In this case study we describe applying DeepVariant to a real WGS sample using
some of the more advanced capabilities of DeepVariant, such as directly calling
variants off a BAM in Google Cloud Storage (GCS), using different virtual
machine (VM) instance types for each step of DeepVariant, and using a GPU to
accelerate call_variants
. We provide some guidelines on the computational
resources needed for each step. And finally we access the quality of the
DeepVariant variant calls with hap.py
.
NOTE: this case study demonstrates an example of how to run DeepVariant end-to-end on one machine. This might not be the fastest or cheapest configuration.
For this case study, we use a 64-core DeepVariant non-preemptible instance in the "us-west1-b" zone with no GPU. From our local command line, we do:
gcloud beta compute instances create "${USER}-deepvariant-casestudy" \
--scopes "compute-rw,storage-full,cloud-platform" \
--image-family "ubuntu-1604-lts" --image-project "ubuntu-os-cloud" \
--machine-type "custom-64-131072" \
--boot-disk-size "300" --boot-disk-type "pd-ssd" \
--boot-disk-device-name "deepvariant-casestudy" \
--zone "us-west1-b"
The custom machine type gives you 64 vCPU, 128.00 GiB.
Then connect to your instance via SSH:
gcloud compute ssh --zone "us-west1-b" "${USER}-deepvariant-casestudy"
Set a number of shell variables, to make what follows easier.
BASE="${HOME}/case-study"
BUCKET="gs://deepvariant"
BIN_VERSION="0.4.0"
MODEL_VERSION="0.4.0"
MODEL_CL="174375304"
# Note that we don't specify the CL number for the binary, only the bin version.
BIN_BUCKET="${BUCKET}/binaries/DeepVariant/${BIN_VERSION}/DeepVariant-${BIN_VERSION}+cl-*"
MODEL_BUCKET="${BUCKET}/models/DeepVariant/${MODEL_VERSION}/DeepVariant-inception_v3-${MODEL_VERSION}+cl-${MODEL_CL}.data-wgs_standard"
DATA_BUCKET="${BUCKET}/case-study-testdata"
INPUT_DIR="${BASE}/input"
BIN_DIR="${INPUT_DIR}/bin"
MODELS_DIR="${INPUT_DIR}/models"
MODEL="${MODELS_DIR}/model.ckpt"
DATA_DIR="${INPUT_DIR}/data"
REF="${DATA_DIR}/hs37d5.fa.gz"
BAM="${DATA_DIR}/HG002_NIST_150bp_50x.bam"
TRUTH_VCF="${DATA_DIR}/HG002_GIAB_highconf_IllFB-IllGATKHC-CG-Ion-Solid_CHROM1-22_v3.2.2_highconf.vcf.gz"
TRUTH_BED="${DATA_DIR}/HG002_GIAB_highconf_IllFB-IllGATKHC-CG-Ion-Solid_CHROM1-22_v3.2.2_highconf.bed"
N_SHARDS="64"
OUTPUT_DIR="${BASE}/output"
EXAMPLES="${OUTPUT_DIR}/HG002.examples.tfrecord@${N_SHARDS}.gz"
CALL_VARIANTS_OUTPUT="${OUTPUT_DIR}/HG002.cvo.tfrecord.gz"
OUTPUT_VCF="${OUTPUT_DIR}/HG002.output.vcf.gz"
LOG_DIR="${OUTPUT_DIR}/logs"
mkdir -p "${OUTPUT_DIR}"
mkdir -p "${BIN_DIR}"
mkdir -p "${DATA_DIR}"
mkdir -p "${MODELS_DIR}"
mkdir -p "${LOG_DIR}"
There are some extra programs we will need.
We are going to use GNU Parallel to
run make_examples
.
We are going to install samtools
and docker.io
to help do some
analysis at the end.
sudo apt-get -y install parallel
sudo apt-get -y install samtools
sudo apt-get -y install docker.io
Copy our binaries from the cloud bucket.
time gsutil -m cp -r "${BIN_BUCKET}/*" "${BIN_DIR}"
This step should be very fast - it took us about 6 seconds when we tested.
Now, we need to install all prerequisites on the machine. Run this command:
cd "${BIN_DIR}"; time bash run-prereq.sh; cd -
In our test run it took about 1 min.
Copy the model files to your local disk.
time gsutil -m cp -r "${MODEL_BUCKET}/*" "${MODELS_DIR}"
This step should be really fast. It took us about 5 seconds.
Copy the input files you need to your local disk from our gs:// bucket.
The original source of these files are:
-
BAM file: HG002_NIST_150bp_50x.bam
The original FASTQ file comes from the PrecisionFDA Truth Challenge. I ran it through PrecisionFDA's BWA-MEM app with default setting, and then got the HG002_NIST_150bp_50x.bam file as output. The FASTQ files are originally from the Genome in a Bottle Consortium. For more information, see this Scientific Data paper.
-
FASTA file: hs37d5.fa.gz
The original file came from: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence. Because DeepVariant requires bgzip files, we had to unzip and bgzip it, and create corresponding index files.
-
Truth VCF and BED
These come from NIST, as part of the Genomes in a Bottle project. They are downloaded from ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG002_NA24385_son/NISTv3.2.2
We picked version 3.2.2 in this case study because we want the final F1 scores to be comparable to the results from the Truth Challenge.
You can simply run the command below to get all the data you need for this case study.
time gsutil -m cp -r "${DATA_BUCKET}/*" "${DATA_DIR}"
It took us about 15min to copy the files.
Because the informational log messages (written to stderr) are voluminous, and because this takes a long time to finish, we will redirect all the output (stdout and stderr) to a file.
( time seq 0 $((N_SHARDS-1)) | \
parallel --halt 2 --joblog "${LOG_DIR}/log" --res "${LOG_DIR}" \
python "${BIN_DIR}"/make_examples.zip \
--mode calling \
--ref "${REF}" \
--reads "${BAM}" \
--examples "${EXAMPLES}" \
--task {}
) >"${LOG_DIR}/make_examples.log" 2>&1
This will take several hours to run.
There are different ways to run call_variants
. In this case study, we ran just
one call_variants
job. Here's the command that we used:
( time python "${BIN_DIR}"/call_variants.zip \
--outfile "${CALL_VARIANTS_OUTPUT}" \
--examples "${EXAMPLES}" \
--checkpoint "${MODEL}"
) >"${LOG_DIR}/call_variants.log" 2>&1
This will start one process of call_variants
and read all 64 shards of
[email protected]
as input, but output to one single output file
HG002.cvo.tfrecord.gz
.
We noticed that running multiple call_variants
on the same machine didn't seem
to save the overall time, because each of the call_variants slowed down when
multiple are running on the same machine.
So if you care about finishing the end-to-end run faster, you could request 64
machines (for example, n1-standard-8
machines) and run each shard as input
separately, and output to corresponding output shards. For example, the first
machine will run this command:
python "${BIN_DIR}"/call_variants.zip \
--outfile=${OUTPUT_DIR}/HG002.cvo.tfrecord-00000-of-00064.gz \
--examples=${OUTPUT_DIR}/HG002.examples.tfrecord-00000-of-00064.gz \
--checkpoint="${MODEL}"
And the rest will process 00001 to 00063. You can also use tools like dsub. In this case study we only report the runtime on a 1-machine example.
( time python "${BIN_DIR}"/postprocess_variants.zip \
--ref "${REF}" \
--infile "${CALL_VARIANTS_OUTPUT}" \
--outfile "${OUTPUT_VCF}"
) >"${LOG_DIR}/postprocess_variants.log" 2>&1
Because this step is single-process, single-thread, if you're orchestrating a more complicated running pipeline, you might want to request a machine with fewer cores for this step.
Step | wall time |
---|---|
make_examples |
5h 17m 51s |
call_variants |
8h 17m 10s |
postprocess_variants |
24m 53s |
total time | ~ 14h 0m |
Here we use the hap.py
(https://github.com/Illumina/hap.py)
program from Illumina to evaluate the resulting vcf file. This
serves as a check to ensure the three DeepVariant commands ran correctly and
produced high-quality results.
UNCOMPRESSED_REF="${OUTPUT_DIR}/hs37d5.fa"
# hap.py cannot read the compressed fa, so uncompress
# into a writable directory and index it.
zcat <"${REF}" >"${UNCOMPRESSED_REF}"
samtools faidx "${UNCOMPRESSED_REF}"
sudo docker pull pkrusche/hap.py
sudo docker run -it \
-v "${DATA_DIR}:${DATA_DIR}" \
-v "${OUTPUT_DIR}:${OUTPUT_DIR}" \
pkrusche/hap.py /opt/hap.py/bin/hap.py \
"${TRUTH_VCF}" \
"${OUTPUT_VCF}" \
-f "${TRUTH_BED}" \
-r "${UNCOMPRESSED_REF}" \
-o "${OUTPUT_DIR}/happy.output"
Type | Recall | Precision | F1_Score |
---|---|---|---|
INDEL | 0.993081 | 0.994954 | 0.994017 |
SNP | 0.999759 | 0.999881 | 0.999820 |