Install miniconda.
Then run:
conda env create -f environment.yml
conda activate DeepSom-cnn
This is the main operating mode. In inference mode, the CNN assigns a pseudoprobability score to each candidate variant. This score can then be used for variant classification.
An example CNN run in inference mode:
python nn.py \
--test_dataset './inference/project_name/wgs_sample_name/tensors/wgs_sample_name_imgb.lst' \
--output_dir './inference/project_name/wgs_sample_name/results/' \
--model_weight './train/train_dataset_name/results/weights/epoch_N_weights_model' \
--tensor_width 150 \
--tensor_height 70 \
--max_depth 200 \
--model_weight
--- path to the weights of a pretrained model.
--tensor_width
--- width of the variant tensor
--tensor_height
--- maximal height of the variant tensor. The VAF resolution of 1/tensor_height
is guaranteed only if the same height was used for tensor generation, otherwise tensors with a larger height are just cropped.
--max_depth
--- maximal read depth (used to normalize the flanking regions feature). Since the read depth distribution
often has a long right tail, we recommend using the 99th quantile of this distribution as --max_depth
(about 2x median coverage)
Predictions on test_dataset
are saved to output_dir/predictions/final_predictions.vcf
, where the CNN score is stored under cnn_score
record in the INFO field.
--tensor_width
, --tensor_height
and --max_depth
should correspond to those used in train mode.
Please visit test for a practical example on how to run the CNN in inference mode.
This mode is used for DeepSom evaluation. Evaluation can be performed only on variants with known labels (somatic/non-somatic). To run the CNN, use the same command as in inference mode (see above).
In train mode, the CNN learns relationships between variant tensors and variant labels (somatic/non-somatic). The goal of the training step is to generate a CNN model (distribution of weights) which can then be used for inference on previously unseen samples.
You don't need to train the CNN if you use DeepSom to call somatic variants for samples of acute myeloid leukemia (LAML), bladder urothelial cancer (BLCA), esophageal adenocarcinoma (ESAD), liver cancer (LINC), or gastric cancer (GACA). Ready-to-use models for these cancers can be found in models
An example CNN run in train mode:
python nn.py \
--train_dataset './train/train_dataset_name/tensors/train_imgb.lst' \
--output_dir './train/train_dataset_name/results/' \
--tensor_width 150 \
--tensor_height 70 \
--max_depth 200 \
--val_fraction 0.0 \
--save_each 10
--tensor_width
--- width of the variant tensor
--tensor_height
--- maximal height of the variant tensor. The VAF resolution of 1/tensor_height
is guaranteed only if the same height was used for tensor generation, otherwise tensors with a larger height are just cropped.
--max_depth
--- maximal read depth (used to normalize the flanking regions feature). Since the read depth distribution
often has a long right tail, we recommend using the 99th quantile of this distribution as --max_depth
(about 2x median coverage)
--val_fraction
--- percentage of input imgb batches used for validation and not for training. After each training epoch, the CNN performance is evaluated on the validation set. CNN predictions on the validation set are saved to
output_dir/predictions/validation_epoch_N.vcf
.
--save_each
--- how often model and optimizer weights as well as predictions on train dataset should be saved.
See python nn.py --help
for more training options.
The model weights are saved in output_dir/weights/epoch_N_weights_model
, the optimizer weights are saved in output_dir/weights/epoch_N_weights_optimizer
. Optimizer weights are only required when one needs to resume training.
One can combine train and test/inference modes by using both --train_dataset
and test_dataset
parameters.
Training for 120 000 SNP and 20 000 INDEL variants (tensor_width
=150; tensor_height
=70) requires about 1.5Gb RAM and takes about 2.5h on NVIDIA Tesla 100V.
For any given variant, the CNN outputs a continuous pseudoprobability score
between 0 and 1. The higher
, the higher the probability
that the variant is somatic. Variant classification is performed by imposing a threshold on
s.t. all variants with
are considered somatic. This threshold can be chosen based on
the ROC or Precision-Recall curve. The ROC curve is a plot of true positive rate (TPR) against false positive rate (FPR) w.r.t. somatic variants. The Precision-Recall curve is a plot of recall against precision w.r.t. somatic variants. For example, using the Precision-Recall curve, one can choose the threshold that maximizes the f1-score or guarantees a given level of recall. Note that variants used for choosing
should be different from those used for CNN training.
Alternatively, the classification threshold can be chosen based on the corresponding probability . The relation between
and
is non-linear.
To compute based on
, one needs to calibrate the CNN output. For calibration, one runs the pre-trained CNN on test variants. Then, the CNN output values are binned s.t. there are at least 20 values per bin. The probability that a variant whose score ends up in bin
is somatic is given by the Bayes formula:
where is the number of somatic variants per WGS sample,
is the number of germline variants and artefacts per WGS sample,
is the fraction of true somatic variants at the input that end up in bin
,
is the fraction of true germline variants and artefacts at the input that end up in bin
.
Note that the ROC and Precision-Recall curves as well as probabilities must be computed separately for SNP and INDEL variants.