Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: unable to open file or unable to determine types for file synthetic.vcf #60

Open
gianfilippo opened this issue Apr 4, 2020 · 31 comments

Comments

@gianfilippo
Copy link

Hi,

I am trying to run neusomatic in ensemble mode, but got stuck after the SomaticSeq.Wrapper.sh step and at the "preprocess.py --mode train" step.
I get the following error
Error: unable to open file or unable to determine types for file synthetic.vcf

the synthetic.vcf is generated by processing the Ensemble.s*.tsv files generated by SomaticSeq.Wrapper.sh, following the details on your repository (my understanding of it, of course)
cat <(cat Ensemble.s*.tsv |grep CHROM|head -1) <(cat Ensemble.s*.tsv |grep -v CHROM) | sed "s/nan/0/g" > ensemble_ann1.tsv

python preprocess.py --mode train --reference $GENFILE.fa --region_bed $INTERVALFILE --tumor_bam $syntheticTumor.bam --normal_bam $syntheticNormal.bam --work WORK --truth_vcf synthetic.vcf --ensemble_tsv ensemble_ann.tsv --min_mapq 10 --num_threads 20 --scan_alignments_binary $HOME/bin/neusomatic/neusomatic/bin/scan_alignments

A few lines from the synthetic.vcf are below
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SPIKEIN
chr1 1503492 . G A 100 PASS SOMATIC;VAF=0.2;DPR=9.66666666667 GT 0/1
chr1 3752754 . G A 100 PASS SOMATIC;VAF=0.307692307692;DPR=90.0 GT 0/1
chr1 3763621 . C A 100 PASS SOMATIC;VAF=0.222222222222;DPR=17.6666666667 GT 0/1
chr1 6152482 . T A 100 PASS SOMATIC;VAF=0.127868852459;DPR=304.666666667 GT 0/1
chr1 6199629 . G C 100 PASS SOMATIC;VAF=0.21978021978;DPR=181.333333333 GT 0/1

Can you please help me understand my mistake ?

Thanks
Gianfilippo

@msahraeian
Copy link
Contributor

msahraeian commented Apr 4, 2020

@gianfilippo happy to see your interest in NeuSomatic.
I think there are some confusions here. synthetic.vcf that is needed for training and the ensemble tsv file that is generated by SomaticSeq.Wrapper.sh are two separate things.

  1. If you don't have a training data with known truth VCF: as noted here to generate training data you should follow the BAMSurgeon workflow that gives you synthetic bams and truth vcf file (In fact it gives a synthetic_snvs.vcf and a synthetic_indels.leftAlign.vcf. you need to combine them to a single VCF). These training data can then be used in NeuSomatic.

  2. Regardless of how you get your data you can run NeuSomatic in standalone or ensmeble modes (in training or calling phases). Details can be found here. The process you mentioned with give you ensemble_ann.tsv that can be used in ensemble mode. So, this tsv file is only output of other callers for your data and should not be used as truth vcf.

In addition, from the few lines of the VCF you have provided, I see that the VCF is in the wrong format. You need a header line like ##fileformat=VCFv4.2 in the first line of VCF that specifies the format.

@gianfilippo
Copy link
Author

gianfilippo commented Apr 5, 2020

thanks. you are right, clearly got confused.....I fixed it.
The VCF is ok. I just sent you a few lines.
Now, I rerun it, but got another error at preprocess stage. Below I am reporting the relevant part of the error message. It is failing at find_records. What else am I doing wrong ?
Just in case, here is how I joined snvs and indels
bcftools concat -a -o synthetic.vcf.gz -O z synthetic_snvs.vcf.gz synthetic_indels.leftAlign.vcf.gz

Thanks for your help!

INFO 2020-04-04 21:25:27,955 find_records (ForkPoolWorker-54) N_none: 536
INFO 2020-04-04 21:25:27,956 find_records (ForkPoolWorker-55) N_none: 531
INFO 2020-04-04 21:25:27,964 find_records (ForkPoolWorker-46) N_none: 531
ERROR 2020-04-04 21:25:27,974 find_records (ForkPoolWorker-51) Traceback (most recent call last):
File "neusomatic/neusomatic/python/generate_dataset.py", line 1055, in find_records
mt2, eqs2 = push_lr(fasta_file, mt, 2)
File "neusomatic/neusomatic/python/generate_dataset.py", line 728, in push_lr
assert(fasta_file.fetch((c), p - 1, p - 1 + len(r)).upper() == r)
AssertionError

ERROR 2020-04-04 21:25:27,974 find_records (ForkPoolWorker-51)
INFO 2020-04-04 21:25:27,988 find_records (ForkPoolWorker-44) N_none: 484
INFO 2020-04-04 21:25:27,997 find_records (ForkPoolWorker-60) N_none: 497
ERROR 2020-04-04 21:25:28,009 main Traceback (most recent call last):
File "neusomatic/neusomatic/python/preprocess.py", line 435, in
args.scan_alignments_binary)
File "neusomatic/neusomatic/python/preprocess.py", line 335, in preprocess
ensemble_beds[i] if ensemble_tsv else None, tsv_batch_size)
File neusomatic/neusomatic/python/preprocess.py", line 129, in generate_dataset_region
tsv_batch_size)
File "neusomatic/neusomatic/python/generate_dataset.py", line 1461, in generate_dataset
raise Exception("find_records failed!")
Exception: find_records failed!

ERROR 2020-04-04 21:25:28,010 main Aborting!
ERROR 2020-04-04 21:25:28,010 main preprocess.py failure on arguments: Namespace(dbsnp_to_filter=None, del_merge_min_af=0, del_min_af=0.05, ensemble_tsv=None, filter_duplicate=False, first_do_without_qual=False, good_ao=10, ins_merge_min_af=0, ins_min_af=0.05, long_read=False, matrix_base_pad=7, matrix_width=32, max_dp=100000, merge_r=0.5, min_ao=1, min_dp=5, min_ev_frac_per_col=0.06, min_mapq=10, mode='train', normal_bam='syntheticNormal.bam', num_threads=20, reference='hg38.fa', region_bed='SeqCap_EZ_Exome_v3_hg38_primary_targets.sorted.bed', restart=False, scan_alignments_binary='neusomatic/neusomatic/bin/scan_alignments', scan_maf=0.01, scan_window_size=2000, snp_min_af=0.05, snp_min_ao=3, snp_min_bq=10, truth_vcf='synthetic.vcf.gz', tsv_batch_size=50000, tumor_bam='syntheticTumor.bam', work='Sample_Test')
Traceback (most recent call last):
File "neusomatic/neusomatic/python/preprocess.py", line 441, in
raise e
File "neusomatic/neusomatic/python/preprocess.py", line 435, in
args.scan_alignments_binary)
File "neusomatic/neusomatic/python/preprocess.py", line 335, in preprocess
ensemble_beds[i] if ensemble_tsv else None, tsv_batch_size)
File "neusomatic/neusomatic/python/preprocess.py", line 129, in generate_dataset_region
tsv_batch_size)
File "neusomatic/neusomatic/python/generate_dataset.py", line 1461, in generate_dataset
raise Exception("find_records failed!")
Exception: find_records failed!

@msahraeian
Copy link
Contributor

@gianfilippo would you please make sure that in the truth.vcf file all the bases in REF and ALT fields are capital. Also, please make sure you are using the same reference file, you have used for alignment.

@gianfilippo
Copy link
Author

Hi, I generated the truth.vcf with BAMSurgeon and the reference file is then same

@msahraeian
Copy link
Contributor

@gianfilippo I added more logging for failed asserts in a new branch called logging_for_asserts. Would you please checkout this branch as follows, rerun your code and send me your log?
cd neusomatic
git pull
git checkout logging_for_asserts

@gianfilippo
Copy link
Author

Hi, thanks for the help!
I can now see from the file that your guess was correct. I found some bases in lower case. Below is the error (part of it). I am not sure this is described in BamSurgeon, but from the following open issue (adamewing/bamsurgeon#137), it seems that "the bases added for the length resolution are always at the end of the read and are in lower case."
If this is the case, should I filter out any line in the VCF with a lower case, or would you retain them ?

INFO 2020-04-06 12:37:31,556 find_records (ForkPoolWorker-51) N_none: 546
INFO 2020-04-06 12:37:31,557 find_records (ForkPoolWorker-59) N_none: 506
INFO 2020-04-06 12:37:31,561 find_records (ForkPoolWorker-50) N_none: 535
ERROR 2020-04-06 12:37:31,563 push_lr
ERROR 2020-04-06 12:37:31,563 push_lr Failed at ['chr1', 36091182, 'a', 'C']
ERROR 2020-04-06 12:37:31,563 push_lr Assertion Failed: A==a

@msahraeian
Copy link
Contributor

@gianfilippo I think we should assume those bases are correct synthetic truth variants. So, we should keep them to have valid training.
I fixed this in the code, so all bases will be capitalized internally. Please pull logging_for_asserts branch again and retry.

@gianfilippo
Copy link
Author

the preprocess step now works, but the train step fails with the error below. What do you think ?

INFO 2020-04-06 17:24:37,805 make_weights_for_balanced_classes count length classes: [(0, 3846), (1, 1753), (2, 52), (3, 121)]
INFO 2020-04-06 17:24:37,805 make_weights_for_balanced_classes weight length classes: [(0, 0.08341995841995842), (1, 0.17407311157311156), (2, 0.24774774774774774), (3, 0.24475918225918225)]
INFO 2020-04-06 17:24:37,805 train_neusomatic weights_type:[0.24311331 0.24250693 8.34199584 0.18095981], weights_length:[8.34199584 0.17407311 0.24774775 0.24475918]
INFO 2020-04-06 17:24:37,807 train_neusomatic Number of candidater per epoch: 5769
ERROR 2020-04-06 17:24:38,365 main Traceback (most recent call last):
File "neusomatic/neusomatic/python/train.py", line 576, in
use_cuda)
File "neusomatic/neusomatic/python/train.py", line 436, in train_neusomatic
outputs, _ = net(inputs)
File ".local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/gpfs/ycga/project/coppola/gc223/bin/neusomatic/neusomatic/python/network.py", line 67, in forward
x = self.pool1(F.relu(self.bn1(self.conv1(x))))
File ".local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File ".local/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 345, in forward
return self.conv2d_forward(input, self.weight)
File ".local/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
self.padding, self.dilation, self.groups)
RuntimeError: Given groups=1, weight of size 64 119 1 3, expected input[100, 26, 5, 32] to have 119 channels, but got 26 channels instead

@msahraeian
Copy link
Contributor

@gianfilippo sounds good.
To train or call in the ensemble mode you should use --ensemble argument in train.py and call.py.

@gianfilippo
Copy link
Author

I am using the --ensemble already (see below the command line).
python $NEUSOMATIC/train.py --candidates_tsv $DIR/dataset//candidates.tsv --out $DIR --num_threads $Ncores --batch_size 100 --ensemble

@msahraeian
Copy link
Contributor

@gianfilippo did you also use --ensemble_tsv when you ran preprocess.py ?

@gianfilippo
Copy link
Author

yes

@gianfilippo
Copy link
Author

below is the preprocess call
python $NEUSOMATIC/preprocess.py --mode train --reference hg38.fa --region_bed $INTERVALFILE --tumor_bam syntheticTumor.bam --normal_bam syntheticNormal.bam --work $DIR --ensemble_tsv ensemble_ann.tsv --truth_vcf synthetic.vcf.gz --min_mapq 10 --num_threads 20 --scan_alignments_binary $NEUSOMATIC/neusomatic/bin/scan_alignments

@msahraeian
Copy link
Contributor

@gianfilippo Can you share with me a few lines from ensemble_ann.tsv and a few lines from one of candidates.tsv files?

@gianfilippo
Copy link
Author

the zipped files (few lines) are attached. For the candidates.tsv I attached the lines from the work.0 dir. I had a typo in the command line, now corrected

Archive.zip

@msahraeian
Copy link
Contributor

@gianfilippo The candidates.tsv file does not seem like the one that as produced in ensemble mode. Can you rerurn preprocess.py in ensemble mode in a new (fresh) output folder?

@gianfilippo
Copy link
Author

gianfilippo commented Apr 7, 2020

it is working now, going through the iterations. I will let you know if it completes. Thanks!

@gianfilippo
Copy link
Author

the network completed training. Below are the last few output lines
INFO 2020-04-07 01:44:34,805 train_neusomatic epoch: 999, iter: 57925, lr: 0.0001, loss: 0.20255
INFO 2020-04-07 01:44:39,472 train_neusomatic epoch: 999, iter: 57939, lr: 0.0001, loss: 0.22420
INFO 2020-04-07 01:44:45,971 train_neusomatic epoch: 1000, iter: 57955, lr: 0.0001, loss: 0.19262
INFO 2020-04-07 01:44:50,754 train_neusomatic epoch: 1000, iter: 57969, lr: 0.0001, loss: 0.20993
INFO 2020-04-07 01:44:55,448 train_neusomatic epoch: 1000, iter: 57983, lr: 0.0001, loss: 0.19704
INFO 2020-04-07 01:45:00,229 train_neusomatic epoch: 1000, iter: 57997, lr: 0.0001, loss: 0.21646
INFO 2020-04-07 01:45:01,207 train_neusomatic Finished Training
INFO 2020-04-07 01:45:01,212 train_neusomatic Total Epochs: 1000
INFO 2020-04-07 01:45:01,212 train_neusomatic Total Epochs: 1000
INFO 2020-04-07 01:45:01,212 train_neusomatic Training is Done.

@msahraeian
Copy link
Contributor

@gianfilippo from the loss, it seems that the training has not converged as expected. Can you send me the training log.

@gianfilippo
Copy link
Author

Hi, the file is attached. I think there may be something wrong in my synthetic.bam and vcfs.
dsq-task.TRAIN-9242965_0-c17n02.txt

@msahraeian
Copy link
Contributor

@gianfilippo

  • One thing I can observe is that you don't have enough spiked variants in your tumor sample (~2000). We normally, expect >50K spiked mutations (for SEQC model we used 2M spikes across 20 WGS samples ). This can be in one sample or across multiple samples. If you have a WES data you can spike with 1-2 mutations per Kbp frequency.

  • To monitor the progress of the training it is good to use a few of candidate TSV files as --validation_candidates_tsv which shows performance over epochs.

@gianfilippo
Copy link
Author

thanks for the suggestion. I have WES. I used a modified version of somaticseq/bamsurgeon script. The modifications simply allow the script to run on the Slurm Workload Manager. I required 100000 snvs, 20000 indels, 1000 svs, and ended up with 3179 SNVs and 609 INDELs

@msahraeian
Copy link
Contributor

@gianfilippo I think that is unexpected for the somaticseq/bamsurgeon. There should be sth wrong over there. Can you share your run script for bamsurgeon?

@gianfilippo
Copy link
Author

Hi, the script is attached. I ran it on 4 threads and the attached file works on thread 1. The others are the same. I also included the mergeFiles script.
BamSimulator.2020-04-10_18-19-06_702757150.cmd.txt
mergeFiles.2020-04-10_18-19-06_702757150.cmd.txt

@litaifang
Copy link

litaifang commented Apr 16, 2020

Can you try the following somaticseq command, to see what's in the bam files for your snv/indel positions?

If you have the latest SomaticSeq v3.4.0 installed, somatic_vcf2tsv.py is in your path. Otherwise, you can directly point to PATH/TO/somaticseq/somaticseq/somatic_vcf2tsv.py for the following command:

somatic_vcf2tsv.py -myvcf Sample_G1700T_012/synthetic_snvs.vcf -truth Sample_G1700T_012/synthetic_snvs.vcf -nbam Sample_G1700T_012/syntheticNormal.bam -tbam Sample_G1700T_012/syntheticTumor.bam -ref hg38.fa -dedup -mincaller 0 -outfile OUTPUT.SNV.tsv

You may also add caller vcf files as input above. All of the following are optional, e.g.,

-mutect mutect.vcf -strelka strelka.snv.vcf -vardict vardict.vcf -muse muse.vcf ......

And do one for indel as well.
Basically, for every position in synthetic_snvs.vcf, it'll extract features from bam files in that coordinate.

@msahraeian
Copy link
Contributor

@gianfilippo Also, can you make sure you use --selector argument when you run BamSimulator_multiThreads.sh and use you WES .bed file there. I suspect you are spiking 100K SNVs in WGS instead of WES, so you get ~3K SNVs.

@gianfilippo
Copy link
Author

Hi, thanks for the suggestions. The output from somatic_vcf2tsv.py for both snvs and indels, using also the callers vcf files are attached.
You are right, I guess I did not notice the --selector argument and have not used it. I will rerun it.
OUTPUT.INDEL.PLUS.tsv.txt
OUTPUT.SNV.PLUS.tsv.txt

@litaifang
Copy link

Looking at the two tsv.txt files, it's not obvious to me why the four callers are all "negative" for all the variant positions.
Column #6, 7, 10, and 14 are are MuTect(2), VarScan2, VarDict, and Strelka, and they're 0 all the way through.
Column #90 and #91 are forward and reverse variant-supporting reads and there seem to be variant-supporting reads in these BAM files, so it seems the synthetic VCF file and the BAM files are good.
Wonder why the four callers don't detect any of those variants.

@gianfilippo
Copy link
Author

if you think the synthetic VCF file and the BAM files are ok, there must be something wrong in the way i set up the script using the callers. Likely some stupid mistake, at this point. I will let you know. Thanks for your help!

@gianfilippo
Copy link
Author

Hi, it seems that now, using "--selector", BamSimulator_multiThreads.sh does not complete and I get core dumps and errors. I tried on 4 and 20 threads. Now I am rerunning it on a single thread just to see if I still get the same errors.

@litaifang
Copy link

@gianfilippo , make sure the synthetic_*.vcf files and the caller output vcf files are all obtained from the same pair of BAM files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants