Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not enough Mutated Oncogenes or TSGs Found in Your Data #21

Closed
schulter opened this issue Aug 26, 2020 · 5 comments
Closed

Not enough Mutated Oncogenes or TSGs Found in Your Data #21

schulter opened this issue Aug 26, 2020 · 5 comments

Comments

@schulter
Copy link

Hi,
I am currently testing the 2020plus software on my MAF data set which is basically the TCGA MAF files for 16 cancer types concatenated. The algorithm ran without problems using the MAF file from the tutorial. My MAF file contains roughly 2.5 million mutations with 500,000 of them being classified as silent.

After most of the snakemake pipeline works as expected, I get this error in the final stages:

Version: 1.2.3
Command: /project/gcn/2020plus-1.2.3/2020plus.py --log-level=INFO train -d .7 -o 1.0 -n 1000 -r pancan16_ourmutation/trained.Rdata --features=pancan16_ourmutation/features.txt --random-seed 71
Training R's Random forest . . .
ERROR: There were either no or very few mutated oncogenes or tumor suppressor genes found in your data! Did you supply a full pan-cancer dataset? Or have you modified the training list of oncogenes or tumor suppressor genes? Or did you subset your mutations to not include oncogenes/tumor suppressor genes in the training list?
Error in job cv_predict while creating output files pancan16_ourmutation/output/results/r_random_forest_prediction.txt, pancan16_ourmutation/trained.Rdata.
RuleException:
CalledProcessError in line 304 of /project/gcn/2020plus-1.2.3/Snakefile:
Command '
python which 2020plus.py --log-level=INFO train -d .7 -o 1.0 -n 1000 -r pancan16_ourmutation/trained.Rdata --features=pancan16_ourmutation/features.txt --random-seed 71
python which 2020plus.py --log-level=INFO classify --trained-classifier pancan16_ourmutation/trained.Rdata --null-distribution pancan16_ourmutation/simulated_null_dist.txt --features pancan16_ourmutation/simulated_summary/simulated_features.txt --simulated
python which 2020plus.py --out-dir pancan16_ourmutation/output --log-level=INFO classify -n 200 -d .7 -o 1.0 --features pancan16_ourmutation/features.txt --null-distribution pancan16_ourmutation/simulated_null_dist.txt --random-seed 71
' returned non-zero exit status 1.
File "/project/gcn/2020plus-1.2.3/Snakefile", line 304, in __rule_cv_predict
File "/home/sasse/miniconda3/envs/2020plus/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Will exit after finishing currently running jobs.
Exiting because a job execution failed. Look above for error message
(2020plus) sasse@bohemianrhapsody:/project/gcn/2020plus-1.2.3>

I called the tool using:

snakemake -s Snakefile predict -p --cores 64 --config mutations="data/pancancer_16_onlyrequiredcols.maf" output_dir="pancan16_ourmutation"

where data/pancancer_16_onlyrequiredcols.maf is my edited MAF file and I leave all other data as in the tutorial.
Do you know why this error happens? Could it be that there is a problem with my mutation file or is the problem in the layout (e.g. not enough mutations in some of the known TSGs/oncogenes due to using only a subset of cancer types)?

I might add that the MAF file only contains the columns required according to this page, that is:

*Hugo_Symbol (or named “Gene”)
*Chromosome

  • Start_Position
  • End_Position
  • Reference_Allele
  • Tumor_Seq_Allele2 (or named “Tumor_Allele”)
  • Tumor_Sample_Barcode (or named “Tumor_Sample”)
  • Variant_Classification

Maybe that has to do with the error?

Thank you for some hints on that.

Best,

Roman

@ctokheim
Copy link
Collaborator

This means that none of the genes listed in either the oncogene list (https://github.com/KarchinLab/2020plus/blob/master/data/gene_lists/oncogenes.txt) or tumor suppressor list (https://github.com/KarchinLab/2020plus/blob/master/data/gene_lists/tsgs.txt) used for training was found in your MAF file. Can you check these files and see if the gene symbols are actually present in your MAF file.

@schulter
Copy link
Author

schulter commented Sep 7, 2020

Hi,
thanks for the reply. The MAF file seems to be okay and contains mutations for both oncogenes and TSGs from the files you linked. Also, it doesn't seem to be a trimming error or similar formatting stuff.
However, the features.txt file in the output dir as well as the summary file contain only roughly 1200 genes which is probably wrong, no?
The MAF file contains mutations in 21731 genes and roughly 2.4 million mutations in total. All of the 71 tsgs and 51 oncogenes contain mutations. For the oncogenes, I have the following numbers of mutations per gene (number of rows in the MAF file for those genes):

PIK3CA 1429
KRAS 646
BRAF 549
MED12 479
CTNNB1 456
SETBP1 430
PDGFRA 429
ALK 376
CARD11 365
EGFR 360

The TSGs also look as expected with TP53, KMT2D and APC being the most frequently mutated genes.

Further, the MAF contains the following classifications of variants:

Missense_Mutation 1252240
Silent 461766
3'UTR 218245
Intron 113019
Nonsense_Mutation 107922
Frame_Shift_Del 77033
5'UTR 52744
RNA 44178
Frame_Shift_Ins 35047
Splice_Site 31509
Splice_Region 23981
3'Flank 22011
5'Flank 15535
In_Frame_Ins 5861
In_Frame_Del 5349
Translation_Start_Site 1605
Nonstop_Mutation 1441
IGR 330

Here is the features.txt file containing summary statistics for only 1198 genes.

Does that help to solve the issue? I also downloaded hg19 from the tutorial page, converted it to fasta using twoBitToFa and extracted the gene sequences as indicated in the tutorial.

Thank you for your effort and time.

Best,

Roman

@ctokheim
Copy link
Collaborator

ctokheim commented Sep 7, 2020

Can you run the unit test that evaluates whether the training command works on a toy dataset? You'll need the nose python package (pip install nose). From the top-level directory of 2020plus, you can run the following command:

$ nosetests tests/test_train.py

The error should be the same as what you observe on your data if there is a problem in the code.

@schulter
Copy link
Author

schulter commented Sep 8, 2020

The test ran through and also the training on the original data works as expected. The error only occurs on my particular data set. Somehow the oncogene.txt in the output folder contains 18,000 lines while the tsg.txt contains only 1198 lines and the summary then also only 1198 lines. This is different in both, the pan-cancer and bladder cancer examples you provide and I assume it produces the observed error message.
So the problem lies rather somewhere with probabilistic2020 where my data set must be somehow different from the provided examples. Is that correct?
I'm trying to dig further into the issue and will try to run 20/20+ directly on a TCGA MAF file with minimal processing.

@schulter
Copy link
Author

schulter commented Sep 8, 2020

I found the error. The coordinates in my MAF files are from HG38 while the snvboxgenes.bed file is for HG19. So this is actually a duplicate from #16 and I will close this issue. However, maybe you could consider a clear error as most (at least TCGA) MAF files contain a column for the reference genome build.
Thanks for your help again!

@schulter schulter closed this as completed Sep 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants