-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not enough Mutated Oncogenes or TSGs Found in Your Data #21
Comments
This means that none of the genes listed in either the oncogene list (https://github.com/KarchinLab/2020plus/blob/master/data/gene_lists/oncogenes.txt) or tumor suppressor list (https://github.com/KarchinLab/2020plus/blob/master/data/gene_lists/tsgs.txt) used for training was found in your MAF file. Can you check these files and see if the gene symbols are actually present in your MAF file. |
Hi,
The TSGs also look as expected with Further, the MAF contains the following classifications of variants:
Here is the features.txt file containing summary statistics for only 1198 genes. Does that help to solve the issue? I also downloaded hg19 from the tutorial page, converted it to fasta using Thank you for your effort and time. Best, Roman |
Can you run the unit test that evaluates whether the training command works on a toy dataset? You'll need the $ nosetests tests/test_train.py The error should be the same as what you observe on your data if there is a problem in the code. |
The test ran through and also the training on the original data works as expected. The error only occurs on my particular data set. Somehow the oncogene.txt in the output folder contains 18,000 lines while the tsg.txt contains only 1198 lines and the summary then also only 1198 lines. This is different in both, the pan-cancer and bladder cancer examples you provide and I assume it produces the observed error message. |
I found the error. The coordinates in my MAF files are from HG38 while the snvboxgenes.bed file is for HG19. So this is actually a duplicate from #16 and I will close this issue. However, maybe you could consider a clear error as most (at least TCGA) MAF files contain a column for the reference genome build. |
Hi,
I am currently testing the 2020plus software on my MAF data set which is basically the TCGA MAF files for 16 cancer types concatenated. The algorithm ran without problems using the MAF file from the tutorial. My MAF file contains roughly 2.5 million mutations with 500,000 of them being classified as silent.
After most of the snakemake pipeline works as expected, I get this error in the final stages:
I called the tool using:
snakemake -s Snakefile predict -p --cores 64 --config mutations="data/pancancer_16_onlyrequiredcols.maf" output_dir="pancan16_ourmutation"
where
data/pancancer_16_onlyrequiredcols.maf
is my edited MAF file and I leave all other data as in the tutorial.Do you know why this error happens? Could it be that there is a problem with my mutation file or is the problem in the layout (e.g. not enough mutations in some of the known TSGs/oncogenes due to using only a subset of cancer types)?
I might add that the MAF file only contains the columns required according to this page, that is:
*Hugo_Symbol (or named “Gene”)
*Chromosome
Maybe that has to do with the error?
Thank you for some hints on that.
Best,
Roman
The text was updated successfully, but these errors were encountered: