-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error using median_vaf_purity on a combined-polyphen-snpeff.vcf #228
Comments
@jburos it currently looks for "mutect" or "strelka" in the filename to figure out how to parse the file. Silly, I know. We need to do better. |
@tavinathanson yep, i figured it was something like this. just noting for now, it's not urgent atm. seems like having an "other" category & parsing like a generic vcf would be useful -- i assume there is a generic vcf format? |
@jburos my understanding is that we can't rely on a standard format for things like depth and VAF, hence needing to understand the caller in order to grab them. But we should be able to infer the format. |
Here is the backstory on this one: Strelka's VCF implementation is not complete and VCF validators fail them because of the weird choice of fields/annotations they use in the VCF. The funny things is that its developers are aware of this and the whole Biostars has been complaining about this, but still no change/fix. The main problem with Strelka in terms of the VAFs is that Strelka doesn't output a VAF field since the number of reads supporting each allele is coming from simulations. This is especially true for the indels and although there is a way to get a VAF-like number out for these variants, it is often not in line with the other tools' estimation. The only exception to this is the variants that are called by both callers, where we do see almost exactly the same VAF across tools (Search the bladder repo for keyword Strelka and there is a notebook that is investigating this issue). So that is why the VAF extraction fails with Strelka variants, since there is no VAF in those VCFs :( Furthermore, MuTeCT does a terrible job with naming the variant columns and I found that the only way to tell a tumor sample from the normal one is to look within the header. This is very mutect-specific and therefore fails on non-mutect/GATK files. Jacki, if you are doing lots of VCF reads, you might want to drop all "REJECT"ed variants from the mutect file to speed up the process. PyVCF is just terrible with huge VCFs and I assuming it should be taking quite a lot of time to process the mutect ones. Just FYI. |
Yikes - @armish thanks for this backstory. It all sounds pretty horrifying. Seems like everyone would benefit from a "format normalization" tool which might sanitize/fix vcfs into a sane format, and perhaps address some of these issues along the way. On a more practical angle, what is the best way for me to confirm that I am processing the vcf files correctly? This makes me a little nervous to think we might be misreading them accidentally. |
@jburos: unfortunately there is no silver bullet here: If you are interested in ETLing the annotations, then either But as a rule of thumb: I suggest always checking for |
Getting the following error when trying to estimate
median_vaf_purity
on a combined/merged VCF file.The text was updated successfully, but these errors were encountered: