Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QC step to exclude failed samples (based on CDC recommended FASTQ QC thresholds) from analysis. #125

Open
jessres opened this issue Nov 7, 2024 · 3 comments

Comments

@jessres
Copy link

jessres commented Nov 7, 2024

Is your feature request related to a problem? Please describe.
We ran into an issue where too many low coverage samples resulted in a pretty much empty vcf-to-fasta file, affecting the results of the passing samples and being unable to generate a phylogenetic tree. Further investigation shows that even 2 - 3 very low coverage samples can affect the accuracy of the phylogenetic tree.
Describe the solution you'd like
We would like to see failed low coverage samples be removed before vcf-to-fasta generation so that only passing samples are used for the core genome and results of the phylogenetic tree are accurate. QC results should still include all samples.

Describe alternatives you've considered
Alternatively, we have considered re-analyzing the run with just passing samples however, this negatively impacts our automated workflow and TAT.

Additional context
Add any other context or screenshots about the feature request here.

@zmudge3
Copy link
Collaborator

zmudge3 commented Nov 12, 2024

Hi @jessres, thanks for your request. I'm seeing three potential solutions to this issue:

  • We're working on adding an independent "pre-MycoSNP" workflow for the v1.6 release. This workflow generates de novo assemblies, then performs taxonomic classification and, for C. auris, clade typing. We can probably include some of the same pre-alignment QC stats that are found in the main workflow's qc_report.txt, e.g. GC After Trimming, Reference Length Coverage After Trimming, and Average Q Score After Trimming. Then, the main MycoSNP workflow could be run with only those samples with acceptable coverage. However, this approach wouldn't allow you to use any post-alignment QC metrics such as Mean Coverage Depth and Genome Fraction at 10X (new metric coming in v1.6), and this would still require manual assessment of the QC metrics to determine which samples to exclude from the main MycoSNP workflow.
  • This is relevant with the current release, v1.5, but again requires manual assessment: Include the --skip_combined_analysis parameter, so the MycoSNP workflow will only run through the alignment and qc report steps. Then, the full pipeline could be rerun without the failing samples. This also isn't ideal because the trimming/alignment/QC steps would be performed again for all the passing samples, but I thought I'd mention it as a potential option to save some time with the current version, in case you weren't already aware.
  • We could include an option to add sample inclusion/exclusion criteria, based on QC metrics, to the main MycoSNP workflow, and any samples that don't meet those criteria could be excluded from the rest of the run, and inclusion/exclusion indicated on the qc report. I think this is really what you're after, but I would appreciate it if you could confirm. No promises this makes it into the next release, but I'm glad it's on our radar.

@jessres
Copy link
Author

jessres commented Nov 13, 2024

Hello @zmudge3 we appreciate such a quick response! Ideally option 3 is exactly what we are looking for. When do you expect v1.6 to be released?

@zmudge3
Copy link
Collaborator

zmudge3 commented Nov 13, 2024

Got it, thank you. We're hoping to have it released by the end of the year.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants