Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIsh list for Figaro2 #67

Open
hjarnek opened this issue Aug 24, 2024 · 0 comments
Open

WIsh list for Figaro2 #67

hjarnek opened this issue Aug 24, 2024 · 0 comments

Comments

@hjarnek
Copy link

hjarnek commented Aug 24, 2024

Hi @michael-weinstein ,

I was happy to hear that you are resuming work on Figaro2. I would like to share a few things that I would be even happier to see fixed for the next version.

  • Consideration of individual sample read numbers. If you have samples of varying quality, choosing trimming parameters can be tricky. As most bioinformatic pipelines at some point rarefy sequence depth to the same number across samples, having as similar read numbers as possible across samples after trimming is desirable, as the surplus of reads in the better-quality samples will be discarded later anyway during rarefaction. In other words, if you are considering two possible sets of trimming parameters, where the first set would yield 50k more reads per sample in all but one sample compared to the second set, which would instead yield 10k more reads in that lowest quality sample, the second set may still be the best option. However, this is not how Figaro works today.
  • maxEE(fwd,rev) cutoff values, as an alternative to a read retention percentile target. For when the more relevant question is "How many reads will I get with this error filtering?" rather than "How many errors will I get with this read retention?".
  • Possibility to run on already primer-trimmed samples. Many of the files I get have already been run through cutadapt, and the original raw fastq files are nowhere to be found. Besides, sometimes a lot of reads in the raw files don't contain the primer sequence for some reason and will therefore be discarded anyway, yet those sequences will be taken into account by Figaro as it is now, possibly skewing the optimum. This probably means Figaro will have to accept some variation in read length of the input files.
  • Support for binned quality scores (as good as it's possible). More and more data is NovaSeq or NextSeq that use binned quality scores. Also, due to the nature of these technologies, the error profiles may not always look the same, like we're used to with MiSeq/HiSeq. Two-colour Illumina systems tend to overconfidently call G bases when quality runs too low, as discussed here for example: Binned quality scores and their effect on (non-decreasing) trans rates  benjjneb/dada2#1307 (comment). I would guess that at some point in the pipeline you need to auto-detect these polyG runs and treat them as low quality.
  • Figaro wrapped as an R package. As DADA2 is implemented in R, this would make for much easier integration in pipelines.

Thank you for your great work with this tool!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant