WIsh list for Figaro2 #67

hjarnek · 2024-08-24T11:43:24Z

I was happy to hear that you are resuming work on Figaro2. I would like to share a few things that I would be even happier to see fixed for the next version.

Consideration of individual sample read numbers. If you have samples of varying quality, choosing trimming parameters can be tricky. As most bioinformatic pipelines at some point rarefy sequence depth to the same number across samples, having as similar read numbers as possible across samples after trimming is desirable, as the surplus of reads in the better-quality samples will be discarded later anyway during rarefaction. In other words, if you are considering two possible sets of trimming parameters, where the first set would yield 50k more reads per sample in all but one sample compared to the second set, which would instead yield 10k more reads in that lowest quality sample, the second set may still be the best option. However, this is not how Figaro works today.
maxEE(fwd,rev) cutoff values, as an alternative to a read retention percentile target. For when the more relevant question is "How many reads will I get with this error filtering?" rather than "How many errors will I get with this read retention?".
Possibility to run on already primer-trimmed samples. Many of the files I get have already been run through cutadapt, and the original raw fastq files are nowhere to be found. Besides, sometimes a lot of reads in the raw files don't contain the primer sequence for some reason and will therefore be discarded anyway, yet those sequences will be taken into account by Figaro as it is now, possibly skewing the optimum. This probably means Figaro will have to accept some variation in read length of the input files.
Support for binned quality scores (as good as it's possible). More and more data is NovaSeq or NextSeq that use binned quality scores. Also, due to the nature of these technologies, the error profiles may not always look the same, like we're used to with MiSeq/HiSeq. Two-colour Illumina systems tend to overconfidently call G bases when quality runs too low, as discussed here for example: Binned quality scores and their effect on (non-decreasing) trans rates benjjneb/dada2#1307 (comment). I would guess that at some point in the pipeline you need to auto-detect these polyG runs and treat them as low quality.
Figaro wrapped as an R package. As DADA2 is implemented in R, this would make for much easier integration in pipelines.

Thank you for your great work with this tool!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIsh list for Figaro2 #67

WIsh list for Figaro2 #67

hjarnek commented Aug 24, 2024

WIsh list for Figaro2 #67

WIsh list for Figaro2 #67

Comments

hjarnek commented Aug 24, 2024