Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output results recommendation for DADA2 #40

Open
mentorwan opened this issue Jun 25, 2021 · 3 comments
Open

Output results recommendation for DADA2 #40

mentorwan opened this issue Jun 25, 2021 · 3 comments

Comments

@mentorwan
Copy link

Thanks for this tool. I have downloaded and tested this tool. Here are a few questions or comments:

It seems the tool cannot support variable-length reads. I need to trim all reads to a fix length to make it work.
If primers already removed, the code cannot support forward parameter 0 and reverse parameter 0. So I just put a small number such as -f 5 -r r to make it work. Is it right?
I don’t understand outputs.
For example:

python figaro/figaro.py -i /Volumes/Issue-33/RAW1/Trim/ -o ./output -a 300 -f 5 -r 5
Forward read length: 250
Reverse read length: 251
{"trimPosition": [134, 196], "maxExpectedError": [1, 2], "readRetentionPercent": 92.92, "score": 91.91915753123422}

Does it suggest forward trim length 134 and reverse trim length 196 in DADA2 or QIIME command because it has the best read retention percent?

Thanks,
Yunhu Wan

@michael-weinstein
Copy link
Collaborator

Putting in a small number for the primers (even 1) should work for this issue. I designed this with the idea that people would not be pre-trimming their reads. Likewise, the issue with reads needing to be all one length is from the same cause.

The output of the program should have a list of potential trimming locations and expected error values to use for forward and reverse reads. The first ones listed are the ones considered optimal by the program based upon the score. The score starts off with the percentage of reads retained (since retaining reads is generally a good thing), and then applying an exponentially-increasing penalty for expected error allowances (since these are generally a bad thing). In this case, you are exactly right on the suggestion, and that looks like a very good score with little error.

@marschmi
Copy link

Thanks for the helpful discussion here and in #25 & #27 !

I'm also trying to run figaro within a DADA2 analysis. I have 192 fastq files (including the R1 and R2 for 96 samples). However, the json file and command line output has 193 entries. This I find to be very confusing because it makes me think that each row of output is for each of the 192 files (which according to the decrease in readRententionPercent seems like a misunderstanding on my end?).

Though, now with the discussion here, it does appear that the row of outputs is an overall score for the data with decreasing the read retention. For example:

{"trimPosition": [163, 147], "maxExpectedError": [1, 1], "readRetentionPercent": 93.93, "score": 93.93114158846818}
{"trimPosition": [160, 150], "maxExpectedError": [1, 1], "readRetentionPercent": 93.92, "score": 93.92239210796859}
{"trimPosition": [164, 146], "maxExpectedError": [1, 1], "readRetentionPercent": 93.92, "score": 93.91911105278125}
{"trimPosition": [162, 148], "maxExpectedError": [1, 1], "readRetentionPercent": 93.91, "score": 93.91145525734409}
{"trimPosition": [165, 145], "maxExpectedError": [1, 1], "readRetentionPercent": 93.91, "score": 93.91036157228164}
{"trimPosition": [166, 144], "maxExpectedError": [1, 1], "readRetentionPercent": 93.9, "score": 93.89833103659471}

This combined with the slide in #33 implies that the top output should be used for the dada2::filterAndTrim(), especially as it will maintain most of the reads. Am I understanding this correctly?

@michael-weinstein
Copy link
Collaborator

The output is sorted on scores. As you'll notice, the highest score is the first set listed followed by other sets in descending order. If you were to graph them out, you'd often see a "peaky" pattern over your optimal trimming sites. I include all the possible combinations because I am a big believer in rigorous QC measures, and tracking alterations in the optimal trimming parameters can provide a method to potentially detect changes in sequencing quality over time. Most users will not need to use anything but the first value unless they are attempting to look at read quality trends between runs or chart out how the optimization process actually worked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants