You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I had a question about calibrated aggregated classification scores output by geNomad. If I run a metagenome assembly through genomad, would it be appropriate to use the calibrated aggregated scores of all the contigs that were used as input (including those with virus or plasmid scores below 0.7) to get an idea of how many contigs had "ambiguous" scores? For example, a contig with a plasmid score of 0.4 and a virus score of 0.6 that doesn't get classified as strictly viral or plasmid. I am trying to get a sense of the whole distribution of the contigs. Thanks.
The text was updated successfully, but these errors were encountered:
Yes, that makes sense. When you use --enable-score-calibration, the values represent approximate probabilities. For instance, a sequence with a plasmid score of 0.4 and a virus score of 0.6 has roughly a 40% chance of being a plasmid and a 60% chance of being a virus. However, keep in mind that the cutoffs for defining ambiguity can be somewhat arbitrary. It might be better to compute the entropy as a measure of ambiguity. For example:
Sequence
Chromosome score
Plasmid score
Virus score
Entropy
Sequence 1
0.2
0.6
0.2
0.636514
Sequence 2
0.0
0.6
0.4
1.098612
In this case, both sequences have the same maximum score (plasmid score = 0.6). However, the second sequence is more ambiguous than the first one because the probabilities of it being a virus or a chromosome are higher. This is quantified through Shannon entropy, which increases as the probabilities of different classes become more similar (maximum entropy is reached when all three scores are approximately 0.33). So, it may be more appropriate to base your decision of what constitutes an "ambiguous classification" on the entropy value rather than directly on the scores.
Hi Antonio,
I had a question about calibrated aggregated classification scores output by geNomad. If I run a metagenome assembly through genomad, would it be appropriate to use the calibrated aggregated scores of all the contigs that were used as input (including those with virus or plasmid scores below 0.7) to get an idea of how many contigs had "ambiguous" scores? For example, a contig with a plasmid score of 0.4 and a virus score of 0.6 that doesn't get classified as strictly viral or plasmid. I am trying to get a sense of the whole distribution of the contigs. Thanks.
The text was updated successfully, but these errors were encountered: