Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about use of calibrated aggregation scores #122

Open
gweeenis opened this issue Sep 11, 2024 · 1 comment
Open

Question about use of calibrated aggregation scores #122

gweeenis opened this issue Sep 11, 2024 · 1 comment

Comments

@gweeenis
Copy link

Hi Antonio,

I had a question about calibrated aggregated classification scores output by geNomad. If I run a metagenome assembly through genomad, would it be appropriate to use the calibrated aggregated scores of all the contigs that were used as input (including those with virus or plasmid scores below 0.7) to get an idea of how many contigs had "ambiguous" scores? For example, a contig with a plasmid score of 0.4 and a virus score of 0.6 that doesn't get classified as strictly viral or plasmid. I am trying to get a sense of the whole distribution of the contigs. Thanks.

@apcamargo
Copy link
Owner

Hi @gweeenis,

Yes, that makes sense. When you use --enable-score-calibration, the values represent approximate probabilities. For instance, a sequence with a plasmid score of 0.4 and a virus score of 0.6 has roughly a 40% chance of being a plasmid and a 60% chance of being a virus. However, keep in mind that the cutoffs for defining ambiguity can be somewhat arbitrary. It might be better to compute the entropy as a measure of ambiguity. For example:

Sequence Chromosome score Plasmid score Virus score Entropy
Sequence 1 0.2 0.6 0.2 0.636514
Sequence 2 0.0 0.6 0.4 1.098612

In this case, both sequences have the same maximum score (plasmid score = 0.6). However, the second sequence is more ambiguous than the first one because the probabilities of it being a virus or a chromosome are higher. This is quantified through Shannon entropy, which increases as the probabilities of different classes become more similar (maximum entropy is reached when all three scores are approximately 0.33). So, it may be more appropriate to base your decision of what constitutes an "ambiguous classification" on the entropy value rather than directly on the scores.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants