-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add coverage per CDS to output #1513
Comments
Can you help me to better understand what needs to be done? The formula or algorithm. Maybe there's some existing code? Is this on nucleotides or amino acids? What output format do you imagine? |
Sure thing! I'm currently calculating the percentage based on amino acids, where total gene coverage equals 1.0 and half coverage equals 0.5, with rounding to three significant figures. The code here: For an example output, can look at dengue metadata here: https://data.nextstrain.org/files/workflows/dengue/metadata_all.tsv.zst It felt a little hacky to have a script back calculate from nextclade generated CDS files, so any help is appreciated to streamline the calculation. |
The word "coverage" is quite overloaded, so it's important to agree on what exactly are we going to do. If I understand correctly the idea is to calculate, for each translated CDS, the percentage of amino acid positions which are:
Is the (3) and (4) actually correct? Additionally, should insertions be involved here? The important part is also how you want the result to be presented. I think creating new column for each CDS might be a bit of an overkill for Nextclade in general (e.g. mpox has dozens/hundreds of CDSes which will end up in a very large tsv file). Can we agree on some single-column format? I propose something similar to what we have for aa mutations:
What do you think? Once I have precise specifications for the feature, I can add it in no time. |
Thanks @ivan-aksamentov for fully specifying 😀 What you write is reasonable and is what I would have distilled from OP as well. I agree with embedding in a single column with your proposed grammar. I think we could round to 4 digits after comma to not lose any precision for CDS that are up to around 5k long (e.g. ORF1a in SC2). For dengue 3 makes sense (shorter CDSes) but to generalize, the extra figure is worth it. If we wanted to have fewer pointless characters, we could multiply the resulting number by 10000, saving us While we're at improving CDS metrics, one thing we have for nucs that might make sense to add to CDSes as well is "alignment range". I'd propose we also add this. Grammar:
The question here is what to put if a gene is completely missing. Maybe None, maybe NA, or null. I don't mind. But would be nice to have! |
These were mostly my questions. I have no idea what I am specifying :) You think that deletions and stop codons need to be considered not covered? There is nothing bad about stop codons it seems. If there's an insertion do you count these positions as covered in addition to the normal ones (it might make coverage higher than 100%) I did not plan to truncate the numbers. It will be read mostly programmatically, right? |
For the nucs the current logic is as follows: nextclade/packages/nextclade/src/run/nextclade_run_one.rs Lines 151 to 153 in dc560aa
It does not count either deletions or insertions. |
Indeed we wouldn't count insertions, the way we do for nucs right now. That's partially as insertions can be artefacts and we want to avoid treating them as coverage. So we essentially are looking for aligned bases. You're right that stop codons should count. |
I made a prototype, but this needs a thorough scientific review, because I have no idea what I am doing: Prototype: #1514 |
Motivated by nextstrain/pathogen-repo-guide#50:
We are currently doing this in what feels like a very "hacky" way of using the Nextclade translations outputs to calculate gene coverage (e.g. dengue).
Would it possible to directly output gene/CDS coverage from Nextclade?
The text was updated successfully, but these errors were encountered: