Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add coverage per CDS to output #1513

Open
joverlee521 opened this issue Jul 19, 2024 · 8 comments · May be fixed by #1514
Open

Add coverage per CDS to output #1513

joverlee521 opened this issue Jul 19, 2024 · 8 comments · May be fixed by #1514
Labels
good first issue Good for newcomers help wanted Extra attention is needed t:feat Type: request of a new feature, functionality, enchancement

Comments

@joverlee521
Copy link
Contributor

Motivated by nextstrain/pathogen-repo-guide#50:

It seems like a common pattern for sequencing efforts to focus on specific genes instead of the full genome. It would be helpful for ingest to annotate each record's gene coverage to explore the data.

We are currently doing this in what feels like a very "hacky" way of using the Nextclade translations outputs to calculate gene coverage (e.g. dengue).

Would it possible to directly output gene/CDS coverage from Nextclade?

@joverlee521 joverlee521 added t:feat Type: request of a new feature, functionality, enchancement good first issue Good for newcomers help wanted Extra attention is needed needs triage Mark for review and label assignment labels Jul 19, 2024
@ivan-aksamentov
Copy link
Member

Can you help me to better understand what needs to be done? The formula or algorithm. Maybe there's some existing code?

Is this on nucleotides or amino acids?

What output format do you imagine?

@ivan-aksamentov ivan-aksamentov removed the needs triage Mark for review and label assignment label Jul 19, 2024
@j23414
Copy link

j23414 commented Jul 19, 2024

Sure thing! I'm currently calculating the percentage based on amino acids, where total gene coverage equals 1.0 and half coverage equals 0.5, with rounding to three significant figures. The code here:

https://github.com/nextstrain/dengue/blob/029702275f1e72eba07c9339185a2feb8eec942e/ingest/scripts/calculate-gene-converage-from-nextclade-translation.py#L39-L41

For an example output, can look at dengue metadata here: https://data.nextstrain.org/files/workflows/dengue/metadata_all.tsv.zst

Screenshot 2024-07-19 at 8 24 14 AM

It felt a little hacky to have a script back calculate from nextclade generated CDS files, so any help is appreciated to streamline the calculation.

@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented Jul 19, 2024

@j23414

The word "coverage" is quite overloaded, so it's important to agree on what exactly are we going to do.

If I understand correctly the idea is to calculate, for each translated CDS, the percentage of amino acid positions which are:

  1. within alignment range (e.g. if a part of CDS falls to unsequenced 5' or 3' region - it is considered not covered)
  2. not X
  3. not deletion -
  4. not * stop codon

Is the (3) and (4) actually correct?

Additionally, should insertions be involved here?

The important part is also how you want the result to be presented. I think creating new column for each CDS might be a bit of an overkill for Nextclade in general (e.g. mpox has dozens/hundreds of CDSes which will end up in a very large tsv file). Can we agree on some single-column format? I propose something similar to what we have for aa mutations:

Gene1:0.123,Gene2:0.456,Gene3:0.789

What do you think?

Once I have precise specifications for the feature, I can add it in no time.

@corneliusroemer
Copy link
Member

Thanks @ivan-aksamentov for fully specifying 😀 What you write is reasonable and is what I would have distilled from OP as well.

I agree with embedding in a single column with your proposed grammar.

I think we could round to 4 digits after comma to not lose any precision for CDS that are up to around 5k long (e.g. ORF1a in SC2). For dengue 3 makes sense (shorter CDSes) but to generalize, the extra figure is worth it. If we wanted to have fewer pointless characters, we could multiply the resulting number by 10000, saving us 0.

While we're at improving CDS metrics, one thing we have for nucs that might make sense to add to CDSes as well is "alignment range". I'd propose we also add this. Grammar:

S:10-148,N:None

The question here is what to put if a gene is completely missing. Maybe None, maybe NA, or null. I don't mind. But would be nice to have!

@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented Jul 19, 2024

@corneliusroemer

Thanks @ivan-aksamentov for fully specifying

These were mostly my questions. I have no idea what I am specifying :)

You think that deletions and stop codons need to be considered not covered? There is nothing bad about stop codons it seems.

If there's an insertion do you count these positions as covered in addition to the normal ones (it might make coverage higher than 100%)

I did not plan to truncate the numbers. It will be read mostly programmatically, right?

@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented Jul 19, 2024

For the nucs the current logic is as follows:

let total_aligned_nucs = alignment_range.len();
let total_covered_nucs = total_aligned_nucs - total_missing - total_non_acgtns;
let coverage = total_covered_nucs as f64 / ref_seq.len() as f64;

It does not count either deletions or insertions.

@corneliusroemer
Copy link
Member

Indeed we wouldn't count insertions, the way we do for nucs right now. That's partially as insertions can be artefacts and we want to avoid treating them as coverage. So we essentially are looking for aligned bases. You're right that stop codons should count.

@ivan-aksamentov ivan-aksamentov linked a pull request Jul 19, 2024 that will close this issue
2 tasks
@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented Jul 19, 2024

I made a prototype, but this needs a thorough scientific review, because I have no idea what I am doing:

Prototype: #1514
Web preview: https://nextclade-git-feat-aa-coverage-nextstrain.vercel.app/
Download CLI binaries in the "Artifacts" section of the latest build (once its' done): https://github.com/nextstrain/nextclade/actions/workflows/cli.yml?query=branch%3Afeat%2Faa-coverage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed t:feat Type: request of a new feature, functionality, enchancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants