Add coverage per CDS to output #1513

joverlee521 · 2024-07-19T00:20:13Z

Motivated by nextstrain/pathogen-repo-guide#50:

It seems like a common pattern for sequencing efforts to focus on specific genes instead of the full genome. It would be helpful for ingest to annotate each record's gene coverage to explore the data.

We are currently doing this in what feels like a very "hacky" way of using the Nextclade translations outputs to calculate gene coverage (e.g. dengue).

Would it possible to directly output gene/CDS coverage from Nextclade?

ivan-aksamentov · 2024-07-19T01:53:22Z

Can you help me to better understand what needs to be done? The formula or algorithm. Maybe there's some existing code?

Is this on nucleotides or amino acids?

What output format do you imagine?

j23414 · 2024-07-19T12:32:27Z

Sure thing! I'm currently calculating the percentage based on amino acids, where total gene coverage equals 1.0 and half coverage equals 0.5, with rounding to three significant figures. The code here:

https://github.com/nextstrain/dengue/blob/029702275f1e72eba07c9339185a2feb8eec942e/ingest/scripts/calculate-gene-converage-from-nextclade-translation.py#L39-L41

For an example output, can look at dengue metadata here: https://data.nextstrain.org/files/workflows/dengue/metadata_all.tsv.zst

It felt a little hacky to have a script back calculate from nextclade generated CDS files, so any help is appreciated to streamline the calculation.

ivan-aksamentov · 2024-07-19T12:47:17Z

@j23414

The word "coverage" is quite overloaded, so it's important to agree on what exactly are we going to do.

If I understand correctly the idea is to calculate, for each translated CDS, the percentage of amino acid positions which are:

within alignment range (e.g. if a part of CDS falls to unsequenced 5' or 3' region - it is considered not covered)
not X
not deletion -
not * stop codon

Is the (3) and (4) actually correct?

Additionally, should insertions be involved here?

The important part is also how you want the result to be presented. I think creating new column for each CDS might be a bit of an overkill for Nextclade in general (e.g. mpox has dozens/hundreds of CDSes which will end up in a very large tsv file). Can we agree on some single-column format? I propose something similar to what we have for aa mutations:

Gene1:0.123,Gene2:0.456,Gene3:0.789

What do you think?

Once I have precise specifications for the feature, I can add it in no time.

corneliusroemer · 2024-07-19T15:03:49Z

Thanks @ivan-aksamentov for fully specifying 😀 What you write is reasonable and is what I would have distilled from OP as well.

I agree with embedding in a single column with your proposed grammar.

I think we could round to 4 digits after comma to not lose any precision for CDS that are up to around 5k long (e.g. ORF1a in SC2). For dengue 3 makes sense (shorter CDSes) but to generalize, the extra figure is worth it. If we wanted to have fewer pointless characters, we could multiply the resulting number by 10000, saving us 0.

While we're at improving CDS metrics, one thing we have for nucs that might make sense to add to CDSes as well is "alignment range". I'd propose we also add this. Grammar:

S:10-148,N:None

The question here is what to put if a gene is completely missing. Maybe None, maybe NA, or null. I don't mind. But would be nice to have!

ivan-aksamentov · 2024-07-19T15:21:49Z

@corneliusroemer

Thanks @ivan-aksamentov for fully specifying

These were mostly my questions. I have no idea what I am specifying :)

You think that deletions and stop codons need to be considered not covered? There is nothing bad about stop codons it seems.

If there's an insertion do you count these positions as covered in addition to the normal ones (it might make coverage higher than 100%)

I did not plan to truncate the numbers. It will be read mostly programmatically, right?

ivan-aksamentov · 2024-07-19T15:22:51Z

For the nucs the current logic is as follows:

nextclade/packages/nextclade/src/run/nextclade_run_one.rs

Lines 151 to 153 in dc560aa

    
           let total_aligned_nucs = alignment_range.len(); 
        
           let total_covered_nucs = total_aligned_nucs - total_missing - total_non_acgtns; 
        
           let coverage = total_covered_nucs as f64 / ref_seq.len() as f64;

It does not count either deletions or insertions.

corneliusroemer · 2024-07-19T15:54:10Z

Indeed we wouldn't count insertions, the way we do for nucs right now. That's partially as insertions can be artefacts and we want to avoid treating them as coverage. So we essentially are looking for aligned bases. You're right that stop codons should count.

ivan-aksamentov · 2024-07-19T16:38:05Z

I made a prototype, but this needs a thorough scientific review, because I have no idea what I am doing:

Prototype: #1514
Web preview: https://nextclade-git-feat-aa-coverage-nextstrain.vercel.app/
Download CLI binaries in the "Artifacts" section of the latest build (once its' done): https://github.com/nextstrain/nextclade/actions/workflows/cli.yml?query=branch%3Afeat%2Faa-coverage

joverlee521 added t:feat Type: request of a new feature, functionality, enchancement good first issue Good for newcomers help wanted Extra attention is needed needs triage Mark for review and label assignment labels Jul 19, 2024

ivan-aksamentov removed the needs triage Mark for review and label assignment label Jul 19, 2024

ivan-aksamentov linked a pull request Jul 19, 2024 that will close this issue

feat: calculate cds coverage #1514

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add coverage per CDS to output #1513

Add coverage per CDS to output #1513

joverlee521 commented Jul 19, 2024

ivan-aksamentov commented Jul 19, 2024

j23414 commented Jul 19, 2024

ivan-aksamentov commented Jul 19, 2024 •

edited

Loading

corneliusroemer commented Jul 19, 2024

ivan-aksamentov commented Jul 19, 2024 •

edited

Loading

ivan-aksamentov commented Jul 19, 2024 •

edited

Loading

corneliusroemer commented Jul 19, 2024

ivan-aksamentov commented Jul 19, 2024 •

edited

Loading

Add coverage per CDS to output #1513

Add coverage per CDS to output #1513

Comments

joverlee521 commented Jul 19, 2024

ivan-aksamentov commented Jul 19, 2024

j23414 commented Jul 19, 2024

ivan-aksamentov commented Jul 19, 2024 • edited Loading

corneliusroemer commented Jul 19, 2024

ivan-aksamentov commented Jul 19, 2024 • edited Loading

ivan-aksamentov commented Jul 19, 2024 • edited Loading

corneliusroemer commented Jul 19, 2024

ivan-aksamentov commented Jul 19, 2024 • edited Loading

ivan-aksamentov commented Jul 19, 2024 •

edited

Loading

ivan-aksamentov commented Jul 19, 2024 •

edited

Loading

ivan-aksamentov commented Jul 19, 2024 •

edited

Loading

ivan-aksamentov commented Jul 19, 2024 •

edited

Loading