Find and use better source for typical mutations of lineages #4

lenaschimmel · 2022-03-15T01:49:08Z

See this comment by @AngieHinrichs which even contains an alternative.

Thanks a lot for your detailed explanation! I'm trying to move this over here so it's easier to find for me.

(Also, if the comment thread over at pange-designation gets locked down after too many "off topic" comments, I won't be able to comment there at all. Already happened in other issues.)

lenaschimmel · 2022-03-15T01:58:17Z

And @SVN-PhD recommended that I take a look at outbreak.info for mutation prevalences.

Currently I have problems with it, neither the website nor the API seems to work properly at the moment, but I will check back later.

FedeGueli · 2022-03-15T07:54:23Z

I suggest to look at covspectrum too.
Maybe you could open an issue there asking them (@chaoran-chen) to add a tool there to download mutations list in machine readable format.
The advantage with Cov-Spectrum would be you can choose country and period restricting the mass of mutations to the ones really circulating in that determined place and period.

chaoran-chen · 2022-03-15T07:59:14Z

Hi everyone. I was just reading this issue here. Do the following APIs look useful to you?

Mutations of BA.1 globally:
https://lapis.cov-spectrum.org/open/v1/sample/nuc-mutations?pangoLineage=BA.1*&minProportion=0.01

Pango lineage with the C25708T mutation:
https://lapis.cov-spectrum.org/open/v1/sample/aggregated?nucMutations=C25708T&fields=pangoLineage

You can also further filter by location, dates (and much more). For example:
https://lapis.cov-spectrum.org/open/v1/sample/nuc-mutations?pangoLineage=BA.1*&minProportion=0.01&dateFrom=2022-01-01&region=Europe

Here is the documentation:
https://lapis.cov-spectrum.org/

It uses data from GenBank (prepared and hosted by Nextstrain).

lenaschimmel · 2022-03-15T08:18:05Z

Thanks a lot, that looks perfect!

lenaschimmel · 2022-03-16T11:58:47Z

I've been working on cov-spectrum integration yesterday. It's not yet finished, but looks promising!

Also, I've been ignoring deletions and insertions until now, because they are not present in virus_properties.json and are also ignored by some other tools. Looks like cov-spectrum handles deletions just like any other mutation, which I might do as well. @chaoran-chen is there a way to also get the insertions of a lineage from cov-spectrum?

And @corneliusroemer, I saw your comment there. My code (not yet pushed, will do it in the evening) currently can get current mutations lists form cov-spectrum and either generate a virus_properties.json with mostly the same syntax as your file:

 "21K": [
            "G21989-",
            "T13195C",
            ...

or it can also include the prevalence for each mutation:

        "21K": [
            {
                "mutation": "G21989-",
                "proportion": 0.9401410657729306,
                "count": 764958
            },
            {
                "mutation": "T13195C",
                "proportion": 0.9829978750416327,
                "count": 799829
            },

I don't have a use for the absolute count, so I could also break it down to:

        "21K": {
           "G21989-": 0.9401410657729306,
           "T13195C": 0.9829978750416327,
           ...

Does any of this seem useful for your work on Nextclade?

chaoran-chen · 2022-03-16T12:27:44Z

@chaoran-chen is there a way to also get the insertions of a lineage from cov-spectrum?

Unfortunately not yet. It will be possible eventually, but I don't know yet when I will be able to implement it.

AngieHinrichs · 2022-03-16T16:22:37Z

Unfortunately for SARS-CoV-2 sequences there are many genome assembly pipelines in use that do not do a good job with indels, so it may be just as well to skip them. I've seen cases where expected deletions are filled in with Ns, back-filled with reference sequence, or partially filled with read alignments that extend a bit into the deleted part of the reference genome sometimes causing false "substitutions" in the deleted region. There is definitely enough information in the substitutions alone to distinguish between the Nextstrain clades. (Although if properly assembled sequences with reliable indels are available, I suppose including indels could provide a more precise estimation of the breakpoint.)

lenaschimmel · 2022-03-16T16:39:07Z

@chaoran-chen:

Unfortunately not yet. It will be possible eventually, but I don't know yet when I will be able to implement it.

Ok, just wanted to make sure that I'm not missing anything that's already there.

@AngieHinrichs: I agree. So I'll make is so that deletions are ignored by default, but can be enabled with a flag.

lenaschimmel · 2022-03-16T20:26:53Z

Support for LAPIS / cov-spectrum is now released! The repo contains a pre-built virus_properties.json which can be updated with --rebuild-examples.

Deletions are disabled / ignored by default, but can be enabled with --enable-deletions. I'm not perfectly happy with how it works right now, but I think it's a good start:

Without deletions

With deletions

Thanks a lot for your input!

corneliusroemer · 2022-03-17T00:57:50Z

@lenaschimmel

Does any of this seem useful for your work on Nextclade?

This is pretty much how I started creating the virus_properties.json before switching to Nextclade data because covSpectrum doesn't have our clades

lenaschimmel · 2022-03-19T00:08:29Z

I just pushed an update with the new --mutation-threshold paramter. See this comment for more details.

I think this finally addresses @AngieHinrichs' original suggestion.

AngieHinrichs · 2022-03-22T00:25:32Z

Thanks @lenaschimmel, --mutation-threshold should do the trick!

I think another tweak might be needed for --rebuild-examples, however. In the latest virus_properties.json, and after running --rebuild-examples, the lists for 21I and 21J are empty:

        "21I": [],
        "21J": [],

-- is that perhaps because all of their defining mutations are now in 21A because of the new minimum of 0.05 when rebuilding?

21J grew much larger than 21I (almost 10x as many genomes per quick stats on the UCSC/UShER tree), so the allele frequencies in 21A are heavily skewed towards 21J.

When I run on the GenBank sequences from cov-lineages/pango-designation#471 (471.genbank.aligned.fa.gz), the label for Delta is "Delta (B.1.617.2 / 21A)" but the mutations are more like 21J because they include 4181T, 6402T, 7124T, 8986T, 9053G and so on. Since the proposed recombinant is from 21J (like most would be by chance since 21J was so much more common than 21I, probably especially by the time Omicron was around though I have not checked dates), the recombination picture comes out perfect except for the '21A' label:

I believe there are very few Delta sequences that are 21A but not 21I or 21J, so the quickest fix might be to simply skip 21A, although I'm not sure what that would mean for mutations shared by 21I and 21J.

It should be pretty straightforward to transform my file of not-masked-for-UShER Nextstrain clade mutations to the virus_properties.json format. I will give that a try.

corneliusroemer · 2022-03-22T20:23:25Z

There are indeed not that many Deltas (lately) that are neither 21I nor 21J. They do exist, there are a few pango lineages, but for identifying current recombinants, one can drop 21A without having to worry too much.

lenaschimmel · 2022-03-22T22:05:51Z

There's an update on #10 which is also relevant to this issue. See my comment here.

lenaschimmel changed the title ~~Check if virus_properties is really a good source for this use case~~ Find and use better source for typical mutations of lineages Mar 16, 2022

lenaschimmel mentioned this issue Mar 18, 2022

Deltacrons with NSP3 breakpoint #8

Open

lenaschimmel closed this as completed Mar 19, 2022

AngieHinrichs mentioned this issue Mar 22, 2022

Fix or remove 21I and 21J #10

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Find and use better source for typical mutations of lineages #4

Find and use better source for typical mutations of lineages #4

lenaschimmel commented Mar 15, 2022 •

edited

Loading

lenaschimmel commented Mar 15, 2022 •

edited

Loading

FedeGueli commented Mar 15, 2022 •

edited

Loading

chaoran-chen commented Mar 15, 2022 •

edited

Loading

lenaschimmel commented Mar 15, 2022

lenaschimmel commented Mar 16, 2022

chaoran-chen commented Mar 16, 2022

AngieHinrichs commented Mar 16, 2022

lenaschimmel commented Mar 16, 2022

lenaschimmel commented Mar 16, 2022

corneliusroemer commented Mar 17, 2022

lenaschimmel commented Mar 19, 2022 •

edited

Loading

AngieHinrichs commented Mar 22, 2022

corneliusroemer commented Mar 22, 2022

lenaschimmel commented Mar 22, 2022

Find and use better source for typical mutations of lineages #4

Find and use better source for typical mutations of lineages #4

Comments

lenaschimmel commented Mar 15, 2022 • edited Loading

lenaschimmel commented Mar 15, 2022 • edited Loading

FedeGueli commented Mar 15, 2022 • edited Loading

chaoran-chen commented Mar 15, 2022 • edited Loading

lenaschimmel commented Mar 15, 2022

lenaschimmel commented Mar 16, 2022

chaoran-chen commented Mar 16, 2022

AngieHinrichs commented Mar 16, 2022

lenaschimmel commented Mar 16, 2022

lenaschimmel commented Mar 16, 2022

Without deletions

With deletions

corneliusroemer commented Mar 17, 2022

lenaschimmel commented Mar 19, 2022 • edited Loading

AngieHinrichs commented Mar 22, 2022

corneliusroemer commented Mar 22, 2022

lenaschimmel commented Mar 22, 2022

lenaschimmel commented Mar 15, 2022 •

edited

Loading

lenaschimmel commented Mar 15, 2022 •

edited

Loading

FedeGueli commented Mar 15, 2022 •

edited

Loading

chaoran-chen commented Mar 15, 2022 •

edited

Loading

lenaschimmel commented Mar 19, 2022 •

edited

Loading