-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Find and use better source for typical mutations of lineages #4
Comments
And @SVN-PhD recommended that I take a look at outbreak.info for mutation prevalences. Currently I have problems with it, neither the website nor the API seems to work properly at the moment, but I will check back later. |
I suggest to look at covspectrum too. |
Hi everyone. I was just reading this issue here. Do the following APIs look useful to you? Mutations of BA.1 globally: Pango lineage with the C25708T mutation: You can also further filter by location, dates (and much more). For example: Here is the documentation: It uses data from GenBank (prepared and hosted by Nextstrain). |
Thanks a lot, that looks perfect! |
I've been working on cov-spectrum integration yesterday. It's not yet finished, but looks promising! Also, I've been ignoring deletions and insertions until now, because they are not present in And @corneliusroemer, I saw your comment there. My code (not yet pushed, will do it in the evening) currently can get current mutations lists form cov-spectrum and either generate a
or it can also include the prevalence for each mutation:
I don't have a use for the absolute count, so I could also break it down to:
Does any of this seem useful for your work on Nextclade? |
Unfortunately not yet. It will be possible eventually, but I don't know yet when I will be able to implement it. |
Unfortunately for SARS-CoV-2 sequences there are many genome assembly pipelines in use that do not do a good job with indels, so it may be just as well to skip them. I've seen cases where expected deletions are filled in with Ns, back-filled with reference sequence, or partially filled with read alignments that extend a bit into the deleted part of the reference genome sometimes causing false "substitutions" in the deleted region. There is definitely enough information in the substitutions alone to distinguish between the Nextstrain clades. (Although if properly assembled sequences with reliable indels are available, I suppose including indels could provide a more precise estimation of the breakpoint.) |
Ok, just wanted to make sure that I'm not missing anything that's already there. @AngieHinrichs: I agree. So I'll make is so that deletions are ignored by default, but can be enabled with a flag. |
Support for LAPIS / cov-spectrum is now released! The repo contains a pre-built Deletions are disabled / ignored by default, but can be enabled with Without deletionsWith deletionsThanks a lot for your input! |
This is pretty much how I started creating the |
I just pushed an update with the new I think this finally addresses @AngieHinrichs' original suggestion. |
Thanks @lenaschimmel, I think another tweak might be needed for
-- is that perhaps because all of their defining mutations are now in 21A because of the new minimum of 0.05 when rebuilding? 21J grew much larger than 21I (almost 10x as many genomes per quick stats on the UCSC/UShER tree), so the allele frequencies in 21A are heavily skewed towards 21J. When I run on the GenBank sequences from cov-lineages/pango-designation#471 (471.genbank.aligned.fa.gz), the label for Delta is "Delta (B.1.617.2 / 21A)" but the mutations are more like 21J because they include 4181T, 6402T, 7124T, 8986T, 9053G and so on. Since the proposed recombinant is from 21J (like most would be by chance since 21J was so much more common than 21I, probably especially by the time Omicron was around though I have not checked dates), the recombination picture comes out perfect except for the '21A' label: I believe there are very few Delta sequences that are 21A but not 21I or 21J, so the quickest fix might be to simply skip 21A, although I'm not sure what that would mean for mutations shared by 21I and 21J. It should be pretty straightforward to transform my file of not-masked-for-UShER Nextstrain clade mutations to the virus_properties.json format. I will give that a try. |
There are indeed not that many Deltas (lately) that are neither 21I nor 21J. They do exist, there are a few pango lineages, but for identifying current recombinants, one can drop 21A without having to worry too much. |
There's an update on #10 which is also relevant to this issue. See my comment here. |
See this comment by @AngieHinrichs which even contains an alternative.
Thanks a lot for your detailed explanation! I'm trying to move this over here so it's easier to find for me.
(Also, if the comment thread over at pange-designation gets locked down after too many "off topic" comments, I won't be able to comment there at all. Already happened in other issues.)
The text was updated successfully, but these errors were encountered: