-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add yellow fever virus dataset #220
Conversation
Thanks @genehack I allowed myself to resolve merge conflict in The dataset can be tested in web like this: (this will also update when you push new changes, after bot pushes the rebuild, but make sure to refresh the browser bypassing cache or clear the cache) The 2000+ example sequences pushing the limits a little and many of them don't align, producing bright red rows in the table - this scares new users sometimes when they try to run examples. Would be nice to select a smaller subset. Example sequence are for demo purposes only, and are sometimes convenient for quick testing and debugging (when software is modified). A few dozen or a hundred is a typical count - just to show off what clades we've got, some interesting science cases, etc. Otherwise, technically looks good to me. I will let science team to review the scientific aspect of the dataset - this is the hardest part. |
Thanks!
Ah, I think I misunderstood the purpose of
🙌
Sure — I do have some updates to push, and I will provide a cut down set of examples; likely this will be early next week. |
ca51f1d
to
5fe9274
Compare
Ideally, set of examples and the set of samples that end up on reference tree to not overlap. If they are - then the examples will be placed onto the tree where they are already there. (It's like training a machine learning model and then testing it on training data - the results will be amazing, but it does not show how well the model performs in reality) |
I have force-pushed a new dataset with a |
e42a4c6
to
424b122
Compare
(I had to remove
Another interesting peculiarity I noticed is that even if there are up to 99 nuc mutations in "South America" clades, they end up only in up to 7 aa mutations. Meaning that most nuc mutations are silent. Don't know much about this virus though. Maybe that's expected. But it won't hurt to check with the experts. |
Ah, I guess this answers my first concern :)
|
Thanks for fixing that up; sorry, I'm not used to needing to propagate changes from the remote back to my branch — branch is set to remote tracking now, so I'll get any new changes.
Yeah, as you noted later -- the genotypes are defined in the literature on the basis of this short region towards the 5' end; the example sequences are all full genomes (because the first Nextstrain dataset I'm trying to build for the full genome, so those were the sequences I had handy.
It's clear from the literature (I can dig up exact cites if you're interested) that the nucleotide level divergence is much higher than the amino acid level; it's not clearly understood why but it's been noted in several papers I've read. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! I'll have a proper look tomorrow.
The Readme right now reads like a workflow readme for workflow devs - instead it should be a readme for dataset users. Have a look at the other existing datasets to see the type of information usually included.
Many things in the Readme are good to keep, but the workflow shouldn't be mentioned so explicitly, and some more info on reference, maintainers, limitations etc would be good. Also there's no link to the actual workflow?
Is there a reason you only use the prM gene? Why not whole genome? Is YFV so recombinant that a full genome tree would be misleading
The clades are probably defined based on just a short subset of the genome because whole genome sequences weren't available due to expense. For Nextclade, there's usually no reason to not use a full tree.
That's because it was the workflow readme — not sure how that happened, but I have provided the correct version.
Because the 2 papers that originally established the genotypes that are the basis of the clades only looked at that prM-E region. It seems to be a very similar situation to the N450 region of measles, in that it's a frequent sequencing target for probably historical reasons.
The whole genome sequences are not systematically annotated with genotypes in the NCBI dataset. The purpose of building this Nextclade dataset from the reference sequences from the two papers linked in the README is so we can use it to assign genotypes to the full genome tree that's in Nextstrain staging. (N.b., that version is the previous build and still uses the geographically-based genotype names, not the clades that this dataset has been updated to use.) |
Built from [nextclade workflow in yellow fever repo](nextstrain/yellow-fever#10)
Per biweekly meeting discussion. Also updates README to provide mapping back to literature genotypes.
8dc2afd
to
7e70d34
Compare
@corneliusroemer I think this is ready to merge — could you take a look, please. |
@corneliusroemer I expect you're quite busy with Pathoplexus, but another look at this new dataset would be appreciated. |
Thanks for the ping, I'll have a look! |
Hi @corneliusroemer any chance to have that look yet? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the science side, this looks pretty good to me. I appreciate the complete description of the genotype system in the README and the citations.
I didn't notice any glaring technical issues that would block this from merge and release.
data_output/nextstrain/yellow-fever/prM-E/unreleased/CHANGELOG.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, save a few more tweaks to parameters/typos
* Tweaks to CHANGELOG and README * Remove minLength from pathogen.json * Tweak privateMutations cutoff and typical values in pathogen.json * Add 2 CladeIII examples to sample sequences * Rebuild tree
Thanks, tweaks applied and pushed. I will plan to merge this Tuesday the 22nd in the absence of further feedback. |
Built from nextclade workflow in yellow fever repo.
preview: https://master.clades.nextstrain.org/?dataset-server=gh:@yellow-fever-dataset@