Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add yellow fever virus dataset #220

Merged
merged 12 commits into from
Oct 18, 2024
Merged

Add yellow fever virus dataset #220

merged 12 commits into from
Oct 18, 2024

Conversation

genehack
Copy link
Contributor

@genehack genehack commented Aug 2, 2024

@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented Aug 2, 2024

Thanks @genehack

I allowed myself to resolve merge conflict in minimizer_index.json (by deleting the file and rebuilding).

The dataset can be tested in web like this:

https://clades.nextstrain.org/?dataset-server=gh:@yellow-fever-dataset@/data_output&dataset-name=nextstrain/yellow-fever/prM-E

(this will also update when you push new changes, after bot pushes the rebuild, but make sure to refresh the browser bypassing cache or clear the cache)

The 2000+ example sequences pushing the limits a little and many of them don't align, producing bright red rows in the table - this scares new users sometimes when they try to run examples. Would be nice to select a smaller subset. Example sequence are for demo purposes only, and are sometimes convenient for quick testing and debugging (when software is modified). A few dozen or a hundred is a typical count - just to show off what clades we've got, some interesting science cases, etc.

Otherwise, technically looks good to me.

I will let science team to review the scientific aspect of the dataset - this is the hardest part.

@genehack
Copy link
Contributor Author

genehack commented Aug 2, 2024

I allowed myself to resolve merge conflict in minimizer_index.json (by deleting the file and rebuilding).

Thanks!

The 2000+ example sequences pushing the limits a little and many of them don't align, producing bright red rows in the table - this scares new users sometimes when they try to run examples. Would be nice to select a smaller subset.

Ah, I think I misunderstood the purpose of sequences.fasta — what I have included is the entire NCBI Dataset for yellow fever — I can certainly go through and try to filter it down to ~50 representative sequences instead.

Otherwise, technically looks good to me.

🙌

I will let science team to review the scientific aspect of the dataset - this is the hardest part.

Sure — I do have some updates to push, and I will provide a cut down set of examples; likely this will be early next week.

@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented Aug 2, 2024

@genehack

Ah, I think I misunderstood the purpose of sequences.fasta — what I have included is the entire NCBI Dataset for yellow fever

Ideally, set of examples and the set of samples that end up on reference tree to not overlap. If they are - then the examples will be placed onto the tree where they are already there. (It's like training a machine learning model and then testing it on training data - the results will be amazing, but it does not show how well the model performs in reality)

@genehack
Copy link
Contributor Author

genehack commented Aug 5, 2024

Ideally, set of examples and the set of samples that end up on reference tree to not overlap.

I have force-pushed a new dataset with a sequences.fasta file containing 32 full-length yellow fever virus genomes, that has no overlap with the 122 samples used to build the nextclade tree.

@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented Aug 6, 2024

(I had to remove data_output/minimizer_index.json, then merge master and then rebuild to resolve merge conflict again. Please don't forget to pull if add more changes)

One thing I noticed is that each example sequence is 672 nucs long and then there's 600+ nuc insertion at the 5' end and 9000+ nuc insertion at the 3' end. This means that the reference is 672 nucs long, but the example sequences are much longer. Not a by little bit, but like 15 times longer. This is not something I've seen before.

Another interesting peculiarity I noticed is that even if there are up to 99 nuc mutations in "South America" clades, they end up only in up to 7 aa mutations. Meaning that most nuc mutations are silent.

Don't know much about this virus though. Maybe that's expected. But it won't hurt to check with the experts.

@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented Aug 6, 2024

Ah, I guess this answers my first concern :)

These two papers, collectively, define 7 distinct yellow fever virus genotypes based on a 670 nucleotide region of the yellow fever virus genome, (bases 641-1310), called the prM-E region.
This dataset can be used to assign genotypes to any sequence that includes at least 500 bp of the prM-E region, including whole genome sequences. Sequence data beyond the prM-E region will be reported as an insertion in the Nextclade output.

@genehack
Copy link
Contributor Author

genehack commented Aug 6, 2024

(I had to remove data_output/minimizer_index.json, then merge master and then rebuild to resolve merge conflict again. Please don't forget to pull if add more changes)

Thanks for fixing that up; sorry, I'm not used to needing to propagate changes from the remote back to my branch — branch is set to remote tracking now, so I'll get any new changes.

One thing I noticed is that each example sequence is 672 nucs long and then there's 600+ nuc insertion at the 5' end and 9000+ nuc insertion at the 3' end. This means that the reference is 672 nucs long, but the example sequences are much longer. Not a by little bit, but like 15 times longer. This is not something I've seen before.

Yeah, as you noted later -- the genotypes are defined in the literature on the basis of this short region towards the 5' end; the example sequences are all full genomes (because the first Nextstrain dataset I'm trying to build for the full genome, so those were the sequences I had handy.

Another interesting peculiarity I noticed is that even if there are up to 99 nuc mutations in "South America" clades, they end up only in up to 7 aa mutations. Meaning that most nuc mutations are silent.

It's clear from the literature (I can dig up exact cites if you're interested) that the nucleotide level divergence is much higher than the amino acid level; it's not clearly understood why but it's been noted in several papers I've read.

@genehack genehack temporarily deployed to refs/pull/220/merge August 20, 2024 23:48 — with GitHub Actions Inactive
Copy link
Member

@corneliusroemer corneliusroemer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! I'll have a proper look tomorrow.

The Readme right now reads like a workflow readme for workflow devs - instead it should be a readme for dataset users. Have a look at the other existing datasets to see the type of information usually included.

Many things in the Readme are good to keep, but the workflow shouldn't be mentioned so explicitly, and some more info on reference, maintainers, limitations etc would be good. Also there's no link to the actual workflow?

Is there a reason you only use the prM gene? Why not whole genome? Is YFV so recombinant that a full genome tree would be misleading

The clades are probably defined based on just a short subset of the genome because whole genome sequences weren't available due to expense. For Nextclade, there's usually no reason to not use a full tree.

@genehack genehack temporarily deployed to refs/pull/220/merge August 22, 2024 17:41 — with GitHub Actions Inactive
@genehack
Copy link
Contributor Author

The Readme right now reads like a workflow readme for workflow devs - instead it should be a readme for dataset users. Have a look at the other existing datasets to see the type of information usually included.

That's because it was the workflow readme — not sure how that happened, but I have provided the correct version.

Is there a reason you only use the prM gene? Why not whole genome? Is YFV so recombinant that a full genome tree would be misleading

Because the 2 papers that originally established the genotypes that are the basis of the clades only looked at that prM-E region. It seems to be a very similar situation to the N450 region of measles, in that it's a frequent sequencing target for probably historical reasons.

The clades are probably defined based on just a short subset of the genome because whole genome sequences weren't available due to expense. For Nextclade, there's usually no reason to not use a full tree.

The whole genome sequences are not systematically annotated with genotypes in the NCBI dataset. The purpose of building this Nextclade dataset from the reference sequences from the two papers linked in the README is so we can use it to assign genotypes to the full genome tree that's in Nextstrain staging. (N.b., that version is the previous build and still uses the geographically-based genotype names, not the clades that this dataset has been updated to use.)

@genehack
Copy link
Contributor Author

@corneliusroemer I think this is ready to merge — could you take a look, please.

@genehack genehack requested a review from a team September 26, 2024 16:59
@genehack
Copy link
Contributor Author

@corneliusroemer I think this is ready to merge — could you take a look, please.

@corneliusroemer I expect you're quite busy with Pathoplexus, but another look at this new dataset would be appreciated.

@corneliusroemer
Copy link
Member

Thanks for the ping, I'll have a look!

@genehack
Copy link
Contributor Author

Thanks for the ping, I'll have a look!

Hi @corneliusroemer any chance to have that look yet?

Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the science side, this looks pretty good to me. I appreciate the complete description of the genotype system in the README and the citations.

I didn't notice any glaring technical issues that would block this from merge and release.

Copy link
Member

@corneliusroemer corneliusroemer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, save a few more tweaks to parameters/typos

* Tweaks to CHANGELOG and README
* Remove minLength from pathogen.json
* Tweak privateMutations cutoff and typical values in pathogen.json
* Add 2 CladeIII examples to sample sequences
* Rebuild tree
genehack added a commit to nextstrain/yellow-fever that referenced this pull request Oct 17, 2024
genehack added a commit to nextstrain/yellow-fever that referenced this pull request Oct 17, 2024
@genehack genehack temporarily deployed to refs/pull/220/merge October 17, 2024 17:55 — with GitHub Actions Inactive
@genehack
Copy link
Contributor Author

Looks good, save a few more tweaks to parameters/typos

Thanks, tweaks applied and pushed.

I will plan to merge this Tuesday the 22nd in the absence of further feedback.

@ivan-aksamentov ivan-aksamentov deployed to refs/pull/220/merge October 18, 2024 16:12 — with GitHub Actions Active
@corneliusroemer corneliusroemer merged commit 745ffb9 into master Oct 18, 2024
1 check passed
@corneliusroemer corneliusroemer deleted the yellow-fever-dataset branch October 18, 2024 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants