Add yellow fever virus dataset #220

genehack · 2024-08-02T20:00:51Z

Built from nextclade workflow in yellow fever repo.

preview: https://master.clades.nextstrain.org/?dataset-server=gh:@yellow-fever-dataset@

ivan-aksamentov · 2024-08-02T21:54:41Z

I allowed myself to resolve merge conflict in minimizer_index.json (by deleting the file and rebuilding).

The dataset can be tested in web like this:

https://clades.nextstrain.org/?dataset-server=gh:@yellow-fever-dataset@/data_output&dataset-name=nextstrain/yellow-fever/prM-E

(this will also update when you push new changes, after bot pushes the rebuild, but make sure to refresh the browser bypassing cache or clear the cache)

The 2000+ example sequences pushing the limits a little and many of them don't align, producing bright red rows in the table - this scares new users sometimes when they try to run examples. Would be nice to select a smaller subset. Example sequence are for demo purposes only, and are sometimes convenient for quick testing and debugging (when software is modified). A few dozen or a hundred is a typical count - just to show off what clades we've got, some interesting science cases, etc.

Otherwise, technically looks good to me.

I will let science team to review the scientific aspect of the dataset - this is the hardest part.

genehack · 2024-08-02T22:46:54Z

I allowed myself to resolve merge conflict in minimizer_index.json (by deleting the file and rebuilding).

Thanks!

The 2000+ example sequences pushing the limits a little and many of them don't align, producing bright red rows in the table - this scares new users sometimes when they try to run examples. Would be nice to select a smaller subset.

Ah, I think I misunderstood the purpose of sequences.fasta — what I have included is the entire NCBI Dataset for yellow fever — I can certainly go through and try to filter it down to ~50 representative sequences instead.

Otherwise, technically looks good to me.

🙌

I will let science team to review the scientific aspect of the dataset - this is the hardest part.

Sure — I do have some updates to push, and I will provide a cut down set of examples; likely this will be early next week.

ivan-aksamentov · 2024-08-02T23:24:12Z

@genehack

Ah, I think I misunderstood the purpose of sequences.fasta — what I have included is the entire NCBI Dataset for yellow fever

Ideally, set of examples and the set of samples that end up on reference tree to not overlap. If they are - then the examples will be placed onto the tree where they are already there. (It's like training a machine learning model and then testing it on training data - the results will be amazing, but it does not show how well the model performs in reality)

genehack · 2024-08-05T17:01:51Z

Ideally, set of examples and the set of samples that end up on reference tree to not overlap.

I have force-pushed a new dataset with a sequences.fasta file containing 32 full-length yellow fever virus genomes, that has no overlap with the 122 samples used to build the nextclade tree.

ivan-aksamentov · 2024-08-06T01:38:29Z

(I had to remove data_output/minimizer_index.json, then merge master and then rebuild to resolve merge conflict again. Please don't forget to pull if add more changes)

One thing I noticed is that each example sequence is 672 nucs long and then there's 600+ nuc insertion at the 5' end and 9000+ nuc insertion at the 3' end. This means that the reference is 672 nucs long, but the example sequences are much longer. Not a by little bit, but like 15 times longer. This is not something I've seen before.

Another interesting peculiarity I noticed is that even if there are up to 99 nuc mutations in "South America" clades, they end up only in up to 7 aa mutations. Meaning that most nuc mutations are silent.

Don't know much about this virus though. Maybe that's expected. But it won't hurt to check with the experts.

ivan-aksamentov · 2024-08-06T01:40:53Z

Ah, I guess this answers my first concern :)

These two papers, collectively, define 7 distinct yellow fever virus genotypes based on a 670 nucleotide region of the yellow fever virus genome, (bases 641-1310), called the prM-E region.
This dataset can be used to assign genotypes to any sequence that includes at least 500 bp of the prM-E region, including whole genome sequences. Sequence data beyond the prM-E region will be reported as an insertion in the Nextclade output.

genehack · 2024-08-06T16:17:41Z

(I had to remove data_output/minimizer_index.json, then merge master and then rebuild to resolve merge conflict again. Please don't forget to pull if add more changes)

Thanks for fixing that up; sorry, I'm not used to needing to propagate changes from the remote back to my branch — branch is set to remote tracking now, so I'll get any new changes.

One thing I noticed is that each example sequence is 672 nucs long and then there's 600+ nuc insertion at the 5' end and 9000+ nuc insertion at the 3' end. This means that the reference is 672 nucs long, but the example sequences are much longer. Not a by little bit, but like 15 times longer. This is not something I've seen before.

Yeah, as you noted later -- the genotypes are defined in the literature on the basis of this short region towards the 5' end; the example sequences are all full genomes (because the first Nextstrain dataset I'm trying to build for the full genome, so those were the sequences I had handy.

Another interesting peculiarity I noticed is that even if there are up to 99 nuc mutations in "South America" clades, they end up only in up to 7 aa mutations. Meaning that most nuc mutations are silent.

It's clear from the literature (I can dig up exact cites if you're interested) that the nucleotide level divergence is much higher than the amino acid level; it's not clearly understood why but it's been noted in several papers I've read.

corneliusroemer

Thanks for this! I'll have a proper look tomorrow.

The Readme right now reads like a workflow readme for workflow devs - instead it should be a readme for dataset users. Have a look at the other existing datasets to see the type of information usually included.

Many things in the Readme are good to keep, but the workflow shouldn't be mentioned so explicitly, and some more info on reference, maintainers, limitations etc would be good. Also there's no link to the actual workflow?

Is there a reason you only use the prM gene? Why not whole genome? Is YFV so recombinant that a full genome tree would be misleading

The clades are probably defined based on just a short subset of the genome because whole genome sequences weren't available due to expense. For Nextclade, there's usually no reason to not use a full tree.

genehack · 2024-08-22T17:46:11Z

The Readme right now reads like a workflow readme for workflow devs - instead it should be a readme for dataset users. Have a look at the other existing datasets to see the type of information usually included.

That's because it was the workflow readme — not sure how that happened, but I have provided the correct version.

Is there a reason you only use the prM gene? Why not whole genome? Is YFV so recombinant that a full genome tree would be misleading

Because the 2 papers that originally established the genotypes that are the basis of the clades only looked at that prM-E region. It seems to be a very similar situation to the N450 region of measles, in that it's a frequent sequencing target for probably historical reasons.

The clades are probably defined based on just a short subset of the genome because whole genome sequences weren't available due to expense. For Nextclade, there's usually no reason to not use a full tree.

The whole genome sequences are not systematically annotated with genotypes in the NCBI dataset. The purpose of building this Nextclade dataset from the reference sequences from the two papers linked in the README is so we can use it to assign genotypes to the full genome tree that's in Nextstrain staging. (N.b., that version is the previous build and still uses the geographically-based genotype names, not the clades that this dataset has been updated to use.)

Built from [nextclade workflow in yellow fever repo](nextstrain/yellow-fever#10)

Per biweekly meeting discussion. Also updates README to provide mapping back to literature genotypes.

genehack · 2024-09-12T17:32:25Z

@corneliusroemer I think this is ready to merge — could you take a look, please.

genehack · 2024-09-26T17:00:38Z

@corneliusroemer I think this is ready to merge — could you take a look, please.

@corneliusroemer I expect you're quite busy with Pathoplexus, but another look at this new dataset would be appreciated.

corneliusroemer · 2024-09-26T18:30:58Z

Thanks for the ping, I'll have a look!

genehack · 2024-10-15T21:01:54Z

Thanks for the ping, I'll have a look!

Hi @corneliusroemer any chance to have that look yet?

data/nextstrain/collection.json

huddlej

On the science side, this looks pretty good to me. I appreciate the complete description of the genotype system in the README and the citations.

I didn't notice any glaring technical issues that would block this from merge and release.

data/nextstrain/yellow-fever/prM-E/README.md

data/nextstrain/yellow-fever/prM-E/pathogen.json

data_output/nextstrain/yellow-fever/prM-E/unreleased/CHANGELOG.md

data/nextstrain/yellow-fever/prM-E/sequences.fasta

data/nextstrain/yellow-fever/prM-E/pathogen.json

data_output/minimizer_index.json

corneliusroemer

Looks good, save a few more tweaks to parameters/typos

* Tweaks to CHANGELOG and README * Remove minLength from pathogen.json * Tweak privateMutations cutoff and typical values in pathogen.json * Add 2 CladeIII examples to sample sequences * Rebuild tree

Context: nextstrain/nextclade_data#220

genehack · 2024-10-17T17:58:25Z

Looks good, save a few more tweaks to parameters/typos

Thanks, tweaks applied and pushed.

I will plan to merge this Tuesday the 22nd in the absence of further feedback.

genehack temporarily deployed to refs/pull/220/merge August 2, 2024 20:00 — with GitHub Actions Inactive

ivan-aksamentov temporarily deployed to refs/pull/220/merge August 2, 2024 21:41 — with GitHub Actions Inactive

genehack mentioned this pull request Aug 2, 2024

Modernize repo nextstrain/yellow-fever#2

Open

30 tasks

genehack force-pushed the yellow-fever-dataset branch from ca51f1d to 5fe9274 Compare August 2, 2024 22:57

genehack temporarily deployed to refs/pull/220/merge August 2, 2024 22:57 — with GitHub Actions Inactive

genehack force-pushed the yellow-fever-dataset branch from e42a4c6 to 424b122 Compare August 5, 2024 17:04

genehack temporarily deployed to refs/pull/220/merge August 5, 2024 17:04 — with GitHub Actions Inactive

ivan-aksamentov temporarily deployed to refs/pull/220/merge August 6, 2024 01:31 — with GitHub Actions Inactive

genehack temporarily deployed to refs/pull/220/merge August 20, 2024 23:48 — with GitHub Actions Inactive

corneliusroemer requested changes Aug 21, 2024

View reviewed changes

genehack temporarily deployed to refs/pull/220/merge August 22, 2024 17:41 — with GitHub Actions Inactive

genehack requested a review from corneliusroemer August 23, 2024 17:08

genehack and others added 7 commits September 12, 2024 10:20

Add yellow fever virus dataset

5d66c5d

Built from [nextclade workflow in yellow fever repo](nextstrain/yellow-fever#10)

chore: rebuild [skip ci]

c787a1c

chore: remove data_output/minimizer_index.json

8a7b5e7

Update clade names to numeric version

d22045f

Per biweekly meeting discussion. Also updates README to provide mapping back to literature genotypes.

chore: rebuild [skip ci]

0413a9b

Include correct README file

82ad041

chore: rebuild [skip ci]

7e70d34

genehack force-pushed the yellow-fever-dataset branch from 8dc2afd to 7e70d34 Compare September 12, 2024 17:31

genehack requested a review from a team September 26, 2024 16:59