Skip to content

Commit

Permalink
moncla-lab: add initial IAV h5 datasets
Browse files Browse the repository at this point in the history
  • Loading branch information
rneher committed May 7, 2024
1 parent fa3faff commit 2c2ff06
Show file tree
Hide file tree
Showing 21 changed files with 501,082 additions and 0 deletions.
5 changes: 5 additions & 0 deletions data/community/moncla-lab/iav-h5/2.3.2.1/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
## Unreleased

Initial release for Nextclade v3!

Read more about Nextclade datasets in the documentation: https://docs.nextstrain.org/projects/nextclade/en/stable/user/datasets.html
47 changes: 47 additions & 0 deletions data/community/moncla-lab/iav-h5/2.3.2.1/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# H5Nx clade `2.3.2.1` dataset with A/duck/Vietnam/NCVD-1584/2012 reference

| attribute | value |
| -------------------- | ---------------------------------------- |
| authors |[Jordan Ort](https://lmoncla.github.io/monclalab/team/JordanOrt/), [Louise Moncla](https://lmoncla.github.io/monclalab/team/LouiseMoncla/)|
| dataset name | H5Nx clade `2.3.2.1` (provisional) |
| reference strain | A/duck/Vietnam/NCVD-1584/2012(H5N1) |
| reference accession | EPI424984 |


## Authors and contacts

Maintained by: [Jordan Ort](https://lmoncla.github.io/monclalab/team/JordanOrt/)

With the help from: [Louise Moncla](https://lmoncla.github.io/monclalab/team/LouiseMoncla/), Todd Davis, Tommy Lam, Samuel Shephard, Richard Neher

## Scope of this dataset
This dataset uses a current H5 candidate vaccine virus (CVV) from clade `2.3.2.1` (A/duck/Vietnam/NCVD-1584/2012) as a reference and is suitable for the analysis of H5 sequences belonging to clade `2.3.2.1` and its sub-clades `2.3.2.1a` through `2.3.2.1g`. Sequences belonging to other clades cannot be annotated by this dataset and will be left `unassigned`.

## Features
This dataset supports

* Assignment to clades and subclades based on the provisional nomenclature defined by the WHO/FAO/WOAH H5 Nomenclature Working Group
* Sequence quality control (QC)
* Phylogenetic placement
* Annotations for glycosylation sties, HA cleavage site sequence, and presence/absence of a polybasic cleavage site

## Clades of H5Nx avian influenza viruses

The WHO/FAO/WOAH H5 Nomenclature Working Group define "clades" using HA gene seguences, and define clades as genetically distinct, monophyletic groups of viruses. Viruses falling into a given clade share a common ancestor with significant bootstrap support and have low levels of within-clade diversity. [Past nomenclature updates](https://onlinelibrary.wiley.com/doi/10.1111/irv.12324) have required viruses in the same clade to be monophyletic with bootstrap suppor of at least 60%, with within-clade pairwise distances of less than 1.5%. These requirements are sometimes relaxed, and clades are periodically updated to account for expanding viral diversity.

Clade `2.3.2.1` viruses and their descendants have circulated since 2007 and are endemic in Southeast Asia. These viruses have diversified into eight additional sub-clades, named `2.3.2.1a` through `2.3.2.1g` due to high circulating diversity within the clade.
This Nextclade dataset incorporates these provisional `2.3.2.1` sub-clades.

## Alternative, and complementary approaches for H5 clade assignment
Two additional tools exist for assigning clades to H5 viruses that accommodate the recent `2.3.2.1` clade splits.

1. [LABEL](https://wonder.cdc.gov/amd/flu/label/): this command-line tool is built and maintained by Sam Shepard, and performs clade assignment for all current `2.3.4.4` and `2.3.2.1` clade splits.
2. [BVBRC Subspecies Classification Tool](https://www.bv-brc.org/app/SubspeciesClassification): this is a drag and drop tool that classifies a variety of viruses, including influenza A H1N1, H3N2, and H5N1.

The clade assignments in this Nextclade dataset were validated against LABEL assignments and shown to be generally well-matched across subclades. The figure below shows a direct comparison of assignments for 1671 HA sequences from GISAID, performed using LABEL and this NextClade dataset for clade `2.3.2.1` and its subclades.

![Figure 1: Comparison between LABEL and Nextclade for 2.3.2.1 assignments](https://github.com/moncla-lab/h5nx-Clades/blob/main/jordan-h5-clades/testing-nextclade-datasets/2321/files/20240430_2321.png)

## What is Nextclade dataset

Read more about Nextclade datasets in Nextclade documentation: https://docs.nextstrain.org/projects/nextclade/en/stable/user/datasets.html
370 changes: 370 additions & 0 deletions data/community/moncla-lab/iav-h5/2.3.2.1/example_sequences.fasta

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
##gff-version 3
##sequence-region EPI424984 1 1704
EPI424984 feature gene 1 1704 . + . gene_name="HA"
114 changes: 114 additions & 0 deletions data/community/moncla-lab/iav-h5/2.3.2.1/pathogen.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
{
"alignmentParams": {
"excessBandwidth": 9,
"terminalBandwidth": 100,
"allowedMismatches": 4,
"gapAlignmentSide": "right",
"minSeedCover": 0.1
},
"attributes": {
"name": "H5Nx clade 2.3.2.1",
"refAccession": "EPI424984",
"refName": "A/duck/Vietnam/NCVD-1584/2012(H5N1)",
"experimental": true
},
"compatibility": {
"cli": "3.0.0-alpha.0",
"web": "3.0.0-alpha.0"
},
"defaultCds": "HA",
"deprecated": false,
"experimental": true,
"files": {
"changelog": "CHANGELOG.md",
"genomeAnnotation": "genome_annotation.gff3",
"pathogenJson": "pathogen.json",
"readme": "README.md",
"reference": "reference.fasta",
"treeJson": "tree.json",
"examples": "example_sequences.fasta"
},
"meta": {
"bugs": "https://github.com/nextstrain/nextclade_data/issues",
"source code": "https://github.com/nextstrain/nextclade_data"
},
"qc": {
"frameShifts": {
"enabled": true
},
"missingData": {
"enabled": false,
"missingDataThreshold": 100,
"scoreBias": 10
},
"mixedSites": {
"enabled": true,
"mixedSitesThreshold": 4
},
"privateMutations": {
"cutoff": 25,
"enabled": true,
"typical": 5
},
"snpClusters": {
"clusterCutOff": 5,
"enabled": false,
"scoreWeight": 50,
"windowSize": 100
},
"stopCodons": {
"enabled": true
}
},
"aaMotifs": [
{
"name": "glycosylation",
"nameShort": "Glyc.",
"nameFriendly": "Glycosylation",
"description": "N-linked glycosylation motifs (N-X-S/T with X any amino acid other than P)",
"includeCdses": [
{
"cds":"HA",
"ranges":[{"begin":0, "end":532}]
}
],
"motifs": [
"N[^P][ST]"
]
},
{
"name": "cleavage_site",
"nameShort": "CS",
"nameFriendly": "CleavageSite",
"description": "Cleavage site of HA",
"includeCdses": [
{
"cds":"HA",
"ranges":[{"begin":340, "end":345}]
}
],
"motifs": [
"[A-Z-]{5}"
]
},
{
"name": "polybasic_cleavage_site",
"nameShort": "PBCS",
"nameFriendly": "PolybasicCleavageSite",
"description": "Polybasic cleavage site of HA",
"includeCdses": [
{
"cds":"HA",
"ranges":[{"begin":340, "end":345}]
}
],
"motifs": [
"^([A-Z-]*[RK][A-Z-]*){4}$"
]
}
],
"schemaVersion": "3.0.0",
"version": {
"tag": "unreleased"
}
}
30 changes: 30 additions & 0 deletions data/community/moncla-lab/iav-h5/2.3.2.1/reference.fasta
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
>EPI424984 Influenza A virus (A/duck/Vietnam/NCVD-1584/2012(H5N1)) segment 4 hemagglutinin (HA) gene, complete cds
ATGGAGAAAATAGTTCTTCTCTTTGCAACAATCAGCCTTGTCAAAAGCGATCATATTTGC
ATTGGTTATCATGCAAATAACTCGACAGAGCAGGTTGACACAATAATGGAAAAGAACGTT
ACTGTTACACATGCCCAAGACATACTGGAAAAGACACACAACGGGAAGCTCTGCGATCTA
AATGGAGTGAAGCCTCTGATTTTAAAAGATTGTAGTGTAGCAGGATGGCTCCTCGGAAAT
CCATTGTGTGACGAATTCACCAATGTGCCAGAATGGTCTTACATAGTAGAGAAGGCCAAT
CCAGCCAATGACCTCTGTTACCCAGGGAATTTCAACGATTATGAAGAATTGAAACACCTA
TTGAGCAGGATAAACCATTTTGAGAAAATACAGATCATCCCCAAAGATTCTTGGTCAGAT
CATGAAGCCTCATTGGGGGTGAGTGCAGCATGTTCATACCAGGGAAATTCCTCCTTCTTC
AGAAATGTGGTGTGGCTTATCAAAAAGGACAATGCATACCCAACAATAAAGAAAGGCTAC
AATAATACCAACCGAGAAGATCTCTTGATACTGTGGGGGATCCACCATCCTAATGATGAG
GCAGAGCAGACAAGGCTCTACCAAAACCCAACTACCTATATTTCCATTGGGACTTCAACA
CTAAACCAGAGATTGGTACCAAAAATAGCCACTAGATCCAAAATAAACGGGCAAAGCGGC
AGGATAGATTTCTTCTGGACAATTTTAAAACCGAATGACGCAATCCACTTCGAGAGTAAT
GGAAATTTCATTGCTCCAGAATATGCATACAAAATTGTCAAGAAGGGAGACTCCACAATC
ATGAGAAGTGAAGTGGAATATGGTAACTGCAACACCAGGTGTCAGACTCCAATAGGGGCG
ATAAACTCTAGTATGCCATTCCACAACATACACCCTCTCACCATCGGAGAATGTCCCAAA
TATGTGAAATCAAACAAATTAGTCCTTGCAACTGGGCTCAGAAATAGTCCTCAAAGAGAG
AGAAGAAGAAAAAGAGGACTGTTTGGAGCTATAGCAGGTTTTATAGAGGGAGGATGGCAG
GGAATGGTAGATGGTTGGTATGGGTACCACCACAGCAATGAACAGGGGAGTGGTTACGCT
GCAGACAAAGAATCTACTCAAAAGGCGATAGACGGAGTCACCAATAAGGTCAATTCGATC
ATTGACAAAATGAACACTCAGTTTGAGGCTGTAGGAAGGGAATTTAATAACTTAGAGAGG
AGAATAGAAAATTTAAACAAGAAGATGGAAGACGGATTCCTAGATGTCTGGACTTATAAT
GCTGAACTTCTGGTTCTCATGGAGAATGAGAGAACTCTAGACTTCCATGACTCAAATGTC
AAGAACCTTTACGATAAGGTCCGACTACAGCTTAAGGATAATGCAAAAGAGCTGGGAAAC
GGTTGTTTCGAGTTCTATCACAAATGTAATAATGAATGTATGGAAAGTGTAAGAAACGGG
ACGTATGACTACCCGCAGTATTCAGAAGAAGCAAGATTAAAAAGAGAGGAAATAAGTGGA
GTAAAACTGGAATCAATAGGAATCTACCAAATACTGTCAATTTATTCAACAGTGGCGAGT
TCCCTAGTGCTGGCAATCATGATGGCTGGTCTATCTTTATGGATGTGTTCCAACGGGTCG
TTACAGTGCAGAATTTGCATTTAA
Loading

0 comments on commit 2c2ff06

Please sign in to comment.