-
Notifications
You must be signed in to change notification settings - Fork 28
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
moncla-lab: add initial IAV h5 datasets
- Loading branch information
Showing
21 changed files
with
501,082 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
## Unreleased | ||
|
||
Initial release for Nextclade v3! | ||
|
||
Read more about Nextclade datasets in the documentation: https://docs.nextstrain.org/projects/nextclade/en/stable/user/datasets.html |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
# H5Nx clade `2.3.2.1` dataset with A/duck/Vietnam/NCVD-1584/2012 reference | ||
|
||
| attribute | value | | ||
| -------------------- | ---------------------------------------- | | ||
| authors |[Jordan Ort](https://lmoncla.github.io/monclalab/team/JordanOrt/), [Louise Moncla](https://lmoncla.github.io/monclalab/team/LouiseMoncla/)| | ||
| dataset name | H5Nx clade `2.3.2.1` (provisional) | | ||
| reference strain | A/duck/Vietnam/NCVD-1584/2012(H5N1) | | ||
| reference accession | EPI424984 | | ||
|
||
|
||
## Authors and contacts | ||
|
||
Maintained by: [Jordan Ort](https://lmoncla.github.io/monclalab/team/JordanOrt/) | ||
|
||
With the help from: [Louise Moncla](https://lmoncla.github.io/monclalab/team/LouiseMoncla/), Todd Davis, Tommy Lam, Samuel Shephard, Richard Neher | ||
|
||
## Scope of this dataset | ||
This dataset uses a current H5 candidate vaccine virus (CVV) from clade `2.3.2.1` (A/duck/Vietnam/NCVD-1584/2012) as a reference and is suitable for the analysis of H5 sequences belonging to clade `2.3.2.1` and its sub-clades `2.3.2.1a` through `2.3.2.1g`. Sequences belonging to other clades cannot be annotated by this dataset and will be left `unassigned`. | ||
|
||
## Features | ||
This dataset supports | ||
|
||
* Assignment to clades and subclades based on the provisional nomenclature defined by the WHO/FAO/WOAH H5 Nomenclature Working Group | ||
* Sequence quality control (QC) | ||
* Phylogenetic placement | ||
* Annotations for glycosylation sties, HA cleavage site sequence, and presence/absence of a polybasic cleavage site | ||
|
||
## Clades of H5Nx avian influenza viruses | ||
|
||
The WHO/FAO/WOAH H5 Nomenclature Working Group define "clades" using HA gene seguences, and define clades as genetically distinct, monophyletic groups of viruses. Viruses falling into a given clade share a common ancestor with significant bootstrap support and have low levels of within-clade diversity. [Past nomenclature updates](https://onlinelibrary.wiley.com/doi/10.1111/irv.12324) have required viruses in the same clade to be monophyletic with bootstrap suppor of at least 60%, with within-clade pairwise distances of less than 1.5%. These requirements are sometimes relaxed, and clades are periodically updated to account for expanding viral diversity. | ||
|
||
Clade `2.3.2.1` viruses and their descendants have circulated since 2007 and are endemic in Southeast Asia. These viruses have diversified into eight additional sub-clades, named `2.3.2.1a` through `2.3.2.1g` due to high circulating diversity within the clade. | ||
This Nextclade dataset incorporates these provisional `2.3.2.1` sub-clades. | ||
|
||
## Alternative, and complementary approaches for H5 clade assignment | ||
Two additional tools exist for assigning clades to H5 viruses that accommodate the recent `2.3.2.1` clade splits. | ||
|
||
1. [LABEL](https://wonder.cdc.gov/amd/flu/label/): this command-line tool is built and maintained by Sam Shepard, and performs clade assignment for all current `2.3.4.4` and `2.3.2.1` clade splits. | ||
2. [BVBRC Subspecies Classification Tool](https://www.bv-brc.org/app/SubspeciesClassification): this is a drag and drop tool that classifies a variety of viruses, including influenza A H1N1, H3N2, and H5N1. | ||
|
||
The clade assignments in this Nextclade dataset were validated against LABEL assignments and shown to be generally well-matched across subclades. The figure below shows a direct comparison of assignments for 1671 HA sequences from GISAID, performed using LABEL and this NextClade dataset for clade `2.3.2.1` and its subclades. | ||
|
||
![Figure 1: Comparison between LABEL and Nextclade for 2.3.2.1 assignments](https://github.com/moncla-lab/h5nx-Clades/blob/main/jordan-h5-clades/testing-nextclade-datasets/2321/files/20240430_2321.png) | ||
|
||
## What is Nextclade dataset | ||
|
||
Read more about Nextclade datasets in Nextclade documentation: https://docs.nextstrain.org/projects/nextclade/en/stable/user/datasets.html |
370 changes: 370 additions & 0 deletions
370
data/community/moncla-lab/iav-h5/2.3.2.1/example_sequences.fasta
Large diffs are not rendered by default.
Oops, something went wrong.
3 changes: 3 additions & 0 deletions
3
data/community/moncla-lab/iav-h5/2.3.2.1/genome_annotation.gff3
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
##gff-version 3 | ||
##sequence-region EPI424984 1 1704 | ||
EPI424984 feature gene 1 1704 . + . gene_name="HA" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,114 @@ | ||
{ | ||
"alignmentParams": { | ||
"excessBandwidth": 9, | ||
"terminalBandwidth": 100, | ||
"allowedMismatches": 4, | ||
"gapAlignmentSide": "right", | ||
"minSeedCover": 0.1 | ||
}, | ||
"attributes": { | ||
"name": "H5Nx clade 2.3.2.1", | ||
"refAccession": "EPI424984", | ||
"refName": "A/duck/Vietnam/NCVD-1584/2012(H5N1)", | ||
"experimental": true | ||
}, | ||
"compatibility": { | ||
"cli": "3.0.0-alpha.0", | ||
"web": "3.0.0-alpha.0" | ||
}, | ||
"defaultCds": "HA", | ||
"deprecated": false, | ||
"experimental": true, | ||
"files": { | ||
"changelog": "CHANGELOG.md", | ||
"genomeAnnotation": "genome_annotation.gff3", | ||
"pathogenJson": "pathogen.json", | ||
"readme": "README.md", | ||
"reference": "reference.fasta", | ||
"treeJson": "tree.json", | ||
"examples": "example_sequences.fasta" | ||
}, | ||
"meta": { | ||
"bugs": "https://github.com/nextstrain/nextclade_data/issues", | ||
"source code": "https://github.com/nextstrain/nextclade_data" | ||
}, | ||
"qc": { | ||
"frameShifts": { | ||
"enabled": true | ||
}, | ||
"missingData": { | ||
"enabled": false, | ||
"missingDataThreshold": 100, | ||
"scoreBias": 10 | ||
}, | ||
"mixedSites": { | ||
"enabled": true, | ||
"mixedSitesThreshold": 4 | ||
}, | ||
"privateMutations": { | ||
"cutoff": 25, | ||
"enabled": true, | ||
"typical": 5 | ||
}, | ||
"snpClusters": { | ||
"clusterCutOff": 5, | ||
"enabled": false, | ||
"scoreWeight": 50, | ||
"windowSize": 100 | ||
}, | ||
"stopCodons": { | ||
"enabled": true | ||
} | ||
}, | ||
"aaMotifs": [ | ||
{ | ||
"name": "glycosylation", | ||
"nameShort": "Glyc.", | ||
"nameFriendly": "Glycosylation", | ||
"description": "N-linked glycosylation motifs (N-X-S/T with X any amino acid other than P)", | ||
"includeCdses": [ | ||
{ | ||
"cds":"HA", | ||
"ranges":[{"begin":0, "end":532}] | ||
} | ||
], | ||
"motifs": [ | ||
"N[^P][ST]" | ||
] | ||
}, | ||
{ | ||
"name": "cleavage_site", | ||
"nameShort": "CS", | ||
"nameFriendly": "CleavageSite", | ||
"description": "Cleavage site of HA", | ||
"includeCdses": [ | ||
{ | ||
"cds":"HA", | ||
"ranges":[{"begin":340, "end":345}] | ||
} | ||
], | ||
"motifs": [ | ||
"[A-Z-]{5}" | ||
] | ||
}, | ||
{ | ||
"name": "polybasic_cleavage_site", | ||
"nameShort": "PBCS", | ||
"nameFriendly": "PolybasicCleavageSite", | ||
"description": "Polybasic cleavage site of HA", | ||
"includeCdses": [ | ||
{ | ||
"cds":"HA", | ||
"ranges":[{"begin":340, "end":345}] | ||
} | ||
], | ||
"motifs": [ | ||
"^([A-Z-]*[RK][A-Z-]*){4}$" | ||
] | ||
} | ||
], | ||
"schemaVersion": "3.0.0", | ||
"version": { | ||
"tag": "unreleased" | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
>EPI424984 Influenza A virus (A/duck/Vietnam/NCVD-1584/2012(H5N1)) segment 4 hemagglutinin (HA) gene, complete cds | ||
ATGGAGAAAATAGTTCTTCTCTTTGCAACAATCAGCCTTGTCAAAAGCGATCATATTTGC | ||
ATTGGTTATCATGCAAATAACTCGACAGAGCAGGTTGACACAATAATGGAAAAGAACGTT | ||
ACTGTTACACATGCCCAAGACATACTGGAAAAGACACACAACGGGAAGCTCTGCGATCTA | ||
AATGGAGTGAAGCCTCTGATTTTAAAAGATTGTAGTGTAGCAGGATGGCTCCTCGGAAAT | ||
CCATTGTGTGACGAATTCACCAATGTGCCAGAATGGTCTTACATAGTAGAGAAGGCCAAT | ||
CCAGCCAATGACCTCTGTTACCCAGGGAATTTCAACGATTATGAAGAATTGAAACACCTA | ||
TTGAGCAGGATAAACCATTTTGAGAAAATACAGATCATCCCCAAAGATTCTTGGTCAGAT | ||
CATGAAGCCTCATTGGGGGTGAGTGCAGCATGTTCATACCAGGGAAATTCCTCCTTCTTC | ||
AGAAATGTGGTGTGGCTTATCAAAAAGGACAATGCATACCCAACAATAAAGAAAGGCTAC | ||
AATAATACCAACCGAGAAGATCTCTTGATACTGTGGGGGATCCACCATCCTAATGATGAG | ||
GCAGAGCAGACAAGGCTCTACCAAAACCCAACTACCTATATTTCCATTGGGACTTCAACA | ||
CTAAACCAGAGATTGGTACCAAAAATAGCCACTAGATCCAAAATAAACGGGCAAAGCGGC | ||
AGGATAGATTTCTTCTGGACAATTTTAAAACCGAATGACGCAATCCACTTCGAGAGTAAT | ||
GGAAATTTCATTGCTCCAGAATATGCATACAAAATTGTCAAGAAGGGAGACTCCACAATC | ||
ATGAGAAGTGAAGTGGAATATGGTAACTGCAACACCAGGTGTCAGACTCCAATAGGGGCG | ||
ATAAACTCTAGTATGCCATTCCACAACATACACCCTCTCACCATCGGAGAATGTCCCAAA | ||
TATGTGAAATCAAACAAATTAGTCCTTGCAACTGGGCTCAGAAATAGTCCTCAAAGAGAG | ||
AGAAGAAGAAAAAGAGGACTGTTTGGAGCTATAGCAGGTTTTATAGAGGGAGGATGGCAG | ||
GGAATGGTAGATGGTTGGTATGGGTACCACCACAGCAATGAACAGGGGAGTGGTTACGCT | ||
GCAGACAAAGAATCTACTCAAAAGGCGATAGACGGAGTCACCAATAAGGTCAATTCGATC | ||
ATTGACAAAATGAACACTCAGTTTGAGGCTGTAGGAAGGGAATTTAATAACTTAGAGAGG | ||
AGAATAGAAAATTTAAACAAGAAGATGGAAGACGGATTCCTAGATGTCTGGACTTATAAT | ||
GCTGAACTTCTGGTTCTCATGGAGAATGAGAGAACTCTAGACTTCCATGACTCAAATGTC | ||
AAGAACCTTTACGATAAGGTCCGACTACAGCTTAAGGATAATGCAAAAGAGCTGGGAAAC | ||
GGTTGTTTCGAGTTCTATCACAAATGTAATAATGAATGTATGGAAAGTGTAAGAAACGGG | ||
ACGTATGACTACCCGCAGTATTCAGAAGAAGCAAGATTAAAAAGAGAGGAAATAAGTGGA | ||
GTAAAACTGGAATCAATAGGAATCTACCAAATACTGTCAATTTATTCAACAGTGGCGAGT | ||
TCCCTAGTGCTGGCAATCATGATGGCTGGTCTATCTTTATGGATGTGTTCCAACGGGTCG | ||
TTACAGTGCAGAATTTGCATTTAA |
Oops, something went wrong.