Skip to content

Commit

Permalink
Merge pull request #29 from nextstrain/hpxv
Browse files Browse the repository at this point in the history
Add Monkeypox datasets
  • Loading branch information
corneliusroemer authored Jun 16, 2022
2 parents 00f186d + 8a58880 commit ba3688a
Show file tree
Hide file tree
Showing 45 changed files with 10,774 additions and 7,618 deletions.
33 changes: 33 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,36 @@
## 2022-06-14

### 3 Monkeypox (MPXV) datasets introduced

Three MPXV datasets are added with differing zoom levels containing:

- MPXV (All clades)
- hMPXV-1 (part of clade 3, source of 2017/2018/2022 outbreaks)
- hMPXV-1 B.1 (2022 outbreak lineage)

All 3 use the coordinate system of the recently designated NCBI Monkeypox reference sequence NC_063383 (MPXV-M5312_HM12_Rivers).

However, SNPs from two different ref sequences are added to the "all clades" and B.1 datasets to reduce the number of total mutations.

The B.1 datset uses SNPs of ON563414.3 (MPXV_USA_2022_MA001) on top of a NC_063383 backbone.

The "all clades" build uses the SNPs of a reconstructed ancestral MPXV sequence that is the inferred most recent common ancestor of clades 1, 2 and 3, rooted with a Cowpox outgroup.

Only the MPXV (All clades) dataset can assign all clades 1, 2 and 3.
The hMPXV-1 dataset can be used if all viruses are from hMPXV-1.
The B.1 dataset is useful for 2022 outbreak sequences but will not be able to assign anything but B.1 lineages.

Gene annotations follow the annotation used by NC_063383 and is of the form `OPG001` (for OrthoPox Gene 001).
Since the alignment reference is always in NC_063383 coordinates, nucleotide and protein mutation position should usually be identical in alignments done with all three datasets.

Quality control parameters are subject to change, especially since "known" frame shifts and stop codons have not been annotated. For example, clade 1 sequences will always show around 7 frame shifts, yet these do not indicate quality problems.

### New dataset version (tag `2022-06-14T12:00:00Z`)

#### SARS-CoV-2

- Pango lineages: New lineages added up till [pango-designation release](https://github.com/cov-lineages/pango-designation/releases) v1.9 and beyond are now included, including among others `BA.5.1-BA.5.3`, `BA.2.35-BA.2.48` and `XV-XY`

## 2022-04-28

### New dataset version (tag `2022-04-28T12:00:00Z`)
Expand Down
7 changes: 7 additions & 0 deletions data/datasets/MPXV/dataset.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"defaultRef": "ancestral",
"enabled": true,
"metadata": {},
"name": "MPXV",
"nameFriendly": "Monkeypox (All Clades)"
}
9 changes: 9 additions & 0 deletions data/datasets/MPXV/references/ancestral/datasetRef.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{
"enabled": true,
"metadata": {},
"reference": {
"source": "genbank",
"accession": "ancestral",
"strainName": "Reconstructed ancestral MPXV"
}
}

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
{
"schemaVersion": "1.2.0",
"privateMutations": {
"enabled": true,
"typical": 50,
"cutoff": 300
},
"missingData": {
"enabled": true,
"missingDataThreshold": 20000,
"scoreBias": 1000
},
"snpClusters": {
"enabled": false,
"windowSize": 100,
"clusterCutOff": 10,
"scoreWeight": 10
},
"mixedSites": {
"enabled": true,
"mixedSitesThreshold": 40
},
"frameShifts": {
"enabled": true,
"scoreWeight": 2
},
"stopCodons": {
"enabled": true,
"scoreWeight": 10,
"ignoredStopCodons": []
}
}

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"tag": "2022-05-20T12:00:00Z",
"comment": "First Monkeypox dataset, experimental",
"tag": "2022-06-14T12:00:00Z",
"comment": "Monkeypox all-clades dataset",
"compatibility": {
"nextcladeCli": {
"min": "1.999.0",
Expand Down

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,10 @@
"nucMutLabelMap": {},
"nucMutLabelMapReverse": {},
"alignmentParams": {
"max_indel": 5000,
"seed_spacing": 1000,
"terminal_bandwidth": 100,
"excess_bandwidth": 3,
"max_indel": 20000,
"seed_spacing": 500,
"terminal_bandwidth": 500,
"excess_bandwidth": 20,
"gap_alignment_side": "left"
}
}
7 changes: 7 additions & 0 deletions data/datasets/hMPXV/dataset.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"defaultRef": "NC_063383.1",
"enabled": true,
"metadata": {},
"name": "hMPXV",
"nameFriendly": "Human Monkeypox (hMPXV)"
}
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"metadata": {},
"reference": {
"source": "genbank",
"accession": "MT903344.1",
"strainName": "MPXV-UK_P2/2018"
"accession": "NC_063383.1",
"strainName": "MPXV-M5312_HM12_Rivers"
}
}

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Country (Institute),Target,Oligonucleotide,Sequence

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
{
"tag": "2022-06-14T12:00:00Z",
"comment": "hMPXV-1 dataset",
"compatibility": {
"nextcladeCli": {
"min": "1.999.0",
"max": null
},
"nextcladeWeb": {
"min": "1.999.0",
"max": null
}
},
"enabled": true,
"files": {
"geneMap": "genemap.gff",
"primers": "primers.csv",
"qc": "qc.json",
"reference": "reference.fasta",
"sequences": "sequences.fasta",
"tree": "tree.json",
"virusPropertiesJson": "virus_properties.json"
},
"metadata": {}
}

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
{
"schemaVersion": "1.10.0",
"nucMutLabelMap": {},
"nucMutLabelMapReverse": {},
"alignmentParams": {
"max_indel": 20000,
"seed_spacing": 500,
"terminal_bandwidth": 500,
"excess_bandwidth": 20,
"gap_alignment_side": "left"
}
}
7 changes: 7 additions & 0 deletions data/datasets/hMPXV_B1/dataset.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"defaultRef": "pseudo_ON563414",
"enabled": true,
"metadata": {},
"name": "hMPXV_B1",
"nameFriendly": "Human Monkeypox Clade B.1"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{
"enabled": true,
"metadata": {},
"reference": {
"source": "genbank",
"accession": "pseudo_ON563414",
"strainName": "MPXV_USA_2022_MA001 in NC_063383 coordinates"
}
}
Loading

0 comments on commit ba3688a

Please sign in to comment.