Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data: initial commit of community neherlab/HIV-1 dataset #186

Merged
merged 2 commits into from
May 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion data/community/collection.json
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
]
},
"dataset_order": [
"community/isuvdl/mazeller/prrsv2/orf5/yimim2023"
"community/isuvdl/mazeller/prrsv2/orf5/yimim2023",
"community/neherlab/hiv-1"
]
}
3 changes: 3 additions & 0 deletions data/community/neherlab/hiv-1/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
## Unreleased

Initial release of an HIV-1 dataset for subtype classification.
35 changes: 35 additions & 0 deletions data/community/neherlab/hiv-1/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# EXPERIMENTAL: HIV-1 group M subtype and CRF classification

| Key | Value |
| ---------------------- | --------------------------------------------------------------------------------------------------------------------|
| authors | [Richard Neher (U Basel)](https://neherlab.org), [Thomas Leitner (LANL)](https://public.lanl.gov/tkl/) |
| data source | LANL database and Genbank |
| workflow | [github.com/neherlab/HIV-nextclade](https://github.com/neherlab/HIV-nextclade) |
| nextclade dataset path | neherlab/HIV-1 |
| reference | [NC_001802](https://www.ncbi.nlm.nih.gov/nuccore/NC_001802) |

This Nextclade data set aligns sequences to the HXB2 reference and finds the closest match in a reference tree build from the [2021 Super filtered Alignment](https://www.hiv.lanl.gov/content/sequence/NEWALIGN/help.html#filter) provided by LANL.
Due to extensive recombination in HIV-1 group M, the tree is not a proper phylogenetic tree, but a collection of trees inferred for each subtype or recombinant form. These individuals trees are then grafted together. Rare CRFs, URFs, or subtypes represented less then 3 times in the reference alignment are labeled as subtype `other`.

Note that the classification is **EXPERIMENTAL** and might be unreliable for sequences that lack a close representative in the tree.
The closeness to sequences in the reference set is quantified by the `private mutations`
As a general guide, the more private mutations a sequence has, the less reliable the subtype assignment will be.

Note that alignments of sequences that are far from the subtype B HXB2 reference will have

## Reference sequence HXB2 (NC_001802)

This data set uses the NCBI reference sequence [NC_001802](https://www.ncbi.nlm.nih.gov/nuccore/NC_001802) based on the HXB2 genome [K03455](https://www.ncbi.nlm.nih.gov/nuccore/K03455.1). The primary reason for choosing it is to ensure amino acid substitutions in conserved proteins such as `Pol` are numbered consistently.
Note that this sequence as a few problems, including a premature stop-codon in `nef`.

## Treatment of indel-rich regions

There are multiple regions in the HIV-1 genome where deletions and insertions are common. Such regions are often not consistently aligned when aligning sequences individually to a reference. For the purposes of finding closely matching sequences in the reference tree, these regions are ignored. Specifically, tree placement ignores

- The LTR before the beginning of `gag` (until position 336 in `NC_001802`)
- Positions 5780 until 5870 at the end of `vpu` and the beginning of `gp120`
- Positions 6160 until 6240 in `gp120`.
- Position 6940 unitl 7020 in `gp120`.
- The 5' LTR from end of `nef` (positions 9863 until the end of the genome)


27 changes: 27 additions & 0 deletions data/community/neherlab/hiv-1/genome_annotation.gff3
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
##sequence-region NC_001802.1 1 9181
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=11676
NC_001802.1 RefSeq region 1 9181 . + . ID=NC_001802.1:1..9181;Dbxref=taxon:11676;gb-acronym=HIV-1;gbkey=Src;genome=genomic;mol_type=genomic RNA;note=strain for reference annotation
NC_001802.1 RefSeq CDS 1799 2095 . + . Name=pro;gbkey=Prot;protein_id=NP_705926.1;product=aspartic peptidase;ID=id-NP_057849.4:489..587;experiment=DESCRIPTION:[PMID:2537531],DESCRIPTION:[PMID:2548279],DESCRIPTION:[PMID:3290901];Note=The proteinase domain of Gag-Pol (in the form of homodimer) mediates all the cleavages in the polyprotein. Cleaves itself from the polyprotein late in particle assembly.
NC_001802.1 RefSeq CDS 2096 3775 . + . Name=RT-p66;gbkey=Prot;product=p66 subunit;protein_id=NP_705927.1;ID=id-NP_057849.4:588..1147;experiment=DESCRIPTION:[PMID:1374166],EXISTENCE:[PMID:4316300];Note=transcribes single stranded viral RNA genome into double stranded proviral DNA%3B HIV-1 reverse transcriptase is composed of the p66 subunit (this protein) and the p51 subunit that lacks the RNAse H domain of the larger subunit
NC_001802.1 RefSeq CDS 2096 3415 . + . Name=RT-p51;gbkey=Prot;protein_id=NP_789739.1;ID=id-NP_057849.4:588..1027;product=reverse transcriptase p51 subunit;Note=HIV-1 reverse transcriptase is composed of the p66 subunit and the p51 subunit (this protein) that lacks the RNAse H domain of the larger subunit
NC_001802.1 RefSeq CDS 3776 4639 . + . Name=INT;gbkey=Prot;product=integrase;protein_id=NP_705928.1;ID=id-NP_057849.4:1148..1435;experiment=DESCRIPTION:[PMID:7983732],DESCRIPTION:[PMID:8035478];Note=mediates integration of the viral DNA into the infected cell chromosome
NC_001802.1 RefSeq CDS 336 731 . + . Name=p17;gbkey=Prot;product=matrix;protein_id=NP_579876.2;ID=id-NP_057850.1:1..132;experiment=DESCRIPTION:[PMID:12032547],DESCRIPTION:[PMID:1710290],DESCRIPTION:[PMID:8610175];Note=viral structural protein%3B forms the outer structural shell of HIV-1 virions%3B involved in the nuclear import of the HIV-1 preintegration complex
NC_001802.1 RefSeq CDS 732 1424 . + . Name=p24;gbkey=Prot;product=capsid;protein_id=NP_579880.1;ID=id-NP_057850.1:133..363;Note=viral structural protein%3B forms the core of HIV-1 virions;experiment=DESCRIPTION:[PMID:15208690],DESCRIPTION:[PMID:16041386],DESCRIPTION:[PMID:21248851]
NC_001802.1 RefSeq CDS 1425 1466 . + . Name=p2;product=p2;gbkey=Prot;protein_id=NP_579882.1;ID=id-NP_057850.1:364..377;Note=Processing of Gag-Pol by the protease domain dimer starts with cleavage between the p2 and nucleocapsid proteins.
NC_001802.1 RefSeq CDS 1467 1631 . + . Name=p7;gbkey=Prot;protein_id=NP_579881.1;product=nucleocapsid;ID=id-NP_057850.1:378..432;experiment=DESCRIPTION:[PMID:1639074],DESCRIPTION:[PMID:7666546];Note=viral structural protein%3B coats the genomic RNA inside the virion core%3B binds and delivers full-length viral RNAs into assembling HIV-1 virions
NC_001802.1 RefSeq CDS 1632 1679 . + . Name=p1;product=p1;gbkey=Prot;protein_id=NP_787042.1;ID=id-NP_057850.1:433..448;Note=important for virus infectivity%2C protein processing%2C and genomic RNA dimer stability
NC_001802.1 RefSeq CDS 1680 1835 . + . Name=p6;product=p6;gbkey=Prot;protein_id=NP_579883.1;ID=id-NP_057850.1:449..500;experiment=DESCRIPTION:[PMID:10085158],DESCRIPTION:[PMID:15527852];Note=important for incorporation of Vpr into assembling HIV-1 virions%3B helps mediate efficient virus particle release from infected cells
NC_001802.1 RefSeq CDS 4587 5165 . + 0 Name=vif;gbkey=CDS;gene=vif;product=Vif;locus_tag=HIV1gp3;protein_id=NP_057851.1;ID=cds-NP_057851.1;Dbxref=GenBank:NP_057851.1,GeneID:155459
NC_001802.1 RefSeq CDS 5105 5319 . + 0 Name=vpr;gbkey=CDS;gene=vpr;product=Vpr;locus_tag=HIV1gp4;protein_id=NP_057852.2;ID=cds-NP_057852.2;exception=artificial frameshift;Dbxref=GenBank:NP_057852.2,GeneID:155807;Note=An artificial frameshift eliminating the orf-disrupting nucleotide at position 5320 is introduced to obtain the typical HIV-1 Vpr protein sequence. For this particular HIV-1 strain%2C HXB2%2C only a short (78 amino acid long) variant of the Vpr sequence can be obtained by translation of nucleotides 5105 through 5341 without the frameshift
NC_001802.1 RefSeq CDS 5321 5396 . + 1 Name=vpr;gbkey=CDS;gene=vpr;product=Vpr;locus_tag=HIV1gp4;protein_id=NP_057852.2;ID=cds-NP_057852.2;exception=artificial frameshift;Dbxref=GenBank:NP_057852.2,GeneID:155807;Note=An artificial frameshift eliminating the orf-disrupting nucleotide at position 5320 is introduced to obtain the typical HIV-1 Vpr protein sequence. For this particular HIV-1 strain%2C HXB2%2C only a short (78 amino acid long) variant of the Vpr sequence can be obtained by translation of nucleotides 5105 through 5341 without the frameshift
NC_001802.1 RefSeq CDS 5377 5591 . + 0 Name=tat;gbkey=CDS;gene=tat;product=Tat;locus_tag=HIV1gp5;protein_id=NP_057853.1;ID=cds-NP_057853.1;Dbxref=GenBank:NP_057853.1,GeneID:155871;Note=the length of Tat varies depending on virus strain or clade
NC_001802.1 RefSeq CDS 7925 7970 . + 1 Name=tat;gbkey=CDS;gene=tat;product=Tat;locus_tag=HIV1gp5;protein_id=NP_057853.1;ID=cds-NP_057853.1;Dbxref=GenBank:NP_057853.1,GeneID:155871;Note=the length of Tat varies depending on virus strain or clade
NC_001802.1 RefSeq CDS 5516 5591 . + 0 Name=rev;gbkey=CDS;gene=rev;product=Rev;locus_tag=HIV1gp6;protein_id=NP_057854.1;ID=cds-NP_057854.1;Dbxref=GenBank:NP_057854.1,GeneID:155908
NC_001802.1 RefSeq CDS 7925 8199 . + 2 Name=rev;gbkey=CDS;gene=rev;product=Rev;locus_tag=HIV1gp6;protein_id=NP_057854.1;ID=cds-NP_057854.1;Dbxref=GenBank:NP_057854.1,GeneID:155908
NC_001802.1 RefSeq CDS 5608 5856 . + 0 Name=vpu;gbkey=CDS;gene=vpu;product=Vpu;locus_tag=HIV1gp7;protein_id=NP_057855.1;ID=cds-NP_057855.1;Dbxref=GenBank:NP_057855.1,GeneID:155945;Note=Vpu and gp160 are translated from different reading frames of the same bicistronic mRNA
NC_001802.1 RefSeq CDS 5855 7303 . + . Name=gp120;gbkey=Prot;protein_id=NP_579894.2;ID=id-NP_057856.1:29..511;experiment=DESCRIPTION:[PMID:24179160];product=Envelope surface glycoprotein gp120;Note=mediates binding of HIV-1 to CD4 and cellular co-receptors%3B cooperates with gp41 to mediate fusion of viral membrane with cellular membrane during virus entry into cells
NC_001802.1 RefSeq CDS 7304 8338 . + . Name=gp41;gbkey=Prot;protein_id=NP_579895.1;ID=id-NP_057856.1:512..856;product=Envelope transmembrane domain;Note=cooperates with gp120 to mediate fusion of viral membrane with cellular membrane during virus entry into cells
NC_001802.1 RefSeq CDS 8343 8963 . + 0 Name=nef;gbkey=CDS;gene=nef;product=Nef;locus_tag=HIV1gp9;protein_id=NP_057857.2;ID=cds-NP_057857.2;transl_except=(pos:8712..8714%2Caa:Trp);Dbxref=GenBank:NP_057857.2,GeneID:156110;Note=This particular nucleotide sequence has a premature stop codon in place of a well-conserved tryptophan codon at position 8712-8714 that truncates the HIV1 Nef protein sequence to a 123 amino acids-long N-terminal portion (not shown).
70 changes: 70 additions & 0 deletions data/community/neherlab/hiv-1/pathogen.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
{
"alignmentParams": {
"penaltyGapOpen": 8,
"penaltyGapOpenInFrame": 12,
"penaltyGapOpenOutOfFrame": 14,
"gapAlignmentSide": "left",
"excessBandwidth": 50,
"terminalBandwidth": 50,
"kmerLength": 10,
"kmerDistance": 50,
"minMatchLength": 40,
"allowedMismatches": 8,
"windowSize": 30
},
"compatibility": {
"cli": "3.0.0-alpha.0",
"web": "3.0.0-alpha.0"
},
"placementMaskRanges":[
{"begin":0, "end":336},
{"begin":5780, "end":5870},
{"begin":6160, "end":6240},
{"begin":6940, "end":7020},
{"begin":8963, "end":9181}
],
"deprecated": false,
"enabled": true,
"experimental": false,
"files": {
"pathogenJson": "pathogen.json",
"changelog": "CHANGELOG.md",
"examples": "sequences.fasta",
"genomeAnnotation": "genome_annotation.gff3",
"readme": "README.md",
"reference": "reference.fasta",
"treeJson": "tree.json"
},
"official": true,
"qc": {
"frameShifts": {
"enabled": true,
"scoreWeight": 20
},
"missingData": {
"enabled": true,
"missingDataThreshold": 2000,
"scoreBias": 500
},
"mixedSites": {
"enabled": true,
"mixedSitesThreshold": 40
},
"privateMutations": {
"cutoff": 700,
"enabled": true,
"typical": 600,
"weightLabeledSubstitutions": 1,
"weightReversionSubstitutions": 2,
"weightUnlabeledSubstitutions": 1
},
"stopCodons": {
"enabled": true,
"ignoredStopCodons": [
{"cdsName": "p6", "codon":49}
],
"scoreWeight": 40
}
},
"schemaVersion": "3.0.0"
}
Loading