Skip to content

This repository is to store the used dataset for publications related to UNIQmin

License

Notifications You must be signed in to change notification settings

ChongLC/UNIQmin_PublicationData

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Dataset used for publications related to UNIQmin

Tool: UNIQmin

PyPI   GitHub tag   DOI - 10.3390/biology10090853

Table of contents

Protocol paper:

UNIQmin, an alignment-free tool to study viral sequence diversity across taxonomic lineages: a case study of monkeypox virus

Publication:   Preprint - bioRxiv

Click to view the description of the project

Sequence changes in viral genomes generate protein sequence diversity that enable viruses to evade the host immune system, hindering the development of effective preventive and therapeutic interventions. Massive proliferation of sequence data provides unprecedented opportunities to study viral adaptation and evolution. Alignment-free approach removes various restrictions, otherwise posed by an alignment-dependent approach for the study of sequence diversity. The publicly available tool, UNIQmin offers an alignment-free approach for the study of viral sequence diversity at any given rank of taxonomy lineage and is big data ready. The tool performs an exhaustive search to determine the minimal set of sequences required to capture the peptidome diversity within a given dataset. This compression is possible through the removal of identical sequences and unique sequences that do not contribute effectively to the peptidome diversity pool. Herein, we describe a detailed four-part protocol (BP1-4) utilizing UNIQmin to generate the minimal set for the purpose of viral diversity analyses at any rank of the taxonomy lineage, using Monkeypox virus (MPX) as a case study. These protocols enable systematic diversity studies across the taxonomic lineage, which are much needed for our future preparedness of a viral epidemic, in particular when data is in abundance and freely available.


Basic Protocols 1, 2 & 3:

Taxonomy lineage rank Retrieval dataset (r) Deduplicated dataset (nr) Minimal dataset Link to download
Species: MPX 19,423 1,245 866 Download
Genus: Orthopoxvirus 83,088 15,523 9,146 Download
Family: Poxiviridae 163,793 34,782 25,618 Download

Basic Protocol 4:

Diversity spectrum Virus and selected protein Link to download
Highly conserved (H < 1) Avian H5N1 PA Download
Semi-conserved (1 <= H < 2) DENV NS3 Download
Diverse (2 <= H < 3) DENV NS2a Download
Extremely diverse (H >= 3) HIV-1 clade B Nef Download

Application papers:

UNIQmin application to all viruses:

Mapping the minimal set of the viral peptidome across major viral taxonomic lineages

Status - In progress


UNIQmin application to SARS-CoV-2:

Negligible peptidome diversity of SARS-CoV-2 and its higher taxonomic ranks

Publication:   Preprint - bioRxiv

Click to view the description of the project

The unprecedented increase in SARS-CoV-2 sequence data limits the application of alignment-dependent approaches to study viral diversity. Herein, we applied our recently published UNIQmin, an alignment-free tool to study the protein sequence diversity of SARS-CoV-2 (sub-species) and its higher taxonomic lineage ranks (species, genus, and family). Only less than 0.5% of the reported SARS-CoV-2 protein sequences are required to represent the inherent viral peptidome diversity, which only increases to a mere ~2% at the family rank. This is expected to remain relatively the same even with further increases in the sequence data. The findings have important implications in the design of vaccines, drugs, and diagnostics, whereby the number of sequences required for consideration of such studies is drastically reduced, short-circuiting the discovery process, while still providing for a systematic evaluation and coverage of the pathogen diversity.


Compression of SARS-CoV-2 datasets across taxonomy lineage ranks, namely sub-species (proteins), species (with and without SARS-CoV-2), genus, and family:

Note: All data were retrieved as of July 2021.

Taxonomy lineage rank: Virus Retrieval dataset (r) Deduplicated dataset (% of r) Minimal dataset (% of r)
Subspecies: SARS-CoV-2 (GISAID) 56,340,320 1,780,901 (~3.2) 273,851 (~0.5)  
Download
Species: SARS-related coronavirus 4,669,400 480,112 (~10.3) 61,819 (~1.3)
Genus: Betacoronavirus 4,689,400 485,220 (~10.3) 65,117 (~1.3)  
Download
Family: Coronaviridae 4,733,200 506,374 (~10.7) 79,414 (~1.7)  
Download

Note: SARS-CoV-2 Spike Protein

Month-Year Retrieval dataset (r) Deduplicated dataset (% of r) Minimal dataset (% of r)
July 2021 2,115,156 358,096 (~16.9) 42,399 (~2.0)  
December 2022 14,060,695 2,778,826 (~19.8) 112,912 (~0.8)  
Download

Note: All data were retrieved as of December 2022.

Variant  
Download
Retrieval dataset (r) Deduplicated dataset (% of r) Minimal dataset (% of r)
Alpha 1,188,924 153,732 (~12.9) 25,169 (~2.1)
Beta 43,299 17,766 (~41.0) 8,429 (~19.5)
Delta 4,527,917 950,886 (~21.0) 55,686 (~1.2)
Gamma 129,136 27,720 (~21.5) 8,246 (~6.4)
Mu 15,792 5,277 (~33.4) 2,078 (~13.2)
Omicron 6,664,999 1,477,721 (~22.2) 75,992 (~1.1)

Citing resources


Found a bug?

Or would like to drop some feedback?
Just open a new issue or send an email to us ([email protected]).

About

This repository is to store the used dataset for publications related to UNIQmin

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published