For a list of taxonIDs, VHost-Classifier will filter out the viruses and then sort these viruses into groups based on their host lineage.
The VHost-Classifier algorithm uses the Virus-Host DB, the NCBI Taxonomy DB and inbuilt predictive rules to achieve a high rate of virus host classification. VHost-Classifier will classify virus taxonIDs to family resolution.
VHost-Classifier will sort viruses it could not assign a host to by the environment they were sequenced from. To do this it uses the IMG/VR database and inbuilt predictive rules.
When benchmarked on 1000 randomly selected viral taxonids on NCBI, the software could classify 93% of vtaxids to the rank of Class, and 37% of vtaxids to the rank of Family, with an accuracy of 100%. A list of these random taxids can be found in the random_ids.csv file.
Usage:
Clone the directory and run from within cloned directory.
python vhost_classifier.py [TaxonID.tsv] [VirusHostDB.tsv] [Output Dir] [-i] [-g] [-n]
[TaxonID.tsv]
: a .tsv list of taxonIDs to be classified (one taxon ID per row).
[VHostDB.tsv]
: a copy of the Virus Host DB which can be downloaded here
or by running :
wget ftp://ftp.genome.jp/pub/db/virushostdb/virushostdb.tsv
[Output Dir]
: the name of the directory to output results to (must be unique).
[-i]
: optional argument, specify the value to start indexing the input taxonIDs from (default 0).
[-g]
: optional argument, taxonomic ranks to bin to. PCO, Phylum Class Order or POF, Phylum Order Family (default PCO).
[-n]
: optional argument, supply file of scientific names alongside taxon ids (use if taxonid list returns an index error).
Example:
python VHost_Classifier.py random_ids.csv VirusHostDB.tsv VHC_Run_1 -i 1 -g POF -n random_names.csv
Virus host classify a list of taxonIDs in random_ids.csv
, use the VHost-DB file supplied by VirusHostDB.tsv
and output the results to VHC_RUN_1
. Index the input taxonIDs from 1 in the output csv files. Classify taxonIDs to Phylum Order Family. Parse the random_names.csv
file.
Dependencies:
Python 3
ETE3 Toolkit for Python 3
Note: On first run through NCBI taxonomy database will be downloaded by ETE3.
Output:
VHost Classifier will create directories and in each directory write .csv files.
Reading the .csv files: the first column contains taxon IDs, the second column the index position (indexed from -i) of the taxon id in the input file. The final column contains the virus name. In each directory a counts.csv file is also written which contains the counts of how many taxon IDs are in each taxonomic class.
VHC-Analysis: run this script from within the Host-Assigned directory of the run you want to analyse. The script will walk the directory tree and write each Counts.csv file to a Total_Counts.csv file which will be saved in the Host-Assigned directory. This file makes it easier to compare the overall host diversity of viruses in your input.
References:
Virus-Host DB: Mihara, Tomoko, et al. "Linking virus genomes with host taxonomy." Viruses 8.3 (2016): 66.
IMG/VR: Paez-Espino, David, et al. "IMG/VR: a database of cultured and uncultured DNA Viruses and retroviruses." Nucleic acids research (2016): gkw1030.