octoFLU: Automated classification to evolutionary origin of influenza A virus gene sequences detected in U.S. swine
Reference database has been updated to include N2 clade names (i.e., 2002A vs 2002B; 1998A vs 1998B). See Zeller et al. Coordinated evolution between N2 neuraminidase and H1 and H3 hemagglutinin genes increased influenza A virus genetic diversity in swine.
Determines evolutionary origin of influenza A virus genes through inference of maximum likelihood tree and then assignment of a defined genetic clade based on nearest neighbor determined by patristic distances.
This tool has been tested on swine H1 and H3 data (collected from 2014 to present), sequence from other serotypes, or sequence that is collected from outside North America may generate incorrect results. We suggest you use the IRD Sequence Annotation tool prior to running this pipeline.
We also recommend that output from the automatic classification be interpreted conservatively, and that more comprehensive phylogenetic analyses may be required for accurate determination of evolutionary history. This pipeline generates a phylogeny using a limited set of reference sequences and annotates the queries based upon the "nearest neighbor." If query sequences are dissimilar to the annotated reference set (e.g., swine H1 sequence from the 1990s, or swine data collected in Euope or Asia) they are likely to be misclassified.
If you use this pipeline or the curated reference datasets in your work, please cite this:
Chang, J.+, Anderson, T.K.+, Zeller, M.A.+, Gauger, P.C., Vincent, A.L. (2019). octoFLU: Automated classification to evolutionary origin of influenza A virus gene sequences detected in U.S. swine. Microbiology Resource Announcements 8:e00673-19. +These authors contributed equally.
If you have problems running the pipeline, please use the Issues feature of github, or e-mail [email protected] or [email protected] directly.
We thank Jordan Angell (USDA-APHIS, Visual Services) for the design of the octoFLU logo.
Unaligned fasta with query sequences (e.g., strain name with protein segment identifier).
- Text output stating the query name, protein symbol, genetic clade or evolutionary lineage.
- Text output holding the query name and top BLASTn hit.
- Inferred maximum likelihood trees with reference gene sets and queries.
bash octoFLU.sh sample_data/query_sample.fasta
pip3 install smof
pip3 install dendropy
git clone https://github.com/flu-crew/octoFLU.git
cd octoFLU
If you are on linux, you can likely just use pip vs. pip3.
You will need to have an installation of:
- NCBI Blast,
- smof,
- mafft,
- FastTree,
- dendropy,
- and the included
treedist.py
script
Edit the paths in octoFLU.sh
to connect blastn
, makeblastdb
, smof
mafft
, and FastTree
.
# Connect your reference dataset here
REFERENCE=reference_data/reference.fa
# Connect your programs here, can use full path names
BLASTN=~/bin/blastn
MAKEBLASTDB=~/bin/makeblastdb
SMOF=~/bin/smof
MAFFT=`which mafft`
FASTTREE=~/bin/FastTree
NN_CLASS=treedist.py
Then run the pipeline
bash octoFLU.sh sample_data/query_sample.fasta
The output will be in a *_Final_Output.txt
file and *_output
folder, any trees generated will be listed and named by protein symbol, and blast_output.txt
includes the query genes and their top BLASTn hit.
The main bottleneck is waiting for trees to run in FastTree (an installation of multi-threaded version helps). A sampling of the output is included, split by ...
.
bash octoFLU.sh sample_data/query.fasta
less query.fasta_Final_Output.txt
QUERY_MH540411_A/swine/Iowa/A02169143/2018 H1 pdm 1A.3.3.2
QUERY_MH595470_A/swine/South_Dakota/A02170160/2018 H1 delta1 1B.2.2.2
QUERY_MH595472_A/swine/Illinois/A02170163/2018 H1 alpha 1A.1.1
...
QUERY_MH546131_A/swine/Minnesota/A01785562/2018 H3 2010-human_like 3.2010.1
QUERY_MH561745_A/swine/Minnesota/A01785568/2018 H3 2010-human_like 3.2010.1
QUERY_MH551260_A/swine/Iowa/A02016898/2018 H3 2010-human_like 3.2010.1
...
QUERY_MH551259_A/swine/Iowa/A02016897/2018 N1 classicalSwine
QUERY_MH561752_A/swine/Minnesota/A01785574/2018 N1 classicalSwine
QUERY_MH551263_A/swine/Minnesota/A02016891/2018 N1 classicalSwine
...
QUERY_MK024152_A/swine/Minnesota/A01785613/2018 N2 1998
QUERY_MH976804_A/swine/Michigan/A01678583/2018 N2 1998
QUERY_MH595471_A/swine/South_Dakota/A02170160/2018 N2 2002
...
QUERY_MH922882_A/swine/Ohio/18TOSU4536/2018 M pdm
QUERY_MK321295_A/swine/Florida/A01104129/2018 M pdm
QUERY_MK129490_A/swine/Illinois/A02170163/2018 M pdm
...
QUERY_MK185286_A/swine/Iowa/A02016889/2018 PB1 TRIG
QUERY_MK185322_A/swine/Iowa/A02169143/2018 PB1 pdm
QUERY_MK039744_A/swine/Iowa/A02254795/2018 PB1 TRIG
Start the Docker deamon and navigate to your query file location.
cd mydataset/
docker pull flucrew/octoflu
docker run -it -v ${PWD}:/data flucrew/octoflu:latest /bin/bash
From inside the docker image you should be able to run the pipeline. Remember to copy files to /data
to pull them out of the docker image to your computer.
bash octoFLU.sh sample_data/query_sample.fasta
cp -rf query_sample.fasta_output /data/.
exit
If you want to run your own dataset, hold the data in a fasta file (e.g., mydataset/myseqs.fasta
).
cd mydataset
docker run -it -v ${PWD}:/data flucrew/octoflu:latest /bin/bash
bash octoFLU.sh /data/myseqs.fasta
After octoFLU is finished running copy data outside of docker
cp myseqs.fasta_output /data/.
exit
cd myseqs.fasta_output
Singularity and Docker are friends. A singularity image can be built using singularity pull.
singularity pull docker://flucrew/octoflu
This pipeline relies upon python3. Many MacOS computers have Python 2.7, so an update is required. The Python website has an installer for Python 3.7, if you use the package it will place python3 in /usr/local/bin/. Unfortunately, this needs you to set up an alias in your shell environment (e.g., echo "alias python=/usr/local/bin/python3.7" >> ~/.bash_profile).
The best option is to use Homebrew.
brew install pyenv
pyenv install 3.7.3
pyenv global 3.7.3
pyenv version
echo -e 'if command -v pyenv 1>/dev/null 2>&1; then\n eval "$(pyenv init -)"\nfi' >> ~/.bash_profile
We have also used the anaconda distribution with python3, and the dendropy module may be installed using conda (e.g., conda install -c bioconda dendropy). Pip is a good thing to install if you don't have it.
This is a very helpful article describing the best approach (towards the bottom) of getting python3 on your Mac
A python script octoFLU.py
has been provided that will run on Windows, Mac, or Linux machines with similar usage and output as the original octoFLU.sh
. This script can be run directly in cmd.exe or through Anaconda.
If running through Anaconda, call the python script octoFLU.py
instead of the bash script octoFLU.sh
.
python octoFLU.py sample_data/query_sample.fasta
While the dependencies are the same, pathing generally works better if it is explicit.
# ===== Connect your programs here, Linux style
# BLASTN = "/usr/local/bin/ncbi_blast/blastn"
# MAKEBLASTDB = "/usr/local/bin/ncbi_blast/makeblastdb"
# SMOF = "smof"
# MAFFT = "/usr/local/bin/mafft"
# FASTTREE = "/usr/local/bin/FastTree/FastTree"
# NN_CLASS = "treedist.py"
# PYTHON = "python"
# ===== Windows Style program referencing. Recommend using WHERE to find commands in cmd.exe
BLASTN = "E:/lab/tools/ncbi_blast/blastn.exe"
MAKEBLASTDB = "E:/lab/tools/ncbi_blast/makeblastdb.exe"
SMOF = "E:/Anaconda3/Scripts/smof.exe"
MAFFT = "E:/lab/tools/mafft-win/mafft.bat"
FASTTREE = "E:/lab/tools/FastTree.exe"
NN_CLASS = "treedist.py"
PYTHON = "E:/Anaconda3/python.exe"
Explicit pathing is needed for anything not in the Windows Path. After using pip
to install dendroscope
and smof
, the path to the smof
executable can be found using where smof
. Input and output remain unchanged.
If running through Ubuntu commandline on Windows 10, it may be easiest to run from Desktop:
Open Ubuntu app.
cd /mnt/c/Users/put_your_user_name_here/Desktop
git clone https://github.com/flu-crew/octoFLU.git
cd octoFLU
bash octoFLU.sh sample_data/query_sample.fasta
There has been known issues involving file encoding, while the file needs to be converted to ANSI to run correctly.
- Extend to include international evolutionary lineages.
- Reannotate the tree with NN-clades for ease of use.
- Integrate a script to combine gene assignments to a whole genome constellation descriptor.
- Annotate input sequences with gene classification, and use these designations in the inferred phylogenetic trees.