diff --git a/amptk/funguild.py b/amptk/funguild.py index bd442de..c6fa83b 100644 --- a/amptk/funguild.py +++ b/amptk/funguild.py @@ -102,8 +102,9 @@ def main(args): # initialize script and log if not args.out: args.out = '{}.funguild.txt'.format(args.input.rsplit('.', 1)[0]) + #remove logfile if exists - log_name = '{}.funguild.log'.format(args.out.rsplit('.', 1)[0]) + log_name = '{}.log'.format(args.out.rsplit('.', 1)[0]) if os.path.isfile(log_name): os.remove(log_name) diff --git a/docs/index.rst b/docs/index.rst index 602724f..fefe0e0 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -20,7 +20,10 @@ AMPtk documentation downstream -AMPtk is a series of scripts to process NGS amplicon data using USEARCH and VSEARCH, it can also be used to process any NGS amplicon data and includes databases setup for analysis of fungal ITS, fungal LSU, bacterial 16S, and insect COI amplicons. It can handle Ion Torrent, MiSeq, and 454 data. At least USEARCH v9.1.13 and VSEARCH v2.2.0 are required as of AMPtk v0.7.0. +AMPtk is a series of scripts to process NGS amplicon data using USEARCH and VSEARCH, it can also be used to process any NGS amplicon data and includes databases setup for analysis of fungal ITS, fungal LSU, bacterial 16S, and insect COI amplicons. It can handle Ion Torrent, MiSeq, and 454 data. At least USEARCH v9.1.13 and VSEARCH v2.2.0 were required as of AMPtk v0.7.0. + + +Update November 2021: As of AMPtk v1.5.1, USEARCH is no longer a dependency as all of the processing can be done with VSEARCH 64-bit. This means that default/hybrid taxonomy assignment uses just global alignment and SINTAX -- however the added benefit here is that we are not constrained by the 4 GB RAM limit of the 32-bit version of USEARCH. Citation @@ -67,8 +70,7 @@ Users can also install manually, download a `release `_. Obtain the proper version of USEARCH v9.2.64 and softlink into the PATH: @@ -147,28 +149,15 @@ You only need to worry about these dependencies if you installed manually and/or Run from Docker ================== -There is a base installation of AMPtk on Docker at nextgenusfs/amptk-base. Because usearch9 and usearch10 are required but must be personally licensed, here are the directions to get a working AMPtk docker image. +There is a base installation of AMPtk on Docker at nextgenusfs/amptk-base and then one with taxonomy databases at nextgenusfs/amptk. I've written a wrapper script that will run the docker image, simply have to download the script and ensure its executable. 1) Download the Dockerfile build file. .. code-block:: none - wget https://raw.githubusercontent.com/nextgenusfs/amptk/master/Dockerfile - -2) Download usearch9.2.64 and usearch10.0.240 for linux (32 bit version) `here `_. - - -3) Build AMPtk docker image - -.. code-block:: none - - docker build -t amptk -f Dockerfile . - -4) You can now launch the docker image like so (make sure files you need are in current directory) - -.. code-block:: none + $ wget -O amptk-docker https://raw.githubusercontent.com/nextgenusfs/amptk/master/amptk-docker + $ chmod +x amptk-docker - docker run -it --rm -v $PWD:/work amptk /bin/bash More Information diff --git a/docs/taxonomy.rst b/docs/taxonomy.rst index 891c928..be8bd82 100644 --- a/docs/taxonomy.rst +++ b/docs/taxonomy.rst @@ -9,7 +9,7 @@ AMPtk can assign taxonomy using Blast, RDP Classifier, Global Alignment, UTAX, a **The hybrid taxonomy algorithm works as follows:** 1) Global Alignment is performed on a reference database. The method captures all of the top hits from the database that have Percent Identity higher than ``--usearch_cutoff`` (Default: 0.7). The top hits that have the most taxonomy information (levels) are retained and LCA is run on these hits. -2) UTAX Classifier is run generating a taxonomy string that scores better than ``--utax_cutoff`` (Default: 0.8). +2) UTAX Classifier is run generating a taxonomy string that scores better than ``--utax_cutoff`` (Default: 0.8). *v1.5.1 and greater do not run UTAX* 3) SINTAX Classifier is run generating a taxonomy string that scores better than ``--sintax_cutoff`` (Default: 0.8). 4) The Bayesian Classifier results are compared, and the method that produces the most levels of taxonomy above the threshold is retained. 5) If the best Global Alignment result is less than 97% identical, then final taxonomy string defaults to the best Bayesian Classifier result (UTAX or SINTAX). @@ -53,9 +53,9 @@ A typically command to assign taxonomy in AMPtk looks like this: .. code-block:: none amptk taxonomy -i input.otu_table.txt -f input.cluster.otus.fa -m input.mapping_file.txt -d ITS2 - + This command will run the default hybrid method and will use the ITS2 database (``-d ITS2``). The output of this command will be a tab delimited taxonomy file ``input.taxonomy.txt``, an OTU table contain taxonomy as the last column ``input.otu_table.taxonomy.txt```, a multi-fasta file with taxonomy as OTU headers ``input.otus.taxonomy.fa``, and BIOM file containing taxonomy and metadata ``input.biom``. - + Taxonomy Databases ------------------------------------- AMPtk is packaged with 4 reference databases for fungal ITS, fungal LSU, bacterial 16S, and arthropod/chordate mtCOI. These pre-built databases are updated frequently when reference databases are updated and can be downloaded/installed as follows: @@ -70,7 +70,7 @@ AMPtk is packaged with 4 reference databases for fungal ITS, fungal LSU, bacteri #update database amptk install -i ITS 16S LSU COI --force - + Users can also build their own custom databases, with the largest obstacle to overcome being formatting the taxonomy headers for reference databases. Because AMPtk uses UTAX/SINTAX Bayesian classifiers, it uses the same taxonomy header formatting which looks like the following ``Kingdom(k), Phylum(p), Class(c), Order(o), Family(f), Genus(g), Species(s)``: .. code-block:: none @@ -93,21 +93,21 @@ These databases were created from Unite v8.0, first downloading two databases fr #Create full length ITS USEARCH Database, convert taxonomy, and create USEARCH database amptk database -i UNITE_public_all_02.02.2019.fasta -f ITS1-F -r ITS4 \ --primer_required none -o ITS --create_db usearch --install --source UNITE:8.0 - + #create SINTAX database amptk database -i sh_general_release_dynamic_all_02.02.2019_dev.fasta \ -o ITS_SINTAX --create_db sintax -f ITS1-F -r ITS4 --derep_fulllength \ - --install --source UNITE:8.0 --primer_required none + --install --source UNITE:8.0 --primer_required none #Create UTAX Databases amptk database -i sh_general_release_dynamic_all_02.02.2019_dev.fasta \ -o ITS_UTAX --create_db utax -f ITS1-F -r ITS4 \ --derep_fulllength --install --source UNITE:8.0 --primer_required none - + amptk database -i sh_general_release_dynamic_all_02.02.2019_dev.fasta \ -o ITS1_UTAX -f ITS1-F -r ITS2 --primer_required rev --derep_fulllength \ --create_db utax --install --subsample 65000 --source UNITE:8.0 - + amptk database -i sh_general_release_dynamic_all_02.02.2019_dev.fasta \ -o ITS2_UTAX --create_db utax -f fITS7 -r ITS4 --derep_fulllength \ --install --source UNITE:8.0 --primer_required for @@ -128,7 +128,7 @@ Since it can literally take days to download the arthropod dataset, if you'd lik #combine datasets for usearch cat arthropods.bold-reformated.fa chordates.bold-reformated.fa > arth-chord.bold-reformated.fasta - + #generate global alignment database amptk database -i arth-chord.bold.reformated.fasta -f LCO1490 -r mlCOIintR --primer_required none \ --derep_fulllength --format off --primer_mismatch 4 -o COI --min_len 200 --create_db usearch \ @@ -140,12 +140,12 @@ The second set of output files from `bold2utax.py` are named with `.BIN-consensu #combine datasets cat arthropods.BIN-consensus.fa chordates.BIN-consensus.fa > arth-chord.bold.BIN-consensus.fasta - + #generate SINTAX database amptk database -i arth-chord.bold.BIN-consensus.fasta -f LCO1490 -r mlCOIintR --primer_required none \ --derep_fulllength --format off --primer_mismatch 4 -o COI_SINTAX --min_len 200 --create_db sintax \ --install --source BOLD:20190219 - + #generate UTAX database, need to subsample for memory issues with 32 bit usearch and we require rev primer match here amptk database -i arth-chord.bold.BIN-consensus.fasta -f LCO1490 -r mlCOIintR --primer_required rev \ --derep_fulllength --format off --subsample 00000 --primer_mismatch 4 -o COI_UTAX --min_len 200 \ @@ -165,8 +165,8 @@ The fungal 28S database (LSU) was downloaded from `RDP