update some docs

nextgenusfs · Nov 26, 2021 · 70a63f9 · 70a63f9
1 parent 767999e
commit 70a63f9
Show file tree

Hide file tree

Showing 3 changed files with 28 additions and 39 deletions.
diff --git a/amptk/funguild.py b/amptk/funguild.py
@@ -102,8 +102,9 @@ def main(args):
     # initialize script and log
     if not args.out:
         args.out = '{}.funguild.txt'.format(args.input.rsplit('.', 1)[0])
+
     #remove logfile if exists
-    log_name = '{}.funguild.log'.format(args.out.rsplit('.', 1)[0])
+    log_name = '{}.log'.format(args.out.rsplit('.', 1)[0])
     if os.path.isfile(log_name):
         os.remove(log_name)
 

diff --git a/docs/index.rst b/docs/index.rst
@@ -20,7 +20,10 @@ AMPtk documentation
    downstream
 
 
-AMPtk is a series of scripts to process NGS amplicon data using USEARCH and VSEARCH, it can also be used to process any NGS amplicon data and includes databases setup for analysis of fungal ITS, fungal LSU, bacterial 16S, and insect COI amplicons. It can handle Ion Torrent, MiSeq, and 454 data. At least USEARCH v9.1.13 and VSEARCH v2.2.0 are required as of AMPtk v0.7.0.
+AMPtk is a series of scripts to process NGS amplicon data using USEARCH and VSEARCH, it can also be used to process any NGS amplicon data and includes databases setup for analysis of fungal ITS, fungal LSU, bacterial 16S, and insect COI amplicons. It can handle Ion Torrent, MiSeq, and 454 data. At least USEARCH v9.1.13 and VSEARCH v2.2.0 were required as of AMPtk v0.7.0.
+
+
+Update November 2021: As of AMPtk v1.5.1, USEARCH is no longer a dependency as all of the processing can be done with VSEARCH 64-bit. This means that default/hybrid taxonomy assignment uses just global alignment and SINTAX -- however the added benefit here is that we are not constrained by the 4 GB RAM limit of the 32-bit version of USEARCH.
 
 
 Citation
@@ -67,8 +70,7 @@ Users can also install manually, download a `release <https://github.com/nextgen
 
 
 
-
-Dependencies Requiring Manual Install
+Dependencies Requiring Manual Install for Older version of AMPtk (Deprecated)
 =========================================
 1) AMPtk utilizes USEARCH9 which must be installed manually from the developer `here <http://www.drive5.com/usearch/download.html>`_.  Obtain the proper version of USEARCH v9.2.64 and softlink into the PATH:
 
@@ -147,28 +149,15 @@ You only need to worry about these dependencies if you installed manually and/or
 
 Run from Docker
 ==================
-There is a base installation of AMPtk on Docker at nextgenusfs/amptk-base. Because usearch9 and usearch10 are required but must be personally licensed, here are the directions to get a working AMPtk docker image.
+There is a base installation of AMPtk on Docker at nextgenusfs/amptk-base and then one with taxonomy databases at nextgenusfs/amptk.  I've written a wrapper script that will run the docker image, simply have to download the script and ensure its executable.
 
 1) Download the Dockerfile build file.
 
 .. code-block:: none
 
-    wget https://raw.githubusercontent.com/nextgenusfs/amptk/master/Dockerfile
-
-2) Download usearch9.2.64 and usearch10.0.240 for linux (32 bit version) `here <http://www.drive5.com/usearch/download.html>`_.
-
-
-3) Build AMPtk docker image
-
-.. code-block:: none
-
-    docker build -t amptk -f Dockerfile .
-
-4) You can now launch the docker image like so (make sure files you need are in current directory)
-
-.. code-block:: none
+    $ wget -O amptk-docker https://raw.githubusercontent.com/nextgenusfs/amptk/master/amptk-docker
+    $ chmod +x amptk-docker
 
-    docker run -it --rm -v $PWD:/work amptk /bin/bash
 
 
 More Information

diff --git a/docs/taxonomy.rst b/docs/taxonomy.rst
@@ -9,7 +9,7 @@ AMPtk can assign taxonomy using Blast, RDP Classifier, Global Alignment, UTAX, a
 **The hybrid taxonomy algorithm works as follows:**
 
 1) Global Alignment is performed on a reference database. The method captures all of the top hits from the database that have Percent Identity higher than ``--usearch_cutoff`` (Default: 0.7). The top hits that have the most taxonomy information (levels) are retained and LCA is run on these hits.
-2) UTAX Classifier is run generating a taxonomy string that scores better than ``--utax_cutoff`` (Default: 0.8).
+2) UTAX Classifier is run generating a taxonomy string that scores better than ``--utax_cutoff`` (Default: 0.8). *v1.5.1 and greater do not run UTAX*
 3) SINTAX Classifier is run generating a taxonomy string that scores better than ``--sintax_cutoff`` (Default: 0.8).
 4) The Bayesian Classifier results are compared, and the method that produces the most levels of taxonomy above the threshold is retained.
 5) If the best Global Alignment result is less than 97% identical, then final taxonomy string defaults to the best Bayesian Classifier result (UTAX or SINTAX).
@@ -53,9 +53,9 @@ A typically command to assign taxonomy in AMPtk looks like this:
 .. code-block:: none
 
     amptk taxonomy -i input.otu_table.txt -f input.cluster.otus.fa -m input.mapping_file.txt -d ITS2
-    
+
 This command will run the default hybrid method and will use the ITS2 database (``-d ITS2``).  The output of this command will be a tab delimited taxonomy file ``input.taxonomy.txt``, an OTU table contain taxonomy as the last column ``input.otu_table.taxonomy.txt```, a multi-fasta file with taxonomy as OTU headers ``input.otus.taxonomy.fa``, and BIOM file containing taxonomy and metadata ``input.biom``.
-     
+
 Taxonomy Databases
 -------------------------------------
 AMPtk is packaged with 4 reference databases for fungal ITS, fungal LSU, bacterial 16S, and arthropod/chordate mtCOI. These pre-built databases are updated frequently when reference databases are updated and can be downloaded/installed as follows:
@@ -70,7 +70,7 @@ AMPtk is packaged with 4 reference databases for fungal ITS, fungal LSU, bacteri
 
     #update database
     amptk install -i ITS 16S LSU COI --force
-    
+
 Users can also build their own custom databases, with the largest obstacle to overcome being formatting the taxonomy headers for reference databases.  Because AMPtk uses UTAX/SINTAX Bayesian classifiers, it uses the same taxonomy header formatting which looks like the following ``Kingdom(k), Phylum(p), Class(c), Order(o), Family(f), Genus(g), Species(s)``:
 
 .. code-block:: none
@@ -93,21 +93,21 @@ These databases were created from Unite v8.0, first downloading two databases fr
     #Create full length ITS USEARCH Database, convert taxonomy, and create USEARCH database
     amptk database -i UNITE_public_all_02.02.2019.fasta -f ITS1-F -r ITS4 \
         --primer_required none -o ITS --create_db usearch --install --source UNITE:8.0
-        
+
     #create SINTAX database
     amptk database -i sh_general_release_dynamic_all_02.02.2019_dev.fasta \
         -o ITS_SINTAX --create_db sintax -f ITS1-F -r ITS4 --derep_fulllength \
-        --install --source UNITE:8.0 --primer_required none    
+        --install --source UNITE:8.0 --primer_required none
 
     #Create UTAX Databases
     amptk database -i sh_general_release_dynamic_all_02.02.2019_dev.fasta  \
         -o ITS_UTAX --create_db utax -f ITS1-F -r ITS4 \
         --derep_fulllength --install --source UNITE:8.0 --primer_required none
-        
+
     amptk database -i sh_general_release_dynamic_all_02.02.2019_dev.fasta \
         -o ITS1_UTAX -f ITS1-F -r ITS2 --primer_required rev --derep_fulllength \
         --create_db utax --install --subsample 65000 --source UNITE:8.0
-        
+
     amptk database -i sh_general_release_dynamic_all_02.02.2019_dev.fasta \
         -o ITS2_UTAX --create_db utax -f fITS7 -r ITS4 --derep_fulllength \
          --install --source UNITE:8.0 --primer_required for
@@ -128,7 +128,7 @@ Since it can literally take days to download the arthropod dataset, if you'd lik
 
     #combine datasets for usearch
     cat arthropods.bold-reformated.fa chordates.bold-reformated.fa > arth-chord.bold-reformated.fasta
-    
+
     #generate global alignment database
     amptk database -i arth-chord.bold.reformated.fasta -f LCO1490 -r mlCOIintR --primer_required none \
         --derep_fulllength --format off --primer_mismatch 4 -o COI --min_len 200 --create_db usearch \
@@ -140,12 +140,12 @@ The second set of output files from `bold2utax.py` are named with `.BIN-consensu
 
     #combine datasets
     cat arthropods.BIN-consensus.fa chordates.BIN-consensus.fa > arth-chord.bold.BIN-consensus.fasta
- 
+
     #generate SINTAX database
     amptk database -i arth-chord.bold.BIN-consensus.fasta -f LCO1490 -r mlCOIintR --primer_required none \
         --derep_fulllength --format off --primer_mismatch 4 -o COI_SINTAX --min_len 200 --create_db sintax \
         --install --source BOLD:20190219
-        
+
     #generate UTAX database, need to subsample for memory issues with 32 bit usearch and we require rev primer match here
     amptk database -i arth-chord.bold.BIN-consensus.fasta -f LCO1490 -r mlCOIintR --primer_required rev \
         --derep_fulllength --format off --subsample 00000 --primer_mismatch 4 -o COI_UTAX --min_len 200 \
@@ -165,8 +165,8 @@ The fungal 28S database (LSU) was downloaded from `RDP <http://rdp.cme.msu.edu/d
 
     amptk database -i RDP_v8.0_fungi.fa -o LSU_UTAX --format rdp2utax --primer_required none \
         --skip_trimming --create_db utax --derep_fulllength --install --source RDP:8 --subsample 45000
-        
-            
+
+
 To generate a training set for UTAX, the sequences were first dereplicated, and clustered at 97% to get representative sequences for training.  This training set was then converted to a UTAX database:
 
 .. code-block:: none
@@ -181,12 +181,12 @@ This is downloaded from `R. Edgar's website <http://drive5.com/utax/data/rdp_v16
 
     amptk database -i rdp_16s_v16_sp.kingdom.fa -o 16S --format off --create_db usearch \
         --skip_trimming --install --primer_required none --derep_fulllength
-        
+
     amptk database -i rdp_16s_v16_sp.kingdom.fa -o 16S_SINTAX --format off --create_db sintax \
         -f 515FB -r 806RB --install --primer_required for --derep_fulllength
-        
+
     amptk database -i rdp_16s_v16_sp.kingdom.fa -o 16S_UTAX --format off --create_db sintax \
-        -f 515FB -r 806RB --install --primer_required for --derep_fulllength    
+        -f 515FB -r 806RB --install --primer_required for --derep_fulllength
 
 
 Checking Installed Databases
@@ -202,11 +202,10 @@ A simple ``amptk info`` command will show you all the arguments as well as displ
     ------------------------------
     Taxonomy Databases Installed:
     ------------------------------
-     DB_name         DB_type              FASTA originated from                Fwd Primer Rev Primer Records     Date   
+     DB_name         DB_type              FASTA originated from                Fwd Primer Rev Primer Records     Date
      ITS.udb         usearch                   UNITE_public_01.12.2017.fasta   ITS1-F      ITS4     532025  2018-05-01
      ITS1_UTAX.udb   utax  sh_general_release_dynamic_s_01.12.2017_dev.fasta   ITS1-F      ITS2      57293  2018-05-01
      ITS2_UTAX.udb   utax  sh_general_release_dynamic_s_01.12.2017_dev.fasta    fITS7      ITS4      55962  2018-05-01
      ITS_UTAX.udb    utax    sh_general_release_dynamic_01.12.2017_dev.fasta   ITS1-F      ITS4      30580  2018-05-01
     ------------------------------
-   
-    
+