update to 0.2.6 with 454 support

nextgenusfs · Oct 31, 2015 · ac9ea07 · ac9ea07
1 parent 6b849bd
commit ac9ea07
Show file tree

Hide file tree

Showing 14 changed files with 270 additions and 162 deletions.
diff --git a/README.md b/README.md
@@ -1,7 +1,7 @@
 # UFITS
 ###USEARCH Fungal ITS Clustering:###
 
-UFITS is a series of scripts to process fungal ITS amplicon data using USEARCH8
+UFITS is a series of scripts to process fungal ITS amplicon data using USEARCH8.  It can handle Ion Torrent, MiSeq, and 454 data and is cross-platform compatible (works on Windows, Mac, Linux).
 ___
 
 <img src="https://github.com/nextgenusfs/ufits/blob/master/docs/ufits.png" width="400">
@@ -21,13 +21,14 @@ UFITS comes with a wrapper script for ease of use.  On UNIX, you can call it by
 ```
 $ ufits
 Usage:       ufits <command> <arguments>
-version:     0.2.5
+version:     0.2.6
 
 Description: UFITS is a package of scripts to process fungal ITS amplicon data.  It uses the UPARSE algorithm for clustering
              and thus USEARCH8 is a dependency.
     
 Command:     ion         pre-process Ion Torrent data (find barcodes, remove primers, trim/pad)
              illumina    pre-process folder of de-multiplexed Illumina data (gunzip, merge PE, remove primers, trim/pad)
+             454         pre-process Roche 454 (pyrosequencing) data (find barcodes, remove primers, trim/pad)
              cluster     cluster OTUs (using UPARSE algorithm)
              filter      OTU table filtering
              taxonomy    Assign taxonomy to OTUs
@@ -46,7 +47,7 @@ And then by calling one of the commands, you get a help menu for each:
 ```
 $ ufits cluster
 Usage:       ufits cluster <arguments>
-version:     0.2.5
+version:     0.2.6
 
 Description: Script is a "wrapper" for the UPARSE algorithm.  Modifications include support for a mock spike-in
              community.  FASTQ quality trimming via expected errors and Dereplication are run in Python which allows
@@ -68,7 +69,6 @@ Arguments:   -i, --fastq         Input FASTQ file (Required)
              --cleanup           Remove intermediate files.
             
 Written by Jon Palmer (2015) [email protected]
-
 ```
 
 ####Processing Ion Torrent Data:####
@@ -81,6 +81,18 @@ ufits ion --barcodes 1,5,24 -i data.fastq -o data
 
 This will find Ion barcodes (1, 5, and 24) and relabel header with that information (barcodelabel=BC_5;). By default, it will look for all 96 Ion Xpress barcodes, specifiy the barcodes you used by a comma separated list. Next the script will find and trim both the forward and reverse primers (default is ITS2 region: fITS7 & ITS4), and then finally will trim or pad with N's to a set length (default: 250 bp).  Trimming to the same length is critcally important for USEARCH to cluster correctly, padding with N's after finding the reverse primer keeps short ITS sequences from being discarded.  These options can be customized using: `--fwd_primer`, `--rev_primer`, `--trim_len`, etc.
 
+####Processing Roche 454 Data:####
+Data from 454 instruments has the same read structure as Ion Torrent: <barcode><primer>Read<primer> and thus can be processed very similarly.  You just need to provide either an SFF file, FASTA + QUAL, or FASTQ files and then you need to specify a multi-fasta file containing the barcodes used in the project.  The data will be processed in the same fashion (see above) as the Ion Torrent Data. For example:
+
+```
+#SFF input
+ufits 454 -i data.sff --barcode_fasta my454barcodes.fa -o 454project
+
+#FASTA/QUAL input
+ufits 454 -i data.fa -q data.qual --barcode_fasta my454barcodes.fa -o 454project
+```
+
+
 ####Processing Illumina MiSeq PE Data:####
 
 Paired-end MiSeq data is typically delivered already de-multiplexed into separate read files, that have a defined naming structure from Illumina that looks like this: 
@@ -92,20 +104,20 @@ Paired-end MiSeq data is typically delivered already de-multiplexed into separat
 You can processes a folder of Illumina data like this:
 
 ```
-ufits illumina -i folder_name
+ufits illumina -i folder_name -o miseqData
 ```
 
-This will find all files ending with '.fastq.gz' in the input folder, gunzip the files, and then sequentially process the paired read files.  First it will run USEARCH8 `-fastq_mergepairs`, however, since some ITS sequences are too long to overlap you can rescue longer sequences by recovering the the non-merged forward reads.  Alternatively, you can only utilize the forward reads (R1), by passing the `--reads forward` argument.  Next the forward and reverse primers are removed and the reads are trimmed/padded to a set length of clustering. Finally, the resulting FASTQ files for each of the processed samples are concatenated together into a file called `ufits.demux.fq` that will be used for the next clustering step.  The script will also output a text file called `ufits-filenames.txt` that contains a tab-delimited output of the sample name as well as [i5] and [i7] index sequences that were used.
+This will find all files ending with '.fastq.gz' in the input folder, gunzip the files, and then sequentially process the paired read files.  First it will run USEARCH8 `-fastq_mergepairs`, however, since some ITS sequences are too long to overlap you can rescue longer sequences by recovering the the non-merged forward reads.  Alternatively, you can only utilize the forward reads (R1), by passing the `--reads forward` argument.  Next the forward and reverse primers are removed and the reads are trimmed/padded to a set length of clustering. Finally, the resulting FASTQ files for each of the processed samples are concatenated together into a file called `miseqData.demux.fq` that will be used for the next clustering step.  The script will also output a text file called `miseqData-filenames.txt` that contains a tab-delimited output of the sample name as well as [i5] and [i7] index sequences that were used.
 
 ####OTU Clustering:####
 
-Now the data from either platform (Ion or Illumina) can be clustered by running the following:
+Now the data from either platform (Ion, 454, or Illumina) can be clustered by running the following:
 
 ```
 ufits cluster -i ufits.demux.fq -o ion --mock BC_5
 ```
 
-This will run `usearch -fastq_filter` to filter the data based on expected errors, then remove duplicated sequences, sort the output by frequency, and finally `usearch -cluster_otus`.  You can also optionally run UCHIME Reference filtering by adding the `--uchime_ref ITS2` option or change the default clustering radius (97%) by passing the `--pct_otu` option. Another option is to process a spike-in control or mock community, you can specify a barcode name for the mock community by passing in `--mock BC_5` which will run some additional steps and report stats of the run to STDOUT.  Type `-h` for all the available options.
+This quality filter the data based on expected errors, then remove duplicated sequences, sort the output by frequency, and finally `usearch -cluster_otus`.  You can also optionally run UCHIME Reference filtering by adding the `--uchime_ref ITS2` option or change the default clustering radius (97%) by passing the `--pct_otu` option. Another option is to process a spike-in control or mock community, you can specify a barcode name for the mock community by passing in `--mock BC_5` which will run some additional steps and report stats of the run to STDOUT.  Type `-h` for all the available options.
 
 
 ####OTU Table Filtering####
@@ -133,14 +145,14 @@ You can assign taxonomy to your OTUs using UFITS, either using UTAX from USEARCH
 ufits install --install_unite -f GTGARTCATCGAATCTTTG -r TCCTCCGCTTATTGATATGC
 ```
 
-This commands will download the newest UNITE curated ITS databases.  It will first download the UNITE curated general release, reformat the UNITE headers to be compatible with UTAX classifier training, trim the data to correspond to the primers you used to generate your amplicons, i.e. ITS2 via fTIS7/ITS4, and then finally will train the UTAX classifier with these data.  The script will then download the UNTIE+INSD database, reformat taxonomy in headers, trim with primers, and then create a USEARCH database.  The resulting databases are stored in the `DB` folder of the `ufits` directory and are given the names UTAX.udb and USEARCH.udb respectively.
+This commands will download the newest UNITE curated ITS databases.  It will first download the UNITE curated general release, reformat the UNITE headers to be compatible with UTAX classifier training, trim the data to correspond to the primers you used to generate your amplicons, i.e. ITS2 via fTIS7/ITS4, and then finally will train the UTAX classifier with these data.  The script will then download the UNTIE+INSD database, reformat taxonomy in headers and then create a USEARCH database.  The resulting databases are stored in the `DB` folder of the `ufits` directory and are given the names UTAX.udb and USEARCH.udb respectively.
 
 Issuing the `ufits taxonomy` command will inform you which databases have been properly configured as well as usage instructions:
 
 ```
 $ ufits taxonomy
 Usage:       ufits taxonomy <arguments>
-version:     0.2.5
+version:     0.2.6
 
 Description: Script maps OTUs to taxonomy information and can append to an OTU table (optional).  By default the script
              uses a hybrid approach, e.g. gets taxonomy information from UTAX as well as BLAST-like hits from the larger

diff --git a/bin/ufits-OTU_cluster.py b/bin/ufits-OTU_cluster.py
@@ -71,7 +71,10 @@ def dereplicate(input, output):
     for rec in SeqIO.parse(in_file, 'fastq'):
         sequence = str(rec.seq)
         if sequence not in seqs:
-            seqs[sequence]=rec.id+'size=1;'
+            if rec.id.endswith(';'):
+                seqs[sequence]=rec.id+'size=1;'
+            else:
+                seqs[sequence]=rec.id+';size=1;'
         else:
             count = int(seqs[sequence].split('=')[-1].rstrip(';')) + 1
             formated_string = seqs[sequence].rsplit('=', 1)[0]+'='+str(count)+';'

diff --git a/bin/ufits-assign_taxonomy.py b/bin/ufits-assign_taxonomy.py
@@ -80,7 +80,7 @@ def setupLogging(LOGNAME):
 
 
 if args.method == 'utax':
-    if not args.db:
+    if not args.utax_db:
         log.error("No DB specified, exiting")
         os._exit(1)
     usearch = args.usearch
@@ -128,7 +128,7 @@ def setupLogging(LOGNAME):
 
 
 elif args.method == 'usearch':
-    if not args.db:
+    if not args.usearch_db:
         log.error("No DB specified, exiting")
         os._exit(1)
     usearch = args.usearch

diff --git a/bin/ufits-heatmap.py b/bin/ufits-heatmap.py
@@ -15,7 +15,7 @@ def __init__(self,prog):
     epilog="""Written by Jon Palmer (2015) [email protected]""",
     formatter_class=MyFormatter)
 
-map_colors = ['Accent','BrBG','BuGn','Dark2','PRGn','Pastel1', 'Pastel2','Spectral','bwr','copper','cubehelix','gist_yarg','gist_earth_r','gist_heat','Blues','Reds','Purples','Greens','BuPu','BuGn','OrRd','YlGnBu','YlOrRd','terrain', 'brg', 'gnuplot','bone_r','terrain_r']
+map_colors = ['Accent','BrBG','BuGn','Dark2','PRGn','Pastel1', 'Pastel2','Spectral','bwr','copper','cubehelix','gist_yarg','gist_earth_r','gist_heat','Blues','Reds','Purples','Greens','BuPu','BuGn','OrRd','YlGnBu','YlOrRd','terrain', 'brg', 'gnuplot','bone_r','terrain_r','sprint','summer','ocean_r']
 
 parser.add_argument('-i','--table', dest="table", required=True, help='OTU Table (Required)')
 parser.add_argument('-o','--output', dest="output", required=True, help='Output File (Required)')
@@ -106,7 +106,7 @@ def convert_binary(x):
     for line in min_table[1:]:
         new_line = []
         new_line.insert(0,line[0])
-        new_line.append([x/y * 100 for x,y in zip(line[1:], sums[1:])])
+        new_line.append([x/y * 100 for x,y in zip(line[1:], sums[1:])])     
         new_line = flatten(new_line)
         binary_table.append(new_line)
     binary_table.insert(0,header)

diff --git a/bin/ufits-mock_filter.py b/bin/ufits-mock_filter.py
@@ -29,7 +29,8 @@ def __init__(self,prog):
 
 parser.add_argument('-i','--otu_table', required=True, help='Input OTU table')
 parser.add_argument('-b','--mock_barcode', help='Barocde of Mock community')
-parser.add_argument('-p','--barcode_bleed', dest="barcodebleed", default='0.1', help='Index Bleed filter')
+parser.add_argument('-p','--index_bleed', dest="barcodebleed", default='0.1', help='Index Bleed filter')
+parser.add_argument('-s','--subtract_threshold', default=0, type=int, help='Threshold to subtract')
 parser.add_argument('--mc',default='mock3', help='Multi-FASTA mock community')
 parser.add_argument('-d','--delimiter', default='tsv', choices=['csv','tsv'], help='Delimiter')
 parser.add_argument('--col_order', dest="col_order", default="naturally", help='Provide comma separated list')
@@ -109,7 +110,7 @@ def setupLogging(LOGNAME):
 check = os.stat(args.otu_table).st_size
 if check == 0:
     log.error("Input OTU table is empty")
-    os.exit(1)
+    os._exit(1)
 
 #get base name of files
 base = args.otu_table.split(".otu_table")
@@ -160,8 +161,17 @@ def setupLogging(LOGNAME):
         line_count += 1
         sub_table.append([try_int(x) for x in line]) #convert to integers
 
+
+    #shouldn't use subtract filter without mock community, but adding it here for another purpose
+    tempTable = []
+    if args.subtract_threshold != 0:
+        log.info("Threshold subtracting table by %i" % args.subtract_threshold)
+        for line in sub_table:
+            tempTable.append([try_subtract(x,args.subtract_threshold) for x in line]) #subtract threshold 
+    else:
+        tempTable = sub_table
     #Now sort the table by row name naturally
-    header = sub_table[:1] #pull out header row
+    header = tempTable[:1] #pull out header row
     header = header[0] #just get first list, double check
 
      #convert header names (optional)
@@ -184,7 +194,7 @@ def setupLogging(LOGNAME):
     sortHead.insert(0, sortHead.pop(otu)) #re-insert the OTU header in first column
 
     #sort the table without header row and then add back
-    sortTable = natsorted(sub_table[1:]) #sort the table without header row
+    sortTable = natsorted(tempTable[1:]) #sort the table without header row
     sortTable.insert(0,header) #now add back header row
 
     #get index of the sorted header list 

diff --git a/bin/ufits-process_illumina_folder.py b/bin/ufits-process_illumina_folder.py
@@ -31,6 +31,7 @@ class col:
 parser.add_argument('-f','--fwd_primer', dest="F_primer", default='GTGARTCATCGAATCTTTG', help='Forward Primer (fITS7)')
 parser.add_argument('-r','--rev_primer', dest="R_primer", default='TCCTCCGCTTATTGATATGC', help='Reverse Primer (ITS4)')
 parser.add_argument('--require_primer', dest="primer", default='on', choices=['on', 'off'], help='Require Fwd primer to be present')
+parser.add_argument('--rescue_forward', action="store_true", help='Rescue Not-merged forward reads')
 parser.add_argument('-n','--name_prefix', dest="prefix", default='R_', help='Prefix for renaming reads')
 parser.add_argument('-m','--min_len', default='50', help='Minimum read length to keep')
 parser.add_argument('-l','--trim_len', default='250', help='Trim length for reads')
@@ -134,7 +135,7 @@ def setupLogging(LOGNAME):
 uniq_names = []
 fastq_for = []
 fastq_rev = []
-map = 'ufits-filenames.txt'
+map = args.out + '-filenames.txt'
 map_file = open(map, 'w')
 map_file.write("Name\t[i5]\t[i7]\tLane\tSet_num\n")
 for item in sorted(filenames):
@@ -159,10 +160,14 @@ def setupLogging(LOGNAME):
                 log.debug("Non-standard names detected, skipping mapping file")
 map_file.close()
 
-
+BarcodeCount = {}
 #loop through each set
 for i in range(len(fastq_for)):
     name = fastq_for[i].split("_")[0]
+    outname = name + '.fq'
+    if os.path.isfile(os.path.join(args.out, outname)):
+        log.info("Output for %s detected, skipping files" % outname)
+        continue
     for_reads = os.path.join(args.input, fastq_for[i])
     rev_reads = os.path.join(args.input, fastq_rev[i])
     log.info("Working on reads from sample %s" % name)
@@ -197,7 +202,8 @@ def setupLogging(LOGNAME):
         final_out = os.path.join(args.out, outname)
         cat_file = open(final_out, 'wb')
         shutil.copyfileobj(open(merge_out,'rU'), cat_file)
-        shutil.copyfileobj(open(skip_for,'rU'), cat_file)
+        if args.rescue_forward:
+            shutil.copyfileobj(open(skip_for,'rU'), cat_file)
         cat_file.close()
 
         #clean and close up intermediate files
@@ -435,6 +441,7 @@ def OnRec(Label, Seq, Qual):
     progress.FileDone("%u reads, %u outupt, %u bad fwd primer, %u rev primer stripped, %u too short" % \
           (SeqCount, OutCount, FwdPrimerMismatchCount, RevPrimerStrippedCount, TooShortCount))
 
+    BarcodeCount[SampleLabel] = OutCount
     log.info("Stats for demuxing: \
     \n%10u seqs \
     \n%10u fwd primer mismatches (%.1f%% discarded) \
@@ -449,7 +456,7 @@ def OnRec(Label, Seq, Qual):
 #Now concatenate all of the demuxed files together
 log.info("Concatenating Demuxed Files")
 
-catDemux = 'ufits.demux.fq'
+catDemux = args.out + '.demux.fq'
 with open(catDemux, 'wb') as outfile:
     for filename in glob.glob(os.path.join(args.out,'*.demux.fq')):
         if filename == catDemux:
@@ -461,20 +468,18 @@ def OnRec(Label, Seq, Qual):
 num_lines = sum(1 for line in open(catDemux))
 total = int(num_lines) / 4
 log.info('{0:,}'.format(total) + ' reads processed')
-log.info("Output file:  %s" % catDemux)
 
-#get file size and issue warning if over 4.0 GB
+#now let's count the barcodes found and count the number of times they are found.
+log.info("Found %i barcoded samples\n%30s:  %s" % (len(BarcodeCount), 'Sample', 'Count'))
+for key,value in BarcodeCount.iteritems():
+    print "%30s:  %s" % (key, str(value))
+
+#get file size
 filesize = os.path.getsize(catDemux)
 readablesize = convertSize(filesize)
-log.info("File size:  %s" % readablesize)
+log.info("Output file:  %s (%s)" % (catDemux, readablesize))
 print "-------------------------------------------------------"
-if filesize >= 4294967296:
-    if 'win32' in sys.platform:
-        print "\nWarning, file is larger than 4 GB, you will need USEARCH 64 bit to cluster OTUs"
-    else:
-        print col.WARN + "\nWarning, file is larger than 4 GB, you will need USEARCH 64 bit to cluster OTUs" + col.END
+if 'win32' in sys.platform:
+    print "\nExample of next cmd: ufits cluster -i %s -o out --uchime_ref ITS2 --mock <mock BC name> (test data: spike)\n" % (catDemux)
 else:
-    if 'win32' in sys.platform:
-        print "\nExample of next cmd: ufits cluster -i %s -o out --uchime_ref ITS2 --mock <mock BC name> (test data: spike)\n" % (catDemux)
-    else:
-        print col.WARN + "\nExample of next cmd: " + col.END + "ufits cluster -i %s -o out --uchime_ref ITS2 --mock <mock BC name> (test data: spike)\n" % (catDemux)
+    print col.WARN + "\nExample of next cmd: " + col.END + "ufits cluster -i %s -o out --uchime_ref ITS2 --mock <mock BC name> (test data: spike)\n" % (catDemux)