Skip to content

Commit

Permalink
update to 0.2.6 with 454 support
Browse files Browse the repository at this point in the history
  • Loading branch information
Jon Palmer authored and Jon Palmer committed Oct 31, 2015
1 parent 6b849bd commit ac9ea07
Show file tree
Hide file tree
Showing 14 changed files with 270 additions and 162 deletions.
32 changes: 22 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# UFITS
###USEARCH Fungal ITS Clustering:###

UFITS is a series of scripts to process fungal ITS amplicon data using USEARCH8
UFITS is a series of scripts to process fungal ITS amplicon data using USEARCH8. It can handle Ion Torrent, MiSeq, and 454 data and is cross-platform compatible (works on Windows, Mac, Linux).
___

<img src="https://github.com/nextgenusfs/ufits/blob/master/docs/ufits.png" width="400">
Expand All @@ -21,13 +21,14 @@ UFITS comes with a wrapper script for ease of use. On UNIX, you can call it by
```
$ ufits
Usage: ufits <command> <arguments>
version: 0.2.5
version: 0.2.6
Description: UFITS is a package of scripts to process fungal ITS amplicon data. It uses the UPARSE algorithm for clustering
and thus USEARCH8 is a dependency.
Command: ion pre-process Ion Torrent data (find barcodes, remove primers, trim/pad)
illumina pre-process folder of de-multiplexed Illumina data (gunzip, merge PE, remove primers, trim/pad)
454 pre-process Roche 454 (pyrosequencing) data (find barcodes, remove primers, trim/pad)
cluster cluster OTUs (using UPARSE algorithm)
filter OTU table filtering
taxonomy Assign taxonomy to OTUs
Expand All @@ -46,7 +47,7 @@ And then by calling one of the commands, you get a help menu for each:
```
$ ufits cluster
Usage: ufits cluster <arguments>
version: 0.2.5
version: 0.2.6
Description: Script is a "wrapper" for the UPARSE algorithm. Modifications include support for a mock spike-in
community. FASTQ quality trimming via expected errors and Dereplication are run in Python which allows
Expand All @@ -68,7 +69,6 @@ Arguments: -i, --fastq Input FASTQ file (Required)
--cleanup Remove intermediate files.
Written by Jon Palmer (2015) [email protected]
```

####Processing Ion Torrent Data:####
Expand All @@ -81,6 +81,18 @@ ufits ion --barcodes 1,5,24 -i data.fastq -o data

This will find Ion barcodes (1, 5, and 24) and relabel header with that information (barcodelabel=BC_5;). By default, it will look for all 96 Ion Xpress barcodes, specifiy the barcodes you used by a comma separated list. Next the script will find and trim both the forward and reverse primers (default is ITS2 region: fITS7 & ITS4), and then finally will trim or pad with N's to a set length (default: 250 bp). Trimming to the same length is critcally important for USEARCH to cluster correctly, padding with N's after finding the reverse primer keeps short ITS sequences from being discarded. These options can be customized using: `--fwd_primer`, `--rev_primer`, `--trim_len`, etc.

####Processing Roche 454 Data:####
Data from 454 instruments has the same read structure as Ion Torrent: <barcode><primer>Read<primer> and thus can be processed very similarly. You just need to provide either an SFF file, FASTA + QUAL, or FASTQ files and then you need to specify a multi-fasta file containing the barcodes used in the project. The data will be processed in the same fashion (see above) as the Ion Torrent Data. For example:

```
#SFF input
ufits 454 -i data.sff --barcode_fasta my454barcodes.fa -o 454project
#FASTA/QUAL input
ufits 454 -i data.fa -q data.qual --barcode_fasta my454barcodes.fa -o 454project
```


####Processing Illumina MiSeq PE Data:####

Paired-end MiSeq data is typically delivered already de-multiplexed into separate read files, that have a defined naming structure from Illumina that looks like this:
Expand All @@ -92,20 +104,20 @@ Paired-end MiSeq data is typically delivered already de-multiplexed into separat
You can processes a folder of Illumina data like this:

```
ufits illumina -i folder_name
ufits illumina -i folder_name -o miseqData
```

This will find all files ending with '.fastq.gz' in the input folder, gunzip the files, and then sequentially process the paired read files. First it will run USEARCH8 `-fastq_mergepairs`, however, since some ITS sequences are too long to overlap you can rescue longer sequences by recovering the the non-merged forward reads. Alternatively, you can only utilize the forward reads (R1), by passing the `--reads forward` argument. Next the forward and reverse primers are removed and the reads are trimmed/padded to a set length of clustering. Finally, the resulting FASTQ files for each of the processed samples are concatenated together into a file called `ufits.demux.fq` that will be used for the next clustering step. The script will also output a text file called `ufits-filenames.txt` that contains a tab-delimited output of the sample name as well as [i5] and [i7] index sequences that were used.
This will find all files ending with '.fastq.gz' in the input folder, gunzip the files, and then sequentially process the paired read files. First it will run USEARCH8 `-fastq_mergepairs`, however, since some ITS sequences are too long to overlap you can rescue longer sequences by recovering the the non-merged forward reads. Alternatively, you can only utilize the forward reads (R1), by passing the `--reads forward` argument. Next the forward and reverse primers are removed and the reads are trimmed/padded to a set length of clustering. Finally, the resulting FASTQ files for each of the processed samples are concatenated together into a file called `miseqData.demux.fq` that will be used for the next clustering step. The script will also output a text file called `miseqData-filenames.txt` that contains a tab-delimited output of the sample name as well as [i5] and [i7] index sequences that were used.

####OTU Clustering:####

Now the data from either platform (Ion or Illumina) can be clustered by running the following:
Now the data from either platform (Ion, 454, or Illumina) can be clustered by running the following:

```
ufits cluster -i ufits.demux.fq -o ion --mock BC_5
```

This will run `usearch -fastq_filter` to filter the data based on expected errors, then remove duplicated sequences, sort the output by frequency, and finally `usearch -cluster_otus`. You can also optionally run UCHIME Reference filtering by adding the `--uchime_ref ITS2` option or change the default clustering radius (97%) by passing the `--pct_otu` option. Another option is to process a spike-in control or mock community, you can specify a barcode name for the mock community by passing in `--mock BC_5` which will run some additional steps and report stats of the run to STDOUT. Type `-h` for all the available options.
This quality filter the data based on expected errors, then remove duplicated sequences, sort the output by frequency, and finally `usearch -cluster_otus`. You can also optionally run UCHIME Reference filtering by adding the `--uchime_ref ITS2` option or change the default clustering radius (97%) by passing the `--pct_otu` option. Another option is to process a spike-in control or mock community, you can specify a barcode name for the mock community by passing in `--mock BC_5` which will run some additional steps and report stats of the run to STDOUT. Type `-h` for all the available options.


####OTU Table Filtering####
Expand Down Expand Up @@ -133,14 +145,14 @@ You can assign taxonomy to your OTUs using UFITS, either using UTAX from USEARCH
ufits install --install_unite -f GTGARTCATCGAATCTTTG -r TCCTCCGCTTATTGATATGC
```

This commands will download the newest UNITE curated ITS databases. It will first download the UNITE curated general release, reformat the UNITE headers to be compatible with UTAX classifier training, trim the data to correspond to the primers you used to generate your amplicons, i.e. ITS2 via fTIS7/ITS4, and then finally will train the UTAX classifier with these data. The script will then download the UNTIE+INSD database, reformat taxonomy in headers, trim with primers, and then create a USEARCH database. The resulting databases are stored in the `DB` folder of the `ufits` directory and are given the names UTAX.udb and USEARCH.udb respectively.
This commands will download the newest UNITE curated ITS databases. It will first download the UNITE curated general release, reformat the UNITE headers to be compatible with UTAX classifier training, trim the data to correspond to the primers you used to generate your amplicons, i.e. ITS2 via fTIS7/ITS4, and then finally will train the UTAX classifier with these data. The script will then download the UNTIE+INSD database, reformat taxonomy in headers and then create a USEARCH database. The resulting databases are stored in the `DB` folder of the `ufits` directory and are given the names UTAX.udb and USEARCH.udb respectively.

Issuing the `ufits taxonomy` command will inform you which databases have been properly configured as well as usage instructions:

```
$ ufits taxonomy
Usage: ufits taxonomy <arguments>
version: 0.2.5
version: 0.2.6
Description: Script maps OTUs to taxonomy information and can append to an OTU table (optional). By default the script
uses a hybrid approach, e.g. gets taxonomy information from UTAX as well as BLAST-like hits from the larger
Expand Down
5 changes: 4 additions & 1 deletion bin/ufits-OTU_cluster.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,10 @@ def dereplicate(input, output):
for rec in SeqIO.parse(in_file, 'fastq'):
sequence = str(rec.seq)
if sequence not in seqs:
seqs[sequence]=rec.id+'size=1;'
if rec.id.endswith(';'):
seqs[sequence]=rec.id+'size=1;'
else:
seqs[sequence]=rec.id+';size=1;'
else:
count = int(seqs[sequence].split('=')[-1].rstrip(';')) + 1
formated_string = seqs[sequence].rsplit('=', 1)[0]+'='+str(count)+';'
Expand Down
4 changes: 2 additions & 2 deletions bin/ufits-assign_taxonomy.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ def setupLogging(LOGNAME):


if args.method == 'utax':
if not args.db:
if not args.utax_db:
log.error("No DB specified, exiting")
os._exit(1)
usearch = args.usearch
Expand Down Expand Up @@ -128,7 +128,7 @@ def setupLogging(LOGNAME):


elif args.method == 'usearch':
if not args.db:
if not args.usearch_db:
log.error("No DB specified, exiting")
os._exit(1)
usearch = args.usearch
Expand Down
4 changes: 2 additions & 2 deletions bin/ufits-heatmap.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ def __init__(self,prog):
epilog="""Written by Jon Palmer (2015) [email protected]""",
formatter_class=MyFormatter)

map_colors = ['Accent','BrBG','BuGn','Dark2','PRGn','Pastel1', 'Pastel2','Spectral','bwr','copper','cubehelix','gist_yarg','gist_earth_r','gist_heat','Blues','Reds','Purples','Greens','BuPu','BuGn','OrRd','YlGnBu','YlOrRd','terrain', 'brg', 'gnuplot','bone_r','terrain_r']
map_colors = ['Accent','BrBG','BuGn','Dark2','PRGn','Pastel1', 'Pastel2','Spectral','bwr','copper','cubehelix','gist_yarg','gist_earth_r','gist_heat','Blues','Reds','Purples','Greens','BuPu','BuGn','OrRd','YlGnBu','YlOrRd','terrain', 'brg', 'gnuplot','bone_r','terrain_r','sprint','summer','ocean_r']

parser.add_argument('-i','--table', dest="table", required=True, help='OTU Table (Required)')
parser.add_argument('-o','--output', dest="output", required=True, help='Output File (Required)')
Expand Down Expand Up @@ -106,7 +106,7 @@ def convert_binary(x):
for line in min_table[1:]:
new_line = []
new_line.insert(0,line[0])
new_line.append([x/y * 100 for x,y in zip(line[1:], sums[1:])])
new_line.append([x/y * 100 for x,y in zip(line[1:], sums[1:])])
new_line = flatten(new_line)
binary_table.append(new_line)
binary_table.insert(0,header)
Expand Down
18 changes: 14 additions & 4 deletions bin/ufits-mock_filter.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,8 @@ def __init__(self,prog):

parser.add_argument('-i','--otu_table', required=True, help='Input OTU table')
parser.add_argument('-b','--mock_barcode', help='Barocde of Mock community')
parser.add_argument('-p','--barcode_bleed', dest="barcodebleed", default='0.1', help='Index Bleed filter')
parser.add_argument('-p','--index_bleed', dest="barcodebleed", default='0.1', help='Index Bleed filter')
parser.add_argument('-s','--subtract_threshold', default=0, type=int, help='Threshold to subtract')
parser.add_argument('--mc',default='mock3', help='Multi-FASTA mock community')
parser.add_argument('-d','--delimiter', default='tsv', choices=['csv','tsv'], help='Delimiter')
parser.add_argument('--col_order', dest="col_order", default="naturally", help='Provide comma separated list')
Expand Down Expand Up @@ -109,7 +110,7 @@ def setupLogging(LOGNAME):
check = os.stat(args.otu_table).st_size
if check == 0:
log.error("Input OTU table is empty")
os.exit(1)
os._exit(1)

#get base name of files
base = args.otu_table.split(".otu_table")
Expand Down Expand Up @@ -160,8 +161,17 @@ def setupLogging(LOGNAME):
line_count += 1
sub_table.append([try_int(x) for x in line]) #convert to integers


#shouldn't use subtract filter without mock community, but adding it here for another purpose
tempTable = []
if args.subtract_threshold != 0:
log.info("Threshold subtracting table by %i" % args.subtract_threshold)
for line in sub_table:
tempTable.append([try_subtract(x,args.subtract_threshold) for x in line]) #subtract threshold
else:
tempTable = sub_table
#Now sort the table by row name naturally
header = sub_table[:1] #pull out header row
header = tempTable[:1] #pull out header row
header = header[0] #just get first list, double check

#convert header names (optional)
Expand All @@ -184,7 +194,7 @@ def setupLogging(LOGNAME):
sortHead.insert(0, sortHead.pop(otu)) #re-insert the OTU header in first column

#sort the table without header row and then add back
sortTable = natsorted(sub_table[1:]) #sort the table without header row
sortTable = natsorted(tempTable[1:]) #sort the table without header row
sortTable.insert(0,header) #now add back header row

#get index of the sorted header list
Expand Down
37 changes: 21 additions & 16 deletions bin/ufits-process_illumina_folder.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ class col:
parser.add_argument('-f','--fwd_primer', dest="F_primer", default='GTGARTCATCGAATCTTTG', help='Forward Primer (fITS7)')
parser.add_argument('-r','--rev_primer', dest="R_primer", default='TCCTCCGCTTATTGATATGC', help='Reverse Primer (ITS4)')
parser.add_argument('--require_primer', dest="primer", default='on', choices=['on', 'off'], help='Require Fwd primer to be present')
parser.add_argument('--rescue_forward', action="store_true", help='Rescue Not-merged forward reads')
parser.add_argument('-n','--name_prefix', dest="prefix", default='R_', help='Prefix for renaming reads')
parser.add_argument('-m','--min_len', default='50', help='Minimum read length to keep')
parser.add_argument('-l','--trim_len', default='250', help='Trim length for reads')
Expand Down Expand Up @@ -134,7 +135,7 @@ def setupLogging(LOGNAME):
uniq_names = []
fastq_for = []
fastq_rev = []
map = 'ufits-filenames.txt'
map = args.out + '-filenames.txt'
map_file = open(map, 'w')
map_file.write("Name\t[i5]\t[i7]\tLane\tSet_num\n")
for item in sorted(filenames):
Expand All @@ -159,10 +160,14 @@ def setupLogging(LOGNAME):
log.debug("Non-standard names detected, skipping mapping file")
map_file.close()


BarcodeCount = {}
#loop through each set
for i in range(len(fastq_for)):
name = fastq_for[i].split("_")[0]
outname = name + '.fq'
if os.path.isfile(os.path.join(args.out, outname)):
log.info("Output for %s detected, skipping files" % outname)
continue
for_reads = os.path.join(args.input, fastq_for[i])
rev_reads = os.path.join(args.input, fastq_rev[i])
log.info("Working on reads from sample %s" % name)
Expand Down Expand Up @@ -197,7 +202,8 @@ def setupLogging(LOGNAME):
final_out = os.path.join(args.out, outname)
cat_file = open(final_out, 'wb')
shutil.copyfileobj(open(merge_out,'rU'), cat_file)
shutil.copyfileobj(open(skip_for,'rU'), cat_file)
if args.rescue_forward:
shutil.copyfileobj(open(skip_for,'rU'), cat_file)
cat_file.close()

#clean and close up intermediate files
Expand Down Expand Up @@ -435,6 +441,7 @@ def OnRec(Label, Seq, Qual):
progress.FileDone("%u reads, %u outupt, %u bad fwd primer, %u rev primer stripped, %u too short" % \
(SeqCount, OutCount, FwdPrimerMismatchCount, RevPrimerStrippedCount, TooShortCount))

BarcodeCount[SampleLabel] = OutCount
log.info("Stats for demuxing: \
\n%10u seqs \
\n%10u fwd primer mismatches (%.1f%% discarded) \
Expand All @@ -449,7 +456,7 @@ def OnRec(Label, Seq, Qual):
#Now concatenate all of the demuxed files together
log.info("Concatenating Demuxed Files")

catDemux = 'ufits.demux.fq'
catDemux = args.out + '.demux.fq'
with open(catDemux, 'wb') as outfile:
for filename in glob.glob(os.path.join(args.out,'*.demux.fq')):
if filename == catDemux:
Expand All @@ -461,20 +468,18 @@ def OnRec(Label, Seq, Qual):
num_lines = sum(1 for line in open(catDemux))
total = int(num_lines) / 4
log.info('{0:,}'.format(total) + ' reads processed')
log.info("Output file: %s" % catDemux)

#get file size and issue warning if over 4.0 GB
#now let's count the barcodes found and count the number of times they are found.
log.info("Found %i barcoded samples\n%30s: %s" % (len(BarcodeCount), 'Sample', 'Count'))
for key,value in BarcodeCount.iteritems():
print "%30s: %s" % (key, str(value))

#get file size
filesize = os.path.getsize(catDemux)
readablesize = convertSize(filesize)
log.info("File size: %s" % readablesize)
log.info("Output file: %s (%s)" % (catDemux, readablesize))
print "-------------------------------------------------------"
if filesize >= 4294967296:
if 'win32' in sys.platform:
print "\nWarning, file is larger than 4 GB, you will need USEARCH 64 bit to cluster OTUs"
else:
print col.WARN + "\nWarning, file is larger than 4 GB, you will need USEARCH 64 bit to cluster OTUs" + col.END
if 'win32' in sys.platform:
print "\nExample of next cmd: ufits cluster -i %s -o out --uchime_ref ITS2 --mock <mock BC name> (test data: spike)\n" % (catDemux)
else:
if 'win32' in sys.platform:
print "\nExample of next cmd: ufits cluster -i %s -o out --uchime_ref ITS2 --mock <mock BC name> (test data: spike)\n" % (catDemux)
else:
print col.WARN + "\nExample of next cmd: " + col.END + "ufits cluster -i %s -o out --uchime_ref ITS2 --mock <mock BC name> (test data: spike)\n" % (catDemux)
print col.WARN + "\nExample of next cmd: " + col.END + "ufits cluster -i %s -o out --uchime_ref ITS2 --mock <mock BC name> (test data: spike)\n" % (catDemux)
Loading

0 comments on commit ac9ea07

Please sign in to comment.