This workflow is replicate the QA protocol implemented at JGI for Illumina reads and use the program “rqcfilter2” from BBTools(38:44) which implements them as a pipeline.
RQCFilterData Database: It is a 106G tar file includes reference datasets of artifacts, adapters, contaminants, phiX genome, host genomes.
Prepare the Database
mkdir -p refdata
tar xvzf RQCFilterData.tgz -C refdata
rm RQCFilterData.tgz
Description of the files:
file: the WDL file for workflow definition.json
file: the example input for the workflow.conf
file: the conf file for running
file: the shell script for running the example workflow
- database path,
- fastq (illumina paired-end interleaved fastq),
- output path
- memory (optional) ex: "jgi_rqcfilter.memory": "35G"
- threads (optional) ex: "jgi_rqcfilter.threads": "16"
"jgi_rqcfilter.database": "/global/cfs/projectdirs/m3408/aim2/database",
"jgi_rqcfilter.input_files": [
"jgi_rqcfilter.outdir": "/global/cfs/cdirs/m3408/ficus_rqcfiltered"
The output will have one directory named by prefix of the fastq input file and a bunch of output files, including statistical numbers, status log and a shell script to reproduce the steps etc.
The main QC fastq output is named by
|-- 8434.1.102069.ACAGTG.anqdpht.fastq.gz
|-- filterStats.txt
|-- filterStats.json
|-- filterStats2.txt
|-- adaptersDetected.fa
|-- spikein.fq.gz
|-- status.log
|-- ...