bamQC workflow collects a number of metrics which are computed using several methods (by employing third-party software tools along with some custom code) and outputs the results in JSON format. The output also contains metadata, such as the instrument and lane names. Also, it contains the estimate of ribosomal rRNA contamination and other information. bamQC supports downsampling for faster analysis and splits some tasks by chromosome, which also increases speed.
java -jar cromwell.jar run bamQC.wdl --inputs inputs.json
Parameter | Value | Description |
---|---|---|
inputGroups |
Array[InputGroup] | Array of objects describing sets of bams to merge together and on which to compute QC metrics |
metadata |
Map[String,String] | JSON file containing metadata |
mode |
String | running mode for the workflow, only allow value 'lane_level' and 'call_ready' |
bamQCMetrics.refFasta |
String | Path to human genome FASTA reference |
bamQCMetrics.refSizesBed |
String | Path to human genome BED reference with chromosome sizes |
bamQCMetrics.workflowVersion |
String | Workflow version string |
Parameter | Value | Default | Description |
---|---|---|---|
outputFileNamePrefix |
String | "bamQC" | Prefix for output files |
intervalsToParallelizeByString |
String | "chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22,chrX,chrY,chrM" | Comma separated list of intervals to split by (e.g. chr1,chr2,chr3,chr4). |
Parameter | Value | Default | Description |
---|---|---|---|
splitStringToArray.lineSeparator |
String | "," | Interval group separator - these are the intervals to split by. |
splitStringToArray.recordSeparator |
String | "+" | Record separator - a delimiter for joining records |
splitStringToArray.jobMemory |
Int | 1 | Memory allocated to job (in GB). |
splitStringToArray.cores |
Int | 1 | The number of cores to allocate to the job. |
splitStringToArray.timeout |
Int | 1 | Maximum amount of time (in hours) the task can run for. |
splitStringToArray.modules |
String | "" | Environment module name and version to load (space separated) before command execution. |
getChrCoefficient.memory |
Int | 2 | Memory allocated for this job |
getChrCoefficient.timeout |
Int | 1 | Hours before task timeout |
getChrCoefficient.modules |
String | "samtools/1.14" | Names and versions of modules to load |
preFilter.filterFlags |
Int | 260 | Samtools filter flags to apply. |
preFilter.minMapQuality |
Int? | None | Minimum alignment quality to pass filter |
preFilter.filterAdditionalParams |
String? | None | Additional parameters to pass to samtools. |
preFilter.modules |
String | "samtools/1.14" | required environment modules |
preFilter.jobMemory |
Int | 6 | Memory allocated for this job |
preFilter.minMemory |
Int | 2 | Minimum amount of RAM allocated to the task |
preFilter.threads |
Int | 4 | Requested CPU threads |
preFilter.timeout |
Int | 4 | hours before task timeout |
mergeSplitByIntervalFiles.suffix |
String | ".merge" | suffix to use for merged bam |
mergeSplitByIntervalFiles.jobMemory |
Int | 24 | Memory allocated to job (in GB). |
mergeSplitByIntervalFiles.overhead |
Int | 6 | Java overhead memory (in GB). jobMemory - overhead == java Xmx/heap memory. |
mergeSplitByIntervalFiles.cores |
Int | 1 | The number of cores to allocate to the job. |
mergeSplitByIntervalFiles.timeout |
Int | 24 | Maximum amount of time (in hours) the task can run for. |
mergeSplitByIntervalFiles.modules |
String | "gatk/4.1.6.0" | Environment module name and version to load (space separated) before command execution. |
mergeFiles.suffix |
String | ".merge" | suffix to use for merged bam |
mergeFiles.jobMemory |
Int | 24 | Memory allocated to job (in GB). |
mergeFiles.overhead |
Int | 6 | Java overhead memory (in GB). jobMemory - overhead == java Xmx/heap memory. |
mergeFiles.cores |
Int | 1 | The number of cores to allocate to the job. |
mergeFiles.timeout |
Int | 24 | Maximum amount of time (in hours) the task can run for. |
mergeFiles.modules |
String | "gatk/4.1.6.0" | Environment module name and version to load (space separated) before command execution. |
filter.minQuality |
Int | 30 | Minimum alignment quality to pass filter |
filter.modules |
String | "samtools/1.14" | required environment modules |
filter.jobMemory |
Int | 16 | Memory allocated for this job |
filter.threads |
Int | 4 | Requested CPU threads |
filter.timeout |
Int | 4 | hours before task timeout |
updateMetadata.modules |
String | "python/3.6" | required environment modules |
updateMetadata.jobMemory |
Int | 16 | Memory allocated for this job |
updateMetadata.threads |
Int | 4 | Requested CPU threads |
updateMetadata.timeout |
Int | 4 | hours before task timeout |
findDownsampleParams.targetReads |
Int | 100000 | Desired number of reads in downsampled output |
findDownsampleParams.minReadsAbsolute |
Int | 10000 | Minimum value of targetReads to allow pre-downsampling |
findDownsampleParams.minReadsRelative |
Int | 2 | Minimum value of (inputReads)/(targetReads) to allow pre-downsampling |
findDownsampleParams.precision |
Int | 8 | Number of decimal places in fraction for pre-downsampling |
findDownsampleParams.preDSMultiplier |
Float | 1.5 | Determines target size for pre-downsampled set (if any). Must have (preDSMultiplier) < (minReadsRelative). |
findDownsampleParams.modules |
String | "python/3.6" | required environment modules |
findDownsampleParams.jobMemory |
Int | 16 | Memory allocated for this job |
findDownsampleParams.threads |
Int | 4 | Requested CPU threads |
findDownsampleParams.timeout |
Int | 4 | hours before task timeout |
findDownsampleParamsMarkDup.threshold |
Int | 10000000 | Minimum number of reads to conduct downsampling |
findDownsampleParamsMarkDup.chromosomes |
Array[String] | ["chr12", "chr13", "chrXII", "chrXIII"] | Array of chromosome identifiers for downsampled subset |
findDownsampleParamsMarkDup.baseInterval |
Int | 15000 | Base width of interval in each chromosome, for very large BAMs |
findDownsampleParamsMarkDup.intervalStart |
Int | 100000 | Start of interval in each chromosome, for very large BAMs |
findDownsampleParamsMarkDup.customRegions |
String | "" | Custom downsample regions; overrides chromosome and interval parameters |
findDownsampleParamsMarkDup.modules |
String | "python/3.6" | required environment modules |
findDownsampleParamsMarkDup.jobMemory |
Int | 16 | Memory allocated for this job |
findDownsampleParamsMarkDup.threads |
Int | 4 | Requested CPU threads |
findDownsampleParamsMarkDup.timeout |
Int | 4 | hours before task timeout |
downsample.downsampleSuffix |
String | "downsampled.bam" | Suffix for output file |
downsample.randomSeed |
Int | 42 | Random seed for pre-downsampling (if any) |
downsample.modules |
String | "samtools/1.14" | required environment modules |
downsample.jobMemory |
Int | 16 | Memory allocated for this job |
downsample.threads |
Int | 4 | Requested CPU threads |
downsample.timeout |
Int | 4 | hours before task timeout |
downsampleRegion.modules |
String | "samtools/1.14" | required environment modules |
downsampleRegion.jobMemory |
Int | 16 | Memory allocated for this job |
downsampleRegion.threads |
Int | 4 | Requested CPU threads |
downsampleRegion.timeout |
Int | 4 | hours before task timeout |
markDuplicates.opticalDuplicatePixelDistance |
Int | 100 | Maximum offset between optical duplicate clusters |
markDuplicates.picardMaxMemMb |
Int | 6000 | Memory requirement in MB for running Picard JAR |
markDuplicates.modules |
String | "picard/2.21.2" | required environment modules |
markDuplicates.jobMemory |
Int | 16 | Memory allocated for this job |
markDuplicates.threads |
Int | 4 | Requested CPU threads |
markDuplicates.timeout |
Int | 4 | hours before task timeout |
bamQCMetrics.normalInsertMax |
Int | 1500 | Maximum of expected insert size range |
bamQCMetrics.modules |
String | "bam-qc-metrics/0.2.5" | required environment modules |
bamQCMetrics.jobMemory |
Int | 16 | Memory allocated for this job |
bamQCMetrics.threads |
Int | 4 | Requested CPU threads |
bamQCMetrics.timeout |
Int | 4 | hours before task timeout |
runMosdepth.modules |
String | "mosdepth/0.2.9" | required environment modules |
runMosdepth.jobMemory |
Int | 16 | Memory allocated for this job |
runMosdepth.threads |
Int | 4 | Requested CPU threads |
runMosdepth.timeout |
Int | 4 | hours before task timeout |
cumulativeDistToHistogram.modules |
String | "python/3.6" | required environment modules |
cumulativeDistToHistogram.jobMemory |
Int | 8 | Memory allocated for this job |
cumulativeDistToHistogram.threads |
Int | 4 | Requested CPU threads |
cumulativeDistToHistogram.timeout |
Int | 1 | hours before task timeout |
collateResults.modules |
String | "python/3.6" | required environment modules |
collateResults.jobMemory |
Int | 8 | Memory allocated for this job |
collateResults.threads |
Int | 4 | Requested CPU threads |
collateResults.timeout |
Int | 1 | hours before task timeout |
Output | Type | Description | Labels |
---|---|---|---|
result |
File | json file that contains metrics and meta data described in https://github.com/oicr-gsi/bam-qc-metrics/blob/master/metrics.md | vidarr_label: result |
This section lists command(s) run by WORKFLOW workflow
CHROM_LEN=$(samtools view -H ~{bamFile} | grep ^@SQ | grep -v _ | grep -w ~{chromosome} | cut -f 3 | sed 's/LN://')
LARGEST=$(samtools view -H ~{bamFile} | grep ^@SQ | grep -v _ | cut -f 3 | sed 's/LN://' | sort -n | tail -n 1)
echo | awk -v chrom_len=$CHROM_LEN -v largest=$LARGEST '{print int((chrom_len/largest + 0.1) * 10)/10}'
run_bam_qc.py \
-b ~{bamFile} \
-d ~{markDuplicates} \
--debug \
-i ~{normalInsertMax} \
-o ~{resultName} \
-r ~{refFasta} \
-t ~{refSizesBed} \
-T . \
-w ~{workflowVersion} \
~{dsInput}
python3 <<CODE
import json
data = json.loads(open("~{bamQCMetricsResult}").read())
histogram = json.loads(open("~{histogram}").read())
data["coverage histogram"] = histogram
metadata = json.loads(open("~{metadata}").read())
for key in metadata.keys():
data[key] = metadata[key]
out = open("~{outputFileName}", "w")
json.dump(data, out, sort_keys=True)
out.close()
CODE
python3 <<CODE
import csv, json
summary = open("~{summary}").readlines()
globalDist = open("~{globalDist}").readlines()
# read chromosome lengths from the summary
summaryReader = csv.reader(summary, delimiter="\t")
lengthByChr = {}
for row in summaryReader:
if row[0] == 'chrom' or row[0] == 'total':
continue # skip initial header row, and final total row
lengthByChr[row[0]] = int(row[1])
chromosomes = sorted(lengthByChr.keys())
# read the cumulative distribution for each chromosome
globalReader = csv.reader(globalDist, delimiter="\t")
cumDist = {}
for k in chromosomes:
cumDist[k] = {}
for row in globalReader:
if row[0]=="total":
continue
cumDist[row[0]][int(row[1])] = float(row[2])
# convert the cumulative distributions to non-cumulative and populate histogram
# if the input BAM is empty, chromosomes and histogram will also be empty
histogram = {}
for k in chromosomes:
depths = sorted(cumDist[k].keys())
dist = {}
for i in range(len(depths)-1):
depth = depths[i]
nextDepth = depths[i+1]
dist[depth] = cumDist[k][depth] - cumDist[k][nextDepth]
maxDepth = max(depths)
dist[maxDepth] = cumDist[k][maxDepth]
# now find the number of loci at each depth of coverage to construct the histogram
for depth in depths:
loci = int(round(dist[depth]*lengthByChr[k], 0))
histogram[depth] = histogram.get(depth, 0) + loci
# if histogram is non-empty, fill in zero values for missing depths
for i in range(max(histogram.keys(), default=0)):
if i not in histogram:
histogram[i] = 0
out = open("~{outFileName}", "w")
json.dump(histogram, out, sort_keys=True)
out.close()
CODE
set -e
set -o pipefail
samtools view -b -h ~{bamFile} | \
~{preDownsampleCommand} ~{downsampleCommand} \
samtools view -b > ~{resultName}
set -e
# ensure BAM file and index are symlinked to working directory
ln -s ~{bamFile}
ln -s ~{bamIndex}
samtools view -b -h ~{bamFileName} ~{region} > ~{resultName}
set -e
set -o pipefail
samtools view -h -b -F 2304 -U ~{nonPrimaryReadsFile} ~{bamFile} | \
samtools view -h -b -F 4 -U ~{unmappedReadsFile} | \
samtools view -h -b -q ~{minQuality} -U ~{lowQualityReadsFile} \
> ~{resultName}
samtools view -c ~{bamFile} > ~{totalInputReadsFile}
samtools view -c ~{nonPrimaryReadsFile} > ~{totalNonPrimaryReadsFile}
samtools view -c ~{unmappedReadsFile} > ~{totalUnmappedReadsFile}
samtools view -c ~{lowQualityReadsFile} > ~{totalLowQualityReadsFile}
samtools index ~{bamFile} > ~{resultIndexName}
python3 <<CODE
import json, math, sys
readsIn = ~{inputReads}
readsTarget = ~{targetReads}
precision = ~{precision}
print("Input reads param =", readsIn, file=sys.stderr)
print("Target reads param =", readsTarget, file=sys.stderr)
minReadsAbsolute = ~{minReadsAbsolute}
minReadsRelative = ~{minReadsRelative}
preDownsampleMultiplier = ~{preDSMultiplier}
if readsIn <= readsTarget:
# absolutely no downsampling
applyPreDownsample = False
applyDownsample = False
preDownsampleTarget = "no_pre_downsample"
downSampleTarget = "no_downsample"
elif readsIn < readsTarget * minReadsRelative or readsTarget < minReadsAbsolute:
# no predownsampling
applyPreDownsample = False
applyDownsample = True
preDownsampleTarget = "no_pre_downsample"
downSampleTarget = str(readsTarget)
else:
# predownsampling and downsampling
applyPreDownsample = True
applyDownsample = True
probability = (readsTarget * preDownsampleMultiplier)/readsIn
formatString = "{:0"+str(precision)+"d}"
preDownsampleTarget = formatString.format(int(math.floor(probability * 10**precision)))
downSampleTarget = str(readsTarget)
status = {
"pre_ds": applyPreDownsample,
"ds": applyDownsample
}
targets = {
"pre_ds": preDownsampleTarget,
"ds": downSampleTarget
}
statusFile = open("~{statusFile}", "w")
json.dump(status, statusFile)
statusFile.close()
targetFile = open("~{targetsFile}", "w")
json.dump(targets, targetFile)
targetFile.close()
CODE
python3 <<CODE
readsIn = ~{inputReads}
threshold = ~{threshold}
interval = ~{baseInterval}
start = ~{intervalStart} + 1 # start of sub-chromosome window, if needed; exclude telomeres
chromosomes = [line.strip() for line in open("~{chromosomesText}").readlines()]
customRegions = "~{customRegions}" # overrides other chromosome/interval parameters
ds = True # True if downsampling, false otherwise
end = None # end of window, if needed
if readsIn <= threshold:
ds = False # no downsampling
elif readsIn <= threshold*10:
pass # default to chr12 & chr13 =~ 8% of genome
elif readsIn <= threshold*10**2:
end = start + interval*10**3 - 1 # default 2*15 million base window ~ 1% of genome
elif readsIn <= threshold*10**3:
end = start + interval*10**2 - 1
elif readsIn <= threshold*10**4:
end = start + interval*10 - 1
else:
end = start + interval - 1
if ds:
status = "true"
if customRegions != "":
region = customRegions
elif end == None:
region = " ".join(chromosomes)
else:
regions = ["%s:%i-%i" % (chromosome, start, end) for chromosome in chromosomes ]
region = " ".join(regions)
else:
status = "false"
region = ""
outStatus = open("~{outputStatus}", "w")
print(status, file=outStatus)
outStatus.close()
outRegion = open("~{outputRegion}", "w")
print(region, file=outRegion)
outRegion.close()
CODE
java -Xmx~{picardMaxMemMb}M \
-jar ${PICARD_ROOT}/picard.jar \
MarkDuplicates \
INPUT=~{bamFile} \
OUTPUT=~{outFileBam} \
VALIDATION_STRINGENCY=SILENT \
TMP_DIR=${PWD} \
METRICS_FILE=~{outFileText} \
OPTICAL_DUPLICATE_PIXEL_DISTANCE=~{opticalDuplicatePixelDistance}
set -euo pipefail
gatk --java-options "-Xmx~{jobMemory - overhead}G" MergeSamFiles \
~{sep=" " prefix("--INPUT=", bams)} \
--OUTPUT="~{outputFileName}~{suffix}.bam" \
--CREATE_INDEX=true \
--SORT_ORDER=coordinate \
--ASSUME_SORTED=false \
--USE_THREADING=true \
--VALIDATION_STRINGENCY=SILENT
set -e
set -o pipefail
samtools view -b \
-F ~{filterFlags} \
~{"-q " + minMapQuality} \
~{filterAdditionalParams} \
~{bamFile} \
~{sep=" " intervals} > ~{resultName}
set -eo pipefail
# ensure BAM file and index are symlinked to working directory
ln -s ~{bamFile}
ln -s ~{bamIndex}
# run mosdepth
MOSDEPTH_PRECISION=8 mosdepth -x -n -t 3 bamqc ~{bamFileName}
set -euo pipefail
echo "~{str}" | tr '~{lineSeparator}' '\n' | tr '~{recordSeparator}' '\t'
python3 <<CODE
import json
metadata = json.loads(open("~{metadataJson}").read())
metadata["total input reads meta"] = ~{totalInputReads}
metadata["non-primary reads meta"] = ~{nonPrimaryReads}
metadata["unmapped reads meta"] = ~{unmappedReads}
metadata["low-quality reads meta"] = ~{lowQualityReads}
outFile = open("~{outFileName}", "w")
json.dump(metadata, outFile)
outFile.close()
CODE
For support, please file an issue on the Github project or send an email to [email protected] .
Generated with generate-markdown-readme (https://github.com/oicr-gsi/gsi-wdl-tools/)