Skip to content

oicr-gsi/bamQC

Repository files navigation

bamQC

bamQC workflow collects a number of metrics which are computed using several methods (by employing third-party software tools along with some custom code) and outputs the results in JSON format. The output also contains metadata, such as the instrument and lane names. Also, it contains the estimate of ribosomal rRNA contamination and other information. bamQC supports downsampling for faster analysis and splits some tasks by chromosome, which also increases speed.

Overview

Dependencies

Usage

Cromwell

java -jar cromwell.jar run bamQC.wdl --inputs inputs.json

Inputs

Required workflow parameters:

Parameter Value Description
inputGroups Array[InputGroup] Array of objects describing sets of bams to merge together and on which to compute QC metrics
metadata Map[String,String] JSON file containing metadata
mode String running mode for the workflow, only allow value 'lane_level' and 'call_ready'
bamQCMetrics.refFasta String Path to human genome FASTA reference
bamQCMetrics.refSizesBed String Path to human genome BED reference with chromosome sizes
bamQCMetrics.workflowVersion String Workflow version string

Optional workflow parameters:

Parameter Value Default Description
outputFileNamePrefix String "bamQC" Prefix for output files
intervalsToParallelizeByString String "chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22,chrX,chrY,chrM" Comma separated list of intervals to split by (e.g. chr1,chr2,chr3,chr4).

Optional task parameters:

Parameter Value Default Description
splitStringToArray.lineSeparator String "," Interval group separator - these are the intervals to split by.
splitStringToArray.recordSeparator String "+" Record separator - a delimiter for joining records
splitStringToArray.jobMemory Int 1 Memory allocated to job (in GB).
splitStringToArray.cores Int 1 The number of cores to allocate to the job.
splitStringToArray.timeout Int 1 Maximum amount of time (in hours) the task can run for.
splitStringToArray.modules String "" Environment module name and version to load (space separated) before command execution.
getChrCoefficient.memory Int 2 Memory allocated for this job
getChrCoefficient.timeout Int 1 Hours before task timeout
getChrCoefficient.modules String "samtools/1.14" Names and versions of modules to load
preFilter.filterFlags Int 260 Samtools filter flags to apply.
preFilter.minMapQuality Int? None Minimum alignment quality to pass filter
preFilter.filterAdditionalParams String? None Additional parameters to pass to samtools.
preFilter.modules String "samtools/1.14" required environment modules
preFilter.jobMemory Int 6 Memory allocated for this job
preFilter.minMemory Int 2 Minimum amount of RAM allocated to the task
preFilter.threads Int 4 Requested CPU threads
preFilter.timeout Int 4 hours before task timeout
mergeSplitByIntervalFiles.suffix String ".merge" suffix to use for merged bam
mergeSplitByIntervalFiles.jobMemory Int 24 Memory allocated to job (in GB).
mergeSplitByIntervalFiles.overhead Int 6 Java overhead memory (in GB). jobMemory - overhead == java Xmx/heap memory.
mergeSplitByIntervalFiles.cores Int 1 The number of cores to allocate to the job.
mergeSplitByIntervalFiles.timeout Int 24 Maximum amount of time (in hours) the task can run for.
mergeSplitByIntervalFiles.modules String "gatk/4.1.6.0" Environment module name and version to load (space separated) before command execution.
mergeFiles.suffix String ".merge" suffix to use for merged bam
mergeFiles.jobMemory Int 24 Memory allocated to job (in GB).
mergeFiles.overhead Int 6 Java overhead memory (in GB). jobMemory - overhead == java Xmx/heap memory.
mergeFiles.cores Int 1 The number of cores to allocate to the job.
mergeFiles.timeout Int 24 Maximum amount of time (in hours) the task can run for.
mergeFiles.modules String "gatk/4.1.6.0" Environment module name and version to load (space separated) before command execution.
filter.minQuality Int 30 Minimum alignment quality to pass filter
filter.modules String "samtools/1.14" required environment modules
filter.jobMemory Int 16 Memory allocated for this job
filter.threads Int 4 Requested CPU threads
filter.timeout Int 4 hours before task timeout
updateMetadata.modules String "python/3.6" required environment modules
updateMetadata.jobMemory Int 16 Memory allocated for this job
updateMetadata.threads Int 4 Requested CPU threads
updateMetadata.timeout Int 4 hours before task timeout
findDownsampleParams.targetReads Int 100000 Desired number of reads in downsampled output
findDownsampleParams.minReadsAbsolute Int 10000 Minimum value of targetReads to allow pre-downsampling
findDownsampleParams.minReadsRelative Int 2 Minimum value of (inputReads)/(targetReads) to allow pre-downsampling
findDownsampleParams.precision Int 8 Number of decimal places in fraction for pre-downsampling
findDownsampleParams.preDSMultiplier Float 1.5 Determines target size for pre-downsampled set (if any). Must have (preDSMultiplier) < (minReadsRelative).
findDownsampleParams.modules String "python/3.6" required environment modules
findDownsampleParams.jobMemory Int 16 Memory allocated for this job
findDownsampleParams.threads Int 4 Requested CPU threads
findDownsampleParams.timeout Int 4 hours before task timeout
findDownsampleParamsMarkDup.threshold Int 10000000 Minimum number of reads to conduct downsampling
findDownsampleParamsMarkDup.chromosomes Array[String] ["chr12", "chr13", "chrXII", "chrXIII"] Array of chromosome identifiers for downsampled subset
findDownsampleParamsMarkDup.baseInterval Int 15000 Base width of interval in each chromosome, for very large BAMs
findDownsampleParamsMarkDup.intervalStart Int 100000 Start of interval in each chromosome, for very large BAMs
findDownsampleParamsMarkDup.customRegions String "" Custom downsample regions; overrides chromosome and interval parameters
findDownsampleParamsMarkDup.modules String "python/3.6" required environment modules
findDownsampleParamsMarkDup.jobMemory Int 16 Memory allocated for this job
findDownsampleParamsMarkDup.threads Int 4 Requested CPU threads
findDownsampleParamsMarkDup.timeout Int 4 hours before task timeout
downsample.downsampleSuffix String "downsampled.bam" Suffix for output file
downsample.randomSeed Int 42 Random seed for pre-downsampling (if any)
downsample.modules String "samtools/1.14" required environment modules
downsample.jobMemory Int 16 Memory allocated for this job
downsample.threads Int 4 Requested CPU threads
downsample.timeout Int 4 hours before task timeout
downsampleRegion.modules String "samtools/1.14" required environment modules
downsampleRegion.jobMemory Int 16 Memory allocated for this job
downsampleRegion.threads Int 4 Requested CPU threads
downsampleRegion.timeout Int 4 hours before task timeout
markDuplicates.opticalDuplicatePixelDistance Int 100 Maximum offset between optical duplicate clusters
markDuplicates.picardMaxMemMb Int 6000 Memory requirement in MB for running Picard JAR
markDuplicates.modules String "picard/2.21.2" required environment modules
markDuplicates.jobMemory Int 16 Memory allocated for this job
markDuplicates.threads Int 4 Requested CPU threads
markDuplicates.timeout Int 4 hours before task timeout
bamQCMetrics.normalInsertMax Int 1500 Maximum of expected insert size range
bamQCMetrics.modules String "bam-qc-metrics/0.2.5" required environment modules
bamQCMetrics.jobMemory Int 16 Memory allocated for this job
bamQCMetrics.threads Int 4 Requested CPU threads
bamQCMetrics.timeout Int 4 hours before task timeout
runMosdepth.modules String "mosdepth/0.2.9" required environment modules
runMosdepth.jobMemory Int 16 Memory allocated for this job
runMosdepth.threads Int 4 Requested CPU threads
runMosdepth.timeout Int 4 hours before task timeout
cumulativeDistToHistogram.modules String "python/3.6" required environment modules
cumulativeDistToHistogram.jobMemory Int 8 Memory allocated for this job
cumulativeDistToHistogram.threads Int 4 Requested CPU threads
cumulativeDistToHistogram.timeout Int 1 hours before task timeout
collateResults.modules String "python/3.6" required environment modules
collateResults.jobMemory Int 8 Memory allocated for this job
collateResults.threads Int 4 Requested CPU threads
collateResults.timeout Int 1 hours before task timeout

Outputs

Output Type Description Labels
result File json file that contains metrics and meta data described in https://github.com/oicr-gsi/bam-qc-metrics/blob/master/metrics.md vidarr_label: result

This section lists command(s) run by WORKFLOW workflow

Run getChrCoefficient

 
  CHROM_LEN=$(samtools view -H ~{bamFile} | grep ^@SQ | grep -v _ | grep -w ~{chromosome} | cut -f 3 | sed 's/LN://')
  LARGEST=$(samtools view -H ~{bamFile} | grep ^@SQ | grep -v _ | cut -f 3 | sed 's/LN://' | sort -n | tail -n 1)
  echo | awk -v chrom_len=$CHROM_LEN -v largest=$LARGEST '{print int((chrom_len/largest + 0.1) * 10)/10}'
 

Run bamQCMetrics

  run_bam_qc.py \
   -b ~{bamFile} \
   -d ~{markDuplicates} \
   --debug \
   -i ~{normalInsertMax} \
   -o ~{resultName} \
   -r ~{refFasta} \
   -t ~{refSizesBed} \
   -T . \
   -w ~{workflowVersion} \
   ~{dsInput}

Run collateResults

   python3 <<CODE
   import json
   data = json.loads(open("~{bamQCMetricsResult}").read())
   histogram = json.loads(open("~{histogram}").read())
   data["coverage histogram"] = histogram
   metadata = json.loads(open("~{metadata}").read())
   for key in metadata.keys():
       data[key] = metadata[key]
   out = open("~{outputFileName}", "w")
   json.dump(data, out, sort_keys=True)
   out.close()
   CODE

Run cumulativeDistToHistogram

   python3 <<CODE
   import csv, json
   summary = open("~{summary}").readlines()
   globalDist = open("~{globalDist}").readlines()
   # read chromosome lengths from the summary
   summaryReader = csv.reader(summary, delimiter="\t")
   lengthByChr = {}
   for row in summaryReader:
       if row[0] == 'chrom' or row[0] == 'total':
 	  continue # skip initial header row, and final total row
       lengthByChr[row[0]] = int(row[1])
   chromosomes = sorted(lengthByChr.keys())
   # read the cumulative distribution for each chromosome
   globalReader = csv.reader(globalDist, delimiter="\t")
   cumDist = {}
   for k in chromosomes:
       cumDist[k] = {}
   for row in globalReader:
       if row[0]=="total":
 	  continue
       cumDist[row[0]][int(row[1])] = float(row[2])
   # convert the cumulative distributions to non-cumulative and populate histogram
   # if the input BAM is empty, chromosomes and histogram will also be empty
   histogram = {}
   for k in chromosomes:
       depths = sorted(cumDist[k].keys())
       dist = {}
       for i in range(len(depths)-1):
 	  depth = depths[i]
 	  nextDepth = depths[i+1]
 	  dist[depth] = cumDist[k][depth] - cumDist[k][nextDepth]
       maxDepth = max(depths)
       dist[maxDepth] = cumDist[k][maxDepth]
       # now find the number of loci at each depth of coverage to construct the histogram
       for depth in depths:
 	  loci = int(round(dist[depth]*lengthByChr[k], 0))
 	  histogram[depth] = histogram.get(depth, 0) + loci
   # if histogram is non-empty, fill in zero values for missing depths
   for i in range(max(histogram.keys(), default=0)):
       if i not in histogram:
 	  histogram[i] = 0
   out = open("~{outFileName}", "w")
   json.dump(histogram, out, sort_keys=True)
   out.close()
   CODE

Run downsample

     set -e
     set -o pipefail
     samtools view -b -h ~{bamFile} | \
     ~{preDownsampleCommand} ~{downsampleCommand} \
     samtools view -b > ~{resultName}

downsampleRegion

     set -e
     # ensure BAM file and index are symlinked to working directory
     ln -s ~{bamFile}
     ln -s ~{bamIndex}
     samtools view -b -h ~{bamFileName} ~{region} > ~{resultName}

Run filter

     set -e
     set -o pipefail
     samtools view -h -b -F 2304 -U ~{nonPrimaryReadsFile} ~{bamFile} | \
     samtools view -h -b -F 4 -U ~{unmappedReadsFile} | \
     samtools view -h -b -q ~{minQuality} -U ~{lowQualityReadsFile} \
     > ~{resultName}
     samtools view -c ~{bamFile} > ~{totalInputReadsFile}
     samtools view -c ~{nonPrimaryReadsFile} > ~{totalNonPrimaryReadsFile}
     samtools view -c ~{unmappedReadsFile} > ~{totalUnmappedReadsFile}
     samtools view -c ~{lowQualityReadsFile} > ~{totalLowQualityReadsFile}
     samtools index ~{bamFile} > ~{resultIndexName}

Run findDownsampleParams

     python3 <<CODE
     import json, math, sys
     readsIn = ~{inputReads}
     readsTarget = ~{targetReads}
     precision = ~{precision}
     print("Input reads param =", readsIn, file=sys.stderr)
     print("Target reads param =", readsTarget, file=sys.stderr)
     minReadsAbsolute = ~{minReadsAbsolute}
     minReadsRelative = ~{minReadsRelative}
     preDownsampleMultiplier = ~{preDSMultiplier}
     if readsIn <= readsTarget:
       # absolutely no downsampling
       applyPreDownsample = False
       applyDownsample = False
       preDownsampleTarget = "no_pre_downsample"
       downSampleTarget = "no_downsample"
     elif readsIn < readsTarget * minReadsRelative or readsTarget < minReadsAbsolute:
       # no predownsampling
       applyPreDownsample = False
       applyDownsample = True
       preDownsampleTarget = "no_pre_downsample"
       downSampleTarget = str(readsTarget)
     else:
       # predownsampling and downsampling
       applyPreDownsample = True
       applyDownsample = True
       probability = (readsTarget * preDownsampleMultiplier)/readsIn
       formatString = "{:0"+str(precision)+"d}"
       preDownsampleTarget = formatString.format(int(math.floor(probability * 10**precision)))
       downSampleTarget = str(readsTarget)
     status = {
       "pre_ds": applyPreDownsample,
       "ds": applyDownsample
     }
     targets = {
       "pre_ds": preDownsampleTarget,
       "ds": downSampleTarget
     }
     statusFile = open("~{statusFile}", "w")
     json.dump(status, statusFile)
     statusFile.close()
     targetFile = open("~{targetsFile}", "w")
     json.dump(targets, targetFile)
     targetFile.close()
     CODE

Run findDownsampleParamsMarkDup

     python3 <<CODE
     readsIn = ~{inputReads}
     threshold = ~{threshold}
     interval = ~{baseInterval}
     start = ~{intervalStart} + 1 # start of sub-chromosome window, if needed; exclude telomeres
     chromosomes = [line.strip() for line in open("~{chromosomesText}").readlines()]
     customRegions = "~{customRegions}" # overrides other chromosome/interval parameters
     ds = True # True if downsampling, false otherwise
     end = None # end of window, if needed
     if readsIn <= threshold:
         ds = False # no downsampling
     elif readsIn <= threshold*10:
         pass # default to chr12 & chr13 =~ 8% of genome
     elif readsIn <= threshold*10**2:
         end = start + interval*10**3 - 1 # default 2*15 million base window ~ 1% of genome
     elif readsIn <= threshold*10**3:
         end = start + interval*10**2 - 1
     elif readsIn <= threshold*10**4:
         end = start + interval*10 - 1
     else:
         end = start + interval - 1
     if ds:
         status = "true"
         if customRegions != "":
             region = customRegions
         elif end == None:
             region = " ".join(chromosomes)
         else:
             regions = ["%s:%i-%i" % (chromosome, start, end) for chromosome in chromosomes ]
             region = " ".join(regions)
     else:
         status = "false"
         region = ""
     outStatus = open("~{outputStatus}", "w")
     print(status, file=outStatus)
     outStatus.close()
     outRegion = open("~{outputRegion}", "w")
     print(region, file=outRegion)
     outRegion.close()
     CODE

Run markDuplicates

     java -Xmx~{picardMaxMemMb}M \
     -jar ${PICARD_ROOT}/picard.jar \
     MarkDuplicates \
     INPUT=~{bamFile} \
     OUTPUT=~{outFileBam} \
     VALIDATION_STRINGENCY=SILENT \
     TMP_DIR=${PWD} \
     METRICS_FILE=~{outFileText} \
     OPTICAL_DUPLICATE_PIXEL_DISTANCE=~{opticalDuplicatePixelDistance}

Run MergeFiles

    set -euo pipefail
  
    gatk --java-options "-Xmx~{jobMemory - overhead}G" MergeSamFiles \
    ~{sep=" " prefix("--INPUT=", bams)} \
    --OUTPUT="~{outputFileName}~{suffix}.bam" \
    --CREATE_INDEX=true \
    --SORT_ORDER=coordinate \
    --ASSUME_SORTED=false \
    --USE_THREADING=true \
    --VALIDATION_STRINGENCY=SILENT 
   

Run prefilter

  set -e
  set -o pipefail
  samtools view -b \
       -F ~{filterFlags} \
       ~{"-q " + minMapQuality} \
       ~{filterAdditionalParams} \
       ~{bamFile} \
       ~{sep=" " intervals} > ~{resultName}

Run runMosdepth

  set -eo pipefail
  # ensure BAM file and index are symlinked to working directory
  ln -s ~{bamFile}
  ln -s ~{bamIndex}
  # run mosdepth
  MOSDEPTH_PRECISION=8 mosdepth -x -n -t 3 bamqc ~{bamFileName}

splitStringToArray

   set -euo pipefail
   
   echo "~{str}" | tr '~{lineSeparator}' '\n' | tr '~{recordSeparator}' '\t'

Run updateMetadata

   python3 <<CODE
   import json
   metadata = json.loads(open("~{metadataJson}").read())
   metadata["total input reads meta"] = ~{totalInputReads}
   metadata["non-primary reads meta"] = ~{nonPrimaryReads}
   metadata["unmapped reads meta"] = ~{unmappedReads}
   metadata["low-quality reads meta"] = ~{lowQualityReads}
   outFile = open("~{outFileName}", "w")
   json.dump(metadata, outFile)
   outFile.close()
   CODE

Support

For support, please file an issue on the Github project or send an email to [email protected] .

Generated with generate-markdown-readme (https://github.com/oicr-gsi/gsi-wdl-tools/)