Skip to content

Running STRetch

Harriet Dashnow edited this page Jul 20, 2018 · 33 revisions

Quick start

Run the WGS pipeline:

STRetch/tools/bin/bpipe run STRetch/pipelines/STRetch_wgs_pipeline.groovy sample1_L001_R1.fastq.gz sample1_L001_R2.fastq.gz sample2_L001_R1.fastq.gz sample2_L001_R2.fastq.gz …

All reads are assumed to be paired end with filenames ending in _R1.fastq.gz and _R2.fastq.gz. Multiple samples can be run in the same command and will be processed in parallel and their STR variation compared to find outliers.

If you would like to use the included PCR WGS reference set as controls, you will need to edit the "CONTROL" line in pipeline_config.groovy. Otherwise by default STRetch will use all the other samples in your set as controls.

STRetch pipelines

STRetch provides three pipelines, depending on what type of sequencing you are doing and what format your data is in.

For all pipelines, if you analyse multiple samples together you can look for outliers in your set of samples.

Whole genome sequencing starting from fastq files: STRetch_wgs_pipeline.groovy

This pipeline maps all paired-end reads against a reference genome containing STR-decoy chromosomes. It then uses reads mapping to the STR-decoy chromosomes to detect large STR expansions.

On a single sample:

bpipe run STRetch/pipelines/STRetch_wgs_pipeline.groovy sample_L001_R1.fastq.gz sample_L001_R2.fastq.gz

On multiple samples:

bpipe run STRetch/pipelines/STRetch_wgs_pipeline.groovy sample1_L001_R1.fastq.gz sample1_L001_R2.fastq.gz sample2_L001_R1.fastq.gz sample2_L001_R2.fastq.gz …

Whole genome sequencing starting from mapped bam files: STRetch_wgs_bam_pipeline.groovy

This pipeline takes a position sorted mapped bam or cram file and extracts reads from it likely to contain STR sequence. Specifically, all reads mapping to any STRs, flanking sequence, and unmapped reads. These reads are then mapped against a new reference genome containing STR-decoy chromosomes and the pipeline continues much as the standard STRetch WGS pipeline.

If using cram input files, make sure to set CRAM_REF=reference_genome.fasta in pipeline_config.groovy to the reference genome used to create the cram files.

bpipe run -p input_regions=STR_positions.bed STRetch/pipelines/STRetch_wgs_bam_pipeline.groovy sample1.bam sample2.bam

STR_positions.bed is a bed file defining the positions of all STRs in the genome, for example hg19.simpleRepeat_period1-6.dedup.sorted.bed. Using a bed file that only specifies some STRs may cause STRetch to miss expansions. It must match the reference genome used to produce the bam file, but doesn’t have to match the reference genome used for the rest of the pipeline. This be file must not contain two STRs at the same position (this does occur in some genome annotations, see Reference data for more details)

You can additionally run multiple instances of BWA in parallel to speed up the read extraction and mapping time. To do this, set the bwa_parallelism parameter. For example -p bwa_parallelism=12 would run BWA 12 times in parallel for each sample. You can also use the bpipe -n parameter to limit the total number of jobs. Please see the Bpipe documentation.

Exome or other targeted sequencing, starting from fastq files: STRetch_exome_pipeline.groovy

Note that because STRetch assumes uniform coverage when estimate STR allele sizes, the exome pipeline will likely not produce accurate size estimate. It can be used to find check if samples are outliers at a given STR locus. This pipeline requires that all samples run together should have been sequenced using the same technology, for example the same exome kit.

Note: for exome samples, the exome target region must be set in the pipeline_config.groovy and is assumed to be the same for all samples.

EXOME_TARGET="target_region.bed"

bpipe run STRetch/pipelines/STRetch_exome_pipeline_meerkat.groovy sample1_L001_R1.fastq.gz sample1_L001_R2.fastq.gz sample2_L001_R1.fastq.gz sample2_L001_R2.fastq.gz …

Input file naming assumptions

Paired end fastq files

Sample.X_R1.fastq.gz Sample.X_R2.fastq.gz

Where sample is a unique sample name and is separated from the rest by “.” X can be anything (and is ignored). Forward and reverse reads are indicated by _R1 and _R2.

The pipeline only allows for one pair of fastq files per sample. If you have more (e.g. due to multiple lanes) you can simply concatenate the files (compressing and uncompressing as you go).

Mapped bam files

Sample.X.bam

Where sample is a unique sample name and is separated from the rest by “.” X can be anything (and is ignored).

Making your own custom control data

If you have a set of samples you would like to use as controls for future samples, simply run all samples using the STRetch pipeline, then re-run the estimateSTR.py script in the same working directory with the --emit option, like so:

STRetch/tools/bin/python STRetch/scripts/estimateSTR.py --model STRetch/scripts/STRcov.model.csv --emit my_controls.tsv

When running STRetch with your new samples, edit the file STRetch/pipelines/pipeline_config.groovy:

CONTROL="my_controls.tsv"