Running STRetch

Quick start

Run the WGS pipeline:

bpipe run STRetch/pipelines/STRetch_wgs_pipeline.groovy sample1_L001_R1.fastq.gz sample1_L001_R2.fastq.gz sample2_L001_R1.fastq.gz sample2_L001_R2.fastq.gz …

All reads are assumed to be paired end with filenames ending in _R1.fastq.gz and _R2.fastq.gz. Multiple samples can be run in the same command and will be processed in parallel and their STR variation compared to find outliers.

Sample data

To quickly try out STRetch, you can download some simulated reads from samples with STR expansions in the SCA8 STR locus.

You will need to edit pipeline_config.groovy to use the provided target bed file:

EXOME_TARGET="SCA8_region.bed"

Sample command to run:

$ bpipe run path/to/STRetch/pipelines/STRetch_exome_pipeline.groovy *.fastq.gz

STRetch pipelines

STRetch provides three pipelines, depending on what type of sequencing you are doing and what format your data is in.

For all pipelines, if you analyse multiple samples together you can look for outliers in your set of samples.

Whole genome sequencing starting from fastq files: STRetch_wgs_pipeline.groovy

This is the standard STRetch pipeline. It maps all paired-end reads against a reference genome containing STR-decoy chromosomes. It then uses reads mapping to the STR-decoy chromosomes to detect large STR expansions.

On a single sample:

bpipe run STRetch/pipelines/STRetch_wgs_pipeline.groovy sample_L001_R1.fastq.gz sample_L001_R2.fastq.gz

On multiple samples:

bpipe run STRetch/pipelines/STRetch_wgs_pipeline.groovy sample1_L001_R1.fastq.gz sample1_L001_R2.fastq.gz sample2_L001_R1.fastq.gz sample2_L001_R2.fastq.gz …

Whole genome sequencing starting from mapped bam files: STRetch_wgs_bam_pipeline.groovy

This pipeline takes a position sorted mapped bam file and extracts reads from it likely to contain STR sequence. These reads are then mapped against a new reference genome containg STR-decoy chromosomes and the pipeline continues much as the standard STRetch WGS pipeline.

bpipe run STRetch/pipelines/STRetch_wgs_bam_pipeline.groovy STR_positions.bed sample1.bam sample2.bam

STR_positions.bed is a bed file defining the positions of all STRs in the genome. It must match the reference genome used to produce the bam file, but doesn’t have to match the reference genome used for the rest of the pipeline.

Exome or other targeted sequencing, starting from fastq files: STRetch_exome_pipeline.groovy

Note that because STRetch assumes uniform coverage when estimate STR allele sizes, the exome pipeline will likely not produce accurate size estimate. It can be used to find check if samples are outliers at a given STR locus. This pipeline requires that all samples run together should have been sequenced using the same technology, for example the same exome kit.

Note: for exome samples, the exome target region must be set in the pipeline_config and is assumed to be the same for all samples.

EXOME_TARGET="target_region.bed"

bpipe run STRetch/pipelines/STRetch_exome_pipeline_meerkat.groovy sample1_L001_R1.fastq.gz sample1_L001_R2.fastq.gz sample2_L001_R1.fastq.gz sample2_L001_R2.fastq.gz …

Input file naming assumptions

Paired end fastq files

Sample.X_R1.fastq.gz Sample.X_R2.fastq.gz

Where sample is a unique sample name and is separated from the rest by “.” X can be anything (and is ignored). Forward and reverse reads are indicated by _R1 and _R2.

The pipeline only allows for one pair of fastq files per sample. If you have more (e.g. due to multiple lanes) you can simply concatenate the files (compressing and uncompressing as you go).