-
Notifications
You must be signed in to change notification settings - Fork 15
Running STRetch
Run the WGS pipeline:
bpipe run STRetch/pipelines/STRetch_wgs_pipeline.groovy sample1_L001_R1.fastq.gz sample1_L001_R2.fastq.gz sample2_L001_R1.fastq.gz sample2_L001_R2.fastq.gz …
All reads are assumed to be paired end with filenames ending in
_R1.fastq.gz
and _R2.fastq.gz
.
Multiple samples can be run in the same command and will be processed in
parallel and their STR variation compared to find outliers.
To quickly try out STRetch, you can download some simulated reads from samples with STR expansions in the SCA8 STR locus.
You will need to edit pipeline_config.groovy to use the provided target bed file:
EXOME_TARGET="SCA8_region.bed"
Sample command to run:
$ bpipe run path/to/STRetch/pipelines/STRetch_exome_pipeline.groovy *.fastq.gz
STRetch provides three pipelines, depending on what type of sequencing you are doing and what format your data is in.
For all pipelines, if you analyse multiple samples together you can look for outliers in your set of samples.
This is the standard STRetch pipeline. It maps all paired-end reads against a reference genome containing STR-decoy chromosomes. It then uses reads mapping to the STR-decoy chromosomes to detect large STR expansions.
On a single sample:
bpipe run STRetch/pipelines/STRetch_wgs_pipeline.groovy sample_L001_R1.fastq.gz sample_L001_R2.fastq.gz
On multiple samples:
bpipe run STRetch/pipelines/STRetch_wgs_pipeline.groovy sample1_L001_R1.fastq.gz sample1_L001_R2.fastq.gz sample2_L001_R1.fastq.gz sample2_L001_R2.fastq.gz …
This pipeline takes a position sorted mapped bam file and extracts reads from it likely to contain STR sequence. These reads are then mapped against a new reference genome containg STR-decoy chromosomes and the pipeline continues much as the standard STRetch WGS pipeline.
bpipe run STRetch/pipelines/STRetch_wgs_bam_pipeline.groovy STR_positions.bed sample1.bam sample2.bam
STR_positions.bed is a bed file defining the positions of all STRs in the genome. It must match the reference genome used to produce the bam file, but doesn’t have to match the reference genome used for the rest of the pipeline.
Note that because STRetch assumes uniform coverage when estimate STR allele sizes, the exome pipeline will likely not produce accurate size estimate. It can be used to find check if samples are outliers at a given STR locus. This pipeline requires that all samples run together should have been sequenced using the same technology, for example the same exome kit.
Note: for exome samples, the exome target region must be set in the
pipeline_config
and is assumed to be the same for all samples.
EXOME_TARGET="target_region.bed"
bpipe run STRetch/pipelines/STRetch_exome_pipeline_meerkat.groovy sample1_L001_R1.fastq.gz sample1_L001_R2.fastq.gz sample2_L001_R1.fastq.gz sample2_L001_R2.fastq.gz …
Sample.X_R1.fastq.gz Sample.X_R2.fastq.gz
Where sample is a unique sample name and is separated from the rest by “.” X can be anything (and is ignored). Forward and reverse reads are indicated by _R1 and _R2.
The pipeline only allows for one pair of fastq files per sample. If you have more (e.g. due to multiple lanes) you can simply concatenate the files (compressing and uncompressing as you go).
Sample.X.bam
Where sample is a unique sample name and is separated from the rest by “.” X can be anything (and is ignored).