Skip to content

Searching for read pairs containing mixed-motif STR expansions from complex repeat structures

Notifications You must be signed in to change notification settings

bahlolab/mixSTR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

mixSTR

mixSTR (pronounced "mixture") is a post-processing script applied to output generated by superSTR ("a lightweight, alignment-free utility for detecting repeat-containing reads in sequencing data").

Identifying complex repeat expansions

Bioinformatic repeat expansion methods for short-read WGS rely on information from "anchored reads" that align uniquely to flanking DNA sequences just outside a short tandem repeat. Therefore, short reads cannot determine any information about the interior structure of large repeat expansions. Complex repeat expansions, some of which contain adjacent expansions of two or more different repeat motifs, can only be fully revealed by long-read sequencing. Read pairs containing mixed-motif STR expansions suggest that a complex repeat expansion may be present.

Detecting complex repeat expansions

Preparing superSTR data to run mixSTR

Before running mixSTR, a sample needs to be processed using superSTR and the raw per_read.txt.gz output generated by superSTR sorted to ensure read pairs appear in adjacent lines in the output file:

zcat per_read.txt.gz | sort | gzip > sorted_per_read.txt.gz

Running mixSTR

Minimal command to run mixSTR:

python3 mixSTR.py \
    --input sorted_per_read.txt.gz \
    --output mixSTR_output.txt

Usage:

mixSTR.py [-h] [OPTIONAL PARAMETERS] -i INPUT_PATH -o OUTPUT_PATH

optional arguments:
  -h, --help            show this help message and exit

required arguments:
  -i INPUT_PATH, --input INPUT_PATH
                        Path to sorted superSTR per_read output
  -o OUTPUT_PATH, --output OUTPUT_PATH
                        Output path for files.

optional parameters:
  --read-repeat-thresh READ_REPEAT_THRESH
        Minimum number of base pairs of repetive sequence required in each read to include read pair in analysis (default 120)
  --motif-repeat-thresh MOTIF_REPEAT_THRESH
        Minimum number of base pairs of repetive sequence in a read pair to include motif (default 50)
  --motif MOTIF_FILTER  
        Comma separate set of motifs to count (if specified then any other motifs will not be reported)
  --motif-size MOTIF_SIZE_FILTER
        Comma separated set of motif lengths in base pairs to keep (if then any other motifs will not be reported, unless specified with --motif argument)

mixSTR output

Output reported by mixSTR is a two-coloumn, tab-separated text file containing the counts of the number of read pairs identified corresponding to one or more motifs.

Where there is a complex expansions containing adjacent expansions of two motifs, the expect output should include entries corresponding to both motifs individually and the pair of motifs (e.g., AAAAT, AAATG, and AAAAT-AAATG).

Example use cases

To search for the TTTTA + TTTCA signature assocated with familial adult myoclonic epilepsy (FAME):

python3 mixSTR.py \
    --input sorted_per_read.txt.gz \
    --output mixSTR_FAME.txt \
    --motif AAAAT,AAATG

To search all pentanucleotide repeat motifs:

python3 mixSTR.py \
    --input sorted_per_read.txt.gz \
    --output mixSTR_pentanucleotide.txt \
    --motif-size 5

About

Searching for read pairs containing mixed-motif STR expansions from complex repeat structures

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages