mixSTR (pronounced "mixture") is a post-processing script applied to output generated by superSTR ("a lightweight, alignment-free utility for detecting repeat-containing reads in sequencing data").
Bioinformatic repeat expansion methods for short-read WGS rely on information from "anchored reads" that align uniquely to flanking DNA sequences just outside a short tandem repeat. Therefore, short reads cannot determine any information about the interior structure of large repeat expansions. Complex repeat expansions, some of which contain adjacent expansions of two or more different repeat motifs, can only be fully revealed by long-read sequencing. Read pairs containing mixed-motif STR expansions suggest that a complex repeat expansion may be present.
Before running mixSTR, a sample needs to be processed using superSTR and the raw per_read.txt.gz
output generated by superSTR sorted to ensure read pairs appear in adjacent lines in the output file:
zcat per_read.txt.gz | sort | gzip > sorted_per_read.txt.gz
Minimal command to run mixSTR:
python3 mixSTR.py \
--input sorted_per_read.txt.gz \
--output mixSTR_output.txt
Usage:
mixSTR.py [-h] [OPTIONAL PARAMETERS] -i INPUT_PATH -o OUTPUT_PATH
optional arguments:
-h, --help show this help message and exit
required arguments:
-i INPUT_PATH, --input INPUT_PATH
Path to sorted superSTR per_read output
-o OUTPUT_PATH, --output OUTPUT_PATH
Output path for files.
optional parameters:
--read-repeat-thresh READ_REPEAT_THRESH
Minimum number of base pairs of repetive sequence required in each read to include read pair in analysis (default 120)
--motif-repeat-thresh MOTIF_REPEAT_THRESH
Minimum number of base pairs of repetive sequence in a read pair to include motif (default 50)
--motif MOTIF_FILTER
Comma separate set of motifs to count (if specified then any other motifs will not be reported)
--motif-size MOTIF_SIZE_FILTER
Comma separated set of motif lengths in base pairs to keep (if then any other motifs will not be reported, unless specified with --motif argument)
Output reported by mixSTR is a two-coloumn, tab-separated text file containing the counts of the number of read pairs identified corresponding to one or more motifs.
Where there is a complex expansions containing adjacent expansions of two motifs, the expect output should include entries corresponding to both motifs individually and the pair of motifs (e.g., AAAAT, AAATG, and AAAAT-AAATG).
To search for the TTTTA + TTTCA signature assocated with familial adult myoclonic epilepsy (FAME):
python3 mixSTR.py \
--input sorted_per_read.txt.gz \
--output mixSTR_FAME.txt \
--motif AAAAT,AAATG
To search all pentanucleotide repeat motifs:
python3 mixSTR.py \
--input sorted_per_read.txt.gz \
--output mixSTR_pentanucleotide.txt \
--motif-size 5