-
Notifications
You must be signed in to change notification settings - Fork 15
Reference data
Although it has only been tested on humans, STRetch can be run using any reasonably complete reference genome.
Here is an example of generating a new STRetch reference genome using hg38. Place reference data in STRetch/reference-data
and edit pipeline_config.groovy
to point to the new reference data.
Obtain a fasta file of your reference genome:
For example you could download hg38 from UCSC. You will need uncompressed fasta for the following steps. Let's say for example that you have downloaded hg38, extracted it, and named it hg38.fasta
like this:
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz -O hg38.fasta.gz
gzip -d hg38.fasta.gz
Generate STR decoy chromosomes:
python STRetch/scripts/decoy_STR.py --length 2000 > STRdecoys.fasta
Add STR decoy chromosomes to your reference genome:
cat hg38.fasta STRdecoys.fasta > hg38.STRdecoys.fasta
Sort the fasta file and make sure lines are a uniform length:
python STRetch/scripts/sort_fasta.py --infile hg38.STRdecoys.fasta --outfile hg38.STRdecoys.sorted.fasta
Index for BWA:
bwa index hg38.STRdecoys.sorted.fasta
Generate faidx index:
samtools faidx hg38.STRdecoys.sorted.fasta
Create BED file specifying locations of STR decoy chromosomes:
grep STR hg38.STRdecoys.sorted.fasta.fai | awk -v OFS='\t' '{ print $1, 0, $2 -1 }' > STRdecoys.bed
Sort the BED file:
bedtools sort -i STRdecoys.bed > STRdecoys.sorted.bed
Generate genome file:
cut -f1,2 hg38.STRdecoys.sorted.fasta.fai > hg38.STRdecoys.sorted.fasta.genome
For genomes listed in UCSC and annotated with STRs you can download and use this annotation. For genomes not listed in UCSC, see the instructions for running Tandem Repeat Finder below.
Download the Tandem Repeats Finder annotation of the reference genome from UCSC
File name: hg38.simpleRepeat.txt.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/simpleRepeat.txt.gz -O hg38.simpleRepeat.txt.gz
Filter to only repeat units of 6bp or less, then extract columns required to create a bed file:
gzip -dc hg38.simpleRepeat.txt.gz | awk '($6 <= 6)' | awk -v OFS='\t' '{print $2,$3,$4,$17,$7}' > hg38.simpleRepeat_period1-6.bed
Remove duplicate lines (lines where two different repeat units are annotated at the same locus):
Rscript STRetch/scripts/dedupSTRannotation.R -i hg38.simpleRepeat_period1-6.bed -o hg38.simpleRepeat_period1-6.dedup.bed
Sort:
bedtools sort -i hg38.simpleRepeat_period1-6.dedup.bed > hg38.simpleRepeat_period1-6.dedup.sorted.bed
For a new genome that has not already been annotated with STRs, you will need to run Tandem Repeat Finder.
Download TRF from: https://tandem.bu.edu/trf/trf409.linux64.download.html
Move the file to your linux cluster, and set it to be executable. e.g.
chmod ug+xr trf409.linux64
Run TRF on your entire original reference genome (without the extra STR chromosomes). These are the default settings for TRF, except limiting it to detect STRs with up to 6bp repeat units.
./trf409.linux64 hg38.fasta 2 7 7 80 10 50 6 -dh
Convert to BED format with the provided script.
python STRetch/scripts/TRFdat_to_bed.py --dat hg38.fasta.2.7.7.80.10.50.6.dat --bed hg38.trf_period1-6.bed
Remove duplicate lines (lines where two different repeat units are annotated at the same locus):
Rscript STRetch/scripts/dedupSTRannotation.R -i hg38.trf_period1-6.bed -o hg38.trf_period1-6.dedup.bed
Sort:
bedtools sort -i hg38.trf_period1-6.dedup.bed > hg38.trf_period1-6.dedup.sorted.bed
Edit pipeline_config.groovy
to point to these new fasta and bed files (the indexes must be in the same directory, but don't need to be specified). You will need to edit the lines:
REF="/home/hdashnow/STRetch/reference-data/hg38.STRdecoys.sorted.fasta"
STR_BED="/home/hdashnow/STRetch/reference-data/hg38.simpleRepeat_period1-6_dedup.sorted.bed"
You should now, at a minimum, have the following files in STRetch/reference-data
:
hg38.simpleRepeat_period1-6.dedup.sorted.bed
hg38.STRdecoys.sorted.fasta
hg38.STRdecoys.sorted.fasta.amb
hg38.STRdecoys.sorted.fasta.ann
hg38.STRdecoys.sorted.fasta.bwt
hg38.STRdecoys.sorted.fasta.genome
hg38.STRdecoys.sorted.fasta.pac
hg38.STRdecoys.sorted.fasta.sa
STRdecoys.sorted.bed
If any of these files are missing, STRetch will not run!
- Make sure your chromosomes names match up in all files, for example "chr1" vs "1". You may need to edit the simpleRepeat bed file to ensure they match.
- Make sure all your files are sorted.