Skip to content

Reference data

Harriet Dashnow edited this page Nov 7, 2018 · 20 revisions

Making your own reference data

Although it has only been tested on humans, STRetch can be run using any reasonably complete reference genome.

Here is an example of generating a new STRetch reference genome using hg38. Place reference data in STRetch/reference-data and edit pipeline_config.groovy to point to the new reference data.

Create an STR decoy reference genome

Obtain a fasta file of your reference genome:
For example you could download hg38 from UCSC. You will need uncompressed fasta for the following steps. Let's say for example that you have downloaded hg38, extracted it, and named it hg38.fasta like this:

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz -O hg38.fasta.gz
gzip -d hg38.fasta.gz

Generate STR decoy chromosomes:
python STRetch/scripts/decoy_STR.py --length 2000 > STRdecoys.fasta

Add STR decoy chromosomes to your reference genome:
cat hg38.fasta STRdecoys.fasta > hg38.STRdecoys.fasta

Sort the fasta file and make sure lines are a uniform length:
python STRetch/scripts/sort_fasta.py --infile hg38.STRdecoys.fasta --outfile hg38.STRdecoys.sorted.fasta

Create some extra indices required for the STRetch pipeline

Index for BWA:
bwa index hg38.STRdecoys.sorted.fasta

Generate faidx index:
samtools faidx hg38.STRdecoys.sorted.fasta

Create BED file specifying locations of STR decoy chromosomes:
grep STR hg38.STRdecoys.sorted.fasta.fai | awk -v OFS='\t' '{ print $1, 0, $2 -1 }' > STRdecoys.bed

Sort the BED file: bedtools sort -i STRdecoys.bed > STRdecoys.sorted.bed

Generate genome file:
cut -f1,2 hg38.STRdecoys.sorted.fasta.fai > hg38.STRdecoys.sorted.fasta.genome

Create a BED file specifying all STRs in the reference genome

For genomes listed in UCSC and annotated with STRs you can download and use this annotation. For genomes not listed in UCSC, see the instructions for running Tandem Repeat Finder below.

For a genome that has already been annotated with STRs

Download the Tandem Repeats Finder annotation of the reference genome from UCSC
File name: hg38.simpleRepeat.txt.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/simpleRepeat.txt.gz -O hg38.simpleRepeat.txt.gz

Filter to only repeat units of 6bp or less, then extract columns required to create a bed file:
gzip -dc hg38.simpleRepeat.txt.gz | awk '($6 <= 6)' | awk -v OFS='\t' '{print $2,$3,$4,$17,$7}' > hg38.simpleRepeat_period1-6.bed

Remove duplicate lines (lines where two different repeat units are annotated at the same locus):
Rscript STRetch/scripts/dedupSTRannotation.R -i hg38.simpleRepeat_period1-6.bed -o hg38.simpleRepeat_period1-6.dedup.bed

Sort:
bedtools sort -i hg38.simpleRepeat_period1-6.dedup.bed > hg38.simpleRepeat_period1-6.dedup.sorted.bed

For a genome that has not been annotated with STRs

For a new genome that has not already been annotated with STRs, you will need to run Tandem Repeat Finder.

Download TRF from: https://tandem.bu.edu/trf/trf409.linux64.download.html

Move the file to your linux cluster, and set it to be executable. e.g.

chmod ug+xr trf409.linux64

Run TRF on your entire original reference genome (without the extra STR chromosomes). These are the default settings for TRF, except limiting it to detect STRs with up to 6bp repeat units.

./trf409.linux64 hg38.fasta 2 7 7 80 10 50 6 -dh

Convert to BED format with the provided script.

python STRetch/scripts/TRFdat_to_bed.py --dat hg38.fasta.2.7.7.80.10.50.6.dat --bed hg38.trf_period1-6.bed

Remove duplicate lines (lines where two different repeat units are annotated at the same locus):

Rscript STRetch/scripts/dedupSTRannotation.R -i hg38.trf_period1-6.bed -o hg38.trf_period1-6.dedup.bed

Sort:

bedtools sort -i hg38.trf_period1-6.dedup.bed > hg38.trf_period1-6.dedup.sorted.bed

Check you have all the reference data required by STRetch

Edit pipeline_config.groovy to point to these new fasta and bed files (the indexes must be in the same directory, but don't need to be specified). You will need to edit the lines:

REF="/home/hdashnow/STRetch/reference-data/hg38.STRdecoys.sorted.fasta"
STR_BED="/home/hdashnow/STRetch/reference-data/hg38.simpleRepeat_period1-6_dedup.sorted.bed"

You should now, at a minimum, have the following files in STRetch/reference-data:

hg38.simpleRepeat_period1-6.dedup.sorted.bed
hg38.STRdecoys.sorted.fasta
hg38.STRdecoys.sorted.fasta.amb
hg38.STRdecoys.sorted.fasta.ann
hg38.STRdecoys.sorted.fasta.bwt
hg38.STRdecoys.sorted.fasta.genome
hg38.STRdecoys.sorted.fasta.pac
hg38.STRdecoys.sorted.fasta.sa
STRdecoys.sorted.bed

If any of these files are missing, STRetch will not run!

Solutions to common reference issues

  • Make sure your chromosomes names match up in all files, for example "chr1" vs "1". You may need to edit the simpleRepeat bed file to ensure they match.
  • Make sure all your files are sorted.