This is a method to find some significant information of genome.fa
, which include the locations, lengths, minimum repeat units, etc. of telomeres and centromeres. Here, we use the genome of grape T2T as an example to show how we find them visually and accurately.
Install TIDK
conda install -c bioconda tidk
Here, you can check this script whether working correctly by tidk -h
tidk explore -f genome.fa --minimum 5 --maximum 12 -o ./genome.explore -t 10 --log --dir /your/path/telomere_find --extension tsv
Actually, TTTAGGG/CCCTAAA is a common repeat of telomeres in most plants.
Reference website: http://telomerase.asu.edu/sequences_telomere.html.
tidk search -f genome.fa -s TTTAGGG -o ./genome.search --dir /your/path/telomere_find
tidk plot -c /your/path/telomere_find/genome.search_telomeric_repeat_windows.csv -o ./genome.search
You can both check the image file .svg
visually and the original file .csv
to locate the telomeres.
This method based on the previous reports, such as Song et al., 2021, Sork et al., 2022, Hofstatter et al., 2022, etc.
Please download two tools, EDTA and TRF, before you start searching.
# Install
git clone https://github.com/oushujun/EDTA.git
cd EDTA
conda env create -f EDTA.yml
conda activate EDTA
perl EDTA.pl
# RUN
source /your/path/anaconda3/bin/activate EDTA
perl EDTA.pl --genome genome.fa --sensitive 1 --overwrite 1 --anno 1 --species others --threads 10
In fact, if you have enough different type of TE family in related species, you can establish your own Pan-repeat-database by using RepeatModeler. We can also offer you a method in our study, which you can find PanGenomeTE.
As reported, centromeres are high relating to TE (LTR_Copia, LTR_Gypsy, etc.) and some specific centromeric tandem repeat units. Therefore, we should extract keywords, such as Copia, Gypsy, Helitron, and etc., from genome.mod.EDTA.TEanno.gff3
respectively, which can attain different type of TE (format .gff3
).
grep 'Copia' genome.mod.EDTA.TEanno.gff3 > TE_Copia.split.gff3
grep 'Gypsy' genome.mod.EDTA.TEanno.gff3 > TE_Gypsy.split.gff3
grep 'Helitron' genome.mod.EDTA.TEanno.gff3 > TE_Helitron.split.gff3
grep 'MULE-MuDR' genome.mod.EDTA.TEanno.gff3 > TE_MULE-MuDR.split.gff3
# Install
conda create -n TRF
conda activate TRF
conda install -c bioconda trf
# RUN (defualt)
trf yoursequence.fa 2 7 7 80 10 50 500 -f -d -m
python TRF2GFF.py -d trf_output.dat -o genome_trf.gff3
# Filter redundant information
cat genome_trf.gff3 | awk '{split($9,a,";");print $1"\t"$4"\t"$5"\t"a[1]"\t"a[2]"\t"a[3]"\t"a[4]"\t"a[9]}' > genome_trf.split.txt
You can find TRF2GFF.py
in TRF2GFF.
The output of genome_trf.split.txt
is as follows.
PN01 2 668 ID=TRF_00001 period=7 copies=94.9 consensus_size=7 cons_seq=AAGTTTA
PN01 592 699 ID=TRF_00002 period=7 copies=14.6 consensus_size=7 cons_seq=GTTTCAC ...
Opening genome_trf.split.txt
by Excel (Microsoft office) and dealing with it by PivotTable as this workflow introduced.
And then screening the top five repeat units of period
in each chromosome (filtered condition: period >= 30, copies >= 2.0. Melters et al., 2013). And then to extract these repeat units from genome_trf.gff3
respectively.
Notes: Specifically, for grapevine, we have identified numerous T2T genomes with period >= 30
. If the repeat units are not clearly visible for your species in IGV, we recommend filtering them based on the period >= 100
, as outlined in the Supplementary Table
. This approach will help reduce potential errors, such as period=30/31/32/33/etc., when extracting the period value as period=XXX
.
grep 'period=107' genome_trf.gff3 > trf_107bp.split.gff3
grep 'period=214' genome_trf.gff3 > trf_214bp.split.gff3
grep 'period=428' genome_trf.gff3 > trf_428bp.split.gff3 ...
Actully, there have a reported article (Melters et al., 2013), which compared many TRF repeats of the different species in centromeric region. You can see the Supplementary Table as a reference.
Please download IGV from offical website first.
Please prepare four type of files: genome.fa
, genome.gff3
, TE_XXX.split.gff3
, and trf_XXXbp.split.gff3
. And then put them into IGV to visualize your data. You can zoom in and out and divide core centromeric region according to the density of genome annotation, TE and TRF as follows.
You can see the low frequency peak of genome.gff3
and TE_XXX.split.gff3
in centromeric regions, while there is the high frequency peak of (top five) trf_XXXbp.split.gff3
.
In grape, you can find the top five repeat units that are 107 and its times, such as 214bp, 321bp, 428bp, etc., in most of chromosomes.
However, there have some different patterns in chr03 and chr18. It can be 135bp and 66bp (and its times).
Finally, you can record the coordinate of the core centromeric region at the top of IGV slide windows.
Shi, Xiaoya, et al. "The complete reference genome for grapevine (Vitis vinifera L.) genetics and breeding." Horticulture Research (2023): uhad061.
Xu Wang ([email protected])