In this document, I will detect the introgression region from KF21 (wheat Zhou18 X Aegilops tauschii T015) as an example.
Available methods
- Low depth (<1X) resequencing data k-mer (Recommend)
- High depth (>10X) resequencing data mapping
- Low depth (<1X) resequencing data mapping
- Genotyping by sequencing (GBS), Restriction site Associated DNA Sequencing (RAD-Seq)
- exon capture (on the way)
- Partent1 Aegilops tauschii T015 sequencing raw data (depth > 10X) and SNP data sets
- Partent2 wheat Zhou18 sequencing raw data (depth > 10X) and SNP data sets
- KF21 sequencing raw data
- whole sequencing raw data. > 5X
- extract subset 1X, 0.1X, 0.2X
This is an alignment-free method to detect introgression fragment.
- Step1: count each sample k-mer Use the parameter "-ci1" to contain all k-mer
- Step2: Generate KF21 k-mer dataset for detect introgression fragments
- Compare parent1 T015 and parent2 Zhou18 k-mer dataset, only remain T015 special k-mer. Using the parameter "-ci3" to ignore low-frequency k-mer, which may be caused by sequencing error
- Compare KF21 k-mer to "T015 special k-mer", only remain common k-mer
- Step3: The introgression fragment is determined according to the k-mer distribution density at the chromosome
- Benchmark
Dataset | Software | Usage | Data size (Gb) | CPU time (sec.) | Max Memory (Mb) | threads | Run time (sec.) |
---|---|---|---|---|---|---|---|
KF21 (0.1X) | KMC3/kmc | k-mer matrix | 1.5 | 211.61 | 25213 | 38 | 33 |
KF21 (0.2X) | KMC3/kmc | k-mer matrix | 3.3 | 305.95 | 35550 | 38 | 63 |
KF21 (1X) | KMC3/kmc | k-mer matrix | 16 | 1391.37 | 45264 | 38 | 210 |
T015 | KMC3/kmc | k-mer matrix | 115 | 11583.54 | 75530 | 38 | 1372 |
Zhou18 | KMC3/kmc | k-mer matrix | 107 | 11691.82 | 76187 | 38 | 1167 |
T015-Zhou18 | KMC3/kmc_tools | special k-mer | - | 2667.51 | 11301 | 38 | 7622 |
T015-Zhou18-KF21 (0.1X) | KMC3/kmc_tools | special k-mer | - | 243.55 | 2906 | 38 | 317 |
T015-Zhou18-KF21 (0.2X) | KMC3/kmc_tools | special k-mer | - | 478.82 | 3489 | 38 | 596 |
T015-Zhou18-KF21 (1X) | KMC3/kmc_tools | special k-mer | - | 1100.86 | 3775 | 38 | 2442 |
T015-Zhou18-KF21 (0.1X) | custom script | k-mer locus | - | 8612.44 | 21240 | 38 | 241 |
T015-Zhou18-KF21 (0.2X) | custom script | k-mer locus | - | 15557.39 | 21195 | 38 | 415 |
T015-Zhou18-KF21 (1X) | custom script | k-mer locus | - | 62510.59 | 21344 | 38 | 1619 |
base_dir=/cluster/home/baishenglong/AeT093_project/results/89.introgression_region_detect
mkdir -p $base_dir/introgression-kmer
cd $base_dir/introgression-kmer
mkdir -p $base_dir/introgression-kmer/00.prepare-data
cd $base_dir/introgression-kmer/00.prepare-data
ln -s /share/home/baishenglong/Ae_Pan_Geneome/data/01.Illumina_DNA/06.T015/PE150_add/clean_data/T015-1_HCTCFALXX_L3_1.clean.fq.gz ./T015_1.fq.gz
ln -s /share/home/baishenglong/Ae_Pan_Geneome/data/01.Illumina_DNA/06.T015/PE150_add/clean_data/T015-1_HCTCFALXX_L3_2.clean.fq.gz ./T015_2.fq.gz
ln -s /cluster/home/baishenglong/-/results/20200119-extract-introgression-reseq/raw_db/Zhou18_1.fq.gz ./Zhou18_1.fq.gz
ln -s /cluster/home/baishenglong/-/results/20200119-extract-introgression-reseq/raw_db/Zhou18_2.fq.gz ./Zhou18_2.fq.gz
read1=/cluster/home/baishenglong/AeT093_project/results/80.KF21/00.raw_data/2020-BC/Clean/6-1-3/V300075181_L1_WHEilaRAAAA-683_1.fq.gz
read2=/cluster/home/baishenglong/AeT093_project/results/80.KF21/00.raw_data/2020-BC/Clean/6-1-3/V300075181_L1_WHEilaRAAAA-683_2.fq.gz
# 0.1X --> 16000000000*0.1/150(read length)/2(read1 read2) = 5300000
bsubt "seqtk sample -s100 $read1 5000000 | gzip > KF21BC-5Mreads_1.fq.gz"
bsubt "seqtk sample -s100 $read2 5000000 | gzip > KF21BC-5Mreads_2.fq.gz"
# 0.2X --> 16000000000*0.2/150(read length)/2(read1 read2) = 10600000
bsubt "seqtk sample -s100 $read1 11000000 | gzip > KF21BC-11Mreads_1.fq.gz"
bsubt "seqtk sample -s100 $read2 11000000 | gzip > KF21BC-11Mreads_2.fq.gz"
# 1X --> 16000000000*1/150(read length)/2(read1 read2) = 53000000
bsubt "seqtk sample -s100 $read1 53000000 | gzip > KF21BC-53Mreads_1.fq.gz"
bsubt "seqtk sample -s100 $read2 53000000 | gzip > KF21BC-53Mreads_2.fq.gz"
# Step1: count each sample k-mer
mkdir -p $base_dir/introgression-kmer/01.kmer-freq-count
cd $base_dir/introgression-kmer/01.kmer-freq-count
for read1 in `ls ../00.prepare-data/*_1.fq.gz`;do
kmer=96
thread=38
sample_name=$(basename $read1 | cut -d '_' -f1)
read2=$(echo $read1 | sed 's/_1.fq.gz/_2.fq.gz/g')
mkdir $sample_name
echo -e "$read1\n$read2" > $sample_name/sample.list
bsubt -q normalB -n ${thread} "mkdir $sample_name/${sample_name}.k${kmer}.tmp; /cluster/home/baishenglong/softwares/KMC3/kmc -fq -k$kmer -t${thread} -m80 -ci1 @$sample_name/sample.list $sample_name/${sample_name}.k${kmer}_reads.jf $sample_name/${sample_name}.k${kmer}.tmp && rm -r $sample_name/${sample_name}.k${kmer}.tmp"
done
# Step2: Generate KF21 kmer dataset for detect introgression fragments
mkdir -p $base_dir/introgression-kmer/02.merge-kmer-freq
cd $base_dir/introgression-kmer/02.merge-kmer-freq
# parent unique kmer
parent1=T015
parent2=Zhou18
thread=38
bsubt -J unique-kmer -n ${thread} "/cluster/home/baishenglong/softwares/KMC3/kmc_tools -t${thread} simple ../01.kmer-freq-count/${parent1}/${parent1}.k96_reads.jf -ci3 ../01.kmer-freq-count/${parent2}/${parent2}.k96_reads.jf -ci3 kmers_subtract ${parent1}.k96.unique.jf"
# bsubt -J unique-kmer -n ${thread} "/cluster/home/baishenglong/softwares/KMC3/kmc_tools -t${thread} simple ../01.kmer-freq-count/${parent1}/${parent1}.k96_reads.jf -ci3 ../01.kmer-freq-count/${parent2}/${parent2}.k96_reads.jf -ci3 reverse_kmers_subtract ${parent2}.k96.unique.jf"
# population unique kmer
for read1 in `ls ../00.prepare-data/*_1.fq.gz | egrep -v "$parent1|$parent2"`;do
sample_name=$(basename $read1 | cut -d '_' -f1)
bsubt -w 328452 -J unique-kmer -n ${thread} "/cluster/home/baishenglong/softwares/KMC3/kmc_tools -t${thread} simple ${parent1}.k${kmer}.unique.jf ../01.kmer-freq-count/${sample_name}/${sample_name}.k${kmer}_reads.jf intersect ${parent1}.${sample_name}.k${kmer}.unique.jf && /cluster/home/baishenglong/softwares/KMC3/kmc_tools -t${thread} transform ${parent1}.${sample_name}.k${kmer}.unique.jf dump ${parent1}.${sample_name}.k${kmer}.unique.dump"
done
# Step3: k-mer at genome locus
mkdir -p $base_dir/introgression-kmer/03.kmer-locus
cd $base_dir/introgression-kmer/03.kmer-locus
reference=/cluster/home/baishenglong/Ae_T093/101.novogene/final_genome_anno/assembly/Ae_T093.genome.fa
genome_windows=/cluster/home/baishenglong/Ae_T093/101.novogene/final_genome_anno/assembly/Ae_T093_windows1M_step500k.bed
for read1 in `ls ../00.prepare-data/*_1.fq.gz | egrep -v "$parent1|$parent2"`;do
sample_name=$(basename $read1 | cut -d '_' -f1)
# convert to fasta
bsubt -n $thread "number=0; cut -f 1 ../02.merge-kmer-freq/${parent1}.${sample_name}.k${kmer}.unique.dump | awk '{number += 1; print \">kmer\"number\"\n\"\$1}' | pigz -p 8 > ${parent1}.${sample_name}.k${kmer}.unique.kmer.fa.gz" 1>$sample_name.jobid; sleep 1
# bwa / blast
jobid=$(cat $sample_name.jobid | cut -d '<' -f 2 | cut -d '>' -f 1)
# bsubt -q normalB -n 38 -w $jobid "zcat ${parent1}.${sample_name}.k${kmer}.unique.kmer.fa.gz | parallel --noswap -j 80% --block 10k --recstart '>' --pipe blastn -outfmt 6 -db $reference -evalue 10 -word_size 45 -num_alignments 3 -query - | awk '\$3==100&&\$4==$kmer' > ${parent1}.${sample_name}.k${kmer}.unique.kmer.blast" >jobid
bsubt -w $jobid -n $thread "bwa mem -k 45 -M -Y -t $thread $reference ${parent1}.${sample_name}.k${kmer}.unique.kmer.fa.gz | grep -v -e 'XA:Z:' -e 'SA:Z:' | grep 'MD:Z:96' | cut -f 1,3,4 > ${parent1}.${sample_name}.k${kmer}.unique.kmer.locus.tsv" 1>jobid; sleep 1
# convert bed file
jobid=$(cat jobid | cut -d '<' -f 2 | cut -d '>' -f 1)
bsubt -w $jobid "sort -k2,2 -k3,3n ${parent1}.${sample_name}.k${kmer}.unique.kmer.locus.tsv | awk '{print \$2\"\t\"\$3\"\t\"\$3+1}' > ${parent1}.${sample_name}.k${kmer}.unique.kmer.locus.sort.bed" 1>jobid; sleep 1
# plot snp-distrubtion
jobid=$(cat jobid | cut -d '<' -f 2 | cut -d '>' -f 1)
bsubt -w $jobid "bedtools coverage -a $genome_windows -b ${parent1}.${sample_name}.k${kmer}.unique.kmer.locus.sort.bed -counts | grep '^Chr' > ${parent1}.${sample_name}.k${kmer}.snp-distubtion.tsv && Rscript SNP_distrubtion.R ${parent1}.${sample_name}.k${kmer}.snp-distubtion.tsv ${parent1}.${sample_name}.k${kmer}.snp-distubtion.png 30000"
done
# reference=/cluster/home/baishenglong/Ae_T093/101.novogene/final_genome_anno/assembly/Ae_T093.genome.fa
# bsubt -q fatB "python ../fasta-kmer-split.py --input_genome $reference --input_kmer $kmer > genome.kmer.db"
- KF21 depth 1X introgression region detection
Reference step:
- Using GATK pipeline to get KF21 SNPs data
- Compare KF21 SNPs with Zhou18 and T015, remain KF21 same with T015 and different with Zhou18 SNPs data
- Using 1 Mb windows and 500 kb steps to plot KF21, Zhou18 and T015 SNP distrubtion at each chromosome
- Introgression region at KF21 SNP peak locus
Huang et al., 2009, Genome research and Xie et al., 2010, PNAS introduced the method of using low-depth resequencing data to genotype the population, respectively. Recently, Zhao et al, 2020, Plant Methods completed the QTL mapping of the maize RIL population using a similar method. These methods can also be applied to the detection and tracking of introgression fragments.