Skip to content

Detection and tracking of introgression fragments in the introgression population

Notifications You must be signed in to change notification settings

slbai01/Introgression-region-detect

Repository files navigation

Introgression region detecting and tracking

In this document, I will detect the introgression region from KF21 (wheat Zhou18 X Aegilops tauschii T015) as an example.

Available methods

  1. Low depth (<1X) resequencing data k-mer (Recommend)
  2. High depth (>10X) resequencing data mapping
  3. Low depth (<1X) resequencing data mapping
  4. Genotyping by sequencing (GBS), Restriction site Associated DNA Sequencing (RAD-Seq)
  5. exon capture (on the way)

prepare-data

  1. Partent1 Aegilops tauschii T015 sequencing raw data (depth > 10X) and SNP data sets
  2. Partent2 wheat Zhou18 sequencing raw data (depth > 10X) and SNP data sets
  3. KF21 sequencing raw data
    1. whole sequencing raw data. > 5X
    2. extract subset 1X, 0.1X, 0.2X

Method 1. Low depth resequencing data k-mer

This is an alignment-free method to detect introgression fragment.

kmer-method-workflow

  1. Step1: count each sample k-mer Use the parameter "-ci1" to contain all k-mer
  2. Step2: Generate KF21 k-mer dataset for detect introgression fragments
    1. Compare parent1 T015 and parent2 Zhou18 k-mer dataset, only remain T015 special k-mer. Using the parameter "-ci3" to ignore low-frequency k-mer, which may be caused by sequencing error
    2. Compare KF21 k-mer to "T015 special k-mer", only remain common k-mer
  3. Step3: The introgression fragment is determined according to the k-mer distribution density at the chromosome
  • Benchmark
Dataset Software Usage Data size (Gb) CPU time (sec.) Max Memory (Mb) threads Run time (sec.)
KF21 (0.1X) KMC3/kmc k-mer matrix 1.5 211.61 25213 38 33
KF21 (0.2X) KMC3/kmc k-mer matrix 3.3 305.95 35550 38 63
KF21 (1X) KMC3/kmc k-mer matrix 16 1391.37 45264 38 210
T015 KMC3/kmc k-mer matrix 115 11583.54 75530 38 1372
Zhou18 KMC3/kmc k-mer matrix 107 11691.82 76187 38 1167
T015-Zhou18 KMC3/kmc_tools special k-mer - 2667.51 11301 38 7622
T015-Zhou18-KF21 (0.1X) KMC3/kmc_tools special k-mer - 243.55 2906 38 317
T015-Zhou18-KF21 (0.2X) KMC3/kmc_tools special k-mer - 478.82 3489 38 596
T015-Zhou18-KF21 (1X) KMC3/kmc_tools special k-mer - 1100.86 3775 38 2442
T015-Zhou18-KF21 (0.1X) custom script k-mer locus - 8612.44 21240 38 241
T015-Zhou18-KF21 (0.2X) custom script k-mer locus - 15557.39 21195 38 415
T015-Zhou18-KF21 (1X) custom script k-mer locus - 62510.59 21344 38 1619
base_dir=/cluster/home/baishenglong/AeT093_project/results/89.introgression_region_detect
mkdir -p $base_dir/introgression-kmer
cd $base_dir/introgression-kmer

mkdir -p $base_dir/introgression-kmer/00.prepare-data
cd $base_dir/introgression-kmer/00.prepare-data

ln -s /share/home/baishenglong/Ae_Pan_Geneome/data/01.Illumina_DNA/06.T015/PE150_add/clean_data/T015-1_HCTCFALXX_L3_1.clean.fq.gz ./T015_1.fq.gz
ln -s /share/home/baishenglong/Ae_Pan_Geneome/data/01.Illumina_DNA/06.T015/PE150_add/clean_data/T015-1_HCTCFALXX_L3_2.clean.fq.gz ./T015_2.fq.gz
ln -s /cluster/home/baishenglong/-/results/20200119-extract-introgression-reseq/raw_db/Zhou18_1.fq.gz ./Zhou18_1.fq.gz
ln -s /cluster/home/baishenglong/-/results/20200119-extract-introgression-reseq/raw_db/Zhou18_2.fq.gz ./Zhou18_2.fq.gz

read1=/cluster/home/baishenglong/AeT093_project/results/80.KF21/00.raw_data/2020-BC/Clean/6-1-3/V300075181_L1_WHEilaRAAAA-683_1.fq.gz
read2=/cluster/home/baishenglong/AeT093_project/results/80.KF21/00.raw_data/2020-BC/Clean/6-1-3/V300075181_L1_WHEilaRAAAA-683_2.fq.gz
# 0.1X --> 16000000000*0.1/150(read length)/2(read1 read2) = 5300000
bsubt "seqtk sample -s100 $read1 5000000 | gzip > KF21BC-5Mreads_1.fq.gz"
bsubt "seqtk sample -s100 $read2 5000000 | gzip > KF21BC-5Mreads_2.fq.gz"
# 0.2X --> 16000000000*0.2/150(read length)/2(read1 read2) = 10600000
bsubt "seqtk sample -s100 $read1 11000000 | gzip > KF21BC-11Mreads_1.fq.gz"
bsubt "seqtk sample -s100 $read2 11000000 | gzip > KF21BC-11Mreads_2.fq.gz"
# 1X --> 16000000000*1/150(read length)/2(read1 read2) = 53000000
bsubt "seqtk sample -s100 $read1 53000000 | gzip > KF21BC-53Mreads_1.fq.gz"
bsubt "seqtk sample -s100 $read2 53000000 | gzip > KF21BC-53Mreads_2.fq.gz"

# Step1: count each sample k-mer
mkdir -p $base_dir/introgression-kmer/01.kmer-freq-count
cd $base_dir/introgression-kmer/01.kmer-freq-count

for read1 in `ls ../00.prepare-data/*_1.fq.gz`;do 
  kmer=96
  thread=38
  sample_name=$(basename $read1 | cut -d '_' -f1)
  read2=$(echo $read1 | sed 's/_1.fq.gz/_2.fq.gz/g')
  mkdir $sample_name
  echo -e "$read1\n$read2" > $sample_name/sample.list
  bsubt -q normalB -n ${thread} "mkdir $sample_name/${sample_name}.k${kmer}.tmp; /cluster/home/baishenglong/softwares/KMC3/kmc -fq -k$kmer -t${thread} -m80 -ci1 @$sample_name/sample.list $sample_name/${sample_name}.k${kmer}_reads.jf $sample_name/${sample_name}.k${kmer}.tmp && rm -r $sample_name/${sample_name}.k${kmer}.tmp"
done

# Step2: Generate KF21 kmer dataset for detect introgression fragments
mkdir -p $base_dir/introgression-kmer/02.merge-kmer-freq
cd $base_dir/introgression-kmer/02.merge-kmer-freq

# parent unique kmer
parent1=T015
parent2=Zhou18
thread=38
bsubt -J unique-kmer -n ${thread} "/cluster/home/baishenglong/softwares/KMC3/kmc_tools -t${thread} simple ../01.kmer-freq-count/${parent1}/${parent1}.k96_reads.jf -ci3 ../01.kmer-freq-count/${parent2}/${parent2}.k96_reads.jf -ci3 kmers_subtract ${parent1}.k96.unique.jf"
# bsubt -J unique-kmer -n ${thread} "/cluster/home/baishenglong/softwares/KMC3/kmc_tools -t${thread} simple ../01.kmer-freq-count/${parent1}/${parent1}.k96_reads.jf -ci3 ../01.kmer-freq-count/${parent2}/${parent2}.k96_reads.jf -ci3 reverse_kmers_subtract ${parent2}.k96.unique.jf"

# population unique kmer
for read1 in `ls ../00.prepare-data/*_1.fq.gz | egrep -v "$parent1|$parent2"`;do 
  sample_name=$(basename $read1 | cut -d '_' -f1)
  bsubt -w 328452 -J unique-kmer -n ${thread} "/cluster/home/baishenglong/softwares/KMC3/kmc_tools -t${thread} simple ${parent1}.k${kmer}.unique.jf ../01.kmer-freq-count/${sample_name}/${sample_name}.k${kmer}_reads.jf intersect ${parent1}.${sample_name}.k${kmer}.unique.jf && /cluster/home/baishenglong/softwares/KMC3/kmc_tools -t${thread} transform ${parent1}.${sample_name}.k${kmer}.unique.jf dump ${parent1}.${sample_name}.k${kmer}.unique.dump"
done

# Step3: k-mer at genome locus
mkdir -p $base_dir/introgression-kmer/03.kmer-locus
cd $base_dir/introgression-kmer/03.kmer-locus

reference=/cluster/home/baishenglong/Ae_T093/101.novogene/final_genome_anno/assembly/Ae_T093.genome.fa
genome_windows=/cluster/home/baishenglong/Ae_T093/101.novogene/final_genome_anno/assembly/Ae_T093_windows1M_step500k.bed
for read1 in `ls ../00.prepare-data/*_1.fq.gz | egrep -v "$parent1|$parent2"`;do 
  sample_name=$(basename $read1 | cut -d '_' -f1)
  # convert to fasta
  bsubt -n $thread "number=0; cut -f 1 ../02.merge-kmer-freq/${parent1}.${sample_name}.k${kmer}.unique.dump | awk '{number += 1; print \">kmer\"number\"\n\"\$1}' | pigz -p 8 > ${parent1}.${sample_name}.k${kmer}.unique.kmer.fa.gz" 1>$sample_name.jobid; sleep 1
  # bwa / blast
  jobid=$(cat $sample_name.jobid | cut -d '<' -f 2 | cut -d '>' -f 1)
#   bsubt -q normalB -n 38 -w $jobid "zcat ${parent1}.${sample_name}.k${kmer}.unique.kmer.fa.gz | parallel --noswap -j 80% --block 10k --recstart '>' --pipe blastn -outfmt 6 -db $reference -evalue 10 -word_size 45 -num_alignments 3 -query - | awk '\$3==100&&\$4==$kmer' > ${parent1}.${sample_name}.k${kmer}.unique.kmer.blast" >jobid
  bsubt -w $jobid -n $thread "bwa mem -k 45 -M -Y -t $thread $reference ${parent1}.${sample_name}.k${kmer}.unique.kmer.fa.gz | grep -v -e 'XA:Z:' -e 'SA:Z:' | grep 'MD:Z:96' | cut -f 1,3,4 > ${parent1}.${sample_name}.k${kmer}.unique.kmer.locus.tsv" 1>jobid; sleep 1
  # convert bed file
  jobid=$(cat jobid | cut -d '<' -f 2 | cut -d '>' -f 1)
  bsubt -w $jobid "sort -k2,2 -k3,3n ${parent1}.${sample_name}.k${kmer}.unique.kmer.locus.tsv | awk '{print \$2\"\t\"\$3\"\t\"\$3+1}' > ${parent1}.${sample_name}.k${kmer}.unique.kmer.locus.sort.bed" 1>jobid; sleep 1
  # plot snp-distrubtion
  jobid=$(cat jobid | cut -d '<' -f 2 | cut -d '>' -f 1)
  bsubt -w $jobid "bedtools coverage -a $genome_windows -b ${parent1}.${sample_name}.k${kmer}.unique.kmer.locus.sort.bed -counts | grep '^Chr' > ${parent1}.${sample_name}.k${kmer}.snp-distubtion.tsv && Rscript SNP_distrubtion.R ${parent1}.${sample_name}.k${kmer}.snp-distubtion.tsv ${parent1}.${sample_name}.k${kmer}.snp-distubtion.png 30000"
done

# reference=/cluster/home/baishenglong/Ae_T093/101.novogene/final_genome_anno/assembly/Ae_T093.genome.fa
# bsubt -q fatB "python ../fasta-kmer-split.py --input_genome $reference --input_kmer $kmer > genome.kmer.db"
  • KF21 depth 1X introgression region detection

KF21-1X

  • KF21 depth 0.2X introgression region detection KF21-0.2X

  • KF21 depth 0.1X introgression region detection KF21-0.1X

Method 2. High depth (>10X) resequencing data mapping

Reference step:

  1. Using GATK pipeline to get KF21 SNPs data
  2. Compare KF21 SNPs with Zhou18 and T015, remain KF21 same with T015 and different with Zhou18 SNPs data
  3. Using 1 Mb windows and 500 kb steps to plot KF21, Zhou18 and T015 SNP distrubtion at each chromosome
  4. Introgression region at KF21 SNP peak locus

Method 3. Low depth (<1X) resequencing data mapping

Huang et al., 2009, Genome research and Xie et al., 2010, PNAS introduced the method of using low-depth resequencing data to genotype the population, respectively. Recently, Zhao et al, 2020, Plant Methods completed the QTL mapping of the maize RIL population using a similar method. These methods can also be applied to the detection and tracking of introgression fragments.

About

Detection and tracking of introgression fragments in the introgression population

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published