Skip to content

A R package for QTLseq.

License

GPL-3.0, GPL-3.0 licenses found

Licenses found

GPL-3.0
LICENSE
GPL-3.0
LICENSE.md
Notifications You must be signed in to change notification settings

laowang1992/easyQTLseq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

[ English | 简体中文 ]

easyQTLseq: A R Package for QTLseq Analysis

EasyQTLseq is a R package for QTL-seq analysis.

Installation

EasyQTLseq can be installed from GitHub via devtools:

# install devtools
install.packages("devtools")
# install easyQTLseq
devtools::install_github("laowang1992/easyQTLseq")

Input Data

Use GATK VariantsToTable

The VCF file contained parent(s) and two bulks was generated from GATK best practices pipeline. Utilizing VariantsToTable function of GATK to extract GT, AD and GQ information for improving reading and processing speed.

java -Xmx30g -jar ${GATK} \
     -R ${genome} -T VariantsToTable \
     -F CHROM -F POS -F REF -F ALT -GF GT -GF AD -GF GQ \
     -V ${BSA}.filter.SNPs.vcf.gz -o ${BSA}.filter.SNPs.table

Use vcf2table

If you only have a vcf file, and you don’t install GATK on your computer. Then you can read the vcf file using vcfR package, and use vcf2table() function in easyQTLseq to convert the vcf object to a table, which is similar the output of GATK’s VariantsToTable. The output of vcf2table can be used for select_sample_and_SNP() directly.

library(vcfR)
library(easyQTLseq)
file_path <- system.file("extdata", "A07.SNPs.vcf.gz", package = "easyQTLseq")
x <- read.vcfR(file = file_path)
data <- vcf2table(x = x)

Usage

Library and import data

A sample file is included with this package. There are four samples in this sample file, R3 is the high phenotype parent sample, qY is the low phenotype parent sample, R is the high phenotype bulk sample, and Y is the low phenotype bulk sample.

library(easyQTLseq)
# Example with sample data from a GATK table.
file_path <- system.file("extdata", "subset.table.gz", package = "easyQTLseq")
# readr::read_tsv() has a faster speed than read.table() when reading a file.
data <- readr::read_tsv(file = file_path)
## Rows: 515669 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (11): CHROM, REF, ALT, R.GT, R.AD, R3.GT, R3.AD, Y.GT, Y.AD, qY.GT, qY.AD
## dbl  (5): POS, R.GQ, R3.GQ, Y.GQ, qY.GQ
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Select samples and SNPs

Some basic information should be assigned use select_sample_and_SNP(). This function will return a QTLseq S3 object.

x <- select_sample_and_SNP(data = data, highP = "qY", lowP = "R3", highB = "Y", lowB = "R", popType = "F2", bulkSize = c(30, 30))
x
## $data
## # A tibble: 128,751 × 12
##    CHROM          POS REF   ALT   HP.DP LP.DP HB.HP.AD HB.LP.AD HB.DP LB.HP.AD
##    <chr>        <dbl> <chr> <chr> <int> <int>    <int>    <int> <int>    <int>
##  1 scaffoldA01 275959 A     C        17    24        6        6    12       16
##  2 scaffoldA01 320766 G     A        19    11        6        6    12       10
##  3 scaffoldA01 361230 A     T        18     6        6        5    11        8
##  4 scaffoldA01 361486 G     T        26     9       13        5    18        8
##  5 scaffoldA01 361875 A     C        19     5        8        0     8        4
##  6 scaffoldA01 361884 T     G        19     5        8        0     8        6
##  7 scaffoldA01 362128 A     C        19     6        7        5    12        4
##  8 scaffoldA01 362153 T     G        20     5        8        5    13        6
##  9 scaffoldA01 364170 C     T        17    14       12        9    21       14
## 10 scaffoldA01 365454 T     C        18    10       11        5    16       15
## # ℹ 128,741 more rows
## # ℹ 2 more variables: LB.LP.AD <int>, LB.DP <int>
## 
## $highP
## [1] "qY"
## 
## $lowP
## [1] "R3"
## 
## $highB
## [1] "Y"
## 
## $lowB
## [1] "R"
## 
## $popType
## [1] "F2"
## 
## $bulkSize
## [1] 30 30
## 
## $slidwin
## data frame with 0 columns and 0 rows
## 
## $chrLen
## # A tibble: 3 × 2
##   CHROM            Len
##   <chr>          <dbl>
## 1 scaffoldA01 37932493
## 2 scaffoldA07 32276921
## 3 scaffoldA09 65861795
## 
## attr(,"class")
## [1] "QTLseq"     "WithParent" "BothParent"

This function can handle different situation, such as both parents of the segregation population are present, only high parent or low parent is present, and no parent is present, or one of these present has a reference genome used for SNP calling.

# If only one parent is present, e.g. high parent.
x_onlyHP <- select_sample_and_SNP(data = data, highP = "qY", highB = "Y", lowB = "R", popType = "F2", bulkSize = c(30, 30))
# If no parent is present.
x_noParent <- select_sample_and_SNP(data = data, highB = "Y", lowB = "R", popType = "F2", bulkSize = c(30, 30))
# If no parent is present, but high parent has a reference genome, this reference genome is used for SNP calling. Then the `highP` parameter should be "REF".
x_HPisREF <- select_sample_and_SNP(data = data, highP = "REF", highB = "Y", lowB = "R", popType = "F2", bulkSize = c(30, 30))

Depth statistics

After select samples and SNPs, the coverage depth distribution of every samples is calculated, then a density distribution figure is drawn, named <outPrefix>.depth_density.pdf|png

depth_statistics(x = x, outPrefix = "outprefix")
outprefix.depth_density.png

outprefix.depth_density.png

Filter SNP according depth distribution

For low coverage depth SNP may have low reliability and accuracy, and extremely high coverage depth may be derived from repetitive sequence. These SNP should be omited.

# default minimum coverage depth is 6, default maximum coverage depth is `average+3*sd`.
x_filter <- filterDP(x = x)

SNP distribution

SNPs may be unevenly distributed on chromosomes. This step will show the distribution of SNP along the chromosome and generate <outprefix>.SNP_number_per_chr.txt|csv and <outprefix>.SNP_distribution_histogram.pdf|png.

SNP_distribution(x = x_filter, outPrefix = "outprefix", 
                 targetChr = c("scaffoldA01", "scaffoldA07", "scaffoldA09"), 
                 chrLabel = c("A01", "A07", "A09"))
outprefix.SNP_distribution_histogram.png

outprefix.SNP_distribution_histogram.png

Export depth coverage information

If you want analyze QTL-seq using other method or software, you can export allele depths information using export_dp(). This information will be export to <outprefix>.Depth_information.txt|csv in work directory.

export_dp(x = x_filter, outPrefix = "outprefix")

Calculate using sliding window

To reduce noise in QTL-seq analsis, a sliding window method is adopted to calculate SNP index, delta SNP index, and Euclidean distance (ED).

If parent is present in the data or parent has a reference genome which is used for SNP calling, both delta SNP index and DE are calculated, if no parent is present in the data, only ED is calculated.

x_filter <- calc_index_etc(x = x_filter, outPrefix = "outprefix", winSize = 2000000, winStep = 200000)

Export figures

After calculating delta SNP index and DE, the result can be show along the chromosome. This function will export the figures.

export_figure(x = x_filter, 
              outPrefix = "outprefix", 
              targetChr = c("scaffoldA01", "scaffoldA07", "scaffoldA09"), # Target chromosome to be drawn in figures, default is all chromosomes in the data.
              chrLabel = c("A01", "A07", "A09"),                          # The label for chromosome shown in figures, default is chromosome names in the data.
              minN = 20,                                                  # Too few SNPs in a window will result in noise, the windows containing SNPs less than minN will be omitted in figures.
              width = 6, height = 3)
outprefix.delta_SNP_index.99CI.line.png

outprefix.delta_SNP_index.99CI.line.png

outprefix.ED4.line.png

outprefix.ED4.line.png

Get significant QTL region

If parent is present in the data or parent has a reference genome which is used for SNP calling, delta SNP index confidence intervals for different read depths under the null hypothesis (no QTL) as obtained by simulation test (10,000 replications for each read depth). The chromosome regions exceed 95% or 99% confidence intervals are considered as significant QTL region.

getQTL_and_exportFigure(x = x_filter, outPrefix = "outprefix", minN = 20)
outprefix.scaffoldA07.99CI.png

outprefix.scaffoldA07.99CI.png

outprefix.scaffoldA07.ED4.png

outprefix.scaffoldA07.ED4.png

Citation

If this package is used in your research, you should cite this package in the method section, like:

The QTL-seq analysis was performed using R package easyQTLseq (https://github.com/laowang1992/easyQTLseq.git).

This paper is also recommended to cite:

Takagi H, Abe A, Yoshida K, et al. QTL-seq: rapid mapping of quantitative trait loci in rice by whole genome resequencing of DNA from two bulked populations. Plant J. 2013;74(1):174-183. doi:10.1111/tpj.12105