Vignette

Running a DTU analysis using equivalence class counts

This vignette demonstrates how users can perform differential transcript usage analysis using equivalence classes derived from the pseudo-aligners Salmon and Kallisto. Please see our f1000 paper for full details on the methodology and results.

1 Data

Obtain the toy data set, which is a subsampled version of the simulated Drosophila data from Soneson et al.[1]. We will be using samples 1-2 from condition A and samples 4-5 from condition B (this omits samples 3 and 6 in our toy example).

Download the following data: Drosophila toy data and put it into a directory called 'data'. Un-tar the data like so:

tar -xvzf Dm_subsampled_data.tar.gz

Next obtain the reference data:

rsync -av rsync://ftp.ensembl.org/ensembl/pub/release-95/fasta/drosophila_melanogaster/cdna/Drosophila_melanogaster.BDGP6.cdna.all.fa.gz .
 cd ..

2 Setup

Install the following:

Salmon or Kallisto
Python 2.7+ or 3+ with:
- Numpy
- Pandas
R

Using Anaconda is recommended as this will install Python with both Numpy and Pandas.

NOTE: Using Python 3+ is highly recommended for running the create_salmon_ec_count_matrix.py script. Running the script using Python 2 may result in significantly higher compute times.

Load R and install the following packages:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("DEXSeq")
BiocManager::install("DRIMSeq")
BiocManager::install("tximport")
BiocManager::install("biomaRt")
install.packages('data.table')
install.packages('dplyr')

Install bpipe if you want to use the ec-dtu-pipe pipeline.

If you want to run the analyses manually, download the following scripts:

curl -O https://raw.githubusercontent.com/Oshlack/ec-dtu-paper/master/create_salmon_ec_count_matrix.py
curl -O https://raw.githubusercontent.com/Oshlack/ec-dtu-paper/master/create_kallisto_ec_count_matrix.py

3 EC-DTU-pipe

3.1 Setup

One quick and easy way to run EC DTU analyses with Salmon is to use our ec-dtu-pipe pipeline. Clone the pipeline as follows:

git clone [email protected]:Oshlack/ec-dtu-pipe.git

Make sure bpipe is installed in addition to the dependencies in Section 2.

2.2 Running the pipeline

To run the pipeline on our toy data, first create create a parameters file. Let's call it ec_params.txt:

-p pipe_dir=ec-dtu-pipe
-p python=python
-p salmon=salmon
-p salmon_index=salmon_index
-p txome=data/Drosophila_melanogaster.BDGP6.cdna.all.fa.gz
-p bmart_dset=dmelanogaster_gene_ensembl
-p sample_regex=Dm
-p group=Dm.4,Dm.5
-p fastqFormat=%_sample_%_subsampled_*.fq.gz
-p cores=8
-p time=time

Run the pipeline as follows (note that the file paths are relative to your working directory):

bpipe run @ec_params.txt ec-dtu-pipe/ec-dtu-pipe.groovy \
  data/Dm_sample_1_subsampled_1.fq.gz data/Dm_sample_1_subsampled_2.fq.gz \
  data/Dm_sample_2_subsampled_1.fq.gz data/Dm_sample_2_subsampled_2.fq.gz \
  data/Dm_sample_4_subsampled_1.fq.gz data/Dm_sample_4_subsampled_2.fq.gz \
  data/Dm_sample_5_subsampled_1.fq.gz data/Dm_sample_5_subsampled_2.fq.gz

3.3 Viewing results

When the pipeline successfully completes, it will create an ec_results.RData file, which can be loaded into R and investigated:

load('ec_results.RData')

Access the results as follows:

results[['dexseq_object']] is the DEXSeq object
results[['dexseq_results']] is the DEXSeq results object
results[['gene_FDR']] is the results from DEXSeq's perGeneQValue method

4 Salmon

We can also run each of the steps manually. First, we need to create a Salmon index based on the reference transcriptome that we want to use.

salmon index -t data/Drosophila_melanogaster.BDGP6.cdna.all.fa.gz -i salmon_index

We are now ready to run Salmon on our four samples. A concise way to do this is to use a bash for loop:

for i in 1 2 4 5; do
  salmon quant --index salmon_index -l A \
      --dumpEq --seqBias --gcBias \
      -1 data/Dm_sample_${i}_subsampled_1.fq.gz \
      -2 data/Dm_sample_${i}_subsampled_2.fq.gz \
      -p 8 -o salmon_output/sample${i}
done

The --dumpEq parameter is necessary in order to obtain equivalence class output. NOTE: if you are running later versions of Salmon (0.12 or later), you must use the --hardFilter flag if you use the --validateMappings flag, otherwise ECs will not be uniquely defined by their transcript memberships (which is required for our method).

4.1 Building the counts matrix

We provide a python script to match equivalence classes between samples and create a matrix of counts. The script can then be run like so:

python create_salmon_ec_count_matrix.py \
    salmon_output/sample1/aux_info/eq_classes.txt \
    salmon_output/sample2/aux_info/eq_classes.txt \
    salmon_output/sample4/aux_info/eq_classes.txt \
    salmon_output/sample5/aux_info/eq_classes.txt \
    sample1,sample2,sample4,sample5 \
    ec_matrix.txt

Note that this script will not accept wild-card input for the eq_classes.txt files. We can instead use bash to get around this:

python create_salmon_ec_count_matrix.py \
    `ls */aux_info/eq_classes.txt` \
    sample1,sample2,sample4,sample5 \
    ec_matrix.txt

5 Kallisto

Alternatively, we can create our EC counts using Kallisto. To skip to the DTU analysis, see Section 6.

First, we must first create an index:

kallisto index -i kallisto_index data/Drosophila_melanogaster.BDGP6.cdna.all.fa.gz

Kallisto allows us to obtain equivalence classes by running in batch mode. For this, we require a file containing our sample names and fastq file paths. Create the following file under our working directory using your favourite text editor (remember to separate the columns by tabs) and call it samples.txt:

#id file1 file2
sample1 data/Dm_sample_1_subsampled_1.fq.gz data/Dm_sample_1_subsampled_2.fq.gz
sample2 data/Dm_sample_2_subsampled_1.fq.gz data/Dm_sample_2_subsampled_2.fq.gz
sample4 data/Dm_sample_4_subsampled_1.fq.gz data/Dm_sample_4_subsampled_2.fq.gz
sample5 data/Dm_sample_5_subsampled_1.fq.gz data/Dm_sample_5_subsampled_2.fq.gz

Now run Kallisto using pseudo-align mode:

kallisto pseudo -i kallisto_index -o kallisto_output -t 8 -b samples.txt

5.1 Building the counts matrix

For this step, we need a list of transcript IDs used for our reference. We can easily generate this using our fastq reference and a one-line bash-script:

zcat < data/Drosophila_melanogaster.BDGP6.cdna.all.fa.gz | grep '>' | cut -d' ' -f 1 | sed 's/>//' > transcript_ids.txt

Now run the create_kallisto_ec_count_matrix.py script to prepare Kallisto's output for DEXSeq input:

python create_kallisto_ec_count_matrix.py \
        kallisto_output/matrix.ec \
        kallisto_output/matrix.tsv \
        kallisto_output/matrix.cells \
        transcript_ids.txt \
        ec_matrix.txt

6 DTU analysis in R

6.1 Setup

Load R, making sure the dependencies from Section 2 are installed.

Before we load our data, we will need to create a transcript and gene ID lookup table (in our case, we need the drosophila reference):

library(biomaRt)

mart <- useMart("ENSEMBL_MART_ENSEMBL",
                dataset="dmelanogaster_gene_ensembl")
ensg_attr <- c('external_gene_name','ensembl_gene_id','ensembl_transcript_id')
ensg <- getBM(mart=mart,attributes=ensg_attr)
colnames(ensg) <- c('gene_name', 'gene_id', 'transcript')

6.2 Preparing our data

We are now ready to load our data:

library(dplyr)
library(data.table)
library(DEXSeq)
library(DRIMSeq)

ecm <- read.delim('ec_matrix.txt', sep='\t')
ecm <- inner_join(ecm, ensg, by='transcript')

Next we remove equivalence classes that map to multiple different genes:

multi_ecs <- data.table(ecm)[, length(unique(gene_id)), keyby=ec_names]
multi_ecs <- multi_ecs[multi_ecs$V1 > 1]
multi_ecs <- multi_ecs$ec_names
df <- ecm[!ecm$ec_names %in% multi_ecs,]

We now define our sample groups:

group <- c(0, 0, 1, 1)
samples <- c('sample1', 'sample2', 'sample4', 'sample5')
sampleTable <- data.frame(sample_id = samples,
                          condition = group)

Our EC data frame still potentially contains repeat entries as multiple transcripts mapping to the same EC will be divided into multiple rows. Therefore, we will remove that transcript ID column, and consider all distinct equivalence classes per gene. As we will use DRIMSeq to prepare and filter our data, we will also change create a 'feature_id' column which holds the equivalence class IDs:

df$feature_id <- df$ec_names
df <- distinct(df[,c(samples, 'gene_id', 'feature_id')])

6.3 Filtering

Our data is now ready to load into a DRIMSeq object.

d <- dmDSdata(counts = df, samples = sampleTable)

Optionally, we may filter our data further by using some standard transcript and gene-level filters:

n <- 4
n.small <- 2
d <- dmFilter(d, min_samps_feature_expr=n.small,
              min_feature_expr=10,
              min_samps_feature_prop=n.small,
              min_samps_gene_expr=n,
              min_gene_expr=10)

6.3 Analysis

Our data is now ready to load into DEXSeq to perform differential transcript usage analysis:

sample.data <- DRIMSeq::samples(d)
count.data <- round(as.matrix(counts(d)[,-c(1:2)]))

dxd <- DEXSeqDataSet(countData=count.data,
                     sampleData=sample.data,
                     design=~sample + exon + condition:exon,
                     featureID=counts(d)$feature_id,
                     groupID=counts(d)$gene_id)

We will use MulticoreParam to allow DEXSeq to take advantage of multiple for cores when performing analysis. This greatly speeds up analysis. First, we estimate the library sizes, then estimate dispersions, which we can visualise:

BPPARAM=MulticoreParam(workers=8)
dxd <- estimateSizeFactors(dxd)
dxd <- estimateDispersions(dxd, BPPARAM=BPPARAM)
plotDispEsts(dxd)

dispersion plot

Each point shows the dispersion estimate for each feature. The red line shows the fitted mean-dispersion trend. The blue dots show the adjusted dispersion estimates, shrunken towards the fitted trend. Note that we are using a toy, sparse dataset, and your 'real' data will have many more points.

We can now run the test for differential exon usage (which in our case is differential EC usage):

dxd <- testForDEU(dxd, BPPARAM=BPPARAM)
dxd <- estimateExonFoldChanges(dxd, BPPARAM=BPPARAM)
dxr <- DEXSeqResults(dxd)

6.4 Results

We can view our results as follows:

print(table(dxr$padj < 0.05))

# FALSE  TRUE
#  974    47

This shows the number of significant equivalence classes. We can visualise this as follows:

DEXSeq::plotMA(dxr)

MA plot

The red points show significantly differentially expressed ECs.

We can obtain a corrected p-value at the gene level by using the perGeneQValue function:

pgq <- perGeneQValue(dxr)

results <- data.frame(gene=names(pgq), FDR=pgq)
results <- results[order(results$FDR),]

table(results$FDR < 0.05)

# FALSE  TRUE
#   418    24

head(results)

# gene          FDR
# FBgn0045770 FBgn0045770 1.037187e-10
# FBgn0011760 FBgn0011760 1.048522e-10
# FBgn0003721 FBgn0003721 4.157964e-07
# FBgn0033820 FBgn0033820 1.200841e-06
# FBgn0000579 FBgn0000579 7.744006e-05
# FBgn0035720 FBgn0035720 3.341006e-04

Finally, we can visualise differential EC usage for a given gene by using the plot_ec_usage function which can be obtained from here.

DECU plot

lookup <- distinct(ecm[,c('gene_id', 'transcript', 'ec_names')])
lookup$id <- paste(lookup$gene_id, lookup$ec_names, sep=':')

source('https://raw.githubusercontent.com/Oshlack/ec-dtu-paper/dev/R/plotting.R')
plot_ec_usage(dxr, 'FBgn0003435', lookup, ec_col = 'featureID', conds = c('c1', 'c2'))

7 References

[1] Soneson, C., Matthes, K. L., Nowicka, M., Law, C. W., & Robinson, M. D. (2016). Isoform prefiltering improves performance of count-based methods for analysis of differential transcript usage. Genome Biology, 17(1), 1–15. https://doi.org/10.1186/s13059-015-0862-3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly