-
Notifications
You must be signed in to change notification settings - Fork 1
Vignette
This vignette demonstrates how users can perform differential transcript usage analysis using equivalence classes derived from the pseudo-aligners Salmon and Kallisto. Please see our f1000 paper for full details on the methodology and results.
Obtain the toy data set, which is a subsampled version of the simulated Drosophila data from Soneson et al.[1]. We will be using samples 1-2 from condition A and samples 4-5 from condition B (this omits samples 3 and 6 in our toy example).
Download the following data: Drosophila toy data and put it into a directory called 'data'. Un-tar the data like so:
tar -xvzf Dm_subsampled_data.tar.gz
Next obtain the reference data:
rsync -av rsync://ftp.ensembl.org/ensembl/pub/release-95/fasta/drosophila_melanogaster/cdna/Drosophila_melanogaster.BDGP6.cdna.all.fa.gz .
cd ..
Install the following:
Using Anaconda is recommended as this will install Python with both Numpy and Pandas.
NOTE: Using Python 3+ is highly recommended for running the create_salmon_ec_count_matrix.py
script. Running the script using Python 2 may result in significantly higher compute times.
Load R and install the following packages:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DEXSeq")
BiocManager::install("DRIMSeq")
BiocManager::install("tximport")
BiocManager::install("biomaRt")
install.packages('data.table')
install.packages('dplyr')
Install bpipe if you want to use the ec-dtu-pipe pipeline.
If you want to run the analyses manually, download the following scripts:
curl -O https://raw.githubusercontent.com/Oshlack/ec-dtu-paper/master/create_salmon_ec_count_matrix.py
curl -O https://raw.githubusercontent.com/Oshlack/ec-dtu-paper/master/create_kallisto_ec_count_matrix.py
One quick and easy way to run EC DTU analyses with Salmon is to use our ec-dtu-pipe pipeline. Clone the pipeline as follows:
git clone [email protected]:Oshlack/ec-dtu-pipe.git
Make sure bpipe is installed in addition to the dependencies in Section 2.
To run the pipeline on our toy data, first create create a parameters file. Let's call it ec_params.txt:
-p pipe_dir=ec-dtu-pipe
-p python=python
-p salmon=salmon
-p salmon_index=salmon_index
-p txome=data/Drosophila_melanogaster.BDGP6.cdna.all.fa.gz
-p bmart_dset=dmelanogaster_gene_ensembl
-p sample_regex=Dm
-p group=Dm.4,Dm.5
-p fastqFormat=%_sample_%_subsampled_*.fq.gz
-p cores=8
-p time=time
Run the pipeline as follows (note that the file paths are relative to your working directory):
bpipe run @ec_params.txt ec-dtu-pipe/ec-dtu-pipe.groovy \
data/Dm_sample_1_subsampled_1.fq.gz data/Dm_sample_1_subsampled_2.fq.gz \
data/Dm_sample_2_subsampled_1.fq.gz data/Dm_sample_2_subsampled_2.fq.gz \
data/Dm_sample_4_subsampled_1.fq.gz data/Dm_sample_4_subsampled_2.fq.gz \
data/Dm_sample_5_subsampled_1.fq.gz data/Dm_sample_5_subsampled_2.fq.gz
When the pipeline successfully completes, it will create an ec_results.RData
file, which can be loaded into R and investigated:
load('ec_results.RData')
Access the results as follows:
-
results[['dexseq_object']]
is the DEXSeq object -
results[['dexseq_results']]
is the DEXSeq results object -
results[['gene_FDR']]
is the results from DEXSeq's perGeneQValue method
We can also run each of the steps manually. First, we need to create a Salmon index based on the reference transcriptome that we want to use.
salmon index -t data/Drosophila_melanogaster.BDGP6.cdna.all.fa.gz -i salmon_index
We are now ready to run Salmon on our four samples. A concise way to do this is to use a bash for loop:
for i in 1 2 4 5; do
salmon quant --index salmon_index -l A \
--dumpEq --seqBias --gcBias \
-1 data/Dm_sample_${i}_subsampled_1.fq.gz \
-2 data/Dm_sample_${i}_subsampled_2.fq.gz \
-p 8 -o salmon_output/sample${i}
done
The --dumpEq
parameter is necessary in order to obtain equivalence class output. NOTE: if you are running later versions of Salmon (0.12 or later), you must use the --hardFilter
flag if you use the --validateMappings
flag, otherwise ECs will not be uniquely defined by their transcript memberships (which is required for our method).
We provide a python script to match equivalence classes between samples and create a matrix of counts. The script can then be run like so:
python create_salmon_ec_count_matrix.py \
salmon_output/sample1/aux_info/eq_classes.txt \
salmon_output/sample2/aux_info/eq_classes.txt \
salmon_output/sample4/aux_info/eq_classes.txt \
salmon_output/sample5/aux_info/eq_classes.txt \
sample1,sample2,sample4,sample5 \
ec_matrix.txt
Note that this script will not accept wild-card input for the eq_classes.txt
files. We can instead use bash to get around this:
python create_salmon_ec_count_matrix.py \
`ls */aux_info/eq_classes.txt` \
sample1,sample2,sample4,sample5 \
ec_matrix.txt
Alternatively, we can create our EC counts using Kallisto. To skip to the DTU analysis, see Section 6.
First, we must first create an index:
kallisto index -i kallisto_index data/Drosophila_melanogaster.BDGP6.cdna.all.fa.gz
Kallisto allows us to obtain equivalence classes by running in batch mode. For this, we require a file containing our sample names and fastq file paths. Create the following file under our working directory using your favourite text editor (remember to separate the columns by tabs) and call it samples.txt
:
#id file1 file2
sample1 data/Dm_sample_1_subsampled_1.fq.gz data/Dm_sample_1_subsampled_2.fq.gz
sample2 data/Dm_sample_2_subsampled_1.fq.gz data/Dm_sample_2_subsampled_2.fq.gz
sample4 data/Dm_sample_4_subsampled_1.fq.gz data/Dm_sample_4_subsampled_2.fq.gz
sample5 data/Dm_sample_5_subsampled_1.fq.gz data/Dm_sample_5_subsampled_2.fq.gz
Now run Kallisto using pseudo-align mode:
kallisto pseudo -i kallisto_index -o kallisto_output -t 8 -b samples.txt
For this step, we need a list of transcript IDs used for our reference. We can easily generate this using our fastq reference and a one-line bash-script:
zcat < data/Drosophila_melanogaster.BDGP6.cdna.all.fa.gz | grep '>' | cut -d' ' -f 1 | sed 's/>//' > transcript_ids.txt
Now run the create_kallisto_ec_count_matrix.py
script to prepare Kallisto's output for DEXSeq input:
python create_kallisto_ec_count_matrix.py \
kallisto_output/matrix.ec \
kallisto_output/matrix.tsv \
kallisto_output/matrix.cells \
transcript_ids.txt \
ec_matrix.txt
Load R, making sure the dependencies from Section 2 are installed.
Before we load our data, we will need to create a transcript and gene ID lookup table (in our case, we need the drosophila reference):
library(biomaRt)
mart <- useMart("ENSEMBL_MART_ENSEMBL",
dataset="dmelanogaster_gene_ensembl")
ensg_attr <- c('external_gene_name','ensembl_gene_id','ensembl_transcript_id')
ensg <- getBM(mart=mart,attributes=ensg_attr)
colnames(ensg) <- c('gene_name', 'gene_id', 'transcript')
We are now ready to load our data:
library(dplyr)
library(data.table)
library(DEXSeq)
library(DRIMSeq)
ecm <- read.delim('ec_matrix.txt', sep='\t')
ecm <- inner_join(ecm, ensg, by='transcript')
Next we remove equivalence classes that map to multiple different genes:
multi_ecs <- data.table(ecm)[, length(unique(gene_id)), keyby=ec_names]
multi_ecs <- multi_ecs[multi_ecs$V1 > 1]
multi_ecs <- multi_ecs$ec_names
df <- ecm[!ecm$ec_names %in% multi_ecs,]
We now define our sample groups:
group <- c(0, 0, 1, 1)
samples <- c('sample1', 'sample2', 'sample4', 'sample5')
sampleTable <- data.frame(sample_id = samples,
condition = group)
Our EC data frame still potentially contains repeat entries as multiple transcripts mapping to the same EC will be divided into multiple rows. Therefore, we will remove that transcript ID column, and consider all distinct equivalence classes per gene. As we will use DRIMSeq to prepare and filter our data, we will also change create a 'feature_id' column which holds the equivalence class IDs:
df$feature_id <- df$ec_names
df <- distinct(df[,c(samples, 'gene_id', 'feature_id')])
Our data is now ready to load into a DRIMSeq object.
d <- dmDSdata(counts = df, samples = sampleTable)
Optionally, we may filter our data further by using some standard transcript and gene-level filters:
n <- 4
n.small <- 2
d <- dmFilter(d, min_samps_feature_expr=n.small,
min_feature_expr=10,
min_samps_feature_prop=n.small,
min_samps_gene_expr=n,
min_gene_expr=10)
Our data is now ready to load into DEXSeq to perform differential transcript usage analysis:
sample.data <- DRIMSeq::samples(d)
count.data <- round(as.matrix(counts(d)[,-c(1:2)]))
dxd <- DEXSeqDataSet(countData=count.data,
sampleData=sample.data,
design=~sample + exon + condition:exon,
featureID=counts(d)$feature_id,
groupID=counts(d)$gene_id)
We will use MulticoreParam
to allow DEXSeq to take advantage of multiple for cores when performing analysis. This greatly speeds up analysis. First, we estimate the library sizes, then estimate dispersions, which we can visualise:
BPPARAM=MulticoreParam(workers=8)
dxd <- estimateSizeFactors(dxd)
dxd <- estimateDispersions(dxd, BPPARAM=BPPARAM)
plotDispEsts(dxd)
Each point shows the dispersion estimate for each feature. The red line shows the fitted mean-dispersion trend. The blue dots show the adjusted dispersion estimates, shrunken towards the fitted trend. Note that we are using a toy, sparse dataset, and your 'real' data will have many more points.
We can now run the test for differential exon usage (which in our case is differential EC usage):
dxd <- testForDEU(dxd, BPPARAM=BPPARAM)
dxd <- estimateExonFoldChanges(dxd, BPPARAM=BPPARAM)
dxr <- DEXSeqResults(dxd)
We can view our results as follows:
print(table(dxr$padj < 0.05))
# FALSE TRUE
# 974 47
This shows the number of significant equivalence classes. We can visualise this as follows:
DEXSeq::plotMA(dxr)
The red points show significantly differentially expressed ECs.
We can obtain a corrected p-value at the gene level by using the perGeneQValue
function:
pgq <- perGeneQValue(dxr)
results <- data.frame(gene=names(pgq), FDR=pgq)
results <- results[order(results$FDR),]
table(results$FDR < 0.05)
# FALSE TRUE
# 418 24
head(results)
# gene FDR
# FBgn0045770 FBgn0045770 1.037187e-10
# FBgn0011760 FBgn0011760 1.048522e-10
# FBgn0003721 FBgn0003721 4.157964e-07
# FBgn0033820 FBgn0033820 1.200841e-06
# FBgn0000579 FBgn0000579 7.744006e-05
# FBgn0035720 FBgn0035720 3.341006e-04
Finally, we can visualise differential EC usage for a given gene by using the plot_ec_usage
function which can be obtained from here.
lookup <- distinct(ecm[,c('gene_id', 'transcript', 'ec_names')])
lookup$id <- paste(lookup$gene_id, lookup$ec_names, sep=':')
source('https://raw.githubusercontent.com/Oshlack/ec-dtu-paper/dev/R/plotting.R')
plot_ec_usage(dxr, 'FBgn0003435', lookup, ec_col = 'featureID', conds = c('c1', 'c2'))
[1] Soneson, C., Matthes, K. L., Nowicka, M., Law, C. W., & Robinson, M. D. (2016). Isoform prefiltering improves performance of count-based methods for analysis of differential transcript usage. Genome Biology, 17(1), 1–15. https://doi.org/10.1186/s13059-015-0862-3