-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathcsdex_vignette.Rmd
244 lines (182 loc) · 7.45 KB
/
csdex_vignette.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
---
title: "csDEX vignette"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{csDEX vignette}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}
---
# The csDEX package: Quick Start
csDEX (Condition-specific Differential Exon Expression) is a family of general
linear models (GLMs), used to model differential splicing given a set of
sequencing experiments based on RNA-seq. The methods assume tens to hundreds
experimental conditions to be compared, and the result are
changes in usage of splice sites that are specific to an
experimental condition.
Capabilities:
* handling of sequence data in form of read counts or percentages (e.g. percent-spliced in, PSI),
* modelling all types of splicing events,
* discovery and ranking of condition-specific changes,
* parallel processing of multiple genes (groups),
* time-efficient model approximation with no change in false positive rate.
## Installation
Install the latest version hosted on GitHub:
```{r eval=FALSE}
require(devtools)
install_github("mstrazar/csDEX")
```
Test the installation:
```{r, message=FALSE}
require(csDEX)
```
## Data preparation
The package assumes a directory with replicate data and a design (decoder) file
to link experimental conditions to appropriate files.
### Replicate data
Replicate data is stored as `.txt` files containing a table with two columns in the
following format:
```
groupID:featureID value
groupID:featureID value
```
The replicate files are tab- or space-delimited and have no header line. The
fields encode the following information:
* `groupID`: group of features belonging to the same transcriptional unit,
most likely a gene.
* `featureID`: alternative transcriptional unit, such as an exon, exonic
part, alternative 5'/3' end, etc. Most commonly referring to a
non-overlapping exonic bin, to which reads are mapped.
* `value`: expression quantification, either in form of read counts mapping
to the feature (non-negative integers) or percentage spliced-in (PSI,
real numbers 0.0-1.0, inclusive). Choice of quantification shall be
consistent within and between all files.
Example for PSI:
```
ENSG00000198912:003 0.13
ENSG00000198912:004 1.00
ENSG00000198912:005 0.02
ENSG00000198912:006 1.00
ENSG00000198912:007 0.97
ENSG00000198912:008 0.13
```
An arbitrary number of replicates can be used for each of the experimental conditions.
#### Obtaining replicate files from BAM/SAM files.
Given the number of available tools out there, we do not reinvent the wheel but
point users to existing tools. These will typically assume a genome annotation
listed in a `.gtf/.gff` format.
Read count data:
- QoRTs
- DEXSeq
PSI:
- cuffLinks
- rMATS
- manually from ENCODE pre-made transcript quantifications
### The design file.
The experimental design is stored as a tab-delimited table and links replicate filenames to
experimental conditions. It assumes (at least) two columns denoting
* `File.accession`: replicate file name within the input directory, without extension.
* `Experiment.target`: experimental condition identifier
* `input.read.count`: (optional) Number of reads from which the replicate was generated. Required to compute gene counts-per-million reads (CPM) if required.
```
File.accession Experiment.target input.read.count
ENCFF120ICS DDX27-human 1380110
ENCFF237IIP DDX27-human 1593308
ENCFF202WCG NELFE-human 1472191
ENCFF759XIJ NELFE-human 2889128
ENCFF781OAE PUM2-human 2021596
ENCFF803RGM PUM2-human 2362944
```
Files sharing the same `Experiment.target` are assumed to be technical replicates
of the same experimental condition. The column names are not fixed and can be defined at
runtime. Other columns can be stored, but will be ignored by csDEX. The format
is consistent with the ENCODE metadata format.
Two exemplary small datasets are provided with the csDEX package, to be
used for testing purposes only.
## Using csDEX
The example dataset is accessed as follows.
```{r message=FALSE}
design.file = system.file("extdata/metadata.tsv",
package="csDEX", mustWork=TRUE)
data.dir.psi = system.file("extdata/psi",
package="csDEX", mustWork=TRUE)
data.dir.count = system.file("extdata/count",
package="csDEX", mustWork=TRUE)
```
### The PSI model
The percentage-spliced in model is based on the Beta regression GLM. It is somewhat
more streamlined and therefore presented first. The data is loaded into a
`csDEXdataSet` object:
```{r, message=FALSE}
cdx = csDEXdataSet(data.dir=data.dir.psi,
design.file=design.file,
type="PSI",
aggregation=mean)
```
The `aggregation` argument is a function determining how to merge replicate
measurements associated to a feature. Other functions, e.g. `sum`, `max`,
`min`, ..., are possible, as well as custom ones. The metadata associated to
rows and columns can be viewed or modified using `rowData(cdx)` and
`colData(cdx)` extractor functions. See the methods documentation for a
complete list of preprocessing and filtering options.
The differential feature expression is run with:
```{r eval=TRUE, message=FALSE}
result = testForDEU(cdx,
workers=1,
tmp.dir=NULL)
```
```{r}
head(result[c("featureID", "condition", "y", "pvalue", "padj")])
```
The result is a table of features and conditions ranked by statistical
significance, where the columns `pvalue`, `padj` (Bonferroni correction) and
`fdr` (Benjamini-Hochberg correction) are of interest. The argument `tmp.dir`
can be set to a valid system path where intermediary results will be stored
(one file per gene), which is useful for lengthy computations. The `workers`
parameter can be used to increase the number of processing cores, if available.
### The read count model
The read count model is based on Negative binomial regression. It requires 2-3
extra preprocessing steps to infer model hyperparameters.
Data is loaded similarly as for the PSI model. Note the `type` argument.
```{r message=FALSE}
cdx = csDEXdataSet(data.dir=data.dir.count,
design.file=design.file,
type="count")
```
The following methods act on a `csDEXdataSet` object, by adding new fields to `rowData` or `colData`.
Due to possible unequal sequencing, size factors are computed.
```{r}
cdx = estimateSizeFactors(cdx)
```
Normalized count values can be accessed by setting the `normalized` parameter
to `exprData`.
```{r eval=FALSE}
exprData(cdx, normalized=TRUE)
```
The NB model assumes the precision parameter (sometimes termed precision) to be known.
csDEX uses the `edgeR` package method `estimateDisp(...)$tagwise.dispersion`, but custom
precision models can be used.
```{r message=FALSE}
cdx = estimatePrecisions(cdx)
```
Finally, gene-wise read coverage, expressed as counts-per-million reads (CPM)
can be calculated, provided that a column name `input.read.count` is present
and valid in the design file. This step is optional, but can be used to
filter out low coverage genes.
```{r results=1:10}
cdx = estimateGeneCPM(cdx)
cpmData(cdx)
```
Having computed all the hyperparameters, testing for differential
usage is run. Note the `min.cpm` argument is set to filter out genes
based on CPM, as described above.
```{r message=FALSE}
result = testForDEU(cdx, workers=1, tmp.dir=NULL, min.cpm=1)
```
```{r}
head(result[c("featureID", "condition", "y", "pvalue", "padj")])
```
#### Session information
```{r}
sessionInfo()
```