-
Notifications
You must be signed in to change notification settings - Fork 14
/
lab_dge.Rmd
395 lines (282 loc) · 16.5 KB
/
lab_dge.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
---
title: "Differential Gene Expression"
subtitle: "Workshop on RNA-Seq"
editor_options:
chunk_output_type: console
output:
bookdown::html_document2:
highlight: textmate
toc: true
toc_float:
collapsed: true
smooth_scroll: true
print: false
toc_depth: 4
number_sections: true
df_print: default
code_folding: none
self_contained: false
keep_md: false
encoding: 'UTF-8'
css: "assets/lab.css"
include:
after_body: assets/footer-lab.html
---
```{r,child="assets/header-lab.Rmd"}
```
<div class="boxy boxy-exclamation boxy-yellow">
Set `~/RNAseq/labs/` as your working directory (i.e. your working directory should be the directory where you saved this .Rmd file). Create a subdirectory named `data` in this directory (`~/RNAseq/labs/data/`) for input and output files.
</div>
Load R packages and source the download function.
```{r}
library(dplyr) # data wrangling
library(ggplot2) # plotting
library(DESeq2) # rna-seq
library(edgeR) # rna-seq
# source download function
source("https://raw.githubusercontent.com/NBISweden/workshop-RNAseq/master/assets/scripts.R")
```
# Preparation
For differential gene expression, we use the **DESeq2** package. We use the raw filtered counts and metadata.
```{r}
download_data("data/gene_counts.csv")
cf <- as.matrix( read.csv("data/gene_counts.csv",header=TRUE,stringsAsFactors=FALSE,row.names=1) )
download_data("data/metadata_raw.csv")
mr <- read.csv("data/metadata_raw.csv",header=TRUE,stringsAsFactors=F,row.names=1)
```
The data is converted to a DESeq2 object. The GLM model we use is simple since we only have one variable of interest `~Group`.
If we had other covariates to control for, we would add them to the model like so `~var+Group`. The variable of interest appears in the end. This model asks to find differentially expressed genes between groups under 'Group' variable while controlling for the effect of 'var'. Similarily, batch effects can be controlled by specifying them in the model `~batch+Group`.
```{r}
library(DESeq2)
mr$Group <- factor(mr$Group)
d <- DESeqDataSetFromMatrix(countData=cf,colData=mr,design=~Group)
```
# Size factors
The first step is estimating size factors. The data is normalised for sequencing depth and compositional bias as done for the VST step. DESeq2 uses a method called *median-of-ratios* for this step.
```{r,chunk.title=TRUE}
d <- DESeq2::estimateSizeFactors(d,type="ratio")
```
<div class="boxy boxy-lightbulb">
**Optional**
For those interested in the details of the *median-of-ratios* method, click below.
<p>
<button class="btn btn-sm btn-collapse btn-collapse-normal collapsed" type="button" data-toggle="collapse" data-target="#dge-size-factor" aria-expanded="false" aria-controls="dge-size-factor">
</button>
</p>
<div class="collapse" id="dge-size-factor">
<div class="card card-body">
This is a step-by-step guide to computing normalisation factors (size factors) using the *median-of-ratios* method.
1. The geometric mean is computed across all samples for each gene.
```{r,chunk.title=TRUE}
gm_mean = function(x, na.rm=TRUE){
exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x))
}
gmean <- apply(cf,1,gm_mean)
head(gmean)
```
2. Read counts for each gene is divided by the geometric mean of that gene to create a ratio.
```{r,chunk.title=TRUE}
ratio <- cf/gmean
head(ratio)[,1:5]
```
3. The median ratio for each sample (across all genes) is taken as the size factor for that sample.
```{r,chunk.title=TRUE}
sf <- apply(ratio,2,median)
sf
```
We can verify that these values are correct by comparing with size factors generated by DESeq2.
```{r,chunk.title=TRUE}
# deseq2 size factors
sizeFactors(d)
```
If we plot the size factors for each sample against the total counts for each sample, we get the plot below. We can see that they correlate very well. Size factors are mostly correcting for total counts, ie; sequencing depth.
```{r,fig.height=4.5,fig.width=4.5}
plot(sizeFactors(d),colSums(cf),xlab="Size factors",ylab="Total counts")
```
The raw counts can then be divided by the size factor to yield normalised read counts.
```{r,chunk.title=TRUE}
# custom
head(t(t(cf)/sf))[,1:5]
# deseq2
head(counts(d,normalized=TRUE))[,1:5]
```
</div>
</div>
</div>
<i class="fas fa-comments"></i> The function `estimateSizeFactors()` has options to set a custom `locfunc` other than the median. Why is this useful? What happens if you change it to "shorth". Check out the help page for `estimateSizeFactors()`.
# Gene dispersion
When it comes to comparing values between groups, some measure of variation is needed to estimate the variability in gene counts within groups. The most common measures of variation such as variance and standard deviation are not a good measure for rnaseq data because these measures correlate with the mean. For negative bionomially distributed data, dispersion is a better measure of variation in the data set.
<div class="boxy boxy-lightbulb">
**Optional**
For some more clarification on dispersion, click below.
<p>
<button class="btn btn-sm btn-collapse btn-collapse-normal collapsed" type="button" data-toggle="collapse" data-target="#dge-dispersion" aria-expanded="false" aria-controls="dge-dispersion">
</button>
</p>
<div class="collapse" id="dge-dispersion">
<div class="card card-body">
We can create a mean counts vs variance plot for all genes in our data set.
```{r,fig.height=4.5,fig.width=4.5,chunk.title=TRUE}
dm <- apply(cf,1,mean)
dv <- apply(cf,1,var)
ggplot(data.frame(mean=log10(dm+1),var=log10(dv+1)),
aes(mean,var))+
geom_point(alpha=0.2)+
geom_smooth(method="lm")+
labs(x=expression('Log'[10]~'Mean counts'),y=expression('Log'[10]~'Variance'))+
theme_bw()
```
We see a mostly linear relationship on the log scale. The blue line denotes a linear fit. Genes that have larger read counts show higher variance. It's hard to say which genes are more variable based on this alone. Therefore, variance is not a good measure to identify variation in read counts. A measure that controls for this mean-variance relationship is what we need.
One option is the coefficient of variation (CV). Let's compute the CV for each gene and plot CV vs the mean.
```{r,fig.height=4.5,fig.width=4.5,chunk.title=TRUE}
cva <- function(x) sd(x)/mean(x)
dc <- apply(cf,1,cva)
ggplot(data.frame(mean=log10(dm+1),var=dc),
aes(mean,var))+
geom_point(alpha=0.2)+
geom_smooth()+
labs(x=expression('Log'[10]~'Mean counts'),y="Coefficient of variation")+
theme_bw()
```
Now, we see that genes with lower counts have higher variability and genes with larger counts have lower variability. A measure like CV is taking the ratio of 'variation' to mean. `cv=sd(x)/mean(x)`.
This becomes even more apparent if we compute the CV and mean over replicates within sample groups (Time).
```{r,fig.height=3.5,fig.width=5,chunk.title=TRUE}
dx1 <- data.frame(day00=apply(cf[,1:3],1,cva),day07=apply(cf[,4:6],1,cva))
dx1$gene <- rownames(dx1)
dx1 <- tidyr::gather(dx1,key=sample,value=cv,-gene)
rownames(dx1) <- paste0(dx1$gene,"-",dx1$sample)
dx2 <- data.frame(day00=apply(cf[,1:3],1,mean),day07=apply(cf[,4:6],1,mean))
dx2$gene <- rownames(dx2)
dx2 <- tidyr::gather(dx2,key=sample,value=mean,-gene)
rownames(dx2) <- paste0(dx2$gene,"-",dx2$sample)
dx3 <- merge(dx1,dx2,by=0)
ggplot(dx3,aes(x=log10(mean+1),y=cv))+
geom_point(alpha=0.2)+
geom_smooth()+
facet_wrap(~sample.x)+
labs(x=expression('Log'[10]~'Mean counts'),y="Coefficient of variation")+
theme_bw()
```
We find that CV strongly declines with increasing counts. Genes with low counts show higher variability. For the sake of completeness, we can also plot the relationship between CV and variance for the same sample groups.
```{r,fig.height=3.5,fig.width=5,chunk.title=TRUE}
dx1 <- data.frame(day00=apply(cf[,1:3],1,cva),day07=apply(cf[,4:6],1,cva))
dx1$gene <- rownames(dx1)
dx1 <- tidyr::gather(dx1,key=sample,value=cv,-gene)
rownames(dx1) <- paste0(dx1$gene,"-",dx1$sample)
dx2 <- data.frame(day00=apply(cf[,1:3],1,var),day07=apply(cf[,4:6],1,var))
dx2$gene <- rownames(dx2)
dx2 <- tidyr::gather(dx2,key=sample,value=var,-gene)
rownames(dx2) <- paste0(dx2$gene,"-",dx2$sample)
dx3 <- merge(dx1,dx2,by=0)
ggplot(dx3,aes(x=log10(var+1),y=cv))+
geom_point(alpha=0.2)+
geom_smooth()+
facet_wrap(~sample.x)+
labs(x=expression('Log'[10]~'Variance'),y="Coefficient of variation")+
theme_bw()
```
And we see that they have a weak relationship.
</div>
</div>
</div>
DESeq2 computes it's own version of dispersion in a more robust manner taking into account low count values. The DESeq2 dispersion estimates are inversely related to the mean and directly related to variance. The dispersion estimate is a good measure of the variation in gene expression for a certain mean value.
Now, the variance or dispersion estimate for genes with low counts is unreliable when there are too few replicates. To overcome this, DESeq2 borrows information from other genes. DESeq2 assumes that genes with similar expression levels have similar dispersion values. Dispersion estimates are computed for each gene separately using maximum likelihood estimate. A curve is fitted to these gene-wise dispersion estimates. The gene-wise estimates are then 'shrunk' to the fitted curve.
Gene-wide dispersions, fitted curve and shrinkage can be visualised using the `plotDispEsts()` function.
```{r,fig.height=5.5,fig.width=5,chunk.title=TRUE}
d <- DESeq2::estimateDispersions(d)
plotDispEsts(d)
```
The black points denote the maximum likelihood dispersion estimate for each gene. The red curve denote the fitted curve. The blue points denote the new gene dispersion estimates after shrunk towards the curve. The circled blue points denote estimates that are not shrunk as they are too far away from the curve. Thus, shrinkage method is important to reduce false positives in DGE analysis involving too few replicates.
<i class='fas fa-lightbulb'></i> It is a good idea to visually check the dispersion shrinkage plot to verify that the method works for your data set.
# Testing
Overdispersion is the reason why RNA-Seq data is better modelled as negative-binomial distribution rather than poisson distribution. Poisson distribution has a mean = variance relationship, while negative-binomial distribution has a variance > mean relationship. The last step in the DESeq2 workflow is fitting the Negative Binomial model for each gene and performing differential expression testing. This is based on the log fold change values computed on the corrected count estimates between groups.
The log fold change is computed from the corrected count values as such:
```
FC = corrected counts group B / corrected counts group A
logFC = log2(FC)
```
and the fold-change can be computed back from the log fold-change as such:
```
FC = 2^logFC
```
The most commonly used testing for comparing two groups in DESeq2 is the Walds's test. The null hypothesis is that the groups are not different and `logFC=0`. The list of contrasts can be seen using `resultsNames()`. Then we can pick our comparisons of interest.
```{r,chunk.title=TRUE}
dg <- nbinomWaldTest(d)
resultsNames(dg)
```
And we can get the result tables for the three different comparisons. The summary of the result object shows the number of genes that are differentially expressed with positive or negative fold-change and outliers.
```{r,chunk.title=TRUE}
res <- results(dg,name="Group_day07_vs_day00",alpha=0.05)
res$padj[is.na(res$padj)] <- 1
# res <- na.omit(res)
summary(res)
write.csv(res,"data/dge_results.csv",row.names=T)
```
You can also build up the comparison using **contrasts**. Contrasts need the condition, level to compare and the reference level (base level). For example, `results(dg,contrast=c("Group","day07","day00"),alpha=0.05)`.
Note that `day00` is the reference level and other levels are compared to this. Therefore, a fold-change of 2 would mean that, a gene is 2 fold higher expressed in `day07` compared to `day00`.
The `results()` function has many useful arguments. One can set a threshold on the logFC values using `lfcThreshold`. By default, no filtering is performed based on logFC values. Outliers are detected and p-values are set to NA automatically using `cooksCutoff`. `independentFiltering=TRUE` remove genes with low counts.
```{r,chunk.title=TRUE}
head(res)
```
The result table contains mean expression value (baseMean), log2 fold change (log2FoldChange), log2 fold change standard error (lfcSE), wald test statistic (stat), wald test p-value (pvalue) and BH adjusted p-value (padj) for each gene.
<i class='fas fa-lightbulb'></i> Note that the results object is a `DESeqResults` class object and not a data.frame. It can be converted to a data.frame using `as.data.frame()` for exporting to a file.
It is a good idea to look at the distribution of unadjusted p-values.
```{r,fig.height=5,fig.width=5}
hist(res$pvalue,main="Pval distribution",xlab="P-values")
```
This is the kind of distribution to be expected when the p-values are "well-behaved". For more explanation on p-value distributions, see [here](http://varianceexplained.org/statistics/interpreting-pvalue-histogram/). If this distribution is very different or strange, then it might indicate an underlying problem.
We can filter the results table as needed.
```{r,chunk.title=TRUE}
# all genes
nrow(as.data.frame(res))
# only genes with padj <0.05
nrow(dplyr::filter(as.data.frame(res),padj<0.05))
# only genes with padj <0.05 and an absolute fold change >2
nrow(dplyr::filter(as.data.frame(res),padj<0.05,abs(log2FoldChange)>2))
```
Note that manually filtering by **log2FoldChange** on the results table is not the same as setting the `lfcThreshold` argument in the `results()` function.
### lfcShrink
This is an optional extra step to generate more accurate log2 fold changes. This step corrects the log2 fold changes for genes with high dispersion. This does not change the p-values or the list of DE genes.
```{r,chunk.title=TRUE}
lres <- lfcShrink(dg,coef="Group_day07_vs_day00",res=res,type="normal")
summary(lres)
```
```{r,chunk.title=TRUE}
head(lres)
```
We see that the number of genes up and down has not changed. But, we can see that the logFC distribution has changed in the density plot below.
```{r,fig.height=4,fig.width=8}
par(mfrow=c(1,2))
plot(density(na.omit(res$log2FoldChange)),main="log2FC")
plot(density(na.omit(lres$log2FoldChange)),main="log2FC lfcShrink")
par(mfrow=c(1,1))
```
This step does not change the total number of DE genes. This may be useful in downstream steps where genes need to be filtered down based on fold change or if fold change values are used in functional analyses such as GSEA.
<i class="fas fa-comments"></i> How many DE genes do you get?
# Visualisation
In this section, we can explore some useful visualisation of the differential gene expression output.
## MA plot
The MA plot shows mean expression vs log fold change for all genes. The `plotMA()` function from DESeq2 takes a results object as input. Differentially expressed genes are marked in bright colour.
```{r,fig.height=5.5,fig.width=5,chunk.title=TRUE}
DESeq2::plotMA(res)
```
<i class="fas fa-comments"></i> How does this plot change if you set log fold change filtering to minimum value of 1. How does the plot change when you use `lfcShrink()`?
## Volcano plot
A volcano plot is similar to the MA plot. It plots log fold change vs adjusted p-values.
```{r,fig.height=5,fig.width=5,chunk.title=TRUE}
ggplot()+
geom_point(data=as.data.frame(res),aes(x=log2FoldChange,y=-log10(padj)),col="grey80",alpha=0.5)+
geom_point(data=filter(as.data.frame(res),padj<0.05),aes(x=log2FoldChange,y=-log10(padj)),col="red",alpha=0.7)+
geom_hline(aes(yintercept=-log10(0.05)),alpha=0.5)+
theme_bw()
```
X axis denotes log fold change and the y axis denotes -log10 adjusted p-value. The adjusted p-value is transformed so that the smallest p-values appears at the top. The horizontal grey line denotes the significance threshold line. All genes above this line (coloured red as well) are considered significant.
<i class="fas fa-comments"></i> Why is the y-axis (p-value) on a -log scale?
## Counts plot
It can be a good idea to manually verify some of these genes by plotting out it's actual read count values. We can use the function `plotCounts()` to visualise the data points for a gene of interest. Below, we see the counts before and after normalisation.
```{r,fig.height=5,fig.width=5,chunk.title=TRUE}
plotCounts(d,gene=rownames(res)[1],intgroup="Group",normalized=F)
plotCounts(d,gene=rownames(res)[1],intgroup="Group",normalized=T)
```
<i class="fas fa-comments"></i> By looking at the count plots, do you agree that the top DE genes differ significantly between the groups compared? Does the fold change make sense given the reference group?
***