forked from tavareshugo/tutorial_DESeq2_contrasts
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathDESeq2_contrasts.Rmd
539 lines (396 loc) · 19.1 KB
/
DESeq2_contrasts.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
---
title: "Setting DESeq contrasts (using the model matrix)"
output:
md_document:
variant: markdown_github
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
set.seed(10)
```
```{r, message=FALSE, echo=FALSE}
library(DESeq2)
```
In this tutorial we show how to set treatment contrasts in `<DESeq2>` using the design or model matrix.
This is a general and flexible way to define contrasts, and is often useful for more complex contrasts or when the design of the experiment is imbalanced (e.g. different number of replicates in each group).
Although we focus on `<DESeq2>`, the approach can also be used with the other popular package `<edgeR>`.
Each section below covers a particular experimental design, from simpler to more complex ones.
The first chunk of code in each section is to simulate data, which has no particular meaning and is only done in order to have a DESeqDataSet object with the right kind of variables for each example.
In practice, users can ignore this step as they should have created a DESeqDataSet object from their own data following the [instructions in the vignette](https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#the-deseqdataset).
There is a set of [accompanying slides](https://docs.google.com/presentation/d/1B9zW1_F-kBqQEu4xqxIJrudYP5DecytYMRR6bY4H6aM/edit?usp=sharing) that illustrate each of the sections below.
# One factor, two levels (slide 5)
```{r}
# simulate data
dds <- makeExampleDESeqDataSet(n = 1000, m = 6, betaSD = 2)
dds$condition <- factor(rep(c("shade", "sun"), each = 3))
```
First we can look at our sample information:
```{r}
colData(dds)
```
Our factor of interest is `condition` and so we define our design and run the DESeq model fitting routine:
```{r, message=FALSE}
design(dds) <- ~ 1 + condition # or just `~ condition`
dds <- DESeq(dds) # equivalent to edgeR::glmFit()
```
Then check what coefficients DESeq estimated:
```{r}
resultsNames(dds)
```
We can see that we have a coefficient for our _intercept_ and coefficient for the effect of "sun" (i.e. differences between sun versus shade).
Using the more standard syntax, we can obtain the results for the effect of sun as such:
```{r}
res1 <- results(dds, contrast = list("condition_sun_vs_shade"))
res1
```
The above is a simple way to obtain the results of interest.
But it is worth understanding how DESeq is getting to these results by looking at the model's matrix.
DESeq defines the model matrix using base R functionality:
```{r}
model.matrix(design(dds), colData(dds))
```
We can see that R coded "condition" as a dummy variable, with an intercept (common to all samples) and a "conditionsun" variable, which adds the effect of sun to samples 4-6.
We can actually set our contrasts in `DESeq2::results()` using a numeric vector.
The way it works is to define a vector of "weights" for the coefficient(s) we want to test for.
In this case, we have `(Intercept)` and `conditionsun` as our coefficients (see model matrix above), and we want to test for the effect of sun, so our contrast vector would be `c(0, 1)`.
In other words, we don't care about the value of `(Intercept)` (so it has a weight of 0), and we're only interested in the effect of sun (so we give it a weight of 1).
In this case the design is very simple, so we could define our contrast vector "manually". But for complex designs this can get more difficult to do, so it's worth mentioning the general way in which we can define this.
For any contrast of interest, we can follow three steps:
- Get the model matrix
- Subset the matrix for each group of interest and calculate its column means - this results in a vector of coefficients for each group
- Subtract the group vectors from each other according to the comparison we're interested in
Let's see this example in action:
```{r}
# get the model matrix
mod_mat <- model.matrix(design(dds), colData(dds))
mod_mat
```
```{r}
# calculate the vector of coefficient weights in the sun
sun <- colMeans(mod_mat[dds$condition == "sun", ])
sun
```
```{r}
# calculate the vector of coefficient weights in the shade
shade <- colMeans(mod_mat[dds$condition == "shade", ])
shade
```
```{r}
# The contrast we are interested in is the difference between sun and shade
sun - shade
```
That last step is where we define our contrast vector, and we can give this directly to the `results` function:
```{r}
# get the results for this contrast
res2 <- results(dds, contrast = sun - shade)
```
This gives us exactly the same results as before, which we can check for example by plotting the log-fold-changes between the first and second approach:
```{r, eval=FALSE}
plot(res1$log2FoldChange, res2$log2FoldChange)
```
## Extra: recoding the design (slide 12)
Often, we can use different model matrices that essentially correspond to the same design.
For example, we could recode our design above by removing the intercept:
```{r, message=FALSE}
design(dds) <- ~ 0 + condition
dds <- DESeq(dds)
resultsNames(dds)
```
In this case we get a coefficient corresponding to the average expression in shade and the average expression in the sun (rather than the _difference_ between sun and shade).
If we use the same contrast trick as before (using the model matrix), we can see the result is the same:
```{r}
# get the model matrix
mod_mat <- model.matrix(design(dds), colData(dds))
mod_mat
# calculate weights for coefficients in each condition
sun <- colMeans(mod_mat[which(dds$condition == "sun"), ])
shade <- colMeans(mod_mat[which(dds$condition == "shade"), ])
# get the results for our contrast
res3 <- results(dds, contrast = sun - shade)
```
Again, the results are essentially the same:
```{r, eval=FALSE}
plot(res1$log2FoldChange, res3$log2FoldChange)
```
In theory there's no difference between these two ways of defining our design.
The design with an intercept is more common, but for the purposes of understanding what's going on, it's sometimes easier to look at models without intercept.
# One factor, three levels (slide 6)
```{r}
# simulate data
dds <- makeExampleDESeqDataSet(n = 1000, m = 9, betaSD = 2)
dds$condition <- NULL
dds$colour <- factor(rep(c("pink", "yellow", "white"), each = 3))
dds$colour <- relevel(dds$colour, "white")
```
First we can look at our sample information:
```{r}
colData(dds)
```
As in the previous example, we only have one factor of interest, `condition`, and so we define our design and run the DESeq as before:
```{r, message=FALSE}
design(dds) <- ~ 1 + colour
dds <- DESeq(dds)
# check the coefficients estimated by DEseq
resultsNames(dds)
```
We see that now we have 3 coefficients:
- "Intercept" corresponds to white colour (our reference level)
- "colour_pink_vs_white" corresponds to the difference between the reference level and pink
- "colour_yellow_vs_white" corresponds to the difference between the reference level and yellow
We could obtain the difference between white and any of the two colours easily:
```{r}
res1_pink_white <- results(dds, contrast = list("colour_pink_vs_white"))
res1_yellow_white <- results(dds, contrast = list("colour_yellow_vs_white"))
```
For comparing pink vs yellow, however, we need to compare two coefficients with each other to check whether they are themselves different (check the slide to see the illustration). This is how the standard DESeq syntax would be:
```{r}
res1_pink_yellow <- results(dds, contrast = list("colour_pink_vs_white",
"colour_yellow_vs_white"))
```
However, following our three steps detailed in the first section, we can define our comparisons from the design matrix:
```{r}
# define the model matrix
mod_mat <- model.matrix(design(dds), colData(dds))
mod_mat
```
```{r}
# calculate coefficient vectors for each group
pink <- colMeans(mod_mat[dds$colour == "pink", ])
white <- colMeans(mod_mat[dds$colour == "white", ])
yellow <- colMeans(mod_mat[dds$colour == "yellow", ])
```
And we can now define any contrasts we want:
```{r, fig.keep = "none"}
# obtain results for each pairwise contrast
res2_pink_white <- results(dds, contrast = pink - white)
res2_pink_yellow <- results(dds, contrast = pink - yellow)
res2_yellow_white <- results(dds, contrast = yellow - white)
# plot the results from the two approaches to check that they are identical
plot(res1_pink_white$log2FoldChange, res2_pink_white$log2FoldChange)
plot(res1_pink_yellow$log2FoldChange, res2_pink_yellow$log2FoldChange)
plot(res1_yellow_white$log2FoldChange, res2_yellow_white$log2FoldChange)
```
With this approach, we could even define a more unusual contrast, for example to find genes that differ between pigmented and non-pigmented samples:
```{r}
# define vector of coefficients for pigmented samples
pigmented <- colMeans(mod_mat[dds$colour %in% c("pink", "yellow"),])
# Our contrast of interest is
pigmented - white
```
Notice the contrast vector in this case assigns a "weight" of 0.5 to each of `colourpink` and `colouryellow`. This is equivalent to saying that we want to consider the average of pink and yellow expression. In fact, we could have also defined our contrast vector like this:
```{r}
# average of pink and yellow minus white
(pink + yellow)/2 - white
```
To obtain our results, we use the `results()` function as before:
```{r}
# get the results between pigmented and white
res2_pigmented <- results(dds, contrast = pigmented - white)
```
## Extra: why not define a new group in our design matrix?
For this last example (pigmented vs white), we may have considered creating a new variable in our column data:
```{r}
dds$pigmented <- factor(dds$colour %in% c("pink", "yellow"))
colData(dds)
```
and then re-run DESeq with a new design:
```{r, message=FALSE}
design(dds) <- ~ 1 + pigmented
dds <- DESeq(dds)
resultsNames(dds)
res1_pigmented <- results(dds, contrast = list("pigmented_TRUE_vs_FALSE"))
```
However, in this model the gene dispersion is estimated together for pink and yellow samples as if they were replicates of each other, which may result in inflated/deflated estimates.
Instead, our approach above estimates the error within each of those groups.
To check the difference one could compare the two approaches visually:
```{r, eval=FALSE}
# compare the log-fold-changes between the two approaches
plot(res1_pigmented$log2FoldChange, res2_pigmented$log2FoldChange)
abline(0, 1, col = "brown", lwd = 2)
# compare the errors between the two approaches
plot(res1_pigmented$lfcSE, res2_pigmented$lfcSE)
abline(0, 1, col = "brown", lwd = 2)
```
# Two factors with interaction (slide 7)
```{r}
# simulate data
dds <- makeExampleDESeqDataSet(n = 1000, m = 12, betaSD = 2)
dds$colour <- factor(rep(c("pink", "white"), each = 6))
dds$colour <- relevel(dds$colour, "white")
dds$condition <- factor(rep(c("sun", "shade"), 6))
dds <- dds[, order(dds$colour, dds$condition)]
colnames(dds) <- paste0("sample", 1:ncol(dds))
```
First let's look at our sample information:
```{r}
colData(dds)
```
This time we have two factors of interest, and we want to model both with an interaction (i.e. we assume that white and pink samples may respond differently to sun/shade).
We define our design accordingly and fit the model:
```{r, message=FALSE}
design(dds) <- ~ 1 + colour + condition + colour:condition
dds <- DESeq(dds)
resultsNames(dds)
```
Because we have two factors and an interaction, the number of comparisons we can do is larger.
Using our three-step approach from the model matrix, we do things exactly as we've been doing so far:
```{r}
# get the model matrix
mod_mat <- model.matrix(design(dds), colData(dds))
# Define coefficient vectors for each condition
pink_shade <- colMeans(mod_mat[dds$colour == "pink" & dds$condition == "shade", ])
pink_sun <- colMeans(mod_mat[dds$colour == "pink" & dds$condition == "sun", ])
white_shade <- colMeans(mod_mat[dds$colour == "white" & dds$condition == "shade", ])
white_sun <- colMeans(mod_mat[dds$colour == "white" & dds$condition == "sun", ])
```
We are now ready to define any contrast of interest from these vectors (for completeness we show the equivalent syntax using the coefficient's names from DESeq).
Pink vs White (in the shade):
```{r}
res1 <- results(dds, contrast = pink_shade - white_shade)
# or equivalently
res2 <- results(dds, contrast = list("colour_pink_vs_white"))
```
Pink vs White (in the sun):
```{r}
res1 <- results(dds, contrast = pink_sun - white_sun)
# or equivalently
res2 <- results(dds, contrast = list(c("colour_pink_vs_white",
"colourpink.conditionsun")))
```
Sun vs Shade (for whites):
```{r}
res1 <- results(dds, contrast = white_sun - white_shade)
# or equivalently
res2 <- results(dds, contrast = list(c("condition_sun_vs_shade")))
```
Sun vs Shade (for pinks):
```{r}
res1 <- results(dds, contrast = pink_sun - pink_shade)
# or equivalently
res2 <- results(dds, contrast = list(c("condition_sun_vs_shade",
"colourpink.conditionsun")))
```
Interaction between colour and condition (i.e. do pinks and whites respond differently to the sun?):
```{r}
res1 <- results(dds,
contrast = (pink_sun - pink_shade) - (white_sun - white_shade))
# or equivalently
res2 <- results(dds, contrast = list("colourpink.conditionsun"))
```
In conclusion, although we can define these contrasts using DESeq coefficient names, it is somewhat more explicit (and perhaps intuitive?) what it is we're comparing using matrix-based contrasts.
# Three factors, with nesting (slide 8)
```{r}
# simulate data
dds <- makeExampleDESeqDataSet(n = 1000, m = 24, betaSD = 2)
dds$colour <- factor(rep(c("white", "pink"), each = 12))
dds$colour <- relevel(dds$colour, "white")
dds$species <- factor(rep(LETTERS[1:4], each = 6))
dds$condition <- factor(rep(c("sun", "shade"), 12))
dds <- dds[, order(dds$colour, dds$species, dds$condition)]
colnames(dds) <- paste0("sample", 1:ncol(dds))
```
First let's look at our sample information:
```{r}
colData(dds)
```
Now we have three factors, but species is _nested_ within colour (i.e. a species is either white or pink, it cannot be both).
Therefore, colour is a linear combination with species (or, another way to think about it is that colour is redundant with species).
Because of this, we will define our design without including "colour", although later we can compare groups of species of the same colour with each other.
```{r, message=FALSE}
design(dds) <- ~ 1 + species + condition + species:condition
dds <- DESeq(dds)
resultsNames(dds)
```
Now it's harder to define contrasts between groups of species of the same colour using DESeq's coefficient names (although still possible).
But using the model matrix approach, we do it in exactly the same way we have done so far!
Again, let's define our groups from the model matrix:
```{r}
# get the model matrix
mod_mat <- model.matrix(design(dds), colData(dds))
# define coefficient vectors for each group
pink_shade <- colMeans(mod_mat[dds$colour == "pink" & dds$condition == "shade", ])
white_shade <- colMeans(mod_mat[dds$colour == "white" & dds$condition == "shade", ])
pink_sun <- colMeans(mod_mat[dds$colour == "pink" & dds$condition == "sun", ])
white_sun <- colMeans(mod_mat[dds$colour == "white" & dds$condition == "sun", ])
```
It's worth looking at some of these vectors, to see that they are composed of weighted coefficients from different species.
For example, for "pink" species, we have equal contribution from "speciesC" and "speciesD":
```{r}
pink_shade
```
And so, when we define our contrasts, each species will be correctly weighted:
```{r}
pink_sun - pink_shade
```
We can set our contrasts in exactly the same way as we did in the previous section (for completeness, we also give the contrasts using DESeq's named coefficients).
Pink vs White (in the shade):
```{r, eval=FALSE}
res1_pink_white_shade <- results(dds, contrast = pink_shade - white_shade)
# or equivalently
res2_pink_white_shade <- results(dds,
contrast = list(c("species_C_vs_A",
"species_D_vs_A"),
c("species_B_vs_A")),
listValues = c(0.5, -0.5))
```
Note that, when using the named coefficients, we had to add an extra option `listValues = c(0.5, -0.5)`, to weight each coefficient by half.
This can be thought of as taking the average of the two coefficients, similar to what we obtain with our numeric contrast, as we saw above when doing `pink_sun - pink_shade`.
Here is the comparison for Pink vs White (in the sun):
```{r}
res1_pink_white_sun <- results(dds, contrast = pink_sun - white_sun)
# or equivalently
res2_pink_white_sun <- results(dds,
contrast = list(c("species_C_vs_A",
"species_D_vs_A",
"speciesC.conditionsun",
"speciesD.conditionsun"),
c("species_B_vs_A",
"speciesB.conditionsun")),
listValues = c(0.5, -0.5))
```
Here is the results for the interaction, i.e. whether Pink and White respond differently to Sun-Shade:
```{r}
res1_interaction <- results(dds, contrast = (pink_sun - pink_shade) - (white_sun - white_shade))
# or equivalently
res2_interaction <- results(dds,
contrast = list(c("speciesC.conditionsun", "speciesD.conditionsun"),
c("speciesB.conditionsun")),
listValues = c(0.5, -0.5))
```
## Extra: imbalanced design
Let's take our previous example, but drop one of the samples from the data, so that we only have 2 replicates for it.
```{r, message=FALSE}
dds <- dds[, -13] # drop one of the species C samples
dds <- DESeq(dds)
resultsNames(dds)
```
Define our model matrix and coefficient vectors:
```{r}
mod_mat <- model.matrix(design(dds), colData(dds))
mod_mat
# define coefficient vectors for each group
pink_shade <- colMeans(mod_mat[dds$colour == "pink" & dds$condition == "shade", ])
white_shade <- colMeans(mod_mat[dds$colour == "white" & dds$condition == "shade", ])
pink_sun <- colMeans(mod_mat[dds$colour == "pink" & dds$condition == "sun", ])
white_sun <- colMeans(mod_mat[dds$colour == "white" & dds$condition == "sun", ])
```
Now let's check what happens to the pink_shade group:
```{r}
pink_shade
```
Notice that whereas before "speciesC" and "speciesD" had each a weight of 0.5, now they have different weights.
That's because for speciesC there's only 2 replicates.
So, we have a total of 5 white individuals in the shade (2 from species C and 3 from D).
Therefore, when we calculate the average coefficients for pinks, we need to do it as 0.4 x speciesC + 0.6 x speciesD.
The nice thing about this approach is that we do not need to worry about any of this, the weights come from our `colMeans()` call automatically.
And now, any contrasts that we make will take these weights into account:
```{r}
# pink vs white (in the shade)
pink_shade - white_shade
# interaction
(pink_sun - pink_shade) - (white_sun - white_shade)
```
# Further reading
- Forum discussion about nested design: http://seqanswers.com/forums/showthread.php?t=47766