forked from UCLouvain-CBIO/WSBIM2122
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path52-design.Rmd
462 lines (344 loc) · 17.1 KB
/
52-design.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
# More complex experimental design {#sec-design}
**Learning Objectives**
The goal of this chapter is
- to understand more complex experimental designs
- how to extract relevant factors
```{r, echo = FALSE}
suppressPackageStartupMessages(library("DESeq2"))
suppressPackageStartupMessages(library("tidyverse"))
```
## A more complex dataset
The dataset that we used previously was quite simple, we had 3 KD
samples vs 3 control samples. But the original dataset was in reality
a bit more complex, as the KD experiments were actually done on 2
different cell lines. So we have now 12 samples. The 6 first samples
correspond to an epithelial cell line (Cell1), and 6 other samples
correspond to a fibroblast cell line (Cell2).
```{r, echo=FALSE}
stopifnot(packageVersion("rWSBIM2122") >= "0.4.1")
```
```{r}
load("wsbim2122_data/deseq2/coldata_both.rda")
load("wsbim2122_data/deseq2/counts_both.rda")
coldata
```
### PCA analysis
Let's start by a PCA analysis, that will help us to define the most
appropriate design to use for the analysis.
To generate the [DESeqDataSet](https://www.rdocumentation.org/packages/DESeq2/versions/1.12.3/topics/DESeqDataSet-class), remember that we have to specify the experimental design.
We will see that can use different experimental designs for the analysis,
depending on the biological questions that we want to answer. We will discuss
about this latter. At this step, as we just want to do a PCA (that will anyway
be blind to the sample information specified by the design formula),
we can use a design ~ 1.
```{r, message = FALSE, warning = FALSE}
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = coldata,
design = ~ 1)
dds <- DESeq(dds)
rld <- rlogTransformation(dds)
pca_data <- plotPCA(rld,
intgroup = c("Condition", "Type"),
returnData = TRUE)
percentVar <- round(100 * attr(pca_data, "percentVar"))
ggplot(pca_data,
aes(x = PC1, y = PC2,
color = Condition, shape = Type)) +
geom_point(size = 3) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance"))
```
We can see from this PCA plot that PC1 separates the samples based on the
cell line origin, while PC2 separates the samples based on the siRNA treatment.
This means that most of the variability comes from the cell origin.
The design used for the analysis will have to take this cell effect into
account.
Different designs can be used to analyse this dataset, depending on the
biological questions that are asked.
- **A paired design** should be used to answer to the question:
*Which genes are consistently affected by the KD in both cell lines?*
- **A design with an interaction** should be used if the biological questions are:
*Which genes are significantly affected by the KD in the epithelial cells?*
*Which genes are significantly affected by the KD in the fibroblasts?*
*Which genes are not affected in the same way by the KD in the two cell lines*
## Paired design
To take into account the condition effect while controlling the cell origin,
we can use the following design:
`design = ~ Cell + Condition`
Let's run the analysis with this design (we have to re-run the
`DESeqDataSetFromMatrix()` and `DESeq()` functions)
```{r, message = FALSE, warning = FALSE}
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = coldata,
design = ~ Type + Condition)
# Set the mock level as the reference level
dds$Condition <- relevel(dds$Condition, ref = "mock")
# Set the Epithelial cells as the reference level
dds$Type <- relevel(dds$Type, ref = "Epithelial")
dds <- DESeq(dds)
```
As we have seen previously, the `resultsNames()` function gives the
names of the coefficients that can be extracted.
```{r}
resultsNames(dds)
```
The linear model set with this design can be written
$$log2(q_{ij}) = Intercept + Type\_Fibroblast\_vs\_Epithelial.x_j + Condition\_KD\_vs\_mock.y_j + \epsilon$$
- The $Intercept$ represents the log2 expression level in the
reference (in mock Epithelial cells).
- The $Type\_Fibroblast\_vs\_Epithelial$ coefficient corresponds to the log2FC
between fibroblasts and epithelial cells. $x_j = 0$ if the sample j corresponds to epithelial cells,
and $x_j = 1$ if the sample j corresponds to fibroblasts.
- The $Condition\_KD\_vs\_mock$ coefficient corresponds to the log2FC between KD and
mock cells. $y_j = 0$ if the sample j corresponds
to mock cells, and $y_j$ = 1 if the sample j corresponds to KD cells.
The following figure (generated with the [ExploreModelMatrix](http://www.bioconductor.org/packages/release/bioc/vignettes/ExploreModelMatrix/inst/doc/ExploreModelMatrix.html) package) illustrates the value of the
linear predictor of a generalized linear model for each combination of
input variables. It can help to understand and to generate contrasts.
```{r echo=FALSE}
coldata$Condition <- factor(coldata$Condition, levels = c("mock", "KD"))
coldata$Type <- factor(coldata$Type, levels = c("Epithelial", "Fibroblast"))
vd <- ExploreModelMatrix::VisualizeDesign(sampleData = coldata,
designFormula = ~ Type + Condition,
textSizeFitted = 6)
vd$plotlist[[1]]
```
From this figure, we can see how the genes log2 expression values are modelized
in the different samples.
- In mock epithelial cells
$log2(q_{Gene_i\_Epithelial\_mock}) = Intercept$
- In KD epithelial cells
$log2(q_{Gene_i\_Epithelial\_KD}) = Intercept + Condition\_KD\_vs\_mock$
- In mock fibroblasts cells
$log2(q_{Gene_i\_Fibroblast\_mock}) = Intercept + Type\_Fibroblast\_vs\_Epithelial$
- In KD fibroblasts cells
$log2(q_{Gene_i\_Fibroblast\_KD}) = Intercept + Type\_Fibroblast\_vs\_Epithelial + Condition\_KD\_vs\_mock$
### KD effect
To identify the genes that are consistently affected by the KD in both cell lines,
we will have to test (for each gene) the *Condition_KD_vs_mock* coefficient, to see
if it is significantly different from zero.
Remember that we can use the `name` parameter of the `results()` function to specify the
coefficient that we want to extract (here the *Condition_KD_vs_mock* coefficient).
This will test if genes are differentially expressed in KD cells versus
mock cells.
```{r}
res <- results(dds, name = "Condition_KD_vs_mock")
res_tbl <- as_tibble(res, rownames = "ENSEMBL") %>%
arrange(padj)
head(res_tbl)
```
The normalised counts of the genes with the lowest p-adjusted values are plotted bellow.
We can see that these genes correspond to genes that are highly affected by the siRNA
in both cells lines.
```{r}
as_tibble(counts(dds[res_tbl$ENSEMBL[1:2], ], normalize = TRUE),
rownames = 'ENSEMBL') %>%
pivot_longer(names_to = "sample", values_to = "counts", -ENSEMBL) %>%
left_join(as_tibble(colData(dds), rownames = "sample")) %>%
mutate(name = paste0(substr(Type, 1, 5), '_', Condition, '_', 1:3)) %>%
ggplot(aes(x = name, y = counts, fill = Condition)) +
geom_bar(stat = 'identity', color = "gray30") +
facet_wrap( ~ ENSEMBL, scales = "free") +
theme(axis.text.x = element_text(size = 8, angle = 90),
axis.title.x = element_blank(),
legend.position = "right",
legend.text = element_text(size = 7),
legend.title = element_text(size = 7))
```
### Epithelial vs fibroblasts cells
Of course if the question of interest was
*What are the genes that are differentially expressed in epithelial and fibroblasts?*
we would have tested the *Type_Fibroblast_vs_Epithelial* coefficient.
```{r}
res <- results(dds, name = "Type_Fibroblast_vs_Epithelial")
res_tbl <- as_tibble(res, rownames = "ENSEMBL") %>%
arrange(padj)
as_tibble(counts(dds[res_tbl$ENSEMBL[1:2], ], normalize = TRUE),
rownames = 'ENSEMBL') %>%
pivot_longer(names_to = "sample", values_to = "counts", -ENSEMBL) %>%
left_join(as_tibble(colData(dds), rownames = "sample")) %>%
mutate(name = paste0(substr(Type, 1, 5), '_', Condition, '_', 1:3)) %>%
ggplot(aes(x = name, y = counts, fill = Condition)) +
geom_bar(stat = 'identity', color = "gray30") +
facet_wrap( ~ ENSEMBL, scales = "free") +
theme(axis.text.x = element_text(size = 8, angle = 90),
axis.title.x = element_blank(),
legend.position = "right",
legend.text = element_text(size = 7),
legend.title = element_text(size = 7))
```
## Design with interaction
Interaction terms can be added to the design formula, in order to test
the treatment effect in one cell line or the treatment
effect that differs across the cell lines.
The design with an interaction could be written $$design = ~ Cell * Condition$$
which is equivalent to
$$design = ~ Cell + Condition + Cell:Condition$$
Let's use this design in our analysis (we have to re-run the
`DESeqDataSetFromMatrix()` and `DESeq()` functions)
```{r, message = FALSE, warning = FALSE}
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = coldata,
design = ~ Type * Condition)
# Set the Epithelial cells as the reference level
dds$Type <- relevel(dds$Type, ref = "Epithelial")
# Set the mock condition as the reference level
dds$Condition <- relevel(dds$Condition, ref = "mock")
dds <- DESeq(dds)
```
The `resultsNames()` function gives the names of results that can be extracted.
```{r}
resultsNames(dds)
```
The linear model used with this design can be written
$$log2(q_{ij}) = Intercept + Type\_Fibroblast\_vs\_Epithelial.x_j + Condition\_KD\_vs\_mock.y_j + TypeFibroblast.ConditionKD.z_j + \epsilon$$
- The $Intercept$ represents the log2 expression level in the
reference (in mock Cell1).
- The $Type\_Fibroblast\_vs\_Epithelial$ coefficient corresponds
to the log2FC between mock fibroblasts and mock epithelial cells.
$x_j = 0$ if the sample j corresponds to fibroblasts,
and $x_j = 1$ if the sample j corresponds to epithelial cells.
- The $Condition\_KD\_vs\_mock$ coefficient corresponds to the log2FC
between KD and mock cells in the reference cells (here epithelial cells).
$y_j = 0$ if the sample j corresponds to mock cells, and $y_j$ = 1 if
the sample j corresponds to KD cells.
- The $TypeFibroblast.ConditionKD$ coefficient corresponds to the
the *extra KD effect* in KD-fibroblasts compared to KD-epitelial cells.
$z_j = 0$ if the sample j corresponds to mock cells or to epithial cells,
and $z_j = 1$ if the sample j corresponds to KD-fibroblasts.
This is summarized in the following figure:
```{r echo=FALSE}
coldata$Condition <- factor(coldata$Condition, levels = c("mock", "KD"))
coldata$Type <- factor(coldata$Type, levels = c("Epithelial", "Fibroblast"))
vd <- ExploreModelMatrix::VisualizeDesign(sampleData = coldata,
designFormula = ~ Type * Condition,
textSizeFitted = 4)
vd$plotlist[[1]]
```
From this figure, we can see how the genes log2 expression values
are modelized in the different samples.
- in mock epithelial cells
$log2(q_{Gene_i\_Epithelial\_mock}) = Intercept$
- in KD epithelial cells
$log2(q_{Gene_i\_Epithelial\_KD}) = Intercept + Condition\_KD\_vs\_mock$
- in mock fibroblasts cells
$log2(q_{Gene_i\_Fibroblast\_mock}) = Intercept + Type\_Fibroblast\_vs\_Epithelial$
- in KD fibroblasts cells
$log2(q_{Gene_i\_Fibroblast\_KD}) = Intercept + Type\_Fibroblast\_vs\_Epithelial + Condition\_KD\_vs\_mock + TypeFibroblast.ConditionKD$
### KD effect in Epithelial cells
As the epithelial cells were set as the reference cells,
the *Condition_KD_vs_mock* coefficient should be extracted to
test the effect of the KD in these cells.
In fact, testing the siRNA effect in epithelial cells is like comparing the
logFC between epithelial KD and epithelial mock cells.
$$log2(\frac{q_{Gene_i\_Epithelial\_KD}}{q_{Gene_i\_Epithelial\_mock}})$$
$$= log2(q_{Gene_i\_Epithelial\_KD}) - log2(q_{Gene_i\_Epithelial\_mock})$$
$$= Intercept + Condition\_KD - Intercept = Condition\_KD$$
```{r}
res <- results(dds, name = "Condition_KD_vs_mock")
res_tbl <- as_tibble(res, rownames = "ENSEMBL") %>%
arrange(padj)
head(res_tbl)
as_tibble(counts(dds[res_tbl$ENSEMBL[1:2], ], normalize = TRUE),
rownames = 'ENSEMBL') %>%
pivot_longer(names_to = "sample", values_to = "counts", -ENSEMBL) %>%
left_join(as_tibble(colData(dds), rownames = "sample")) %>%
mutate(name = paste0(substr(Type, 1, 5), '_', Condition, '_', 1:3)) %>%
ggplot(aes(x = name, y = counts, fill = Condition)) +
geom_bar(stat = 'identity', color = "gray30") +
facet_wrap( ~ ENSEMBL, scales = "free") +
theme(axis.text.x = element_text(size = 8, angle = 90),
axis.title.x = element_blank(),
legend.position = "right",
legend.text = element_text(size = 7),
legend.title = element_text(size = 7))
```
### KD effect in Fibroblasts cells
As the fibroblasts cells were NOT set as the reference cells,
it is less trivial to extract the right coefficients to test the
KD effect in fibroblasts.
But remember that testing the siRNA effect in fibroblast cells is like
comparing the logFC between fibroblast KD and fibroblast mock cells.
$$log2(\frac{q_{Gene_i\_Fibroblast\_KD}}{q_{Gene_i\_Fibroblast\_mock}})$$
$$= log2(q_{Gene_i\_Fibroblast\_KD}) - log2(q_{Gene_i\_Fibroblast\_mock})$$
$$= (Intercept + Condition\_KD + Type\_Fibroblast\_vs\_Epithelial + TypeFibroblast.ConditionKD )$$ $$ - (Intercept + Type\_Fibroblast\_vs\_Epithelial) $$
$$= Condition\_KD + TypeFibroblast.ConditionKD $$
So to test the KD effect in fibroblasts, we have to extract both
*Condition_KD_vs_mock* and *TypeFibroblast.ConditionKD* coefficients.
```{r}
res <- results(dds,
list(c("Condition_KD_vs_mock", "TypeFibroblast.ConditionKD")))
res_tbl <- as_tibble(res, rownames = "ENSEMBL") %>%
arrange(padj)
head(res_tbl)
as_tibble(counts(dds[res_tbl$ENSEMBL[1:2], ], normalize = TRUE),
rownames = 'ENSEMBL') %>%
pivot_longer(names_to = "sample", values_to = "counts", -ENSEMBL) %>%
left_join(as_tibble(colData(dds), rownames = "sample")) %>%
mutate(name = paste0(substr(Type, 1, 5), '_', Condition, '_', 1:3)) %>%
ggplot(aes(x = name, y = counts, fill = Condition)) +
geom_bar(stat = 'identity', color = "gray30") +
facet_wrap( ~ ENSEMBL, scales = "free") +
theme(axis.text.x = element_text(size = 8, angle = 90),
axis.title.x = element_blank(),
legend.position = "right",
legend.text = element_text(size = 7),
legend.title = element_text(size = 7))
```
Note that another possibility would have been to set the fibroblasts
as the reference cells, re-run the whole analysis, and to simply
extract the *Condition_KD_vs_mock* coefficient.
```{r, message = FALSE, warning = FALSE}
dds_test <- DESeqDataSetFromMatrix(countData = counts,
colData = coldata,
design = ~ Type * Condition)
# Set the Epithelial cells as the reference level
dds_test$Type <- relevel(dds_test$Type, ref = "Fibroblast")
# Set the mock condition as the reference level
dds_test$Condition <- relevel(dds_test$Condition, ref = "mock")
dds_test <- DESeq(dds_test)
res <- results(dds_test,
name = "Condition_KD_vs_mock")
res_tbl <- as_tibble(res, rownames = "ENSEMBL") %>%
arrange(padj)
head(res_tbl)
```
### KD effect that are different across epithelial and fibroblasts cells
Which genes are not affected in the same way by the KD in the two cell lines?
In other terms, we are testing the KD effects that are significantly different in both cell
lines. This can be done by extracting the *TypeFibroblast.ConditionKD* coefficient.
In fact this coefficient corresponds to the *extra KD effect* ocuring in fibroblasts
compared to epitelial cells.
If the effect of the KD is similar in both cell types, we expect the
*TypeFibroblast.ConditionKD* coefficient to be nearly 0.
If the effect of the KD is higher in fibroblasts than in epithelial cells, we expect the
*TypeFibroblast.ConditionKD* coefficient to be higher than 0.
If the effect of the KD is lower in fibroblasts than in epithelial cells, we expect the
*TypeFibroblast.ConditionKD* coefficient to be lower than 0.
```{r}
res <- results(dds, name = "TypeFibroblast.ConditionKD")
res_tbl <- as_tibble(res, rownames = "ENSEMBL") %>%
arrange(padj)
head(res_tbl)
```
We can see that gene ENSG00000177494 (with the lowest p-value) has as
highly negative log2FC. This means that the effect of the siRNA treatment
is much lower in fibroblasts than in epithelial cells.
Gene ENSG00000108179 is also an illustrative example,
as its value is decreased by the siRNA treatment in the epithelial cells,
while on the contrary it is increased by the treatment in the fibroblast cells.
```{r}
as_tibble(counts(dds[res_tbl$ENSEMBL[c(1,5)], ], normalize = TRUE),
rownames = 'ENSEMBL') %>%
pivot_longer(names_to = "sample", values_to = "counts", -ENSEMBL) %>%
left_join(as_tibble(colData(dds), rownames = "sample")) %>%
mutate(name = paste0(substr(Type, 1, 5), '_', Condition, '_', 1:3)) %>%
ggplot(aes(x = name, y = counts, fill = Condition)) +
geom_bar(stat = 'identity', color = "gray30") +
facet_wrap( ~ ENSEMBL, scales = "free") +
theme(axis.text.x = element_text(size = 8, angle = 90),
axis.title.x = element_blank(),
legend.position = "right",
legend.text = element_text(size = 7),
legend.title = element_text(size = 7))
```