forked from UCLouvain-CBIO/WSBIM2122
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path30-linmods.Rmd
495 lines (362 loc) · 14.7 KB
/
30-linmods.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
# Linear models {#sec-linmod}
The objective of this chapter is to provide students with an intuitive
understanding of how linear models are used in significance testing,
and more specifically
- show how linear models are equivalent to a t-test
- demonstrate their flexibility.
## Starting with a t-test
`r msmbstyle::question_begin()`
Load the `gexp` data from the `rWSBIM2122` package (version >= 0.2.2)
as show below. It provides the log2 expression of a single gene of
interest measured in two groups, namely 1 and 2. Ignore the `ID`
column for now.
```{r gexp}
library(rWSBIM2122)
data(gexp)
gexp
```
Using a t-test, compare the expression in groups 1 and 2. In
particular, report the mean expression of each group, the log2
fold-change, the associated p-value and visualise the data.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r gexp_t_test1}
t.test(expression ~ group, data = gexp)
```
```{r gexp_vis1, message = FALSE}
library("ggplot2")
p <- ggplot(gexp, aes(x = group, y = expression)) +
geom_boxplot() + geom_jitter()
p
```
`r msmbstyle::solution_end()`
## The linear model equivalence
Let's now use linear regression and the `lm()` function to model the
expression in groups 1 and 2:
$$y = \beta_0 + \beta_1 \times x + \epsilon$$
where $y$ is the expression we want to model and $x$ represents the
two groups that are generally encoded as 0 and 1. The expression above
translates as follows:
- the expression in group 1 (i.e. when $x = 0$) is equal to $\beta_0$
(the average expression of group 1);
- the expression in group 2 (i.e. when $x = 1$) is equal to
$\beta_0$ + $\beta_1$ (the difference between group 1 and group 2).
We model this with
```{r mod1}
mod1 <- lm(expression ~ group, data = gexp)
mod1
```
An indeed, we see that
1. the intercept corresponds to the mean in group 1, and
2. the slope (`group2` coefficient) of the model corresponds to the
log2 fold-change and that, as expected, the mean of group 2 is
*intercept + slope*.
We can visualise this linear model with:
```{r}
p +
geom_point(data = data.frame(x = c(1, 2), y = c(2.75, 4.33)),
aes(x = x, y = y),
colour = "red", size = 5) +
geom_abline(intercept = coefficients(mod1)[1] - coefficients(mod1)[2],
slope = coefficients(mod1)[2])
```
Note that above, we need to subtract the slope from the intercept when
plotting the linear model because groups 1 and 2 are encoded as 0 and
1 in the model, but plotted as 1 and 2 on the figure.
We can also extract the p-value associated with the coefficients:
```{r}
summary(mod1)
```
that are computer as
$$t_{score} = \frac{\beta_i - 0}{SE_{\beta_i}}$$
that has a t-distribution with n − 2 degrees of freedom if the null
hypothesis is true. We use a t-test that tests whether the slope is
different from 0:
- $H_0$: $\beta_i = 0$
- $H_1$: $\beta_i \neq 0$
The p-values for the slope $\beta_1$ is slightly different due to the
homoscedasticity assumption that wasn't set in the t-test, which we
can repeat:
```{r}
t.test(expression ~ group, data = gexp, var.equal = TRUE)
```
## Paired t-test
`r msmbstyle::question_begin()`
The `ID` variable in the `gexp` data provides the unique sample
identifier: it is the sample in which an expression of 2.7 was
measured in group 1 and 3.9 in group 2.
Repeat the test comparing the gene expression between group 1 and
group 2 using a *paired* t-test. Report the mean expression of each
group, the log2 fold-change, the associated p-value and visualise the
data.
Compare the p-values obtained in the paired and unpaired designs and
explain the differences.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r gexp_t_test2}
t.test(expression ~ group, data = gexp, paired = TRUE)
```
```{r gexp_vis2}
gexp2 <- dplyr::full_join(gexp[1:10, ], gexp[11:20, ],
by = "ID", suffix = c("_1", "_2"))
ggplot(gexp, aes(x = group, y = expression)) +
geom_boxplot() +
geom_point(aes(colour = ID), size = 4) +
geom_segment(data = gexp2,
aes(x = group_1, xend = group_2,
y = expression_1, yend = expression_2,
colour = ID))
```
In the unpaired design, the variances in groups 1 and 2 are
`r round(var(gexp2$expression_1), 2)` and
`r round(var(gexp2$expression_2), 2)` respectively,
and do not take into account the paired relation between
samples. Using a paired design allows for a smaller estimation of the
observed variability, leading to a greater t statistic and a smaller
p-value.
`r msmbstyle::solution_end()`
This comparison can be replicated using a linar model[^mod2]
[^mod2]: See below for the exact equation.
$$y = \beta_0 + \beta_1 \times group + \beta_i \times ID_i + \epsilon$$
where the $\beta_i$ coefficients will encode the relation between
pairs of individuals. We model this with
```{r mod2}
mod2 <- lm(expression ~ group + ID, data = gexp)
mod2
```
which corresponds to the following model
```{r, echo = FALSE, results = "asis", eval = FALSE, include = FALSE}
## this has been used to generate the equation below
equatiomatic::extract_eq(mod2, wrap = TRUE, terms_per_line = 4)
## modified \alpha into \beta_{0} manually
```
$$
\begin{aligned}
\operatorname{expression} &= \beta_{0} + \beta_{1}(\operatorname{group}_{\operatorname{2}}) + \beta_{2}(\operatorname{ID}_{\operatorname{2}}) + \beta_{3}(\operatorname{ID}_{\operatorname{3}})\ + \\
&\quad \beta_{4}(\operatorname{ID}_{\operatorname{4}}) + \beta_{5}(\operatorname{ID}_{\operatorname{5}}) + \beta_{6}(\operatorname{ID}_{\operatorname{6}}) + \beta_{7}(\operatorname{ID}_{\operatorname{7}})\ + \\
&\quad \beta_{8}(\operatorname{ID}_{\operatorname{8}}) + \beta_{9}(\operatorname{ID}_{\operatorname{9}}) + \beta_{10}(\operatorname{ID}_{\operatorname{10}}) + \epsilon
\end{aligned}
$$
The `group2` coefficient still corresponds to the fold-change between
the two groups. The coefficients `ID2` to `ID10` correspond to the
respective differences between the average expression of `ID1` and
`ID2` to `ID10`.
```{r}
avg_id <- (gexp2$expression_1 + gexp2$expression_2)/2
coefs_id <- avg_id[-1] - avg_id[1]
names(coefs_id) <- paste0("ID", 2:10)
coefs_id
```
As above, a summary of the model will give us the statistical
significance for the different coefficients.
```{r}
summary(mod2)
```
`r msmbstyle::question_begin()`
Interpret the summary of `mod2` computed above.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
We see that the `group2` coefficient, that tests the difference
between the two groups is now significant. It is identical to the
paired t-test above.
The respective ID coefficients compare the average expression in these
IDs against the average expression of ID1. We can indeed see that ID6
and ID7, that have significant coefficients, are the most different
from ID1.
`r msmbstyle::solution_end()`
## Model generalisation
In the example above, we had two variables that influenced the gene
expression, namely the group and the specific individual and we ended
up with a model that included both of these[^tilda]:
[^tilda]: We will drop the left-hand side term of our formula, setting
it by default as the quantiative expression data.
```{r, eval = FALSE}
~ group + ID
```
Any additional variables could be added to the model if deemed
necessary.
`r msmbstyle::question_begin()`
We have measured the gene expression in cell lines U2OS and Jurkat
(defined as variable `cell_line`), each in the presence of drug A or B
(defined as variable `drug`). How would you define the model formula?
`r msmbstyle::question_end()`
`r msmbstyle::question_begin()`
As above, we have measured the gene expression in cell lines U2OS and
Jurkat (defined as variable `cell_line`), each in the presence of drug
A or B (defined as variable `drug`). This time however, two operators
(defined as variable `operator`) have shared the work. How would you
define the model formula?
`r msmbstyle::question_end()`
`r msmbstyle::question_begin()`
We have measured the gene expression in cell lines U2OS and Jurkat
(defined as variable `cell_line`), each in the presence of drug A, B
or C (defined as variable `drug`). How would you define the model
formula? Which coefficient would you include in the hypothesis test to
assess the drug effect?
`r msmbstyle::question_end()`
## Interactions between variables
Consider the following design: two plant genotypes, wild type and a
mutant that shows some resistance to drought, that are grown,
collected and analysed by RNA sequencing under normal, humid and dry
conditions. The researchers that set up this experiment expect that
the two genotypes will produce, for some key genes, different or even
opposite responses in the extreme conditions: that the expression of
some genes will increase in the mutant under dry conditions while they
will decrease (or change to a lesser extend at least) under humid
conditions.
```{r, echo = FALSE}
des <- data.frame(sample = 1:12,
genotype = rep(c("WT", "MT"), each = 6),
condition = rep(c("normal", "dry", "humid"), 4))
DT::datatable(des, rownames = FALSE,
class = 'cell-border stripe',
options = list(
pageLength = 12,
autoWidth = TRUE))
```
Such situation require a new model formula syntax
```{r, eval = FALSE}
~ genotype * condition
```
which is equivalent to
```{r, eval = FALSE}
~ genotype + condition + genotype * condition
```
Here, coefficients are estimated for `genotype`, `condition` and the
**interaction** between the two, and the model will test for
differences in *condition* slopes between the different
genotypes. These differences might even culminate in opposite signs.
## Exercise
`r msmbstyle::question_begin()`
Load the `lmdata` data available in the `rWSBIM2122` package (version
0.2.2 or later), that provides two variables, `lmdata` and `lmannot`,
describing a factorial design (2 factors, `condition` and `cell_type`
with 2 levels each) with gene expression for 5 genes and 12 samples.
```{r}
data(lmdata, package = "rWSBIM2122")
lmdata
lmannot
```
Analyse each gene with a linear model, defining an appropriate model
formula, and interpret the results for each gene. You can visualise
the data to help with the interpretation.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r, message = FALSE}
library("tidyverse")
## Prepare the data
x <-
data.frame(lmdata) %>%
rownames_to_column() %>%
pivot_longer(names_to = "sample", values_to = "expression", -rowname) %>%
full_join(lmannot) %>%
mutate(sample = sub("sample_", "", sample))
```
```{r}
## Analyse gene 1
x_i <- filter(x, rowname == "gene_1")
x_i
mod_i <- lm(expression ~ condition * cell_type, data = x_i)
summary(mod_i)
ggplot(x_i, aes(x = sample, y = expression,
colour = condition,
shape = cell_type)) +
geom_point()
```
```{r}
## Analyse gene 2
x_i <- filter(x, rowname == "gene_2")
x_i
mod_i <- lm(expression ~ condition * cell_type, data = x_i)
summary(mod_i)
ggplot(x_i, aes(x = sample, y = expression, colour = cell_type)) +
geom_point() +
facet_wrap(~ condition)
```
```{r}
## Analyse gene 3
x_i <- filter(x, rowname == "gene_3")
x_i
mod_i <- lm(expression ~ condition * cell_type, data = x_i)
summary(mod_i)
ggplot(x_i, aes(x = sample, y = expression, colour = condition)) +
geom_point() +
facet_wrap(~ cell_type)
```
```{r}
## Analyse gene 4
x_i <- filter(x, rowname == "gene_4")
x_i
mod_i <- lm(expression ~ condition * cell_type, data = x_i)
summary(mod_i)
ggplot(x_i, aes(x = sample, y = expression,
colour = condition,
shape = cell_type)) +
geom_point()
```
```{r}
## Analyse gene 5
x_i <- filter(x, rowname == "gene_5")
x_i
mod_i <- lm(expression ~ condition * cell_type, data = x_i)
summary(mod_i)
group_by(x_i, cell_type, condition) %>% summarise(mean = mean(expression))
group_by(x_i, cell_type) %>% summarise(mean = mean(expression))
group_by(x_i, condition) %>% summarise(mean = mean(expression))
ggplot(x_i, aes(x = sample, y = expression,
colour = condition,
shape = cell_type)) +
geom_point() +
facet_grid(cell_type ~ condition)
```
`r msmbstyle::solution_end()`
## Which model to use?
The same experimental design can be analysed using different
models. Some are clearly not appropriate, such as the omission of the
ID variable in the first example. Other cases aren't that clear.
- If we return to the previous example for gene 3, we see that in both
cases, we obtain significant results for the cell type coefficient,
albeit with a much smaller p-value in the simpler model.
```{r}
x_i <- filter(x, rowname == "gene_3")
summary(lm(expression ~ condition + cell_type, data = x_i))
summary(lm(expression ~ condition * cell_type, data = x_i))
```
The questions asked is slighly different. The coefficient `cell_typeB`
in the model without interaction tests for shared differences in the
two cell types by taking into account that there are two
conditions. In the model with interaction, the test focuses in the
cell type effect in the reference condition only.
- If we look at gene 5, we see that we miss out several significant
effects if we use the simpler model, without interaction, because
the simpler model doesn't ask the right question.
```{r}
x_i <- filter(x, rowname == "gene_5")
summary(lm(expression ~ condition + cell_type, data = x_i))
summary(lm(expression ~ condition * cell_type, data = x_i))
```
In high throughput experiments, where we test tens of thousands of
genes, we can't explore each gene separately with multiple models. The
same model must be applied to all genes and the significance of
coefficient(s) of interest is/are considered.
Solving this dilema isn't a statistical but a scientific question. You
will have to define the hypothesis that underlies the
experiment[^hypo]. If there are reasons to believe that an interaction
effect is relevant (i.e. there are different effects in the two cell
types), then the more complex model (with an interaction) is
appropriate. If no such interaction is expected, then testing for it
penalises your analysis. In other words it isn't the complexity of the
model that makes its quality, but its appropriateness for the studied
design/effect.
[^hypo]: ideally even before generating any data.
Accumulating models and tests isn't necessarily a solution either, as
this leads to the accumulation of tests that would require additional
adjustment for multiple testing.
## Additional reading
We will be using linear models when analysing gene expression
experiment, with concrete RNA sequencing example. The [*A guide to
creating design matrices for gene expression
experiments*](https://f1000research.com/articles/9-1444) paper
[@Law:2020] is an excellent guide describing the basics of how to set
up design and contrast matrices and great preparation to chapter
\@ref(sec-rnaseq).