-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathmodel_design.r
325 lines (273 loc) · 9.77 KB
/
model_design.r
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
# Expressing design formula in R
# http://genomicsclass.github.io/book/pages/expressing_design_formula.html
# The Design Matrix
#
# formula and model.matrix are used to produce design matrices
# aka model matrices.
# Say model Y[i] = beta0 + beta1*x[i] + e, i = 1, 2, ...,n
#
# Y[i] the weights
# x[i] = 1 when mouse[i] receives high fat diet
# x a indicator variable to indicate
# if the experimential subject had a certain characteristics or not.
# ( y[1] ) ( 1 x[1] ) ( e[1] )
# ( y[2] ) ( 1 x[2] ) ( beta0 ) ( e[2] )
# Y = ( .. ) = ( . ) x ( ) + ( .. )
# ( .. ) ( . ) ( beta1 ) ( .. )
# ( .. ) ( . ) ( .. )
# ( y[n] ) ( 1 x[n] ) ( e[n] )
# Or simply:
#
# Y = X*BETA + e
#
# The design matrix is the matrix XX.
# Choice of design
#
# The choice of design matrix encodes which coefficients will be fit in the model,
# as well as the inter-relationship between the samples.
# Suppose we have two groups, control and high fat diet, with two samples each.
# For illustrative purposes, we will code these with 1 and 2 respectively.
# We should first use factor to encode these values so that they dont get interpreted numerically.
# Values get interpreted as different levels of a factor.
group <- factor( c(1,1,2,2) )
# We can then use the paradigm ~ group to, say, model on the variable group.
# equivalence model.matrix(formula(~ group))
model.matrix(~ group)
# > model.matrix(~ group)
# (Intercept) group2
# 1 1 0
# 2 1 0
# 3 1 1
# 4 1 1
# attr(,"assign")
# [1] 0 1
# attr(,"contrasts")
# attr(,"contrasts")$group
# [1] "contr.treatment"
#
# We want the second column to have only 0 and 1, indicating group membership.
# Names of the levels are irrelevant to model.matrix and lm.
# All that matters is the order.
# The below code produces the same design matrix
group <- factor(c("control","control","highfat","highfat"))
model.matrix(~ group)
# > model.matrix(~ group)
# (Intercept) grouphighfat
# 1 1 0
# 2 1 0
# 3 1 1
# 4 1 1
# attr(,"assign")
# [1] 0 1
# attr(,"contrasts")
# attr(,"contrasts")$group
# [1] "contr.treatment"
# More groups
# Using the same formula, we can accommodate modeling more groups.
# Suppose we have a third diet:
group <- factor(c(1,1,2,2,3,3))
model.matrix(~ group)
# > model.matrix(~ group)
# (Intercept) group2 group3
# 1 1 0 0
# 2 1 0 0
# 3 1 1 0
# 4 1 1 0
# 5 1 0 1
# 6 1 0 1
# attr(,"assign")
# [1] 0 1 1
# attr(,"contrasts")
# attr(,"contrasts")$group
# [1] "contr.treatment"
#
# Now we have a third column which specifies which samples belong to the third group.
# An alternate formulation of design matrix is possible by specifying + 0 in the formula:
group <- factor(c(1,1,2,2,3,3))
model.matrix(~ group + 0)
# > model.matrix(~ group + 0)
# group1 group2 group3
# 1 1 0 0
# 2 1 0 0
# 3 0 1 0
# 4 0 1 0
# 5 0 0 1
# 6 0 0 1
# attr(,"assign")
# [1] 1 1 1
# attr(,"contrasts")
# attr(,"contrasts")$group
# [1] "contr.treatment"
#
# This group now fits a separate coefficient for each group.
# More variables
#
# To perform experiments with more than one variable (diet)
# Assume we are interested in the effect of diet and the difference in sexes.
# In this case, we have four possible groups:
diet <- factor(c(1,1,1,1,2,2,2,2))
sex <- factor(c("f","f","m","m","f","f","m","m"))
table(diet,sex)
# > table(diet,sex)
# sex
# diet f m
# 1 2 2
# 2 2 2
# If we assume that the diet effect is the same for males and females
# (this is an assumption), then our linear model is:
# y[i] = b0 + b1*x1[i] + b1*x2[i] + e[i]
# To fit this model we can simply add the additional variable with a + sign
# in order to build a design matrix which fits based on the information
# in additional variables:
diet <- factor(c(1,1,1,1,2,2,2,2))
sex <- factor(c("f","f","m","m","f","f","m","m"))
model.matrix(~ diet + sex)
# > model.matrix(~ diet + sex)
# (Intercept) diet2 sexm
# 1 1 0 0
# 2 1 0 0
# 3 1 0 1
# 4 1 0 1
# 5 1 1 0
# 6 1 1 0
# 7 1 1 1
# 8 1 1 1
# attr(,"assign")
# [1] 0 1 2
# attr(,"contrasts")
# attr(,"contrasts")$diet
# [1] "contr.treatment"
#
# attr(,"contrasts")$sex
# [1] "contr.treatment"
# The design matrix includes an intercept, a term for diet and a term for sex.
# This linear model accounts for differences in both the group and condition variables.
# Model assumes that the diet effect is the same for both males and females.
# We say these are an additive effect.
# For each variable, we add an effect regardless of what the other is.
# Another model is possible here,
# which fits an additional term and which encodes the potential interaction
# of group and condition variables.
# The interaction model can be written in either of the following two formulas:
model.matrix(~ diet + sex + diet:sex)
# or
model.matrix(~ diet*sex)
# > model.matrix(~ diet*sex)
# (Intercept) diet2 sexm diet2:sexm
# 1 1 0 0 0
# 2 1 0 0 0
# 3 1 0 1 0
# 4 1 0 1 0
# 5 1 1 0 0
# 6 1 1 0 0
# 7 1 1 1 1
# 8 1 1 1 1
# attr(,"assign")
# [1] 0 1 2 3
# attr(,"contrasts")
# attr(,"contrasts")$diet
# [1] "contr.treatment"
#
# attr(,"contrasts")$sex
# [1] "contr.treatment"
# Where does model.matrix look for the data?
# The model.matrix function will grab the variable from the R global environment,
# unless the data is explicitly provided as a data frame to the data argument:
group <- 1:4
model.matrix(~ group)
# Note how the R global environment variable group is ignored.
model.matrix(~ group, data=data.frame(group=5:8))
# > group <- 1:4
# > model.matrix(~ group)
# (Intercept) group
# 1 1 1
# 2 1 2
# 3 1 3
# 4 1 4
# attr(,"assign")
# [1] 0 1
# > model.matrix(~ group, data=data.frame(group=5:8))
# (Intercept) group
# 1 1 5
# 2 1 6
# 3 1 7
# 4 1 8
# attr(,"assign")
# [1] 0 1
# Continuous variables
# In the falling object example, time was a continuous variable in the model
# and time squared was also included:
tt <- seq(0,3.4,len=25)
model.matrix(~ tt + I(tt^2))
# > model.matrix(~ tt + I(tt^2))
# (Intercept) tt I(tt^2)
# 1 1 0.0000000 0.00000000
# 2 1 0.1416667 0.02006944
# 3 1 0.2833333 0.08027778
# 4 1 0.4250000 0.18062500
# 5 1 0.5666667 0.32111111
# 6 1 0.7083333 0.50173611
# 7 1 0.8500000 0.72250000
# 8 1 0.9916667 0.98340278
# 9 1 1.1333333 1.28444444
# 10 1 1.2750000 1.62562500
# 11 1 1.4166667 2.00694444
# 12 1 1.5583333 2.42840278
# 13 1 1.7000000 2.89000000
# 14 1 1.8416667 3.39173611
# 15 1 1.9833333 3.93361111
# 16 1 2.1250000 4.51562500
# 17 1 2.2666667 5.13777778
# 18 1 2.4083333 5.80006944
# 19 1 2.5500000 6.50250000
# 20 1 2.6916667 7.24506944
# 21 1 2.8333333 8.02777778
# 22 1 2.9750000 8.85062500
# 23 1 3.1166667 9.71361111
# 24 1 3.2583333 10.61673611
# 25 1 3.4000000 11.56000000
# attr(,"assign")
# [1] 0 1 2
## Expressing Design Formula Exercises
# Suppose we have an experiment with the following design:
# on three different days, perform an experiment with two treated and two control samples.
# We then measure some outcome y[i] and want to test the effect of condition,
# while controlling for whatever differences might have occured due to the the different day
# (maybe the temperature in the lab affects the measuring device).
# Assume that the true condition effect is the same for each day
# (no interaction between condition and day).
# We then define factors for 'day' and for 'condition'.
condition <- factor(c("treated","treated","treated","treated","treated","treated","control",
"control","control","control","control","control"))
day <- factor(c("A","A","B","B","C","C","A","A","B","B","C","C"))
table(condition,day)
# day: A B C
# condition: --------------
# treated | 2 2 2
# control | 2 2 2
# Given the factors we have defined above, and not defining any new ones,
# the following formula will produce a design matrix (model matrix)
# that let's us analyze the effect of condition, controlling for the different days:
model.matrix(~ condition + day)
# > model.matrix(~ condition + day)
# (Intercept) conditiontreated dayB dayC
# 1 1 1 0 0
# 2 1 1 0 0
# 3 1 1 1 0
# 4 1 1 1 0
# 5 1 1 0 1
# 6 1 1 0 1
# 7 1 0 0 0
# 8 1 0 0 0
# 9 1 0 1 0
# 10 1 0 1 0
# 11 1 0 0 1
# 12 1 0 0 1
# attr(,"assign")
# [1] 0 1 2 2
# attr(,"contrasts")
# attr(,"contrasts")$condition
# [1] "contr.treatment"
#
# attr(,"contrasts")$day
# [1] "contr.treatment"