-
Notifications
You must be signed in to change notification settings - Fork 22
/
Copy pathabd18-molerats.qmd
221 lines (161 loc) · 7.62 KB
/
abd18-molerats.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
---
title: "Mole Rats Energy Expenditure"
author:
- name: Frank Harrell
affiliation: Department of Biostatistics<br>Vanderbilt University School of Medicine<br>MSCI Biostatistics II
date: last-modified
format:
html:
self-contained: true
number-sections: true
number-depth: 3
anchor-sections: true
smooth-scroll: true
theme: journal
toc: true
toc-depth: 3
toc-title: Contents
toc-location: left
code-link: false
code-tools: true
code-fold: show
code-block-bg: "#f1f3f5"
code-block-border-left: "#31BAE9"
reference-location: margin
fig-cap-location: margin
execute:
warning: false
message: false
major: regression modeling; ABD
minor: ordinary least squares
---
```{r setup}
require(rms)
options(prType='html')
getRs('abd.r')
require(ggplot2)
```
# Problem
* Mole rats , ABD example 18.4, page 620 2nd[Adapted from Tom Stewart's Stata exercise.]{.aside}
* Dataset: 18-e-4
* observations: 35
* Response: energy (daily energy expenditure)
* Predictors: mass (body mass); caste (worker, lazy)
* Goal: How different is mean daily energy expenditure between busy and lazy molerats when adjusted for differences in body mass?
# Data Import
The dataset contained only natural log-transformed variables. This is not typically good practice. In R it's best too transform on-the-fly, so we anti-log the values after importing the data, and drop the original log variables.
```{r import}
d <- getabd('18-e-4')
d <- upData(d,
mass = exp(lnMass),
energy = exp(lnEnergy),
labels = .q(energy = 'Energy Expenditure',
mass = 'Body Mass'),
drop = .q(lnMass, lnEnergy))
contents(d) # data dictionary for imported dataset
```
# Descriptive Statistics
```{r desc}
describe(d)
```
# What relationships are the researchers studying?
# Plot the raw data
First plot the raw data on their original scale then after taking natural log of the two continuous variables. Levels of `caste` are depicted using color.
```{r}
nc <- with(d, nCoincident(mass, energy)) # nCoincident in Hmisc
nc # no coincident points so ignore
```
```{r}
ggplot(d, aes(x=mass, y=energy, color=caste)) + geom_point() +
xlab(hlab(mass)) + ylab(hlab(energy))
ggplot(d, aes(x=mass, y=energy, color=caste)) + geom_point() +
scale_x_continuous(trans='log',
breaks = c(45, seq(50, 200, by=25)),
minor_breaks = seq(50, 200, by=5)) +
scale_y_continuous(trans='log',
breaks = c(35, seq(50, 175, by=25)),
minor_breaks = seq(40, 175, by=5)) +
xlab(hlab(mass)) + ylab(hlab(energy))
```
# Write down the linear regression model that will help understand relationships of interest
Log transformations make things more symmetric and appear to make a linear model work better, so let's use them. Use natural logs.
$$\ln(\text{energy}) = \beta_{0} + \beta_{1}\times\ln(\text{mass}) + \beta_{2}\times[\text{caste=worker}] + \beta_{3}\times\ln(\text{mass})\times[\text{caste=worker}]$$
where the 0/1 indicator $[u]$ is 1 if $u$ is true, 0 if $u$ is false.
* How many parameters are in the model?
# Write null and alternative hypotheses of the interaction effect
* Using parameters in the full model: $H_{0}: \beta_{3}=0, H_\text{A}: \beta_{3} \neq 0$
* By a comparison of two models: fit a model that omits the interaction term and run a statistical test based on how much worsening there is in $\hat{\sigma}^2$ which is equivalent to how much loss there is in $R^2$.
* What does the presence of an interaction mean in terms of the research question?
# Fit the model
We fit model `f` with interaction and model `g` without interaction.
Note that the `*` notation in R automatically includes main effects along with interactions of variables on either side of `*`. R works best when transformations are specified inside the model formula.
```{r}
dd <- datadist(d); options(datadist='dd')
f <- ols(log(energy) ~ log(mass) * caste, data=d, x=TRUE) # x=TRUE needed for studentized residuals
f
latex(f)
g <- ols(log(energy) ~ log(mass) + caste, data=d, x=TRUE)
g
latex(g)
```
# Plot fits from both models
See Fig. 18.4-1 ABD 2nd. Note that `rms` plots on the original predictor scale but takes logs internally. `Predict(fit, x, z)` means to obtain predictions for combinations of values of `x` and `z` with `x` placed on the $x$-axis and `z` generating separate curves vertically. Adding `fun=exp` transforms predicted values by anti-logging them, before plotting. Predictions default to using the 10th smallest and largest values of `mass` for limits. We override that to use the full data range.
```{r}
rmass <- with(d, range(mass))
masses <- seq(rmass[1], rmass[2], length=150)
ggplot(Predict(f, mass=masses, caste, fun=exp))
ggplot(Predict(g, mass=masses, caste, fun=exp))
```
# Diagnostics for both models
* Check for randomness and constant horizontal variability in residuals vs. $\hat{Y}$
* Look for normality of residuals
We use studentized residuals. [See [this](https://stackoverflow.com/questions/4357031) for a discussion of using `ggplot2` for qqnorm plots.]{.aside}
```{r}
rf <- resid(f, type='studentized')
rg <- resid(g, type='studentized')
ggplot(mapping=aes(fitted(f), rf)) + geom_point()
ggplot(mapping=aes(sample=rf)) + geom_qq() + stat_qq_line() +
xlab('Studentized Residuals') + ylab('Normal Quantiles')
ggplot(mapping=aes(fitted(g), rg)) + geom_point()
ggplot(mapping=aes(sample=rg)) + geom_qq() + stat_qq_line() +
xlab('Studentized Residuals') + ylab('Normal Quantiles')
```
# Chunk tests
In the following example, we are going to perform a series of chunk tests. We are going to demonstrate how to perform a chunk test with two different methods. The first method is a comparison of models. The second is a direct test of the betas. The two methods are equivalent and will give identical answers for linear models. [The comparison of models approach is only valid if the regression models are fit
using the same observations. This is often an issue if there is missing data as the default practice is to disregard observations that have missing covariates. The consequence is that the reduced model may use more observations than the full model. We will cover an example of how to compare models when there is missing data.]{.aside}
Depending on your circumstances, one method may be easier to implement than another.
## Method 1: Comparison of models
Use the built-in `lm` function to do this as its `anova` function compares two models with an $F$-test
### Interaction test
* Full: $E[\ln(\text{energy})] = \beta_{0} + \beta_{1}\times\ln(\text{mass}) + \beta_{2}\times[\text{worker}] + \beta_{3}\times\ln(\text{mass})\times[\text{worker}]$
* Reduced: $E[\ln(\text{energy})] = \beta_{0} + \beta_{1}\times\ln(\text{mass}) + \beta_{2}\times[\text{worker}]$
```{r}
# Fit the full model with lm first
flm <- lm(log(energy) ~ log(mass) * caste, data=d)
anova( lm(log(energy) ~ log(mass) + caste, data=d), flm)
```
### Total impact of mass
* Reduced: $E[\ln(\text{energy})] = \beta_{0} + \beta_{1}\times[\text{worker}]$
```{r}
anova(lm(log(energy) ~ caste, data=d), flm)
```
### Total impact of caste
* Reduced: $E[\ln(\text{energy})] = \beta_{0} + \beta_{1}\times\ln(\text{mass})$
```{r}
anova(lm(log(energy) ~ log(mass), data=d), flm)
```
### Overall regression
* Reduced: $E[\ln(\text{energy})] = \beta_{0}$
```{r}
anova(lm(log(energy) ~ 1, data=d), flm)
```
## Method 2: Directly test $\beta$ in full model
All the tests are automatically done by the `rms` package's `anova` function on the `ols` full model fit.
```{r}
anova(f)
```
This approach avoids issues with missing data.
# Computing Environment
```{r echo=FALSE,results='asis'}
markupSpecs$html$session()
```