-
Notifications
You must be signed in to change notification settings - Fork 22
/
Copy pathframingham.qmd
207 lines (160 loc) · 5.62 KB
/
framingham.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
---
title: "Framingham Heart Study"
author:
- name: Frank Harrell
affiliation: Department of Biostatistics<br>Vanderbilt University School of Medicine<br>MSCI Biostatistics II
date: last-modified
format:
html:
self-contained: true
number-sections: true
number-depth: 3
anchor-sections: true
smooth-scroll: true
theme: journal
toc: true
toc-depth: 3
toc-title: Contents
toc-location: left
code-link: false
code-tools: true
code-fold: show
code-block-bg: "#f1f3f5"
code-block-border-left: "#31BAE9"
reference-location: margin
fig-cap-location: margin
execute:
warning: false
message: false
major: regression modeling; validation
minor: ordinary least squares
---
```{r setup}
require(rms)
require(ggplot2)
options(prType='html', grType='plotly') # plotly for interactive graphs from describe()
require(qreport) # for maketabs
```
# Problem
* Dataset: Framingham 2.10 dataset from WD Dupont *Statistical Modeling for Biomedical Researchers*
* N observations: 4699
* Response: Systolic Blood Pressure
* Predictors: BMI, age, serum cholesterol, sex
* Problem 1: Can regression diagnostics find a known outlier?
* Problem 2: Internally validate a linear model
# Data Preparation
## Data Import
Change the age of subject 2642 from 55 to 103 after importing.
```{r import}
fram <- read.csv('http://hbiostat.org/data/repo/2.20.Framingham.csv')
fram <-
upData(fram,
sex = factor(sex, 1:2, c('male', 'female')),
labels = .q(sbp = 'Systolic Blood Pressure',
bmi = 'Body Mass Index',
scl = 'Serum Cholesterol'),
units = .q(sbp = mmHg))
d <- upData(fram,
age = ifelse(id == 2642, 103, age))
```
```{r}
print(contents(d), levelType='table')
```
## Descriptive Statistics
```{r desc,results='asis'}
maketabs(plot(describe(d))) # needs results='asis'
```
The age outlier is already seen in the spike histogram above.
# Linear Model With Outlier in Data
## Fit
* Allow sex to modify the effects of the other three variables
* Predicting untransformed `sbp` resulted in very problematic qq plot
* Taking log(`sbp`) solved $\frac{1}{2}$ the problem
We take logs of BMI and cholesterol in the hope of leading to more linear relationships.
```{r}
dd <- datadist(d); options(datadist='dd')
f <- ols(log(sbp) ~ sex * (log(bmi) + age + log(scl)), data=d, x=TRUE)
f
```
## Diagnostics
On the first plot, distinguish the one molested point from the others by making it red.
```{r}
r <- resid(f, type='studentized')
ggplot(mapping=aes(fitted(f), r)) +
geom_point(alpha=.2) +
geom_point(aes(x=fitted(f)[1], y=r[1], color=I('red'), size=I(3)))
ggplot(mapping=aes(sample=r)) + geom_qq() + stat_qq_line() +
xlab('Studentized Residuals') + ylab('Normal Quantiles')
```
A good advertisement for semiparametric (ordinal) regression models.
## Influence Statistics
```{r}
#| fig.height: 2.75
i <- resid(f, 'influence.measures')
par(mfrow=c(1,2))
hist(i[, 'dfb.age'], nclass=100, xlab='Change in Age beta', main='')
hist(i[, 'dffit'], nclass=100, xlab='DFFIT', main='')
w <- which.influence(f)
w
show.influence(w, d)
```
The correct point was found, and it affected two parameter estimates (main age parameter and age $\times$ sex interaction; `Count`=2) by more than 0.2 standardized units (the default cutoff for `which.influence`). Another influential observation is one with a cholesterol of 492.
# Fit and Validate Model Without Outlier
Go back to the original imported data.
```{r}
d <- fram
```
## Fit
```{r}
dd <- datadist(d); options(datadist='dd')
f <- ols(log(sbp) ~ sex * (log(bmi) + age + log(scl)),
data=d, x=TRUE, y=TRUE) # x, y needed for validations
f
```
Before preceding let's check the linearity assumptions for the continuous predictors by expanding them into splines with 4 knots.
```{r}
g <- ols(log(sbp) ~ sex * (rcs(log(bmi),4) +
rcs(age, 4) + rcs(log(scl), 4)),
data=d, x=TRUE, y=TRUE) # x, y needed for validations
g
```
```{r}
AIC(f); AIC(g) # small difference, g slightly better
```
We'll go with the simpler model.
## Bootstrap Strong Internal Validation
The small difference between $R^2$ and $R^{2}_\text{adj}$ above previews the small amount of overfitting depicted below.
Use 300 bootstrap resamples for validation of statistical indexes, each of which is used to fit a model from scratch. Use 150 bootstraps for calibration just to save time.
```{r results='asis'}
set.seed(17) # so can replicate results
v <- validate(f, B=300)
html(v)
cal <- calibrate(f, B=150)
plot(cal)
```
The calibration curve looks excellent where predicted values are not rare.
* What is overfitting and why would it be a concern?
* What is optimism?
* Does an in-sample calibration plot provide useful information?
+ Plot of $Y$ vs. $\hat{Y}$ in the model development sample (whole sample)
* What is the in-sample mean |prediction error|?
* Does the in-sample mean |prediction error| over or under estimate the out-of-sample error?
* Compute the in-sample Spearman rank correlation
* What is the difference between discriminaton and calibration?
```{r}
# Compute mean |prediction error| on the original sbp scale
with(d, mean(abs(exp(fitted(f)) - sbp), na.rm=TRUE))
with(d, spearman(fitted(f), sbp))
with(d, spearman(fitted(f), log(sbp)))
```
Also use a nonparametric smoother to get an in-sample "apparent" calibration plot.
```{r app}
ggplot(d, aes(x=fitted(f), y=log(sbp))) + geom_point(alpha=0.2) +
geom_smooth(method=lm, color='red') +
geom_smooth(color='green') +
geom_abline(intercept=0, slope=1, color='blue')
```
# Computing Environment
```{r echo=FALSE,results='asis'}
markupSpecs$html$session()
```