-
Notifications
You must be signed in to change notification settings - Fork 22
/
Copy pathnhgh.qmd
226 lines (170 loc) · 7.01 KB
/
nhgh.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
---
title: "NHANES"
author:
- name: Frank Harrell
affiliation: Department of Biostatistics<br>Vanderbilt University School of Medicine<br>MSCI Biostatistics II
date: last-modified
format:
html:
self-contained: true
number-sections: true
number-depth: 3
anchor-sections: true
smooth-scroll: true
theme: journal
toc: true
toc-depth: 3
toc-title: Contents
toc-location: left
code-link: false
code-tools: true
code-fold: show
code-block-bg: "#f1f3f5"
code-block-border-left: "#31BAE9"
reference-location: margin
fig-cap-location: margin
execute:
warning: false
message: false
major: data reduction
---
```{r setup}
require(rms)
require(qreport) # for maketabs
options(prType='html', grType='plotly') # plotly makes certain graphics interactive
```
# Problem: How Are Demographics Associated with Height?
* Dataset: `nhgh`
* N observations: 6795
* Response: height
* Predictors: Race/ethnicity, sex, age
# Data Import
```{r import}
getHdata(nhgh)
d <- upData(nhgh, keep=.q(sex, age, re, ht, SCr, gh,
dx, tx, wt, bmi, leg, arml, armc, waist, tri, sub))
contents(d)
```
# Descriptive Statistics
```{r results='asis'}
sparkline::sparkline(0) # initialize javascript for describe
des <- describe(d)
maketabs(print(des, 'both'), wide=TRUE)
```
# Plot Raw Data
There are many tied values of `ht` so use hexagonal binning to depict frequency counts of coincident points. Use `ggplot2` faceting to stratify by `re` and `sex`. Make the plot taller than usual with chunk markup `#| fig.height: 8.5` (8.5 inches). [If you have trouble getting this too run try installing the `hexbin` package.}{.aside}
```{r}
#| fig.height: 8.5
cuts <- c(1:5, 10, 15, 20, 30) # cutpoints for cell N
ggplot(d, aes(x=age, y=ht)) +
facet_grid(re ~ sex) +
stat_binhex(aes(alpha=..count.., color=cut2(..count.., cuts=cuts)), bins=80) +
xlab(hlab(age)) + ylab(hlab(ht)) +
guides(alpha='none', fill=FALSE, color=guide_legend(title='Frequency'))
```
# Linear Model
* No assumption of linearity for continuous predictors
* Allow for age $\times$ sex interaction (different shapes of age relationship by sex)
```{r}
dd <- datadist(d); options(datadist='dd')
# x=TRUE needed for studentized residuals
f <- ols(ht ~ re + sex * rcs(age, 4), data=d, x=TRUE)
f
latex(f)
```
# Diagnostics
## Residuals
```{r}
r <- resid(f, type='studentized')
ggplot(mapping=aes(fitted(f), r)) + geom_point(alpha=.2)
ggplot(mapping=aes(sample=r)) + geom_qq() + stat_qq_line() +
xlab('Studentized Residuals') + ylab('Normal Quantiles')
```
* What is your opinion about which model assumptions hold and which don't?
* How might the right hand side of the model not fit adequately?
## Influence Statistics
Find observations that if removed would change a parameter by more than 0.15 standardized (by standard error) units.[Because of the large sample size there were no overly influential observations at standard cutoffs such as 0.4. The cutoff was relaxed to 0.15 for illustration.]{.aside} `which.influence` in `rms` recognizes that some predictors have multiple $\beta$s associated with them (e.g., splined predictors or categorical predictors like `re` with > 2 levels). It finds observations that are influential on *any* of the predictor's parameters.
```{r}
w <- which.influence(f, 0.15)
w
show.influence(w, d)
```
# Data Reduction
**Why do data reduction?**
If we hope to construct stable models, there is a limit to the complexity of the models we construct. The type of outcome and number of observations limit complexity. [Some of this material was provided by Tom Stewart]{.aside}
**Approach**
Determine "effective sample size" and parameter budget. Live within the parameter budget and allocate parameters to predictors based on research goals and the strength of association with outcome.
**Challenge**
Lots of predictors with a limited parameter budget.
**Strategies Demonstrated**
* Generalized Spearman's $\rho$ (predictive potential)
* Variable clustering
* Redundancy analysis
* Principal components
## Data Subsetting
Include persons age 21 and older who have never been diagnosed or treated for diabetes.
```{r}
d <- subset(d, age >= 21 & dx == 0 & tx == 0)
cat(nrow(d), 'subjects qualified\n')
```
## Predictive Potential of Continuous Variables
```{r}
#| fig.height: 3.5
s <- spearman2(gh ~ age + bmi + ht + wt + leg + arml + armc +
waist + tri + sub, data=d)
s
plot(s)
```
## Variable Clustering
* Shows possibility of combining some variables
* Occasionally leads to removal of some variables
* Helps with collinearities, redundancies, lowering effective number of parameters in need of estimation
```{r}
v <- varclus(~ age + bmi + ht + wt + leg + arml + armc + waist +
tri + sub + re + sex,
data=d) # function is in Hmisc
plot(v)
```
* Interpret what you see for continuous variables
* Interpret what you see for categorical variables
## Redundancy Analysis
* Try to predict each predictor from the other predictors (splined if continuous, all indicator variables less one if categorical)
* Note that `bmi, ht, wt` would have been perfectly predicted (any one from the other two; $R^{2} = 1.0$) had they been logged
```{r}
redun(~ age + bmi + ht + wt + leg + arml + armc + waist + tri + sub + re + sex,
data=d) # function is in Hmisc
```
## Principal Components
* A way to find linear combinations of predictors that captures much of the information in the individual predictors
* Is masked to Y so is *unsupervised learning* that does not lead to overfitting
* Consider the body size measurements only, and remove BMI for the moment because it is perfectly computed from height and weight (were logs to be taken on all 3 variables)
* PCs are linear combination of the original variables
* First PC (PC$_1$) explains maximum variance subject to a scaling constraint
* Second PC (PC$_2$) explains the second largest variance subject to being uncorrelated with PC$_1$.
```{r}
#| fig.height: 4
pc <- princmp(~ age + ht + wt + leg + arml + armc + waist + tri + sub,
data=d, sw=TRUE)$scores
pc1 <- pc[, 1] # vector of PC1
pc2 <- pc[, 2] # PC2
```
* How many components would you choose?
* To use PC-based data reduction in outcome prediction, substitute `pc1`, `pc2`, ... for some of the constituent variables.
The `princmp` function also can do sparse PCs if you have installed the `pcaPP` package. This results in many of the loadings being set exactly to zero, effectively doing variable clustering and PC simultaneously.
```{r}
spc <- princmp(~ age + ht + wt + leg + arml + armc + waist + tri + sub,
data=d, sw=TRUE, method='sparse')$scores
```
* Let's see how regular and sparce PCs predict glycohemoglobin in an ordinal model
```{r results='asis'}
w <- vector('list', 10)
for(k in 1:5) w[[k]] <- orm(gh ~ pc[, 1:k], data=d)
for(k in 1:5) w[[k+5]] <- orm(gh ~ spc[, 1:k], data=d)
nam <- c(paste0('PC1-', 1:5), paste0('Sparse PC1-', 1:5))
names(w) <- nam
maketabs(w)
```
# Computing Environment
```{r echo=FALSE,results='asis'}
markupSpecs$html$session()
```