-
Notifications
You must be signed in to change notification settings - Fork 22
/
Copy pathcat-pred.qmd
123 lines (92 loc) · 3.82 KB
/
cat-pred.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
---
title: "Categorical Predictors"
author:
- name: Frank Harrell
affiliation: Department of Biostatistics<br>Vanderbilt University School of Medicine<br>MSCI Biostatistics II
date: last-modified
format:
html:
self-contained: true
number-sections: true
number-depth: 3
anchor-sections: true
smooth-scroll: true
theme: journal
toc: true
toc-depth: 3
toc-title: Contents
toc-location: left
code-link: false
code-tools: true
code-fold: show
code-block-bg: "#f1f3f5"
code-block-border-left: "#31BAE9"
reference-location: margin
fig-cap-location: margin
execute:
warning: false
message: false
major: regression analysis
minor: ordinary least squares; categorical
---
```{r setup}
require(rms)
options(prType='html')
```
# Problem
* Dataset: NHANES - `nhgh`
* N observations: 6805
* Response: Height
* Predictors: Race/ethnicity combination
# Data Import
```{r import}
getHdata(nhgh)
d <- upData(nhgh, keep=.q(ht, re, sex))
contents(d) # data dictionary for imported dataset
```
# Descriptive Statistics
```{r desc}
describe(d)
```
# Internal Coding for Categorical Predictors
## Cell Means Model
This is more natural but only works when there is only one categorical predictor, otherwise the model is overparameterized. The `rms` package does not support this. When not adjusting for other covariates and the model is linear (ordinary regression) the cell means are easily obtained by stratification. [See Tom Stewart's excellent [presentation](https://hbiostat.org/doc/b2/cell-means-vs-ref-cell.pdf) that graphically illustrates the difference between the reference cell and cell means models.]{.aside}
With the cell means model there is one parameter per level of the categorical predictor so there are $k$ parameters for $k$ categories. Parameter values are category-specific means.
For a model with race/ethnicity `re` as the sole predictor the model can be written
$$E(\text{height} | \text{re}) = \mu_{\text{re}}$$
where `re` is indexing the means $\mu$ over all 5 levels.
## Reference Cell Model
This is the default in R and is what `rms` uses. There are $k-1$ parameters for $k$ categories and any number of predictors may be categorical. Parameter values are offsets (differences) from the reference category mean. The reference category can be anything but by default is the first category (first `factor` level if predictor is a `factor`, first one in alphabetic order otherwise).
The model intercept is the estimated mean at the first level. If there are other variables in the model, it is the mean with all these set to zero.
```{r}
dd <- datadist(d); options(datadist='dd')
levels(d$re) # same as with(d, levels(re))
f <- ols(ht ~ re, data=d)
f
latex(f)
```
See that the reference cell is the first `level`, `Mexican American`, since `re` is a `factor` (categories are in a user-specified order).
## Compare Predictions with `re`-specific Means
```{r}
#| fig.height: 1.6
Predict(f, re)
summary(ht ~ re, data=d) # uses summary.formula in Hmisc
# Plot is flipped for categorical predictor, hence ylab
ggplot(Predict(f, re), ylab='Mean Height, cm')
```
Manually compute the estimated mean for `Other Hispanic`
```{r}
coef(f)
# First coefficient is intercept
coef(f)[1] + coef(f)['re=Other Hispanic']
```
## Controlling the Contrasts
Model estimates gave us all contrasts with `Mexican American`. Let's instead get all contrasts with `Other Hispanic`. This does not require refitting the model since it can be done using `summary.rms` (short generic name `summary`).
```{r}
summary(f, re='Other Hispanic')
```
Get a specific contrast using `contrast.rms` (short generic name `contrast`). The contrast is the mean for `Non-Hispanic Black` minus the mean for `Non-Hispanic White`.
```{r}
contrast(f, list(re='Non-Hispanic Black'),
list(re='Non-Hispanic White'))
```