-
Notifications
You must be signed in to change notification settings - Fork 0
/
report-hw08.Rmd
515 lines (382 loc) · 16.2 KB
/
report-hw08.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
---
title: "Homework 08"
author: "Spencer Pease"
date: "12/09/2019"
output: pdf_document
---
```{r setup, include=FALSE}
# Set output options
knitr::opts_chunk$set(echo = FALSE)
options(knitr.kable.NA = '-')
# Include libraries
library(uwIntroStats)
library(magrittr)
library(dplyr)
library(ggplot2)
library(ggfortify)
library(survival)
# Helper functions
# Prepare data
mri <-
read.table("data/mri.txt", header = TRUE) %>%
dplyr::as_tibble() %>%
dplyr::select(crt, obstime, death) %>%
dplyr::filter(!is.na(crt)) %>%
dplyr::mutate(
surv = dplyr::if_else(obstime >= 5 * 365.25, 1L, 0L),
n_surv = 1L - surv,
high_crt = dplyr::if_else(crt > 1.2, 1L, 0L),
low_crt = 1L - high_crt
)
mri$surv_obj <- Surv(time = mri$obstime, event = mri$death)
```
# _(Q1)_ (OR) Response: 5-Year mortality; Predictor: High Serum Creatinine Level
In this question we look at a logistic regression model with an indicator of
death within five years as the response (_death_), and an indicator of high
serum creatinine level as the predictor ($crt_{H}$):
$$ logit(p_{i}) = log(\frac{p_{i}}{1-p_{i}}) = \beta_{0} + \beta_{1} \cdot crt_{H} $$
$$ Pr(death_{i} = 1) = p_{i} $$
```{r}
DvH_mdl <- regress("odds", n_surv ~ high_crt, data = mri)
DvH_coef <- DvH_mdl$coefficients[, 1]
```
## _(Q1.a)_
This model is saturated, since the predictor variable (_high crt_) has the same
number of groupings (_low_, _high_) as the regression model has parameters
($\beta_{0}$, $\beta_{1}$). Or, put another way, each group can be fit exactly
with the model - information does not need to be borrowed from across groups.
## _(Q1.b)_
For this model, we have:
- Intercept $\beta_{0} =$ `r round(DvH_coef[1], 2)`
- Slope $\beta_{1} =$ `r round(DvH_coef[2], 2)`
Here, the exponentiated intercept is interpreted as the odds of death within
five years for the subset of the population that does not have a high serum
creatinine levels.
The exponentiated slope is interpreted as odds ratio of death within five years
between the subset of the population that has high creatinine levels and the
subset that doesn't have high creatinine levels.
## _(Q1.c,d)_
```{r}
DvH_odds <- exp(c(DvH_coef[1], sum(DvH_coef)))
DvH_prop <- DvH_odds / (1 + DvH_odds)
```
- Estimated **odds** of dying within 5 years for subjects with **low**
creatinine levels: `r round(DvH_odds[1], 2)`
- Estimated **probability** of dying within 5 years for subjects with **low**
creatinine levels: `r round(DvH_prop[1], 2)`
- Estimated **odds** of dying within 5 years for subjects with **high**
creatinine levels: `r round(DvH_odds[2], 2)`
- Estimated **probability** of dying within 5 years for subjects with **high**
creatinine levels: `r round(DvH_prop[2], 2)`
## _(Q1.e)_
```{r}
DvH_inf <- DvH_mdl$transformed %>%
as_tibble() %>%
select(-`F stat`, -`df`) %>%
magrittr::set_colnames(c("Estimate", "2.5%", "97.5%", "P-val")) %>%
mutate(`P-val` = paste0(round(`P-val` * 10^5, 2), "e-5")) %>%
slice(2)
knitr::kable(
DvH_inf,
booktabs = TRUE,
digits = 2,
caption = paste("Statistical inference of the relationship between binary",
"indicator of death and binary indicator of high crt")
)
```
From logistic regression analysis, we estimate that for two groups that differ
by binary indicator of serum creatinine level ($high > 1.2 \frac{mg}{dl}$,
$low \le 1.2 \frac{mg}{dl}$), the odds of death are `r round(DvH_inf[[1, 1]], 2)`
times larger for the group with high creatinine levels. A $95\%$ confidence
interval suggests that this observation is not unusual if the true odd ratio
were anywhere from `r round(DvH_inf[[1, 2]], 2)` to `r round(DvH_inf[[1, 3]], 2)`.
Because this interval does not include $1$, in addition to the fact that the
two-sided _P_-value for this estimate is less than $.05$, suggests that we
can reject the null hypothesis that there is no association between five-year
mortality and binary indicator of serum creatinine level.
## _(Q1.f,g)_
```{r}
DvL_mdl <- regress("odds", n_surv ~ low_crt, data = mri)
DvL_coef <- DvL_mdl$coefficients[, 1]
SvH_mdl <- regress("odds", surv ~ high_crt, data = mri)
SvH_coef <- SvH_mdl$coefficients[, 1]
coef_tbl <-
tibble(
parameter = c("Intercept", "Slope"),
`death~crtH` = DvH_coef,
`death~crtL` = DvL_coef,
`survival~crtH` = SvH_coef
) %>%
tidyr::pivot_longer(cols = -parameter, names_to = "model", values_to = "val") %>%
tidyr::pivot_wider(names_from = "model", values_from = "val")
knitr::kable(
coef_tbl,
booktabs = TRUE,
digits = 2,
caption = paste("Comparison of different representations of the same",
"underlying model")
)
```
If our original model instead used an indicator of **low** serum creatinine
level as the predictor, the statistical evidence for an association would not
change, since changing the crt indicator will only change the sign of the slope,
not the magnitude.
If our original model instead used an indicator of **surviving at least 5 years**
as the response variable, the statistical evidence for an association would not
change, only the starting point of the intercept and the direction of the slope.
# _(Q2)_ (OR) Response: High Serum Creatinine Level; Predictor: 5-Year mortality
In this question we look at a logistic regression model with an indicator of
high serum creatinine level as the response ($crt_{H}$), and an indicator of
death within 5 years as the predictor (_death_):
$$ logit(p_{i}) = log(\frac{p_{i}}{1-p_{i}}) = \beta_{0} + \beta_{1} \cdot death $$
$$ Pr(crt_{Hi} = 1) = p_{i} $$
```{r}
HvD_mdl <- regress("odds", high_crt ~ n_surv, data = mri)
HvD_coef <- HvD_mdl$coefficients[, 1]
```
## _(Q2.a)_
For this model, we have:
- Intercept $\beta_{0} =$ `r round(HvD_coef[1], 2)`
- Slope $\beta_{1} =$ `r round(HvD_coef[2], 2)`
Here, the exponentiated intercept is interpreted as the odds of having a high
creatinine level for the subset of the population that has survived at least
five years
The exponentiated slope is interpreted as odds ratio of having a high creatinine
level between the subset of the population that died within five years and
the subset that survived at least five years.
## _(Q2.b,c)_
```{r}
HvD_odds <- exp(c(HvD_coef[1], sum(HvD_coef)))
HvD_prop <- HvD_odds / (1 + HvD_odds)
```
- Estimated **odds** of high creatinine level for subjects who **die** within
5 years: `r round(HvD_odds[2], 2)`
- Estimated **probability** of high serum creatinine for subjects who **die**
within 5 years: `r round(HvD_prop[2], 2)`
- Estimated **odds** of high creatinine level for subjects who **survive** at
least 5 years: `r round(HvD_odds[1], 2)`
- Estimated **probability** of high serum creatinine for subjects who **survive**
at least 5 years: `r round(HvD_prop[1], 2)`
## _(Q2.d)_
```{r}
HvD_inf <- HvD_mdl$transformed %>%
as_tibble() %>%
select(-`F stat`, -`df`) %>%
magrittr::set_colnames(c("Estimate", "2.5%", "97.5%", "P-val")) %>%
mutate(`P-val` = paste0(round(`P-val` * 10^5, 2), "e-5")) %>%
slice(2)
knitr::kable(
HvD_inf,
booktabs = TRUE,
digits = 2,
caption = paste("Statistical inference of the relationship between binary",
"indicator of high crt and binary indicator of death")
)
```
From logistics regression analysis, we estimate that for two groups that differ
by binary indicator of five-year all-cause mortality, the odds of having high
serum creatinine level ($crt > 1.2 \frac{mg}{dl}$) are `r round(HvD_inf[1, 1], 2)`
times larger in the group that died within five years than the group that
survived at least five years. A $95\%$ confidence interval suggests this
observation is not unusual if the true odds ratio is anywhere from
`r round(HvD_inf[1, 2], 2)` to `r round(HvD_inf[1, 3], 2)`. Since this interval
does not include $1$ and our two-sided _P_-value is much less than $.05$, we have
sufficient confidence to reject the null hypothesis that there is no association
between binary indicator of serum creatinine level and five-year mortality.
## _(Q2.e)_
Both this retrospective analysis and the prospective analysis from _(Q1)_
produce the same estimate of the odds ratio, as well as the confidence
interval and _P_-value of that estimate. This is because both models are testing
for association between the same two indicators, with only the response and
predictor labels swapped.
# _(Q3)_ (RD) Response: 5-Year mortality; Predictor: High Serum Creatinine Level
This question asks us to assess the association between 5-year mortality and
serum creatinine level via a risk difference (RD) contrast. Since our response
is binary (_death_), we can use a linear model like:
$$ E[death_{i} | crt_{Hi}] = \beta_{0} + \beta_{1} \cdot death_{i}$$
```{r}
rd_mdl <- regress("mean", n_surv ~ high_crt, data = mri)
rd_coef <- rd_mdl$coefficients[, 1]
```
## _(Q3.a,b)_
For this model, we have:
- Intercept $\beta_{0} =$ `r round(rd_coef[1], 2)`
- Slope $\beta_{1} =$ `r round(rd_coef[2], 2)`
Here, the intercept is interpreted as the estimated mean proportion of death
within five years for the subset of the population that does not have high serum
creatinine levels.
The slope is interpreted as the estimated difference in population means
of death within five years between the two population subsets defined by high
serum creatinine level indicator.
## _(Q3.c)_
```{r}
rd_inf <- rd_mdl$coefficients %>%
as_tibble() %>%
select(Estimate, `95%L`, `95%H`, `Pr(>|t|)`) %>%
magrittr::set_colnames(c("Estimate", "2.5%", "97.5%", "P-val")) %>%
mutate(`P-val` = paste0(round(`P-val` * 10^4, 2), "e-4")) %>%
slice(2)
knitr::kable(
rd_inf,
booktabs = TRUE,
digits = 2,
caption = paste("Statistical inference of the difference in mean estimates",
"of proportion of death within 5 years between indicators",
"of high creatinine")
)
```
From a linear regression analysis of five-year all-cause mortality indicator
and high serum creatinine level ($crt > 1.2 \frac{mg}{dl}$) indicator, using
Huber-White estimates of the standard error, we find that the estimated risk
difference in probability of death between the population of individuals with
high creatinine level and those without is `r round(rd_inf[1, 1])`. A $95\%$
confidence interval suggests this estimate is not unusual if the true risk
difference is anywhere from `r round(rd_inf[1, 2])` to `r round(rd_inf[1, 3])`.
Because this interval does not include $0$, and since the _P_-value for this
estimate is much less than $.05$, we can reject the null hypothesis that there
is no association between high serum creatinine level indicator and five-year
mortality with confidence.
# _(Q4)_
## _(Q4.a)_
```{r}
DvC_mdl <- regress("odds", n_surv ~ crt, data = mri)
DvC_coef <- DvC_mdl$coefficients[, 1]
```
### _(Q4.a.i)_
- Intercept $\beta_{0} =$ `r round(DvC_coef[1], 2)`
- Slope $\beta_{1} =$ `r round(DvC_coef[2], 2)`
Here, the exponentiated intercept is interpreted as the odds of death within
five years for the subset of the population that has a mean serum creatinine
level of $0$. This is not useful scientifically, as we never observed an
individual with level of less than $.05$.
The exponentiated slope is interpreted as the ratio of the odds of death between
two subsets of the population differing by one unit of mean creatinine level.
Scientifically, this is a useful value, because it allows us to test for
difference between two subsets of the population.
### _(Q4.a.ii)_
```{r}
DvC_inf <- DvC_mdl$transformed %>%
as_tibble() %>%
select(-`F stat`, -`df`) %>%
magrittr::set_colnames(c("Estimate", "2.5%", "97.5%", "P-val")) %>%
mutate(`P-val` = paste0(round(`P-val` * 10^5, 2), "e-5")) %>%
slice(2)
knitr::kable(
DvC_inf,
booktabs = TRUE,
digits = 2,
caption = paste("Statistical inference of the OR relationship between ",
"creatinine level and binary indicator of death")
)
```
From logistic regression analysis, we estimate that for two groups differing
in mean serum creatinine level by one unit ($1 \frac{mg}{dl}$), the odds of
death within five years increases by `r round(DvC_inf[1, 1], 2)` times. A $95\%$
confidence interval suggests this observation is not unusual if the true odds
ratio is anywhere from `r round(DvC_inf[1, 2], 2)` to `r round(DvC_inf[1, 3], 2)`.
Since this interval does not include $1$, and the two-sided _P_-value for this
estimate is less than $.05$, we can, with confidence, reject the null hypothesis
that there is no association between serum creatinine level and five-year mortality.
## _(Q4.b)_
```{r}
rdC_mdl <- regress("mean", n_surv ~ crt, data = mri)
rdC_coef <- rdC_mdl$coefficients[, 1]
```
### _(Q4.b.i)_
- Intercept $\beta_{0} =$ `r round(rdC_coef[1], 2)`
- Slope $\beta_{1} =$ `r round(rdC_coef[2], 2)`
Here, the intercept is interpreted as the estimated mean proportion of death
within five years for the subset of the population with a mean serum creatinine
level of $0 \frac{mg}{dl}$. Again, this is not scientifically useful, as there
are no observed individuals with a measured creatinine level of less than $.05$.
The slope is interpreted as the estimated difference in mean probability of
death within five years between the two population subsets that differ by
$1 \frac{mg}{dl}$ of mean serum creatinine level. This does have scientific
value, as it lets us estimate differences across groups within our population.
### _(Q4.b.ii)_
### _(Q4.b.iii)_
```{r}
rdC_inf <- rdC_mdl$coefficients %>%
as_tibble() %>%
select(Estimate, `95%L`, `95%H`, `Pr(>|t|)`) %>%
magrittr::set_colnames(c("Estimate", "2.5%", "97.5%", "P-val")) %>%
mutate(`P-val` = paste0(round(`P-val` * 10^9, 2), "e-9")) %>%
slice(2)
knitr::kable(
rdC_inf,
booktabs = TRUE,
digits = 2,
caption = paste("Statistical inference of the RD relationship between ",
"creatinine level and binary indicator of death")
)
```
From a linear regression analysis of five-year all-cause mortality and and
serum creatinine level, using Huber-White estimates of the standard error, we
find the estimated risk difference in probability of death between between two
groups differing by one unit of mean serum level ($1 \frac{mg}{dl}$) increases
by `r round(rdC_inf[1, 1], 2)`. A $95\%$ confidence interval suggest this
estimate is not unusual if the true risk difference is in the range
`r round(rdC_inf[1, 2], 2)` to `r round(rdC_inf[1, 3], 2)`. $1$ not being
included in this interval and a two-sided _P_-value less than $.05$ suggest we
can with high confidence reject the null hypothesis that there is no association
between serum creatinine level and five-year mortality.
## _(Q4.c)_
```{r}
crt_predict <- c(0.8, 1.8, 2.8, 3.8)
or_prop <- exp(DvC_coef[1] + DvC_coef[2] * crt_predict)
pred_tbl <- tibble(
crt = crt_predict,
OR = or_prop / (1 + or_prop),
RD = rdC_coef[1] + rdC_coef[2] * crt_predict
)
knitr::kable(
pred_tbl,
booktabs = TRUE,
digits = 2,
caption = paste("Comparison of probability of death between OR and RD",
"models with continuous crt variable")
)
```
# _(Q5)_ Survival Analysis
## _(Q5.a)_
```{r}
km_mdl <- survfit(surv_obj ~ high_crt, data = mri)
km_plot <- autoplot(km_mdl) +
labs(
title = "Kaplan-Meier Estimated Survival Functions",
subtitle = "Grouped by Serum Creatinine Level Indicator",
x = "Observation Time (Days)",
y = "Percentage Surviving"
) +
scale_fill_discrete(name = "Creatinine Level\nIndicator", labels = c("Low", "High")) +
scale_colour_discrete(name = "Creatinine Level\nIndicator", labels = c("Low", "High")) +
theme_bw() +
theme(text = element_text(family = "serif"))
km_plot
```
## _(Q5.b)_
The high creatinine level indicator curve decays faster than the low indicator,
though both decay at independent steady rates until about 1500 days, at which
point both decays increase.
## _(Q5.c)_
```{r}
km_fit <- survdiff(surv_obj ~ high_crt, data = mri)
km_inf <-
tibble(
`crt Indicator` = c("Low", "High"),
`N obs` = km_fit$obs,
`Observed` = km_fit$obs,
`Expected` = km_fit$exp
)
knitr::kable(
km_inf,
booktabs = TRUE,
digits = 2,
caption = paste("Association of time-to-death and serum creatinine level")
)
```
From the Kaplan-Meier survival estimates of time-to-death and high serum
creatinine level indicator, the _P_-value of $9e-6$ and $\chi^{2}$ value of
`r round(km_fit$chisq, 2)` suggest that we can reject the null hypothesis that
there is no association between survival time and high serum creatinine level
indicator.