-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy path02-mortality_association_analysis.Rmd
382 lines (305 loc) · 12.4 KB
/
02-mortality_association_analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
---
title: "R notebook"
output: rmarkdown::github_document
editor_options:
markdown:
wrap: 72
---
# Basic Association Analysis Using the NHANES Accelerometer Data
## Introduction
In this notebook, we will associate daily step count with the risk of
all-cause mortality.
This notebook is a demo, not an indicator of best practice. Analytic
decisions here (e.g. which variables are adjusted for) should not be
interpreted as prescriptive and should be made afresh in the context of
each new analysis.
## Set up the session
We load packages:
```{r, warning=F}
# First we need to install packages that aren't already present
pkgs <- c("data.table", "ggplot2", "survival", "table1", "dplyr")
pkgs_inst <- pkgs[!{pkgs %in% rownames(installed.packages())}]
install.packages(pkgs_inst, repos = "https://www.stats.bris.ac.uk/R/")
options(bitmapType='cairo', digits = 3)
# Load packages
lapply(pkgs, library, character.only = TRUE)
# using lapply just allows us to load several packages on one line
```
We load the data we prepared earlier and make sure categorical variables
are coded as factors:
```{r}
dat_full <- fread("data/nhanes_prepped_data.csv", data.table = FALSE)
# We create factors and define the ordering of levels
dat_full$ethnicity <- factor(
dat_full$ethnicity,
levels = c(
"Mexican American",
"Other Hispanic",
"Non-Hispanic White",
"Non-Hispanic Black",
"Other Race - Including Multi-Racial"
)
)
dat_full$education <- factor(
dat_full$education,
levels = c(
"Less than 9th grade",
"9-11th grade (Includes 12th grade with no diploma)",
"High school graduate/GED or equivalent",
"Some college or AA degree",
"College graduate or above"
)
)
```
For simplicity, we will do a complete-case analysis. In other words, we
will drop participants with missing data for our main covariates of
interest. Alternatively, we could impute the missing data using
established [methods](https://www.bmj.com/content/338/bmj.b2393).
```{r}
dat <- dat_full[complete.cases(dat_full[, c("age",
"sex",
"ethnicity",
"education",
"smoking_current",
"bmi",
"any_disease")]), ]
```
**Exercise 1:** How many participants were excluded from the analysis?
Do you consider this number acceptable?
```{r}
## write code for exercise 1
```
We split daily steps into quarters:
```{r}
# Define a function to do the splitting
qtile_cut <- function(x, probs = seq(0, 1, 0.25), na.rm = TRUE, labels = NULL) {
breaks <- quantile(x = x, probs = probs, na.rm = na.rm)
out <- cut(x = x, breaks = breaks, labels = labels, right = FALSE, include.lowest = TRUE)
return(out)
}
# Use function to create step quarters
dat$steps_quarters <- qtile_cut(dat$StepsDayMedAdjusted,
labels = c("Quarter 1", "Quarter 2",
"Quarter 3", "Quarter 4"))
```
## Describe and explore the data
We use the 'table1' package to generate a nicely formatted table:
```{r}
# Add labels
label(dat$age) <- "Age"
label(dat$steps_quarters) <- "Quarter of daily step count"
label(dat$sex) <- "Sex"
label(dat$ethnicity) <- "Ethnicity"
label(dat$education) <- "Education level"
label(dat$smoking_current) <- "Current smoker"
label(dat$alcohol) <- "Drinks per week"
label(dat$bmi) <- "Body mass index"
label(dat$med_bp) <- "Medication for high blood pressure"
label(dat$med_chol) <- "Medication for high cholesterol"
label(dat$any_disease) <- "Any prior disease"
units(dat$age) <- "years"
units(dat$steps_quarters) <- "steps/day"
units(dat$bmi) <- "kg/m2"
# We'll customise how we render variables so rather than median (min, max) we present median (Q1, Q3)
my_render_cont <- function(x){
with(
stats.apply.rounding(stats.default(x)),
c(
"",
`Mean (SD)` = sprintf("%s (%s)", MEAN, SD),
`Median [Q1, Q3]` = sprintf("%s [%s, %s]",
MEDIAN, Q1, Q3)
)
)
}
# Make table
tab_desc <- table1::table1(~ age + sex + ethnicity + education + smoking_current + bmi + any_disease + StepsDayMedAdjusted| steps_quarters,
data = dat,
render.cont = my_render_cont)
print(tab_desc) # Show table
write(tab_desc, "descriptive_table.html")
```
**Exercise 2:** Do you observe any differences in participant
characteristics across the different quarters of daily steps?
There's much more we could do here, but for now we will move on to
analysing the association of daily step count with the risk of all-cause
mortality.
## Associations with risk of all-cause mortality
The mortality data consist of: 1) an event status indicator at exit
(death_f), and 2) a follow-up time variable (fwup_years).
**Exercise 3:** What is the median follow-up time and how many deaths
occurred?
```{r}
## write code for exercise 3
```
We can run a Cox model to examine the association between daily step
count and the risk of all-cause mortality. We'll start by using
time-on-study as the timescale and set this up using the 'survival'
package in R. We'll also adjust for some potential confounding
variables, though this is not an exhaustive list:
```{r}
cox_model <- coxph(
Surv(fwup_years, death_f) ~ steps_quarters + age + sex + ethnicity + education + smoking_current,
data = dat
)
summary(cox_model)
```
The `exp(coef)` column gives the hazard ratio (HR) and the `lower .95`
and `upper .95` columns its 95% confidence interval (CI).
For example, the HR for the top vs bottom quarter is 0.28 (95% CI
0.20-0.39). This means that participants in the top quarter of daily
step count had a 72% lower risk of death compared to those in the bottom
quarter, after adjusting for potential confounders.
Alternatively, we could analyse the data using [age as the
timescale](https://journals.lww.com/epidem/Fulltext/2012/07000/Proportional_Hazards_Regression_in_Epidemiologic.9.aspx)
(rather than time on study):
```{r}
# Define age of study exit
dat$age_exit <- dat$age + dat$fwup_years
# Run Cox model with age as the time scale
cox_model_age_timescale <- coxph(
Surv(age, age_exit, death_f) ~ steps_quarters + sex + ethnicity + education + smoking_current,
data = dat
)
summary(cox_model_age_timescale)
```
**Exercise 4:** Can you think of additional potential confounders that
we should adjust for? How do we choose which confounders to include?
## Proportional hazards assumption
We can now look at modelling assumptions, which we'll do using the first
model above. A key assumption of Cox regression is the **proportional
hazards assumption**. There are several ways to assess this. One way is
through plots and a statistical test of the scaled Schoenfeld residuals.
Read more
[here](http://www.sthda.com/english/wiki/cox-model-assumptions). Other
ways include use of log-log survival plots or considering interaction
terms between the variable of interest and time.
Interpretation of plots: In the figures, the solid line is a smoothing
spline fit to the plot. The dashed lines representing a ±2σ band around
the fit. Departures from a horizontal line are indicative of
non-proportional hazards. You can read more about the interpretation of
these plots
[here](https://shariq-mohammed.github.io/files/cbsa2019/1-intro-to-survival.html#:~:text=In%20principle%2C%20the%20Schoenfeld%20residuals,The%20function%20cox.).
```{r}
cox.zph(cox_model)
plot(cox.zph(cox_model))
```
It's sometimes hard to judge how important violations of the
proportional hazards assumptions are. As the statistical tests just
assess evidence against proportionality, they may detect even very
modest non-proportionality, particularly if there is a lot of data.
Therefore, it's always helpful to include some graphical method.
## Presenting results
We could plot the results. First we extract and format them:
```{r}
# Extract details from model
plot_dat <- as.data.frame(
exp(cbind(coef(cox_model), confint(cox_model)))
)
colnames(plot_dat) <- c("HR", "lower_CI", "upper_CI")
plot_dat$var_name <- rownames(plot_dat)
plot_dat$var_name <- sub("steps_quarters", "", plot_dat$var_name)
# Restrict to only activity variables and add row for the reference
plot_dat <- plot_dat[1:3, ]
ref_row <-
data.frame(
"var_name" = levels(dat$steps_quarters)[1],
"HR" = 1,
"lower_CI" = 1,
"upper_CI" = 1
)
plot_dat <- rbind(ref_row, plot_dat)
# Add event numbers
plot_dat$num_deaths <- sapply(
X = as.factor(plot_dat$var_name),
FUN = function(x) sum(dat$death_f[dat$steps_quarters == x], na.rm=TRUE)
)
# Add label columns
round_2_dp <- function(x) format(round(x, digits = 2), nsmall = 2) # this line just writes a utility function to round to 2 dp
plot_dat$label_HR <- paste0(round_2_dp(plot_dat$HR), " (", round_2_dp(plot_dat$lower_CI), ", ", round_2_dp(plot_dat$upper_CI), ")")
plot_dat$label_quarter <- c("Quarter 1", "Quarter 2", "Quarter 3", "Quarter 4")
plot_dat$label_deaths <- plot_dat$num_deaths
# Add a title row
title_row <-
data.frame(
"var_name" = " ",
"HR" = NA,
"lower_CI" = NA,
"upper_CI" = NA,
"num_deaths" = NA,
"label_quarter" = "Group",
"label_HR" = "HR (95% CI)",
"label_deaths" = "Deaths"
)
plot_dat <- rbind(title_row, plot_dat)
plot_dat$var_name <- factor(plot_dat$var_name, levels = c(" ", levels(dat$steps_quarters)))
print(plot_dat)
```
We can then create a plot:
```{r, warning=F}
steps_cox_plot <- ggplot(plot_dat, aes(x = HR, y = var_name)) + # SET UP PLOT DATA
# AXES: SCALES AND LABELS
scale_x_continuous(trans = "log", breaks = c(0.25, 0.5, 0.75, 1.0)) +
scale_y_discrete(limits = rev) +
labs(title = "Association of daily step count with all-cause mortality", x = "Hazard Ratio") +
# LINES: VERTICAL LINE AT 1 AND X AXIS
geom_vline(aes(xintercept = 1),
size = 1) +
geom_segment(aes(x = 0.2, xend = 1.05, y = 0, yend = 0), colour = "black", size = 1) + # Using this segment to colour axis so we can have a longer invisible axis to position text
# ADD PLOT DATA
geom_errorbar(aes(xmin = lower_CI, xmax = upper_CI), width = 0, size = 0.75) +
geom_point(size = 4, shape = 15) +
# ADD LABELS TO PLOT
geom_text(aes(x = 0.07, label = label_quarter), hjust = 0, size = 5) +
geom_text(aes(label = label_deaths, x = 0.15), hjust = 0, size = 5) +
geom_text(aes(label = label_HR, x = 1.2), hjust = 0, size = 5) +
# THEME (NON-DATA ELEMENTS OF PLOT)
theme_classic() +
theme(axis.line.y = element_blank(),
axis.line.x = element_blank(),
axis.text.x = element_text(size = 15, colour = "black"),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.x = element_text(size = 20),
axis.title.y = element_blank(),
title = element_text(size = 15),
legend.position = "none") +
# CANVAS
coord_cartesian(xlim = c(0.05, 3), clip = "off")
# Display plot
steps_cox_plot
```
We could edit this plot to make it visually nicer, but for now we write
it out to save it:
```{r, warning=F}
svg("steps_cox_plot.svg")
print(steps_cox_plot)
dev.off()
```
## Sensitivity analyses
### Investigation of reverse causation bias
In observational epidemiological studies, we are often concerned about
*reverse causation bias*. This type of bias occurs when the direction of
cause and effect is mistaken, i.e. instead of the exposure influencing
the outcome, the outcome may influence the exposure. In our example, it
could be that some individuals have reduced their daily steps due to
underlying health conditions. As a result, it may seem that lower daily
steps lead to higher mortality, when in fact the association is driven
by pre-existing disease.
To assess the potential for reverse causation bias influencing the
results, researchers often conduct sensitivity analyses excluding
participants with prior disease or [excluding the first few years of
follow-up](https://academic.oup.com/ije/article/49/1/162/5607291).
**Exercise 5:** Examine the association between daily step count and
all-cause mortality among participants without prior disease. How does
it compare with the previous association? Why do you think that is?
**Exercise 6:** Examine the association between daily step count and
all-cause mortality after excluding participants who died within the
first 4 years of follow-up. What do you observe?
```{r}
## write code for exercise 5
```
```{r}
## write code for exercise 6
```
End of Notebook