02-mortality_association_analysis.Rmd

---
title: "R notebook"
output: rmarkdown::github_document
editor_options: 
  markdown: 
    wrap: 72
---

# Basic Association Analysis Using the NHANES Accelerometer Data

## Introduction

In this notebook, we will associate daily step count with the risk of
all-cause mortality.

This notebook is a demo, not an indicator of best practice. Analytic
decisions here (e.g. which variables are adjusted for) should not be
interpreted as prescriptive and should be made afresh in the context of
each new analysis.

## Set up the session

We load packages:

```{r, warning=F}
# First we need to install packages that aren't already present
pkgs <- c("data.table", "ggplot2", "survival", "table1", "dplyr")
pkgs_inst <- pkgs[!{pkgs %in% rownames(installed.packages())}]
install.packages(pkgs_inst, repos = "https://www.stats.bris.ac.uk/R/")

options(bitmapType='cairo', digits = 3)

# Load packages
lapply(pkgs, library, character.only = TRUE)
# using lapply just allows us to load several packages on one line
```

We load the data we prepared earlier and make sure categorical variables
are coded as factors:

```{r}
dat_full <- fread("data/nhanes_prepped_data.csv", data.table = FALSE)

# We create factors and define the ordering of levels 
dat_full$ethnicity <- factor(
  dat_full$ethnicity,
  levels = c(
    "Mexican American",
    "Other Hispanic",
    "Non-Hispanic White",
    "Non-Hispanic Black",
    "Other Race - Including Multi-Racial"
  )
)

dat_full$education <- factor(
  dat_full$education,
  levels = c(
    "Less than 9th grade",
    "9-11th grade (Includes 12th grade with no diploma)",
    "High school graduate/GED or equivalent",
    "Some college or AA degree",
    "College graduate or above"
  )
)
```

For simplicity, we will do a complete-case analysis. In other words, we
will drop participants with missing data for our main covariates of
interest. Alternatively, we could impute the missing data using
established [methods](https://www.bmj.com/content/338/bmj.b2393).

```{r}
dat <- dat_full[complete.cases(dat_full[, c("age",
                                            "sex",
                                            "ethnicity",
                                            "education",
                                            "smoking_current",
                                            "bmi",
                                            "any_disease")]), ]
```

**Exercise 1:** How many participants were excluded from the analysis?
Do you consider this number acceptable?

```{r}
## write code for exercise 1

```

We split daily steps into quarters:

```{r}
# Define a function to do the splitting
qtile_cut <-  function(x, probs = seq(0, 1, 0.25), na.rm = TRUE, labels = NULL) {
    breaks <- quantile(x = x, probs = probs, na.rm = na.rm)
    out <- cut(x = x, breaks = breaks, labels = labels, right = FALSE, include.lowest = TRUE)
    return(out)
}

# Use function to create step quarters
dat$steps_quarters <- qtile_cut(dat$StepsDayMedAdjusted,
                                labels = c("Quarter 1", "Quarter 2", 
                                           "Quarter 3", "Quarter 4"))
```

## Describe and explore the data

We use the 'table1' package to generate a nicely formatted table:

```{r}
# Add labels
label(dat$age)   <- "Age"
label(dat$steps_quarters) <- "Quarter of daily step count"
label(dat$sex) <- "Sex"
label(dat$ethnicity) <- "Ethnicity"
label(dat$education) <- "Education level"
label(dat$smoking_current) <- "Current smoker"
label(dat$alcohol) <- "Drinks per week"
label(dat$bmi) <- "Body mass index"
label(dat$med_bp) <- "Medication for high blood pressure"
label(dat$med_chol) <- "Medication for high cholesterol"
label(dat$any_disease) <- "Any prior disease"
units(dat$age) <- "years"
units(dat$steps_quarters) <- "steps/day"
units(dat$bmi) <- "kg/m2"

# We'll customise how we render variables so rather than median (min, max) we present median (Q1, Q3)
my_render_cont <- function(x){
  with(
    stats.apply.rounding(stats.default(x)),
    c(
      "",
      `Mean (SD)` = sprintf("%s (%s)", MEAN, SD),
      `Median [Q1, Q3]` = sprintf("%s [%s, %s]",
                                    MEDIAN, Q1, Q3)
    )
  )
}

# Make table
tab_desc <- table1::table1(~ age + sex + ethnicity + education + smoking_current + bmi + any_disease + StepsDayMedAdjusted| steps_quarters, 
                           data = dat, 
                           render.cont = my_render_cont) 
print(tab_desc) # Show table
write(tab_desc, "descriptive_table.html")
```

**Exercise 2:** Do you observe any differences in participant
characteristics across the different quarters of daily steps?

There's much more we could do here, but for now we will move on to
analysing the association of daily step count with the risk of all-cause
mortality.

## Associations with risk of all-cause mortality

The mortality data consist of: 1) an event status indicator at exit
(death_f), and 2) a follow-up time variable (fwup_years).

**Exercise 3:** What is the median follow-up time and how many deaths
occurred?

```{r}
## write code for exercise 3

```

We can run a Cox model to examine the association between daily step
count and the risk of all-cause mortality. We'll start by using
time-on-study as the timescale and set this up using the 'survival'
package in R. We'll also adjust for some potential confounding
variables, though this is not an exhaustive list:

```{r}
cox_model <- coxph(
  Surv(fwup_years, death_f) ~ steps_quarters + age + sex + ethnicity + education + smoking_current,
  data = dat
)
summary(cox_model)
```

The `exp(coef)` column gives the hazard ratio (HR) and the `lower .95`
and `upper .95` columns its 95% confidence interval (CI).

For example, the HR for the top vs bottom quarter is 0.28 (95% CI
0.20-0.39). This means that participants in the top quarter of daily
step count had a 72% lower risk of death compared to those in the bottom
quarter, after adjusting for potential confounders.

Alternatively, we could analyse the data using [age as the
timescale](https://journals.lww.com/epidem/Fulltext/2012/07000/Proportional_Hazards_Regression_in_Epidemiologic.9.aspx)
(rather than time on study):

```{r}
# Define age of study exit
dat$age_exit <- dat$age + dat$fwup_years

# Run Cox model with age as the time scale  
cox_model_age_timescale <- coxph(
  Surv(age, age_exit, death_f) ~ steps_quarters + sex + ethnicity + education + smoking_current,
  data = dat
)
summary(cox_model_age_timescale)
```

**Exercise 4:** Can you think of additional potential confounders that
we should adjust for? How do we choose which confounders to include?

## Proportional hazards assumption

We can now look at modelling assumptions, which we'll do using the first
model above. A key assumption of Cox regression is the **proportional
hazards assumption**. There are several ways to assess this. One way is
through plots and a statistical test of the scaled Schoenfeld residuals.
Read more
[here](http://www.sthda.com/english/wiki/cox-model-assumptions). Other
ways include use of log-log survival plots or considering interaction
terms between the variable of interest and time.

Interpretation of plots: In the figures, the solid line is a smoothing
spline fit to the plot. The dashed lines representing a ±2σ band around
the fit. Departures from a horizontal line are indicative of
non-proportional hazards. You can read more about the interpretation of
these plots
[here](https://shariq-mohammed.github.io/files/cbsa2019/1-intro-to-survival.html#:~:text=In%20principle%2C%20the%20Schoenfeld%20residuals,The%20function%20cox.).

```{r}
cox.zph(cox_model)
plot(cox.zph(cox_model))
```

It's sometimes hard to judge how important violations of the
proportional hazards assumptions are. As the statistical tests just
assess evidence against proportionality, they may detect even very
modest non-proportionality, particularly if there is a lot of data.
Therefore, it's always helpful to include some graphical method.

## Presenting results

We could plot the results. First we extract and format them:

```{r}
# Extract details from model
plot_dat <- as.data.frame(
  exp(cbind(coef(cox_model), confint(cox_model)))
)
colnames(plot_dat) <- c("HR", "lower_CI", "upper_CI")
plot_dat$var_name <- rownames(plot_dat)
plot_dat$var_name <- sub("steps_quarters", "", plot_dat$var_name)

# Restrict to only activity variables and add row for the reference
plot_dat <- plot_dat[1:3, ]
ref_row <-
  data.frame(
    "var_name" = levels(dat$steps_quarters)[1], 
    "HR" = 1,
    "lower_CI" = 1,
    "upper_CI"  = 1
  )
plot_dat <- rbind(ref_row, plot_dat)

# Add event numbers
plot_dat$num_deaths <- sapply(
    X = as.factor(plot_dat$var_name),
    FUN = function(x) sum(dat$death_f[dat$steps_quarters == x], na.rm=TRUE)
  ) 

# Add label columns 
round_2_dp <- function(x) format(round(x, digits = 2), nsmall = 2) # this line just writes a utility function to round to 2 dp
plot_dat$label_HR <- paste0(round_2_dp(plot_dat$HR), " (", round_2_dp(plot_dat$lower_CI), ", ", round_2_dp(plot_dat$upper_CI), ")")
plot_dat$label_quarter <- c("Quarter 1", "Quarter 2", "Quarter 3", "Quarter 4")
plot_dat$label_deaths <- plot_dat$num_deaths

# Add a title row
title_row <-
  data.frame(
    "var_name" = " ", 
    "HR" = NA,
    "lower_CI" = NA,
    "upper_CI"  = NA, 
    "num_deaths" = NA, 
    "label_quarter" = "Group",
    "label_HR" = "HR (95% CI)", 
    "label_deaths" = "Deaths"
  )

plot_dat <- rbind(title_row, plot_dat)
plot_dat$var_name <- factor(plot_dat$var_name, levels = c(" ", levels(dat$steps_quarters)))

print(plot_dat)
```

We can then create a plot:

```{r, warning=F}
steps_cox_plot <- ggplot(plot_dat, aes(x = HR, y = var_name)) + # SET UP PLOT DATA 
  
  # AXES: SCALES AND LABELS
  scale_x_continuous(trans = "log", breaks = c(0.25, 0.5, 0.75, 1.0)) + 
  scale_y_discrete(limits = rev) +
  labs(title = "Association of daily step count with all-cause mortality", x = "Hazard Ratio") +

  # LINES: VERTICAL LINE AT 1 AND X AXIS
  geom_vline(aes(xintercept = 1),
             size = 1) +
  geom_segment(aes(x = 0.2, xend = 1.05, y = 0, yend = 0), colour = "black", size = 1) + # Using this segment to colour axis so we can have a longer invisible axis to position text

  # ADD PLOT DATA
  geom_errorbar(aes(xmin = lower_CI, xmax = upper_CI), width = 0, size = 0.75) +
  geom_point(size = 4, shape = 15) +
  
  # ADD LABELS TO PLOT
  geom_text(aes(x = 0.07, label = label_quarter), hjust = 0, size = 5) +
  geom_text(aes(label = label_deaths, x = 0.15), hjust = 0, size = 5) +
  geom_text(aes(label = label_HR, x = 1.2), hjust = 0, size = 5) +

  # THEME (NON-DATA ELEMENTS OF PLOT)
  theme_classic()  + 
  theme(axis.line.y = element_blank(), 
        axis.line.x = element_blank(),
        axis.text.x = element_text(size = 15, colour = "black"), 
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.title.x = element_text(size = 20), 
        axis.title.y = element_blank(), 
        title = element_text(size = 15), 
        legend.position = "none") + 

    # CANVAS
    coord_cartesian(xlim = c(0.05, 3), clip = "off")

# Display plot 
steps_cox_plot
```

We could edit this plot to make it visually nicer, but for now we write
it out to save it:

```{r, warning=F}
svg("steps_cox_plot.svg")
print(steps_cox_plot)
dev.off()
```

## Sensitivity analyses

### Investigation of reverse causation bias

In observational epidemiological studies, we are often concerned about
*reverse causation bias*. This type of bias occurs when the direction of
cause and effect is mistaken, i.e. instead of the exposure influencing
the outcome, the outcome may influence the exposure. In our example, it
could be that some individuals have reduced their daily steps due to
underlying health conditions. As a result, it may seem that lower daily
steps lead to higher mortality, when in fact the association is driven
by pre-existing disease.

To assess the potential for reverse causation bias influencing the
results, researchers often conduct sensitivity analyses excluding
participants with prior disease or [excluding the first few years of
follow-up](https://academic.oup.com/ije/article/49/1/162/5607291).

**Exercise 5:** Examine the association between daily step count and
all-cause mortality among participants without prior disease. How does
it compare with the previous association? Why do you think that is?

**Exercise 6:** Examine the association between daily step count and
all-cause mortality after excluding participants who died within the
first 4 years of follow-up. What do you observe?

```{r}
## write code for exercise 5

```

```{r}
## write code for exercise 6

```

End of Notebook