Skip to content

Commit

Permalink
Survival2 (#78)
Browse files Browse the repository at this point in the history
* add boxHED

* updated left-trncation + objective functions. Added KM

* Add multi-state

* Updates survival

* update eha figure

* update row/column names in figure/table

* Cleaned up CR and recurrent events.

* update survival/eha

* updates survtsk

* update srvtsk

* Finish (for now) srvtsk

* Push pdf [skip ci]

* Update book/survival.qmd

* Push pdf [skip ci]

* typos

* Push pdf [skip ci]

* typos and PI

* Push pdf [skip ci]

* move tasks around

* Push pdf [skip ci]

* add small note about discrete time

* Push pdf [skip ci]

* typos

* Push pdf [skip ci]

* equations to own line

* typo

* Push pdf [skip ci]

* typos

* Push pdf [skip ci]

* typos

* typoes

* Push pdf [skip ci]

* move equations to blocks

* Push pdf [skip ci]

* typos

* typos

* Push pdf [skip ci]

* typos, light rewrite

* Push pdf [skip ci]

* small rewrite

* typo

* Push pdf [skip ci]

* fix formatting

* typos

* Push pdf [skip ci]

* typo

* typo

* update km section

* Update KM section + replace rats data with tumor data

* more work on censoring + example table + figure for left-truncation

* Rewrite of objective function section -> Estimation

* typo

* add macro

* tex typ

* latex typo

* Push pdf [skip ci]

* typo

* Push pdf [skip ci]

* add fixme

* Push pdf [skip ci]

* further reading

* Push pdf [skip ci]

* typo

* typo

* add line 3

* typos

* Push pdf [skip ci]

* Update book/P1C4_survival.qmd

* Push pdf [skip ci]

* Update book/P1C4_survival.qmd

Co-authored-by: Raphael Sonabend <[email protected]>

* Push pdf [skip ci]

* add discrete time section + non-parametric estimation

* Push pdf [skip ci]

---------

Co-authored-by: adibender <[email protected]>
Co-authored-by: Raphael Sonabend <[email protected]>
Co-authored-by: RaphaelS1 <[email protected]>
  • Loading branch information
4 people authored Dec 5, 2024
1 parent 081cc1a commit 5930499
Show file tree
Hide file tree
Showing 19 changed files with 819 additions and 296 deletions.
Binary file modified book/Figures/survival/event-history-analysis.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 0 additions & 1 deletion book/Figures/survival/event-history-analysis.svg

This file was deleted.

Binary file added book/Figures/survival/km-age-bin-tumor.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added book/Figures/survival/km-infants.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added book/Figures/survival/km-tumor.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added book/Figures/survival/left-truncation.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
10 changes: 5 additions & 5 deletions book/P0C0_notation.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -27,13 +27,13 @@ $$
$$

Vectors are usually defined using transpose notation, for example the vector above may instead be written as $\xx^\trans = (x_1 \ x_2 \cdots x_n)$ or $\xx = (x_1 \ x_2 \cdots x_n)^\trans$.
Vectors may also be defined in a shortened format as, $\xx \in \calX^n$, which implies a vector of length $n$ with elements as represented above.
Vectors may also be defined in a shortened format as $\xx \in \calX^{n \times 1}$ or more simply $\xx \in \calX^n$, which implies a column vector of length $n$ with elements as represented above.

A letter in normal font with one subscript refers to a single element from a vector.
For example, given $\xx \in \calX^n$, the $i$th element is denoted $x_i$.
Given a matrix $\XX \in \calX^{n \times p}$, a bold-face lower-case letter with a single subscript refers to the row of a matrix, for example the $i$th row would be $\xx_i = (x_{i;1} \ x_{i;2} \cdots x_{i;p})^\trans$.
Whereas a column is referenced with a semi-colon before the subscript, for example the $j$th column would be $\xx_{;j} = (x_{1;j} \ x_{2;j} \cdots x_{n;j})^\trans$.
Two subscripts can be used to reference a single element of a matrix, for example $x_{i;j}$ would be the element in the $i$th row and $j$th column of $\XX$.
Two subscripts can be used to reference a single element of a matrix, for example $x_{i;j} \in \calX$ would be the element in the $i$th row and $j$th column of $\XX$.

## Functions

Expand All @@ -44,7 +44,7 @@ $\EE(X)$ is the expectation of the random variable $X$.
We write $A \indep B$, to denote that $A$ and $B$ are independent, i.e., that $P(A \cap B) = P(A)P(B)$.

A function $f$, will either be written as a formal map of domain to codomain, $f: \calX \rightarrow \calY; (x, y) \mapsto f(x, y)$ (which is most useful for understanding inputs and outputs), or more simply and commonly as $f(x, y)$.
Given a random variable, $X$, following distribution $\zeta$ (mathematically written $X \sim \zeta$), then $f_X$ denotes the probability density function, and analogously for other distribution defining functions.
Given a random variable, $X$, following distribution $\zeta$ (mathematically written $X \sim \zeta$), then $f_X$ denotes the probability density function, and analogously for other distribution defining functions such as the cumulative distribution function, survival function, etc.
In the survival analysis context (@sec-surv), a subscript "$0$" refers to a "baseline" function, for example, $S_0$ is the baseline survival function.

## Variables and acronyms
Expand All @@ -55,8 +55,8 @@ Common variables and acronyms used in the book are given in @tbl-not-var and @tb
| - | ---- |
| $\Reals, \PReals, \NNReals, \ExtReals$|Set of Reals, positive Reals (excl. zero), non-negative Reals (incl. zero), and Reals including $\pm\infty$. |
| $\PNaturals$|Set of Naturals excluding zero. |
| $(\XX, \tt, \bsdelta)$ | Survival data where $\XX \in \Reals$ is a real-valued matrix of observations (rows) and features (columns), $\tt$ is a vector of observed outcome times, and $\bsdelta$ is a vector of observed outcome indicators. |
| $\bsbeta$|Vector of model coefficients/weights. |
| $(\XX, \tt, \bsdelta)$ | Survival data where $\XX \in \Reals^{n \times p}$ is a real-valued matrix of $n$ observations (rows) and $p$ features (columns), $\tt \in \Reals^n$ is a vector of observed outcome times, and $\bsdelta \in \Reals^n$ is a vector of observed outcome indicators. |
| $\bsbeta$|Vector of model coefficients/weights, $\bsbeta \in \Reals^p$. |
| $\bseta$ | Vector of linear predictors, $\dvec{\eta}{n}$, where $\bseta = \XX\bsbeta$ and $\eta_i = \xx_{i}^\trans\bsbeta$. |
| $\calD, \dtrain, \dtest$| Dataset, training data, and testing data. |

Expand Down
633 changes: 455 additions & 178 deletions book/P1C4_survival.qmd

Large diffs are not rendered by default.

133 changes: 102 additions & 31 deletions book/P1C5_eha.qmd

Large diffs are not rendered by default.

142 changes: 70 additions & 72 deletions book/P1C6_survtsk.qmd

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion book/P2C8_rank.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ This representation of discrimination provides more information by encoding the
In theory this representation could result in a negative value, however this would indicate that $C<0.5$, which would indicate serious problems with the model that should be addressed before proceeding with further analysis.
Representing measures as a percentage over a baseline is a common method to improve measure interpretability and closely relates to the ERV representation of scoring rules (@sec-eval-distr-score-base).

### Concordance Indices
### Concordance Indices {#sec-eval-crank-conc}

Common concordance indices in survival analysis can be expressed as a general measure:

Expand Down
2 changes: 1 addition & 1 deletion book/P3C15_svm.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -160,7 +160,7 @@ Where again $K$ is a kernel function and the calculation of the Lagrange multipl
Support vector machines can be used to estimate rankings by penalizing predictions that result in disconcordant predictions.
Recall the definition of concordance from @sec-eval-crank: ranking predictions for a pair of comparable observations $(i, j)$ where $t_i < t_j \cap \delta_i = 1$, are called concordant if $r_i > r_j$ where $r_i, r_j$ are the predicted ranks for observations $i$ and $j$ respectively and a higher value implies greater risk.
Using the prognostic index as a ranking prediction (@sec-surv-setmltask), a pair of observations is concordant if $g(\xx_i) > g(\xx_j)$ when $t_i < t_j$, leading to:
Using the prognostic index as a ranking prediction (@sec-survtsk-PI), a pair of observations is concordant if $g(\xx_i) > g(\xx_j)$ when $t_i < t_j$, leading to:
$$
\begin{aligned}
Expand Down
2 changes: 1 addition & 1 deletion book/P4C21_discrete.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,6 @@ abstract: TODO (150-200 WORDS)
{{< include _macros.tex >}}
:::

# Discrete Time Survival Analysis
# Discrete Time Survival Analysis {#sec-discrete}

{{< include _soon.qmd >}}
28 changes: 25 additions & 3 deletions book/P5C24_conclusions.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,34 @@
{{< include _macros.tex >}}
:::

# Conclusions

## Common problems in survival analysis
# Conclusions {#sec-conclusions}

{{< include _soon.qmd >}}

## Common problems in survival analysis {#sec-conclusions-faq}

### Data cleaning

#### Events at t=0 {.unnumbered .unlisted}

Throughout this book we have defined survival times taking values in the non-negative Reals (zero inclusive) $\NNReals$.
In practice, model implementations assume time is over the positive Reals (zero exclusive).
One must therefore consider how to deal with subjects that experience the outcome at $0$.
There is no established best practice for dealing with this case as the answer may be data-dependent.
Possible choices include:

1. Deleting all data where the outcome occurs at $t=0$, this may be appropriate if it only happens in a small number of observations and therefore deletion is unlikely to bias predictions;
2. Update the survival time to the next smallest observed survival time. For example, if the first observation to experience the event after $t=0$ happens at $t=0.1$, then set $0.1$ as the survival time for any observation experiencing the event at $t=0$. Note this method will not be appropriate when data is over a long period, for example if measuring time over years, then there could be a substantial difference between $t=0$ and $t=1$;
3. Update the survival time to a very small value $\epsilon$ that makes sense given the context of the data, e.g., $\epsilon = 0.0001$.

#### Continuous v Discrete Time {.unnumbered .unlisted}

We defined survival tasks throughout this book assuming continuous time predictions in $\NNReals$.
In practice, many outcomes in survival analysis are recorded on a discrete scale, such as in medical statistics where outcomes are observed on a yearly, daily, monthly, hourly, etc. basis.
Whilst discrete-time survival analysis exists for this purpose (@sec-discrete), software implementations overwhelming use theory from the 'continuous-time setting.
There has not been a lot of research into whether discrete-time methods outperform continuous-time methods when correctly applied to discrete data, however available experiments do not indicate that discrete methods outperform their continuous counterparts [@Suresh2022].
Therefore it is recommended to use available software implementations, even when data is recorded on a discrete scale.

### Evaluation and prediction

* Which time points to make predictions for?
Expand Down
Binary file modified book/_book/Machine-Learning-in-Survival-Analysis.pdf
Binary file not shown.
3 changes: 1 addition & 2 deletions book/_macros.tex
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
\providecommand{\calS}{\mathcal{S}}
\providecommand{\calT}{\mathcal{T}}
\providecommand{\calU}{\mathcal{U}}
\providecommand{\calO}{\mathcal{O}}
\providecommand{\calX}{\mathcal{X}}
\providecommand{\calY}{\mathcal{Y}}

Expand Down Expand Up @@ -55,8 +56,6 @@
\providecommand{\hattt}{\hat{\mathbf{t}}}
\providecommand{\hatrr}{\hat{\mathbf{r}}}

\providecommand{\tbi}{t_{(i)}}

\providecommand{\bsdelta}{\boldsymbol{\delta}}
\providecommand{\bsbeta}{\boldsymbol{\beta}}
\providecommand{\bseta}{\boldsymbol{\eta}}
Expand Down
97 changes: 96 additions & 1 deletion book/experiments/code.R
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,12 @@ remotes::install_github("mlr-org/mlr3proba", ref = 'v0.5.7', upgrade = "never")
remotes::install_github("mlr-org/mlr3", ref = 'v0.16.1', upgrade = "never")
remotes::install_github("mlr-org/paradox", ref = 'v0.11.1', upgrade = "never")

library(ggplot2)
theme_set(theme_bw())

## Ranking
rm(list = ls())
library(dplyr)
library(ggplot2)
library(mlr3)
library(mlr3proba)

Expand Down Expand Up @@ -234,3 +236,96 @@ g = p1 + p2 + p3 + p4 &

ggsave("book/Figures/forests/bootstrap.png", g, height = 6, units = "in",
dpi = 600)

## Kaplan Meier
library(survival)
data("tumor", package = "pammtools")
tumor <- cbind(id = seq_len(nrow(tumor)), tumor)
tumor_duplicated = tumor |>
filter(days %in% days[duplicated(days)]) |>
arrange(days)

## Table for illustration of right-censored data
tab_surv_tumor = tumor_duplicated |>
filter(id %in% c(13, 62, 185, 230, 431, 719)) |>
select(id, age, sex, complications, days, status) |>
arrange(id)
knitr::kable(tab_surv_tumor)

km = survfit(Surv(days, status)~1, data = tumor)
bkm = broom::tidy(km)

df_med = data.frame(
x = c(0, median(km)), # Starting x-coordinates
y = c(0.5, 0), # Starting y-coordinates
xend = c(median(km), median(km)), # Ending x-coordinates
yend = c(0.5, 0.5) # Ending y-coordinates
)

p_km = ggplot(bkm, aes(x = time, y = estimate)) +
geom_step() +
geom_segment(data=df_med, aes(x=x, xend=xend, y=y, yend=yend), lty = 3)+
ylim(c(0, 1)) +
ylab("S(t)") +
xlab("time")
p_km
ggsave("book/Figures/survival/km-tumor.png", p_km, height=3, units="in", dpi=600)

# stratified KM wrt complications
tumor = tumor |>
mutate(age_bin = factor(age < 50, levels = c(TRUE, FALSE), labels = c("age < 50", "age >= 50")))
km_age_bin = survfit(Surv(days, status)~age_bin, data = tumor)
bkm_age_bin = broom::tidy(km_age_bin)
med_km_age_bin = as.numeric(median(km_age_bin))
df_age_bin = data.frame(
x = c(0, med_km_age_bin[2]), # Starting x-coordinates
y = c(0.5, 0), # Starting y-coordinates
xend = c(med_km_age_bin[2], med_km_age_bin[2]), # Ending x-coordinates
yend = c(0.5, 0.5) # Ending y-coordinates
)

p_km_age_bin = ggplot(bkm_age_bin, aes(x = time, y = estimate)) +
geom_step(aes(col = strata)) +
geom_segment(data=df_age_bin, aes(x=x, xend=xend, y=y, yend=yend), lty = 3)+
geom_hline(yintercept = .5, lty = 3) +
ylim(c(0, 1)) +
ylab("S(t)") +
xlab("time")
p_km_age_bin
ggsave("book/Figures/survival/km-age-bin-tumor.png", p_km_age_bin, height=3, units="in", dpi=600)


## Left-truncation
data("infants", package = "eha")

# KM for infants with dead/alive mothers
km_infants = survfit(Surv(exit, event)~mother, data = infants)
bkm_infants = broom::tidy(km_infants)

p_km_infants = ggplot(bkm_infants, aes(x = time, y = estimate)) +
geom_step(aes(col = strata)) +
ylim(c(0, 1)) +
ylab("S(t)") +
xlab("time")
# adjusted for left-truncation
km_infants_lt = survfit(Surv(enter, exit, event)~mother, data = infants)
bkm_infants_lt = broom::tidy(km_infants_lt)

p_km_infants_lt = ggplot(bkm_infants_lt, aes(x = time, y = estimate)) +
geom_step(aes(col = strata)) +
ylim(c(0, 1)) +
ylab("S(t)") +
xlab("time")
library(patchwork)

p_km_infants_joined = p_km_infants + p_km_infants_lt + plot_layout(guides = "collect")
ggsave("book/Figures/survival/km-infants.png", p_km_infants_joined, height=3, width=7, units="in", dpi=600)


## table infant data

inf_sub = infants |>
filter(stratum %in% c(1, 2, 4)) |>
select(stratum, enter, exit, event, mother)

inf_sub |> knitr::kable()
62 changes: 62 additions & 0 deletions book/library.bib
Original file line number Diff line number Diff line change
@@ -1,3 +1,31 @@
@article{akritasGeneralizedProductlimitEstimator2005,
author = {Akritas, Michael G. and LaValley, Michael P.},
publisher = {Taylor \& Francis},
date = {2005-09},
doi = {10.1080/10485250500038637},
journaltitle = {Nonparametric Statistics},
langid = {english},
title = {A Generalized Product-Limit Estimator for Truncated Data},
urldate = {2024-11-24},
}

@article{brostrom.influence.1987,
author = {Broström, Göran},
publisher = {[Board of the Foundation of the Scandinavian Journal of Statistics, Wiley]},
date = {1987},
eprint = {4616055},
eprinttype = {jstor},
issn = {0303-6898},
journaltitle = {Scandinavian Journal of Statistics},
note = {https://www.jstor.org/stable/4616055},
number = {2},
pages = {113--123},
shorttitle = {The {{Influence}} of {{Mother}}'s {{Death}} on {{Infant Mortality}}},
title = {The {{Influence}} of {{Mother}}'s {{Death}} on {{Infant Mortality}}: {{A Case Study}} in {{Matched Data Survival Analysis}}},
urldate = {2024-04-11},
volume = {14},
}

@article{vakulenko-lagun.inverse.2020,
author = {{Vakulenko-Lagun}, Bella and Mandel, Micha and Betensky, Rebecca A.},
date = {2020-06},
Expand Down Expand Up @@ -648,6 +676,14 @@ @article{Cox1972
volume = {34},
}

@Manual{pkgeha,
title = {eha: Event History Analysis},
author = {Göran Broström},
year = {2024},
note = {R package version 2.11.5},
url = {https://cran.r-project.org/package=eha},
}

@article{Aalen1978,
author = {Aalen, Odd},
annotation = {Another half of the Nelson-Aalen estimator},
Expand Down Expand Up @@ -7978,6 +8014,19 @@ @article{Erdem2022
volume = {12},
}

@article{Suresh2022,
author = {Suresh, Krithika and Severn, Cameron and Ghosh, Debashis},
url = {https://doi.org/10.1186/s12874-022-01679-6},
date = {2022},
doi = {10.1186/s12874-022-01679-6},
issn = {1471-2288},
journaltitle = {BMC Medical Research Methodology},
number = {1},
pages = {207},
title = {{Survival prediction models: an introduction to discrete-time modeling}},
volume = {22},
}

@article{Benavoli2017,
author = {Benavoli, Alessio and Corani, Giorgio and Demšar, Janez and Zaffalon, Marco},
url = {http://jmlr.org/papers/v18/16-305.html},
Expand All @@ -8003,3 +8052,16 @@ @inbook{Simon2007
title = {Resampling Strategies for Model Assessment and Selection},
}

@article{McGough2021,
author = {McGough, Sarah F. and Incerti, Devin and Lyalina, Svetlana and Copping, Ryan and Narasimhan, Balasubramanian and Tibshirani, Robert},
title = {Penalized regression for left-truncated and right-censored survival data},
journal = {Statistics in Medicine},
volume = {40},
number = {25},
pages = {5487-5500},
keywords = {Cox model, high-dimensional data, lasso, left truncation, penalized regression, survival analysis},
doi = {https://doi.org/10.1002/sim.9136},
url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.9136},
eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1002/sim.9136},
year = {2021}
}

0 comments on commit 5930499

Please sign in to comment.