Survival2 (#78)

* add boxHED * updated left-trncation + objective functions. Added KM * Add multi-state * Updates survival * update eha figure * update row/column names in figure/table * Cleaned up CR and recurrent events. * update survival/eha * updates survtsk * update srvtsk * Finish (for now) srvtsk * Push pdf [skip ci] * Update book/survival.qmd * Push pdf [skip ci] * typos * Push pdf [skip ci] * typos and PI * Push pdf [skip ci] * move tasks around * Push pdf [skip ci] * add small note about discrete time * Push pdf [skip ci] * typos * Push pdf [skip ci] * equations to own line * typo * Push pdf [skip ci] * typos * Push pdf [skip ci] * typos * typoes * Push pdf [skip ci] * move equations to blocks * Push pdf [skip ci] * typos * typos * Push pdf [skip ci] * typos, light rewrite * Push pdf [skip ci] * small rewrite * typo * Push pdf [skip ci] * fix formatting * typos * Push pdf [skip ci] * typo * typo * update km section * Update KM section + replace rats data with tumor data * more work on censoring + example table + figure for left-truncation * Rewrite of objective function section -> Estimation * typo * add macro * tex typ * latex typo * Push pdf [skip ci] * typo * Push pdf [skip ci] * add fixme * Push pdf [skip ci] * further reading * Push pdf [skip ci] * typo * typo * add line 3 * typos * Push pdf [skip ci] * Update book/P1C4_survival.qmd * Push pdf [skip ci] * Update book/P1C4_survival.qmd Co-authored-by: Raphael Sonabend <[email protected]> * Push pdf [skip ci] * add discrete time section + non-parametric estimation * Push pdf [skip ci] --------- Co-authored-by: adibender <[email protected]> Co-authored-by: Raphael Sonabend <[email protected]> Co-authored-by: RaphaelS1 <[email protected]>
mlsa-book · Dec 5, 2024 · 5930499 · 5930499
1 parent 081cc1a
commit 5930499
Show file tree

Hide file tree

Showing 19 changed files with 819 additions and 296 deletions.
diff --git a/book/Figures/survival/event-history-analysis.png b/book/Figures/survival/event-history-analysis.png
diff --git a/book/Figures/survival/event-history-analysis.svg b/book/Figures/survival/event-history-analysis.svg
diff --git a/book/Figures/survival/km-age-bin-tumor.png b/book/Figures/survival/km-age-bin-tumor.png
diff --git a/book/Figures/survival/km-infants.png b/book/Figures/survival/km-infants.png
diff --git a/book/Figures/survival/km-tumor.png b/book/Figures/survival/km-tumor.png
diff --git a/book/Figures/survival/left-truncation.png b/book/Figures/survival/left-truncation.png
diff --git a/book/Figures/survival/multi-state-examples-w-transitions.png b/book/Figures/survival/multi-state-examples-w-transitions.png
diff --git a/book/P0C0_notation.qmd b/book/P0C0_notation.qmd
@@ -27,13 +27,13 @@ $$
 $$
 
 Vectors are usually defined using transpose notation, for example the vector above may instead be written as $\xx^\trans = (x_1 \ x_2 \cdots x_n)$ or $\xx = (x_1 \ x_2 \cdots x_n)^\trans$.
-Vectors may also be defined in a shortened format as, $\xx \in \calX^n$, which implies a vector of length $n$ with elements as represented above.
+Vectors may also be defined in a shortened format as $\xx \in \calX^{n \times 1}$ or more simply $\xx \in \calX^n$, which implies a column vector of length $n$ with elements as represented above.
 
 A letter in normal font with one subscript refers to a single element from a vector.
 For example, given $\xx \in \calX^n$, the $i$th element is denoted $x_i$.
 Given a matrix $\XX \in \calX^{n \times p}$, a bold-face lower-case letter with a single subscript refers to the row of a matrix, for example the $i$th row would be $\xx_i = (x_{i;1} \ x_{i;2} \cdots x_{i;p})^\trans$.
 Whereas a column is referenced with a semi-colon before the subscript, for example the $j$th column would be $\xx_{;j} = (x_{1;j} \ x_{2;j} \cdots x_{n;j})^\trans$.
-Two subscripts can be used to reference a single element of a matrix, for example $x_{i;j}$ would be the element in the $i$th row and $j$th column of $\XX$.
+Two subscripts can be used to reference a single element of a matrix, for example $x_{i;j} \in \calX$ would be the element in the $i$th row and $j$th column of $\XX$.
 
 ## Functions
 
@@ -44,7 +44,7 @@ $\EE(X)$ is the expectation of the random variable $X$.
 We write $A \indep B$, to denote that $A$ and $B$ are independent, i.e., that $P(A \cap B) = P(A)P(B)$.
 
 A function $f$, will either be written as a formal map of domain to codomain, $f: \calX \rightarrow \calY; (x, y) \mapsto f(x, y)$ (which is most useful for understanding inputs and outputs), or more simply and commonly as $f(x, y)$.
-Given a random variable, $X$, following distribution $\zeta$ (mathematically written $X \sim \zeta$), then $f_X$ denotes the probability density function, and analogously for other distribution defining functions.
+Given a random variable, $X$, following distribution $\zeta$ (mathematically written $X \sim \zeta$), then $f_X$ denotes the probability density function, and analogously for other distribution defining functions such as the cumulative distribution function, survival function, etc.
 In the survival analysis context (@sec-surv), a subscript "$0$" refers to a "baseline" function, for example, $S_0$ is the baseline survival function.
 
 ## Variables and acronyms
@@ -55,8 +55,8 @@ Common variables and acronyms used in the book are given in @tbl-not-var and @tb
 | - | ---- |
 | $\Reals, \PReals, \NNReals, \ExtReals$|Set of Reals, positive Reals (excl. zero), non-negative Reals (incl. zero), and Reals including $\pm\infty$. |
 | $\PNaturals$|Set of Naturals excluding zero. |
-| $(\XX, \tt, \bsdelta)$ | Survival data where $\XX \in \Reals$ is a real-valued matrix of observations (rows) and features (columns), $\tt$ is a vector of observed outcome times, and $\bsdelta$ is a vector of observed outcome indicators. |
-| $\bsbeta$|Vector of model coefficients/weights. |
+| $(\XX, \tt, \bsdelta)$ | Survival data where $\XX \in \Reals^{n \times p}$ is a real-valued matrix of $n$ observations (rows) and $p$ features (columns), $\tt \in \Reals^n$ is a vector of observed outcome times, and $\bsdelta \in \Reals^n$ is a vector of observed outcome indicators. |
+| $\bsbeta$|Vector of model coefficients/weights, $\bsbeta \in \Reals^p$. |
 | $\bseta$ | Vector of linear predictors, $\dvec{\eta}{n}$, where $\bseta = \XX\bsbeta$ and $\eta_i = \xx_{i}^\trans\bsbeta$. |
 | $\calD, \dtrain, \dtest$| Dataset, training data, and testing data. |
 

diff --git a/book/P1C4_survival.qmd b/book/P1C4_survival.qmd
diff --git a/book/P1C5_eha.qmd b/book/P1C5_eha.qmd
diff --git a/book/P1C6_survtsk.qmd b/book/P1C6_survtsk.qmd
diff --git a/book/P2C8_rank.qmd b/book/P2C8_rank.qmd
@@ -55,7 +55,7 @@ This representation of discrimination provides more information by encoding the
 In theory this representation could result in a negative value, however this would indicate that $C<0.5$, which would indicate serious problems with the model that should be addressed before proceeding with further analysis.
 Representing measures as a percentage over a baseline is a common method to improve measure interpretability and closely relates to the ERV representation of scoring rules (@sec-eval-distr-score-base).
 
-### Concordance Indices
+### Concordance Indices {#sec-eval-crank-conc}
 
 Common concordance indices in survival analysis can be expressed as a general measure:
 

diff --git a/book/P3C15_svm.qmd b/book/P3C15_svm.qmd
@@ -160,7 +160,7 @@ Where again $K$ is a kernel function and the calculation of the Lagrange multipl
 
 Support vector machines can be used to estimate rankings by penalizing predictions that result in disconcordant predictions.
 Recall the definition of concordance from @sec-eval-crank: ranking predictions for a pair of comparable observations $(i, j)$ where $t_i < t_j \cap \delta_i = 1$, are called concordant if $r_i > r_j$ where $r_i, r_j$ are the predicted ranks for observations $i$ and $j$ respectively and a higher value implies greater risk.
-Using the prognostic index as a ranking prediction (@sec-surv-setmltask), a pair of observations is concordant if $g(\xx_i) > g(\xx_j)$ when $t_i < t_j$, leading to:
+Using the prognostic index as a ranking prediction (@sec-survtsk-PI), a pair of observations is concordant if $g(\xx_i) > g(\xx_j)$ when $t_i < t_j$, leading to:
 
 $$
 \begin{aligned}

diff --git a/book/P4C21_discrete.qmd b/book/P4C21_discrete.qmd
@@ -6,6 +6,6 @@ abstract: TODO (150-200 WORDS)
 {{< include _macros.tex >}}
 :::
 
-# Discrete Time Survival Analysis
+# Discrete Time Survival Analysis {#sec-discrete}
 
 {{< include _soon.qmd >}}
diff --git a/book/P5C24_conclusions.qmd b/book/P5C24_conclusions.qmd
@@ -2,12 +2,34 @@
 {{< include _macros.tex >}}
 :::
 
-# Conclusions
-
-## Common problems in survival analysis
+# Conclusions {#sec-conclusions}
 
 {{< include _soon.qmd >}}
 
+## Common problems in survival analysis {#sec-conclusions-faq}
+
+### Data cleaning
+
+#### Events at t=0 {.unnumbered .unlisted}
+
+Throughout this book we have defined survival times taking values in the non-negative Reals (zero inclusive) $\NNReals$.
+In practice, model implementations assume time is over the positive Reals (zero exclusive).
+One must therefore consider how to deal with subjects that experience the outcome at $0$. 
+There is no established best practice for dealing with this case as the answer may be data-dependent.
+Possible choices include:
+
+1. Deleting all data where the outcome occurs at $t=0$, this may be appropriate if it only happens in a small number of observations and therefore deletion is unlikely to bias predictions;
+2. Update the survival time to the next smallest observed survival time. For example, if the first observation to experience the event after $t=0$ happens at $t=0.1$, then set $0.1$ as the survival time for any observation experiencing the event at $t=0$. Note this method will not be appropriate when data is over a long period, for example if measuring time over years, then there could be a substantial difference between $t=0$ and $t=1$;
+3. Update the survival time to a very small value $\epsilon$ that makes sense given the context of the data, e.g., $\epsilon = 0.0001$.
+
+#### Continuous v Discrete Time {.unnumbered .unlisted}
+
+We defined survival tasks throughout this book assuming continuous time predictions in $\NNReals$.
+In practice, many outcomes in survival analysis are recorded on a discrete scale, such as in medical statistics where outcomes are observed on a yearly, daily, monthly, hourly, etc. basis.
+Whilst discrete-time survival analysis exists for this purpose (@sec-discrete), software implementations overwhelming use theory from the 'continuous-time setting.
+There has not been a lot of research into whether discrete-time methods outperform continuous-time methods when correctly applied to discrete data, however available experiments do not indicate that discrete methods outperform their continuous counterparts [@Suresh2022].
+Therefore it is recommended to use available software implementations, even when data is recorded on a discrete scale.
+
 ### Evaluation and prediction
 
 * Which time points to make predictions for?

diff --git a/book/_book/Machine-Learning-in-Survival-Analysis.pdf b/book/_book/Machine-Learning-in-Survival-Analysis.pdf
diff --git a/book/_macros.tex b/book/_macros.tex
@@ -27,6 +27,7 @@
 \providecommand{\calS}{\mathcal{S}}
 \providecommand{\calT}{\mathcal{T}}
 \providecommand{\calU}{\mathcal{U}}
+\providecommand{\calO}{\mathcal{O}}
 \providecommand{\calX}{\mathcal{X}}
 \providecommand{\calY}{\mathcal{Y}}
 
@@ -55,8 +56,6 @@
 \providecommand{\hattt}{\hat{\mathbf{t}}}
 \providecommand{\hatrr}{\hat{\mathbf{r}}}
 
-\providecommand{\tbi}{t_{(i)}}
-
 \providecommand{\bsdelta}{\boldsymbol{\delta}}
 \providecommand{\bsbeta}{\boldsymbol{\beta}}
 \providecommand{\bseta}{\boldsymbol{\eta}}

diff --git a/book/experiments/code.R b/book/experiments/code.R
@@ -2,10 +2,12 @@ remotes::install_github("mlr-org/mlr3proba", ref = 'v0.5.7', upgrade = "never")
 remotes::install_github("mlr-org/mlr3", ref = 'v0.16.1', upgrade = "never")
 remotes::install_github("mlr-org/paradox", ref = 'v0.11.1', upgrade = "never")
 
+library(ggplot2)
+theme_set(theme_bw())
+
 ## Ranking
 rm(list = ls())
 library(dplyr)
-library(ggplot2)
 library(mlr3)
 library(mlr3proba)
 
@@ -234,3 +236,96 @@ g = p1 + p2 + p3 + p4 &
 
 ggsave("book/Figures/forests/bootstrap.png", g, height = 6, units = "in",
   dpi = 600)
+
+## Kaplan Meier
+library(survival)
+data("tumor", package = "pammtools")
+tumor <- cbind(id = seq_len(nrow(tumor)), tumor)
+tumor_duplicated = tumor |>
+  filter(days %in% days[duplicated(days)]) |>
+  arrange(days)
+
+## Table for illustration of right-censored data
+tab_surv_tumor = tumor_duplicated |>
+  filter(id %in% c(13, 62, 185, 230, 431, 719)) |>
+  select(id, age, sex, complications, days, status) |>
+  arrange(id)
+knitr::kable(tab_surv_tumor)
+
+km = survfit(Surv(days, status)~1, data = tumor)
+bkm = broom::tidy(km)
+
+df_med = data.frame(
+  x = c(0, median(km)), # Starting x-coordinates
+  y = c(0.5, 0),        # Starting y-coordinates
+  xend = c(median(km), median(km)),    # Ending x-coordinates
+  yend = c(0.5, 0.5)       # Ending y-coordinates
+)
+
+p_km = ggplot(bkm, aes(x = time, y = estimate)) +
+  geom_step() +
+  geom_segment(data=df_med, aes(x=x, xend=xend, y=y, yend=yend), lty = 3)+
+  ylim(c(0, 1)) +
+  ylab("S(t)") +
+  xlab("time")
+p_km
+ggsave("book/Figures/survival/km-tumor.png", p_km, height=3, units="in", dpi=600)
+
+# stratified KM wrt complications
+tumor = tumor |>
+  mutate(age_bin = factor(age < 50, levels = c(TRUE, FALSE), labels = c("age < 50", "age >= 50")))
+km_age_bin = survfit(Surv(days, status)~age_bin, data = tumor)
+bkm_age_bin = broom::tidy(km_age_bin)
+med_km_age_bin = as.numeric(median(km_age_bin))
+df_age_bin = data.frame(
+  x = c(0, med_km_age_bin[2]), # Starting x-coordinates
+  y = c(0.5, 0),        # Starting y-coordinates
+  xend = c(med_km_age_bin[2], med_km_age_bin[2]),    # Ending x-coordinates
+  yend = c(0.5, 0.5)       # Ending y-coordinates
+)
+
+p_km_age_bin = ggplot(bkm_age_bin, aes(x = time, y = estimate)) +
+  geom_step(aes(col = strata)) +
+  geom_segment(data=df_age_bin, aes(x=x, xend=xend, y=y, yend=yend), lty = 3)+
+  geom_hline(yintercept =  .5, lty = 3) +
+  ylim(c(0, 1)) +
+  ylab("S(t)") +
+  xlab("time")
+p_km_age_bin
+ggsave("book/Figures/survival/km-age-bin-tumor.png", p_km_age_bin, height=3, units="in", dpi=600)
+
+
+## Left-truncation
+data("infants", package = "eha")
+
+# KM for infants with dead/alive mothers
+km_infants = survfit(Surv(exit, event)~mother, data = infants)
+bkm_infants = broom::tidy(km_infants)
+
+p_km_infants = ggplot(bkm_infants, aes(x = time, y = estimate)) +
+  geom_step(aes(col = strata)) +
+  ylim(c(0, 1)) +
+  ylab("S(t)") +
+  xlab("time")
+# adjusted for left-truncation
+km_infants_lt = survfit(Surv(enter, exit, event)~mother, data = infants)
+bkm_infants_lt = broom::tidy(km_infants_lt)
+
+p_km_infants_lt = ggplot(bkm_infants_lt, aes(x = time, y = estimate)) +
+  geom_step(aes(col = strata)) +
+  ylim(c(0, 1)) +
+  ylab("S(t)") +
+  xlab("time")
+library(patchwork)
+
+p_km_infants_joined = p_km_infants + p_km_infants_lt + plot_layout(guides =  "collect")
+ggsave("book/Figures/survival/km-infants.png", p_km_infants_joined, height=3, width=7, units="in", dpi=600)
+
+
+## table infant data
+
+inf_sub = infants |>
+  filter(stratum %in% c(1, 2, 4)) |>
+  select(stratum, enter, exit, event, mother)
+
+inf_sub |> knitr::kable()
diff --git a/book/library.bib b/book/library.bib
@@ -1,3 +1,31 @@
+@article{akritasGeneralizedProductlimitEstimator2005,
+  author       = {Akritas, Michael G. and LaValley, Michael P.},
+  publisher    = {Taylor \& Francis},
+  date         = {2005-09},
+  doi          = {10.1080/10485250500038637},
+  journaltitle = {Nonparametric Statistics},
+  langid       = {english},
+  title        = {A Generalized Product-Limit Estimator for Truncated Data},
+  urldate      = {2024-11-24},
+}
+
+@article{brostrom.influence.1987,
+  author       = {Broström, Göran},
+  publisher    = {[Board of the Foundation of the Scandinavian Journal of Statistics, Wiley]},
+  date         = {1987},
+  eprint       = {4616055},
+  eprinttype   = {jstor},
+  issn         = {0303-6898},
+  journaltitle = {Scandinavian Journal of Statistics},
+  note         = {https://www.jstor.org/stable/4616055},
+  number       = {2},
+  pages        = {113--123},
+  shorttitle   = {The {{Influence}} of {{Mother}}'s {{Death}} on {{Infant Mortality}}},
+  title        = {The {{Influence}} of {{Mother}}'s {{Death}} on {{Infant Mortality}}: {{A Case Study}} in {{Matched Data Survival Analysis}}},
+  urldate      = {2024-04-11},
+  volume       = {14},
+}
+
 @article{vakulenko-lagun.inverse.2020,
   author       = {{Vakulenko-Lagun}, Bella and Mandel, Micha and Betensky, Rebecca A.},
   date         = {2020-06},
@@ -648,6 +676,14 @@ @article{Cox1972
   volume       = {34},
 }
 
+@Manual{pkgeha,
+  title = {eha: Event History Analysis},
+  author = {Göran Broström},
+  year = {2024},
+  note = {R package version 2.11.5},
+  url = {https://cran.r-project.org/package=eha},
+}
+
 @article{Aalen1978,
   author       = {Aalen, Odd},
   annotation   = {Another half of the Nelson-Aalen estimator},
@@ -7978,6 +8014,19 @@ @article{Erdem2022
   volume       = {12},
 }
 
+@article{Suresh2022,
+  author       = {Suresh, Krithika and Severn, Cameron and Ghosh, Debashis},
+  url          = {https://doi.org/10.1186/s12874-022-01679-6},
+  date         = {2022},
+  doi          = {10.1186/s12874-022-01679-6},
+  issn         = {1471-2288},
+  journaltitle = {BMC Medical Research Methodology},
+  number       = {1},
+  pages        = {207},
+  title        = {{Survival prediction models: an introduction to discrete-time modeling}},
+  volume       = {22},
+}
+
 @article{Benavoli2017,
   author       = {Benavoli, Alessio and Corani, Giorgio and Demšar, Janez and Zaffalon, Marco},
   url          = {http://jmlr.org/papers/v18/16-305.html},
@@ -8003,3 +8052,16 @@ @inbook{Simon2007
   title     = {Resampling Strategies for Model Assessment and Selection},
 }
 
+@article{McGough2021,
+author = {McGough, Sarah F. and Incerti, Devin and Lyalina, Svetlana and Copping, Ryan and Narasimhan, Balasubramanian and Tibshirani, Robert},
+title = {Penalized regression for left-truncated and right-censored survival data},
+journal = {Statistics in Medicine},
+volume = {40},
+number = {25},
+pages = {5487-5500},
+keywords = {Cox model, high-dimensional data, lasso, left truncation, penalized regression, survival analysis},
+doi = {https://doi.org/10.1002/sim.9136},
+url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.9136},
+eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1002/sim.9136},
+year = {2021}
+}