finish LB comments

mlsa-book · Nov 21, 2024 · ec5e819 · ec5e819
1 parent 6e35746
commit ec5e819
Show file tree

Hide file tree

Showing 3 changed files with 86 additions and 28 deletions.
diff --git a/book/Figures/ml/cv.png b/book/Figures/ml/cv.png
diff --git a/book/library.bib b/book/library.bib
@@ -9477,3 +9477,31 @@ @misc{Burk2024
       primaryClass={stat.ML},
       url={https://arxiv.org/abs/2406.04098}, 
 }
+
+@article{Benavoli2017,
+  author  = {Alessio Benavoli and Giorgio Corani and Janez Dem{\v{s}}ar and Marco Zaffalon},
+  title   = {Time for a Change: a Tutorial for Comparing Multiple Classifiers Through Bayesian Analysis},
+  journal = {Journal of Machine Learning Research},
+  year    = {2017},
+  volume  = {18},
+  number  = {77},
+  pages   = {1--36},
+  url     = {http://jmlr.org/papers/v18/16-305.html}
+}
+
+@Inbook{Simon2007,
+author="Simon, Richard",
+editor="Dubitzky, Werner
+and Granzow, Martin
+and Berrar, Daniel",
+title="Resampling Strategies for Model Assessment and Selection",
+bookTitle="Fundamentals of Data Mining in Genomics and Proteomics",
+year="2007",
+publisher="Springer US",
+address="Boston, MA",
+pages="173--186",
+isbn="978-0-387-47509-7",
+doi="10.1007/978-0-387-47509-7_8",
+url="https://doi.org/10.1007/978-0-387-47509-7_8"
+}
+
diff --git a/book/machinelearning.qmd b/book/machinelearning.qmd
@@ -10,7 +10,8 @@ abstract: TODO (150-200 WORDS)
 
 This chapter covers core concepts in machine learning.
 This is not intended as a comprehensive introduction and does not cover mathematical theory nor how to run machine learning models using software.
-Instead, the focus is on introducing important concepts and to provide basic intuition for a general machine learning workflow. This includes the concept of a machine learning task, data splitting (resampling), model training and prediction, evaluation, and model comparison.
+Instead, the focus is on introducing important concepts and to provide basic intuition for a general machine learning workflow.
+This includes the concept of a machine learning task, data splitting (resampling), model training and prediction, evaluation, and model comparison.
 Recommendations for more comprehensive introductions are given at the end of this chapter, including books that cover practical implementation in different programming languages.
 
 ## Basic workflow {#sec-ml-basics}
@@ -23,10 +24,10 @@ This book is primarily concerned with *predictive* survival analysis, i.e., maki
 The basic machine learning workflow is represented in @fig-ml-basic.
 Data is split into training and test datasets.
 A learner is selected and is trained on the training data, inducing a fitted model.
-The features from the test data are passed to the model which makes predictions for the unseen labels.
-The labels from the test data are passed to a chosen measure with the predictions, which evaluates the performance of the model.
+The features from the test data are passed to the model which makes predictions for the unseen outcomes (@cnj-lm).
+The outcomes from the test data are passed to a chosen measure with the predictions, which evaluates the performance of the model (@cnj-lm-eval).
 The process of repeating this procedure to test different training and test data is called *resampling* and running multiple resampling experiments with different models is called *benchmarking*.
-All these concepts will be explained in this chapter.
+All these concepts will be explained in this chapter. 
 
 ![Basic machine learning workflow with data splitting, model training, predicting, and evaluating. Image from @Foss2024 (CC BY-NC-SA 4.0).](Figures/ml/resampling.png){#fig-ml-basic}
 
@@ -94,21 +95,23 @@ Parameters are learned from data whereas hyperparameters are set beforehand to g
 Model *parameters* (or *weights*), $\bstheta$, are coefficients to be estimated during model training.
 Hyperparameters, $\bslambda$, control how the algorithms are run but are not directly updated by them.
 Hyperparameters can be mathematical, for example the learning rate in a gradient boosting machine (@sec-boost), or structural, for example the depth of a decision tree (@sec-ranfor).
-The number of hyperparameters usually increases with learner complexity and affects its performance. 
+The number of hyperparameters usually increases with learner complexity and affects its predictive performance. 
 Often hyperparameters need to be tuned (@sec-ml-opt) instead of manually set.
 Computationally, storing $(\hat{\bstheta}, \bslambda)$ is sufficient to recreate any trained model.
 
 :::: {.callout-note icon=false}
 
 ## Ridge regression
 
+::: {#cnj-lm}
+
 Let $f : \calX \rightarrow \calY$ be the regression task of interest with $\calX \subseteq \Reals$ and $\calY \subseteq \Reals$.
-Let $(\xx, \yy) = ((x_1, y_1), \ldots, (x_n, y_n))$ be data such that $x_i \in \calX$ and $y_i \in \calY$.
+Let $(\xx, \yy) = ((x_1, y_1), \ldots, (x_n, y_n))$ be data such that $x_i \in \calX$ and $y_i \in \calY$ for all $i = 1,...,n$.
 
 Say the **learner** of interest is a regularized linear regression model with **learning algorithm**:
 
 $$
-(\hat{\beta}_0,\hat{\beta}_1):=\argmin_{\beta_0,\beta_1}\Bigg\{\sum_{i=1}^n(y_i-(\beta_0 +\beta_1 x_i))^2+\gamma\beta_1^2\Bigg\}.
+(\hat{\beta}_0,\hat{\beta}_1):=\argmin_{\beta_0,\beta_1}\Bigg\{\sum_{i=1}^n\big(y_i-(\beta_0 +\beta_1 x_i)\big)^2+\gamma\beta_1^2\Bigg\}.
 $$
 
 and **prediction algorithm**:
@@ -119,7 +122,9 @@ $$
 
 The **hyperparameters** are $\lambda = (\gamma \in \PReals)$ and the **parameters** are $\bstheta = (\beta_0, \beta_1)^\trans$.
 Say that $\gamma = 2$ is set and the learner is then trained by passing $(\xx, \yy)$ to the learning algorithm and thus estimating $\hat{\bstheta}$ and $\hatf$.
-A **prediction**, can then be made by passing $x^* \in \calX$ to the fitted model: $\haty := \hatf(x^*) = \hat{\beta}_0 + \hat{\beta}_1x^*$.
+A **prediction**, can then be made by passing new data $x^* \in \calX$ to the fitted model: $\haty := \hatf(x^*) = \hat{\beta}_0 + \hat{\beta}_1x^*$.
+
+:::
 
 ::::
 
@@ -128,24 +133,22 @@ A **prediction**, can then be made by passing $x^* \in \calX$ to the fitted mode
 To understand if a model is 'good', its predictions are evaluated with a *loss function*.
 Loss functions assign a score to the discrepancy between predictions and true values, $L: \calY \times \calY \rightarrow \ExtReals$.
 Given (unseen) real-world data, $(\XX^*, \yy^*)$, and a trained model, $\hatf$, the loss is given by $L(\hatf(\XX^*), \yy^*) = L(\hatyy, \yy^*)$.
-For a model to be useful, it should perform well in general, and not just for the data used for training and development, which is known as a model's *generalization error*.
+For a model to be useful, it should perform well in general, meaning its generalization error should be low.
+The *generalization error* refers to the model's performance on new data, rather than just the data encountered during training and development.
 
-A model should not be deployed, that is, manually or automatically used to make predictions, unless its generalization error was estimated to be acceptable for a given context.
+A model should only be used to make predictions if its generalization error was estimated to be acceptable for a given context.
 If a model were to be trained and evaluated on the same data, the resulting loss, known as the *training error*, would be an overoptimistic estimate of the true generalization error [@Hastie2013].
 This occurs as the model is making predictions for data it has already 'seen' and the loss is therefore not evaluating the model's ability to generalize to new, unseen data.
 Estimation of the generalization error requires *data splitting*, which is the process of splitting available data, $\calD$, into *training data*, $\dtrain \subset \calD$, and *testing data*, $\dtest = \calD \setminus \dtrain$.
 
 The simplest method to estimate the generalization error is to use *holdout resampling*, which is the process of partitioning the data into one training dataset and one testing dataset, with the model trained on the former and predictions made for the latter.
-Using 2/3 of the data for training and 1/3 for testing is a common splitting ratio (@Kohavi1995).
-In general, for independent and identically distributed (iid) data, data should be partitioned randomly to ensure any information encoded in data ordering is removed.
-Ordering is often important in the real-world, for example in healthcare data when patients are recorded in order of enrolment to a study.
-Whilst ordering can provide useful information, it does not generalize to new, unseen data.
-For example, the number of days a patient has been in hospital is more useful than the patient's index in the dataset, as the former could be calculated for a new patient whereas the latter is meaningless.
-Another example is the `rats` dataset, which will be explored again in @sec-surv.
-The `rats` data explores how a novel drug effects tumor incidence.
-However, the data is ordered by rat litters with every three rats being in the same litter.
-Hence if rats were to be bred or raised differently over time,  even if the `litter` column were removed, this information would still be encoded in the order of the dataset and could impact upon any findings.
-Randomly splitting the dataset breaks any possible association between order and outcome.
+Using 2/3 of the data for training and 1/3 for testing is a common splitting ratio [@Kohavi1995].
+For independent and identically distributed (iid) data, it is generally best practice to partition the data randomly.
+This ensures that any potential patterns or information encoded in the ordering of the data are removed, as such patterns are unlikely to generalize to new, unseen data.
+For example, in clinical datasets, the order in which patients enter a study might inadvertently encode latent information such as which doctor was on duty at the time, which could theoretically influence patient outcomes.
+As this information is not explicitly captured in measured features, it is unlikely to hold predictive value for future patients.
+Random splitting breaks any spurious associations between the order of data and the outcomes.
+
 When data is not iid, for example spatially correlated or time-series data, then random splitting may not be advisable, see @Hornung2023 for an overview of evaluation strategies in non-standard settings.
 
 Holdout resampling is a quick method to estimate the generalization error, and is particular useful when very large datasets are available.
@@ -155,30 +158,57 @@ $k$-fold cross-validation (CV) can be used as a more robust method to better est
 $k$-fold CV partitions the data into $k$ subsets, called *folds*.
 The training data comprises of $k-1$ of the folds and the remaining one is used for testing and evaluation.
 This is repeated $k$ times until each of the folds has been used exactly once as the testing data.
-The performance from each fold is averaged into a final performance estimate.
+The performance from each fold is averaged into a final performance estimate (@fig-ml-cv).
 It is common to use $k = 5$ or $k = 10$ [@Breiman1992; @Kohavi1995].
 This process can be repeated multiple times (*repeated $k$-fold CV*) and/or $k$ can even be set to $n$, which is known as *leave-one-out cross-validation*.
 
 Cross-validation can also be stratified, which ensures that a variable of interest will have the same distribution in each fold as in the original data.
-This is important, and often recommended, in survival analysis to ensure that the proportion of censoring in each fold is representative of the full dataset [@Burk2024].
+This is important, and often recommended, in survival analysis to ensure that the proportion of censoring in each fold is representative of the full dataset [@Casalicchio2024; @Herrmann2020].
+
+![Three-fold cross-validation. In each iteration a different dataset is used for predictions and the other two for training. The performance from each iteration is averaged into a final, single metric. Image from @Casalicchio2024 (CC BY-NC-SA 4.0).](Figures/ml/cv.png){#fig-ml-cv}
+
+<!-- FIXME - TO USE IN PRINT WE NEED PERMISSION FROM ALL AUTHORS OR NEED TO MAKE OUR OWN VERSION -->
 
 Repeating resampling experiments with multiple models is referred to as a *benchmark experiment*.
 A benchmark experiment compares models by evaluating their performance on *identical* data, which means the same resampling strategy and folds should be used for all models.
-Determining if one model is actually better than another is a surprisingly complex topic [@Demsar2006; @Dietterich1998; @Nadeau2003] and is out of scope for this book, instead any benchmark experiments performed in this book are purely for illustrative reasons and no results are expected to generalize outside of these experiments.
+Determining if one model is actually better than another is a surprisingly complex topic [@Benavoli2017; @Demsar2006; @Dietterich1998; @Nadeau2003] and is out of scope for this book, instead any benchmark experiments performed in this book are purely for illustrative reasons and no results are expected to generalize outside of these experiments.
+
+:::: {.callout-note icon=false}
+
+## Evaluating ridge regression
+
+::: {#cnj-lm-eval}
+
+Let $\calX \subseteq \Reals$ and $\calY \subseteq \Reals$ and let $(\xx^*, \yy^*) = ((x^*_1, y^*_1), \ldots, (x^*_m, y^*_m))$ be data previously unseen by the model trained in @cnj-lm where $x_i \in \calX$ and $y_i \in \calY$ for all $i = 1,...,m$.
+
+Predictions are made by passing $\xx^*$ to the fitted model yielding $\hatyy = (\haty_1, \ldots \haty_m)$ where $\haty_i := \hatf(x_i^*) = \hat{\beta}_0 + \hat{\beta}_1x_i^*$.
+
+Say the mean absolute error is used to evaluate the model, defined by
+
+$$
+L(\boldsymbol{\phi}, \boldsymbol{\varphi}) = \frac{1}{n} \sum^n_{i=1} |\phi_i - \varphi_i|
+$$
+where $(\boldsymbol{\phi}, \boldsymbol{\varphi}) = ((\phi_1, \varphi_1),\ldots,(\phi_n, \varphi_n))$.
+
+The model's predictive performance is then calculated as $L(\hatyy, \yy^*)$.
+
+:::
+
+::::
 
-## Optimization {#sec-ml-opt}
+## Hyperparameter Optimization {#sec-ml-opt}
 
 @sec-ml-models introduced model hyperparameters, which control how training and prediction algorithms are run.
 Setting hyperparameters is a critical part of model fitting and can significantly change model performance.
-*Tuning* is the process of using internal benchmark experiment to automatically select the optimal hyper-parameter configuration.
+*Tuning* is the process of using internal benchmark experiments to automatically select the optimal hyper-parameter configuration.
 For example, the depth of trees, $m_r$ in a random forest (@sec-ranfor) is a potential hyperparameter to tune.
 This hyperparameter may be tuned over a range of values, say $[1, 15]$ or over a discrete subset, say $\{1, 5, 15\}$, for now assume the latter.
 Three random forests with $1$, $5$, and $15$ tree depth respectively are compared in a benchmark experiment.
 The depth that results in the model with the optimal performance is then selected for the hyperparameter value going forward.
-*Nested resampling* is a common method to prevent overfitting that could occur from using overlapping data for tuning, training, or testing.
+*Nested resampling* is a common method to reduce bias that could occur from using overlapping data for tuning, training, or testing [@Simon2007].
 Nested resampling is the process of resampling the training set again for tuning and then the optimal model is refit on the entire training data (@fig-ml-nested).
 
-![An illustration of nested resampling. The large blocks represent three-fold CV for the outer resampling for model evaluation and the small blocks represent four-fold CV for the inner resampling for HPO. The light blue blocks are the training sets and the dark blue blocks are the test sets. Image and caption from @Becker2024 (CC BY-NC-SA 4.0).](Figures/ml/nested.png){#fig-ml-nested}
+![An illustration of nested resampling. The large blocks represent three-fold CV for the outer resampling for model evaluation and the small blocks represent four-fold CV for the inner resampling for hyperparameter optimization. The light blue blocks are the training sets and the dark blue blocks are the test sets. Image and caption from @Becker2024 (CC BY-NC-SA 4.0).](Figures/ml/nested.png){#fig-ml-nested}
 
 ## Conclusion
 
@@ -191,7 +221,7 @@ Nested resampling is the process of resampling the training set again for tuning
 * Classification tasks make predictions for discrete outcomes, such as the predicted weather tomorrow;
 * Both regression and classification tasks may make determiistic predictions (a single number or category), or probabilistic predictions (the probability of a number or category);
 * Models have parameters that are fit during training and hyperparameters that are set or tuned;
-* Models should be resampled to estimate the generalization error to understand future performance.
+* Models should be evaluated on resampled data to estimate the generalization error to understand future performance.
 
 ::::