diff --git a/_freeze/notebooks/6-training-efficiency-exercise-solution/execute-results/html.json b/_freeze/notebooks/6-training-efficiency-exercise-solution/execute-results/html.json index 8f54a36..1533bc5 100644 --- a/_freeze/notebooks/6-training-efficiency-exercise-solution/execute-results/html.json +++ b/_freeze/notebooks/6-training-efficiency-exercise-solution/execute-results/html.json @@ -2,7 +2,7 @@ "hash": "6afa555cbebd20dcf7d2176d38060953", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Training Efficiency\"\nsolutions: true\n---\n\n---\ntitle: \"Training Efficiency\"\n---\n\n\n\n\n\n\n**Question 1:** Validation\n\nIn this exercise, we will once again train a simple multi-layer perceptron on the *Indian Liver Patient Dataset* (ILPD). Create a learner that:\n\n1. Uses 2 hidden layers with 100 neurons each.\n2. Utilizes a batch size of 128.\n3. Trains for 200 epochs.\n4. Employs a validation set comprising 30% of the data.\n5. Tracks the training and validation log-loss during training.\n6. Utilizes trace-jitting to speed up the training process.\n7. Employs the history callback to record the training and validation log-loss during training.\n\nAfterward, plot the validation log-loss, which is accessible via `learner$model$callbacks$history`.\n\nBelow, we create the task and remove the `gender` feature for simplicity.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(mlr3verse)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nLoading required package: mlr3\n```\n\n\n:::\n\n```{.r .cell-code}\nlibrary(mlr3torch)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nLoading required package: mlr3pipelines\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nLoading required package: torch\n```\n\n\n:::\n\n```{.r .cell-code}\nilpd_num <- tsk(\"ilpd\")\nilpd_num$select(setdiff(ilpd_num$feature_names, \"gender\"))\nilpd_num\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n (583 x 10): Indian Liver Patient Data\n* Target: diseased\n* Properties: twoclass\n* Features (9):\n - dbl (5): albumin, albumin_globulin_ratio, direct_bilirubin, total_bilirubin, total_protein\n - int (4): age, alanine_transaminase, alkaline_phosphatase, aspartate_transaminase\n```\n\n\n:::\n:::\n\n\n
\nHint\n* To specify the validation set, use the `validate` field, which can either be set during construction or by calling `$configure()`.\n* Trace-jitting can be enabled via the `jit_trace` parameter.\n* The history callback can be constructed via `t_clbk(\"history\")` and needs to be passed during the *construction* of the learner.\n* The validation and measures can be specified via `measures_valid` and take a measure object that is constructed via `msr()`.\n
\n\n::: {.content-visible when-meta=solutions}\n**Solution**\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(ggplot2)\n\nmlp <- lrn(\"classif.mlp\",\n neurons = c(100, 100),\n batch_size = 128,\n epochs = 200,\n predict_type = \"prob\",\n validate = 0.3,\n jit_trace = TRUE,\n callbacks = t_clbk(\"history\"),\n measures_valid = msr(\"classif.logloss\")\n)\n\nmlp$train(ilpd_num)\nhead(mlp$model$callbacks$history)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n epoch valid.classif.logloss\n \n1: 1 3.373034\n2: 2 5.475234\n3: 3 4.667771\n4: 4 3.047842\n5: 5 1.563049\n6: 6 0.958690\n```\n\n\n:::\n\n```{.r .cell-code}\nggplot(mlp$model$callbacks$history) +\n geom_line(aes(x = epoch, y = valid.classif.logloss)) +\n labs(\n y = \"Log-Loss (Validation)\",\n x = \"Epoch\"\n ) +\n theme_minimal()\n```\n\n::: {.cell-output-display}\n![](6-training-efficiency-exercise-solution_files/figure-html/unnamed-chunk-3-1.png){fig-align='center' width=672}\n:::\n:::\n\n:::\n\n**Question 2:** Early Stopping\nEnable early stopping to prevent overfitting and re-train the learner (using a patience of 10). Print the final validation performance of the learner and the early stopped results. You can consult the [documentation of `LearnerTorch`](https://mlr3torch.mlr-org.com/reference/mlr_learners_torch.html) on how to access these two results (section *Active Bindings*).\n\n
\nHint\nYou can enable early stopping by setting the `patience` parameter.\n
\n\n::: {.content-visible when-meta=solutions}\n**Solution**\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nmlp$configure(\n patience = 10\n)\nmlp$train(ilpd_num)\nmlp$internal_tuned_values\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n$epochs\n[1] 24\n```\n\n\n:::\n\n```{.r .cell-code}\nmlp$internal_valid_scores\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n$classif.logloss\n[1] 0.5598296\n```\n\n\n:::\n:::\n\n:::\n\n**Question 3:** Early Stopping and Dropout Tuning\n\nWhile early stopping in itself is already useful, `mlr3torch` also allows you to simultaneously tune the number of epochs using early stopping while tuning other hyperparameters via traditional hyperparameter tuning from `mlr3tuning`.\n\nOne thing we have not covered so far is that the MLP learner we have used so far also uses a dropout layer. The dropout probability can be configured via the `p` parameter.\n\nYour task is to tune the dropout probability `p` in the range $[0, 1]$ and the epochs using early stopping (using the configuration from the previous exercise).\n\nTo adapt this to work with early stopping, you need to set the:\n\n1. `epochs` to `to_tune(upper = , internal = TRUE)`: This tells the `Tuner` that the learner will tune the number of epochs itself.\n2. `$validate` field of the `\"test\"` so the same data is used for tuning and validation.\n3. Tuning `measure` to `msr(\"internal_valid_score\", minimize = TRUE)`. We set `minimize` to `TRUE` because we have used the log-loss as a validation measure.\n\nApart from this, the tuning works just like in tutorial 5. Use 3-fold cross-validation for the tuning and evaluate 10 configurations.\n\nRun the tuning and print the optimal configuration.\n\n::: {.content-visible when-meta=solutions}\n**Solution**\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(mlr3torch)\n\nmlp$configure(\n epochs = to_tune(upper = 100, internal = TRUE),\n p = to_tune(lower = 0, upper = 1),\n validate = \"test\"\n)\n\ntuner <- tnr(\"random_search\")\nresampling <- rsmp(\"cv\", folds = 3)\nmeasure <- msr(\"internal_valid_score\", minimize = TRUE)\n\nti <- tune(\n tuner = tuner,\n task = ilpd_num,\n learner = mlp,\n resampling = resampling,\n measure = measure,\n term_evals = 10\n)\n\nti$learner_result_param_vals\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nNULL\n```\n\n\n:::\n:::\n\n:::\n\n", + "markdown": "---\ntitle: \"Training Efficiency\"\nsolutions: true\n---\n\n---\ntitle: \"Training Efficiency\"\n---\n\n\n\n\n\n\n**Question 1:** Validation\n\nIn this exercise, we will once again train a simple multi-layer perceptron on the *Indian Liver Patient Dataset* (ILPD). Create a learner that:\n\n1. Uses 2 hidden layers with 100 neurons each.\n2. Utilizes a batch size of 128.\n3. Trains for 200 epochs.\n4. Employs a validation set comprising 30% of the data.\n5. Track the validation log-loss.\n6. Utilizes trace-jitting to speed up the training process.\n7. Employs the history callback to record the training and validation log-loss during training.\n\nAfterward, plot the validation log-loss, which is accessible via `learner$model$callbacks$history`.\n\nBelow, we create the task and remove the `gender` feature again for simplicity.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(mlr3verse)\nlibrary(mlr3torch)\nilpd_num <- tsk(\"ilpd\")\nilpd_num$select(setdiff(ilpd_num$feature_names, \"gender\"))\nilpd_num\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n (583 x 10): Indian Liver Patient Data\n* Target: diseased\n* Properties: twoclass\n* Features (9):\n - dbl (5): albumin, albumin_globulin_ratio, direct_bilirubin, total_bilirubin, total_protein\n - int (4): age, alanine_transaminase, alkaline_phosphatase, aspartate_transaminase\n```\n\n\n:::\n:::\n\n\n::: {.content-visible when-meta=solutions}\n**Solution**\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(ggplot2)\n\nmlp <- lrn(\"classif.mlp\",\n neurons = c(100, 100),\n batch_size = 128,\n epochs = 200,\n predict_type = \"prob\",\n validate = 0.3,\n jit_trace = TRUE,\n callbacks = t_clbk(\"history\"),\n measures_valid = msr(\"classif.logloss\")\n)\n\nmlp$train(ilpd_num)\nhead(mlp$model$callbacks$history)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n epoch valid.classif.logloss\n \n1: 1 3.373034\n2: 2 5.475234\n3: 3 4.667771\n4: 4 3.047842\n5: 5 1.563049\n6: 6 0.958690\n```\n\n\n:::\n\n```{.r .cell-code}\nggplot(mlp$model$callbacks$history) +\n geom_line(aes(x = epoch, y = valid.classif.logloss)) +\n labs(\n y = \"Log-Loss (Validation)\",\n x = \"Epoch\"\n ) +\n theme_minimal()\n```\n\n::: {.cell-output-display}\n![](6-training-efficiency-exercise-solution_files/figure-html/unnamed-chunk-3-1.png){fig-align='center' width=672}\n:::\n:::\n\n:::\n\n**Question 2:** Early Stopping\n\nEnable early stopping to prevent overfitting and re-train the learner (using a patience of 10). Print the final validation performance of the learner and the early stopped results. You can consult the [documentation of `LearnerTorch`](https://mlr3torch.mlr-org.com/reference/mlr_learners_torch.html) on how to access these (see section *Active Bindings*).\n\n
\nHint\nYou can enable early stopping by setting the `patience` parameter.\n
\n\n::: {.content-visible when-meta=solutions}\n**Solution**\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nmlp$configure(\n patience = 10\n)\nmlp$train(ilpd_num)\nmlp$internal_tuned_values\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n$epochs\n[1] 24\n```\n\n\n:::\n\n```{.r .cell-code}\nmlp$internal_valid_scores\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n$classif.logloss\n[1] 0.5598296\n```\n\n\n:::\n:::\n\n:::\n\n**Question 3:** Early Stopping and Dropout Tuning\n\nWhile early stopping in itself is already useful, `mlr3torch` also allows you to simultaneously tune the number of epochs using early stopping while tuning other hyperparameters via traditional hyperparameter tuning from `mlr3tuning`.\n\nOne thing we have not mentioned so far is that the MLP learner also uses a dropout layer.\nThe dropout probability can be configured via the `p` parameter.\n\nYour task is to tune the dropout probability `p` in the range $[0, 1]$ and the epochs using early stopping (using the configuration from the previous exercise) with an upper bound of 100 epochs.\n\nTo adapt this to work with early stopping, you need to set the:\n\n1. `epochs` to `to_tune(upper = , internal = TRUE)`: This tells the `Tuner` that the learner will tune the number of epochs itself.\n2. `$validate` field of the `\"test\"` so the same data is used for tuning and validation.\n3. Tuning `measure` to `msr(\"internal_valid_score\", minimize = TRUE)`. We set `minimize` to `TRUE` because we have used the log-loss as a validation measure.\n\nApart from this, the tuning works just like in tutorial 5. Use 3-fold cross-validation and evaluate 10 configurations using random search.\nFinally, print the optimal configuration.\n\n::: {.content-visible when-meta=solutions}\n**Solution**\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(mlr3torch)\n\nmlp$configure(\n epochs = to_tune(upper = 100, internal = TRUE),\n p = to_tune(lower = 0, upper = 1),\n validate = \"test\"\n)\n\ntuner <- tnr(\"random_search\")\nresampling <- rsmp(\"cv\", folds = 3)\nmeasure <- msr(\"internal_valid_score\", minimize = TRUE)\n\nti <- tune(\n tuner = tuner,\n task = ilpd_num,\n learner = mlp,\n resampling = resampling,\n measure = measure,\n term_evals = 10\n)\n\nti$result_learner_param_vals\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n$epochs\n[1] 53\n\n$device\n[1] \"auto\"\n\n$num_threads\n[1] 1\n\n$num_interop_threads\n[1] 1\n\n$seed\n[1] \"random\"\n\n$jit_trace\n[1] TRUE\n\n$eval_freq\n[1] 1\n\n$measures_train\nlist()\n\n$measures_valid\nlist()\n\n$patience\n[1] 0\n\n$min_delta\n[1] 0\n\n$batch_size\n[1] 128\n\n$neurons\n[1] 100 100\n\n$p\n[1] 0.3738756\n\n$activation\n object generator\n Inherits from: \n Public:\n .classes: nn_relu nn_module\n initialize: function (inplace = FALSE) \n forward: function (input) \n clone: function (deep = FALSE, ..., replace_values = TRUE) \n Private:\n .__clone_r6__: function (deep = FALSE) \n Parent env: \n Locked objects: FALSE\n Locked class: FALSE\n Portable: TRUE\n\n$activation_args\nlist()\n```\n\n\n:::\n:::\n\n:::\n\n", "supporting": [ "6-training-efficiency-exercise-solution_files" ], diff --git a/_freeze/notebooks/6-training-efficiency-exercise-task/execute-results/html.json b/_freeze/notebooks/6-training-efficiency-exercise-task/execute-results/html.json index 9b3125f..eb2aa16 100644 --- a/_freeze/notebooks/6-training-efficiency-exercise-task/execute-results/html.json +++ b/_freeze/notebooks/6-training-efficiency-exercise-task/execute-results/html.json @@ -2,7 +2,7 @@ "hash": "04252552fc4e85b5f601de01091b4ab7", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Training Efficiency\"\nsolutions: false\n---\n\n---\ntitle: \"Training Efficiency\"\n---\n\n\n\n\n\n\n**Question 1:** Validation\n\nIn this exercise, we will once again train a simple multi-layer perceptron on the *Indian Liver Patient Dataset* (ILPD). Create a learner that:\n\n1. Uses 2 hidden layers with 100 neurons each.\n2. Utilizes a batch size of 128.\n3. Trains for 200 epochs.\n4. Employs a validation set comprising 30% of the data.\n5. Tracks the training and validation log-loss during training.\n6. Utilizes trace-jitting to speed up the training process.\n7. Employs the history callback to record the training and validation log-loss during training.\n\nAfterward, plot the validation log-loss, which is accessible via `learner$model$callbacks$history`.\n\nBelow, we create the task and remove the `gender` feature for simplicity.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(mlr3verse)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nLoading required package: mlr3\n```\n\n\n:::\n\n```{.r .cell-code}\nlibrary(mlr3torch)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nLoading required package: mlr3pipelines\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nLoading required package: torch\n```\n\n\n:::\n\n```{.r .cell-code}\nilpd_num <- tsk(\"ilpd\")\nilpd_num$select(setdiff(ilpd_num$feature_names, \"gender\"))\nilpd_num\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n (583 x 10): Indian Liver Patient Data\n* Target: diseased\n* Properties: twoclass\n* Features (9):\n - dbl (5): albumin, albumin_globulin_ratio, direct_bilirubin, total_bilirubin, total_protein\n - int (4): age, alanine_transaminase, alkaline_phosphatase, aspartate_transaminase\n```\n\n\n:::\n:::\n\n\n
\nHint\n* To specify the validation set, use the `validate` field, which can either be set during construction or by calling `$configure()`.\n* Trace-jitting can be enabled via the `jit_trace` parameter.\n* The history callback can be constructed via `t_clbk(\"history\")` and needs to be passed during the *construction* of the learner.\n* The validation and measures can be specified via `measures_valid` and take a measure object that is constructed via `msr()`.\n
\n\n::: {.content-visible when-meta=solutions}\n**Solution**\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(ggplot2)\n\nmlp <- lrn(\"classif.mlp\",\n neurons = c(100, 100),\n batch_size = 128,\n epochs = 200,\n predict_type = \"prob\",\n validate = 0.3,\n jit_trace = TRUE,\n callbacks = t_clbk(\"history\"),\n measures_valid = msr(\"classif.logloss\")\n)\n\nmlp$train(ilpd_num)\nhead(mlp$model$callbacks$history)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n epoch valid.classif.logloss\n \n1: 1 3.373034\n2: 2 5.475234\n3: 3 4.667771\n4: 4 3.047842\n5: 5 1.563049\n6: 6 0.958690\n```\n\n\n:::\n\n```{.r .cell-code}\nggplot(mlp$model$callbacks$history) +\n geom_line(aes(x = epoch, y = valid.classif.logloss)) +\n labs(\n y = \"Log-Loss (Validation)\",\n x = \"Epoch\"\n ) +\n theme_minimal()\n```\n\n::: {.cell-output-display}\n![](6-training-efficiency-exercise-task_files/figure-html/unnamed-chunk-3-1.png){fig-align='center' width=672}\n:::\n:::\n\n:::\n\n**Question 2:** Early Stopping\nEnable early stopping to prevent overfitting and re-train the learner (using a patience of 10). Print the final validation performance of the learner and the early stopped results. You can consult the [documentation of `LearnerTorch`](https://mlr3torch.mlr-org.com/reference/mlr_learners_torch.html) on how to access these two results (section *Active Bindings*).\n\n
\nHint\nYou can enable early stopping by setting the `patience` parameter.\n
\n\n::: {.content-visible when-meta=solutions}\n**Solution**\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nmlp$configure(\n patience = 10\n)\nmlp$train(ilpd_num)\nmlp$internal_tuned_values\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n$epochs\n[1] 24\n```\n\n\n:::\n\n```{.r .cell-code}\nmlp$internal_valid_scores\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n$classif.logloss\n[1] 0.5598296\n```\n\n\n:::\n:::\n\n:::\n\n**Question 3:** Early Stopping and Dropout Tuning\n\nWhile early stopping in itself is already useful, `mlr3torch` also allows you to simultaneously tune the number of epochs using early stopping while tuning other hyperparameters via traditional hyperparameter tuning from `mlr3tuning`.\n\nOne thing we have not covered so far is that the MLP learner we have used so far also uses a dropout layer. The dropout probability can be configured via the `p` parameter.\n\nYour task is to tune the dropout probability `p` in the range $[0, 1]$ and the epochs using early stopping (using the configuration from the previous exercise).\n\nTo adapt this to work with early stopping, you need to set the:\n\n1. `epochs` to `to_tune(upper = , internal = TRUE)`: This tells the `Tuner` that the learner will tune the number of epochs itself.\n2. `$validate` field of the `\"test\"` so the same data is used for tuning and validation.\n3. Tuning `measure` to `msr(\"internal_valid_score\", minimize = TRUE)`. We set `minimize` to `TRUE` because we have used the log-loss as a validation measure.\n\nApart from this, the tuning works just like in tutorial 5. Use 3-fold cross-validation for the tuning and evaluate 10 configurations.\n\nRun the tuning and print the optimal configuration.\n\n::: {.content-visible when-meta=solutions}\n**Solution**\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(mlr3torch)\n\nmlp$configure(\n epochs = to_tune(upper = 100, internal = TRUE),\n p = to_tune(lower = 0, upper = 1),\n validate = \"test\"\n)\n\ntuner <- tnr(\"random_search\")\nresampling <- rsmp(\"cv\", folds = 3)\nmeasure <- msr(\"internal_valid_score\", minimize = TRUE)\n\nti <- tune(\n tuner = tuner,\n task = ilpd_num,\n learner = mlp,\n resampling = resampling,\n measure = measure,\n term_evals = 10\n)\n\nti$learner_result_param_vals\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nNULL\n```\n\n\n:::\n:::\n\n:::\n\n", + "markdown": "---\ntitle: \"Training Efficiency\"\nsolutions: false\n---\n\n---\ntitle: \"Training Efficiency\"\n---\n\n\n\n\n\n\n**Question 1:** Validation\n\nIn this exercise, we will once again train a simple multi-layer perceptron on the *Indian Liver Patient Dataset* (ILPD). Create a learner that:\n\n1. Uses 2 hidden layers with 100 neurons each.\n2. Utilizes a batch size of 128.\n3. Trains for 200 epochs.\n4. Employs a validation set comprising 30% of the data.\n5. Track the validation log-loss.\n6. Utilizes trace-jitting to speed up the training process.\n7. Employs the history callback to record the training and validation log-loss during training.\n\nAfterward, plot the validation log-loss, which is accessible via `learner$model$callbacks$history`.\n\nBelow, we create the task and remove the `gender` feature again for simplicity.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(mlr3verse)\nlibrary(mlr3torch)\nilpd_num <- tsk(\"ilpd\")\nilpd_num$select(setdiff(ilpd_num$feature_names, \"gender\"))\nilpd_num\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n (583 x 10): Indian Liver Patient Data\n* Target: diseased\n* Properties: twoclass\n* Features (9):\n - dbl (5): albumin, albumin_globulin_ratio, direct_bilirubin, total_bilirubin, total_protein\n - int (4): age, alanine_transaminase, alkaline_phosphatase, aspartate_transaminase\n```\n\n\n:::\n:::\n\n\n::: {.content-visible when-meta=solutions}\n**Solution**\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(ggplot2)\n\nmlp <- lrn(\"classif.mlp\",\n neurons = c(100, 100),\n batch_size = 128,\n epochs = 200,\n predict_type = \"prob\",\n validate = 0.3,\n jit_trace = TRUE,\n callbacks = t_clbk(\"history\"),\n measures_valid = msr(\"classif.logloss\")\n)\n\nmlp$train(ilpd_num)\nhead(mlp$model$callbacks$history)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n epoch valid.classif.logloss\n \n1: 1 3.373034\n2: 2 5.475234\n3: 3 4.667771\n4: 4 3.047842\n5: 5 1.563049\n6: 6 0.958690\n```\n\n\n:::\n\n```{.r .cell-code}\nggplot(mlp$model$callbacks$history) +\n geom_line(aes(x = epoch, y = valid.classif.logloss)) +\n labs(\n y = \"Log-Loss (Validation)\",\n x = \"Epoch\"\n ) +\n theme_minimal()\n```\n\n::: {.cell-output-display}\n![](6-training-efficiency-exercise-task_files/figure-html/unnamed-chunk-3-1.png){fig-align='center' width=672}\n:::\n:::\n\n:::\n\n**Question 2:** Early Stopping\n\nEnable early stopping to prevent overfitting and re-train the learner (using a patience of 10). Print the final validation performance of the learner and the early stopped results. You can consult the [documentation of `LearnerTorch`](https://mlr3torch.mlr-org.com/reference/mlr_learners_torch.html) on how to access these (see section *Active Bindings*).\n\n
\nHint\nYou can enable early stopping by setting the `patience` parameter.\n
\n\n::: {.content-visible when-meta=solutions}\n**Solution**\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nmlp$configure(\n patience = 10\n)\nmlp$train(ilpd_num)\nmlp$internal_tuned_values\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n$epochs\n[1] 24\n```\n\n\n:::\n\n```{.r .cell-code}\nmlp$internal_valid_scores\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n$classif.logloss\n[1] 0.5598296\n```\n\n\n:::\n:::\n\n:::\n\n**Question 3:** Early Stopping and Dropout Tuning\n\nWhile early stopping in itself is already useful, `mlr3torch` also allows you to simultaneously tune the number of epochs using early stopping while tuning other hyperparameters via traditional hyperparameter tuning from `mlr3tuning`.\n\nOne thing we have not mentioned so far is that the MLP learner also uses a dropout layer.\nThe dropout probability can be configured via the `p` parameter.\n\nYour task is to tune the dropout probability `p` in the range $[0, 1]$ and the epochs using early stopping (using the configuration from the previous exercise) with an upper bound of 100 epochs.\n\nTo adapt this to work with early stopping, you need to set the:\n\n1. `epochs` to `to_tune(upper = , internal = TRUE)`: This tells the `Tuner` that the learner will tune the number of epochs itself.\n2. `$validate` field of the `\"test\"` so the same data is used for tuning and validation.\n3. Tuning `measure` to `msr(\"internal_valid_score\", minimize = TRUE)`. We set `minimize` to `TRUE` because we have used the log-loss as a validation measure.\n\nApart from this, the tuning works just like in tutorial 5. Use 3-fold cross-validation and evaluate 10 configurations using random search.\nFinally, print the optimal configuration.\n\n::: {.content-visible when-meta=solutions}\n**Solution**\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(mlr3torch)\n\nmlp$configure(\n epochs = to_tune(upper = 100, internal = TRUE),\n p = to_tune(lower = 0, upper = 1),\n validate = \"test\"\n)\n\ntuner <- tnr(\"random_search\")\nresampling <- rsmp(\"cv\", folds = 3)\nmeasure <- msr(\"internal_valid_score\", minimize = TRUE)\n\nti <- tune(\n tuner = tuner,\n task = ilpd_num,\n learner = mlp,\n resampling = resampling,\n measure = measure,\n term_evals = 10\n)\n\nti$result_learner_param_vals\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n$epochs\n[1] 53\n\n$device\n[1] \"auto\"\n\n$num_threads\n[1] 1\n\n$num_interop_threads\n[1] 1\n\n$seed\n[1] \"random\"\n\n$jit_trace\n[1] TRUE\n\n$eval_freq\n[1] 1\n\n$measures_train\nlist()\n\n$measures_valid\nlist()\n\n$patience\n[1] 0\n\n$min_delta\n[1] 0\n\n$batch_size\n[1] 128\n\n$neurons\n[1] 100 100\n\n$p\n[1] 0.3738756\n\n$activation\n object generator\n Inherits from: \n Public:\n .classes: nn_relu nn_module\n initialize: function (inplace = FALSE) \n forward: function (input) \n clone: function (deep = FALSE, ..., replace_values = TRUE) \n Private:\n .__clone_r6__: function (deep = FALSE) \n Parent env: \n Locked objects: FALSE\n Locked class: FALSE\n Portable: TRUE\n\n$activation_args\nlist()\n```\n\n\n:::\n:::\n\n:::\n\n", "supporting": [ "6-training-efficiency-exercise-task_files" ], diff --git a/_freeze/notebooks/6-training-efficiency-exercise/execute-results/html.json b/_freeze/notebooks/6-training-efficiency-exercise/execute-results/html.json index 91944b2..8265b80 100644 --- a/_freeze/notebooks/6-training-efficiency-exercise/execute-results/html.json +++ b/_freeze/notebooks/6-training-efficiency-exercise/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "86df75e7675bf153e8efac089ccfc032", + "hash": "7c5038c0ca81018ec207ba0a4da4ddac", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Training Efficiency\"\n---\n\n\n\n\n\n\n**Question 1:** Validation\n\nIn this exercise, we will once again train a simple multi-layer perceptron on the *Indian Liver Patient Dataset* (ILPD). Create a learner that:\n\n1. Uses 2 hidden layers with 100 neurons each.\n2. Utilizes a batch size of 128.\n3. Trains for 200 epochs.\n4. Employs a validation set comprising 30% of the data.\n5. Tracks the training and validation log-loss during training.\n6. Utilizes trace-jitting to speed up the training process.\n7. Employs the history callback to record the training and validation log-loss during training.\n\nAfterward, plot the validation log-loss, which is accessible via `learner$model$callbacks$history`.\n\nBelow, we create the task and remove the `gender` feature for simplicity.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(mlr3verse)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nLoading required package: mlr3\n```\n\n\n:::\n\n```{.r .cell-code}\nlibrary(mlr3torch)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nLoading required package: mlr3pipelines\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nLoading required package: torch\n```\n\n\n:::\n\n```{.r .cell-code}\nilpd_num <- tsk(\"ilpd\")\nilpd_num$select(setdiff(ilpd_num$feature_names, \"gender\"))\nilpd_num\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n (583 x 10): Indian Liver Patient Data\n* Target: diseased\n* Properties: twoclass\n* Features (9):\n - dbl (5): albumin, albumin_globulin_ratio, direct_bilirubin, total_bilirubin, total_protein\n - int (4): age, alanine_transaminase, alkaline_phosphatase, aspartate_transaminase\n```\n\n\n:::\n:::\n\n\n\n
\nHint\n* To specify the validation set, use the `validate` field, which can either be set during construction or by calling `$configure()`.\n* Trace-jitting can be enabled via the `jit_trace` parameter.\n* The history callback can be constructed via `t_clbk(\"history\")` and needs to be passed during the *construction* of the learner.\n* The validation and measures can be specified via `measures_valid` and take a measure object that is constructed via `msr()`.\n
\n\n::: {.content-visible when-meta=solutions}\n**Solution**\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(ggplot2)\n\nmlp <- lrn(\"classif.mlp\",\n neurons = c(100, 100),\n batch_size = 128,\n epochs = 200,\n predict_type = \"prob\",\n validate = 0.3,\n jit_trace = TRUE,\n callbacks = t_clbk(\"history\"),\n measures_valid = msr(\"classif.logloss\")\n)\n\nmlp$train(ilpd_num)\nhead(mlp$model$callbacks$history)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n epoch valid.classif.logloss\n \n1: 1 3.373034\n2: 2 5.475234\n3: 3 4.667771\n4: 4 3.047842\n5: 5 1.563049\n6: 6 0.958690\n```\n\n\n:::\n\n```{.r .cell-code}\nggplot(mlp$model$callbacks$history) +\n geom_line(aes(x = epoch, y = valid.classif.logloss)) +\n labs(\n y = \"Log-Loss (Validation)\",\n x = \"Epoch\"\n ) +\n theme_minimal()\n```\n\n::: {.cell-output-display}\n![](6-training-efficiency-exercise_files/figure-html/unnamed-chunk-3-1.png){fig-align='center' width=672}\n:::\n:::\n\n\n:::\n\n**Question 2:** Early Stopping\nEnable early stopping to prevent overfitting and re-train the learner (using a patience of 10). Print the final validation performance of the learner and the early stopped results. You can consult the [documentation of `LearnerTorch`](https://mlr3torch.mlr-org.com/reference/mlr_learners_torch.html) on how to access these two results (section *Active Bindings*).\n\n
\nHint\nYou can enable early stopping by setting the `patience` parameter.\n
\n\n::: {.content-visible when-meta=solutions}\n**Solution**\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nmlp$configure(\n patience = 10\n)\nmlp$train(ilpd_num)\nmlp$internal_tuned_values\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n$epochs\n[1] 24\n```\n\n\n:::\n\n```{.r .cell-code}\nmlp$internal_valid_scores\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n$classif.logloss\n[1] 0.5598296\n```\n\n\n:::\n:::\n\n\n:::\n\n**Question 3:** Early Stopping and Dropout Tuning\n\nWhile early stopping in itself is already useful, `mlr3torch` also allows you to simultaneously tune the number of epochs using early stopping while tuning other hyperparameters via traditional hyperparameter tuning from `mlr3tuning`.\n\nOne thing we have not covered so far is that the MLP learner we have used so far also uses a dropout layer. The dropout probability can be configured via the `p` parameter.\n\nYour task is to tune the dropout probability `p` in the range $[0, 1]$ and the epochs using early stopping (using the configuration from the previous exercise).\n\nTo adapt this to work with early stopping, you need to set the:\n\n1. `epochs` to `to_tune(upper = , internal = TRUE)`: This tells the `Tuner` that the learner will tune the number of epochs itself.\n2. `$validate` field of the `\"test\"` so the same data is used for tuning and validation.\n3. Tuning `measure` to `msr(\"internal_valid_score\", minimize = TRUE)`. We set `minimize` to `TRUE` because we have used the log-loss as a validation measure.\n\nApart from this, the tuning works just like in tutorial 5. Use 3-fold cross-validation for the tuning and evaluate 10 configurations.\n\nRun the tuning and print the optimal configuration.\n\n::: {.content-visible when-meta=solutions}\n**Solution**\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(mlr3torch)\n\nmlp$configure(\n epochs = to_tune(upper = 100, internal = TRUE),\n p = to_tune(lower = 0, upper = 1),\n validate = \"test\"\n)\n\ntuner <- tnr(\"random_search\")\nresampling <- rsmp(\"cv\", folds = 3)\nmeasure <- msr(\"internal_valid_score\", minimize = TRUE)\n\nti <- tune(\n tuner = tuner,\n task = ilpd_num,\n learner = mlp,\n resampling = resampling,\n measure = measure,\n term_evals = 10\n)\n\nti$learner_result_param_vals\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nNULL\n```\n\n\n:::\n:::\n\n\n:::\n", + "markdown": "---\ntitle: \"Training Efficiency\"\n---\n\n\n\n\n\n\n**Question 1:** Validation\n\nIn this exercise, we will once again train a simple multi-layer perceptron on the *Indian Liver Patient Dataset* (ILPD). Create a learner that:\n\n1. Uses 2 hidden layers with 100 neurons each.\n2. Utilizes a batch size of 128.\n3. Trains for 200 epochs.\n4. Employs a validation set comprising 30% of the data.\n5. Track the validation log-loss.\n6. Utilizes trace-jitting to speed up the training process.\n7. Employs the history callback to record the training and validation log-loss during training.\n\nAfterward, plot the validation log-loss, which is accessible via `learner$model$callbacks$history`.\n\nBelow, we create the task and remove the `gender` feature again for simplicity.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(mlr3verse)\nlibrary(mlr3torch)\nilpd_num <- tsk(\"ilpd\")\nilpd_num$select(setdiff(ilpd_num$feature_names, \"gender\"))\nilpd_num\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n (583 x 10): Indian Liver Patient Data\n* Target: diseased\n* Properties: twoclass\n* Features (9):\n - dbl (5): albumin, albumin_globulin_ratio, direct_bilirubin, total_bilirubin, total_protein\n - int (4): age, alanine_transaminase, alkaline_phosphatase, aspartate_transaminase\n```\n\n\n:::\n:::\n\n\n\n::: {.content-visible when-meta=solutions}\n**Solution**\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(ggplot2)\n\nmlp <- lrn(\"classif.mlp\",\n neurons = c(100, 100),\n batch_size = 128,\n epochs = 200,\n predict_type = \"prob\",\n validate = 0.3,\n jit_trace = TRUE,\n callbacks = t_clbk(\"history\"),\n measures_valid = msr(\"classif.logloss\")\n)\n\nmlp$train(ilpd_num)\nhead(mlp$model$callbacks$history)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n epoch valid.classif.logloss\n \n1: 1 3.373034\n2: 2 5.475234\n3: 3 4.667771\n4: 4 3.047842\n5: 5 1.563049\n6: 6 0.958690\n```\n\n\n:::\n\n```{.r .cell-code}\nggplot(mlp$model$callbacks$history) +\n geom_line(aes(x = epoch, y = valid.classif.logloss)) +\n labs(\n y = \"Log-Loss (Validation)\",\n x = \"Epoch\"\n ) +\n theme_minimal()\n```\n\n::: {.cell-output-display}\n![](6-training-efficiency-exercise_files/figure-html/unnamed-chunk-3-1.png){fig-align='center' width=672}\n:::\n:::\n\n\n:::\n\n**Question 2:** Early Stopping\n\nEnable early stopping to prevent overfitting and re-train the learner (using a patience of 10). Print the final validation performance of the learner and the early stopped results. You can consult the [documentation of `LearnerTorch`](https://mlr3torch.mlr-org.com/reference/mlr_learners_torch.html) on how to access these (see section *Active Bindings*).\n\n
\nHint\nYou can enable early stopping by setting the `patience` parameter.\n
\n\n::: {.content-visible when-meta=solutions}\n**Solution**\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nmlp$configure(\n patience = 10\n)\nmlp$train(ilpd_num)\nmlp$internal_tuned_values\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n$epochs\n[1] 24\n```\n\n\n:::\n\n```{.r .cell-code}\nmlp$internal_valid_scores\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n$classif.logloss\n[1] 0.5598296\n```\n\n\n:::\n:::\n\n\n:::\n\n**Question 3:** Early Stopping and Dropout Tuning\n\nWhile early stopping in itself is already useful, `mlr3torch` also allows you to simultaneously tune the number of epochs using early stopping while tuning other hyperparameters via traditional hyperparameter tuning from `mlr3tuning`.\n\nOne thing we have not mentioned so far is that the MLP learner also uses a dropout layer.\nThe dropout probability can be configured via the `p` parameter.\n\nYour task is to tune the dropout probability `p` in the range $[0, 1]$ and the epochs using early stopping (using the configuration from the previous exercise) with an upper bound of 100 epochs.\n\nTo adapt this to work with early stopping, you need to set the:\n\n1. `epochs` to `to_tune(upper = , internal = TRUE)`: This tells the `Tuner` that the learner will tune the number of epochs itself.\n2. `$validate` field of the `\"test\"` so the same data is used for tuning and validation.\n3. Tuning `measure` to `msr(\"internal_valid_score\", minimize = TRUE)`. We set `minimize` to `TRUE` because we have used the log-loss as a validation measure.\n\nApart from this, the tuning works just like in tutorial 5. Use 3-fold cross-validation and evaluate 10 configurations using random search.\nFinally, print the optimal configuration.\n\n::: {.content-visible when-meta=solutions}\n**Solution**\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(mlr3torch)\n\nmlp$configure(\n epochs = to_tune(upper = 100, internal = TRUE),\n p = to_tune(lower = 0, upper = 1),\n validate = \"test\"\n)\n\ntuner <- tnr(\"random_search\")\nresampling <- rsmp(\"cv\", folds = 3)\nmeasure <- msr(\"internal_valid_score\", minimize = TRUE)\n\nti <- tune(\n tuner = tuner,\n task = ilpd_num,\n learner = mlp,\n resampling = resampling,\n measure = measure,\n term_evals = 10\n)\n\nti$result_learner_param_vals\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n$epochs\n[1] 53\n\n$device\n[1] \"auto\"\n\n$num_threads\n[1] 1\n\n$num_interop_threads\n[1] 1\n\n$seed\n[1] \"random\"\n\n$jit_trace\n[1] TRUE\n\n$eval_freq\n[1] 1\n\n$measures_train\nlist()\n\n$measures_valid\nlist()\n\n$patience\n[1] 0\n\n$min_delta\n[1] 0\n\n$batch_size\n[1] 128\n\n$neurons\n[1] 100 100\n\n$p\n[1] 0.3738756\n\n$activation\n object generator\n Inherits from: \n Public:\n .classes: nn_relu nn_module\n initialize: function (inplace = FALSE) \n forward: function (input) \n clone: function (deep = FALSE, ..., replace_values = TRUE) \n Private:\n .__clone_r6__: function (deep = FALSE) \n Parent env: \n Locked objects: FALSE\n Locked class: FALSE\n Portable: TRUE\n\n$activation_args\nlist()\n```\n\n\n:::\n:::\n\n\n:::\n", "supporting": [ "6-training-efficiency-exercise_files" ], diff --git a/_freeze/notebooks/6-training-efficiency/execute-results/html.json b/_freeze/notebooks/6-training-efficiency/execute-results/html.json index a18dc80..c4713fc 100644 --- a/_freeze/notebooks/6-training-efficiency/execute-results/html.json +++ b/_freeze/notebooks/6-training-efficiency/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "9e10fc19f004b9f389fb22b417272fb8", + "hash": "90bea0fb54de0a892a773e9e42492b1f", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Training Efficiency\"\n---\n\n\n\n\n\n\nMethods for increasing training efficiency can be roughly split into:\n\n1. Computational methods such as JIT compilation, using GPU, parallel data loading, etc., that allow doing the same thing **faster**.\n2. Methodological approaches that change how we approach modeling to achieve either better results or faster training.\n\n# Computational Approaches\n\n## Parallel Processing\n\n### Graphical Processing Unit (GPU)\n\nUsing a GPU is crucial when training relatively large neural networks because GPUs are specifically designed to handle the parallel processing required for complex computations.\nTo use a GPU in mlr3torch, we can set the device parameter to \"cuda\". By default, it is set to \"auto\", which will use a GPU if it is available and otherwise fall back to the CPU.\n\n:::{.callout-tip}\nTo check if a GPU is available, we can use the `torch::cuda_is_available()` function.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(torch)\ncuda_is_available()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] FALSE\n```\n\n\n:::\n:::\n\n\n\nIf you have an M1 Mac (or later), you can also use the available graphics card by setting the `device` parameter to `\"mps\"`.\nYou can check this by running:\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nbackends_mps_is_available()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE\n```\n\n\n:::\n:::\n\n\n:::\n\nTo demonstrate the speed improvements obtained by using a GPU, we conduct a large matrix operation on a GPU and a CPU.\nWe start by randomly sampling a matrix of size 1000x1000.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nx_cpu = torch_randn(1000, 1000, device = \"cpu\")\n```\n:::\n\n\n\nBelow, we perform a matrix multiplication on the CPU and the GPU and compare the timings.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# this will only run if a GPU is available\nx_cuda = x_cpu$cuda()\n\nbench::mark(\n cpu = x_cpu$matmul(x_cpu),\n cuda = x_cuda$matmul(x_cuda)\n)\n```\n:::\n\n\n\n### CPU Threads\n\nTraining large networks on a CPU is not a recommended approach, but it can be useful for smaller networks or when you don't have a GPU.\nYou can still use multiple threads to speed up the execution of operations.\nNote that the code below will not run on macOS, as it is not possible to set the number of threads on macOS.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# this will be skipped on macOS\nbench::mark(\n {torch_set_num_threads(1L); x_cpu$matmul(x_cpu)},\n {torch_set_num_threads(16L); x_cpu$matmul(x_cpu)}\n)\n```\n:::\n\n\n\n`torch` also allows for interop-parallelization, but this is more advanced and code needs to be written in a specific way.\n\n:::{.callout-note}\n## Quiz: Number of Threads\n\nQuestion 1: On a CPU with 4 cores, does it make sense to set the number of threads to values greater than 4? Explain your answer.\n\n
\nClick for answer\nOn a CPU with 4 cores, at most 4 threads can run in parallel.\nUsing more threads than the number of cores will not speed up the execution of operations.\n
\n\nQuestion 2: On a CPU with 64 cores, is it always the case that using 64 threads is better than using 32 threads?\n\n
\nClick for answer\nNot necessarily. Using more threads will mean that:\n\n1. The threads need to communicate and synchronize, which increases the runtime.\n2. More resources are used for the computation, which decreases the runtime.\n\nThe optimal number of threads is a trade-off between these two effects.\n
\n:::\n\n## Efficient Data Loading\n\nBesides speeding up the computation of operations in the forward and backward pass, another possible bottleneck is the loading of data.\nThere are various ways to improve data loading speed:\n\n1. Improve the implementation of the `dataset` class\n2. Parallelize the data loading process\n3. Move data to the GPU\n\nThese approaches will now be discussed.\n\n### Efficient Dataset Implementation\n\nWhen implementing a dataset, we need to define:\n\n1. How we store and load the data\n2. Whether implementing loading of a batch is beneficial\n\n:::{.callout-note}\n## Quiz: Data Loading\n\nThe *tiny imagenet* dataset is a dataset of 100,000 images of size 64x64x3.\nIt is a subset of the famous *imagenet* dataset.\nBelow, we show some examples from the dataset:\n\n![](../assets/tiny-imagenet.png)\n\n\n\n\n\n\n\nWe will now consider different ways to write a `torch::dataset` implementation for this data.\nAssume we have some image paths stored in a character vector as well as in an array where they are already loaded into memory.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nstr(image_paths)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n chr [1:100] \"/Users/sebi/Library/Caches/org.R-project.R/R/mlr3torch/datasets/tiny_imagenet/raw/tiny-imagenet-200/train/n0144\"| __truncated__ ...\n```\n\n\n:::\n\n```{.r .cell-code}\nstr(image_array)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n num [1:100, 1:3, 1:64, 1:64] 1 0.0784 0.4706 0.5608 0.5647 ...\n```\n\n\n:::\n:::\n\n\n\nAn individual image can, for example, be loaded using the `torchvision::base_loader()` function:\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(torchvision)\nstr(base_loader(image_paths[1]))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n num [1:64, 1:64, 1:3] 1 1 1 1 1 ...\n```\n\n\n:::\n:::\n\n\n\n**Question 1:** Reading From Disk or RAM\n\nWhich of the following is the faster way to load the images? Explain why.\n\n1. Loading the images from disk:\n\n\n\n ::: {.cell layout-align=\"center\"}\n \n ```{.r .cell-code}\n ds_disk = dataset(\"image_paths\",\n initialize = function(image_paths) {\n self$image_paths = image_paths\n },\n .getitem = function(i) {\n torch_tensor(torchvision::base_loader(self$image_paths[i]))\n },\n .length = function() {\n length(self$image_paths)\n }\n )(image_paths)\n ```\n :::\n\n\n\n2. Loading the images from an array:\n\n\n\n ::: {.cell layout-align=\"center\"}\n \n ```{.r .cell-code}\n ds_ram = dataset(\"image_array\",\n initialize = function(image_array) {\n self$image_array = image_array\n },\n .getbatch = function(i) {\n torch_tensor(self$image_array[i, , , ])\n },\n .length = function() {\n nrow(self$image_array)\n }\n )(image_array)\n ```\n :::\n\n\n\n
\nClick for answer\n\nGenerally, loading images from RAM is significantly faster than loading them from disk.\nAlthough the benchmark presented below may seem somewhat 'unfair' since `ds_ram` has already loaded the images into memory, this difference is evident in practice.\nWhen iterating over the dataset for multiple epochs, the first method will need to reload the images from disk for each epoch, while the second method only requires a single loading of the images into memory.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\niter = function(ds, ..., epochs = 1) {\n dl = torch::dataloader(ds, batch_size = 16, ...)\n for (epoch in seq_len(epochs)) {\n coro::loop(for(batch in dl) {\n batch\n })\n }\n}\nbench::mark(\n disk = iter(ds_disk),\n ram = iter(ds_ram),\n check = FALSE\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 6\n expression min median `itr/sec` mem_alloc `gc/sec`\n \n1 disk 18ms 20.01ms 47.6 14MB 14.0\n2 ram 8.4ms 9.06ms 110. 9.4MB 26.0\n```\n\n\n:::\n:::\n\n\n\n
\n\n**Question 2:** (Don't) Copy that\n\nConsider now the next dataset implementation:\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nds_tensor = dataset(\"tensor\",\n initialize = function(image_array) {\n self$tensor = torch_tensor(image_array)\n },\n .getitem = function(i) {\n self$tensor[i, ..]\n },\n .length = function() {\n nrow(self$tensor)\n }\n)(image_array)\n```\n:::\n\n\n\nDo you think this implementation is faster or slower than the `ds_ram` implementation? Explain why.\n\n
\nClick for answer\nThis implementation is faster than the `ds_ram` implementation.\nThis is because the `ds_tensor` implementation copies the R array to a torch tensor only once, whereas the `ds_ram` implementation copies the R array to a torch tensor for each item.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nbench::mark(\n tensor = iter(ds_tensor),\n array = iter(ds_ram),\n check = FALSE\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 6\n expression min median `itr/sec` mem_alloc `gc/sec`\n \n1 tensor 4.62ms 5.06ms 196. 96.08KB 6.77\n2 array 8.03ms 9.28ms 107. 9.38MB 27.6 \n```\n\n\n:::\n:::\n\n\n\n
\n\n**Question 3**: `$.getbatch()` vs `$.getitem()`\n\nWhich implementation is faster? Explain why.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nds_tensor_batch = dataset(\"tensor_batch\",\n initialize = function(image_array) {\n self$tensor = torch_tensor(image_array)\n },\n .getbatch = function(i) {\n self$tensor[i, ..]\n },\n .length = function() {\n nrow(self$tensor)\n }\n)(image_array)\n```\n:::\n\n\n\n
\nClick for answer\nThe `$.getbatch()` implementation is faster than the `$.getitem()` implementation.\nThis is because when using the `$.getitem()` method, the batch for indices `ids` is obtained by calling `$.getitem(id)` for each index in `ids` and then stacking them together, which requires a new tensor allocation.\nSlicing the tensor, however, avoids this allocation when `shuffle = TRUE` (which is also the default).\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nbench::mark(\n getbatch = iter(ds_tensor_batch),\n getitem = iter(ds_tensor),\n check = FALSE\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 6\n expression min median `itr/sec` mem_alloc `gc/sec`\n \n1 getbatch 1.69ms 1.99ms 466. 3.83KB 4.25\n2 getitem 4.48ms 4.85ms 204. 54.69KB 7.19\n```\n\n\n:::\n:::\n\n\n
\n:::\n\n### Parallel Data Loading\n\nIn Deep Learning, datasets can be very large, and it might therefore be the case that the data is simply too large to fit into memory.\nIn this case, we can use parallel data loading to speed up the data loading process.\nInstead of loading the data sequentially in the main process, other R processes will be started that execute the data loading.\nFor example, if we set `num_workers = 4L`, 4 R processes will be started that load the data, while the main process is free to train the model.\nThese processes then send the batches to the main process.\nThe image below visualizes this process:\n\n![](../assets/parallel-dataloader.png)\n\nCreating such a parallel dataloader is as easy as setting the `num_workers` parameter to a value greater than 0.\n\n:::{.callout-note}\nNote that there is some communication overhead that results from sending the batches from the worker to the main process.\nThis will hopefully be reduced in the future, but is currently there.\nFor this reason, parallel data loading is therefore -- currently -- only beneficial when it is slow, e.g., because of loading the data from disk or because of expensive preprocessing.\n:::\n\n### Moving Data to the GPU\n\nOne thing we have ignored so far is that when training using a GPU, the data needs to be moved to the GPU.\nThis is because a GPU has its own memory (VRAM), and the data needs to be moved to this memory before it can be used for training.\nThe moving of the data to the GPU cannot be done on the processes that are loading the data but must be done in the main process, i.e., after the batch was received from (possibly parallelized) dataloader.\nOne way to speed up the data loading process is to pin the memory of the data to the GPU.\nBefore a tensor can be moved from RAM to VRAM, it needs to be in so-called page-locked memory, which can be done using the `pin_memory` parameter.\n\n![](../assets/pinned-memory.png)\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\niter_cuda = function(ds, pin_memory = TRUE) {\n dl = torch::dataloader(ds, batch_size = 16, pin_memory = pin_memory)\n coro::loop(for(batch in dl) {\n batch$cuda()\n })\n}\n\nbench::mark(\n not_pinned = iter_cuda(ds_disk, pin_memory = FALSE),\n pinned = iter_cuda(ds_disk, pin_memory = TRUE)\n)\n```\n:::\n\n\n\n:::{.callout-note}\n\nIn order to use parallel data loading or memory pinning with `mlr3torch`, these parameters can simply be specified in the learner:\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlrn(\"classif.mlp\", num_workers = 8L, pin_memory = TRUE, device = \"cuda\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n: My Little Powny\n* Model: -\n* Parameters: device=cuda, num_threads=1, num_interop_threads=1, seed=random, jit_trace=FALSE, eval_freq=1,\n measures_train=, measures_valid=, patience=0, min_delta=0, num_workers=8, pin_memory=TRUE,\n neurons=integer(0), p=0.5, activation=, activation_args=\n* Validate: NULL\n* Packages: mlr3, mlr3torch, torch\n* Predict Types: [response], prob\n* Feature Types: integer, numeric, lazy_tensor\n* Properties: internal_tuning, marshal, multiclass, twoclass, validation\n* Optimizer: adam\n* Loss: cross_entropy\n* Callbacks: -\n```\n\n\n:::\n:::\n\n\n:::\n\n## JIT Compilation & Ignite Optimizers\n\nSome special care needs to be taken when using `torch` (or `mlr3torch`) in order to get good performance.\nIn the future, this will hopefully not be necessary anymore, but is currently required.\n\n### 'Ignite' Optimizers\n\nIn `torch`, different versions of optimizers exist:\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\noptim_adamw\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n object generator\n Inherits from: \n Public:\n initialize: function (params, lr = 0.001, betas = c(0.9, 0.999), eps = 1e-08, \n loop_fun: function (group, param, g, p) \n step: function (closure = NULL) \n clone: function (deep = FALSE) \n Parent env: \n Locked objects: FALSE\n Locked class: FALSE\n Portable: TRUE\n```\n\n\n:::\n\n```{.r .cell-code}\noptim_ignite_adamw\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n object generator\n object generator\n Inherits from: \n Public:\n initialize: function (params, lr = 0.001, betas = c(0.9, 0.999), eps = 1e-08, \n clone: function (deep = FALSE) \n Private:\n .config_names: lr betas eps weight_decay amsgrad\n .state_names: exp_avg exp_avg_sq max_exp_avg_sq step\n .optim: function (params, ...) \n .get_states: function (opt) \n .set_states: function (opt, params, states) \n .add_param_group: function (opt, params, lr, betas, eps, weight_decay, amsgrad) \n .assert_params: function (lr, betas, eps, weight_decay, amsgrad) \n .set_param_group_options: function (opt, list) \n .zero_grad: function (opt) \n .get_param_groups: function (ptr) \n Parent env: \n Locked objects: FALSE\n Locked class: FALSE\n Portable: TRUE\n```\n\n\n:::\n:::\n\n\n\nThe 'ignite' indicates that the optimizer is a version that is optimized for performance.\nNot for all optimizers does an ignite version exist, but for the most common ones, it does.\n\nBelow, we compare the performance of the default optimizer and the ignite optimizer and see that the latter is considerably faster.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nadamw = as_torch_optimizer(torch::optim_adamw)\nignite_adamw = as_torch_optimizer(torch::optim_ignite_adamw)\n\nlearner = lrn(\"classif.mlp\", epochs = 10, neurons = c(100, 100), batch_size = 32, optimizer = adamw)\n\nlearner_ignite = learner$clone(deep = TRUE)\nlearner_ignite$configure(\n optimizer = ignite_adamw\n)\ntask_sonar = tsk(\"sonar\")\n\nbench::mark(\n learner$train(task_sonar),\n learner_ignite$train(task_sonar),\n check = FALSE\n)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning: Some expressions had a GC in every iteration; so filtering is disabled.\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 6\n expression min median `itr/sec` mem_alloc `gc/sec`\n \n1 learner$train(task_sonar) 667ms 667ms 1.50 15.7MB 4.49\n2 learner_ignite$train(task_sonar) 202ms 211ms 4.78 10.7MB 4.78\n```\n\n\n:::\n:::\n\n\n\n### JIT Compilation\n\nJIT (Just-In-Time) compilation is a runtime optimization technique that compiles code into machine code during execution rather than beforehand.\nThis has different advantages:\n\n1. By JIT-compiling a model, some operations can be optimized for performance.\n2. A JIT-compiled model can be saved and executed without an R dependency for deployment (only LibTorch is required), e.g., in a C++ application.\n3. Running a JIT-compiled model in R is faster because the whole network is executed in C++ instead of R.\n\nIn `torch`, this can either be done using TorchScript or by tracing a model.\nWe will briefly discuss both approaches, but for more information, see the [torch documentation](https://torch.mlverse.org/docs/articles/torchscript).\n\n#### TorchScript\n\nTorchScript is a subset of Python -- i.e., its own programming language -- that can be used to define compiled functions.\nIn R, this is available via the [`jit_compile`](https://torch.mlverse.org/docs/reference/jit_compile.html) function.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nf = jit_compile(\"\ndef f(x, w, bias):\n return x @ w + bias\n\")$f\n\nx = torch_randn(10, 10)\nw = torch_randn(10, 1)\nbias = torch_randn(1)\n\nout = f(x, w, bias)\nstr(out)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nFloat [1:10, 1:1]\n```\n\n\n:::\n:::\n\n\n\nBesides syntax, there are some important differences between TorchScript and R:\n\n1. In TorchScript, indexing tensors is 0-based, and\n2. TorchScript is statically typed, so you need to specify the types of the arguments, unless they are tensors, which is the default.\n\nBelow, we define a function that takes a list of tensors and calculates their sum.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsum_jit = jit_compile(\"\ndef sum_jit(xs: List[Tensor]):\n output = torch.zeros_like(xs[0])\n for x in xs:\n output = output + x\n return output\n\")$sum_jit\n\nsum_jit(list(torch_randn(1), torch_randn(1)))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntorch_tensor\n-0.7121\n[ CPUFloatType{1} ]\n```\n\n\n:::\n:::\n\n\n\n#### Tracing\n\nThe alternative to writing TorchScript is to write your module in R and to use [`jit_trace`](https://torch.mlverse.org/docs/reference/jit_trace_module.html) to compile it.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nf2 = function(x, w, bias) {\n x$matmul(w) + bias\n}\n# need to provide some example input\n# arguments are passed by position\nf2 = jit_trace(f2, torch_randn(10, 10), torch_randn(10, 100), torch_randn(100))\nout2 = f2(x, w, bias)\ntorch_equal(out, out2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE\n```\n\n\n:::\n:::\n\n\n\nAn advantage of trace-compilation is that it even allows you to JIT-compile modules, which is currently not possible with `jit_compile`.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nnet = nn_sequential(\n nn_linear(10, 100),\n nn_relu(),\n nn_linear(100, 10)\n)\nnet_jit = jit_trace(net, torch_randn(10, 10))\n\ntorch_equal(net(x), net_jit(x))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE\n```\n\n\n:::\n:::\n\n\n\nTrace-compilation is restrictive because it only records operations applied to torch tensors and is unaware of R control flow, so you need to be careful when using it.\nFurthermore, it only accepts torch tensors as arguments.\nUnless you have dynamic inputs and outputs or modify the configuration of the module, trace-compilation should usually work.\nYou can also check this by running the original and trace-jitted module on some example inputs and see if they return the same result.\n\n:::{.callout-note}\nA trace-jitted module *does* respect the mode of the network, i.e., whether it is training or evaluating.\n:::\n\nIn `mlr3torch`, trace compilation is also available and can be enabled by setting `jit_trace = TRUE` in the learner.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlearner = lrn(\"classif.mlp\", jit_trace = TRUE)\n```\n:::\n\n\n\nYou can also combine TorchScript with tracing:\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nnet_both = nn_module(\n initialize = function() {\n self$linear = nn_linear(1, 1)\n },\n forward = function(x) {\n self$linear(sum_jit(x))\n }\n)()\n\nnet_both(list(torch_randn(1), torch_randn(1)))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntorch_tensor\n 1.0027\n[ CPUFloatType{1} ][ grad_fn = ]\n```\n\n\n:::\n\n```{.r .cell-code}\nnet_both(list(torch_randn(1)))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntorch_tensor\n0.01 *\n 8.5286\n[ CPUFloatType{1} ][ grad_fn = ]\n```\n\n\n:::\n:::\n\n\n\n:::{.callout-note}\n## Quiz: Just In Time\n\n**Question 1**: Consider the trace-jitted function below. Can you predict the output of the last two lines? Can you explain why this happens?\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nf = function(a, b, multiply) {\n if (multiply$item()) {\n a * b\n } else {\n a + b\n }\n}\nfjit = jit_trace(f, torch_tensor(1), torch_tensor(2), torch_tensor(TRUE))\n\nfjit(torch_tensor(2), torch_tensor(3), torch_tensor(TRUE))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntorch_tensor\n 6\n[ CPUFloatType{1} ]\n```\n\n\n:::\n\n```{.r .cell-code}\nfjit(torch_tensor(2), torch_tensor(3), torch_tensor(FALSE))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntorch_tensor\n 6\n[ CPUFloatType{1} ]\n```\n\n\n:::\n:::\n\n\n\n**Question 2**: Answer the same question for the following function:\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nf = function(a, b, multiply) {\n torch_where(multiply, a * b, a + b)\n}\nfjit = jit_trace(f, torch_tensor(1), torch_tensor(2), torch_tensor(TRUE))\n\nfjit(torch_tensor(2), torch_tensor(3), torch_tensor(TRUE))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntorch_tensor\n 6\n[ CPUFloatType{1} ]\n```\n\n\n:::\n\n```{.r .cell-code}\nfjit(torch_tensor(2), torch_tensor(3), torch_tensor(FALSE))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntorch_tensor\n 5\n[ CPUFloatType{1} ]\n```\n\n\n:::\n:::\n\n\n:::\n\n### Mixed Precision Training\n\nAnother way to speed up the training process is to use mixed precision training.\nThis technique involves training the model using both 16-bit and 32-bit floating point numbers.\nThis allows reducing the memory footprint of the model and speeding up the training process.\n\nWe won't cover this here, but refer to the [torch documentation](https://torch.mlverse.org/docs/articles/amp) that explains how to do this.\n\n## Methodological Approaches\n\n### Validation and Early Stopping\n\nFor more details on this topic, see the [corresponding chapter](https://mlr3book.mlr-org.com/chapters/chapter15/predsets_valid_inttune.html) in the `mlr3` book.\n\nAs we have already seen in one of the previous notebooks, in deep learning, some part of the data is often used for validation purposes.\nThis allows monitoring the performance of the model on unseen data.\n\nIn `mlr3torch`, we can track the performance of the model on a validation set by specifying:\n\n* `validate`, which is the ratio of the data that is used for validation\n* `measures_valid`, which is a list of measures to use for validation\n* `eval_freq`, which is the frequency at which the validation is performed\n* `callbacks`, which is a list of callbacks to use during training, in this case, we use the `history` callback, which records the performance of the model on the validation set at regular intervals, enabling us to monitor and analyze the model's performance over time.\n\n:::{.callout-tip}\nWhile `mlr3torch` comes with predefined callbacks, it is also possible to define custom callbacks that modify the training process.\n:::\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntask = tsk(\"sonar\")\n\nmlp_learner = lrn(\"classif.mlp\",\n neurons = c(50, 50), batch_size = 256, epochs = 400,\n optimizer = t_opt(\"adam\", lr = 0.003),\n predict_type = \"prob\", jit_trace = TRUE,\n # Validation / Performance Monitoring\n validate = 0.3, # how much data to use for validation\n measures_valid = msr(\"classif.logloss\"), # how to evaluate train performance\n measures_train = msr(\"classif.logloss\"), # how to evaluate validation performance\n callbacks = t_clbk(\"history\"), # history callbacks save train and validation performance\n eval_freq = 10 # after how many training epochs to perform validation\n)\nmlp_learner$train(task)\nhistory = mlp_learner$model$callbacks$history\nstr(history)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nClasses 'data.table' and 'data.frame':\t40 obs. of 3 variables:\n $ epoch : num 10 20 30 40 50 60 70 80 90 100 ...\n $ train.classif.logloss: num 0.678 0.643 0.569 0.515 0.478 ...\n $ valid.classif.logloss: num 0.667 0.618 0.536 0.469 0.436 ...\n - attr(*, \".internal.selfref\")= \n - attr(*, \"sorted\")= chr \"epoch\"\n```\n\n\n:::\n\n```{.r .cell-code}\nhead(history)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nKey: \n epoch train.classif.logloss valid.classif.logloss\n \n1: 10 0.6775741 0.6665855\n2: 20 0.6430574 0.6176948\n3: 30 0.5685190 0.5364953\n4: 40 0.5151559 0.4694589\n5: 50 0.4780497 0.4363074\n6: 60 0.3861667 0.4153698\n```\n\n\n:::\n:::\n\n\n\nBelow we plot the training and validation for the different epochs:\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](6-training-efficiency_files/figure-html/unnamed-chunk-30-1.png){fig-align='center' width=672}\n:::\n:::\n\n\n\nInstead of only monitoring the validation loss (and watching it get worse and worse), we can also stop the training process dynamically when the validation loss begins to increase.\nThis regularization technique is called early stopping, and it prevents overfitting during the training of iteratively trained machine learning models.\nIt involves monitoring the validation loss during training and stopping the training process when the validation loss begins to increase, indicating that the model is starting to overfit the training data.\n\nThe key configuration option for early stopping is the `patience` parameter, which defines the number of epochs to wait after the last improvement in validation loss before stopping the training. For example, if patience is set to 10, the training will continue for 10 additional epochs after the last observed improvement in validation loss. If no improvement is seen during this period, training will be halted.\n\nAdvantages of early stopping include:\n\n- **Prevention of Overfitting**: By stopping training when the model starts to overfit, we can achieve better generalization on unseen data.\n- **Resource Efficiency**: It saves computational resources by avoiding unnecessary training epochs once the model performance has plateaued.\n\nNow, let's train the learner again using early stopping with a patience of 10 epochs:\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nmlp_learner$param_set$set_values(\n patience = 5\n)\nmlp_learner$train(task)\nmlp_learner$internal_tuned_values$epochs\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 160\n```\n\n\n:::\n:::\n\n\n\nBeyond only tuning the number of epochs, `mlr3`'s internal tuning mechanism also allows tuning the number of epochs internally while using an offline tuning method to optimize other hyperparameters.\nTo use this, we can set the parameters we want to tune `TuneTokens`:\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(mlr3tuning)\nmlp_learner$param_set$set_values(\n epochs = to_tune(upper = 100, internal = TRUE),\n opt.lr = to_tune(lower = 1e-4, upper = 1e-1, logscale = TRUE)\n)\n```\n:::\n\n\n\nWe could now pass this learner to a tuner, where the tuner would only optimize the learning rate, while the learner optimizes the epochs internally.\n\n## Architecture Design\n\nAnother essential aspect of training neural networks efficiently and effectively is the design of the network architecture, which can be a challenging task.\nHowever, for many tasks, there are well-known architectures that perform well and can be used as a starting point.\nUnless there is a specific reason to design a new architecture, it is recommended to use such an architecture.\n\n:::{.callout-note}\nBecause the Python deep learning ecosystem is so large, many more architectures are implemented in Python than in R.\nOne way to use them in R is to simply translate the PyTorch code to (R-)torch.\nWhile PyTorch and (R-)torch are quite similar, there are some differences, e.g., 1-based and 0-based indexing.\nThe `torch` website contains a [brief tutorial](https://torch.mlverse.org/docs/articles/python-to-r) on how to do this.\n:::\n\nNonetheless, we will cover important techniques that can be used to speed up the training process, namely *batch normalization* and *dropout*.\n\n### Batch Normalization\n\nBatch Normalization is an important technique in deep learning that contributed significantly to speeding up the training process.\n\nThe formula for batch normalization (during training) is given by:\n\n$$\n\\hat{x} = \\frac{x - \\mu_B}{\\sqrt{\\sigma_B^2 + \\epsilon}}\n$$\n\nwhere:\n\n- $\\hat{x}$ is the normalized output,\n- $x$ is the input,\n- $\\mu_B$ is the mean of the batch,\n- $\\sigma_B^2$ is the variance of the batch,\n- $\\epsilon$ is a small constant added for numerical stability.\n\nDuring inference, the module uses the running mean and variance of the training data to normalize the input.\n\nIn `torch`, different versions of batch normalization exist for different dimensions of the input tensor.\nBelow, we illustrate the batch normalization module using a 1D input tensor (the batch dimension does not count here)\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nx = torch_randn(10, 5)\nbn = nn_batch_norm1d(num_features = 5)\nbn(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntorch_tensor\n 1.4613 -1.3934 -0.2146 1.0406 0.1413\n-0.9634 -0.3388 1.7441 0.7744 2.1476\n-2.0328 0.5667 -2.0592 0.4071 -0.0529\n 0.6778 0.3264 0.2637 -0.2301 -0.0409\n-0.9243 0.1298 -0.6447 -1.5477 -2.1935\n 0.8150 -0.1962 0.7988 -1.5426 0.1137\n-0.2350 -2.0121 -0.1847 1.1725 0.0143\n 0.8381 0.6141 0.9971 1.0148 -0.5667\n 0.2166 0.7147 -0.7208 -0.1408 -0.0285\n 0.1467 1.5887 0.0203 -0.9482 0.4657\n[ CPUFloatType{10,5} ][ grad_fn = ]\n```\n\n\n:::\n:::\n\n\n\n:::{.callout-note}\n## Quiz: Batch Normalization\n\n**Question 1**: Earlier we have learned that `nn_module`s have buffers and parameters, where the latter are learned with gradient descent.\nDo you think the mean and variance are parameters or buffers?\n\n
\nClick for answer\nThey are both buffers as they only store the variance and running mean of all training samples seen, i.e., they are not updated using gradient information.\n
\n\n**Question 2**: Training vs. Evaluation Mode:\nWhile many `nn_module`s behave the same way irrespective of their mode, batch normalization is an example of a module that behaves differently during training and evaluation, i.e.,\nduring training, the module uses the mean and variance of the current batch, while during evaluation, it uses the running mean and variance of all training samples seen.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nbn(x[1:10, ])\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntorch_tensor\n 1.4613 -1.3934 -0.2146 1.0406 0.1413\n-0.9634 -0.3388 1.7441 0.7744 2.1476\n-2.0328 0.5667 -2.0592 0.4071 -0.0529\n 0.6778 0.3264 0.2637 -0.2301 -0.0409\n-0.9243 0.1298 -0.6447 -1.5477 -2.1935\n 0.8150 -0.1962 0.7988 -1.5426 0.1137\n-0.2350 -2.0121 -0.1847 1.1725 0.0143\n 0.8381 0.6141 0.9971 1.0148 -0.5667\n 0.2166 0.7147 -0.7208 -0.1408 -0.0285\n 0.1467 1.5887 0.0203 -0.9482 0.4657\n[ CPUFloatType{10,5} ][ grad_fn = ]\n```\n\n\n:::\n:::\n\n\n\nWhich of the following statements is true and why?\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nbn$eval()\nequal1 = torch_equal(\n torch_cat(list(bn(x[1:2, ]), bn(x[3:4, ]))),\n bn(x[1:4, ])\n)\nbn$train()\nequal2 = torch_equal(\n torch_cat(list(bn(x[1:2, ]), bn(x[3:4, ]))),\n bn(x[1:4, ])\n)\n```\n:::\n\n\n\n
\nClick for answer\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nc(equal1, equal2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE FALSE\n```\n\n\n:::\n:::\n\n\n\nThe first statement is true because, in evaluation mode, the module uses the running mean and variance of all training samples seen.\nThe second statement is false because the first tensor uses different means and variances for rows 1-2 and 3-4, while the second tensor uses the same mean and variance for all rows.\n
\n:::\n\nTo illustrate its effectiveness, we will define a simple CNN, with and without batch normalization, train it on CIFAR-10, and compare their performance.\n\nTo build the neural networks, we will use `mlr3torch`, which allows building architectures from `PipeOp`s.\nThis makes the creation of network architectures easier, as we, e.g., don't have to specify auxiliary parameters (such as the input dimension of a linear layer).\nRecall that the `po(\"torch_ingress_ltnsr\")` is a special `PipeOp` that marks the input of the neural network.\nNote that `po(\"nn_relu_1\")` is equivalent to `po(\"nn_relu\", id = \"nn_relu_1\")`.\nWe need to specify unique ID parameters as this is required in `mlr3pipelines`.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ncnn_bn = po(\"torch_ingress_ltnsr\") %>>%\n po(\"nn_conv2d_1\", out_channels = 32, kernel_size = 3, stride = 1, padding = 1) %>>%\n po(\"nn_batch_norm2d_1\") %>>%\n po(\"nn_relu_1\") %>>%\n po(\"nn_max_pool2d_1\", kernel_size = 2, stride = 2) %>>%\n po(\"nn_conv2d_2\", out_channels = 64, kernel_size = 3, stride = 1, padding = 1) %>>%\n po(\"nn_batch_norm2d_2\") %>>%\n po(\"nn_relu_2\") %>>%\n po(\"nn_max_pool2d_2\", kernel_size = 2, stride = 2)\n\ncnn = po(\"torch_ingress_ltnsr\") %>>%\n po(\"nn_conv2d_1\", out_channels = 32, kernel_size = 3, stride = 1, padding = 1) %>>%\n po(\"nn_relu_1\") %>>%\n po(\"nn_max_pool2d_1\", kernel_size = 2, stride = 2) %>>%\n po(\"nn_conv2d\", out_channels = 64, kernel_size = 3, stride = 1, padding = 1) %>>%\n po(\"nn_relu_2\") %>>%\n po(\"nn_max_pool2d_2\", kernel_size = 2, stride = 2)\n\nhead = po(\"nn_flatten\") %>>%\n po(\"nn_linear\", out_features = 128) %>>%\n po(\"nn_relu\") %>>%\n po(\"nn_head\")\n\nmodel = po(\"torch_optimizer\", optimizer = t_opt(\"adam\", lr = 0.003)) %>>%\n po(\"torch_model_classif\",\n epochs = 100,\n batch_size = 256,\n predict_type = \"prob\",\n device = \"cuda\"\n )\n```\n:::\n\n\n\nWe evaluate the two models on the CIFAR-10 image classification task that we have introduced earlier.\nThere, the goal is to classify images into 10 different classes.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nnet_bn = as_learner(cnn_bn %>>% head %>>% model)\nnet_bn$id = \"net_bn\"\nnet = as_learner(cnn %>>% head %>>% model)\nnet$id = \"net\"\n\ncifar10 = tsk(\"cifar10\")\nresampling = rsmp(\"holdout\")$instantiate(cifar10)\n\ndesign = benchmark_grid(\n task = cifar10,\n learner = list(net_bn, net),\n resampling = resampling\n)\ndesign\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n task learner resampling\n \n1: cifar10 net_bn holdout\n2: cifar10 net holdout\n```\n\n\n:::\n:::\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nbmr = benchmark(design)\nbmr$aggregate()\n```\n:::\n\n\n\n## Dropout\n\nDropout is a regularization technique used to prevent overfitting in neural networks by randomly setting a fraction of input units to zero during training. This encourages the network to learn more robust features that are not reliant on specific neurons, thereby improving its generalization capabilities.\nDuring each training iteration, dropout randomly \"drops\" a subset of neurons by setting their activations to zero with a specified probability (commonly between 20% to 50%). This forces the network to distribute the learned representations more evenly across neurons, reducing the reliance on any single neuron and mitigating overfitting.\nDropout is more commonly used in the context of fully connected layers.\n\n![](../assets/dropout.png){fig-align=\"center\" width=100%}\n\nSource: https://medium.com/konvergen/understanding-dropout-ddb60c9f98aa\n\nJust like batch normalization, it also has different behavior during training and evaluation.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndropout = nn_dropout(p = 0.5)\ndropout(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntorch_tensor\n 0.0000 -3.9488 0.0093 0.0000 0.7024\n-0.0000 -1.4141 0.0000 2.9566 5.7694\n-4.4366 0.7622 -0.0000 2.1163 0.2118\n 0.0000 0.0000 0.0000 0.6584 0.2422\n-0.0000 -0.0000 -0.9663 -0.0000 -5.1942\n 0.0000 -1.0714 2.3080 -0.0000 0.6326\n-1.1987 -5.4360 0.0000 3.8675 0.0000\n 0.0000 0.8761 2.7579 3.5069 -0.0000\n-0.3855 1.1178 -0.0000 0.8627 0.0000\n-0.0000 0.0000 0.0000 -0.0000 1.5217\n[ CPUFloatType{10,5} ]\n```\n\n\n:::\n\n```{.r .cell-code}\ndropout$eval()\ndropout(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntorch_tensor\n 0.9281 -1.9744 0.0046 1.7829 0.3512\n-1.2553 -0.7071 2.2261 1.4783 2.8847\n-2.2183 0.3811 -2.0875 1.0582 0.1059\n 0.2226 0.0924 0.5471 0.3292 0.1211\n-1.2201 -0.1440 -0.4831 -1.1782 -2.5971\n 0.3462 -0.5357 1.1540 -1.1725 0.3163\n-0.5994 -2.7180 0.0385 1.9338 0.1908\n 0.3669 0.4380 1.3789 1.7534 -0.5429\n-0.1927 0.5589 -0.5695 0.4313 0.1367\n-0.2556 1.6093 0.2711 -0.4924 0.7609\n[ CPUFloatType{10,5} ]\n```\n\n\n:::\n:::\n\n\n\nTo look at the effects, we will create a second classification head with dropout and then define new learners\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nhead_dropout = po(\"nn_flatten\") %>>%\n po(\"nn_linear\", out_features = 128) %>>%\n po(\"nn_relu\") %>>%\n po(\"nn_dropout\", p = 0.5) %>>%\n po(\"nn_head\")\n\nnet_bn_dropout = as_learner(cnn_bn %>>% head_dropout %>>% model)\nnet_bn_dropout$id = \"net_bn_dropout\"\nnet_dropout = as_learner(cnn %>>% head_dropout %>>% model)\nnet_dropout$id = \"net_dropout\"\n\ndesign2 = benchmark_grid(\n task = cifar10,\n learner = list(net_bn_dropout, net_dropout),\n resampling = resampling\n)\n```\n:::\n\n\n\nNext, we run the second benchmark experiment and afterwards combine the results with the first benchmark experiment.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nbmr2 = benchmark(design2)\nbmr = c(bmr, bmr2)\nautoplot(bmr)\n```\n:::\n\n\n\n:::{.callout-note}\n## Quiz: Dropout\n\n**Question 1**: Worse Training Loss: You are training a neural network with and without dropout. The training loss is higher with dropout, is this a bug?\n\n
\nClick for answer\nNot necessarily, as dropout is a regularization technique that prevents overfitting.\nIt's goal is to reduce the generalization performance of the model.\n
\n:::\n\n## Transfer Learning\n\nTransfer learning is a powerful technique in machine learning where a pre-trained model developed for a specific task is reused as the starting point for a model on a second, related task. Instead of training a model from scratch, which can be time-consuming and computationally expensive, transfer learning leverages the knowledge gained from a previously learned task to improve learning efficiency and performance on a new task.\n\nThe advantages of transfer learning are:\n\n1. Reduced Training Time: Leveraging a pre-trained model can significantly decrease the time required to train a new model, as the foundational feature extraction layers are already optimized.\n2. Improved Performance: Transfer learning can enhance model performance, especially when the new task has limited training data. The pre-trained model's knowledge helps in achieving better generalization.\n3. Resource Efficiency: Utilizing pre-trained models reduces the computational resources needed, making it feasible to develop sophisticated models without extensive hardware.\n\nWhen the model is then trained on a new task, only the last layer is replaced with a new output layer to adjust for the new task.\n\nThis is visualized below:\n\n![](../assets/transfer-learning.svg)\n\nSource: https://en.wikipedia.org/wiki/Transfer_learning\n\n`mlr3torch` connects various pretrained image networks that are available in the [`torchvision` package](https://torchvision.mlverse.org/).\nThe ResNet-18 model is a popular pre-trained model that was pretrained on ImageNet.\nWe can use the pretrained weights by setting the `pretrained` parameter to `TRUE`.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nresnet = lrn(\"classif.resnet18\",\n pretrained = TRUE,\n epochs = 2,\n batch_size = 256,\n validate = 0.3,\n measures_valid = msr(\"classif.logloss\"),\n device = \"cuda\",\n predict_type = \"prob\",\n id = \"pretrained\"\n)\nresnet_no_pretrain = resnet$clone(deep = TRUE)\nresnet_no_pretrain$param_set$set_values(\n pretrained = FALSE\n)\nresnet_no_pretrain$id = \"not_pretrained\"\n\ngrid = benchmark_grid(\n task = tsk(\"cifar10\"),\n learner = list(resnet, resnet_no_pretrain),\n resampling = rsmp(\"insample\")\n)\n\nbmr = benchmark(grid, store_models = TRUE)\nbmr$aggregate()\n```\n:::\n\n\n\nWhen fine-tuning a pretrained model like ResNet-18, it's common to observe instabilities in gradients, which can manifest as fluctuating validation performance.\nThis can e.g. be because the learning rate is too high (compared to the learning rate that was used during pretraining).\n\nTo address this, one can:\n\n1. Use a smaller learning rate for the pretrained layers than for the new output head.\n2. Freeze the pretrained layers (for some epochs) and only train the new output head.\n\nIn `mlr3torch` this can be achieved via the callback mechanism.\nFor the unfreezing, there even exists a predefined callback `t_clbk(\"unfreeze\")`.\nTo create a custom callback, the `torch_callback()` function can be used.\nA tutorial on this can be found on the [`mlr3torch` package website](https://mlr3torch.mlr-org.com/index.html).\n\n:::{.callout-note}\n## In-Context Learning\n\nLarge foundation models (such as GPT-4) even allow to perform tasks on which they were not pretrained on without any finetuning.\nThis is referred to as in-context learning or zero-shot learning.\nThere, the task is fed into the model during inference: \"Hey ChatGPT, is What is the sentiment of this sentence. Return -1 for sad, 0 for neutral, 1 for happy: \"\n:::\n\n## Data Augmentation\n\nData augmentation is a technique used to increase the diversity and quantity of training data without actually collecting new data.\nBy applying various transformations to the existing dataset, data augmentation helps improve the generalization capabilities of machine learning models, reduce overfitting, and enhance model robustness.\nThis is especially crucial when you have limited data.\n\nData augmentation for images can consist of rotation, flipping, translating, grey scaling, etc.\nWhich data augmentation is admissible, depends on the task:\n\n- If the modeling task is to predict whether there is a mark in the top right corner of an image, vertical or horizontal flipping is not admissible.\n- If the goal is to predict whether there is a mark somewhere in the image, it would be admissible.\n\nIn other words, the data augmentation must be compatible with the invariances of the task.\n\nIn `mlr3torch`, data augmentation is available via `PipeOp`s of the form `po(\"augment_\")`.\nCurrently, only augemntation operators from the `torchvision` package are available, but you can also add your own.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\naugment = po(\"augment_random_resized_crop\") %>>%\n po(\"augment_random_horizontal_flip\") %>>%\n po(\"augment_random_vertical_flip\")\n```\n:::\n\n\n\nWe can just create a new `GraphLearner` that includes the augemntation steps as well as the learner from above:\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nresnet_augmented = as_learner(augment %>>% resnet)\nresnet_augmented$id = \"resnet_augmented\"\nresnet_augmented$train(task = cifar10)\n```\n:::\n\n\n\n:::{.callout-note}\n## Quiz: Data Augmentation\n\n**Question 1**: Do you think data augmentation should be applied to the validation set?\n\n
\nClick for answer\nNo, as the purpose of data augmentation is not to improve an individual prediction, it will not be applied during test time and hence also not to the validation set.\nLooking at the performance of augmented validation data is, however, also not a mistake.\n
\n:::\n", + "markdown": "---\ntitle: \"Training Efficiency\"\n---\n\n\n\n\n\n\nMethods for increasing training efficiency can be roughly split into:\n\n1. Computational methods such as JIT compilation, using GPUs, parallel data loading, etc., that allow doing the same thing faster.\n2. Methodological approaches that change how we approach modeling to achieve either better results or the same results faster.\n\n# Computational Approaches\n\n## Parallel Processing\n\n### Graphical Processing Unit (GPU)\n\nUsing a GPU is crucial when training relatively large neural networks because GPUs are specifically designed to handle the parallel processing required for complex computations.\nTo use a GPU in `mlr3torch`, we can set the device parameter to \"cuda\". By default, it is set to \"auto\", which will use a GPU if available and otherwise fall back to the CPU.\n\n:::{.callout-tip}\nTo check if a GPU is available, we can use the `torch::cuda_is_available()` function.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(torch)\ncuda_is_available()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] FALSE\n```\n\n\n:::\n:::\n\n\n\nIf you have an M1 Mac (or later), you can also use the available graphics card by setting the `device` parameter to `\"mps\"`.\nYou can check this by running:\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nbackends_mps_is_available()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE\n```\n\n\n:::\n:::\n\n\n:::\n\nTo demonstrate the speed improvements obtained by using a GPU, we conduct a large matrix operation on a GPU and a CPU.\nWe start by randomly sampling a matrix of size 1000x1000.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nx_cpu = torch_randn(1000, 1000, device = \"cpu\")\n```\n:::\n\n\n\nBelow, we perform a matrix multiplication on the CPU and the GPU and compare the timings.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# this will only run if a GPU is available\nx_cuda = x_cpu$cuda()\n\nbench::mark(\n cpu = x_cpu$matmul(x_cpu),\n cuda = x_cuda$matmul(x_cuda)\n)\n```\n:::\n\n\n\n### CPU Threads\n\nTraining large networks on a CPU is not a recommended approach, but it can be a viable option for smaller networks.\nYou can still use multiple threads to speed up the execution of operations.\nPlease be aware that the code below will not execute on macOS, as setting the number of threads is not supported on this operating system.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# this will be skipped on macOS\nbench::mark(\n {torch_set_num_threads(1L); x_cpu$matmul(x_cpu)},\n {torch_set_num_threads(16L); x_cpu$matmul(x_cpu)}\n)\n```\n:::\n\n\n\n`torch` also allows for interop-parallelization, but this is more advanced and code needs to be written in a specific way.\n\n:::{.callout-note}\n## Quiz: Number of Threads\n\nQuestion 1: On a CPU with 4 cores, does it make sense to set the number of threads to values greater than 4? Explain your answer.\n\n
\nClick for answer\nOn a CPU with 4 cores, at most 4 threads can run in parallel.\nUsing more threads than the number of cores will not speed up the execution of operations.\n
\n\nQuestion 2: On a CPU with 64 cores, is it always the case that using 64 threads is better than using 32 threads?\n\n
\nClick for answer\nNot necessarily. Using more threads will mean that:\n\n1. The threads need to communicate and synchronize, which increases the runtime.\n2. More resources are used for the computation, which decreases the runtime.\n\nThe optimal number of threads is a trade-off between these two effects.\n
\n:::\n\n## Efficient Data Loading\n\nBesides parallelizing the computation of operations in the forward and backward pass, another possible bottleneck is the loading of data.\nThere are various ways to improve data loading speed:\n\n1. Improve the implementation of the `dataset` class\n2. Parallelize the data loading process\n3. Increase the speed of data transfer to the GPU\n\nThese approaches will now be discussed.\n\n### Efficient Dataset Implementation\n\nWhen implementing a dataset, we need to define:\n\n1. How we store and load the data\n2. Whether implementing loading of a batch is beneficial\n\n:::{.callout-note}\n## Quiz: Data Loading\n\nThe *tiny imagenet* dataset is a dataset of 100,000 images of size 64x64.\nIt is a subset of the famous *imagenet* dataset.\nBelow, we show some examples from it:\n\n![](../assets/tiny-imagenet.png)\n\n\n\n\n\n\n\nWe will now consider different ways to write a `torch::dataset` implementation for this data.\nAssume we have some image paths stored in a character vector as well as in an array where they are already loaded into memory.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nstr(image_paths)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n chr [1:100] \"/Users/sebi/Library/Caches/org.R-project.R/R/mlr3torch/datasets/tiny_imagenet/raw/tiny-imagenet-200/train/n0144\"| __truncated__ ...\n```\n\n\n:::\n\n```{.r .cell-code}\nstr(image_array)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n num [1:100, 1:3, 1:64, 1:64] 1 0.0784 0.4706 0.5608 0.5647 ...\n```\n\n\n:::\n:::\n\n\n\nAn individual image can, for example, be loaded using the `torchvision::base_loader()` function:\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(torchvision)\nstr(base_loader(image_paths[1]))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n num [1:64, 1:64, 1:3] 1 1 1 1 1 ...\n```\n\n\n:::\n:::\n\n\n\n**Question 1:** Reading From Disk or RAM\n\nWhich of the following is the faster way to load the images? Explain why.\n\n1. Loading the images from disk:\n\n\n\n ::: {.cell layout-align=\"center\"}\n \n ```{.r .cell-code}\n ds_disk = dataset(\"image_paths\",\n initialize = function(image_paths) {\n self$image_paths = image_paths\n },\n .getitem = function(i) {\n torch_tensor(torchvision::base_loader(self$image_paths[i]))\n },\n .length = function() {\n length(self$image_paths)\n }\n )(image_paths)\n ```\n :::\n\n\n\n2. Loading the images from an array:\n\n\n\n ::: {.cell layout-align=\"center\"}\n \n ```{.r .cell-code}\n ds_ram = dataset(\"image_array\",\n initialize = function(image_array) {\n self$image_array = image_array\n },\n .getitem = function(i) {\n torch_tensor(self$image_array[i, , , ])\n },\n .length = function() {\n nrow(self$image_array)\n }\n )(image_array)\n ```\n :::\n\n\n\n
\nClick for answer\n\nGenerally, loading images from RAM is significantly faster than loading them from disk.\nAlthough the benchmark presented below may seem somewhat 'unfair' since `ds_ram` has already loaded the images into memory, this difference is evident in practice.\nWhen iterating over the dataset for multiple epochs, the first method will need to reload the images from disk for each epoch, while the second method only requires a single loading of the images into memory.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\niter = function(ds, ..., epochs = 1) {\n dl = torch::dataloader(ds, batch_size = 16, ...)\n for (epoch in seq_len(epochs)) {\n coro::loop(for(batch in dl) {\n batch\n })\n }\n}\nbench::mark(\n disk = iter(ds_disk, epochs = 10),\n ram = iter(ds_ram, epochs = 10),\n check = FALSE\n)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning: Some expressions had a GC in every iteration; so filtering is disabled.\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 6\n expression min median `itr/sec` mem_alloc `gc/sec`\n \n1 disk 228ms 258ms 3.88 109MB 9.69\n2 ram 183ms 184ms 5.33 94.4MB 10.7 \n```\n\n\n:::\n:::\n\n\n\n
\n\n**Question 2:** (Don't) Copy that\n\nConsider now the next dataset implementation:\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nds_tensor = dataset(\"tensor\",\n initialize = function(image_array) {\n self$tensor = torch_tensor(image_array)\n },\n .getitem = function(i) {\n self$tensor[i, ..]\n },\n .length = function() {\n nrow(self$tensor)\n }\n)(image_array)\n```\n:::\n\n\n\nDo you think this implementation is faster or slower than the `ds_ram` implementation? Explain why.\n\n
\nClick for answer\nThis implementation is faster than the `ds_ram` implementation.\nThis is because the `ds_tensor` implementation copies the R array to a torch tensor only once, whereas the `ds_ram` implementation copies the R array to a torch tensor for each item.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nbench::mark(\n tensor = iter(ds_tensor),\n array = iter(ds_ram),\n check = FALSE\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 6\n expression min median `itr/sec` mem_alloc `gc/sec`\n \n1 tensor 4.52ms 4.82ms 206. 96.08KB 6.71\n2 array 14.94ms 16.26ms 62.0 9.44MB 16.9 \n```\n\n\n:::\n:::\n\n\n\n
\n\n**Question 3**: `$.getbatch()` vs `$.getitem()`\n\nWhich implementation is faster? Explain why.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nds_tensor_batch = dataset(\"tensor_batch\",\n initialize = function(image_array) {\n self$tensor = torch_tensor(image_array)\n },\n .getbatch = function(i) {\n self$tensor[i, .., drop = FALSE]\n },\n .length = function() {\n nrow(self$tensor)\n }\n)(image_array)\n```\n:::\n\n\n\n
\nClick for answer\nThe `$.getbatch()` implementation is faster than the `$.getitem()` implementation.\nThis is because when using the `$.getitem()` method, the batch for indices `ids` is obtained by calling `$.getitem(id)` for each index in `ids` and then stacking them together, which requires a new tensor allocation.\nSlicing the tensor, however, avoids this allocation when `shuffle = TRUE` (which is also the default).\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nbench::mark(\n getbatch = iter(ds_tensor_batch),\n getitem = iter(ds_tensor),\n check = FALSE\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 6\n expression min median `itr/sec` mem_alloc `gc/sec`\n \n1 getbatch 1.75ms 1.94ms 502. 23KB 4.69\n2 getitem 4.61ms 4.94ms 198. 54.7KB 7.54\n```\n\n\n:::\n:::\n\n\n
\n:::\n\n### Parallel Data Loading\n\nIn Deep Learning, datasets can be very large, and it might therefore be the case that the data is simply too large to fit into memory.\nIn this case, we can use parallel data loading to speed up the data loading process.\nInstead of loading the data sequentially in the main process, other R processes will be started that execute the data loading.\nFor example, if we set `num_workers = 4L`, 4 R processes will be started that load the data, while the main process is free to train the model.\nThese processes then send the batches to the main process.\nThe image below visualizes this process:\n\n![](../assets/parallel-dataloader.png)\n\nCreating such a parallel dataloader is as easy as setting the `num_workers` parameter to a value greater than 0.\n\n:::{.callout-note}\nNote that in the current implementation, parallel data loading is only beneficial when it is slow, e.g., because of loading the data from disk or because of expensive preprocessing.\nThis will hopefully be improved in the future (by a faster implementation of the parallel dataloader).\n:::\n\n\n### Moving Data to the GPU\n\nOne thing we have ignored so far is that when training using a GPU, the data needs to be moved from RAM to the GPU.\nThis is because a GPU has its own memory (VRAM), and the data needs to be moved to this memory before it can be used for training.\nThe moving of the data to the GPU cannot be done on the processes that are loading the data but must be done in the main process, i.e., after the batch was received from (possibly parallelized) dataloader.\nOne way to speed up the data loading process is to pin the memory of the data to the GPU.\nBefore a tensor can be moved from RAM to VRAM, it needs to be in so-called page-locked memory, which can be enabled using the `pin_memory` parameter of `dataloader()`..\n\n![](../assets/pinned-memory.png)\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\niter_cuda = function(ds, pin_memory = TRUE) {\n dl = torch::dataloader(ds, batch_size = 16, pin_memory = pin_memory)\n coro::loop(for(batch in dl) {\n batch$cuda()\n })\n}\n\nbench::mark(\n not_pinned = iter_cuda(ds_disk, pin_memory = FALSE),\n pinned = iter_cuda(ds_disk, pin_memory = TRUE)\n)\n```\n:::\n\n\n\n:::{.callout-note}\n\nIn order to use parallel data loading or memory pinning with `mlr3torch`, these parameters can simply be specified in the learner:\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlrn(\"classif.mlp\", num_workers = 8L, pin_memory = TRUE, device = \"cuda\")\n```\n:::\n\n\n:::\n\n## JIT Compilation & Ignite Optimizers\n\nSome special care needs to be taken when using `torch` (or `mlr3torch`) in order to get good performance.\nIn the future, this will hopefully not be necessary anymore, but is currently required.\n\n### 'Ignite' Optimizers\n\nIn `torch`, different versions of optimizers exist:\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\noptim_adamw\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n object generator\n Inherits from: \n Public:\n initialize: function (params, lr = 0.001, betas = c(0.9, 0.999), eps = 1e-08, \n loop_fun: function (group, param, g, p) \n step: function (closure = NULL) \n clone: function (deep = FALSE) \n Parent env: \n Locked objects: FALSE\n Locked class: FALSE\n Portable: TRUE\n```\n\n\n:::\n\n```{.r .cell-code}\noptim_ignite_adamw\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n object generator\n object generator\n Inherits from: \n Public:\n initialize: function (params, lr = 0.001, betas = c(0.9, 0.999), eps = 1e-08, \n clone: function (deep = FALSE) \n Private:\n .config_names: lr betas eps weight_decay amsgrad\n .state_names: exp_avg exp_avg_sq max_exp_avg_sq step\n .optim: function (params, ...) \n .get_states: function (opt) \n .set_states: function (opt, params, states) \n .add_param_group: function (opt, params, lr, betas, eps, weight_decay, amsgrad) \n .assert_params: function (lr, betas, eps, weight_decay, amsgrad) \n .set_param_group_options: function (opt, list) \n .zero_grad: function (opt) \n .get_param_groups: function (ptr) \n Parent env: \n Locked objects: FALSE\n Locked class: FALSE\n Portable: TRUE\n```\n\n\n:::\n:::\n\n\n\nThe 'ignite' indicates that the optimizer is a version that is optimized for performance.\nNot for all optimizers does an ignite version exist, but for the most common ones, there does.\n\nBelow, we compare the performance of the default optimizer and the ignite optimizer and see that the latter is considerably faster.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nadamw = as_torch_optimizer(torch::optim_adamw)\nignite_adamw = as_torch_optimizer(torch::optim_ignite_adamw)\n\nlearner = lrn(\"classif.mlp\", epochs = 10, neurons = c(100, 100), batch_size = 32, optimizer = adamw)\n\nlearner_ignite = learner$clone(deep = TRUE)\nlearner_ignite$configure(\n optimizer = ignite_adamw\n)\ntask_sonar = tsk(\"sonar\")\n\nbench::mark(\n learner$train(task_sonar),\n learner_ignite$train(task_sonar),\n check = FALSE\n)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning: Some expressions had a GC in every iteration; so filtering is disabled.\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 6\n expression min median `itr/sec` mem_alloc `gc/sec`\n \n1 learner$train(task_sonar) 588ms 588ms 1.70 15.7MB 5.10\n2 learner_ignite$train(task_sonar) 204ms 208ms 4.77 10.7MB 4.77\n```\n\n\n:::\n:::\n\n\n\n### JIT Compilation\n\nJIT (Just-In-Time) compilation is a runtime optimization technique that compiles code into machine code during execution rather than beforehand.\nThis has different advantages:\n\n1. By JIT-compiling a model, some operations can be optimized for performance.\n2. A JIT-compiled model can be saved and executed without an R dependency for deployment (only LibTorch is required), e.g., in a C++ application.\n3. Running a JIT-compiled model in R is faster because the whole network is executed in C++ instead of R.\n\nIn `torch`, this can either be done using TorchScript or by tracing a model.\nWe will briefly discuss both approaches, but for more information, see the [torch documentation](https://torch.mlverse.org/docs/articles/torchscript).\n\n#### TorchScript\n\nTorchScript is a subset of Python -- i.e., its own programming language -- that can be used to define compiled functions.\nIn R, this is available via the [`jit_compile`](https://torch.mlverse.org/docs/reference/jit_compile.html) function.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nf = jit_compile(\"\ndef f(x, w, bias):\n return x @ w + bias\n\")$f\n\nx = torch_randn(10, 10)\nw = torch_randn(10, 1)\nbias = torch_randn(1)\n\nout = f(x, w, bias)\nstr(out)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nFloat [1:10, 1:1]\n```\n\n\n:::\n:::\n\n\n\nBesides syntax, there are some notable differences between TorchScript and R to be aware of:\n\n1. In TorchScript, indexing tensors is 0-based, and\n2. TorchScript is statically typed, so you need to specify the types of the arguments, unless they are tensors, which is the default.\n\nBelow, we define a function that takes a list of tensors and calculates their sum.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsum_jit = jit_compile(\"\ndef sum_jit(xs: List[Tensor]):\n output = torch.zeros_like(xs[0])\n for x in xs:\n output = output + x\n return output\n\")$sum_jit\n\nsum_jit(list(torch_randn(1), torch_randn(1)))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntorch_tensor\n-0.7121\n[ CPUFloatType{1} ]\n```\n\n\n:::\n:::\n\n\n\n#### Tracing\n\nThe alternative to writing TorchScript is to write your module in R and to use [`jit_trace`](https://torch.mlverse.org/docs/reference/jit_trace_module.html) to compile it.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nf2 = function(x, w, bias) {\n x$matmul(w) + bias\n}\n# need to provide some example input\n# arguments are passed by position\nf2 = jit_trace(f2, torch_randn(10, 10), torch_randn(10, 100), torch_randn(100))\nout2 = f2(x, w, bias)\ntorch_equal(out, out2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE\n```\n\n\n:::\n:::\n\n\n\nAn advantage of trace-compilation is that it can be applied to modules, which is currently not possible with `jit_compile`.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nnet = nn_sequential(\n nn_linear(10, 100),\n nn_relu(),\n nn_linear(100, 10)\n)\nnet_jit = jit_trace(net, torch_randn(10, 10))\n\ntorch_equal(net(x), net_jit(x))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE\n```\n\n\n:::\n:::\n\n\n\nHowever, trace-compilation is restrictive because it only records operations applied to torch tensors and is unaware of R control flow, so you need to be careful when using it.\nFurthermore, it only accepts torch tensors as arguments.\nFor many simple modules, trace-compilation should usually work.\nYou can also check this by running the original and trace-jitted module on some example inputs and see if they return the same result.\n\n:::{.callout-note}\nA trace-jitted module *does* respect the mode of the network, i.e., whether it is training or evaluating.\n:::\n\nIn `mlr3torch`, trace compilation is also available and can be enabled by setting `jit_trace = TRUE` in the learner.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlearner = lrn(\"classif.mlp\", jit_trace = TRUE)\n```\n:::\n\n\n\nYou can also combine TorchScript with tracing:\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nnet_both = nn_module(\n initialize = function() {\n self$linear = nn_linear(1, 1)\n },\n forward = function(x) {\n self$linear(sum_jit(x))\n }\n)()\n\nnet_both(list(torch_randn(1), torch_randn(1)))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntorch_tensor\n 1.0027\n[ CPUFloatType{1} ][ grad_fn = ]\n```\n\n\n:::\n\n```{.r .cell-code}\nnet_both(list(torch_randn(1)))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntorch_tensor\n0.01 *\n 8.5286\n[ CPUFloatType{1} ][ grad_fn = ]\n```\n\n\n:::\n:::\n\n\n\n:::{.callout-note}\n## Quiz: Just In Time\n\n**Question 1**: Consider the trace-jitted function below. Can you predict the output of the last two lines? Can you explain why this happens?\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nf = function(a, b, multiply) {\n if (multiply$item()) {\n a * b\n } else {\n a + b\n }\n}\nfjit = jit_trace(f, torch_tensor(1), torch_tensor(2), torch_tensor(TRUE))\n\nfjit(torch_tensor(2), torch_tensor(3), torch_tensor(TRUE))\nfjit(torch_tensor(2), torch_tensor(3), torch_tensor(FALSE))\n```\n:::\n\n\n\n
\nClick for answer\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nfjit(torch_tensor(2), torch_tensor(3), torch_tensor(TRUE))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntorch_tensor\n 6\n[ CPUFloatType{1} ]\n```\n\n\n:::\n\n```{.r .cell-code}\nfjit(torch_tensor(2), torch_tensor(3), torch_tensor(FALSE))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntorch_tensor\n 6\n[ CPUFloatType{1} ]\n```\n\n\n:::\n:::\n\n\n\n
\n\n**Question 2**: Answer the same question for the following function:\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nf = function(a, b, multiply) {\n torch_where(multiply, a * b, a + b)\n}\nfjit = jit_trace(f, torch_tensor(1), torch_tensor(2), torch_tensor(TRUE))\n\nfjit(torch_tensor(2), torch_tensor(3), torch_tensor(TRUE))\nfjit(torch_tensor(2), torch_tensor(3), torch_tensor(FALSE))\n```\n:::\n\n\n\n
\nClick for answer\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nfjit(torch_tensor(2), torch_tensor(3), torch_tensor(TRUE))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntorch_tensor\n 6\n[ CPUFloatType{1} ]\n```\n\n\n:::\n\n```{.r .cell-code}\nfjit(torch_tensor(2), torch_tensor(3), torch_tensor(FALSE))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntorch_tensor\n 5\n[ CPUFloatType{1} ]\n```\n\n\n:::\n:::\n\n\n
\n:::\n\n### Mixed Precision Training\n\nAnother way to speed up the training process is to use mixed precision training.\nThis technique involves training the model using both 16-bit and 32-bit floating point numbers.\nThis allows reducing the memory footprint of the model and speeding up the training process.\nWe won't cover this here, but refer to the [torch documentation](https://torch.mlverse.org/docs/articles/amp) that explains how to do this.\n\n## Methodological Approaches\n\n### Validation and Early Stopping\n\nFor more details on this topic, see the [corresponding chapter](https://mlr3book.mlr-org.com/chapters/chapter15/predsets_valid_inttune.html) in the `mlr3` book.\n\nAs we have already seen in one of the previous notebooks, in deep learning, some part of the data is often used for validation purposes.\nThis allows monitoring the performance of the model on unseen data.\n\nIn `mlr3torch`, we can track the performance of the model on a validation set by specifying:\n\n* `validate`, which is the ratio of the data that is used for validation\n* `measures_valid`, which is a list of measures to evaluate the validation performance\n* `eval_freq`, which is the frequency at which the validation is performed\n* `callbacks`, which is a list of callbacks to use during training, in this case, we use the `t_clbk(\"history\")` callback, which records the performance of the model on the validation set at regular intervals, enabling us to monitor and analyze the model's performance over time.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntask = tsk(\"sonar\")\n\nmlp_learner = lrn(\"classif.mlp\",\n neurons = c(50, 50), batch_size = 256, epochs = 400,\n optimizer = t_opt(\"adam\", lr = 0.003),\n predict_type = \"prob\", jit_trace = TRUE,\n # Validation / Performance Monitoring\n validate = 0.3, # how much data to use for validation\n measures_valid = msr(\"classif.logloss\"), # how to evaluate train performance\n measures_train = msr(\"classif.logloss\"), # how to evaluate validation performance\n callbacks = t_clbk(\"history\"), # history callbacks save train and validation performance\n eval_freq = 10 # after how many training epochs to perform validation\n)\nmlp_learner$train(task)\nhistory = mlp_learner$model$callbacks$history\nhead(history)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nKey: \n epoch train.classif.logloss valid.classif.logloss\n \n1: 10 0.6775741 0.6665855\n2: 20 0.6430574 0.6176948\n3: 30 0.5685190 0.5364953\n4: 40 0.5151559 0.4694589\n5: 50 0.4780497 0.4363074\n6: 60 0.3861667 0.4153698\n```\n\n\n:::\n:::\n\n\n\nBelow we plot the training and validation for the different epochs:\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](6-training-efficiency_files/figure-html/unnamed-chunk-32-1.png){fig-align='center' width=672}\n:::\n:::\n\n\n\nInstead of only monitoring the validation loss (and watching it get worse and worse), we can also stop the training process dynamically when the validation loss begins to increase.\nThis regularization technique is called **early stopping**, and it prevents overfitting during the training of iteratively trained machine learning models.\n\nThe key configuration option for early stopping is the `patience` parameter, which defines the number of epochs to wait after the last improvement in validation loss before stopping the training.\nFor example, if the patience is set to 5, the training will continue for 5 additional epochs after the last observed improvement in validation loss.\nIf no improvement is seen during this period, training will be halted.\n\nAdvantages of early stopping include:\n\n- **Prevention of Overfitting**: By stopping training when the model starts to overfit, we can achieve better generalization on unseen data.\n- **Resource Efficiency**: It saves computational resources by avoiding unnecessary training epochs once the model performance has plateaued.\n\nNow, let's train the learner again using early stopping with a patience of 5 epochs:\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nmlp_learner$param_set$set_values(\n patience = 5\n)\nmlp_learner$train(task)\nmlp_learner$internal_tuned_values$epochs\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 160\n```\n\n\n:::\n:::\n\n\n\nBeyond only tuning the number of epochs, `mlr3`'s internal tuning mechanism also allows tuning the number of epochs internally while using an offline tuning method to optimize other hyperparameters.\nTo use this, we can set the parameters we want to tune using `to_tune()`, but need to set `internal = TRUE` for the `epochs` parameter.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(mlr3tuning)\nmlp_learner$param_set$set_values(\n epochs = to_tune(upper = 100, internal = TRUE),\n opt.lr = to_tune(lower = 1e-4, upper = 1e-1, logscale = TRUE)\n)\n```\n:::\n\n\n\nWe could now pass this learner to a tuner as usual.\n\n## Architecture Design\n\nAnother essential aspect of training neural networks efficiently and effectively is the design of the network architecture, which can be a challenging task.\nHowever, for many problems, there are predefined architectures that perform well and can be used.\nUnless there is a specific reason to design a new architecture, it is recommended to use such an architecture.\n\n:::{.callout-note}\nBecause the Python deep learning ecosystem is so large, many more architectures are implemented in Python than in R.\nOne way to use them in R is to simply translate the PyTorch code to (R-)torch.\nWhile PyTorch and (R-)torch are quite similar, there are some differences, e.g., 1-based and 0-based indexing.\nThe `torch` website contains a [brief tutorial](https://torch.mlverse.org/docs/articles/python-to-r) on this topic.\n:::\n\nNonetheless, we will cover important techniques that can be used to speed up the training process, namely *batch normalization* and *dropout*.\n\n### Batch Normalization\n\nBatch Normalization is an important technique in deep learning that contributed significantly to speeding up the training process.\n\nThe formula for batch normalization (during training) is given by:\n\n$$\n\\hat{x} = \\frac{x - \\mu_B}{\\sqrt{\\sigma_B^2 + \\epsilon}}\n$$\n\nwhere:\n\n- $\\hat{x}$ is the normalized output,\n- $x$ is the input,\n- $\\mu_B$ is the mean of the batch,\n- $\\sigma_B^2$ is the variance of the batch,\n- $\\epsilon$ is a small constant added for numerical stability.\n\nDuring inference, the module uses the running mean and variance of the training data to normalize the input.\n\nIn `torch`, different versions of batch normalization exist for different dimensions of the input tensor.\nBelow, we illustrate the batch normalization module using a 1D input tensor (the batch dimension does not count here):\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nx = torch_randn(10, 5)\nbn = nn_batch_norm1d(num_features = 5)\nbn(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntorch_tensor\n 1.4613 -1.3934 -0.2146 1.0406 0.1413\n-0.9634 -0.3388 1.7441 0.7744 2.1476\n-2.0328 0.5667 -2.0592 0.4071 -0.0529\n 0.6778 0.3264 0.2637 -0.2301 -0.0409\n-0.9243 0.1298 -0.6447 -1.5477 -2.1935\n 0.8150 -0.1962 0.7988 -1.5426 0.1137\n-0.2350 -2.0121 -0.1847 1.1725 0.0143\n 0.8381 0.6141 0.9971 1.0148 -0.5667\n 0.2166 0.7147 -0.7208 -0.1408 -0.0285\n 0.1467 1.5887 0.0203 -0.9482 0.4657\n[ CPUFloatType{10,5} ][ grad_fn = ]\n```\n\n\n:::\n:::\n\n\n\n:::{.callout-note}\n## Quiz: Batch Normalization\n\n**Question 1**: Earlier we have learned that `nn_module`s have buffers and parameters, where only the latter are learned with gradient descent.\nDo you think the mean and variance are parameters or buffers?\n\n
\nClick for answer\nThey are both buffers as they only store the variance and running mean of all training samples seen, i.e., they are not updated using gradient information.\n
\n\n**Question 2**: Training vs. Evaluation Mode:\nWhile many `nn_module`s behave the same way irrespective of their mode, batch normalization is an example of a module that behaves differently during training and evaluation.\nDuring training, the module uses the mean and variance of the current batch, while during evaluation, it uses the running mean and variance of all training samples seen.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nbn(x[1:10, ])\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntorch_tensor\n 1.4613 -1.3934 -0.2146 1.0406 0.1413\n-0.9634 -0.3388 1.7441 0.7744 2.1476\n-2.0328 0.5667 -2.0592 0.4071 -0.0529\n 0.6778 0.3264 0.2637 -0.2301 -0.0409\n-0.9243 0.1298 -0.6447 -1.5477 -2.1935\n 0.8150 -0.1962 0.7988 -1.5426 0.1137\n-0.2350 -2.0121 -0.1847 1.1725 0.0143\n 0.8381 0.6141 0.9971 1.0148 -0.5667\n 0.2166 0.7147 -0.7208 -0.1408 -0.0285\n 0.1467 1.5887 0.0203 -0.9482 0.4657\n[ CPUFloatType{10,5} ][ grad_fn = ]\n```\n\n\n:::\n:::\n\n\n\nWhich of the following statements is true and why?\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nbn$eval()\nequal1 = torch_equal(\n torch_cat(list(bn(x[1:2, ]), bn(x[3:4, ]))),\n bn(x[1:4, ])\n)\nbn$train()\nequal2 = torch_equal(\n torch_cat(list(bn(x[1:2, ]), bn(x[3:4, ]))),\n bn(x[1:4, ])\n)\n```\n:::\n\n\n\n
\nClick for answer\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nc(equal1, equal2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE FALSE\n```\n\n\n:::\n:::\n\n\n\nThe first statement is true because, in evaluation mode, the module uses the running mean and variance of all training samples seen.\nThe second statement is false because the first tensor uses different means and variances for rows 1-2 and 3-4, while the second tensor uses the same mean and variance for all rows.\n
\n:::\n\nTo illustrate its effectiveness, we will define a simple CNN, with and without batch normalization, train it on CIFAR-10, and compare their performances.\n\nTo build the neural networks, we will use `mlr3torch`, which allows building architectures from `PipeOp`s.\nRecall that the `po(\"torch_ingress_ltnsr\")` is a special `PipeOp` that marks the input of the neural network.\nNote that `po(\"nn_relu_1\")` is equivalent to `po(\"nn_relu\", id = \"nn_relu_1\")`.\nWe need to specify unique IDs for each `PipeOp` as this is required in mlr3pipelines graphs.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ncnn_bn = po(\"torch_ingress_ltnsr\") %>>%\n po(\"nn_conv2d_1\", out_channels = 32, kernel_size = 3, stride = 1, padding = 1) %>>%\n po(\"nn_batch_norm2d_1\") %>>%\n po(\"nn_relu_1\") %>>%\n po(\"nn_max_pool2d_1\", kernel_size = 2, stride = 2) %>>%\n po(\"nn_conv2d_2\", out_channels = 64, kernel_size = 3, stride = 1, padding = 1) %>>%\n po(\"nn_batch_norm2d_2\") %>>%\n po(\"nn_relu_2\") %>>%\n po(\"nn_max_pool2d_2\", kernel_size = 2, stride = 2)\n\ncnn = po(\"torch_ingress_ltnsr\") %>>%\n po(\"nn_conv2d_1\", out_channels = 32, kernel_size = 3, stride = 1, padding = 1) %>>%\n po(\"nn_relu_1\") %>>%\n po(\"nn_max_pool2d_1\", kernel_size = 2, stride = 2) %>>%\n po(\"nn_conv2d\", out_channels = 64, kernel_size = 3, stride = 1, padding = 1) %>>%\n po(\"nn_relu_2\") %>>%\n po(\"nn_max_pool2d_2\", kernel_size = 2, stride = 2)\n\nhead = po(\"nn_flatten\") %>>%\n po(\"nn_linear\", out_features = 128) %>>%\n po(\"nn_relu\") %>>%\n po(\"nn_head\")\n\nmodel = po(\"torch_optimizer\", optimizer = t_opt(\"adam\", lr = 0.003)) %>>%\n po(\"torch_model_classif\",\n epochs = 100,\n batch_size = 256,\n predict_type = \"prob\",\n device = \"cuda\"\n )\n```\n:::\n\n\n\nWe evaluate the two models on the CIFAR-10 image classification task that we have introduced earlier.\nThere, the goal is to classify images into 10 different classes.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nnet_bn = as_learner(cnn_bn %>>% head %>>% model)\nnet_bn$id = \"net_bn\"\nnet = as_learner(cnn %>>% head %>>% model)\nnet$id = \"net\"\n\ncifar10 = tsk(\"cifar10\")\nresampling = rsmp(\"holdout\")$instantiate(cifar10)\n\ndesign = benchmark_grid(\n task = cifar10,\n learner = list(net_bn, net),\n resampling = resampling\n)\ndesign\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n task learner resampling\n \n1: cifar10 net_bn holdout\n2: cifar10 net holdout\n```\n\n\n:::\n:::\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nbmr = benchmark(design)\nbmr$aggregate()\n```\n:::\n\n\n\n## Dropout\n\nDropout is a regularization technique used to prevent overfitting in neural networks.\nDuring each training iteration, dropout randomly \"drops\" a subset of neurons by setting their activations to zero with a specified probability (commonly between 20% to 50%).\nThis forces the network to distribute the learned representations more evenly across neurons.\nDropout is most commonly used in the context of fully connected layers.\n\n![](../assets/dropout.png){fig-align=\"center\" width=100%}\n\n[Source](https://medium.com/konvergen/understanding-dropout-ddb60c9f98aa)\n\nJust like batch normalization, it also has different behavior during training and evaluation.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndropout = nn_dropout(p = 0.5)\ndropout(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntorch_tensor\n 0.0000 -3.9488 0.0093 0.0000 0.7024\n-0.0000 -1.4141 0.0000 2.9566 5.7694\n-4.4366 0.7622 -0.0000 2.1163 0.2118\n 0.0000 0.0000 0.0000 0.6584 0.2422\n-0.0000 -0.0000 -0.9663 -0.0000 -5.1942\n 0.0000 -1.0714 2.3080 -0.0000 0.6326\n-1.1987 -5.4360 0.0000 3.8675 0.0000\n 0.0000 0.8761 2.7579 3.5069 -0.0000\n-0.3855 1.1178 -0.0000 0.8627 0.0000\n-0.0000 0.0000 0.0000 -0.0000 1.5217\n[ CPUFloatType{10,5} ]\n```\n\n\n:::\n\n```{.r .cell-code}\ndropout$eval()\ndropout(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntorch_tensor\n 0.9281 -1.9744 0.0046 1.7829 0.3512\n-1.2553 -0.7071 2.2261 1.4783 2.8847\n-2.2183 0.3811 -2.0875 1.0582 0.1059\n 0.2226 0.0924 0.5471 0.3292 0.1211\n-1.2201 -0.1440 -0.4831 -1.1782 -2.5971\n 0.3462 -0.5357 1.1540 -1.1725 0.3163\n-0.5994 -2.7180 0.0385 1.9338 0.1908\n 0.3669 0.4380 1.3789 1.7534 -0.5429\n-0.1927 0.5589 -0.5695 0.4313 0.1367\n-0.2556 1.6093 0.2711 -0.4924 0.7609\n[ CPUFloatType{10,5} ]\n```\n\n\n:::\n:::\n\n\n\nTo look at the effects, we will create a second classification head with dropout and then define new learners.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nhead_dropout = po(\"nn_flatten\") %>>%\n po(\"nn_linear\", out_features = 128) %>>%\n po(\"nn_relu\") %>>%\n po(\"nn_dropout\", p = 0.5) %>>%\n po(\"nn_head\")\n\nnet_bn_dropout = as_learner(cnn_bn %>>% head_dropout %>>% model)\nnet_bn_dropout$id = \"net_bn_dropout\"\nnet_dropout = as_learner(cnn %>>% head_dropout %>>% model)\nnet_dropout$id = \"net_dropout\"\n\ndesign2 = benchmark_grid(\n task = cifar10,\n learner = list(net_bn_dropout, net_dropout),\n resampling = resampling\n)\n```\n:::\n\n\n\nNext, we run the second benchmark experiment and afterwards combine the results with the first benchmark experiment.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nbmr2 = benchmark(design2)\nbmr = c(bmr, bmr2)\nautoplot(bmr)\n```\n:::\n\n\n\n:::{.callout-note}\n## Quiz: Dropout\n\n**Question 1**: Worse Training Loss: You are training a neural network with and without dropout. The training loss is higher with dropout, is this a bug?\n\n
\nClick for answer\nNot necessarily, as dropout is a regularization technique that prevents overfitting.\nIts goal is to reduce the generalization performance of the model and not to improve training performance.\n
\n:::\n\n## Transfer Learning\n\nTransfer learning is a powerful technique in machine learning where a pre-trained model developed for a specific task is reused as the starting point for a model on a second, related task. Instead of training a model from scratch, which can be time-consuming and computationally expensive, transfer learning leverages the knowledge gained from a previously learned task to improve learning efficiency and performance on a new task.\n\nThe advantages of transfer learning are:\n\n1. Reduced Training Time: Leveraging a pre-trained model can significantly decrease the time required to train a new model, as the foundational feature extraction layers are already optimized.\n2. Improved Performance: Transfer learning can enhance model performance, especially when the new task has limited training data. The pre-trained model's knowledge helps in achieving better generalization.\n3. Resource Efficiency: Utilizing pre-trained models reduces the computational resources needed, making it feasible to develop sophisticated models without extensive hardware.\n\nWhen the model is then trained on a new task, only the last layer is replaced with a new output layer to adjust for the new task.\n\nThis is visualized below:\n\n![](../assets/transfer-learning.svg)\n\n[Source](https://en.wikipedia.org/wiki/Transfer_learning)\n\n`mlr3torch` offers various pretrained image networks that are available through the [`torchvision` package](https://torchvision.mlverse.org/).\nThe ResNet-18 model is a popular pre-trained model that was pretrained on ImageNet.\nWe can use the pretrained weights by setting the `pretrained` parameter to `TRUE`.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nresnet = lrn(\"classif.resnet18\",\n pretrained = TRUE,\n epochs = 2,\n batch_size = 256,\n validate = 0.3,\n measures_valid = msr(\"classif.logloss\"),\n device = \"cuda\",\n predict_type = \"prob\",\n id = \"pretrained\"\n)\nresnet_no_pretrain = resnet$clone(deep = TRUE)\nresnet_no_pretrain$param_set$set_values(\n pretrained = FALSE\n)\nresnet_no_pretrain$id = \"not_pretrained\"\n\ngrid = benchmark_grid(\n task = tsk(\"cifar10\"),\n learner = list(resnet, resnet_no_pretrain),\n resampling = rsmp(\"insample\")\n)\n\nbmr = benchmark(grid, store_models = TRUE)\nbmr$aggregate()\n```\n:::\n\n\n\nWhen fine-tuning a pretrained model like ResNet-18, it's common to observe instabilities in gradients, which can manifest as fluctuating validation performance.\n\nTo address this, one can for example freeze the pretrained layers (for some epochs) and only train the new output head.\nIn `mlr3torch`, this can be achieved by using the `t_clbk(\"unfreeze\")` callback.\n\n:::{.callout-note}\n## In-Context Learning\n\nLarge foundation models (such as GPT-4) even allow performing tasks on which they were not pretrained on without any finetuning.\nThis is referred to as in-context learning or zero-shot learning.\nThere, the task is fed into the model during inference: \"Hey ChatGPT, is What is the sentiment of this sentence. Return -1 for sad, 0 for neutral, 1 for happy: \"\n:::\n\n## Data Augmentation\n\nData augmentation is a technique used to increase the diversity and quantity of training data without actually collecting new data.\nBy applying various transformations to the existing dataset, data augmentation helps improve the generalization capabilities of machine learning models, reduce overfitting, and enhance model robustness.\nThis is especially crucial when you have limited data.\n\nData augmentation for images can consist of rotation, flipping, translating, grey scaling, etc.\nWhich data augmentation is admissible, depends on the task:\n\n- If the modeling task is to predict whether there is a mark in the top right corner of an image, vertical or horizontal flipping is not admissible.\n- If the goal is to predict whether there is a mark somewhere in the image, it would be admissible.\n\nIn other words, the data augmentation must be compatible with the invariances of the task.\n\nIn `mlr3torch`, data augmentation is available via `PipeOp`s of the form `po(\"augment_\")`.\nCurrently, only augmentation operators from the `torchvision` package are available, but you can also add your own.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\naugment = po(\"augment_random_resized_crop\") %>>%\n po(\"augment_random_horizontal_flip\") %>>%\n po(\"augment_random_vertical_flip\")\n```\n:::\n\n\n\nWe can just create a new `GraphLearner` that includes the augmentation steps as well as the learner from above:\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nresnet_augmented = as_learner(augment %>>% resnet)\nresnet_augmented$id = \"resnet_augmented\"\nresnet_augmented$train(task = cifar10)\n```\n:::\n\n\n\n:::{.callout-note}\n## Quiz: Data Augmentation\n\n**Question 1**: Do you think data augmentation should be applied to the validation set?\n\n
\nClick for answer\nNo, as the purpose of data augmentation is not to improve an individual prediction, it will not be applied during test time and hence also not to the validation set.\nLooking at the performance of augmented validation data is, however, also not a mistake.\n
\n:::\n", "supporting": [ "6-training-efficiency_files" ], diff --git a/_freeze/notebooks/6-training-efficiency/figure-html/unnamed-chunk-30-1.png b/_freeze/notebooks/6-training-efficiency/figure-html/unnamed-chunk-32-1.png similarity index 100% rename from _freeze/notebooks/6-training-efficiency/figure-html/unnamed-chunk-30-1.png rename to _freeze/notebooks/6-training-efficiency/figure-html/unnamed-chunk-32-1.png diff --git a/docs/notebooks/6-training-efficiency-exercise-solution.html b/docs/notebooks/6-training-efficiency-exercise-solution.html index cf1a8e5..13f8d9b 100644 --- a/docs/notebooks/6-training-efficiency-exercise-solution.html +++ b/docs/notebooks/6-training-efficiency-exercise-solution.html @@ -321,27 +321,18 @@

Training Efficiency

  • Utilizes a batch size of 128.
  • Trains for 200 epochs.
  • Employs a validation set comprising 30% of the data.
  • -
  • Tracks the training and validation log-loss during training.
  • +
  • Track the validation log-loss.
  • Utilizes trace-jitting to speed up the training process.
  • Employs the history callback to record the training and validation log-loss during training.
  • Afterward, plot the validation log-loss, which is accessible via learner$model$callbacks$history.

    -

    Below, we create the task and remove the gender feature for simplicity.

    +

    Below, we create the task and remove the gender feature again for simplicity.

    -
    library(mlr3verse)
    -
    -
    Loading required package: mlr3
    -
    -
    library(mlr3torch)
    -
    -
    Loading required package: mlr3pipelines
    -
    -
    -
    Loading required package: torch
    -
    -
    ilpd_num <- tsk("ilpd")
    -ilpd_num$select(setdiff(ilpd_num$feature_names, "gender"))
    -ilpd_num
    +
    library(mlr3verse)
    +library(mlr3torch)
    +ilpd_num <- tsk("ilpd")
    +ilpd_num$select(setdiff(ilpd_num$feature_names, "gender"))
    +ilpd_num
    <TaskClassif:ilpd> (583 x 10): Indian Liver Patient Data
     * Target: diseased
    @@ -351,34 +342,23 @@ 

    Training Efficiency

    - int (4): age, alanine_transaminase, alkaline_phosphatase, aspartate_transaminase
    -
    - -Hint - -
      -
    • To specify the validation set, use the validate field, which can either be set during construction or by calling $configure().
    • -
    • Trace-jitting can be enabled via the jit_trace parameter.
    • -
    • The history callback can be constructed via t_clbk("history") and needs to be passed during the construction of the learner.
    • -
    • The validation and measures can be specified via measures_valid and take a measure object that is constructed via msr().
    • -
    -

    Solution

    -
    library(ggplot2)
    -
    -mlp <- lrn("classif.mlp",
    -  neurons = c(100, 100),
    -  batch_size = 128,
    -  epochs = 200,
    -  predict_type = "prob",
    -  validate = 0.3,
    -  jit_trace = TRUE,
    -  callbacks = t_clbk("history"),
    -  measures_valid = msr("classif.logloss")
    -)
    -
    -mlp$train(ilpd_num)
    -head(mlp$model$callbacks$history)
    +
    library(ggplot2)
    +
    +mlp <- lrn("classif.mlp",
    +  neurons = c(100, 100),
    +  batch_size = 128,
    +  epochs = 200,
    +  predict_type = "prob",
    +  validate = 0.3,
    +  jit_trace = TRUE,
    +  callbacks = t_clbk("history"),
    +  measures_valid = msr("classif.logloss")
    +)
    +
    +mlp$train(ilpd_num)
    +head(mlp$model$callbacks$history)
       epoch valid.classif.logloss
        <num>                 <num>
    @@ -389,13 +369,13 @@ 

    Training Efficiency

    5: 5 1.563049 6: 6 0.958690
    -
    ggplot(mlp$model$callbacks$history) +
    -  geom_line(aes(x = epoch, y = valid.classif.logloss)) +
    -  labs(
    -    y = "Log-Loss (Validation)",
    -    x = "Epoch"
    -  ) +
    -  theme_minimal()
    +
    ggplot(mlp$model$callbacks$history) +
    +  geom_line(aes(x = epoch, y = valid.classif.logloss)) +
    +  labs(
    +    y = "Log-Loss (Validation)",
    +    x = "Epoch"
    +  ) +
    +  theme_minimal()
    @@ -404,7 +384,8 @@

    Training Efficiency

    -

    Question 2: Early Stopping Enable early stopping to prevent overfitting and re-train the learner (using a patience of 10). Print the final validation performance of the learner and the early stopped results. You can consult the documentation of LearnerTorch on how to access these two results (section Active Bindings).

    +

    Question 2: Early Stopping

    +

    Enable early stopping to prevent overfitting and re-train the learner (using a patience of 10). Print the final validation performance of the learner and the early stopped results. You can consult the documentation of LearnerTorch on how to access these (see section Active Bindings).

    Hint @@ -413,16 +394,16 @@

    Training Efficiency

    Solution

    -
    mlp$configure(
    -  patience = 10
    -)
    -mlp$train(ilpd_num)
    -mlp$internal_tuned_values
    +
    mlp$configure(
    +  patience = 10
    +)
    +mlp$train(ilpd_num)
    +mlp$internal_tuned_values
    $epochs
     [1] 24
    -
    mlp$internal_valid_scores
    +
    mlp$internal_valid_scores
    $classif.logloss
     [1] 0.5598296
    @@ -430,42 +411,99 @@

    Training Efficiency

    Question 3: Early Stopping and Dropout Tuning

    While early stopping in itself is already useful, mlr3torch also allows you to simultaneously tune the number of epochs using early stopping while tuning other hyperparameters via traditional hyperparameter tuning from mlr3tuning.

    -

    One thing we have not covered so far is that the MLP learner we have used so far also uses a dropout layer. The dropout probability can be configured via the p parameter.

    -

    Your task is to tune the dropout probability p in the range \([0, 1]\) and the epochs using early stopping (using the configuration from the previous exercise).

    +

    One thing we have not mentioned so far is that the MLP learner also uses a dropout layer. The dropout probability can be configured via the p parameter.

    +

    Your task is to tune the dropout probability p in the range \([0, 1]\) and the epochs using early stopping (using the configuration from the previous exercise) with an upper bound of 100 epochs.

    To adapt this to work with early stopping, you need to set the:

    1. epochs to to_tune(upper = <value>, internal = TRUE): This tells the Tuner that the learner will tune the number of epochs itself.
    2. $validate field of the "test" so the same data is used for tuning and validation.
    3. Tuning measure to msr("internal_valid_score", minimize = TRUE). We set minimize to TRUE because we have used the log-loss as a validation measure.
    -

    Apart from this, the tuning works just like in tutorial 5. Use 3-fold cross-validation for the tuning and evaluate 10 configurations.

    -

    Run the tuning and print the optimal configuration.

    +

    Apart from this, the tuning works just like in tutorial 5. Use 3-fold cross-validation and evaluate 10 configurations using random search. Finally, print the optimal configuration.

    Solution

    -
    library(mlr3torch)
    -
    -mlp$configure(
    -  epochs = to_tune(upper = 100, internal = TRUE),
    -  p = to_tune(lower = 0, upper = 1),
    -  validate = "test"
    -)
    -
    -tuner <- tnr("random_search")
    -resampling <- rsmp("cv", folds = 3)
    -measure <- msr("internal_valid_score", minimize = TRUE)
    -
    -ti <- tune(
    -  tuner = tuner,
    -  task = ilpd_num,
    -  learner = mlp,
    -  resampling = resampling,
    -  measure = measure,
    -  term_evals = 10
    -)
    -
    -ti$learner_result_param_vals
    +
    library(mlr3torch)
    +
    +mlp$configure(
    +  epochs = to_tune(upper = 100, internal = TRUE),
    +  p = to_tune(lower = 0, upper = 1),
    +  validate = "test"
    +)
    +
    +tuner <- tnr("random_search")
    +resampling <- rsmp("cv", folds = 3)
    +measure <- msr("internal_valid_score", minimize = TRUE)
    +
    +ti <- tune(
    +  tuner = tuner,
    +  task = ilpd_num,
    +  learner = mlp,
    +  resampling = resampling,
    +  measure = measure,
    +  term_evals = 10
    +)
    +
    +ti$result_learner_param_vals
    -
    NULL
    +
    $epochs
    +[1] 53
    +
    +$device
    +[1] "auto"
    +
    +$num_threads
    +[1] 1
    +
    +$num_interop_threads
    +[1] 1
    +
    +$seed
    +[1] "random"
    +
    +$jit_trace
    +[1] TRUE
    +
    +$eval_freq
    +[1] 1
    +
    +$measures_train
    +list()
    +
    +$measures_valid
    +list()
    +
    +$patience
    +[1] 0
    +
    +$min_delta
    +[1] 0
    +
    +$batch_size
    +[1] 128
    +
    +$neurons
    +[1] 100 100
    +
    +$p
    +[1] 0.3738756
    +
    +$activation
    +<nn_relu> object generator
    +  Inherits from: <inherit>
    +  Public:
    +    .classes: nn_relu nn_module
    +    initialize: function (inplace = FALSE) 
    +    forward: function (input) 
    +    clone: function (deep = FALSE, ..., replace_values = TRUE) 
    +  Private:
    +    .__clone_r6__: function (deep = FALSE) 
    +  Parent env: <environment: 0x12f15a7b8>
    +  Locked objects: FALSE
    +  Locked class: FALSE
    +  Portable: TRUE
    +
    +$activation_args
    +list()
    diff --git a/docs/notebooks/6-training-efficiency-exercise-task.html b/docs/notebooks/6-training-efficiency-exercise-task.html index 51a740e..a47a088 100644 --- a/docs/notebooks/6-training-efficiency-exercise-task.html +++ b/docs/notebooks/6-training-efficiency-exercise-task.html @@ -321,27 +321,18 @@

    Training Efficiency

  • Utilizes a batch size of 128.
  • Trains for 200 epochs.
  • Employs a validation set comprising 30% of the data.
  • -
  • Tracks the training and validation log-loss during training.
  • +
  • Track the validation log-loss.
  • Utilizes trace-jitting to speed up the training process.
  • Employs the history callback to record the training and validation log-loss during training.
  • Afterward, plot the validation log-loss, which is accessible via learner$model$callbacks$history.

    -

    Below, we create the task and remove the gender feature for simplicity.

    +

    Below, we create the task and remove the gender feature again for simplicity.

    -
    library(mlr3verse)
    -
    -
    Loading required package: mlr3
    -
    -
    library(mlr3torch)
    -
    -
    Loading required package: mlr3pipelines
    -
    -
    -
    Loading required package: torch
    -
    -
    ilpd_num <- tsk("ilpd")
    -ilpd_num$select(setdiff(ilpd_num$feature_names, "gender"))
    -ilpd_num
    +
    library(mlr3verse)
    +library(mlr3torch)
    +ilpd_num <- tsk("ilpd")
    +ilpd_num$select(setdiff(ilpd_num$feature_names, "gender"))
    +ilpd_num
    <TaskClassif:ilpd> (583 x 10): Indian Liver Patient Data
     * Target: diseased
    @@ -351,18 +342,8 @@ 

    Training Efficiency

    - int (4): age, alanine_transaminase, alkaline_phosphatase, aspartate_transaminase
    -
    - -Hint - -
      -
    • To specify the validation set, use the validate field, which can either be set during construction or by calling $configure().
    • -
    • Trace-jitting can be enabled via the jit_trace parameter.
    • -
    • The history callback can be constructed via t_clbk("history") and needs to be passed during the construction of the learner.
    • -
    • The validation and measures can be specified via measures_valid and take a measure object that is constructed via msr().
    • -
    -
    -

    Question 2: Early Stopping Enable early stopping to prevent overfitting and re-train the learner (using a patience of 10). Print the final validation performance of the learner and the early stopped results. You can consult the documentation of LearnerTorch on how to access these two results (section Active Bindings).

    +

    Question 2: Early Stopping

    +

    Enable early stopping to prevent overfitting and re-train the learner (using a patience of 10). Print the final validation performance of the learner and the early stopped results. You can consult the documentation of LearnerTorch on how to access these (see section Active Bindings).

    Hint @@ -371,16 +352,15 @@

    Training Efficiency

    Question 3: Early Stopping and Dropout Tuning

    While early stopping in itself is already useful, mlr3torch also allows you to simultaneously tune the number of epochs using early stopping while tuning other hyperparameters via traditional hyperparameter tuning from mlr3tuning.

    -

    One thing we have not covered so far is that the MLP learner we have used so far also uses a dropout layer. The dropout probability can be configured via the p parameter.

    -

    Your task is to tune the dropout probability p in the range \([0, 1]\) and the epochs using early stopping (using the configuration from the previous exercise).

    +

    One thing we have not mentioned so far is that the MLP learner also uses a dropout layer. The dropout probability can be configured via the p parameter.

    +

    Your task is to tune the dropout probability p in the range \([0, 1]\) and the epochs using early stopping (using the configuration from the previous exercise) with an upper bound of 100 epochs.

    To adapt this to work with early stopping, you need to set the:

    1. epochs to to_tune(upper = <value>, internal = TRUE): This tells the Tuner that the learner will tune the number of epochs itself.
    2. $validate field of the "test" so the same data is used for tuning and validation.
    3. Tuning measure to msr("internal_valid_score", minimize = TRUE). We set minimize to TRUE because we have used the log-loss as a validation measure.
    -

    Apart from this, the tuning works just like in tutorial 5. Use 3-fold cross-validation for the tuning and evaluate 10 configurations.

    -

    Run the tuning and print the optimal configuration.

    +

    Apart from this, the tuning works just like in tutorial 5. Use 3-fold cross-validation and evaluate 10 configurations using random search. Finally, print the optimal configuration.

    diff --git a/docs/notebooks/6-training-efficiency-exercise.html b/docs/notebooks/6-training-efficiency-exercise.html index 51a740e..a47a088 100644 --- a/docs/notebooks/6-training-efficiency-exercise.html +++ b/docs/notebooks/6-training-efficiency-exercise.html @@ -321,27 +321,18 @@

    Training Efficiency

  • Utilizes a batch size of 128.
  • Trains for 200 epochs.
  • Employs a validation set comprising 30% of the data.
  • -
  • Tracks the training and validation log-loss during training.
  • +
  • Track the validation log-loss.
  • Utilizes trace-jitting to speed up the training process.
  • Employs the history callback to record the training and validation log-loss during training.
  • Afterward, plot the validation log-loss, which is accessible via learner$model$callbacks$history.

    -

    Below, we create the task and remove the gender feature for simplicity.

    +

    Below, we create the task and remove the gender feature again for simplicity.

    -
    library(mlr3verse)
    -
    -
    Loading required package: mlr3
    -
    -
    library(mlr3torch)
    -
    -
    Loading required package: mlr3pipelines
    -
    -
    -
    Loading required package: torch
    -
    -
    ilpd_num <- tsk("ilpd")
    -ilpd_num$select(setdiff(ilpd_num$feature_names, "gender"))
    -ilpd_num
    +
    library(mlr3verse)
    +library(mlr3torch)
    +ilpd_num <- tsk("ilpd")
    +ilpd_num$select(setdiff(ilpd_num$feature_names, "gender"))
    +ilpd_num
    <TaskClassif:ilpd> (583 x 10): Indian Liver Patient Data
     * Target: diseased
    @@ -351,18 +342,8 @@ 

    Training Efficiency

    - int (4): age, alanine_transaminase, alkaline_phosphatase, aspartate_transaminase
    -
    - -Hint - -
      -
    • To specify the validation set, use the validate field, which can either be set during construction or by calling $configure().
    • -
    • Trace-jitting can be enabled via the jit_trace parameter.
    • -
    • The history callback can be constructed via t_clbk("history") and needs to be passed during the construction of the learner.
    • -
    • The validation and measures can be specified via measures_valid and take a measure object that is constructed via msr().
    • -
    -
    -

    Question 2: Early Stopping Enable early stopping to prevent overfitting and re-train the learner (using a patience of 10). Print the final validation performance of the learner and the early stopped results. You can consult the documentation of LearnerTorch on how to access these two results (section Active Bindings).

    +

    Question 2: Early Stopping

    +

    Enable early stopping to prevent overfitting and re-train the learner (using a patience of 10). Print the final validation performance of the learner and the early stopped results. You can consult the documentation of LearnerTorch on how to access these (see section Active Bindings).

    Hint @@ -371,16 +352,15 @@

    Training Efficiency

    Question 3: Early Stopping and Dropout Tuning

    While early stopping in itself is already useful, mlr3torch also allows you to simultaneously tune the number of epochs using early stopping while tuning other hyperparameters via traditional hyperparameter tuning from mlr3tuning.

    -

    One thing we have not covered so far is that the MLP learner we have used so far also uses a dropout layer. The dropout probability can be configured via the p parameter.

    -

    Your task is to tune the dropout probability p in the range \([0, 1]\) and the epochs using early stopping (using the configuration from the previous exercise).

    +

    One thing we have not mentioned so far is that the MLP learner also uses a dropout layer. The dropout probability can be configured via the p parameter.

    +

    Your task is to tune the dropout probability p in the range \([0, 1]\) and the epochs using early stopping (using the configuration from the previous exercise) with an upper bound of 100 epochs.

    To adapt this to work with early stopping, you need to set the:

    1. epochs to to_tune(upper = <value>, internal = TRUE): This tells the Tuner that the learner will tune the number of epochs itself.
    2. $validate field of the "test" so the same data is used for tuning and validation.
    3. Tuning measure to msr("internal_valid_score", minimize = TRUE). We set minimize to TRUE because we have used the log-loss as a validation measure.
    -

    Apart from this, the tuning works just like in tutorial 5. Use 3-fold cross-validation for the tuning and evaluate 10 configurations.

    -

    Run the tuning and print the optimal configuration.

    +

    Apart from this, the tuning works just like in tutorial 5. Use 3-fold cross-validation and evaluate 10 configurations using random search. Finally, print the optimal configuration.

    diff --git a/docs/notebooks/6-training-efficiency.html b/docs/notebooks/6-training-efficiency.html index 8b2e8bd..31578b7 100644 --- a/docs/notebooks/6-training-efficiency.html +++ b/docs/notebooks/6-training-efficiency.html @@ -316,8 +316,8 @@

    Training Efficiency

    Methods for increasing training efficiency can be roughly split into:

      -
    1. Computational methods such as JIT compilation, using GPU, parallel data loading, etc., that allow doing the same thing faster.
    2. -
    3. Methodological approaches that change how we approach modeling to achieve either better results or faster training.
    4. +
    5. Computational methods such as JIT compilation, using GPUs, parallel data loading, etc., that allow doing the same thing faster.
    6. +
    7. Methodological approaches that change how we approach modeling to achieve either better results or the same results faster.

    Computational Approaches

    @@ -325,7 +325,7 @@

    Computational Approaches

    Parallel Processing

    Graphical Processing Unit (GPU)

    -

    Using a GPU is crucial when training relatively large neural networks because GPUs are specifically designed to handle the parallel processing required for complex computations. To use a GPU in mlr3torch, we can set the device parameter to “cuda”. By default, it is set to “auto”, which will use a GPU if it is available and otherwise fall back to the CPU.

    +

    Using a GPU is crucial when training relatively large neural networks because GPUs are specifically designed to handle the parallel processing required for complex computations. To use a GPU in mlr3torch, we can set the device parameter to “cuda”. By default, it is set to “auto”, which will use a GPU if available and otherwise fall back to the CPU.

    @@ -370,7 +370,7 @@

    Graphical Pr

    CPU Threads

    -

    Training large networks on a CPU is not a recommended approach, but it can be useful for smaller networks or when you don’t have a GPU. You can still use multiple threads to speed up the execution of operations. Note that the code below will not run on macOS, as it is not possible to set the number of threads on macOS.

    +

    Training large networks on a CPU is not a recommended approach, but it can be a viable option for smaller networks. You can still use multiple threads to speed up the execution of operations. Please be aware that the code below will not execute on macOS, as setting the number of threads is not supported on this operating system.

    # this will be skipped on macOS
     bench::mark(
    @@ -414,11 +414,11 @@ 

    CPU Threads

    Efficient Data Loading

    -

    Besides speeding up the computation of operations in the forward and backward pass, another possible bottleneck is the loading of data. There are various ways to improve data loading speed:

    +

    Besides parallelizing the computation of operations in the forward and backward pass, another possible bottleneck is the loading of data. There are various ways to improve data loading speed:

    1. Improve the implementation of the dataset class
    2. Parallelize the data loading process
    3. -
    4. Move data to the GPU
    5. +
    6. Increase the speed of data transfer to the GPU

    These approaches will now be discussed.

    @@ -438,7 +438,7 @@

    Efficient

    -

    The tiny imagenet dataset is a dataset of 100,000 images of size 64x64x3. It is a subset of the famous imagenet dataset. Below, we show some examples from the dataset:

    +

    The tiny imagenet dataset is a dataset of 100,000 images of size 64x64. It is a subset of the famous imagenet dataset. Below, we show some examples from it:

    We will now consider different ways to write a torch::dataset implementation for this data. Assume we have some image paths stored in a character vector as well as in an array where they are already loaded into memory.

    @@ -482,7 +482,7 @@

    Efficient initialize = function(image_array) { self$image_array = image_array }, - .getbatch = function(i) { + .getitem = function(i) { torch_tensor(self$image_array[i, , , ]) }, .length = function() { @@ -506,33 +506,36 @@

    Efficient } } bench::mark( - disk = iter(ds_disk), - ram = iter(ds_ram), + disk = iter(ds_disk, epochs = 10), + ram = iter(ds_ram, epochs = 10), check = FALSE )

    +
    +
    Warning: Some expressions had a GC in every iteration; so filtering is disabled.
    +
    # A tibble: 2 × 6
       expression      min   median `itr/sec` mem_alloc `gc/sec`
       <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
    -1 disk           18ms  20.01ms      47.6      14MB     14.0
    -2 ram           8.4ms   9.06ms     110.      9.4MB     26.0
    +1 disk 228ms 258ms 3.88 109MB 9.69 +2 ram 183ms 184ms 5.33 94.4MB 10.7

    Question 2: (Don’t) Copy that

    Consider now the next dataset implementation:

    -
    ds_tensor = dataset("tensor",
    -  initialize = function(image_array) {
    -    self$tensor = torch_tensor(image_array)
    -  },
    -  .getitem = function(i) {
    -    self$tensor[i, ..]
    -  },
    -  .length = function() {
    -    nrow(self$tensor)
    -  }
    -)(image_array)
    +
    ds_tensor = dataset("tensor",
    +  initialize = function(image_array) {
    +    self$tensor = torch_tensor(image_array)
    +  },
    +  .getitem = function(i) {
    +    self$tensor[i, ..]
    +  },
    +  .length = function() {
    +    nrow(self$tensor)
    +  }
    +)(image_array)

    Do you think this implementation is faster or slower than the ds_ram implementation? Explain why.

    @@ -541,34 +544,34 @@

    Efficient

    This implementation is faster than the ds_ram implementation. This is because the ds_tensor implementation copies the R array to a torch tensor only once, whereas the ds_ram implementation copies the R array to a torch tensor for each item.

    -
    bench::mark(
    -  tensor = iter(ds_tensor),
    -  array = iter(ds_ram),
    -  check = FALSE
    -)
    +
    bench::mark(
    +  tensor = iter(ds_tensor),
    +  array = iter(ds_ram),
    +  check = FALSE
    +)
    # A tibble: 2 × 6
       expression      min   median `itr/sec` mem_alloc `gc/sec`
       <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
    -1 tensor       4.62ms   5.06ms      196.   96.08KB     6.77
    -2 array        8.03ms   9.28ms      107.    9.38MB    27.6 
    +1 tensor 4.52ms 4.82ms 206. 96.08KB 6.71 +2 array 14.94ms 16.26ms 62.0 9.44MB 16.9

    Question 3: $.getbatch() vs $.getitem()

    Which implementation is faster? Explain why.

    -
    ds_tensor_batch = dataset("tensor_batch",
    -  initialize = function(image_array) {
    -    self$tensor = torch_tensor(image_array)
    -  },
    -  .getbatch = function(i) {
    -    self$tensor[i, ..]
    -  },
    -  .length = function() {
    -    nrow(self$tensor)
    -  }
    -)(image_array)
    +
    ds_tensor_batch = dataset("tensor_batch",
    +  initialize = function(image_array) {
    +    self$tensor = torch_tensor(image_array)
    +  },
    +  .getbatch = function(i) {
    +    self$tensor[i, .., drop = FALSE]
    +  },
    +  .length = function() {
    +    nrow(self$tensor)
    +  }
    +)(image_array)
    @@ -576,17 +579,17 @@

    Efficient

    The $.getbatch() implementation is faster than the $.getitem() implementation. This is because when using the $.getitem() method, the batch for indices ids is obtained by calling $.getitem(id) for each index in ids and then stacking them together, which requires a new tensor allocation. Slicing the tensor, however, avoids this allocation when shuffle = TRUE (which is also the default).

    -
    bench::mark(
    -  getbatch = iter(ds_tensor_batch),
    -  getitem = iter(ds_tensor),
    -  check = FALSE
    -)
    +
    bench::mark(
    +  getbatch = iter(ds_tensor_batch),
    +  getitem = iter(ds_tensor),
    +  check = FALSE
    +)
    # A tibble: 2 × 6
       expression      min   median `itr/sec` mem_alloc `gc/sec`
       <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
    -1 getbatch     1.69ms   1.99ms      466.    3.83KB     4.25
    -2 getitem      4.48ms   4.85ms      204.   54.69KB     7.19
    +1 getbatch 1.75ms 1.94ms 502. 23KB 4.69 +2 getitem 4.61ms 4.94ms 198. 54.7KB 7.54
    @@ -608,26 +611,26 @@

    Parallel Data Loadin
    -

    Note that there is some communication overhead that results from sending the batches from the worker to the main process. This will hopefully be reduced in the future, but is currently there. For this reason, parallel data loading is therefore – currently – only beneficial when it is slow, e.g., because of loading the data from disk or because of expensive preprocessing.

    +

    Note that in the current implementation, parallel data loading is only beneficial when it is slow, e.g., because of loading the data from disk or because of expensive preprocessing. This will hopefully be improved in the future (by a faster implementation of the parallel dataloader).

    Moving Data to the GPU

    -

    One thing we have ignored so far is that when training using a GPU, the data needs to be moved to the GPU. This is because a GPU has its own memory (VRAM), and the data needs to be moved to this memory before it can be used for training. The moving of the data to the GPU cannot be done on the processes that are loading the data but must be done in the main process, i.e., after the batch was received from (possibly parallelized) dataloader. One way to speed up the data loading process is to pin the memory of the data to the GPU. Before a tensor can be moved from RAM to VRAM, it needs to be in so-called page-locked memory, which can be done using the pin_memory parameter.

    +

    One thing we have ignored so far is that when training using a GPU, the data needs to be moved from RAM to the GPU. This is because a GPU has its own memory (VRAM), and the data needs to be moved to this memory before it can be used for training. The moving of the data to the GPU cannot be done on the processes that are loading the data but must be done in the main process, i.e., after the batch was received from (possibly parallelized) dataloader. One way to speed up the data loading process is to pin the memory of the data to the GPU. Before a tensor can be moved from RAM to VRAM, it needs to be in so-called page-locked memory, which can be enabled using the pin_memory parameter of dataloader()..

    -
    iter_cuda = function(ds, pin_memory = TRUE) {
    -  dl = torch::dataloader(ds, batch_size = 16, pin_memory = pin_memory)
    -  coro::loop(for(batch in dl) {
    -    batch$cuda()
    -  })
    -}
    -
    -bench::mark(
    -  not_pinned = iter_cuda(ds_disk, pin_memory = FALSE),
    -  pinned = iter_cuda(ds_disk, pin_memory = TRUE)
    -)
    +
    iter_cuda = function(ds, pin_memory = TRUE) {
    +  dl = torch::dataloader(ds, batch_size = 16, pin_memory = pin_memory)
    +  coro::loop(for(batch in dl) {
    +    batch$cuda()
    +  })
    +}
    +
    +bench::mark(
    +  not_pinned = iter_cuda(ds_disk, pin_memory = FALSE),
    +  pinned = iter_cuda(ds_disk, pin_memory = TRUE)
    +)
    @@ -641,22 +644,7 @@

    Moving Data to the

    In order to use parallel data loading or memory pinning with mlr3torch, these parameters can simply be specified in the learner:

    -
    lrn("classif.mlp", num_workers = 8L, pin_memory = TRUE, device = "cuda")
    -
    -
    <LearnerTorchMLP[classif]:classif.mlp>: My Little Powny
    -* Model: -
    -* Parameters: device=cuda, num_threads=1, num_interop_threads=1, seed=random, jit_trace=FALSE, eval_freq=1,
    -  measures_train=<list>, measures_valid=<list>, patience=0, min_delta=0, num_workers=8, pin_memory=TRUE,
    -  neurons=integer(0), p=0.5, activation=<nn_relu>, activation_args=<list>
    -* Validate: NULL
    -* Packages: mlr3, mlr3torch, torch
    -* Predict Types:  [response], prob
    -* Feature Types: integer, numeric, lazy_tensor
    -* Properties: internal_tuning, marshal, multiclass, twoclass, validation
    -* Optimizer: adam
    -* Loss: cross_entropy
    -* Callbacks: -
    -
    +
    lrn("classif.mlp", num_workers = 8L, pin_memory = TRUE, device = "cuda")

    @@ -678,7 +666,7 @@

    ‘Ignite’ Optimizers< loop_fun: function (group, param, g, p) step: function (closure = NULL) clone: function (deep = FALSE) - Parent env: <environment: 0x143916d88> + Parent env: <environment: 0x12cfbcbc8> Locked objects: FALSE Locked class: FALSE Portable: TRUE @@ -702,13 +690,13 @@

    ‘Ignite’ Optimizers< .set_param_group_options: function (opt, list) .zero_grad: function (opt) .get_param_groups: function (ptr) - Parent env: <environment: 0x117450168> + Parent env: <environment: 0x12cae2478> Locked objects: FALSE Locked class: FALSE Portable: TRUE

    -

    The ‘ignite’ indicates that the optimizer is a version that is optimized for performance. Not for all optimizers does an ignite version exist, but for the most common ones, it does.

    +

    The ‘ignite’ indicates that the optimizer is a version that is optimized for performance. Not for all optimizers does an ignite version exist, but for the most common ones, there does.

    Below, we compare the performance of the default optimizer and the ignite optimizer and see that the latter is considerably faster.

    adamw = as_torch_optimizer(torch::optim_adamw)
    @@ -734,8 +722,8 @@ 

    ‘Ignite’ Optimizers<
    # A tibble: 2 × 6
       expression                            min   median `itr/sec` mem_alloc `gc/sec`
       <bch:expr>                       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
    -1 learner$train(task_sonar)           667ms    667ms      1.50    15.7MB     4.49
    -2 learner_ignite$train(task_sonar)    202ms    211ms      4.78    10.7MB     4.78
    +1 learner$train(task_sonar) 588ms 588ms 1.70 15.7MB 5.10 +2 learner_ignite$train(task_sonar) 204ms 208ms 4.77 10.7MB 4.77

    @@ -767,7 +755,7 @@

    TorchScript

    Float [1:10, 1:1]
    -

    Besides syntax, there are some important differences between TorchScript and R:

    +

    Besides syntax, there are some notable differences between TorchScript and R to be aware of:

    1. In TorchScript, indexing tensors is 0-based, and
    2. TorchScript is statically typed, so you need to specify the types of the arguments, unless they are tensors, which is the default.
    3. @@ -806,7 +794,7 @@

      Tracing

      [1] TRUE
      -

      An advantage of trace-compilation is that it even allows you to JIT-compile modules, which is currently not possible with jit_compile.

      +

      An advantage of trace-compilation is that it can be applied to modules, which is currently not possible with jit_compile.

      net = nn_sequential(
         nn_linear(10, 100),
      @@ -820,7 +808,7 @@ 

      Tracing

      [1] TRUE
      -

      Trace-compilation is restrictive because it only records operations applied to torch tensors and is unaware of R control flow, so you need to be careful when using it. Furthermore, it only accepts torch tensors as arguments. Unless you have dynamic inputs and outputs or modify the configuration of the module, trace-compilation should usually work. You can also check this by running the original and trace-jitted module on some example inputs and see if they return the same result.

      +

      However, trace-compilation is restrictive because it only records operations applied to torch tensors and is unaware of R control flow, so you need to be careful when using it. Furthermore, it only accepts torch tensors as arguments. For many simple modules, trace-compilation should usually work. You can also check this by running the original and trace-jitted module on some example inputs and see if they return the same result.

      @@ -884,47 +872,64 @@

      Tracing

      } fjit = jit_trace(f, torch_tensor(1), torch_tensor(2), torch_tensor(TRUE)) -fjit(torch_tensor(2), torch_tensor(3), torch_tensor(TRUE))
      +fjit(torch_tensor(2), torch_tensor(3), torch_tensor(TRUE)) +fjit(torch_tensor(2), torch_tensor(3), torch_tensor(FALSE))
      +
      +
      + +Click for answer + +
      +
      fjit(torch_tensor(2), torch_tensor(3), torch_tensor(TRUE))
      torch_tensor
        6
       [ CPUFloatType{1} ]
      -
      fjit(torch_tensor(2), torch_tensor(3), torch_tensor(FALSE))
      +
      fjit(torch_tensor(2), torch_tensor(3), torch_tensor(FALSE))
      torch_tensor
        6
       [ CPUFloatType{1} ]
      +

      Question 2: Answer the same question for the following function:

      -
      f = function(a, b, multiply) {
      -  torch_where(multiply, a * b, a + b)
      -}
      -fjit = jit_trace(f, torch_tensor(1), torch_tensor(2), torch_tensor(TRUE))
      -
      -fjit(torch_tensor(2), torch_tensor(3), torch_tensor(TRUE))
      +
      f = function(a, b, multiply) {
      +  torch_where(multiply, a * b, a + b)
      +}
      +fjit = jit_trace(f, torch_tensor(1), torch_tensor(2), torch_tensor(TRUE))
      +
      +fjit(torch_tensor(2), torch_tensor(3), torch_tensor(TRUE))
      +fjit(torch_tensor(2), torch_tensor(3), torch_tensor(FALSE))
      +
      +
      + +Click for answer + +
      +
      fjit(torch_tensor(2), torch_tensor(3), torch_tensor(TRUE))
      torch_tensor
        6
       [ CPUFloatType{1} ]
      -
      fjit(torch_tensor(2), torch_tensor(3), torch_tensor(FALSE))
      +
      fjit(torch_tensor(2), torch_tensor(3), torch_tensor(FALSE))
      torch_tensor
        5
       [ CPUFloatType{1} ]
      +

      Mixed Precision Training

      -

      Another way to speed up the training process is to use mixed precision training. This technique involves training the model using both 16-bit and 32-bit floating point numbers. This allows reducing the memory footprint of the model and speeding up the training process.

      -

      We won’t cover this here, but refer to the torch documentation that explains how to do this.

      +

      Another way to speed up the training process is to use mixed precision training. This technique involves training the model using both 16-bit and 32-bit floating point numbers. This allows reducing the memory footprint of the model and speeding up the training process. We won’t cover this here, but refer to the torch documentation that explains how to do this.

      @@ -936,49 +941,27 @@

      Validation a

      In mlr3torch, we can track the performance of the model on a validation set by specifying:

      • validate, which is the ratio of the data that is used for validation
      • -
      • measures_valid, which is a list of measures to use for validation
      • +
      • measures_valid, which is a list of measures to evaluate the validation performance
      • eval_freq, which is the frequency at which the validation is performed
      • -
      • callbacks, which is a list of callbacks to use during training, in this case, we use the history callback, which records the performance of the model on the validation set at regular intervals, enabling us to monitor and analyze the model’s performance over time.
      • +
      • callbacks, which is a list of callbacks to use during training, in this case, we use the t_clbk("history") callback, which records the performance of the model on the validation set at regular intervals, enabling us to monitor and analyze the model’s performance over time.
      -
      -
      -
      - -
      -
      -Tip -
      -
      -
      -

      While mlr3torch comes with predefined callbacks, it is also possible to define custom callbacks that modify the training process.

      -
      -
      -
      task = tsk("sonar")
      -
      -mlp_learner = lrn("classif.mlp",
      -  neurons = c(50, 50), batch_size = 256, epochs = 400,
      -  optimizer = t_opt("adam", lr = 0.003),
      -  predict_type = "prob", jit_trace = TRUE,
      -  # Validation / Performance Monitoring
      -  validate = 0.3, # how much data to use for validation
      -  measures_valid = msr("classif.logloss"), # how to evaluate train performance
      -  measures_train = msr("classif.logloss"), # how to evaluate validation performance
      -  callbacks = t_clbk("history"), # history callbacks save train and validation performance
      -  eval_freq = 10 # after how many training epochs to perform validation
      -)
      -mlp_learner$train(task)
      -history = mlp_learner$model$callbacks$history
      -str(history)
      -
      -
      Classes 'data.table' and 'data.frame':  40 obs. of  3 variables:
      - $ epoch                : num  10 20 30 40 50 60 70 80 90 100 ...
      - $ train.classif.logloss: num  0.678 0.643 0.569 0.515 0.478 ...
      - $ valid.classif.logloss: num  0.667 0.618 0.536 0.469 0.436 ...
      - - attr(*, ".internal.selfref")=<externalptr> 
      - - attr(*, "sorted")= chr "epoch"
      -
      -
      head(history)
      +
      task = tsk("sonar")
      +
      +mlp_learner = lrn("classif.mlp",
      +  neurons = c(50, 50), batch_size = 256, epochs = 400,
      +  optimizer = t_opt("adam", lr = 0.003),
      +  predict_type = "prob", jit_trace = TRUE,
      +  # Validation / Performance Monitoring
      +  validate = 0.3, # how much data to use for validation
      +  measures_valid = msr("classif.logloss"), # how to evaluate train performance
      +  measures_train = msr("classif.logloss"), # how to evaluate validation performance
      +  callbacks = t_clbk("history"), # history callbacks save train and validation performance
      +  eval_freq = 10 # after how many training epochs to perform validation
      +)
      +mlp_learner$train(task)
      +history = mlp_learner$model$callbacks$history
      +head(history)
      Key: <epoch>
          epoch train.classif.logloss valid.classif.logloss
      @@ -996,19 +979,19 @@ 

      Validation a
      -

      +

      -

      Instead of only monitoring the validation loss (and watching it get worse and worse), we can also stop the training process dynamically when the validation loss begins to increase. This regularization technique is called early stopping, and it prevents overfitting during the training of iteratively trained machine learning models. It involves monitoring the validation loss during training and stopping the training process when the validation loss begins to increase, indicating that the model is starting to overfit the training data.

      -

      The key configuration option for early stopping is the patience parameter, which defines the number of epochs to wait after the last improvement in validation loss before stopping the training. For example, if patience is set to 10, the training will continue for 10 additional epochs after the last observed improvement in validation loss. If no improvement is seen during this period, training will be halted.

      +

      Instead of only monitoring the validation loss (and watching it get worse and worse), we can also stop the training process dynamically when the validation loss begins to increase. This regularization technique is called early stopping, and it prevents overfitting during the training of iteratively trained machine learning models.

      +

      The key configuration option for early stopping is the patience parameter, which defines the number of epochs to wait after the last improvement in validation loss before stopping the training. For example, if the patience is set to 5, the training will continue for 5 additional epochs after the last observed improvement in validation loss. If no improvement is seen during this period, training will be halted.

      Advantages of early stopping include:

      • Prevention of Overfitting: By stopping training when the model starts to overfit, we can achieve better generalization on unseen data.
      • Resource Efficiency: It saves computational resources by avoiding unnecessary training epochs once the model performance has plateaued.
      -

      Now, let’s train the learner again using early stopping with a patience of 10 epochs:

      +

      Now, let’s train the learner again using early stopping with a patience of 5 epochs:

      mlp_learner$param_set$set_values(
         patience = 5
      @@ -1019,7 +1002,7 @@ 

      Validation a
      [1] 160

      -

      Beyond only tuning the number of epochs, mlr3’s internal tuning mechanism also allows tuning the number of epochs internally while using an offline tuning method to optimize other hyperparameters. To use this, we can set the parameters we want to tune TuneTokens:

      +

      Beyond only tuning the number of epochs, mlr3’s internal tuning mechanism also allows tuning the number of epochs internally while using an offline tuning method to optimize other hyperparameters. To use this, we can set the parameters we want to tune using to_tune(), but need to set internal = TRUE for the epochs parameter.

      library(mlr3tuning)
       mlp_learner$param_set$set_values(
      @@ -1027,12 +1010,12 @@ 

      Validation a opt.lr = to_tune(lower = 1e-4, upper = 1e-1, logscale = TRUE) )

      -

      We could now pass this learner to a tuner, where the tuner would only optimize the learning rate, while the learner optimizes the epochs internally.

      +

      We could now pass this learner to a tuner as usual.

      Architecture Design

      -

      Another essential aspect of training neural networks efficiently and effectively is the design of the network architecture, which can be a challenging task. However, for many tasks, there are well-known architectures that perform well and can be used as a starting point. Unless there is a specific reason to design a new architecture, it is recommended to use such an architecture.

      +

      Another essential aspect of training neural networks efficiently and effectively is the design of the network architecture, which can be a challenging task. However, for many problems, there are predefined architectures that perform well and can be used. Unless there is a specific reason to design a new architecture, it is recommended to use such an architecture.

      @@ -1043,7 +1026,7 @@

      Architecture Design

      -

      Because the Python deep learning ecosystem is so large, many more architectures are implemented in Python than in R. One way to use them in R is to simply translate the PyTorch code to (R-)torch. While PyTorch and (R-)torch are quite similar, there are some differences, e.g., 1-based and 0-based indexing. The torch website contains a brief tutorial on how to do this.

      +

      Because the Python deep learning ecosystem is so large, many more architectures are implemented in Python than in R. One way to use them in R is to simply translate the PyTorch code to (R-)torch. While PyTorch and (R-)torch are quite similar, there are some differences, e.g., 1-based and 0-based indexing. The torch website contains a brief tutorial on this topic.

      Nonetheless, we will cover important techniques that can be used to speed up the training process, namely batch normalization and dropout.

      @@ -1063,7 +1046,7 @@

      Batch Normalization\(\epsilon\) is a small constant added for numerical stability.

      During inference, the module uses the running mean and variance of the training data to normalize the input.

      -

      In torch, different versions of batch normalization exist for different dimensions of the input tensor. Below, we illustrate the batch normalization module using a 1D input tensor (the batch dimension does not count here)

      +

      In torch, different versions of batch normalization exist for different dimensions of the input tensor. Below, we illustrate the batch normalization module using a 1D input tensor (the batch dimension does not count here):

      x = torch_randn(10, 5)
       bn = nn_batch_norm1d(num_features = 5)
      @@ -1093,14 +1076,14 @@ 

      Batch Normalization

      -

      Question 1: Earlier we have learned that nn_modules have buffers and parameters, where the latter are learned with gradient descent. Do you think the mean and variance are parameters or buffers?

      +

      Question 1: Earlier we have learned that nn_modules have buffers and parameters, where only the latter are learned with gradient descent. Do you think the mean and variance are parameters or buffers?

      Click for answer They are both buffers as they only store the variance and running mean of all training samples seen, i.e., they are not updated using gradient information.
      -

      Question 2: Training vs. Evaluation Mode: While many nn_modules behave the same way irrespective of their mode, batch normalization is an example of a module that behaves differently during training and evaluation, i.e., during training, the module uses the mean and variance of the current batch, while during evaluation, it uses the running mean and variance of all training samples seen.

      +

      Question 2: Training vs. Evaluation Mode: While many nn_modules behave the same way irrespective of their mode, batch normalization is an example of a module that behaves differently during training and evaluation. During training, the module uses the mean and variance of the current batch, while during evaluation, it uses the running mean and variance of all training samples seen.

      bn(x[1:10, ])
      @@ -1145,8 +1128,8 @@

      Batch Normalization

      -

      To illustrate its effectiveness, we will define a simple CNN, with and without batch normalization, train it on CIFAR-10, and compare their performance.

      -

      To build the neural networks, we will use mlr3torch, which allows building architectures from PipeOps. This makes the creation of network architectures easier, as we, e.g., don’t have to specify auxiliary parameters (such as the input dimension of a linear layer). Recall that the po("torch_ingress_ltnsr") is a special PipeOp that marks the input of the neural network. Note that po("nn_relu_1") is equivalent to po("nn_relu", id = "nn_relu_1"). We need to specify unique ID parameters as this is required in mlr3pipelines.

      +

      To illustrate its effectiveness, we will define a simple CNN, with and without batch normalization, train it on CIFAR-10, and compare their performances.

      +

      To build the neural networks, we will use mlr3torch, which allows building architectures from PipeOps. Recall that the po("torch_ingress_ltnsr") is a special PipeOp that marks the input of the neural network. Note that po("nn_relu_1") is equivalent to po("nn_relu", id = "nn_relu_1"). We need to specify unique IDs for each PipeOp as this is required in mlr3pipelines graphs.

      cnn_bn = po("torch_ingress_ltnsr") %>>%
         po("nn_conv2d_1", out_channels = 32, kernel_size = 3, stride = 1, padding = 1) %>>%
      @@ -1210,13 +1193,13 @@ 

      Batch Normalization

      Dropout

      -

      Dropout is a regularization technique used to prevent overfitting in neural networks by randomly setting a fraction of input units to zero during training. This encourages the network to learn more robust features that are not reliant on specific neurons, thereby improving its generalization capabilities. During each training iteration, dropout randomly “drops” a subset of neurons by setting their activations to zero with a specified probability (commonly between 20% to 50%). This forces the network to distribute the learned representations more evenly across neurons, reducing the reliance on any single neuron and mitigating overfitting. Dropout is more commonly used in the context of fully connected layers.

      +

      Dropout is a regularization technique used to prevent overfitting in neural networks. During each training iteration, dropout randomly “drops” a subset of neurons by setting their activations to zero with a specified probability (commonly between 20% to 50%). This forces the network to distribute the learned representations more evenly across neurons. Dropout is most commonly used in the context of fully connected layers.

      -

      Source: https://medium.com/konvergen/understanding-dropout-ddb60c9f98aa

      +

      Source

      Just like batch normalization, it also has different behavior during training and evaluation.

      dropout = nn_dropout(p = 0.5)
      @@ -1252,7 +1235,7 @@ 

      Dropout

      [ CPUFloatType{10,5} ]
      -

      To look at the effects, we will create a second classification head with dropout and then define new learners

      +

      To look at the effects, we will create a second classification head with dropout and then define new learners.

      head_dropout = po("nn_flatten") %>>%
         po("nn_linear", out_features = 128) %>>%
      @@ -1292,7 +1275,7 @@ 

      Dropout

      Click for answer -Not necessarily, as dropout is a regularization technique that prevents overfitting. It’s goal is to reduce the generalization performance of the model. +Not necessarily, as dropout is a regularization technique that prevents overfitting. Its goal is to reduce the generalization performance of the model and not to improve training performance.
      @@ -1309,8 +1292,8 @@

      Transfer Learning

      When the model is then trained on a new task, only the last layer is replaced with a new output layer to adjust for the new task.

      This is visualized below:

      -

      Source: https://en.wikipedia.org/wiki/Transfer_learning

      -

      mlr3torch connects various pretrained image networks that are available in the torchvision package. The ResNet-18 model is a popular pre-trained model that was pretrained on ImageNet. We can use the pretrained weights by setting the pretrained parameter to TRUE.

      +

      Source

      +

      mlr3torch offers various pretrained image networks that are available through the torchvision package. The ResNet-18 model is a popular pre-trained model that was pretrained on ImageNet. We can use the pretrained weights by setting the pretrained parameter to TRUE.

      resnet = lrn("classif.resnet18",
         pretrained = TRUE,
      @@ -1337,13 +1320,8 @@ 

      Transfer Learning

      bmr = benchmark(grid, store_models = TRUE) bmr$aggregate()
      -

      When fine-tuning a pretrained model like ResNet-18, it’s common to observe instabilities in gradients, which can manifest as fluctuating validation performance. This can e.g. be because the learning rate is too high (compared to the learning rate that was used during pretraining).

      -

      To address this, one can:

      -
        -
      1. Use a smaller learning rate for the pretrained layers than for the new output head.
      2. -
      3. Freeze the pretrained layers (for some epochs) and only train the new output head.
      4. -
      -

      In mlr3torch this can be achieved via the callback mechanism. For the unfreezing, there even exists a predefined callback t_clbk("unfreeze"). To create a custom callback, the torch_callback() function can be used. A tutorial on this can be found on the mlr3torch package website.

      +

      When fine-tuning a pretrained model like ResNet-18, it’s common to observe instabilities in gradients, which can manifest as fluctuating validation performance.

      +

      To address this, one can for example freeze the pretrained layers (for some epochs) and only train the new output head. In mlr3torch, this can be achieved by using the t_clbk("unfreeze") callback.

      @@ -1354,7 +1332,7 @@

      Transfer Learning

      -

      Large foundation models (such as GPT-4) even allow to perform tasks on which they were not pretrained on without any finetuning. This is referred to as in-context learning or zero-shot learning. There, the task is fed into the model during inference: “Hey ChatGPT, is What is the sentiment of this sentence. Return -1 for sad, 0 for neutral, 1 for happy:

      +

      Large foundation models (such as GPT-4) even allow performing tasks on which they were not pretrained on without any finetuning. This is referred to as in-context learning or zero-shot learning. There, the task is fed into the model during inference: “Hey ChatGPT, is What is the sentiment of this sentence. Return -1 for sad, 0 for neutral, 1 for happy:

      @@ -1367,13 +1345,13 @@

      Data Augmentation

    4. If the goal is to predict whether there is a mark somewhere in the image, it would be admissible.
    5. In other words, the data augmentation must be compatible with the invariances of the task.

      -

      In mlr3torch, data augmentation is available via PipeOps of the form po("augment_"). Currently, only augemntation operators from the torchvision package are available, but you can also add your own.

      +

      In mlr3torch, data augmentation is available via PipeOps of the form po("augment_"). Currently, only augmentation operators from the torchvision package are available, but you can also add your own.

      augment = po("augment_random_resized_crop") %>>%
         po("augment_random_horizontal_flip") %>>%
         po("augment_random_vertical_flip")
      -

      We can just create a new GraphLearner that includes the augemntation steps as well as the learner from above:

      +

      We can just create a new GraphLearner that includes the augmentation steps as well as the learner from above:

      resnet_augmented = as_learner(augment %>>% resnet)
       resnet_augmented$id = "resnet_augmented"
      diff --git a/docs/notebooks/6-training-efficiency_files/figure-html/unnamed-chunk-30-1.png b/docs/notebooks/6-training-efficiency_files/figure-html/unnamed-chunk-32-1.png
      similarity index 100%
      rename from docs/notebooks/6-training-efficiency_files/figure-html/unnamed-chunk-30-1.png
      rename to docs/notebooks/6-training-efficiency_files/figure-html/unnamed-chunk-32-1.png
      diff --git a/docs/notebooks/resources.html b/docs/notebooks/resources.html
      index 74fd648..e841625 100644
      --- a/docs/notebooks/resources.html
      +++ b/docs/notebooks/resources.html
      @@ -254,7 +254,7 @@ 

      Resources

      • The torch website: https://torch.mlverse.org/
      • The torch package website: https://torch.mlverse.org/docs/
      • -
      • Book by Sigrid Keydana: Deep Learning and Scientific Computing with R torch on which part of this course is based.
      • +
      • Book by Sigrid Keydana: Deep Learning and Scientific Computing with R torch on which parts of this course are based.
      • mlr3torch package website: https://mlr3torch.mlr-org.com/
      • mlr3 book: https://mlr3book.mlr-org.com/
      • mlr3 website: https://mlr-org.com/
      • diff --git a/docs/search.json b/docs/search.json index 3f2d8cc..06a2c9a 100644 --- a/docs/search.json +++ b/docs/search.json @@ -32,7 +32,7 @@ "href": "notebooks/resources.html", "title": "Resources", "section": "", - "text": "The torch website: https://torch.mlverse.org/\nThe torch package website: https://torch.mlverse.org/docs/\nBook by Sigrid Keydana: Deep Learning and Scientific Computing with R torch on which part of this course is based.\nmlr3torch package website: https://mlr3torch.mlr-org.com/\nmlr3 book: https://mlr3book.mlr-org.com/\nmlr3 website: https://mlr-org.com/\nSince torch mimics PyTorch and the latter has a larger community, you can often learn from the PyTorch documentation." + "text": "The torch website: https://torch.mlverse.org/\nThe torch package website: https://torch.mlverse.org/docs/\nBook by Sigrid Keydana: Deep Learning and Scientific Computing with R torch on which parts of this course are based.\nmlr3torch package website: https://mlr3torch.mlr-org.com/\nmlr3 book: https://mlr3book.mlr-org.com/\nmlr3 website: https://mlr-org.com/\nSince torch mimics PyTorch and the latter has a larger community, you can often learn from the PyTorch documentation." }, { "objectID": "notebooks/7-usecase-exercise.html", @@ -46,14 +46,14 @@ "href": "notebooks/6-training-efficiency-exercise.html", "title": "Training Efficiency", "section": "", - "text": "Question 1: Validation\nIn this exercise, we will once again train a simple multi-layer perceptron on the Indian Liver Patient Dataset (ILPD). Create a learner that:\n\nUses 2 hidden layers with 100 neurons each.\nUtilizes a batch size of 128.\nTrains for 200 epochs.\nEmploys a validation set comprising 30% of the data.\nTracks the training and validation log-loss during training.\nUtilizes trace-jitting to speed up the training process.\nEmploys the history callback to record the training and validation log-loss during training.\n\nAfterward, plot the validation log-loss, which is accessible via learner$model$callbacks$history.\nBelow, we create the task and remove the gender feature for simplicity.\n\nlibrary(mlr3verse)\n\nLoading required package: mlr3\n\nlibrary(mlr3torch)\n\nLoading required package: mlr3pipelines\n\n\nLoading required package: torch\n\nilpd_num <- tsk(\"ilpd\")\nilpd_num$select(setdiff(ilpd_num$feature_names, \"gender\"))\nilpd_num\n\n<TaskClassif:ilpd> (583 x 10): Indian Liver Patient Data\n* Target: diseased\n* Properties: twoclass\n* Features (9):\n - dbl (5): albumin, albumin_globulin_ratio, direct_bilirubin, total_bilirubin, total_protein\n - int (4): age, alanine_transaminase, alkaline_phosphatase, aspartate_transaminase\n\n\n\n\nHint\n\n\nTo specify the validation set, use the validate field, which can either be set during construction or by calling $configure().\nTrace-jitting can be enabled via the jit_trace parameter.\nThe history callback can be constructed via t_clbk(\"history\") and needs to be passed during the construction of the learner.\nThe validation and measures can be specified via measures_valid and take a measure object that is constructed via msr().\n\n\nQuestion 2: Early Stopping Enable early stopping to prevent overfitting and re-train the learner (using a patience of 10). Print the final validation performance of the learner and the early stopped results. You can consult the documentation of LearnerTorch on how to access these two results (section Active Bindings).\n\n\nHint\n\nYou can enable early stopping by setting the patience parameter.\n\nQuestion 3: Early Stopping and Dropout Tuning\nWhile early stopping in itself is already useful, mlr3torch also allows you to simultaneously tune the number of epochs using early stopping while tuning other hyperparameters via traditional hyperparameter tuning from mlr3tuning.\nOne thing we have not covered so far is that the MLP learner we have used so far also uses a dropout layer. The dropout probability can be configured via the p parameter.\nYour task is to tune the dropout probability p in the range \\([0, 1]\\) and the epochs using early stopping (using the configuration from the previous exercise).\nTo adapt this to work with early stopping, you need to set the:\n\nepochs to to_tune(upper = <value>, internal = TRUE): This tells the Tuner that the learner will tune the number of epochs itself.\n$validate field of the \"test\" so the same data is used for tuning and validation.\nTuning measure to msr(\"internal_valid_score\", minimize = TRUE). We set minimize to TRUE because we have used the log-loss as a validation measure.\n\nApart from this, the tuning works just like in tutorial 5. Use 3-fold cross-validation for the tuning and evaluate 10 configurations.\nRun the tuning and print the optimal configuration." + "text": "Question 1: Validation\nIn this exercise, we will once again train a simple multi-layer perceptron on the Indian Liver Patient Dataset (ILPD). Create a learner that:\n\nUses 2 hidden layers with 100 neurons each.\nUtilizes a batch size of 128.\nTrains for 200 epochs.\nEmploys a validation set comprising 30% of the data.\nTrack the validation log-loss.\nUtilizes trace-jitting to speed up the training process.\nEmploys the history callback to record the training and validation log-loss during training.\n\nAfterward, plot the validation log-loss, which is accessible via learner$model$callbacks$history.\nBelow, we create the task and remove the gender feature again for simplicity.\n\nlibrary(mlr3verse)\nlibrary(mlr3torch)\nilpd_num <- tsk(\"ilpd\")\nilpd_num$select(setdiff(ilpd_num$feature_names, \"gender\"))\nilpd_num\n\n<TaskClassif:ilpd> (583 x 10): Indian Liver Patient Data\n* Target: diseased\n* Properties: twoclass\n* Features (9):\n - dbl (5): albumin, albumin_globulin_ratio, direct_bilirubin, total_bilirubin, total_protein\n - int (4): age, alanine_transaminase, alkaline_phosphatase, aspartate_transaminase\n\n\nQuestion 2: Early Stopping\nEnable early stopping to prevent overfitting and re-train the learner (using a patience of 10). Print the final validation performance of the learner and the early stopped results. You can consult the documentation of LearnerTorch on how to access these (see section Active Bindings).\n\n\nHint\n\nYou can enable early stopping by setting the patience parameter.\n\nQuestion 3: Early Stopping and Dropout Tuning\nWhile early stopping in itself is already useful, mlr3torch also allows you to simultaneously tune the number of epochs using early stopping while tuning other hyperparameters via traditional hyperparameter tuning from mlr3tuning.\nOne thing we have not mentioned so far is that the MLP learner also uses a dropout layer. The dropout probability can be configured via the p parameter.\nYour task is to tune the dropout probability p in the range \\([0, 1]\\) and the epochs using early stopping (using the configuration from the previous exercise) with an upper bound of 100 epochs.\nTo adapt this to work with early stopping, you need to set the:\n\nepochs to to_tune(upper = <value>, internal = TRUE): This tells the Tuner that the learner will tune the number of epochs itself.\n$validate field of the \"test\" so the same data is used for tuning and validation.\nTuning measure to msr(\"internal_valid_score\", minimize = TRUE). We set minimize to TRUE because we have used the log-loss as a validation measure.\n\nApart from this, the tuning works just like in tutorial 5. Use 3-fold cross-validation and evaluate 10 configurations using random search. Finally, print the optimal configuration." }, { "objectID": "notebooks/6-training-efficiency-exercise-solution.html", "href": "notebooks/6-training-efficiency-exercise-solution.html", "title": "Training Efficiency", "section": "", - "text": "Question 1: Validation\nIn this exercise, we will once again train a simple multi-layer perceptron on the Indian Liver Patient Dataset (ILPD). Create a learner that:\n\nUses 2 hidden layers with 100 neurons each.\nUtilizes a batch size of 128.\nTrains for 200 epochs.\nEmploys a validation set comprising 30% of the data.\nTracks the training and validation log-loss during training.\nUtilizes trace-jitting to speed up the training process.\nEmploys the history callback to record the training and validation log-loss during training.\n\nAfterward, plot the validation log-loss, which is accessible via learner$model$callbacks$history.\nBelow, we create the task and remove the gender feature for simplicity.\n\nlibrary(mlr3verse)\n\nLoading required package: mlr3\n\nlibrary(mlr3torch)\n\nLoading required package: mlr3pipelines\n\n\nLoading required package: torch\n\nilpd_num <- tsk(\"ilpd\")\nilpd_num$select(setdiff(ilpd_num$feature_names, \"gender\"))\nilpd_num\n\n<TaskClassif:ilpd> (583 x 10): Indian Liver Patient Data\n* Target: diseased\n* Properties: twoclass\n* Features (9):\n - dbl (5): albumin, albumin_globulin_ratio, direct_bilirubin, total_bilirubin, total_protein\n - int (4): age, alanine_transaminase, alkaline_phosphatase, aspartate_transaminase\n\n\n\n\nHint\n\n\nTo specify the validation set, use the validate field, which can either be set during construction or by calling $configure().\nTrace-jitting can be enabled via the jit_trace parameter.\nThe history callback can be constructed via t_clbk(\"history\") and needs to be passed during the construction of the learner.\nThe validation and measures can be specified via measures_valid and take a measure object that is constructed via msr().\n\n\nSolution\n\nlibrary(ggplot2)\n\nmlp <- lrn(\"classif.mlp\",\n neurons = c(100, 100),\n batch_size = 128,\n epochs = 200,\n predict_type = \"prob\",\n validate = 0.3,\n jit_trace = TRUE,\n callbacks = t_clbk(\"history\"),\n measures_valid = msr(\"classif.logloss\")\n)\n\nmlp$train(ilpd_num)\nhead(mlp$model$callbacks$history)\n\n epoch valid.classif.logloss\n <num> <num>\n1: 1 3.373034\n2: 2 5.475234\n3: 3 4.667771\n4: 4 3.047842\n5: 5 1.563049\n6: 6 0.958690\n\nggplot(mlp$model$callbacks$history) +\n geom_line(aes(x = epoch, y = valid.classif.logloss)) +\n labs(\n y = \"Log-Loss (Validation)\",\n x = \"Epoch\"\n ) +\n theme_minimal()\n\n\n\n\n\n\n\n\nQuestion 2: Early Stopping Enable early stopping to prevent overfitting and re-train the learner (using a patience of 10). Print the final validation performance of the learner and the early stopped results. You can consult the documentation of LearnerTorch on how to access these two results (section Active Bindings).\n\n\nHint\n\nYou can enable early stopping by setting the patience parameter.\n\nSolution\n\nmlp$configure(\n patience = 10\n)\nmlp$train(ilpd_num)\nmlp$internal_tuned_values\n\n$epochs\n[1] 24\n\nmlp$internal_valid_scores\n\n$classif.logloss\n[1] 0.5598296\n\n\nQuestion 3: Early Stopping and Dropout Tuning\nWhile early stopping in itself is already useful, mlr3torch also allows you to simultaneously tune the number of epochs using early stopping while tuning other hyperparameters via traditional hyperparameter tuning from mlr3tuning.\nOne thing we have not covered so far is that the MLP learner we have used so far also uses a dropout layer. The dropout probability can be configured via the p parameter.\nYour task is to tune the dropout probability p in the range \\([0, 1]\\) and the epochs using early stopping (using the configuration from the previous exercise).\nTo adapt this to work with early stopping, you need to set the:\n\nepochs to to_tune(upper = <value>, internal = TRUE): This tells the Tuner that the learner will tune the number of epochs itself.\n$validate field of the \"test\" so the same data is used for tuning and validation.\nTuning measure to msr(\"internal_valid_score\", minimize = TRUE). We set minimize to TRUE because we have used the log-loss as a validation measure.\n\nApart from this, the tuning works just like in tutorial 5. Use 3-fold cross-validation for the tuning and evaluate 10 configurations.\nRun the tuning and print the optimal configuration.\nSolution\n\nlibrary(mlr3torch)\n\nmlp$configure(\n epochs = to_tune(upper = 100, internal = TRUE),\n p = to_tune(lower = 0, upper = 1),\n validate = \"test\"\n)\n\ntuner <- tnr(\"random_search\")\nresampling <- rsmp(\"cv\", folds = 3)\nmeasure <- msr(\"internal_valid_score\", minimize = TRUE)\n\nti <- tune(\n tuner = tuner,\n task = ilpd_num,\n learner = mlp,\n resampling = resampling,\n measure = measure,\n term_evals = 10\n)\n\nti$learner_result_param_vals\n\nNULL" + "text": "Question 1: Validation\nIn this exercise, we will once again train a simple multi-layer perceptron on the Indian Liver Patient Dataset (ILPD). Create a learner that:\n\nUses 2 hidden layers with 100 neurons each.\nUtilizes a batch size of 128.\nTrains for 200 epochs.\nEmploys a validation set comprising 30% of the data.\nTrack the validation log-loss.\nUtilizes trace-jitting to speed up the training process.\nEmploys the history callback to record the training and validation log-loss during training.\n\nAfterward, plot the validation log-loss, which is accessible via learner$model$callbacks$history.\nBelow, we create the task and remove the gender feature again for simplicity.\n\nlibrary(mlr3verse)\nlibrary(mlr3torch)\nilpd_num <- tsk(\"ilpd\")\nilpd_num$select(setdiff(ilpd_num$feature_names, \"gender\"))\nilpd_num\n\n<TaskClassif:ilpd> (583 x 10): Indian Liver Patient Data\n* Target: diseased\n* Properties: twoclass\n* Features (9):\n - dbl (5): albumin, albumin_globulin_ratio, direct_bilirubin, total_bilirubin, total_protein\n - int (4): age, alanine_transaminase, alkaline_phosphatase, aspartate_transaminase\n\n\nSolution\n\nlibrary(ggplot2)\n\nmlp <- lrn(\"classif.mlp\",\n neurons = c(100, 100),\n batch_size = 128,\n epochs = 200,\n predict_type = \"prob\",\n validate = 0.3,\n jit_trace = TRUE,\n callbacks = t_clbk(\"history\"),\n measures_valid = msr(\"classif.logloss\")\n)\n\nmlp$train(ilpd_num)\nhead(mlp$model$callbacks$history)\n\n epoch valid.classif.logloss\n <num> <num>\n1: 1 3.373034\n2: 2 5.475234\n3: 3 4.667771\n4: 4 3.047842\n5: 5 1.563049\n6: 6 0.958690\n\nggplot(mlp$model$callbacks$history) +\n geom_line(aes(x = epoch, y = valid.classif.logloss)) +\n labs(\n y = \"Log-Loss (Validation)\",\n x = \"Epoch\"\n ) +\n theme_minimal()\n\n\n\n\n\n\n\n\nQuestion 2: Early Stopping\nEnable early stopping to prevent overfitting and re-train the learner (using a patience of 10). Print the final validation performance of the learner and the early stopped results. You can consult the documentation of LearnerTorch on how to access these (see section Active Bindings).\n\n\nHint\n\nYou can enable early stopping by setting the patience parameter.\n\nSolution\n\nmlp$configure(\n patience = 10\n)\nmlp$train(ilpd_num)\nmlp$internal_tuned_values\n\n$epochs\n[1] 24\n\nmlp$internal_valid_scores\n\n$classif.logloss\n[1] 0.5598296\n\n\nQuestion 3: Early Stopping and Dropout Tuning\nWhile early stopping in itself is already useful, mlr3torch also allows you to simultaneously tune the number of epochs using early stopping while tuning other hyperparameters via traditional hyperparameter tuning from mlr3tuning.\nOne thing we have not mentioned so far is that the MLP learner also uses a dropout layer. The dropout probability can be configured via the p parameter.\nYour task is to tune the dropout probability p in the range \\([0, 1]\\) and the epochs using early stopping (using the configuration from the previous exercise) with an upper bound of 100 epochs.\nTo adapt this to work with early stopping, you need to set the:\n\nepochs to to_tune(upper = <value>, internal = TRUE): This tells the Tuner that the learner will tune the number of epochs itself.\n$validate field of the \"test\" so the same data is used for tuning and validation.\nTuning measure to msr(\"internal_valid_score\", minimize = TRUE). We set minimize to TRUE because we have used the log-loss as a validation measure.\n\nApart from this, the tuning works just like in tutorial 5. Use 3-fold cross-validation and evaluate 10 configurations using random search. Finally, print the optimal configuration.\nSolution\n\nlibrary(mlr3torch)\n\nmlp$configure(\n epochs = to_tune(upper = 100, internal = TRUE),\n p = to_tune(lower = 0, upper = 1),\n validate = \"test\"\n)\n\ntuner <- tnr(\"random_search\")\nresampling <- rsmp(\"cv\", folds = 3)\nmeasure <- msr(\"internal_valid_score\", minimize = TRUE)\n\nti <- tune(\n tuner = tuner,\n task = ilpd_num,\n learner = mlp,\n resampling = resampling,\n measure = measure,\n term_evals = 10\n)\n\nti$result_learner_param_vals\n\n$epochs\n[1] 53\n\n$device\n[1] \"auto\"\n\n$num_threads\n[1] 1\n\n$num_interop_threads\n[1] 1\n\n$seed\n[1] \"random\"\n\n$jit_trace\n[1] TRUE\n\n$eval_freq\n[1] 1\n\n$measures_train\nlist()\n\n$measures_valid\nlist()\n\n$patience\n[1] 0\n\n$min_delta\n[1] 0\n\n$batch_size\n[1] 128\n\n$neurons\n[1] 100 100\n\n$p\n[1] 0.3738756\n\n$activation\n<nn_relu> object generator\n Inherits from: <inherit>\n Public:\n .classes: nn_relu nn_module\n initialize: function (inplace = FALSE) \n forward: function (input) \n clone: function (deep = FALSE, ..., replace_values = TRUE) \n Private:\n .__clone_r6__: function (deep = FALSE) \n Parent env: <environment: 0x12f15a7b8>\n Locked objects: FALSE\n Locked class: FALSE\n Portable: TRUE\n\n$activation_args\nlist()" }, { "objectID": "notebooks/5-mlr3torch-exercise.html", @@ -375,7 +375,7 @@ "href": "notebooks/6-training-efficiency-exercise-task.html", "title": "Training Efficiency", "section": "", - "text": "Question 1: Validation\nIn this exercise, we will once again train a simple multi-layer perceptron on the Indian Liver Patient Dataset (ILPD). Create a learner that:\n\nUses 2 hidden layers with 100 neurons each.\nUtilizes a batch size of 128.\nTrains for 200 epochs.\nEmploys a validation set comprising 30% of the data.\nTracks the training and validation log-loss during training.\nUtilizes trace-jitting to speed up the training process.\nEmploys the history callback to record the training and validation log-loss during training.\n\nAfterward, plot the validation log-loss, which is accessible via learner$model$callbacks$history.\nBelow, we create the task and remove the gender feature for simplicity.\n\nlibrary(mlr3verse)\n\nLoading required package: mlr3\n\nlibrary(mlr3torch)\n\nLoading required package: mlr3pipelines\n\n\nLoading required package: torch\n\nilpd_num <- tsk(\"ilpd\")\nilpd_num$select(setdiff(ilpd_num$feature_names, \"gender\"))\nilpd_num\n\n<TaskClassif:ilpd> (583 x 10): Indian Liver Patient Data\n* Target: diseased\n* Properties: twoclass\n* Features (9):\n - dbl (5): albumin, albumin_globulin_ratio, direct_bilirubin, total_bilirubin, total_protein\n - int (4): age, alanine_transaminase, alkaline_phosphatase, aspartate_transaminase\n\n\n\n\nHint\n\n\nTo specify the validation set, use the validate field, which can either be set during construction or by calling $configure().\nTrace-jitting can be enabled via the jit_trace parameter.\nThe history callback can be constructed via t_clbk(\"history\") and needs to be passed during the construction of the learner.\nThe validation and measures can be specified via measures_valid and take a measure object that is constructed via msr().\n\n\nQuestion 2: Early Stopping Enable early stopping to prevent overfitting and re-train the learner (using a patience of 10). Print the final validation performance of the learner and the early stopped results. You can consult the documentation of LearnerTorch on how to access these two results (section Active Bindings).\n\n\nHint\n\nYou can enable early stopping by setting the patience parameter.\n\nQuestion 3: Early Stopping and Dropout Tuning\nWhile early stopping in itself is already useful, mlr3torch also allows you to simultaneously tune the number of epochs using early stopping while tuning other hyperparameters via traditional hyperparameter tuning from mlr3tuning.\nOne thing we have not covered so far is that the MLP learner we have used so far also uses a dropout layer. The dropout probability can be configured via the p parameter.\nYour task is to tune the dropout probability p in the range \\([0, 1]\\) and the epochs using early stopping (using the configuration from the previous exercise).\nTo adapt this to work with early stopping, you need to set the:\n\nepochs to to_tune(upper = <value>, internal = TRUE): This tells the Tuner that the learner will tune the number of epochs itself.\n$validate field of the \"test\" so the same data is used for tuning and validation.\nTuning measure to msr(\"internal_valid_score\", minimize = TRUE). We set minimize to TRUE because we have used the log-loss as a validation measure.\n\nApart from this, the tuning works just like in tutorial 5. Use 3-fold cross-validation for the tuning and evaluate 10 configurations.\nRun the tuning and print the optimal configuration." + "text": "Question 1: Validation\nIn this exercise, we will once again train a simple multi-layer perceptron on the Indian Liver Patient Dataset (ILPD). Create a learner that:\n\nUses 2 hidden layers with 100 neurons each.\nUtilizes a batch size of 128.\nTrains for 200 epochs.\nEmploys a validation set comprising 30% of the data.\nTrack the validation log-loss.\nUtilizes trace-jitting to speed up the training process.\nEmploys the history callback to record the training and validation log-loss during training.\n\nAfterward, plot the validation log-loss, which is accessible via learner$model$callbacks$history.\nBelow, we create the task and remove the gender feature again for simplicity.\n\nlibrary(mlr3verse)\nlibrary(mlr3torch)\nilpd_num <- tsk(\"ilpd\")\nilpd_num$select(setdiff(ilpd_num$feature_names, \"gender\"))\nilpd_num\n\n<TaskClassif:ilpd> (583 x 10): Indian Liver Patient Data\n* Target: diseased\n* Properties: twoclass\n* Features (9):\n - dbl (5): albumin, albumin_globulin_ratio, direct_bilirubin, total_bilirubin, total_protein\n - int (4): age, alanine_transaminase, alkaline_phosphatase, aspartate_transaminase\n\n\nQuestion 2: Early Stopping\nEnable early stopping to prevent overfitting and re-train the learner (using a patience of 10). Print the final validation performance of the learner and the early stopped results. You can consult the documentation of LearnerTorch on how to access these (see section Active Bindings).\n\n\nHint\n\nYou can enable early stopping by setting the patience parameter.\n\nQuestion 3: Early Stopping and Dropout Tuning\nWhile early stopping in itself is already useful, mlr3torch also allows you to simultaneously tune the number of epochs using early stopping while tuning other hyperparameters via traditional hyperparameter tuning from mlr3tuning.\nOne thing we have not mentioned so far is that the MLP learner also uses a dropout layer. The dropout probability can be configured via the p parameter.\nYour task is to tune the dropout probability p in the range \\([0, 1]\\) and the epochs using early stopping (using the configuration from the previous exercise) with an upper bound of 100 epochs.\nTo adapt this to work with early stopping, you need to set the:\n\nepochs to to_tune(upper = <value>, internal = TRUE): This tells the Tuner that the learner will tune the number of epochs itself.\n$validate field of the \"test\" so the same data is used for tuning and validation.\nTuning measure to msr(\"internal_valid_score\", minimize = TRUE). We set minimize to TRUE because we have used the log-loss as a validation measure.\n\nApart from this, the tuning works just like in tutorial 5. Use 3-fold cross-validation and evaluate 10 configurations using random search. Finally, print the optimal configuration." }, { "objectID": "notebooks/6-training-efficiency.html", @@ -389,56 +389,56 @@ "href": "notebooks/6-training-efficiency.html#parallel-processing", "title": "Training Efficiency", "section": "Parallel Processing", - "text": "Parallel Processing\n\nGraphical Processing Unit (GPU)\nUsing a GPU is crucial when training relatively large neural networks because GPUs are specifically designed to handle the parallel processing required for complex computations. To use a GPU in mlr3torch, we can set the device parameter to “cuda”. By default, it is set to “auto”, which will use a GPU if it is available and otherwise fall back to the CPU.\n\n\n\n\n\n\nTip\n\n\n\nTo check if a GPU is available, we can use the torch::cuda_is_available() function.\n\nlibrary(torch)\ncuda_is_available()\n\n[1] FALSE\n\n\nIf you have an M1 Mac (or later), you can also use the available graphics card by setting the device parameter to \"mps\". You can check this by running:\n\nbackends_mps_is_available()\n\n[1] TRUE\n\n\n\n\nTo demonstrate the speed improvements obtained by using a GPU, we conduct a large matrix operation on a GPU and a CPU. We start by randomly sampling a matrix of size 1000x1000.\n\nx_cpu = torch_randn(1000, 1000, device = \"cpu\")\n\nBelow, we perform a matrix multiplication on the CPU and the GPU and compare the timings.\n\n# this will only run if a GPU is available\nx_cuda = x_cpu$cuda()\n\nbench::mark(\n cpu = x_cpu$matmul(x_cpu),\n cuda = x_cuda$matmul(x_cuda)\n)\n\n\n\nCPU Threads\nTraining large networks on a CPU is not a recommended approach, but it can be useful for smaller networks or when you don’t have a GPU. You can still use multiple threads to speed up the execution of operations. Note that the code below will not run on macOS, as it is not possible to set the number of threads on macOS.\n\n# this will be skipped on macOS\nbench::mark(\n {torch_set_num_threads(1L); x_cpu$matmul(x_cpu)},\n {torch_set_num_threads(16L); x_cpu$matmul(x_cpu)}\n)\n\ntorch also allows for interop-parallelization, but this is more advanced and code needs to be written in a specific way.\n\n\n\n\n\n\nQuiz: Number of Threads\n\n\n\nQuestion 1: On a CPU with 4 cores, does it make sense to set the number of threads to values greater than 4? Explain your answer.\n\n\nClick for answer\n\nOn a CPU with 4 cores, at most 4 threads can run in parallel. Using more threads than the number of cores will not speed up the execution of operations.\n\nQuestion 2: On a CPU with 64 cores, is it always the case that using 64 threads is better than using 32 threads?\n\n\nClick for answer\n\nNot necessarily. Using more threads will mean that:\n\nThe threads need to communicate and synchronize, which increases the runtime.\nMore resources are used for the computation, which decreases the runtime.\n\nThe optimal number of threads is a trade-off between these two effects." + "text": "Parallel Processing\n\nGraphical Processing Unit (GPU)\nUsing a GPU is crucial when training relatively large neural networks because GPUs are specifically designed to handle the parallel processing required for complex computations. To use a GPU in mlr3torch, we can set the device parameter to “cuda”. By default, it is set to “auto”, which will use a GPU if available and otherwise fall back to the CPU.\n\n\n\n\n\n\nTip\n\n\n\nTo check if a GPU is available, we can use the torch::cuda_is_available() function.\n\nlibrary(torch)\ncuda_is_available()\n\n[1] FALSE\n\n\nIf you have an M1 Mac (or later), you can also use the available graphics card by setting the device parameter to \"mps\". You can check this by running:\n\nbackends_mps_is_available()\n\n[1] TRUE\n\n\n\n\nTo demonstrate the speed improvements obtained by using a GPU, we conduct a large matrix operation on a GPU and a CPU. We start by randomly sampling a matrix of size 1000x1000.\n\nx_cpu = torch_randn(1000, 1000, device = \"cpu\")\n\nBelow, we perform a matrix multiplication on the CPU and the GPU and compare the timings.\n\n# this will only run if a GPU is available\nx_cuda = x_cpu$cuda()\n\nbench::mark(\n cpu = x_cpu$matmul(x_cpu),\n cuda = x_cuda$matmul(x_cuda)\n)\n\n\n\nCPU Threads\nTraining large networks on a CPU is not a recommended approach, but it can be a viable option for smaller networks. You can still use multiple threads to speed up the execution of operations. Please be aware that the code below will not execute on macOS, as setting the number of threads is not supported on this operating system.\n\n# this will be skipped on macOS\nbench::mark(\n {torch_set_num_threads(1L); x_cpu$matmul(x_cpu)},\n {torch_set_num_threads(16L); x_cpu$matmul(x_cpu)}\n)\n\ntorch also allows for interop-parallelization, but this is more advanced and code needs to be written in a specific way.\n\n\n\n\n\n\nQuiz: Number of Threads\n\n\n\nQuestion 1: On a CPU with 4 cores, does it make sense to set the number of threads to values greater than 4? Explain your answer.\n\n\nClick for answer\n\nOn a CPU with 4 cores, at most 4 threads can run in parallel. Using more threads than the number of cores will not speed up the execution of operations.\n\nQuestion 2: On a CPU with 64 cores, is it always the case that using 64 threads is better than using 32 threads?\n\n\nClick for answer\n\nNot necessarily. Using more threads will mean that:\n\nThe threads need to communicate and synchronize, which increases the runtime.\nMore resources are used for the computation, which decreases the runtime.\n\nThe optimal number of threads is a trade-off between these two effects." }, { "objectID": "notebooks/6-training-efficiency.html#efficient-data-loading", "href": "notebooks/6-training-efficiency.html#efficient-data-loading", "title": "Training Efficiency", "section": "Efficient Data Loading", - "text": "Efficient Data Loading\nBesides speeding up the computation of operations in the forward and backward pass, another possible bottleneck is the loading of data. There are various ways to improve data loading speed:\n\nImprove the implementation of the dataset class\nParallelize the data loading process\nMove data to the GPU\n\nThese approaches will now be discussed.\n\nEfficient Dataset Implementation\nWhen implementing a dataset, we need to define:\n\nHow we store and load the data\nWhether implementing loading of a batch is beneficial\n\n\n\n\n\n\n\nQuiz: Data Loading\n\n\n\nThe tiny imagenet dataset is a dataset of 100,000 images of size 64x64x3. It is a subset of the famous imagenet dataset. Below, we show some examples from the dataset:\n\nWe will now consider different ways to write a torch::dataset implementation for this data. Assume we have some image paths stored in a character vector as well as in an array where they are already loaded into memory.\n\nstr(image_paths)\n\n chr [1:100] \"/Users/sebi/Library/Caches/org.R-project.R/R/mlr3torch/datasets/tiny_imagenet/raw/tiny-imagenet-200/train/n0144\"| __truncated__ ...\n\nstr(image_array)\n\n num [1:100, 1:3, 1:64, 1:64] 1 0.0784 0.4706 0.5608 0.5647 ...\n\n\nAn individual image can, for example, be loaded using the torchvision::base_loader() function:\n\nlibrary(torchvision)\nstr(base_loader(image_paths[1]))\n\n num [1:64, 1:64, 1:3] 1 1 1 1 1 ...\n\n\nQuestion 1: Reading From Disk or RAM\nWhich of the following is the faster way to load the images? Explain why.\n\nLoading the images from disk:\n\nds_disk = dataset(\"image_paths\",\n initialize = function(image_paths) {\n self$image_paths = image_paths\n },\n .getitem = function(i) {\n torch_tensor(torchvision::base_loader(self$image_paths[i]))\n },\n .length = function() {\n length(self$image_paths)\n }\n)(image_paths)\n\nLoading the images from an array:\n\nds_ram = dataset(\"image_array\",\n initialize = function(image_array) {\n self$image_array = image_array\n },\n .getbatch = function(i) {\n torch_tensor(self$image_array[i, , , ])\n },\n .length = function() {\n nrow(self$image_array)\n }\n)(image_array)\n\n\n\n\nClick for answer\n\nGenerally, loading images from RAM is significantly faster than loading them from disk. Although the benchmark presented below may seem somewhat ‘unfair’ since ds_ram has already loaded the images into memory, this difference is evident in practice. When iterating over the dataset for multiple epochs, the first method will need to reload the images from disk for each epoch, while the second method only requires a single loading of the images into memory.\n\niter = function(ds, ..., epochs = 1) {\n dl = torch::dataloader(ds, batch_size = 16, ...)\n for (epoch in seq_len(epochs)) {\n coro::loop(for(batch in dl) {\n batch\n })\n }\n}\nbench::mark(\n disk = iter(ds_disk),\n ram = iter(ds_ram),\n check = FALSE\n)\n\n# A tibble: 2 × 6\n expression min median `itr/sec` mem_alloc `gc/sec`\n <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>\n1 disk 18ms 20.01ms 47.6 14MB 14.0\n2 ram 8.4ms 9.06ms 110. 9.4MB 26.0\n\n\n\nQuestion 2: (Don’t) Copy that\nConsider now the next dataset implementation:\n\nds_tensor = dataset(\"tensor\",\n initialize = function(image_array) {\n self$tensor = torch_tensor(image_array)\n },\n .getitem = function(i) {\n self$tensor[i, ..]\n },\n .length = function() {\n nrow(self$tensor)\n }\n)(image_array)\n\nDo you think this implementation is faster or slower than the ds_ram implementation? Explain why.\n\n\nClick for answer\n\nThis implementation is faster than the ds_ram implementation. This is because the ds_tensor implementation copies the R array to a torch tensor only once, whereas the ds_ram implementation copies the R array to a torch tensor for each item.\n\nbench::mark(\n tensor = iter(ds_tensor),\n array = iter(ds_ram),\n check = FALSE\n)\n\n# A tibble: 2 × 6\n expression min median `itr/sec` mem_alloc `gc/sec`\n <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>\n1 tensor 4.62ms 5.06ms 196. 96.08KB 6.77\n2 array 8.03ms 9.28ms 107. 9.38MB 27.6 \n\n\n\nQuestion 3: $.getbatch() vs $.getitem()\nWhich implementation is faster? Explain why.\n\nds_tensor_batch = dataset(\"tensor_batch\",\n initialize = function(image_array) {\n self$tensor = torch_tensor(image_array)\n },\n .getbatch = function(i) {\n self$tensor[i, ..]\n },\n .length = function() {\n nrow(self$tensor)\n }\n)(image_array)\n\n\n\nClick for answer\n\nThe $.getbatch() implementation is faster than the $.getitem() implementation. This is because when using the $.getitem() method, the batch for indices ids is obtained by calling $.getitem(id) for each index in ids and then stacking them together, which requires a new tensor allocation. Slicing the tensor, however, avoids this allocation when shuffle = TRUE (which is also the default).\n\nbench::mark(\n getbatch = iter(ds_tensor_batch),\n getitem = iter(ds_tensor),\n check = FALSE\n)\n\n# A tibble: 2 × 6\n expression min median `itr/sec` mem_alloc `gc/sec`\n <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>\n1 getbatch 1.69ms 1.99ms 466. 3.83KB 4.25\n2 getitem 4.48ms 4.85ms 204. 54.69KB 7.19\n\n\n\n\n\n\n\nParallel Data Loading\nIn Deep Learning, datasets can be very large, and it might therefore be the case that the data is simply too large to fit into memory. In this case, we can use parallel data loading to speed up the data loading process. Instead of loading the data sequentially in the main process, other R processes will be started that execute the data loading. For example, if we set num_workers = 4L, 4 R processes will be started that load the data, while the main process is free to train the model. These processes then send the batches to the main process. The image below visualizes this process:\n\nCreating such a parallel dataloader is as easy as setting the num_workers parameter to a value greater than 0.\n\n\n\n\n\n\nNote\n\n\n\nNote that there is some communication overhead that results from sending the batches from the worker to the main process. This will hopefully be reduced in the future, but is currently there. For this reason, parallel data loading is therefore – currently – only beneficial when it is slow, e.g., because of loading the data from disk or because of expensive preprocessing.\n\n\n\n\nMoving Data to the GPU\nOne thing we have ignored so far is that when training using a GPU, the data needs to be moved to the GPU. This is because a GPU has its own memory (VRAM), and the data needs to be moved to this memory before it can be used for training. The moving of the data to the GPU cannot be done on the processes that are loading the data but must be done in the main process, i.e., after the batch was received from (possibly parallelized) dataloader. One way to speed up the data loading process is to pin the memory of the data to the GPU. Before a tensor can be moved from RAM to VRAM, it needs to be in so-called page-locked memory, which can be done using the pin_memory parameter.\n\n\niter_cuda = function(ds, pin_memory = TRUE) {\n dl = torch::dataloader(ds, batch_size = 16, pin_memory = pin_memory)\n coro::loop(for(batch in dl) {\n batch$cuda()\n })\n}\n\nbench::mark(\n not_pinned = iter_cuda(ds_disk, pin_memory = FALSE),\n pinned = iter_cuda(ds_disk, pin_memory = TRUE)\n)\n\n\n\n\n\n\n\nNote\n\n\n\nIn order to use parallel data loading or memory pinning with mlr3torch, these parameters can simply be specified in the learner:\n\nlrn(\"classif.mlp\", num_workers = 8L, pin_memory = TRUE, device = \"cuda\")\n\n<LearnerTorchMLP[classif]:classif.mlp>: My Little Powny\n* Model: -\n* Parameters: device=cuda, num_threads=1, num_interop_threads=1, seed=random, jit_trace=FALSE, eval_freq=1,\n measures_train=<list>, measures_valid=<list>, patience=0, min_delta=0, num_workers=8, pin_memory=TRUE,\n neurons=integer(0), p=0.5, activation=<nn_relu>, activation_args=<list>\n* Validate: NULL\n* Packages: mlr3, mlr3torch, torch\n* Predict Types: [response], prob\n* Feature Types: integer, numeric, lazy_tensor\n* Properties: internal_tuning, marshal, multiclass, twoclass, validation\n* Optimizer: adam\n* Loss: cross_entropy\n* Callbacks: -" + "text": "Efficient Data Loading\nBesides parallelizing the computation of operations in the forward and backward pass, another possible bottleneck is the loading of data. There are various ways to improve data loading speed:\n\nImprove the implementation of the dataset class\nParallelize the data loading process\nIncrease the speed of data transfer to the GPU\n\nThese approaches will now be discussed.\n\nEfficient Dataset Implementation\nWhen implementing a dataset, we need to define:\n\nHow we store and load the data\nWhether implementing loading of a batch is beneficial\n\n\n\n\n\n\n\nQuiz: Data Loading\n\n\n\nThe tiny imagenet dataset is a dataset of 100,000 images of size 64x64. It is a subset of the famous imagenet dataset. Below, we show some examples from it:\n\nWe will now consider different ways to write a torch::dataset implementation for this data. Assume we have some image paths stored in a character vector as well as in an array where they are already loaded into memory.\n\nstr(image_paths)\n\n chr [1:100] \"/Users/sebi/Library/Caches/org.R-project.R/R/mlr3torch/datasets/tiny_imagenet/raw/tiny-imagenet-200/train/n0144\"| __truncated__ ...\n\nstr(image_array)\n\n num [1:100, 1:3, 1:64, 1:64] 1 0.0784 0.4706 0.5608 0.5647 ...\n\n\nAn individual image can, for example, be loaded using the torchvision::base_loader() function:\n\nlibrary(torchvision)\nstr(base_loader(image_paths[1]))\n\n num [1:64, 1:64, 1:3] 1 1 1 1 1 ...\n\n\nQuestion 1: Reading From Disk or RAM\nWhich of the following is the faster way to load the images? Explain why.\n\nLoading the images from disk:\n\nds_disk = dataset(\"image_paths\",\n initialize = function(image_paths) {\n self$image_paths = image_paths\n },\n .getitem = function(i) {\n torch_tensor(torchvision::base_loader(self$image_paths[i]))\n },\n .length = function() {\n length(self$image_paths)\n }\n)(image_paths)\n\nLoading the images from an array:\n\nds_ram = dataset(\"image_array\",\n initialize = function(image_array) {\n self$image_array = image_array\n },\n .getitem = function(i) {\n torch_tensor(self$image_array[i, , , ])\n },\n .length = function() {\n nrow(self$image_array)\n }\n)(image_array)\n\n\n\n\nClick for answer\n\nGenerally, loading images from RAM is significantly faster than loading them from disk. Although the benchmark presented below may seem somewhat ‘unfair’ since ds_ram has already loaded the images into memory, this difference is evident in practice. When iterating over the dataset for multiple epochs, the first method will need to reload the images from disk for each epoch, while the second method only requires a single loading of the images into memory.\n\niter = function(ds, ..., epochs = 1) {\n dl = torch::dataloader(ds, batch_size = 16, ...)\n for (epoch in seq_len(epochs)) {\n coro::loop(for(batch in dl) {\n batch\n })\n }\n}\nbench::mark(\n disk = iter(ds_disk, epochs = 10),\n ram = iter(ds_ram, epochs = 10),\n check = FALSE\n)\n\nWarning: Some expressions had a GC in every iteration; so filtering is disabled.\n\n\n# A tibble: 2 × 6\n expression min median `itr/sec` mem_alloc `gc/sec`\n <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>\n1 disk 228ms 258ms 3.88 109MB 9.69\n2 ram 183ms 184ms 5.33 94.4MB 10.7 \n\n\n\nQuestion 2: (Don’t) Copy that\nConsider now the next dataset implementation:\n\nds_tensor = dataset(\"tensor\",\n initialize = function(image_array) {\n self$tensor = torch_tensor(image_array)\n },\n .getitem = function(i) {\n self$tensor[i, ..]\n },\n .length = function() {\n nrow(self$tensor)\n }\n)(image_array)\n\nDo you think this implementation is faster or slower than the ds_ram implementation? Explain why.\n\n\nClick for answer\n\nThis implementation is faster than the ds_ram implementation. This is because the ds_tensor implementation copies the R array to a torch tensor only once, whereas the ds_ram implementation copies the R array to a torch tensor for each item.\n\nbench::mark(\n tensor = iter(ds_tensor),\n array = iter(ds_ram),\n check = FALSE\n)\n\n# A tibble: 2 × 6\n expression min median `itr/sec` mem_alloc `gc/sec`\n <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>\n1 tensor 4.52ms 4.82ms 206. 96.08KB 6.71\n2 array 14.94ms 16.26ms 62.0 9.44MB 16.9 \n\n\n\nQuestion 3: $.getbatch() vs $.getitem()\nWhich implementation is faster? Explain why.\n\nds_tensor_batch = dataset(\"tensor_batch\",\n initialize = function(image_array) {\n self$tensor = torch_tensor(image_array)\n },\n .getbatch = function(i) {\n self$tensor[i, .., drop = FALSE]\n },\n .length = function() {\n nrow(self$tensor)\n }\n)(image_array)\n\n\n\nClick for answer\n\nThe $.getbatch() implementation is faster than the $.getitem() implementation. This is because when using the $.getitem() method, the batch for indices ids is obtained by calling $.getitem(id) for each index in ids and then stacking them together, which requires a new tensor allocation. Slicing the tensor, however, avoids this allocation when shuffle = TRUE (which is also the default).\n\nbench::mark(\n getbatch = iter(ds_tensor_batch),\n getitem = iter(ds_tensor),\n check = FALSE\n)\n\n# A tibble: 2 × 6\n expression min median `itr/sec` mem_alloc `gc/sec`\n <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>\n1 getbatch 1.75ms 1.94ms 502. 23KB 4.69\n2 getitem 4.61ms 4.94ms 198. 54.7KB 7.54\n\n\n\n\n\n\n\nParallel Data Loading\nIn Deep Learning, datasets can be very large, and it might therefore be the case that the data is simply too large to fit into memory. In this case, we can use parallel data loading to speed up the data loading process. Instead of loading the data sequentially in the main process, other R processes will be started that execute the data loading. For example, if we set num_workers = 4L, 4 R processes will be started that load the data, while the main process is free to train the model. These processes then send the batches to the main process. The image below visualizes this process:\n\nCreating such a parallel dataloader is as easy as setting the num_workers parameter to a value greater than 0.\n\n\n\n\n\n\nNote\n\n\n\nNote that in the current implementation, parallel data loading is only beneficial when it is slow, e.g., because of loading the data from disk or because of expensive preprocessing. This will hopefully be improved in the future (by a faster implementation of the parallel dataloader).\n\n\n\n\nMoving Data to the GPU\nOne thing we have ignored so far is that when training using a GPU, the data needs to be moved from RAM to the GPU. This is because a GPU has its own memory (VRAM), and the data needs to be moved to this memory before it can be used for training. The moving of the data to the GPU cannot be done on the processes that are loading the data but must be done in the main process, i.e., after the batch was received from (possibly parallelized) dataloader. One way to speed up the data loading process is to pin the memory of the data to the GPU. Before a tensor can be moved from RAM to VRAM, it needs to be in so-called page-locked memory, which can be enabled using the pin_memory parameter of dataloader()..\n\n\niter_cuda = function(ds, pin_memory = TRUE) {\n dl = torch::dataloader(ds, batch_size = 16, pin_memory = pin_memory)\n coro::loop(for(batch in dl) {\n batch$cuda()\n })\n}\n\nbench::mark(\n not_pinned = iter_cuda(ds_disk, pin_memory = FALSE),\n pinned = iter_cuda(ds_disk, pin_memory = TRUE)\n)\n\n\n\n\n\n\n\nNote\n\n\n\nIn order to use parallel data loading or memory pinning with mlr3torch, these parameters can simply be specified in the learner:\n\nlrn(\"classif.mlp\", num_workers = 8L, pin_memory = TRUE, device = \"cuda\")" }, { "objectID": "notebooks/6-training-efficiency.html#jit-compilation-ignite-optimizers", "href": "notebooks/6-training-efficiency.html#jit-compilation-ignite-optimizers", "title": "Training Efficiency", "section": "JIT Compilation & Ignite Optimizers", - "text": "JIT Compilation & Ignite Optimizers\nSome special care needs to be taken when using torch (or mlr3torch) in order to get good performance. In the future, this will hopefully not be necessary anymore, but is currently required.\n\n‘Ignite’ Optimizers\nIn torch, different versions of optimizers exist:\n\noptim_adamw\n\n<optim_adamw> object generator\n Inherits from: <inherit>\n Public:\n initialize: function (params, lr = 0.001, betas = c(0.9, 0.999), eps = 1e-08, \n loop_fun: function (group, param, g, p) \n step: function (closure = NULL) \n clone: function (deep = FALSE) \n Parent env: <environment: 0x143916d88>\n Locked objects: FALSE\n Locked class: FALSE\n Portable: TRUE\n\noptim_ignite_adamw\n\n<optim_ignite_adamw> object generator\n<optim_ignite> object generator\n Inherits from: <inherit>\n Public:\n initialize: function (params, lr = 0.001, betas = c(0.9, 0.999), eps = 1e-08, \n clone: function (deep = FALSE) \n Private:\n .config_names: lr betas eps weight_decay amsgrad\n .state_names: exp_avg exp_avg_sq max_exp_avg_sq step\n .optim: function (params, ...) \n .get_states: function (opt) \n .set_states: function (opt, params, states) \n .add_param_group: function (opt, params, lr, betas, eps, weight_decay, amsgrad) \n .assert_params: function (lr, betas, eps, weight_decay, amsgrad) \n .set_param_group_options: function (opt, list) \n .zero_grad: function (opt) \n .get_param_groups: function (ptr) \n Parent env: <environment: 0x117450168>\n Locked objects: FALSE\n Locked class: FALSE\n Portable: TRUE\n\n\nThe ‘ignite’ indicates that the optimizer is a version that is optimized for performance. Not for all optimizers does an ignite version exist, but for the most common ones, it does.\nBelow, we compare the performance of the default optimizer and the ignite optimizer and see that the latter is considerably faster.\n\nadamw = as_torch_optimizer(torch::optim_adamw)\nignite_adamw = as_torch_optimizer(torch::optim_ignite_adamw)\n\nlearner = lrn(\"classif.mlp\", epochs = 10, neurons = c(100, 100), batch_size = 32, optimizer = adamw)\n\nlearner_ignite = learner$clone(deep = TRUE)\nlearner_ignite$configure(\n optimizer = ignite_adamw\n)\ntask_sonar = tsk(\"sonar\")\n\nbench::mark(\n learner$train(task_sonar),\n learner_ignite$train(task_sonar),\n check = FALSE\n)\n\nWarning: Some expressions had a GC in every iteration; so filtering is disabled.\n\n\n# A tibble: 2 × 6\n expression min median `itr/sec` mem_alloc `gc/sec`\n <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>\n1 learner$train(task_sonar) 667ms 667ms 1.50 15.7MB 4.49\n2 learner_ignite$train(task_sonar) 202ms 211ms 4.78 10.7MB 4.78\n\n\n\n\nJIT Compilation\nJIT (Just-In-Time) compilation is a runtime optimization technique that compiles code into machine code during execution rather than beforehand. This has different advantages:\n\nBy JIT-compiling a model, some operations can be optimized for performance.\nA JIT-compiled model can be saved and executed without an R dependency for deployment (only LibTorch is required), e.g., in a C++ application.\nRunning a JIT-compiled model in R is faster because the whole network is executed in C++ instead of R.\n\nIn torch, this can either be done using TorchScript or by tracing a model. We will briefly discuss both approaches, but for more information, see the torch documentation.\n\nTorchScript\nTorchScript is a subset of Python – i.e., its own programming language – that can be used to define compiled functions. In R, this is available via the jit_compile function.\n\nf = jit_compile(\"\ndef f(x, w, bias):\n return x @ w + bias\n\")$f\n\nx = torch_randn(10, 10)\nw = torch_randn(10, 1)\nbias = torch_randn(1)\n\nout = f(x, w, bias)\nstr(out)\n\nFloat [1:10, 1:1]\n\n\nBesides syntax, there are some important differences between TorchScript and R:\n\nIn TorchScript, indexing tensors is 0-based, and\nTorchScript is statically typed, so you need to specify the types of the arguments, unless they are tensors, which is the default.\n\nBelow, we define a function that takes a list of tensors and calculates their sum.\n\nsum_jit = jit_compile(\"\ndef sum_jit(xs: List[Tensor]):\n output = torch.zeros_like(xs[0])\n for x in xs:\n output = output + x\n return output\n\")$sum_jit\n\nsum_jit(list(torch_randn(1), torch_randn(1)))\n\ntorch_tensor\n-0.7121\n[ CPUFloatType{1} ]\n\n\n\n\nTracing\nThe alternative to writing TorchScript is to write your module in R and to use jit_trace to compile it.\n\nf2 = function(x, w, bias) {\n x$matmul(w) + bias\n}\n# need to provide some example input\n# arguments are passed by position\nf2 = jit_trace(f2, torch_randn(10, 10), torch_randn(10, 100), torch_randn(100))\nout2 = f2(x, w, bias)\ntorch_equal(out, out2)\n\n[1] TRUE\n\n\nAn advantage of trace-compilation is that it even allows you to JIT-compile modules, which is currently not possible with jit_compile.\n\nnet = nn_sequential(\n nn_linear(10, 100),\n nn_relu(),\n nn_linear(100, 10)\n)\nnet_jit = jit_trace(net, torch_randn(10, 10))\n\ntorch_equal(net(x), net_jit(x))\n\n[1] TRUE\n\n\nTrace-compilation is restrictive because it only records operations applied to torch tensors and is unaware of R control flow, so you need to be careful when using it. Furthermore, it only accepts torch tensors as arguments. Unless you have dynamic inputs and outputs or modify the configuration of the module, trace-compilation should usually work. You can also check this by running the original and trace-jitted module on some example inputs and see if they return the same result.\n\n\n\n\n\n\nNote\n\n\n\nA trace-jitted module does respect the mode of the network, i.e., whether it is training or evaluating.\n\n\nIn mlr3torch, trace compilation is also available and can be enabled by setting jit_trace = TRUE in the learner.\n\nlearner = lrn(\"classif.mlp\", jit_trace = TRUE)\n\nYou can also combine TorchScript with tracing:\n\nnet_both = nn_module(\n initialize = function() {\n self$linear = nn_linear(1, 1)\n },\n forward = function(x) {\n self$linear(sum_jit(x))\n }\n)()\n\nnet_both(list(torch_randn(1), torch_randn(1)))\n\ntorch_tensor\n 1.0027\n[ CPUFloatType{1} ][ grad_fn = <ViewBackward0> ]\n\nnet_both(list(torch_randn(1)))\n\ntorch_tensor\n0.01 *\n 8.5286\n[ CPUFloatType{1} ][ grad_fn = <ViewBackward0> ]\n\n\n\n\n\n\n\n\nQuiz: Just In Time\n\n\n\nQuestion 1: Consider the trace-jitted function below. Can you predict the output of the last two lines? Can you explain why this happens?\n\nf = function(a, b, multiply) {\n if (multiply$item()) {\n a * b\n } else {\n a + b\n }\n}\nfjit = jit_trace(f, torch_tensor(1), torch_tensor(2), torch_tensor(TRUE))\n\nfjit(torch_tensor(2), torch_tensor(3), torch_tensor(TRUE))\n\ntorch_tensor\n 6\n[ CPUFloatType{1} ]\n\nfjit(torch_tensor(2), torch_tensor(3), torch_tensor(FALSE))\n\ntorch_tensor\n 6\n[ CPUFloatType{1} ]\n\n\nQuestion 2: Answer the same question for the following function:\n\nf = function(a, b, multiply) {\n torch_where(multiply, a * b, a + b)\n}\nfjit = jit_trace(f, torch_tensor(1), torch_tensor(2), torch_tensor(TRUE))\n\nfjit(torch_tensor(2), torch_tensor(3), torch_tensor(TRUE))\n\ntorch_tensor\n 6\n[ CPUFloatType{1} ]\n\nfjit(torch_tensor(2), torch_tensor(3), torch_tensor(FALSE))\n\ntorch_tensor\n 5\n[ CPUFloatType{1} ]\n\n\n\n\n\n\n\nMixed Precision Training\nAnother way to speed up the training process is to use mixed precision training. This technique involves training the model using both 16-bit and 32-bit floating point numbers. This allows reducing the memory footprint of the model and speeding up the training process.\nWe won’t cover this here, but refer to the torch documentation that explains how to do this." + "text": "JIT Compilation & Ignite Optimizers\nSome special care needs to be taken when using torch (or mlr3torch) in order to get good performance. In the future, this will hopefully not be necessary anymore, but is currently required.\n\n‘Ignite’ Optimizers\nIn torch, different versions of optimizers exist:\n\noptim_adamw\n\n<optim_adamw> object generator\n Inherits from: <inherit>\n Public:\n initialize: function (params, lr = 0.001, betas = c(0.9, 0.999), eps = 1e-08, \n loop_fun: function (group, param, g, p) \n step: function (closure = NULL) \n clone: function (deep = FALSE) \n Parent env: <environment: 0x12cfbcbc8>\n Locked objects: FALSE\n Locked class: FALSE\n Portable: TRUE\n\noptim_ignite_adamw\n\n<optim_ignite_adamw> object generator\n<optim_ignite> object generator\n Inherits from: <inherit>\n Public:\n initialize: function (params, lr = 0.001, betas = c(0.9, 0.999), eps = 1e-08, \n clone: function (deep = FALSE) \n Private:\n .config_names: lr betas eps weight_decay amsgrad\n .state_names: exp_avg exp_avg_sq max_exp_avg_sq step\n .optim: function (params, ...) \n .get_states: function (opt) \n .set_states: function (opt, params, states) \n .add_param_group: function (opt, params, lr, betas, eps, weight_decay, amsgrad) \n .assert_params: function (lr, betas, eps, weight_decay, amsgrad) \n .set_param_group_options: function (opt, list) \n .zero_grad: function (opt) \n .get_param_groups: function (ptr) \n Parent env: <environment: 0x12cae2478>\n Locked objects: FALSE\n Locked class: FALSE\n Portable: TRUE\n\n\nThe ‘ignite’ indicates that the optimizer is a version that is optimized for performance. Not for all optimizers does an ignite version exist, but for the most common ones, there does.\nBelow, we compare the performance of the default optimizer and the ignite optimizer and see that the latter is considerably faster.\n\nadamw = as_torch_optimizer(torch::optim_adamw)\nignite_adamw = as_torch_optimizer(torch::optim_ignite_adamw)\n\nlearner = lrn(\"classif.mlp\", epochs = 10, neurons = c(100, 100), batch_size = 32, optimizer = adamw)\n\nlearner_ignite = learner$clone(deep = TRUE)\nlearner_ignite$configure(\n optimizer = ignite_adamw\n)\ntask_sonar = tsk(\"sonar\")\n\nbench::mark(\n learner$train(task_sonar),\n learner_ignite$train(task_sonar),\n check = FALSE\n)\n\nWarning: Some expressions had a GC in every iteration; so filtering is disabled.\n\n\n# A tibble: 2 × 6\n expression min median `itr/sec` mem_alloc `gc/sec`\n <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>\n1 learner$train(task_sonar) 588ms 588ms 1.70 15.7MB 5.10\n2 learner_ignite$train(task_sonar) 204ms 208ms 4.77 10.7MB 4.77\n\n\n\n\nJIT Compilation\nJIT (Just-In-Time) compilation is a runtime optimization technique that compiles code into machine code during execution rather than beforehand. This has different advantages:\n\nBy JIT-compiling a model, some operations can be optimized for performance.\nA JIT-compiled model can be saved and executed without an R dependency for deployment (only LibTorch is required), e.g., in a C++ application.\nRunning a JIT-compiled model in R is faster because the whole network is executed in C++ instead of R.\n\nIn torch, this can either be done using TorchScript or by tracing a model. We will briefly discuss both approaches, but for more information, see the torch documentation.\n\nTorchScript\nTorchScript is a subset of Python – i.e., its own programming language – that can be used to define compiled functions. In R, this is available via the jit_compile function.\n\nf = jit_compile(\"\ndef f(x, w, bias):\n return x @ w + bias\n\")$f\n\nx = torch_randn(10, 10)\nw = torch_randn(10, 1)\nbias = torch_randn(1)\n\nout = f(x, w, bias)\nstr(out)\n\nFloat [1:10, 1:1]\n\n\nBesides syntax, there are some notable differences between TorchScript and R to be aware of:\n\nIn TorchScript, indexing tensors is 0-based, and\nTorchScript is statically typed, so you need to specify the types of the arguments, unless they are tensors, which is the default.\n\nBelow, we define a function that takes a list of tensors and calculates their sum.\n\nsum_jit = jit_compile(\"\ndef sum_jit(xs: List[Tensor]):\n output = torch.zeros_like(xs[0])\n for x in xs:\n output = output + x\n return output\n\")$sum_jit\n\nsum_jit(list(torch_randn(1), torch_randn(1)))\n\ntorch_tensor\n-0.7121\n[ CPUFloatType{1} ]\n\n\n\n\nTracing\nThe alternative to writing TorchScript is to write your module in R and to use jit_trace to compile it.\n\nf2 = function(x, w, bias) {\n x$matmul(w) + bias\n}\n# need to provide some example input\n# arguments are passed by position\nf2 = jit_trace(f2, torch_randn(10, 10), torch_randn(10, 100), torch_randn(100))\nout2 = f2(x, w, bias)\ntorch_equal(out, out2)\n\n[1] TRUE\n\n\nAn advantage of trace-compilation is that it can be applied to modules, which is currently not possible with jit_compile.\n\nnet = nn_sequential(\n nn_linear(10, 100),\n nn_relu(),\n nn_linear(100, 10)\n)\nnet_jit = jit_trace(net, torch_randn(10, 10))\n\ntorch_equal(net(x), net_jit(x))\n\n[1] TRUE\n\n\nHowever, trace-compilation is restrictive because it only records operations applied to torch tensors and is unaware of R control flow, so you need to be careful when using it. Furthermore, it only accepts torch tensors as arguments. For many simple modules, trace-compilation should usually work. You can also check this by running the original and trace-jitted module on some example inputs and see if they return the same result.\n\n\n\n\n\n\nNote\n\n\n\nA trace-jitted module does respect the mode of the network, i.e., whether it is training or evaluating.\n\n\nIn mlr3torch, trace compilation is also available and can be enabled by setting jit_trace = TRUE in the learner.\n\nlearner = lrn(\"classif.mlp\", jit_trace = TRUE)\n\nYou can also combine TorchScript with tracing:\n\nnet_both = nn_module(\n initialize = function() {\n self$linear = nn_linear(1, 1)\n },\n forward = function(x) {\n self$linear(sum_jit(x))\n }\n)()\n\nnet_both(list(torch_randn(1), torch_randn(1)))\n\ntorch_tensor\n 1.0027\n[ CPUFloatType{1} ][ grad_fn = <ViewBackward0> ]\n\nnet_both(list(torch_randn(1)))\n\ntorch_tensor\n0.01 *\n 8.5286\n[ CPUFloatType{1} ][ grad_fn = <ViewBackward0> ]\n\n\n\n\n\n\n\n\nQuiz: Just In Time\n\n\n\nQuestion 1: Consider the trace-jitted function below. Can you predict the output of the last two lines? Can you explain why this happens?\n\nf = function(a, b, multiply) {\n if (multiply$item()) {\n a * b\n } else {\n a + b\n }\n}\nfjit = jit_trace(f, torch_tensor(1), torch_tensor(2), torch_tensor(TRUE))\n\nfjit(torch_tensor(2), torch_tensor(3), torch_tensor(TRUE))\nfjit(torch_tensor(2), torch_tensor(3), torch_tensor(FALSE))\n\n\n\nClick for answer\n\n\nfjit(torch_tensor(2), torch_tensor(3), torch_tensor(TRUE))\n\ntorch_tensor\n 6\n[ CPUFloatType{1} ]\n\nfjit(torch_tensor(2), torch_tensor(3), torch_tensor(FALSE))\n\ntorch_tensor\n 6\n[ CPUFloatType{1} ]\n\n\n\nQuestion 2: Answer the same question for the following function:\n\nf = function(a, b, multiply) {\n torch_where(multiply, a * b, a + b)\n}\nfjit = jit_trace(f, torch_tensor(1), torch_tensor(2), torch_tensor(TRUE))\n\nfjit(torch_tensor(2), torch_tensor(3), torch_tensor(TRUE))\nfjit(torch_tensor(2), torch_tensor(3), torch_tensor(FALSE))\n\n\n\nClick for answer\n\n\nfjit(torch_tensor(2), torch_tensor(3), torch_tensor(TRUE))\n\ntorch_tensor\n 6\n[ CPUFloatType{1} ]\n\nfjit(torch_tensor(2), torch_tensor(3), torch_tensor(FALSE))\n\ntorch_tensor\n 5\n[ CPUFloatType{1} ]\n\n\n\n\n\n\n\n\nMixed Precision Training\nAnother way to speed up the training process is to use mixed precision training. This technique involves training the model using both 16-bit and 32-bit floating point numbers. This allows reducing the memory footprint of the model and speeding up the training process. We won’t cover this here, but refer to the torch documentation that explains how to do this." }, { "objectID": "notebooks/6-training-efficiency.html#methodological-approaches", "href": "notebooks/6-training-efficiency.html#methodological-approaches", "title": "Training Efficiency", "section": "Methodological Approaches", - "text": "Methodological Approaches\n\nValidation and Early Stopping\nFor more details on this topic, see the corresponding chapter in the mlr3 book.\nAs we have already seen in one of the previous notebooks, in deep learning, some part of the data is often used for validation purposes. This allows monitoring the performance of the model on unseen data.\nIn mlr3torch, we can track the performance of the model on a validation set by specifying:\n\nvalidate, which is the ratio of the data that is used for validation\nmeasures_valid, which is a list of measures to use for validation\neval_freq, which is the frequency at which the validation is performed\ncallbacks, which is a list of callbacks to use during training, in this case, we use the history callback, which records the performance of the model on the validation set at regular intervals, enabling us to monitor and analyze the model’s performance over time.\n\n\n\n\n\n\n\nTip\n\n\n\nWhile mlr3torch comes with predefined callbacks, it is also possible to define custom callbacks that modify the training process.\n\n\n\ntask = tsk(\"sonar\")\n\nmlp_learner = lrn(\"classif.mlp\",\n neurons = c(50, 50), batch_size = 256, epochs = 400,\n optimizer = t_opt(\"adam\", lr = 0.003),\n predict_type = \"prob\", jit_trace = TRUE,\n # Validation / Performance Monitoring\n validate = 0.3, # how much data to use for validation\n measures_valid = msr(\"classif.logloss\"), # how to evaluate train performance\n measures_train = msr(\"classif.logloss\"), # how to evaluate validation performance\n callbacks = t_clbk(\"history\"), # history callbacks save train and validation performance\n eval_freq = 10 # after how many training epochs to perform validation\n)\nmlp_learner$train(task)\nhistory = mlp_learner$model$callbacks$history\nstr(history)\n\nClasses 'data.table' and 'data.frame': 40 obs. of 3 variables:\n $ epoch : num 10 20 30 40 50 60 70 80 90 100 ...\n $ train.classif.logloss: num 0.678 0.643 0.569 0.515 0.478 ...\n $ valid.classif.logloss: num 0.667 0.618 0.536 0.469 0.436 ...\n - attr(*, \".internal.selfref\")=<externalptr> \n - attr(*, \"sorted\")= chr \"epoch\"\n\nhead(history)\n\nKey: <epoch>\n epoch train.classif.logloss valid.classif.logloss\n <num> <num> <num>\n1: 10 0.6775741 0.6665855\n2: 20 0.6430574 0.6176948\n3: 30 0.5685190 0.5364953\n4: 40 0.5151559 0.4694589\n5: 50 0.4780497 0.4363074\n6: 60 0.3861667 0.4153698\n\n\nBelow we plot the training and validation for the different epochs:\n\n\n\n\n\n\n\n\n\nInstead of only monitoring the validation loss (and watching it get worse and worse), we can also stop the training process dynamically when the validation loss begins to increase. This regularization technique is called early stopping, and it prevents overfitting during the training of iteratively trained machine learning models. It involves monitoring the validation loss during training and stopping the training process when the validation loss begins to increase, indicating that the model is starting to overfit the training data.\nThe key configuration option for early stopping is the patience parameter, which defines the number of epochs to wait after the last improvement in validation loss before stopping the training. For example, if patience is set to 10, the training will continue for 10 additional epochs after the last observed improvement in validation loss. If no improvement is seen during this period, training will be halted.\nAdvantages of early stopping include:\n\nPrevention of Overfitting: By stopping training when the model starts to overfit, we can achieve better generalization on unseen data.\nResource Efficiency: It saves computational resources by avoiding unnecessary training epochs once the model performance has plateaued.\n\nNow, let’s train the learner again using early stopping with a patience of 10 epochs:\n\nmlp_learner$param_set$set_values(\n patience = 5\n)\nmlp_learner$train(task)\nmlp_learner$internal_tuned_values$epochs\n\n[1] 160\n\n\nBeyond only tuning the number of epochs, mlr3’s internal tuning mechanism also allows tuning the number of epochs internally while using an offline tuning method to optimize other hyperparameters. To use this, we can set the parameters we want to tune TuneTokens:\n\nlibrary(mlr3tuning)\nmlp_learner$param_set$set_values(\n epochs = to_tune(upper = 100, internal = TRUE),\n opt.lr = to_tune(lower = 1e-4, upper = 1e-1, logscale = TRUE)\n)\n\nWe could now pass this learner to a tuner, where the tuner would only optimize the learning rate, while the learner optimizes the epochs internally." + "text": "Methodological Approaches\n\nValidation and Early Stopping\nFor more details on this topic, see the corresponding chapter in the mlr3 book.\nAs we have already seen in one of the previous notebooks, in deep learning, some part of the data is often used for validation purposes. This allows monitoring the performance of the model on unseen data.\nIn mlr3torch, we can track the performance of the model on a validation set by specifying:\n\nvalidate, which is the ratio of the data that is used for validation\nmeasures_valid, which is a list of measures to evaluate the validation performance\neval_freq, which is the frequency at which the validation is performed\ncallbacks, which is a list of callbacks to use during training, in this case, we use the t_clbk(\"history\") callback, which records the performance of the model on the validation set at regular intervals, enabling us to monitor and analyze the model’s performance over time.\n\n\ntask = tsk(\"sonar\")\n\nmlp_learner = lrn(\"classif.mlp\",\n neurons = c(50, 50), batch_size = 256, epochs = 400,\n optimizer = t_opt(\"adam\", lr = 0.003),\n predict_type = \"prob\", jit_trace = TRUE,\n # Validation / Performance Monitoring\n validate = 0.3, # how much data to use for validation\n measures_valid = msr(\"classif.logloss\"), # how to evaluate train performance\n measures_train = msr(\"classif.logloss\"), # how to evaluate validation performance\n callbacks = t_clbk(\"history\"), # history callbacks save train and validation performance\n eval_freq = 10 # after how many training epochs to perform validation\n)\nmlp_learner$train(task)\nhistory = mlp_learner$model$callbacks$history\nhead(history)\n\nKey: <epoch>\n epoch train.classif.logloss valid.classif.logloss\n <num> <num> <num>\n1: 10 0.6775741 0.6665855\n2: 20 0.6430574 0.6176948\n3: 30 0.5685190 0.5364953\n4: 40 0.5151559 0.4694589\n5: 50 0.4780497 0.4363074\n6: 60 0.3861667 0.4153698\n\n\nBelow we plot the training and validation for the different epochs:\n\n\n\n\n\n\n\n\n\nInstead of only monitoring the validation loss (and watching it get worse and worse), we can also stop the training process dynamically when the validation loss begins to increase. This regularization technique is called early stopping, and it prevents overfitting during the training of iteratively trained machine learning models.\nThe key configuration option for early stopping is the patience parameter, which defines the number of epochs to wait after the last improvement in validation loss before stopping the training. For example, if the patience is set to 5, the training will continue for 5 additional epochs after the last observed improvement in validation loss. If no improvement is seen during this period, training will be halted.\nAdvantages of early stopping include:\n\nPrevention of Overfitting: By stopping training when the model starts to overfit, we can achieve better generalization on unseen data.\nResource Efficiency: It saves computational resources by avoiding unnecessary training epochs once the model performance has plateaued.\n\nNow, let’s train the learner again using early stopping with a patience of 5 epochs:\n\nmlp_learner$param_set$set_values(\n patience = 5\n)\nmlp_learner$train(task)\nmlp_learner$internal_tuned_values$epochs\n\n[1] 160\n\n\nBeyond only tuning the number of epochs, mlr3’s internal tuning mechanism also allows tuning the number of epochs internally while using an offline tuning method to optimize other hyperparameters. To use this, we can set the parameters we want to tune using to_tune(), but need to set internal = TRUE for the epochs parameter.\n\nlibrary(mlr3tuning)\nmlp_learner$param_set$set_values(\n epochs = to_tune(upper = 100, internal = TRUE),\n opt.lr = to_tune(lower = 1e-4, upper = 1e-1, logscale = TRUE)\n)\n\nWe could now pass this learner to a tuner as usual." }, { "objectID": "notebooks/6-training-efficiency.html#architecture-design", "href": "notebooks/6-training-efficiency.html#architecture-design", "title": "Training Efficiency", "section": "Architecture Design", - "text": "Architecture Design\nAnother essential aspect of training neural networks efficiently and effectively is the design of the network architecture, which can be a challenging task. However, for many tasks, there are well-known architectures that perform well and can be used as a starting point. Unless there is a specific reason to design a new architecture, it is recommended to use such an architecture.\n\n\n\n\n\n\nNote\n\n\n\nBecause the Python deep learning ecosystem is so large, many more architectures are implemented in Python than in R. One way to use them in R is to simply translate the PyTorch code to (R-)torch. While PyTorch and (R-)torch are quite similar, there are some differences, e.g., 1-based and 0-based indexing. The torch website contains a brief tutorial on how to do this.\n\n\nNonetheless, we will cover important techniques that can be used to speed up the training process, namely batch normalization and dropout.\n\nBatch Normalization\nBatch Normalization is an important technique in deep learning that contributed significantly to speeding up the training process.\nThe formula for batch normalization (during training) is given by:\n\\[\n\\hat{x} = \\frac{x - \\mu_B}{\\sqrt{\\sigma_B^2 + \\epsilon}}\n\\]\nwhere:\n\n\\(\\hat{x}\\) is the normalized output,\n\\(x\\) is the input,\n\\(\\mu_B\\) is the mean of the batch,\n\\(\\sigma_B^2\\) is the variance of the batch,\n\\(\\epsilon\\) is a small constant added for numerical stability.\n\nDuring inference, the module uses the running mean and variance of the training data to normalize the input.\nIn torch, different versions of batch normalization exist for different dimensions of the input tensor. Below, we illustrate the batch normalization module using a 1D input tensor (the batch dimension does not count here)\n\nx = torch_randn(10, 5)\nbn = nn_batch_norm1d(num_features = 5)\nbn(x)\n\ntorch_tensor\n 1.4613 -1.3934 -0.2146 1.0406 0.1413\n-0.9634 -0.3388 1.7441 0.7744 2.1476\n-2.0328 0.5667 -2.0592 0.4071 -0.0529\n 0.6778 0.3264 0.2637 -0.2301 -0.0409\n-0.9243 0.1298 -0.6447 -1.5477 -2.1935\n 0.8150 -0.1962 0.7988 -1.5426 0.1137\n-0.2350 -2.0121 -0.1847 1.1725 0.0143\n 0.8381 0.6141 0.9971 1.0148 -0.5667\n 0.2166 0.7147 -0.7208 -0.1408 -0.0285\n 0.1467 1.5887 0.0203 -0.9482 0.4657\n[ CPUFloatType{10,5} ][ grad_fn = <NativeBatchNormBackward0> ]\n\n\n\n\n\n\n\n\nQuiz: Batch Normalization\n\n\n\nQuestion 1: Earlier we have learned that nn_modules have buffers and parameters, where the latter are learned with gradient descent. Do you think the mean and variance are parameters or buffers?\n\n\nClick for answer\n\nThey are both buffers as they only store the variance and running mean of all training samples seen, i.e., they are not updated using gradient information.\n\nQuestion 2: Training vs. Evaluation Mode: While many nn_modules behave the same way irrespective of their mode, batch normalization is an example of a module that behaves differently during training and evaluation, i.e., during training, the module uses the mean and variance of the current batch, while during evaluation, it uses the running mean and variance of all training samples seen.\n\nbn(x[1:10, ])\n\ntorch_tensor\n 1.4613 -1.3934 -0.2146 1.0406 0.1413\n-0.9634 -0.3388 1.7441 0.7744 2.1476\n-2.0328 0.5667 -2.0592 0.4071 -0.0529\n 0.6778 0.3264 0.2637 -0.2301 -0.0409\n-0.9243 0.1298 -0.6447 -1.5477 -2.1935\n 0.8150 -0.1962 0.7988 -1.5426 0.1137\n-0.2350 -2.0121 -0.1847 1.1725 0.0143\n 0.8381 0.6141 0.9971 1.0148 -0.5667\n 0.2166 0.7147 -0.7208 -0.1408 -0.0285\n 0.1467 1.5887 0.0203 -0.9482 0.4657\n[ CPUFloatType{10,5} ][ grad_fn = <NativeBatchNormBackward0> ]\n\n\nWhich of the following statements is true and why?\n\nbn$eval()\nequal1 = torch_equal(\n torch_cat(list(bn(x[1:2, ]), bn(x[3:4, ]))),\n bn(x[1:4, ])\n)\nbn$train()\nequal2 = torch_equal(\n torch_cat(list(bn(x[1:2, ]), bn(x[3:4, ]))),\n bn(x[1:4, ])\n)\n\n\n\nClick for answer\n\n\nc(equal1, equal2)\n\n[1] TRUE FALSE\n\n\nThe first statement is true because, in evaluation mode, the module uses the running mean and variance of all training samples seen. The second statement is false because the first tensor uses different means and variances for rows 1-2 and 3-4, while the second tensor uses the same mean and variance for all rows.\n\n\n\nTo illustrate its effectiveness, we will define a simple CNN, with and without batch normalization, train it on CIFAR-10, and compare their performance.\nTo build the neural networks, we will use mlr3torch, which allows building architectures from PipeOps. This makes the creation of network architectures easier, as we, e.g., don’t have to specify auxiliary parameters (such as the input dimension of a linear layer). Recall that the po(\"torch_ingress_ltnsr\") is a special PipeOp that marks the input of the neural network. Note that po(\"nn_relu_1\") is equivalent to po(\"nn_relu\", id = \"nn_relu_1\"). We need to specify unique ID parameters as this is required in mlr3pipelines.\n\ncnn_bn = po(\"torch_ingress_ltnsr\") %>>%\n po(\"nn_conv2d_1\", out_channels = 32, kernel_size = 3, stride = 1, padding = 1) %>>%\n po(\"nn_batch_norm2d_1\") %>>%\n po(\"nn_relu_1\") %>>%\n po(\"nn_max_pool2d_1\", kernel_size = 2, stride = 2) %>>%\n po(\"nn_conv2d_2\", out_channels = 64, kernel_size = 3, stride = 1, padding = 1) %>>%\n po(\"nn_batch_norm2d_2\") %>>%\n po(\"nn_relu_2\") %>>%\n po(\"nn_max_pool2d_2\", kernel_size = 2, stride = 2)\n\ncnn = po(\"torch_ingress_ltnsr\") %>>%\n po(\"nn_conv2d_1\", out_channels = 32, kernel_size = 3, stride = 1, padding = 1) %>>%\n po(\"nn_relu_1\") %>>%\n po(\"nn_max_pool2d_1\", kernel_size = 2, stride = 2) %>>%\n po(\"nn_conv2d\", out_channels = 64, kernel_size = 3, stride = 1, padding = 1) %>>%\n po(\"nn_relu_2\") %>>%\n po(\"nn_max_pool2d_2\", kernel_size = 2, stride = 2)\n\nhead = po(\"nn_flatten\") %>>%\n po(\"nn_linear\", out_features = 128) %>>%\n po(\"nn_relu\") %>>%\n po(\"nn_head\")\n\nmodel = po(\"torch_optimizer\", optimizer = t_opt(\"adam\", lr = 0.003)) %>>%\n po(\"torch_model_classif\",\n epochs = 100,\n batch_size = 256,\n predict_type = \"prob\",\n device = \"cuda\"\n )\n\nWe evaluate the two models on the CIFAR-10 image classification task that we have introduced earlier. There, the goal is to classify images into 10 different classes.\n\nnet_bn = as_learner(cnn_bn %>>% head %>>% model)\nnet_bn$id = \"net_bn\"\nnet = as_learner(cnn %>>% head %>>% model)\nnet$id = \"net\"\n\ncifar10 = tsk(\"cifar10\")\nresampling = rsmp(\"holdout\")$instantiate(cifar10)\n\ndesign = benchmark_grid(\n task = cifar10,\n learner = list(net_bn, net),\n resampling = resampling\n)\ndesign\n\n task learner resampling\n <char> <char> <char>\n1: cifar10 net_bn holdout\n2: cifar10 net holdout\n\n\n\nbmr = benchmark(design)\nbmr$aggregate()" + "text": "Architecture Design\nAnother essential aspect of training neural networks efficiently and effectively is the design of the network architecture, which can be a challenging task. However, for many problems, there are predefined architectures that perform well and can be used. Unless there is a specific reason to design a new architecture, it is recommended to use such an architecture.\n\n\n\n\n\n\nNote\n\n\n\nBecause the Python deep learning ecosystem is so large, many more architectures are implemented in Python than in R. One way to use them in R is to simply translate the PyTorch code to (R-)torch. While PyTorch and (R-)torch are quite similar, there are some differences, e.g., 1-based and 0-based indexing. The torch website contains a brief tutorial on this topic.\n\n\nNonetheless, we will cover important techniques that can be used to speed up the training process, namely batch normalization and dropout.\n\nBatch Normalization\nBatch Normalization is an important technique in deep learning that contributed significantly to speeding up the training process.\nThe formula for batch normalization (during training) is given by:\n\\[\n\\hat{x} = \\frac{x - \\mu_B}{\\sqrt{\\sigma_B^2 + \\epsilon}}\n\\]\nwhere:\n\n\\(\\hat{x}\\) is the normalized output,\n\\(x\\) is the input,\n\\(\\mu_B\\) is the mean of the batch,\n\\(\\sigma_B^2\\) is the variance of the batch,\n\\(\\epsilon\\) is a small constant added for numerical stability.\n\nDuring inference, the module uses the running mean and variance of the training data to normalize the input.\nIn torch, different versions of batch normalization exist for different dimensions of the input tensor. Below, we illustrate the batch normalization module using a 1D input tensor (the batch dimension does not count here):\n\nx = torch_randn(10, 5)\nbn = nn_batch_norm1d(num_features = 5)\nbn(x)\n\ntorch_tensor\n 1.4613 -1.3934 -0.2146 1.0406 0.1413\n-0.9634 -0.3388 1.7441 0.7744 2.1476\n-2.0328 0.5667 -2.0592 0.4071 -0.0529\n 0.6778 0.3264 0.2637 -0.2301 -0.0409\n-0.9243 0.1298 -0.6447 -1.5477 -2.1935\n 0.8150 -0.1962 0.7988 -1.5426 0.1137\n-0.2350 -2.0121 -0.1847 1.1725 0.0143\n 0.8381 0.6141 0.9971 1.0148 -0.5667\n 0.2166 0.7147 -0.7208 -0.1408 -0.0285\n 0.1467 1.5887 0.0203 -0.9482 0.4657\n[ CPUFloatType{10,5} ][ grad_fn = <NativeBatchNormBackward0> ]\n\n\n\n\n\n\n\n\nQuiz: Batch Normalization\n\n\n\nQuestion 1: Earlier we have learned that nn_modules have buffers and parameters, where only the latter are learned with gradient descent. Do you think the mean and variance are parameters or buffers?\n\n\nClick for answer\n\nThey are both buffers as they only store the variance and running mean of all training samples seen, i.e., they are not updated using gradient information.\n\nQuestion 2: Training vs. Evaluation Mode: While many nn_modules behave the same way irrespective of their mode, batch normalization is an example of a module that behaves differently during training and evaluation. During training, the module uses the mean and variance of the current batch, while during evaluation, it uses the running mean and variance of all training samples seen.\n\nbn(x[1:10, ])\n\ntorch_tensor\n 1.4613 -1.3934 -0.2146 1.0406 0.1413\n-0.9634 -0.3388 1.7441 0.7744 2.1476\n-2.0328 0.5667 -2.0592 0.4071 -0.0529\n 0.6778 0.3264 0.2637 -0.2301 -0.0409\n-0.9243 0.1298 -0.6447 -1.5477 -2.1935\n 0.8150 -0.1962 0.7988 -1.5426 0.1137\n-0.2350 -2.0121 -0.1847 1.1725 0.0143\n 0.8381 0.6141 0.9971 1.0148 -0.5667\n 0.2166 0.7147 -0.7208 -0.1408 -0.0285\n 0.1467 1.5887 0.0203 -0.9482 0.4657\n[ CPUFloatType{10,5} ][ grad_fn = <NativeBatchNormBackward0> ]\n\n\nWhich of the following statements is true and why?\n\nbn$eval()\nequal1 = torch_equal(\n torch_cat(list(bn(x[1:2, ]), bn(x[3:4, ]))),\n bn(x[1:4, ])\n)\nbn$train()\nequal2 = torch_equal(\n torch_cat(list(bn(x[1:2, ]), bn(x[3:4, ]))),\n bn(x[1:4, ])\n)\n\n\n\nClick for answer\n\n\nc(equal1, equal2)\n\n[1] TRUE FALSE\n\n\nThe first statement is true because, in evaluation mode, the module uses the running mean and variance of all training samples seen. The second statement is false because the first tensor uses different means and variances for rows 1-2 and 3-4, while the second tensor uses the same mean and variance for all rows.\n\n\n\nTo illustrate its effectiveness, we will define a simple CNN, with and without batch normalization, train it on CIFAR-10, and compare their performances.\nTo build the neural networks, we will use mlr3torch, which allows building architectures from PipeOps. Recall that the po(\"torch_ingress_ltnsr\") is a special PipeOp that marks the input of the neural network. Note that po(\"nn_relu_1\") is equivalent to po(\"nn_relu\", id = \"nn_relu_1\"). We need to specify unique IDs for each PipeOp as this is required in mlr3pipelines graphs.\n\ncnn_bn = po(\"torch_ingress_ltnsr\") %>>%\n po(\"nn_conv2d_1\", out_channels = 32, kernel_size = 3, stride = 1, padding = 1) %>>%\n po(\"nn_batch_norm2d_1\") %>>%\n po(\"nn_relu_1\") %>>%\n po(\"nn_max_pool2d_1\", kernel_size = 2, stride = 2) %>>%\n po(\"nn_conv2d_2\", out_channels = 64, kernel_size = 3, stride = 1, padding = 1) %>>%\n po(\"nn_batch_norm2d_2\") %>>%\n po(\"nn_relu_2\") %>>%\n po(\"nn_max_pool2d_2\", kernel_size = 2, stride = 2)\n\ncnn = po(\"torch_ingress_ltnsr\") %>>%\n po(\"nn_conv2d_1\", out_channels = 32, kernel_size = 3, stride = 1, padding = 1) %>>%\n po(\"nn_relu_1\") %>>%\n po(\"nn_max_pool2d_1\", kernel_size = 2, stride = 2) %>>%\n po(\"nn_conv2d\", out_channels = 64, kernel_size = 3, stride = 1, padding = 1) %>>%\n po(\"nn_relu_2\") %>>%\n po(\"nn_max_pool2d_2\", kernel_size = 2, stride = 2)\n\nhead = po(\"nn_flatten\") %>>%\n po(\"nn_linear\", out_features = 128) %>>%\n po(\"nn_relu\") %>>%\n po(\"nn_head\")\n\nmodel = po(\"torch_optimizer\", optimizer = t_opt(\"adam\", lr = 0.003)) %>>%\n po(\"torch_model_classif\",\n epochs = 100,\n batch_size = 256,\n predict_type = \"prob\",\n device = \"cuda\"\n )\n\nWe evaluate the two models on the CIFAR-10 image classification task that we have introduced earlier. There, the goal is to classify images into 10 different classes.\n\nnet_bn = as_learner(cnn_bn %>>% head %>>% model)\nnet_bn$id = \"net_bn\"\nnet = as_learner(cnn %>>% head %>>% model)\nnet$id = \"net\"\n\ncifar10 = tsk(\"cifar10\")\nresampling = rsmp(\"holdout\")$instantiate(cifar10)\n\ndesign = benchmark_grid(\n task = cifar10,\n learner = list(net_bn, net),\n resampling = resampling\n)\ndesign\n\n task learner resampling\n <char> <char> <char>\n1: cifar10 net_bn holdout\n2: cifar10 net holdout\n\n\n\nbmr = benchmark(design)\nbmr$aggregate()" }, { "objectID": "notebooks/6-training-efficiency.html#dropout", "href": "notebooks/6-training-efficiency.html#dropout", "title": "Training Efficiency", "section": "Dropout", - "text": "Dropout\nDropout is a regularization technique used to prevent overfitting in neural networks by randomly setting a fraction of input units to zero during training. This encourages the network to learn more robust features that are not reliant on specific neurons, thereby improving its generalization capabilities. During each training iteration, dropout randomly “drops” a subset of neurons by setting their activations to zero with a specified probability (commonly between 20% to 50%). This forces the network to distribute the learned representations more evenly across neurons, reducing the reliance on any single neuron and mitigating overfitting. Dropout is more commonly used in the context of fully connected layers.\n\n\n\n\n\nSource: https://medium.com/konvergen/understanding-dropout-ddb60c9f98aa\nJust like batch normalization, it also has different behavior during training and evaluation.\n\ndropout = nn_dropout(p = 0.5)\ndropout(x)\n\ntorch_tensor\n 0.0000 -3.9488 0.0093 0.0000 0.7024\n-0.0000 -1.4141 0.0000 2.9566 5.7694\n-4.4366 0.7622 -0.0000 2.1163 0.2118\n 0.0000 0.0000 0.0000 0.6584 0.2422\n-0.0000 -0.0000 -0.9663 -0.0000 -5.1942\n 0.0000 -1.0714 2.3080 -0.0000 0.6326\n-1.1987 -5.4360 0.0000 3.8675 0.0000\n 0.0000 0.8761 2.7579 3.5069 -0.0000\n-0.3855 1.1178 -0.0000 0.8627 0.0000\n-0.0000 0.0000 0.0000 -0.0000 1.5217\n[ CPUFloatType{10,5} ]\n\ndropout$eval()\ndropout(x)\n\ntorch_tensor\n 0.9281 -1.9744 0.0046 1.7829 0.3512\n-1.2553 -0.7071 2.2261 1.4783 2.8847\n-2.2183 0.3811 -2.0875 1.0582 0.1059\n 0.2226 0.0924 0.5471 0.3292 0.1211\n-1.2201 -0.1440 -0.4831 -1.1782 -2.5971\n 0.3462 -0.5357 1.1540 -1.1725 0.3163\n-0.5994 -2.7180 0.0385 1.9338 0.1908\n 0.3669 0.4380 1.3789 1.7534 -0.5429\n-0.1927 0.5589 -0.5695 0.4313 0.1367\n-0.2556 1.6093 0.2711 -0.4924 0.7609\n[ CPUFloatType{10,5} ]\n\n\nTo look at the effects, we will create a second classification head with dropout and then define new learners\n\nhead_dropout = po(\"nn_flatten\") %>>%\n po(\"nn_linear\", out_features = 128) %>>%\n po(\"nn_relu\") %>>%\n po(\"nn_dropout\", p = 0.5) %>>%\n po(\"nn_head\")\n\nnet_bn_dropout = as_learner(cnn_bn %>>% head_dropout %>>% model)\nnet_bn_dropout$id = \"net_bn_dropout\"\nnet_dropout = as_learner(cnn %>>% head_dropout %>>% model)\nnet_dropout$id = \"net_dropout\"\n\ndesign2 = benchmark_grid(\n task = cifar10,\n learner = list(net_bn_dropout, net_dropout),\n resampling = resampling\n)\n\nNext, we run the second benchmark experiment and afterwards combine the results with the first benchmark experiment.\n\nbmr2 = benchmark(design2)\nbmr = c(bmr, bmr2)\nautoplot(bmr)\n\n\n\n\n\n\n\nQuiz: Dropout\n\n\n\nQuestion 1: Worse Training Loss: You are training a neural network with and without dropout. The training loss is higher with dropout, is this a bug?\n\n\nClick for answer\n\nNot necessarily, as dropout is a regularization technique that prevents overfitting. It’s goal is to reduce the generalization performance of the model." + "text": "Dropout\nDropout is a regularization technique used to prevent overfitting in neural networks. During each training iteration, dropout randomly “drops” a subset of neurons by setting their activations to zero with a specified probability (commonly between 20% to 50%). This forces the network to distribute the learned representations more evenly across neurons. Dropout is most commonly used in the context of fully connected layers.\n\n\n\n\n\nSource\nJust like batch normalization, it also has different behavior during training and evaluation.\n\ndropout = nn_dropout(p = 0.5)\ndropout(x)\n\ntorch_tensor\n 0.0000 -3.9488 0.0093 0.0000 0.7024\n-0.0000 -1.4141 0.0000 2.9566 5.7694\n-4.4366 0.7622 -0.0000 2.1163 0.2118\n 0.0000 0.0000 0.0000 0.6584 0.2422\n-0.0000 -0.0000 -0.9663 -0.0000 -5.1942\n 0.0000 -1.0714 2.3080 -0.0000 0.6326\n-1.1987 -5.4360 0.0000 3.8675 0.0000\n 0.0000 0.8761 2.7579 3.5069 -0.0000\n-0.3855 1.1178 -0.0000 0.8627 0.0000\n-0.0000 0.0000 0.0000 -0.0000 1.5217\n[ CPUFloatType{10,5} ]\n\ndropout$eval()\ndropout(x)\n\ntorch_tensor\n 0.9281 -1.9744 0.0046 1.7829 0.3512\n-1.2553 -0.7071 2.2261 1.4783 2.8847\n-2.2183 0.3811 -2.0875 1.0582 0.1059\n 0.2226 0.0924 0.5471 0.3292 0.1211\n-1.2201 -0.1440 -0.4831 -1.1782 -2.5971\n 0.3462 -0.5357 1.1540 -1.1725 0.3163\n-0.5994 -2.7180 0.0385 1.9338 0.1908\n 0.3669 0.4380 1.3789 1.7534 -0.5429\n-0.1927 0.5589 -0.5695 0.4313 0.1367\n-0.2556 1.6093 0.2711 -0.4924 0.7609\n[ CPUFloatType{10,5} ]\n\n\nTo look at the effects, we will create a second classification head with dropout and then define new learners.\n\nhead_dropout = po(\"nn_flatten\") %>>%\n po(\"nn_linear\", out_features = 128) %>>%\n po(\"nn_relu\") %>>%\n po(\"nn_dropout\", p = 0.5) %>>%\n po(\"nn_head\")\n\nnet_bn_dropout = as_learner(cnn_bn %>>% head_dropout %>>% model)\nnet_bn_dropout$id = \"net_bn_dropout\"\nnet_dropout = as_learner(cnn %>>% head_dropout %>>% model)\nnet_dropout$id = \"net_dropout\"\n\ndesign2 = benchmark_grid(\n task = cifar10,\n learner = list(net_bn_dropout, net_dropout),\n resampling = resampling\n)\n\nNext, we run the second benchmark experiment and afterwards combine the results with the first benchmark experiment.\n\nbmr2 = benchmark(design2)\nbmr = c(bmr, bmr2)\nautoplot(bmr)\n\n\n\n\n\n\n\nQuiz: Dropout\n\n\n\nQuestion 1: Worse Training Loss: You are training a neural network with and without dropout. The training loss is higher with dropout, is this a bug?\n\n\nClick for answer\n\nNot necessarily, as dropout is a regularization technique that prevents overfitting. Its goal is to reduce the generalization performance of the model and not to improve training performance." }, { "objectID": "notebooks/6-training-efficiency.html#transfer-learning", "href": "notebooks/6-training-efficiency.html#transfer-learning", "title": "Training Efficiency", "section": "Transfer Learning", - "text": "Transfer Learning\nTransfer learning is a powerful technique in machine learning where a pre-trained model developed for a specific task is reused as the starting point for a model on a second, related task. Instead of training a model from scratch, which can be time-consuming and computationally expensive, transfer learning leverages the knowledge gained from a previously learned task to improve learning efficiency and performance on a new task.\nThe advantages of transfer learning are:\n\nReduced Training Time: Leveraging a pre-trained model can significantly decrease the time required to train a new model, as the foundational feature extraction layers are already optimized.\nImproved Performance: Transfer learning can enhance model performance, especially when the new task has limited training data. The pre-trained model’s knowledge helps in achieving better generalization.\nResource Efficiency: Utilizing pre-trained models reduces the computational resources needed, making it feasible to develop sophisticated models without extensive hardware.\n\nWhen the model is then trained on a new task, only the last layer is replaced with a new output layer to adjust for the new task.\nThis is visualized below:\n\nSource: https://en.wikipedia.org/wiki/Transfer_learning\nmlr3torch connects various pretrained image networks that are available in the torchvision package. The ResNet-18 model is a popular pre-trained model that was pretrained on ImageNet. We can use the pretrained weights by setting the pretrained parameter to TRUE.\n\nresnet = lrn(\"classif.resnet18\",\n pretrained = TRUE,\n epochs = 2,\n batch_size = 256,\n validate = 0.3,\n measures_valid = msr(\"classif.logloss\"),\n device = \"cuda\",\n predict_type = \"prob\",\n id = \"pretrained\"\n)\nresnet_no_pretrain = resnet$clone(deep = TRUE)\nresnet_no_pretrain$param_set$set_values(\n pretrained = FALSE\n)\nresnet_no_pretrain$id = \"not_pretrained\"\n\ngrid = benchmark_grid(\n task = tsk(\"cifar10\"),\n learner = list(resnet, resnet_no_pretrain),\n resampling = rsmp(\"insample\")\n)\n\nbmr = benchmark(grid, store_models = TRUE)\nbmr$aggregate()\n\nWhen fine-tuning a pretrained model like ResNet-18, it’s common to observe instabilities in gradients, which can manifest as fluctuating validation performance. This can e.g. be because the learning rate is too high (compared to the learning rate that was used during pretraining).\nTo address this, one can:\n\nUse a smaller learning rate for the pretrained layers than for the new output head.\nFreeze the pretrained layers (for some epochs) and only train the new output head.\n\nIn mlr3torch this can be achieved via the callback mechanism. For the unfreezing, there even exists a predefined callback t_clbk(\"unfreeze\"). To create a custom callback, the torch_callback() function can be used. A tutorial on this can be found on the mlr3torch package website.\n\n\n\n\n\n\nIn-Context Learning\n\n\n\nLarge foundation models (such as GPT-4) even allow to perform tasks on which they were not pretrained on without any finetuning. This is referred to as in-context learning or zero-shot learning. There, the task is fed into the model during inference: “Hey ChatGPT, is What is the sentiment of this sentence. Return -1 for sad, 0 for neutral, 1 for happy: ”" + "text": "Transfer Learning\nTransfer learning is a powerful technique in machine learning where a pre-trained model developed for a specific task is reused as the starting point for a model on a second, related task. Instead of training a model from scratch, which can be time-consuming and computationally expensive, transfer learning leverages the knowledge gained from a previously learned task to improve learning efficiency and performance on a new task.\nThe advantages of transfer learning are:\n\nReduced Training Time: Leveraging a pre-trained model can significantly decrease the time required to train a new model, as the foundational feature extraction layers are already optimized.\nImproved Performance: Transfer learning can enhance model performance, especially when the new task has limited training data. The pre-trained model’s knowledge helps in achieving better generalization.\nResource Efficiency: Utilizing pre-trained models reduces the computational resources needed, making it feasible to develop sophisticated models without extensive hardware.\n\nWhen the model is then trained on a new task, only the last layer is replaced with a new output layer to adjust for the new task.\nThis is visualized below:\n\nSource\nmlr3torch offers various pretrained image networks that are available through the torchvision package. The ResNet-18 model is a popular pre-trained model that was pretrained on ImageNet. We can use the pretrained weights by setting the pretrained parameter to TRUE.\n\nresnet = lrn(\"classif.resnet18\",\n pretrained = TRUE,\n epochs = 2,\n batch_size = 256,\n validate = 0.3,\n measures_valid = msr(\"classif.logloss\"),\n device = \"cuda\",\n predict_type = \"prob\",\n id = \"pretrained\"\n)\nresnet_no_pretrain = resnet$clone(deep = TRUE)\nresnet_no_pretrain$param_set$set_values(\n pretrained = FALSE\n)\nresnet_no_pretrain$id = \"not_pretrained\"\n\ngrid = benchmark_grid(\n task = tsk(\"cifar10\"),\n learner = list(resnet, resnet_no_pretrain),\n resampling = rsmp(\"insample\")\n)\n\nbmr = benchmark(grid, store_models = TRUE)\nbmr$aggregate()\n\nWhen fine-tuning a pretrained model like ResNet-18, it’s common to observe instabilities in gradients, which can manifest as fluctuating validation performance.\nTo address this, one can for example freeze the pretrained layers (for some epochs) and only train the new output head. In mlr3torch, this can be achieved by using the t_clbk(\"unfreeze\") callback.\n\n\n\n\n\n\nIn-Context Learning\n\n\n\nLarge foundation models (such as GPT-4) even allow performing tasks on which they were not pretrained on without any finetuning. This is referred to as in-context learning or zero-shot learning. There, the task is fed into the model during inference: “Hey ChatGPT, is What is the sentiment of this sentence. Return -1 for sad, 0 for neutral, 1 for happy: ”" }, { "objectID": "notebooks/6-training-efficiency.html#data-augmentation", "href": "notebooks/6-training-efficiency.html#data-augmentation", "title": "Training Efficiency", "section": "Data Augmentation", - "text": "Data Augmentation\nData augmentation is a technique used to increase the diversity and quantity of training data without actually collecting new data. By applying various transformations to the existing dataset, data augmentation helps improve the generalization capabilities of machine learning models, reduce overfitting, and enhance model robustness. This is especially crucial when you have limited data.\nData augmentation for images can consist of rotation, flipping, translating, grey scaling, etc. Which data augmentation is admissible, depends on the task:\n\nIf the modeling task is to predict whether there is a mark in the top right corner of an image, vertical or horizontal flipping is not admissible.\nIf the goal is to predict whether there is a mark somewhere in the image, it would be admissible.\n\nIn other words, the data augmentation must be compatible with the invariances of the task.\nIn mlr3torch, data augmentation is available via PipeOps of the form po(\"augment_\"). Currently, only augemntation operators from the torchvision package are available, but you can also add your own.\n\naugment = po(\"augment_random_resized_crop\") %>>%\n po(\"augment_random_horizontal_flip\") %>>%\n po(\"augment_random_vertical_flip\")\n\nWe can just create a new GraphLearner that includes the augemntation steps as well as the learner from above:\n\nresnet_augmented = as_learner(augment %>>% resnet)\nresnet_augmented$id = \"resnet_augmented\"\nresnet_augmented$train(task = cifar10)\n\n\n\n\n\n\n\nQuiz: Data Augmentation\n\n\n\nQuestion 1: Do you think data augmentation should be applied to the validation set?\n\n\nClick for answer\n\nNo, as the purpose of data augmentation is not to improve an individual prediction, it will not be applied during test time and hence also not to the validation set. Looking at the performance of augmented validation data is, however, also not a mistake." + "text": "Data Augmentation\nData augmentation is a technique used to increase the diversity and quantity of training data without actually collecting new data. By applying various transformations to the existing dataset, data augmentation helps improve the generalization capabilities of machine learning models, reduce overfitting, and enhance model robustness. This is especially crucial when you have limited data.\nData augmentation for images can consist of rotation, flipping, translating, grey scaling, etc. Which data augmentation is admissible, depends on the task:\n\nIf the modeling task is to predict whether there is a mark in the top right corner of an image, vertical or horizontal flipping is not admissible.\nIf the goal is to predict whether there is a mark somewhere in the image, it would be admissible.\n\nIn other words, the data augmentation must be compatible with the invariances of the task.\nIn mlr3torch, data augmentation is available via PipeOps of the form po(\"augment_\"). Currently, only augmentation operators from the torchvision package are available, but you can also add your own.\n\naugment = po(\"augment_random_resized_crop\") %>>%\n po(\"augment_random_horizontal_flip\") %>>%\n po(\"augment_random_vertical_flip\")\n\nWe can just create a new GraphLearner that includes the augmentation steps as well as the learner from above:\n\nresnet_augmented = as_learner(augment %>>% resnet)\nresnet_augmented$id = \"resnet_augmented\"\nresnet_augmented$train(task = cifar10)\n\n\n\n\n\n\n\nQuiz: Data Augmentation\n\n\n\nQuestion 1: Do you think data augmentation should be applied to the validation set?\n\n\nClick for answer\n\nNo, as the purpose of data augmentation is not to improve an individual prediction, it will not be applied during test time and hence also not to the validation set. Looking at the performance of augmented validation data is, however, also not a mistake." }, { "objectID": "notebooks/7-usecase.html", diff --git a/notebooks/6-training-efficiency-exercise.qmd b/notebooks/6-training-efficiency-exercise.qmd index 1329030..a9865c6 100644 --- a/notebooks/6-training-efficiency-exercise.qmd +++ b/notebooks/6-training-efficiency-exercise.qmd @@ -12,15 +12,15 @@ In this exercise, we will once again train a simple multi-layer perceptron on th 2. Utilizes a batch size of 128. 3. Trains for 200 epochs. 4. Employs a validation set comprising 30% of the data. -5. Tracks the training and validation log-loss during training. +5. Track the validation log-loss. 6. Utilizes trace-jitting to speed up the training process. 7. Employs the history callback to record the training and validation log-loss during training. Afterward, plot the validation log-loss, which is accessible via `learner$model$callbacks$history`. -Below, we create the task and remove the `gender` feature for simplicity. +Below, we create the task and remove the `gender` feature again for simplicity. -```{r} +```{r, message = FALSE} library(mlr3verse) library(mlr3torch) ilpd_num <- tsk("ilpd") @@ -28,14 +28,6 @@ ilpd_num$select(setdiff(ilpd_num$feature_names, "gender")) ilpd_num ``` -
        -Hint -* To specify the validation set, use the `validate` field, which can either be set during construction or by calling `$configure()`. -* Trace-jitting can be enabled via the `jit_trace` parameter. -* The history callback can be constructed via `t_clbk("history")` and needs to be passed during the *construction* of the learner. -* The validation and measures can be specified via `measures_valid` and take a measure object that is constructed via `msr()`. -
        - ::: {.content-visible when-meta=solutions} **Solution** @@ -66,7 +58,8 @@ ggplot(mlp$model$callbacks$history) + ::: **Question 2:** Early Stopping -Enable early stopping to prevent overfitting and re-train the learner (using a patience of 10). Print the final validation performance of the learner and the early stopped results. You can consult the [documentation of `LearnerTorch`](https://mlr3torch.mlr-org.com/reference/mlr_learners_torch.html) on how to access these two results (section *Active Bindings*). + +Enable early stopping to prevent overfitting and re-train the learner (using a patience of 10). Print the final validation performance of the learner and the early stopped results. You can consult the [documentation of `LearnerTorch`](https://mlr3torch.mlr-org.com/reference/mlr_learners_torch.html) on how to access these (see section *Active Bindings*).
        Hint @@ -90,9 +83,10 @@ mlp$internal_valid_scores While early stopping in itself is already useful, `mlr3torch` also allows you to simultaneously tune the number of epochs using early stopping while tuning other hyperparameters via traditional hyperparameter tuning from `mlr3tuning`. -One thing we have not covered so far is that the MLP learner we have used so far also uses a dropout layer. The dropout probability can be configured via the `p` parameter. +One thing we have not mentioned so far is that the MLP learner also uses a dropout layer. +The dropout probability can be configured via the `p` parameter. -Your task is to tune the dropout probability `p` in the range $[0, 1]$ and the epochs using early stopping (using the configuration from the previous exercise). +Your task is to tune the dropout probability `p` in the range $[0, 1]$ and the epochs using early stopping (using the configuration from the previous exercise) with an upper bound of 100 epochs. To adapt this to work with early stopping, you need to set the: @@ -100,9 +94,8 @@ To adapt this to work with early stopping, you need to set the: 2. `$validate` field of the `"test"` so the same data is used for tuning and validation. 3. Tuning `measure` to `msr("internal_valid_score", minimize = TRUE)`. We set `minimize` to `TRUE` because we have used the log-loss as a validation measure. -Apart from this, the tuning works just like in tutorial 5. Use 3-fold cross-validation for the tuning and evaluate 10 configurations. - -Run the tuning and print the optimal configuration. +Apart from this, the tuning works just like in tutorial 5. Use 3-fold cross-validation and evaluate 10 configurations using random search. +Finally, print the optimal configuration. ::: {.content-visible when-meta=solutions} **Solution** @@ -129,6 +122,6 @@ ti <- tune( term_evals = 10 ) -ti$learner_result_param_vals +ti$result_learner_param_vals ``` ::: diff --git a/notebooks/6-training-efficiency.qmd b/notebooks/6-training-efficiency.qmd index bea660a..4837373 100644 --- a/notebooks/6-training-efficiency.qmd +++ b/notebooks/6-training-efficiency.qmd @@ -6,7 +6,7 @@ title: "Training Efficiency" Methods for increasing training efficiency can be roughly split into: -1. Computational methods such as JIT compilation, using GPU, parallel data loading, etc., that allow doing the same thing faster. +1. Computational methods such as JIT compilation, using GPUs, parallel data loading, etc., that allow doing the same thing faster. 2. Methodological approaches that change how we approach modeling to achieve either better results or the same results faster. # Computational Approaches @@ -16,7 +16,7 @@ Methods for increasing training efficiency can be roughly split into: ### Graphical Processing Unit (GPU) Using a GPU is crucial when training relatively large neural networks because GPUs are specifically designed to handle the parallel processing required for complex computations. -To use a GPU in mlr3torch, we can set the device parameter to "cuda". By default, it is set to "auto", which will use a GPU if it is available and otherwise fall back to the CPU. +To use a GPU in `mlr3torch`, we can set the device parameter to "cuda". By default, it is set to "auto", which will use a GPU if available and otherwise fall back to the CPU. :::{.callout-tip} To check if a GPU is available, we can use the `torch::cuda_is_available()` function. @@ -54,7 +54,7 @@ bench::mark( ### CPU Threads -Training large networks on a CPU is not a recommended approach, but it can be useful for smaller networks or when you don't have a GPU. +Training large networks on a CPU is not a recommended approach, but it can be a viable option for smaller networks. You can still use multiple threads to speed up the execution of operations. Please be aware that the code below will not execute on macOS, as setting the number of threads is not supported on this operating system. @@ -94,12 +94,12 @@ The optimal number of threads is a trade-off between these two effects. ## Efficient Data Loading -Besides speeding up the computation of operations in the forward and backward pass, another possible bottleneck is the loading of data. +Besides parallelizing the computation of operations in the forward and backward pass, another possible bottleneck is the loading of data. There are various ways to improve data loading speed: 1. Improve the implementation of the `dataset` class 2. Parallelize the data loading process -3. Increase speed of data transfer to the GPU +3. Increase the speed of data transfer to the GPU These approaches will now be discussed. @@ -113,9 +113,9 @@ When implementing a dataset, we need to define: :::{.callout-note} ## Quiz: Data Loading -The *tiny imagenet* dataset is a dataset of 100,000 images of size 64x64x3. +The *tiny imagenet* dataset is a dataset of 100,000 images of size 64x64. It is a subset of the famous *imagenet* dataset. -Below, we show some examples from the dataset: +Below, we show some examples from it: ![](../assets/tiny-imagenet.png) @@ -168,7 +168,7 @@ Which of the following is the faster way to load the images? Explain why. initialize = function(image_array) { self$image_array = image_array }, - .getbatch = function(i) { + .getitem = function(i) { torch_tensor(self$image_array[i, , , ]) }, .length = function() { @@ -194,8 +194,8 @@ iter = function(ds, ..., epochs = 1) { } } bench::mark( - disk = iter(ds_disk), - ram = iter(ds_ram), + disk = iter(ds_disk, epochs = 10), + ram = iter(ds_ram, epochs = 10), check = FALSE ) ``` @@ -247,7 +247,7 @@ ds_tensor_batch = dataset("tensor_batch", self$tensor = torch_tensor(image_array) }, .getbatch = function(i) { - self$tensor[i, ..] + self$tensor[i, .., drop = FALSE] }, .length = function() { nrow(self$tensor) @@ -285,18 +285,18 @@ The image below visualizes this process: Creating such a parallel dataloader is as easy as setting the `num_workers` parameter to a value greater than 0. :::{.callout-note} -Note that there is some communication overhead that results from sending the batches from the worker to the main process. -This will hopefully be reduced in the future, but is currently there. -For this reason, parallel data loading is therefore -- currently -- only beneficial when it is slow, e.g., because of loading the data from disk or because of expensive preprocessing. +Note that in the current implementation, parallel data loading is only beneficial when it is slow, e.g., because of loading the data from disk or because of expensive preprocessing. +This will hopefully be improved in the future (by a faster implementation of the parallel dataloader). ::: + ### Moving Data to the GPU -One thing we have ignored so far is that when training using a GPU, the data needs to be moved to the GPU. +One thing we have ignored so far is that when training using a GPU, the data needs to be moved from RAM to the GPU. This is because a GPU has its own memory (VRAM), and the data needs to be moved to this memory before it can be used for training. The moving of the data to the GPU cannot be done on the processes that are loading the data but must be done in the main process, i.e., after the batch was received from (possibly parallelized) dataloader. One way to speed up the data loading process is to pin the memory of the data to the GPU. -Before a tensor can be moved from RAM to VRAM, it needs to be in so-called page-locked memory, which can be done using the `pin_memory` parameter. +Before a tensor can be moved from RAM to VRAM, it needs to be in so-called page-locked memory, which can be enabled using the `pin_memory` parameter of `dataloader()`.. ![](../assets/pinned-memory.png) @@ -318,7 +318,7 @@ bench::mark( In order to use parallel data loading or memory pinning with `mlr3torch`, these parameters can simply be specified in the learner: -```{r} +```{r, output = FALSE} lrn("classif.mlp", num_workers = 8L, pin_memory = TRUE, device = "cuda") ``` ::: @@ -338,7 +338,7 @@ optim_ignite_adamw ``` The 'ignite' indicates that the optimizer is a version that is optimized for performance. -Not for all optimizers does an ignite version exist, but for the most common ones, it does. +Not for all optimizers does an ignite version exist, but for the most common ones, there does. Below, we compare the performance of the default optimizer and the ignite optimizer and see that the latter is considerably faster. @@ -392,7 +392,7 @@ out = f(x, w, bias) str(out) ``` -Besides syntax, there are some important differences between TorchScript and R: +Besides syntax, there are some notable differences between TorchScript and R to be aware of: 1. In TorchScript, indexing tensors is 0-based, and 2. TorchScript is statically typed, so you need to specify the types of the arguments, unless they are tensors, which is the default. @@ -425,7 +425,7 @@ out2 = f2(x, w, bias) torch_equal(out, out2) ``` -An advantage of trace-compilation is that it even allows you to JIT-compile modules, which is currently not possible with `jit_compile`. +An advantage of trace-compilation is that it can be applied to modules, which is currently not possible with `jit_compile`. ```{r} net = nn_sequential( @@ -438,9 +438,9 @@ net_jit = jit_trace(net, torch_randn(10, 10)) torch_equal(net(x), net_jit(x)) ``` -Trace-compilation is restrictive because it only records operations applied to torch tensors and is unaware of R control flow, so you need to be careful when using it. +However, trace-compilation is restrictive because it only records operations applied to torch tensors and is unaware of R control flow, so you need to be careful when using it. Furthermore, it only accepts torch tensors as arguments. -Unless you have dynamic inputs and outputs or modify the configuration of the module, trace-compilation should usually work. +For many simple modules, trace-compilation should usually work. You can also check this by running the original and trace-jitted module on some example inputs and see if they return the same result. :::{.callout-note} @@ -474,7 +474,7 @@ net_both(list(torch_randn(1))) **Question 1**: Consider the trace-jitted function below. Can you predict the output of the last two lines? Can you explain why this happens? -```{r} +```{r, output = FALSE} f = function(a, b, multiply) { if (multiply$item()) { a * b @@ -488,9 +488,19 @@ fjit(torch_tensor(2), torch_tensor(3), torch_tensor(TRUE)) fjit(torch_tensor(2), torch_tensor(3), torch_tensor(FALSE)) ``` -**Question 2**: Answer the same question for the following function: +
        +Click for answer ```{r} +fjit(torch_tensor(2), torch_tensor(3), torch_tensor(TRUE)) +fjit(torch_tensor(2), torch_tensor(3), torch_tensor(FALSE)) +``` + +
        + +**Question 2**: Answer the same question for the following function: + +```{r, output = FALSE} f = function(a, b, multiply) { torch_where(multiply, a * b, a + b) } @@ -499,6 +509,15 @@ fjit = jit_trace(f, torch_tensor(1), torch_tensor(2), torch_tensor(TRUE)) fjit(torch_tensor(2), torch_tensor(3), torch_tensor(TRUE)) fjit(torch_tensor(2), torch_tensor(3), torch_tensor(FALSE)) ``` + +
        +Click for answer + +```{r} +fjit(torch_tensor(2), torch_tensor(3), torch_tensor(TRUE)) +fjit(torch_tensor(2), torch_tensor(3), torch_tensor(FALSE)) +``` +
        ::: ### Mixed Precision Training @@ -506,7 +525,6 @@ fjit(torch_tensor(2), torch_tensor(3), torch_tensor(FALSE)) Another way to speed up the training process is to use mixed precision training. This technique involves training the model using both 16-bit and 32-bit floating point numbers. This allows reducing the memory footprint of the model and speeding up the training process. - We won't cover this here, but refer to the [torch documentation](https://torch.mlverse.org/docs/articles/amp) that explains how to do this. ## Methodological Approaches @@ -521,13 +539,9 @@ This allows monitoring the performance of the model on unseen data. In `mlr3torch`, we can track the performance of the model on a validation set by specifying: * `validate`, which is the ratio of the data that is used for validation -* `measures_valid`, which is a list of measures to use for validation +* `measures_valid`, which is a list of measures to evaluate the validation performance * `eval_freq`, which is the frequency at which the validation is performed -* `callbacks`, which is a list of callbacks to use during training, in this case, we use the `history` callback, which records the performance of the model on the validation set at regular intervals, enabling us to monitor and analyze the model's performance over time. - -:::{.callout-tip} -While `mlr3torch` comes with predefined callbacks, it is also possible to define custom callbacks that modify the training process. -::: +* `callbacks`, which is a list of callbacks to use during training, in this case, we use the `t_clbk("history")` callback, which records the performance of the model on the validation set at regular intervals, enabling us to monitor and analyze the model's performance over time. ```{r} task = tsk("sonar") @@ -545,7 +559,6 @@ mlp_learner = lrn("classif.mlp", ) mlp_learner$train(task) history = mlp_learner$model$callbacks$history -str(history) head(history) ``` @@ -566,17 +579,18 @@ ggplot(history, aes(x = epoch)) + ``` Instead of only monitoring the validation loss (and watching it get worse and worse), we can also stop the training process dynamically when the validation loss begins to increase. -This regularization technique is called early stopping, and it prevents overfitting during the training of iteratively trained machine learning models. -It involves monitoring the validation loss during training and stopping the training process when the validation loss begins to increase, indicating that the model is starting to overfit the training data. +This regularization technique is called **early stopping**, and it prevents overfitting during the training of iteratively trained machine learning models. -The key configuration option for early stopping is the `patience` parameter, which defines the number of epochs to wait after the last improvement in validation loss before stopping the training. For example, if patience is set to 10, the training will continue for 10 additional epochs after the last observed improvement in validation loss. If no improvement is seen during this period, training will be halted. +The key configuration option for early stopping is the `patience` parameter, which defines the number of epochs to wait after the last improvement in validation loss before stopping the training. +For example, if the patience is set to 5, the training will continue for 5 additional epochs after the last observed improvement in validation loss. +If no improvement is seen during this period, training will be halted. Advantages of early stopping include: - **Prevention of Overfitting**: By stopping training when the model starts to overfit, we can achieve better generalization on unseen data. - **Resource Efficiency**: It saves computational resources by avoiding unnecessary training epochs once the model performance has plateaued. -Now, let's train the learner again using early stopping with a patience of 10 epochs: +Now, let's train the learner again using early stopping with a patience of 5 epochs: ```{r} mlp_learner$param_set$set_values( @@ -587,7 +601,7 @@ mlp_learner$internal_tuned_values$epochs ``` Beyond only tuning the number of epochs, `mlr3`'s internal tuning mechanism also allows tuning the number of epochs internally while using an offline tuning method to optimize other hyperparameters. -To use this, we can set the parameters we want to tune `TuneTokens`: +To use this, we can set the parameters we want to tune using `to_tune()`, but need to set `internal = TRUE` for the `epochs` parameter. ```{r, message = FALSE} library(mlr3tuning) @@ -597,19 +611,19 @@ mlp_learner$param_set$set_values( ) ``` -We could now pass this learner to a tuner, where the tuner would only optimize the learning rate, while the learner optimizes the epochs internally. +We could now pass this learner to a tuner as usual. ## Architecture Design Another essential aspect of training neural networks efficiently and effectively is the design of the network architecture, which can be a challenging task. -However, for many tasks, there are well-known architectures that perform well and can be used as a starting point. +However, for many problems, there are predefined architectures that perform well and can be used. Unless there is a specific reason to design a new architecture, it is recommended to use such an architecture. :::{.callout-note} Because the Python deep learning ecosystem is so large, many more architectures are implemented in Python than in R. One way to use them in R is to simply translate the PyTorch code to (R-)torch. While PyTorch and (R-)torch are quite similar, there are some differences, e.g., 1-based and 0-based indexing. -The `torch` website contains a [brief tutorial](https://torch.mlverse.org/docs/articles/python-to-r) on how to do this. +The `torch` website contains a [brief tutorial](https://torch.mlverse.org/docs/articles/python-to-r) on this topic. ::: Nonetheless, we will cover important techniques that can be used to speed up the training process, namely *batch normalization* and *dropout*. @@ -635,7 +649,7 @@ where: During inference, the module uses the running mean and variance of the training data to normalize the input. In `torch`, different versions of batch normalization exist for different dimensions of the input tensor. -Below, we illustrate the batch normalization module using a 1D input tensor (the batch dimension does not count here) +Below, we illustrate the batch normalization module using a 1D input tensor (the batch dimension does not count here): ```{r} x = torch_randn(10, 5) @@ -646,7 +660,7 @@ bn(x) :::{.callout-note} ## Quiz: Batch Normalization -**Question 1**: Earlier we have learned that `nn_module`s have buffers and parameters, where the latter are learned with gradient descent. +**Question 1**: Earlier we have learned that `nn_module`s have buffers and parameters, where only the latter are learned with gradient descent. Do you think the mean and variance are parameters or buffers?
        @@ -655,8 +669,8 @@ They are both buffers as they only store the variance and running mean of all tr
        **Question 2**: Training vs. Evaluation Mode: -While many `nn_module`s behave the same way irrespective of their mode, batch normalization is an example of a module that behaves differently during training and evaluation, i.e., -during training, the module uses the mean and variance of the current batch, while during evaluation, it uses the running mean and variance of all training samples seen. +While many `nn_module`s behave the same way irrespective of their mode, batch normalization is an example of a module that behaves differently during training and evaluation. +During training, the module uses the mean and variance of the current batch, while during evaluation, it uses the running mean and variance of all training samples seen. ```{r} bn(x[1:10, ]) @@ -688,13 +702,12 @@ The second statement is false because the first tensor uses different means and
        ::: -To illustrate its effectiveness, we will define a simple CNN, with and without batch normalization, train it on CIFAR-10, and compare their performance. +To illustrate its effectiveness, we will define a simple CNN, with and without batch normalization, train it on CIFAR-10, and compare their performances. To build the neural networks, we will use `mlr3torch`, which allows building architectures from `PipeOp`s. -This makes the creation of network architectures easier, as we, e.g., don't have to specify auxiliary parameters (such as the input dimension of a linear layer). Recall that the `po("torch_ingress_ltnsr")` is a special `PipeOp` that marks the input of the neural network. Note that `po("nn_relu_1")` is equivalent to `po("nn_relu", id = "nn_relu_1")`. -We need to specify unique ID parameters as this is required in `mlr3pipelines`. +We need to specify unique IDs for each `PipeOp` as this is required in mlr3pipelines graphs. ```{r} cnn_bn = po("torch_ingress_ltnsr") %>>% @@ -756,13 +769,14 @@ bmr$aggregate() ## Dropout -Dropout is a regularization technique used to prevent overfitting in neural networks by randomly setting a fraction of input units to zero during training. This encourages the network to learn more robust features that are not reliant on specific neurons, thereby improving its generalization capabilities. -During each training iteration, dropout randomly "drops" a subset of neurons by setting their activations to zero with a specified probability (commonly between 20% to 50%). This forces the network to distribute the learned representations more evenly across neurons, reducing the reliance on any single neuron and mitigating overfitting. -Dropout is more commonly used in the context of fully connected layers. +Dropout is a regularization technique used to prevent overfitting in neural networks. +During each training iteration, dropout randomly "drops" a subset of neurons by setting their activations to zero with a specified probability (commonly between 20% to 50%). +This forces the network to distribute the learned representations more evenly across neurons. +Dropout is most commonly used in the context of fully connected layers. ![](../assets/dropout.png){fig-align="center" width=100%} -Source: https://medium.com/konvergen/understanding-dropout-ddb60c9f98aa +[Source](https://medium.com/konvergen/understanding-dropout-ddb60c9f98aa) Just like batch normalization, it also has different behavior during training and evaluation. @@ -773,7 +787,7 @@ dropout$eval() dropout(x) ``` -To look at the effects, we will create a second classification head with dropout and then define new learners +To look at the effects, we will create a second classification head with dropout and then define new learners. ```{r} head_dropout = po("nn_flatten") %>>% @@ -810,7 +824,7 @@ autoplot(bmr)
        Click for answer Not necessarily, as dropout is a regularization technique that prevents overfitting. -It's goal is to reduce the generalization performance of the model. +Its goal is to reduce the generalization performance of the model and not to improve training performance.
        ::: @@ -830,9 +844,9 @@ This is visualized below: ![](../assets/transfer-learning.svg) -Source: https://en.wikipedia.org/wiki/Transfer_learning +[Source](https://en.wikipedia.org/wiki/Transfer_learning) -`mlr3torch` connects various pretrained image networks that are available in the [`torchvision` package](https://torchvision.mlverse.org/). +`mlr3torch` offers various pretrained image networks that are available through the [`torchvision` package](https://torchvision.mlverse.org/). The ResNet-18 model is a popular pre-trained model that was pretrained on ImageNet. We can use the pretrained weights by setting the `pretrained` parameter to `TRUE`. @@ -864,22 +878,14 @@ bmr$aggregate() ``` When fine-tuning a pretrained model like ResNet-18, it's common to observe instabilities in gradients, which can manifest as fluctuating validation performance. -This can e.g. be because the learning rate is too high (compared to the learning rate that was used during pretraining). - -To address this, one can: - -1. Use a smaller learning rate for the pretrained layers than for the new output head. -2. Freeze the pretrained layers (for some epochs) and only train the new output head. -In `mlr3torch` this can be achieved via the callback mechanism. -For the unfreezing, there even exists a predefined callback `t_clbk("unfreeze")`. -To create a custom callback, the `torch_callback()` function can be used. -A tutorial on this can be found on the [`mlr3torch` package website](https://mlr3torch.mlr-org.com/index.html). +To address this, one can for example freeze the pretrained layers (for some epochs) and only train the new output head. +In `mlr3torch`, this can be achieved by using the `t_clbk("unfreeze")` callback. :::{.callout-note} ## In-Context Learning -Large foundation models (such as GPT-4) even allow to perform tasks on which they were not pretrained on without any finetuning. +Large foundation models (such as GPT-4) even allow performing tasks on which they were not pretrained on without any finetuning. This is referred to as in-context learning or zero-shot learning. There, the task is fed into the model during inference: "Hey ChatGPT, is What is the sentiment of this sentence. Return -1 for sad, 0 for neutral, 1 for happy: " ::: @@ -899,7 +905,7 @@ Which data augmentation is admissible, depends on the task: In other words, the data augmentation must be compatible with the invariances of the task. In `mlr3torch`, data augmentation is available via `PipeOp`s of the form `po("augment_")`. -Currently, only augemntation operators from the `torchvision` package are available, but you can also add your own. +Currently, only augmentation operators from the `torchvision` package are available, but you can also add your own. ```{r} augment = po("augment_random_resized_crop") %>>% @@ -907,7 +913,7 @@ augment = po("augment_random_resized_crop") %>>% po("augment_random_vertical_flip") ``` -We can just create a new `GraphLearner` that includes the augemntation steps as well as the learner from above: +We can just create a new `GraphLearner` that includes the augmentation steps as well as the learner from above: ```{r, eval = cuda_is_available()} resnet_augmented = as_learner(augment %>>% resnet)