-
Notifications
You must be signed in to change notification settings - Fork 69
/
Copy path06-regularization.qmd
385 lines (280 loc) · 15.4 KB
/
06-regularization.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
# Linear Model Selection and Regularization
```{r}
#| echo: false
set.seed(1234)
source("_common.R")
```
This lab will take a look at regularization models and hyperparameter tuning. These models are related to the models we saw in chapter 3 and 4, with the difference that they contain a regularization term.
This chapter will use [parsnip](https://www.tidymodels.org/start/models/) for model fitting and [recipes and workflows](https://www.tidymodels.org/start/recipes/) to perform the transformations, and [tune and dials](https://www.tidymodels.org/start/tuning/) to tune the hyperparameters of the model.
We will be using the `Hitters` data set from the `ISLR` package. We wish to predict the baseball players `Salary` based on several different characteristics which are included in the data set. Since we wish to predict `Salary`, then we need to remove any missing data from that column. Otherwise, we won't be able to run the models.
```{r}
#| message: false
library(tidymodels)
library(ISLR)
Hitters <- as_tibble(Hitters) %>%
filter(!is.na(Salary))
```
## Best Subset Selection
tidymodels does not currently support subset selection methods, and it unlikely to include it in the [near future](https://stackoverflow.com/questions/66651033/stepwise-algorithm-in-tidymodels#comment117845482_66651033).
## Forward and Backward Stepwise Selection
tidymodels does not currently support forward and backward stepwise selection methods, and it unlikely to include it in the [near future](https://stackoverflow.com/questions/66651033/stepwise-algorithm-in-tidymodels#comment117845482_66651033).
## Ridge Regression
We will use the `glmnet` package to perform ridge regression. `parsnip` does not have a dedicated function to create a ridge regression model specification. You need to use `linear_reg()` and set `mixture = 0` to specify a ridge model. The `mixture` argument specifies the amount of different types of regularization, `mixture = 0` specifies only ridge regularization and `mixture = 1` specifies only lasso regularization. Setting `mixture` to a value between 0 and 1 lets us use both. When using the `glmnet` engine we also need to set a `penalty` to be able to fit the model. We will set this value to `0` for now, it is not the best value, but we will look at how to select the best value in a little bit.
```{r}
ridge_spec <- linear_reg(mixture = 0, penalty = 0) %>%
set_mode("regression") %>%
set_engine("glmnet")
```
Once the specification is created we can fit it to our data. We will use all the predictors.
```{r}
ridge_fit <- fit(ridge_spec, Salary ~ ., data = Hitters)
```
The `glmnet` package will fit the model for all values of `penalty` at once, so let us see what the parameter estimate for the model is now that we have `penalty = 0`.
```{r}
#| message: false
tidy(ridge_fit)
```
Let us instead see what the estimates would be if the penalty was 11498.
```{r}
tidy(ridge_fit, penalty = 11498)
```
Notice how the estimates are decreasing when the amount of penalty goes up. Look below at the parameter estimates for `penalty = 705` and `penalty = 50`.
```{r}
tidy(ridge_fit, penalty = 705)
tidy(ridge_fit, penalty = 50)
```
We can visualize how the magnitude of the coefficients are being regularized towards zero as the penalty goes up.
```{r}
#| fig-alt: |
#| Multiple line chart. Log Lambda along the x-axis, Coefficients
#| along the y-axis. The curves starts at different places along
#| the y-axis but slowly converge towards 0 as the Log Lambda
#| value increase.
ridge_fit %>%
autoplot()
```
Prediction is done like normal, if we use `predict()` by itself, then `penalty = 0` as we set in the model specification is used.
```{r}
predict(ridge_fit, new_data = Hitters)
```
but we can also get predictions for other values of `penalty` by specifying it in `predict()`
```{r}
predict(ridge_fit, new_data = Hitters, penalty = 500)
```
We saw how we can fit a ridge model and make predictions for different values of `penalty`. But it would be nice if we could find the "best" value of the penalty. This is something we can use hyperparameter tuning for. Hyperparameter tuning is in its simplest form a way of fitting many models with different sets of hyperparameters trying to find one that performs "best". The complexity in hyperparameter tuning can come from how you try different models. We will keep it simple for this lab and only look at grid search, only looking at evenly spaced parameter values. This is a fine enough approach if you have one or two tunable parameters but can become computationally infeasible. See the chapter on [iterative search](https://www.tmwr.org/iterative-search.html) from [Tidy Modeling with R](https://www.tmwr.org/) for more information.
We start like normal by setting up a validation split. A K-fold cross-validation data set is created on the training data set with 10 folds.
```{r}
Hitters_split <- initial_split(Hitters, strata = "Salary")
Hitters_train <- training(Hitters_split)
Hitters_test <- testing(Hitters_split)
Hitters_fold <- vfold_cv(Hitters_train, v = 10)
```
We can use the `tune_grid()` function to perform hyperparameter tuning using a grid search. `tune_grid()` needs 3 different thing;
- a `workflow` object containing the model and preprocessor,
- a `rset` object containing the resamples the `workflow` should be fitted within, and
- a tibble containing the parameter values to be evaluated.
Optionally a metric set of performance metrics can be supplied for evaluation. If you don't set one then a default set of performance metrics is used.
We already have a resample object created in `Hitters_fold`. Now we should create the workflow specification next.
We just used the data set as is when we fit the model earlier. But ridge regression is scale sensitive so we need to make sure that the variables are on the same scale. We can use `step_normalize()`. Secondly let us deal with the factor variables ourself using `step_novel()` and `step_dummy()`.
```{r}
ridge_recipe <-
recipe(formula = Salary ~ ., data = Hitters_train) %>%
step_novel(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors())
```
The model specification will look very similar to what we have seen earlier, but we will set `penalty = tune()`. This tells `tune_grid()` that the `penalty` parameter should be tuned.
```{r}
ridge_spec <-
linear_reg(penalty = tune(), mixture = 0) %>%
set_mode("regression") %>%
set_engine("glmnet")
```
Now we combine to create a `workflow` object.
```{r}
ridge_workflow <- workflow() %>%
add_recipe(ridge_recipe) %>%
add_model(ridge_spec)
```
The last thing we need is the values of `penalty` we are trying. This can be created using `grid_regular()` which creates a grid of evenly spaces parameter values. We use the `penalty()` function from the [dials](https://dials.tidymodels.org/) package to denote the parameter and set the range of the grid we are searching for. Note that this range is log-scaled.
```{r}
penalty_grid <- grid_regular(penalty(range = c(-5, 5)), levels = 50)
penalty_grid
```
Using 50 levels for one parameter might seem overkill and in many applications it is. But remember that `glmnet` fits all the models in one go so adding more levels to `penalty` doesn't affect the computational speed much.
Now we have everything we need and we can fit all the models.
```{r}
tune_res <- tune_grid(
ridge_workflow,
resamples = Hitters_fold,
grid = penalty_grid
)
tune_res
```
The output of `tune_grid()` can be hard to read by itself unprocessed. `autoplot()` creates a great visualization
```{r}
#| fig-alt: |
#| Facetted connected scatter chart. regularization along the
#| x-axis. Performance values along the y-axis. The facets are
#| rmse and rsq. Both are fairly constant for low values of
#| regularization, rmse starts moderately increasing and rsq
#| starts moderately decreasing once the regularization
#| gets larger.
autoplot(tune_res)
```
Here we see that the amount of regularization affects the performance metrics differently. Note how there are areas where the amount of regularization doesn't have any meaningful influence on the coefficient estimates. We can also see the raw metrics that created this chart by calling `collect_matrics()`.
```{r}
collect_metrics(tune_res)
```
The "best" values of this can be selected using `select_best()`, this function requires you to specify a `matric` that it should select against.
```{r}
best_penalty <- select_best(tune_res, metric = "rsq")
best_penalty
```
This value of `penalty` can then be used with `finalize_workflow()` to update/finalize the recipe by replacing `tune()` with the value of `best_penalty`. Now, this model should be fit again, this time using the whole training data set.
```{r}
ridge_final <- finalize_workflow(ridge_workflow, best_penalty)
ridge_final_fit <- fit(ridge_final, data = Hitters_train)
```
This final model can now be applied on our testing data set to validate the performance
```{r}
augment(ridge_final_fit, new_data = Hitters_test) %>%
rsq(truth = Salary, estimate = .pred)
```
And it performs fairly well given what we saw earlier.
## The Lasso
We will use the `glmnet` package to perform lasso regression. `parsnip` does not have a dedicated function to create a ridge regression model specification. You need to use `linear_reg()` and set `mixture = 1` to specify a lasso model. The `mixture` argument specifies the amount of different types of regularization, `mixture = 0` specifies only ridge regularization and `mixture = 1` specifies only lasso regularization. Setting `mixture` to a value between 0 and 1 lets us use both.
The following procedure will be very similar to what we saw in the ridge regression section. The preprocessing needed is the same, but let us write it out one more time.
```{r}
lasso_recipe <-
recipe(formula = Salary ~ ., data = Hitters_train) %>%
step_novel(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors())
```
Next, we finish the lasso regression `workflow`.
```{r}
lasso_spec <-
linear_reg(penalty = tune(), mixture = 1) %>%
set_mode("regression") %>%
set_engine("glmnet")
lasso_workflow <- workflow() %>%
add_recipe(lasso_recipe) %>%
add_model(lasso_spec)
```
While we are doing a different kind of regularization we still use the same `penalty` argument. I have picked a different range for the values of penalty since I know it will be a good range. You would in practice have to cast a wide net at first and then narrow on the range of interest.
```{r}
penalty_grid <- grid_regular(penalty(range = c(-2, 2)), levels = 50)
```
And we can use `tune_grid()` again.
```{r}
#| fig-alt: |
#| Facetted connected scatter chart. regularization along the
#| x-axis. Performance values along the y-axis. The facets are
#| rmse and rsq. Both are fairly constant for low values of
#| regularization, rmse starts moderately increasing and rsq
#| starts moderately decreasing once the regularization
#| gets larger.
tune_res <- tune_grid(
lasso_workflow,
resamples = Hitters_fold,
grid = penalty_grid
)
autoplot(tune_res)
```
We select the best value of `penalty` using `select_best()`
```{r}
best_penalty <- select_best(tune_res, metric = "rsq")
```
And refit the using the whole training data set.
```{r}
lasso_final <- finalize_workflow(lasso_workflow, best_penalty)
lasso_final_fit <- fit(lasso_final, data = Hitters_train)
```
And we are done, by calculating the `rsq` value for the lasso model can we see that for this data ridge regression outperform lasso regression.
```{r}
augment(lasso_final_fit, new_data = Hitters_test) %>%
rsq(truth = Salary, estimate = .pred)
```
## Principal Components Regression
We will talk more about principal components analysis in chapter 10. This section will show how principal components can be used as a dimensionality reduction preprocessing step.
I will treat principal component regression as a linear model with PCA transformations in the preprocessing. But using the tidymodels framework then this is still mostly one model.
```{r}
lm_spec <-
linear_reg() %>%
set_mode("regression") %>%
set_engine("lm")
```
The preprocessing recipe will closely resemble the recipe we saw in the ridge and lasso sections. The main difference is that we end the recipe with `step_pca()` which will perform principal component analysis on all the predictors, and return the components that explain `threshold` percent of the variance. We have set `threshold = tune()` so we can treat the threshold as a hyperparameter to be tuned. By using workflows and tune together can be tune parameters in the preprocessing as well as parameters in the models.
```{r}
pca_recipe <-
recipe(formula = Salary ~ ., data = Hitters_train) %>%
step_novel(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors()) %>%
step_pca(all_predictors(), threshold = tune())
pca_workflow <-
workflow() %>%
add_recipe(pca_recipe) %>%
add_model(lm_spec)
```
We create a smaller grid for `threshold` and we don't need to modify the range since `[0, 1]` is an acceptable range.
```{r}
threshold_grid <- grid_regular(threshold(), levels = 10)
threshold_grid
```
And now we fit using `tune_grid()`. This time we will actually perform 100 fits since we need to fit a model for each value of `threshold` within each fold.
```{r}
tune_res <- tune_grid(
pca_workflow,
resamples = Hitters_fold,
grid = threshold_grid
)
```
The results look a little shaky here.
```{r}
#| fig-alt: |
#| Facetted connected scatter chart. regularization along the
#| x-axis. Performance values along the y-axis. The facets are
#| rmse and rsq. Very variable, appears to produce best values
#| for threshold == 0.5.
autoplot(tune_res)
```
But we can still select the best model.
```{r}
best_threshold <- select_best(tune_res, metric = "rmse")
```
And fit the model much like have done a couple of times by now. The workflow is finalized using the value we selected with `select_best()`, and training using the full training data set.
```{r}
pca_final <- finalize_workflow(pca_workflow, best_threshold)
pca_final_fit <- fit(pca_final, data = Hitters_train)
```
## Partial Least Squares
Lastly, we have a partial least squares model. We will treat this much like the PCA section and say that partial least squares calculations will be done as a preprocessing that we tune. The following code is almost identical to previous chapters and will be shown in full without many explanations to avoid repetition. If you skipped to this section, go back and read the previous sections for more commentary.
```{r}
pls_recipe <-
recipe(formula = Salary ~ ., data = Hitters_train) %>%
step_novel(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors()) %>%
step_pls(all_predictors(), num_comp = tune(), outcome = "Salary")
lm_spec <- linear_reg() %>%
set_mode("regression") %>%
set_engine("lm")
pls_workflow <- workflow() %>%
add_recipe(pls_recipe) %>%
add_model(lm_spec)
num_comp_grid <- grid_regular(num_comp(c(1, 20)), levels = 10)
tune_res <- tune_grid(
pls_workflow,
resamples = Hitters_fold,
grid = num_comp_grid
)
best_threshold <- select_best(tune_res, metric = "rmse")
pls_final <- finalize_workflow(pls_workflow, best_threshold)
pls_final_fit <- fit(pls_final, data = Hitters_train)
```