-
Notifications
You must be signed in to change notification settings - Fork 69
/
Copy path08-tree-based-methods.qmd
436 lines (326 loc) · 15 KB
/
08-tree-based-methods.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
# Tree-Based Methods
```{r}
#| echo: false
set.seed(1234)
source("_common.R")
```
This lab will take a look at different tree-based models, in doing so we will explore how changing the hyperparameters can help improve performance.
This chapter will use [parsnip](https://www.tidymodels.org/start/models/) for model fitting and [recipes and workflows](https://www.tidymodels.org/start/recipes/) to perform the transformations, and [tune and dials](https://www.tidymodels.org/start/tuning/) to tune the hyperparameters of the model. `rpart.plot` is used to visualize the decision trees created using the `rpart` package as engine, and `vip` is used to visualize variable importance for later models.
```{r}
#| message: false
library(tidymodels)
library(ISLR)
library(rpart.plot)
library(vip)
data("Boston", package = "MASS")
Boston <- as_tibble(Boston)
```
The `Boston` data set contain various statistics for 506 neighborhoods in Boston. We will build a regression model that related the median value of owner-occupied homes (`medv`) as the response with the remaining variables as predictors.
:::{.callout-important}
The `Boston` data set is quite outdated and contains some really unfortunate variables.
:::
We will also use the `Carseats` data set from the `ISLR` package to demonstrate a classification model. We create a new variable `High` to denote if `Sales <= 8`, then the `Sales` predictor is removed as it is a perfect predictor of `High`.
```{r}
Carseats <- as_tibble(Carseats) %>%
mutate(High = factor(if_else(Sales <= 8, "No", "Yes"))) %>%
select(-Sales)
```
## Fitting Classification Trees
We will both be fitting a classification and regression tree in this section, so we can save a little bit of typing by creating a general decision tree specification using `rpart` as the engine.
```{r}
tree_spec <- decision_tree() %>%
set_engine("rpart")
```
Then this decision tree specification can be used to create a classification decision tree engine. This is a good example of how the flexible composition system created by parsnip can be used to create multiple model specifications.
```{r}
class_tree_spec <- tree_spec %>%
set_mode("classification")
```
With both a model specification and our data are we ready to fit the model.
```{r}
class_tree_fit <- class_tree_spec %>%
fit(High ~ ., data = Carseats)
```
When we look at the model output we see a quite informative summary of the model. It tries to give a written description of the tree that is created.
```{r}
class_tree_fit
```
Once the tree gets more than a couple of nodes it can become hard to read the printed diagram. The `rpart.plot` package provides functions to let us easily visualize the decision tree. As the name implies, it only works with `rpart` trees.
```{r}
#| warning: false
#| fig-alt: |
#| Decision tree chart. A total of 10 splits. Color is used to
#| represent the prediction of High, blue values represent No
#| green represent Yes.
class_tree_fit %>%
extract_fit_engine() %>%
rpart.plot()
```
We can see that the most important variable to predict high sales appears to be shelving location as it forms the first node.
The training accuracy of this model is `r augment(class_tree_fit, new_data = Carseats) %>% accuracy(truth = High, estimate = .pred_class) %>% pull(.estimate) %>% scales::percent_format()(.)`
```{r}
augment(class_tree_fit, new_data = Carseats) %>%
accuracy(truth = High, estimate = .pred_class)
```
Let us take a look at the confusion matrix to see if the balance is there
```{r}
augment(class_tree_fit, new_data = Carseats) %>%
conf_mat(truth = High, estimate = .pred_class)
```
And the model appears to work well overall. But this model was fit on the whole data set so we only get the training accuracy which could be misleading if the model is overfitting. Let us redo the fitting by creating a validation split and fit the model on the training data set.
```{r}
set.seed(1234)
Carseats_split <- initial_split(Carseats)
Carseats_train <- training(Carseats_split)
Carseats_test <- testing(Carseats_split)
```
Now we can fit the model on the training data set.
```{r}
class_tree_fit <- fit(class_tree_spec, High ~ ., data = Carseats_train)
```
Let us take a look at the confusion matrix for the training data set and testing data set.
```{r}
augment(class_tree_fit, new_data = Carseats_train) %>%
conf_mat(truth = High, estimate = .pred_class)
```
The training data set performs well as we would expect
```{r}
augment(class_tree_fit, new_data = Carseats_test) %>%
conf_mat(truth = High, estimate = .pred_class)
```
but the testing data set doesn't perform just as well and get a smaller accuracy of `r augment(class_tree_fit, new_data = Carseats_test) %>% accuracy(truth = High, estimate = .pred_class) %>% pull(.estimate) %>% scales::percent_format()(.)`
```{r}
augment(class_tree_fit, new_data = Carseats_test) %>%
accuracy(truth = High, estimate = .pred_class)
```
Let us try to tune the `cost_complexity` of the decision tree to find a more optimal complexity. We use the `class_tree_spec` object and use the `set_args()` function to specify that we want to tune `cost_complexity`. This is then passed directly into the workflow object to avoid creating an intermediate object.
```{r}
class_tree_wf <- workflow() %>%
add_model(class_tree_spec %>% set_args(cost_complexity = tune())) %>%
add_formula(High ~ .)
```
To be able to tune the variable we need 2 more objects. S `resamples` object, we will use a k-fold cross-validation data set, and a `grid` of values to try. Since we are only tuning 1 hyperparameter it is fine to stay with a regular grid.
```{r}
set.seed(1234)
Carseats_fold <- vfold_cv(Carseats_train)
param_grid <- grid_regular(cost_complexity(range = c(-3, -1)), levels = 10)
tune_res <- tune_grid(
class_tree_wf,
resamples = Carseats_fold,
grid = param_grid,
metrics = metric_set(accuracy)
)
```
using `autoplot()` shows which values of `cost_complexity` appear to produce the highest accuracy
```{r}
#| fig-alt: |
#| Connected scatter chart. Cost-complexity along the x-axis,
#| accuracy along the y-axis. The accuracy stays constant for
#| low values of cost-complexity. When cost-complexity is larger
#| than 0.01 the accuracy shoots up and down rapidly.
autoplot(tune_res)
```
We can now select the best performing value with `select_best()`, finalize the workflow by updating the value of `cost_complexity` and fit the model on the full training data set.
```{r}
#| warning: false
best_complexity <- select_best(tune_res)
class_tree_final <- finalize_workflow(class_tree_wf, best_complexity)
class_tree_final_fit <- fit(class_tree_final, data = Carseats_train)
class_tree_final_fit
```
At last, we can visualize the model, and we see that the better-performing model is less complex than the original model we fit.
```{r}
#| warning: false
#| fig-alt: |
#| Decision tree chart. A total of 2 splits. Color is used to
#| represent the prediction of High, blue values represent No
#| green represent Yes.
class_tree_final_fit %>%
extract_fit_engine() %>%
rpart.plot()
```
## Fitting Regression Trees
We will now show how we fit a regression tree. This is very similar to what we saw in the last section. The main difference here is that the response we are looking at will be continuous instead of categorical. We can reuse `tree_spec` as a base for the regression decision tree specification.
```{r}
reg_tree_spec <- tree_spec %>%
set_mode("regression")
```
We are using the `Boston` data set here so we will do a validation split here.
```{r}
#| warning: false
set.seed(1234)
Boston_split <- initial_split(Boston)
Boston_train <- training(Boston_split)
Boston_test <- testing(Boston_split)
```
fitting the model to the training data set
```{r}
#| warning: false
reg_tree_fit <- fit(reg_tree_spec, medv ~ ., Boston_train)
reg_tree_fit
```
```{r}
augment(reg_tree_fit, new_data = Boston_test) %>%
rmse(truth = medv, estimate = .pred)
```
and the `rpart.plot()` function works for the regression decision tree as well
```{r}
#| warning: false
#| fig-alt: |
#| Decision tree chart. A total of 8 splits. Color is used to
#| represent medv. Light blue colors represent small values, dark
#| blue represent high values.
reg_tree_fit %>%
extract_fit_engine() %>%
rpart.plot()
```
Notice how the result is a numeric variable instead of a class.
Now let us again try to tune the `cost_complexity` to find the best performing model.
```{r}
reg_tree_wf <- workflow() %>%
add_model(reg_tree_spec %>% set_args(cost_complexity = tune())) %>%
add_formula(medv ~ .)
set.seed(1234)
Boston_fold <- vfold_cv(Boston_train)
param_grid <- grid_regular(cost_complexity(range = c(-4, -1)), levels = 10)
tune_res <- tune_grid(
reg_tree_wf,
resamples = Boston_fold,
grid = param_grid
)
```
And it appears that higher complexity works are to be preferred according to our cross-validation
```{r}
#| fig-alt: |
#| Facetted connected scatter chart. Cost-complexity along the
#| x-axis. Performance values along the y-axis. The facets are
#| rmse and rsq. Both are fairly constant for low values of
#| cost-complexity, rmse starts moderately increasing and rsq
#| starts moderately decreasing once the cost-complexity
#| gets larger.
autoplot(tune_res)
```
We select the best-performing model according to `"rmse"` and fit the final model on the whole training data set.
```{r}
#| warning: false
best_complexity <- select_best(tune_res, metric = "rmse")
reg_tree_final <- finalize_workflow(reg_tree_wf, best_complexity)
reg_tree_final_fit <- fit(reg_tree_final, data = Boston_train)
reg_tree_final_fit
```
Visualizing the model reveals a much more complex tree than what we saw in the last section.
```{r}
#| warning: false
#| fig-alt: |
#| Decision tree chart. A total of 14 splits. Color is used to
#| represent medv. Light blue colors represent small values, dark
#| blue represent high values.
reg_tree_final_fit %>%
extract_fit_engine() %>%
rpart.plot()
```
## Bagging and Random Forests
Here we apply bagging and random forests to the `Boston` data set. We will be using the `randomForest` package as the engine. A bagging model is the same as a random forest where `mtry` is equal to the number of predictors. We can specify the `mtry` to be `.cols()` which means that the number of columns in the predictor matrix is used. This is useful if you want to make the specification more general and useable to many different data sets. `.cols()` is one of many [descriptors](https://parsnip.tidymodels.org/reference/descriptors.html) in the parsnip package.
We also set `importance = TRUE` in `set_engine()` to tell the engine to save the information regarding variable importance. This is needed for this engine if we want to use the `vip` package later.
```{r}
bagging_spec <- rand_forest(mtry = .cols()) %>%
set_engine("randomForest", importance = TRUE) %>%
set_mode("regression")
```
We fit the model like normal
```{r}
bagging_fit <- fit(bagging_spec, medv ~ ., data = Boston_train)
```
and we take a look at the testing performance. Which we see is an improvement over the decision tree.
```{r}
augment(bagging_fit, new_data = Boston_test) %>%
rmse(truth = medv, estimate = .pred)
```
We can also create a quick scatterplot between the true and predicted value to see if we can make any diagnostics.
```{r}
#| fig-alt: |
#| Scatter chart. medv along the x-axis and .pred along the
#| y-axis. A diagonal line have been added, most of the points
#| follows fairly close to the line, with points for high values
#| of medv being under the line.
augment(bagging_fit, new_data = Boston_test) %>%
ggplot(aes(medv, .pred)) +
geom_abline() +
geom_point(alpha = 0.5)
```
There isn't anything weird going on here so we are happy. Next, let us take a look at the variable importance
```{r}
#| fig-alt: |
#| Horizontal bar chart. Importance along the x-axis, predictors
#| along the y-axis. Highest values are rm, lstat, nox and dis.
#| Lowest are indus, black and age.
vip(bagging_fit)
```
Next, let us take a look at a random forest. By default, `randomForest()` `p / 3` variables when building a random forest of regression trees, and `sqrt(p)` variables when building a random forest of classification trees. Here we use `mtry = 6`.
```{r}
rf_spec <- rand_forest(mtry = 6) %>%
set_engine("randomForest", importance = TRUE) %>%
set_mode("regression")
```
and fitting the model like normal
```{r}
rf_fit <- fit(rf_spec, medv ~ ., data = Boston_train)
```
this model has a slightly better performance than the bagging model
```{r}
augment(rf_fit, new_data = Boston_test) %>%
rmse(truth = medv, estimate = .pred)
```
We can likewise plot the true value against the predicted value
```{r}
#| fig-alt: |
#| Scatter chart. medv along the x-axis and .pred along the
#| y-axis. A diagonal line have been added, most of the points
#| follows fairly close to the line, with points for high values
#| of medv being under the line.
augment(rf_fit, new_data = Boston_test) %>%
ggplot(aes(medv, .pred)) +
geom_abline() +
geom_point(alpha = 0.5)
```
it looks fine. No discernible difference between this chart and the one we created for the bagging model.
The variable importance plot is also quite similar to what we saw for the bagging model which isn't surprising.
```{r}
#| fig-alt: |
#| Horizontal bar chart. Importance along the x-axis, predictors
#| along the y-axis. Highest values are rm, lstat, nox and dis.
#| Lowest are black, indus and age.
vip(rf_fit)
```
you would normally want to perform hyperparameter tuning for the random forest model to get the best out of your forest. This exercise is left for the reader.
## Boosting
We will now fit a boosted tree model. The `xgboost` packages give a good implementation of boosted trees. It has many parameters to tune and we know that setting `trees` too high can lead to overfitting. Nevertheless, let us try fitting a boosted tree. We set `tree = 5000` to grow 5000 trees with a maximal depth of 4 by setting `tree_depth = 4`.
```{r}
boost_spec <- boost_tree(trees = 5000, tree_depth = 4) %>%
set_engine("xgboost") %>%
set_mode("regression")
```
fitting the model like normal
```{r}
boost_fit <- fit(boost_spec, medv ~ ., data = Boston_train)
```
and the `rmse` is a little high in this case which is properly because we didn't tune any of the parameters.
```{r}
augment(boost_fit, new_data = Boston_test) %>%
rmse(truth = medv, estimate = .pred)
```
We can look at the scatterplot and we don't see anything weird going on.
```{r}
#| fig-alt: |
#| Scatter chart. medv along the x-axis and .pred along the
#| y-axis. A diagonal line have been added, most of the points
#| follows fairly close to the line, with points for high values
#| of medv being under the line.
augment(boost_fit, new_data = Boston_test) %>%
ggplot(aes(medv, .pred)) +
geom_abline() +
geom_point(alpha = 0.5)
```
You would normally want to perform hyperparameter tuning for the boosted tree model to get the best out of your model. This exercise is left for the reader. Look at the [Iterative search](https://www.tmwr.org/iterative-search.html) chapter of [Tidy Modeling with R](https://www.tmwr.org/) for inspiration.
## Bayesian Additive Regression Trees
This section is WIP.