-
Notifications
You must be signed in to change notification settings - Fork 0
/
03-overfitting-cv.Rmd
224 lines (152 loc) · 9.05 KB
/
03-overfitting-cv.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
```{r 03_setup, include=FALSE}
knitr::opts_chunk$set(echo=TRUE, eval=FALSE)
```
# Overfitting & Cross-validation
## Learning Goals {-}
- Explain why training/in-sample model evaluation metrics can provide a misleading view of true test/out-of-sample performance
- Accurately describe all steps of cross-validation to estimate the test/out-of-sample version of a model evaluation metric
- Explain what role CV has in a predictive modeling analysis and its connection to overfitting
- Explain the pros/cons of higher vs. lower k in k-fold CV in terms of sample size and computing time
- Implement cross-validation in R using the `caret` package
<br>
Slides from today are available [here](https://docs.google.com/presentation/d/1dHajTeYPS09Fs1MvUQgGUVpPomkaNqIKWHfYjT7fwF8/edit?usp=sharing).
<br><br><br>
## The `caret` package {-}
(If you have not already installed the `caret` package, install it with `install.packages("caret")`.)
Over this course, we will looking at a broad but linked set of specialized tools applied in statistical machine learning. Specialized tools generally require specialized code. The `caret` (**C**lassification **A**nd **RE**gression **T**raining) package helps us *streamline* much of this specialized code.
We'll learn some tweaks along the way, but the majority of our `caret` code will use the `train()` function, which allows us to build and evaluate various models. It looks like a lot, but remember that we'll be using this over and over:
```{r}
# Load the package
library(caret)
# Set the seed for the random number generator
set.seed(___)
# Run the algorithm
my_model <- train(
y ~ x,
data = ___,
method = ___,
tuneGrid = data.frame(___),
trControl = trainControl(method = ___, number = ___, selectionFunction = ___),
metric = ___,
na.action = na.omit
)
```
Argument Meaning
------------- -----------------
`y ~ x` Specifies the outcome and predictor variables (just like `lm()`).
`data` Sample data
`method` Modeling method to use (e.g., `"lm"`, `"knn"`)
`tuneGrid` Modeling method's tuning parameters (parameters which define how it works)
`trControl` Method by which to train and test the model (typically cross-validation). When we build multiple models, `selectionFunction` describes the process by which to select a final model.
`metric` When we build multiple models, this is the metric by which we’ll evaluate and compare them (e.g., RMSE, MAE).
`na.action` What to do about missing data values. We typically set `na.action = na.omit` to prevent errors if the data has missing values.
<br>
The power of `caret` is that it allows us to streamline the vast world of machine learning techniques into one common syntax. On top of `"lm"`, check out the different machine learning methods that we can use in `caret`. We'll see several of these throughout the course:
```{r collapse=TRUE}
names(getModelInfo())
```
You can find more detailed descriptions of these methods [here](https://rdrr.io/cran/caret/man/models.html) and in a searchable table [here](https://topepo.github.io/caret/available-models.html).
<br>
In the exercises below, you’ll need to adapt the following code to perform the CV procedure on a linear regression model (`"lm"`):
```{r}
# Regression model with CV error
my_model <- train(
y ~ x,
data = ___,
method = "lm",
trControl = trainControl(method = "cv", number = your_k),
na.action = na.omit
)
```
Notice how this specifies the general caret code:
- `method = "lm"` indicates that we want to fit a linear regression model
- `trControl = trainControl(method = "cv", number = your_k)` indicates that we want to train and test the model using cross validation (`"cv"`). You need to specify the number of folds, `your_k`.
- `tuneGrid`: absent
Linear regression doesn't depend upon any tuning parameters (we're just minimizing the sum of squared residuals).
- `selectionFunction` and `metric`: absent
We're only building one model, not comparing multiple models, so we don't need a metric by which to compare models or a method by which to select a model.
After building the model, you can check out the results:
```{r}
# Summarize the model
summary(my_model)
# Get model evaluation metrics for each fold
my_model$resample
# Get CV evaluation metrics
my_model$results
```
<br><br><br>
## Exercises {-}
**You can download a template RMarkdown file to start from [here](template_rmds/03-overfitting-cv.Rmd).**
### Context {-}
We'll be working with a dataset containing physical measurements on 80 adult males. These measurements include body fat percentage estimates as well as body circumference measurements.
- `fatBrozek`: Percent body fat using Brozek's equation: 457/Density - 414.2
- `fatSiri`: Percent body fat using Siri's equation: 495/Density - 450
- `density`: Density determined from underwater weighing (gm/cm^3).
- `age`: Age (years)
- `weight`: Weight (lbs)
- `height`: Height (inches)
- `neck`: Neck circumference (cm)
- `chest`: Chest circumference (cm)
- `abdomen`: Abdomen circumference (cm)
- `hip`: Hip circumference (cm)
- `thigh`: Thigh circumference (cm)
- `knee`: Knee circumference (cm)
- `ankle`: Ankle circumference (cm)
- `biceps`: Biceps (extended) circumference (cm)
- `forearm`: Forearm circumference (cm)
- `wrist`: Wrist circumference (cm)
It takes a lot of effort to estimate body fat percentage accurately through underwater weighing. The goal is to build the best predictive model for `fatSiri` using just circumference measurements, which are more easily attainable. (We won't use `fatBrozek` or `density` as predictors because they're other outcome variables.)
```{r}
library(readr)
library(ggplot2)
library(dplyr)
library(broom)
library(caret)
bodyfat_train <- read_csv("https://www.dropbox.com/s/js2gxnazybokbzh/bodyfat_train.csv?dl=1")
# Remove the fatBrozek and density variables
bodyfat_train <- bodyfat_train %>%
select(-fatBrozek, -density)
```
### Exercise 1: 4 models {-}
Consider the 4 models below:
```{r}
mod1 <- lm(fatSiri ~ age+weight+neck+abdomen+thigh+forearm, data = bodyfat_train)
mod2 <- lm(fatSiri ~ age+weight+neck+abdomen+thigh+forearm+biceps, data = bodyfat_train)
mod3 <- lm(fatSiri ~ age+weight+neck+abdomen+thigh+forearm+biceps+chest+hip, data = bodyfat_train)
mod4 <- lm(fatSiri ~ ., data = bodyfat_train) # The . means all predictors
```
a. Which model will have the lowest training RMSE, and why?
b. Compute the RMSE for models 1 and 4 to (partially) check your answer for part a.
c. Which model do you think will perform worst on new test data? Why?
### Exercise 2: Cross-validation with `caret` {-}
a. Complete the code below to perform 10-fold cross-validation for `mod1` to estimate the test RMSE ($\text{CV}_{(10)}$). Do we need to use `set.seed()`? Why or why not? (Is there a number of folds for which we would not need to set the seed?)
```{r}
# Do we need to use set.seed()?
mod1_cv <- train(
)
```
b. **STAT 155 review:** Look at the `summary()` of `mod1_cv`. Contextually interpret the coefficient for the weight predictor. Is anything surprising? Why might this be?
c. Look at `mod1_cv$resample`, and use this to calculate the 10-fold cross-validated RMSE by hand (the idea is the same as when using MSE). (Note: We haven't done this together, but how can you adapt code that we've used before?)
d. Check your answer to part c by directly printing out the CV metrics: `mod1_cv$results`. Interpret this metric.
### Exercise 3: Looking at the evaluation metrics {-}
Look at the completed table below of evaluation metrics for the 4 models.
a. Which model performed the best on the training data?
b. Which model performed best on the test set?
c. Explain why there's a discrepancy between these 2 answers and why CV, in general, can help prevent overfitting.
Model Training RMSE $\text{CV}_{(10)}$
-------- --------------- --------------------
`mod1` 3.810712 4.389568
`mod2` 3.766645 4.438637
`mod3` 3.752362 4.517281
`mod4` 3.572299 4.543343
### Exercise 4: Practical issues: choosing $k$ {-}
a. In terms of sample size, what are the pros/cons of low vs. high $k$?
b. In terms of computational time, what are the pros/cons of low vs. high $k$?
c. If possible, it is advisable to choose $k$ to be a divisor of the sample size. Why do you think that is?
### Digging deeper {-}
If you have time, consider these exercises to further explore concepts related to today's ideas.
1. `caret`'s `trainControl()` function also has a `"repeatedcv"` method. Just from the name, how do you think this method differs from `"cv"`? What are the pros/cons of `"repeatedcv"` as compared to `"cv"`?
2. Adapt the `train()` code to perform leave-one-out-cross-validation (LOOCV).
- Hint: `nrow(dataset)` obtains the number of cases in the dataset.
- Do we need `set.seed()`? Why or why not?
- Using the information from `your_output$resample` (which is a dataset), construct a visualization to examine the variability of RMSE from case to case. What might explain any very large values? What does this highlight about the quality of estimation of the LOOCV process?