-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy path6_MLPS_R_model_selection.Rmd
104 lines (81 loc) · 3.51 KB
/
6_MLPS_R_model_selection.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
---
title: "6_MLPS_R_model_selection"
author: "Zhe Zhang (TA - Heinz CMU PhD)"
date: "1/28/2017"
output:
html_document:
css: '~/Dropbox/avenir-white.css'
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = F, error = F)
```
## Lecture 6: Model Selection
Key specific tasks we covered in this lecture:
* training/validation/test set splits
* empirical misclassification error
* average squared error
* k-fold cross validation
* regularization and revisiting cross-validation in sparse linear models
* grid search
* AIC, BIC
```{r}
library(tidyverse)
# Train/Validation/Test Splits
all_data <- data.frame(location = sample(c("NA", "EU", "AS", "AF", "?"),
size = 1000, replace = T),
outcome = sample(c(1, 2, 3, 4),
size = 1000, replace = T))
test_indices <- sample(seq(10),
size = nrow(all_data),
replace = T)
held_out_test <- all_data[test_indices == 10, ]
all_training <- all_data[test_indices != 10, ]
for (fold in seq(9)) {
training <- all_training[test_indices != fold, ]
validation <- all_training[test_indices == fold, ]
print(paste("Fold #", fold))
print(paste("has avg outcome", mean(validation$outcome,
na.rm = T)))
}
```
```{r}
# Empirical Misclassifcation Error
predictions <- data.frame(y = sample(c(0,1),
size = 1000,
r = T,
prob = c(3,1)),
pred = runif(1000))
# ASSUME: use 0.5 as a prediction threshold
predictions <- mutate(predictions,
errors = as.numeric(y == round(pred)))
error_rate <- mean(predictions$errors)
print(paste('Misclassfication Rate is', error_rate))
# Average Squared Error
# Empirical Misclassifcation Error
predictions_regression <-
data.frame(y = rnorm(1000),
pred = rnorm(1000))
# ASSUME: use 0.5 as a prediction threshold
predictions_regression <- predictions_regression %>%
mutate(sq_error = (y - pred)^2)
mse <- mean(predictions_regression$sq_error)
print(paste('Avg/Mean Squared Error is', mse))
```
### K-Fold Cross Validation
This is left for a HW assignment. Use the code for creating and iterating over train/validation splits as a hint to start writing your own cross-validation functions.
The key addition for the HW will be to record your data from different iterations. You can try using a pre-made data frame or making a list of data.frames and then using `bind_rows()` to combine them.
### Regularization & Grid Search
To identify the best performing regularization parameters, we can use a grid search, in 1 or more dimensions. We ask you to do so in the HW2. To do so, use the above cross-validation loop and inside it, insert a nested for loop where you try different regularization parameters in your grid. You'll want to practice with saving the data from each parameter and each validation fold.
I recommend using a list of data.frames (even if it's just a one row data frame) and then using `bind_rows()` to combine the list of data frames into one data frame. Importantly, the column names should be the same.
### AIC/BIC
For linear models, you can use the built-in AIC/BIC commands.
```{r}
lm_full <- lm(mpg ~ ., data = mtcars)
lm_small <- lm(mpg ~ cyl + hp, data = mtcars)
# AIC
AIC(lm_full, k = 2)
AIC(lm_small, k = 2)
# BIC
AIC(lm_full, k = log(nobs(lm_full)))
AIC(lm_small, k = log(nobs(lm_full)))
```