-
Notifications
You must be signed in to change notification settings - Fork 0
/
07-splines.Rmd
144 lines (94 loc) · 4.52 KB
/
07-splines.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
```{r 07_setup, include=FALSE}
knitr::opts_chunk$set(echo=TRUE, eval=FALSE)
```
# Splines
## Learning Goals {-}
- Explain the advantages of splines over global transformations and other types of piecewise polynomials
- Explain how splines are constructed by drawing connections to variable transformations and least squares
- Explain how the number of knots relates to the bias-variance tradeoff
<br>
Slides from today are available [here](https://docs.google.com/presentation/d/1A1QrmwmAXMYSIEwhNGgHRf0GH7vqukcJFemlWLA3yuQ/edit?usp=sharing).
<br><br><br>
## Splines in `caret` {-}
To build models with splines in `caret`, we proceed with the same structure for `train()` as we use for ordinary linear regression models. (Why can we just use least squares?) Refer to code from Topic 3: Overfitting & CV for a refresher.
To work with splines, we'll use the `splines` package. The `ns()` function in that package creates the transformations needed to create a spline function for a quantitative predictor. This involves a small update to the *formula*:
```{r}
# Before: all linear terms
ls_mod <- train(
y ~ quant_x1 + quant_x2,
...
)
# With splines
ls_mod <- train(
y ~ ns(quant_x1, df = 3) + ns(quant_x2, df = 3),
...
)
```
- The `df` argument in `ns()` stands for degrees of freedom:
- `df = # knots + 1`
- The degrees of freedom are the number of coefficients in the transformation functions that are free to vary (essentially the number of underlying parameters behind the transformations).
<br><br><br>
## Exercises {-}
**You can download a template RMarkdown file to start from [here](template_rmds/07-splines.Rmd).**
Before proceeding, install the `splines` package by entering `install.packages("splines")` in the Console.
We'll continue using the `College` dataset in the `ISLR` package to explore splines. You can use `?College` in the Console to look at the data codebook.
```{r}
library(caret)
library(ggplot2)
library(dplyr)
library(readr)
library(ISLR)
library(splines)
data(College)
# A little data cleaning
college_clean <- College %>%
mutate(school = rownames(College)) %>%
filter(Grad.Rate <= 100) # Remove one school with grad rate of 118%
rownames(college_clean) <- NULL # Remove school names as row names
```
### Exercise 1: Evaluating a fully linear model {-}
We will model `Grad.Rate` as a function of 4 predictors: `Private`, `Terminal`, `Expend`, and `S.F.Ratio`.
a. Make scatterplots with 2 different smoothing lines to explore potential nonlinearity. Adding the following to the normal scatterplot code will create a smooth (curved) blue trend line and a red linear trend line.
```{r}
geom_smooth(color = "blue", se = FALSE) +
geom_smooth(method = "lm", color = "red", se = FALSE)
```
b. Use `caret` to fit an ordinary linear regression model (no splines yet) with the following specifications:
- Use 8-fold CV.
- Use mean absolute error (MAE) to select a final model.
- Select the simplest model for which the metric is within one standard error of the best metric.
```{r}
set.seed(___)
ls_mod <- train(
)
```
c. Make plots of the residuals vs. the 3 quantitative predictors to evaluate the appropriateness of linear terms.
```{r}
ls_mod_data <- college_clean %>%
mutate(
pred = predict(ls_mod, newdata = college_clean),
resid = ___
)
ggplot(ls_mod_data, ???) +
___ +
___ +
geom_hline(yintercept = 0, color = "red")
```
### Exercise 2: Evaluating a spline model {-}
We'll extend our linear regression model with spline functions of the quantitative predictors (leave `Private` as is).
a. What tuning parameter is associated with splines? How do high/low values of this parameter relate to bias and variance?
b. Update your code from Exercise 1 to model the 3 quantitative predictors with natural splines that have 2 knots (= 3 degrees of freedom).
```{r}
set.seed(___)
spline_mod <- train(
)
```
c. Make plots of the residuals vs. the 3 quantitative predictors to evaluate if splines improved the model.
```{r}
spline_mod_data <- ___
# Residual plots
```
### Extra! Variable scaling {-}
What is your intuition about whether variable scaling matters for the performance of splines?
Check you intuition by reusing code from Exercise 2, except with `preProcess = "scale"` inside `train()`. Call this `spline_mod2`.
How do the predictions from `spline_mod` and `spline_mod2` compare? You could use a plot to compare or check out the `all.equal()` function.