-
Notifications
You must be signed in to change notification settings - Fork 69
/
Copy path09-support-vector-machines.qmd
292 lines (223 loc) · 9.33 KB
/
09-support-vector-machines.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
# Support Vector Machines
```{r}
#| echo: false
set.seed(1234)
source("_common.R")
```
This lab will take a look at support vector machines, in doing so we will explore how changing the hyperparameters can help improve performance.
This chapter will use [parsnip](https://www.tidymodels.org/start/models/) for model fitting and [recipes and workflows](https://www.tidymodels.org/start/recipes/) to perform the transformations, and [tune and dials](https://www.tidymodels.org/start/tuning/) to tune the hyperparameters of the model.
```{r}
#| message: false
library(tidymodels)
library(ISLR)
```
## Support Vector Classifier
Let us start by creating a synthetic data set. We will use some normally distributed data with an added offset to create 2 separate classes.
```{r}
set.seed(1)
sim_data <- tibble(
x1 = rnorm(40),
x2 = rnorm(40),
y = factor(rep(c(-1, 1), 20))
) %>%
mutate(x1 = ifelse(y == 1, x1 + 1.5, x1),
x2 = ifelse(y == 1, x2 + 1.5, x2))
```
Plotting it shows that we are having two slightly overlapping classes
```{r}
ggplot(sim_data, aes(x1, x2, color = y)) +
geom_point()
```
We can then create a linear SVM specification by setting `degree = 1` in a polynomial SVM model. We furthermore set `scaled = FALSE` in `set_engine()` to have the engine scale the data for us. Once we get to it later we can be performing this scaling in a recipe instead.
:::{.callout-note}
`set_engine()` can be used to pass in additional arguments directly to the underlying engine. In this case, I'm passing in `scaled = FALSE` to `kernlab::ksvm()` which is the engine function.
:::
```{r}
svm_linear_spec <- svm_poly(degree = 1) %>%
set_mode("classification") %>%
set_engine("kernlab", scaled = FALSE)
```
Taking the specification, we can add a specific `cost` of 10 before fitting the model to the data. Using `set_args()` allows us to set the `cost` argument without modifying the model specification.
```{r}
svm_linear_fit <- svm_linear_spec %>%
set_args(cost = 10) %>%
fit(y ~ ., data = sim_data)
svm_linear_fit
```
The `kernlab` models can be visualized using the `plot()` function if you load the `kernlab` package.
```{r}
#| message: false
#| fig-alt: |
#| Scatter chart of sim_data with x2 on the x-axis and x1 on the
#| y-axis. A gradient going from red through white to blue,
#| is overlaid. Blue values occur when both x1 and x2 sum to more
#| than 2 and red values when x1 and x2 sum to less than 2.
#| The gradient does not appear to seperate the two classes
#| which is represented by shapes.
library(kernlab)
svm_linear_fit %>%
extract_fit_engine() %>%
plot()
```
what if we instead used a smaller value of the `cost` parameter?
```{r}
svm_linear_fit <- svm_linear_spec %>%
set_args(cost = 0.1) %>%
fit(y ~ ., data = sim_data)
svm_linear_fit
```
Now that a smaller value of the cost parameter is being used, we obtain a larger number of support vectors, because the margin is now wider.
Let us set up a `tune_grid()` section to find the value of `cost` that leads to the highest accuracy for the SVM model.
```{r}
#| fig-alt: |
#| Facetted connected scatter chart. x-axis shows different
#| values of cost, and the y-axis show the performance metric
#| value. The facets correspond to the two performance metrics
#| accuracy and roc_auc. Both charts show a constant value for
#| all values of cost, expect for once where the accuracy skipes.
svm_linear_wf <- workflow() %>%
add_model(svm_linear_spec %>% set_args(cost = tune())) %>%
add_formula(y ~ .)
set.seed(1234)
sim_data_fold <- vfold_cv(sim_data, strata = y)
param_grid <- grid_regular(cost(), levels = 10)
tune_res <- tune_grid(
svm_linear_wf,
resamples = sim_data_fold,
grid = param_grid
)
autoplot(tune_res)
```
using the `tune_res` object and `select_best()` function allows us to find the value of `cost` that gives the best cross-validated accuracy. Finalize the workflow with `finalize_workflow()` and fit the new workflow on the data set.
```{r}
best_cost <- select_best(tune_res, metric = "accuracy")
svm_linear_final <- finalize_workflow(svm_linear_wf, best_cost)
svm_linear_fit <- svm_linear_final %>% fit(sim_data)
```
We can now generate a different data set to act as the test data set. We will make sure that it is generated using the same model but with a different seed.
```{r}
set.seed(2)
sim_data_test <- tibble(
x1 = rnorm(20),
x2 = rnorm(20),
y = factor(rep(c(-1, 1), 10))
) %>%
mutate(x1 = ifelse(y == 1, x1 + 1.5, x1),
x2 = ifelse(y == 1, x2 + 1.5, x2))
```
and asseessing the model on this testing data set shows us that the model still performs very well.
```{r}
augment(svm_linear_fit, new_data = sim_data_test) %>%
conf_mat(truth = y, estimate = .pred_class)
```
## Support Vector Machine
We will now see how we can fit an SVM using a non-linear kernel. Let us start by generating some data, but this time generate with a non-linear class boundary.
```{r}
#| fig-alt: |
#| Scatter plot of sim_data2. Data is in a oblong shape with
#| points in the middle having color and both ends having
#| another color.
set.seed(1)
sim_data2 <- tibble(
x1 = rnorm(200) + rep(c(2, -2, 0), c(100, 50, 50)),
x2 = rnorm(200) + rep(c(2, -2, 0), c(100, 50, 50)),
y = factor(rep(c(1, 2), c(150, 50)))
)
sim_data2 %>%
ggplot(aes(x1, x2, color = y)) +
geom_point()
```
We will try an SVM with a radial basis function. Such a kernel would allow us to capture the non-linearity in our data.
```{r}
svm_rbf_spec <- svm_rbf() %>%
set_mode("classification") %>%
set_engine("kernlab")
```
fitting the model
```{r}
svm_rbf_fit <- svm_rbf_spec %>%
fit(y ~ ., data = sim_data2)
```
and plotting reveals that the model was able to separate the two classes, even though they were non-linearly separated.
```{r}
#| fig-alt: |
#| Scatter chart of sim_data with x2 on the x-axis and x1 on the
#| y-axis. A gradient going from red through white to blue,
#| is overlaid. The grading is blue in the middle of the data
#| and red at the edges, with a non-linear seperation between
#| the colors.
svm_rbf_fit %>%
extract_fit_engine() %>%
plot()
```
But let us see how well this model generalizes to new data from the same generating process.
```{r}
set.seed(2)
sim_data2_test <- tibble(
x1 = rnorm(200) + rep(c(2, -2, 0), c(100, 50, 50)),
x2 = rnorm(200) + rep(c(2, -2, 0), c(100, 50, 50)),
y = factor(rep(c(1, 2), c(150, 50)))
)
```
And it works well!
```{r}
augment(svm_rbf_fit, new_data = sim_data2_test) %>%
conf_mat(truth = y, estimate = .pred_class)
```
## ROC Curves
ROC curves can easily be created using the `roc_curve()` function from the yardstick package. We use this function much the same way as we have done using the `accuracy()` function, but the main difference is that we pass the predicted class probability instead of passing the predicted class.
```{r}
augment(svm_rbf_fit, new_data = sim_data2_test) %>%
roc_curve(truth = y, .pred_1)
```
This produces the different values of `specificity` and `sensitivity` for each threshold. We can get a quick visualization by passing the results of `roc_curve()` into `autoplot()`
```{r}
#| message: false
#| fig-alt: |
#| A ROC curve plot. 1-specificity along the x-axis and
#| sensitivity along the y-axis. A dotted line is drawn
#| along the diagonal. The line quite closely follows
#| the upper left side.
augment(svm_rbf_fit, new_data = sim_data2_test) %>%
roc_curve(truth = y, .pred_1) %>%
autoplot()
```
A common metric is t calculate the area under this curve. This can be done using the `roc_auc()` function (`_auc` stands for **a**rea **u**nder **c**urve).
```{r}
augment(svm_rbf_fit, new_data = sim_data2_test) %>%
roc_auc(truth = y, .pred_1)
```
## Application to Gene Expression Data
We now examine the Khan data set, which consists of several tissue samples corresponding to four distinct types of small round blue cell tumors. For each tissue sample, gene expression measurements are available. The data set comes in the `Khan` list which we will wrangle a little bit to create two tibbles, 1 for the training data and 1 for the testing data.
```{r}
#| warning: false
Khan_train <- bind_cols(
y = factor(Khan$ytrain),
as_tibble(Khan$xtrain)
)
Khan_test <- bind_cols(
y = factor(Khan$ytest),
as_tibble(Khan$xtest)
)
```
looking at the dimensions of the training data reveals that we have 63 observations with 20308 gene expression measurements.
```{r}
dim(Khan_train)
```
There is a very large number of predictors compared to the number of rows. This indicates that a linear kernel will be preferable, as the added flexibility we would get from a polynomial or radial kernel is unnecessary.
```{r}
khan_fit <- svm_linear_spec %>%
set_args(cost = 10) %>%
fit(y ~ ., data = Khan_train)
```
Let us take a look at the training confusion matrix. And look, we get a perfect confusion matrix. We are getting this because the hyperplane was able to fully separate the classes.
```{r}
augment(khan_fit, new_data = Khan_train) %>%
conf_mat(truth = y, estimate = .pred_class)
```
But remember we don't measure the performance by how well it performs on the training data set. We measure the performance of a model on how well it performs on the testing data set, so let us look at the testing confusion matrix
```{r}
augment(khan_fit, new_data = Khan_test) %>%
conf_mat(truth = y, estimate = .pred_class)
```
And it performs fairly well. A couple of misclassifications but nothing too bad.