-
Notifications
You must be signed in to change notification settings - Fork 14
/
Copy pathtraining_with_luz.qmd
539 lines (412 loc) · 20.5 KB
/
training_with_luz.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
# Training with luz {#sec:luz}
At this point in the book, you know how to train a neural network. Truth be told, though, there's some cognitive effort involved in having to remember the right execution order of steps like `optimizer$zero_grad()`, `loss$backward()`, and `optimizer$step()`. Also, in more complex scenarios than our running example, the list of things to actively remember gets longer.
One thing we haven't talked about yet, for example, is how to handle the usual three stages of machine learning: training, validation, and testing. Another is the question of data flow between *devices* (CPU and GPU, if you have one). Both topics necessitate additional code to be introduced to the training loop. Writing this code can be tedious, and creates a potential for mistakes.
You can see exactly what I'm referring to in the appendix at the end of this chapter. But now, I want to focus on the remedy: a high-level, easy-to-use, concise way of organizing and instrumenting the training process, contributed by a package built on top of `torch`: `luz`.
## Que haya luz - Que haja luz - Let there be light
A *torch* already brings some light, but sometimes in life, there is no *too bright*. `luz` was designed to make deep learning with `torch` as effortless as possible, while at the same time allowing for easy customization. In this chapter, we focus on the overall process; examples of customization will appear in later chapters.
For ease of comparison, we take our running example, and add a third version, now using `luz`. First, we "just" directly port the example; then, we adapt it to a more realistic scenario. In that scenario, we
- make use of separate training, validation, and test sets;
- have `luz` compute *metrics* during training/validation;
- illustrate the use of *callbacks* to perform custom actions or dynamically change hyper-parameters during training; and
- explain what is going on with the aforementioned *devices*.
## Porting the toy example
### Data
`luz` does not just substantially transform the code required to train a neural network; it also adds flexibility on the data side of things. In addition to a reference to a `dataloader()`, its `fit()` method accepts `dataset()`s, tensors, and even R objects, as we'll be able to verify soon.
We start by generating an R matrix and a vector, as before. This time though, we also wrap them in a `tensor_dataset()`, and instantiate a `dataloader()`. Instead of just 100, we now generate 1000 observations.
```{r}
library(torch)
library(luz)
# input dimensionality (number of input features)
d_in <- 3
# number of observations in training set
n <- 1000
x <- torch_randn(n, d_in)
coefs <- c(0.2, -1.3, -0.5)
y <- x$matmul(coefs)$unsqueeze(2) + torch_randn(n, 1)
ds <- tensor_dataset(x, y)
dl <- dataloader(ds, batch_size = 100, shuffle = TRUE)
```
### Model
To use `luz`, no changes are needed to the model definition. Note, though, that we just *define* the model architecture; we never actually *instantiate* a model object ourselves.
```{r}
# dimensionality of hidden layer
d_hidden <- 32
# output dimensionality (number of predicted features)
d_out <- 1
net <- nn_module(
initialize = function(d_in, d_hidden, d_out) {
self$net <- nn_sequential(
nn_linear(d_in, d_hidden),
nn_relu(),
nn_linear(d_hidden, d_out)
)
},
forward = function(x) {
self$net(x)
}
)
```
### Training
To train the model, we don't write loops anymore. `luz` replaces the familiar *iterative* style by a *declarative* one: You tell `luz` what you want to happen, and like a docile sorcerer's apprentice, it sets in motion the machinery.
Concretely, instruction happens in two -- required -- calls.
1. In `setup()`\index{\texttt{setup()} (luz)}, you specify the loss function and the optimizer to use.
2. In `fit()`\index{\texttt{fit()} (luz)}, you pass reference(s) to the training (and optionally, validation) data, as well as the number of epochs to train for.
If the model is configurable -- meaning, it accepts arguments to `initialize()` -- a third method comes into play: `set_hparams()`\index{\texttt{set{\textunderscore}hparams()} (luz)}, to be called in-between the other two. (That's `hparams` for hyper-parameters.) Using this mechanism, you can easily experiment with, for example, different layer sizes, or other factors suspected to affect performance.
```{r}
fitted <- net %>%
setup(loss = nn_mse_loss(), optimizer = optim_adam) %>%
set_hparams(
d_in = d_in,
d_hidden = d_hidden, d_out = d_out
) %>%
fit(dl, epochs = 200)
```
Running this code, you should see output approximately like this:
Epoch 1/200
Train metrics: Loss: 3.0343
Epoch 2/200
Train metrics: Loss: 2.5387
Epoch 3/200
Train metrics: Loss: 2.2758
...
...
Epoch 198/200
Train metrics: Loss: 0.891
Epoch 199/200
Train metrics: Loss: 0.8879
Epoch 200/200
Train metrics: Loss: 0.9036
Above, what we passed to `fit()` was the `dataloader()`. Let's check that referencing the `dataset()` would have been just as fine:
```{r}
fitted <- net %>%
setup(loss = nn_mse_loss(), optimizer = optim_adam) %>%
set_hparams(
d_in = d_in,
d_hidden = d_hidden, d_out = d_out
) %>%
fit(ds, epochs = 200)
```
Or even, `torch` tensors:
```{r}
fitted <- net %>%
setup(loss = nn_mse_loss(), optimizer = optim_adam) %>%
set_hparams(
d_in = d_in,
d_hidden = d_hidden, d_out = d_out
) %>%
fit(list(x, y), epochs = 200)
```
And finally, R objects, which can be convenient when we aren't already working with tensors.
```{r}
fitted <- net %>%
setup(loss = nn_mse_loss(), optimizer = optim_adam) %>%
set_hparams(
d_in = d_in,
d_hidden = d_hidden, d_out = d_out
) %>%
fit(list(as.matrix(x), as.matrix(y)), epochs = 200)
```
In the following sections, we'll always be working with `dataloader()`s; but in some cases those "shortcuts" may come in handy.
Next, we extend the toy example, illustrating how to address more complex requirements.
## A more realistic scenario
### Integrating training, validation, and test
In deep learning, training and validation phases are interleaved. Every epoch of training is followed by an epoch of validation. Importantly, the data used in both phases have to be strictly disjoint.
In each training phase, gradients are computed and weights are changed; during validation, none of that happens. Why have a validation set, then? If, for each epoch, we compute task-relevant metrics for both partitions, we can see if we are *overfitting* to the training data: that is, drawing conclusions based on training sample specifics not descriptive of the overall population we want to model. All we have to do is two things: instruct `luz` to compute a suitable metric, and pass it an additional `dataloader` pointing to the validation data.
The former is done in `setup()`, and for a regression task, common choices are mean squared or mean absolute error (MSE or MAE, resp.). As we're already using MSE as our loss, let's choose MAE for a metric:
```{r}
fitted <- net %>%
setup(
loss = nn_mse_loss(),
optimizer = optim_adam,
metrics = list(luz_metric_mae())
) %>%
fit(...)
```
The validation `dataloader` is passed in `fit()` -- but to be able to reference it, we need to construct it first! So now (anticipating we'll want to have a test set, too), we split up the original 1000 observations into three partitions, creating a `dataset` and a `dataloader` for each of them.
```{r}
train_ids <- sample(1:length(ds), size = 0.6 * length(ds))
valid_ids <- sample(
setdiff(1:length(ds), train_ids),
size = 0.2 * length(ds)
)
test_ids <- setdiff(
1:length(ds),
union(train_ids, valid_ids)
)
train_ds <- dataset_subset(ds, indices = train_ids)
valid_ds <- dataset_subset(ds, indices = valid_ids)
test_ds <- dataset_subset(ds, indices = test_ids)
train_dl <- dataloader(train_ds,
batch_size = 100, shuffle = TRUE
)
valid_dl <- dataloader(valid_ds, batch_size = 100)
test_dl <- dataloader(test_ds, batch_size = 100)
```
Now, we are ready to start the enhanced workflow:
```{r}
fitted <- net %>%
setup(
loss = nn_mse_loss(),
optimizer = optim_adam,
metrics = list(luz_metric_mae())
) %>%
set_hparams(
d_in = d_in,
d_hidden = d_hidden, d_out = d_out
) %>%
fit(train_dl, epochs = 200, valid_data = valid_dl)
```
Epoch 1/200
Train metrics: Loss: 2.5863 - MAE: 1.2832
Valid metrics: Loss: 2.487 - MAE: 1.2365
Epoch 2/200
Train metrics: Loss: 2.4943 - MAE: 1.26
Valid metrics: Loss: 2.4049 - MAE: 1.2161
Epoch 3/200
Train metrics: Loss: 2.4036 - MAE: 1.236
Valid metrics: Loss: 2.3261 - MAE: 1.1962
...
...
Epoch 198/200
Train metrics: Loss: 0.8947 - MAE: 0.7504
Valid metrics: Loss: 1.0572 - MAE: 0.8287
Epoch 199/200
Train metrics: Loss: 0.8948 - MAE: 0.7503
Valid metrics: Loss: 1.0569 - MAE: 0.8286
Epoch 200/200
Train metrics: Loss: 0.8944 - MAE: 0.75
Valid metrics: Loss: 1.0579 - MAE: 0.8292
Even though both training and validation sets come from the exact same distribution, we do see a bit of overfitting. This is a topic we'll talk about more in the next chapter.
Once training has finished, the `fitted` object above holds a history of epoch-wise metrics, as well as references to a number of important objects involved in the training process. Among the latter is the fitted model itself -- which enables an easy way to obtain predictions on the test set:\index{\texttt{predict()} (luz)}
```{r}
fitted %>% predict(test_dl)
```
torch_tensor
0.7799
1.7839
-1.1294
-1.3002
-1.8169
-1.6762
-0.7548
-1.2041
2.9613
-0.9551
0.7714
-0.8265
1.1334
-2.8406
-1.1679
0.8350
2.0134
2.1083
1.4093
0.6962
-0.3669
-0.5292
2.0310
-0.5814
2.7494
0.7855
-0.5263
-1.1257
-3.3117
0.6157
... [the output was truncated (use n=-1 to disable)]
[ CPUFloatType{200,1} ]
We also want to evaluate performance on the test set:\index{\texttt{evaluate()} (luz)}
```{r}
fitted %>% evaluate(test_dl)
```
A `luz_module_evaluation`
── Results
loss: 0.9271
mae: 0.7348
This workflow of: training and validation in lock-step, then checking and extracting predictions on the test set is something we'll encounter times and again in this book.
### Using callbacks to "hook" into the training process\index{callbacks (luz)}
At this point, you may feel that what we've gained in code efficiency, we may have lost in flexibility. Coding the training loop yourself, you can arrange for all kinds of things to happen: save model weights, adjust the learning rate ... whatever you need.
In reality, no flexibility is lost. Instead, `luz` offers a standardized way to achieve the same goals: callbacks. Callbacks are objects that can execute arbitrary R code, at any of the following points in time:
- when the overall training process starts or ends (`on_fit_begin()` / `on_fit_end()`);
- when an epoch (comprising training and validation) starts or ends (`on_epoch_begin()` / `on_epoch_end()`);
- when during an epoch, the training (validation, resp.) phase starts or ends (`on_train_begin()` / `on_train_end()`; `on_valid_begin()` / `on_valid_end()`);
- when during training (validation, resp.), a new batch is either about to be or has been processed (`on_train_batch_begin()` / `on_train_batch_end()`; `on_valid_batch_begin()` / `on_valid_batch_end()`);
- and even at specific landmarks inside the "innermost" training / validation logic, such as "after loss computation", "after `backward()`" or "after `step()`".
While you can implement any logic you wish using callbacks (and we'll see how to do this in a later chapter), `luz` already comes equipped with a very useful set. For example:
- `luz_callback_model_checkpoint()` saves model weights after every epoch (or just in case of improvements, if so instructed).
- `luz_callback_lr_scheduler()` activates one of `torch`'s *learning rate schedulers*. Different scheduler objects exist, each following their own logic in dynamically updating the learning rate.
- `luz_callback_early_stopping()` terminates training once model performance stops to improve. What exactly "stops to improve" should mean is configurable by the user.
Callbacks are passed to the `fit()` method in a list. For example, augmenting our most recent workflow:
```{r}
fitted <- net %>%
setup(
loss = nn_mse_loss(),
optimizer = optim_adam,
metrics = list(luz_metric_mae())
) %>%
set_hparams(d_in = d_in,
d_hidden = d_hidden,
d_out = d_out) %>%
fit(
train_dl,
epochs = 200,
valid_data = valid_dl,
callbacks = list(
luz_callback_model_checkpoint(path = "./models/",
save_best_only = TRUE),
luz_callback_early_stopping(patience = 10)
)
)
```
With this configuration, weights will be saved, but only if validation loss decreases. Training will halt if there is no improvement (again, in validation loss) for ten epochs. With both callbacks, you can pick any other metric to base the decision on, and the metric in question may also refer to the training set.
Here, we see early stopping happening after 111 epochs:
Epoch 1/200
Train metrics: Loss: 2.5803 - MAE: 1.2547
Valid metrics: Loss: 3.3763 - MAE: 1.4232
Epoch 2/200
Train metrics: Loss: 2.4767 - MAE: 1.229
Valid metrics: Loss: 3.2334 - MAE: 1.3909
...
...
Epoch 110/200
Train metrics: Loss: 1.011 - MAE: 0.8034
Valid metrics: Loss: 1.1673 - MAE: 0.8578
Epoch 111/200
Train metrics: Loss: 1.0108 - MAE: 0.8032
Valid metrics: Loss: 1.167 - MAE: 0.8578
Early stopping at epoch 111 of 200
### How `luz` helps with devices\index{device handling (luz)}
Finally, let's quickly mention how `luz` helps with device placement. Devices, in a usual environment, are the CPU and perhaps, if available, a GPU. For training, data and model weights need to be located on the same device. This can introduce complexities, and -- at the very least -- necessitates additional code to keep all pieces in sync.
With `luz`, related actions happen transparently to the user. Let's take the prediction step from above:
```{r}
fitted %>% predict(test_dl)
```
In case this code was executed on a machine that has a GPU, `luz` will have detected that, and the model's weight tensors will already have been moved there. Now, for the above call to `predict()`, what happened "under the hood" was the following:
- `luz` put the model in evaluation mode, making sure that weights are not updated.
- `luz` moved the test data to the GPU, batch by batch, and obtained model predictions.
- These predictions were then moved back to the CPU, in anticipation of the caller wanting to process them further with R. (Conversion functions like `as.numeric()`, `as.matrix()` etc. can only act on CPU-resident tensors.)
In the below appendix, you find a complete walk-through of how to implement the train-validate-test workflow by hand. You'll likely find this a lot more complex than what we did above -- and it does not even bring into play metrics, or any of the functionality afforded by `luz` callbacks.
In the next chapter, we discuss essential ingredients of modern deep learning we haven't yet touched upon; and following that, we look at specific architectures destined to specifically handle different tasks and domains.
## Appendix: A train-validate-test workflow implemented by hand
For clarity, we repeat here the two things that do *not* depend on whether you're using `luz` or not: `dataloader()` preparation and model definition.
```{r}
# input dimensionality (number of input features)
d_in <- 3
# number of observations in training set
n <- 1000
x <- torch_randn(n, d_in)
coefs <- c(0.2, -1.3, -0.5)
y <- x$matmul(coefs)$unsqueeze(2) + torch_randn(n, 1)
ds <- tensor_dataset(x, y)
dl <- dataloader(ds, batch_size = 100, shuffle = TRUE)
train_ids <- sample(1:length(ds), size = 0.6 * length(ds))
valid_ids <- sample(setdiff(
1:length(ds),
train_ids
), size = 0.2 * length(ds))
test_ids <- setdiff(1:length(ds), union(train_ids, valid_ids))
train_ds <- dataset_subset(ds, indices = train_ids)
valid_ds <- dataset_subset(ds, indices = valid_ids)
test_ds <- dataset_subset(ds, indices = test_ids)
train_dl <- dataloader(train_ds,
batch_size = 100,
shuffle = TRUE
)
valid_dl <- dataloader(valid_ds, batch_size = 100)
test_dl <- dataloader(test_ds, batch_size = 100)
# dimensionality of hidden layer
d_hidden <- 32
# output dimensionality (number of predicted features)
d_out <- 1
net <- nn_module(
initialize = function(d_in, d_hidden, d_out) {
self$net <- nn_sequential(
nn_linear(d_in, d_hidden),
nn_relu(),
nn_linear(d_hidden, d_out)
)
},
forward = function(x) {
self$net(x)
}
)
```
Recall that with `luz`, now all that separates you from watching how training and validation losses evolve is a snippet like this:
```{r}
fitted <- net %>%
setup(
loss = nn_mse_loss(),
optimizer = optim_adam
) %>%
set_hparams(
d_in = d_in,
d_hidden = d_hidden, d_out = d_out
) %>%
fit(train_dl, epochs = 200, valid_data = valid_dl)
```
Without `luz`, however, things to be taken care of fall into three distinct categories.
First, instantiate the network, and, if CUDA is installed, move its weights to the GPU.
```{r}
device <- torch_device(if
(cuda_is_available()) {
"cuda"
} else {
"cpu"
})
model <- net(d_in = d_in, d_hidden = d_hidden, d_out = d_out)
model <- model$to(device = device)
```
Second, create an optimizer.
```{r}
optimizer <- optim_adam(model$parameters)
```
And third, the biggest chunk: In each epoch, iterate over training batches as well as validation batches, performing backpropagation when working on the former, while just passively reporting losses when processing the latter.
For clarity, we pack training logic and validation logic each into their own functions. `train_batch()` and `valid_batch()` will be called from inside loops over the respective batches. Those loops, in turn, will be executed for every epoch.
While `train_batch()` and `valid_batch()`, per se, trigger the usual actions in the usual order, note the device placement calls: For the model to be able to take in the data, they have to live on the same device. Then, for mean-squared-error computation to be possible, the target tensors need to live there as well.
```{r}
train_batch <- function(b) {
optimizer$zero_grad()
output <- model(b[[1]]$to(device = device))
target <- b[[2]]$to(device = device)
loss <- nn_mse_loss(output, target)
loss$backward()
optimizer$step()
loss$item()
}
valid_batch <- function(b) {
output <- model(b[[1]]$to(device = device))
target <- b[[2]]$to(device = device)
loss <- nn_mse_loss(output, target)
loss$item()
}
```
The loop over epochs contains two lines that deserve special attention: `model$train()` and `model$eval()`. The former instructs `torch` to put the model in training mode; the latter does the opposite. With the simple model we're using here, it wouldn't be a problem if you forgot those calls; however, when later we'll be using regularization layers like `nn_dropout()` and `nn_batch_norm2d()`, calling these methods in the correct places is essential. This is because these layers behave differently during evaluation and training.
```{r}
num_epochs <- 200
for (epoch in 1:num_epochs) {
model$train()
train_loss <- c()
# use coro::loop() for stability and performance
coro::loop(for (b in train_dl) {
loss <- train_batch(b)
train_loss <- c(train_loss, loss)
})
cat(sprintf(
"\nEpoch %d, training: loss: %3.5f \n",
epoch, mean(train_loss)
))
model$eval()
valid_loss <- c()
# disable gradient tracking to reduce memory usage
with_no_grad({
coro::loop(for (b in valid_dl) {
loss <- valid_batch(b)
valid_loss <- c(valid_loss, loss)
})
})
cat(sprintf(
"\nEpoch %d, validation: loss: %3.5f \n",
epoch, mean(valid_loss)
))
}
```
This completes our walk-through of manual training, and should have made more concrete my assertion that using `luz` significantly reduces the potential for casual (e.g., copy-paste) errors.