-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path05_5_Train_Test.qmd
249 lines (175 loc) · 8.47 KB
/
05_5_Train_Test.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
---
title: "05_5_Train_Test"
format: html
editor: source
params:
print_sol: false
toc: true
toc-location: left
toc-depth: 6
self-contained: true
---
# References
NOTE: the following material not fully original, but rather an extension of reference 1 with comments and other sources to improve the student's understanding:
1. Hyndman, R. J., & Athanasopoulos, G. (2021). *Forecasting: principles and practice* (3rd ed.). [Link](https://otexts.com/fpp3/)
2. tidyverts.org. (n.d.). *fable package documentation*. Retrieved from:
- [Link 1](https://fable.tidyverts.org/index.html)
- [Link 2](https://fable.tidyverts.org/articles/fable.html)
3. Fischer, J. (n.d.). Was dem MAPE fälschlicherweise vorgeworfen wird: seine wahren Schwächen und bessere Alternativen. *STATWORX*. Retrieved from [Link](https://www.statworx.com/content-hub/blog/was-dem-mape-falschlicherweise-vorgeworfen-wird-seine-wahren-schwachen-und-bessere-alternativen/)
4. Hyndman, R. J., & Koehler, A. B. (2006). Another look at measures of forecast accuracy. *International Journal of Forecasting, 22*(4), 679–688.
# Packages
```{r, error=FALSE, warning=FALSE, message = FALSE}
library(fpp3)
```
# Train and Test Sets
The size of the residuals is not a reliable indication of how large true forecast errors are likely to be. **The accuracy of forecasts can only be determined by considering how well a model performs on new data that were not used when fitting the model.**
However, **the problem is that we do not have data on which to test the forecasts, since these are into the future.**
**Common Practice**: separate available data in two portions - **train** and **test**
- **train data**: used to fit the model (to estimate the model parameters)
- **test data**: used to evaluate model accuracy.
This is shown in the graphs below:
![](./figs/train_test_fctask.png){width="100%"}
![](./figs/13_train_test.png){width="100%"}
**Indicatons to choose test data size**
- Typically **around 20% of the total sample**
- **At least as large as the maximum forecast horizon required**
- **Ideally much longer as the forecast horizon required (3 times min)**
**Important considerations**
- A **model that fits the training data well will not necessarily forecast well.**
- A perfect fit to the training data can always be obtained with enough parameters (**over-fitting**)
- Over-fitting a model to data is just as bad as not identifying a pattern in the data.
## Functions to subset a time series
### `filter()`
`filter()` can be used to filter out a portion of a time series.
* It is **important to convert your time index to something that can be compared against the object you want to filter by**. In the example below we need to convert the `yearquarter` index to a `year` before comparing against 1995.
```{r}
# Data from 1995 onwards
aus_production %>% filter(year(Quarter) >= 1995)
```
More examples on the use of filter in the exercises below, that will be provided with solutions in due course.
### `slice()`
`slice()`: allows extraction by indices.
```{r}
# Extact the first 12 points (1 year) of data of each combination of State, Industry
aus_retail %>%
group_by(State, Industry) %>%
slice(1:12) %>%
ungroup() # So that subsequent operations are not performed group-wise
```
Slice can be used in combination with **`n()` to refer to the maximum number of rows.**
* In the following examples **the parenthesis play a major role because the `:` operator takes precedence over the `-` operator, so in order to adequately reference the indexes we need to use the parenthesis**
#### Slice example 1
```{r}
# Extract everything put the last 12 observations of each series
# (each combination of State, Industry)
aus_retail %>%
group_by(State, Industry) %>%
slice(1:(n()-12)) %>% # We need to enclose (n()-12) with parenthesis.
ungroup() # So that subsequent operations are not performed group-wise
```
#### Slice example 2
The **`-` sign below** indicates that we are interested in **precisely what is not covered by (1:(n()-12))**.
```{r}
# Extract just last 12 observations of each series
# (each combination of State, Industry)
# Recall thath the - operator can be used here to indecate "everything but".
aus_retail %>%
group_by(State, Industry) %>%
slice(-(1:(n()-12))) %>% # We need to enclose (n()-12) with parenthesis.
ungroup() # So that subsequent operations are not performed group-wise
```
#### Slice example 3
```{r}
# ALTERNATIVE: Extract just last 12 observations of each series
# (each combination of State, Industry)
aus_retail %>%
group_by(State, Industry) %>%
slice((n()-11):n()) %>% # We need to enclose (n()-11) with parenthesis.
ungroup()
```
#### Slice example 4
The next one is a bit tricky and involves the operation precedence issue. First look at the following decreasing sequence that starts from 19 and ends at 0, in steps of 1.
```{r}
19:0
```
Adding or subtracting an integer to that sequence does not alter the number of elements, rather, it only shifts the sequence accordingly
```{r}
200 - 19:0
```
Using this, we can use the integer returned by `n()` when used within `dplyr` verbs (the number of rows) to obtain only the last 20 observations in the dataset:
```{r}
# Extract the latest 20 observations (20 quarters -> 5 years)
aus_production %>%
slice(n()-19:0) # n()-19:0 is equivalent to 218-19:0 (218 is the number of
# rows in aus_production).
```
## Exercises train-test datasets
1. Create a training and a test set for household wealth (dataset `hh_budget`) by withholding the last four years as a test set. Use `filter()` and a logical condition
```{r, include = params$print_sol}
test_years = 4
train <- hh_budget %>%
filter(Year <= max(Year) - test_years)
test <- hh_budget %>%
filter(Year > max(Year) - test_years)
# Check dimensions
(nrow(train) + nrow(test)) == nrow(hh_budget)
# NOTE: other functions for dimension checking:
# nrow(), ncol(), dim(), length()
```
2. Consider the dataset `global_economy` and in particular the data related to `Argentina`
* 2.1: Determine how many datapoints related to Argentina there are in the dataset (how many rows) and store that number in a variable. Also extract the series corresponding to Argentinian exports. Hint: you can use the function `nrow()` (in an exam context, no hint will be given).
```{r, include = params$print_sol}
arg_economy <- global_economy %>% filter(Country == "Argentina") %>% select(Year, Exports)
nobs <- arg_economy %>% nrow()
nobs
```
* 2.2: Create a training dataset that consists on 80% of the observations. Since 80% of the observations is not an integer number, round it down to the nearest integer. The rest of the observations are to be stored as test dataset.
```{r, include = params$print_sol}
split_row = as.integer(nobs * 0.8)
train <- arg_economy %>% slice(1:split_row)
test <- arg_economy %>% slice(-(1:split_row)) # Select everything but
# the train indexes
# Check dimensions
nrow(train)
nrow(test)
nrow(arg_economy) == nrow(train) + nrow(test)
```
3. Create a training set for Australian takeaway food turnover (aus_retail) by withholding the last four years as a test set.
* 3.1: using filter() extract values corresponding to the industry "Takeaway food services"
```{r, include = params$print_sol}
food_services <- aus_retail %>% filter(Industry == "Takeaway food services")
food_services
```
* 3.2: Using `sum()` and `summarize()` obtain the aggregates turnover of this industry per month over all the states
```{r, include = params$print_sol}
# This is unnecessary. Just to show the problem: multiple territories with the same month
food_services %>% filter(Month == yearmonth("1982 Apr"))
# Summarize Turnover data over all territories:
food_turnover <-
food_services %>%
summarize(
Turnover = sum(Turnover) # Adds turnover over all territories
)
food_turnover
```
* 3.3: Create your training and test datasets
```{r, include = params$print_sol}
# Four years = 4 * 12 months
test_length = 4*12
split_row = nrow(food_turnover)-test_length
# Alternative 1:
train <- food_turnover %>% filter(Month <= max(Month)-test_length)
test <- food_turnover %>% filter(Month > max(Month)-test_length)
# Check dimensions
nrow(food_turnover) == (nrow(train) + nrow(test))
nrow(train)
nrow(test)
# Alternative 2:
train <- food_turnover %>% slice(1:split_row)
# select everything but (1:split_row)
test <- food_turnover %>% slice(-(1:split_row))
# Check dimensions
nrow(food_turnover) == (nrow(train) + nrow(test))
nrow(train)
nrow(test)
```