05_5_Train_Test.qmd

---
title: "05_5_Train_Test"
format: html
editor: source
params:
  print_sol: false
toc: true
toc-location: left
toc-depth: 6
self-contained: true
---

# References

NOTE: the following material not fully original, but rather an extension of reference 1 with comments and other sources to improve the student's understanding:

1.  Hyndman, R. J., & Athanasopoulos, G. (2021). *Forecasting: principles and practice* (3rd ed.). [Link](https://otexts.com/fpp3/)

2.  tidyverts.org. (n.d.). *fable package documentation*. Retrieved from:

-   [Link 1](https://fable.tidyverts.org/index.html)
-   [Link 2](https://fable.tidyverts.org/articles/fable.html)

3.  Fischer, J. (n.d.). Was dem MAPE fälschlicherweise vorgeworfen wird: seine wahren Schwächen und bessere Alternativen. *STATWORX*. Retrieved from [Link](https://www.statworx.com/content-hub/blog/was-dem-mape-falschlicherweise-vorgeworfen-wird-seine-wahren-schwachen-und-bessere-alternativen/)

4.  Hyndman, R. J., & Koehler, A. B. (2006). Another look at measures of forecast accuracy. *International Journal of Forecasting, 22*(4), 679–688.


# Packages

```{r, error=FALSE, warning=FALSE, message = FALSE}
library(fpp3)
```

# Train and Test Sets

The size of the residuals is not a reliable indication of how large true forecast errors are likely to be. **The accuracy of forecasts can only be determined by considering how well a model performs on new data that were not used when fitting the model.**

However, **the problem is that we do not have data on which to test the forecasts, since these are into the future.**

**Common Practice**: separate available data in two portions - **train** and **test**

-   **train data**: used to fit the model (to estimate the model parameters)
-   **test data**: used to evaluate model accuracy.

This is shown in the graphs below:

![](./figs/train_test_fctask.png){width="100%"}

![](./figs/13_train_test.png){width="100%"}

**Indicatons to choose test data size**

-   Typically **around 20% of the total sample**
-   **At least as large as the maximum forecast horizon required**
-   **Ideally much longer as the forecast horizon required (3 times min)**

**Important considerations**

-   A **model that fits the training data well will not necessarily forecast well.**
-   A perfect fit to the training data can always be obtained with enough parameters (**over-fitting**)
    -   Over-fitting a model to data is just as bad as not identifying a pattern in the data.

## Functions to subset a time series

### `filter()`

`filter()` can be used to filter out a portion of a time series.

* It is **important to convert your time index to something that can be compared against the object you want to filter by**. In the example below we need to convert the `yearquarter` index to a `year` before comparing against 1995.

```{r}
# Data from 1995 onwards
aus_production %>% filter(year(Quarter) >= 1995)
```

More examples on the use of filter in the exercises below, that will be provided with solutions in due course.

### `slice()`

`slice()`: allows extraction by indices.

```{r}
# Extact the first 12 points (1 year) of data of each combination of State, Industry
aus_retail %>%
  group_by(State, Industry)  %>%
  slice(1:12) %>% 
  ungroup() # So that subsequent operations are not performed group-wise
```

Slice can be used in combination with **`n()` to refer to the maximum number of rows.** 

* In the following examples **the parenthesis play a major role because the `:` operator takes precedence over the `-` operator, so in order to adequately reference the indexes we need to use the parenthesis**

#### Slice example 1

```{r}
# Extract everything put the last 12 observations of each series 
# (each combination of State, Industry)
aus_retail %>%
  group_by(State, Industry) %>%
  slice(1:(n()-12)) %>%  # We need to enclose (n()-12) with parenthesis.
  ungroup() # So that subsequent operations are not performed group-wise
```

#### Slice example 2

The **`-` sign below** indicates that we are interested in **precisely what is not covered by (1:(n()-12))**.

```{r}
# Extract just last 12 observations of each series 
# (each combination of State, Industry)
# Recall thath the - operator can be used here to indecate "everything but".
aus_retail %>%
  group_by(State, Industry) %>%
  slice(-(1:(n()-12))) %>% # We need to enclose (n()-12) with parenthesis.
  ungroup() # So that subsequent operations are not performed group-wise
```

#### Slice example 3

```{r}
# ALTERNATIVE: Extract just last 12 observations of each series
# (each combination of State, Industry)
aus_retail %>%
  group_by(State, Industry) %>%
  slice((n()-11):n()) %>% # We need to enclose (n()-11) with parenthesis.
  ungroup()
```

#### Slice example 4

The next one is a bit tricky and involves the operation precedence issue. First look at the following decreasing sequence that starts from 19 and ends at 0, in steps of 1.

```{r}
19:0
```

Adding or subtracting an integer to that sequence does not alter the number of elements, rather, it only shifts the sequence accordingly

```{r}
200 - 19:0
```

Using this, we can use the integer returned by `n()` when used within `dplyr` verbs (the number of rows) to obtain only the last 20 observations in the dataset:

```{r}
# Extract the latest 20 observations (20 quarters -> 5 years)
aus_production %>%
  slice(n()-19:0) # n()-19:0 is equivalent to 218-19:0 (218 is the number of
                  # rows in aus_production).
```

## Exercises train-test datasets

1.  Create a training and a test set for household wealth (dataset `hh_budget`) by withholding the last four years as a test set. Use `filter()` and a logical condition

```{r, include = params$print_sol}
test_years = 4

train <- hh_budget %>%
  filter(Year <= max(Year) - test_years)

test <- hh_budget %>%
  filter(Year > max(Year) - test_years)

# Check dimensions
(nrow(train) + nrow(test)) == nrow(hh_budget)

# NOTE: other functions for dimension checking:
# nrow(), ncol(), dim(), length()
```

2.  Consider the dataset `global_economy` and in particular the data related to `Argentina`

* 2.1: Determine how many datapoints related to Argentina there are in the dataset (how many rows) and store that number in a variable. Also extract the series corresponding to Argentinian exports. Hint: you can use the function `nrow()` (in an exam context, no hint will be given).

```{r, include = params$print_sol}
arg_economy <-  global_economy %>% filter(Country == "Argentina") %>% select(Year, Exports) 
nobs <- arg_economy %>% nrow()
nobs
```

* 2.2: Create a training dataset that consists on 80% of the observations. Since 80% of the observations is not an integer number, round it down to the nearest integer. The rest of the observations are to be stored as test dataset.

```{r, include = params$print_sol}
split_row = as.integer(nobs * 0.8)
train <- arg_economy %>% slice(1:split_row)
test <- arg_economy %>% slice(-(1:split_row)) # Select everything but 
                                              # the train indexes

# Check dimensions
nrow(train)
nrow(test)
nrow(arg_economy) == nrow(train) + nrow(test)
```

3.  Create a training set for Australian takeaway food turnover (aus_retail) by withholding the last four years as a test set.

* 3.1: using filter() extract values corresponding to the industry "Takeaway food services"

```{r, include = params$print_sol}
food_services <- aus_retail %>% filter(Industry == "Takeaway food services")
food_services
```

* 3.2: Using `sum()` and `summarize()` obtain the aggregates turnover of this industry per month over all the states

```{r, include = params$print_sol}
# This is unnecessary. Just to show the problem: multiple territories with the same month
food_services %>% filter(Month == yearmonth("1982 Apr"))

# Summarize Turnover data over all territories:
food_turnover <- 
  food_services %>% 
        summarize(
          Turnover = sum(Turnover) # Adds turnover over all territories
        )

food_turnover
```

* 3.3: Create your training and test datasets

```{r, include = params$print_sol}
# Four years = 4 * 12 months
test_length = 4*12
split_row = nrow(food_turnover)-test_length

# Alternative 1:
train <- food_turnover %>% filter(Month <= max(Month)-test_length)
test <- food_turnover %>% filter(Month > max(Month)-test_length)

# Check dimensions
nrow(food_turnover) == (nrow(train) + nrow(test))
nrow(train)
nrow(test)

# Alternative 2:
train <- food_turnover %>% slice(1:split_row)

# select everything but (1:split_row)
test <- food_turnover %>% slice(-(1:split_row)) 

# Check dimensions
nrow(food_turnover) == (nrow(train) + nrow(test))
nrow(train)
nrow(test)
```