-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path03_A_TSGraphs_Timeplots_Scatterplots_Seasonalplots.qmd
1253 lines (904 loc) · 42.5 KB
/
03_A_TSGraphs_Timeplots_Scatterplots_Seasonalplots.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "03_A_TSGraphs_Timeplots_Scatterplots_Seasonalplots"
editor: source
params:
print_sol: true
print_sol_int: true
hidden_notes: true
toc: true
toc-location: left
toc-depth: 6
self-contained: true
---
```{r, echo=FALSE, include=FALSE}
library(kableExtra)
```
# Libraries
```{r}
library(fpp3)
library(fma)
library(readr)
```
# References
This is not original material but a compilation of the following sources with some additional examples and clarifications introduced by the professor:
1. Hyndman, R.J., & Athanasopoulos, G., 2021. *Forecasting: principles and practice*. 3rd edition.
2. Additional material provided by the book authors
3. The Datasaurus Dozen - [Link](https://blog.revolutionanalytics.com/2017/05/the-datasaurus-dozen.html)
4. Shumway, R.; Stoffer, D; *Time Series Analysis and Its Applications with R examples*.
# Framework and packages:
We will work with the fpp3 package, developed by Prof. Rob Hyndman and his team. They are actually one of the leading contributing groups to the time series community, so do not forget their name.
This package extends the tidyverse framework to time series. For that it requires to load some packages under the hood (see output of the following cell):
```{r}
library(fpp3)
```
Additionally, for this particular session we will use the packages `GGally`, `fma` and `patchwork`. Install them if you do not have them already. **Remember that the command `install.packages` is to be run only once, not every time you want to use the package**.
```{r, eval=FALSE}
install.packages("GGally", dependencies = TRUE) # to generate a scatterplot matrix
install.packages("fma", dependencies = TRUE) # to load the Us treasury bills dataset
install.packages("patchwork", dependencies = TRUE) # Used to manage the relative location of ggplots
```
Once you have installed the packages with the `install.packages` command, you may use them simply by loadeing them with the `library` command. You have to do this every time you restart your R session.
```{r, error=FALSE, warning=FALSE, message = FALSE}
library(patchwork) # Used to manage the relative location of ggplots
library(GGally)
library(fma) # to load the Us treasury bills dataset
```
# Time plots and basic time series patterns
```{r, echo=FALSE}
test <- tibble(
Pattern = c("Trend", "Seasonal", "Cyclic"),
Description = c("Long-term increase OR decrease in the data", "Periodic pattern due to the calendar (quarter, month, day of the week...)", "Pattern where data exhibits rises and falls that are NOT FIXED IN PERIOD (typically duration of at least 2 years)")
)
knitr::kable(test,
caption="<center><strong>Time series patterns</strong></center>",
format="html"
) %>%
kable_styling(full_width = FALSE)
```
```{r, echo=FALSE}
test <- tibble(
Seasonal = c("Seasonal pattern of constant length", "Shorter than cyclic pattern", "Magnitude less variable than cyclic pattern"),
Cyclic = c("Cyclic patern of variable length", "Longer than periodic pattern", "Magnitude more variable than periodic pattern")
)
knitr::kable(test,
caption="<center><strong>Seasonal or cyclic</strong></center>",
format="html"
) %>%
kable_styling(full_width = FALSE)
```
Let us look at some examples:
## Example 1 - trended and seasonal data
### `scale_x_yearquarter`: adjusting the grid when the time index follows a ``yearquarter` object
Inspecting the dataset aus_production we note two things:
1. As all the datasets loaded along the `fpp3` package, it comes in `tsibble` format.
2. The column signaling time (the index of the tsibble) is of type `yearquarter``
```{r}
aus_production
```
**Because the index of the tsibble is of type yearquarter**, we will correspondingly use the function **scale_x_yearquarter** when we want to scale the x-axis of the timeplot below
```{r}
aus_production %>%
# Filter data to show appropriate timeframe
filter((year(Quarter) >= 1980 & year(Quarter)<= 2000)) %>%
# autoplot: generates a simple timeplot, yet to be adjusted with
# further commands.
autoplot(Electricity) +
# Scale the x axis adequately
# scale_x_yearquarter used because the index is a yearquarter column
scale_x_yearquarter(date_breaks = "1 year",
minor_breaks = "1 year") +
# Flip x-labels by 90 degrees
theme(axis.text.x = element_text(angle = 90))
```
Strong trend with a change in the slope of the trend around 1990.
### `autoplot()` as a wrapper around ggplot2 objects
The function `autoplot()` of the library `fpp3` is nothing but a wrapper around ggplot2 objects. Depending on the object fed to autoplot, it will have one or other behavior, combining different ggplot objects.
In the example above, `autoplot()` is fed a time series in tsibble format. In this situation, it acts as a wrapper around the ggplot command `geom_line()`. It is just that using `autoplot()` is much more convenient and it is a good idea to have such a function in a time series dedicated library such as `fpp3`. The code below uses explicitly all ggplot commands and attains the exact same output as using autoplot.
```{r}
aus_production %>%
filter((year(Quarter) >= 1980 & year(Quarter)<= 2000)) %>%
# The two lines below are equivalent to autoplot(Electricity)
ggplot(aes(x = Quarter, y = Electricity)) +
geom_line() +
# Scale the x axis adequately
# scale_x_yearquarter used because the index is a yearquarter column
scale_x_yearquarter(date_breaks = "1 year",
minor_breaks = "1 year") +
# Flip x-labels by 90 degrees
theme(axis.text.x = element_text(angle = 90))
```
## Example 2 - **exercise** - trended and seasonal data with recessions
Scale the x-grid of the following graph so that the end of each year is clearly visible.
**HINT**: remember you need to **inspect the dataset aus_production** and check **the type of its index column (time column) to pick the right function to scale the x axis**.
```{r}
aus_production %>%
autoplot(Bricks)
```
```{r, include=params$print_sol}
autoplot(aus_production, Bricks) +
scale_x_yearquarter(date_breaks = "4 year",
minor_breaks = "1 year") +
# Flip x-labels by 90 degrees
theme(axis.text.x = element_text(angle = 90))
# Trended with recessions in 1983 and 1991
```
## Example 3 - trend embedded within a cycle
US treasury bill contracts on the Chicago market for 100 consecutive trading days in 1981.
See [Treasury Bills](https://www.investopedia.com/terms/t/treasurybill.asp).
The object ustreas is loaded along with the `fma` library, which we loaded at the beginning of the notebook.
```{r}
ustreas
```
`ustreas` is a `ts` object (time-series). This is another possible format for time series in R, but remember we are going to use `tsibles` to allow for seamless functionality with the fpp3 library. You can convert the `ts` to a `tsibble` using `as_tsibble` in the following manner:
```{r}
ustreas_tsibble <- as_tsibble(ustreas)
ustreas_tsibble
```
```{r}
autoplot(ustreas_tsibble)
```
```{r}
```
Looks like a downward trend, but it is part of a much longer cycle (not shown). You do not want to make forecasts with this data too long into the future.
### `scale_x_continuous`: adjusting the grid when the index is numeric.
Examining the tsibble `ustreas_tsibble` reveals that **the index is numeric (trading day)** ("dbl", meaning double precision floating point).
```{r}
ustreas_tsibble %>% head(5)
```
If the index is of type numeric we need to use the function `scale_x_continuous` to properly adjust the x-axis. In addition to this, we need to generate a specific sequence signaling the ticks. For example, if we want major ticks every 10 days and minor ticks every 5 days:
```{r}
# Sequence for major ticks, from 0 to the maximum index in the data
major_ticks_seq = seq(0, max(ustreas_tsibble$index), 10)
major_ticks_seq
```
```{r}
# Sequence for minor ticks, from 0 to the maximum index in the data
minor_ticks_seq = seq(0, max(ustreas_tsibble$index), 5)
minor_ticks_seq
```
```{r}
ustreas_tsibble %>%
autoplot() +
scale_x_continuous(breaks = major_ticks_seq,
minor_breaks = minor_ticks_seq)
```
```{r}
```
## Example 4 - **exercise**. Yearly data and cyclic behavior
**ANUAL DATA** is not susceptible to within-year calendar periods (so called "seasonal periods"). It is sampled at the same point every year. Therefore, within-year calendar periods do not affect yearly data.
In other words: **there are no seasonal patterns in yearly data**. If anything **there could be cyclic behavior**.
Hudson Bay Company trading records for Snowshoe Hare and Canadian Lynx furs from 1845 to 1935. This data contains trade records for all areas of the company. We are going analyse the number of Canadian Lynx pelts traded.
```{r}
pelt %>% head(5)
```
```{r}
lynx <- pelt %>% select(Year, Lynx)
lynx %>% head(5)
```
### Exercise
Plot the time series ensuring we have major ticks every 10 years and minor ticks every 5 years.
**Hint**: check the type of the index sequence to select the correct variation of the family of functions `scale_x_...()` you need to properly scale the x-axis.
```{r, include=params$print_sol}
major_ticks_seq = seq(0, max(lynx$Year), 10)
minor_ticks_seq = seq(0, max(lynx$Year), 5)
lynx %>%
autoplot() +
scale_x_continuous(breaks = major_ticks_seq,
minor_breaks = minor_ticks_seq) +
xlab("Year") +
ylab("Number of lynx trapped")
```
```{r, eval=FALSE, include = params$print_sol}
Number of lynx trapped anualy in the Hudson bay reach
* BECAUSE THIS IS ANUAL DATA, IT CANNOT BE SEASONAL
* Population of lynx raises when there is plenty of food
+ When food supply is low -> population plummets
+ Surviving lynx have excess food -> population raises
```
## Example 5 - exercise - Cyclic + seasonal behavior
Monthly sales of new one-family houses sold in the USA since 1973.
The ts object `hsales` is loaded along the fma library
```{r}
hsales_tsibble <- hsales %>% as_tsibble()
```
**Create a timeplot that has major ticks every 5 years and minor ticks every year**
**Hint**: check the type of the index sequence to select the correct variation of the family of functions `scale_x_...()` you need to properly scale the x-axis. **The one you need here has not been used in previous examples, but it is consistent with the type of the index**.
```{r, include = params$print_sol}
hsales_tsibble %>%
autoplot() +
scale_x_yearmonth(date_breaks = "5 year",
minor_breaks = "1 year")
```
## Example 6 - multiple seasonality
Half-hourly electricity demand in England and Wales from Monday 5 June 2000 to Sunday 27 August 2000. Discussed in Taylor (2003), and kindly provided by James W Taylor. Units: Megawatts
Again taylor is a time-series object (ts)
```{r}
class(taylor)
```
However, since the series is sampled every half-hour (that is, we are dealing with datetime info), usiung `as_tsibble()` directly does not adequately convert this time series to a tsibble:
```{r}
# Column index represented as a double! This is not what we want.
taylor %>% as_tsibble()
```
To adequately transform the taylor 'ts' object to a `tsibble`, we are going to define a sequence of date-time objects that match the length of the `taylor` object. Based on the description and assuming the measurements on the 5th of June 2000 start at midnight, the code below turns the `ts` object into a tsibble. It is not necessary that you fully understand this code, but it is explained in detail:
```{r}
# Define the start time based on information about the Taylor series
# Assume UTC time.
start_time <- as.POSIXct("2000-06-05 00:00:00", tz = "UTC")
# We specify by = 1800 because the lowest unit we specified in our datetime object
# start_time are seconds (hh:mm:ss). The samples in this ts object are taken every
# half an hour, that is, every 1800 seconds (30 * 60)
# We specify as well that the sequence needs to be of the same length as the taylor
# object
datetime_seq <- start_time + seq(from = 0, by = 1800, length.out = length(taylor))
# Convert the time series to a tsibble
taylor_tsibble <-
tibble(
index = datetime_seq,
value = as.numeric(taylor)
) %>%
as_tsibble(index = index)
# Check the resulting tsibble
taylor_tsibble
```
Now let us create a quick plot of this object:
```{r}
autoplot(taylor_tsibble)
```
- Two types of seasonality
- Daily pattern
- Weekly pattern
- If we had data over a few years: we would see annual patern
- If we had data over a few decades: we might see a longer cyclic pattern
To inspect these patterns more closely, I am going to filter for the first four weeks in the time series. To attain this, I am going to create an auxiliary week column and then filter based on that column:
```{r}
# Create auxiliary column signalling the yearweek corresponding to
# every point in the time seres.
taylor_tsibble <-
taylor_tsibble %>%
mutate(
week = yearweek(index)
)
# Extract first week and compute 4th week
week1 <- taylor_tsibble$week[1]
week4 <- week1 + 3
# Filter for first four weeks and store in new object
taylor_tsibble_4w <-
taylor_tsibble %>%
filter(
week >= week1,
week <= week4
)
# Depict result
taylor_tsibble_4w %>%
autoplot() +
# Used because index is date-time
scale_x_datetime(
breaks = "1 week",
minor_breaks = "1 day"
)
```
Here we can see the patterns more closely.
**IMPORTANT**: the demand of electricity does not always follow such an extremely regular pattern. There might be weekly seasonality, but it is not necessarily so regular. See the next example.
## Example 7 - exercise
We are going to use the dataset `vic_elec` from the `fpp3` library. It provides half-hourly electricity demand in for Victoria (Australia).
```{r}
vic_elec %>% head(5)
```
### `scale_x_datetime`: indices that are date-time objects
Let us filter the electricity consumption in Victoria `vic_elec` to include only datapoints of January 2012:
```{r}
elec_jan <- vic_elec %>%
mutate(month = yearmonth(Time)) %>%
filter(month == yearmonth("2012 Jan"))
```
**Create a time-plot of this data** using `scale_x_datetime` to establish major breaks of 1 week and minor breaks of 1 day.
```{r, include = params$print_sol}
elec_jan %>%
autoplot() +
scale_x_datetime(breaks = "1 week",
minor_breaks = "1 day")
```
```{r, eval=FALSE, include=params$print_sol}
We can spot a daily pattern but this time there is not a clear weekly pattern.
The data does not necessarily exhibit all the possible seasonal patterns.
```
## Example 8 - exercise
### `scale_x_date`: indices that are date objects
Let us begin by creating a daily time series out of the previos `elec_jan` (see example 7) using `index_by` to compute the total electricity consumption within each day
```{r}
elec_jan_daily <-
elec_jan %>%
index_by(date = as_date(Time)) %>%
summarize(
daily_demand = sum(Demand)
)
# Inspect result
elec_jan_daily
```
Now create a time plot with an x-axis resolution that has:
* major ticks every week
* minor ticks every day
```{r, include = params$print_sol}
elec_jan_daily %>%
autoplot() +
scale_x_date(breaks = "1 week",
minor_breaks = "1 day")
```
## Example 9 - Trend + seasonality (multiplicative)
PBS contains monthly medicare australia prescription data. We are going to select those regarding drugs of the category ATC2. For further info run `?PBS` in the console.
```{r}
PBS %>%
filter(ATC2 == "A10") %>%
select(Month, Concession, Type, Cost) %>%
summarise(TotalC = sum(Cost)) %>%
# Comment on this assignment operator
mutate(Cost = TotalC / 1e6) -> a10
```
```{r}
a10 %>% head(5)
```
```{r}
autoplot(a10, Cost) +
labs(y = "$ (millions)",
title = "Australian antidiabetic drug sales") +
scale_x_yearmonth(breaks = "1 year") +
theme(axis.text.x = element_text(angle = 90))
```
```{r}
```
# Scatter plots
## Two variables
The dataset `vic_elec`, which contains half-hourly data for the demand of electricity and average temperature in the state of Victoria (Australia).
```{r}
vic_elec
```
Let us depict a timeplot for the demand and a timeplot for the average temperature. The code snippet below requires you to load the library `GGally` for one statement to run (comment in the code):
```{r}
p1 <- vic_elec %>%
filter(year(Time) == 2014) %>%
autoplot(Demand) +
labs(y = "GW",
title = "Half-hourly electricity demand: Victoria")
p2 <- vic_elec %>%
filter(year(Time) == 2014) %>%
autoplot(Temperature) +
labs(
y = "Degrees Celsius",
title = "Half-hourly temperatures: Melbourne, Australia"
)
# Requires GGally. Places one graph above the other, ensuring proper
# matching of the x-axis.
p1 / p2
```
For every point in time, we may now depict the electricity demand and the corresponding average temperature, resulting in a scatterplot of these two variables instead of a timeplot. In this scatterplot we also add two trend lines:
- A linear trend line following ordinary least squares.
- A non-linear trend line following a loess model
```{r}
vic_elec %>%
filter(year(Time) == 2014) %>%
ggplot(aes(x = Temperature, y = Demand)) +
geom_point() +
# Linear trend line
geom_smooth(method = "lm", se = FALSE) +
# Non-linear trend line
geom_smooth(method = "loess", color = "red", se = FALSE) +
# Labels
labs(x = "Temperature (degrees Celsius)",
y = "Electricity demand (GW)")
```
Just by looking at the graph, we can see that **the linear trend line does not capture the entire strignth of the relationship between demand and temperature**.
### Correlation coefficients - careful interpretation
In your previous courses on statistics, you have been introduced to the correlation coefficient between two variables `X` and `Y `and you have also derived (hopefully) the relationship between this correlation coefficient and a linear trend line between `X` and `Y` obtained using ordinary least squares (ols):
Equation for the correlation between two random variables X and Y:
$$
r_{X,Y} = \frac{\text{Cov}(X,Y)}{\sqrt{\text{Var}(X) \cdot \text{Var}(Y)}}
$$
Value of that correlation coefficient out of a sample:
$$
r_{xy} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2 \sum_{i=1}^{n}(y_i - \bar{y})^2}}
$$
On the other hand, the regression of Y on X is:
$$
Y = \beta_0 + \beta_1 X
$$
with
$$
\beta_1 = \frac{\text{Cov}(X,Y)}{\text{Var}(X)} = r_{X,Y} \cdot \frac{\sigma_Y}{\sigma_X}
$$
In other words, the equation of that linear regression line **tells us that the correlation coefficient alone only accounts for the linear relationship between `X` and `Y`** but it **does not measure any non-linear relationships**. In other words, if $r_{X,Y}=0$ we can be sure that **there is no linear relationship between X and Y**, but **there may be other relationships that are non-linear**:
$$
\require{cancel}
Y = \beta_0 + \cancel{r_{X,Y} \cdot \frac{\sigma_Y}{\sigma_X} \ \ X}
$$
The correlation coefficient **only measures the strength of linear relationship**.
#### Important example 1 - correlation coefficient of 0
Let us see a **specific toy example that will should keep in your head forever.**In this, we take some points out of the function `$y=x^2$`. We take as many points to the left than points to the right, located at symmetrical positions around the y axis.
```{r}
df <- tibble(
x = seq(-4, 4, 0.05),
y = x^2
)
df %>%
ggplot(aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
```
In the figure we can see that the slope of the linear relationship is 0, and, as a result, the correlation coefficient is also 0:
```{r}
# **QUESTION**: Why is it not exactly 0?
cor(df$x, df$y)
```
```{r, include=params$print_sol, eval=FALSE}
ANSWER
* Its a matter of how the computer handles numbers.
* An output with that exponent is 0 for our intents and purposes.
* You need to get used to recognising this kind of scenarios.
```
But **we know for a fact that there is a non-linear relationship between the variables (we have defined the points using it).**.
#### Important example 2 - correlatoin
If the linear relationship is not 0 (slope of the trend line $\neq$ 0) $\rightarrow$ $r_{xy}\neq0$. But **there may be a stronger non-linear relationship** as in the case of the electricity demand.
For example, the correlation for the electricity demand and temperature data shown in the previous figure is only 0.26, but their **non-linear** relationship is actually much stronger (this can be seen comparing the linear model and the loess model fitted in the previous scatterplot)
```{r}
# Compute correlation coefficient
round(cor(vic_elec$Temperature, vic_elec$Demand), 2)
```
Remember that **a correlation coefficient of 0 does not imply that there is no relationship between two variables. It just implies that there is no linear relationship**.
Note: beware that there are very bad sources on *statistics* and *data science* on the internet:
- [Bad interpretation of correlation coefficient - ex 1](https://www.investopedia.com/ask/answers/032515/what-does-it-mean-if-correlation-coefficient-positive-negative-or-zero.asp)
- [Bad interpretation of correlation coefficient = 0 - ex 2](https://www.simplypsychology.org/correlation.html)
Other examples of variables with a correlation coefficient of 0:
```{r, echo=FALSE, out.width='60%', fig.align="center", fig.cap="Examples of corr coefficient = 0 (Wikipedia)"}
knitr::include_graphics('./figs/corr_coeff_0_wiki.png')
```
#### Correlation coefficient vs. shape of the relationship
Remember as well that **two variables may have a very similar correlation coefficent yet have a very different aspect**. The example below (ref [1]) shows scatterplots of different variables, all of which have a correlation coefficient of 0.82. The set of four images below above is known as Ancombe's quartet.
```{r, echo=FALSE, out.width='60%', fig.align="center", fig.cap="Examples of variables having the same correlation coefficient. From [1]"}
knitr::include_graphics('./figs/corr_coeff_same.png')
```
Finally, the animation below is intended to show you that, given a correlation coefficient and a number $n$ of datapoints, there are infinitely many orderings of those datapoints that result in the same correlation coefficient. In other words, the correlation coefficient is relevant, but it is not very informative of the distribution of the points.
![Datasaurus - from ref. 3](figs/DataSaurus_Dozen.gif){width="641"}
```{r}
```
All these examples are meant to make you aware of **how important it is to depict a scatterplot of the variables and not to rely only on the correlation coefficient**, which is a summary only of their linear relationship.
## Several variables - scatterplot matrix
```{r}
visitors <- tourism %>%
group_by(State) %>%
summarise(Trips = sum(Trips))
visitors
```
```{r}
# Check distinct values of column "State"
distinct(visitors, State)
```
```{r}
visitors %>%
pivot_wider(values_from=Trips, names_from=State)
```
```{r}
visitors %>%
pivot_wider(values_from=Trips, names_from=State) %>%
GGally::ggpairs(columns = 2:9) +
theme(axis.text.x = element_text(angle = 90))
```
```{r}
```
We can add a **loess** line to the scatterplots to help identify non-linearities with the following code:
```{r}
visitors %>%
pivot_wider(values_from=Trips, names_from=State) %>%
GGally::ggpairs(columns = 2:9, lower = list(continuous = wrap("smooth_loess", color="lightblue"))) +
theme(axis.text.x = element_text(angle = 90))
```
## Exercise: scatterplot matrix and lagged variables
The code snippet below loads the dataset `soi_recuitment`. Run the code and select the file `soi_recruitment.csv` when prompted to pick up a file. You will find this in the folder `ZZ_Datasets`, on google drive. This dataset contains two variables and has been taken from reference 4.
* SOI: Southern Oscillation Index (SOI) for a period of 453 months ranging over the years 1950-1987. The Southern Oscillation Index (SOI) is a standardized index based on the observed sea level pressure (SLP) differences between Tahiti and Darwin, Australia. The SOI is one measure of the large-scale fluctuations in air pressure occurring between the western and eastern tropical Pacific (i.e., the state of the Southern Oscillation) during El Niño and La Niña weather episodes. For more information about this variable, see [link](https://www.ncei.noaa.gov/access/monitoring/enso/soi)
* Recruitment: Recruitment (index of the number of new fish) for a period of 453 months ranging over the years 1950-1987. Recruitment is loosely defined as an indicator of new members of a population. The data has been gathered by Dr. Roy Mendelssohn, from the Pacific Environmental Fisheries Group.
```{r, echo=FALSE, warning=FALSE, message=FALSE}
soi_recruitment <-
read_csv("ZZ_Datasets/soi_recruitment.csv") %>%
mutate(ym = yearmonth(index)) %>%
select(ym, SOI, recruitment) %>%
as_tsibble(index = ym)
# Compute lags
for (i in seq(1, 8)) {
lag_name <- paste0("SOI_l", i)
soi_recruitment[[lag_name]] = lag(soi_recruitment[["SOI"]], i)
}
# Reorder
soi_recruitment <-
soi_recruitment %>%
select(ym, recruitment, everything())
```
```{r, eval=FALSE}
soi_recruitment <-
read_csv(file.choose()) %>%
mutate(ym = yearmonth(index)) %>%
select(ym, SOI, recruitment) %>%
as_tsibble(index = ym)
# Inspect result
soi_recruitment
```
### Lagged variables and their importance in time series
We are going to delve in this concept further in the next classes, but it is important to introduce it now.
Given a **variable $x_t$**, its lags correspond to **delayed versions of the variable**, with the number of the lag indicating the time delay:
* The first lag of a variable $x_t$ is written as $x_{t-1}$. At any point in time $t$, the value of $x_{t-1}$ is the value of $x_t$ one timestep ago.
* The second lag of a variable $x_t$ is written as $x_{t-2}$. At any point in time $t$, the value of $x_{t-2}$ is the value of $x_t$ two timestep ago.
* ...
* The n-th lag of a variable $x_t$ is written as $x_{t-n}$. At any point in time $t$, the value of $x_{t-n}$ is the value of $x_t$ $n$ timesteps ago.
**Lagged variables are relevant in time series** for multiple reasons. **Primarily** because:
1. By studying the relationship between the lags {$x_{t-1}$, $x_{t-2}$, $\cdots$, $x_{t-n}$} and $x_t$, we can **understand if the variable $x_t$ is related to its values some timesteps ago.
2. By studying the relationship between the lags {$x_{t-1}$, $x_{t-2}$, $\cdots$, $x_{t-n}$} and a second variable $y_t$, we can **understand if the variable $y_t$ is related to what happened in the past in the variable x**.
* **A typical business case:** current sales might be related to investment in marketing some timesteps ago.
As an example, the code below **computes the first 8 lags of the variable `SOI`:**
```{r}
# Compute desired lags
for (i in seq(1, 8)) {
lag_name <- paste0("SOI_l", i)
soi_recruitment[[lag_name]] = lag(soi_recruitment[["SOI"]], i)
}
# Reorder
soi_recruitment <-
soi_recruitment %>%
select(ym, recruitment, everything())
# Inspect result.
soi_recruitment
```
### Exercise: **generate the scatterplot matrix below**
**Question:** do you detect any non-linear relationship between recruitment and SOI or any of its lags `soi_li` that is not adequately captured by the corr. coefficient?
```{r, echo=FALSE, include=!params$print_sol, warning=FALSE, fig.width=8, fig.height=8}
# Count number of columns
ncols <- length(names(soi_recruitment))
# Generate scatterplot matrix
soi_recruitment %>%
GGally::ggpairs(columns = 2:ncols, lower = list(continuous = wrap("smooth_loess", color="lightblue", se=TRUE))) +
theme(axis.text.x = element_text(angle = 90))
```
```{r, include=params$print_sol, warning=FALSE, fig.width=8, fig.height=8}
# Count number of columns
ncols <- length(names(soi_recruitment))
# Generate scatterplot matrix
soi_recruitment %>%
GGally::ggpairs(columns = 2:ncols, lower = list(continuous = wrap("smooth_loess", color="lightblue", se=TRUE))) +
theme(axis.text.x = element_text(angle = 90))
```
```{r, eval=FALSE, include=params$print_sol}
The relationships between recruitment and SOI and its different lags seem fairly
linear, as the smoothing lines show. The correlation coefficient seems to
capture them properly in this case.
## IMPORTANT NOTE: LEADING VS LAGGING VARIABLES
For soi_l5 to soi_l8 the relationship is stronger. This indicates that the soi
series "leads" the recruitment series. That is, what happened some timesteps ago
in the soi series (t-5, t-6, t-7, t-8...) is correlated to what happens at time
t in the recruitment series.
Conversely, the variable recruitment "lags" the variable soi. What happened
on the current timestap in recruitment is related to the value of soi some
timestamps ago.
These concepts will be more clear when we discuss autocorrelation and
crosscorrelation
```
# Seasonal plots - gg_season
**Difference with a time-plot:** now the x-axis does not extend indefinately over time, but rather only over the duration of a season. The time axis is limited to a single season. In most of the situations we are going to handle in this course, that length of the season will be a year.
The period of a time series is always of greater order than the sampling frequency of the time series.
* **Example 1:** with **quarterly data**, the natural seasonal period to consider is the immediatly superior calendar unit, which is a year.
* **Example 2:** with **monthly data**, the natural seasonal period to consider are the immediatly superior calendar units, either a quarter or a year. In most instances we will consider a year when dealing with monthly data:
* **Example 3:** with **daily data**, multiple seasonal periods can be considered: week, month, year...
* **Example 4:** with **sub-daily data**, the day itself could be considered as a seasonal period...
* ...
Let us consider again the example of the antidiabetic drug sales in Australia. The dataset PBS contains monthly medicare australia prescription data. We are going to select those regarding drugs of the category ATC2. For further info run `?PBS` in the console.
```{r}
PBS %>%
filter(ATC2 == "A10") %>%
select(Month, Concession, Type, Cost) %>%
summarise(TotalC = sum(Cost)) %>%
# Comment on this assignment operator
mutate(Cost = TotalC / 1e6) -> a10
a10
```
```{r}
# Let us remember the time-plot
autoplot(a10, Cost) +
labs(y = "$ (millions)",
title = "Australian antidiabetic drug sales") +
scale_x_yearmonth(breaks = "1 year") +
theme(axis.text.x = element_text(angle = 90))
```
In this case we have quarterly data, so the natural period to consider is a year, which covers all the calendar variations and is of higher order than the data considered. The time-plot seems to confirm this as well. We will deal with seasonality and periods in more detail in the coming lessons through the concept of *autocorrelation*.
Now ley us produce a seasonal plot using gg_season
```{r}
a10 %>%
gg_season(Cost, labels = "both") + # Labels -> "both", "right", "left"
labs(y = "$ (millions)",
title = "Seasonal plot: Antidiabetic drug sales")
```
```{r}
```
As you can see, the length of the x-axis is limited to the duration of a year. Then each year is represented and color coded using a gradient scale. On this plot we can see that there are jumps in the general level of the series each year. We also see a fiest decline from January to Febrary and a subsequent recovery throughout the rest of the year
## Multiple seasonal periods
The dataset `vic_elec`, loaded along with fpp3, contains **half-hourly electricity demand** for the state of Victoria (Australia) for **three years**. Because **data is sampled every half-hour, we can find multiple seasonal patterns in the data**:
- Daily pattern (period 48)
- Weekly pattern
- Yearly pattern...
### Daily period example
Let us examine when the data begins and ends. A possible way to do this is:
```{r}
head(vic_elec)
tail(vic_elec)
```
It appears that the dataset contains measurements of the years 2012, 2013 and 2014 (three years).
It is possible to adjust the seasonal period considered by `gg_season` (if applicable to the data) through the argument `period`. For example, we may consider a daily period as follows:
```{r}
vic_elec %>% gg_season(Demand, period = "day") +
theme(legend.position = "none") +
labs(y="MWh", title="Electricity demand: Victoria")
```
```{r}
```
1. **Question:** what does each line represent in the data?
2. **Question:** how many datapoints do we have?
3. **Question:** how many lines do we have in our graph?
4. **Question** how many data points per line do we have?
Solutions to be provided after class:
```{r, eval = FALSE, include = params$hidden_notes}
1. Each line represents the demand in a specific day.
2. 365 * 24 * 2 * 3 = 52560 # 3 years and 2 * 24 because the ta is half-hourly
3. 365 * 3 = 1095
4. 52560 / 1095 = 48 (48 half hours in a day)
```
### Weekly period example
```{r}
vic_elec %>% gg_season(Demand, period = "week") +
theme(legend.position = "none") +
labs(y="MWh", title="Electricity demand: Victoria")
```
```{r}
```
1. **Question:** what does each line represent in the data?
2. **Question:** how many weeks are there in a year?
3. **Question:** How many lines do we have in this data?
4. **Question** how many datapoints does each line have?
```{r, eval = FALSE, include=params$hidden_notes}
1, Electricity consumption over a specific week
2. 365.25/7 = 52.18 on average - one leap every 4 years approx
3. Approx 52 * 3
4. 7 * 24 * 2 = 336
```
### yearly perio example
```{r}
vic_elec %>% gg_season(Demand, period = "year") +
labs(y="MWh", title="Electricity demand: Victoria")
```
```{r}
```
1. **Question:** what does each line represent?
2. **Question:** how many datapoints do we have per line?
3. **Question:** propose an alternatve, less noisy way of representing the data:
```{r, eval=FALSE, include=params$hidden_notes}
* Each line represents the electricity consumption for a specific year
* 365 * 24 * 2 = 17520
* For example: summarizing by month using `index_by` with a max, min and an average value.
```
# Seasonal subseries plot
## Exercise seasonal plots
The `aus_arrivals` (loaded with the `fpp3` library) data set comprises quarterly international arrivals (in thousands) to Australia from Japan, New Zealand, UK and the US.
Create a:
- Timeplot of the arrivals using `autoplot()` (one curve per country of origin)
- Seasonal plot using `gg_season()` (due to the structure of the data it will produce one graph per country of origin)
- A `gg_subseries()` plot that shows the evolution of the quarterly arrivals per country over the time period considered (see next section on `gg_subseries()`)
```{r}
aus_arrivals
```
```{r, include=params$print_sol}
aus_arrivals %>% autoplot(Arrivals)
```
```{r, eval = FALSE, include=params$print_sol}
Generally the number of arrivals to Australia is increasing over the entire
series, with the exception of Japanese visitors which begin to decline after 1995.
The series appear to have a seasonal pattern which varies proportionally to the
number of arrivals. Interestingly, the number of visitors from NZ peaks sharply
in 1988. The seasonal pattern from Japan appears to change substantially.
```
```{r, fig.width = 10, fig.height=15, echo=FALSE, include=params$print_sol}
aus_arrivals %>% gg_season(Arrivals, labels = "both")
```
```{r, include=params$print_sol, eval=FALSE}
The seasonal pattern of arrivals appears to vary between each country.
In particular, arrivals from the UK appear to be lowest in Q2 and Q3, and
increase substantially for Q4 and Q1. Whereas for NZ visitors, the lowest
period of arrivals is in Q1, and highest in Q3. Similar variations can be
seen for Japan and US.
```
```{r, include=params$print_sol}
aus_arrivals %>% gg_subseries(Arrivals)
```
```{r, include=params$print_sol, eval = FALSE}
The subseries plot reveals more interesting features. It is evident that whilst
the UK arrivals is increasing, most of this increase is seasonal. More arrivals