-
Notifications
You must be signed in to change notification settings - Fork 0
/
EpiconceptCampyTSA.Rmd
1587 lines (1025 loc) · 58.8 KB
/
EpiconceptCampyTSA.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Analysing Campylobacter time series in R"
output:
worded::rdocx_document
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE,
fig.width = 9,
fig.height = 6)
```
```{r setlang, eval = TRUE, echo = FALSE, results = 'hide'}
Sys.setlocale("LC_ALL", "english_UK")
```
```{r, echo = FALSE, warning= FALSE, message= FALSE}
#load packages
#Those required for creating this markdown:
#"Worded": for styling and pagebreaks
#"knitr": for styling tables
required_packages <- c("worded", "knitr")
for (i in seq(along = required_packages)) {
library(required_packages[i], character.only = TRUE)
}
```
<!---CHUNK_PAGEBREAK--->
# Copyright and License
**Source:**
This case study was designed under a ECDC contract for “Expertise in biostatistics for analysis of surveillance data”.
This tutorial was first designed by Esther Kissling, EpiConcept, April 2012 and was modified for R by Thomas Seyler, EpiConcept, July 2015. In 2018 the R version was re-worked by Alexander Spina (correction and updated); some parts of code were adapted from the EPIET time series analysis module case studies.
**Revisions:**
*If you modify this case study, please indicate below your name and changes you made*
**You are free:**
- **to Share** — to copy, distribute and transmit the work
- **to Remix** — to adapt the work
**Under the following conditions:**
- **Attribution** — You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). The best way to do this is to keep as it is the list of contributors: sources, authors and reviewers.
- **Share Alike** — If you alter, transform, or build upon this work, you may distribute the resulting work only under the same or similar license to this one. Your changes must be documented. Under that condition, you are allowed to add your name to the list of contributors.
- You cannot sell this work alone but you can use it as part of a teaching.
**With the understanding that:**
- **Waiver** — Any of the above conditions can be waived if you get permission from the copyright holder.
- **Public Domain** — Where the work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.
- **Other Rights** — In no way are any of the following rights affected by the license:
- Your fair dealing or fair use rights, or other applicable copyright exceptions and limitations;
- The author's moral rights;
- Rights other persons may have either in the work itself or in how the work is used, such as publicity or privacy rights.
- **Notice** — For any reuse or distribution, you must make clear to others the license terms of this work by keeping together this work and the current license.
This licence is based on http://creativecommons.org/licenses/by-sa/3.0/
<!---CHUNK_PAGEBREAK--->
## Objectives of tutorial
At the end of the tutorial participants should be able to
- create a plan of analysis for simple time series analysis;
- use R for exploratory time series analysis including installing the software, importing a dataset, setting time units for aggregation of time series, data checking, trend and seasonality analysis as well as analysis of residuals;
- understand how to create simple aberration detection algorithms, taking outbreaks into account.
## Note on the dataset
The dataset is an original dataset on Campylobacter disease in a European country that has been modified for training purposes.
## Background to the tutorial
The objective of the tutorial is to carry out an exploratory time series analysis on Campylobacter data using the TSA tool. The dataset is called *campy.csv*.
The dataset gives the number of daily cases and contains a time, count and period variable. The table below shows the database dictionary:
```{r, echo = FALSE}
kable( matrix( c(
"Variable name", "Variable type", "Definition",
"time", "Daily date", "Date defined as 'Date used for statistics'",
"nb", "Numeric", "Number of reports",
"period", "Numeric", "The reporting period (e.g. 1=daily; 7=week; 30=month; 90=quarter; etc.)"
), ncol = 3, nrow = 4, byrow = T)
)
```
<!---CHUNK_PAGEBREAK--->
# Question 1
Open the dataset (*campy.csv*) in R and briefly explore it (make sure your working directory is set in R). Import the necessary packages and install the corresponding packages.
With the assumption that the reporting in the dataset here is stable over time, create an outline of a mini-protocol for TSA, including hypotheses to be tested, a plan of analysis, with specifications for the dataset, inclusion/exclusion criteria, etc.
<!---CHUNK_PAGEBREAK--->
# Help Question 1
**For information:** Background on Campylobacteriosis
Campylobacteriosis is a diarrhoeal disease caused by Campylobacter bacteria, found in animals such as poultry, cattle, pigs, wild birds and wild mammals.
The most frequent way of getting infected is through the consumption of contaminated food (mainly poultry) or water. Other risk factors include swimming in natural surface-waters and direct contact with infected animals.
After an incubation period 2-5 days (range 1-10 days) common symptoms are severe abdominal pain, watery and /or bloody diarrhoea and fever. Usually, symptoms last for a few days and the disease is self-limiting but occasionally they will persist and result in hospitalisation. Antimicrobial therapy is seldom needed.
Campylobacter infection has been associated with complications such as later joint inflammation (5-10% of cases) and, on rare occasions, Guillian-Barre syndrome (a temporary but severe paralysis that may be total).
As prophylactic measure, control of Campylobacter colonisation in poultry is important, as well as hygienic processing of meat, and the protection and control of private drinking water supplies.
## R-Studio environment
*R-Studio* is an integrated development environment (IDE) for *R*. It includes a console, syntax highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management. You can download it freely from the [*R-Studio* website](https://www.rstudio.com/).
The *RStudio* window comprises the following sections:
![](RstudioIDE.png)
1. A script which is equivalent to a do file in STATA or pgn in EpiInfo.
2. A console where you can write commands and execute them, and where results other than graphs are shown.
3. A “Environment” and “History” windows where you can find the list of R objects being stored and the log of your session, respectively.
4. The “Plots”, “Packages” and “Help” windows, for graphic visualisation, selection of installed packages and help documentation, respectively.
## R-scripts
Throughout this case study we strongly encourage you to use *R-scripts* to keep track of the commands you are using to follow your plan of analysis. It will contain data management steps as well as descriptive and statistical analysis.
Open R studio and open a new *R-script*.
To open a new *R-script*, click on the plus button on the top left (below File), then select *R-script* from the drop-down menu.
![](RstudioScript.png)
## Working directory
Make sure you have created a working directory on your computer. This is where you store your data and your R scripts and outputs. An example of working directory could be “C:/data/crypto” if you are using a PC.
To set your working directory you can use the menu function by clicking "Session" > "Set Working Directory" > "Choose Directory..." and then you can select the appropriate folder.
![](RstudioWorkingDir.png)
You could alternatively do this by adding code to your *R-script* (or the command line).
You can check the path for your current working directory using the *getwd* function.
```{r, eval = FALSE}
#Check your current working directory
getwd()
```
To set your working directory you can use the *setwd* function.
```{r, eval = FALSE}
#set working directory
#note the forward rather than backwards slashes
setwd("C:/data/crypto")
```
## Packages
R comes with a set of default functions. However one of the strengths of R is that many user-written functions exist to help you do your work. To use them, you first need to install the package they belong to.
To do this you can click on "Install" on the bottom right in the "Packages" window.
![](RstudioPackages.png)
<!---CHUNK_PAGEBREAK--->
We will mainly use packages which come ready installed with R (base code), but where it makes things easier we will use add-on packages. All the R packages you need for the exercises can be installed over the Internet.
You will need to install these before starting. We will be using the following packages.
- *ISOweek*: for adjusting dates and switching between units
- *surveillance*: for fitting time series and detection algorithms
- *zoo*: for producing rolling statistics
- *moments*: for looking at skewness and kurtosis of time series
You can also install packages using the *install.packages* function, as below. Do not try to understand this code.
```{r}
#choose packages required
required_packages <- c("ISOweek", "surveillance", "zoo", "moments")
```
```{r, eval = FALSE}
# Installing required packages for this case study
for (pkg in required_packages) {
if (!pkg %in% rownames(installed.packages())) {
install.packages(pkg, dependencies = TRUE)
}
}
```
You only need to install packages once. However you will need to load them each time you start a new session using the *library* function.
```{r, eval = FALSE}
library(ISOweek)
```
Here too, you can load multiple packages at the same time within a loop; again do not try to understand this just yet, just appreciate it is possible.
```{r, results = 'hide', message = FALSE, warning = FALSE}
# Loading required packages for this case study
for (i in seq(along = required_packages)) {
library(required_packages[i], character.only = TRUE)
}
```
<!---CHUNK_PAGEBREAK--->
## Reading in data
Import the dataset from a comma separated value (.csv) file using the *read.csv* function, storing it as a data frame within *R* called campy. The arrow is used to assign the data frame to campy.
Here we also specify that we do not want to read in string (character variables) as factors.
```{r}
campy <- read.csv("campy.csv", stringsAsFactors = FALSE)
```
## Variable types
Campy is now a *R* object, of the data frame type. A data frame is simply the how datasets are classified in *R*, thus given class "data.frame". You can check the type of object by typing: class(campy)
Individual variables from the data frame are called using the name of the data frame followed by the **$** sign and the name of the variable. For instance, if you wish to enquire about the type of variables:
```{r, eval = FALSE}
#check class of variable time
class(campy$time);
#check class of variable count
class(campy$count)
```
You will notice that the time variable is of class "character". However we want this as a date. You can change this by overwriting the variable using the *as.Date* function.
```{r}
#overwrite time variable as a date
campy$time <- as.Date(campy$time)
```
# Browse data
To browse the first six records you can use the *head* function.
```{r, eval = FALSE}
#see first six rows of data
head(campy)
```
To get summary statistics of all variables included in the data frame use the *summary* function.
```{r, eval = FALSE}
#see summary statistics of variables
summary(campy)
```
Alternatively you can view the data by clicking on the data by clicking on the data icon to the right of the view.
![](RstudioView.png)
You could achieve the same thing by using the *View* function (note the capital V).
```{r, eval = FALSE}
#view your dataset
View(campy)
```
<!---CHUNK_PAGEBREAK--->
**Examples of hypotheses to be tested could be:**
- Is there a trend in the Campylobacter reporting?
- If yes, what type of trend is it?
- Is there any seasonality in Campylobacter reporting?
- If yes, around what time of the year does it occur?
- Are the residuals white noise? Or are there other events than random variation?
**Inclusion/exclusion criteria:** Only records with daily reporting will be considered (drop records with yearly reporting)
**Time unit for analysis:** Weeks
**Draft plan of analysis:**
**Data cleaning**
- Plot original data
- Data description (frequency distributions, summary tables) of all variables
- Checks for missing data
- Checks for numeric attraction in reporting due to reporting bias, including statistical tests
**Data description**
- To better understand the nature of the data
- Plot the data
- Calculate statistical descriptors
- Checks for stationarity
- Check for constant mean (linear regression)
- Check for constant variance (mean vs. Variance plot)
- Checks for autocorrelation
- Autocorrelation
- Partial autocorrelation
- Correlogram
- Checks for distribution
- Histogramme
- Symmetry plot
- Normal quantile plot
- Shapiro-Wilk normality test
- Checks for transformations
- Moving average (exploring trend and seasonality)
- Decomposition of the dataset
- Checks for trend (check R^2^)
- Checks for seasonality (taking trend into account)
- Spectral analysis, Periodogram
- Check R^2^ of models
- Residual analysis
- Plot residuals
- Statistical descriptors of residuals
- Checks for stationarity
- Check for constant mean (linear regression)
- Check for constant variance (mean vs. Variance plot)
- Plot of residuals against model
- Checks for autocorrelation
- Autocorrelation
- Partial autocorrelation
- Correlogram
- Checks for distribution
- Histogramme, Symmetry plot, Normal quantile plot
- Shapiro-Wilk normality test
- Test for white noise
- Depending on results of above analyses, carry out transformation, adjust trend/seasonality
- Possible: Simple modelling of data series with 95% CI, taking outbreaks into account (downweighting/removing)
<!---CHUNK_PAGEBREAK--->
# Question 2
Make any changes necessary to the dataset. Specify the appropriate time unit for aggregation (start with days). Carry out data checking, including tables and plots of all variables, checks for missing data and checks for numeric attraction by day of week and day of month for possible reporting bias.
Interpret your findings.
<!---CHUNK_PAGEBREAK--->
# Help Question 2
In the current form, the data frame is daily counts (aggregate counts by date of notification). Dates are unique, i.e. there is only one count reported for each day.
You could double check this using the *duplicated* function.
```{r, eval = FALSE}
#plot counts by day
plot(campy$time, campy$count,
main = "Daily Campylobacter notifications, Europe, 2007-2010",
xlab = "days", ylab = "number of cases")
```
```{r, echo = FALSE}
#This is not shown to participants
#it is just to highlight the higher reporting in december 2008
clr <- ifelse(campy$count < 700, "black", "red")[-length(campy$count)]
plot(campy$time, campy$count, col = clr,
main = "Daily Campylobacter notifications, Europe, 2007-2010",
xlab = "days", ylab = "number of cases")
```
This plot shows the original data. There do appear to be some outliers (e.g. increased reporting at the end of December 2008, highlighted in red). It would be interesting to investigate this further and also go back to the country to ask for more information.
Now check if there are missing values in the dataset. You could either do this with the *summary* function, as in question 1, or you could do the following to get counts of missings for the dataset.
```{r}
# get counts of missing values in the dataset
sum(is.na(campy))
```
To check for notification bias, let us plot the number of cases notified by day of the week. The first step is to extract the date of the week from the date (‘date’ variable) by creating a new variable "dayweek". If the language on your computer is not set to English, see the appendix for how to change this and get English days of the week.
```{r}
#extract the weekdays from date
campy$dayweek <- weekdays(campy$time)
```
Make sure the days of the week are in the correct order by making it a factor variable with appropriate levels.
```{r}
#change to factor with levels
campy$dayweek <- factor(campy$dayweek,
levels = c("Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday", "Sunday"))
```
You then need to use the *aggregate* function to collapse your dataset and get counts add the counts for each day together.
In the below step, you could also not use the *with* function and instead specify the dataset within *aggregate*. However this produces less useful column names.
```{r}
# using campy dataframe
# sum counts by day of the week
campy_dayweek <- with(campy, aggregate(count ~ dayweek, FUN = sum))
```
Next make a bar chart of the number of cases per day of the week over the whole period.
You can use the aggregated *campy_dayweek* data in the *barplot* function. You may not see all of the x-labels in the plot window, however if you click the "zoom" button these should appear.
```{r}
# plot counts by day
# names.arg is required to label the x axis appropriately
barplot(campy_dayweek$count, names.arg = campy_dayweek$dayweek,
main = "Number of notified cases by day of the week",
xlab = "Day of the week",
ylab = "Cases")
```
Let us check for numeric attraction by day of the month (think of potential higher notification on the 1^st^ of the month for administrative purposes). As we have seen that there is much fewer notifications during weekends, let us exclude Saturday and Sunday from this analysis. We create a new object that is the copy of ‘campy’ but without all weekend records.
```{r}
# keep rows from campy that are not saturday or sunday
campyw <- subset(campy,
campy$dayweek != "Saturday" & campy$dayweek != "Sunday")
```
Create a new variable "daynum"", extracting the day of the month from the "time" variable. To do this use the *format* function and specify day using "%d".
We also need to change this to a numeric variable using the *as.numeric* function.
```{r}
# create variable with day of month
# "%d" specifies that day of month is what you want
campyw$daynum <- format(campyw$time, "%d")
# change to numeric
campyw$daynum <- as.numeric(campyw$daynum)
```
We then need to aggregate the number of cases by day of the month, by creating a new data frame "campyw_daynum",
```{r}
# using campyw dataframe
# sum counts by day
campy_daynum <- with(campyw, aggregate(count ~ daynum, FUN = sum))
```
Plot another bar chart.
```{r}
barplot(campy_daynum$count, names.arg = campy_daynum$daynum,
main = "Notified cases by day of month (exluding weekends)",
xlab = "Day of the month",
ylab = "Cases")
```
There appears to be numeric attraction by day of week and day of month. The low number of notifications on Saturday and Sunday is likely due to the fact that some health care settings are closed on weekends. This is a strong argument to work with weekly cases. There also appears to be some higher magnitude of reporting on a Tuesday and on the 23^rd^. This may be influenced by the high reporting on the 23^rd^ of December 2008 (it is a Tuesday). You could do the plots on the data excluding 2008 to see if the graphs change.
<!---CHUNK_PAGEBREAK--->
# Question 3
Describe the time series:
- Plot the time series aggregated by time unit of interest
- Check the data for stationarity, autocorrelation , distribution and for possible transformations of the data. Interpret your findings and plan next steps.
- Note: we do not expect data to be stationary, autocorrelated or normally distributed before it has been decomposed. This is nevertheless an important part in descriptive analysis.
- Use the “moving average” to explore possible trends and seasonality within the time series
<!---CHUNK_PAGEBREAK--->
# Help Question 3
The first thing to do is to create a data frame where the counts are aggregated by week. Be careful, there are different methods to build the week of the year. Here we will use the *ISOweek* function (part of *ISOweek* package).
```{r}
# create a variable with week
campy$week <- ISOweek(campy$time)
```
Have a look at the new variable
```{r, eval = FALSE}
# look at first few rows
head(campy$date)
# look at your dataset
View(campy$date)
```
As before, use the aggregate function to sum the cases by ISO-week.
```{r}
# using campy dataframe
# sum counts by week
campy_week <- with(campy, aggregate(count ~ week, FUN = sum))
```
We will then create a Time-Series object. These are vector or matrices with class of "ts" (and additional attributes) which represent data that have been sampled at equispaced points in time. In the matrix case, each column of the matrix data is assumed to contain a single (univariate) time series. Working with a Time-Series object will make it easier to analyse the time series. We only keep the counts variable and then define the time unit of the new object by specifying the starting point (beginning of first week of 2007) and the frequency (52 weeks).
```{r}
# Create time series only using campy$week
campy_ts <- ts(campy_week$count, start = c(2007, 0), end = c(2010, 52), frequency = 52)
```
```{r, eval = FALSE}
# check the class of camp_ts
class(campy_ts)
```
You can visualise the data from the time series object, look at basic descriptive statistics and check for missing values.
```{r, eval = FALSE}
# view your timeseries object
campy_ts
# get descriptive statistics
summary(campy_ts)
# check missings
sum(is.na(campy_ts))
```
<!---CHUNK_PAGEBREAK--->
Plot your time series.
```{r}
# Plot time series by week
plot(campy_ts, main = "Weekly cases, UK, 2007-2010",
ylab = "Number of cases", xlab = "Week of notification")
```
Examine the distribution of counts per week.
```{r}
# plot histogram of counts occurences
hist(campy_ts)
```
Look at the summary statistics, as well as the skewness and kurtosis using the respective functions from the *moments* package.
```{r}
# summary statistics
summary(campy_ts)
# look at the skewness
skewness(campy_ts)
# look at kurtosis
kurtosis(campy_ts)
```
The mean and median are not too different. There appears to be some skewness in the data. Kurtosis is around 3, which means there is a similar peak as in a normal distribution.
<!---CHUNK_PAGEBREAK--->
## Checks for stationarity
### Linear regression
Simply speaking, a time series is stationary if its mean level and variance stay steady over time, or as explained by Hyndman and Athanasopoulos (2013): “[a] stationary time series is one whose properties do not depend on the time at which the series is observed”.
We can use linear regression to check whether the mean is constant over time:
We first define the time variable as "t", ranging from week 1 to week 209, covering the whole period:
```{r}
# define number of timepoints
t <- 1:209
```
We can then use our time-series data frame and the time variable in a linear regression using the *lm* function.
```{r}
# run linear regression for counts over time
linreg <- lm(campy_ts ~ t)
```
It is possible to apply the *summary* function to this regression in order to get a nicely formatted output.
```{r}
# view output of regression
summary(linreg)
```
There appears to be an upwards trend in the data. On average, there are 2.11 more cases notified every week.
We can also use linear regression to check whether the variance is constant over time.
We first need to calculating a rolling standard deviation. In this case we choose a moving window of 10 weeks. This means that for each time point a standard deviation is calculated based on the five points before and after.
```{r}
# produce a centred rolling standard deviation with 10 week window
variance <- rollapply(data = campy_ts, width = 10, FUN = sd, fill = NA)
```
```{r}
# run linear regression of standard deviation over time
variancereg <- lm(variance ~ t)
# view output of linear regression
summary(variancereg)
```
There does not appear to be a significant difference in variance over time.
<!---CHUNK_PAGEBREAK--->
## Checks for autocorrelation
Does the series correlate with itself, and for what lags?
In order to do estimate autocorrelation we will use the *acf* function. This function also creates a plot, however we will not use the plotting function immediately.
```{r}
# estimate autocorrelation without plotting
acplot <- acf(campy_ts, plot = FALSE)
```
We then need to transform lags from years to weeks by multiplying the lag variable by 52 (number of weeks in a year).
```{r}
# change lag from years to weeks
acplot$lag <- acplot$lag * 52
```
Then you can use the regular plot function to visualise these estimates.
```{r}
# plot your autocorrelation graph
plot(acplot, main = "Autocorrelation", xlab = "Lag (week)")
```
The first line on this autocorrelation graph shows the correlation between each obsersation and itself (at lag zero, therefore always 1), the second shows correlation between each observation and the preceding one (lag one), and so on.
Blue lines show the 95% confidence intervals. Where lines go above or below this there is significant autocorrelation (in order to produce a similar confidence intervals to Stata, add *ci.type = "ma"* to the *acf* function).
This graph therefore shows substantial autocorrelation in the data.
We now need to investigate possible seasonality (or periodicity).
To do this we use the *pacf* function for partial autocorrelation. Essentially what this does, is adjust the autocorrelation in each observation for any autocorrelation from previous observations.
```{r}
# estimate partial autocorrelation without plotting
pacplot <- pacf(campy_ts, plot = FALSE)
# change lag from years to weeks
pacplot$lag <- pacplot$lag * 52
# plot your partial autocorrelation graph
plot(pacplot, main = "Partial Autocorrelation", xlab = "Lag (week)")
```
The partial autocorrelation graph shows autocorrelation at the first lag and further autocorrelation at around the 5th and 18/19th lag. The seasonal component of this time series needs to be examined.
<!---CHUNK_PAGEBREAK--->
## Checks for normality
Begin by visually inspecting the data.
We can plot a histogram of the data using *hist* and add a normal curve by combining the *curve* and *dnorm* functions. In this case we want to plot probability density rather than frequency on our histogram.
```{r}
#plot your histogram defining labels and breaks
hist(campy_ts, probability = TRUE,
main = "Distribution with normal curve",
xlab = "counts",
breaks = 15
)
#plot your normal curve on top
curve(dnorm(x, mean = mean(campy_ts), sd = sd(campy_ts)),
add = TRUE, col = "red", lwd = 2)
```
You can also inspect this using a normal quantile plot (or QQ-plot). To do this combine the *qqnorm* and *qqline* functions. If the dots do not follow the straight line then it is indicative of a non-normal distribution.
```{r}
# quantile plot for the data
qqnorm(campy_ts)
qqline(campy_ts)
```
Both of these plots show that the data is skewed and not normally distributed.
This can also be show statistically using the shapiro-Wilk test for normality. This shows significant deviation from normal, so we may need to consider transforming the data.
```{r}
# run shapiro-wilk test on data
shapiro.test(campy_ts)
```
<!---CHUNK_PAGEBREAK--->
## Check transformations
We will check transforming using square-root, log and square.
First try the square-root using the *sqrt* function.
```{r}
# check transformation by square-root
# apply square-root to your dataset
sqrtcampy_ts <- sqrt(campy_ts)
# plot your histogram defining labels and breaks
hist(sqrtcampy_ts, probability = TRUE,
main = "square-root",
xlab = "counts",
breaks = 15
)
# plot your normal curve on top
curve(dnorm(x, mean = mean(sqrtcampy_ts), sd = sd(sqrtcampy_ts)),
add = TRUE, col = "red", lwd = 2)
# run shapiro-wilk test on data
shapiro.test(sqrtcampy_ts)
```
Next try log transformation using the *log* function.
```{r}
# check transformation by log
# apply square-root to your dataset
logcampy_ts <- log(campy_ts)
# plot your histogram defining labels and breaks
hist(logcampy_ts, probability = TRUE,
main = "Log",
xlab = "counts",
breaks = 15
)
# plot your normal curve on top
curve(dnorm(x, mean = mean(logcampy_ts), sd = sd(logcampy_ts)),
add = TRUE, col = "red", lwd = 2)
# run shapiro-wilk test on data
shapiro.test(logcampy_ts)
```
Finally, try transforming by the square by raising the data to the power of 2.
```{r}
# check transformation by square
# apply square-root to your dataset
sqcampy_ts <- campy_ts^2
# plot your histogram defining labels and breaks
hist(sqcampy_ts, probability = TRUE,
main = "Square",
xlab = "counts",
breaks = 15
)
# plot your normal curve on top
curve(dnorm(x, mean = mean(sqcampy_ts), sd = sd(sqcampy_ts)),
add = TRUE, col = "red", lwd = 2)
# run shapiro-wilk test on data
shapiro.test(sqcampy_ts)
```
The transformation that renders the original time series most normal is the square root (W = 0.98762, p-value = 0.06633). However, does this transformation make sense? A log transformation and also no transformation should also be considered.
<!---CHUNK_PAGEBREAK--->
# Moving average
We now want to use moving averages to investigate possible trends and seasonality.
We can calculate these using the *rollapply* function. Similar to how we calculated the standard deviation when checking for variation over time.
We will calculate the centred moving average for several different time windows. Try for example to do one each for 52 weeks, 26 weeks and 13 weeks.
```{r}
# calculate annual moving average
ma52 <- rollapply(campy_ts, width = 52, FUN = mean)
```
We can then plot this using the *ts.plot* function.
```{r}
# plot your timeseries with annual moving average
ts.plot(campy_ts, ma52,
main = "Weekly cases and moving average of 52 weeks, Campylobacter cases, 2007-2010",
ylab = "number of cases",
xlab = "week of notification")
```
Now try the same for a 26 week window.
```{r}
# calculate 26 weekly moving average
ma26 <- rollapply(campy_ts, width = 26, FUN = mean)
# plot your timeseries with annual moving average
ts.plot(campy_ts, ma26,
main = "Weekly cases and moving average of 26 weeks, Campylobacter cases, 2007-2010",
ylab = "number of cases",
xlab = "week of notification")
```
And now do the same for a 13 week window.
```{r}
# calculate 13 weekly moving average
ma13 <- rollapply(campy_ts, width = 13, FUN = mean)
# plot your timeseries with annual moving average
ts.plot(campy_ts, ma13,
main = "Weekly cases and moving average of 13 weeks, Campylobacter cases, 2007-2010",
ylab = "number of cases",
xlab = "week of notification")
```
It is also possible to decompose a time series automatically into seasonal, trend and irregular components using moving averages with the *decompose* function.
```{r}
plot(decompose(campy_ts))
```
In the UK, there appears to be an increasing annual trend in Campylobacter reports from 2007 to 2010 (this can be seen with the moving average of 51 weeks). There appears also to be an annual peak of reports in the summer (can be seen with the moving average of 25 weeks). The moving average window of 13 weeks, gives more information and shows that there appears to be a peak in early summer and a second smaller peak after the summer.
Without more background it is difficult to judge whether the trend is a real trend, the peaks are real peaks or they are perhaps due to batch reporting. However a literature search indicates that Great Britain has marked seasonal peaks in Campylobacter enteritis occurring in May and September (“Investigation of Faecal Specimens for Bacterial Pathogens”; National Standards Methods, 2010. http://www.hpa-standardmethods.org.uk/documents/bsop/pdf/bsop30.pdf [Accessed October 2011]).
<!---CHUNK_PAGEBREAK--->
# Question 4
Decompose the time series:
- Decide on an appropriate trend to use
- With the trend removed, analyse seasonality
- What seasonality will you include in a final model? Why?
<!---CHUNK_PAGEBREAK--->
# Help Question 4
## Regression type
We are using counts data, where sometimes Poisson regression or negative binomial regression (if overdispersion is present) may be indicated. However we have very high counts, with all counts per week exceeding 500. In addition, the histogram does not look particularly Poisson distributed. Therefore it may make sense to use linear regression in the first instance. At residual analysis stage we can then verify if this choice is appropriate.
A next step may be to decide if the data follow a linear or a quadratic trend. Look at the graphs and regression outputs for both trends.
First, define your time variable. The last time variable we defined by typing in the upper limit as a fixed value, however we can similarly make reference to the length of our time series object.
```{r}
# define time variable (number of weeks)
t <- 1:length(campy_ts)
```
We now run a linear regression as previously, showing the summary as well as the confidence intervals.
```{r}
# run linear regression for counts over time
linreg <- lm(campy_ts ~ t)
# view output of regression
summary(linreg)
# view confidence intervals
confint(linreg)
```
We can then plot the fitted values together with our observed data, similar to moving averages above, using the *ts.plot* function.
```{r}
ts.plot(campy_ts, linreg$fitted.values,
col = c("black", "red"),
main = "linear trend")
```
For a quadratic model we need to first square our time variable and then run it together with the previous linear regression.
```{r}
# create a new squared timed variable
t2 <- t^2
# run linear regression
quadreg <- lm(campy_ts ~ t + t2)
# view output
summary(quadreg)
# plot with observed data
ts.plot(campy_ts, quadreg$fitted.values,
col = c("black", "red"),
main = "linear quadratic trend")
```
The adjusted R^2^ of the model with a quadratic trend (R^2^=0.106) is slightly low than the R^2^ of the model with a linear trend (R^2^= 0.109). However the difference between these values is minimal. In addition, R^2^ values do not take the complexity of the model into account (a quadratic model is more complex than a linear model). We can examine the Akaike Information Criterion (AIC), which takes model complexity into account.
```{r}
# view linear regression AIC
AIC(linreg)
# view quadratic AIC
AIC(quadreg)
```
The AIC for the model with a linear trend is 3054.5 and the AIC for the model with quadratic trend is 3056.4 (the smaller the AIC the better the fit). The difference is negligible. We may want to use the principle of parsimony and continue our modelling using the simpler model, with the linear trend.
Conclusions:
In the overall dataset, we see an upwards trend over time, which we expected from the results of the moving average analysis. The regression output shows a coefficient of 2.11 which indicates an increase of 2.11 cases for each week in the dataset. The p-value is highly significant, which can also be a reflection of the high sample size.
It appears that the trend is increasing, but we cannot conclude that without more information on the surveillance system and reporting practices.
Please also note that for educational purposes we are using linear regression to model the trend here – linear regression may only be used if the data show no signs of overdispersion, which is rare in count data. However if counts are very high, then the distribution may approximate the normal distribution in which case it is OK to use linear regression.
Furthermore, it is important that we have no missing data. If this were the case, it would be important to set actions for NA values when running the *lm* function (see the help section for more).
<!---CHUNK_PAGEBREAK--->
# Seasonality
We will check for seasonality on the data after the trend is removed, however we get basically the same results if we check for seasonality on the original data.
We can find the most important periods in the data by doing a spectral analysis. This can be done using the *spec.pgram* function. By specifying *fast* as false, stops the function from "padding" the series with extra zeros to enable fast fourier transform.
```{r}
# get spectrograme with detrended
periodo <- spec.pgram(campy_ts, fast = FALSE)
```
To plot the periodogram appropriately we need to divide the frequencies by 52, as it is originally given in years.
From the periodogram we can see that there may be several important periods. The first period of importance is obviously 52 weeks, after that we see 1 or 2 further important peaks.
```{r}
# plot spectogram as weeks (rather than years)
plot(1/(periodo$freq/52), log(periodo$spec),
type = 'l',
xlim = c(0, 209),
xlab = "Period (Weeks)", ylab = "Log(density)")
```
We can extract the frequencies with the highest peaks and extract the weeks – note that you need to use the *order* function in combination with the negative sign to sort the output based on highest first.
```{r}
# extract important weeks
# multiply by 52 because currently in years
1 / periodo$freq[order(-periodo$spec)] * 52
```
We can see that after a period of 52.25, the period of 17.4 is most important. This is approximately a third of a year, so this may be a plausible further period to explore. The next most important period is 26.1, which is around half a year, again a possible plausible period to explore.
Let’s just start with the most important period of 52.25 weeks. We can create a sine and cosine for this period and add it to our model with the linear trend.
```{r}
# defining sine and cosine terms