-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathMA_Paper.Rmd
1349 lines (801 loc) · 51.1 KB
/
MA_Paper.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "MA Paper Code Appendix"
author: "Nigel McKernan"
date: "10/18/2020"
output:
html_document:
toc: yes
toc_float: yes
number_sections: yes
bibliography: MAPaper.bibtex
nocite: |
@stats2020hom,
@stats2020inc,
@stats2020rent,
@stats2020vac,
@stats2020unemp,
@anderson2015asthewind,
@arellano1991some,
@bajari2007estimating,
@bondy2018april,
@cepa2007priority,
@chan2014july,
@chay2005does,
@cholette2006benchmarking,
@chow1971best,
@coulson2013what,
@croissant2008panel,
@csavina2014july,
@denton1971adjustment,
@deryugina2016mortality,
@driscoll1998consistent,
@garrett2002aggregated,
@heckman1976common,
@hong2015august,
@jacobs1994dividing,
@leonard2016july,
@li2006studies,
@litterman1983random,
@mullerkademann2015internal,
@rosen1974jan,
@sax2013december,
@small1975air,
@statscan2015low,
@teranbc2020our,
@troy2008property,
@wang2015august,
@wei1990disaggregation,
@goc2019national,
@goc2020historical,
@hla2018star
@kleiber2008aer
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Introduction
This appendix is a companion piece to my Master's Research Project that I completed to fulfill the requirements of my Master's in Economics from Concordia University in Montréal, Québec.
You can find that [here](https://github.com/nigeljmckernan/MAPaperAppendix/blob/master/Thesis_Nigel_McKernan.pdf).
I have 3 main goals that I want to achieve with this document:
- To demonstrate *how* I executed the analysis of my paper via the programming language R and its vast collection of packages.
- To practice writing [R Markdown](https://rmarkdown.rstudio.com) documents (what you are reading right now).
- R Markdown documents inserts R code into reports, slides, etc., so as to seamlessly transition between regular text documentation, and incorporating the R code written that pertains to the report.
- To demonstrate the following skills to prospective interested parties, such as employers, schools, or other organizations:
- Research
- Data analysis, cleaning, and visualization
- Communicating my research and analysis in a layman-esque manner (as best as I can).
## Base-R/data.table vs Tidyverse
Apart from the econometrics-focused packages that I'll be using, I prefer using packages and functions from the [Tidyverse](https://tidyverse.org) collection of packages for data manipulation/munging/cleaning, compared to those found in base-R or in the `data.table` package.
That's simply my personal preference with regards to how I best like cleaning and manipulating dirty or untidy data to a point where it is conducive to data analysis, regressions, machine learning, etc.
- I realize `data.table` is faster, though I prefer `dplyr`'s syntax, and the seamless transition into functions from the other Tidyverse packages, like `tidyr`, is unbelievably convenient.
- If I need the speed that `data.table` is excellent at providing, I'll just use `dtplyr` to speed up my `dplyr` operations.
As an aside, I think people often attribute the `%>%` (pipe) operator from `magrittr` as something that belongs to the Tidyverse. I personally don't see it like that as the pipe works really well with non-Tidyverse packages (including `data.table`!!).
With that out of the way, I'll be loading up the `tidyverse` package to load up our essentials.
```{r packages}
library(tidyverse)
library(knitr)
library(dtplyr)
library(foreach, warn.conflicts = FALSE)
library(tempdisagg)
library(plm, warn.conflicts = FALSE)
library(broom)
```
```{r stargazer, message =FALSE}
library(stargazer)
```
## Overview of Packages Used
With that previous disclaimer out of the way, the non-Tidyverse packages I'll be using are:
- `plm` - the main workhorse package for conducting panel-data regression analyses that most closely resembles `lm()`
- `lme4` & `nlme` won't be used as my model does not really incorporate quadratic regressors (my models fails to reject the RESET test) or other non-linear elements and does not take a mixed-models approach.
- `lmtest` - mainly for `waldtest()` for comparing models, and `coeftest()` for testing the significance of my regressors while incorporating different estimated covariance matrices from `plm`
- `sandwich` is not used as the more useful covariance matrix estimators for panel data are already provided in `plm`
- i.e. `vcovSCC()`
- `foreach` - the main reason why I like the this package so much is its `for` loop returns an *object*. Not only that, but via the `.combine` argument, you can really customize the output `foreach()` returns.
- For this project, I want a *single* data-frame where I iterate through various years, months, and station ID's across Canada from the Environment Canada's website to pull years-worth of climate data at the hourly and daily levels.
- To do that, the value for my `.combine` argument will be `rbind`, as I want the output from each iteration appended the end of the previous iteration, returning a *single* data-frame, instead of a nested list that I would have to unnest afterwards.
- By default, `foreach()` will return a nested list where each element in the list contains the output of the respective iteration.
- I will not be using a parallelized backend such as `doParallel` as pulling data via http(s) only works sequentially when I've tried it, so the `%do%` or `%doSEQ%` methods will work fine.
- `tempdisagg` - I will not be covering much of the theory of temporal disaggregation, but this package will be used to temporally disaggregate the socioeconomic data from Statistics Canada that is only available annually, down to a monthly frequency, which is the frequency needed to match our dependent variable.
- I don't attempt to use an indicator series as there is not a clear consensus, or known recommended indicator data, for something like household income, dwelling vacancy rate, or homicide rate
- As such, I will be using the Denton-Cholette (`method = "denton-cholette"`) method of estimating our AR(1) regressor by regressing on a vector of one's, which does not require an indicator series, and produces very smooth results.
- As an aside, I *really* wish there was an attempt to wrap `td()` and `ta()` from `tempdisagg` into a "tidy" way of doing temporal (dis)aggregation
- I really like what `parsnip` from [Tidymodels](https://tidymodels.org) does to "front-endify" a lot of machine-learning/regression techniques into a coherent and consistent API, and really wish there was an attempt to bring `plm` and `tempdisagg` into that ecosystem.
- A close alternative is to use the various `map_x()` functions from `purrr` to iterate in a "tidy" way across a dataframe.
# Exogenous Regressor Data
For my exogenous regressors, they all come from Statistics Canada tables (hereafter I'll abbreviate it to StatsCan), and since the process for how I go about cleaning/pulling/extracting the data for the *exogenous* variables needed is generally the same, I'll only demonstrate this for one variable, for brevity's sake.
The dataset I'll be using is from table [11-10-0135-01](https://doi.org/10.25318/1110013501-eng) from StatsCan, "Low income statistics by age, sex and economic family type"
The variable I'll be pulling/cleaning from this table is to extract the percent of persons under the After-Tax-Low-Income-Cutoff, by city, by year.
- In my model, the short-hand for this variable is "LICO"
## Pulling In The Data
Base-R's `read.csv()` function is known to be generally slow, and not the greatest at parsing columns into their respective data types.
`fread()` from `data.table` is blazing fast, but I generally encounter more errors with it and the columns that get parsed aren't always converted to their appropriate data type.
I do find it useful if the input file is consistent *enough* to not garble up the data on the way in.
On the other hand, `read_csv()` or `read_tsv()` from `readr` is very consistent (from my experience) in correctly parsing the data into its appropriate data types, though it is slower than `fread()`.
As such, `readr::read_csv()` will be used the pull the data from our CSV file.
```{r readr}
LICO_Data <- read_csv("Z:\\Documents and Misc Stuff\\MA Paper\\11100135-eng\\11100135.csv")
LICO_Data
```
As you can see, some columns do not have names that are clear what they represent in the dataset, or are too long and must be cut down in length for ease of manipulation later.
Let's proceed with that now.
## Removing and Renaming Columns
```{r removing unneeded columns}
LICO_Part_1 <- LICO_Data %>%
select(-DGUID,
-UOM_ID,
-SCALAR_ID,
-VECTOR,
-COORDINATE,
-DECIMALS,
-SYMBOL,
-STATUS,
-TERMINATED,
-SCALAR_FACTOR,
-UOM)
LICO_Part_1 %>% head() %>% kable()
```
We've eliminated some columns that were either redundant, or had no pertinent information.
However columns like "Low income lines" and "Persons in low income" can be renamed to something more succinct.
```{r renaming}
LICO_Part_2 <- LICO_Part_1 %>%
rename(Persons = `Persons in low income`,
Lines = `Low income lines`,
Year = REF_DATE,
Value = VALUE)
LICO_Part_2 %>% head() %>% kable()
```
That looks much better.
## Splitting A Column
The "GEO" column is a bit problematic for us as this column contains observations for Canada in aggregate and the various provinces, but *also* individual cities that have the province name in the observation separated by a comma (,).
- e.g. Winnipeg, Manitoba
- (Go Jets Go!)
Instead this column should ideally be split into two:
- One column for "Province/Country"
- The other for "City"
So how do we split this column into two?
Lucky for us, any time the observation is a city, it's split with a comma (,) from its respective province.
The `separate()` function from `tidyr` makes splitting this easy to do:
```{r splitting}
LICO_Part_3 <- LICO_Part_2 %>%
separate(GEO, c("City","Province"), ", ")
LICO_Part_3 %>% filter(!Province %>% is.na()) %>% head() %>% kable()
```
Since not all observations denote cities, and therefore do not contain a comma, there's no splitting to be done, and so the second column will contain `NA`'s any time that occurs.
This is denoted by the first error box above.
## Filtering Observations
Since this dataset contains superfluous or extra data that is not needed for this analysis, let's filter those observations out.
First, since this project concerns Canadian Metropolitan Areas (CMA's) and *not* provinces or the country in aggregate, we will be filtering out any observations that are not CMA's.
Additionally, our dependent variable only has data from 2001 to 2019, so we need to filter out years outside of that range to match our dependent variable.
As well, we do not need *all* the various measures of low income from the "Lines" column. We are only concerned with the after-tax variant, based off 1992 as a base year to deflate other years.
Finally, we want to capture *all* persons; not those stratified into different age groups in the "Persons" column.
Those last two requirements can be accomplished by the `stringr` package, which makes dealing with text data very easy.
So, let's filter out the observations in accordance with our needs above:
```{r filtering}
LICO_Part_4 <- LICO_Part_3 %>%
filter(!Province %>% is.na(),
Year >= 2001 & Year < 2020,
Persons %>% str_starts("All"),
Lines %>% str_detect("1992 base"),
Lines %>% str_detect("after tax"))
LICO_Part_4 %>% head() %>% kable()
```
Alright, those observations have now been filtered out of our dataset.
Now, we need to *pivot* this dataset.
## Pivoting
If you're familiar with Pivot Tables in Microsoft Excel or something similar (LibreOffice Calc), then this should be relatively easy to understand.
If not, pivoting data is hard to explain without demonstration, but simply it seeks to transpose individual observations into their *own* columns, or the opposite, to take columns and then *pivot* them into a single column.
- Our dataset as it is now is in the *former* situation; we want to turn observations into columns;
- I want to make certain observations in one column, into *their own* columns:
- i.e. The "Statistics" column has different types of measurements, with their corresponding value in the "Value" column
- What I want to achieve is to make a column for every different measure in the "Statistics" column, with their respective value taken from the "Value" column
The end goal of this is to have a unique observation per pairwise combination of city and year, *per row*.
Hard to understand, right? Let's just demonstrate it instead:
But before we do, let's first:
- See how many **distinct** values the "Statistics" column can be as our dataset currently stands
- Take a look at what our dataframe looks like *before* we pivot it
```{r pre-pivot}
# See distinct values for the Statistics column
LICO_Part_4 %>% select(Statistics) %>% distinct() %>% kable()
# Preview refresh of our table
LICO_Part_4 %>% head() %>% kable()
```
What I want to achieve is to take those 3 distinct values in the "Statistics" column, and turn them into their own columns, with their respective value taken from the "Value" column.
```{r pivotting}
LICO_Part_5 <- LICO_Part_4 %>%
select(-Province) %>%
#Getting rid of the Province column since I really don't need it anymore.
pivot_wider(names_from = Statistics,
values_from = Value) %>%
#Renaming the resulting columns to something a little more concise, but based off the renaming, you probably get what they represent.
rename (No_Persons = `Number of persons in low income`,
Percentage = `Percentage of persons in low income`,
Gap = `Average gap ratio`)
LICO_Part_5 %>% head() %>% kable()
```
I did some renaming of the resulting columns, but they should be a close enough analogue to what they were named before.
## Recoding Observations
Now I'm going to do some alterations to the names of some cities in our "City" column.
To avoid any potential errors and make sure everything is named consistently, I'm going to replace any "é" with a regular "e".
- This is found in the names of Montréal and Québec.
- I don't have anything against accent aigu, or Québec, or Montréal (I mean I *live* there after all), it's just to make sure any join operations I do with the other datasets always correctly match to each other.
To do this, I'm relying again on functions in the `stringr` package, specifically `str_replace()`
```{r recoding}
LICO_Part_5$City <- LICO_Part_5$City %>%
str_replace("é", "e") %>%
str_replace("Ottawa-Gatineau","Ottawa")
LICO_Part_5 %>% head() %>% kable()
```
## Cleanup
Now that we've all the necessary recoding, filtering, and pivoting, let's do some final cleanup of removing any unneeded columns.
For my analysis, I'm only keeping the "City", "Year", and "Percentage" columns to keep the pairwise City and Year panel observation and its corresponding value.
```{r cleanup}
LICO_Part_6 <- LICO_Part_5 %>%
select(City,
Year,
Percentage)
LICO_Part_6 %>% head() %>% kable()
```
Now our dataset looks something we can incorporate into a dataframe to use in `plm()`, assuming the rest of our variables all end up looking like that too, which they do.
Just for some visualization, as I haven't really done any until now, I'm going to plot the percentage over time faceted by each city.
```{r plotting LICO}
LICO_Part_6 %>%
ggplot(aes(Year,Percentage, colour = City)) +
geom_line() +
xlab("Year") +
ylab("Percentage of Persons Below After-Tax LICO (%)") +
ggtitle("LICO Plots by City over Time")
```
# Temporal Disaggregation
Now that our data is "tidy", we need to temporally disaggregate each series to a monthly frequency.
This is done with the `td()` function from the `tempdisagg` package.
Before we move any further, we must do one more pivot so each city will be its own column, so `td()` can do its job properly.
```{r pivot disagg}
LICO_Part_7 <- LICO_Part_6 %>%
pivot_wider(names_from = City,
values_from = Percentage) %>%
arrange(Year)
LICO_Part_7 %>% kable()
```
We have 18 years' worth of data, multiply by 12 months, and our new dataframe must have 216 rows to conduct this properly.
So let's create an empty matrix which we'll populate later, for the `for()` loop implementation.
I also run the disaggregation using `map_dfc()` from `purrr` to create the object, which doesn't need an empty dataframe/matrix beforehand.
```{r td}
LICO_td <- matrix(nrow = nrow(LICO_Part_7) * 12L,
ncol = ncol(LICO_Part_7)) %>%
as_tibble()
colnames(LICO_td) <- colnames(LICO_Part_7)
LICO_td
```
## Dissaggregating
The next chunk of code will be using the `tempdisagg` package so we can do our disaggregations.
The package does not interface in a "tidy" way, so base-R-style `for()` loops will have to be used to get the desired results.
However, in the chunk following this, I'll use `map_dfc()` from `purrr` (also part of the Tidyverse) to iterate over the dataframe in a "tidy" way, supplanting for loops.
There's probably a way you can do the following with the `across()` and `mutate()` functions from `dplyr` but I didn't experiment with that.
As stated earlier, since I'm not using any indicator series to perform the disaggregations, our estimated $AR(1)$ regressor will be regressed on a vector of ones instead.
This is accomplished by using the "Denton-Cholette" method.
```{r disaggregate}
# A for loop implementation
for(i in seq(ncol(LICO_td) - 1L)) {
LICO_td[[(i+1L)]] <- td(LICO_Part_7[[(i+1)]] ~ 1L,
method = "denton-cholette",
to = 12,
conversion = "mean")$values
}
# A purrr-style implementation
# Also more "tidy"
LICO_Part_7 %>%
select(-Year) %>%
map_dfc( ~ td(.x ~ 1L,
method = "denton-cholette",
to = 12L,
conversion = "mean")$values) %>%
mutate(Year_Month = seq.Date(from = LICO_Part_7$Year %>%
min() %>%
# For ISO 8601 standard, making it safer to parse
paste("01","01",sep = "-") %>%
lubridate::ymd() %>%
as.Date(),
by = "month",
length.out = (nrow(LICO_Part_7) * 12L))) %>%
relocate(Year_Month,
.before = Quebec) -> LICO_td
LICO_td %>% head() %>% kable()
```
## Pivoting Long
Looking at the tail end of our data, we seem to have properly indexed our disaggregated data, now we have to pivot it back to our "tidy" requirements.
Once again, we will be using the `tidyr` package, but this time we will go in the opposite direction from our previous pivoting and convert columns into *observations*.
```{r pivot longer}
LICO_td_pivot <- LICO_td %>%
pivot_longer(-Year_Month,
names_to = "City",
values_to = "LICO") %>%
relocate(Year_Month, .before = City) %>%
arrange(City,
Year_Month)
LICO_td_pivot %>% head(n = 13L) %>% kable()
```
The column "Year\_Month" I added will be one of the two indices (along with "City") to properly indicate our observations in our dataframe for `plm()` to use.
# Instrumental Variables
So we've collected our data on our exogenous regressors, we still need data for our instrumental variables, which are Wind Direction and Wind Speed.
Environment Canada has an API set up to pull historical data in bulk, depending on the city, time, or frequency needed.
## Pulling Climate Data
In the following example, I will only pull data for one station ID for one year, as iterating over multiple station ID's, years, and months takes a significant amount of time, though I will provide the `foreach()` loop I wrote to accomplish this in the next code chunk.
```{r pulling climate data}
# First part of the URL needed
URL_first <- "https://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&stationID="
# Second part of the URL needed
URL_second <- "&Year="
# Third part of the URL needed
URL_third <- "&Month="
# Fourth part of the URL needed for hourly frequency
URL_fourth <- "&timeframe=1&submit= Download+Data"
# Example station ID in Yellowknife, Northwest Territories
station = 1706
# Example year and month, 2006 and January, respectively
year = 2006
month = 1
# readr's read_csv() does not work with the following code for some reason I don't understand
read.csv(paste0(URL_first,
station,
URL_second,
year,
URL_third,
month,
URL_fourth)) %>%
as_tibble()
```
As you can probably see, this API can be easily iterated over multiple station ID's, years, and months to collect a very large amount of data.
The following chunk of code you should not run as it takes a very long time to download, and will eat up multiple gigabytes in size.
It makes use of `foreach()`'s excellent `.combine` argument to determine what kind of format the iterated output will return.
In this case, I want each iteration to append its rows to the previous, returning a single dataframe in the end to use.
It uses a dataframe called `stations` that contains all relevant monitoring station metadata.
It's based off the Station Inventory CSV file that you can grab [here](https://drive.google.com/drive/folders/1WJCDEU34c60IfOnG4rv5EPZ4IhhW9vZH).
```{r not run, eval=FALSE}
data_hourly <- foreach(station=seq_along(stations$station_ID), .combine = "rbind") %:%
foreach(year=seq(from = stations$first[[station]], to = stations$last[[station]]), .combine = "rbind") %:%
foreach(month=seq(12), .combine = "rbind") %do% {
read.csv(paste0(URL_first,stations[station],URL_second,year,URL_third,month,URL_fourth))
}
```
## Temporal Aggregation
Since the frequency of this data is too fine (daily/hourly), it must be aggregated up to a monthly frequency.
Once again, `dplyr` comes to the rescue, with `group_by()` and `summarise()` making this fairly simple to do.
```{r silent import, include=FALSE, cache=TRUE, cache.lazy=FALSE}
data_hourly <- data.table::fread("data_hourly.csv") %>% as_tibble()
stations <- read_csv("station_data.csv")
```
```{r aggregation, cache=TRUE}
summarised <- data_hourly %>%
# In case you want to speed up dplyr's operations
lazy_dt() %>%
# We need to get the city names from our stations dataframe so we can group the summarised data by city
inner_join(stations,
by = c("Climate_ID" = "Climate ID")) %>%
filter(!Direction %>% is.na(), # Filtering out any observations where Wind Direction is null
Year >= 1991 # None of the other variables I'll use have data before 1991
) %>%
group_by(City, Year, Month) %>%
summarise(Dir = mean(Direction),
Spd = mean(Speed)) %>%
mutate(Year_Month = Year %>%
paste(Month,"1",sep = "-") %>%
lubridate:: ymd() %>%
as.Date()) %>%
ungroup() %>%
select(-Year,
-Month) %>%
# If you've used dbplyr before, then just like dbplyr, dtplyr needs collect() to be put at the end of your call, or the data will not be pulled.
collect() %>%
relocate(Year_Month,
.before = City)
summarised %>% head(n = 12L) %>% kable()
```
# Pollution Data
One of the last pieces of data needed to be obtained is data on fine particulate matter at 2.5 microns (PM2.5).
This variable will be our main endogenous regressor that we need to supplant with our previously prepared wind data.
This data is available from the National Air Pollution Surveillance (NAPS) program.
The various years' worth of data is available [here](http://data.ec.gc.ca/data/air/monitor/national-air-pollution-surveillance-naps-program/).
The data available at the NAPS website are (in my opinion) pretty poorly catalogued;
- Being behind .ZIP files
- The format of the underlying CSV files changing sporadically
- Depending on which year you're look at
As such, I'm not going to go through the process I went through of tidying and cleaning it up.
I mainly:
- Brought each year's data into Excel and did some manual cleaning
- Consolidated all years' data into one file
- Brought that one file into R
To compensate for the lack of code, here is a visualization of pollution levels over time faceted by City.
```{r silent pollution, include=FALSE}
pollution <- read_csv("Data_Date.csv")
```
```{r poll viz}
pollution %>%
ggplot(aes(Date, Pollution)) +
geom_line() +
facet_wrap( ~ City) +
ylab(expression(paste("PM2.5 (",mu,"g/m"^3,")"))) +
ggtitle("PM2.5 Concentrations over Time by City")
```
# House Price Index Data
Finally, our dependent variable is the Teranet/National Bank of Canada House Price Index (THPI).
This data needs the least amount of preparation done as it's already provided in a "tidy" format and in the frequency needed (monthly).
The only barrier to obtaining the data is submitting an email address, first name, and last name to receive the data.
The data can be found [here](https://housepriceindex.ca/index-history/).
```{r price viz}
pollution %>%
ggplot(aes(Date,Index)) +
geom_line() +
facet_wrap(~City) +
ylab("THPI Index") +
ggtitle("THPI Index Over Time by City")
```
# Initial Regression Models
```{r rename poll, include=FALSE, cache=TRUE}
pollution <- read_csv("Data_Date.csv") %>%
pdata.frame(index = c("City","Date"))
```
Behind the scenes, I've pulled the rest of my data, and gone through cleaning, joining, (dis)aggregating certain series.
Now we are at a point where we have a single `pdata.frame` containing:
- Our *Exogenous* variables
- Mainly socioeconomic data from StatsCan
- Our *Endogenous* variable
- Air Pollution (PM2.5)
- Pulled from NAPS
- Our *Instrumental Variables*
- Wind Direction and Wind Speed
- Pulled from Environment Canada
- Our *Dependent* variable
- The Teranet/National Bank of Canada House Price Index (THPI)
- Our *Panel Indices*
- City and Year\_Month
The following code chunk will contain some formulae that we'll recycle over different models, which saves me from typing/copy-pasting the different formulae over and over again.
After, we will run the following regression models with `plm::plm()`:
- Pooled
- Basically pretending the panel is a single cross-section, just like "regular" OLS
- Fixed-Effects (also called a "Within" model)
- All observations are *completely* temporally de-meaned.
- Does *not* assume orthogonality between explanatory variables and the unobserved fixed effect - Therefore no constant
- Time-invariant variables are eliminated, and therefore redundant/useless
- The error term **only** includes time-varying/idiosyncratic residuals
- Random-Effects
- All observations are *partially* temporally de-meaned
- *Does* assume orthogonality between explanatory variables and the unobserved time-invariant effect
- The constant is retained
- Time-invariant variables are *not* eliminated
- The error term includes **both** time-invariant, *and* time-varying/idiosyncratic residuals
With that out of the way, I'll briefly present my model and its components:
## Overview of Model(s) Estimated
Our main model of interest is as follows:
$THPI = \beta_0 + \theta \widehat{Pollution_{it}} + \beta X_{it} + \eta_i + \epsilon_{it}$
Where:
- $\beta_0$ is our time and panel-invariant that will be included *only* in our Pooled-OLS and Random-Effects models.
- $\beta$ is a $1 \times K$ vector containing the coefficients corresponding to our *exogenous* regressors
- $X_{it}$ is a $N \times K$ matrix containing the data on our exogenous regressors
- $\theta$ is our coefficient for the fitted-values of Pollution
- Fitted values are obtained from the first-stage regression, shown below
- $\widehat{Pollution_{it}}$ is the fitted-values of Pollution on our exogenous and instrumental regressors from the first-stage regression
- $\eta_i$ is our time-invariant error that will theoretically be eliminated in our Fixed-Effects/Within model
- $\epsilon_{it}$ is our time-varying/idiosyncratic error
The first-stage regression is as follows:
$Pollution = \alpha_0 + \alpha Z_{it} + \upsilon_i + \omega_{it}$
Where:
- $Z_{it}$ is a $N \times L$ (where $L \geq K$) matrix containing data on our exogenous regressors *and* our instruments
- $\alpha$ is a $1 \times L$ vector containing the coefficients for our instruments and exogenous regressors
- $\alpha_0$ is our time and panel-invariant constant which is included *only* in our Pooled-OLS and Random-Effects models
- $\upsilon_i$ is our time-invariant error that will theoretically be eliminated in our Fixed-Effects/Within model
- $\omega_{it}$ is our time-varying/idiosyncratic error
## Formulae & OLS
```{r ols, results='asis'}
# Some formulae that we'll come back to as we explore some initial models for fit
# Regular OLS
form_ols <- Index ~ Pollution + Homicide + Income + LowInc + Rent + Unemp + Vacancy + Pop
# IV formula with both Direction and Speed
form_iv <- Index ~ Pollution + Homicide + Income + LowInc + Rent + Unemp + Vacancy + Pop | Homicide + Income + LowInc + Rent + Unemp + Vacancy + Pop + Direction + Speed
# IV formula with just direction
form_iv_dir <- Index ~ Pollution + Homicide + Income + LowInc + Rent + Unemp + Vacancy + Pop | Homicide + Income + LowInc + Rent + Unemp + Vacancy + Pop + Direction
# IV formula with just speed
form_iv_spd <- Index ~ Pollution + Homicide + Income + LowInc + Rent + Unemp + Vacancy + Pop | Homicide + Income + LowInc + Rent + Unemp + Vacancy + Pop + Speed
pooled_ols <- plm(form_ols,
model = "pooling",
effect = "individual",
data = pollution)
random_ols <- plm(form_ols,
model = "random",
effect = "individual",
random.method = "amemiya",
data = pollution)
fixed_ols <- plm(form_ols,
model = "within",
effect = "individual",
data = pollution)
stargazer(pooled_ols, fixed_ols, random_ols, type = "html")
```
From left to right our models are as follows:
- Pooled OLS
- Fixed-Effects
- Random-Effects
So running our regressions *without* any instrumental variables leads to a Pollution regressor that is not only the wrong sign desired (positive), but also not very significant in either magnitude (very close to zero), nor statistical significance in our latter two models.
This improves only marginally as we introduce fixed and random effects into our models.
Our adjusted-$R^2$ improves significantly, however.
Let's see if introducing our two instruments improves this situation at all.
## IV Models
```{r iv, results='asis'}
pooled_iv <- plm(form_iv,
model = "pooling",
effect = "individual",
data = pollution)
random_iv <- plm(form_iv,
model = "random",
effect = "individual",
random.method = "amemiya",
data = pollution)
fixed_iv <- plm(form_iv,
model = "within",
effect = "individual",
data = pollution)
stargazer(pooled_iv, fixed_iv, random_iv, type = "html")
```
The ordering of models is the same as previous.
What a stark contrast to before:
- In all three models, the Pollution regressor is the sign we want (negative), and significant in magnitude *and* statistical significance.
- Especially after introducing Fixed/Random effects -Our Homicide Rate and Unemployment regressors are the correct sign in the latter two models
What's disappointing is our Vacancy regressor is supposed to be negative, and our Income regressor should be positive.
- This may be due to some caveats with regards to the data *and* the models, as I'll discuss later.
Let's move on to some diagnostic tests and further discussion/selection of our models.
# Diagnostics and Further Refinements
This section will contain various tests about our models to see which one has superior properties or explanatory power.
## F-Test for Effects
The first test we'll conduct is that of testing for individual effects in our models.
Compared to fixed-effects, our Pooled-OLS model has no effects that our different Cities might introduce into the model.
This first F-test examines whether introducing City effects into our model changes our model enough to warrant including them.
```{r F test}
pFtest(fixed_iv, pooled_iv)
```
We easily reject the null hypothesis of no individual effects in our data.
Therefore, we are justified in including City-level effects, and abandoning our Pooled-OLS model.
## Hausman Test
The next test we'll conduct is that of the Durbin-Wu-Hausman test.
This tests whether both of our Fixed-Effects and Random-Effects models are consistent or not.
The null hypothesis is that both models are consistent, but Random-Effects is more efficient.
The alternative hypothesis is that only our Fixed-Effects model is consistent.
```{r Hausman}
phtest(fixed_iv, random_iv)
```
We fail to reject the null, and therefore conclude that while both Fixed and Random-Effects are consistent, our Random-Effects model is the more efficient.
So we seemed to have rejected both our Pooled OLS model (due to the significant effects Cities have on the model), and our Fixed-Effects model (due to Random-Effects being more efficient).
Let's double-check with one final test to see if Random-Effects is superior to our Pooled model.
## Breusch-Pagan LM Test
This final test is the Breusch-Pagan Lagrange Multiplier test.
This test is similar to the the F test we conducted earlier, however that test is a comparison between our Pooled-OLS model and our Within model, *not* between our Pooled-OLS model and our Random-Effects model.
The null hypothesis states that there are no panel effects in our model.
```{r bp test}
plmtest(pooled_iv, type="bp")
```
Like before, we reject the null and find that there are significant individual effects in our model to warrant including them.
We'll continue on with our Random-Effects-IV model as the model to beat.
# Instrumental Variable (IV) Diagnostics
In order for our IV models to be consistent, we need to examine the extent to which:
- Our instruments are relevant *and* strong enough
- Our suspected endogenous variable is **actually** endogenous
- We are over-identifying in restrictions by having more instruments than endogenous variables
A lot of this testing is fairly easy in a single-cross-section context with `AER::summary.ivreg(diagnostics=TRUE)`, however, `AER`'s `ivreg()` cannot incorporate fixed or random effects into the model automatically.
A lot of the following code is replicating the code for the `diagnostics=TRUE` argument found in `AER::summary.ivreg()` to test the "IV" aspects of our Random-Effects-Instrumental-Variable (REIV) model.
Someone could probably make a wrapper to provide `AER::summary.ivreg(diagnostics=TRUE)` functionality for panel data models constructed in `plm::plm()`
Let's proceed.
## F-Testing our Instruments
The first step in this process is to conduct an F-test for potentially weak/irrelevant instruments.
```{r weak instruments, results='asis'}
# Wald test on our first stage regression for instrument validity
# Formula for our first-stage regression WITH out instruments
form_first <- Pollution ~ Homicide + Income + LowInc + Rent + Unemp + Vacancy + Pop + Direction + Speed
# Formula for our first-stage regression WITHOUT our instruments
form_first_no_inst <- Pollution ~ Homicide + Income + LowInc + Rent + Unemp + Vacancy + Pop
# Formula for our first-stage regression with only the Wind Direction instrument
form_first_dir <- Pollution ~ Homicide + Income + LowInc + Rent + Unemp + Vacancy + Pop + Direction
# Formula for our first-stage regression with only the Wind Speed instrument
form_first_spd <- Pollution ~ Homicide + Income + LowInc + Rent + Unemp + Vacancy + Pop + Speed
first_stage <- plm(form_first,
model = "random",
random.method = "amemiya",
data = pollution)
first_stage_no_inst <- plm(form_first_no_inst,
model = "random",
random.method = "amemiya",
data = pollution)
first_stage_dir <- plm(form_first_dir,
model = "random",
random.method = "amemiya",
data = pollution)
first_stage_spd <- plm(form_first_spd,
model = "random",
random.method = "amemiya",
data = pollution)