forked from NickCH-K/CausalitySlides
-
Notifications
You must be signed in to change notification settings - Fork 2
/
Lecture_15_Instrumental_Variables.Rmd
385 lines (295 loc) · 15 KB
/
Lecture_15_Instrumental_Variables.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
---
title: "Lecture 15 Instrumental Variables"
author: "Nick Huntington-Klein"
date: "March 20, 2019"
output:
revealjs::revealjs_presentation:
theme: solarized
transition: slide
self_contained: true
smart: true
fig_caption: true
reveal_options:
slideNumber: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE)
library(tidyverse)
library(dagitty)
library(ggdag)
library(gganimate)
library(ggthemes)
library(Cairo)
library(fixest)
library(modelsummary)
theme_set(theme_gray(base_size = 15))
```
## Recap
- We've covered quite a few methods for isolating causal effects!
- Controlling for variables to close back doors (explain X and Y with the control, remove what's explained)
- Matching on variables to close back doors (find treated and non-treated observations with )
- Using a control group to control for time (before/after difference for treated and untreated, then difference them)
- Using a cutoff to construct a very good control group (treated/untreated difference near a cutoff)
## Today
- We've got ONE LAST METHOD to go deep on!
- Today we'll be covering *instrumental variables*
- The basic idea is that we have some variable - the instrumental variable - that causes `X` but has no other open back doors!
## Natural Experiments
- This calls back to our idea of trying to mimic an experiment without having an experiment. In fact, let's think about an actual randomized experiment.
- We have some random assignment `R` that determines your `X`. So even though we have back doors between `X` and `Y`, we can identify `X -> Y`
```{r, dev='CairoPNG', echo=FALSE, fig.width=6,fig.height=3}
dag <- dagify(Y~X+W,
X~W,
X~R,
coords=list(
x=c(X=1,Y=2,W=1.5,R=0),
y=c(X=1,Y=1,W=2,R=1)
)) %>% tidy_dagitty()
ggdag_classic(dag,node_size=20) +
theme_dag_blank()
```
## Natural Experiments
- The idea of instrumental variables is this:
- What if we can find a variable that can take the place of R in the diagram despite not actually being something we randomized in an experiment?
- If we can do that, we've clearly got a "natural experiment"
- When we find a variable that can do that, we call it an "instrument" or "instrumental variable"
- Let's call it `Z`
## Instrumental Variable
So, for `Z` take the place of `R` in the diagram, what do we need?
- `Z` must be related to `X` (typically `Z -> X` but not always)
- There must be *no open paths* from `Z` to `Y` *except for ones that go through `X`*
In other words "`Z` is related to `X`, and all the effect of `Z` on `Y` goes THROUGH `X`"
## Instrumental Variable
- This doesn't relieve us of the duty of identifying a causal effect by closing back doors
- But it *moves* that duty from the endogenous variable to the instrument, which potentially is easier to identify
- (and then adds on the additional requirement that there are also no open *front* doors from $Z$ to $Y$ except through $X$ )
## Instrumental Variable
How?
- Explain `X` with `Z`, and keep only what *is* explained, `X'`
- Explain `Y` with `Z`, and keep only what *is* explained, `Y'`
- [If `Z` is logical/binary] Divide the difference in `Y'` between `Z` values by the difference in `X'` between `Z` values
- [If `Z` is not logical/binary] Get the correlation between `X'` and `Y'`
## Estimation
- We will be doing this mostly by hand today (until the end part) but most commonly this is estimated using *two stage least squares*
- We basically just do what we described on the last slide:
1. Use the instruments and controls to explain $X$ in the first stage
1. Use the controls and the predicted (explained) part of $X$ in place of $X$ in the second stage
1. (do some standard error adjustments)
Many ways to do this in R, we'll be doing 2SLS with `feols()` from **fixest**
## Graphically
```{r, dev='CairoPNG', echo=FALSE, fig.width=8,fig.height=7}
df <- data.frame(Z = as.integer(1:200>100),
W = rnorm(200)) %>%
mutate(X = .5+2*W +2*Z+ rnorm(200)) %>%
mutate(Y = -X + 4*W + 1 + rnorm(200),time="1") %>%
group_by(Z) %>%
mutate(mean_X=mean(X),mean_Y=mean(Y),YL=NA,XL=NA) %>%
ungroup()
#Calculate correlations
before_cor <- paste("1. Raw data. Correlation between X and Y: ",round(cor(df$X,df$Y),3),sep='')
afterlab <- '6. The slope between points is the effect of X on Y.'
dffull <- rbind(
#Step 1: Raw data only
df %>% mutate(mean_X=NA,mean_Y=NA,time=before_cor),
#Step 2: Add x-lines
df %>% mutate(mean_Y=NA,time='2. What differences in X are explained by Z?'),
#Step 3: X de-meaned
df %>% mutate(X = mean_X,mean_Y=NA,time="3. Remove everything in X not explained by Z"),
#Step 4: Remove X lines, add Y
df %>% mutate(X = mean_X,mean_X=NA,time="4. What differences in Y are explained by Z?"),
#Step 5: Y de-meaned
df %>% mutate(X = mean_X,Y = mean_Y,mean_X=NA,time="5. Remove everything in Y not explained by Z"),
#Step 6: Raw demeaned data only
df %>% mutate(X = mean_X,Y =mean_Y,mean_X=NA,mean_Y=NA,YL=mean_Y,XL=mean_X,time=afterlab))
#Get line segments
endpts <- df %>%
group_by(Z) %>%
summarize(mean_X=mean(mean_X),mean_Y=mean(mean_Y))
p <- ggplot(dffull,aes(y=Y,x=X,color=as.factor(Z)))+geom_point()+
geom_vline(aes(xintercept=mean_X,color=as.factor(Z)))+
geom_hline(aes(yintercept=mean_Y,color=as.factor(Z)))+
guides(color=guide_legend(title="Z"))+
geom_segment(aes(x=ifelse(time==afterlab,endpts$mean_X[1],NA),
y=endpts$mean_Y[1],xend=endpts$mean_X[2],
yend=endpts$mean_Y[2]),size=1,color='blue')+
scale_color_colorblind()+
labs(title = 'X -> Y, With Binary Z as an Instrumental Variable \n{next_state}')+
transition_states(time,transition_length=c(6,16,6,16,6,6),state_length=c(50,22,12,22,12,50),wrap=FALSE)+
ease_aes('sine-in-out')+
exit_fade()+enter_fade()
animate(p,nframes=175)
```
## Instrumental Variables
- Notice that this whole process is like the *opposite* of controlling for a variable
- We explain `X` and `Y` with the variable, but instead of tossing out what's explained, we ONLY KEEP what's explained!
- Instead of saying "you're on a back door, I want to close you" we say "you have no back doors! I want my `X` to be just like you! I'm only keeping that part of `X` that's explained by you!"
- Since `Z` has no back doors, the part of `X` explained by `Z` has no back doors to the part of `Y` explained by `Z`
## Imperfect Assignment
- Let's apply one of the common uses of instrumental variables, which actually *is* when you have a randomized experiment
- In normal circumstances, if we have an experiment and assign people with `R`, we just compare `Y` across values of `R`:
```{r, echo=TRUE}
df <- tibble(R = sample(c(0,1),500,replace=T)) %>%
mutate(X = R, Y = 5*X + rnorm(500))
#The truth is a difference of 5
df %>% group_by(R) %>% summarize(Y=mean(Y))
```
## Imperfect Assignment
- But what happens if you run a randomized experiment and assign people with `R`, but not everyone does what you say? Some "treated" people don't get the treatment, and some "untreated" people do get it
- When this happens, we can't just compare `Y` across `R`
- But `R` is still a valid instrument!
## Imperfect Assignment
```{r, echo=TRUE}
df <- tibble(R = sample(c(0,1),500,replace=T)) %>%
#We tell them whether or not to get treated
mutate(X = R) %>%
#But some of them don't listen! 20% do the OPPOSITE!
mutate(X = ifelse(runif(500) > .8,1-R,R)) %>%
mutate(Y = 5*X + rnorm(500))
#The truth is a difference of 5
df %>% group_by(R) %>% summarize(Y=mean(Y))
```
## Imperfect Assignment
- So let's do IV (instrumental variables); `R` is the IV.
```{r, echo=TRUE}
iv <- df %>% group_by(R) %>% summarize(Y = mean(Y), X = mean(X))
iv
#Remember, since our instrument is binary, we want the slope
(iv$Y[2] - iv$Y[1])/(iv$X[2]-iv$X[1])
#Truth is 5!
```
## Another Example
- Justifying that an IV has no back doors can be hard!
- Usually things aren't as clean-cut as having actual randomization
- And sometimes we may have to add controls in order to justify the IV
- Think hard - are there really no other paths from `Z` to `Y`?
- This will often require *detailed contextual knowledge* of the data generating process
## Pollution and Driving
- If air quality is really bad, you may choose to drive instead of walk/bike/bus in order to avoid breathing it
- So do particularly smoggy days lead people to drive more?
- Pan He and Cheng Xu ask this question using Shanghai as an example!
## Pollution and Driving
- Plenty of back doors - seasons, whether factories are running, smog levels last week...
```{r, dev='CairoPNG', echo=FALSE, fig.width=6,fig.height=4.5}
dag <- dagify(Drive~Smog+LastWk+Factry+Season,
Smog~LastWk+Factry+Season,
LastWk~Season,
coords=list(
x=c(Smog=1,LastWk=3,Factry=2,Season=1,Drive=3),
y=c(Smog=1,LastWk=2,Factry=1.85,Season=2,Drive=1)
)) %>% tidy_dagitty()
ggdag_classic(dag,node_size=20) +
theme_dag_blank()
```
## Pollution and Driving
- The *direction of the wind* could be an IV - Shanghai faces the water, and so when the wind blows West, it brings pollution into the city
```{r, dev='CairoPNG', echo=FALSE, fig.width=6,fig.height=4.5}
dag <- dagify(Drive~Smog+LastWk+Factry+Season,
Smog~LastWk+Factry+Season+Wind,
LastWk~Season,
Wind~Season,
coords=list(
x=c(Smog=1,LastWk=3,Factry=2,Season=1,Drive=3,Wind=0),
y=c(Smog=1,LastWk=2,Factry=1.85,Season=2,Drive=1,Wind=1)
)) %>% tidy_dagitty()
ggdag_classic(dag,node_size=20) +
theme_dag_blank()
```
## Pollution and Driving
- This gives us an IV we can use!
- Of course, we need to control for Season to block out the back door.
- The authors do indeed find that additional smog, brought in by the wind, increases the number of people who choose to drive - making the problem worse later!!
## Trade and Manufacturing
- Another example: did Chinese imports reduce US manufacturing employment?
- Employment in the US manufacturing sector has been dropping for decades
- (note - *manufacturing itself* isn't dropping, we're manufacturing more than ever, we're just doing it without as many actual people)
## Trade and Manufacturing
- The timing of the drop in manufacturing jobs coincides with us importing a lot more Chinese stuff
- But did the Chinese imports *cause* the decline or was it a coincidence? Automation is another good explanation!
- Or general declining US competitiveness in the global market, vs. everyone (not just China)
## Trade and Manufacturing
- Autor, Dorn, & Hanson use Chinese exports *to other countries* (`CEXoth`) as an IV for Chinese exports *to the US* (`CEXus`) in order to estimate the impact of Chinese exports *to the US* on US manufacturing employment (`mfg`)
- Let's think about whether this makes sense as an IV - any back doors from `CEXoth` to `mfg` we can imagine? Or front doors that don't go through `CEXus`?
- Also, important, do we think that the arrow from `CEXoth` to `CEXus` is actually there?
## Trade and Manufacturing
`D` is global demand for US manufactures, `L` is US labor supply, `Close` measures how similar the kinds of things the US manufactures are to China manufactures
```{r, dev='CairoPNG', echo=FALSE, fig.width=8,fig.height=4}
dag <- dagify(CEXus~CEXoth+D+L+Close,
mfg~CEXus+D+L+Close,
D~U1+L,
CEXoth~U1,
coords=list(
x=c(CEXoth=0,CEXus=1,D=1,L=2,Close=3,mfg=3.5,U1=.5),
y=c(CEXoth=0,CEXus=0,D=1,L=1,Close=1,mfg=0,U1=.5)
)) %>% tidy_dagitty()
ggdag_classic(dag,node_size=20) +
theme_dag_blank()
```
## Trade and Manufacturing
- So we need to control for `D` in some way to close `CEXoth <- U1 -> D -> mfg` but other than that we have a good instrument
- (they do this, and also use information like "what does `Close` look like on a regional level?" to improve their estimate)
- Autor, Dorn, & Hanson found that Chinese exports elsewhere predicted them in the US (China was opening up and becoming more effective as a producer, making their products attractive everywhere)
## Trade and Manufacturing
- And when you limit `mfg` and `CEXus` to just what's explained by `CEXoth`, you do see that some decline in `mfg` is because of Chinese imports
![Direct effect of instruments on `mfg`](Lecture_15_AutorDornHanson.png)
## Practice
- Does the price of cigarettes affect smoking? Get AER package and data(CigarettesSW). Examine with help().
- Get JUST thecigarette taxes `cigtax` from `taxs-tax`
- Draw a causal diagram using `packs`, `price`, `cigtax`, and some back door `W`. What might `W` be?
- Adjust `price` and `cigtax` for inflation: divide them by `cpi`
- Explain `price` and `packs` with `cigtax` using `cut(,breaks=7)` for `cigtax`
- Get correlation between the explained parts and plot the explained parts - does price reduce packs smoked?
## Practice Answers
```{r, echo=TRUE}
library(AER)
data(CigarettesSW)
CigarettesSW <- CigarettesSW %>%
mutate(cigtax = taxs-tax) %>%
mutate(price = price/cpi,
cigtax = cigtax/cpi) %>%
group_by(cut(cigtax,breaks=7)) %>%
summarize(priceexp = mean(price),
packsexp = mean(packs)) %>%
ungroup()
cor(CigarettesSW$priceexp,CigarettesSW$packsexp)
```
## Practice Answers Plot
```{r, echo=TRUE, fig.width=6,fig.height=4}
plot(CigarettesSW$priceexp,CigarettesSW$packsexp)
```
## Practice Diagram Answers
```{r, dev='CairoPNG', echo=FALSE, fig.width=6,fig.height=4}
dag <- dagify(price~cigtax+W,
packs~price+W,
coords=list(
x=c(packs=2,price=1,cigtax=0,W=1.5),
y=c(packs=1,price=1,cigtax=1,W=2)
)) %>% tidy_dagitty()
ggdag_classic(dag,node_size=20) +
theme_dag_blank()
```
## Practice - Doing it with Regression!
- Common 2SLS estimators: `ivreg` in **AER**, `iv_robust` in **estimatr**, and `feols()` in **fixest**. We'll use the latter since it's fast easy to combine with fixed effects and all kinds of error adjustments
```{r, echo = TRUE, eval = FALSE}
m <- feols(Y ~ controls | X ~ Z, data = data)
m <- feols(Y ~ controls | fixed_effects | X ~ Z, data = data, se = 'hetero')
```
## Practice - Doing it with Regression
- Reload the cigarette data and skip the summarize step
- Run our cigarette analysis first doing 2SLS by hand - use `lm()` to run the first stage, then replace `price` with `predict(m)` in the second stage
- Then use `feols()` to do the same (use 1 to indicate no controls). Coefficients should be the same but the standard errors will be corrected in the `feols()` version!
- Show both results in `msummary()`
## Practice - Doing it with Regression
```{r, echo = TRUE}
data(CigarettesSW)
CigarettesSW <- CigarettesSW %>%
mutate(cigtax = taxs-tax) %>%
mutate(price = price/cpi,
cigtax = cigtax/cpi)
first_stage <- lm(price~cigtax, data = CigarettesSW)
second_stage <- lm(packs ~ predict(first_stage), data = CigarettesSW)
package <- feols(packs ~ 1 | price ~ cigtax, data = CigarettesSW)
```
## Practice - Doing it with Regression
```{r, echo = TRUE}
msummary(list(second_stage, package), stars = TRUE, gof_omit = 'AIC|BIC|Lik|F|R2')
```