forked from NickCH-K/CausalitySlides
-
Notifications
You must be signed in to change notification settings - Fork 2
/
Lecture_12_Regression_Discontinuity.Rmd
358 lines (288 loc) · 14 KB
/
Lecture_12_Regression_Discontinuity.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
---
title: "Lecture 12 Regression Discontinuity"
author: "Nick Huntington-Klein"
date: "`r Sys.Date()`"
output:
revealjs::revealjs_presentation:
theme: solarized
transition: slide
self_contained: true
smart: true
fig_caption: true
reveal_options:
slideNumber: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE)
library(tidyverse)
library(dagitty)
library(ggdag)
library(gganimate)
library(ggpubr)
library(ggthemes)
library(Cairo)
theme_set(theme_gray(base_size = 15))
```
## Recap
- We've been going over ways in which we can use control groups to isolate causal effects
- We can select similar control groups using matching or controlling (what economists call "selection on observables")
- We can use a treated group at a different time as its own control group with fixed effects
- When a treatment is applied at a particular time, we can select a reasonable control to account for the effects of time using difference-in-difference (a "natural experiment")
## Today
- We're going to go over one other way in which we can find and isolate a very convincing control group
- Like DID, it's also an example of a natural experiment
- *Regression discontinuity*
## Regression Discontinuity
- For regression discontinuity to work, we need the Treatment to be assigned based on a *cutoff* of what's called a "running variable"
- For example, imagine we want to know the effects of being in a Gifted and Talented (GATE) program on your adult earnings
- Being admitted to the program is based on your test score (running variable)
- If you score above 75, you're in the program. 75 or below, you're out!
## Regression Discontinuity
- Notice that the y-axis here is *In GATE*, not the outcome
```{r, echo=FALSE, eval=TRUE, fig.width=7, fig.height=5.5}
rdd <- tibble(test = runif(300)*100) %>%
mutate(GATE = test >= 75,
above = test >= 75) %>%
mutate(earn = runif(300)*40+10*GATE+test/2)
ggplot(rdd,aes(x=test,y=GATE))+geom_point()+
geom_vline(aes(xintercept=75),col='red')+
theme_pubr()
labs(x='Test Score',
y='In GATE')
```
## Regression Discontinuity
- Here's how it look when we look at the actual outcome
```{r, echo=FALSE, eval=TRUE, fig.width=7, fig.height=5.5}
ggplot(rdd,aes(x=test,y=earn,color=GATE))+geom_point()+
theme_pubr() +
geom_vline(aes(xintercept=75),col='red')+
labs(x='Test Score',
y='Earnings')
```
## Regression Discontinuity
- Now, we have a bit of a problem!
- If we look at the relationship between treatment and going to college, we'll be picking up the fact that higher test scores make you more likely to go to college anyway
```{r, dev='CairoPNG', echo=FALSE, fig.width=6,fig.height=4}
dag <- dagify(earn~GATE+Test,
GATE~Test,
coords=list(
x=c(earn=3,GATE=1,Test=2.5),
y=c(earn=1,GATE=1.5,Test=2)
)) %>% tidy_dagitty()
ggdag_classic(dag,node_size=20) +
theme_dag_blank()
```
## Regression Discontinuity
- Except, that's not actually what the diagram looks like! Test only affects GATE to the extent that it makes you be above the 90 cutoff!
```{r, dev='CairoPNG', echo=FALSE, fig.width=6,fig.height=4}
dag <- dagify(earn~GATE+Test,
Above~Test,
GATE~Above,
coords=list(
x=c(earn=3,GATE=1,Test=2.5,Above=1.75),
y=c(earn=1,GATE=1.5,Test=2,Above=1.75)
)) %>% tidy_dagitty()
ggdag_classic(dag,node_size=20) +
theme_dag_blank()
```
## Regression Discontinuity
- What can we do with that information?
- Well, imagine that we looked at the area *just around* the cutoff
- Say, the cutoff is 75, so we look at 73 to 77
- *Within that group*, it's basically random whether you fall on one side of the line or another
## Regression Discontinuity
- Someone with a 75 is, on average, almost exactly the same as someone with a 76, except that one got the treatment and the other didn't!
- Heck, that tiny test score difference could be due to just having a bad day before the test
- So we have two groups - the just-barely-missed-outs and the just-barely-made-its, that are basically exactly the same except that one happened to get treatment
- A perfect description of what we're looking for in a control group!
## Regression Discontinuity
- So we look directly around the cutoff, and compare just below to just above.
- This is our way of controlling for test score and closing the `GATE <- Above <- Test -> earn` back door
- Why not just control for `Test` in the normal way?
- Because if we really think that, right around the cutoff, it's random whether you're on one side or the other, we don't just close the `Test` back door, we have effectively random assignment, like an experiment!
- We're not just closing the `Test` back door, we're closing *all* back doors
## In Practice
```{r, echo=TRUE, eval=FALSE}
rdd.data <- tibble(test = runif(1000)*100) %>%
mutate(GATE = test >= 75) %>% mutate(earn = runif(1000)*40+10*GATE+test/2)
#Choose a "bandwidth" of how wide around the cutoff to look (arbitrary in our example)
#Bandwidth of 2 with a cutoff of 75 means we look from 75-2 to 75+2
bandwidth <- 2
#Just look within the bandwidth
rdd <- rdd.data %>% filter(abs(75-test) < bandwidth) %>%
#Create a variable indicating we're above the cutoff
mutate(above = test >= 75) %>%
#And compare our outcome just below the cutoff to just above
group_by(above) %>% summarize(earn = mean(earn))
rdd
#Our effect looks just about right (10 is the truth)
rdd$earn[2] - rdd$earn[1]
```
```{r, echo=FALSE, eval=TRUE}
set.seed(1000)
rdd.data <- tibble(test = runif(1000)*100) %>%
mutate(GATE = test >= 75) %>% mutate(earn = runif(1000)*40+10*GATE+test/2)
#Choose a "bandwidth" of how wide around the cutoff to look (arbitrary in our example)
#Bandwidth of 2 with a cutoff of 75 means we look from 75-2 to 75+2
bandwidth <- 2
#Just look within the bandwidth
rdd <- rdd.data %>% filter(abs(75-test) < bandwidth) %>%
#Create a variable indicating we're above the cutoff
mutate(above = test >= 75) %>%
#And compare our outcome just below the cutoff to just above
group_by(above) %>% summarize(earn = mean(earn))
rdd
#Our effect looks just about right (10 is the truth)
rdd$earn[2] - rdd$earn[1]
```
## Graphically
```{r, dev='CairoPNG', echo=FALSE, fig.width=8,fig.height=7}
df <- data.frame(xaxisTime=runif(300)*20) %>%
mutate(Y = .2*xaxisTime+3*(xaxisTime>10)-.1*xaxisTime*(xaxisTime>10)+rnorm(300),
state="1",
groupX=floor(xaxisTime)+.5,
groupLine=floor(xaxisTime),
cutLine=rep(c(9,11),150)) %>%
group_by(groupX) %>%
mutate(mean_Y=mean(Y)) %>%
ungroup() %>%
arrange(groupX)
dffull <- rbind(
#Step 1: Raw data only
df %>% mutate(groupLine=NA,cutLine=NA,mean_Y=NA,state='1. Start with raw data.'),
#Step 2: Add Y-lines
df %>% mutate(cutLine=NA,state='2. What differences in Y are explained by Running Variable?'),
#Step 3: Collapse to means
df %>% mutate(Y = mean_Y,state="3. Keep only what's explained by the Running Variable."),
#Step 4: Zoom in on just the cutoff
df %>% mutate(mean_Y = ifelse(xaxisTime > 9 & xaxisTime < 11,mean_Y,NA),Y=ifelse(xaxisTime > 9 & xaxisTime < 11,mean_Y,NA),groupLine=NA,state="4. Focus just on what happens around the cutoff."),
#Step 5: Show the effect
df %>% mutate(mean_Y = ifelse(xaxisTime > 9 & xaxisTime < 11,mean_Y,NA),Y=ifelse(xaxisTime > 9 & xaxisTime < 11,mean_Y,NA),groupLine=NA,state="5. The jump at the cutoff is the effect of treatment."))
p <- ggplot(dffull,aes(y=Y,x=xaxisTime))+geom_point()+
geom_vline(aes(xintercept=10),linetype='dashed')+
geom_point(aes(y=mean_Y,x=groupX),color="red",size=2)+
geom_vline(aes(xintercept=groupLine))+
geom_vline(aes(xintercept=cutLine))+
geom_segment(aes(x=10,xend=10,
y=ifelse(state=='5. The jump at the cutoff is the effect of treatment.',
filter(df,groupLine==9)$mean_Y[1],NA),
yend=filter(df,groupLine==10)$mean_Y[1]),size=1.5,color='blue')+
scale_color_colorblind()+
scale_x_continuous(
breaks = c(5, 15),
label = c("Untreated", "Treated")
)+xlab("Running Variable")+
labs(title = 'The Effect of Treatment on Y using Regression Discontinuity \n{next_state}')+
transition_states(state,transition_length=c(6,16,6,16,6),state_length=c(50,22,12,22,50),wrap=FALSE)+
ease_aes('sine-in-out')+
exit_fade()+enter_fade()
animate(p,nframes=175)
```
## Example: Corporate Social Responsibility
- Corporate Social Responsibility (CSR) is when corporations engage in the kind of behavior that nonprofits usually do - community outreach, charity, etc.
- Is this good for the corporation? Or would it make more sense to just send the money they spend to actual nonprofits if they just want to do good?
- This is a causal question
## Example: Corporate Social Responsibility
- Convenient for our purposes, CSR policies are voted on by shareholder boards
- If a board votes 49% in favor, it fails. 51% in favor? It passes!
- Sounds like a regression discontinuity to me!
- "Close votes" is a common application of regression discontinuity
## Example: Corporate Social Responsibility
- So how do CSR policy announcements affect stock prices?
```{r, dev='CairoPNG', echo=FALSE, fig.width=6,fig.height=5}
dag <- dagify(price~vote+CSR,
win~vote,
CSR~win,
coords=list(
x=c(price=3,CSR=1,vote=2.5,win=1.75),
y=c(price=1,CSR=1.5,vote=2,win=1.75)
)) %>% tidy_dagitty()
ggdag_classic(dag,node_size=20) +
theme_dag_blank()
```
## Example: Corporate Social Responsbility
- Caroline Flammer studies this topic
- Looking at the "abnormal return" (stock price return minus what's expected given the market) comparing CSR votes that just won vs. CSR votes that just lost
- So what should we do?
- Focus just around the cutoff and compare abnormal returns just above and just below.
## Example: Corporate Social Responsibility
![Flammer (2015) Management Science](Lecture_12_Flammer.png)
## Example: Corporate Social Responsibility
- Looks like stock returns increase by about .02, comparing CSRs that just lost to just won!
- Seems like the market likes seeing those CSRs and values them
- And all those things that we might expect to correlate with both stock price growth and CSRs - tech-savvy, youthful leadership, etc., we've closed those back doors too!
## Balance
- Have we really closed those back doors?
- One thing that's so great about RDD is that, since it's basically random whether you're on one side of the cutoff or another, there shouldn't be other back doors
- It's a form of within variation that's *so narrow* it basically closes everything
- We can check this by seeing if other variables differ on either side of the line
- This is our way of testing our diagram - if our diagram is true, then `above` should have no relationship with any back door variable after focusing around the cutoff
## Balance
```{r, echo=TRUE}
rdd.data <- tibble(test = runif(500)*100) %>%
mutate(backdoor=rnorm(500)+test/50) %>% mutate(GATE = test + backdoor >= 75) %>%
mutate(earn = runif(500)*40+10*GATE+5*backdoor+test/2)
bandwidth <- 2
rdd <- rdd.data %>% filter(abs(75-test) < bandwidth) %>%
#Create a variable indicating we're above the cutoff
mutate(above = test >= 75) %>%
#And compare our outcome just below the cutoff to just above
group_by(above) %>% summarize(backdoor = mean(backdoor))
rdd
#Not a lot of difference!
rdd$backdoor[2] - rdd$backdoor[1]
```
## Balance
- Notice there's NO real difference here, indicating that we've closed that back door
```{r, echo=FALSE, fig.width=7,fig.height=5}
rdgrph <- rdd.data %>%
mutate(bandwidth = abs(75-test) <= 2,
above = test >= 75) %>%
group_by(cut(test,breaks=(0:49*2+1))) %>%
mutate(meanback = mean(backdoor)) %>%
ungroup()
ggplot(filter(rdgrph,bandwidth==1),aes(x=test,y=backdoor))+geom_point(col='blue')+
geom_vline(aes(xintercept=75),col='red')+
geom_vline(aes(xintercept=73),col='red',linetype='dashed',alpha=.6)+
geom_vline(aes(xintercept=77),col='red',linetype='dashed',alpha=.6)+
geom_point(data=filter(rdgrph,bandwidth==0),aes(x=test,y=backdoor),alpha=.3)+
geom_step(data=rdgrph,aes(x=test,y=meanback),col='red',size=1)+
theme_pubr()
labs(x='Test Score',
y='Backdoor Variable')
```
## Summing Up
- We've covered four main methods of making comparisons as close as possible
- Controlling and matching both take a set of measured variables and adjust so you're looking at variation within those variables
- Difference-in-difference takes a chosen comparison group and uses it to adjust for changes over time in your treated group of interest
- Regression discontinuity uses a cutoff in a running variable to identify a treated and nontreated group that are basically randomly assigned
## Summing Up
- Next time we'll be doing regression discontinuity using... actual regression
- We already have a solid idea of how it works, now let's see how it's really done
## Practice
- Does winning help your party stay in power *30 years* later?
- Install and load the `politicaldata` package, and load `data(house_results)`
- Create tibbles `hr76` and `hr16` with only 1976 and 2016
- Create `repadv76` equal to `rep` vote minus `dem` for 1976, and filter only to those with `!is.na(repadv75)`
- Create `repwins16` equal to `rep > dem` for 2016, and filter `!is.na(repwins16)`
- `select()` only `district`,`repadv76`, `repwins16`, and `inner_join()` the two data sets
- Compare `repwins16` mean above and below `repadv76=0` with a bandwidth of .04
## Practice Answers
```{r, echo=TRUE}
#install.packages('politicaldata')
library(politicaldata)
data(house_results)
hr76 <- filter(house_results,year==1976) %>%
mutate(repadv76 = rep - dem) %>%
filter(!is.na(repadv76)) %>%
select(district,repadv76)
hr16 <- filter(house_results,year==2016) %>%
mutate(repwins16 = rep > dem) %>%
filter(!is.na(repwins16)) %>%
select(district,repwins16)
fulldata <- inner_join(hr76,hr16)
bandwidth <- .04
fulldata %>% filter(abs(repadv76-0)<=.04) %>%
mutate(above = repadv76 > 0) %>%
group_by(above) %>% summarize(repwins16=mean(repwins16))
```