forked from NickCH-K/CausalitySlides
-
Notifications
You must be signed in to change notification settings - Fork 2
/
Lecture_13_Estimating_Regression_Discontinuity.Rmd
416 lines (310 loc) · 14.9 KB
/
Lecture_13_Estimating_Regression_Discontinuity.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
---
title: "Lecture 13 Estimating Regression Discontinuity"
author: "Nick Huntington-Klein"
date: "`r Sys.Date()`"
output:
revealjs::revealjs_presentation:
theme: solarized
transition: slide
self_contained: true
smart: true
fig_caption: true
reveal_options:
slideNumber: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE)
library(tidyverse)
library(dagitty)
library(ggdag)
library(ggpubr)
library(ggthemes)
library(Cairo)
library(rdrobust)
library(modelsummary)
library(purrr)
library(AER)
theme_set(theme_gray(base_size = 15))
```
## Recap
- Regression discontinuity is a design that can be used when treatment is applied based on a *cutoff*
- Above the cutoff? Treated! Below the cutoff? Not treated! (or below/above)
- By comparing people right around the cutoff, we are effectively closing all back doors
- Isolating the treatment effect!
## Today
- Surely we aren't just comparing averages above and below!
- How can we actually implement regression discontinuity (presumably with regression?)
- What do we need to keep in mind?
- What about close variations? What if the cutoff doesn't assign treatment perfectly?
## Regression Discontinuity in Regression
- How can re make a model for RDD?
- We want to: look for a jump at a cutoff point
- Get as good an idea of what the outcome is just on either side of the cutoff
- So...
## Regression Discontinuity in Regression
Let's start with the simple linear version:
$$ Y = \beta_0 + \beta_1(X-Cutoff) + \beta_2Treated + $$
$$ \beta_3Treated\times(X-Cutoff)+\varepsilon $$
- This formulation basically allows there to be two lines: one to the left of the cutoff ( $\beta_0 + \beta_1(X-Cutoff)$ ), and one to the right ( $(\beta_0 + \beta_2) + (\beta_1 + \beta_3)(X-Cutoff)$ )
- The jump at the cutoff is given by $\beta_2$ - that's our RDD estimate
- We use $X$ *relative to the cutoff* so that we can easily locate the jump in the $\beta_2$ coefficient
## Regression Discontinuity in Regression
```{r}
set.seed(1000)
tb <- tibble(X = runif(100)) %>%
mutate(Y = X + .5*(X > .5) + .2*rnorm(100))
ggplot(tb, aes(x = X, y = Y, group = X > .5)) +
geom_point() +
geom_smooth(method = 'lm', se = FALSE) +
geom_vline(aes(xintercept = .5)) +
ggpubr::theme_pubr()
```
## Choices!
- This is of course the simplest version!
- Things to consider:
- Bandwidth
- Functional form
- Controls
## Bandwidth
- The idea of RDD is that people *just around the cutoff* are very much comparable
- Basically random if your test score is 79 vs. 81 if the cutoff is 80, for example
- So people far away from the cutoff aren't too informative! At best they help determine the slope of the fitted lines
- So... drop 'em!
## Bandwidth
- RDD generally uses data only from the observations in a given range around the cutoff
- (Or at least weights them less the further away they are from cutoff)
- How wide should the bandwidth be?
- There's a big wide literature on *optimal bandwidth selection* which balances the addition of bias (from adding people far away from the cutoff who may have back doors) vs. variance (from adding more people so as to improve estimator precision)
- We won't be doing this by hand, we can often rely on an RDD command to do this for us
## Functional Form
- Why fit a straight line on either side? If the true relationship is curvy this will give us the wrong result!
- We can be much more flexible! As long as we fit some sort of line on either side, we can look for the jump
- One way to do this is with polynomials ( $\tilde{X} = X-Cutoff$, $T = Treated$ ):
$$ Y = \beta_0 + \beta_1\tilde{X} + + \beta_2 \tilde{X}^2 + \beta_3T + \beta_4\tilde{X}T + + \beta_5 \tilde{X}^2T+\varepsilon $$
## Functional Form
- (by the way, you can take this basic interaction-with-cutoff design idea and use it to look at how *anything* changes before and after cutoff, not just the level of $Y$! You could look at how the *slope* changes ("regression kink"), or how some other identified effect changes, or just about anything! The beauty of flexible design)
## Functional Form
- The interpretation is the same as before - look for the jump!
- We do want to be careful with polynomials though, and not add too many
- Remember, the more polynomial terms we add, the stranger the behavior of the line at *either end* of the range of data
- And the cutoff is at the far-right end of the pre-cutoff data and the far-left end of the post-cutoff data!
- So we can get illusory effects generated by having too many terms
## Functional Form
- A common approach is to use *non-parametric* regression or *local linear regression*
- This doesn't impose any particular shape! And it's easy to get a prediction on either side of the cutoff
- This allows for non-straight lines without dealing with the issues polynomials bring us
## Different Functional Forms
- Let's look at the same data with a few different functional forms
- Remember, the RDD effect is the jump at the cutoff. The TRUE effect here will be $.3$, and the TRUE model is an order-2 polynomial
```{r, echo = FALSE}
set.seed(500)
```
```{r, echo = TRUE}
tb <- tibble(Running = runif(200)) %>%
mutate(Y = 1.5*Running - .6*Running^2 + .3*(Running > .5) + rnorm(200, 0, .25)) %>%
mutate(RC = Running - .5, Treated = Running > .5)
```
## Different Functional Forms
```{r, echo =FALSE}
m <- lm(Y~Treated, data = tb)
jump <- coef(m)[2]
ggplot(tb, aes(x = RC, y = Y, group = Treated)) + geom_point() +
#geom_smooth(method = 'lm', se = FALSE) +
geom_line(aes(y = tb %>% group_by(Treated) %>% mutate(YM=mean(Y)) %>% pull(YM)),
color = 'blue') +
theme_pubr() +
labs(x = 'Running Variable Centered on Cutoff',
y = 'Outcome',
title = paste0('Simple Above/Below Average. Jump: ', scales::number(jump, accuracy = .001)))
```
## Different Functional Forms
```{r, echo =FALSE}
m <- lm(Y~RC*Treated, data = tb)
jump <- coef(m)[3]
ggplot(tb, aes(x = RC, y = Y, group = Treated)) + geom_point() +
geom_smooth(method = 'lm', se = FALSE) +
theme_pubr() +
labs(x = 'Running Variable Centered on Cutoff',
y = 'Outcome',
title = paste0('Linear RDD. Jump: ', scales::number(jump, accuracy = .001)))
```
## Different Functional Forms
```{r, echo =FALSE}
m <- lm(Y~RC*Treated + I(RC^2)*Treated, data = tb)
jump <- coef(m)[3]
ggplot(tb, aes(x = RC, y = Y, group = Treated)) + geom_point() +
geom_smooth(method = 'lm', se = FALSE, formula = y~ x + I(x^2)) +
theme_pubr() +
labs(x = 'Running Variable Centered on Cutoff',
y = 'Outcome',
title = paste0('Order-2 Polynomial RDD. Jump: ', scales::number(jump, accuracy = .001)))
```
## Different Functional Forms
```{r, echo =FALSE}
m <- lm(Y~RC*Treated + I(RC^2)*Treated + I(RC^3)*Treated, data = tb)
jump <- coef(m)[3]
ggplot(tb, aes(x = RC, y = Y, group = Treated)) + geom_point() +
geom_smooth(method = 'lm', se = FALSE, formula = y~ x + I(x^2) + I(x^3)) +
theme_pubr() +
labs(x = 'Running Variable Centered on Cutoff',
y = 'Outcome',
title = paste0('Order-3 Polynomial RDD. Jump: ', scales::number(jump, accuracy = .001)))
```
## Different Functional Forms
```{r, echo =FALSE}
m <- lm(Y~RC*Treated + I(RC^2)*Treated + I(RC^3)*Treated + I(RC^4)*Treated + I(RC^5)*Treated + I(RC^6)*Treated + I(RC^7)*Treated + I(RC^8)*Treated, data = tb)
jump <- coef(m)[3]
ggplot(tb, aes(x = RC, y = Y, group = Treated)) + geom_point() +
geom_smooth(method = 'lm', se = FALSE, formula = y~ poly(x,8)) +
theme_pubr() +
labs(x = 'Running Variable Centered on Cutoff',
y = 'Outcome',
title = paste0('Order-8 Polynomial RDD. Jump: ', scales::number(jump, accuracy = .001)))
```
## Different Functional Forms
```{r, echo =FALSE}
tb <- tb %>%
arrange(RC)
m1 <- loess(Y ~ RC, data = tb %>% filter(!Treated))
m2 <- loess(Y ~ RC, data = tb %>% filter(Treated))
jump <- predict(m2)[1]-utils::tail(predict(m1),1)
ggplot(tb, aes(x = RC, y = Y, group = Treated)) + geom_point() +
geom_smooth(method = 'loess', se = FALSE) +
theme_pubr() +
labs(x = 'Running Variable Centered on Cutoff',
y = 'Outcome',
title = paste0('Local Linear Regression RDD. Jump: ', scales::number(jump, accuracy = .001)))
```
## Functional Form:
So:
- Avoid higher-order polynomials
- Even the "true model" can be worse than something simpler sometimes (although if I rerun this with different random data, linear > squared doesn't always remain true)
- (And fewer terms makes more sense too once we apply a bandwidth and zoom in)
- Be very suspicious if your fit veers wildly off right aroud the cutoff
- Consider a nonparametric approach
## Controls
- Generally you don't need control variables in an RDD
- If the design is valid, you've closed all back doors. That's sort of the whole point!
- Although maybe we want some if we have a wide bandwidth - this will remove some of the bias
- Still, we can get real value from having access to control variables. How?
## Controls
- Control variables allow us to perform *placebo tests* of our RDD model
- RDD should close all back doors... but what if it doesn't? What if we missed something
- We can rerun our RDD model, but simply use a control variable as the outcome
- If we find an effect... uh oh, that shouldn't happen! (outside of the levels expected by normal sampling variation)
- You can run these for *every control variable you have!*
## Fuzzy Regression Discontinuity
- Commonly, treatment isn't *entirely* assigned on the basis of a cutoff
- But it becomes much *more/less* common at the cutoff
- We can still work with this!
- This is called *fuzzy regression discontinuity*
## Fuzzy Regression Discontinuity
- We can start by making sure there's actually a jump in treatment at the cutoff, by running RDD with treatment as the outcome
- There has to at least be a jump (up or down) in treatment probability at the cutoff
- If there isn't (or if there is but it's tiny - we'll be dividing by this later and don't want to divide by something close-to-zero) that's a problem!
- Statistically won't work, and theoretically implies we were wrong about our RDD design
## Fuzzy Regression Discontinuity
```{r, echo = FALSE}
set.seed(10000)
fuzz <- tibble(Running = runif(150)) %>%
mutate(Treat = (.1 + .5*Running + .5*(Running > .5)) %>%
map_dbl(function(x) min(x, 1)) %>%
map_dbl(function(x) sample(c(1,0), 1, prob = c(x, 1-x)))) %>%
mutate(Y = 1 + Running + 2*Treat + rnorm(150)*.5) %>%
mutate(Runbin = cut(Running, 0:10/10)) %>%
group_by(Runbin) %>%
mutate(av_treat = mean(Treat),
av_out = mean(Y))
ggplot(fuzz , aes(x = Running, y = Treat)) +
geom_point() +
geom_point(data = fuzz %>% group_by(Runbin) %>% slice(1), aes(x = Running, y = av_treat),
color = 'red', size = 2) +
geom_smooth(aes(group = Running > .5), method = 'lm', color = 'blue', se = FALSE) +
geom_vline(aes(xintercept = .5), linetype = 'dashed') +
ggpubr::theme_pubr() +
labs(x = 'Running Variable', y = 'Treated')
```
## Fuzzy Regression Discontinuity
- So what happens if we just do RDD as normal?
- The effect is understated because we have some untreated in the post-cutoff and treated in the pre.
- So with a positive effect the pre-cutoff value goes up (because we mix some treatment effect in there) and the post-cutoff value goes down (since we mix some untreated in there), bringing them closer together and shrinking the effect estimate
## Fuzzy Regression Discontinuity
```{r, echo = FALSE}
ggplot(fuzz , aes(x = Running, y = Y)) +
geom_point() +
geom_point(data = fuzz %>% group_by(Runbin) %>% slice(1), aes(x = Running, y = av_out),
color = 'red', size = 2) +
geom_smooth(aes(group = Running > .5), method = 'lm', color = 'blue', se = FALSE) +
geom_vline(aes(xintercept = .5), linetype = 'dashed') +
ggpubr::theme_pubr() +
labs(x = 'Running Variable', y = 'Treated')
```
## Fuzzy Regression Discontinuity
- This is simulated data, the true effect is 2.
```{r, echo = FALSE}
fuzz <- fuzz %>%
mutate(Above = Running >= .5)
mreg <- lm(Y ~ Running*Above, data = fuzz)
msummary(list(Y = mreg), stars = TRUE, gof_omit = 'AIC|BIC|F|Lik|Adj', )
```
## Fuzzy Regression Discontinuity
- We can scale by how much the treatment prevalence jumped... if the chance of being treated only went up by 50%, then the effect we see should be 50% as large, so let's adjust that away!
## Fuzzy Regression Discontinuity
- We can try literally dividing the effect on $Y$ by the effect on $Treated$
```{r, echo = FALSE}
mtr <- lm(Treat ~ Running*Above, data = fuzz)
ivr <- ivreg(Y ~ Running*Treat | Running*Above, data = fuzz)
msummary(list(Y = mreg, Treated = mtr), stars = TRUE, gof_omit = 'AIC|BIC|F|Lik|Adj')
```
## Fuzzy Regression Discontinuity
- Or can use instrumental variables (IV) for this (which we'll get to later), with being above the cutoff as an instrument of treatment
```{r, echo = FALSE}
msummary(list('Instrumental Variables' = ivr), stars = TRUE, gof_omit = 'AIC|BIC|F|Lik|Adj')
```
## But Really...
- There are additional estimation details that are difficult to do yourself
- There are optimal bandwidth selection operators
- There is bias introduced by taking points away from the cutoff, but also available corrections for that bias
- We probably want to use a command that does this stuff for us
## rdrobust
- The **rdrobust** package has the `rdrobust` function which runs regression discontinuty with:
- Options for fuzzy RD
- Optimal bandwidth selection
- Bias correction
- Lots of options (no control variables though)
- Unfortunately doesn't work with `modelsummary`
## rdrobust
- Remember the simulated data we had earlier with the true effect of .3?
```{r, echo = TRUE, eval = FALSE}
library(rdrobust)
m <- rdrobust(tb$Y, tb$Running, c = .5)
summary(m)
```
## rdrobust
```{r, echo = FALSE, eval = TRUE}
library(rdrobust)
m <- rdrobust(tb$Y, tb$Running, c = .5)
summary(m)
```
## rdplot
- Or, easily plot the results! Note the default uses order-4 polynomial unlike `rdrobust` which is local linear
```{r, echo = TRUE, eval = FALSE}
rdplot(tb$Y, tb$Running, c = .5)
```
## rdplot
```{r, echo = FALSE, eval = TRUE}
rdplot(tb$Y, tb$Running, c = .5)
```
## And for Special Cases
- We'll probably be actually *estimating* RDD models with `rdrobust` - going through the by-hand stuff is important for knowing what is going on though!
- **rdrobust** is one of a family of packages for different kinds of RDD:
- **rdpower** for power anayses of regression discontinuity models (do this!)
- **rdmulti** for RDD with multiple cutoffs
- And the wonkier **rdlocrand** and **rddensity**
## Practice
- Discuss: one place where RDD is used frequently is in politics, where vote share is used as the running variable and a .5 cutoff determines who is elected
- Why does this work? What assumptions do we need to make? What issues might there be? Is this 'fuzzy' or not?
## Practice
- Load the **rdrobust** package and the `rdrobust_RDsenate` data.
- Perform a linear and order-2 polynomial RDD using `lm()` with a bandwidth of .05 on either side
- Then use `rdrobust` and `rdplot` to do the same