Lecture_12_Regression_Discontinuity.Rmd

---
title: "Lecture 12 Regression Discontinuity"
author: "Nick Huntington-Klein"
date: "`r Sys.Date()`"
output:   
  revealjs::revealjs_presentation:
    theme: solarized
    transition: slide
    self_contained: true
    smart: true
    fig_caption: true
    reveal_options:
      slideNumber: true
    
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE)
library(tidyverse)
library(dagitty)
library(ggdag)
library(gganimate)
library(ggpubr)
library(ggthemes)
library(Cairo)
theme_set(theme_gray(base_size = 15))
```

## Recap

- We've been going over ways in which we can use control groups to isolate causal effects
- We can select similar control groups using matching or controlling (what economists call "selection on observables")
- We can use a treated group at a different time as its own control group with fixed effects
- When a treatment is applied at a particular time, we can select a reasonable control to account for the effects of time using difference-in-difference (a "natural experiment")

## Today

- We're going to go over one other way in which we can find and isolate a very convincing control group
- Like DID, it's also an example of a natural experiment
- *Regression discontinuity*

## Regression Discontinuity

- For regression discontinuity to work, we need the Treatment to be assigned based on a *cutoff* of what's called a "running variable"
- For example, imagine we want to know the effects of being in a Gifted and Talented (GATE) program on your adult earnings
- Being admitted to the program is based on your test score (running variable)
- If you score above 75, you're in the program. 75 or below, you're out!

## Regression Discontinuity

- Notice that the y-axis here is *In GATE*, not the outcome

```{r, echo=FALSE, eval=TRUE, fig.width=7, fig.height=5.5}
rdd <- tibble(test = runif(300)*100) %>%
  mutate(GATE = test >= 75,
         above = test >= 75) %>%
  mutate(earn = runif(300)*40+10*GATE+test/2)

ggplot(rdd,aes(x=test,y=GATE))+geom_point()+
  geom_vline(aes(xintercept=75),col='red')+
  theme_pubr()
  labs(x='Test Score',
       y='In GATE')
```

## Regression Discontinuity

- Here's how it look when we look at the actual outcome

```{r, echo=FALSE, eval=TRUE, fig.width=7, fig.height=5.5}
ggplot(rdd,aes(x=test,y=earn,color=GATE))+geom_point()+
  theme_pubr() +
  geom_vline(aes(xintercept=75),col='red')+
  labs(x='Test Score',
       y='Earnings')
```


## Regression Discontinuity

- Now, we have a bit of a problem!
- If we look at the relationship between treatment and going to college, we'll be picking up the fact that higher test scores make you more likely to go to college anyway

```{r, dev='CairoPNG', echo=FALSE, fig.width=6,fig.height=4}
dag <- dagify(earn~GATE+Test,
              GATE~Test,
              coords=list(
                x=c(earn=3,GATE=1,Test=2.5),
                y=c(earn=1,GATE=1.5,Test=2)
              )) %>% tidy_dagitty()
ggdag_classic(dag,node_size=20) + 
  theme_dag_blank()
```

## Regression Discontinuity

- Except, that's not actually what the diagram looks like! Test only affects GATE to the extent that it makes you be above the 90 cutoff!

```{r, dev='CairoPNG', echo=FALSE, fig.width=6,fig.height=4}
dag <- dagify(earn~GATE+Test,
              Above~Test,
              GATE~Above,
              coords=list(
                x=c(earn=3,GATE=1,Test=2.5,Above=1.75),
                y=c(earn=1,GATE=1.5,Test=2,Above=1.75)
              )) %>% tidy_dagitty()
ggdag_classic(dag,node_size=20) + 
  theme_dag_blank()
```

## Regression Discontinuity

- What can we do with that information?
- Well, imagine that we looked at the area *just around* the cutoff
- Say, the cutoff is 75, so we look at 73 to 77
- *Within that group*, it's basically random whether you fall on one side of the line or another

## Regression Discontinuity

- Someone with a 75 is, on average, almost exactly the same as someone with a 76, except that one got the treatment and the other didn't!
- Heck, that tiny test score difference could be due to just having a bad day before the test
- So we have two groups - the just-barely-missed-outs and the just-barely-made-its, that are basically exactly the same except that one happened to get treatment
- A perfect description of what we're looking for in a control group!

## Regression Discontinuity

- So we look directly around the cutoff, and compare just below to just above.
- This is our way of controlling for test score and closing the `GATE <- Above <- Test -> earn` back door
- Why not just control for `Test` in the normal way?
- Because if we really think that, right around the cutoff, it's random whether you're on one side or the other, we don't just close the `Test` back door, we have effectively random assignment, like an experiment!
- We're not just closing the `Test` back door, we're closing *all* back doors

## In Practice

```{r, echo=TRUE, eval=FALSE}
rdd.data <- tibble(test = runif(1000)*100) %>%
  mutate(GATE = test >= 75) %>% mutate(earn = runif(1000)*40+10*GATE+test/2)
#Choose a "bandwidth" of how wide around the cutoff to look (arbitrary in our example)
#Bandwidth of 2 with a cutoff of 75 means we look from 75-2 to 75+2
bandwidth <- 2
#Just look within the bandwidth
rdd <- rdd.data %>% filter(abs(75-test) < bandwidth) %>%
  #Create a variable indicating we're above the cutoff
  mutate(above = test >= 75) %>%
  #And compare our outcome just below the cutoff to just above
  group_by(above) %>% summarize(earn = mean(earn))
rdd
#Our effect looks just about right (10 is the truth)
rdd$earn[2] - rdd$earn[1]
```

```{r, echo=FALSE, eval=TRUE}
set.seed(1000)
rdd.data <- tibble(test = runif(1000)*100) %>%
  mutate(GATE = test >= 75) %>% mutate(earn = runif(1000)*40+10*GATE+test/2)
#Choose a "bandwidth" of how wide around the cutoff to look (arbitrary in our example)
#Bandwidth of 2 with a cutoff of 75 means we look from 75-2 to 75+2
bandwidth <- 2
#Just look within the bandwidth
rdd <- rdd.data %>% filter(abs(75-test) < bandwidth) %>%
  #Create a variable indicating we're above the cutoff
  mutate(above = test >= 75) %>%
  #And compare our outcome just below the cutoff to just above
  group_by(above) %>% summarize(earn = mean(earn))
rdd
#Our effect looks just about right (10 is the truth)
rdd$earn[2] - rdd$earn[1]
```

## Graphically

```{r, dev='CairoPNG', echo=FALSE, fig.width=8,fig.height=7}
df <- data.frame(xaxisTime=runif(300)*20) %>%
  mutate(Y = .2*xaxisTime+3*(xaxisTime>10)-.1*xaxisTime*(xaxisTime>10)+rnorm(300),
         state="1",
         groupX=floor(xaxisTime)+.5,
         groupLine=floor(xaxisTime),
         cutLine=rep(c(9,11),150)) %>%
  group_by(groupX) %>%
  mutate(mean_Y=mean(Y)) %>%
  ungroup() %>%
  arrange(groupX)


dffull <- rbind(
  #Step 1: Raw data only
  df %>% mutate(groupLine=NA,cutLine=NA,mean_Y=NA,state='1. Start with raw data.'),
  #Step 2: Add Y-lines
  df %>% mutate(cutLine=NA,state='2. What differences in Y are explained by Running Variable?'),
  #Step 3: Collapse to means
  df %>% mutate(Y = mean_Y,state="3. Keep only what's explained by the Running Variable."),
  #Step 4: Zoom in on just the cutoff
  df %>% mutate(mean_Y = ifelse(xaxisTime > 9 & xaxisTime < 11,mean_Y,NA),Y=ifelse(xaxisTime > 9 & xaxisTime < 11,mean_Y,NA),groupLine=NA,state="4. Focus just on what happens around the cutoff."),
  #Step 5: Show the effect
  df %>% mutate(mean_Y = ifelse(xaxisTime > 9 & xaxisTime < 11,mean_Y,NA),Y=ifelse(xaxisTime > 9 & xaxisTime < 11,mean_Y,NA),groupLine=NA,state="5. The jump at the cutoff is the effect of treatment."))


p <- ggplot(dffull,aes(y=Y,x=xaxisTime))+geom_point()+
  geom_vline(aes(xintercept=10),linetype='dashed')+
  geom_point(aes(y=mean_Y,x=groupX),color="red",size=2)+
  geom_vline(aes(xintercept=groupLine))+
  geom_vline(aes(xintercept=cutLine))+
  geom_segment(aes(x=10,xend=10,
                   y=ifelse(state=='5. The jump at the cutoff is the effect of treatment.',
                            filter(df,groupLine==9)$mean_Y[1],NA),
                   yend=filter(df,groupLine==10)$mean_Y[1]),size=1.5,color='blue')+
  scale_color_colorblind()+
  scale_x_continuous(
    breaks = c(5, 15),
    label = c("Untreated", "Treated")
  )+xlab("Running Variable")+
  labs(title = 'The Effect of Treatment on Y using Regression Discontinuity \n{next_state}')+
  transition_states(state,transition_length=c(6,16,6,16,6),state_length=c(50,22,12,22,50),wrap=FALSE)+
  ease_aes('sine-in-out')+
  exit_fade()+enter_fade()

animate(p,nframes=175)
```

## Example: Corporate Social Responsibility

- Corporate Social Responsibility (CSR) is when corporations engage in the kind of behavior that nonprofits usually do - community outreach, charity, etc.
- Is this good for the corporation? Or would it make more sense to just send the money they spend to actual nonprofits if they just want to do good?
- This is a causal question

## Example: Corporate Social Responsibility

- Convenient for our purposes, CSR policies are voted on by shareholder boards
- If a board votes 49% in favor, it fails. 51% in favor? It passes!
- Sounds like a regression discontinuity to me!
- "Close votes" is a common application of regression discontinuity

## Example: Corporate Social Responsibility

- So how do CSR policy announcements affect stock prices?

```{r, dev='CairoPNG', echo=FALSE, fig.width=6,fig.height=5}
dag <- dagify(price~vote+CSR,
              win~vote,
              CSR~win,
              coords=list(
                x=c(price=3,CSR=1,vote=2.5,win=1.75),
                y=c(price=1,CSR=1.5,vote=2,win=1.75)
              )) %>% tidy_dagitty()
ggdag_classic(dag,node_size=20) + 
  theme_dag_blank()
```

## Example: Corporate Social Responsbility

- Caroline Flammer studies this topic
- Looking at the "abnormal return" (stock price return minus what's expected given the market) comparing CSR votes that just won vs. CSR votes that just lost
- So what should we do?
- Focus just around the cutoff and compare abnormal returns just above and just below.

## Example: Corporate Social Responsibility

![Flammer (2015) Management Science](Lecture_12_Flammer.png)

## Example: Corporate Social Responsibility

- Looks like stock returns increase by about .02, comparing CSRs that just lost to just won!
- Seems like the market likes seeing those CSRs and values them
- And all those things that we might expect to correlate with both stock price growth and CSRs - tech-savvy, youthful leadership, etc., we've closed those back doors too!

## Balance

- Have we really closed those back doors?
- One thing that's so great about RDD is that, since it's basically random whether you're on one side of the cutoff or another, there shouldn't be other back doors
- It's a form of within variation that's *so narrow* it basically closes everything
- We can check this by seeing if other variables differ on either side of the line
- This is our way of testing our diagram - if our diagram is true, then `above` should have no relationship with any back door variable after focusing around the cutoff

## Balance

```{r, echo=TRUE}
rdd.data <- tibble(test = runif(500)*100) %>%
  mutate(backdoor=rnorm(500)+test/50) %>% mutate(GATE = test + backdoor >= 75) %>%
  mutate(earn = runif(500)*40+10*GATE+5*backdoor+test/2)
bandwidth <- 2
rdd <- rdd.data %>% filter(abs(75-test) < bandwidth) %>%
  #Create a variable indicating we're above the cutoff
  mutate(above = test >= 75) %>%
  #And compare our outcome just below the cutoff to just above
  group_by(above) %>% summarize(backdoor = mean(backdoor))
rdd
#Not a lot of difference!
rdd$backdoor[2] - rdd$backdoor[1]
```

## Balance

- Notice there's NO real difference here, indicating that we've closed that back door

```{r, echo=FALSE, fig.width=7,fig.height=5}
rdgrph <- rdd.data %>% 
  mutate(bandwidth = abs(75-test) <= 2,
         above = test >= 75) %>%
  group_by(cut(test,breaks=(0:49*2+1))) %>%
  mutate(meanback = mean(backdoor)) %>%
  ungroup()

ggplot(filter(rdgrph,bandwidth==1),aes(x=test,y=backdoor))+geom_point(col='blue')+
  geom_vline(aes(xintercept=75),col='red')+
  geom_vline(aes(xintercept=73),col='red',linetype='dashed',alpha=.6)+
  geom_vline(aes(xintercept=77),col='red',linetype='dashed',alpha=.6)+
  geom_point(data=filter(rdgrph,bandwidth==0),aes(x=test,y=backdoor),alpha=.3)+
  geom_step(data=rdgrph,aes(x=test,y=meanback),col='red',size=1)+
  theme_pubr()
  labs(x='Test Score',
      y='Backdoor Variable')
```

## Summing Up

- We've covered four main methods of making comparisons as close as possible
- Controlling and matching both take a set of measured variables and adjust so you're looking at variation within those variables
- Difference-in-difference takes a chosen comparison group and uses it to adjust for changes over time in your treated group of interest
- Regression discontinuity uses a cutoff in a running variable to identify a treated and nontreated group that are basically randomly assigned

## Summing Up

- Next time we'll be doing regression discontinuity using... actual regression
- We already have a solid idea of how it works, now let's see how it's really done

## Practice

- Does winning help your party stay in power *30 years* later?
- Install and load the `politicaldata` package, and load `data(house_results)`
- Create tibbles `hr76` and `hr16` with only 1976 and 2016
- Create `repadv76` equal to `rep` vote minus `dem` for 1976, and filter only to those with `!is.na(repadv75)`
- Create `repwins16` equal to `rep > dem` for 2016, and filter `!is.na(repwins16)`
- `select()` only `district`,`repadv76`, `repwins16`, and `inner_join()` the two data sets
- Compare `repwins16` mean above and below `repadv76=0` with a bandwidth of .04

## Practice Answers

```{r, echo=TRUE}
#install.packages('politicaldata')
library(politicaldata)
data(house_results)

hr76 <- filter(house_results,year==1976) %>%
  mutate(repadv76 = rep - dem) %>%
  filter(!is.na(repadv76)) %>%
  select(district,repadv76)
hr16 <- filter(house_results,year==2016) %>%
  mutate(repwins16 = rep > dem) %>%
  filter(!is.na(repwins16)) %>%
  select(district,repwins16)

fulldata <- inner_join(hr76,hr16)
bandwidth <- .04 

fulldata %>% filter(abs(repadv76-0)<=.04) %>%
  mutate(above = repadv76 > 0) %>%
  group_by(above) %>% summarize(repwins16=mean(repwins16))
```