-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathMarkdown-Research-Reproduction_Causal-Effect-of-Segregation.Rmd
508 lines (314 loc) · 18.7 KB
/
Markdown-Research-Reproduction_Causal-Effect-of-Segregation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
---
title: 'Reproducing Reasearch: Causal Impact of Segregation on Poverty Rates'
date: ""
output:
pdf_document:
latex_engine: pdflatex
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
This project uses data from Elizabeth Ananat’s paper, “The Wrong Side(s) of the Tracks: The Causal
Effects of Racial Segregation on Urban Poverty and Inequality,” published in the American Economic Journal:
Applied Economics in 2011. This paper studies how segregation has affected population characteristics and
income disparity in US cities using the layout of railroad tracks as an instrumental variable.
#### Keep the following variables in the final dataset.
| Name |Description |
|----------|------------------------------------------------------------------------|
|dism1990 |1990 dissimilarity index |
|herf |RDI (Railroad division index) |
|lenper |Track length per square km |
|povrate_w |White poverty rate 1990 |
|povrate_b |Black poverty rate 1990 |
|area1910 |Physical area in 1910 (1000 sq. miles) |
|count1910 |Population in 1910 (1000s) |
|ethseg10 |Ethnic Dissimilariy index in 1910 |
|ethiso10 |Ethnic isolation index in 1910 |
|black1910 |Percent Black in 1910 |
|passpc |Street cars per capita 1915 |
|black1920 |Percent Black 1920 |
|lfp1920 |Labor Force Participation 1920 |
|incseg |Income segregation 1990 |
|pctbk1990 |Percent Black 1990 |
|manshr |Share employed in manufacturing 1990 |
|pop1990 |Population in 1990 |
You can find the detailed description of each variable in the original paper.
\pagebreak
## Setup
**Code:**
```{r, message=F, warning=F}
library(dplyr)
library(stargazer)
library(lfe)
library(ggplot2)
library(haven)
library(huxtable)
library(kableExtra)
library(stringr)
library(AER)
basic<- read_dta('C:\\Users\\roryq\\Downloads\\aej.dta')
```
```{r}
df<-basic %>% select(dism1990,herf,lenper,povrate_w,povrate_b,area1910,count1910,ethiso10
,ethseg10,black1910,passpc,black1920,lfp1920,incseg,pctbk1990
,manshr,pop1990, name)
```
## Data description:
Each observation is a city. The row observation contains information in multiple years, ie one city is a row and that row has multiple columns that cover data from different years.
\pagebreak
## Report summary statistics of the following variables in the dataset:"dism1990", "herf", "lenper", "povrate_w", "povrate_b".
**Code:**
```{r, results='asis',warning=F,message=F}
s <- df %>% select(dism1990, herf, lenper, povrate_w, povrate_b)
s <- s %>%
mutate(across(everything(), ~ as.numeric(as.character(.))))
summary_stats <- s %>%
summarise(
Mean = colMeans(., na.rm = TRUE),
SD = sapply(., sd, na.rm = TRUE),
Min = sapply(., min, na.rm = TRUE),
Max = sapply(., max, na.rm = TRUE)
)
summary_stats_with_rownames <- cbind(Row = c("dism1990", "herf", "lenper", "povrate_w",
"povrate_b"), summary_stats)
# Convert the updated dataframe with row names to a huxtable
kable_summary <- knitr::kable(summary_stats_with_rownames, format = "latex", booktabs = TRUE,
col.names = c("Variable", "Mean", "SD", "Min", "Max"))
# Apply kableExtra functions for styling
kable_summary %>%
kable_styling(latex_options = c("striped", "hold_position")) %>%
column_spec(1, bold = TRUE) %>%
column_spec(2:ncol(summary_stats_with_rownames), width = "3cm") %>%
row_spec(0, bold = TRUE, color = "white", background = "#2c3e50")
```
`
\pagebreak
## Reduced Form:
### We are interested in understanding how segregation affects population characteristics and income disparity in US cities. We will focus on two outcome variables: the poverty rate for blacks and whites. Regress these two outcome variables on segregation in 1990, our explanatory variable, and interpret your results. Report robust standard errors.
**Code:**
```{r, results='asis'}
model_b<-felm(povrate_b~dism1990,data=df)
model_w<-felm(povrate_w~dism1990,data=df)
stargazer(model_w,model_b,
se = list(model_w$rse, model_b$rse), # Robust standard errors
type = "latex", # Output LaTeX table
dep.var.labels = c( "Poverty Rate (White)","Poverty Rate (Black)"),
covariate.labels = c("Segregation (1990)"),
omit.stat = c("f", "ser", "adj.rsq","rsq","n"))
```
**Answer:**
For an increase of one standard deviation of segregation, the poverty rate of white people decreases by approximately 1.02%.
For an increase of one standard deviation of segregation, the poverty rate of black people increases by approximately 2.5%.
\pagebreak
## Explain the problem with giving a causal interpretation to the estimates we just produced. Give examples of specific confounds that might make a causal interpretation of our result problematic.
**Answer:**
We can not interpret this as causal because there are a lot of omitted variables that could also affect poverty rate and is related to the segregation of the city. For example city size could be an omitted variable where larger cities have more infrastructure and are therefore more segregated, but also have higher poverty rates because of the wealth inequality with high poverty in low class neighborhoods.
Another possible confounder could be education and policy. There have long been policy the excludes black people from attaining education and the same opportunities as their white counterparts. Therefore the higher segregation is correlated with less education/opportunity for black people which contributes to the income inequality and poverty rates, because of the strong relationship between education and income.
\pagebreak
# Validity of the instrument:
## Estimate the following regression and interpret it's coefficients,
<br><br><br>
<br>
<br>
<br>
$$
dism1990_i=\beta_0+\beta_1RDI_i+\beta_2 tracklength_i+\epsilon.
$$
<br>
**Code:**
```{r, results='asis'}
model_I <- felm(dism1990 ~ herf+ lenper, data = df)
model.sum <- summary(model_I)
stargazer(model_I,
se = list(model_I$rse), # Robust standard errors
type = "latex", # Output LaTeX table
dep.var.labels = c( "Segregation (1990)"),
covariate.labels = c("RDI", "Track length per square km"),
omit.stat = c("f", "ser", "adj.rsq","rsq","n"))
```
**Answer:**
For an increase of one standard deviation of RDI, segregation 1990 increases by approximately 5%.
\pagebreak
## In the context of instrumental variables, what is this regression referred to as and why is it important?
**Answer:**
This regression is the first stage, it 'cleans' the variation in the explanatory variable to allow for causal inference, and also establishes that there is a relationship between the instrument and explanatory variable.
\pagebreak
## Illustrate the relationship between the RDI and segregation graphically.
**Code:**
```{r}
model <- lm(dism1990 ~ herf, data = df)
# Calculate residuals
df$residuals <- residuals(model)
# Calculate the IQR for the 'herf' variable (explanatory variable)
Q1 <- quantile(df$herf, 0.25)
Q3 <- quantile(df$herf, 0.75)
IQR_value <- Q3 - Q1
# Define the lower and upper bounds for IQR outliers
lower_bound_iqr <- Q1 - 1.5 * IQR_value
upper_bound_iqr <- Q3 + 1.5 * IQR_value
# Identify IQR outliers (based on herf variable)
df$outlier_iqr <- df$herf < lower_bound_iqr | df$herf > upper_bound_iqr
# Identify residual outliers (based on the model residuals)
outlier_threshold_residuals <- 2 * sd(df$residuals) # You can adjust this threshold
df$outlier_residuals <- abs(df$residuals) > outlier_threshold_residuals
# Combine both outlier flags (IQR and residuals)
df$outlier_combined <- df$outlier_iqr | df$outlier_residuals
```
```{r,warning=F,message=F}
ggplot(df, aes(x=herf,y=dism1990))+ geom_point() + # Scatter plot points
labs(title = "Full Sample Relationship between RDI and Segregation",
x = "RDI",
y = "Segregation (1990)",
subtitle = "Outlier Cities are Labled") +
theme_minimal()+
geom_smooth(method = "lm", se = TRUE, color = "firebrick")+
geom_point(color='black', fill='cornflowerblue', shape=21, size=2.5)+
theme(plot.title = element_text(hjust = 0.49,margin = margin(b = 15),size=14))+
theme(
plot.title = element_text(hjust = 0.5,face = "bold"), # Center the title
panel.grid.major = element_line(color = "gray90"),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
axis.title.x = element_text(margin = margin(t = 15),hjust = 0.49, size=11.5),
axis.title.y = element_text(margin = margin(r = 15),hjust = 0.5,size=11.5),
plot.subtitle = element_text(size = 11.5, color = "black", hjust = 0.5)
)+
geom_text(aes(label = ifelse(outlier_combined, str_sub(name, 1, nchar(name) - 2), "")),
vjust = -0.65, hjust = 0.45, size = 3, color = "black")
```
## Is there a concern that this might be a weak instrument? Why would this be a problem?
**Answer:**
The F stat value of the model is `r model.sum$fstat[1]`. This indicates that there is a not weak first stage. The F stat larger than 10 indicates that this is a strong instrument.
\pagebreak
## Regress the following cith characteristics on the RDI and track length: `area1910` `count1910`, `black1910`, `incseg`, `lfp1920`. Present your results and interpret your findings. Why do these results matter for answering our question of interest?
**Code and Answer:**
```{r,results='asis'}
model_A<- felm(area1910~herf+lenper,data=df)
model_B<- felm(count1910~herf+lenper,data=df)
model_C<- felm(black1910~herf+lenper,data=df)
model_D<- felm(incseg~herf+lenper,data=df)
model_E<- felm(lfp1920~herf+lenper,data=df)
stargazer(model_A,model_B,model_C,model_D,model_E,omit.stat=c( "ser"),
se = list(model_A$rse, model_B$rse,model_C$rse,model_D$rse,model_E$rse),
type = "latex",
covariate.labels = c("RDI", "Track length per square km"))
```
\pagebreak
## What are the two conditions necessary for a valid instrument? What evidence do you have that the RDI meet these conditions? Be specific in supporting this claim.
**Answer:**
Condition 1, there has to be a relationship between the control and the instrument (preferrably more than just a weak relationship, if you want a strong instrument). Condition 2 is the exclusion restriction, the instrument shouldnt be correlated with any other omitted variables that affect the dependent variable as well.
Since the tracks were laid down previously to the city being drawn, and the tracks were randomly laid out according to ease of transportation. Therefore the tracks randomness by coming before the urbanization ensures the exclusionary restriction. From the first stage regression we can also see that there is a relationship between the instrument and segregation, fulfilling condition 1.
## Do you believe the instrument is valid? Why/why not?
**Answer:**
Yes, because of the above conditions above that are explained, and I dont think there are any violations of the exclusion restriction that would invalidate the instrument.
\pagebreak
## Generate a table that estimates the effect of segregation on the poverty rate for blacks and whites by OLS and then using the RDI instrument. Make sure you report robust standard errors. How does the use of the RDI instrument change the estimated coefficients?
These are the exact results reported in row 2 of columns 1-4 in table 2.
**Code and Answer:**
```{r,results='asis',warning=F,message=FALSE}
# Model without IV
model_WP<- felm(povrate_w~dism1990,data=df)
model_BP<-felm(povrate_b~dism1990,data=df)
# Model with IV
model_WI<- felm(povrate_w ~ lenper |0| (dism1990~herf+lenper), data = df)
model_BI<-felm(povrate_b ~ lenper |0| (dism1990~herf+lenper), data = df)
# Table output
stargazer(model_WP,model_BP,model_WI,model_BI,se = list(model_WP$rse, model_BP$rse,model_WI$rse
,model_BI$rse), # Robust standard errors
type = "latex",
covariate.labels = c("Dissimilarity"),omit.stat=c( "ser"))
```
The use of IV increases the coefficients (and robust standard errors) compared to the original regression.
\pagebreak
## What is the reduced form equation?
**Answer:**
$$
\begin{aligned}
\text{Reduced Form: } Y=\beta_0+\beta_1RDI_i+\beta_2lenper+\epsilon_i\\
\end{aligned}
$$
## For the two poverty rates, estimate the reduced form on all the cities and illustrate the reduced form relationships graphically.
**Code:**
```{r,warning=F}
model_WIR<- felm(povrate_w~herf+lenper,data=df)
model_BIR<-felm(povrate_b~herf+lenper,data=df)
df$povrate<-(df$povrate_w+df$povrate_b)/2
ggplot(df, aes(x = herf)) +
# Scatter plot for povrate_w (white)
geom_point(aes(y = povrate_w, color = "Poverty Rate (White)"), alpha = 0.6) +
# Scatter plot for povrate_b (black)
geom_point(aes(y = povrate_b, color = "Poverty Rate (Black)"), alpha = 0.6) +
# Regression line for model_WIR (White)
geom_abline(slope = coef(model_WIR)["herf"], intercept = coef(model_WIR)["(Intercept)"],
color = "cornflowerblue", size = 1, alpha = 0.6) +
# Regression line for model_BIR (Black)
geom_abline(slope = coef(model_BIR)["herf"], intercept = coef(model_BIR)["(Intercept)"],
color = "firebrick", size = 1, alpha = 0.6) +
labs(
title = "Poverty Rates by RDI ",
x = "RDI",
y = "Poverty Rate"
) +
scale_color_manual(values = c("Poverty Rate (White)" = "cornflowerblue", "Poverty Rate (Black)" = "firebrick")) +
theme_minimal() +
guides(color = guide_legend(title = "Poverty Rate Type")) + # Customize the legend title
theme(
plot.title = element_text(hjust = 0.5) # Center the title
)
```
\pagebreak
## Generate a table with six columns that check whether the main results are robust to adding additional controls for city characteristics. What do you conclude?
You can choose the controls you want.
**Code:**
```{r,results='asis',warning=F,message=F}
# baseline
model_WI<- felm(povrate_w ~ lenper |0| (dism1990~herf+lenper), data = df)
model_BI<-felm(povrate_b ~ lenper |0| (dism1990~herf+lenper), data = df)
# Add first control
m1<-felm(povrate_w ~ lenper+pop1990 |0| (dism1990~herf+lenper+pop1990), data = df)
m1b<-felm(povrate_b ~ lenper +pop1990 |0| (dism1990~herf+lenper+pop1990), data = df)
# Add second control
m2w<-felm(povrate_w ~ lenper+black1910 |0| (dism1990~herf+lenper+black1910), data = df)
m2b<-felm(povrate_b ~lenper+ black1910 |0| (dism1990~ herf+lenper+black1910), data = df)
stargazer(model_WI, model_BI,m1,m1b,m2w,m2b, omit.stat=c("ser"))
```
**Answer:**
Each regression has a similar effect of segragation on poverty rate (in white and black regressions). the coefficients of these regressions for dism1990, are all similar to eachother and with the confidence interval of the baseline regression (and all are statistically significant).
After adding these controls it appears that the original model without the controls is robust because the new regression $\beta_{1i}$ is within the original confidence interval.
\pagebreak
# Why **Two Stage** least squares?
Because the estimates in this paper only feature one endogenous regressor and one instrument, it is an excellent example with which to illustrate build intuition and see what the instrumental variables regressor is actually doing because in this scenario the IV estimator is exactly equal to the two stage least squares estimator ($\hat{\beta}_{IV}=\hat{\beta}_{2SLS}$).
## Estimate the first stage regression and use your estimates to generate the predicted values for the explanatory variable for all the observations.
**Code:**
```{r}
# Generate the predicted values from the first-stage regression
df$pred <- predict(lm(dism1990 ~ herf + lenper, data = df))
first<- felm(dism1990 ~ herf + lenper, data = df)
```
\pagebreak
### If our instrument is valid, the step above "removed" the "bad" endogenous variation from the predicted explanatory variable, keeping only the exogenous variation that is generated by the instrument. Now run the second stage by regressing our outcome variable on the predicted values generated above and the relevant controls. Compare our estimates from this regression to those generated earlier. How do they compare?
Using robust standard errors
**Code:**
```{r,results='asis',warning=F,message=F}
new_w<- felm(povrate_w~pred+lenper,data=df)
new_b<- felm(povrate_b~pred+lenper,data=df)
original_w<- felm(povrate_w ~ lenper |0| (dism1990~herf+lenper), data = df)
original_b<- felm(povrate_b ~ lenper |0| (dism1990~herf+lenper), data = df)
stargazer(original_w,new_w,original_b,new_b,se = list( original_w$rse,new_w$rse,original_b$rse,
new_b$rse), type="latex",
column.labels = c("Original", "New", "Original", "New"),
dep.var.labels = c("Poverty Rate White",
"Poverty Rate Black","Poverty Rate White", "Poverty Rate Black"),
omit.stat = c("ser"),
title = "Additional Controls")
```
\pagebreak
# Yet another IV trick: Taking the "Good" variation and scaling it
## Question: Take the coefficient from you reduced form estimate and divide it by your first stage estimate. How does this value compare your earlier estimate for the main result?
**Answer:**
```{r}
b<- coef(model_BIR)["herf"]/coef(first)["herf"]
w<- coef(model_WIR)["herf"]/coef(first)["herf"]
```
The quotient of the coefficients for black people is `r b` and the quotient of the coefficients for white people is `r w`. These match the second stage regression coefficients for $\beta_1$ from the previous regression.
\pagebreak