-
Notifications
You must be signed in to change notification settings - Fork 20
/
01_FPI.Rmd
1311 lines (1082 loc) · 74.3 KB
/
01_FPI.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
```{r libraries,echo=FALSE,message=FALSE,error=FALSE,warning=FALSE}
library(MASS)
library(RColorBrewer)
library(gplots)
library(ggplot2)
library(snowfall)
library(cowplot)
library(sandwich)
library(ggpubr)
library(dplyr)
library(metafor)
library(stats)
library(mvtnorm)
library(tmvtnorm)
library(AER)
library(stargazer)
library(plm)
library(lfe)
library(fixest)
library(tidyr)
library(purrr)
library(stringr)
library(did)
library(DIDmultiplegt)
library(DIDmultiplegtDYN)
library(didimputation)
library(did2s)
library(recipes)
library(tidyverse)
library(ggthemes)
library(fastDummies)
library(Matching)
library(MatchIt)
library(scales)
library(diagonals)
library(sf)
library(spdep)
library(btb)
library(cartography)
library(spatstat)
library(stats)
library(broom)
library(igraph)
library(nbpMatching)
#library(maxmatching)
#library(tidymodels)
#library(nestedmodels)
```
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, error = FALSE, warning = FALSE,eval=TRUE)
# results='hide'
```
```{r parameters, echo=FALSE}
cst <- "60%"
Nsim <- 1000
N.sample <- c(100,1000,10000,100000)
ncpus <- 4
```
# (PART) Fundamental Problems of Inference {-}
# Introduction: the Two Fundamental Problems of Inference {-}
When trying to estimate the effect of a program on an outcome, we face two very important and difficult problems: [the Fundamental Problem of Causal Inference (FPCI)](FPCI.html) and [the Fundamental Problem of Statistical Inference (FPSI)](FPSI.html).
In its most basic form, the FPCI states that our causal parameter of interest ($TT$, short for Treatment on the Treated, that we will define shortly) is fundamentally unobservable, even when the sample size is infinite.
The main reason for that is that one component of $TT$, the outcome of the treated had they not received the program, remains unobservable.
We call this outcome a counterfactual outcome.
The FPCI is a very dispiriting result, and is actually the basis for all of the statistical methods of causal inference.
All of these methods try to find ways to estimate the counterfactual by using observable quantities that hopefully approximate it as well as possible.
Most people, including us but also policymakers, generally rely on intuitive quantities in order to generate the counterfactual (the individuals without the program or the individuals before the program was implemented).
Unfortunately, these approximations are generally very crude, and the resulting estimators of $TT$ are generally biased, sometimes severely.
The Fundamental Problem of Statistical Inference (FPSI) states that, even if we have an estimator $E$ that identifies $TT$ in the population, we cannot observe $E$ because we only have access to a finite sample of the population.
The only thing that we can form from the sample is a sample equivalent $\hat{E}$ to the population quantity $E$, and $\hat{E}\neq E$.
Why is $\hat{E}\neq E$?
Because a finite sample is never perfectly representative of the population.
What can we do to deal with the FPSI?
I am going to argue that there are mainly two things that we might want to do: estimating the extent of sampling noise and decreasing sampling noise.
# Fundamental Problem of Causal Inference {#FPCI}
In order to state the FPCI, we are going to describe the basic language to encode causality set up by Rubin, and named [Rubin Causal Model (RCM)](RCM.html).
RCM being about partly observed random variables, it is hard to make these notions concrete with real data.
That's why we are going to use simulations from a simple model in order to make it clear how these variables are generated.
The second virtue of this model is that it is going to make it clear the source of selection into the treatment.
This is going to be useful when understanding biases of intuitive comparisons, but also to discuss the methods of causal inference.
A third virtue of this approach is that it makes clear the connexion between the treatment effects literature and models.
Finally, a fourth reason that it is useful is that it is going to give us a source of sampling variation that we are going to use to visualize and explore the properties of our estimators.
I use $X_i$ to denote random variable $X$ all along the notes.
I assume that we have access to a sample of $N$ observations indexed by $i\in\left\{1,\dots,N\right\}$.
''$i$'' will denote the basic sampling units when we are in a sample, and a basic element of the probability space when we are in populations.
Introducing rigorous measure-theoretic notations for the population is feasible but is not necessary for comprehension.
When the sample size is infinite, we say that we have a population.
A population is a very useful fiction for two reasons.
First, in a population, there is no sampling noise: we observe an infinite amount of observations, and our estimators are infinitely precise.
This is useful to study phenomena independently of sampling noise.
For example, it is in general easier to prove that an estimator is equal to $TT$ under some conditions in the population.
Second, we are most of the time much more interested in estimating the values of parameters in the population rather than in the sample.
The population parameter, independent of sampling noise, gives a much better idea of the causal parameter for the population of interest than the parameter in the sample.
In general, the estimator for both quantities will be the same, but the estimators for the effetc of sampling noise on these estimators will differ.
Sampling noise for the population parameter will generally be larger, since it is affected by another source of variability (sample choice).
## Rubin Causal Model
RCM is made of three distinct building blocks: a treatment allocation rule, that decides who receives the treatment; potential outcomes, that measure how each individual reacts to the treatment; the switching equation that relates potential outcomes to observed outcomes through the allocation rule.
### Treatment allocation rule
The first building block of RCM is the treatment allocation rule.
Throughout this class, we are going to be interested in inferring the causal effect of only one treatment with respect to a control condition.
Extensions to multi-valued treatments are in general self-explanatory.
In RCM, treatment allocation is captured by the variable $D_i$.
$D_i=1$ if unit $i$ receives the treatment and $D_i=0$ if unit $i$ does not receive the treatment and thus remains in the control condition.
The treatment allocation rule is critical for several reasons.
First, because it switches the treatment on or off for each unit, it is going to be at the source of the FPCI.
Second, the specific properties of the treatment allocatoin rule are going to matter for the feasibility and bias of the various econometric methods that we are going to study.
Let's take a few examples of allocation rules.
These allocation rules are just examples.
They do not cover the space of all possible allocation rules.
They are especially useful as concrete devices to understand the sources of biases and the nature of the allocation rule.
In reality, there exists even more complex allocation rules (awareness, eligibility, application, acceptance, active participation).
Awareness seems especially important for program participation and has only been tackled recently by economists.
First, some notation.
Let's imagine a treatment that is given to individuals.
Whether each individual receives the treatment partly depends on the level of her outcome before receiving the treatment.
Let's denote this variable $Y^B_i$, with $B$ standing for "Before".
It can be the health status assessed by a professional before deciding to give a drug to a patient.
It can be the poverty level of a household used to assess its eligibilty to a cash transfer program.
#### Sharp cutoff rule
The sharp cutoff rule means that everyone below some threshold $\bar{Y}$ is going to receive the treatment.
Everyone whose outcome before the treatment lies above $\bar{Y}$ does not receive the treatment.
Such rules can be found in reality in a lot of situations.
They might be generated by administrative rules.
One very simple way to model this rule is as follows:
\begin{align}\label{eq:cutoff}
D_i & = \uns{Y_i^B\leq\bar{Y}},
\end{align}
where $\uns{A}$ is the indicator function, taking value $1$ when $A$ is true and $0$ otherwise.
```{example, name="Sharp cutoff rule"}
Imagine that $Y_i^B=\exp(y_i^B)$, with $y_i^B=\mu_i+U_i^B$, $\mu_i\sim\mathcal{N}(\bar{\mu},\sigma^2_{\mu})$ and $U_i^B\sim\mathcal{N}(0,\sigma^2_{U})$.
Now, let's choose some values for these parameters so that we can generate a sample of individuals and allocate the treatment among them.
I'm going to switch to R for that.
```
```{r param.init,eval=TRUE,echo=TRUE,results='markup'}
param <- c(8,.5,.28,1500)
names(param) <- c("barmu","sigma2mu","sigma2U","barY")
param
```
Now, I have choosen values for the parameters in my model.
For example, $\bar{\mu}=$ `r param["barmu"]` and $\bar{Y}=$ `r param["barY"]`.
What remains to be done is to generate $Y_i^B$ and then $D_i$.
For this, I have to choose a sample size ($N=1000$) and then generate the shocks from a normal.
```{r YiBD,eval=TRUE,echo=TRUE,results='hide'}
# for reproducibility, I choose a seed that will give me the same random sample each time I run the program
set.seed(1234)
N <-1000
mu <- rnorm(N,param["barmu"],sqrt(param["sigma2mu"]))
UB <- rnorm(N,0,sqrt(param["sigma2U"]))
yB <- mu + UB
YB <- exp(yB)
Ds <- ifelse(YB<=param["barY"],1,0)
```
Let's now build a histogram of the data that we have just generated.
```{r histyb,fig.cap='Histogram of $y_B$',fig.align='center',out.width=cst}
# building histogram of yB with cutoff point at ybar
# Number of steps
Nsteps.1 <- 15
#step width
step.1 <- (log(param["barY"])-min(yB[Ds==1]))/Nsteps.1
Nsteps.0 <- (-log(param["barY"])+max(yB[Ds==0]))/step.1
breaks <- cumsum(c(min(yB[Ds==1]),c(rep(step.1,Nsteps.1+Nsteps.0+1))))
hist(yB,breaks=breaks,main="")
abline(v=log(param["barY"]),col="red")
```
You can see on Figure \@ref(fig:histyb) a histogram of $y_i^B$ with the red line indicating the cutoff point: $\bar{y}=\ln(\bar{Y})=$ `r round(log(param["barY"]),1)`.
All the observations below the red line are treated according to the sharp rule while all the one located above are not.
In order to see how many observations eventually receive the treatment with this allocation rule, let's build a contingency table.
```{r tableDsharp,eval=TRUE,echo=TRUE,results='asis',warning=FALSE,error=FALSE,message=FALSE}
table.D.sharp <- as.matrix(table(Ds))
knitr::kable(table.D.sharp,caption='Treatment allocation with sharp cutoff rule',booktabs=TRUE)
```
We can see on Table \@ref(tab:tableDsharp) that there are `r table.D.sharp[which(rownames(table.D.sharp)==1)]` treated observations.
#### Fuzzy cutoff rule
This rule is less sharp than the sharp cutoff rule.
Here, other criteria than $Y_i^B$ enter into the decision to allocate the treatment.
The doctor might measure the health status of a patient following official guidelines, but he might also measure other factors that will also influence his decision of giving the drug to the patient.
The officials administering a program might measure the official income level of a household, but they might also consider other features of the household situation when deciding to enroll the household into the program or not.
If these additional criteria are unobserved to the econometrician, then we have the fuzzy cutoff rule.
A very simple way to model this rule is as follows:
\begin{align}\label{eq:fuzzcutoff}
D_i & = \uns{Y_i^B+V_i\leq\bar{Y}},
\end{align}
where $V_i$ is a random variable unobserved to the econometrician and standing for the other influences that might drive the allocation of the treatment.
$V_i$ is distributed according to a, for the moment, unspecified cumulative distribution function $F_V$.
When $V_i$ is degenerate (\textit{i.e.} it has only one point of support: it is a constant), the fuzzy cutoff rule becomes the sharp cutoff rule.
#### Eligibility $+$ self-selection rule
It is also possible that households, once they have been made eligible to the treatment, can decide whether they want to receive it or not.
A patient might be able to refuse the drug that the doctor suggests she should take.
A household might refuse to participate in a cash transfer program to which it has been made eligible.
Not all programs have this feature, but most of them have some room for decisions by the agents themselves of whether they want to receive the treatment or not.
One simple way to model this rule is as follows:
\begin{align}\label{eq:eligself}
D_i & = \uns{D^*_i\geq0}E_i,
\end{align}
where $D^*_i$ is individual $i$'s valuation of the treatment and $E_i$ is whether or not she is deemed eligible for the treatment.
$E_i$ might be choosen according to the sharp cutoff rule of to the fuzzy cutoff rule, or to any other eligibility rule.
We will be more explicit about $D_i^*$ in what follows.
**SIMULATIONS ARE MISSING FOR THESE LAST TWO RULES**
### Potential outcomes
The second main building block of RCM are potential outcomes.
Let's say that we are interested in the effect of a treatment on an outcome $Y$.
Each unit $i$ can thus be in two potential states: treated or non treated.
Before the allocation of the treatment is decided, both of these states are feasible for each unit.
```{definition, name="Potential outcomes"}
For each unit $i$, we define two potential outcomes:
```
* $Y_i^1$: the outcome that unit $i$ is going to have if it receives the treatment,
* $Y_i^0$: the outcome that unit $i$ is going to have if it **does not** receive the treatment.
```{example}
Let's choose functional forms for our potential outcomes.
For simplicity, all lower case letters will denote log outcomes.
$y_i^0=\mu_i+\delta+U_i^0$, with $\delta$ a time shock common to all the observations and $U_i^0=\rho U_i^B+\epsilon_i$, with $|\rho|<1$.
In the absence of the treatment, part of the shocks $U_i^B$ that the individuals experienced in the previous period persist, while some part vanish.
$y_i^1=y_i^0+\bar{\alpha}+\theta\mu_i+\eta_i$.
In order to generate the potential outcomes, one has to define the laws for the shocks and to choose parameter values.
Let's assume that $\epsilon_i\sim\mathcal{N}(0,\sigma^2_{\epsilon})$ and $\eta_i\sim\mathcal{N}(0,\sigma^2_{\eta})$.
Now let's choose some parameter values:
```
```{r param.suite,eval=TRUE,echo=TRUE,results='markup'}
l <- length(param)
param <- c(param,0.9,0.01,0.05,0.05,0.05,0.1)
names(param)[(l+1):length(param)] <- c("rho","theta","sigma2epsilon","sigma2eta","delta","baralpha")
param
```
We can finally generate the potential outcomes;
```{r pot.out,eval=TRUE,echo=TRUE,results='hide'}
epsilon <- rnorm(N,0,sqrt(param["sigma2epsilon"]))
eta<- rnorm(N,0,sqrt(param["sigma2eta"]))
U0 <- param["rho"]*UB + epsilon
y0 <- mu + U0 + param["delta"]
alpha <- param["baralpha"]+ param["theta"]*mu + eta
y1 <- y0+alpha
Y0 <- exp(y0)
Y1 <- exp(y1)
```
Now, I would like to visualize my potential outcomes:
```{r histpotout,fig.cap='Potential outcomes',fig.align='center',out.width=cst}
plot(y0,y1)
```
You can see on the resulting Figure \@ref(fig:histpotout) that both potential outcomes are positively correlated.
Those with a large potential outcome when untreated (*e.g.* in good health without the treatment) also have a positive health with the treatment.
It is also true that individuals with bad health in the absence of the treatment also have bad health with the treatment.
### Switching equation
The last building block of RCM is the switching equation.
It links the observed outcome to the potential outcomes through the allocation rule:
\begin{align}
(\#eq:switch)
Y_i & =
\begin{cases}
Y_i^1 & \text{if } D_i=1\\
Y_i^0 & \text{if } D_i=0
\end{cases} \\
& = Y_i^1D_i + Y_i^0(1-D_i) \nonumber
\end{align}
```{example}
In order to generate observed outcomes in our numerical example, we simply have to enforce the switching equation:
```
```{r switch,eval=TRUE,echo=TRUE,results='hide'}
y <- y1*Ds+y0*(1-Ds)
Y <- Y1*Ds+Y0*(1-Ds)
```
What the switching equation \@ref(eq:switch) means is that, for each individual $i$, we get to observe only one of the two potential outcomes.
When individual $i$ belongs to the treatment group (*i.e.* $D_i=1$), we get to observe $Y_i^1$.
When individual $i$ belongs to the control group (*i.e.* $D_i=0$), we get to observe $Y_i^0$.
Because the same individual cannot be at the same time in both groups, we can NEVER see both potential outcomes for the same individual at the same time.
For each of the individuals, one of the two potential outcomes is unobserved.
We say that it is a **counterfactual**.
A counterfactual quantity is a quantity that is, according to Hume's definition, contrary to the observed facts.
A counterfactual cannot be observed, but it can be conceived by an effort of reason: it is the consequence of what would have happened had some action not been taken.
```{remark}
One very nice way of visualising the switching equation has been proposed by Jerzy Neyman in a 1923 prescient paper.
Neyman proposes to imagine two urns, each one filled with $N$ balls.
One urn is the treatment urn and contains balls with the id of the unit and the value of its potential outcome $Y_i^1$.
The other urn is the control urn, and it contains balls with the value of the potential outcome $Y_i^0$ for each unit $i$.
Following the allocation rule $D_i$, we decide whether unit $i$ is in the treatment or control group.
When unit $i$ is in the treatment group, we take the corresponding ball from the first urn and observe the potential outcome on it.
But, at the same time, the urns are connected so that the corresponding ball with the potential outcome of unit $i$ in the control urn disappears as soon as we draw ball $i$ from the treatment urn.
The switching equation works a lot like Schrodinger's cat paradox.
Schrodinger's cat is placed in a sealed box and receives a dose of poison when an atom emits a radiation.
As long as the box is sealed, there is no way we can know whether the cat is dead or alive.
When we open the box, we observe either a dead cat or a living cat, but we cannot observe the cat both alive and dead at the same time.
The switching equation is like opening the box, it collapses the observed outcome into one of the two potential ones.
```
```{example}
One way to visualize the inner workings of the switching equation is to plot the potential outcomes along with the criteria driving the allocation rule.
In our simple example, it simply amounts to plotting observed ($y_i$) and potential outcomes ($y_i^1$ and $y_i^0$) along $y_i^B$.
```
```{r ploty1y0yB,eval=TRUE,echo=TRUE,fig.cap='Potential outcomes',fig.align='center',out.width=cst}
plot(yB[Ds==0],y0[Ds==0],pch=1,xlim=c(5,11),ylim=c(5,11),xlab="yB",ylab="Outcomes")
points(yB[Ds==1],y1[Ds==1],pch=3)
points(yB[Ds==0],y1[Ds==0],pch=3,col='red')
points(yB[Ds==1],y0[Ds==1],pch=1,col='red')
test <- 5.8
i.test <- which(abs(yB-test)==min(abs(yB-test)))
points(yB[abs(yB-test)==min(abs(yB-test))],y1[abs(yB-test)==min(abs(yB-test))],col='green',pch=3)
points(yB[abs(yB-test)==min(abs(yB-test))],y0[abs(yB-test)==min(abs(yB-test))],col='green')
abline(v=log(param["barY"]),col="red")
legend(5,11,c('y0|D=0','y1|D=1','y0|D=1','y1|D=0',paste('y0',i.test,sep=''),paste('y1',i.test,sep='')),pch=c(1,3,1,3,1,3),col=c('black','black','red','red','green','green'),ncol=3)
```
```{r plotyyB,eval=TRUE,echo=TRUE,fig.cap='Observed outcomes',fig.align='center',out.width=cst}
plot(yB[Ds==0],y0[Ds==0],pch=1,xlim=c(5,11),ylim=c(5,11),xlab="yB",ylab="Outcomes")
points(yB[Ds==1],y1[Ds==1],pch=3)
legend(5,11,c('y|D=0','y|D=1'),pch=c(1,3))
abline(v=log(param["barY"]),col="red")
```
Figure \@ref(fig:ploty1y0yB) plots the observed outcomes $y_i$ along with the unobserved potential outcomes.
Figure \@ref(fig:ploty1y0yB) shows that each individual in the sample is endowed with two potential outcomes, represented by a circle and a cross.
Figure \@ref(fig:plotyyB) plots the observed outcomes $y_i$ that results from applying the switching equation.
Only one of the two potential outcomes is observed (the cross for the treated group and the circle for the untreated group) and the other is not.
The observed sample in Figure \@ref(fig:plotyyB) only shows observed outcomes, and is thus silent on the values of the missing potential outcomes.
## Treatment effects
RCM enables the definition of causal effects at the individual level.
In practice though, we generally focus on a summary measure: the effect of the treatment on the treated.
### Individual level treatment effects
Potential outcomes enable us to define the central notion of causal inference: the causal effect, also labelled the treatment effect, which is the difference between the two potential outcomes.
```{definition,causaleff,name='Individual level treatment effect'}
For each unit $i$, the causal effect of the treatment on outcome $Y$ is: $\Delta^Y_i=Y_i^1-Y_i^0$.
```
```{example}
The individual level causal effect in log terms is: $\Delta^y_i=\alpha_i=\bar{\alpha}+\theta\mu_i+\eta_i$.
The effect is the sum of a part common to all individuals, a part correlated with $\mu_i$: the treatment might have a larger or a smaller effect depending on the unobserved permanent ability or health status of individuals, and a random shock.
It is possible to make the effect of the treatment to depend on $U_i^B$ also, but it would complicate the model.
```
In Figure \@ref(fig:ploty1y0yB), the individual level treatment effects are the differences between each cross and its corresponding circle.
For example, for observation `r i.test`, the two potential outcomes appear in green in Figure \@ref(fig:ploty1y0yB).
The effect of the treatment on unit `r i.test` is equal to:
$$
\Delta^y_{`r i.test`}=y^1_{`r i.test`}-y^0_{`r i.test`}=`r round(y1[i.test],2)`-`r round(y0[i.test],2)`=`r round(y1[i.test]-y0[i.test],2)`.
$$
Since observation `r i.test` belongs to the treatment group, we can only observe the potential outcome in the presence of the treatment, $y^1_{`r i.test`}$.
RCM allows for heterogeneity of treatment effects.
The treatment has a large effect on some units and a much smaller effect on other units.
We can even have some units that benefit from the treatment and some units that are harmed by the treatment.
The individual level effect of the treatment is itself a random variable (and not a fixed parameter).
It has a distribution, $F_{\Delta^Y}$.
Heterogeneity of treatment effects seems very natural: the treatment might interact with individuals' different backgrounds.
The effect of a drug might depend on the genetic background of an individual.
An education program might only work for children that already have sufficient non-cognitive skills, and thus might depend in turn on family background.
An environmental regulation or a behavioral intervention might only trigger reactions by already environmentally aware individuals.
A CCT might have a larger effect when indiviuals are credit-constrained or face shocks.
```{example}
In our numerical example, the distribution of $\Delta^y_i=\alpha_i$ is a normal: $\alpha_i\sim\mathcal{N}(\bar{\alpha}+\theta\bar{\mu},\theta^2\sigma^2_{\mu}+\sigma^2_{\eta})$.
We would like to visualize treatment effect heterogeneity.
For that, we can build a histogram of the individual level causal effect.
```
On top of the histogram, we can also draw the theoretical distribution of the treatment effect: a normal with mean `r param["baralpha"]+param["theta"]*param["barmu"]` and variance `r round(param["theta"]^2*param["sigma2mu"]+param["sigma2eta"],2)`.
```{r histalpha,eval=TRUE,echo=TRUE,fig.cap='Histogram of $\\Delta^y$',fig.align='center',fig.pos='htbp',out.width=cst}
hist(alpha,main="",prob=TRUE)
curve(dnorm(x, mean=(param["baralpha"]+param["theta"]*param["barmu"]), sd=sqrt(param["theta"]^2*param["sigma2mu"]+param["sigma2eta"])), add=TRUE,col='red')
```
The first thing that we can see on Figure \@ref(fig:histalpha) is that the theoretical and the empirical distributions nicely align with each other.
We also see that the majority of the observations lies to the right of zero: most people experience a positive effect of the treatment.
But there are some individuals that do not benefit from the treatment: the effect of the treatment on them is negative.
### Average treatment effect on the treated
We do not generally estimate individual-level treatment effects.
We generally look for summary statistics of the effect of the treatment.
By far the most widely reported causal parameter is the Treatment on the Treated parameter (TT).
It can be defined in the sample at hand or in the population.
```{definition,TT,name='Average and expected treatment effects on the treated'}
The Treatment on the Treated parameters for outcome $Y$ are:
```
* The average Treatment effect on the Treated in the sample:
\begin{align*}
\Delta^Y_{TT_s} & = \frac{1}{\sum_{i=1}^ND_i}\sum_{i=1}^N(Y_i^1-Y_i^0)D_i,
\end{align*}
* The expected Treatment effect on the Treated in the population:
\begin{align*}
\Delta^Y_{TT} & = \esp{Y_i^1-Y_i^0|D_i=1}.
\end{align*}
The TT parameters measure the average effect of the treatment on those who actually take it, either in the sample at hand or in the popluation.
It is generally considered to be the most policy-relevant parameter since it measures the effect of the treatment as it has actually been allocated.
For example, the expected causal effect on the overall population is only relevant if policymakers are considering implementing the treatment even on those who have not been selected to receive it.
For a drug or an anti-poverty program, it would mean giving the treatment to healthy or rich people, which would make little sense.
TT does not say anything about how the effect of the treatment is distributed in the population or in the sample.
TT does not account for the heterogneity of treatment effects.
In Lecture 7, we will look at other parameters of interest that look more closely into how the effect of the treatment is distributed.
```{example}
The value of TT in our sample is:
```
$$
\Delta^y_{TT_s}=`r round(mean(alpha[Ds==1]),3)`.
$$
Computing the population value of $TT$ is slightly more involved: we have to use the formula for the conditional expectation of a censored bivariate normal random variable:
\begin{align*}
\Delta^y_{TT} & = \esp{\alpha_i|D_i=1}\\
& = \bar{\alpha}+\theta\esp{\mu_i|\mu_i+U_i^B\leq\bar{y}}\\
& = \bar{\alpha}+\theta\left(\bar{\mu} - \frac{\sigma^2_{\mu}}{\sqrt{\sigma^2_{\mu}+\sigma^2_{U}}}\frac{\phi\left(\frac{\bar{y}-\bar{\mu}}{\sqrt{\sigma^2_{\mu}+\sigma^2_{U}}}\right)}{\Phi\left(\frac{\bar{y}-\bar{\mu}}{\sqrt{\sigma^2_{\mu}+\sigma^2_{U}}}\right)}\right)\\
& = \bar{\alpha}+\theta\bar{\mu}-\theta\left(\frac{\sigma^2_{\mu}}{\sqrt{\sigma^2_{\mu}+\sigma^2_{U}}}\frac{\phi\left(\frac{\bar{y}-\bar{\mu}}{\sqrt{\sigma^2_{\mu}+\sigma^2_{U}}}\right)}{\Phi\left(\frac{\bar{y}-\bar{\mu}}{\sqrt{\sigma^2_{\mu}+\sigma^2_{U}}}\right)}\right),
\end{align*}
where $\phi$ and $\Phi$ are respectively the density and the cumulative distribution functions of the standard normal.
The second equality follows from the definition of $\alpha_i$ and $D_i$ and from the fact that $\eta_i$ is independent from $\mu_i$ and $U_i^B$.
The third equality comes from the formula for the expectation of a censored bivariate normal random variable.
In order to compute the population value of TT easily for different sets of parameter values, let's write a function in R:
```{r delta.y.tt,eval=TRUE,echo=TRUE,tidy=FALSE}
delta.y.tt <- function(param){return(param["baralpha"]+param["theta"]*param["barmu"]
-param["theta"]*((param["sigma2mu"]*dnorm((log(param["barY"])-param["barmu"])/(sqrt(param["sigma2mu"]+param["sigma2U"]))))
/(sqrt(param["sigma2mu"]+param["sigma2U"])
*pnorm((log(param["barY"])-param["barmu"])/(sqrt(param["sigma2mu"]+param["sigma2U"]))))))}
```
The population value of TT computed using this function is: $\Delta^y_{TT}=$ `r round(delta.y.tt(param),3)`.
We can see that the values of TT in the sample and in the population differ slightly.
This is because of sampling noise: the units in the sample are not perfectly representative of the units in the population.
## Fundamental problem of causal inference
At least in this lecture, causal inference is about trying to infer TT, either in the sample or in the population.
The FPCI states that it is impossible to directly observe TT because one part of it remains fundamentally unobserved.
```{theorem,FPCI,name='Fundamental problem of causal inference'}
It is impossible to observe TT, either in the population or in the sample.
```
```{proof}
The proof of the FPCI is rather straightforward.
Let me start with the sample TT:
\begin{align*}
\Delta^Y_{TT_s} & = \frac{1}{\sum_{i=1}^ND_i}\sum_{i=1}^N(Y_i^1-Y_i^0)D_i \\
& = \frac{1}{\sum_{i=1}^ND_i}\sum_{i=1}^NY_i^1D_i- \frac{1}{\sum_{i=1}^ND_i}\sum_{i=1}^NY_i^0D_i \\
& = \frac{1}{\sum_{i=1}^ND_i}\sum_{i=1}^NY_iD_i- \frac{1}{\sum_{i=1}^ND_i}\sum_{i=1}^NY_i^0D_i.
\end{align*}
Since $Y_i^0$ is unobserved whenever $D_i=1$, $\frac{1}{\sum_{i=1}^ND_i}\sum_{i=1}^NY_i^0D_i$ is unobserved, and so is $\Delta^Y_{TT_s}$.
The same is true for the population TT:
\begin{align*}
\Delta^Y_{TT} & = \esp{Y_i^1-Y_i^0|D_i=1} \\
& = \esp{Y_i^1|D_i=1}-\esp{Y_i^0|D_i=1}\\
& = \esp{Y_i|D_i=1}-\esp{Y_i^0|D_i=1}.
\end{align*}
$\esp{Y_i^0|D_i=1}$ is unobserved, and so is $\Delta^Y_{TT}$.
```
The key insight in order to understand the FPCI is to see that the outcomes of the treated units had they not been treated are unobservable, and so is their average or expectation.
We say that they are counterfactual, contrary to what has happened.
```{definition,counter,name='Couterfactual'}
Both $\frac{1}{\sum_{i=1}^ND_i}\sum_{i=1}^NY_i^0D_i$ and $\esp{Y_i^0|D_i=1}$ are counterfactual quantities that we will never get to observe.
```
```{example}
The average counterfactual outcome of the treated is the mean of the red circles in the $y$ axis on Figure \@ref(fig:ploty1y0yB):
```
$$
\frac{1}{\sum_{i=1}^ND_i}\sum_{i=1}^Ny_i^0D_i= `r round(mean(y0[Ds==1]),2)`.
$$
Remember that we can estimate this quantity only because we have generated the data ourselves.
In real life, this quantity is hopelessly unobserved.
$\esp{y_i^0|D_i=1}$ can be computed using the formula for the expectation of a censored normal random variable:
\begin{align*}
\esp{y_i^0|D_i=1} & = \esp{\mu_i+\delta+U_i^0|D_i=1}\\
& = \esp{\mu_i+\delta+\rho U_i^B+\epsilon_i|D_i=1}\\
& = \delta + \esp{\mu_i+\rho U_i^B|y_i^B\leq\bar{y}}\\
& = \delta + \bar{\mu} - \frac{\sigma^2_{\mu}+\rho\sigma^2_U}{\sqrt{\sigma^2_{\mu}+\sigma^2_{U}}}\frac{\phi\left(\frac{\bar{y}-\bar{\mu}}{\sqrt{\sigma^2_{\mu}+\sigma^2_{U}}}\right)}{\Phi\left(\frac{\bar{y}-\bar{\mu}}{\sqrt{\sigma^2_{\mu}+\sigma^2_{U}}}\right)}.
\end{align*}
We can write a function in R to compute this value:
```{r espy0D1,eval=TRUE,echo=TRUE,tidy=FALSE}
esp.y0.D1 <- function(param){
return(param["delta"]+param["barmu"]
-((param["sigma2mu"]+param["rho"]*param["sigma2U"])
*dnorm((log(param["barY"])-param["barmu"])/(sqrt(param["sigma2mu"]+param["sigma2U"]))))
/(sqrt(param["sigma2mu"]+param["sigma2U"])*pnorm((log(param["barY"])-param["barmu"])
/(sqrt(param["sigma2mu"]+param["sigma2U"])))))
}
```
The population value of TT computed using this function is: $\esp{y_i^0|D_i=1}=$ `r round(esp.y0.D1(param),3)`.
## Intuitive estimators, confounding factors and selection bias
In this section, we are going to examine the properties of two intuitive comparisons that laypeople, policymakers but also ourselves make in order to estimate causal effects: the with/wihtout comparison ($WW$) and the before/after comparison ($BA$).
$WW$ compares the average outcomes of the treated individuals with those of the untreated individuals.
$BA$ compares the average outcomes of the treated after taking the treatment to their average outcomes before they took the treatment.
These comparisons try to proxy for the expected counterfactual outcome in the treated group by using an observed quantity.
$WW$ uses the expected outcome of the untreated individuals as a proxy.
$BA$ uses the expected outcome of the treated before they take the treatment as a proxy.
Unfortunately, both of these proxies are generally poor and provide biased estimates of $TT$.
The reason that these proxies are poor is that the treatment is not the only factor that differentiates the treated group from the groups used to form the proxy.
The intuitive comparisons are biased because factors, other than the treatment, are correlated to its allocation.
The factors that bias the intuitive comparisons are generally called confouding factors or confounders.
The treatment effect measures the effect of a ceteris paribus change in treatment status, while the intuitive comparisons capture both the effect of this change and that of other correlated changes that spuriously contaminate the comparison.
Intuitive comparisons measure correlations while treatment effects measure causality.
The old motto "correlation is not causation" applies vehemently here.
```{remark}
A funny anecdote about this expression "correlation is not causation".
This expression is due to Karl Pearson, the father of modern statistics.
He coined the phrase in his famous book "The Grammar of Science."
Pearson is famous for inventing the correlation coefficient.
He actually thought that correlation was a much superior, much more rigorous term, than causation.
In his book, he actually used the sentence to argue in favor of abandoning causation altogether and focusing on the much better-defined and measurable concept of correlation.
Interesting turn of events that his sentence is now used to mean that correlation is weaker than causation, totally reverting the original intended meaning.
```
In this section, we are going to define both comparisons, study their biases and state the conditions under which they identify $TT$.
This will prove to be a very useful introduction to the notion of identification.
It is also very important to be able to understand the sources of bias of comparisons that we use every day and that come very naturally to policy makers and lay people.
```{remark}
In this section, we state the definitions and formulae in the population.
This is for two reasons.
First, it is simpler, and lighter in terms of notation.
Second, it emphasizes that the problems with intuitive comparisons are independent of sampling noise.
Most of the results stated here for the population extend to the sample, replacing the expectation operator by the average operator.
I will nevertheless give examples in the sample, since it is so much simpler to compute.
I will denote sample equivalents of population estimators with a hat.
```
### With/Without comparison, selection bias and cross-sectional confounders
The with/without comparison ($WW$) is very intuitive: just compare the outcomes of the treated and untreated individuals in order to estimate the causal effect.
This approach is nevertheless generally biased.
We call the bias of $WW$ selection bias ($SB$).
Selection bias is due to unobserved confounders that are distributed differently in the treatment and control group and that generate differences in outcomes even in the absence of the treatment.
In this section, I define the $WW$ estimator, derives its bias, introduces the confounders and states conditions under which it is unbiased.
#### With/Without comparison
The with/without comparison ($WW$) is very intuitive: just compare the outcomes of the treated and untreated individuals in order to estimate the causal effect.
```{definition,name='With/without comparison'}
The with/without comparison is the difference between the expected outcomes of the treated and the expected outcomes of the untreated:
\begin{align*}
\Delta^Y_{WW} & = \esp{Y_i|D_i=1}-\esp{Y_i|D_i=0}.
\end{align*}
```
```{example}
In the population, $WW$ can be computed using the traditional formula for the expectation of a truncated normal distribution:
\begin{align*}
\Delta^y_{WW} & = \esp{y_i|D_i=1}-\esp{y_i|D_i=0} \\
& = \esp{y_i^1|D_i=1}-\esp{y^0_i|D_i=0} \\
& = \esp{\alpha_i|D_i=1}+\esp{\mu_i+\rho U_i^B|\mu_i+U_i^B\leq\bar{y}}-\esp{\mu_i+\rho U_i^B|\mu_i+U_i^B>\bar{y}} \\
& = \bar{\alpha}+\theta\left(\bar{\mu}-\frac{\sigma^2_{\mu}}{\sqrt{\sigma^2_{\mu}+\sigma^2_{U}}}\frac{\phi\left(\frac{\bar{y}-\bar{\mu}}{\sqrt{\sigma^2_{\mu}+\sigma^2_{U}}}\right)}{\Phi\left(\frac{\bar{y}-\bar{\mu}}{\sqrt{\sigma^2_{\mu}+\sigma^2_{U}}}\right)}\right)
-\frac{\sigma^2_{\mu}+\rho\sigma^2_{U}}{\sqrt{\sigma^2_{\mu}+\sigma^2_{U}}}\left(\frac{\phi\left(\frac{\bar{y}-\bar{\mu}}{\sqrt{\sigma^2_{\mu}+\sigma^2_{U}}}\right)}{\Phi\left(\frac{\bar{y}-\bar{\mu}}{\sqrt{\sigma^2_{\mu}+\sigma^2_{U}}}\right)}+\frac{\phi\left(\frac{\bar{y}-\bar{\mu}}{\sqrt{\sigma^2_{\mu}+\sigma^2_{U}}}\right)}{1-\Phi\left(\frac{\bar{y}-\bar{\mu}}{\sqrt{\sigma^2_{\mu}+\sigma^2_{U}}}\right)}\right).
\end{align*}
```
In order to compute this parameter, we are going to set up a R function.
For reasons that will become clearer later, we will define two separate functions to compute the first and second part of the formula.
In the first part, you should have recognised $TT$, that we have already computed in Lecture 1.
We are going to call the second part $SB$, for reasons that will become explicit in a bit.
```{r WW.SB,eval=TRUE,echo=TRUE}
delta.y.tt <- function(param){
return(param["baralpha"]+param["theta"]*param["barmu"]-param["theta"]
*((param["sigma2mu"]*dnorm((log(param["barY"])-param["barmu"])
/(sqrt(param["sigma2mu"]+param["sigma2U"]))))
/(sqrt(param["sigma2mu"]+param["sigma2U"])*pnorm((log(param["barY"])-param["barmu"])
/(sqrt(param["sigma2mu"]+param["sigma2U"]))))))
}
delta.y.sb <- function(param){
return(-(param["sigma2mu"]+param["rho"]*param["sigma2U"])/sqrt(param["sigma2mu"]+param["sigma2U"])
*dnorm((log(param["barY"])-param["barmu"])/(sqrt(param["sigma2mu"]+param["sigma2U"])))
*(1/pnorm((log(param["barY"])-param["barmu"])/(sqrt(param["sigma2mu"]+param["sigma2U"])))
+1/(1-pnorm((log(param["barY"])-param["barmu"])/(sqrt(param["sigma2mu"]+param["sigma2U"]))))))
}
delta.y.ww <- function(param){
return(delta.y.tt(param)+delta.y.sb(param))
}
```
As a conclusion of all these derivations, $WW$ in the population is equal to `r round(delta.y.ww(param),3)`.
Remember that the value of $TT$ in the population is `r round(delta.y.tt(param),3)`.
In order to compute the $WW$ estimator in a sample, I'm going to generate a brand new sample and I'm going to choose a seed for the pseudo-random number generator so that we obtain the same result each time we run the code.
I use ``set.seed(1234)`` in the code chunk below.
```{r simul,eval=TRUE,echo=TRUE}
param <- c(8,.5,.28,1500)
names(param) <- c("barmu","sigma2mu","sigma2U","barY")
set.seed(1234)
N <-1000
mu <- rnorm(N,param["barmu"],sqrt(param["sigma2mu"]))
UB <- rnorm(N,0,sqrt(param["sigma2U"]))
yB <- mu + UB
YB <- exp(yB)
Ds <- rep(0,N)
Ds[YB<=param["barY"]] <- 1
l <- length(param)
param <- c(param,0.9,0.01,0.05,0.05,0.05,0.1)
names(param)[(l+1):length(param)] <- c("rho","theta","sigma2epsilon","sigma2eta","delta","baralpha")
epsilon <- rnorm(N,0,sqrt(param["sigma2epsilon"]))
eta<- rnorm(N,0,sqrt(param["sigma2eta"]))
U0 <- param["rho"]*UB + epsilon
y0 <- mu + U0 + param["delta"]
alpha <- param["baralpha"]+ param["theta"]*mu + eta
y1 <- y0+alpha
Y0 <- exp(y0)
Y1 <- exp(y1)
y <- y1*Ds+y0*(1-Ds)
Y <- Y1*Ds+Y0*(1-Ds)
```
In this sample, the average outcome of the treated in the presence of the treatment is
$$
\frac{1}{\sum_{i=1}^ND_i}\sum_{i=1}^ND_iy_i= `r round(mean(y[Ds==1]),3)`.
$$
It is materialized by a circle on Figure \@ref(fig:ploty0y1yBD).
The average outcome of the untreated is
$$
\frac{1}{\sum_{i=1}^N(1-D_i)}\sum_{i=1}^N(1-D_i)y_i= `r round(mean(y[Ds==0]),3)`.
$$
It is materialized by a plus sign on Figure \@ref(fig:ploty0y1yBD).
```{r ploty0y1yBD,eval=TRUE,fig.cap='Evolution of average outcomes in the treated and control group before (Time =1) and after (Time=2) the treatment',fig.align='center',echo=FALSE,out.width=cst}
x <- c(1,2)
y11 <- c(mean(yB[Ds==1]),mean(y[Ds==1]))
y10 <- c(mean(yB[Ds==1]),mean(y0[Ds==1]))
y00 <- c(mean(yB[Ds==0]),mean(y0[Ds==0]))
plot(x,y11,ylim=c(6.5,8.5),type='o',pch=1,xlab='Time',ylab='Average outcomes')
points(x,y10,pch=2,type='o',lty=2)
points(x,y00,pch=3,type='o',lty=6)
legend(1,8,c('Untreated (observed)','Treated (observed)','Treated (counterfactual)'),pch=c(3,1,2),lty=c(6,1,2),ncol=1)
```
The estimate of the $WW$ comparison in the sample is thus:
\begin{align*}
\hat{\Delta^Y_{WW}} & = \frac{1}{\sum_{i=1}^N D_i}\sum_{i=1}^N Y_iD_i-\frac{1}{\sum_{i=1}^N (1-D_i)}\sum_{i=1}^N Y_i(1-D_i).
\end{align*}
We have $\hat{\Delta^y_{WW}}=$ `r round(mean(y[Ds==1])-mean(y[Ds==0]),3)`.
Remember that the value of $TT$ in the sample is $\Delta^y_{TT_s}=$ `r round(mean(alpha[Ds==1]),3)`.
Overall, $WW$ severely underestimates the effect of the treatment in our example.
$WW$ suggests that the treatment has a negative effect on outcomes whereas we know by construction that it has a positive one.
#### Selection bias
When we form the with/without comparison, we do not recover the $TT$ parameter.
Instead, we recover $TT$ plus a bias term, called **selection bias**:
\begin{align*}
\Delta^Y_{WW} & =\Delta^Y_{TT}+\Delta^Y_{SB}.
\end{align*}
```{definition,SB,name='Selection bias'}
Selection bias is the difference between the with/without comparison and the treatment on the treated parameter:
\begin{align*}
\Delta^Y_{SB} & = \Delta^Y_{WW}-\Delta^Y_{TT}.
\end{align*}
```
$WW$ tries to approximate the counterfactual expected outcome in the treated group by using $\esp{Y_i^0|D_i=0}$, the expected outcome in the untreated group .
Selection bias appears because this proxy is generally poor.
It is very easy to see that selection bias is indeed directly due to this bad proxy problem:
```{theorem,SBth,name='Selection bias and counterfactual'}
Selection bias is the difference between the counterfactual expected potential outcome in the absence of the treatment among the treated and the expected potential outcome in the absence of the treatment among the untreated.
\begin{align*}
\Delta^Y_{SB} & = \esp{Y_i^0|D_i=1}-\esp{Y_i^0|D_i=0}.
\end{align*}
```
```{proof}
\begin{align*}
\Delta^Y_{SB} & = \Delta^Y_{WW}-\Delta^Y_{TT} \\
& = \esp{Y_i|D_i=1}-\esp{Y_i|D_i=0}-\esp{Y_i^1-Y_i^0|D_i=1}\\
& = \esp{Y_i^0|D_i=1}-\esp{Y_i^0|D_i=0}.
\end{align*}
The first and second equalities stem only from the definition of both parameters.
The third equality stems from using the switching equation: $Y_i=Y_i^1D_i+Y_i^0(1-D_i)$, so that $\esp{Y_i|D_i=1}=\esp{Y^1_i|D_i=1}$ and $\esp{Y_i|D_i=0}=\esp{Y_i^0|D_i=0}$.
```
```{example}
In the population, $SB$ is equal to
```
\begin{align*}
\Delta^y_{SB} & = \Delta^y_{WW}-\Delta^y_{TT} \\
& = `r round(delta.y.ww(param),3)` - `r round(delta.y.tt(param),3)` \\
& = `r round(delta.y.ww(param)-delta.y.tt(param),3)`
\end{align*}
We could have computed $SB$ directly using the formula from Theorem \@ref(thm:SBth):
\begin{align*}
\Delta^y_{SB} & = \esp{y_i^0|D_i=1}-\esp{y_i^0|D_i=0}\\
& = -\frac{\sigma^2_{\mu}+\rho\sigma^2_{U}}{\sqrt{\sigma^2_{\mu}+\sigma^2_{U}}}\left(\frac{\phi\left(\frac{\bar{y}-\bar{\mu}}{\sqrt{\sigma^2_{\mu}+\sigma^2_{U}}}\right)}{\Phi\left(\frac{\bar{y}-\bar{\mu}}{\sqrt{\sigma^2_{\mu}+\sigma^2_{U}}}\right)}+\frac{\phi\left(\frac{\bar{y}-\bar{\mu}}{\sqrt{\sigma^2_{\mu}+\sigma^2_{U}}}\right)}{1-\Phi\left(\frac{\bar{y}-\bar{\mu}}{\sqrt{\sigma^2_{\mu}+\sigma^2_{U}}}\right)}\right).
\end{align*}
When using the R function for $SB$ that we have defined earlier, we indeed find: $\Delta^y_{SB}=$ `r round(delta.y.sb(param),3)`.
In the sample, $\hat{\Delta^y_{SB}}=$`r round(mean(y[Ds==1])-mean(y[Ds==0]),3)`-`r round((mean(y1[Ds==1])-mean(y0[Ds==1])),3)` $=$ `r round(mean(y[Ds==1])-mean(y[Ds==0])-(mean(y1[Ds==1])-mean(y0[Ds==1])),3)`.
Selection bias emerges because we are using a bad proxy for the counterfactual.
The average outcome for the untreated is equal to $\frac{1}{\sum_{i=1}^N(1-D_i)}\sum_{i=1}^N(1-D_i)y_i=$ `r round(mean(y0[Ds==0]),3)` while the counterfactual average outcome for the treated is $\frac{1}{\sum_{i=1}^ND_i}\sum_{i=1}^ND_iy^0_i=$ `r round(mean(y0[Ds==1]),3)`.
Their difference is as expected equal to $SB$:
$\hat{\Delta^y_{SB}}=$ `r round(mean(y0[Ds==1]),3)` $-$ `r round(mean(y0[Ds==0]),3)` $=$ `r round(mean(y0[Ds==1])-mean(y0[Ds==0]),3)`.
The counterfactual average outcome of the treated is much smaller than the average outcome of the untreated.
On Figure \@ref(fig:ploty0y1yBD), this is materialized by the fact that the plus sign is located much above the triangle.
```{remark}
The concept of selection bias is related to but different from the concept of sample selection bias.
With sample selection bias, we worry that selection into the sample might bias the estimated effect of a treatment on outcomes.
With selection bias, we worry that selection into the treatment itself might bias the effect of the treatment on outcomes.
Both biases are due to unbserved covariates, but they do not play out in the same way.
For example, estimating the effect of education on women's wages raises both selection bias and sample selection bias issues.
Selection bias stems from the fact that more educated women are more likely to be more dynamic and thus to have higher earnings even when less educated.
Selection bias would be positive in that case, overestimating the effect of education on earnings.
Sample selection bias stems from the fact that we can only use a sample of working women in order to estimate the effect of education on wages, since we do not observe the wages on non working women.
But, selection into the labor force might generate sample selection bias.
More educated women participate more in the labor market, while less educated women participate less.
As a consequence, less educated women that work are different from the overall sample of less educated women.
They might be more dynamic and work-focused.
As a consequence, their wages are higher than the average wages of the less educated women.
Comparing the wages of less educated women that work to those of more educated women that work might understate the effect of education on earnings.
Sample selection bias would generate a negative bias on the education coefficient.
```
#### Confounding factors
Confounding factors are the factors that generate differences between treated and untreated individuals even in the absence of the treatment.
The confounding factors are thus responsible for selection bias.
In general, the mere fact of being selected for receiving the treatment means that you have a host of characteristics that would differentiate you from the unselected individuals, even if you were not to receive the treatment eventually.
For example, if a drug is given to initially sicker individuals, then, we expect that they will be sicker that the untreated in the absence of the treatment.
Comparing sick individuals to healthy ones is not a sound way to estimate the effect of a treatment.
Obviously, even if our treatment performs well, healthier individuals will be healthier after the treatment has been allocated to the sicker patients.
The best we can expect is that the treated patients have recovered, and that their health after the treatment is comparable to that of the untreated patients.
In that case, the with/without comparison is going to be null, whereas the true effect of the treatment is positive.
Selection bias is negative in that case: in the absence of the treatment, the average health status of the treated individuals would have been smaller than that of the untreated individuals.
The confounding factor is the health status of individuals when the decision to allocate the drug has been taken.
It is correlated to both the allocation of the treatment (negatively) and to health in the absence of the treatment (positively).
```{example}
In our example, $\mu_i$ and $U_i^B$ are the confounding factors.
Because the treatment is only given to individuals with pre-treament outcomes smaller than a threshold ($y_i^B\leq\bar{y}$), participants tend to have smaller $\mu_i$ and $U_i^B$ than non participants, as we can see on Figure \@ref(fig:histmuD).
```
```{r histmuD,eval=TRUE,echo=FALSE,results='hide',fig.cap='Distribution of confounders in the treated and control group',fig.align='center',out.width=cst,fig.pos='htbp'}
data <- as.data.frame(cbind(mu,UB,y0,U0,Ds))
data$D <-as.factor(Ds)
hist.mu <- ggplot(data, aes(x=mu, fill=D)) +
geom_histogram(binwidth=.3, alpha=.5, position="identity")
hist.UB <- ggplot(data, aes(x=UB, fill=D)) +
geom_histogram(binwidth=.3, alpha=.5, position="identity")
hist.U0D <- ggplot(data, aes(x=U0, fill=D)) +
geom_histogram(binwidth=.3, alpha=.5, position="identity")
hist.y0 <- ggplot(data, aes(x=y0, fill=D)) +
geom_histogram(binwidth=.3, alpha=.5, position="identity")
figure <- ggarrange(hist.mu,hist.UB,hist.U0D,hist.y0,labels=c("A", "B", "C","D"),ncol = 2, nrow = 2)
figure
```
Since confounding factors are persistent, they affect the outcomes of participants and non participants after the treatment date.
$\mu_i$ persists entirely over time, and $U_i^B$ persists at a rate $\rho$.
As a consequence, even in the absence of the treatment, participants have lower outcomes than non participants, as we can see on Figure \@ref(fig:histmuD).
We can derive the contributions of both confouding factors to overall SB:
\begin{align*}
\esp{Y_i^0|D_i=1} & = \esp{\mu_i+\delta+U_i^0|\mu_i+U_i^B\leq\bar{y}}\\
& = \delta + \esp{\mu_i|\mu_i+U_i^B\leq\bar{y}} + \rho\esp{U_i^B|\mu_i+U_i^B\leq\bar{y}}\\
\Delta^y_{SB} & = \esp{\mu_i|\mu_i+U_i^B\leq\bar{y}}-\esp{\mu_i|\mu_i+U_i^B>\bar{y}} \\
& \phantom{=} + \rho\left(\esp{U_i^B|\mu_i+U_i^B\leq\bar{y}}-\esp{U_i^B|\mu_i+U_i^B>\bar{y}}\right)\\
& = -\frac{\sigma^2_{\mu}}{\sqrt{\sigma^2_{\mu}+\sigma^2_{U}}}\left(\frac{\phi\left(\frac{\bar{y}-\bar{\mu}}{\sqrt{\sigma^2_{\mu}+\sigma^2_{U}}}\right)}{\Phi\left(\frac{\bar{y}-\bar{\mu}}{\sqrt{\sigma^2_{\mu}+\sigma^2_{U}}}\right)}+\frac{\phi\left(\frac{\bar{y}-\bar{\mu}}{\sqrt{\sigma^2_{\mu}+\sigma^2_{U}}}\right)}{1-\Phi\left(\frac{\bar{y}-\bar{\mu}}{\sqrt{\sigma^2_{\mu}+\sigma^2_{U}}}\right)}\right) \\
& \phantom{=} -\frac{\rho\sigma^2_{U}}{\sqrt{\sigma^2_{\mu}+\sigma^2_{U}}}\left(\frac{\phi\left(\frac{\bar{y}-\bar{\mu}}{\sqrt{\sigma^2_{\mu}+\sigma^2_{U}}}\right)}{\Phi\left(\frac{\bar{y}-\bar{\mu}}{\sqrt{\sigma^2_{\mu}+\sigma^2_{U}}}\right)}+\frac{\phi\left(\frac{\bar{y}-\bar{\mu}}{\sqrt{\sigma^2_{\mu}+\sigma^2_{U}}}\right)}{1-\Phi\left(\frac{\bar{y}-\bar{\mu}}{\sqrt{\sigma^2_{\mu}+\sigma^2_{U}}}\right)}\right)
\end{align*}
In order to evaluate these quantities, let's build two R functions:
```{r SB.decomp,eval=TRUE,echo=TRUE}
delta.y.sb.mu <- function(param){
return(-(param["sigma2mu"])/sqrt(param["sigma2mu"]+param["sigma2U"])
*dnorm((log(param["barY"])-param["barmu"])/(sqrt(param["sigma2mu"]+param["sigma2U"])))
*(1/pnorm((log(param["barY"])-param["barmu"])/(sqrt(param["sigma2mu"]+param["sigma2U"])))
+1/(1-pnorm((log(param["barY"])-param["barmu"])/(sqrt(param["sigma2mu"]+param["sigma2U"]))))))
}
delta.y.sb.U <- function(param){
return(-(param["rho"]*param["sigma2U"])/sqrt(param["sigma2mu"]+param["sigma2U"])
*dnorm((log(param["barY"])-param["barmu"])/(sqrt(param["sigma2mu"]+param["sigma2U"])))
*(1/pnorm((log(param["barY"])-param["barmu"])/(sqrt(param["sigma2mu"]+param["sigma2U"])))
+1/(1-pnorm((log(param["barY"])-param["barmu"])/(sqrt(param["sigma2mu"]+param["sigma2U"]))))))
}
```
The contribution of $\mu_i$ to selection bias is `r round(delta.y.sb.mu(param),3)` while that of $U_i^0$ is of `r round(delta.y.sb.U(param),3)`.
#### When does $WW$ identify $TT$?
Are there conditions under which $WW$ identify $TT$?
The answer is yes: when there is no selection bias, the proxy used by $WW$ for the counterfactual quantity is actually valid.
Formally, $WW$ identifies $TT$ when the following assumption holds:
```{definition,noselb,name="No selection bias"}
We assume the following:
\begin{align*}
\esp{Y_i^0|D_i=1} & = \esp{Y_i^0|D_i=0}.
\end{align*}
```
Under Assumption \@ref(def:noselb), the expected counterfactual outcome of the treated is equal to the expected potential outcome of the untreated in the absence of the treatment.
This yields to the following result:
```{theorem,wwtt}
Under Assumption \@ref(def:noselb), $WW$ identifies the $TT$ parameter:
\begin{align*}
\Delta^Y_{WW} & = \Delta^Y_{TT}.
\end{align*}
```
```{proof}
\begin{align*}
\Delta^Y_{WW} & = \esp{Y_i|D_i=1}-\esp{Y_i|D_i=0}\\
& = \esp{Y_i^1|D_i=1}-\esp{Y_i^0|D_i=0}\\
& = \esp{Y_i^1|D_i=1}-\esp{Y_i^0|D_i=1} \\
& = \Delta^Y_{TT},
\end{align*}
where the second equation uses the switching equation and the third uses Assumption \@ref(def:noselb).
```
So, under Assumption \@ref(def:noselb), the $WW$ comparison actually identifies the $TT$ parameter.
We say that Assumption \@ref(def:noselb) is an **identification assumption**: it serves to identify the parameter of interest using observed data.
The intuition for this result is simply that, under Assumption \@ref(def:noselb), there are no confounding factors and thus no selection bias.
under Assumption \@ref(def:noselb), the factors that yield individuals to receive or not the treatment are mean-independent of the potential outcomes in the absence of the treatment.
In this case, the expected outcome in the untreated group actually is a perfect proxy for the counterfactual expected outcome of the treated group.
Obviously, Assumption \@ref(def:noselb) is extremely unlikely to hold in real life.
For Assumption \@ref(def:noselb) to hold, it has to be that **all** the determinants of $D_i$ are actually unrelated to $Y_i^0$.
One way to enforce Assumption \@ref(def:noselb) is to randomize treatment intake.
We will see this in the Lecture on RCTs.
It might also be possible that Assumption \@ref(def:noselb) holds in the data in the absence of an RCT.
But this is not very likely, and should be checked by every mean possible.
One way to test for the validity of Assumption \@ref(def:noselb) is to compare the values of observed covariates in the treated and untreated group.
For Assumption \@ref(def:noselb) to be credible, observed covariates should be distributed in the same way.
Another nice way to test for the validity of Assumption \@ref(def:noselb) with observed data is to implement a **placebo test**.
A placebo test looks for an effect where there should be none, if we believe the identification assumptions.
For example, under Assumption \@ref(def:noselb) it should be (even though it is not rigorously implied) that outcomes before the treatment are also mean-independent of the treatment allocation.
And actually, since a future treatment cannot have an effect today (unless people anticipate the treatment, which we assume away here), the $WW$ comparison before the treatment should be null, therefore giving a zero effect of the placebo treatment "will receive the treatment in the future."
```{example}
When the allocation rule defining $D_i$ is the eligibility rule that we have used so far, we have already seen that Assumption \@ref(def:noselb) does not hold and the placebo test should not pass either.
```
One way of generating Assumption \@ref(def:noselb) from the eligibility rule that we are using is to mute the persistence in outcome dynamics.
For example, one could set $\rho=0$ and $\sigma^2_{\mu}=0$.
```{r paramnopersist,eval=TRUE,echo=TRUE}
param <- c(8,0,.28,1500,0,0.01,0.05,0.05,0.05,0.1)
names(param) <- c("barmu","sigma2mu","sigma2U","barY","rho","theta","sigma2epsilon","sigma2eta","delta","baralpha")
param
```
In that case, outcomes are not persistent and Assumption \@ref(def:noselb) holds:
\begin{align*}
\esp{y_i^0|D_i=1} & = \esp{\mu_i+\delta+U_i^0|y_i^B\leq\bar{y}}\\
& = \esp{\bar{\mu}+\delta+\epsilon_i|\bar{\mu}+U_i^B\leq\bar{y}}\\
& = \bar{\mu} + \delta + \esp{\epsilon_i|\bar{\mu}+U_i^B\leq\bar{y}}\\
& = \bar{\mu} + \delta + \esp{\epsilon_i|\bar{\mu}+U_i^B>\bar{y}}\\
& = \esp{\mu_i+\delta+U_i^0|y_i^B>\bar{y}}\\
& = \esp{y_i^0|D_i=0},
\end{align*}
where the second equality follows from $\sigma^2_{\mu}=0$ and $\rho=0$ and the fourth from $\epsilon_i \Ind U_i^B$.
Another direct way to see this is to use the formula for selection bias that we have derived above.
It is easy to see that with $\rho=0$ and $\sigma^2_{\mu}=0$, $\Delta^y_{SB}=0$.
To be sure, we can compute $\Delta^y_{SB}$ with the new parameter values: $\Delta^y_{SB}=$ `r round(delta.y.sb(param),3)`.
As a consequence, $\Delta^y_{TT}=$ `r round(delta.y.tt(param),2)` $=$ `r round(delta.y.ww(param),2)` $=\Delta^y_{WW}$.
```{remark}
You might have noticed that the value of $\Delta^y_{TT}$ is different than before.
It is normal, since it depends on the values of parameters, and especially on $\sigma_{\mu}^2$ and $\rho$.
```
Let's see how these quantities behave in the sample.
```{r simulnopersist,eval=TRUE,echo=TRUE}
set.seed(1234)
mu <- rnorm(N,param["barmu"],sqrt(param["sigma2mu"]))
UB <- rnorm(N,0,sqrt(param["sigma2U"]))
yB <- mu + UB
YB <- exp(yB)
Ds <- rep(0,N)
Ds[YB<=param["barY"]] <- 1
epsilon <- rnorm(N,0,sqrt(param["sigma2epsilon"]))
eta<- rnorm(N,0,sqrt(param["sigma2eta"]))
U0 <- param["rho"]*UB + epsilon
y0 <- mu + U0 + param["delta"]
alpha <- param["baralpha"]+ param["theta"]*mu + eta
y1 <- y0+alpha
Y0 <- exp(y0)
Y1 <- exp(y1)
y <- y1*Ds+y0*(1-Ds)
Y <- Y1*Ds+Y0*(1-Ds)
```
We can see that $\hat{\esp{Y_i^0|D_i=1}}=$ `r round(mean(y0[Ds==1]),3)` $\approx$ `r round(mean(y0[Ds==0]),3)` $=\hat{\esp{Y_i^0|D_i=0}}$.
This means that $WW$ should be close to $TT$: $\hat{\Delta^y_{TT}}=$ `r round(mean(alpha[Ds==1]),3)` $\approx$ `r round(mean(y[Ds==1])-mean(y[Ds==0]),3)` $=\hat{\Delta^y_{WW}}$.
Note that $\hat{WW}$ in the sample is not exactly, but only approximately, equal to $TT$ in the population and in the sample.
This is an instance of the Fundamental Problem of Statistical Inference that we will study in the next chapter.
Under these restrictions, the placebo test would unfortunately conclude against Assumption \@ref(def:noselb) even though it is valid: