-
Notifications
You must be signed in to change notification settings - Fork 20
/
08_Cluster.Rmd
2326 lines (1964 loc) · 140 KB
/
08_Cluster.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Clustering {#cluster}
Until now, we have assumed throughout the book that different units, say $i$ and $j$, with $i\neq j$, never interact in any way.
One way we have encoded that is by implicitly assuming that the treatment of individual $i$ has no impact on the outcomes of individual $j$.
We will clarify this assumption in Chapter \@ref(Diffusion), and study ways to relax it.
Another way we have assumed an absence of interactions is by assuming that all potential outcomes and treatment indicators are independently and identically distributed across observations.
We have encoded this restriction at various points, first with Assumption \@ref(hyp:iid) in the context of estimating the sampling noise in Randomized Controlled Trials, and then with Assumptions \@ref(hyp:iidDID) and \@ref(hyp:iidDIDCross) in DID settings.
This assumption enabled us to leverage the Central Limit Theorem, the most powerful tool we can use to derive asymptotic approximations to our estimators.
The *i.i.d.* assumption is nevertheless extremely restrictive.
It excludes various cases that we are likely to encouter in real life scenarios.
For example, this assumption will be wrong if we allocate the treatment at the level of a group of units, such as a village or a firm for example.
Groups of units are generally called **clusters**.
In that case, the treatment indicator will not be i.i.d. across observations, since knowing the treatment status of a unit in a given cluster can help us predict the treatment status of its neighbors in the same cluster, so Assumption \@ref(hyp:iid) would be violated.
The i.i.d. assumption will also be wrong in panel data with repeated observations of the same units if the outcomes are correlated over time for reasons other than having a fixed effect in common.
For example, any case in which shocks to outcomes around the fixed effect are somewhat persistent over time, we will have correlation between the observations of the same unit over time, and between the changes in outcomes of the same unit over time.
This last issue would violate Assumption \@ref(hyp:iidDID).
There are many instances where the shocks to outcomes are likely correlated over time, such as productivity shocks, earnings, education, health, savings, etc, that is most of the outcomes we are generally interested in.
Finally, the i.i.d. assumption will also be violated in DID settings in repeated cross sections if the treatment is allocated at a more aggregate level than that of the individual unit.
For example, if we study the consequences of laws that vary at the levels of U.S. counties or States, such as minimum wage laws for example, and we have observations on firm's TFP and employment level, or on individuals' earnings, then we have correlations between the treatment status of units belonging to the same cluster, and probably the outcomes are correlated as well between the units in the same cluster.
In this chapter, we are going to examine what happens to our traditional estimators when relaxing the i.i.d. assumption, and which new estimators we can use, and what types of diagnostic tests we can make use of in order to assess the severity of the issue.
Clustering, as this issue is generally called, has been the topic of a **very** large literature in econometrics, whose results are sometimes vague, sometimes hard to comprehend or to reconcile, etc.
What we are going to do is to try to carve a path through this literature, trying to enlighten its main motivations, its main tools and its main results.
We are first going to study the case of clustering in the context of Randomized Controlled Trials.
We will then move on to the case of temporal autocorrelation in panel data and we will end with the case of panel data with clustering across time and between observations in each cross section.
## Clustering in Randomized Controlled Trials {#ClusterRCT}
What I propose is to first start to see what the problem of clustering does to sampling noise, and how much our default estimator of sampling noise assuming no clustering is biased.
I will then introduce the notion of design effect, which is a way to quantify the effect of clustering on sampling noise estimates.
Finally, I will propose ways to provide estimates of sampling noise that account for clustering.
### An example
Let us first see how that happens in our example before deriving the reason for why we have such a problem and the possible solutions.
For that, we are going to simulate data with clusters (such as villages) and we are going to have two features of he data generating process that happen at the cluster level:
1. The treatment is going to be randomized at the cluster level.
We thus have a clustered randomized controlled trial, where randomization is at the level of clusters regrouping several individual units of observation.
For example, think of a classroom where we randomize a new treatment and we observe outcomes at the student level (grades, health status, participation, etc), or think of a cvaccination camoaign randomized at the village level.
2. The outcome is going to be autocorrelated within clusters.
For example, think of the common determinants of school success (such as the teacher, the parents, the overall conditions at the school) or of the health status in a village (climate, sanitation, pressure by parasites, etc) at the cluster level.
These common determinants make the outcomes of two people taken at random in a given cluster more similar than those of two people taken at random in the whole population.
The share of the variance in outcomes explained by the variance across clusters is called the intra-cluster correlation coefficient (ICC) and is going to play a key role in what follows.
```{example}
Let's see how we model all of that in our example.
We choose to generate the outcomes of individual $i$ in cluster $c$ in period $t$ as follows:
```
\begin{align*}
y^0_{i,c,t} & = \mu_{i,c} + U_{i,t} \\
\mu_{i,c} & = \mu^C_c + \mu^U_i \\
U_{i,t} & = \rho U_{i,t-1} + \epsilon_{i,t},
\end{align*}
where $U_{i,0}$, $\epsilon_{i,t}$, $\mu^U_i$ and $\mu_c$ are i.i.d. shocks independent of each other.
I introduce a panel data structure because it will be useful for the simulations in the next section.
$\mu_c$ is the error term varying at the cluster level.
We call it a cluster fixed effect.
The intra-cluster correlation coefficient is equal to the ratio between the variance in outcomes between clusters and the total variance in outcomes:
\begin{align*}
ICC_t & = \frac{\sigma^2_{\mu_c}}{\sigma^2_{\mu^c}+\sigma^2_{\mu^U}+\sigma^2_{U_t}}.
\end{align*}
Let's generate some data and see what an increase in ICC does to our estimates of the treatment effect in a Brute Force design.
As usual, we are going to use the With/Without estimator to compute the treatment effect.
Let us first set parameter values:
```{r paramICC}
param <- c(8,.5,.28,1500,0.9,0.01,0.05,0.05,0.05,0.1)
names(param) <- c("barmu","sigma2mu","sigma2U","barY","rho","theta","sigma2epsilon","sigma2eta","delta","baralpha")
```
Let us then simulate one dataset:
```{r ICCDataSimul,eval=TRUE,echo=TRUE,warning=FALSE,error=FALSE,message=FALSE}
N <- 1000
ICC.mu <- 0.2
Nclusters <- 100
set.seed(1234)
# I am going to draw a cluster fixed effect
muC <- rnorm(Nclusters,0,sqrt(param["sigma2mu"]*ICC.mu))
muC <- rep(muC,each=N/Nclusters)
# I draw an individual fixed effect with the remaining variance
muU <- rnorm(N,0,sqrt(param["sigma2mu"]*(1-ICC.mu)))
mu <- param["barmu"] + muC + muU
UBB <- rnorm(N,0,sqrt(param["sigma2U"]))
yBB <- mu + UBB
YBB <- exp(yBB)
epsilonB <- rnorm(N,0,sqrt(param["sigma2epsilon"]))
U0B <- param["rho"]*UBB + epsilonB
y0B <- mu + U0B
epsilonA <- rnorm(N,0,sqrt(param["sigma2epsilon"]))
U0A <- param["rho"]*U0B + epsilonA
y0A <- mu + U0A + param["delta"]
eta<- rnorm(N,0,sqrt(param["sigma2eta"]))
alpha <- param["baralpha"]+ param["theta"]*mu + eta
y1B <- y0B+alpha
y1A <- y0A+alpha
# randomized allocation at the cluster level
Rs <- runif(Nclusters)
# cluster level treatment vector
Rc <- ifelse(Rs<=.5,1,0)
# individual level treatment vector
R <- rep(Rc,each=N/Nclusters)
# outcomes
yB <- y0B
yA <- y1A*R+y0A*(1-R)
# clusters
cluster <- 1:Nclusters
cluster <- rep(cluster,each=N/Nclusters)
```
Let us finally run a set of Monte Carlo simulations, varying ICC:
```{r monte.carlo.BF.RCT,eval=TRUE,echo=TRUE,warning=FALSE,error=FALSE,message=FALSE,results='hide',cache=TRUE}
monte.carlo.BF.ICC <- function(s,N,Nclusters,ICC.mu,param){
set.seed(s)
# I am going to draw a cluster fixed effect
muC <- rnorm(Nclusters,0,sqrt(param["sigma2mu"]*ICC.mu))
muC <- rep(muC,each=N/Nclusters)
# I draw an individual fixed effect with the remaining variance
muU <- rnorm(N,0,sqrt(param["sigma2mu"]*(1-ICC.mu)))
mu <- param["barmu"] + muC + muU
UBB <- rnorm(N,0,sqrt(param["sigma2U"]))
yBB <- mu + UBB
YBB <- exp(yBB)
epsilonB <- rnorm(N,0,sqrt(param["sigma2epsilon"]))
U0B <- param["rho"]*UBB + epsilonB
y0B <- mu + U0B
epsilonA <- rnorm(N,0,sqrt(param["sigma2epsilon"]))
U0A <- param["rho"]*U0B + epsilonA
y0A <- mu + U0A + param["delta"]
eta<- rnorm(N,0,sqrt(param["sigma2eta"]))
alpha <- param["baralpha"]+ param["theta"]*mu + eta
y1B <- y0B+alpha
y1A <- y0A+alpha
# randomized allocation at the cluster level
Rs <- runif(Nclusters)
# cluster level treatment vector
Rc <- ifelse(Rs<=.5,1,0)
# individual level treatment vector
R <- rep(Rc,each=N/Nclusters)
# outcomes
yB <- y0B
yA <- y1A*R+y0A*(1-R)
# regression
ols.ww.BF <- lm(yA ~ R)
ols.ww.BF.yB <- lm(yA ~ R + yB)
# True ICC
data <- as.data.frame(cbind(yB,R))
cluster <- 1:Nclusters
cluster <- rep(cluster,each=N/Nclusters)
data$cluster <- cluster
data <- group_by(data,cluster)
data <- merge(data,summarise(data,yB.mean=mean(yB)))
ICC <- var(data$yB.mean)/var(yB)
return(c(ols.ww.BF$coef[[2]],ols.ww.BF.yB$coef[[2]],ICC))
}
sf.simuls.BF.ICC <- function(ICC.mu,N,Nclusters,Nsim,param){
sfInit(parallel=TRUE,cpus=8)
sfLibrary(dplyr)
sim <- matrix(unlist(sfLapply(1:Nsim,monte.carlo.BF.ICC,N=N,Nclusters=Nclusters,ICC.mu=ICC.mu,param=param)),nrow=Nsim,ncol=3,byrow=TRUE)
sfStop()
colnames(sim) <- c('BF','BF.yB','ICC')
return(sim)
}
Nsim <- 1000
#Nsim <- 50
ICC.val <- c(0,0.05,0.2,0.5,0.8,1)
simuls.BF.ICC <- lapply(ICC.val,sf.simuls.BF.ICC,N=N,Nclusters=Nclusters,Nsim=Nsim,param=param)
names(simuls.BF.ICC) <- ICC.val
```
Let us now present the results of the simulations:
```{r MonteCarloBFICCHist,eval=TRUE,echo=FALSE,warning=FALSE,error=FALSE,message=FALSE,results='hide',fig.cap='Distribution of the $WW$ estimator in a Brute Force Design for various levels of ICC',fig.align='center',out.width='65%',fig.pos='htbp'}
par(mfrow=c(2,3))
for (i in 1:length(simuls.BF.ICC)){
hist(simuls.BF.ICC[[i]][,'BF'],main=paste('ICC=',as.character(round(mean(simuls.BF.ICC[[i]][,'ICC']),digits=2))),breaks=30,xlab=expression(hat(Delta^yWW)),xlim=c(-0.15,0.55))
abline(v=delta.y.ate(param),col="red")
}
```
Figure \@ref(fig:MonteCarloBFICCHist) suggests that increasing the ICC increases sampling noise.
Let's compute sampling noise formally:
```{r precisionBFICC,eval=TRUE,echo=TRUE,results='hide',echo=FALSE,warning=FALSE,error=FALSE,message=FALSE,fig.cap='Effect of ICC on the 99\\% sampling noise of the WW estimator in a Brute Force design',fig.align='center',out.width='65%',fig.pos='htbp'}
delta <- 0.99
precision.BF <- function(k,delta){
return(2*quantile(abs(simuls.BF.ICC[[k]][,'BF']-delta.y.ate(param)),probs=c(delta=delta)))
}
# preparing data for plot
precision.BF.ICC <- sapply(1:length(ICC.val),precision.BF,delta=delta)
ICC.est <- c(round(mean(simuls.BF.ICC[[1]][,'ICC']),digits=2),round(mean(simuls.BF.ICC[[2]][,'ICC']),digits=2),round(mean(simuls.BF.ICC[[3]][,'ICC']),digits=2),round(mean(simuls.BF.ICC[[4]][,'ICC']),digits=2),round(mean(simuls.BF.ICC[[5]][,'ICC']),digits=2),round(mean(simuls.BF.ICC[[6]][,'ICC']),digits=2))
precision <- as.data.frame(cbind(ICC.est,precision.BF.ICC,rep(delta.y.ate(param),length(ICC.val))))
colnames(precision) <- c('ICC','precision','ATE')
# plot
ggplot(precision, aes(x=as.factor(ICC), y=ATE)) +
geom_bar(position=position_dodge(), stat="identity", colour='black') +
geom_errorbar(aes(ymin=ATE-precision/2, ymax=ATE+precision/2), width=.2,position=position_dodge(.9),color='red') +
xlab("ICC") +
theme_bw()
```
Figure \@ref(fig:precisionBFICC) confirms that increasing the ICC increases sampling noise in a major way.
With a small ICC, sampling noise is equal to `r round(precision.BF.ICC[[1]],2)`, while it is equal to `r round(precision.BF.ICC[[6]],2)` with the largest ICC.
The difference is more than sizable.
Increasing the ICC is equivalent to losing sample size.
The intuition for this result is that, with autocorrelated data, the Central Limit Theorem applies at a much slower pace.
The part of the error term that is common to all observations in a given cluster only vanishes as we add more clusters, not as we add more observations per cluster.
As a result, this part of sampling noise decreases with the square root of the number of clusters, not with the square root of the number of observations.
The second result to emerge from Figure \@ref(fig:precisionBFICC) is that ignoring clustering would severely underestimate sampling noise.
The lower level of sampling noise on Figure \@ref(fig:precisionBFICC) is roughly equal to the level of sampling noise that an estimator ignoring clustering would deliver (the estimator detailed in Theorem \@ref(thm:asympnoiseWW)).
Let's compute the naive estimator of sampling noise based on Theorem \@ref(thm:asympnoiseWW):
```{r OLSWWICC,eval=TRUE,echo=TRUE,warning=FALSE,error=FALSE,message=FALSE,results='hide'}
# OLS regression in the Brute Force data
ols.ww <- lm(yA~R)
```
The naive estimate of sampling noise is thus `r round(samp.noise.ww.CLT.ols(delta,ols.ww,type='HC1'),2)`.
### Design effect
A very useful way to understand what clustering does to sampling noise is to derive the variance of the With/Without estimator in the presence of clustering of a simple nature: when all the outcomes within a cluster are correlated in the same way and correlation between outcomes is zero outside of the clusters and all clusters have the same size.
This derivation enables us to introduce the notion of design effect, which is a way to quantify the effect of clustering on sampling noise estimates as a function of the ICC.
In order to make this derivation, we need to formally specify the covariance matrix of our error terms.
Let us assume that we have access to observations about $N$ units, allocated in $n$ clusters of equal size $m$.
Among the $n$ clusters, $n_1$ are treated and $n_0$ are in the control group.
Let $R_{i}$ denote the randomized allocation variable for each unit $i$, which takes value one when unit $i$ is located in a cluster $c$ that has been (randomly) allocated to the treatment and value zero otherwise.
Let $R^C_{c}$ denote the randomized allocation variable for each cluster $c$, which takes value one when cluster $c$ has been (randomly) allocated to the treatment and value zero otherwise.
Let $\mathbf{R}_c$ be the vector of randomized allocations at the cluster level: it has length equal to $n$, and each value is equal to $R^C_{c}$.
Let us denote $U_i^1=Y_i^1-\esp{Y_i^1|R_{i}=1}$ and $U_i^0=Y_i^0-\esp{Y_i^0|R_{i}=0}$ and $U_i=R_{i}U_i^1+(1-R_{i})U_i^0$.
Let $U$ be the vector of all $U_i$ error terms.
Let $\sigma^2_1=\var{U_i^1}$ and $\sigma^2_0=\var{U_i^0}$.
Finally, let $\Omega_1$ and $\Omega_0$ be $m\times m$ matrices with a diagonal of one and off-diagonal terms all equal to $\rho_1$ and $\rho_0$ respectively, with $\rho_1$ and $\rho_0$ the Intra-Cluster Correlation Coefficient among the treated and untreated observation respectively.
We are now equipped to relax Assumption \@ref(hyp:iid):
```{hypothesis,ClusteredErrors,name="Clustered Design"}
We assume that the error terms are clustered:
\begin{align*}
\esp{UU'} & = \sigma^2_1\diag{\mathbf{R}_c}\otimes\Omega_1+\sigma^2_0(I-\diag{\mathbf{R}_c})\otimes\Omega_0.
\end{align*}
```
```{remark}
Assumption \@ref(hyp:ClusteredErrors) imposes that the covariance structure between potential outcomes is block diagonal.
That is that there is the same correlation between outcomes for observations in the same cluster, and there is no correlation between outcomes for observations that do not belong to the same cluster.
```
```{theorem,VarWWClus,name="Variance of the With/Without estimator under Clustering"}
Under Assumptions \@ref(def:noselb), \@ref(hyp:fullrank) and \@ref(hyp:ClusteredErrors),
\begin{align*}
\var{{\hat{\Delta^Y_{WW}}}} & = \frac{1}{N}\left(\frac{\sigma^2_0}{1-\Pr(R_i=1)}(1+(m-1)\rho_0)+\frac{\sigma^2_1}{\Pr(R_i=1)}(1+(m-1)\rho_1)\right).
\end{align*}
```
```{proof}
See in Appendix \@ref(proofVarWWClus).
```
We can also prove the following corollary:
```{corollary,DesignEffect,name="Design Effect"}
Under Assumptions \@ref(def:noselb), \@ref(hyp:fullrank) and \@ref(hyp:ClusteredErrors), and with $\rho_0=\rho_1=\rho$, we have:
\begin{align*}
\var{{\hat{\Delta^Y_{WW}}}} & = \frac{1}{N}(1+(m-1)\rho)\left(\frac{\sigma^2_0}{1-\Pr(D_i=1)}+\frac{\sigma^2_1}{\Pr(D_i=1)}\right),
\end{align*}
with $(1+(m-1)\rho)$ the **design effect**.
```
```{proof}
The proof is straightforward using Theorem \@ref(thm:VarWWClus).
```
```{remark}
Note that the formula for the variance of the $WW$ estimator in Corollary \@ref(cor:DesignEffect) is equal to the formula for the variance of the same estimator under Assumption \@ref(hyp:iid) of i.i.d. error terms derived in Lemma \@ref(lem:asymWW) multiplied by the design effect.
As soon as $m>1$, the design effect is strictly superior to one, meaning that the variance of the $WW$ estimator in a clustered design is superior to the variance of the $WW$ estimator in a designed where randomization is done at the unit level.
The larger the Intra Cluster Correlation Coefficient $\rho$ and the higehr the number of units per cluster $m$, the larger the design effect.
```
```{remark}
One very useful way to think about the design effect is to think about **effective sample size** $N^*=\frac{N}{1+(m-1)\rho}$ as being the sample size that would yield the same precision as our clustered experiment but with an experiment randomized at the unit level.
In that sense, the design effect measures by how much clustering decreases our **effective** sample size.
```
```{example}
It is possible to visualize the extent of the design effect is to plot the effective sample size as a function of the real sample size, for various values of $\rho$ and $m$.
For that, we simply need to generate a function to compute the effective sample size.
```
```{r EffSampleSize,eval=TRUE,echo=TRUE,warning=FALSE,error=FALSE,message=FALSE,results='hide'}
DesignEffect <- function(rho,m){
return(1+(m-1)*rho)
}
EffSampleSize <- function(N,...){
return(N/DesignEffect(...))
}
```
Let us now plot the result:
```{r EffSampleSizePlot,eval=TRUE,echo=TRUE,results='hide',warning=FALSE,error=FALSE,message=FALSE,fig.cap='Effective sample size in a clustered design',fig.align='center',out.width='65%',fig.pos='htbp'}
ggplot()+
xlim(0,10000) +
ylim(0,10000) +
geom_function(aes(linetype="m=2 and rho=0.1",color="m=2 and rho=0.1"),fun=EffSampleSize,args=list(rho=0.1,m=2)) +
geom_function(aes(linetype="m=5 and rho=0.1",color="m=5 and rho=0.1"),fun=EffSampleSize,args=list(rho=0.1,m=5)) +
geom_function(aes(linetype="m=10 and rho=0.1",color="m=10 and rho=0.1"),fun=EffSampleSize,args=list(rho=0.1,m=10)) +
geom_function(aes(linetype="m=2 and rho=0.5",color="m=2 and rho=0.5"),fun=EffSampleSize,args=list(rho=0.5,m=2)) +
geom_function(aes(linetype="m=5 and rho=0.5",color="m=5 and rho=0.5"),fun=EffSampleSize,args=list(rho=0.5,m=5)) +
geom_function(aes(linetype="m=10 and rho=0.5",color="m=10 and rho=0.5"),fun=EffSampleSize,args=list(rho=0.5,m=10)) +
geom_function(aes(linetype="m=2 and rho=1",color="m=2 and rho=1"),fun=EffSampleSize,args=list(rho=1,m=2)) +
geom_function(aes(linetype="m=5 and rho=1",color="m=5 and rho=1"),fun=EffSampleSize,args=list(rho=1,m=5)) +
geom_function(aes(linetype="m=10 and rho=1",color="m=10 and rho=1"),fun=EffSampleSize,args=list(rho=1,m=10)) +
geom_abline(slope=1,intercept=0,linetype="dotted",color='black')+
scale_color_discrete(name="") +
scale_linetype_discrete(name="") +
ylab("Effective sample size") +
xlab("Actual sample size") +
theme_bw()
```
Figure \@ref(fig:EffSampleSizePlot) shows that effective sample size can decrease enormously because of clustering.
For example, with clusters of size $m=10$, and the Intra Cluster Correlation Coefficient $\rho=0.5$, a sample size of $N=10000$ observations is equivalent to an unclustered RCT ran on a sample of $N^*=$ `r round(EffSampleSize(N=10000,rho=0.5,m=10),0)`, which is equivalent to dividing the true sample size by `r DesignEffect(rho=0.5,m=10)`.
**Discuss clustering in the power analysis chapter**
### Estimating sampling noise accounting for clustering
We have shown that clustering increases sampling noise around our estimates and worse, that our basic estimates of sampling noise based on the i.i.d. assumption underestimate the true amount of sampling noise, sometimes severely.
Devising ways to estimate sampling noise that offer a correct estimate of the true extent of sampling noise in the presence of clustering is thus a very important endeavor, if we are to gauge correctly the precision of our treatment effect estimates.
Let us see several ways to do that.
```{remark}
There exists several super cool resources which detail the topics of estimating standard errors under clustering.
I really like the one by [Cameron and Miller (2015)](https://doi.org/10.3368/jhr.50.2.317).
The one by [McKinnon, Nielsen and Webb (2023)](https://doi.org/10.1016/j.jeconom.2022.04.001) is also nice, even if more technical.
```
#### Using the plug-in formula
One direct and simple way to account for clustering is to use the formula in Theorem \@ref(thm:VarWWClus) or the simpler formula in Corollary \@ref(cor:DesignEffect).
The Intra Cluster Correlation Coefficient can be computed as the ratio of the variance of the mean of the outcomes at the cluster level divided by the total variance of the outcomes in the sample.
Once we have an estimate of the variance of the treatment effect, we can build sampling noise $2\epsilon$ by multiplying the resulting standard error by $2\Phi^{-1}\left(\frac{\delta+1}{2}\right)$.
```{remark}
Note that this approach implicitly assumes that the Central Limit Theorem holds with clustered data.
Actually, in the clustered case, as long as we allow the number of clusters to go to infinity, we can safely use the classical CLT.
We will also see in Section \@ref(CLTDD) that versions of the CLT exist for more general case where there is dependency between data points.
```
```{remark}
One uncertainty with the approach relying on CLT based on increasing the number of clusters is that we are unsure whether the normalizing constant in the CLT is $N$, the total sample size, or $n$, the number of clusters.
This will be rigorously clarified in Sections \@ref(DesignBasedClusters) and \@ref(CLTDD).
```
```{example}
In practice, to apply the formula in Theorem \@ref(thm:VarWWClus), we simply need to compute the total variance of outcomes in the treated and control groups, and to estimate the Intra Cluster Correlation Coefficients of outcomes among the treated and the controls respectively, $\rho_1$ and $\rho_0$.
The Intra Cluster Correlation Coefficient can be computed as the ratio of the variance of the mean of the outcomes at the cluster level divided by the total variance of the outcomes in the sample.
Let's go.
```
```{r ComputingICC10}
# Function for the plug in variance estimator
VarClusterPlugIn <- function(sigma1,sigma0,ICC1,ICC0,p,m,N){
dEffect0 <- 1+(m-1)*ICC0
dEffect1 <- 1+(m-1)*ICC1
return((dEffect0*sigma0/(1-p)+dEffect1*sigma1/p)/N)
}
# Preparing data
dataCluster <- data.frame(Cluster=cluster,yA=yA,yB=yB,R=R)
# Computing ICC
# Cluster-level variance
VaryAClus <- dataCluster %>%
group_by(R,Cluster) %>%
summarize(
MeanyAClus = mean(yA)
) %>%
ungroup(.) %>%
group_by(R) %>%
summarize(
VaryAClus = var(MeanyAClus)
)
# Total variance
VaryA <- dataCluster %>%
group_by(R) %>%
summarize(
VaryATot = var(yA)
)
# ICC
ICC <- VaryA %>%
left_join(VaryAClus,by='R') %>%
mutate(
ICC = VaryAClus/VaryATot
)
# p
p <- mean(dataCluster$R)
# Variance estimate
VarClusteredWW <- VarClusterPlugIn(sigma1=ICC %>% filter(R==1) %>% pull(VaryATot),
sigma0=ICC %>% filter(R==0) %>% pull(VaryATot),
ICC1=ICC %>% filter(R==1) %>% pull(ICC),
ICC0=ICC %>% filter(R==0) %>% pull(ICC),
p=p,
m=N/Nclusters,
N=N)
```
Our estimate of sampling noise accounting for clustering using the plug-in estimator is equal to `r round(2*(qnorm((delta+1)/2))*sqrt(VarClusteredWW),2)`.
The true level of sampling noise estimated from the simulations is `r round(precision.BF.ICC[[3]],2)`.
Remember that the naive estimate of sampling noise which ignores clustering is `r round(samp.noise.ww.CLT.ols(delta,ols.ww,type='HC1'),2)`.
#### Using cluster-robust standard errors
The most widely used approach for accounting for clustering when estimating treatment effects is using cluster-robust standard errors.
We start with the formula for the variance of the OLS estimator, that we have derived in the proof of Theorem \@ref(thm:VarWWClus): $\var{\hat{\Theta}_{OLS}|X}=(X'X)^{-1}X'\esp{UU'|X}X(X'X)^{-1}$.
The basic idea of clustered robust standard errors is to build an empirical estimate of the covariance matrix of the residuals $\esp{UU'|X}$ using the estimated residuals $\hat{U}_i$.
We might want to use the full $\mathbf{\hat{U}}\mathbf{\hat{U}}'$ matrix, with $\mathbf{\hat{U}}$ the vector of all estimated residuals, but that would not work, because, by construction, the OLS estimator produces residuals which are orthogonal to the covariates, so that $X'\esp{UU'|X}X$ is a null matrix.
Instead, cluster-robust standard errors are estimated by assuming first that the matrix $\esp{UU'|X}$ is block diagonal, meaning that observations are only correlated within clusters.
Let $\mathbf{\hat{U}}_c$ be the vector of empirical residuals of observations residing in cluster $c$.
Let's write $\hat{\Omega}_c=\mathbf{\hat{U}}_c\mathbf{\hat{U}}_c'$ and $\hat{\Omega}=\diag(\left\{\hat{\Omega}_c\right\}_{c=1}^n)$.
We can now use $\hatvar{\hat{\Theta}_{OLS,Clustered}}=(X'X)^{-1}X'\hat{\Omega}X(X'X)^{-1}$ as our cluster-robust estimate of the covariance matrix of the OLS estimator, with $\Theta=(\alpha,\beta)$ the parameter vector in the equation $Y_i=\alpha+\beta R_i+U_i$, where $\beta=\Delta^Y_{WW}$, as Lemma \@ref(lem:WWOLS) shows.
In practice, some authors and statistical software might add a degrees of freedom correction to these estimates as a way to curb small sample bias.
One common correction factor is $\frac{n}{n-1}\frac{N-1}{N-k}$ with $K$ the number of covariates.
Another classical correction is $\frac{n}{n-1}$.
```{example}
Let's see how these approaches work in our example.
The most straightforward way to implement the cluster-robust variance estimator in `R` is to use the `vcovCL` command from the `sandwich` package.
The `type` option of `vcovCL` can take values `HC0`, where the only correction is $\frac{n}{n-1}$, and `HC1`, where the correction is $\frac{n}{n-1}\frac{N-1}{N-k}$.
Another approach is to use the `feols` function of the `fixest` package with the `cluster` option.
```
```{r ExampleClusterOLS,eval=TRUE,echo=TRUE,results='hide',warning=FALSE,error=FALSE,message=FALSE}
# regression
RegBFCLuster <- lm(yA ~ R, data=dataCluster)
# cluster-robust standard error
VCovClusterHC1 <- vcovCL(RegBFCLuster,cluster=~cluster,type="HC1",sandwich=TRUE)
VCovClusterHC0 <- vcovCL(RegBFCLuster,cluster=~cluster,type="HC0",sandwich=TRUE)
# using fixest
RegBFCLusterFixest <- feols(yA ~ R | 1, data=dataCluster, cluster="Cluster")
```
With the `HC1` correction, our estimate of sampling noise accounting for clustering using the plug-in estimator is equal to `r round(2*(qnorm((delta+1)/2))*sqrt(diag(VCovClusterHC1)[[2]]),2)`, and to `r round(2*(qnorm((delta+1)/2))*sqrt(diag(VCovClusterHC0)[[2]]),2)` with the `HC0` correction and to `r round(2*(qnorm((delta+1)/2))*sqrt(diag(vcov(RegBFCLusterFixest))[[2]]),2)` with `fixest`.
The true level of sampling noise estimated from the simulations is `r round(precision.BF.ICC[[3]],2)`.
Remember that the naive estimate of sampling noise which ignores clustering is `r round(samp.noise.ww.CLT.ols(delta,ols.ww,type='HC1'),2)`.
#### Using Feasible Generalized Least Squares
As [Cameron and Miller (2015)](https://doi.org/10.3368/jhr.50.2.317) remark, when error terms are autocorrelated, the OLS estimator is not the most efficient one, and a Generalized Least Squares (GLS) estimator could perform better.
The problem is to find the Feasible Generalized Least Squares estimator that makes real this potential gain in precision.
In order to derive it, we have first to specify a model for $\Omega_{GLS}=\esp{UU'|X}$.
One such model is the one in Assumption \@ref(hyp:ClusteredErrors).
Under such a model, we know that the GLS estimator of the parameter vector $\Theta=(\alpha,\beta)$ in the equation $Y_i=\alpha+\beta R_i+U_i$ is $\hat{\Theta}_{GLS}=(X'\Omega_{GLS}^{-1}X)^{-1}X'\Omega_{GLS}^{-1}Y$.
With a consistent estimator for $\Omega$, *e.g.* $\hat{\Omega}_{FGLS}$, we have $\hat{\Theta}_{FGLS}=(X'\hat{\Omega}_{FGLS}^{-1}X)^{-1}X'\hat{\Omega}_{FGLS}^{-1}Y$ with the associated estimated variance: $\hatvar{\hat{\Theta}_{FGLS}}=(X'\hat{\Omega}_{FGLS}^{-1}X)^{-1}$.
[Cameron and Miller (2015)](https://doi.org/10.3368/jhr.50.2.317) also suggest that we can build a cluster robust estimate of the variance of the FGLS estimator as follows: $\hatvar{\hat{\Theta}_{FGLS,Clustered}}=(X'\hat{\Omega}_{FGLS}^{-1}X)^{-1}X'\hat{\Omega}_{FGLS}^{-1}\hat{\Omega}\hat{\Omega}_{FGLS}^{-1}X(X'\hat{\Omega}_{FGLS}^{-1}X)^{-1}$, with $\hat{\Omega}=\diag(\left\{\hat{\Omega}_c\right\}_{c=1}^n)$, $\hat{\Omega}_c=\mathbf{\hat{U}}_c\mathbf{\hat{U}}_c'$ and $\mathbf{\hat{U}}_c$ be the vector of empirical residuals of observations residing in cluster $c$, as in the previous section, but obtained with the FGLS model now.
```{remark}
As [Cameron and Miller (2015)](https://doi.org/10.3368/jhr.50.2.317) remark, the approach of building a cluster-robust estimator for the variance of the FGLS estimator has been popularized in biostatistics by [Liang and Zeger (1986)](https://doi.org/10.1093/biomet/73.1.13).
```
```{example}
Let's see how the FGLS approach works in our example.
Following Assumption \@ref(hyp:ClusteredErrors), we know that $\hat{\Omega}_{FGLS}=\hat{\sigma}^2_1\diag{\mathbf{R}_c}\otimes\hat{\Omega}_1+\hat{\sigma}^2_0(I-\diag{\mathbf{R}_c})\otimes\hat{\Omega}_0$, with $\hat{\sigma}^2_1$ the estimated variance of outcomes in the treated group, $\hat{\sigma}^2_0$, the estimated varianc of outcomes in the control group and $\hat{\Omega}_1$ and $\hat{\Omega}_0$ matrices with a diagonal of one and off-diagonal elements equal to $\hat{\rho}_1$ and $\hat{\rho}_0$, respectively, the Intra Cluster Correlation Coefficient of outcomes in the treated and untreated groups respectively.
We know how to estimate all these parameters so we can build $\hat{\Omega}_{FGLS}$.
Let's do it.
```
```{r FGLS}
# Omega matrix at the cluster level
Omega1 <- matrix(data=ICC %>% filter(R==1) %>% pull(ICC),nrow=N/Nclusters,ncol=N/Nclusters) + diag(1-ICC %>% filter(R==1) %>% pull(ICC),nrow=N/Nclusters,ncol=N/Nclusters)
Omega0 <- matrix(data=ICC %>% filter(R==0) %>% pull(ICC),nrow=N/Nclusters,ncol=N/Nclusters) + diag(1-ICC %>% filter(R==0) %>% pull(ICC),nrow=N/Nclusters,ncol=N/Nclusters)
# sigma1 and sigma0
sigma21 <- ICC %>% filter(R==1) %>% pull(VaryATot)
sigma20 <- ICC %>% filter(R==0) %>% pull(VaryATot)
# OmegaFGLS matrix at teh sample level
OmegaFGLS <- sigma21*diag(Rc)%x%Omega1 +sigma20*(diag(1,nrow=Nclusters,ncol=Nclusters)-diag(Rc))%x%Omega0
# Inverting Omega
InvOmegaFGLS <- solve(OmegaFGLS)
# matrix of covariates
X <- cbind(1,R)
# FGLS estimator
RegFGLS <- solve((t(X)%*%InvOmegaFGLS%*%X))%*%(t(X)%*%InvOmegaFGLS%*%yA)
# FGLS estimate of WW
WWyFGLS <- RegFGLS[2,1]
# Basic variance estimate of the FGLS estimator
VcovRegFGLS <- solve((t(X)%*%InvOmegaFGLS%*%X))
# estimated basic standard error of WWyFGLS
SeWWyFGLS <- sqrt(VcovRegFGLS[2,2])
```
The FGLS estimator of the treatment effect is equal to `r round(WWyFGLS,2)` $\pm$ `r round(qnorm((1+0.95)/2)*SeWWyFGLS,2)` with $\delta=0.95$.
This means that the basic FGLS estimate of sampling noise is equal to `r round(2*(qnorm((delta+1)/2))*SeWWyFGLS,2)`.
The true level of sampling noise estimated from the simulations is `r round(precision.BF.ICC[[3]],2)`.
Let us now try to estimate the cluster robust variance of the FGLS estimator.
This requires to compute the matrix $\hat{\Omega}_c=\mathbf{\hat{U}}_c\mathbf{\hat{U}}_c'$ for each cluster.
For that, we are going to extract the block diagonal matrix $\hat{\Omega}=\mathbf{\hat{U}}\mathbf{\hat{U}}'$ using the package `diagonal`.
```{r FGLSCLuster}
# computing FGLS residuals
ResidFGLS <- yA-X%*%RegFGLS
# Product of residuals matrix
HatOmega <- ResidFGLS%*%t(ResidFGLS)
# block diagonal equal to zero
diagonals::fatdiag(HatOmega,size=N/Nclusters) <- 0
# take the UU' matrix and take of the OFF diagonal elements
HatOmega <- ResidFGLS%*%t(ResidFGLS)-HatOmega
# compute the Cluster robust FGLS variance matrix estimator
VcovRegFGLSCluster <- VcovRegFGLS%*%t(X)%*%InvOmegaFGLS%*%HatOmega%*%InvOmegaFGLS%*%X%*%VcovRegFGLS
# estimated clustered standard error of WWyFGLS
SeWWyFGLSCluster <- sqrt(VcovRegFGLSCluster[2,2])
```
This means that the cluster-robust FGLS estimate of sampling noise is equal to `r round(2*(qnorm((delta+1)/2))*SeWWyFGLSCluster,2)`.
The true level of sampling noise estimated from the simulations is `r round(precision.BF.ICC[[3]],2)`.
#### Using the Bootstrap
The bootstrap is a data-intensive non-parametric way to obtain cluster-robust estimates of sampling noise.
Here, we are going to use the non-parametric bootstrap, but other bootstrap methods exist.
[Cameron and Miller (2015)](https://doi.org/10.3368/jhr.50.2.317) cover some of them.
```{remark}
A key distinction when we use the bootstrap is whether it can offer **asymptotic refinements** or not.
```
The way we implement the non-parametric (or pair) bootstrap is as follows:
1. We build a sample of $N_c$ clusters by sampling with replacement from the original sample,
2. For each sample $b$, we estimate a treatment effect $\hat{\Delta}^y_{WW}(b)$,
3. After computing $B$ such estimates, we compute the bootstrap-cluster-robust variance estimator as follows:
\begin{align*}
\hatvar{\hat{\Delta}^y_{WW}}^{Bootstrap}_{Clustered} & = \frac{1}{B-1}\sum_{b=1}^B\left(\hat{\Delta}^y_{WW}(b)-\frac{1}{B}\sum_{b=1}^B\hat{\Delta}^y_{WW}(b)\right)^2.
\end{align*}
```{example}
Let's see how this works in our example.
We first need to generate a dataset with outcomes, treatment and cluster indicators.
Then we need to draw repeatedly new vectors of clusters identifiers, with replacement, and compute the WW estimator for each.
```
```{r NPBootClus,cache=TRUE}
# Regroup data and cluster id
# already done in dataCluster
# write a function to draw a new vector of clusters and return the WW estimator
# seed: the seed for the PRNG
# data: the dataset
# cluster: the name of the cluster variable (with cluster variable indexed from 1 to Nc)
# y: the name of the outcome variable
# D: the name of the treatment variable
NPBootCluster <- function(seed,data,cluster,y,D){
# set seed
set.seed(seed)
# compute number of clusters
NClusters <- data %>%
group_by(!!sym(cluster)) %>%
summarize(count=n()) %>%
summarize(NClusters=n())
# draw a sample with replacement
SampleClusters <- data.frame(BootClusters =sample(1:(NClusters[[1]]),size=NClusters[[1]],replace=TRUE)) %>%
left_join(data,by=c('BootClusters'=cluster))
# run regression
RegCluster <- lm(as.formula(paste(y,D,sep='~')),data=SampleClusters)
# WW estimate
WW <- coef(RegCluster)[[2]]
# return
return(WW)
}
# testing
test <- NPBootCluster(seed=1,data=dataCluster,cluster='Cluster',y='yA',D='R')
# programming to run in parallel
sf.NP.Boot.Cluster <- function(Nsim,...){
sfInit(parallel=TRUE,cpus=8)
sfLibrary(dplyr)
sim <- matrix(unlist(sfLapply(1:Nsim,NPBootCluster,data=dataCluster,cluster='Cluster',y='yA',D='R')),nrow=Nsim,ncol=1,byrow=TRUE)
sfStop()
colnames(sim) <- c('WW')
return(sim)
}
# Number of simulations
#Nsim <- 10
Nsim <- 400
# running in parallel
simuls.NP.Boot.Cluster <- sf.NP.Boot.Cluster(Nsim=Nsim,data=dataCluster,cluster='Cluster',y='yA',D='R')
# estimating variance
VarClusteredWWNPBoot <- var(simuls.NP.Boot.Cluster)
```
The cluster-robust non-parametric bootstrap estimate of sampling noise is equal to `r round(2*(qnorm((delta+1)/2))*sqrt(VarClusteredWWNPBoot),2)`.
The true level of sampling noise estimated from the simulations is `r round(precision.BF.ICC[[3]],2)`.
#### Using Randomization inference {#RI}
Randomization inference is a different way to obtain estimates of the true level of sampling noise using resampling estimates.
One intuitive way of running randomization inference would be to draw new vectors of treatment status at the cluster level and estimate the variability of treatment effect estimates.
One problem with that approach is that it does not take into account the fact that treated and control differ by the amount of the treatment effect and that is going to add some additional noise in the estimate.
A more rigorous approach follows the suggestion in [Imbens and Rubin (2015), Section 5.7](https://doi.org/10.1017/CBO9781139025751.006) and proceed as follows:
1. Assume a size for the treatment effect (let's call it $\tau$)
2. Compute each potential outcome for each treated ($\tilde{Y}_i^0=Y_i-\tau$ and $\tilde{Y}_i^1=Y_i$) and each untreated ($\tilde{Y}_i^1=Y_i+\tau$ and $\tilde{Y}_i^0=Y_i$) unit in the original sample
3. Draw a new treatment allocation $\tilde{R}^1_i$
4. Compute the realized outcomes $\tilde{Y}_i=\tilde{Y}_i^1\tilde{R}^1_i+\tilde{Y}_i^0(1-\tilde{R}^1_i)$
5. Compute the $WW$ estimate $\tilde{\Delta}^Y_{WW_1}$ using the new treatment allocation $\tilde{R}^1_i$ and the realized outcomes $\tilde{Y}_i$
6. Repeat the operation $\tilde{N}$ times, to obtain $\left\{\tilde{\Delta}^Y_{WW_k}\right\}_{k=1}^{\tilde{N}}$
7. Compute the empirical p-value $\tilde{p}(\tau)$ as the proportion of sample draws where $\left|\tilde{\Delta}^Y_{WW_k}-\tau\right|\leq\left|\hat{\Delta}^Y_{WW}-\tau\right|$, where $\hat{\Delta}^Y_{WW}$ is the treatment effect estimate in the original sample.
8. Repeat for various values of $\tau$ on a set of points $\tau_1,\dots,\tau_{K}$.
9. Find the values $\tau^u_{\alpha}$ and $\tau^l_{\alpha}$ that are such that $\tilde{p}(\tau^u_{\alpha})\approx\alpha\approx\tilde{p}(\tau^l_{\alpha})$ and $\tau^l_{\alpha}<\hat{\Delta}^Y_{WW}<\tau^u_{\alpha}$.
$\left[\tau^l_{\alpha},\tau^u_{\alpha}\right]$ is the $1-\alpha$ cluster robust randomizaiton inference-based confidence interval for $\hat{\Delta}^Y_{WW}$.
```{example}
Let's see how that works in our example.
```
```{r RICLuster,results='hide',cache=TRUE}
# function computing one randomization inference draw for one value of tau
# s: the seed for the PRNG
# data: the dataset
# cluster: the name of the cluster variable (with cluster variable indexed from 1 to Nc)
# y: the name of the outcome variable
# D: the name of the treatment variable
ClusteredRI <- function(s,tau,data,cluster,y,D){
# Compute potential outcomes
data <- data %>%
mutate(
y0 = case_when(
!!sym(D)==1 ~ !!sym(y)-tau,
!!sym(D)==0 ~ !!sym(y),
TRUE ~ 99
),
y1 = case_when(
!!sym(D)==1 ~ !!sym(y),
!!sym(D)==0 ~ !!sym(y)+tau,
TRUE ~ 99
)
)
# drawing alternative treatment vector at cluster level
# compute number of clusters
NClusters <- data %>%
group_by(!!sym(cluster)) %>%
summarize(count=n()) %>%
ungroup(.) %>%
summarize(NClusters=n())
# set seed
set.seed(s)
# randomized allocation at the cluster level
Rs <- runif(NClusters[[1]])
# cluster level treatment vector
Rc <- ifelse(Rs<=.5,1,0)
# dataframe of treated joined to original data
RIdata <- data.frame(ClusterId=1:NClusters[[1]],Rc=Rc) %>%
left_join(data,by=c("ClusterId"=cluster)) %>%
#generating RI observed outcomes
mutate(
yc = y0*(1-Rc)+y1*Rc
)
# estimating the RI WW treatment effect
RegWWRI <- lm(yc~Rc,data=RIdata)
# returning estimate
return(coef(RegWWRI)[[2]])
}
# testing
testRI <- ClusteredRI(s=1,tau=0,data=dataCluster,cluster='Cluster',y='yA',D='R')
# testing sapply
Nsim <- 10
testRIapply <- sapply(1:Nsim,ClusteredRI,tau=0,data=dataCluster,cluster='Cluster',y='yA',D='R')
# parallelize for one value of tau
# programming to run in parallel
sf.RI.Cluster <- function(Nsim,tau,data,cluster,y,D){
sfInit(parallel=TRUE,cpus=8)
sfLibrary(dplyr)
sfExport('dataCluster')
sim <- matrix(unlist(sfLapply(1:Nsim,ClusteredRI,tau=tau,data=data,cluster=cluster,y=y,D=D)),nrow=Nsim,ncol=1,byrow=TRUE)
sfStop()
colnames(sim) <- c('WW')
return(sim)
}
# testing
sf.test.RI <- sf.RI.Cluster(Nsim=Nsim,tau=0,data=dataCluster,cluster='Cluster',y='yA',D='R')
# lapply over values of tau on a grid
# function that takes tau as an input and spits out the empirical p-value
# tau: hypothesized value of treatment effect
# WWhat: estimated vamlue of treatment effect in the original sample
ParallelClusteredRI <- function(tau,WWhat,...){
# computing the RI distribution of WW estimates for given tau
sf.WW.RI <- sf.RI.Cluster(tau=tau,...)
# estimate enpirical cdf of |WWRI-tau|
F.WW.RI.tau <- ecdf(abs(sf.WW.RI-tau))
# Compute the p-value
pvalue <- 1-F.WW.RI.tau(abs(WWhat-tau))
# return the p-value
return(pvalue)
}
# testing
testRItau <- ParallelClusteredRI(tau=0,WWhat=0.2,Nsim=Nsim,data=dataCluster,cluster='Cluster',y='yA',D='R')
# run on a grid of tau's
tau.grid <- seq(from=-0.15,to=0.55,by=0.01)
Nsim <- 1000
simuls.RI.tau <- sapply(tau.grid,ParallelClusteredRI,WWhat=0.2,Nsim=Nsim,data=dataCluster,cluster='Cluster',y='yA',D='R')
names(simuls.RI.tau) <- tau.grid
# putting results in data frame
RI.pvalues.tau <- data.frame(pvalue=simuls.RI.tau,tau=tau.grid)
```
Let us visualize the resulting estimates of the p-values as a function of $\tau$:
```{r RICLusterPlot,fig.cap='Randomization inference p-values as a function of $\\tau$',fig.align='center',out.width='65%',fig.pos='htbp'}
ggplot(RI.pvalues.tau,aes(x=tau,y=pvalue))+
geom_point() +
geom_line() +
geom_hline(yintercept = 0.05,linetype='dotted',color='red')+
ylim(0,1) +
ylab("p-value") +
xlab(expression(tau)) +
theme_bw()
```
Let us now compute the confidence interval and estimate of sampling noise:
```{r RIClusteredCI}
# computing RI-based confidence interval and sampling noise estimate
# function that takes a dataframe of pvalues and a grid and spits out both ends of confidence interval and level delta and sampling noise
# delta: level of the confidence interval (95% or 99% for example)
# data: a data frame with a variable for tau and a variable for pvalues
# tau: name of the tau variable
# pval: name of the pvalue variable
RI.CI <- function(delta,data,tau,pval){
data <- data %>%
mutate(
# finding observations with pvalue lower than delta
below.pval = case_when(
!!sym(pval) <= 1-delta ~ 1,
!!sym(pval) > 1-delta ~ 0,
TRUE ~99
),
# finding observations below the tau with maximum pvalue
max.pval = max(!!sym(pval)),
max.pval.tau = case_when(
!!sym(pval) == max.pval ~ !!sym(tau),
TRUE ~ -Inf
),
max.pval.tau = max(max.pval.tau),
below.max.paval.tau = case_when(
tau<max.pval.tau ~ 1,
tau>=max.pval.tau ~ 0,
TRUE ~ Inf
)
) %>%
# keeping only observations that are below pval of delta%
filter(below.pval==1) %>%
# grouping by whether tau is above or below the tau with maximum pvalue
group_by(below.max.paval.tau)
# finding max and min values that are extremes of delta% CI
minCI <- data %>%
filter(below.max.paval.tau==1) %>%
ungroup(.) %>%
summarize(
minCI = max(!!sym(tau))
) %>%
pull(minCI)
maxCI <- data %>%
filter(below.max.paval.tau==0) %>%
ungroup(.) %>%
summarize(
maxCI = min(!!sym(tau))
) %>%
pull(maxCI)
# results
results <- list(minCI,maxCI,maxCI-minCI)
names(results) <- c("LowCI","HighCI","SamplingNoise")
return(results)
}
# test
testRICI95 <- RI.CI(delta=0.95,data=RI.pvalues.tau,tau='tau',pval='pvalue')
testRICI99 <- RI.CI(delta=0.99,data=RI.pvalues.tau,tau='tau',pval='pvalue')
```
As a result of our procedure, the randomization inference based 95\% confidence interval for the $WW$ estimator is $\left[\right.$ `r testRICI95[[1]]` , `r testRICI95[[2]]` $\left.\right]$ and the 99\% sampling noise is equal to `r testRICI99[[3]]`.
The true level of sampling noise estimated from the simulations is `r round(precision.BF.ICC[[3]],2)`.
```{remark}
Another, simpler, option for implementing randomization inference would have been simply to reallocate the treatment vector among clusters and keep the original values of the outcomes instead of generating the outcomes under the null for each value of $\tau$.
This is the simple approach we have used in Chapter \@ref(FPSI).
The properties of this approach have not been studied to my knowledge, but it would be interest!ing to know under which conditions it is approximately correct.
Let's check what would have happened if we had used this simpler approach to randomization inference.
```
```{r ClusteredRISimple,results='hide',cache=TRUE}
# function computing one simplified randomization inference draw
# s: the seed for the PRNG
# data: the dataset
# cluster: the name of the cluster variable (with cluster variable indexed from 1 to Nc)
# y: the name of the outcome variable
# D: the name of the treatment variable
ClusteredRISimple <- function(s,data,cluster,y){
# drawing alternative treatment vector at cluster level
# compute number of clusters
NClusters <- data %>%
group_by(!!sym(cluster)) %>%
summarize(count=n()) %>%
ungroup(.) %>%
summarize(NClusters=n())
# set seed
set.seed(s)
# randomized allocation at the cluster level
Rs <- runif(NClusters[[1]])
# cluster level treatment vector
Rc <- ifelse(Rs<=.5,1,0)
# dataframe of treated joined to original data
RIdata <- data.frame(ClusterId=1:NClusters[[1]],Rc=Rc) %>%
left_join(data,by=c("ClusterId"=cluster))
# estimating the RI WW treatment effect
RegWWRISimple <- lm(as.formula(paste(y,'Rc',sep='~')),data=RIdata)
# returning estimate
return(coef(RegWWRISimple)[[2]])
}
# testing
testRISimple <- ClusteredRISimple(s=1,data=dataCluster,cluster='Cluster',y='yA')
# testing sapply
Nsim <- 10
testRISimpleapply <- sapply(1:Nsim,ClusteredRISimple,data=dataCluster,cluster='Cluster',y='yA')
# programming to run in parallel
sf.RI.Cluster.Simple <- function(Nsim,data,cluster,y){
sfInit(parallel=TRUE,cpus=8)
sfLibrary(dplyr)
sim <- matrix(unlist(sfLapply(1:Nsim,ClusteredRISimple,data=data,cluster=cluster,y=y)),nrow=Nsim,ncol=1,byrow=TRUE)
sfStop()
colnames(sim) <- c('WW')
return(sim)
}
# testing
sf.test.RI.simple <- sf.RI.Cluster.Simple(Nsim=Nsim,data=dataCluster,cluster='Cluster',y='yA')
# running
Nsim <- 1000
simuls.RI.Simple <- sf.RI.Cluster.Simple(Nsim=Nsim,data=dataCluster,cluster='Cluster',y='yA')
Var.WW.RI.Simple <- var(simuls.RI.Simple)
```
With the simplified randomization inference procedure, the 99\% sampling noise is equal to `r round(2*(qnorm((delta+1)/2))*sqrt(Var.WW.RI.Simple),2)`.
The true level of sampling noise estimated from the simulations is `r round(precision.BF.ICC[[3]],2)`.
## Clustering in panel data
Another source of clustering appears in panel data, where we follow the same observation over time.
One way to account for the correlation between the outcomes of the same individual over time is to add individual level fixed effects, and to use either a within estimator or a Least Squares Dummy Variable estimator, as we did in Section \@ref(DID).
But adding individual level fixed effects probably does not account for the overall correlation between observations of the same individual over time.
A case in point is autocorrelation in the error terms due to persistent shocks.
If a unit has experienced a shock at date $t$, she is more likely to experience a similar shock at date $t+1$.
For example, health problems do not solve themselves miraculously between survey rounds.
In the same manner, job loss or job changes or hurricanes, etc, all have effects that persist over time.
As a consequence, we often believe that the observations for the same unit are correlated over time, abve and beyond what can be captured by a unit fixed effect.
In this section, we are going to study the issue of clustering in panel data, focusing on the estimators we have studied in Section \@ref(DID), especially the simple DID estimator, the fixed effects estimator, and the Sun and Abraham estimator in staggered designs.
A key result of importance is that DID estimates which only involve two-periods are not affected by the autocorrelation problem, while estimates which aggregate more than two periods of data are affected, and the more so as they aggregate more time periods.
We are first going to take an example to exemplify the importance of the problem.
We will then derive the closed form variance of the various estimators under simplified assumptions on the autocorrelation between error terms.
Finally, we will explain how to estimate sampling noise in panel data with autocorrelated error terms.
### An example
Let us start with an example of how autocorrelated error terms alter the precision of panel data estimates.
For that, we are going to generate long time series of panel data and explore how the variability of treatment effect estimates changes with persistence in error terms at the individual level, both for the event-study estimates (which rely on $2\times 2$ comparisons) and for the TT estimate, which relies on aggregated treatment effects.
For simplicity, we are going to jave a design without staggered entry, but everything we are saying here applies to staggered entry as well.
**Check what happens with staggered entry**
I am going to use the following model:
We model outcomes dynamics as follows:
\begin{align*}
Y_{i,t} & = D_{i,t}Y^1_{i,t}+(1-D_{i,t})Y_{i,t}^0\\
Y_{i,t}^1 & = Y_{i,t}^0 + \bar\alpha \\
Y_{i,t}^0 & = \mu_i + U_{i,t} \\
U_{i,t} & = \rho U_{i,t-1} +\epsilon_{i,t} \\
D_{i,t} & = \uns{D^*_{i,k}\geq0}\uns{t\geq k} \\
D^*_{i,k} & = \theta_i + \eta_{i,k}.
\end{align*}
Outcome dynamics are characterized by an individual fixed effect $\mu_i$ and an AR(1) process with autoregressive parameter $\rho\in\left[0,1\right]$.
$\epsilon_{i,t}$ is i.i.d. and has finite variance $\sigma^2$ and is independent from all the other variables in the model.
$U_{i,0}$ has finite variance $\sigma^2_{U_0}$ and is independent from all the other variables in the model.
The program is only available at period $k$.
Participation in the program (which we denote $D_{i,t}$) depends on the utility from entering the program $D^*_{i,k}$ being positive.
$\eta_{i,k}$ is a random shock uncorrelated with any of the other variables in the model.
Selection bias occurs because $\mu_i$ and $\theta_i$ are correlated.
Let us now choose some parameter values (but the value of $\rho$):
```{r ParametersPanelClustered}
# basic parameter values
param.basic <- c(0.05,0.72,0.55,0.55,0.5,0.1,0)
names(param.basic) <- c("sigma","sigmaU0","sigmamu","sigmatheta","rhosigmamutheta","sigmaeta","baralpha")
```
Note that I have chosen the true effect of the treatment to be zero, which saves me computational difficulties because $Y_{i,t}=Y^0_{i,t}$, and does not matter at all for now since we are trying to estimate sampling noise and not the treatment effect itself.
Let us now encode a function generating one sample for one choice of parameter values and seed and spitting out one estimate of the event study parameters and of the TT parameter based on Sun and Abraham estimator.
I am going also to estimate the $2\times 2$ DID estimator along with its heteroskedasticity robust standard errors, both as a within estimator and as a first-difference estimator, in order to check whether the standard error estimates are similar in all three approaches.
```{r ModelSimpleDID,cache=TRUE}
# function generating a sample of outcomes for K time periods with N sample size and spitting out DID estimates (both event study and TT, with both Sunab and )
# seed: seed setting the PRNG
# N: number of units in the panel
# K: number of periods in the panel
# k: treatment date
# param: basic parameters
# rho: value of rho
# cluster: whether we cluster or not our treatment effect estimates for temporal autocorrelation ("None", vs "Temporal")
Outcome.Sample.DID.Long <- function(seed,N,K,k,param,rho,cluster="None"){
# draw all i.i.d. error terms for all years
set.seed(seed)
epsilon <- rnorm(n=K*N,sd=sqrt(param["sigma"]))
# draw initial error term
U0 <- rnorm(n=N,sd=sqrt(param["sigmaU0"]))
# draw fixed effects
sigma.mu.theta <- matrix(c(param["sigmamu"],param["sigmamu"]*param["sigmatheta"]*param["rhosigmamutheta"],
param["sigmamu"]*param["sigmatheta"]*param["rhosigmamutheta"],param["sigmatheta"]
),ncol=2,nrow=2,byrow = T)
mu.theta <- rmvnorm(n=N,mean=c(0,0),sigma=sigma.mu.theta)
mu <- mu.theta[,1]
theta <- mu.theta[,2]
# building dataset (wide format)
data <- as.data.frame(cbind(mu,U0))
for (i in 1:K){
data[paste('U',i,sep="")] <- rho*data[paste('U',i-1,sep="")]+epsilon[((i-1)*N+1):(i*N)]
data[paste('Y',i,sep="")] <- mu + data[paste('U',i,sep="")]
}
# selection rule
# draw error term
data['eta'] <- rnorm(n=N,sd=sqrt(param["sigmaeta"]))
data <- data %>%
mutate(
# generating utility from participation
Ds = theta + eta,
# generating treatment participation
D = case_when(
Ds >0 ~ 1,
TRUE ~ 0
),
# generating treatment identifier for Sun&Abraham estimator with treatment group being at period k
Dsunab = case_when(
D == 1 ~ k,
D == 0 ~ 99,
TRUE ~ -99
)
)
# generate long dataset
data <- data %>%
select(contains("Y"),contains("D")) %>%
mutate(id = 1:nrow(data)) %>%
pivot_longer(starts_with("Y"),names_to = "Period",names_prefix = "Y",values_to = "Y") %>%
mutate(
Period = as.numeric(Period),
TimeToTreatment = case_when(
D == "1" ~ Period-k,
TRUE ~ -99
)
)
# estimating DID model with Sun and Abraham estimator
if (cluster=="None"){
reg.DID.SA <- feols(Y ~ sunab(Dsunab,Period) | id + Period,vcov='HC1',data=data)
}