-
Notifications
You must be signed in to change notification settings - Fork 20
/
Copy path02_FPSI.Rmd
1610 lines (1343 loc) · 105 KB
/
02_FPSI.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Fundamental Problem of Statistical Inference {#FPSI}
The Fundamental Problem of Statistical Inference (FPSI) states that, even if we have an estimator $E$ that identifies $TT$ in the population, we cannot observe $E$ because we only have access to a finite sample of the population.
The only thing that we can form from the sample is a sample equivalent $\hat{E}$ to the population quantity $E$, and $\hat{E}\neq E$.
For example, the sample analog to $WW$ is the difference in means between treated and untreated units $\hat{WW}$.
As we saw in the last lecture, $\hat{WW}$ is never exactly equal to $WW$.
Why is $\hat{E}\neq E$?
Because a finite sample is never perfectly representative of the population.
In a sample, even in a random sample, the distribution of the observed and unobserved covariates deviates from the true population one.
As a consequence, the sample value of the estimator is never precisely equal to the population value, but fluctuates around it with sampling noise.
The main problem with the FPSI is that if we find an effect of our treatment, be it small or large, we cannot know whether we should attribute it to the treatment or to the bad or good luck of sampling noise.
What can we do to deal with the FPSI?
I am going to argue that there are mainly two things that we might want to do: estimating the extent of sampling noise and decreasing sampling noise.
Estimating sampling noise means measuring how much variability there is in our estimate $\hat{E}$ due to the sampling procedure.
This is very useful because it enables us to form a confidence interval that gauges how far from $\hat{E}$ the true value $E$ might be.
It is a measure of the precision of our estimation and of the extent to which sampling noise might drive our results.
Estimating sampling noise is very hard because we have only access to one sample and we would like to know the behavior of our estimator over repeated samples.
We are going to learn four ways to estimate the extent of sampling noise using data from one sample.
Because sampling noise is such a nuisance and makes our estimates imprecise, we would like to be able to make it as small as possible.
We are going to study three ways of decreasing sampling noise, two that take place before collecting the data (increasing sample size, stratifying) and one that takes place after (conditioning).
Maybe you are surprised not to find statistical significance tests as an important answer to the FPSI.
I argue in this lecture that statistical tests are misleading tools that make us overestimate the confidence in our results and underestimate the scope of sampling noise.
Statistical tests are not meant to be used for scientific research, but were originally designed to make decisions in industrial settings where the concept of successive sampling made actual sense.
Statistical tests also generate collective behaviors such as publication bias and specification search that undermine the very foundations of science.
A general movement in the social sciences, but also in physics, is starting to ban the reporting of p-values.
## What is sampling noise? Definition and illustration {#sec:sampnoise}
In this section, I am going to define sampling noise and illustrate it with a numerical example.
In Section \@ref(sec:definitionnoise), I define sampling noise.
In section \@ref(sec:illusnoisepop), I illustrate how sampling noise varies when one is interested in the population treatment effect.
In section \@ref(sec:illusnoisesamp), I illustrate how sampling noise varies when one is interesetd in the sample treatment effect.
Finally, in section \@ref(sec:confinterv), I show how confidence intervals can be built from an estimate of sampling noise.
### Sampling noise, a definition {#sec:definitionnoise}
Sampling noise measures how much sampling variability moves the sample estimator $\hat{E}$ around.
One way to define it more rigorously is to make it equal to the width of a confidence interval:
```{definition,sampnoise,name='Sampling noise'}
Sampling noise is the width of the symmetric interval around TT within which $\delta*100$\% of the sample estimators fall, where $\delta$ is the confidence level.
As a consequence, sampling noise is equal to $2\epsilon$ where $\epsilon$ is such that:
\begin{align*}
\Pr(|\hat{E}-TT|\leq\epsilon) &= \delta.
\end{align*}
```
This definition tries to capture the properties of the distribution of $\hat{E}$ using only one number.
As every simplification, it leaves room for dissatisfaction, exactly as a 2D map is a convenient albeit arbitrary betrayal of a 3D phenomenon.
For example, there is nothing sacred about the symmetry of the interval.
It is just extremely convenient.
One might prefer an interval that is symmetric in tail probabilities instead.
Feel free to explore with different concepts if you like.
A related concept to that of sampling noise is that of precision: the smaller the sampling noise, the higher the precision.
Precision can be defined for example as the inverse of sampling noise $\frac{1}{2\epsilon}$.
Finally, a very useful concept is that of signal to noise ratio.
It is not used in economics, but physicists use this concept all the time.
The signal to noise ratio measures the treatment effect in multiple of the sampling noise.
If they are of the same order of magnitude, we have a lot of noise and little confidence in our estimates.
If the signal is much larger than the noise, we tend to have a lot of confidence in our parameter estimates.
The signal to noise ratio can be computed as follows: $\frac{E}{2\epsilon}$ or $\frac{\hat{E}}{2\epsilon}$.
```{remark}
A very striking result is that the signal to noise ratio of a result that is marginally significant at the 5\% level is very small, around one half, meaning that the noise is generally double the signal in these results.
We will derive this result after studying how to estimate sampling noise with real data.
```
There are two distinct ways of understanding sampling noise, depending on whether we are after the population treatment effect ($\Delta^Y_{TT}$) or the sample treatment effect ($\Delta^Y_{TT_s}$).
Sampling noise for the population treatment effect stems from the fact that the sample is not perfectly representative of the population.
The sample differs from the population and thus the sample estimates differs from the population estimate.
Sampling noise for the sample parameter stems from the fact that the control group is not a perfect embodiment of the counterfactual.
Discrepancies between treated and control samples are going to generate differences between the $WW$ estimate and the $TT$ effect in the sample.
### Sampling noise for the population treatment effect {#sec:illusnoisepop}
Sampling noise for the population treatment effect stems from the fact that the sample is not perfectly representative of the population.
```{example}
In order to assess the scope of sampling noise for our population treatment effect estimate, let's first draw a sample.
In order to be able to do that, I first have to define the parameter values:
```
```{r param,eval=TRUE,echo=TRUE,cache=TRUE}
param <- c(8,.5,.28,1500,0.9,0.01,0.05,0.05,0.05,0.1)
names(param) <- c("barmu","sigma2mu","sigma2U","barY","rho","theta","sigma2epsilon","sigma2eta","delta","baralpha")
param
```
```{r simul.no.selb.1234,eval=TRUE,echo=TRUE}
set.seed(1234)
N <-1000
mu <- rnorm(N,param["barmu"],sqrt(param["sigma2mu"]))
UB <- rnorm(N,0,sqrt(param["sigma2U"]))
yB <- mu + UB
YB <- exp(yB)
Ds <- rep(0,N)
V <- rnorm(N,param["barmu"],sqrt(param["sigma2mu"]+param["sigma2U"]))
Ds[V<=log(param["barY"])] <- 1
epsilon <- rnorm(N,0,sqrt(param["sigma2epsilon"]))
eta<- rnorm(N,0,sqrt(param["sigma2eta"]))
U0 <- param["rho"]*UB + epsilon
y0 <- mu + U0 + param["delta"]
alpha <- param["baralpha"]+ param["theta"]*mu + eta
y1 <- y0+alpha
Y0 <- exp(y0)
Y1 <- exp(y1)
y <- y1*Ds+y0*(1-Ds)
Y <- Y1*Ds+Y0*(1-Ds)
delta.y.ate <- function(param){
return(param["baralpha"]+param["theta"]*param["barmu"])
}
```
In this sample, the $WW$ estimator yields an estimate of $\hat{\Delta^y_{WW}}=$ `r round(mean(y[Ds==1])-mean(y[Ds==0]),3)`.
Despite random assignment, we have $\hat{\Delta^y_{WW}}\neq\Delta^y_{TT}=$ `r round(delta.y.ate(param),3)`, an instance of the FPSI.
In order to see how sampling noise varies, let's draw another sample.
In order to do so, I am going to choose a different seed to initialize the pseudo-random number generator in R.
```{r simul.no.selb.12345,eval=TRUE,echo=TRUE}
set.seed(12345)
N <-1000
mu <- rnorm(N,param["barmu"],sqrt(param["sigma2mu"]))
UB <- rnorm(N,0,sqrt(param["sigma2U"]))
yB <- mu + UB
YB <- exp(yB)
Ds <- rep(0,N)
V <- rnorm(N,param["barmu"],sqrt(param["sigma2mu"]+param["sigma2U"]))
Ds[V<=log(param["barY"])] <- 1
epsilon <- rnorm(N,0,sqrt(param["sigma2epsilon"]))
eta<- rnorm(N,0,sqrt(param["sigma2eta"]))
U0 <- param["rho"]*UB + epsilon
y0 <- mu + U0 + param["delta"]
alpha <- param["baralpha"]+ param["theta"]*mu + eta
y1 <- y0+alpha
Y0 <- exp(y0)
Y1 <- exp(y1)
y <- y1*Ds+y0*(1-Ds)
Y <- Y1*Ds+Y0*(1-Ds)
```
In this sample, the $WW$ estimator yields an estimate of $\hat{\Delta^y_{WW}}=$ `r round(mean(y[Ds==1])-mean(y[Ds==0]),3)`.
Again, despite random assignment, we have $\hat{\Delta^y_{WW}}\neq\Delta^y_{TT}=$ `r round(delta.y.ate(param),3)`, an instance of the FPSI.
Furthermore, the estimate of the population treatment effect in this sample differs from the previous one, a consequence of sampling noise.
Let's now visualize the extent of sampling noise by repeating the procedure multiple times with various sample sizes.
This is called Monte Carlo replications: in each replication, I choose a sample size, draw one sample from the population and compute the $\hat{WW}$ estimator.
At each replication, the sample I'm using is different, reflecting the actual sampling process and enabling me to gauge the extent of sampling noise.
In order to focus on sampling noise alone, I am running the replications in the model in which selection into the treatment is independent on potential outcomes, so that $WW=TT$ in the population.
In order to speed up the process, I am using parallelized computing: I send each sample to a different core in my computer so that several samples can be run at the same time.
You might want to adapt the program below to the number of cores you actually have using the ``ncpus`` variable in the beginning of the ``.Rmd`` file that generates this page..
In order to parallelize computations, I use the Snowfall package in R, that gives very simple and intuitive parallelization commands.
In order to save time when generating the graph, I use the wonderful "cache" option of knitr: it stores the estimates from the code chunk and will not rerun it as long as the code inside the chunk has not been altered nor the code of the chunks that it depends on (parameter values, for example).
```{r montecarlo,eval=TRUE,echo=TRUE,warning=FALSE,error=FALSE,message=FALSE,fig.cap='Distribution of the $WW$ estimator over replications of samples of different sizes',fig.align='center',out.width=cst,fig.pos='htbp',cache=TRUE,dependson=c('parameters','param')}
monte.carlo.ww <- function(s,N,param){
set.seed(s)
mu <- rnorm(N,param["barmu"],sqrt(param["sigma2mu"]))
UB <- rnorm(N,0,sqrt(param["sigma2U"]))
yB <- mu + UB
YB <- exp(yB)
Ds <- rep(0,N)
V <- rnorm(N,param["barmu"],sqrt(param["sigma2mu"]+param["sigma2U"]))
Ds[V<=log(param["barY"])] <- 1
epsilon <- rnorm(N,0,sqrt(param["sigma2epsilon"]))
eta<- rnorm(N,0,sqrt(param["sigma2eta"]))
U0 <- param["rho"]*UB + epsilon
y0 <- mu + U0 + param["delta"]
alpha <- param["baralpha"]+ param["theta"]*mu + eta
y1 <- y0+alpha
Y0 <- exp(y0)
Y1 <- exp(y1)
y <- y1*Ds+y0*(1-Ds)
Y <- Y1*Ds+Y0*(1-Ds)
return(c((1/sum(Ds))*sum(y*Ds)-(1/sum(1-Ds))*sum(y*(1-Ds)),var(y[Ds==1]),var(y[Ds==0]),mean(Ds)))
}
simuls.ww.N <- function(N,Nsim,param){
simuls.ww <- as.data.frame(matrix(unlist(lapply(1:Nsim,monte.carlo.ww,N=N,param=param)),nrow=Nsim,ncol=4,byrow=TRUE))
colnames(simuls.ww) <- c('WW','V1','V0','p')
return(simuls.ww)
}
sf.simuls.ww.N <- function(N,Nsim,param){
sfInit(parallel=TRUE,cpus=ncpus)
sim <- as.data.frame(matrix(unlist(sfLapply(1:Nsim,monte.carlo.ww,N=N,param=param)),nrow=Nsim,ncol=4,byrow=TRUE))
sfStop()
colnames(sim) <- c('WW','V1','V0','p')
return(sim)
}
simuls.ww <- lapply(N.sample,sf.simuls.ww.N,Nsim=Nsim,param=param)
par(mfrow=c(2,2))
for (i in 1:4){
hist(simuls.ww[[i]][,'WW'],main=paste('N=',as.character(N.sample[i])),xlab=expression(hat(Delta^yWW)),xlim=c(-0.15,0.55))
abline(v=delta.y.ate(param),col="red")
}
```
Figure \@ref(fig:montecarlo) is essential to understanding statistical inference and the properties of our estimators.
We can see on Figure \@ref(fig:montecarlo) that the estimates indeed move around at each sample replication.
We can also see that the estimates seem to be concentrated around the truth.
We also see that the estimates are more and more concentrated around the truth as sample size grows larger and larger.
How big is sampling noise in all of these examples?
We can compute it by using the replications as approximations to the true distribution of the estimator after an infinite number of samples has been drawn.
Let's first choose a confidence level and then compute the empirical equivalent to the formula in Definition \@ref(def:sampnoise).
```{r samp.noise,eval=TRUE,echo=TRUE}
delta<- 0.99
delta.2 <- 0.95
samp.noise <- function(estim,delta){
return(2*quantile(abs(delta.y.ate(param)-estim),prob=delta))
}
samp.noise.ww <- sapply(lapply(simuls.ww,`[`,,1),samp.noise,delta=delta)
names(samp.noise.ww) <- N.sample
samp.noise.ww
```
Let's also compute precision and the signal to noise ratio and put all of these results together in a nice table.
```{r precisionsignal,eval=TRUE,echo=TRUE,results='asis'}
precision <- function(estim,delta){
return(1/samp.noise(estim,delta))
}
signal.to.noise <- function(estim,delta,param){
return(delta.y.ate(param)/samp.noise(estim,delta))
}
precision.ww <- sapply(lapply(simuls.ww,`[`,,1),precision,delta=delta)
names(precision.ww) <- N.sample
signal.to.noise.ww <- sapply(lapply(simuls.ww,`[`,,1),signal.to.noise,delta=delta,param=param)
names(signal.to.noise.ww) <- N.sample
table.noise <- cbind(samp.noise.ww,precision.ww,signal.to.noise.ww)
colnames(table.noise) <- c('Sampling noise', 'Precision', 'Signal to noise ratio')
knitr::kable(table.noise,caption=paste('Sampling noise of $\\hat{WW}$ for the population treatment effect with $\\delta=$',delta,'for various sample sizes',sep=' '),booktabs=TRUE,digits = c(2,2,2),align=c('c','c','c'))
```
Finally, a nice way to summarize the extent of sampling noise is to graph how sampling noise varies around the true treatment effect, as shown on Figure \@ref(fig:precision).
```{r precision,dependson='monte.carlo',eval=TRUE,echo=TRUE,warning=FALSE,error=FALSE,message=FALSE,fig.cap='Sampling noise of $\\hat{WW}$ (99\\% confidence) around $TT$ for various sample sizes',fig.align='center',out.width=cst,fig.pos='htbp'}
colnames(table.noise) <- c('sampling.noise', 'precision', 'signal.to.noise')
table.noise <- as.data.frame(table.noise)
table.noise$N <- as.numeric(rownames(table.noise))
table.noise$TT <- rep(delta.y.ate(param),nrow(table.noise))
ggplot(table.noise, aes(x=as.factor(N), y=TT)) +
geom_bar(position=position_dodge(), stat="identity", colour='black') +
geom_errorbar(aes(ymin=TT-sampling.noise/2, ymax=TT+sampling.noise/2), width=.2,position=position_dodge(.9),color='red') +
xlab("Sample Size")+
theme_bw()
```
With $N=$ `r N.sample[[1]]`, we can definitely see on Figure \@ref(fig:precision) that sampling noise is ridiculously large, especially compared with the treatment effect that we are trying to estimate.
The signal to noise ratio is `r round(table.noise[rownames(table.noise)==N.sample[[1]],colnames(table.noise)=='signal.to.noise'],2)`, which means that sampling noise is an order of magnitude bigger than the signal we are trying to extract.
As a consequence, in `r length(simuls.ww[[1]][,'WW'][simuls.ww[[1]][,'WW']<=0])/Nsim*100`\% of our samples, we are going to estimate a negative effect of the treatment.
There is also a `r length(simuls.ww[[1]][,'WW'][simuls.ww[[1]][,'WW']>2*delta.y.ate(param)])/Nsim*100`\% chance that we end up estimating an effect that is double the true effect.
So how much can we trust our estimate from one sample to be close to the true effect of the treatment when $N=$ `r N.sample[[1]]`?
Not much.
With $N=$ `r N.sample[[2]]`, sampling noise is still large: the signal to noise ratio is `r round(table.noise[rownames(table.noise)==N.sample[[2]],colnames(table.noise)=='signal.to.noise'],2)`, which means that sampling noise is double the signal we are trying to extract.
As a consequence, the chance that we end up with a negative treatment effect has decreased to `r length(simuls.ww[[2]][,'WW'][simuls.ww[[2]][,'WW']<=0])/Nsim*100`\% and that we end up with an effect double the true one is `r length(simuls.ww[[2]][,'WW'][simuls.ww[[2]][,'WW']>2*delta.y.ate(param)])/Nsim*100`\%.
But still, the chances that we end up with an effect that is smaller than three quarters of the true effect is `r length(simuls.ww[[2]][,'WW'][simuls.ww[[2]][,'WW']<=0.75*delta.y.ate(param)])/Nsim*100`\% and the chances that we end up with an estimator that is 25\% bigger than the true effect is `r length(simuls.ww[[2]][,'WW'][simuls.ww[[2]][,'WW']>1.25*delta.y.ate(param)])/Nsim*100`\%.
These are nontrivial differences: compare a program that increases earnings by `r 0.75*delta.y.ate(param)*100`\% to one that increases them by `r delta.y.ate(param)*100`\% and another by `r 1.25*delta.y.ate(param)*100`\%.
They would have completely different cost/benefit ratios.
But we at least trust our estimator to give us a correct idea of the sign of the treatment effect and a vague and imprecise idea of its magnitude.
With $N=$ `r N.sample[[3]]`, sampling noise is smaller than the signal, which is encouraging.
The signal to noise ratio is `r round(table.noise[rownames(table.noise)==N.sample[[3]],colnames(table.noise)=='signal.to.noise'],2)`.
In only 1\% of the samples does the estimated effect of the treatment become smaller than `r round(quantile(simuls.ww[[3]][,'WW'],probs=c(0.005)),3)` or bigger than `r round(quantile(simuls.ww[[3]][,'WW'],probs=c(0.995)),3)`.
We start gaining a lot of confidence in the relative magnitude of the effect, even if sampling noise is still responsible for economically significant variation.
With $N=$ `r N.sample[[4]]`, sampling noise has become trivial.
The signal to noise ratio is `r round(table.noise[rownames(table.noise)==N.sample[[4]],colnames(table.noise)=='signal.to.noise'],2)`, which means that the signal is now `r round(table.noise[rownames(table.noise)==N.sample[[4]],colnames(table.noise)=='signal.to.noise'],0)` times bigger than the sampling noise.
In only 1\% of the samples does the estimated effect of the treatment become smaller than `r round(quantile(simuls.ww[[4]][,'WW'],probs=c(0.005)),3)` or bigger than `r round(quantile(simuls.ww[[4]][,'WW'],probs=c(0.995)),3)`.
Sampling noise is not any more responsible for economically meaningful variation.
### Sampling noise for the sample treatment effect {#sec:illusnoisesamp}
Sampling noise for the sample parameter stems from the fact that the treated and control groups are not perfectly identical.
The distribution of observed and unobserved covariates is actually different, because of sampling variation.
This makes the actual comparison of means in the sample a noisy estimate of the true comparison that we would obtain by comparing the potential outcomes of the treated directly.
In order to understand this issue well and to be able to illustrate it correctly, I am going to focus on the average treatment effect in the whole sample, not on the treated: $\Delta^Y_{ATE_s}=\frac{1}{N}\sum_{i=1}^N(Y_i^1-Y_i^0)$.
This enables me to define a sample parameter that is independent of the allocation of $D_i$.
This is without important consequences since these two parameters are equal in the population when there is no selection bias, as we are assuming since the beginning of this lecture.
Furthermore, if we view the treatment allocation generating no selection bias as a true random assignment in a Randomized Controlled Trial (RCT), then it is still possible to use this approach to estimate $TT$ if we view the population over which we randomise as the population selected for receiving the treatment, as we will see in the lecture on RCTs.
```{example}
In order to assess the scope of sampling noise for our sample treatment effect estimate, we first have to draw a sample:
```
```{r simulnoselb2,eval=TRUE,echo=TRUE}
set.seed(1234)
N <-1000
mu <- rnorm(N,param["barmu"],sqrt(param["sigma2mu"]))
UB <- rnorm(N,0,sqrt(param["sigma2U"]))
yB <- mu + UB
YB <- exp(yB)
Ds <- rep(0,N)
V <- rnorm(N,param["barmu"],sqrt(param["sigma2mu"]+param["sigma2U"]))
Ds[V<=log(param["barY"])] <- 1
epsilon <- rnorm(N,0,sqrt(param["sigma2epsilon"]))
eta<- rnorm(N,0,sqrt(param["sigma2eta"]))
U0 <- param["rho"]*UB + epsilon
y0 <- mu + U0 + param["delta"]
alpha <- param["baralpha"]+ param["theta"]*mu + eta
y1 <- y0+alpha
Y0 <- exp(y0)
Y1 <- exp(y1)
y <- y1*Ds+y0*(1-Ds)
Y <- Y1*Ds+Y0*(1-Ds)
```
In this sample, the treatment effect parameter is $\Delta^y_{ATE_s}=$ `r round(mean(alpha),3)`.
The $WW$ estimator yields an estimate of $\hat{\Delta^y_{WW}}=$ `r round(mean(y[Ds==1])-mean(y[Ds==0]),3)`.
Despite random assignment, we have $\Delta^y_{ATE_s}\neq\hat{\Delta^y_{WW}}$, an instance of the FPSI.
In order to see how sampling noise varies, let's draw a new treatment allocation, while retaining the same sample and the same potential outcomes.
```{r simulsampnoise12345,eval=TRUE,echo=TRUE}
set.seed(12345)
N <-1000
Ds <- rep(0,N)
V <- rnorm(N,param["barmu"],sqrt(param["sigma2mu"]+param["sigma2U"]))
Ds[V<=log(param["barY"])] <- 1
y <- y1*Ds+y0*(1-Ds)
Y <- Y1*Ds+Y0*(1-Ds)
```
In this sample, the treatment effect parameter is still $\Delta^y_{ATE_s}=$ `r round(mean(alpha),3)`.
The $WW$ estimator yields now an estimate of $\hat{\Delta^y_{WW}}=$ `r round(mean(y[Ds==1])-mean(y[Ds==0]),3)`.
The $WW$ estimate is different from our previous estimate because the treatment was allocated to a different random subset of people.
Why is this second estimate so imprecise?
It might because it estimates one of the two components of the average treatment effect badly, or both.
The true average potential outcome with the treatment is, in this sample, $\frac{1}{N}\sum_{i=1}^Ny_i^1=$ `r round(mean(y1),3)` while the $WW$ estimate of this quantity is $\frac{1}{\sum_{i=1}^ND_i}\sum_{i=1}^ND_iy_i=$ `r round(mean(y[Ds==1]),3)`.
The true average potential outcome without the treatment is, in this sample, $\frac{1}{N}\sum_{i=1}^Ny_i^0=$ `r round(mean(y0),3)` while the $WW$ estimate of this quantity is $\frac{1}{\sum_{i=1}^N(1-D_i)}\sum_{i=1}^N(1-D_i)y_i=$ `r round(mean(y[Ds==0]),3)`.
It thus seems that most of the bias in the estimated effect stems from the fact that the treatment has been allocated to individuals with lower than expected outcomes with the treatment, be it because they did not react strongly to the treatment, or because they were in worse shape without the treatment.
We can check which one of these two explanations is more important.
The true average effect of the treatment is, in this sample, $\frac{1}{N}\sum_{i=1}^N(y_i^1-y^0_i)=$ `r round(mean(alpha),3)` while, in the treated group, this quantity is $\frac{1}{\sum_{i=1}^ND_i}\sum_{i=1}^ND_i(y_i^1-y_i^0)=$ `r round(mean(alpha[Ds==1]),3)`.
The true average potential outcome without the treatment is, in this sample, $\frac{1}{N}\sum_{i=1}^Ny^0_i=$ `r round(mean(y0),3)` while, in the treated group, this quantity is $\frac{1}{\sum_{i=1}^ND_i}\sum_{i=1}^ND_iy_i^0=$ `r round(mean(y0[Ds==1]),3)`.
The reason for the poor performance of the $WW$ estimator in this sample is that individuals with lower counterfactual outcomes were included in the treated group, not that the treatment had lower effects on them.
The bad counterfactual outcomes of the treated generates a bias of `r round(mean(y0[Ds==1])-mean(y0),3)`, while the bias due to heterogeneous reactions to the treatment is of `r round(mean(alpha[Ds==1])-mean(alpha),3)`.
The last part of the bias is the one due to the fact that the individuals in the control group have slightly better counterfactual outcomes than in the sample: `r round(-mean(y0[Ds==0])+mean(y0),3)`.
The sum of these three terms yields the total bias of our $WW$ estimator in this second sample: `r round(mean(y[Ds==1])-mean(y[Ds==0])-mean(alpha),3)`.
Let's now assess the overall effect of sampling noise on the estimate of the sample treatment effect for various sample sizes.
In order to do this, I am going to use parallelized Monte Carlo simulations again.
For the sake of simplicity, I am going to generate the same potential outcomes in each replication, using the same seed, and only choose a different treatment allocation.
```{r montecarlosample,eval=TRUE,echo=TRUE,warning=FALSE,error=FALSE,message=FALSE,fig.cap='Distribution of the $WW$ estimator over replications of treatment allocation for samples of different sizes',fig.align='center',out.width=cst,fig.pos='htbp',cache=TRUE,dependson=c('parameters','param')}
monte.carlo.ww.sample <- function(s,N,param){
set.seed(1234)
mu <- rnorm(N,param["barmu"],sqrt(param["sigma2mu"]))
UB <- rnorm(N,0,sqrt(param["sigma2U"]))
yB <- mu + UB
YB <- exp(yB)
epsilon <- rnorm(N,0,sqrt(param["sigma2epsilon"]))
eta<- rnorm(N,0,sqrt(param["sigma2eta"]))
U0 <- param["rho"]*UB + epsilon
y0 <- mu + U0 + param["delta"]
alpha <- param["baralpha"]+ param["theta"]*mu + eta
y1 <- y0+alpha
Y0 <- exp(y0)
Y1 <- exp(y1)
set.seed(s)
Ds <- rep(0,N)
V <- rnorm(N,param["barmu"],sqrt(param["sigma2mu"]+param["sigma2U"]))
Ds[V<=log(param["barY"])] <- 1
y <- y1*Ds+y0*(1-Ds)
Y <- Y1*Ds+Y0*(1-Ds)
return((1/sum(Ds))*sum(y*Ds)-(1/sum(1-Ds))*sum(y*(1-Ds)))
}
simuls.ww.N.sample <- function(N,Nsim,param){
return(unlist(lapply(1:Nsim,monte.carlo.ww.sample,N=N,param=param)))
}
sf.simuls.ww.N.sample <- function(N,Nsim,param){
sfInit(parallel=TRUE,cpus=ncpus)
sim <- sfLapply(1:Nsim,monte.carlo.ww.sample,N=N,param=param)
sfStop()
return(unlist(sim))
}
simuls.ww.sample <- lapply(N.sample,sf.simuls.ww.N.sample,Nsim=Nsim,param=param)
monte.carlo.ate.sample <- function(N,s,param){
set.seed(s)
mu <- rnorm(N,param["barmu"],sqrt(param["sigma2mu"]))
UB <- rnorm(N,0,sqrt(param["sigma2U"]))
yB <- mu + UB
YB <- exp(yB)
epsilon <- rnorm(N,0,sqrt(param["sigma2epsilon"]))
eta<- rnorm(N,0,sqrt(param["sigma2eta"]))
U0 <- param["rho"]*UB + epsilon
y0 <- mu + U0 + param["delta"]
alpha <- param["baralpha"]+ param["theta"]*mu + eta
y1 <- y0+alpha
Y0 <- exp(y0)
Y1 <- exp(y1)
Ds <- rep(0,N)
V <- rnorm(N,param["barmu"],sqrt(param["sigma2mu"]+param["sigma2U"]))
Ds[V<=log(param["barY"])] <- 1
y <- y1*Ds+y0*(1-Ds)
Y <- Y1*Ds+Y0*(1-Ds)
return(mean(alpha))
}
par(mfrow=c(2,2))
for (i in 1:4){
hist(simuls.ww.sample[[i]],main=paste('N=',as.character(N.sample[i])),xlab=expression(hat(Delta^yWW)),xlim=c(-0.15,0.55))
abline(v=monte.carlo.ate.sample(N.sample[[i]],1234,param),col="red")
}
```
Let's also compute sampling noise, precision and the signal to noise ratio in these examples.
```{r sampnoisesample,eval=TRUE,echo=TRUE}
samp.noise.sample <- function(i,delta,param){
return(2*quantile(abs(monte.carlo.ate.sample(1234,N.sample[[i]],param)-simuls.ww.sample[[i]]),prob=delta))
}
samp.noise.ww.sample <- sapply(1:4,samp.noise.sample,delta=delta,param=param)
names(samp.noise.ww.sample) <- N.sample
precision.sample <- function(i,delta,param){
return(1/samp.noise.sample(i,delta,param=param))
}
signal.to.noise.sample <- function(i,delta,param){
return(monte.carlo.ate.sample(1234,N.sample[[i]],param)/samp.noise.sample(i,delta,param=param))
}
precision.ww.sample <- sapply(1:4,precision.sample,delta=delta,param=param)
names(precision.ww.sample) <- N.sample
signal.to.noise.ww.sample <- sapply(1:4,signal.to.noise.sample,delta=delta,param=param)
names(signal.to.noise.ww.sample) <- N.sample
table.noise.sample <- cbind(samp.noise.ww.sample,precision.ww.sample,signal.to.noise.ww.sample)
colnames(table.noise.sample) <- c('Sampling noise', 'Precision', 'Signal to noise ratio')
knitr::kable(table.noise.sample,caption=paste('Sampling noise of $\\hat{WW}$ for the sample treatment effect with $\\delta=$',delta,'and for various sample sizes',sep=' '),booktabs=TRUE,align=c('c','c','c'),digits=c(3,3,3))
```
Finally, let's compare the extent of sampling noise for the population and the sample treatment effect parameters.
```{r precisionpopsample,dependson=c('monte.carlo','monte.carlo.sample'),eval=TRUE,echo=TRUE,warning=FALSE,error=FALSE,message=FALSE,fig.cap='Sampling noise of $\\hat{WW}$ (99\\% confidence) around $TT$ and $TT_s$ for various sample sizes',fig.align='center',out.width=cst,fig.pos='htbp'}
colnames(table.noise.sample) <- c('sampling.noise', 'precision', 'signal.to.noise')
table.noise.sample <- as.data.frame(table.noise.sample)
table.noise.sample$N <- as.numeric(rownames(table.noise.sample))
table.noise.sample$TT <- sapply(N.sample,monte.carlo.ate.sample,s=1234,param=param)
table.noise.sample$Type <- 'TTs'
table.noise$Type <- 'TT'
table.noise.tot <- rbind(table.noise,table.noise.sample)
table.noise.tot$Type <- factor(table.noise.tot$Type)
ggplot(table.noise.tot, aes(x=as.factor(N), y=TT,fill=Type)) +
geom_bar(position=position_dodge(), stat="identity", colour='black') +
geom_errorbar(aes(ymin=TT-sampling.noise/2, ymax=TT+sampling.noise/2), width=.2,position=position_dodge(.9),color='red') +
xlab("Sample Size")+
theme_bw()+
theme(legend.position=c(0.85,0.88))
```
Figure \@ref(fig:montecarlosample) and Table \@ref(tab:sampnoisesample) present the results of the simulations of sampling noise for the sample treatment effect parameter.
Figure \@ref(fig:precisionpopsample) compares sampling noise for the population and sample treatment effects.
For all practical purposes, the estimates of sampling noise for the sample treatment effect are extremely close to the ones we have estimated for the population treatment effect.
I am actually surprised by this result, since I expected that keeping the potential outcomes constant over replications would decrease sampling noise.
It seems that the variability in potential outcomes over replications of random allocations of the treatment in a given sample mimicks very well the sampling process from a population.
I do not know if this result of similarity of sampling noise for the population and sample treatment effect is a general one, but considering them as similar or close seems innocuous in our example.
### Building confidence intervals from estimates of sampling noise {#sec:confinterv}
In real life, we do not observe $TT$.
We only have access to $\hat{E}$.
Let's also assume for now that we have access to an estimate of sampling noise, $2\epsilon$.
How can we use these two quantities to assess the set of values that $TT$ might take?
One very useful device that we can use is the confidence interval.
Confidence intervals are very useful because they quantify the zone within which we have a chance to find the true effect $TT$:
```{theorem,confinter,name='Confidence interval'}
For a given level of confidence $\delta$ and corresponding level of sampling noise $2\epsilon$ of the estimator $\hat{E}$ of $TT$, the confidence interval $\left\{\hat{E}-\epsilon,\hat{E}+\epsilon\right\}$ is such that the probability that it contains $TT$ is equal to $\delta$ over sample replications:
\begin{align*}
\Pr(\hat{E}-\epsilon\leq TT\leq\hat{E}+\epsilon) & = \delta.
\end{align*}
```
```{proof}
From the definition of sampling noise, we know that:
\begin{align*}
\Pr(|\hat{E}-TT|\leq\epsilon) & = \delta.
\end{align*}
Now:
\begin{align*}
\Pr(|\hat{E}-TT|\leq\epsilon) & = \Pr(TT-\epsilon\leq\hat{E}\leq TT+\epsilon)\\
& = \Pr(-\hat{E}-\epsilon\leq-TT\leq -\hat{E}+\epsilon)\\
& = \Pr(\hat{E}-\epsilon\leq TT\leq\hat{E}+\epsilon),
\end{align*}
which proves the result.
```
It is very important to note that confidence intervals are centered around $\hat{E}$ and not around $TT$.
When estimating sampling noise and building Figure \@ref(fig:precision), we have centered our intervals around $TT$.
The interval was fixed and $\hat{E}$ was moving across replications and $2\epsilon$ was defined as the length of the interval around $TT$ containing a proportion $\delta$ of the estimates $\hat{E}$.
A confidence interval cannot be centered around $TT$, which is unknown, but is centered around $\hat{E}$, that we can observe.
As a consequence, it is the interval that moves around across replications, and $\delta$ is the proportion of samples in which the interval contains $TT$.
```{example}
Let's see how confidence intervals behave in our numerical example.
```
```{r confinterval,dependson=c('monte.carlo'),eval=TRUE,echo=TRUE,results='hide',warning=FALSE,error=FALSE,message=FALSE,fig.cap=paste('Confidence intervals of $\\hat{WW}$ for $\\delta=$',delta,'(red) and',delta.2,'(blue) over sample replications for various sample sizes',sep=' '),fig.align='center',out.width=cst,fig.pos='htbp',fig.height=12,fig.width=10}
N.plot <- 40
plot.list <- list()
for (k in 1:length(N.sample)){
set.seed(1234)
test <- sample(simuls.ww[[k]][,'WW'],N.plot)
test <- as.data.frame(cbind(test,rep(samp.noise(simuls.ww[[k]][,'WW'],delta=delta)),rep(samp.noise(simuls.ww[[k]][,'WW'],delta=delta.2))))
colnames(test) <- c('WW','sampling.noise.1','sampling.noise.2')
test$id <- 1:N.plot
plot.test <- ggplot(test, aes(x=as.factor(id), y=WW)) +
geom_bar(position=position_dodge(), stat="identity", colour='black') +
geom_errorbar(aes(ymin=WW-sampling.noise.1/2, ymax=WW+sampling.noise.1/2), width=.2,position=position_dodge(.9),color='red') +
geom_errorbar(aes(ymin=WW-sampling.noise.2/2, ymax=WW+sampling.noise.2/2), width=.2,position=position_dodge(.9),color='blue') +
geom_hline(aes(yintercept=delta.y.ate(param)), colour="#990000", linetype="dashed")+
#ylim(-0.5,1.2)+
xlab("Sample id")+
theme_bw()+
ggtitle(paste("N=",N.sample[k]))
plot.list[[k]] <- plot.test
}
plot.CI <- plot_grid(plot.list[[1]],plot.list[[2]],plot.list[[3]],plot.list[[4]],ncol=1,nrow=length(N.sample))
print(plot.CI)
```
Figure \@ref(fig:confinterval) presents the `r 100*delta`\% and `r 100*delta.2`\% confidence intervals for 40 samples selected from our simulations.
First, confidence intervals do their job: they contain the true effect most of the time.
Second, the `r 100*delta.2`\% confidence interval misses the true effect more often, as expected.
For example, with $N=$ `r N.sample[[2]]`, the confidence intervals in samples 13 and 23 do not contain the true effect, but it is not far from their lower bound.
Third, confidence intervals faithfully reflect what we can learn from our estimates at each sample size.
With $N=$ `r N.sample[[1]]`, the confidence intervals make it clear that the effect might be very large or very small, even strongly negative.
With $N=$ `r N.sample[[2]]`, the confidence intervals suggest that the effect is either positive or null, but unlikely to be strongly negative.
Most of the time, we get the sign right.
With $N=$ `r N.sample[[3]]`, we know that the true effect is bigger than 0.1 and smaller than 0.3 and most intervals place the true effect somewhere between 0.11 and 0.25.
With $N=$ `r N.sample[[4]]`, we know that the true effect is bigger than 0.15 and smaller than 0.21 and most intervals place the true effect somewhere between 0.16 and 0.20.
### Reporting sampling noise: a proposal
Once sampling noise is measured (and we'll see how to get an estimate in the next section), one still has to communicate it to others.
There are many ways to report sampling noise:
* Sampling noise as defined in this book ($2*\epsilon$)
* The corresponding confidence interval
* The signal to noise ratio
* A standard error
* A significance level
* A p-value
* A t-statistic
The main problem with all of these approaches is that they do not express sampling noise in a way that is directly comparable to the magnitude of the $TT$ estimate.
Other ways of reporting sampling noise such as p-values and t-stats are nonlinear transforms of sampling noise, making it difficult to really gauge the size of sampling noise as it relates to the magnitude of $TT$.
My own preference goes to the following format for reporting results: $TT \pm \epsilon$.
As such, we can readily compare the size of the noise to the sizee of the $TT$ estimate.
We can also form all the other ways of expressing sampling noise directly.
```{example}
Let's see how this approach behaves in our numerical example.
```
```{r samp.noise.report,dependson=c('monte.carlo'),eval=TRUE,echo=TRUE,results='hide',warning=FALSE,error=FALSE,message=FALSE}
test.all <- list()
for (k in 1:length(N.sample)){
set.seed(1234)
test <- sample(simuls.ww[[k]][,'WW'],N.plot)
test <- as.data.frame(cbind(test,rep(samp.noise(simuls.ww[[k]][,'WW'],delta=delta)),rep(samp.noise(simuls.ww[[k]][,'WW'],delta=delta.2))))
colnames(test) <- c('WW','sampling.noise.1','sampling.noise.2')
test$id <- 1:N.plot
test.all[[k]] <- test
}
```
With $N=$ `r N.sample[[1]]`, the reporting of the results for sample `r test.all[[1]]$id[[1]]` would be something like: "we find an effect of `r round(select(filter(test.all[[1]],id==1),WW),2)` $\pm$ `r round(select(filter(test.all[[1]],id==1),sampling.noise.1)/2,2)`."
Note how the choice of $\delta$ does not matter much for the result.
The previous result was for $\delta=0.99$ while the result for $\delta=0.95$ would have been: "we find an effect of `r round(select(filter(test.all[[1]],id==1),WW),2)` $\pm$ `r round(select(filter(test.all[[1]],id==1),sampling.noise.2)/2,2)`."
The precise result changes with $\delta$, but the qualitative result stays the same: the magnitude of sampling noise is large and it dwarfs the treatment effect estimate.
With $N=$ `r N.sample[[2]]`, the reporting of the results for sample `r test.all[[2]]$id[[1]]` with $\delta=0.99$ would be something like: "we find an effect of `r round(select(filter(test.all[[2]],id==1),WW),2)` $\pm$ `r round(select(filter(test.all[[2]],id==1),sampling.noise.1)/2,2)`."
With $\delta=0.95$: "we find an effect of `r round(select(filter(test.all[[2]],id==1),WW),2)` $\pm$ `r round(select(filter(test.all[[2]],id==1),sampling.noise.2)/2,2)`."
Again, although the precise quantitative result is affected by the choice of $\delta$, but hte qualitative message stays the same: sampling noise is of the same order of magnitude as the estimated treatment effect.
With $N=$ `r N.sample[[3]]`, the reporting of the results for sample `r test.all[[3]]$id[[1]]` with $\delta=0.99$ would be something like: "we find an effect of `r round(select(filter(test.all[[3]],id==1),WW),2)` $\pm$ `r round(select(filter(test.all[[3]],id==1),sampling.noise.1)/2,2)`."
With $\delta=0.95$: "we find an effect of `r round(select(filter(test.all[[3]],id==1),WW),2)` $\pm$ `r round(select(filter(test.all[[3]],id==1),sampling.noise.2)/2,2)`."
Again, see how the qualitative result is independent of the precise choice of $\delta$: sampling noise is almost one order of magnitude smaller than the treatment effect estimate.
With $N=$ `r N.sample[[4]]`, the reporting of the results for sample `r test.all[[4]]$id[[1]]` with $\delta=0.99$ would be something like: "we find an effect of `r round(select(filter(test.all[[4]],id==1),WW),2)` $\pm$ `r round(select(filter(test.all[[4]],id==1),sampling.noise.1)/2,2)`."
With $\delta=0.95$: "we find an effect of `r round(select(filter(test.all[[4]],id==1),WW),2)` $\pm$ `r round(select(filter(test.all[[4]],id==1),sampling.noise.2)/2,2)`."
Again, see how the qualitative result is independent of the precise choice of $\delta$: sampling noise is one order of magnitude smaller than the treatment effect estimate.
```{remark}
What I hope the example makes clear is that my proposed way of reporting results gives the same importance to sampling noise as it gives to the treatment effect estimate.
Also, comparing them is easy, without requiring a huge computational burden on our brain.
```
```{remark}
One problem with the approach that I propose is when you have a non-symetric distribution of sampling noise, or when $TT \pm \epsilon$ exceeds natural bounds on $TT$ (such as if the effect cannot be bigger than one, for example).
I think these issues are minor and rare and can be dealt with on a case by case basis.
The advantage of having one simple and directly readable number comparable to the magnitude of the treatment effect is overwhelming and makes this approach the most natural and adequate, in my opinion.
```
### Using effect sizes to normalize the reporting of treatment effects and their precision {#sec:effectsize}
When looking at the effect of a program on an outcome, we depend on the scaling on that outcome to appreciate the relative size of the estimated treatment effect.
It is often difficult to appreciate the relative importance of the size of an effect, even if we know the scale of the outcome of interest.
One useful device to normalize the treatment effects is called Cohen's $d$, or effect size.
The idea is to compare the magnitude of the treatment effect to an estimate of the usual amount of variation that the outcome undergoes in the population.
The way to build Cohen's $d$ is by dividing the estimated treatment effect by the standard deviation of the outcome.
I generally prefer to use the standard devaition of the outcome in the control group, so as not to include the additional amoiunt of variation due to the heterogeneity in treatment effects.
```{definition, name="Cohen's $d$"}
Cohen's $d$, or effect size, is the ratio of the estimated treatment effect to the standard deviation of outcomes in the control group:
```
$$
d = \frac{\hat{TT}}{\sqrt{\frac{1}{N^0}\sum_{i=1}^{N^0}(Y_i-\bar{Y^0})^2}}
$$
where $\hat{TT}$ is an estimate of the treatment effect, $N^0$ is the number of individuals in the treatment group and $\bar{Y^0}$ is the average outcome in the treatment group.
Cohen's $d$ can be interpreted in terms of magnitude of effect size:
* It is generally considered that an effect is large when its $d$ is larger than 0.8.
* An effect size around 0.5 is considered medium
* An effect size around 0.2 is considered to be small
* An effect size around 0.02 is considered to be very small.
There probably could be a rescaling of these terms, but that is the actual state of the art.
What I like about effect sizes is that they encourage an interpretation of the order of magnitude of the treatment effect.
As such, they enable to include the information on precision by looking at which orders of magnitude are compatible with the estimated effect at the estimated precision level.
Effect sizes and orders of magnitude help make us aware that our results might be imprecise, and that the precise value that we have estimated is probably not the truth.
What is important is the range of effect sizes compatible with our results (both point estimate and precision).
```{example}
Let's see how Cohen's $d$ behaves in our numerical example.
```
The value of Cohen's $d$ (or effect size) in the population is equal to:
\begin{align*}
ES & = \frac{TT}{\sqrt{V^0}} = \frac{\bar{\alpha}+\theta\bar{\mu}}{\sqrt{\sigma^2_{\mu}+\rho^2\sigma^2_{U}+\sigma^2_{\epsilon}}}
\end{align*}
We can write a function to compute this parameter, as well as functions to implement its estimator in the simulated samples:
```{r ES,echo=TRUE,warning=FALSE,message=FALSE,results=FALSE,error=FALSE}
V0 <- function(param){
return(param["sigma2mu"]+param["rho"]^2*param["sigma2U"]+param["sigma2epsilon"])
}
ES <- function(param){
return(delta.y.ate(param)/sqrt(V0(param)))
}
samp.noise.ES <- function(estim,delta,param=param){
return(2*quantile(abs(delta.y.ate(param)/sqrt(V0(param))-estim),prob=delta))
}
for (i in 1:4){
simuls.ww[[i]][,'ES'] <- simuls.ww[[i]][,'WW']/sqrt(simuls.ww[[i]][,'V0'])
}
```
The true effect size in the population is thus `r round(ES(param),2)`.
It is considered to be small according to the current classification, although I'd say that a treatment able to move the outcomes by 20\% of their usual variation is a pretty effective treatment, and this effect should be labelled at least medium.
Let's stick with the classification though.
In our example, the effect size does not differ much from the treatment effect since the standard deviation of outcomes in the control group is pretty close to one: it is equal to `r round(sqrt(V0(param)),2)`.
Let's now build confidence intervals for the effect size and try to comment on the magnitudes of these effects using the normalized classification.
```{r confintervalES,dependson=c('monte.carlo'),eval=TRUE,echo=TRUE,results='hide',warning=FALSE,error=FALSE,message=FALSE,fig.cap=paste('Confidence intervals of $\\hat{ES}$ for $\\delta=$',delta,'(red) and',delta.2,'(blue) over sample replications for various sample sizes',sep=' '),fig.align='center',out.width=cst,fig.pos='htbp',fig.height=12,fig.width=10}
N.plot.ES <- 40
plot.list.ES <- list()
for (k in 1:length(N.sample)){
set.seed(1234)
test.ES <- sample(simuls.ww[[k]][,'ES'],N.plot)
test.ES <- as.data.frame(cbind(test.ES,rep(samp.noise.ES(simuls.ww[[k]][,'ES'],delta=delta,param=param)),rep(samp.noise.ES(simuls.ww[[k]][,'ES'],delta=delta.2,param=param))))
colnames(test.ES) <- c('ES','sampling.noise.ES.1','sampling.noise.ES.2')
test.ES$id <- 1:N.plot.ES
plot.test.ES <- ggplot(test.ES, aes(x=as.factor(id), y=ES)) +
geom_bar(position=position_dodge(), stat="identity", colour='black') +
geom_errorbar(aes(ymin=ES-sampling.noise.ES.1/2, ymax=ES+sampling.noise.ES.1/2), width=.2,position=position_dodge(.9),color='red') +
geom_errorbar(aes(ymin=ES-sampling.noise.ES.2/2, ymax=ES+sampling.noise.ES.2/2), width=.2,position=position_dodge(.9),color='blue') +
geom_hline(aes(yintercept=ES(param)), colour="#990000", linetype="dashed")+
#ylim(-0.5,1.2)+
xlab("Sample id")+
ylab("Effect Size")+
theme_bw()+
ggtitle(paste("N=",N.sample[k]))
plot.list.ES[[k]] <- plot.test.ES
}
plot.CI.ES <- plot_grid(plot.list.ES[[1]],plot.list.ES[[2]],plot.list.ES[[3]],plot.list.ES[[4]],ncol=1,nrow=length(N.sample))
print(plot.CI.ES)
```
Figure \@ref(fig:confintervalES) presents the `r 100*delta`\% and `r 100*delta.2`\% confidence intervals for the effect size estimated in 40 samples selected from our simulations.
Let's regroup our estimate and see how we could present their results.
```{r samp.noise.report.ES,dependson=c('monte.carlo'),eval=TRUE,echo=TRUE,results='hide',warning=FALSE,error=FALSE,message=FALSE}
test.all.ES <- list()
for (k in 1:length(N.sample)){
set.seed(1234)
test.ES <- sample(simuls.ww[[k]][,'ES'],N.plot)
test.ES <- as.data.frame(cbind(test.ES,rep(samp.noise.ES(simuls.ww[[k]][,'ES'],delta=delta,param=param)),rep(samp.noise.ES(simuls.ww[[k]][,'ES'],delta=delta.2,param=param))))
colnames(test.ES) <- c('ES','sampling.noise.ES.1','sampling.noise.ES.2')
test.ES$id <- 1:N.plot.ES
test.all.ES[[k]] <- test.ES
}
```
With $N=$ `r N.sample[[1]]`, the reporting of the results for sample `r test.all[[1]]$id[[1]]` would be something like: "we find an effect size of `r round(select(filter(test.all.ES[[1]],id==1),ES),2)` $\pm$ `r round(select(filter(test.all.ES[[1]],id==1),sampling.noise.ES.1)/2,2)`" with $\delta=0,99$.
With $\delta=0.95$ we would say: "we find an effect of `r round(select(filter(test.all.ES[[1]],id==1),ES),2)` $\pm$ `r round(select(filter(test.all.ES[[1]],id==1),sampling.noise.ES.2)/2,2)`."
All in all, our estimate is compatible with the treatment having a large positive effect size and a medium negative effect size.
Low precision prevents us from saying much else.
With $N=$ `r N.sample[[2]]`, the reporting of the results for sample `r test.all.ES[[2]]$id[[1]]` with $\delta=0.99$ would be something like: "we find an effect size of `r round(select(filter(test.all.ES[[2]],id==1),ES),2)` $\pm$ `r round(select(filter(test.all.ES[[2]],id==1),sampling.noise.ES.1)/2,2)`."
With $\delta=0.95$: "we find an effect size of `r round(select(filter(test.all.ES[[2]],id==1),ES),2)` $\pm$ `r round(select(filter(test.all.ES[[2]],id==1),sampling.noise.ES.2)/2,2)`."
Our estimate is compatible with a medium positive effect or a very small positive or even negative effect (depending on the choice of $\delta$).
With $N=$ `r N.sample[[3]]`, the reporting of the results for sample `r test.all.ES[[3]]$id[[1]]` with $\delta=0.99$ would be something like: "we find an effect size of `r round(select(filter(test.all.ES[[3]],id==1),ES),2)` $\pm$ `r round(select(filter(test.all.ES[[3]],id==1),sampling.noise.ES.1)/2,2)`."
With $\delta=0.95$: "we find an effect size of `r round(select(filter(test.all.ES[[3]],id==1),ES),2)` $\pm$ `r round(select(filter(test.all.ES[[3]],id==1),sampling.noise.ES.2)/2,2)`."
Our estimate is thus compatible with a small effect of the treatment.
We can rule out that the effect of the treatment is medium since the upper bound of the 99\% confidence interval is equal to `r round(select(filter(test.all.ES[[3]],id==1),ES)+select(filter(test.all.ES[[3]],id==1),sampling.noise.ES.1)/2,2)`.
We can also rule out that the effect of the treatment is very small since the lower bound of the 99\% confidence interval is equal to `r round(select(filter(test.all.ES[[3]],id==1),ES)-select(filter(test.all.ES[[3]],id==1),sampling.noise.ES.1)/2,2)`.
With this sample size, we have been able to reach a precision level sufficient enough to pin down the order of magnitude of the effect size of our treatment.
There still remains a considerable amount of uncertainty about the true effect size, though: the upper bound of our confidence interval is almost double the lower bound.
With $N=$ `r N.sample[[4]]`, the reporting of the results for sample `r test.all.ES[[4]]$id[[1]]` with $\delta=0.99$ would be something like: "we find an effect size of `r round(select(filter(test.all.ES[[4]],id==1),ES),2)` $\pm$ `r round(select(filter(test.all.ES[[4]],id==1),sampling.noise.ES.1)/2,2)`."
With $\delta=0.95$: "we find an effect size of `r round(select(filter(test.all.ES[[4]],id==1),ES),2)` $\pm$ `r round(select(filter(test.all.ES[[4]],id==1),sampling.noise.ES.2)/2,2)`."
Here, the level of precision of our result is such that, first, it does not depend on the choice of $\delta$ in any meaningful way, and second, we can do more than pinpoint the order of magnitude of the effect size, we can start to zero in on its precise value.
From our estimate, the true value of the effect size is really close to 0.2.
It could be equal to 0.18 or 0.22, but not further away from 0.2 than that.
Remember that is actually equal to `r round(ES(param),2)`.
```{remark}
One issue with Cohen's $d$ is that its magnitude depends on the dispersion of the outcomes in the control group.
That means that for the same treatment, and same value of the treatment effect, the effect size is larger in a population where oucomes are more homogeneous.
This is not an attractive feature of a normalizing scale that its size depends on the particular application.
One solution would be, for each outcome, to provide a standardized scale, using for example the estmated standard deviation in a reference population.
This would be similar to the invention of the metric system, where a reference scale was agreed uppon once and for all.
```
```{remark}
Cohen's $d$ is well defined for continuous outcomes.
For discrete outcomes, the use of Cohen's $d$ poses a series of problems, and alternatives such as relative risk ratios and odds ratios have been proposed.
I'll comment on that in the last chapter.
```
## Estimating sampling noise {#sec:estimsampnoise}
Gauging the extent sampling noise is very useful in order to be able to determine how much we should trust our results.
Are they precise, so that the true treatment effect lies very close to our estimate?
Or are our results imprecise, the true treatment effect maybe lying very far from our estimate?
Estimating sampling noise is hard because we want to infer a property of our estimator over repeated samples using only one sample.
In this lecture, I am going to introduce four tools that enable you to gauge sampling noise and to choose sample size.
The four tools are Chebyshev's inequality, the Central Limit Theorem, resampling methods and Fisher's permutation method.
The idea of all these methods is to use the properties of the sample to infer the properties of our estimator over replications.
Chebyshev's inequality gives an upper bound on the sampling noise and a lower bound on sample size, but these bounds are generally too wide to be useful.
The Central Limit Theorem (CLT) approximates the distribution of $\hat{E}$ by a normal distribution, and quantifies sampling noise as a multiple of the standard deviation.
Resampling methods use the sample as a population and draw new samples from it in order to approximate sampling noise.
Fisher's permutation method, also called randomization inference, derives the distribution of $\hat{E}$ under the assumption that all treatment effects are null, by reallocating the treatment indicator among the treatment and control group.
Both the CLT and resampling methods are approximation methods, and their approximation of the true extent of sampling noise gets better and better as sample size increases.
Fisher's permutation method is exact-it is not an approximation-but it only works for the special case of the $WW$ estimator in a randomized design.
The remaining of this section is structured as follows.
Section \@ref(sec:assumptions) introduces the assumptions that we will need in order to implement the methods.
Section \@ref(sec:cheb) presents the Chebyshev approach to gauging sampling noise and choosing sample size.
Section \@ref(sec:CLT) introduces the CLT way of approximating sampling noise and choosing sample size.
Section \@ref(sec:resamp) presents the resampling methods.
```{remark}
I am going to derive the estimators for the precision only for the $WW$ estimator.
In the following lectures, I will show how these methods adapt to other estimators.
```
### Assumptions {#sec:assumptions}
In order to be able to use the theorems that power up the methods that we are going to use to gauge sampling noise, we need to make some assumptions on the properties of the data.
The main assumptions that we need are that the estimator identifies the true effect of the treatment in the population, that the estimator is well-defined in the sample, that the observations in the sample are independently and identically distributed (i.i.d.), that there is no interaction between units and that the variances of the outcomes in the treated and untreated group are finite.
We know from last lecture that for the $WW$ estimator to identify $TT$, we need to assume that there is no selection bias, as stated in Assumption \@ref(def:noselb).
One way to ensure that this assumption holds is to use a RCT.
In order to be able to form the $WW$ estimator in the sample, we also need that there is at least one treated and one untreated in the sample:
```{hypothesis,fullrank,name='Full rank'}
We assume that there is at least one observation in the sample that receives the treatment and one observation that does not receive it:
\begin{align*}
\exists i,j\leq N \text{ such that } & D_i=1 \& D_j=0.
\end{align*}
```
One way to ensure that this assumption holds is to sample treated and untreated units.
In order to be able to estimate the variance of the estimator easily, we assume that the observations come from random sampling and are i.i.d.:
```{hypothesis,iid,name='i.i.d. sampling'}
We assume that the observations in the sample are identically and independently distributed:
\begin{align*}
\forall i,j\leq N\text{, }i\neq j\text{, } & (Y_i,D_i)\Ind(Y_j,D_j),\\
& (Y_i,D_i)\&(Y_j,D_j)\sim F_{Y,D}.
\end{align*}
```
We have to assume something on how the observations are related to each other and to the population.
Identical sampling is natural in the sense that we are OK to assume that the observations stem from the same population model.
Independent sampling is something else altogether.
Independence means that the fates of two closely related individuals are assumed to be independent.
This rules out two empirically relevant scenarios:
1. The fates of individuals are related because of common influences, as for example the environment, etc,
2. The fates of individuals are related because they directly influence each other, as for example on a market, but also for example because there are diffusion effects, such as contagion of deseases or technological adoption by imitation.
We will address both sources of failure of the independence assumption in future lectures.
Finally, in order for all our derivations to make sense, we need to assume that the outcomes in both groups have finite variances, otherwise sampling noise is going to be too extreme to be able to estimate it using the methods developed in this lecture:
```{hypothesis,finitevar,name='Finite variance of potential outcomes'}
We assume that $\var{Y^1|D_i=1}$ and $\var{Y^0|D_i=0}$ are finite.
```
### Using Chebyshev's inequality {#sec:cheb}
Chebyshev's inequality is a fundamental building block of statistics.
It relates the sampling noise of an estimator to its variance.
More precisely, it derives an upper bound on the samplig noise of an unbiased estimator:
```{theorem,cheb,name="Chebyshev's inequality"}
For any unbiased estimator $\hat{\theta}$, sampling noise level $2\epsilon$ and confidence level $\delta$, sampling noise is bounded from above:
\begin{align*}
2\epsilon \leq 2\sqrt{\frac{\var{\hat{\theta}}}{1-\delta}}.
\end{align*}
```
```{remark}
The more general version of Chebyshev's inequality that is generally presented is as follows:
\begin{align*}
\Pr(|\hat{\theta}-\esp{\hat{\theta}}|>\epsilon) & \leq \frac{\var{\hat{\theta}}}{\epsilon^2}.
\end{align*}
The version I present in Theorem \@ref(thm:cheb) is adapted to the bouding of sampling noise for a given confidence level, while this version is adapted to bounding the confidence level for a given level of sampling noise.
In order to go from this general version to Theorem \@ref(thm:cheb), simply remember that, for an unbiased estimator, $\esp{\hat{\theta}}=\theta$ and that, by definition of sampling noise, $\Pr(|\hat{\theta}-\theta|>\epsilon)=1-\delta$.
As a result, $1-\delta\leq\var{\hat{\theta}}/\epsilon^2$, hence the result in Theorem \@ref(thm:cheb).
```
Using Chebyshev's inequality, we can obtain an upper bound on the sampling noise of the $WW$ estimator:
```{theorem,uppsampnoise,name='Upper bound on the sampling noise of $\\hat{WW}$'}
Under Assumptions \@ref(def:noselb), \@ref(hyp:fullrank) and \@ref(hyp:iid), for a given confidence level $\delta$, the sampling noise of the $\hat{WW}$ estimator is bounded from above:
\begin{align*}
2\epsilon \leq 2\sqrt{\frac{1}{N(1-\delta)}\left(\frac{\var{Y_i^1|D_i=1}}{\Pr(D_i=1)}+\frac{\var{Y_i^0|D_i=0}}{1-\Pr(D_i=1)}\right)}\equiv 2\bar{\epsilon}.
\end{align*}
```
```{proof}
See in Appendix \@ref(proofcheb)
```
Theorem \@ref(thm:uppsampnoise) is a useful step forward for estimating sampling noise.
Theorem \@ref(thm:uppsampnoise) states that the actual level of sampling noise of the $\hat{WW}$ estimator ($2\epsilon$) is never bigger than a quantity that depends on sample size, confidence level and on the variances of outcomes in the treated and control groups.
We either know all the components of the formula for $2\bar{\epsilon}$ or we can estimate them in the sample.
For example, $\Pr(D_i=1)$, $\var{Y_i^1|D_i=1}$ and $\var{Y_i^0|D_i=0}$ by can be approximated by, respectively:
\begin{align*}
\hat{\Pr(D_i=1)} & = \frac{1}{N}\sum_{i=1}^ND_i\\
\hat{\var{Y_i^1|D_i=1}} & = \frac{1}{\sum_{i=1}^ND_i}\sum_{i=1}^ND_i(Y_i-\frac{1}{\sum_{i=1}^ND_i}\sum_{i=1}^ND_iY_i)^2\\
\hat{\var{Y_i^0|D_i=0}} & = \frac{1}{\sum_{i=1}^N(1-D_i)}\sum_{i=1}^N(1-D_i)(Y_i-\frac{1}{\sum_{i=1}^N(1-D_i)}\sum_{i=1}^N(1-D_i)Y_i)^2.
\end{align*}
Using these approximations for the quantities in the formula, we can compute an estimate of the upper bound on sampling noise, $\hat{2\bar{\epsilon}}$.
```{example}
Let's write an R function that is going to compute an estimate for the upper bound of sampling noise for any sample:
```
```{r sampnoisewwcheb,eval=TRUE,echo=TRUE}
samp.noise.ww.cheb <- function(N,delta,v1,v0,p){
return(2*sqrt((v1/p+v0/(1-p))/(N*(1-delta))))
}
```
Let's estimate this upper bound in our usual sample:
```{r simulnoselb1234bis,eval=TRUE,echo=TRUE,cache=TRUE,dependson='param'}
set.seed(1234)
N <-1000
delta <- 0.99
mu <- rnorm(N,param["barmu"],sqrt(param["sigma2mu"]))
UB <- rnorm(N,0,sqrt(param["sigma2U"]))
yB <- mu + UB
YB <- exp(yB)
Ds <- rep(0,N)
V <- rnorm(N,param["barmu"],sqrt(param["sigma2mu"]+param["sigma2U"]))
Ds[V<=log(param["barY"])] <- 1
epsilon <- rnorm(N,0,sqrt(param["sigma2epsilon"]))
eta<- rnorm(N,0,sqrt(param["sigma2eta"]))
U0 <- param["rho"]*UB + epsilon
y0 <- mu + U0 + param["delta"]
alpha <- param["baralpha"]+ param["theta"]*mu + eta
y1 <- y0+alpha
Y0 <- exp(y0)
Y1 <- exp(y1)
y <- y1*Ds+y0*(1-Ds)
Y <- Y1*Ds+Y0*(1-Ds)
```
In our sample, for $\delta=$ `r delta`, $\hat{2\bar{\epsilon}}=$ `r round(samp.noise.ww.cheb(N,delta,var(y[Ds==1]),var(y[Ds==0]),mean(Ds)),2)`.
How does this compare with the true extent of sampling noise when $N=$ `r N`?
Remember that we have computed an estimate of sampling noise out of our Monte Carlo replications.
In Table \@ref(fig:precision), we can read that sampling noise is actually equal to `r round(table.noise[rownames(table.noise)==as.character(N),colnames(table.noise)=='sampling.noise'],2)`.
The Chebyshev upper bound overestimates the extent of sampling noise by `r round((samp.noise.ww.cheb(N,delta,var(y[Ds==1]),var(y[Ds==0]),mean(Ds))/table.noise[rownames(table.noise)==N.sample[[2]],colnames(table.noise)=='sampling.noise']-1)*100,0)`\%.
How does the Chebyshev upper bound fares overall?
In order to know that, let's compute the Chebyshev upper bound for all the simulated samples.
You might have noticed that, when running the Monte Carlo simulations for the population parameter, I have not only recovered $\hat{WW}$ for each sample, but also the estimates of the components of the formula for the upper bound on sampling noise.
I can thus easily compute the Chebyshev upper bound on sampling noise for each replication.
```{r sampnoisewwcheball,eval=TRUE,echo=TRUE,fig.cap='Distribution of the Chebyshev upper bound on sampling noise over replications of samples of different sizes (true sampling noise in red)',fig.align='center',out.width=cst,fig.pos='htbp'}
for (k in (1:length(N.sample))){
simuls.ww[[k]]$cheb.noise <- samp.noise.ww.cheb(N.sample[[k]],delta,simuls.ww[[k]][,'V1'],simuls.ww[[k]][,'V0'],simuls.ww[[k]][,'p'])
}
par(mfrow=c(2,2))
for (i in 1:4){
hist(simuls.ww[[i]][,'cheb.noise'],main=paste('N=',as.character(N.sample[i])),xlab=expression(hat(2*bar(epsilon))),xlim=c(0.25*min(simuls.ww[[i]][,'cheb.noise']),max(simuls.ww[[i]][,'cheb.noise'])))
abline(v=table.noise[i,colnames(table.noise)=='sampling.noise'],col="red")
}
```
Figure \@ref(fig:sampnoisewwcheball) shows that the upper bound works: it is always bigger than the true sampling noise.
Figure \@ref(fig:sampnoisewwcheball) also shows that the upper bound is large: it generally is of an order of magnitude bigger than the true sampling noise, and thus offers a blurry and too pessimistic view of the precision of an estimator.
Figure \@ref(fig:sampnoisewwchebplot) shows that the average Chebyshev bound gives an inflated estimate of sampling noise.
Figure \@ref(fig:confintervalcheb) shows that the Chebyshev confidence intervals are clearly less precise than the true unknown ones.
With $N=$ `r N.sample[2]`, the true confidence intervals generally reject large negative effects, whereas the Chebyshev confidence intervals do not rule out this possibility.
With $N=$ `r N.sample[3]`, the true confidence intervals generally reject effects smaller than 0.1, whereas the Chebyshev confidence intervals cannot rule out small negative effects.
As a conclusion on Chebyshev estimates of sampling noise, their advantage is that they offer an upper bound on the noise: we can never underestimate noise if we use them.
A downside of Chebyshev sampling noise estimates is their low precision, which makes it hard to pinpoint the true confidence intervals.
```{r sampnoisewwchebplot,eval=TRUE,echo=TRUE,fig.cap='Average Chebyshev upper bound on sampling noise over replications of samples of different sizes (true sampling noise in red)',fig.align='center',out.width=cst,fig.pos='htbp'}
for (k in (1:length(N.sample))){
table.noise$cheb.noise[k] <- mean(simuls.ww[[k]]$cheb.noise)
}
ggplot(table.noise, aes(x=as.factor(N), y=TT)) +
geom_bar(position=position_dodge(), stat="identity", colour='black') +
geom_errorbar(aes(ymin=TT-sampling.noise/2, ymax=TT+sampling.noise/2), width=.2,position=position_dodge(.9),color='red') +
geom_errorbar(aes(ymin=TT-cheb.noise/2, ymax=TT+cheb.noise/2), width=.2,position=position_dodge(.9),color='blue') +
xlab("Sample Size")+
theme_bw()
```
```{r confintervalcheb,dependson=c('monte.carlo'),eval=TRUE,echo=TRUE,results='hide',warning=FALSE,error=FALSE,message=FALSE,fig.cap=paste('Chebyshev confidence intervals of $\\hat{WW}$ for $\\delta=$',delta,'over sample replications for various sample sizes (true confidence intervals in red)',sep=' '),fig.align='center',out.width=cst,fig.pos='htbp',fig.height=12,fig.width=10}
N.plot <- 40
plot.list <- list()
for (k in 1:length(N.sample)){
set.seed(1234)
test.cheb <- simuls.ww[[k]][sample(N.plot),c('WW','cheb.noise')]
test.cheb <- as.data.frame(cbind(test.cheb,rep(samp.noise(simuls.ww[[k]][,'WW'],delta=delta),N.plot)))
colnames(test.cheb) <- c('WW','cheb.noise','sampling.noise')
test.cheb$id <- 1:N.plot
plot.test.cheb <- ggplot(test.cheb, aes(x=as.factor(id), y=WW)) +
geom_bar(position=position_dodge(), stat="identity", colour='black') +
geom_errorbar(aes(ymin=WW-sampling.noise/2, ymax=WW+sampling.noise/2), width=.2,position=position_dodge(.9),color='red') +
geom_errorbar(aes(ymin=WW-cheb.noise/2, ymax=WW+cheb.noise/2), width=.2,position=position_dodge(.9),color='blue') +
geom_hline(aes(yintercept=delta.y.ate(param)), colour="#990000", linetype="dashed")+
xlab("Sample id")+
theme_bw()+
ggtitle(paste("N=",N.sample[k]))
plot.list[[k]] <- plot.test.cheb
}
plot.CI <- plot_grid(plot.list[[1]],plot.list[[2]],plot.list[[3]],plot.list[[4]],ncol=1,nrow=length(N.sample))
print(plot.CI)
```
### Using the Central Limit Theorem {#sec:CLT}
The main problem with Chebyshev's upper bound on sampling noise is that it is an upper bound, and thus it overestimates sampling noise and underestimates precision.
One alternative to using Chebyshev's upper bound is to use the Central Limit Theorem (CLT).
In econometrics and statistics, the CLT is used to derive approximate values for the sampling noise of estimators.