-
Notifications
You must be signed in to change notification settings - Fork 14
/
12-RequiredSampleSize.Rmd
499 lines (356 loc) · 40.7 KB
/
12-RequiredSampleSize.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
# Computing the required sample size {#RequiredSampleSize}
An important decision in designing sampling schemes is the number of units to select. In other words, what should the sample size be? If a certain budget is available for sampling, we can determine the affordable sample size from this budget. A costs model is then needed.
The alternative to deriving the affordable sample size from the budget is to start from a requirement on the quality of the survey result obtained by statistical inference. Two types of inference are distinguished, estimation and testing. The required sample size\index{Required sample size} depends on the type of sampling design. With stratified random sampling, we expect that we need fewer sampling units compared to simple random sampling to estimate the population mean with the same precision, whereas with cluster random sampling and two-stage cluster random sampling in general we need more sampling units. To compute the sample size given some quality requirement, we may start with computing the required sample size for simple random sampling and then correct this sample size to account for the design effect. Therefore, I start with presenting formulas for computing the required sample size for simple random sampling. Section \@ref(DesignEffect) describes how the required sample sizes for other types of sampling design can be derived.
Hereafter, formulas for computing the required sample size are presented for simple random sampling *with replacement* of finite populations and simple random sampling of infinite populations. For simple random sampling *without replacement* (SI) of finite populations, these sample sizes can be corrected by [@loh99]
\begin{equation}
n_{\mathrm{SI}} = \frac{n_{\mathrm{SIR}}}{1+\frac{n_{\mathrm{SIR}}}{N}} \;,
(\#eq:nreqSIR)
\end{equation}
with $n_{\mathrm{SIR}}$ the required sample size for simple random sampling with replacement.
## Standard error
A first option is to set a limit on the variance of the $\pi$ estimator of the population mean, see Equation \@ref(eq:VarMean), or on the square root of this variance, the standard error of the estimator. Given a chosen limit for the standard error $se_{\mathrm{max}}$, the required sample size for simple random sampling with replacement can be computed by
\begin{equation}
n = \left(\frac{S^*(z)}{se_{\mathrm{max}}}\right)^2 \;,
(\#eq:nreqse)
\end{equation}
with $S^*(z)$ a prior estimate of the population standard deviation\index{Prior estimate!of population standard deviation}. The required sample size $n$ should be rounded to the nearest integer greater than the right-hand side of Equation \@ref(eq:nreqse). This also applies to the following equations.
For the population proportion (areal fraction) as the parameter of interest, the required sample size can be computed by (see Equation \@ref(eq:EstVarProportionSI))
\begin{equation}
n = \left(\frac{\sqrt{p^*(1-p^*)}}{se_{\mathrm{max}}}\right)^2+1 \;,
(\#eq:nreqseproportion)
\end{equation}
with $p^*$ a prior estimate of the population proportion\index{Prior estimate!of population proportion}.
```{block2, type= 'rmdnote'}
To determine the required sample size for estimating the population proportion, we need a prior estimate of the population parameter of interest itself, whereas for the population mean a prior estimate is needed of the population standard deviation. The parameter of which a prior estimate is needed for sample size determination is referred to as the design parameter\index{Design parameter}.
```
Alternatively, we may require that the *relative* standard error, i.e., the standard error of the estimator divided by the population mean, may not exceed a given limit $rse_{\mathrm{max}}$. In this case the required sample size can be computed by
\begin{equation}
n = \left(\frac{cv^*}{rse_{\mathrm{max}}}\right)^2 \;,
(\#eq:nreqrse)
\end{equation}
with $cv^*$ a prior estimate of the population coefficient of variation\index{Prior estimate!of population coefficient of variation} $S(z)/\bar{z}$. For a constraint on the relative standard error of the population proportion estimator, we obtain
\begin{equation}
n = \left(\frac{p^*(1-p^*)}{rse_{\mathrm{max}}\;p^*}\right)^2+1=\left(\frac{1-p^*}{rse_{\mathrm{max}}}\right)^2+1 \;.
(\#eq:nreqrse2)
\end{equation}
## Length of confidence interval {#ReqSampleSizeLengthCI}
Another option is to require that the length of the confidence interval of the population mean may not exceed a given limit $l_{\mathrm{max}}$:
\begin{equation}
2 \; t_{\alpha/2,n-1}\frac{S(z)}{\sqrt{n}} \leq l_{\mathrm{max}} \;,
(\#eq:widthCI)
\end{equation}
with $t_{\alpha/2,n-1}$ the $(1-\alpha/2)$ quantile of the *t* distribution with $n-1$ degrees of freedom, $S(z)$ the population standard deviation of the study variable, and $n$ the sample size. The problem is that we do not know the degrees of freedom (we want to determine the sample size $n$). Therefore, $t_{\alpha/2,n-1}$ is replaced by $u_{\alpha/2}$, the $(1-\alpha/2)$ quantile of the standard normal distribution. Rearranging yields
\begin{equation}
n = \left(u_{\alpha/2}\frac{S^*(z)}{l_{\mathrm{max}}/2}\right)^2 \;.
(\#eq:nreqwidthCI)
\end{equation}
The requirement can also be formulated as
\begin{equation}
P(|\hat{\bar{z}}-\bar{z}| \leq d_{\mathrm{max}}) \leq 1-\alpha \;,
(\#eq:nreqwidthCIalt)
\end{equation}
with $d_{\mathrm{max}}$ the margin of error\index{Margin of error}: $d_{\mathrm{max}}=l_{\mathrm{max}}/2$.
An alternative is to require that with a large probability $1-\alpha$ the absolute value of the *relative* error of the estimated mean may not exceed a given limit $r_{\mathrm{max}}$. In formula:
\begin{equation}
P\left(\frac{\lvert\hat{\bar{z}}-\bar{z}\rvert}{\bar{z}} \leq r_{\mathrm{max}}\right) \leq 1-\alpha \;.
(\#eq:nreqPrRelError)
\end{equation}
Noting that the absolute error equals $r_{\mathrm{max}} \bar{z}$ and inserting this in Equation \@ref(eq:nreqwidthCI) gives
\begin{equation}
n = \left(u_{\alpha/2}\frac{cv^*}{r_{\mathrm{max}}}\right)^2 \;.
(\#eq:nreqPrRelError2)
\end{equation}
As an example, the required sample size is computed for estimating the population mean of the soil organic matter concentration in Voorst. The requirement is that with a probability of 95\% the absolute value of the *relative* error does not exceed 10\%. A prior estimate of 0.5 for the population coefficient of variation is used.
```{r}
cv <- 0.5
rmax <- 0.1
u <- qnorm(p = 1 - 0.05 / 2, mean = 0, sd = 1)
n <- ceiling((u * cv / rmax)^2)
```
The required sample size is 97. The same result is obtained with function `nContMoe` of package **PracTools** (@PracTools, @Vaillant2018).
```{r}
library(PracTools)
print(ceiling(nContMoe(moe.sw = 2, e = rmax, alpha = 0.05, CVpop = cv)))
```
### Length of confidence interval for a proportion
Each of the methods for computing a confidence interval of a proportion described in Subsection \@ref(ConfidenceIntervalProportion) can be used to compute the required sample size given a limit for the length of the confidence interval of a proportion. The most simple option is to base the required sample size on the Wald interval\index{Wald interval} (Equation \@ref(eq:Waldinterval)), so that the required sample size can be computed by
\begin{equation}
n = \left(u_{\alpha/2}\frac{\sqrt{p^*(1-p^*)}}{l_{\mathrm{max}}/2}\right)^2 +1\;.
(\#eq:nreqwidthCIprop)
\end{equation}
The Wald interval approximates the discrete binomial distribution by a normal distribution. See the rule of thumb in Subsection \@ref(ConfidenceIntervalProportion) for when this approximation is reasonable.
Package **binomSamSize** [@Hohle2017] has quite a few functions for computing the required sample size. Function `ciss.wald` uses the normal approximation. In the next code chunk, the required sample sizes are computed for a prior estimate of the population proportion $p^*$ of 0.2.
```{block2, type='rmdcaution'}
Argument `d` in the functions below is *half* the length of the confidence interval.
```
```{r}
library(binomSamSize)
p_prior <- 0.2
n_prop_wald <- ciss.wald(p0 = p_prior, d = 0.1, alpha = 0.05)
n_prop_agrcll <- ciss.agresticoull(p0 = p_prior, d = 0.1, alpha = 0.05)
n_prop_wilson <- ciss.wilson(p0 = p_prior, d = 0.1, alpha = 0.05)
```
The required sample sizes are `r n_prop_wald`, `r n_prop_agrcll`, and `r n_prop_wilson`, for the Wald, Agresti-Coull, and Wilson approximation of the binomial proportion confidence interval, respectively. The required sample size with function `ciss.wald` is one unit smaller than as computed with Equation \@ref(eq:nreqwidthCIprop), as shown in the code chunk below.
```{r}
ceiling((qnorm(0.975) * sqrt(p_prior * (1 - p_prior)) / 0.1)^2 + 1)
```
## Statistical testing of hypothesis
The required sample size for testing a population mean with a two-sided alternative hypothesis can be computed by [@Ott2015]
\begin{equation}
n = \frac{S^2(z)}{\Delta^2}\;(u_{\alpha/2}+u_{\beta})^2 \;,
(\#eq:reqsamplesizetestingmean)
\end{equation}
with $\Delta$ the smallest relevant difference\index{Smallest relevant difference} of the population mean from the test value\index{Test value}, $\alpha$ the tolerable probability of a type I error\index{Probability of a type I error}, i.e., the probability of rejecting the null hypothesis when the population mean is equal to the test value, $\beta$ the tolerable probability of a type II error\index{Probability of a type II error}, i.e., the probability of not rejecting the null hypothesis when the population mean is not equal to the test value, $u_{\alpha/2}$ as before, and $u_{\beta}$ the $(1-\beta)$ quantile of the standard normal distribution. The quantity $1-\beta$ is the power of a test\index{Power of a test}: the probability of correctly rejecting the null hypothesis. For a one-sided test, $u_{\alpha/2}$ must be replaced by $u_{\alpha}$.
In the next code chunk, the sample size required for a given target power is computed with the standard normal distribution (Equation \@ref(eq:reqsamplesizetestingmean)), as well as with the *t* distribution using function `pwr.t.test` of package **pwr** [@pwr]^[The same result is obtained with function `power.t.test` of the **stats** package.]. This requires some iterative algorithm, as the degrees of freedom of the *t* distribution are a function of the sample size. The required sample size is computed for a one-sample test and a one-sided alternative hypothesis.
```{r}
library(pwr)
sd <- 4; delta <- 1; alpha <- 0.05; beta <- 0.2
n_norm <- (sd / delta)^2 * (qnorm(1 - alpha) + qnorm(1 - beta))^2
n_t <- pwr.t.test(
d = delta / sd, sig.level = alpha, power = (1 - beta),
type = "one.sample", alternative = "greater")
```
In this example, the required sample size computed with the *t* distribution is two units larger than that obtained with the standard normal distribution: `r ceiling(n_t$n)` vs. `r (ceiling(n_norm))`. Package **pwr** has various functions for computing the power of a test given the sample size, or reversely, the sample size for a given power, such as for the two independent samples *t* test, binomial test (for one proportion), test for two proportions, etc.
### Sample size for testing a proportion
For testing a proportion, a graph is computed with the power of a binomial test\index{Binomial test} against the sample size. This is illustrated with a one-sided alternative $H_a: p > 0.20$ and a smallest relevant difference of 0.10.
```{r}
p_test <- 0.20; alpha <- 0.10; delta <- 0.10
n <- 1:150
power <- k_min <- numeric(length = length(n))
for (i in seq_len(length(n))) {
k_min[i] <- qbinom(p = 1 - alpha, size = n[i], prob = p_test)
power[i] <- pbinom(
q = k_min[i], size = n[i], prob = p_test + delta, lower.tail = FALSE)
}
```
As can be seen in the **R** code, as a first step for each total sample size the smallest number of successes $k_{\mathrm{min}}$ is computed at which the null hypothesis is rejected. Then the binomial probability is computed of $k_{\mathrm{min}}+1$ or more successes for a probability of success equal to $p_{\mathrm{test}}+\Delta$. Note that there is no need to add 1 to `k_min` as with argument `lower.tail = FALSE` the value specified by argument `q` is not included.
```{r powerbinomialtest, echo = FALSE, fig.width = 5, fig.asp = 0.7, fig.cap = "Power of right-tail binomial test (test proportion: 0.2; significance level: 0.10)."}
beta <- 0.20
df <- data.frame(n = n, dmin = k_min, power = power)
ggplot(df) +
geom_point(mapping = aes(x = n, y = power)) +
geom_line(mapping = aes(x = n, y = power)) +
geom_hline(mapping = aes(yintercept = 1 - beta), lty = 2) +
scale_x_continuous(name = "Sample size") +
scale_y_continuous(name = "Power")
```
Figure \@ref(fig:powerbinomialtest) shows that the power does not increase monotonically with the sample size. The graph shows a saw-toothed behaviour. This is caused by the stepwise increase of the critical number of successes (`k_min`) with the total sample size.
The required sample size can be computed in two ways. The first option is to compute the smallest sample size for which the power is larger than or equal to the required power $1-\beta$. The alternative is to compute the smallest sample size for which the power is larger than or equal to $1-\beta$ *for all sample sizes larger than this*.
```{r}
n1 <- min(n[which(power > 1 - beta)])
ind <- (power > 1 - beta)
for (i in seq_len(length(n))) {
if (ind[length(n) - i] == FALSE)
break
}
n2 <- n[length(n) - i + 1]
```
The smallest sample size at which the desired level of 0.8 is reached is `r n1`. However, as can be seen in Figure \@ref(fig:powerbinomialtest), for sample sizes 89, 90, 93, 94, and 97, the power drops below the desired level of 0.80. The smallest sample size at which the power stays above the level of 0.8 is `r n2`.
Alternatively, we may use function `pwr.p.test` of package **pwr**. This is an approximation, using an arcsine transformation of proportions. The first step is to compute Cohen's $h$\index{Cohen's $h$}, which is a measure of the distance between two proportions: $h = 2\;arcsin(\sqrt{p_1})-2\;arcsin(\sqrt{p_2})$. This can be done with function `ES.h`. The value of $h$ must be positive, which is achieved when the proportion specified by argument `p1` is larger than the proportion specified by argument `p2`.
```{r}
h <- ES.h(p1 = 0.30, p2 = 0.20)
n_approx <- pwr.p.test(
h, power = (1 - beta), sig.level = alpha, alternative = "greater")
```
The approximated sample size equals `r ceiling(n_approx$n)`, which is somewhat smaller than the required sample sizes computed above.
#### Exercises {-}
1. Write an **R** script to compute the required sample size, given a requirement in terms of the half-length of the confidence interval of a population proportion. Use a normal approximation for computing the confidence interval. Use a range of values for half the length of the interval: $d = (0.01, 0.02, \dots , 0.49)$. Use a prior (anticipated) proportion of 0.1 and a significance level\index{Significance level} of 0.95. Plot the required sample size against $d$. Explain what you see. Why is it needed to provide a prior proportion?
2. Do the same for a single value for the half-length of the confidence interval of 0.2 and a range of values for the prior proportion $p^* = (0.01,0.02, \dots ,0.49)$. Explain what you see. Why is it not needed to compute the required sample size for prior proportions $>0.5$?
## Accounting for design effect {#DesignEffect}
The required sample sizes computed in the previous sections are all for simple random sampling in combination with the $\pi$ estimator of the population mean. But what is the required sample size for other types of sampling design, such as stratified (simple) random sampling, systematic random sampling, two-stage cluster random sampling, and cluster random sampling? Broadly speaking, we expect that with stratified random sampling and systematic random sampling the sampling variance of the estimator of the mean will be smaller than with simple random sampling of the same number of units, whereas with two-stage cluster random sampling and cluster random sampling we expect larger sampling variances. Therefore, reversely, for the first two types of sampling design, we expect that the sample size required to achieve the same level of accuracy or confidence will be smaller than with simple random sampling, and for the latter two design types this sample size will be larger. The design effect\index{Design effect} is commonly quantified by the ratio of two sampling variances of the population mean estimator [@loh99]:
\begin{equation}
de(p,\hat{\bar{z}}) = \frac{V_p(\hat{\bar{z}})}{V_{\mathrm{SI}}(\hat{\bar{z}}_{\pi})}= \frac{V_p(\hat{\bar{z}})}{S^2(z)/n}\;,
(\#eq:designeffect)
\end{equation}
with $V_p(\hat{\bar{z}})$ the sampling variance of an estimator ($\pi$ estimator, regression estimator) of the population mean with sampling design $p$ and $V_{\mathrm{SI}}(\hat{\bar{z}}_{\pi})$ the sampling variance of the $\pi$ estimator of the population mean with simple random sampling. Given an estimate of this design effect, the required sample size for a more complex sampling strategy (combination of sampling design and estimator), given a constraint on the standard error or the half-length of a confidence interval, can be computed by
\begin{equation}
n(p,\hat{\bar{z}}) = \sqrt{de(p,\hat{\bar{z}})} \; n(\mathrm{SI},\pi) \;.
(\#eq:nreqdesigneffect)
\end{equation}
```{block2, type='rmdnote'}
The design effect can also be quantified by the ratio of two standard errors. Then there is no need to take the square root of the design effect, as done in Equation \@ref(eq:nreqdesigneffect), to compute the required sample size for a more complex design, given a constraint on the standard error or the half-length of a confidence interval.
```
## Bayesian sample size determination {#BayesianSampleSize}
A serious drawback of the classical frequentist approach of computing the required sample size explained in the previous sections is that the required sample sizes are sensitive to the design parameters\index{Design parameter} $S^*$, $p^*$, and $cv^*$. We are rather uncertain about these parameters, and therefore it is attractive to replace a single value for these parameters by a probability distribution. This leads to a different statistical approach of computing the required sample size, the Bayesian approach\index{Bayesian approach!to sample size determination}. This Bayesian approach also offers the possibility of accommodating existing information about the population mean or proportion. In this section I show how this approach can be used to compute the required sample size for estimating a population mean or proportion.
But before going into details, let me explain the basics of the Bayesian approach of statistical inference. In the previous sections, the statistical inference was from the frequentist perspective. How does the frequency distribution of the estimator of the population mean look like if we repeat the selection of a sample with a given sampling design? Is the mean of this frequency distribution, referred to as the sampling distribution, equal to the population mean, and what is the variance of this sampling distribution?
The Bayesian approach is fundamentally different. The frequency distribution of the frequentist approach is replaced by a probability distribution of the population mean reflecting our *belief* about the population mean. Note that expressing our belief in terms of a probability distribution implies that in the Bayesian approach, contrary to the frequentist approach, the population mean is a random variable. Where in the frequentist approach, it is incorrect to say that the probability that the population mean is inside the 95\% confidence interval equals 95% (see Section \@ref(ConfidenceInterval)), this is perfectly fine in the Bayesian approach. The term confidence interval is replaced by the term *credible interval*\index{Credible interval} to underline the fundamental different meaning of the interval.
The first step in the Bayesian approach of statistical inference is to postulate a *prior distribution*\index{Prior distribution} for the population parameter of interest. This prior distribution expresses our belief and uncertainty about the parameter before the sample data are taken into account.
The next step is to formalise a theory about the data. This boils down to making an assumption about the type of distribution function of the data. Can we safely assume that the data follow a normal or a binomial distribution? Once the type of distribution has been specified, we can write an equation for the probability of the data *as a function of the parameter*. This function is referred to as the *likelihood function*\index{Likelihood function}.
The final step is to revise our prior belief about the population parameter of interest, using the data and our theory about the data as expressed in the likelihood function. This results in the *posterior distribution*\index{Posterior distribution} of the parameter. Our revised or updated belief is computed with Bayes' rule\index{Bayes' rule}:
\begin{equation}
f(\theta|\mathbf{z}) = \frac{f(\theta) f(\mathbf{z}|\theta)} {f(\mathbf{z})}\;,
(\#eq:BayesTheorem)
\end{equation}
with $f(\theta|\mathbf{z})$ the posterior distribution, i.e., the probability density^[I assume here that the parameter of interest $\theta$ is a continuous random variable.] of the parameter given the sample data, $f(\theta)$ our prior belief in the parameters specified by a probability distribution (prior distribution), $f(\mathbf{z}|\theta)$ the likelihood of the data, and $f(\mathbf{z})$ the probability distribution of the data.
### Bayesian criteria for sample size computation
Equation \@ref(eq:BayesTheorem) shows that the posterior distribution of the population parameter of interest depends on the probability distribution of the new data $f(\mathbf{z})$. The problem is that these new data are not yet known. We are designing a sample, and the data yet are to be collected, so at first glance this might seem an unsolvable problem. However, what we could do is to simulate with the prior probability density function a large number of values of the population parameter(s), and with each parameter a large number of possible vectors with $n$ data. Each data vector is then used to update the prior to the posterior, using Bayes' rule (Equation \@ref(eq:BayesTheorem)). For each posterior either the length of the highest posterior density (HPD) interval \index{Highest posterior density interval} with a coverage probability\index{Coverage probability} of $1-\alpha$ is computed, or reversely, the coverage probability of the HPD interval of length $l_{\mathrm{max}}$. Finally, the average of the lengths of the HPD intervals or the average of the coverage probabilities is computed, and these averages are compared with our precision requirement. If the average length is larger than $l_{\mathrm{max}}$, or the coverage probability of intervals of length $l_{\mathrm{max}}$ is smaller than $1-\alpha$, then we must increase $n$; if the average length is smaller than $l_{\mathrm{max}}$, or the coverage probability of intervals of length $l_{\mathrm{max}}$ is larger than $1-\alpha$, then we must decrease $n$. This whole procedure is repeated until our precision requirement is met. Simulation is one option to compute the sample size, (partly) analytical approaches are also available.
More formally, the procedure is as follows. The prior probability density function on the population parameter(s) $\theta$ is used to compute for a given sample size $n$ the *predictive* distribution of the data:
\begin{equation}
f(\mathbf{z}|n) = \int_{\Theta} f(\mathbf{z}|\theta,n)f(\theta)\mathrm{d}\theta\;,
(\#eq:predictivedistribution)
\end{equation}
with $\Theta$ the parameter space for $\theta$ containing all possible values of $\theta$. This predictive distribution\index{Predictive distribution} is also named the *preposterior* distribution\index{Preposterior distribution}, stressing that the data are not yet accounted for in the distribution.
Even if $\theta$ would be fixed, we do not have only one vector $\mathbf{z}$ with $n$ data values but a probability distribution, from which we can simulate possible data vectors, referred to as the data space $\mathcal{Z}$. In case of a binomial probability and sample size $n$, the data space $\mathcal{Z}$ (in the form of the number of observed successes given sample size $n$) can be written as the set $\{0,1,\dots,n\}$, i.e., one vector of length $n$ with all "failures", $n$ vectors of length $n$ with one success, ${n \choose 2}$ vectors with two successes, etc. Each data vector is associated with a probability density (for continuous data) or probability mass (for discrete data). As a consequence, we do not have only one posterior distribution function $f(\theta|\mathbf{z})$, but as many as we have data vectors in the data space. For each posterior distribution function the coverage of the HPD interval of a given length can be computed, or reversely, the length of the HPD interval for a given coverage. This leads to various criteria for computing the required sample size, among which are the average length criterion (ALC)\index{Average length criterion}, the average coverage criterion (ACC)\index{Average coverage criterion}, and the worst outcome criterion (WOC)\index{Worst outcome criterion} (@Joseph1995, @Joseph1997).
#### Average length criterion
For a fixed posterior HPD interval coverage of $100(1-\alpha)$\%, the smallest sample size $n$ is determined such that
\begin{equation}
\int_{\mathcal{Z}} l(\mathbf{z},n) f(\mathbf{z}|n)\mathrm{d}\mathbf{z} \leq l_{\mathrm{max}}\;,
(\#eq:ALC)
\end{equation}
where $f(\mathbf{z}|n)$ is the predictive distribution of the data (Equation \@ref(eq:predictivedistribution)) and $l(\mathbf{z},n)$ is the length of the $100(1-\alpha)$\% HPD interval for data $\mathbf{z}$ and sample size $n$, obtained by solving
\begin{equation}
\int_v^{v+l(\mathbf{z},n)}f(\theta|\mathbf{z},n)\mathrm{d}\theta= 1-\alpha\;,
(\#eq:solveALC)
\end{equation}
for $l(\mathbf{z},n)$, for each possible data set $\mathbf{z} \in \mathcal{Z}$. $f(\theta|\mathbf{z},n)$ is the posterior density of the population parameter of interest given the data $\mathbf{z}$ and sample size $n$. ALC ensures that the average length of $100(1-\alpha)$\% posterior HPD intervals, weighted by $f(\mathbf{z}|n)$, is at most $l_{\mathrm{max}}$.
#### Average coverage criterion
For a fixed posterior HPD interval of length $l_\mathrm{max}$, the smallest sample size $n$ is determined such that
\begin{equation}
\int_{\mathcal{Z}} \left\{ \int_v^{v+l_\mathrm{max}}f(\theta|\mathbf{z},n)\mathrm{d}\theta \right\} f(\mathbf{z}|n)\mathrm{d}\mathbf{z} \geq 1-\alpha \;.
(\#eq:ACC)
\end{equation}
ACC ensures that the average coverage of HPD intervals of length $l_\mathrm{max}$ is at least $1-\alpha$. The integral inside the curly brackets is the integral of the posterior density of the population parameter of interest over the HPD interval $(v,v+l_\mathrm{max})$, given a data vector $\mathbf{z}$ of size $n$. The mean of this integrated posterior density of the parameter of interest $\theta$ is obtained by multiplying the integrated density with the predictive probability of the data and integrating over all possible data sets in $\mathcal{Z}$.
#### Worst outcome criterion
Neither ALC nor ACC guarantee that for a particular data set $\mathbf{z}$ the criterion is met, as both are defined as averages over all possible data sets in $\mathcal{Z}$. A more conservative sample size can be computed by requiring that for all data sets $\mathcal{Z}$ both criteria are met. @Joseph1997 modified this criterion by restricting the data sets to a subset $\mathcal{W}$ of most likely data sets. The criterion thus obtained is referred to as the modified worst outcome criterion, or for short, the worst outcome criterion (WOC). So, the criterion is
\begin{equation}
\mathrm{inf}_{\mathbf{z} \in \mathcal{W}} \left\{\int_v^{v+l(\mathbf{z},n)}f(\theta|\mathbf{z},n)\mathrm{d}\theta\right\} \geq 1- \alpha \;.
(\#eq:worst)
\end{equation}
The smallest sample size satisfying this condition is used as the sample size. For instance, if the 95\% most likely data sets are chosen as subspace $\mathcal{W}$, WOC guarantees that there is 95\% assurance that the length of the $100(1-\alpha)$\% posterior HPD intervals will be at most $l_{\mathrm{max}}$. The fraction of most likely data sets in subspace $\mathcal{W}$ is referred to as the worst level.
### Mixed Bayesian-likelihood approach
Besides the fully Bayesian approach, @Joseph1997 describe a mixed Bayesian-likelihood approach\index{Mixed Bayesian-likelihood approach} for determining the sample size. In the mixed Bayesian-likelihood approach of sample size determination, the prior distribution of the parameter or parameters is only used to derive the predictive distribution of the data (Equation \@ref(eq:predictivedistribution)), not the posterior distributions of the parameter of interest for each data vector. For analysis of the posterior distribution, an uninformative prior\index{Uninformative prior} is therefore used. This mixed approach is of interest when, after the data have been collected, we prefer to estimate the population mean from these data only, using the frequentist approach described in previous sections.
An example of a situation where the mixed Bayesian-likelihood approach can be attractive is the following. Suppose some data of the study variable from the population of interest are already available, but we would like to collect more data so that we will be more confident about the (current) population mean once these new data are collected. The legacy data are used to construct a prior distribution. We have doubts about the quality of the legacy data because they were collected a long time ago and the study variable might have changed in the meantime. In that case, the mixed Bayesian-likelihood approach can be a good option -- we are willing to use the legacy data to plan the sampling, but not to make statements about the current population.
No closed formula for computing the required sample size exists for this approach because the posterior density function $f(\theta|z,n)$ is not a well-defined distribution as before. However, the required sample size still can be approximated by simulation.
### Estimation of population mean
The three criteria (ALC, ACC, and WOC) described above are now used to compute the required sample size for estimating the population mean, assuming that the data come from a normal distribution. As we are uncertain about the population standard deviation $\sigma$ ($S^*(z)$ in Equation \@ref(eq:nreqwidthCI) is only a prior point estimate of $\sigma$), a prior distribution is assigned to this parameter. It is convenient to assign a gamma distribution as a prior distribution to the *reciprocal* of the population variance, referred to as the precision parameter\index{Precision parameter} $\lambda = 1/\sigma^2$. More precisely, a prior *bivariate* normal-gamma distribution\index{Bivariate normal-gamma distribution} is assigned to the population mean and the precision parameter^[This is equal to a normal-inverse gamma distribution to the population mean and population variance.]. With this prior distribution, the *posterior* distribution of the population mean is fully defined, i.e., both the type of distribution and its parameters are known. The prior distribution is so-called *conjugate* with the normal distribution.
The gamma distribution\index{Gamma distribution} has two parameters: $a$ and $b$. Figure \@ref(fig:gammaprior) shows the gamma distribution for $a=5$ and $b=100$.
```{r gammaprior, fig.asp = 0.7, fig.cap = "Prior gamma distribution for the precision parameter for a shape parameter $a=5$ and a scale parameter $1/b=1/100$."}
a <- 5; b <- 100
x <- seq(from = 0, to = 0.2, length = 1000)
dg <- dgamma(x = x, shape = a, scale = 1 / b)
plot(x = x, y = dg, type = "l", ylab = "Density", xlab = "Precision")
```
The mean of the precision parameter $\lambda$ is given by $a/b$ and its standard deviation by $\sqrt{a/b^2}$.
The normal-gamma prior is used to compute the predictive distribution for the data. For ACC the required sample size can then be computed with [@Adcock1988]
\begin{equation}
n = \frac{4b}{a\; l_{\mathrm{max}}^2}t^2_{2a;\alpha/2} - n_0 \;,
(\#eq:nreqACC)
\end{equation}
with $t^2_{2a;\alpha/2}$ the squared $(1-\alpha/2)$ quantile of the (usual, i.e., neither shifted nor scaled) *t* distribution with $2a$ degrees of freedom and $n_0$ the number of prior points. The prior sample size $n_0$ is only relevant if we have prior information about the population mean and an informative prior is used for this population mean. If we have no information about the population mean, a non-informative prior is used and $n_0$ equals 0. Note that as $a/b$ is the prior mean of the inverse of the population variance, Equation \@ref(eq:nreqACC) is similar to Equation \@ref(eq:nreqwidthCI). The only difference is that a quantile from the standard normal distribution is replaced by a quantile from a *t* distribution with $2a$ degrees of freedom.
No closed-form formula exists for computing the smallest $n$ satisfying ALC, but the solution can be found by a bisectional search algorithm [@Joseph1997].
Package **SampleSizeMeans** [@Joseph2012] is used to compute Bayesian required sample sizes, using both criteria, ACC and ALC, for the fully Bayesian and the mixed Bayesian-likelihood approach. The gamma distribution plotted in Figure \@ref(fig:gammaprior) is used as a prior distribution for the precision parameter $\lambda$. As a reference, also the frequentist required sample size is computed.
```{r}
library(SampleSizeMeans)
lmax <- 2
n_freq <- mu.freq(len = lmax, lambda = a / b, level = 0.95)
n_alc <- mu.alc(len = lmax, alpha = a, beta = b, n0 = 0, level = 0.95)
n_alcmbl <- mu.mblalc(len = lmax, alpha = a, beta = b, level = 0.95)
n_acc <- mu.acc(len = lmax, alpha = a, beta = b, n0 = 0, level = 0.95)
n_accmbl <- mu.mblacc(len = lmax, alpha = a, beta = b, level = 0.95)
n_woc <- mu.modwoc(
len = lmax, alpha = a, beta = b, n0 = 0, level = 0.95, worst.level = 0.95)
n_wocmbl <- mu.mblmodwoc(
len = lmax, alpha = a, beta = b, level = 0.95, worst.level = 0.95)
```
```{r requiredsamplesizesnormalmean, tidy = FALSE, echo = FALSE}
df <- data.frame(n_freq, n_alc, n_alcmbl, n_acc, n_accmbl, n_woc, n_wocmbl)
knitr::kable(
df, caption = "Required sample sizes for estimating a normal mean, computed with three criteria for the fully Bayesian and the mixed Bayesian-likelihood (MBL) approach.",
col.names = c("Freq", "ALC", "ALC-MBL", "ACC", "ACC-MBL", "WOC", "WOC-MBL"),
booktabs = TRUE,
linesep = ""
) %>%
kable_classic() %>%
add_footnote("Freq: required sample size computed with the frequentist approach; ALC: average length criterion; ACC: average coverage criterion; WOC: worst outcome criterion.", notation = "none")
```
Table \@ref(tab:requiredsamplesizesnormalmean) shows that all six required sample sizes are larger than the frequentist required sample size. This makes sense, as the frequentist approach does not account for uncertainty in the population variance parameter. The mixed approach leads to slightly larger required sample sizes than the fully Bayesian approach. This is because in the mixed approach the prior distribution of the precision parameter is not used. Apparently, we do not lose much information by ignoring this prior. With WOC the required sample sizes are about twice the sample sizes obtained with the other two criteria, but this depends of course on the size of the subspace $\mathcal{W}$. If, for instance the 80\% most likely data sets are chosen as subspace $\mathcal{W}$, the required sample sizes are much smaller.
```{r}
n_woc80 <- mu.modwoc(
len = lmax, alpha = a, beta = b, n0 = 0, level = 0.95, worst.level = 0.80)
n_wocmbl80 <- mu.mblmodwoc(
len = lmax, alpha = a, beta = b, level = 0.95, worst.level = 0.80)
```
The required sample sizes with this criterion are `r n_woc80` and `r n_wocmbl80` using the fully Bayesian and the mixed Bayesian-likelihood approach, respectively.
### Estimation of a population proportion
The same criteria can be used to estimate the proportion of a population or, in case of an infinite population, the areal fraction, satisfying some condition [@Joseph1995]. With simple random sampling, this boils down to estimating the probability-of-success parameter $p$ of a binomial distribution. In this case the space of possible outcomes $\mathcal{Z}$ is the number of successes, which is discrete: $\mathcal{Z} = \{0,1,\dots ,n\}$ with $n$ the sample size.
The conjugate prior\index{Conjugate prior} for the binomial likelihood is the beta distribution\index{Beta distribution}:
\begin{equation}
p \sim \frac{1}{B(c,d)} \pi^{c-1} (1-\pi)^{d-1} \;,
(\#eq:priorBetadistribution)
\end{equation}
where $B(c,d)$ is the beta function. The two parameters $c$ and $d$ correspond to the number of 'successes' and 'failures' in the problem context. The larger these numbers, the more the prior information, and the more sharply defined the probability distribution. The plot below shows this distribution for $c=0.6$ and $d=2.4$.
```{r betaprior, fig.asp = 0.7, fig.cap = "Prior beta distribution for the binomial proportion for a beta function $B(0.6,2.4)$."}
c <- 0.6; d <- 2.4
x <- seq(from = 0, to = 1, length = 1000)
dbt <- dbeta(x = x, shape1 = c, shape2 = d)
plot(x = x, y = dbt, type = "l", ylab = "Density", xlab = "Proportion")
```
The mean of the binomial proportion equals $c/(c+d)$ and its standard deviation $\sqrt{cd/\{(c+d+1)(c+d)^2\}}$.
The preposterior marginal distribution of the data is the beta-binomial distribution:
\begin{equation}
f(z|n) = \binom{n}{z}\frac{B(z+c,n-z+d)}{B(c,d)}\;,
(\#eq:preposteriorbinomialdata)
\end{equation}
and for a given number of successes $z$ out of $n$ trials the posterior distribution of $p$ equals
\begin{equation}
f(p|z,n,c,d)=\frac{1}{B(z+c,n-z+d)} p^{z+c-1} (1-p)^{n-z+d-1} \;.
(\#eq:posteriorbinomialdata)
\end{equation}
For the binomial parameter, criterion ALC (Equation \@ref(eq:ALC)) can be written as
\begin{equation}
\sum_{z=0}^n l(z,n)f(z,n) \leq l_{\mathrm{max}} \;.
(\#eq:ALCbinomial)
\end{equation}
To compute the smallest $n$ satisfying this condition, for each value of $z$ and each $n$, $l(z,n)$ must be computed so that
\begin{equation}
\int_v^{v+l(z,n)}f(p|z,n,c,d) \mathrm{d}p = 1-\alpha \;,
(\#eq:xxx)
\end{equation}
with $v$ the lower bound of the HPD credible set given the sample size and the observed number of successes $z$.
For the binomial parameter, criterion ACC (Equation \@ref(eq:ACC)) can be written as
\begin{equation}
\sum_{z=0}^n \mathrm{Pr}\{p \in (v,v+l_{\mathrm{max}})\} f(z,n) \geq 1-\alpha \;,
(\#eq:ACCbinomial)
\end{equation}
with
\begin{equation}
\mathrm{Pr}\{p \in (v,v+l_{\mathrm{max}})\} \propto \int_{v}^{v+l_{\mathrm{max}}}p^z(1-p)^{n-z} f(p) \mathrm{d}p\;,
(\#eq:Binomialprob)
\end{equation}
with $f(p)$ the prior density of the binomial parameter.
For more details about ACC and ALC, and about how the required sample size can be computed with WOC in case of the binomial parameter $p$, refer to @Joseph1995.
The required sample sizes for ALC, ACC, and WOC described in the previous subsection, using the fully Bayesian approach or the mixed Bayesian-likelihood approach, can be computed with package [**SampleSizeBinomial**](http://www.medicine.mcgill.ca/epidemiology/Joseph/software/Bayesian-Sample-Size.html)[@SampleSizeBinomial]. This package is used to compute the required sample sizes using the beta distribution shown in Figure \@ref(fig:betaprior) as a prior for the population proportion. Note that argument `len` of the various functions of package **SampleSizeBinomial** specifies the total length of the confidence interval, not *half* the length as passed to function `ciss.wald` using argument `d`.
```{r}
library(SampleSizeBinomial)
n_alc <- prop.alc(
len = 0.2, alpha = c, beta = d, level = 0.95, exact = TRUE)$n
n_alcmbl <- prop.mblalc(
len = 0.2, alpha = c, beta = d, level = 0.95, exact = TRUE)$n
n_acc <- prop.acc(
len = 0.2, alpha = c, beta = d, level = 0.95, exact = TRUE)$n
n_accmbl <- prop.mblacc(
len = 0.2, alpha = c, beta = d, level = 0.95, exact = TRUE)$n
n_woc <- prop.modwoc(
len = 0.2, alpha = c, beta = d, level = 0.95, exact = TRUE,
worst.level = 0.80)$n
n_wocmbl <- prop.mblmodwoc(
len = 0.2, alpha = c, beta = d, level = 0.95, exact = TRUE,
worst.level = 0.80)$n
library(binomSamSize)
n_freq <- ciss.wald(p0 = c / (c + d), d = 0.1, alpha = 0.05)
```
The required sample sizes are shown in Table \@ref(tab:requiredsamplesizesbinomialproportion).
```{r requiredsamplesizesbinomialproportion, echo = FALSE}
df <- data.frame(n_freq, n_alc, n_alcmbl, n_acc, n_accmbl, n_woc, n_wocmbl)
knitr::kable(
df, caption = "Required sample sizes for estimating a binomial proportion, computed with three criteria for the fully Bayesian and the mixed Bayesian-likelihood (MBL) approach.",
col.names = c("Freq", "ALC", "ALC-MBL", "ACC", "ACC-MBL", "WOC", "WOC-MBL"),
booktabs = TRUE,
linesep = ""
) %>%
kable_classic() %>%
add_footnote("Freq: required sample size computed with the frequentist approach; ALC: average length criterion; ACC: average coverage criterion; WOC: worst outcome criterion.", notation = "none")
```
```{r, echo=FALSE}
rm(list = ls())
```