forked from DickBrus/SpatialSamplingwithR
-
Notifications
You must be signed in to change notification settings - Fork 0
/
09-BalancedandSpreaded.Rmd
868 lines (682 loc) · 53.1 KB
/
09-BalancedandSpreaded.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
# Balanced and well-spread sampling {#BalancedSpreaded}
In this chapter two related but fundamentally different sampling designs are described and illustrated. The similarity and difference are shortly outlined below, but hopefully will become clearer in following sections.
Roughly speaking, for a balanced sample the sample means of covariates are equal to the population means of these covariates. When the covariates are linearly related to the study variable, this may yield a more precise estimate of the population mean or total of the study variable.
A well-spread sample\index{Well-spread sample} is a sample with a large range of values for the covariates, from small to large values, but also including intermediate values. In more technical terms: the sampling units are well-spread along the axes spanned by the covariates. If the spatial coordinates are used as covariates (spreading variables), this results in samples that are well-spread in geographical space. Such samples are commonly referred to as spatially balanced samples, which is somewhat confusing, as the geographical spreading is not implemented through balancing on the geographical coordinates. On the other hand, the averages of the spatial coordinates of a sample well-spread in geographical space will be close to the population means of the coordinates. Therefore, the sample will be approximately balanced on the spatial coordinates [@Grafstrom2014]. The reverse is not true: with balanced sampling\index{Balanced sampling}, the spreading of the sampling units in the space spanned by the balancing variables can be poor. A sample with all values of a covariate used in balancing near the population mean of that variable has a poor spreading along the covariate axis, but can still be perfectly balanced.
## Balanced sampling {#Balanced}
Balanced sampling is a sampling method that exploits one or more quantitative covariates that are related to the study variable. The idea behind balanced sampling is that, if we know the mean of the covariates, then the sampling efficiency can be increased by selecting a sample whose averages of the covariates must be equal to the population means of the covariates.
Let me illustrate balanced sampling with a small simulation study. The simulated population shown in Figure \@ref(fig:simpleexample) shows a linear trend from West to East and besides a trend from South to North. Due to the West-East trend, the simulated study variable $z$ is correlated with the covariate Easting and, due to the South-North trend, also with the covariate Northing. To estimate the population mean of the simulated study variable, intuitively it is attractive to select a sample with an average of the Easting coordinate that is equal to the population mean of Easting (which is 10). Figure \@ref(fig:simpleexample) (subfigure on the left) shows such a sample of size four; we say that the sample is `balanced' on the covariate Easting. The sample in the subfigure on the right is balanced on Easting as well as on Northing.
```{r simpleexample, echo = FALSE, out.width = "100%", fig.cap = "Sample balanced on Easting (E) and on Easting and Northing (E and N)."}
#define residual variogram for simulation
vgmodel <- vgm(model = "Exp", psill = 10, range = 4, nugget = 0)
#define discretisation grid
x <- 1:20 - 0.5
y <- x
grid <- expand.grid(x, y)
names(grid) <- c("x1", "x2")
distx <- outer(grid$x1, grid$x1, FUN = "-")
disty <- outer(grid$x2, grid$x2, FUN = "-")
dist <- sqrt(distx^2 + disty^2)
#compute matrix with covariances
C <- variogramLine(vgmodel, dist_vector = dist, covariance = TRUE)
#now simulate values for grid by Cholesky decomposition
Upper <- chol(C)
set.seed(31415)
G <- rnorm(n = nrow(grid), 0, 1) #simulate random numbers from standard normal distribution
#trend coefficient in x-direction
b1 <- 2
b2 <- 1
grid$z <- crossprod(Upper, G) + b1 * grid$x1 + b2 * grid$x2
#compute population size
N <- nrow(grid)
#set sample size)
n <- 4
#define matrix with covariate for balancing; first column of matrix must be filled with ones
X <- cbind(rep(1, times = N), grid$x1)
#compute inclusion probabilities; use equal probabilities
pi <- rep(n / N, times = N)
nsam <- 100
mx_pop <- mean(grid$x1)
set.seed(31415)
repeat {
sample_ind <- samplecube(X = X, pik = pi, comment = FALSE, method = 1)
mysample <- grid[sample_ind == 1, ]
mx_sample <- mean(mysample$x1)
if (mx_sample == mx_pop) {
break
}
}
#now select a sample balanced on Easting and Northing
X <- cbind(rep(1, times = N), grid$x1, grid$x2)
mx1_pop <- mean(grid$x1)
mx2_pop <- mean(grid$x2)
set.seed(314)
repeat {
sample_ind <- samplecube(X = X, pik = pi, comment = FALSE, method = 1)
mysample2 <- grid[sample_ind == 1, ]
mx1_sample <- mean(mysample2$x1)
mx2_sample <- mean(mysample2$x2)
if (mx1_sample == mx1_pop & mx2_sample == mx2_pop) {
break
}
}
mysamples <- rbind(mysample, mysample2)
mysamples$samid <- rep(c("E", "E and N"), each = 4)
ggplot(data = grid) +
geom_tile(mapping = aes(x = x1, y = x2, fill = z)) +
geom_tile(data = mysamples, mapping = aes(x = x1, y = x2), colour = "white", linewidth = 0.8, width = 1, height = 1, fill = NA) +
scale_fill_viridis_c(name = "z") +
scale_x_continuous(name = "Easting") +
scale_y_continuous(name = "Northing") +
facet_wrap(~ samid, ncol = 2, nrow = 1) +
coord_fixed()
```
### Balanced sample vs. balanced sampling design
We must distinguish a balanced *sample* from a balanced sampling *design*. A sampling design is balanced on a covariate $x$ when *all possible* samples that can be generated by the design are balanced on $x$.
```{block2, type = 'rmdnote'}
Simple random sampling is not a balanced sampling design, because for many simple random samples the sample mean of the balancing variable $x$ is not equal to the population mean of $x$. Only the *expectation* of the sample mean of $x$, i.e., the mean of the sample means obtained by selecting an infinite number of simple random samples, equals the population mean of $x$.
```
Figure \@ref(fig:scatterplotsqerror) shows for 1,000 simple random samples the squared error of the estimated population mean of the study variable $z$ against the difference between the sample mean of balancing variable Easting and the population mean of Easting. Clearly, the larger the absolute value of the difference, the larger on average the squared error. So, to obtain a precise and accurate estimate of the population mean of $z$, we better select samples with a difference close to 0.
(ref:scatterplotsqerrorlabel) Squared error in the estimated population mean of $z$ against the difference between the sample mean and the population mean of Easting, for 1,000 simple random samples of size four selected from the population shown in Figure \@ref(fig:simpleexample).
```{r scatterplotsqerror, echo = FALSE, fig.width = 5, fig.cap = "(ref:scatterplotsqerrorlabel)"}
nsam <- 1000
n <- 4
mz_sample <- mx_sample <- numeric(length = nsam)
set.seed(31415)
for (i in 1:nsam) {
#select simple random sample from the data
sample_ind <- sample(nrow(grid), size = n)
mz_sample[i] <- mean(grid$z[sample_ind])
mx_sample[i] <- mean(grid$x1[sample_ind])
}
sqerror <- (mz_sample - mean(grid$z))^2
devx <- mx_sample - mean(grid$x1)
df <- data.frame(sqerror, devx)
ggplot(data = df) +
geom_point(mapping = aes(y = sqerror, x = devx), size = 2) +
scale_x_continuous(name = "Sample mean Easting - Population mean Easting", limits = c(-9, 9)) +
scale_y_continuous(name = "Squared error")
```
Using only Easting as a balancing variable reduces the sampling variance of the estimator of the mean substantially. Using Easting and Northing as balancing variables further reduces the sampling variance. See Table \@ref(tab:tablebalanced).
```{r tablebalanced, echo = FALSE}
tbl <- data.frame(x = c("SI", "Balanced", "Balanced"), y = c("-", "Easting", "Easting and Northing"), z = c(39.7, 14.4, 9.77))
knitr::kable(
tbl, caption = "Sampling variance of the $\\pi$ estimator of the mean for simple random sampling (SI) and balanced sampling of four units.",
col.names = c("Sampling design", "Balancing variables", "Sampling variance"),
booktabs = TRUE,
linesep = ""
) %>%
kable_classic()
```
### Unequal inclusion probabilities
Until now we have assumed that the inclusion probabilities of the population units are equal, but this is not a requirement for balanced sampling designs. A more general definition of a balanced sampling design is as follows. A sampling design is balanced on variable $x$ when for all samples generated by the design the $\pi$ estimate of the population total of $x$ equals the population total of $x$:
\begin{equation}
\sum_{k \in \mathcal{S}} \frac{x_k}{\pi_k}= \sum_{k=1}^{N} x_k \;.
(\#eq:generaldefinitionbalanced)
\end{equation}
Similar to the regression estimator\index{Regression estimator} explained in the next chapter (Subsection \@ref(RegressionEstimator)), balanced sampling exploits the linear relation between the study variable and one or more covariates. When using the regression estimator, this is done at the estimation stage. Balanced sampling does so at the sampling stage. For a single covariate the regression estimator of the population total equals (see Equation \@ref(eq:SimpleRegressionEstimatorSI))
\begin{equation}
\hat{t}_{\mathrm{regr}}(z) = \hat{t}_{\pi}(z) + \hat{b}\left(t(x) - \hat{t}_{\pi}(x)\right) \;,
(\#eq:RegressionEstimatoranydesign)
\end{equation}
with $\hat{t}_{\pi}(z)$ and $\hat{t}_{\pi}(x)$ the $\pi$ estimators of the population total of the study variable $z$ and the covariate $x$, respectively, $t(x)$ the population total of the covariate, and $\hat{b}$ the estimated slope parameter (see hereafter). With a perfectly balanced sample the second term in the regression estimator, which adjusts the $\pi$ estimator, equals zero.
Balanced samples can be selected with the cube algorithm\index{Cube algorithm for balanced sampling} of @Deville2004. The population total and mean can be estimated by the $\pi$ estimator. The approximated variance of the $\pi$ estimator of the population mean can be estimated by (@Deville2005, @Grafstrom2013)
\begin{equation}
\widehat{V}(\hat{\bar{z}}) = \frac{1}{N^2}\frac{n}{n-p} \sum_{k \in \mathcal{S}} c_k \left(\frac{e_k}{\pi_k}\right)^2 \;,
(\#eq:approxvarianceBalanced)
\end{equation}
with $p$ the number of balancing variables, $c_k$ a weight for unit $k$ (see hereafter), and $e_k$ the residual of unit $k$ given by
\begin{equation}
e_k = z_k - \mathbf{x}_k^{\text{T}}\hat{\mathbf{b}} \;,
(\#eq:residualsBalanced)
\end{equation}
with $\mathbf{x}_k$ a vector of length $p$ with the balancing variables for unit $k$, and $\hat{\mathbf{b}}$ the estimated population regression coefficients\index{Population regression coefficient}, given by
\begin{equation}
\hat{\mathbf{b}} = \left(\sum_{k \in \mathcal{S}} c_k \frac{\mathbf{x}_k}{\pi_k} \frac{\mathbf{x}_k}{\pi_k}^{\text{T}} \right)^{-1} \sum_{k \in \mathcal{S}} c_k \frac{\mathbf{x}_k}{\pi_k} \frac{z_k}{\pi_k} \;.
(\#eq:betasbalanced)
\end{equation}
Working this out for balanced sampling without replacement with equal inclusion probabilities, $\pi_k = n/N,\; k = 1, \dots , N$, yields
\begin{equation}
\widehat{V}(\hat{\bar{z}}) = \frac{1}{n(n-p)} \sum_{k \in \mathcal{S}} c_k e_k^2 \;.
(\#eq:approxvarianceBalancedSI)
\end{equation}
@Deville2005 give several formulas for computing the weights $c_k$, one of which is $c_k = (1-\pi_k)$.
Balanced sampling is now illustrated with aboveground biomass (AGB) data of Eastern Amazonia, see Figure \@ref(fig:mapsAmazonia). Log-transformed short-wave infrared radiation (lnSWIR2) is used as a balancing variable. The `samplecube` function of the **sampling** package [@Tille2016] implements the cube algorithm. Argument `X` of this function specifies the matrix of ancillary variables on which the sample must be balanced. The first column of this matrix is filled with ones, so that the sample size is fixed. To speed up the computations, a 5 km $\times$ 5 km subgrid of `grdAmazonia` is used.
```{block2, type='rmdnote'}
Recall that a sample is balanced on a covariate $x$ if the $\pi$ estimate of the population total of $x$ is equal to the known true population total of $x$ (Equation \@ref(eq:generaldefinitionbalanced)). If we know the total number of units in a population, $N$, we can balance the sample on this known total using a constant with value 1 as a balancing variable. Only for samples of size $n$ the $\pi$ estimate of the total number of population units equals $N$: $\sum_{k \in \mathcal{S}} 1/\pi_k = N$ for $|\mathcal{S}| = n$.
```
Equal inclusion probabilities are used, i.e., for all population units the inclusion probability equals $n/N$.
```{r, echo = FALSE}
grdAmazonia <- read_rds(file = "results/grdAmazonia_5km.rds")
```
```{r}
grdAmazonia <- grdAmazonia %>%
mutate(lnSWIR2 = log(SWIR2))
library(sampling)
N <- nrow(grdAmazonia)
n <- 100
X <- cbind(rep(1, times = N), grdAmazonia$lnSWIR2)
pi <- rep(n / N, times = N)
sample_ind <- samplecube(X = X, pik = pi, comment = FALSE, method = 1)
eps <- 1e-6
units <- which(sample_ind > (1 - eps))
mysample <- grdAmazonia[units, ]
```
The population mean can be estimated by the sample mean.
```{r}
mz <- mean(mysample$AGB)
```
To estimate the variance, a function is defined for estimating the population regression coefficients.
```{r}
estimate_b <- function(z, X, c) {
cXX <- matrix(nrow = ncol(X), ncol = ncol(X), data = 0)
cXz <- matrix(nrow = 1, ncol = ncol(X), data = 0)
for (i in seq_len(length(z))) {
x <- X[i, ]
cXX_i <- c[i] * (x %*% t(x))
cXX <- cXX + cXX_i
cXz_i <- c[i] * t(x) * z[i]
cXz <- cXz + cXz_i
}
b <- solve(cXX, t(cXz))
return(b)
}
```
The next code chunk shows how the estimated variance of the $\pi$ estimator of the population mean can be computed.
```{r}
pi <- rep(n / N, n)
c <- (1 - pi)
b <- estimate_b(z = mysample$AGB / pi, X = X[units, ] / pi, c = c)
zpred <- X %*% b
e <- mysample$AGB - zpred[units]
v_tz <- n / (n - ncol(X)) * sum(c * (e / pi)^2)
v_mz <- v_tz / N^2
```
Figure \@ref(fig:BalancedSampleAmazonia) shows the selected balanced sample. Note the spatial clustering of some units. The estimated population mean (as estimated by the sample mean) of AGB equals `r formatC(mz, 1, format = "f")` 10^9^ kg ha^-1^. The population mean of AGB equals `r formatC(mean(grdAmazonia$AGB), 1, format = "f")` 10^9^ kg ha^-1^. The standard error of the estimated mean equals `r formatC(sqrt(v_mz), 1, format = "f")` 10^9^ kg ha^-1^.
```{r BalancedSampleAmazonia, echo = FALSE, out.width = "100%", fig.cap = "Sample balanced on lnSWIR2 from Eastern Amazonia."}
ggplot(data = grdAmazonia) +
geom_raster(mapping = aes(x = x1 / 1000, y = x2 / 1000, fill = lnSWIR2)) +
geom_point(data = mysample, mapping = aes(x = x1 / 1000, y = x2 / 1000), size = 2) +
scale_fill_viridis_c(name = "lnSWIR2") +
scale_x_continuous(name = "Easting (km)") +
scale_y_continuous(name = "Northing (km)") +
coord_fixed()
```
Figure \@ref(fig:SamplingDistributionBalanced) shows the approximated sampling distribution of the $\pi$ estimator of the mean AGB with balanced sampling and simple random sampling, obtained by repeating the random sampling with both designs and estimation 1,000 times.
```{r, echo = FALSE, eval = FALSE}
mz <- v_mz <- mx_sample <- mz_SI <- numeric(length = 1000)
pi <- rep(n / N, n)
c <- (1 - pi)
set.seed(314)
for (i in 1:1000) {
sample_ind <- samplecube(X = X, pik = rep(n / N, N), comment = FALSE, method = 1)
units <- which(sample_ind > (1 - eps))
mysample <- grdAmazonia[units, ]
mz[i] <- mean(mysample$AGB)
b <- estimate_b(z = mysample$AGB / pi, X = X[units, ] / pi, c = c)
zpred <- X %*% b
e <- mysample$AGB - zpred[units]
v_tz <- n / (n - ncol(X)) * sum(c * (e / pi)^2)
v_mz[i] <- v_tz / N^2
mx_sample[i] <- mean(mysample$lnSWIR2)
units <- sample(nrow(grdAmazonia), size = n, replace = FALSE)
mz_SI[i] <- mean(grdAmazonia$AGB[units])
}
save(mz, mz_SI, v_mz, mx_sample, file = "results/Balanced_Amazonia.rda")
```
(ref:SamplingDistributionBalancedlabel) Approximated sampling distribution of the $\pi$ estimator of the mean AGB (10^9^ kg ha^-1^) in Eastern Amazonia, with balanced sampling, balanced on lnSWIR2 and using equal inclusion probabilities, and simple random sampling without replacement (SI) of 100 units.
```{r SamplingDistributionBalanced, echo = FALSE, fig.width = 5, fig.asp = .8, fig.cap = "(ref:SamplingDistributionBalancedlabel)"}
load(file = "results/Balanced_Amazonia.rda")
estimates <- data.frame(mz, mz_SI, v_mz, mx_sample)
names(estimates)[c(1, 2)] <- c("Balanced", "SI")
df <- estimates %>% pivot_longer(cols = c("Balanced", "SI"))
ggplot(data = df) +
geom_boxplot(aes(y = value, x = name)) +
geom_hline(yintercept = mean(grdAmazonia$AGB), colour = "red") +
scale_x_discrete(name = "Sampling design") +
scale_y_continuous(name = "Estimated mean AGB")
```
The variance of the 1,000 estimates of the population mean of the study variable AGB equals `r formatC(var(estimates$Bal), 1, format = "f")` (10^9^ kg ha^-1^)^2^. The gain in precision compared to simple random sampling equals `r formatC(var(estimates$SI)/var(estimates$Bal), 3, format = "f")` (design effect is `r formatC(var(estimates$Bal)/var(estimates$SI), 3, format = "f")`), so with simple random sampling about three times more sampling units are needed to estimate the population mean with the same precision. The mean of the 1,000 estimated variances equals `r formatC(mean(estimates$v_mz), 1, format = "f")` (10^9^ kg ha^-1^)^2^, indicating that the approximated variance estimator somewhat underestimates the true variance in this case. The population mean of the balancing variable lnSWIR2 equals `r formatC(mean(grdAmazonia$lnSWIR2), 3, format = "f")`. The sample mean of lnSWIR2 varies a bit among the samples. Figure \@ref(fig:histSampleMeanSWIR) shows the approximated sampling distribution of the sample mean of lnSWIR2. In other words, many samples are not perfectly balanced on lnSWIR2. This is not exceptional; in most cases perfect balance is impossible.
```{r histSampleMeanSWIR, echo = FALSE, fig.width = 5, fig.cap = "Approximated sampling distribution of the sample mean of balancing variable lnSWIR2 in Eastern Amazonia with balanced sampling of size 100 and equal inclusion probabilities."}
ggplot(data = estimates) +
geom_histogram(aes(x = mx_sample, y = ..density..), binwidth = 0.0025, fill = "black", alpha = 0.5, color = "black") +
geom_density(aes(x = mx_sample, y = ..density..), lwd = 1, adjust = 2) +
scale_x_continuous(name = "Sample mean lnSWIR2") +
scale_y_continuous(name = "Density")
```
#### Exercises {-}
1. Select a sample of size 100 balanced on lnSWIR2 and Terra_PP from Eastern Amazonia, using equal inclusion probabilities for all units.
+ First, select a subgrid of 5 km $\times$ 5 km using function `spsample`, see Chapter \@ref(SY).
+ Estimate the population mean.
+ Estimate the standard error of the $\pi$ estimator. First, estimate the regression coefficients, using function `estimate_b` defined above, then compute the residuals, and finally compute the variance.
2. Spatial clustering of sampling units is not avoided in balanced sampling. What effect do you expect of this spatial clustering on the precision of the estimated mean? Can you think of a situation where this effect does not occur?
### Stratified random sampling {#StratifiedsamplingasBalancedsampling}
Much in the same way as we controlled in the previous subsection the sample size $n$ by balancing the sample on the known total number of population units $N$, we can balance a sample on the known total number of units in subpopulations. A sample balanced on the sizes of subpopulations is a stratified random sample. Figure \@ref(fig:BalancedSampleCategorical) shows four subpopulations or strata. These four strata can be used in balanced sampling by constructing the following design matrix\index{Design matrix} $\mathbf{X}$ with as many columns as there are strata and as many rows as there are population units:
\begin{equation}
\mathbf{X}=
\begin{bmatrix}
1 &0 &0 &0 \\
1 &0 &0 &0 \\
1 &0 &0 &0 \\
1 &0 &0 &0 \\
0 &1 &0 &0 \\
0 &1 &0 &0 \\
0 &0 &1 &0 \\
\vdots &\vdots &\vdots &\vdots\\
0 & 0 & 0 & 1 \\
\end{bmatrix} \;.
(\#eq:DesignMatrixBalanced)
\end{equation}
The first four rows refer to the four leftmost bottom row population units in Figure \@ref(fig:BalancedSampleCategorical). These units belong to class A, which explains that the first column for these units contain ones. The other three columns for these rows contain all zeroes. The fifth and sixth unit belong to stratum B, so that the second column for these rows contain ones, and so on. The final row is the upperright sampling unit in stratum D, so the first three columns contain zeroes, and the fourth column is filled with a one. The sum of the indicators in the columns is the total number of population units in the strata.
```{r BalancedSampleCategorical, echo = FALSE, fig.width = 5, fig.cap = "Sample balanced on a categorical variable with four classes."}
s1 <- s2 <- 1:20 - 0.5
mypop <- expand.grid(s1, s2)
names(mypop) <- c("s1", "s2")
mypop$stratum <- as.factor(findInterval(mypop$s1, c(4, 6, 14)))
levels(mypop$stratum) <- LETTERS[1:4]
mypop <- mypop %>%
group_by(stratum) %>%
summarise(N_h = n(), .groups = "drop") %>%
mutate(pi_h = rep(5, 4) / N_h) %>%
right_join(mypop, by = "stratum")
X <- model.matrix(~ stratum - 1, mypop)
set.seed(314)
sample_ind <- samplecube(X = X, pik = mypop$pi_h, comment = FALSE, method = 1)
eps <- 1e-6
mysample <- mypop[sample_ind > (1 - eps), ]
ggplot(data = mypop) +
geom_tile(mapping = aes(x = s1, y = s2, fill = factor(stratum)), width = 1, height = 1, size = 0.5, colour = "white") +
geom_tile(data = mysample, mapping = aes(x = s1, y = s2), fill = NA, width = 1, height = 1, linewidth = 0.7, colour = "black") +
scale_fill_viridis_d(name = "Stratum") +
scale_x_continuous(name = "Easting") +
scale_y_continuous(name = "Northing") +
coord_fixed()
```
In the next code chunk, the inclusion probabilities are computed by $\pi_{hk}=n_h/N_h,\; k=1, \dots , N_h$, with $n_h=5$ for all four strata. The stratum sample sizes are equal, but the number of population units of a stratum differ among the strata, so the inclusion probabilities also differ among the strata. The ten leftmost units on the bottom row of Figure \@ref(fig:BalancedSampleCategorical) are shown below.
```{r}
s1 <- s2 <- 1:20 - 0.5
mypop <- expand.grid(s1, s2)
names(mypop) <- c("s1", "s2")
mypop$stratum <- as.factor(findInterval(mypop$s1, c(4, 6, 14)))
levels(mypop$stratum) <- LETTERS[1:4]
mypop <- mypop %>%
group_by(stratum) %>%
summarise(N_h = n(), .groups = "drop") %>%
mutate(pi_h = rep(5, 4) / N_h) %>%
right_join(mypop, by = "stratum") %>%
arrange(s2, s1)
mypop
```
Next, the design matrix $\mathbf{X}$ is computed with function `model.matrix`, expanding the factor `stratum` to a set of dummy variables, see code chunk below. By adding `- 1` to the formula, the usual first column with ones in the design matrix is omitted. The design matrix has four columns with dummy variables (indicators), indicating to which stratum a unit belongs.
The columns in the design matrix with dummy variables are multiplied by the vector with inclusion probabilities, using function `sweep `, resulting in the following design matrix:
\begin{equation}
\mathbf{X}=
\begin{bmatrix}
0.0625 &0 &0 &0 \\
0.0625 &0 &0 &0 \\
0.0625 &0 &0 &0 \\
0.0625 &0 &0 &0 \\
0 &0.125 &0 &0 \\
0 &0.125 &0 &0 \\
0 &0 &0.03125 &0 \\
\vdots &\vdots &\vdots &\vdots\\
0 & 0 & 0 &0.04167 \\
\end{bmatrix} \;.
(\#eq:DesignMatrixBalanced2)
\end{equation}
The multiplication by the inclusion probabilities is not strictly needed. Using the design matrix with dummy variables implies that we balance the sample on the known total number of population units in the strata, $N_h$. For samples with stratum sample sizes equal to $n_h$, the sample sums of the dummy variables used in balancing, divided by the inclusion probability, are equal to $N_h$.
Multiplication of the dummy variables by the vector with inclusion probabilities implies that we balance the sample on the population totals of the inclusion probabilities, which are equal to the targeted stratum sample sizes. For samples with stratum sample sizes $n_h$ equal to these targeted sample sizes, the sample sums of the balancing variables, divided by the inclusion probability (having value $\pi_{hk}/\pi_{hk}=1$ or 0), are equal to the targeted sample sizes.
```{r }
X <- model.matrix(~ stratum - 1, data = mypop)
X <- sweep(X, MARGIN = 1, mypop$pi_h, `*`)
set.seed(314)
sample_ind <- samplecube(X = X, pik = mypop$pi_h, comment = FALSE, method = 1)
eps <- 1e-6
mysample <- mypop[sample_ind > (1 - eps), ]
```
In the above example all units in a stratum have the same inclusion probability, yielding a stratified simple random sample. We may also use variable inclusion probabilities, for instance proportional to a size measure of the units, yielding a stratified ppswor random sample (Section \@ref(ppswor)).
The advantage of selecting a stratified random sample by balancing the sample on a categorical variable becomes clear in case we have multiple classifications that we would like to use in stratification, and we cannot afford to use all cross-classifications as strata. This is the topic of the next subsection.
### Multiway stratification {#Multiwaystratification}
@Falorsi2008 describe how a multiway stratified sample\index{Multiway stratification} can be selected as a balanced sample. Multiway stratification is of interest when one has multiple stratification variables, each stratification variable leading to several strata, so that the total number of cross-classification strata\index{Cross-classification stratum} becomes so large that the stratum sample sizes are strongly disproportional to their size or even exceed the total sample size. For instance, suppose we have three maps with 4, 3, and 5 map units. Further, suppose that all combinations of map units are non-empty, so that we have 4 $\times$ 3 $\times$ 5 = 60 combinations. We may not like to use all combinations (cross-classifications) as strata. The alternative is then to use the $4+3+5=12$ map units as "strata". This is not a stratification in the usual sense since the strata are overlapping.
The sample sizes of the marginal strata can be controlled using a design matrix with as many columns as there are strata. The units of an individual map used for stratification are referred to as marginal strata\index{Marginal stratum}. Each row $k = 1, \dots, N$ in the design matrix $\mathbf{X}$ has as many non-zero values as we have maps, in entries corresponding to the cross-classification map unit of population unit $k$, and zeroes in the remaining entries. The non-zero value is the inclusion probability of that unit. Each column of the design matrix has non-zero values in entries corresponding to the population units in that marginal stratum and zeroes in all other entries.
Two-way stratified random sampling is illustrated with a simulated population of 400 units (Figure \@ref(fig:TwowaystratifiedPopulation)). Figure \@ref(fig:Twowaystratifiedsample) shows two classifications of the population units. Classification A consists of four classes (map units), classification B of three classes. Instead of using $4 \times 3 = 12$ cross-classifications as strata in random sampling, only $4+3=7$ marginal strata are used in two-way stratified random sampling.
```{r TwowaystratifiedPopulation, echo = FALSE, fig.width = 5, fig.cap = "Simulated population used for illustration of two-way stratified random sampling."}
s1 <- s2 <- 1:20 - 0.5
mypop <- expand.grid(s1, s2)
names(mypop) <- c("s1", "s2")
vgmodel <- vgm(model = "Exp", psill = 4, range = 4, nugget = 0)
dists1 <- outer(mypop$s1, mypop$s1, FUN = "-")
dists2 <- outer(mypop$s2, mypop$s2, FUN = "-")
dist <- sqrt(dists1^2 + dists2^2)
C <- variogramLine(vgmodel, dist_vector = dist, covariance = TRUE)
Upper <- chol(C)
set.seed(314)
N <- nrow(mypop)
G <- rnorm(N, 0, 1) #simulate random numbers from standard normal distribution
mypop$z <- crossprod(Upper, G) + 10
mypop$A <- as.factor(findInterval(mypop$s1, c(5, 8, 12)))
mypop$B <- as.factor(findInterval(mypop$s2, c(8, 15)))
levels(mypop$A) <- c("A1", "A2", "A3", "A4")
levels(mypop$B) <- c("B1", "B2", "B3")
ggplot(data = mypop) +
geom_tile(mapping = aes(x = s1, y = s2, fill = z), width = 1, height = 1, linewidth = 0.5) +
scale_fill_viridis_c() +
scale_x_continuous(name = "Easting") +
scale_y_continuous(name = "Northing") +
coord_fixed()
```
As a first step, the inclusion probabilities are added to `data.frame` `mypop` with the spatial coordinates and simulated values. To keep it simple, I computed inclusion probabilities equal to 2 divided by the number of population units in a cross-classification stratum. Note that this does not imply that a sample is selected with two units per cross-classification stratum. As we will see later, it is possible that in some cross-classification strata no units are selected at all, while in other cross-classification strata more than two units are selected. In multiway stratified sampling, the marginal stratum sample sizes are controlled. The inclusion probabilities should result in six selected units for all four units of map A and eight selected units for all three units of map B.
```{r}
mypop <- mypop %>%
group_by(A, B) %>%
summarise(N_h = n(), .groups = "drop") %>%
mutate(pi_h = rep(2, 12) / N_h) %>%
right_join(mypop, by = c("A", "B"))
```
The next step is to create the design matrix. Two submatrices are computed, one per stratification. The two submatrices are joined columnwise, using function `cbind`. The columns are multiplied by the vector with inclusion probabilities.
```{r}
XA <- model.matrix(~ A - 1, mypop)
XB <- model.matrix(~ B - 1, mypop)
X <- cbind(XA, XB)
X <- sweep(X, MARGIN = 1, mypop$pi_h, `*`)
```
Matrix $\mathbf{X}$ can be reduced by one column if in the first column the inclusion probabilities of *all* population units are inserted. This first column contains no zeroes. Balancing on this variable implies that the total sample size is controlled. Now there is no need anymore to control the sample sizes of all marginal strata. It is sufficient to control the sample sizes of three marginal strata of map A (A2, A3, and A4) and two marginal strata of map B (B2 and B3). Given the total sample size, the sample sizes of map units A1 and B1 then cannot be chosen freely anymore.
```{r}
X <- model.matrix(~ A + B, mypop)
colnames(X)
X <- sweep(X, MARGIN = 1, mypop$pi_h, `*`)
```
This reduced design matrix is not strictly needed for selecting a multiway stratified sample, but it must be used in estimation. If in estimation, as many balancing variables are used as we have marginal strata, the matrix with the sum of squares of the balancing variables (first sum in Equation \@ref(eq:betasbalanced)) cannot be inverted (the matrix is singular), and as a consequence the population regression coefficients cannot be estimated.
Finally, the two-way stratified random sample is selected with function `samplecube` of package **sampling**.
```{r}
sample_ind <- samplecube(X = X, pik = mypop$pi_h, method = 1, comment = FALSE)
units <- which(sample_ind > (1 - eps))
mysample <- mypop[units, ]
```
Figure \@ref(fig:Twowaystratifiedsample) shows the selected sample.
```{r, Twowaystratifiedsample, echo = FALSE, out.width = '100%', fig.cap = "Two-way stratified sample."}
plt1 <- ggplot(data = mypop) +
geom_tile(mapping = aes(x = s1, y = s2, fill = factor(A)), width = 1, height = 1, size = 0.5, colour = "white") +
geom_tile(data = mysample, mapping = aes(x = s1, y = s2), fill = NA, width = 1, height = 1, linewidth = 0.7, colour = "black") +
scale_fill_viridis_d(name = "Stratum") +
scale_x_continuous(name = "") +
scale_y_continuous(name = "") +
coord_fixed()
plt2 <- ggplot(data = mypop) +
geom_tile(mapping = aes(x = s1, y = s2, fill = factor(B)), width = 1, height = 1, linewidth = 0.5, colour = "white") +
geom_tile(data = mysample, mapping = aes(x = s1, y = s2), fill = NA, width = 1, height = 1, linewidth = 0.7, colour = "black") +
scale_fill_viridis_d(name = "Stratum") +
scale_y_continuous(name = "") +
scale_x_continuous(name = "") +
coord_fixed()
grid.arrange(plt1, plt2, nrow = 1)
```
All marginal stratum sample sizes of map A are six and all marginal stratum sample sizes of map B are eight, as expected. The sample sizes of the cross-classification strata vary from zero to four.
```{r}
addmargins(table(mysample$A, mysample$B))
```
The population mean can be estimated by the $\pi$ estimator.
```{r}
N <- nrow(mypop)
print(mean <- sum(mysample$z / mysample$pi_h) / N)
```
The variance is estimated as before (Equation \@ref(eq:approxvarianceBalanced)).
```{r}
c <- (1 - mysample$pi_h)
b <- estimate_b(
z = mysample$z / mysample$pi_h, X = X[units, ] / mysample$pi_h, c = c)
zpred <- X %*% b
e <- mysample$z - zpred[units]
n <- nrow(mysample)
v_tz <- n / (n - ncol(X)) * sum(c * (e / mysample$pi_h)^2)
print(v_mz <- v_tz / N^2)
```
## Well-spread sampling {#Spreaded}
With balanced sampling, the spreading of the sampling units in the space spanned by the balancing variables can be poor. For instance, in Figure \@ref(fig:simpleexample) the Easting coordinates of all units of a sample balanced on Easting can be equal or close to the population mean of 10. So, in this example, balancing does not guarantee a good geographical spreading. Stated more generally, a balanced sample can be selected that shows strong clustering in the space spanned by the balancing variables. This clustering may inflate the standard error of the estimated population total and mean. The clustering in geographical or covariate space can be avoided by the local pivotal method [@Grafstrom2012] and the spatially correlated Poisson sampling method\index{Spatially correlated Poisson sampling} [@Grafstrom2012b].
```{block2, type = 'rmdnote'}
For spreading in *geographical* space various other designs are available. A simple design is stratified random sampling from compact geographical strata, see Section \@ref(geostrata). Alternative designs are generalised random-tessellation stratified sampling [@stevens2004], see Subsection \@ref(GRTS), and balanced acceptance sampling\index{Balanced acceptance sampling} [@Robertson2013].
```
### Local pivotal method {#LPM}
The local pivotal method (LPM) is a modification of the pivotal method explained in Subsection \@ref(pivotalmethod)\index{Local pivotal method}. The only difference with the pivotal method is the selection of the pairs of units. In the pivotal method, at each step two units are selected, for instance, the first two units in the vector with inclusion probabilities after randomising the order of the units. In the local pivotal method, the first unit is selected fully randomly, and the nearest neighbour of this unit is used as its counterpart. Recall that when one unit of a pair is included in the sample, the inclusion probability of its counterpart is decreased. This leads to a better spreading of the sampling units in the space spanned by the spreading variables.
LPM can be used for arbitrary inclusion probabilities. The inclusion probabilities can be equal, but as with the pivotal method these probabilities may also differ among the population units.
Selecting samples with LPM can be done with functions `lpm`, `lpm1`, or `lpm2` of package **BalancedSampling** [@Grafstrom2016]. Functions `lpm1` and `lpm2` only differ in the selection of neighbours that are allowed to compete, for details see @Grafstrom2012. For most populations, the two algorithms perform similarly. The algorithm implemented in function `lpm` is only recommended when the population size is too large for `lpm1` or `lpm2`. It only uses a subset of the population in search for nearest neighbours and is thus not as good. Another function ` lpm2_kdtree` of package **SamplingBigData** [@samplingBigData] is developed for big data sets.
Inclusion probabilities are computed with function `inclusionprobabilities` of package **sampling**. A matrix $\mathbf{X}$ must be defined with the values of the spreading variables of the population units. Figure \@ref(fig:LPMKandahar) shows a ppswor sample of 40 units selected from the sampling frame of Kandahar, using the spatial coordinates of the population units as spreading variables. Inclusion probabilities are proportional to the agricultural area within the population units. The geographical spreading is improved compared with the sample shown in Figure \@ref(fig:ppswrKandahar).
```{r}
library(BalancedSampling)
library(sampling)
n <- 40
pi <- inclusionprobabilities(grdKandahar$agri, n)
X <- cbind(grdKandahar$s1, grdKandahar$s2)
set.seed(314)
units <- lpm1(pi, X)
myLPMsample <- grdKandahar[units, ]
```
```{r LPMKandahar, echo = FALSE, out.width = '100%', fig.cap = "Spatial ppswor sample of size 40 from Kandahar, selected by the local pivotal method, using agricultural area as a size variable."}
ggplot(data = grdKandahar) +
geom_raster(mapping = aes(x = s1 / 1000, y = s2 / 1000, fill = agri)) +
geom_tile(data = myLPMsample, mapping = aes(x = s1 / 1000, y = s2 / 1000), colour = "white", linewidth = 0.7, width = 5, height = 5, fill = NA) +
scale_fill_viridis_c(name = "Agriarea") +
scale_x_continuous(name = "Easting (km)") +
scale_y_continuous(name = "Northing (km)") +
coord_fixed()
```
The total poppy area can be estimated with the $\pi$ estimator (Equation \@ref(eq:HTTotal)).
```{r}
myLPMsample$pi <- pi[units]
tz_HT <- sum(myLPMsample$poppy / myLPMsample$pi)
```
The estimated total poppy area equals `r formatC(tz_HT, 0, format = "f", big.mark = ",")` ha. The sampling variance of the estimator of the population total with the local pivotal method can be estimated by [@Grafstrom2014]
\begin{equation}
\widehat{V}(\hat{t}(z)) = \frac{1}{2} \sum_{k \in \mathcal{S}} \left( \frac{z_k}{\pi_k} - \frac{z_{k_j}}{\pi_{k_j}} \right)^2 \;,
(\#eq:VartotalLPM)
\end{equation}
with ${k_j}$ the nearest neighbour of unit $k$ in the sample. This variance estimator is for the case where we have only one nearest neighbour.
Function `vsb` of package **BalancedSampling** is an implementation of a more general variance estimator that accounts for more than one nearest neighbour (equation (6) in @Grafstrom2014). We expect a somewhat smaller variance compared to pps sampling, so we may use the variance of the pwr estimator (Equation \@ref(eq:VarHHTotalppswr)) as a conservative variance estimator\index{Conservative variance estimator}.
```{r vartotLPM}
Xsample <- X[units, ]
se_tz_HT <- sqrt(vsb(pi[units], myLPMsample$poppy, Xsample))
pk <- myLPMsample$pi / n
se_tz_pwr <- sqrt(var(myLPMsample$poppy / pk) / n)
```
The standard error obtained with function `vsb` equals `r formatC(se_tz_HT, 0, format = "f", big.mark = ",")` ha, the standard error of the pwr estimator equals `r formatC(se_tz_pwr, 0, format = "f", big.mark = ",")` ha.
As explained above, the LPM design can also be used to select a probability sample well-spread in the space spanned by one or more quantitative covariates. Matrix $\mathbf{X}$ then should contain the values of the *scaled* (standardised) covariates instead of the spatial coordinates.
#### Exercises {-}
3. Geographical spreading of the sampling units can also be achieved by random sampling from compact geographical strata (geostrata) (Section \@ref(geostrata)). Can you think of one or more advantages of LPM sampling over random sampling from geostrata?
### Generalised random-tessellation stratified sampling {#GRTS}
Generalised random-tessellation stratified (GRTS) sampling\index{Generalised random-tessellation stratified sampling} is designed for sampling discrete objects scattered throughout space, think for instance of the lakes in Finland, segments of hedgerows in England, etc. Each object is represented by a point in 2D-space. It is a complicated design, and for sampling points from a continuous universe or raster cells from a finite population, I recommend more simple designs such as the local pivotal method (Subsection \@ref(LPM)), balanced sampling with geographical spreading (Section \@ref(BalancedandSpreaded)), or sampling from compact geographical strata (Section \@ref(geostrata)). Let me try to explain the GRTS design with a simple example of a finite population of point objects in a circular study area (Figure \@ref(fig:GRTSNumbering)). For a more detailed description of this design, refer to @Hankin2019. As a first step, a square bounding box of the study area is constructed. This bounding box is recursively partitioned into square grid cells. First, 2 $\times$ 2 grid cells are constructed. These grid cells are numbered in a predefined order. In Figure \@ref(fig:GRTSNumbering) this numbering is from lower left, lower right, upper left to upper right. Each grid cell is then subdivided into four subcells; the subcells are numbered using the same order. This is repeated until at most one population unit occurs in each subcell. For our population only two iterations were needed, leading to 4 $\times$ 4 subcells. Note that two subcells are empty. Each address of a subcell consists of two digits: the first digit refers to the grid cell, the second digit to the subcell.
```{r GRTSNumbering, echo = FALSE, out.width = "100%", fig.cap = "Numbering of grid cells and subcells for GRTS sampling."}
s1 <- s2 <- 1:100 - 0.5
pxl <- expand.grid(s1, s2)
names(pxl) <- c("s1", "s2")
#construct strata level 1
s1bnd <- 50
s1f <- findInterval(pxl$s1, s1bnd)
s2f <- findInterval(pxl$s2, s1bnd)
pxl$partit1 <- factor(interaction(s1f, s2f), labels = c(1, 2, 3, 4))
#construct strata level 2
s1bnd <- seq(from = 25, to = 75, by = 25)
s1f <- findInterval(pxl$s1, s1bnd)
s2f <- findInterval(pxl$s2, s1bnd)
lbl <- c("00", "01", "10", "11", "02", "03", "12", "13", "20", "21", "30", "31", "22", "23", "32", "33")
pxl$partit2 <- factor(interaction(s1f, s2f), labels = lbl)
n_h <- rep(1, times = 16)
set.seed(314)
units <- sampling::strata(pxl, stratanames = "partit2",
size = n_h, method = "srswor")
myfinpop <- getdata(pxl, units)
#remove some units, and shift fourth point into circle
myfinpop$s1[4] <- 78
units <- sample(nrow(myfinpop), 2)
myfinpop <- myfinpop[-units, ]
N <- nrow(myfinpop)
circle <- data.frame(s1 = 50, s2 = 50)
plt1 <- ggplot(data = myfinpop) +
geom_raster(data = pxl, mapping = aes(x = s1, y = s2), fill = "lightgrey") +
geom_circle(data = circle, aes(x0 = s1, y0 = s2, r = 50), fill = "grey") +
geom_point(mapping = aes(x = s1, y = s2), size = 2) +
scale_x_continuous(name = "Easting") +
scale_y_continuous(name = "Northing") +
scale_fill_discrete(guide = "none") +
coord_fixed()
dat <- expand.grid(x = seq(from = 12.5, to = 87.5, by = 25), y = seq(from = 12.5, to = 87.5, by = 25))
dat$labels <- lbl
plt2 <- ggplot(data = myfinpop) +
geom_raster(data = pxl, mapping = aes(x = s1, y = s2, fill = as.factor(partit1)), alpha = 0.5) +
geom_circle(data = circle, mapping = aes(x0 = s1, y0 = s2, r = 50)) +
geom_vline(xintercept = c(25, 50, 75)) +
geom_hline(yintercept = c(25, 50, 75)) +
geom_point(mapping = aes(x = s1, y = s2), size = 2) +
geom_text(data = dat, mapping = aes(x = x, y = y, label = labels)) +
scale_fill_viridis_d(guide = "none") +
scale_x_continuous(name = "Easting") +
scale_y_continuous(name = "Northing") +
coord_fixed()
grid.arrange(plt1, plt2, nrow = 1)
```
The next step is to place the 16 subcells on a line in a random order. The randomisation is done hierarchically. First, the four grid cells at the highest level are randomised. In our example, the randomised order is 1, 2, 3, 0 (Figure \@ref(fig:GRTS)). Next, within each grid cell, the order of the subcells is randomised. This is done independently for the grid cells. In our example, for grid cell 1 the randomised order of the subcells is 2, 1, 3, 0 (Figure \@ref(fig:GRTS)). Note that the empty subcells (0,0) and (3,3) are removed from the line.
```{r}
set.seed(314)
ord <- sample(4, 4)
myfinpop_rand <- NULL
for (i in ord) {
units <- which(myfinpop$partit1 == i)
units_rand <- sample(units, size = length(units))
myfinpop_rand <- rbind(myfinpop_rand, myfinpop[units_rand, ])
}
```
After the subcells have been placed on a line, a one-dimensional systematic random sample is selected (Figure \@ref(fig:GRTS)), see also Subsection \@ref(Systematicpps). This can be done either with equal or unequal inclusion probabilities. With equal inclusion probabilities, the length of the intervals representing the population units is constant. With unequal inclusion probabilities, the interval lengths are proportional to a size variable. For a sample size of $n$, the total line is divided into $n$ segments of equal length. A random point is selected in the first segment, and the other points of the systematic sample are determined. Finally, the population units corresponding to the selected systematic sample are identified. With equal probabilities, the five selected units are the units in subcells 11, 23, 22, 32, and 03 (Figure \@ref(fig:GRTS)).
```{r sys1D}
size <- rep(1, N)
n <- 5
interval <- sum(size) / n
start <- round(runif(1) * interval, 2)
mysys <- c(start, 1:(n - 1) * interval + start)
```
```{r GRTS, echo = FALSE, fig.asp = .15, out.width = "100%", fig.cap = "Systematic random sample along a line with equal inclusion probabilities."}
cellbound <- c(0, cumsum(size))
#compute width of cells
w <- diff(cellbound)
#compute x coordinates halfway the cell boundaries
x <- cellbound[1:N] + w / 2
n1 <- table(myfinpop_rand$partit1)
address_1 <- rep(ord, n1[ord])
address <- myfinpop_rand$partit2
datfr <- data.frame(x, y = rep(1, N), address_1, address, w)
ggplot(data = datfr) +
geom_tile(mapping = aes(x = x, y = y, fill = factor(address_1), width = w), alpha = 0.5) +
geom_text(mapping = aes(x = x, y = y, label = address)) +
geom_vline(xintercept = cellbound, linetype = 2) +
scale_y_continuous("", breaks = c()) +
scale_x_continuous("", breaks = mysys) +
scale_fill_viridis_d(guide = "none") +
theme(panel.background = element_blank(),
panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.background = element_blank())
```
Figure \@ref(fig:GRTSpps) shows a systematic random sample along a line with unequal inclusion probabilities. The inclusion probabilities are proportional to a size variable, with values 1, 2, 3, or 4. The selected population units are the units in subcells 10, 20, 31, 01, and 02.
```{r GRTSpps, echo = FALSE, fig.asp = .15, out.width = "100%", fig.cap = "Systematic random sample along a line with inclusion probabilities proportional to size."}
size <- sample(4, size = N, replace = TRUE)
n <- 5
interval <- sum(size) / n
start <- round(runif(1) * interval, 2)
mysyspps <- c(start, 1:(n - 1) * interval + start)
cellbound <- c(0, cumsum(size))
w <- diff(cellbound)
x <- cellbound[1:N] + w / 2
datfr <- data.frame(x, y = rep(1, N), address_1, address, w)
ggplot(data = datfr) +
geom_tile(mapping = aes(x = x, y = y, fill = factor(address_1), width = w), alpha = 0.5) +
geom_text(mapping = aes(x = x, y = y, label = address)) +
geom_vline(xintercept = cellbound, linetype = 2) +
scale_y_continuous("", breaks = c()) +
scale_x_continuous("", breaks = mysyspps) +
scale_fill_viridis_d(guide = "none") +
theme(panel.background = element_blank(),
panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.background = element_blank())
```
GRTS samples can be selected with function `grts` of package **spsurvey** [@spsurvey]. The next code chunk shows the selection of a GRTS sample of 40 units from Kandahar. Tibble `grdKandahar` is first converted to an `sf` object with function `st_as_sf` of package **sf** [@sf]. The data set is using a UTM projection (zone 41N) with WGS84 datum. This projection is passed to function `st_as_df` with argument `crs` (`crs = 32641`). Argument `sframe` of function `grts` specifies the sampling frame. The sample size is passed to function `grts` with argument `n_base`. Argument `seltype` is set to `proportional` to select units with probabilities proportional to an ancillary variable which is passed to function `grts` with argument `aux_var`.
```{r GRTSppsKandahar}
library(spsurvey)
library(sf)
sframe_sf <- st_as_sf(grdKandahar, coords = c("s1", "s2"), crs = 32641)
set.seed(314)
res <- grts(
sframe = sframe_sf, n_base = 40, seltype = "proportional", aux_var = "agri")
myGRTSsample <- res$sites_base
```
Function `cont_analysis` computes the ratio estimator of the population mean and its standard error. The estimated mean is multiplied by the total number of population units to obtain a ratio estimate of the population total.
```{r}
myGRTSsample$siteID <- paste0("siteID", seq_len(nrow(myGRTSsample)))
res <- cont_analysis(myGRTSsample,
vars = "poppy", siteID = "siteID", weight = "wgt", statistics = "Mean")
tz_ratio <- res$Mean$Estimate * nrow(grdKandahar)
se_tz_ratio <- res$Mean$StdError * nrow(grdKandahar)
```
The estimated total poppy area is `r formatC(tz_ratio, 0, format = "f", big.mark = ",")` ha, and the estimated standard error is `r formatC(se_tz_ratio, 0, format = "f", big.mark = ",")` ha.
The alternative is to estimate the total poppy area by the $\pi$ estimator. Function `vsb` of package **BalancedSampling** can be used to estimate the standard error of the $\pi$ estimator.
```{r}
tz_HT <- sum(myGRTSsample$poppy / myGRTSsample$ip)
Xsample <- st_coordinates(myGRTSsample)
se_tz_HT <- sqrt(vsb(myGRTSsample$ip, myGRTSsample$poppy, Xsample))
```
The estimated total is `r formatC(tz_HT, 0, format = "f", big.mark = ",")` ha, and the estimated standard error is `r formatC(se_tz_HT, 0, format = "f", big.mark = ",")` ha.
## Balanced sampling with spreading {#BalancedandSpreaded}
As mentioned in the introduction to this chapter, a sample balanced on a covariate still may have a poor spreading along the axis spanned by the covariate. @Grafstrom2013 presented a method for selecting balanced samples that are also well-spread in the space spanned by the covariates, which they refer to as doubly balanced sampling\index{Doubly balanced sampling}. If we take one or more covariates as balancing variables and, besides, Easting and Northing as spreading variables, this leads to balanced samples with good *geographical* spreading. When the residuals of the regression model show spatial structure (are spatially correlated), the estimated population mean of the study variable becomes more precise thanks to the improved geographical spreading. Balanced samples with spreading can be selected with function `lcube` of package **BalancedSampling**. This is illustrated with Eastern Amazonia, using as before lnSWIR2 for balancing the sample.
```{r lcube}
library(BalancedSampling)
N <- nrow(grdAmazonia)
n <- 100
Xbal <- cbind(rep(1, times = N), grdAmazonia$lnSWIR2)
Xspread <- cbind(grdAmazonia$x1, grdAmazonia$x2)
pi <- rep(n / N, times = N)
set.seed(314)
units <- lcube(Xbal = Xbal, Xspread = Xspread, prob = pi)
mysample <- grdAmazonia[units, ]
```
```{r DoublyBalancedSampleAmazonia, echo = FALSE, out.width = '100%', fig.cap = "Sample balanced on lnSWIR2 with geographical spreading from Eastern Amazonia."}
ggplot(data = grdAmazonia) +
geom_raster(mapping = aes(x = x1 / 1000, y = x2 / 1000, fill = lnSWIR2)) +
geom_point(data = mysample, mapping = aes(x = x1 / 1000, y = x2 / 1000), size = 2) +
scale_fill_viridis_c(name = "lnSWIR2") +
scale_x_continuous(name = "Easting (km)") +
scale_y_continuous(name = "Northing (km)") +
coord_fixed()
```
The selected sample is shown in Figure \@ref(fig:DoublyBalancedSampleAmazonia). Comparing this sample with the balanced sample of Figure \@ref(fig:BalancedSampleAmazonia) shows that the geographical spreading of the sample is improved, although there still are some close points. I used equal inclusion probabilities, so the $\pi$ estimate of the mean is equal to the sample mean, which is equal to `r formatC(mean(mysample$AGB), 1, format = "f")` 10^9^ kg ha^-1^.
The variance of the $\pi$ estimator of the mean can be estimated by (equation (7) of @Grafstrom2013)
\begin{equation}
\widehat{V}(\hat{\bar{z}}) = \frac{n}{n-p}\frac{p}{p+1} \sum_{k \in \mathcal{S}}(1-\pi_k) \left(\frac{e_k}{\pi_k}-\bar{e}_k \right)^2 \;,
(\#eq:VarmeanDoublyBalanced)
\end{equation}
with $p$ the number of balancing variables, $e_k$ the regression model residual of unit $k$ (Equation \@ref(eq:residualsBalanced)), and $\bar{e}_k$ the local mean of the residuals of this unit, computed by
\begin{equation}
\bar{e}_k = \frac{\sum_{j=1}^{p+1}(1-\pi_j)\frac{e_j}{\pi_j}}{\sum_{j=1}^{p+1}(1-\pi_j)}\;.
(\#eq:localmeanresdual)
\end{equation}
The variance estimator can be computed with functions `localmean_weight` and `localmean_var` of package **spsurvey** [@spsurvey].
```{r}
library(spsurvey)
pi <- rep(n / N, n)
c <- (1 - pi)
b <- estimate_b(z = mysample$AGB / pi, X = Xbal[units, ] / pi, c = c)
zpred <- Xbal %*% b
e <- mysample$AGB - zpred[units]
weights <- localmean_weight(x = mysample$x1, y = mysample$x2, prb = pi, nbh = 3)
v_mz <- localmean_var(z = e / pi, weight_1st = weights) / N^2
```
```{r, echo = FALSE, eval = FALSE}
#Alternative for variance estimation, but nbh equals 4 and cannot be changed
mysample$e <- mysample$AGB - zpred[units]
mysample$weight <- 1 / pi
mysample$siteID <- paste0("siteID", 1:nrow(mysample))
res <- cont_analysis(mysample, vars = "e", siteID = "siteID", weight = "weight", xcoord = "x1", ycoord = "x2", statistics = "Mean")
v_mz <- res$Mean$StdError
```
The estimated standard error is `r formatC(sqrt(v_mz), 1, format = "f")` 10^9^ kg ha^-1^, which is considerably smaller than the estimated standard error of the balanced sample without geographical spreading.
```{r, echo = FALSE}
rm(list = ls())
```