-
Notifications
You must be signed in to change notification settings - Fork 14
/
21-IntroSampling4Mapping.Rmd
187 lines (137 loc) · 17.3 KB
/
21-IntroSampling4Mapping.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
# (PART) Sampling for mapping {-}
# Introduction to sampling for mapping {#IntroSamplingforMapping}
## When is probability sampling not required?
This second part of the book deals with sampling for mapping, i.e., for predicting the study variable at the nodes of a fine discretisation grid. For mapping, a model-based sampling approach is the most natural option. When a statistical model, i.e., a model containing an error term modelled by a probability distribution, is used to map the study variable from the sample data, selection of the sampling units by probability sampling is not strictly needed anymore in order to make statistical statements about the population, i.e., statements with quantified uncertainty, see Section \@ref(DBvsMB). As a consequence, there is room for optimising the sampling units by searching for those units that lead to the most accurate map, for instance, the map with the smallest squared prediction error averaged over all locations in the mapped study area, see Chapter \@ref(Validation).
As an illustration, consider the following statistical model to be used for mapping: a simple linear regression model for the study variable to be mapped:
\begin{equation}
Z_k = \beta_0 + \beta_1 x_k + \epsilon_k \;,
(\#eq:simplelinearregressionmodel)
\end{equation}
with $Z_k$ the study variable of unit $k$, $\beta_0$ and $\beta_1$ regression coefficients, $x_k$ a covariate for unit $k$ used as a predictor, and $\epsilon_k$ the error (residual) at unit $k$, normally distributed with mean zero and a constant variance $\sigma^2$. The errors are assumed independent, so that $\text{Cov}(\epsilon_k,\epsilon_j)=0$ for all $k \neq j$. Figure \@ref(fig:twosamples) shows a simple random sample without replacement and the sample optimised for mapping with a simple linear regression model. Both samples are plotted on a map of the covariate $x$.
```{r, echo = FALSE}
#define discretisation grid
s1 <- s2 <- 1:20 - 0.5
grdSquare <- expand.grid(s1, s2)
names(grdSquare) <- c("s1", "s2")
N <- nrow(grdSquare)
#define residual semivariogram for simulation
vgmodel <- vgm(model = "Exp", psill = 25, range = 2, nugget = 0)
#compute matrix with covariances between discretisation points
dists1 <- outer(grdSquare$s1, grdSquare$s1, FUN = "-")
dists2 <- outer(grdSquare$s2, grdSquare$s2, FUN = "-")
dist <- sqrt(dists1^2 + dists2^2)
#compute matrix with covariances
C <- variogramLine(vgmodel, dist_vector = dist, covariance = TRUE)
#now simulate values for grid by Cholesky decomposition
Upper <- chol(C)
#simulate random numbers from standard normal distribution
set.seed(31415)
G <- rnorm(n = nrow(grdSquare), 0, 1)
e <- crossprod(Upper, G)
mu <- 15
grdSquare$x <- e + mu
ord <- order(grdSquare$x)
grdSquare <- grdSquare[ord, ]
#compute values of study variable as linear combination of covariate + white noise
sigma_resid <- 2
set.seed(31415)
grdSquare$z <- 2 + 0.5 * grdSquare$x + rnorm(n = nrow(grdSquare), mean = 0, sd = sigma_resid)
#save(grdSquare,file = "results/SimulatedSquare.rda")
#select simple random sample without replacement
n <- 16
idsrandom <- sample(400, n)
randomsample <- grdSquare[idsrandom, ]
model.random <- lm(z ~ x, data = randomsample)
det.random <- det(vcov(model.random))
C.random <- vcov(model.random)
#select optimal sample for simple linear regression model; grid cells are ordered on x
optsample <- grdSquare[c(1, 2, 3, 4, 5, 6, 7, 8, 393, 394, 395, 396, 397, 398, 399, 400), ]
model.opt <- lm(z ~ x, data = optsample)
det.opt <- det(vcov(model.opt))
C.opt <- vcov(model.opt)
```
```{r twosamples, echo = FALSE, out.width = '100%', fig.cap = "Simple random sample and optimal sample for mapping with a simple linear regression model, plotted on a map of the covariate."}
mysamples <- rbind(randomsample, optsample)
mysamples$design <- factor(rep(c("Random", "Optimal"), each = 16), levels = c("Random", "Optimal"), ordered = TRUE)
ggplot(data = grdSquare) +
geom_raster(mapping = aes(x = s1, y = s2, fill = x)) +
geom_tile(data = mysamples, mapping = aes(x = s1, y = s2), colour = "white", size = 0.8, width = 1, height = 1, fill = NA) +
scale_fill_viridis_c(name = "x") +
scale_x_continuous(name = "Easting") +
scale_y_continuous(name = "Northing") +
facet_wrap(~ design, ncol = 2, nrow = 1) +
coord_fixed()
```
The optimal sample for mapping with a simple linear regression model contains the units with the smallest and the largest values of the covariate $x$. The optimal sample shows strong spatial clustering. Spatial clustering is not avoided because in a simple linear regression model we assume that the residuals are not spatially correlated. In Chapter \@ref(MBSamplePattern) I will show that when the residuals are spatially correlated, spatial clustering of sampling units is avoided. The standard errors of both regression coefficients are considerably smaller for the optimal sample (Table \@ref(tab:sebetas)). The joint uncertainty about the two regression coefficients, quantified by the determinant of the variance-covariance matrix\index{Determinant of variance-covariance matrix} of the regression coefficient estimators, is also much smaller for the optimal sample. When we are less uncertain about the regression coefficients, we are also less uncertain about the regression model predictions of the study variable $z$ at points where we have observations of the covariate $x$ only. We can conclude that for mapping with a simple linear regression model, in this example simple random sampling is not a good option.
```{r sebetas, echo = FALSE}
tbl <- data.frame(x = c("SI", "Optimal"), y = c(formatC(sqrt(C.random[1,1]), 2, format = "f"), formatC(sqrt(C.opt[1,1]), 2, format = "f")), z = c(formatC(sqrt(C.random[2,2]), 3, format = "f"), formatC(sqrt(C.opt[2,2]), 3, format = "f")), d = c(formatC(det.random, 6, format = "f"), formatC(det.opt, 6, format = "f")))
knitr::kable(
tbl, caption = "Standard errors and determinant of the variance-covariance matrix of estimators of the regression coefficients for the simple random sample (SI) and the optimal sample.",
col.names = c("Sampling design", "se intercept", "se slope", "Determinant"),
booktabs = TRUE,
linesep = ""
) %>%
kable_classic()
```
Of course, this simple example would only be applicable if we have evidence of a linear relation between study variable $z$ and covariate $x$, and in addition if we are willing to rely on the assumption that the residuals are not spatially correlated.
## Sampling for simultaneously mapping and estimating means
Although probability sampling is not strictly needed for mapping with a statistical model, in some situations, when feasible, it can still be advantageous to select a probability sample. If the aim of the survey is to map the study variable, as well as to estimate the mean or total for the entire study area or for several subareas, probability sampling can be a good option. Think, for instance, of sampling for the dual objectives of mapping and at the same time estimating soil carbon stocks. Although the statistical model used for mapping can also be used for model-based prediction of the total carbon stocks in the study area and subareas (Section \@ref(SmallAreaModelBased)), we may prefer to estimate these totals by design-based or model-assisted inference. The advantage of design-based and model-assisted estimation of these totals is their validity\index{Validity}. Validity means that an objective assessment of the uncertainty of the estimated mean or total is warranted, and that the coverage of confidence intervals is (almost) correct, provided that the sample is large enough to assume an approximately normal distribution of the estimator and design-unbiasedness of the variance estimator, see Chapter \@ref(Approaches). In design-based estimation, no model of the spatial variation is used. Therefore, discussions are avoided about how realistic modelling assumptions are. In model-assisted estimation, these discussions are irrelevant as well, because we do not rely on these assumptions. A poor model results in large variances of the estimated mean or total and, as a consequence, a wide confidence interval, so that the coverage of the confidence interval is in agreement with the nominal coverage, see Section \@ref(ModelassistedvsModeldependent) for more details.
The question then is: what is a suitable probability sampling design for both aims? First, I would recommend a sampling design with equal inclusion probabilities. This is because in standard model-based inference unequal inclusion probabilities are not accounted for, which may lead to systematic prediction errors when small or large values are overrepresented in the sample (Section \@ref(BiasandVariance)).
Second, in case we have subareas of which we would like to estimate the mean or total (domains of interest), using these subareas as strata in stratified random sampling makes sense, unless there are too many. This requires that of all population units (nodes of discretisation grid) we must know to which subarea it belongs, so that this information can be added to the sampling frame.
Third, a sampling design spreading the sampling units in geographical space is attractive as well, for instance through compact geographical stratification (Section \@ref(geostrata)) or sampling with the local pivotal method (Subsection \@ref(LPM)). We may profit from this geographical spreading if some version of kriging is used for mapping (Chapter \@ref(Introkriging)). In addition, the geographical spreading may enhance the coverage of the space spanned by covariates related to the study variable. Spreading in covariate space can also be explicitly accounted for using the covariates as spreading variables in the local pivotal method.
As an illustration, I selected a single sample of 500 units from Eastern Amazonia with the dual aim of mapping aboveground biomass (AGB) as well as estimating the means of AGB for four biomes. The biomes are used as strata in stratified random sampling. First, the stratum sample sizes are computed for proportional allocation, so that the inclusion probabilities are approximately equal for all population units.
```{r, echo = FALSE}
grdAmazonia <- read_rds(file = "results/grdAmazonia_5km.rds")
```
```{r}
grdAmazonia$Biome <- as.factor(grdAmazonia$Biome)
biomes <- c("Mangrove", "Forest_dry", "Grassland", "Forest_moist")
levels(grdAmazonia$Biome) <- biomes
N_h <- table(grdAmazonia$Biome)
n <- 500
n_h <- round(n * N_h / sum(N_h))
n_h[3] <- n_h[3] + 1
print(n_h)
```
Biome Forest_moist is by far the largest biome with a sample size of `r n_h[4]` points.
In the next code chunk, a balanced sample is selected with equal inclusion probabilities, using both the categorical variable biome and the continuous variable lnSWIR2 as balancing variables (Subsection \@ref(StratifiedsamplingasBalancedsampling)). The geographical coordinates are used as spreading variables.
```{r BalancedStratified}
library(BalancedSampling)
grdAmazonia$lnSWIR2 <- log(grdAmazonia$SWIR2)
pi <- n_h / N_h
stratalabels <- levels(grdAmazonia$Biome)
lut <- data.frame(Biome = stratalabels, pi = as.numeric(pi))
grdAmazonia <- merge(x = grdAmazonia, y = lut)
Xbal <- model.matrix(~ Biome - 1, data = grdAmazonia) %>%
cbind(grdAmazonia$lnSWIR2)
Xspread <- cbind(grdAmazonia$x1, grdAmazonia$x2)
set.seed(314)
units <- lcube(Xbal = Xbal, Xspread = Xspread, prob = grdAmazonia$pi)
mysample <- grdAmazonia[units, ]
```
Figure \@ref(fig:Sample4MappingandEstimationAmazonia) shows the selected sample.
```{r Sample4MappingandEstimationAmazonia, echo = FALSE, out.width = '100%', fig.asp = 0.5, fig.cap = "Balanced sample of size 500 from Eastern Amazonia, balanced on biome and lnSWIR2, with geographical spreading. Equal inclusion probabilities are used."}
ggplot(data = grdAmazonia) +
geom_raster(mapping = aes(x = x1 / 1000, y = x2 / 1000, fill = lnSWIR2)) +
scale_fill_viridis_c(name = "lnSWIR2") +
geom_point(data = mysample, mapping = aes(x = x1 / 1000, y = x2 / 1000, shape = Biome), size = 1.5) +
scale_shape_manual(values = c(0, 1, 2, 3), name = "") +
scale_x_continuous(name = "Easting (km)") +
scale_y_continuous(name = "Northing (km)") +
coord_fixed()
```
I think this is a suitable sample, both for mapping AGB across the entire study area, for instance by kriging with an external drift (Section \@ref(IntroKED)), and for estimating the mean AGB of the four biomes. For biome Forest_moist, the population mean can be estimated from the data of this biome only, using the $\pi$ estimator, as the sample size of this biome is very large (Section \@ref(BalancedandSpreaded)). For the other three biomes, we may prefer model-assisted estimation for small domains as described in Section \@ref(SmallDomainsModelAssisted).
In this example I used one quantitative covariate, lnSWIR2, for balancing the sample. If we have a legacy sample that can be used to fit a linear or non-linear model, for instance a random forest using multiple covariates and factors as predictors (Chapter \@ref(Modelassisted)), then this model can be used to predict the study variable for all population units, so that we can use the predictions of the study variable to balance the sample, see Section \@ref(RandomForest).
## Broad overview of sampling designs for mapping
The non-probability sampling designs for mapping described in the following chapters can be grouped into three categories [@Brus2019b]:
1. geometric sampling designs\index{Geometric sampling design} (Chapters \@ref(RegularGridSpatialCoverage) and \@ref(kmeans));
2. adapted experimental designs\index{Adapted experimental design} (Chapters \@ref(cLHS) and \@ref(SpatialResponseSurface)); and
3. model-based sampling designs (Chapters \@ref(MBgridspacing) and \@ref(MBSamplePattern)).
Square and triangular grids are examples of geometric sampling designs; the sampling units show a regular, geometric spatial pattern. In other geometric sampling designs the spatial pattern is not perfectly regular. Yet these are classified as geometric sampling designs when the samples are obtained by minimising some geometric criterion, i.e., a criterion defined in terms of distances between the sampling units and the nodes of a fine prediction grid discretising the study area (Section \@ref(SpatialCoverage) and Chapter \@ref(kmeans)).
In model-based sampling designs\index{Model-based sampling}, the samples are obtained by minimising a criterion that is defined in terms of variances of prediction errors. An example is the mean kriging variance criterion, i.e., the average of the kriging variances over all nodes of the prediction grid. Model-based sampling therefore requires prior knowledge of the model of spatial variation. Such a model must be specified and justified. Once this model is given, the sample can be optimised. In Chapter \@ref(MBgridspacing) I will show how a spatial model can be used to optimise the spacing of a square grid given a requirement on the accuracy of the map. The grid spacing determines the number of sampling units, so this optimisation boils down to determining the required sample size. In Chapter \@ref(MBSamplePattern) I will show how a sample of a given size can be further optimised through optimisation of the spatial coordinates of the sampling units.
In Chapter \@ref(GeneralIntro) the design-based and model-based approaches for sampling and statistical inference were introduced. Note that a model-based approach does not necessarily imply model-based sampling. The adjective 'model-based' refers to the model-based inference, not to the selection of the units. In a model-based approach sampling units can be, but need not be, selected by model-based sampling. If they are, then both in selecting the units and in mapping a statistical model is used. In most cases, the two models differ: once the sample data are collected, these are used to update the postulated model used for designing the sample. The updated model is then used in mapping.
Besides geometric and model-based sampling designs for a spatial survey, a third category can be distinguished: sampling designs that are adaptations of experimental designs. An adaptation is necessary because in contrast to experiments, in observational studies one is not free to choose combinations of levels of different factors. For instance, when two covariates are strongly positively correlated, it may happen that there are no units with a relatively large value for one covariate and a relatively small value for the other covariate.
In a full factorial design\index{Full factorial design}, all combinations of factor levels are observed. For instance, suppose we have only two covariates, e.g., application rates for N and P in an agricultural experiment, and four levels for each covariate. To account for possible non-linear effects, a good option is to have multiple plots for all 4 $\times$ 4 combinations. This is referred to as a full factorial design. With $k$ factors and $l$ levels per factor the total number of observations is $l^k$. With numerous factors and/or numerous levels per factor, this becomes unfeasible in practice. Alternative designs have been developed that need fewer observations but still provide detailed information about how the study variable responds to changes in the factor levels. Examples are Latin hypercube samples\index{Latin hypercube sample} and response surface designs\index{Response surface design}. The survey sampling analogues of these experimental designs are described in Chapters \@ref(cLHS) and \@ref(SpatialResponseSurface).
```{r, echo = FALSE}
rm(list = ls())
```