-
Notifications
You must be signed in to change notification settings - Fork 14
/
01-GeneralIntroduction.Rmd
392 lines (285 loc) · 34.3 KB
/
01-GeneralIntroduction.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
# Introduction {#GeneralIntro}
This book is about sampling for spatial *surveys*. A survey\index{Survey} is an inventory of an object of study about which statistical statements will be made based on data collected from that object. The object of study is referred to as the population of interest or target population. Examples are a survey of the organic carbon stored in the soil of a country, the water quality of a lake, the wood volume in a forest, the yield of rice in a country, etc. In these examples, soil organic carbon, water quality, wood volume, and rice yield are the study variables, i.e., the variables of which we want to estimate the population mean or some other parameter, or which we want to map. So, this book is about *observational research*, not about experiments. In experiments, observations are done under controlled circumstances; think of an experiment on crop yields as a function of application rates of fertiliser. Several levels of fertiliser application rate are chosen and randomly assigned to experimental plots. In observational research, factors that influence the study variable are not controlled. This implies that in observational research no conclusions can be drawn on causal relations.
If the whole population is observed, this is referred to as a *census*\index{Census}. In general, we cannot afford such a census. Only some parts of the population are selected and the study variable is observed (measured) for these selected parts (the population units in the sample) only. Such a survey is referred to as a *sample survey*\index{Sample survey}. The observations are subsequently used to derive characteristics of the whole population. For instance, to estimate the wood volume in a forest, we cannot afford to measure the wood volume of every tree in the forest. Instead, some trees are selected, the wood volume of these trees is measured, and based on these measurements the total wood volume in the forest is estimated.
## Basic sampling concepts {#BasicConcepts}
In this book the populations of interest have a spatial dimension. In selecting parts of such populations for observation, we may account for the spatial coordinates of the parts, but this is not strictly needed. Examples of spatial sampling designs are designs selecting sampling units that are spread throughout the study area, often leading to more precise estimates of the population mean or total as compared to sampling designs resulting in spatial clusters of units.
Two types of populations can be distinguished: discrete and continuous populations. *Discrete populations* consist of discrete natural objects, think of trees, agricultural fields, lakes, etc. These objects are referred to as *population units*\index{Population unit}. The total number of population units in a discrete population\index{Population!discrete population} is finite. A finite spatial population of discrete units can be denoted by $\mathcal{U}=\{u(\mathbf{s}_1),u(\mathbf{s}_2), \dots , u(\mathbf{s}_N)\}$, with $u(\mathbf{s}_k)$ the unit located at $\mathbf{s}_k$, where $\mathbf{s}$ is a vector with spatial coordinates. The population units naturally serve as the *elementary sampling units*\index{Elementary sampling unit}. In this book the spatial populations are two-dimensional, so a vector $\mathbf{s}$ has two coordinates, Easting and Northing.
Other populations may, for the purpose of sampling, be considered as a physical continuum, e.g., the soil in a region, the water in a lake, the crop on a field.
```{block2, type='rmdnote'}
If interest lies in crop properties per areal unit of the field, the population is continuous. However, if interest lies in properties per plant, the population is discrete and finite.
```
Such continuous spatial populations can be denoted by $\mathcal{U}=\{u(\mathbf{s}), \mathbf{s} \in \mathcal{A} \}$, with $\mathcal{A}$ the study area. Discrete objects that can serve as elementary sampling units do not exist in *continuous populations*\index{Population!continuous population}. Therefore, we must define the elementary sampling units. The elementary sampling units can be areal units, e.g., 10 m squares; or circular plots, e.g., with a radius of 5 m; or "points", i.e., units of such a small area, compared to the area of the population, that the area of the units can be ignored.
In this book a population unit and an elementary sampling unit can be an individual object of a discrete population as well as an areal sampling unit\index{Areal sampling unit} or a point of a continuous population.
The size and geometry of the elementary units used in sampling a continuous population is referred to as the sample support\index{Sample support}. The total number of elementary sampling units in a continuous population can be finite, e.g., all 25 m $\times$ 25 m (disjoint) raster cells in an area (raster cells in Figure \@ref(fig:support)), or infinite, e.g., all points in an area, or all squares or circular plots with a given radius that are allowed to overlap in an area (circles in Figure \@ref(fig:support)).
```{r support, echo=FALSE, out.width='100%', fig.cap="Three sample supports: points, squares, and circles. With disjoint squares, the population is finite. With points, and squares or circles that are allowed to overlap, the population is infinite."}
#Point support
set.seed(314)
s1 <- runif(10, min = -4, max = 104)
s2 <- runif(10, min = -4, max = 104)
mysisample <- data.frame(x = s1, y = s2)
plt1 <- ggplot() +
geom_tile(aes(x = 50, y = 50), width = 100, height = 100, fill = "grey") +
geom_point(data = mysisample, aes(x = x, y = y), size = 1) +
scale_x_continuous(name = "", limits = c(0, 100)) +
scale_y_continuous(name = "", limits = c(0, 100)) +
coord_fixed()
#Square support
x <- y <- seq(from = 5, to = 95, by = 10)
grid <- expand.grid(x, y)
names(grid) <- c("x", "y")
units <- sample(nrow(grid), size = 10, replace = FALSE)
mysisample <- grid[units, ]
plt2 <- ggplot(data = grid) +
geom_tile(mapping = aes(x = x, y = y), width = 10, height = 10, fill = "grey") +
geom_tile(data = mysisample, mapping = aes(x = x, y = y), colour = "red", size = 0.8, width = 10, height = 10, fill = NA) +
scale_x_continuous(name = "") +
scale_y_continuous(name = "") +
coord_fixed()
#Circle support
set.seed(315)
s1 <- runif(10, min = 0, max = 100)
s2 <- runif(10, min = 0, max = 100)
circles <- data.frame(s1, s2)
plt3 <- ggplot() +
geom_tile(aes(x = 50, y = 50), width = 100, height = 100, fill = "grey") +
geom_point(data = circles, aes(x = s1, y = s2), size = 1) +
geom_circle(data = circles, aes(x0 = s1, y0 = s2, r = 4)) +
scale_x_continuous(name = "") +
scale_y_continuous(name = "") +
coord_fixed()
grid.arrange(plt1, plt2, plt3, nrow = 1)
```
Ideally, with areal elementary sampling units, the selected elementary units are exhaustively observed, so that a measurement of the total or mean of the study variable within an areal unit is obtained, think for instance of the total aboveground biomass. In some cases this is not feasible, think for instance of measuring the mean of some soil property in 25 m squares. In this case, a sample of points is selected from each selected square, and the measurement is done at the selected points. These measurements at points are used to estimate the mean of the squares. @Stehman2018 introduced the concept of a response design\index{Response design} as "the protocol used to determine the reference condition of an element of the population". So, in the case just mentioned the response design is the sampling design and the estimator for the mean of the soil property of the 25 m squares.
Ideally, the sample support is constant, but in some situations a varying sample support cannot be avoided. Think, for instance, of square sampling units in an irregularly shaped study area. Near the border of the study area, there are squares that cross the border. The part of a square that falls outside the study area is not observed. So, the support of the observations of squares crossing the border is smaller than that of the observations of squares in the interior of the study area. See also Section \@ref(SIcircularplots).
To sample a finite spatial population, the population units are listed in a data frame. This data frame contains the spatial coordinates of the population units and other information needed for selecting sampling units according to a specific design. Think, for instance, of the labels of more or less homogeneous subpopulations (used as strata in stratified random sampling, see Chapter \@ref(STSI)) and the labels of clusters of population units, for instance, all units in a polygon of a map (used in cluster random sampling, see Chapter \@ref(Cl)). Besides, if we have information about covariates possibly related to the study variable, which we would like to use in selecting the population units, these covariates are added to the list. The list used for selecting sampling units is referred to as the *sampling frame*\index{Sampling frame}.
If the elementary sampling units are disjoint square grid cells (sample support is a square), the population is finite and the grid cells can be selected through selection of their centres (or any other point that uniquely identifies a grid cell) listed in the sampling frame.
In this book also continuous populations are sampled using a list as a sampling frame. The infinite population is discretised by the
cells of a fine discretisation grid. The grid cells are listed in the sampling frame by the spatial coordinates of the *centres* of the grid cells. So, the infinite population is represented by a finite list of points. The advantage of this is that existing **R** packages for sampling of finite populations can also be used for sampling infinite populations.
If the elementary sampling units are points (sample support is a point), the population is infinite. In this case, sampling of points can be implemented by a two-step approach. In the first step, cells of the discretisation grid are selected with or without replacement, and in the second step one or more points are selected within the selected grid cells. Figure \@ref(fig:SamplingFromInfinitePopulation) is an illustration of this two-step approach for simple random sampling of points from a discretised infinite population. Ten grid cells are selected by simple random sampling with replacement. Every time a grid cell is selected, one point is randomly selected from that grid cell. Note that a grid cell can be selected more than once, so that more than one point will be selected from that grid cell. Note also that we may select a point that falls outside the boundary of the study area. This is actually the case with one grid cell in Figure \@ref(fig:SamplingFromInfinitePopulation). The points outside the study area are discarded and replaced by a randomly selected new point inside the study area. Finally, note that near the boundary there are small areas not covered by a grid cell, so that no points can be selected in these areas. It is important that the discretisation grid is fine enough to keep the discretisation error\index{Discretisation error} so small that it can be ignored. The alternative is to extend the discretisation grid beyond the boundaries of the study area so that the full study area is covered by grid cells.
```{r SamplingFromInfinitePopulation, echo = FALSE, fig.width=5, fig.cap = "Sampling of points from discretised infinite population. The grid cells are randomly selected with replacement. Each time a grid cell is selected, a point is randomly selected from that grid cell."}
# initialize pseudo random number generator
set.seed(314)
# set cell size
cellsize <- 2
# read field
field <- read_sf(system.file("extdata/melle.gpkg", package = "sswr")) %>%
st_set_crs(value = NA_crs_)
# select grid cells within the field
grid <- st_make_grid(field, cellsize = cellsize, what = "polygons")
grid <- grid[field] # alternative to 'sp::over'
# select centers of grid cells within the field
gridcenters <- st_make_grid(field, cellsize = cellsize, what = "centers")
gridcentres <- gridcenters[field]
# randomly select grid cells and a location within each selected grid cell
sample <- gridcentres %>%
st_coordinates %>%
as_tibble %>%
slice_sample(n = 10, replace = TRUE) %>%
mutate(
X = jitter(X, amount = 0.5 * cellsize),
Y = jitter(Y, amount = 0.5 * cellsize)) %>%
st_as_sf(coords = c("X", "Y"))
# plot result
ggplot() +
geom_sf(data = field, fill = "grey") +
geom_sf(data = grid, fill = NA) +
geom_sf(data = sample, alpha = 0.5, colour = "red") +
scale_x_continuous(
name = "Easting (km)",
labels = function(x) {1.0e-3 * x}) +
scale_y_continuous(
name = "Northing (km)",
labels = function(x) {1.0e-3 * x})
```
### Population parameters {#PopulationParameters}
The sample data are used to estimate characteristics of the whole population\index{Population parameter}, e.g., the population mean\index{Population mean} or total\index{Population total}; some quantile, e.g., the median or the 90th percentile; or even the entire cumulative frequency distribution.
A finite population total is defined as
\begin{equation}
t(z) = \sum_{k \in \mathcal{U}} z_k = \sum_{k=1}^N z_k \;,
(\#eq:FinitePopTotal)
\end{equation}
with $N$ the number of population units and $z_k$ the study variable for population unit $k$. A finite population mean is defined as a finite population total divided by $N$.
An infinite population total is defined as an integral of the study variable over the study area:
\begin{equation}
t(z) = \int_{\mathbf{s} \in \mathcal{A}} z(\mathbf{s}) \;\mathrm{d}\mathbf{s} \;.
(\#eq:InfinitePopTotal)
\end{equation}
An infinite population mean is defined as a finite population total divided by the area, $A$, covered by the population.
A finite population proportion\index{Population proportion} is defined as the population mean of an 0/1 indicator $y$ with value 1 if the condition is satisfied, and 0 otherwise:
\begin{equation}
p=\frac{\sum_{k=1}^N y_k}{N} \;.
(\#eq:PopulationProportion)
\end{equation}
A cumulative distribution function\index{Cumulative distribution function} (CDF) is defined as
\begin{equation}
F(z)=\sum_{z^\prime \leq z} p(z^\prime) \;,
(\#eq:CDF)
\end{equation}
with $p(z^\prime)$ the proportion of population units whose value for the study variable equals $z^\prime$.
A population quantile\index{Population quantile}, for instance the population median or the population 90th percentile, is defined as
\begin{equation}
q_p= F^{-1}(p) \;,
(\#eq:PopQuantile)
\end{equation}
where $p$ is a number between 0 and 1 (e.g., 0.5 for the median, 0.9 for the 90th percentile), and $F^{-1}(p)$ is the smallest value of the study variable $z$ satisfying $F(z)\geq p$.
In surveys of spatial populations, the aim can also be to make a map of the population.
```{block2, type='rmdnote'}
The parameters defined in this subsection are parameters of spatial populations, i.e., populations observed in a relatively short period of time related to the dynamics of the study variable. We assume that the study variable does not change during the survey period. In Chapter \@ref(RepeatedSurveys) parameters are defined for space-time populations.
```
### Descriptive statistics vs. inference about a population
When we observe only a (small) part of the population, we are uncertain about the population parameter estimates and the map of the population. By using statistical methods, we can quantify how uncertain we are about these results. In decision making it can be important to take this uncertainty into account. An example is a survey of water quality. In Europe the concentration levels of nutrients are regulated in the European Water Framework Directive. To test whether the mean concentration of a nutrient complies with its standard, it is important to account for the uncertainty in the estimated mean. When the estimated mean is just below the standard, there is still a large probability that the population mean exceeds the standard. This example shows that it is important to distinguish computing descriptive statistics from characterising the population using the sample data. For instance, we can compute the sample mean (average of the sample data) without error, but if we use this sample mean as an *estimate* of the population mean, there is certainly an error in this estimate.
### Random sampling vs. probability sampling
Many sampling methods are available. At the highest level, one may distinguish random from non-random sampling methods. In random sampling, a subset of population units is randomly selected from the population, using a (pseudo) random number generator. In non-random sampling, no such random number generator is used. Examples of non-random sampling are (i) convenience sampling, i.e., sampling at places that are easy to access, e.g., along roads; (ii) arbitrary sampling, i.e., sampling without a specific purpose in mind; and (iii) targeted sampling, e.g., at sites suspected of soil pollution.
In the literature, the term \emph{random sampling} is often used for arbitrary sampling\index{Arbitrary sampling}, i.e., sampling without a specific purpose in mind. To avoid confusion the term *probability sampling*\index{Probability sampling} is used for random sampling using a (pseudo) random number generator, so that for any unit in the population the probability of selecting that unit is known. More precisely, a probability sample is a sample from a population such that every unit of the population has a positive probability of being included in the sample. Besides, these *inclusion probabilities* must be known, at least for the selected units, as they are needed in estimation. This is explained in following chapters.
## Design-based vs. model-based approach {#DBvsMB}
The choice between probability or non-probability sampling\index{Non-probability sampling} is closely connected with the choice between the design-based\index{Design-based approach} or the model-based approach\index{Model-based approach} for sampling and statistical inference (estimation, hypothesis testing). The difference between these two approaches is a rather technical subject, so, not to discourage you already in this very first chapter, I will keep it short. In Chapter \@ref(Approaches) I elaborate on the fundamental difference of these two approaches and a third approach, the model-assisted approach, which can be seen as a compromise of the design-based and the model-based approach.
```{r approach, echo=FALSE}
approach <- data.frame(Approach = c("Design-based", "", "Model-based"), Sampling = c("Probability sampling required", "", "Probability sampling not required"), Inference = c("Based on sampling distribution", "(no model used)", "Based on statistical model"))
knitr::kable(
approach, caption = "Statistical approaches for sampling and inference.",
booktabs = TRUE,
linesep = ""
) %>%
kable_classic()
```
In the design-based approach, units are selected by probability sampling (Table \@ref(tab:approach)). Estimates are based on the inclusion probabilities of the sampling units as determined by the sampling design (design-based inference). No model is used in estimation. On the contrary, in the model-based approach a statistical model is used in prediction, i.e., a model with a random error term, for instance a regression model. As the model already contains a random error term, probability sampling is not required in this approach.
Which statistical approach is best largely depends on the aim of the survey, see @bru97 and @gru06. Broadly speaking, the following aims can be distinguished:
1. estimating parameters for the population;
2. estimating parameters for several subpopulations; and
3. mapping the study variable.
```{block2, type = 'rmdnote'}
A map of the study variable is obtained by predicting the study variable at the points of a very fine grid that discretises the study area, or by predicting the means of the study variable for fine grid cells. Many mapping methods are available. In this book a statistical model is applied to predict the study variable, for instance a linear regression model, or a spatial linear mixed model as used in kriging.
```
When the aim is to map the study variable, a model-based approach is the most natural option. This implies that for this aim probability sampling is not necessarily required. In principle, both approaches are suitable for estimating (sub)population parameters\index{Population parameter}. The more subpopulations\index{Subpopulation} are distinguished, the more attractive a model-based approach becomes. Model-based estimates of the subpopulation means or totals are potentially more accurate (depending on how good the model is) than model-free design-based estimates. On the other hand, an advantage of design-based estimation is that an objective assessment of the uncertainty of the estimated mean or total is warranted, and that the coverage of confidence intervals is (almost) correct.
A probability sample can also be used in model-based inference. This flexibility can be attractive when we have a dual aim, mapping as well as estimation of parameters of (sub)populations. When units are not selected by probability sampling, model-free design-based estimation is impossible, and model-based prediction is the only option.
## Populations used in sampling experiments {#Datasets}
In this book various data sets are used to illustrate the sampling designs. Four data sets, Voorst, Kandahar, Eastern Amazonia, and the Iberian Peninsula (Spain and Portugal), are exhaustive, i.e., for all population units, data of the study variable and ancillary data are available. The first two exhaustive data sets\index{Exhaustive data set} are obtained through simulation\index{Simulation}, i.e., by drawing numbers from a probability distribution. Sample data from these two study areas are used to calibrate a statistical model. This model is subsequently used to simulate values of the study variable for all population units. Voorst actually is an infinite population of points. However, this study area is discretised by the cells of a fine grid, and the study variable, the soil organic matter (SOM) concentration, is simulated for all centres of the grid cells. Kandahar is a finite population consisting of 965 squares of size 5 km $\times$ 5 km. The study variable is the area cultivated with poppy. Eastern Amazonia is a map in raster format, with a resolution of 1 km $\times$ 1 km. The study variable is the aboveground biomass as derived from remote sensing images. The aboveground biomass value of a raster cell is treated as the average biomass of that raster cell. The data set Iberian Peninsula is a time series of four maps in raster format with a resolution of 30 arc sec. The study variable is the annual mean air temperature at two metres above the earth surface in $^\circ$C.
The exhaustive data sets are used in the first part of this book on probability sampling for estimating population parameters. By taking the population as the reality, we know the population parameters. Also, for any randomly selected sample from this population, the study variable values for the selected sampling units are known, so that we can *estimate* the population parameters from this sample. An estimated population parameter can then be compared with the population parameter. The difference between these two is the *sampling error*\index{Sampling error} in the estimated population parameter. This opens up the possibility of repeating the random selection of samples with a given sampling design a large number of times, estimating the population parameter for every sample, so that a frequency distribution of the estimated population parameter is obtained. Ideally, the mean of this frequency distribution, referred to as the *sampling distribution*\index{Sampling distribution}, is equal to the population parameter (mean sampling error equals zero), and the variance of the estimated population parameters is small. Another advantage is that sampling designs can be compared on the basis of the sampling distribution, for instance the sampling distributions of the estimator of the population mean with stratified random sampling and simple random sampling, to evaluate whether the stratification leads to more accurate estimates of the population mean.
Furthermore, various data sets are used with data for a sample of population units only. These data sets are described at places where they are first used.
All data sets are available by installing the **R** package **sswr**. This package can be installed from github with function `install_github` of package **remotes** [@remotes].
```{r, eval = FALSE}
library(remotes)
install_github("DickBrus/sswr")
```
The package can then be loaded. You can see the contents of the package and of the data files by typing a question mark, followed by the name of the package or a data file.
```{r, eval = FALSE}
library(sswr)
?sswr
?grdVoorst
```
### Soil organic matter in Voorst, the Netherlands {#Voorst}
The study area of Voorst is located in the eastern part of the Netherlands. The size of the study area is 6 km $\times$ 1 km. At 132 points, samples of the topsoil were collected by graduate students of Wageningen University, which were then analysed for SOM concentrations (in g kg^-1^ dry soil) in a laboratory. The map is created by conditional geostatistical simulation of natural logs of the SOM concentration on a 25 m $\times$ 25 m grid, followed by backtransformation, using a linear mixed model with spatially correlated residuals and combinations of soil type and land use as a qualitative predictor (factor). Figure \@ref(fig:mapVoorst) shows the simulated map of the SOM concentration.
(ref:mapVoorstlabel) Simulated map of the SOM concentration (g kg^-1^) in Voorst.
```{r mapVoorst, echo = FALSE, out.width = '100%', fig.cap = "(ref:mapVoorstlabel)"}
ggplot(data = grdVoorst) +
geom_raster(mapping = aes(x = s1 / 1000, y = s2 / 1000, fill = z)) +
scale_x_continuous(name = "Easting (km)") +
scale_y_continuous(name = "Northing (km)") +
scale_fill_viridis_c(name = "SOM") +
coord_fixed() +
theme(panel.background = element_blank(),
panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.background = element_blank())
```
The frequency distribution of the simulated values at all 7,528 grid cells shows that the SOM concentration is skewed to the right (Figure \@ref(fig:histogramVoorst)).
(ref:histVoorstlabel) Frequency distribution of the simulated SOM concentration (g kg^-1^) in Voorst.
```{r histogramVoorst, echo = FALSE, fig.width = 5, fig.asp = 0.7, fig.cap = "(ref:histVoorstlabel)"}
ggplot(grdVoorst) +
geom_histogram(aes(x = z), breaks = seq(from = 0, to = 500, by = 20), fill = "black", alpha = 0.5, colour = "black") +
scale_y_continuous(name = "Count") +
scale_x_continuous(name = "SOM")
```
Summary statistics are:
```{r, echo = FALSE}
summary(grdVoorst$z)
```
The ancillary information consists of a map of soil classes and a land use map, which are combined to five soil-land use combinations (Figure \@ref(fig:SoilLanduseCombinationsVoorst)). The first letter in the labels for the combinations stands for the soil type: B for *beekeerdgrond* (sandy wetland soil with gleyic properties), E for *enkeerdgrond* (sandy soil with thick anthropogenic humic topsoil), P for podzols (sandy soil with eluviated horizon below the topsoil), R for river clay soil, and X for other sandy soils. The second letter is for land use: A for agriculture (grassland, arable land) and F for forest.
```{r SoilLanduseCombinationsVoorst, echo = FALSE, out.width = '100%', fig.cap = "Soil-land use combinations in Voorst."}
ggplot() +
geom_raster(data = grdVoorst, mapping = aes(x = s1 / 1000, y = s2 / 1000, fill = stratum)) +
scale_fill_manual(
name = "",
values = c(BA = "darkgreen", EA = "brown", PA = "orange", RA = "green", XF = "grey")
) +
scale_x_continuous(name = "Easting (km)") +
scale_y_continuous(name = "Northing (km)") +
coord_fixed() +
theme(panel.background = element_blank(),
panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.background = element_blank())
```
### Poppy fields in Kandahar, Afghanistan {#Poppy}
Cultivation of poppy for opium production is a serious problem in Afghanistan. The United Nations Office on Drugs and Crime (UNODC) monitors the area cultivated with poppy through detailed analysis of aerial photographs and satellite images. This is laborious, and therefore the analysis is restricted to a probability sample of 5 km squares. These sample data are then used to estimate the total poppy area [@UNODC2014].
In 2014 the poppy area within 83 randomly selected squares in the province of Kandahar was determined, as well as the agricultural area within all 965 squares in this province. These data were used to simulate a map of poppy area per 5 km square. The map is simulated with an ordinary kriging model for the logit transform of the proportion of the agricultural area cultivated with poppy within 5 km squares. For privacy reasons, the field was simulated *unconditionally* on these sample data. Figure \@ref(fig:mapsKandahar) shows the map with the agricultural area in hectares per 5 km square, and the map with the simulated poppy area in hectares per square. The frequency distribution of the simulated poppy area per square shows very strong positive skew (Figure \@ref(fig:histogramPoppyarea)). For 375 squares, the simulated poppy area was smaller than 1 hectare (ha).
```{r mapsKandahar, echo = FALSE, fig.cap = "Agricultural area and simulated poppy area, in ha per 5 km square, in Kandahar."}
df <- grdKandahar %>%
pivot_longer(cols = c("poppy", "agri"))
df$name <- factor(df$name, levels = c("poppy", "agri"), ordered = TRUE)
ggplot(data = df) +
geom_raster(mapping = aes(x = s1 / 1000, y = s2 / 1000, fill = value)) +
scale_fill_viridis_c(name = "Area (ha)") +
scale_x_continuous(name = "Easting (km)") +
scale_y_continuous(name = "Northing (km)") +
coord_fixed() +
facet_wrap(~ name, ncol = 1, nrow = 2)
```
```{r histogramPoppyarea, echo = FALSE, fig.width = 5, fig.asp = 0.7, fig.cap = "Frequency distribution of simulated poppy area in ha per 5 km square, in Kandahar."}
ggplot(grdKandahar) +
geom_histogram(aes(x = poppy), breaks = seq(from = 0, to = 2200, by = 100), fill = "black", alpha = 0.5, colour = "black") +
scale_x_continuous(name = "Poppy area (ha)") +
scale_y_continuous(name = "Count")
```
### Aboveground biomass in Eastern Amazonia, Brazil {#Amazonia}
This data set consists of data on the live woody aboveground biomass (AGB) in megatons per ha [@Baccini2012]. A rectangular area of 1,642 km $\times$ 928 km in Eastern Amazonia, Brazil, was selected from this data set. The data were aggregated to a map with a resolution of 1 km $\times$ 1 km. Besides, a stack of five ecologically relevant covariates of the same spatial extent was prepared, being long term mean of MODIS short-wave infrared radiation (SWIR2), primary production in kg C per m^2^ (Terra_PP), average precipitation in driest month in mm (Prec_dm), elevation in m, and clay content in g kg^-1^ soil. All covariates were either resampled by bilinear interpolation or aggregated to conform with the grid of the aboveground biomass map. Figure \@ref(fig:mapsAmazonia) shows a map of AGB and SWIR2.
(ref:mapsAmazonialabel) Aboveground biomass (AGB) in 10^9^ kg ha^-1^ and short-wave infrared radiation (SWIR2) of Eastern Amazonia.
```{r mapsAmazonia, echo = FALSE, out.width = '100%', fig.cap = "(ref:mapsAmazonialabel)"}
plt1 <- ggplot(grdAmazonia) +
geom_raster(mapping = aes(x = x1 / 1000, y = x2 / 1000, fill = AGB)) +
scale_x_continuous(name = "Easting (km)") +
scale_y_continuous(name = "Northing (km)") +
scale_fill_viridis_c(name = "AGB") +
coord_fixed()
plt2 <- ggplot(grdAmazonia) +
geom_raster(mapping = aes(x = x1 / 1000, y = x2 / 1000, fill = SWIR2)) +
scale_x_continuous(name = "Easting (km)") +
scale_y_continuous(name = "Northing (km)") +
scale_fill_viridis_c(name = "SWIR2") +
coord_fixed()
grid.arrange(plt1, plt2, nrow = 2)
```
Figure \@ref(fig:matrixscatter) shows a matrix of two-dimensional density plots of aboveground biomass and the five covariates, made with function `ggpairs` of **R** package **GGally** [@GGally]. The covariate with the strongest correlation with AGB is SWIR2. The Pearson correlation coefficient with AGB is -0.80. The relation does not look linear. The correlation of AGB with the covariates Terra_PP and Prec_dm is weakly positive. All correlations are significant, but this is not meaningful because of the very large number of data used in computing the correlation coefficients.
(ref:MatrixScatterlabel) Matrix of two-dimensional density plots of AGB and five covariates of Eastern Amazonia. Terra_PP values are divided by 1000.
```{r matrixscatter, echo = FALSE, fig.asp = .66, out.width = '100%', fig.cap = "(ref:MatrixScatterlabel)"}
library(GGally)
density2d <- function(data, mapping, ...) {
ggplot(data = data, mapping = mapping) +
geom_density2d_filled(...)
}
grdAmazonia %>%
dplyr::select(AGB, SWIR2, Terra_PP, Prec_dm, Elevation, Clay) %>%
mutate(Terra_PP = Terra_PP/1000) %>%
slice_sample(n = 50000) %>%
ggpairs(
upper = list(continuous = "cor"),
lower = list(continuous = density2d)
)
```
### Annual mean air temperature in Iberia {#TASIberia}
The space-time designs of Chapter \@ref(RepeatedSurveys) are illustrated with the annual mean air temperature at two metres above the earth surface (TAS) in $^\circ$C, in Iberia (Spain and Portugal, islands excluded) for 2004, 2009, 2014, and 2019 (Figure \@ref(fig:TASofIberia)). These data are part of the data set [CHELSA](https://chelsa-climate.org/wp-admin/download-page/CHELSA_tech_specification_V2.pdf) [@Karger2017]. The raster files are latitude-longitude grids with a resolution of 30 arc sec. The data are projected using the Lambert azimuthal equal area (laea) projection. The resolution of the resulting laea raster file is about 780 m $\times$ 780 m.
```{r TASofIberia, echo = FALSE, out.width = '100%', fig.cap = "Annual mean air temperature in Iberia for 2004, 2009, 2014, and 2019."}
df_lf <- grdIberia %>% pivot_longer(cols = c("TAS2004", "TAS2009", "TAS2014", "TAS2019"))
ggplot(df_lf) +
geom_raster(mapping = aes(x = x / 1000, y = y / 1000, fill = value)) +
scale_fill_viridis_c(name = "TAS") +
scale_x_continuous(name = "Easting (km)", breaks = seq(from = 2700, to = 3700, by = 200)) +
scale_y_continuous(name = "Northing (km)") +
facet_wrap(~ name, ncol = 2, nrow = 2) +
coord_fixed()
```
```{r, echo = FALSE}
rm(list = ls())
```