-
Notifications
You must be signed in to change notification settings - Fork 1
/
05-rcbd-spatial-example-R.Rmd
449 lines (325 loc) · 16.6 KB
/
05-rcbd-spatial-example-R.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
# RCBD Example: R {#rcbd-r}
Here are step-by-step instructions for how to incorporate spatial covariates into analysis of a field experiment that uses a randomized complete block design. Several techniques are explored:
Load the NIN data if it is not already in your R environment:
```{r message=FALSE, warning=FALSE}
library(agridat); library(dplyr); library(tidyr); library(purrr);
library(sp)
```
```{r error-chunk}
data("stroup.nin")
Nin <- stroup.nin %>% mutate(col.width = col * 1.2,
row.length = row * 4.3) %>%
fill(rep, .direction = "up") %>% arrange(col, row)
Nin_na <- filter(Nin, !is.na(yield))
Nin_spatial = Nin_na
coordinates(Nin_spatial) <- ~ col.width + row.length
```
Once spatial auto-correlation has been identified in field trials, the next step is to employ a modeling technique that will reduce the impact of spatial variation on the final estimates from the analysis.
## Prep work
The first thing is to run a standard linear model. A common model specification for the randomized complete block design (RCBD) is to include cultivar as a fixed effect and block as a random effect.
```{r message=FALSE, warning=FALSE}
library(nlme); library(emmeans)
nin_lme <- lme(yield ~ gen, random = ~1|rep,
data = Nin,
na.action = na.exclude)
# extract the least squares means for variety
preds_lme <- as.data.frame(emmeans(nin_lme, "gen"))
```
The variables "gen" refers to the cultivar or breeding line being trialled, and "rep" is the block, and the dependent variable, "yield" is grain yield. Basic exploratory analysis of this data set was conducted in \@ref(spatial-r).
## Correlated errors
**Gaussian Example**
In order to fit models using correlated error model, we will need to first obtain preliminary estimates of the nugget, sill and range: from fitting an empirical variogram.
```{r}
library(gstat)
max_dist <- 0.6*max(dist(coordinates(Nin_spatial)))
resid.var1 <- gstat::variogram(yield ~ rep + gen,
cutoff = max_dist,
width = max_dist/10,
data = Nin_spatial)
nugget_start <- min(resid.var1$gamma)
```
In the previous section, an isotropic Gaussian function was identified as the best model for describing the decay of error correlations over distance.
```{r}
nin_vgm <- vgm(model = "Gau", nugget = nugget_start)
nin_variofit <- fit.variogram(resid.var1, nin_vgm)
nugget <- nin_variofit$psill[1]
range <- nin_variofit$range[2]
sill <- sum(nin_variofit$psill)
nugget.effect <- nugget/sill
```
Create a correlated error structure using the **nlme** package.
```{r}
cor.gaus <- corSpatial(value = c(range, nugget.effect),
form = ~ row.length + col.width,
nugget = T, fixed = F,
type = "gaussian",
metric = "euclidean")
```
Update the linear mixed model with the correlated error structure:
```{r}
nin_gaus <- update(nin_lme, corr = cor.gaus)
```
Extract variety estimates:
```{r}
preds_gaus <- as.data.frame(emmeans(nin_gaus, "gen"))
```
Other models can be implemented following this template.
### Exponential
```{r}
rm(nin_vgm, nin_variofit, nugget, sill, range, nugget.effect)
nin_vgm <- vgm(model = "Exp", nugget = nugget_start)
nin_variofit <- fit.variogram(resid.var1, nin_vgm)
nugget <- nin_variofit$psill[1]
range <- nin_variofit$range[2]
sill <- sum(nin_variofit$psill)
nugget.effect <- nugget/sill
cor.exp <- corSpatial(value = c(range, nugget.effect),
form = ~ row.length + col.width,
nugget = T, fixed = F,
type = "exponential",
metric = "euclidean")
nin_exp <- update(nin_lme, corr = cor.exp)
preds_exp <- as.data.frame(emmeans(nin_exp, "gen"))
```
### Spherical
```{r}
rm(nin_vgm, nin_variofit, nugget, sill, range, nugget.effect)
nin_vgm <- vgm(model = "Sph", nugget = nugget_start)
nin_variofit <- fit.variogram(resid.var1, nin_vgm)
nugget <- nin_variofit$psill[1]
range <- nin_variofit$range[2]
sill <- sum(nin_variofit$psill)
nugget.effect <- nugget/sill
cor.sph <- corSpatial(value = c(range, nugget.effect),
form = ~ row.length + col.width,
nugget = T, fixed = F,
type = "spherical",
metric = "euclidean")
nin_sph <- update(nin_lme, corr = cor.sph)
preds_sph <- as.data.frame(emmeans(nin_sph, "gen"))
```
### Matérn
A Matérn error function is not available in **nlme**, but the package **spaMM** has written an extension implementing the Matérn, Cauchy and other covariance structures. The Matérn is demonstrated below.
```{r message=FALSE, warning=FALSE}
rm(nin_vgm, nin_variofit, nugget, sill, range, nugget.effect)
library(spaMM) # required for running `corMatern()`
nin_vgm <- vgm(model = "Mat", nugget = nugget_start)
nin_variofit <- fit.variogram(resid.var1, nin_vgm, fit.kappa = TRUE)
nugget <- nin_variofit$psill[1]
range <- nin_variofit$range[2]
sill <- sum(nin_variofit$psill)
nugget.effect <- nugget/sill
kappa <- nin_variofit$kappa[2]
# from spAMM documentation: "Warning: the range parameter used in corSpatial objects is the inverse of the scale parameter used in MaternCorr and thus they have opposite meaning despite both being denoted ρ elsewhere in this package or in nlme literature" - so range is expressed as an inverse
cor.mat <- corMatern(value = c(1/range, kappa, nugget.effect),
form = ~ row.length + col.width,
nugget = T, fixed = F,
metric = "euclidean")
nin_matern <- update(nin_lme, corr = cor.mat)
preds_mat <- as.data.frame(emmeans(nin_matern, "gen"))
```
In the **nlme** package, there is also an option for a linear model in the `corSpatial()` function. However, if a linear trend is present without a range or sill, it is recommended that a linear trend be fitted to the data instead.
## Row/Column Trends:
The package **lme4** is used since it can handle multiple random effects while **nlme** cannot do this without nesting the effects. I prefer **nlme** for linear modeling in this tutorial because of its built-in functionality for including spatial variation.
```{r message=FALSE, warning=FALSE}
library(lme4); library(lmerTest)
```
```{r}
# variables specifying row and column as factors are needed
Nin$colF <- as.factor(Nin$col)
Nin$rowF <- as.factor(Nin$row)
nin_trend <- lmer(yield ~ gen + (1|rep) + (1|colF) + (1|rowF),
data = Nin,
na.action = na.exclude)
preds_trend <- as.data.frame(emmeans(nin_trend, "gen"))
```
## Splines
The package **SpATS**, "spatial analysis for field trials", implements B-splines for row and column effects.
```{r message=FALSE, warning=FALSE}
library(SpATS)
nin_spline <- SpATS(response = "yield",
spatial = ~ PSANOVA(col, row, nseg = c(10,20),
degree = 3, pord = 2),
genotype = "gen",
random = ~ rep, # + rowF + colF,
data = Nin,
control = list(tolerance = 1e-03, monitoring = 0))
preds_spline <- predict(nin_spline, which = "gen") %>%
dplyr::select(gen, emmean = "predicted.values", SE = "standard.errors")
```
## AR1xAR1
```{r include=FALSE}
if(!require(breedR)) {
#devtools::install_github('famuvie/breedR')
source("http://famuvie.github.io/breedR/src/setup_repo.R")
install.packages('breedR')
}
```
```{r message=FALSE, include=FALSE, echo=TRUE}
library(breedR); library(ggplot2)
nin_ar1ar1 <- remlf90(fixed = yield ~ gen,
#random = ~ rep, # model would not converge with rep included
spatial = list(model = 'AR',
coord = Nin[, c('col','row')]),
data = Nin)
```
The estimates for $\rho$ are determined via a grid search. The default grid searches the entire space, ([-1, 1] for rows and columns) We can narrow down the grid search to the area most impactful based on log likelihood.
```{r}
qplot(rho_r, rho_c, fill = loglik, geom = 'tile', data = nin_ar1ar1$rho)
# Refine the grid around the most likely values (the range cannot contain exactly 1 or -1)
rho.grid <- expand.grid(rho_r = seq(0.5, 0.95, length = 4),
rho_c = seq(0.5, 0.95, length = 4))
```
```{r message=FALSE, include=FALSE, echo=TRUE}
nin_ar1ar1 <- remlf90(fixed = yield ~ gen,
#random = ~ rep, #this also did not converge
spatial = list(model = 'AR',
coord = Nin[, c('col','row')],
rho = rho.grid),
data = Nin)
preds_ar1ar1 <- as.data.frame(fixef(nin_ar1ar1)[[1]]) %>% tibble::rownames_to_column("gen") %>%
mutate(SE = attr(fixef(nin_ar1ar1)[[1]], "se"))
colnames(preds_ar1ar1)[2] <- "emmean"
```
## Bayesian AR1xAR1
```{r}
#library(INLA)
## under construction....
```
## Model Selection
Now that we have built these spatial models, how do we pick the right one? Unfortunately, there is no one model that works best in all circumstances. In addition, there is no single way for choosing the best model! Some approaches include:
1. comparing model fitness (e.g. AIC, BIC, log likelihood)
1. comparing post-hoc power (that is, the p-values for the treatments)
1. comparing standard error of the estimates
In order to make comparisons, the code below assembles all the model objects into one list. They are generated from different processes, as shown by the `class` attribute of each one, so this takes some data conditioning.
```{r include=FALSE, message=FALSE}
rm(nin_variofit, nin_vgm)
rm(nin_vgm, nin_variofit, nugget, sill, range, nugget.effect)
```
```{r message=FALSE, warning=FALSE}
all.models <- mget(ls(pattern = "^nin_*"))
# print out their class
map(all.models, class)
```
### Spatial dependence of residuals
It would be helpful to know if these methods were effective in reducing the spatial autocorrelation among the error residuals.
The function below extracts the residuals from each model and is needed because of different handling of missing values by the package **SpATS**.
```{r message=FALSE, warning=FALSE}
L1 <- nrow(Nin)
non_na <- !is.na(Nin$yield)
L2 <- sum(non_na)
residuals <- map(all.models, function (x) {
resids <- residuals(x)
if(is.data.frame(resids)) {
colnum = ncol(resids)
resids = resids[,colnum]
}
if(length(resids) == L2) {
resids_pl = rep(NA, L1)
resids_pl[non_na] = resids
resids = resids_pl
}
return(resids)
})
names(residuals) <- names(all.models)
```
Run a Moran's I test on the extracted residuals:
```{r message=FALSE, warning=FALSE}
library(spdep)
xy.rook <- cell2nb(nrow = max(Nin$row), ncol = max(Nin$col), type="rook")
Moran.I <- map_df(residuals, function(x) {
mi = moran.test(x, nb2listw(xy.rook), na.action = na.exclude)
mi.stat <- mi$estimate
mi.stat$p.value <- mi$p.value
return(mi.stat)
}) %>% mutate(model = names(all.models)) %>% dplyr::select(c(5, 1:4)) %>%
mutate_at(2:5, round, 4) %>% arrange(p.value)
Moran.I
```
Only one model, `nin_spline` resulted in an improvement in Moran's I. Nearest neighbor approaches can also improve Moran's I. The significant p-values indicate that auto-correlation is still present in those models. However, that doesn't mean the other models are ineffective. The other models incorporate the spatial auto-correlation directly into the error terms.
### Compare Model Fit
*log likelihood, AIC, BIC*
Since these are not nested models, likelihood ratio tests cannot be performed. Log likelihood can be compared within the models from **nlme** but not across packages since they use different estimation procedures.
```{r}
nlme_mods <- list(nin_lme, nin_trend, nin_exp, nin_gaus, nin_sph, nin_matern, nin_ar1ar1)
names(nlme_mods) <- c("LMM", "row-col_trend", "exponential",
"gaussian", "spherical", "matern", "AR1xAR1")
data.frame(logLiklihood = sapply(nlme_mods, logLik),
AIC = sapply(nlme_mods, AIC),
BIC = sapply(nlme_mods, AIC, k = log(nrow(Nin_na)))) %>% arrange(desc(logLiklihood))
```
Larger log likelihoods indicate a better fitting model to the data. A rule of thumb when comparing log likelihoods is that differences less than 2 are not considered notable. These results suggest that the Gaussian, spherical, power and Matérn models are substantially equivalent in capturing the variation present in this data set.
### Experiment-wide error
```{r}
exp_error <- as.data.frame(sapply(nlme_mods[-7], sigma))
exp_error
```
The overall experimental error, $\sigma$, increased slightly in the correlated error models because field variation has been re-partitioned to the error when it was (erroneously) absorbed by the other experimental effects.
As a result, the coefficient of variation is not a good metric for evaluating the quality of spatial models.
```{r}
CV = sapply(nlme_mods[-7], function(x) {
sigma(x)/mean(fitted(x), na.rm = T) * 100
})
as.data.frame(CV)
```
### Post-hoc power
Simulation studies indicate that incorporating spatial correlation into field trial analysis can improve the overall power of the experiment (the probability of detecting true differences in treatments). When working with data from a completed experiment, power is a transformed p-value. Performing ANOVA can indicate which approach maximizes power.
```{r message=FALSE, warning=FALSE}
anovas <- lapply(nlme_mods[-7], function(x){
aov <- as.data.frame(anova(x))[2,]
})
a <- bind_rows(anovas) %>% mutate(model = c("LMM", "row/column trend", "exponential",
"gaussian", "spherical", "matern")) %>%
arrange(desc(`p-value`)) %>% dplyr::select(c(model, 1:4)) %>% tibble::remove_rownames()
a[6,2:5] <- anova(nin_trend)[3:6]
a
```
This table indicates changes in the hypothesis test for "gen". There is a dramatic change in power for this test when incorporating spatial covariance structures.
### Standard error of treatment means
Retrieve predictions generated in the previous section:
```{r}
#(standardise names for downstream merging step)
#preds_ar1ar1 <- preds_ar1ar1 %>% rename(emmean = "predicted.value", SE = "standard.error")
all.preds <- mget(ls(pattern = "^preds_*"))
```
Extract standard errors and plot:
```{r message=FALSE, warning=FALSE}
errors <- lapply(all.preds, "[", "SE")
pred.names <- gsub("preds_", "", names(errors))
error_df <- bind_cols(errors)
colnames(error_df) <- pred.names
```
```{r SE-box-fig, echo=FALSE, fig.cap='Differences in Variety Standard Error', out.width='80%', fig.asp=0.75, fig.align='center'}
boxplot(error_df, ylab = "standard errors", xlab = "linear model", col = "dodgerblue3")
```
### Treatment means
Extract estimates:
```{r message=FALSE, warning=FALSE}
preds <- lapply(all.preds, "[", "emmean")
preds_df <- bind_cols(preds)
colnames(preds_df) <- pred.names
preds_df$gen <- preds_exp$gen
```
Plot changes in ranks:
```{r gen-ranks-fig, echo=FALSE, fig.align='center', fig.asp=0.75, fig.cap='Differences in Variety Ranks', message=FALSE, warning=FALSE, out.width='85%'}
library(ggplot2)
lev <- c("lme", "trend", "exp", "gaus", "mat", "sph", "ar1ar1", "spline")
#reshape::melt(preds_df, id.vars = "gen", variable.name = "model", value.name = "emmeans") %>%
pivot_longer(preds_df, cols = !gen, names_to = "model", values_to = "emmeans") %>%
mutate(model = factor(model, levels = lev)) %>%
ggplot(aes(x = model, y = emmeans, group = gen)) +
geom_point(size = 5, alpha = 0.5, col = "navy") +
geom_line() +
ylab("yield means for gen") +
theme_minimal(base_size = 14)
```
The black lines link the least squares means for a single variety. There is some consistency in the rankings between exponential, Gaussian, Matérn, and spherical covariance models. The control RCBD model, "lme", has fundamentally different rankings. The spline and AR1xAR1 ranking are also sightly different from the other models.
Nevertheless, the following plot indicates considerable consensus in the least squares means from all of the spatial models. The upper diagonal contains Pearson correlations between those values.
```{r ls-panel-fig, echo=FALSE, fig.align='center', fig.asp=1, fig.cap='Correlations in Variety Means', message=FALSE, warning=FALSE, out.width='90%'}
library(psych)
pairs.panels(preds_df[,-9], smooth = F, density = F, ellipses = F,
hist.col = "gold", pch = 1)
```
## Making decisions
There is no consensus on how to pick the best model. Some studies rely on log likelihood, while others seek to maximize the experimental power. Others have sought to minimize the root mean square error from cross validation.
The evidence suggest that for this data set, using any spatial model is better than running a naïve RCBD model.