forked from FAO-SID/SOC-Mapping-Cookbook
-
Notifications
You must be signed in to change notification settings - Fork 0
/
08-model-evaluation.Rmd
192 lines (130 loc) · 14.7 KB
/
08-model-evaluation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
# Model evaluation in digital soil mapping {#evaluation}
*M. Guevara & G.F. Olmedo*
There are no best methods for statistical modeling and different evaluation strategies should be considered in order to identify, realistically, the overall modeling accuracy [@ho2002simple; @qiao2015no; @guevara_2018; @nussbaum2018evaluation]. This Section is devoted to describe quantitative methods for model evaluation applied to SOC mapping across FYROM.
Our objective is to provide a model evaluation example based on a vector of observed SOC and a vector of modeled SOC estimates derived from the geomatching approach (GM) and the three different statistical methods describes in Chapter \@ref(mappingMethods): multiple linear regression-kriging (RK) (see Section \@ref(RK)), random forests (RF) (see Section \@ref(rf)), and support vector machines (SVM) (see Section \@ref(svm))). The model evaluation methods presented here were adapted from the original work of @openair for air quality assessments and their **R** package **openair**.
We found in this package a very useful set of functions for model evaluation metrics that are suitable (and we highly recommend) for comparing digital soil maps derived from different prediction algorithms. We will first analyze the simple correlation and major differences of generated SOC maps by the three different methods. Then for further analysis we will prepare a data frame containing the observed and modeled vectors as well as the method column. Ideally the observed vector should be derived from a completely independent SOC dataset, as explained in the previous Chapter. The cross-validation strategy and the repeated random split for training and testing the models are other two alternatives when no independent dataset is available for validation purposes. However, we do not recommend to use the same training dataset for performing the following analysis, since the resulting *best method* could be the one that overfits the most.
The authors of this Chapter used **R** packages. To run the code provided in this Chapter, the following packages need to be installed in the **R** user library. If the packages are not yet installed, the `install.packages()` function can be used.
```{r, eval=FALSE}
# Installing packages for Chapter 'Model Evaluation in
# Digital Soil Mapping'
install.packages(c("raster", "psych", "rasterVis",
"mapview", "openair"))
```
## Technical steps - Model correlations and spatial differences
**Step 1 - Harmonize the predicted maps to the same spatial resolution**
We will import the predicted maps and harmonize them in to the same regular grid ($1 \times 1$ km of spatial resolution).
```{r, out.width='70%'}
library(raster)
GM<-raster('results/MKD_OCSKGM_GM.tif')
RK<-raster('results/MKD_OCSKGM_RK.tif')
RF<-raster('results/MKD_OCSKGM_rf.tif')
SVM<-raster('results/MKD_OCSKGM_svm.tif')
# Note that RK has a different reference system
RK <- projectRaster(RK, SVM)
models <- stack(GM, RK, RF, SVM)
```
**Step 2 - Compare the statistical distribution and correlation between the models**
Then we will plot the statistical distribution and the correlation between the three different methods (RK, RF, SVM).
```{r, fig.cap='Comparision of DSM model correlations (GM, RK, RF, SVM) and statical distributions'}
library(psych)
pairs.panels(na.omit(as.data.frame(models)),
# Correlation method
method = "pearson",
hist.col = "#00AFBB",
# Show density plots
density = TRUE,
# Show correlation ellipses
ellipses = TRUE
)
```
Here we found that the higher Pearson correlation coefficient ($r$ value) between predicted values was between RK and SVM (0.86). We also found that the statistical distribution of predicted values is quite similar between the four methods and that the higher discrepancies were found between the GM map and the statistical models.
We can in addition overlap the probability distribution functions for the four different methods to verify that their predictions are similar across the full data distribution of values.
```{r, fig.cap='Density plot of the prediction for three DSM models'}
library(rasterVis)
densityplot(models)
```
**Step 3 - Identify spatial differences between the models using the standard deviation**
Lets now take a look at the spatial differences. This step will allow to identify the geographical areas within the prediction domain where model predictions more agreee and disagree. To spatially compare model predictions we will estimate the standard deviation and the
differences between the four SOC maps.
```{r, eval=FALSE}
library(mapview)
SD <- calc(models , sd)
mapview(SD)
```
The mapview command will plot the standard deviation map in a `.html` file. Note roughly how the hotspots (yellow to red colors) of higher variance of predictions tend to be higher towards the west of the country, whereas models tend to agree in their predictions across the east side of the country. Note also that the variability although noisy, it shows a general pattern, (e.g., from east to west), suggesting that model agreement could be associated with specific land surface characteristics.
Now we will analyze specific differences between the three models (RK - RF, RK - SVM, RF - SVM, GM - RF, GM - SVM, GM - RK).
```{r, fig.cap='1 to 1 comparison between the three selected DSM models', eval=FALSE}
library(raster)
library(rasterVis)
GMRK <- calc(models[[c(1,2)]], diff)
GMRF <- calc(models[[c(1,3)]], diff)
GMSVM <- calc(models[[c(1,4)]], diff)
RKRF <- calc(models[[c(2,3)]], diff)
RKSVM <- calc(models[[c(2,4)]], diff)
RFSVM <- calc(models[[c(3,4)]], diff)
preds <- stack(GMRK, GMRF, GMSVM, RKRF, RKSVM, RFSVM)
names(preds) <- c('GMvsRK', 'GMvsRF', 'GMvsSVM','RKvsRF','RKvsSVM','RFvsSVM')
X <- raster::cellStats(preds, mean)
levelplot(preds - X, at=seq(-0.5,0.5, length.out=10),
par.settings = RdBuTheme)
```
Note how the spatial differences of the predicted SOC values have similar patterns, but the difference between RK and RF seems to be less sharp than the differences of SVM with the other two methods. Note that we use the `levelplot()` function to generate a better visualization (from red-to-white-to-blue) of the main effects of differences (e.g., if they are positive or negative), but we could also use the `mapview()` function to analyze these maps in a more interactive fashion. The variance of predictions derived from different models can be used as a proxy of model uncertainty and provides valuable information to consider in further applications of SOC maps (e.g., modeling crop production or quantifying SOC stocks). Note also the interesting differences between the GM map and the three statistical predictions.
## Technical steps - Model evaluation
**Step 1 - Data preparation**
To compare the performance of the four models, we will compare the observed values used and the predicted values for the the validation points. We have to load the validation dataset and the prediction result of the four validation datasets. The table containing these values was prepared in Section \@ref(TS:validation).
```{r}
dat <- read.csv("results/validation.csv")
```
We will prepare a new table from this data that we are going to use for model evaluation purposes. The new table should have the observed value, the predicted value and the model.
```{r}
# Prepare 4 new data.frame with the observed, predicted and the model
modGM <- data.frame(obs = dat$OCSKGM, mod = dat$MKD_OCSKGM_GM,
model = "GM")
modRK <- data.frame(obs = dat$OCSKGM, mod = dat$MKD_OCSKGM_RK,
model = "RK")
modRF <- data.frame(obs = dat$OCSKGM, mod = dat$MKD_OCSKGM_rf,
model = "RF")
modSVM <- data.frame(obs = dat$OCSKGM, mod = dat$MKD_OCSKGM_svm,
model = "SVM")
# Merge the 3 data.frames into one
modData <- rbind(modGM, modRK, modRF, modSVM)
summary(modData)
```
**Step 2 - Calculate statistical evaluation metrics**
Now we will use the `modStats()` function to calculate common numerical model evaluation statistics which are described and mathematically defined in the **openair** package manual [@carslaw2015openair, Ch. 27, pp. 231-233]. These include:
* $n$, the number of complete pairs of data;
* $FAC2$, fraction of predictions within a factor of two;
* $MB$, the mean bias;
* $MGE$, the mean gross error;
* $NMB$, the normalized mean bias;
* $NMGE$, the normalized mean gross error;
* $RMSE$, the root mean squared error;
* $r$, the Pearson correlation coefficient;
* $COE$, the Coefficient of Efficiency based on @legates1999evaluating, @legates2013refined. A perfect model has a $COE$ = 1. A value of $COE$ = 0 or negative implies no prediction capacity;
* $IOA$, the Index of Agreement based on @willmott2012refined, which spans between -1 and 1, with values approaching +1 representing better model performance.
A perfect model would have a $FAC2$, $r$, $COE$ and $IOA$ ~ 1.0, while all the others ~ 0.
However, digital soil mappers should have in mind that there is no such thing as a perfect model on digital SOC mapping for large areas, especially if we deal with sparse data from legacy soil profile or pit observations usually collected over long periods of time. Depending on the situation, some performance measures might be more appropriate than others. Hence, there is not a single best measure, and it is necessary to use a combination of the performance measures for model evaluation purposes [@chang2004air].
```{r}
library(openair)
modsts <- modStats(modData,obs = "obs", mod = "mod", type = "model")
```
```{r modsts, echo=FALSE}
# modsts <- cbind(modsts[1], round(modsts[-1], 2))
# Print a table
knitr::kable(modsts[,c(1,3:11)], caption = "Summary of different model evaluation statistics for the three models compared", digits = 2,
row.names = F, booktabs = TRUE)
```
From our SOC mapping example across FYROM, GP had the highest eror and bias while the three statistical models generate similar results. The $FAC2$ is over to 0.8 in all cases, being RK the one closer to 1. $MB$ and $NMB$ suggest that all the models tent to underestimate SOC because they are negative. SVM tend to underestimate SOC values less than RK and RF. The $MGE$, $NMGE$, and $RMSE$ suggest however that SVM is generating the larger error rate and by the values of $r$, $COE$ and $IOA$, we could say that given available SOC data across FYROM, the RK method improves the predictive capacity of RF and SVM.
These conclusion can be verified by plotting a Taylor Diagram (see Figure \@ref(fig:taylor)), which summarizes multiple aspects of model performance, such as the agreement and variance between observed and predicted values [@taylor2001summarizing]. Recent reports show that the integration of simple validation metrics (e.g., the RMSE correlation ratio) allows to extract information about modeling performance that could not be obtained by analyzing the validation metrics independently, such as the agreement between explained variance and bias [@guevara_2018; @nussbaum2018evaluation]. Taylor Diagrams interpteration rely on the relationships between explained variance and bias (from observed and modeled data). Note from our Taylor Diagram that GM is more distant that the other implemented approaches. Also, the RK method is closer to the observed value, followed by RF. Although, no significant difference was evident between the three implemented algorithms.
```{r taylor, fig.cap="Taylor diagram used in the evaluation of the three selected DSM models"}
TaylorDiagram(modData, obs = "obs", mod = "mod", group = "model",
cols = c("green", "yellow", "red","blue"),
cor.col='brown', rms.col='black')
```
However, we need also to check that the effectiveness of RK remains across all the full distribution of SOC observed values. For doing this, we can plot the conditional quantiles to verify the higher prediction capacity of RK (see Figure \@ref(fig:condquant)).
```{r condquant, fig.cap="Effectiveness of the different DSM models across the full distribution of SOC observed values"}
conditionalQuantile(modData,obs = "obs", mod = "mod", type = "model")
```
The blue line shows the results for a perfect model. In this case, the observations cover a range from 0 to $6 kg \cdot m^{-2}$. Interestingly, the maximum predicted value is in all cases less than $3 kg \cdot m^{-2}$, which is consistent with the $MB$ and $NMB$ previous results. The red line shows the median value of the predictions. Note that the median of predictions across the larger SOC values seems to have more variability across SVM and RK compared with RF, which tends to be conservative across the third quantile of data distribution. The shading shows the predicted quantile intervals (i.e., the 25/75th and the 10/90th). A perfect model (blue line) would have a very narrow spread [@openair]. The histograms show the counts of observed (gray) and predicted values (blue). While RF shows more conservative results across the higher SOC values (e.g., over the 4th), SVM had the lowest model performance based on these metrics. SVM was also the method showing higher spatial differences compared to RF and RK.
We conclude, for this specific example, that RK showed the best performance based on the implemented evaluation metrics. Therefore RK is a suitable method predicting SOC values across FYROM, although other methods generate similar results, we propose that, given available data, RK could be also easier to interpret, because it provides the means to identify the main effects and coefficients of the relationships across the covariate space and the response variable. Beyond model evaluation, the spatial combination of different modeling approaches (e.g., by the means of the ensemble learning theory) should also be considered in order to maximize the accuracy of results. Regardless the statistical space, the users should be aware that different plausible predictions or conventional SOC maps provide complementary information. From a pedological point of view the GM is apprapiate for the accurate delinetion of soil units and the continuos use of different modeling approaches will enhance our capacity to describe the functional variability with higher spatial detail and accuracy within each soil type unit.
Finally, we want to highlight that integrating different statistical methods and visualization tools, such as those provided by the **openair** package in **R** [@openair] will enhance our capacity to identify the best modeling approaches and a combination of SOC prediction factors given a specific dataset. Model evaluation benefits our understanding of the circumstances why each modeling approach will generate different results, which is a first step towards reducing uncertainty while increasing the spatial resolution and accuracy of predictions, the eternal problem for DSM.