-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtest_dataset.Rmd
326 lines (228 loc) · 10.3 KB
/
test_dataset.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
---
title: "CHASE test dataset"
output: github_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
fig.path = "markdown_images/")
```
## Development of two indicator datasets for testing the CHASE assessment tool
This document describes the development of datasets for testing the CHASE assessment tool. By modifying the steps below or by using different sources of data, alternative datasets can be created for input to the CHASE assessment tool.
This could also serve as a model for development the actual assessment as well. However, in order to do this, a number of issues should be addressed, not limited to:
* Determination of the temporal confidence should to be included. At present random values are used.
* Determination of the methodological confidence should be included. This is also represented by random values at present.
## Data sources
For testing purposes, we used indicator data from the HOLAS II assessment. This data was available with indicators per station. The indicator data came from the file **CHASEinput060318.xlsx** sheet **'4. BSII'**.
This sheet does not include station positions. These were added to the dataset by matching up station names to station positions on other sheets.
Using shape files for level 3 assessment units and level 4 assessment units, we mapped the station positions to assessment units, creating two input data sets for the CHASE assessment:
1) [input/assessmentdata_L3.csv](input/assessmentdata_L3.csv)
2) [input/assessmentdata_L4.csv](input/assessmentdata_L4.csv)
#### Shape files
The shapes files for assessment units were downloaded from the HELCOM maps and data service (MADS).
Level 3 : https://maps.helcom.fi/website/MADS/download/?id=e5a59af9-c244-4069-9752-be3acc5dabed
Level 4: https://maps.helcom.fi/website/MADS/download/?id=67d653b1-aad1-4af4-920e-0683af3c4a48
The original shape files contain several columns which are not required. For the purpose of these tests, we are only interested in the names of the assessment units and area in km<sup>2</sup>. They were therefore modified using the following code, to reduce their size.
```{r shapefiles, eval=FALSE}
library(tidyverse)
library(sf)
units3 <- read_sf(dsn = "../gis/_ags_HELCOM_subbasins_with_coastal_and_offshore_division_2018_11",
layer = "HELCOM_subbasins_with_coastal_and_offshore_division_2018_1")
units3 <- units3 %>%
dplyr::select(HELCOM_ID,level_3,Area_km2=area_km2)
# note: change name of variable area_km2 for consistency
st_write(units3, "assessment_units/AssessmentUnits3.shp",append=F)
units4 <- read_sf(dsn = "../gis/_ags_HELCOM_subbasins_with_coastal_WFD_waterbodies_or_watertypes_2018_11",
layer = "HELCOM_subbasins_with_coastal_WFD_waterbodies_or_watertypes_2018_1")
units4 <- units4 %>%
dplyr::select(Name,HELCOM_ID,Area_km2)
st_write(units4, "assessment_units/AssessmentUnits4.shp",append=F)
```
## Mapping
Load packages
```{r load packages,message=FALSE}
library(sf)
library(tidyverse)
library(patchwork) # not essential - this is used only to combine the two maps of assessment units
```
Load shape files for assessment units
```{r read shape files, fig.height=4, fig.width=8}
units3 <- read_sf(dsn = "./assessment_units",
layer = "AssessmentUnits3")
units4 <- read_sf(dsn = "./assessment_units",
layer = "AssessmentUnits4")
# show maps of the assessment units
map3 <- ggplot() +
theme_minimal(base_size=9) +
ggtitle("Level 3 Assessment Units") +
geom_sf(data=units3, colour="black", fill=NA)
map4 <- ggplot() +
theme_minimal(base_size=9) +
ggtitle("Level 4 Assessment Units") +
geom_sf(data=units4, colour="black", fill=NA)
map3+map4
```
Read indicator data from text file [input/bsii_by_stn.txt](input/bsii_by_stn.txt)
```{r read indicators}
file <- "./input/bsii_by_stn.txt"
df <- read.table(file,sep=";",
header=T,
fileEncoding="UTF-8",
comment.char="")
```
Convert indicator data frame to simple features
```{r indicators to sf, fig.height=4, fig.width=4}
df_sf <- st_as_sf(df, coords = c("longitude", "latitude"), crs = 4326) # EPSG 4326 is WGS84
# transform the sf dataframe to the same projection as the assessment units
df_sf <- st_transform(df_sf,crs=st_crs(units3))
# plot the stations and the assessment units together
ggplot() +
theme_minimal(base_size=9) +
ggtitle("Level 4 Assessment Units \n+ Indicator positions") +
geom_sf(data=units4, colour="black", fill=NA) +
geom_sf(data=df_sf, colour="red")
```
Intersect the indicator data points with the polygons to get dataframes of indicators with added information about which assessment unit each indicator belongs to. There will be one dataframe for intersection with Level 3 assessment units and another dataframe for Level 4 assessment units.
*As can be seen from the map, some station positions should be checked. This is outside the scope of this test.*
```{r intersections, warning=FALSE}
df3 <- st_intersection(df_sf, units3)
df4 <- st_intersection(df_sf, units4)
# geometry information is no longer needed
df3$geometry <-NULL
df4$geometry <-NULL
# show the head for Level 3 data
print(head(df3))
```
## Further processing, incl. confidences
#### Spatial confidence
Add counts of stations and observations
```{r counts data and stations, warning=FALSE, message=FALSE}
# count stations for Level3
stn_count3 <- df3 %>%
distinct(HELCOM_ID,level_3,Matrix,Substance,Station) %>%
group_by(HELCOM_ID,level_3,Matrix,Substance) %>%
summarise(CountStations=n()) %>%
ungroup()
# count observations for Level3
data_count3 <- df3 %>%
group_by(HELCOM_ID,level_3,Matrix,Substance) %>%
summarise(CountData=n()) %>%
ungroup()
# merge L3 counts back to original L3 data
df3 <- df3 %>%
left_join(stn_count3,by=c("HELCOM_ID","level_3","Matrix","Substance")) %>%
left_join(data_count3,by=c("HELCOM_ID","level_3","Matrix","Substance"))
# count stations for Level4
stn_count4 <- df4 %>%
distinct(HELCOM_ID,Name,Matrix,Substance,Station) %>%
group_by(HELCOM_ID,Name,Matrix,Substance) %>%
summarise(CountStations=n()) %>%
ungroup()
# count observations for Level4
data_count4 <- df4 %>%
group_by(HELCOM_ID,Name,Matrix,Substance) %>%
summarise(CountData=n()) %>%
ungroup()
# merge L4 counts back to original L4 data
df4 <- df4 %>%
left_join(stn_count4,by=c("HELCOM_ID","Name","Matrix","Substance")) %>%
left_join(data_count4,by=c("HELCOM_ID","Name","Matrix","Substance"))
```
Add spatial confidence based on km<sup>2</sup> per sample
km<sup>2</sup> per sample | Confidence
------------- | -------------
<500 | High
500 - 5000 | Moderate
> 5000 | Low
```{r add spatial confidences, warning=FALSE}
km2perSampleBounds<-c(500,5000)
conf<-c("L","M","H")
df3 <- df3 %>%
mutate(km2perSample=Area_km2/CountData)
df3 <- df3 %>%
rowwise() %>%
mutate(ix=length(km2perSampleBounds[km2perSampleBounds>km2perSample])) %>%
mutate(ConfSpatial=conf[ix+1]) %>%
dplyr::select(-ix)
df4 <- df4 %>%
mutate(km2perSample=Area_km2/CountData)
df4 <- df4 %>%
rowwise() %>%
mutate(ix=length(km2perSampleBounds[km2perSampleBounds>km2perSample])) %>%
mutate(ConfSpatial=conf[ix+1]) %>%
dplyr::select(-ix)
```
#### Threshold confidence
Load table of confidences for threshold values from text file [input/confidence_thresholds.txt](./input/confidence_thresholds.txt)
```{r load threshold confidences, warning=FALSE}
dfConfThreshold <- read.table("./input/confidence_thresholds.txt",sep=";",header=T)
print(dfConfThreshold)
```
Original source for threshold confidence data is Table 2 in this meeting document:
https://portal.helcom.fi/meetings/HOLAS%20II%20HZ%20WS%201-2018-518/MeetingDocuments/2-3%20Confidence%20setting%20for%20CHASE%20integrated%20assessment.pdf
Join threshold confidences to L3 and L4 indicator tables.
_Note: the following code block is not run because **ConfThresh** is already included in the indicator data._
```{r add threshold confidences, warning=FALSE, eval=FALSE}
# NOT RUN!
df3 <- df3 %>%
left_join(dfConfThreshold,by=c("Substance","Matrix")) %>%
mutate(AU_scale=3)
df4 <- df4 %>%
left_join(dfConfThreshold,by=c("Substance","Matrix")) %>%
mutate(AU_scale=4)
```
#### Methodological confidence
Add random confidences for method to L3 and L4 indicator tables
```{r add method confidences, warning=FALSE}
conf<-c("L","M","H")
df3[,"ConfMethod"]<-round(runif(nrow(df3), min=0.5, max=3.49999))
df3 <- df3 %>%
rowwise() %>%
mutate(ConfMethod=conf[ConfMethod]) %>%
ungroup()
df4[,"ConfMethod"]<-round(runif(nrow(df4), min=0.5, max=3.49999))
df4 <- df4 %>%
rowwise() %>%
mutate(ConfMethod=conf[ConfMethod]) %>%
ungroup()
```
#### Temporal confidence
Add random temporal confidences to L3 and L4 indicator tables
```{r add temporal confidences, warning=FALSE}
conf<-c("L","M","H")
df3[,"ConfTemp"]<-round(runif(nrow(df3), min=0.5, max=3.49999))
df3 <- df3 %>%
rowwise() %>%
mutate(ConfTemp=conf[ConfTemp]) %>%
ungroup()
df4[,"ConfTemp"]<-round(runif(nrow(df4), min=0.5, max=3.49999))
df4 <- df4 %>%
rowwise() %>%
mutate(ConfTemp=conf[ConfTemp]) %>%
ungroup()
```
Select only columns needed
```{r select columns, warning=FALSE}
df3 <- df3 %>%
mutate(AU_scale=3) %>%
dplyr::select(AU_scale,AU=level_3,Area_km2,Substance,Type,Matrix,CR,
ConfThresh,CountStations,CountData,ConfSpatial,ConfMethod,ConfTemp)
df4 <- df4 %>%
mutate(AU_scale=4) %>%
dplyr::select(AU_scale,AU=HELCOM_ID,Area_km2,Substance,Type,Matrix,CR,
ConfThresh,CountStations,CountData,ConfSpatial,ConfMethod,ConfTemp)
```
#### Save test datasets
Save the indicator data, including
* information on which assessment units they belong to
* threshold confidence [H/M/L]
* method confidence [H/M/L]
* number of stations in AU
* number of data points per km<sup>2</sup>
* spatial confidence [H/M/L]
* temporal confidence [H/M/L]
```{r save data, warning=FALSE}
write.table(df3,file="./input/assessmentdata_L3.csv",sep=";",row.names=F,col.names=T,quote=T)
write.table(df4,file="./input/assessmentdata_L4.csv",sep=";",row.names=F,col.names=T,quote=T)
```
This markdown document is created by knitting the RMarkdown file [test_dataset.Rmd](test_dataset.Rmd). The steps here for generating test data can be run without generating markdown, using the standalone R script [src/create_test_dataset.R](src/create_test_dataset.R)
Updated 28-10-2021 [[email protected]](mailto:[email protected])