-
Notifications
You must be signed in to change notification settings - Fork 12
/
Tidying_WQ_data.Rmd
executable file
·624 lines (416 loc) · 22.4 KB
/
Tidying_WQ_data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
---
title: "Tidying Public Water Quality Data"
author: "Matthew Ross"
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
html_document:
toc: yes
toc_depth: 3
toc_float: true
editor_options:
chunk_output_type: console
---
# Why public datasets?
Working with large, open-access datasets can serve many purposes. It can be an excellent way to explore new ideas, before investing in field-work or experiments. It can be a great way to take local or experimental results and expand them to different ecosystems, places, or landscapes. Or it can be an excellent way to build, validate, and test ecological models on regional or national scales.
So why doesn't everyone use public data? Well, it's often collected by a variety of organizations, with different methods, units, and incosistent metadata. Together these issues with large public datasets, make them "messy." Messy data can be messy in many different ways, but at the basic level it means that data is hard to analyze, not because the data itself is bad, but because the way it is organized is unclear or inconsistent.
In this lab, we will learn some tricks to "tidying" data, making it analysis-ready. We will depend heavily on the [tidyverse](https://www.tidyverse.org/), an excellent series of packages that make data manipulation beautiful and easy. We will also be working with water quality portal data so we will also use the excellent [dataRetrieval](https://github.com/USGS-R/dataRetrieval) package for downloading data from the Water Quality Portal and the USGS.
## Loading key packages
This lab is meant to introduce the incredible variety of tools that one can use to clean data, many of these tools are captured by the `tidyverse` meta-package, a package of packages, but there are some additional ones that will help us locate our various water quality sites.
```{r setup, warnings='hide',message=FALSE}
library(tidyverse) # Package with dplyr, tibble, readr, and others to help clean coding
library(dataRetrieval) # Package to download data.
library(sf) #Geospatial package to plot and explore data
library(mapview) #Simple interface to leaflet interactive maps
library(broom) #Simplifies model outputs
library(knitr) #Makes nice tables
library(kableExtra) #Makes even nicer tables
library(lubridate) #Makes working with dates easier
library(ggthemes) #Makes plots prettier
library(tidyr) #Makes multiple simultaneous models easier
```
# Downloading data.
For this lab, we'll explore water quality data in the Colorado River basin as it moves from Colorado to Arizona. All data will be generated through the code you see below, with the only external information coming from knowing the SiteID's for the monitoring locations along the Colorado River and the water quality characteristic names.
The water quality portal can be accessed with the command `readWQPdata`, which takes a variety of parameters (like startdate, enddate, constituents, etc...). We'll generate these rules for downloading the data here.
## Download prep
```{r download prep}
#First we'll make a tibble (a tidyverse table) with Site IDs. Generally these are increasingly downstream of the CO headwaters near Grand Lake.
colorado <- tibble(sites=c('USGS-09034500','USGS-09069000','USGS-09071100',
'USGS-09085000','USGS-09095500','USGS-09152500',
'USGS-09180000','USGS-09180500','USGS-09380000'),
basin=c('colorado1','eagle','colorado2',
'roaring','colorado3','gunnison',
'dolores','colorado4','colorado5'))
#Now we need to setup a series of rules for downloading data from the Water Quality Portal.
#We'll focus on cation and anion data from 1950-present. Each cation has a name that we might typically use like calcium or sulfate, but the name may be different in the water quality portal, so we have to check this website https://www.waterqualitydata.us/Codes/Characteristicname?mimeType=xml to get our names correct.
paramater.names <- c('ca','mg','na','k','so4','cl','hco3')
ca <- c('Calcium')
mg <- c('Magnesium')
na <- 'Sodium'
k <- 'Potassium'
so4 <- c('Sulfate','Sulfate as SO4','Sulfur Sulfate','Total Sulfate')
cl <- 'Chloride'
hco3 <- c('Alkalinity, bicarbonate','Bicarbonate')
#Compile all these names into a single list
parameters <- list(ca,mg,na,k,so4,cl,hco3)
#Name each cation or anion in the list
names(parameters) <- paramater.names
#Notice that we aren't downloading any nutrients (P or N) because they are much messier (100s of different ways to measure and report concentration data) than other cation anion data.
#Start dates
start <- '1980-10-01'
end <- '2024-03-20'
#Sample media (no sediment samples)
sampleMedia = 'Water'
#Comple all this information into a list with arguments
site.args <- list(siteid=colorado$sites,
sampleMedia=sampleMedia,
startDateLo=start,
startDateHi=end,
characteristicName=NA) #We'll fill this in later in a loop
```
## Concentration data download
Now that we have generated the commands to download the data, the code to download the data is here, but it is not run on purpose because it takes 15 minutes or so to run everytime. You can always run it yourself by setting `eval=T`.
```{r concentration download}
conc.list <- list() #Empty list to hold each data download
#We'll loop over each anion or cation and download all data at our sites for that constituent
for(i in 1:length(parameters)){
#We need to rename the characteristicName (constituent) each time we go through the loop
site.args$characteristicName<-parameters[[i]]
#readWQPdata takes in our site.args list and downloads the data according to those rules
# time, constituent, site, etc...
# Don't forget about pipes "%>%"! Pipes pass forward the results of a previous command, so that
#You don't have to constantly rename variables. I love them.
conc.list[[i]] <- readWQPdata(site.args) %>%
mutate(parameter=names(parameters)[i]) #Mutate just adds a new column to the data frame
#Pipes make the above command simple and succinct versus something more complicated like:
## conc.list[[i]] <- readWQPdata(site.args)
## conc.list[[i]]$parameter <- names(parameters)[i]
}
#bind all this data together into a single data frame
conc.long <- map_dfr(conc.list,rbind)
```
## Site info download
We also need to download some site info so we can know where these sites are.
```{r site info download}
#In addition to concentration informatino, we probably want to know some things about the sites
#dplyr::select can help us only keep site information that is useful.
site.info <- whatWQPsites(siteid=colorado$sites) %>%
dplyr::select(SiteID=MonitoringLocationIdentifier,
name=MonitoringLocationName,
area=DrainageAreaMeasure.MeasureValue,
area.units=DrainageAreaMeasure.MeasureUnitCode,
lat=LatitudeMeasure,
long=LongitudeMeasure) %>%
distinct() #Distinct just keeps the first of any duplicates.
```
# Data tidying
Now that we have downloaded the data we need to tidy it up. The water quality portal data comes with an incredible amount of metadata in the form of extra columns. But we don't need all this extra data.
## Look at the data you downloaded.
There are two data types we downloaded. First site info which has things like lat and long, and second concentration data. We already slightly tidied the site info data so that it has sensible column names
```{r site info}
head(site.info)
```
This dataset looks nice because it has all the information we need and nothing extra. Now let's look at the concentration data.
```{r conc data}
head(conc.long) %>%
kable(.,'html') %>%
kable_styling() %>%
scroll_box(width='800px',height='300px')
```
## Initial cleaning up
Wow that looks messy! Lots of extraneous columns, lots of NAs, so much information we can hardly parse it. Let's pair it down to the essentials.
```{r tidying up concentration}
#This code mostly just grabs and renames the most important data columns
conc.clean <- conc.long %>%
dplyr::select(date=ActivityStartDate,
parameter=CharacteristicName,
units=ResultMeasure.MeasureUnitCode,
SiteID=MonitoringLocationIdentifier,
org=OrganizationFormalName,
org_id=OrganizationIdentifier,
time=ActivityStartTime.Time,
value=ResultMeasureValue,
sample_method=SampleCollectionMethod.MethodName,
analytical_method=ResultAnalyticalMethod.MethodName,
particle_size=ResultParticleSizeBasisText,
date_time=ActivityStartDateTime,
media=ActivityMediaName,
sample_depth=ActivityDepthHeightMeasure.MeasureValue,
sample_depth_unit=ActivityDepthHeightMeasure.MeasureUnitCode,
fraction=ResultSampleFractionText,
status=ResultStatusIdentifier) %>%
#Remove trailing white space in labels
mutate(units = trimws(units)) %>%
#Keep only samples that are water samples
filter(media=='Water') #Some of these snuck through!
```
Now let's look at the tidier version
```{r examine tidier data}
head(conc.long) %>%
kable(.,'html') %>%
kable_styling() %>%
scroll_box(width='800px',height='300px')
```
## Final tidy dataset
Okay that is getting better but we still have lots of extraneous information. For our purposes let's assume that the sample and analytical methods used by the USGS are reasonable and exchangeable (one method is equivalent to the other). If we make that assumption then the only remaining data we need to clean is to make sure that all the data has the same units.
### Unit Check
```{r unit check}
table(conc.clean$units)
```
All the data is in mg/L!
We just need to remove these observations with a `dplyr::filter` call and then select an even smaller subset of useful columns, while adding a time object column using the `lubridate::ymd` call.
```{r tidy}
conc.tidy <- conc.clean %>%
mutate(date=ymd(date)) %>%
select(date,
parameter,
SiteID,
conc=value)
```
### Daily data
Okay now we have a manageable data frame. But how do we want to organize the data? Since we are looking at a really long time-series of data (70 years), let's look at data as a daily average. The `dplyr::group_by and summarize` commands make this really easy
```{r daily}
#The amazing group_by function groups all the data so that the summary
#only applies to each subgroup (site, date, and parameter combination).
#So in the end you get a daily average concentratino for each site and parameter type.
conc.daily <- conc.tidy %>%
group_by(date,parameter,SiteID) %>%
summarize(conc=mean(conc,na.rm=T))
```
Taking daily averages looks like it did eliminate `r nrow(conc.tidy) - nrow(conc.daily)` observations, meaning these site date combinations had multiple observations on the same day.
# Analyzing data
Now we have a 'tidy' dataset. Let's look at the data we have. First where is the data?
### Map
Here we use the `sf` package to project the site information data into a GIS type data object called a `simple feature (sf)`. The function `st_as_sf` converts the long (x) and lat (y) coordinates into a projected point feature with the EPSG code 4326 (WGS 84). We can then use the `mapview` package and function to look at where these sites are.
```{r}
#convert site info as an sf object
site.sf <- site.info %>%
st_as_sf(.,coords=c('long','lat'), crs=4326)
mapview(site.sf )
```
So these sites are generally in the Colorado River Basin with increasing size.
## Concentration data
Now that we know where the data is coming from let's look at the actual data we downloaded using ggplot2
### Calcium only
To keep the plots simple at first let's look at calcium data by site.
```{r daily plot}
conc.daily %>%
filter(parameter == 'Calcium') %>%
ggplot(.,aes(x=date,y=conc)) +
geom_point() +
facet_wrap(~SiteID)
```
Okay that's a lot of data! Maybe too much. Let's focus in on sites with only continuous datasets and then summarize the data by year
### Annual summaries of full sites
Let's shrink the dataset to only look at annual change.
```{r annual only}
too.few.years <- c('USGS-09034500','USGS-0907110','USGS-0908500')
conc.annual <- conc.daily %>%
filter(!SiteID %in% too.few.years) %>% #! means opposite of, so we want all the sites not in the too.few years vector.
mutate(year=year(date)) %>%
group_by(SiteID,year,parameter) %>%
summarize(annual_mean=mean(conc,na.rm=T),
annual_var=var(conc,na.rm=T))
```
### Plot all the annual data.
```{r ugly}
conc.annual %>%
ggplot(.,aes(x=year,y=annual_mean,color=SiteID)) +
geom_point() +
facet_wrap(~parameter,scales='free')
```
That plot is... ugly! Maybe we can make something prettier
### Prettier annual plot.
Having the points colored by SiteID is not super useful, unless you have memorized the name and location of USGS gauge data. So maybe we can color it by the table we used to download the data? That table `colorado` was organized such that each basin had it's own name or was increasing in basin size. That's a better way to think about the data than as SiteID names. So let's use `join` functions to join the datasets and use the basin names. We'll also use the package `ggthemes` to try and make the plots prettier.
```{r pretty,fig.width=9,fig.height=7}
conc.annual %>%
left_join(colorado %>%
rename(SiteID=sites),by='SiteID') %>%
ggplot(.,aes(x=year,y=annual_mean,color=basin)) +
geom_point() +
facet_wrap(~parameter,scales='free') +
theme_few() +
scale_color_few() +
theme(legend.position=c(.7,.15)) +
guides(color=guide_legend(ncol=2))
```
### Watershed size
Many prior publications have shown that increasing watershed size means decreasing variance in anion and cation concentrations. We can use our dataset to test this in the colorado basin.
```{r}
conc.annual %>%
left_join(site.info,by='SiteID') %>%
filter(annual_var < 5000) %>%
ggplot(.,aes(x=area,y=annual_var,color=year)) +
geom_point() +
scale_x_log10() +
facet_wrap(~parameter,scales='free') +
theme_few() +
theme(legend.position=c(.7,.15))
```
Cool! Looks like it's generally true across all constituents. Potassium has low variance. Why? No clue!
## Reshaping the data
From basic weathering geochemistry principles we know that Bicarbonate concentrations should be correlated with Mg and Ca depending on the weathering reactions that generate these river signals. The current shape of the data in a 'long' format makes looking at these correlations impossible. So we want to 'widen' the data so the parameters are arranged in side by side columns. This is really easy with tidyr `spread` and `gather`!
```{r}
head(conc.annual)
conc.wide <- conc.annual %>%
select(-annual_var) %>%
spread(key=parameter,value=annual_mean) %>%
mutate(`Mg+Ca`=Magnesium+Calcium)
ggplot(conc.wide,aes(x=Bicarbonate,y=`Mg+Ca`,color=SiteID)) +
geom_point() +
geom_abline(slope=1,intercept=0)
```
## Model changes
It looks to me like there might be some trends in the data at certain sites. (Mg and SO4 in particular). Let's use some advanced r to check if there are some linear trends in these datasets.
### Nesting and modelling
There is a really excellent package called purrr and tidyr make doing multiple models on different sites really easy. Quick example here
```{r}
conc.nest <- conc.annual %>%
group_by(parameter,SiteID) %>%
nest()
head(conc.nest)
```
That nests or wraps up the data into tidy little bundles. We can access these bundled datasets by using a map function that applies a model *inside* a bundle.
```{r}
#Create a generic model function (mean as a function of time)
time.lm.models <- function(x){
mod <- lm(annual_mean~year,data=x)
}
conc.models <- conc.nest %>%
mutate(mods=map(data,time.lm.models))
```
Now we have an individual time-series analysis model for each site and parameter combination. But how do we see the results in a tidy way? Broom to the rescue!
```{r}
conc.models %>%
mutate(mod.glance=map(mods,glance)) %>%
unnest(mod.glance) %>% #Unnesting unwraps the nested column.
arrange(desc(adj.r.squared)) %>%
select(parameter,SiteID,adj.r.squared,p.value,logLik,AIC) %>%
kable(.,'html') %>%
kable_styling() %>%
scroll_box(width='600px',height='500px')
```
Cool! Lots of these constituents have significant trends. Many of them are declining, why? what does that mean for water quality? I don't know. But we could probably use more public data to investigate.
# Assignment
The above code is... voluminous! But the goal was to show you the full pipeline of
downloading, cleaning, harmonizing, analyzing, and displaying public water quality
data. For this assignment you will continue to build out your own ability to
analyze large complex data.
## Background
The reason so many of these sites in Colorado have lots of chemical concentration
data is because the Colorado River is a highly critical resource throughout
the Western USA, but it has suffered from impaired water quality, specifically
increases in ion concentration leading to high levels of salinity. This is particularly
true in the Gunnison River. This occurs both because the bedrock is full of readily
weathered ions, and because irrigation practices artificially elevate weathering rates
(more water = more ions released). As such the Gunnison River has a salinity
control plan (https://gunnisonriverbasin.org/water-quality/salinity-control/).
For this assignment you will be exploring changes in Salinity in the Gunnison River,
Dolores River, and the Colorado River above and below where these two rivers
join the CO River (all of this is near Grand Junction). The main questions we have are:
1) What are trends in specific conductivity(salinity) in these three rivers since 2007?
2) What is the correlation between specific conductivity and individual ion concentration?
3) How does discharge control these relationships?
## Q1) What are trends in Specific Conductivity?
### Download Spec Cond Data for our Focus Sites.
```{r}
focus_sites <- colorado <- tibble(sites=c('USGS-09095500','USGS-09152500',
'USGS-09180000','USGS-09180500'),
basin=c('colorado_above','gunnison',
'dolores','colorado_below'))
sc_code = '00095'
# You will need to use readNWISdv function to download daily spec cond data
# for these sites. Data starts in 2007. Pay attention to how that function
# differs from readWQPdata. Consider also using the renameNWIScolumns funciton to
# make this data easier to read.
site_nos <- gsub('USGS-','',focus_sites$sites)
sc <- readNWISdv(siteNumbers = site_nos,
parameterCd = sc_code,
startDate = '2007-01-01',
endDate = '2024-03-20') %>%
renameNWISColumns()
```
### Summarize the daily data to annual means for lower flow periods (July/Aug/Sept/Oct)
Salinity is primarily an issue during low flow periods so when we are looking for
trends we want to focus on the saltiest times of year, generally July-Oct.
```{r}
#Group_by, filter, and summarize will be key here.
sc_summers <- sc %>%
#add month columns
mutate(month = month(Date),
year = year(Date)) %>%
#filter to only those summer months
filter(month %in% c(7,8,9,10)) %>%
#group by sites and years to take site and year averages
group_by(site_no, year) %>%
# Take the mean, removing NAs and add a count of obs to make sure
# our summer annual mean represents a good amount of data
summarize(summer_sc = mean(SpecCond,na.rm = T),
count = n())
## Add a visual check.
ggplot(sc_summers, aes(x = year, y = summer_sc,
color = site_no)) +
geom_point() +
theme_few() +
scale_color_few() +
theme(legend.position = c(.2,.8))
# What's wrong with this site?
```
### Using the `map` code from above test if their are trends
To do this correctly you need to use a "sens.slope" test which is a non-parametric
test for trends in data from the `trend` package, which you will need to install and load
```{r}
### Straight from Sam Struthers with MR modifications
library(trend) # Library with sens.slope package
#Custom sens slope function and output guidance
my_sens <- function(df){
mod <- sens.slope(df$summer_sc)
#Extract the things we want
#if there is a trend (p.value),
# what is the slope?
tibble(slope = mod$estimates,
p_value = mod$p.value)
}
sc_results <- sc_summers %>%
group_by(site_no) %>%
#Bundle the data
nest() %>%
#Apply the model
mutate(mods=map(data,my_sens)) %>%
# Extract model results
unnest(cols = mods) %>%
#remove data
select(-data) %>%
#Join back to site names
#First need to add sites with USGS - in front
mutate(sites = paste0('USGS-',site_no)) %>%
inner_join(focus_sites)
kable(sc_results)
```
## Q2) 2) What is the correlation between specific conductivity and individual ion concentration?
```{r}
# Think about if you want a full join, inner join, left join, etc...
sc_ions <- sc %>%
## Add a new siteid column so it matches conc.daily
mutate(SiteID = paste0('USGS-',site_no)) %>%
inner_join(focus_sites, by = join_by(SiteID == sites)) %>%
#rename the date column so we can join these
rename(date = Date) %>%
# join to the conc.daily dataset. We only want matching data here!
inner_join(conc.daily)
```
### Plot the correlation with sc on the x axis and individual ions on the y
```{r}
ggplot(sc_ions, aes(x = SpecCond, y = conc, color = basin))+
geom_point()+
facet_wrap(~parameter, scales = "free")+
ylab("Concentration of Ion (mg/L)") +
scale_x_log10() +
scale_y_log10()
```
### Use the `map` and modelling code to test for correlation with `lm`
```{r}
```
### What Ions are correlated with SC? Why might this be?
## Q3) How does discharge control these relationships?
### Free play.
This section is meant to encourage you to play with the data, you definitely will need to download discharge data at our focus sites, but how you explore this question is up to you. What's the relationship between discharge and SC? Does this relationship alter the relationship between SC and Ions? Does a model with both SC AND discharge predict water quality concentrations better than SC alone? Just try one of these approaches and write out why you think this might be happening.