-
Notifications
You must be signed in to change notification settings - Fork 0
/
2017_tidyverse_article.Rmd
377 lines (274 loc) · 11.2 KB
/
2017_tidyverse_article.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
---
title: "base R vs the tidyverse"
author: "Stephen Roecker"
date: "`r Sys.Date()`"
output:
html_document:
toc: yes
toc_float:
collapsed: yes
smooth_scroll: no
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
```
## Abstract
There are now 2 distinct dialects of the R programming language in the wild. The first and original dialect is typically referred to as "base" R, which derives from the base R package that comes pre-loaded as part of the standard R installation. The second, is known as the ["tidyverse"](https://www.tidyverse.org/) (or affectionately "Hadleyverse") was largely developed by Hadley Wickham, one of R's most prolific R package contributors. The tidyverse is an 'opinionated' collection of R packages that duplicate and seek to improve upon numerous base R functions for data manipulation (e.g. dplyr) and graphing (e.g. ggplot2). As the tidyverse has grown increasing more comprehensive, it has been suggested that it be taught first to new R users. The debate between which R dialect is better has generated a lot of heat, but not much light. This talk will review the similarities (with numerous examples) between the 2 dialects and hopefully help give new and old R users some perspective.
## Comparison
**base**
- base R is more closely associated with "Statistics", with focus on **statistical methods** not data manipulation methods
- variable syntax
- multiple ways to accomplish the same thing
- vector is probably the primary object
- base functions may produce multiple outputs based on arguments
- stable
- lots of [books](https://www.r-project.org/doc/bib/R-books.html) and [examples](https://www.statmethods.net/index.html), but fewer [free books](https://en.wikibooks.org/wiki/R_Programming)
**tidyverse**
- tidyverse more closely associated with "Data Science", with focus on **data manipulation** methods
- consistent synatx (e.g. data argument always comes first)
- fewer ways to accomplish the same thing
- data.frame is the primary object
- tidyverse functions produce a single output
- unstable
- lots of [free books](https://bookdown.org/)[(e.g. R for Data Science)](http://r4ds.had.co.nz/) and [examples](https://www.rstudio.com/resources/cheatsheets/) of tidyverse
## Load Packages
```{r}
suppressWarnings( {
library(aqp)
library(soilDB) # install from github
library(lattice)
library(tidyverse) # includes: dplyr, tidyr, ggplot2, etc...
})
```
## Toy Soil Dataset
```{r}
# soil data for Marion County
s <- get_component_from_SDA(WHERE = "compname IN ('Miami', 'Crosby') AND majcompflag = 'Yes' AND areasymbol != 'US'")
h1 <- get_chorizon_from_SDA(WHERE = "compname IN ('Miami', 'Crosby')")
source("https://raw.githubusercontent.com/ncss-tech/soilReports/master/inst/reports/region11/lab_summary_by_taxonname/genhz_rules/Miami_rules.R")
h <- subset(h1, cokey %in% s$cokey & !grepl("H", hzname))
h$genhz <- generalize.hz(h$hzname, new = ghr$n, pat = ghr$p)
names(h) <- gsub("total", "", names(h))
h2 <- h
h <- merge(h, s[c("cokey", "compname")], by = "cokey", all.x = TRUE)
depths(h2) <- cokey ~ hzdept_r + hzdepb_r
site(h2) <- s
# examine dataset
str(h, 2)
# plot dataset
plot(h2[1:10], label = "compname", name = "genhz", color = "clay_r")
```
## Options
In a lot of cases the tidyverse has different defaults for similarly named functions.
### strings vs factors
```{r}
fp <- "C:/workspace2/test.csv"
write.csv(s, file = fp, row.names = FALSE)
s1 <- read.csv(file = fp)
str(s1$drainagecl)
# base option 1
s_b <- read.csv(file = fp, stringsAsFactors = FALSE)
str(s_b$drainagecl)
# base option 2
options(stringsAsFactors = FALSE)
s_b <- read.csv(file = fp)
str(s_b$drainagecl)
# tidyverse -readr
s_t <- read_csv(file = fp) # notice the output is a data.frame
str(s_t$drainagecl)
```
### Printing
```{r}
# base
head(s_b) # or
# print(s_b) # prints the whole table
# tidyverse
head(s_t) # or
# print(s_t) # prints the first 10 rows
```
## Standard Evaluation with base R
```{r accessing operators}
## square brackets using column names
summary(h[, "clay_r"], na.rm = TRUE)
# square brackets using logical indices
idx <- names(h) %in% "clay_r"
summary(h[, idx], na.rm = TRUE)
# square brackets using column indices
which(idx)
summary(h[, 12], na.rm = TRUE)
## $ operator
summary(h$clay_r, na.rm = TRUE)
```
## Non-Standard Evaluation (NSE)
Non-standard evaluation (NSE) allows you access columns within a data.frame without the need to repeatedly specifying the data.frame. This is particularly useful with long data.frame object names (e.g. soil_horizons vs h) and many calls to different columns. The tidyverse implements NSE by default. Base R has a few functions like 'with()' and 'attach()' that facilitate NSE evaluation, but few functions implement it by default. NSE is somewhat contensious because it can have unintended consequences if you have objects and columns with the same name. As such, NSE is generally meant for interactive analysis, and not programming.
```{r nse}
# base option 1
with(h, { data.frame(
min = min(clay_r, na.rm = TRUE),
mean = mean(clay_r, na.rm = TRUE),
max = max(clay_r, na.rm = TRUE)
)})
# base option 2
attach(h)
data.frame(
min = min(clay_r, na.rm = TRUE),
mean = mean(clay_r, na.rm = TRUE),
max = max(clay_r, na.rm = TRUE)
)
detach(h)
# tidyverse non-standard evaluation (enabled by default) - dplyr
summarize(h,
min = min(clay_r, na.rm = TRUE),
mean = mean(clay_r, na.rm = TRUE),
max = max(clay_r, na.rm = TRUE)
)
```
## Subsetting vs Filtering
```{r subsetting}
# base R
sub_b <- subset(h, genhz == "Ap")
dim(sub_b)
# tidyverse - dplyr
sub_t <- filter(h, genhz == "Ap")
dim(sub_t)
```
## Ordering vs Arranging
```{r ordering}
# base
with(h, h[order(cokey, hzdept_r), ])[1:4, 1:4]
# tidyverse - dplyr
arrange(h, cokey, hzdept_r)[1:4, 1:4]
```
## Pipping
Referred too as 'syntactic' sugar, pipping is supposed to make code more readable, by making if read from left to right, rather than from inside out. This becomes particularly valuable when 3 or more functions combined. It also alleviates the need to overwrite existing objects.
```{r pipping}
# base
pip_b <- {subset(s, drainagecl == "Well drained") ->.;
.[order(.$nationalmusym), ]
}
pip_b[1:4, 1:4]
# tidyverse
pip_t <- filter(s, drainagecl == "Well drained") %>%
arrange(nationalmusym)
pip_t[1:4, 1:4]
```
## Split-Combine-Apply
In lots of cases we want to know the variation within groups.
```{r split}
# base
vars <- c("compname", "genhz")
sca_b <- {
split(h, h[vars], drop = TRUE) ->.; # split
lapply(., function(x) data.frame( # apply
x[vars][1, ],
clay_min = round(min(x$clay_r, na.rm =TRUE)),
clay_mean = round(mean(x$clay_r, na.rm = TRUE)),
clay_max = round(max(x$clay_r, na.rm = TRUE))
)) ->.;
do.call("rbind", .) # combine
}
print(sca_b)
# tidyverse - dplyr
sca_t <- group_by(h, compname, genhz) %>% # split (sort of)
summarize( # apply and combine
clay_min = round(min(clay_r, na.rm =TRUE)),
clay_mean = round(mean(clay_r, na.rm = TRUE)),
clay_max = round(max(clay_r, na.rm = TRUE))
)
print(sca_t)
```
## Reshaping
In lots of instances, particularly for graphing, it's necessary to convert the a data.frame from **wide** to **long** format.
```{r reshaping}
# base wide to long
vars <- c("clay_r", "sand_r", "om_r")
idvars <- c("compname", "genhz")
head(h[c(idvars, vars)])
lo_b <- reshape(h[c("compname", "genhz", vars)], # need to exclude unused columns
direction = "long",
timevar = "variable", times = vars, # capture names of variables in variable column
v.names = "value", varying = vars # capture values of variables in value column
)
head(lo_b) # notice the row.names
# tidyverse wide to long
idx <- which(names(h) %in% vars)
lo_t <- select(h, compname, idx) %>% # need to exclude unused columns
gather(key = variable,
value = value,
- compname
)
head(lo_t)
# sort factors
comp_sort <- aggregate(value ~ compname, data = lo_b[lo_b$variable == "clay_r", ], median, na.rm = TRUE)
comp_sort <- comp_sort[order(comp_sort$value), ]
lo_b <- within(lo_b, {
compname = factor(lo_b$compname, levels = comp_sort$compname)
genhz = factor(genhz, levels = rev(levels(genhz)))
})
# lattice density plot
bwplot(genhz ~ value | variable + compname,
data = lo_b,
scales = list(x = "free")
)
# ggplot2 density plot
ggplot(lo_b, aes(x = genhz, y = value)) + # ggplot2 doesn't like factors or strings on the y-axis
geom_boxplot() + # notice ggplot2 pipes is "+" not "%>%"
facet_wrap(~ compname + variable, scales = "free_x") +
coord_flip()
```
```{r lattice, eval = FALSE, echo = FALSE}
test2 <- get_cosoilmoist_from_SDA_db(WHERE = "mukey = '406339'")
test <- subset(test2, !is.na(dept_r) & status == "Wet")
ggplot(test, aes(x = as.integer(month), y = dept_r)) +
geom_ribbon(aes(ymin = dept_l, ymax = dept_h), alpha = 0.2) +
geom_line() +
ylim(max(test$dept_h), -5) + # won't plot unless the full range is present
facet_wrap(~ compname)
panel_gribbon <- function(x, y, upper, lower, ...,
fill, col, subscripts, font, fontface) {
upper = upper[subscripts]
lower = lower[subscripts]
panel.polygon(c(x, rev(x)), c(upper, rev(lower)),
col = fill, border = FALSE)
}
panel_ribbon <- function(x, y, ...) {
panel.superpose(x, y, ..., panel.groups = panel_gribbon)
panel.xyplot(x, y, ...)
}
xyplot(data = test, dept_r ~ as.integer(month) | compname,
groups = test$compname,
type = "b", lty = 1:2,
upper = test$dept_l, lower = test$dept_h,
ylim = c(150, -5),
grid = TRUE,
panel = function(x, y, ...){
panel.superpose(x, y, ..., panel.groups = panel_gribbon)
panel.xyplot(x, y, ...)
}
)
xyplot(data = test, dept_r ~ as.integer(month) | compname,
groups = test$compname,
type = "b", lty = 1:2,
upper = test$dept_l, lower = test$dept_h,
ylim = c(150, -5),
grid = TRUE,
panel = panel_ribbon,
)
```
## Conclusion
The tidyverse and it's precusors plyr and reshape2, introduced me to a lot of cool ways of manipulating data in new ways, and made me question 'how I would do that in base'.
**base**
- can be tidyish
- more abstract syntax (e.g. "[")
- fast
- 'very' flexible, to the point of being confusing
- awkward defaults (e.g. column and row naming, default sorting)
**tidyverse**
- more verbose syntax for some things, less verbose for others
- faster (usually)
- 'very' opinionated, to the point of being annoying
- clean defaults
## Questions
- is the tidyverse a better syntax for new R users?