2017_tidyverse_article.Rmd

---
title: "base R vs the tidyverse"
author: "Stephen Roecker"
date: "`r Sys.Date()`"
output: 
  html_document: 
    toc: yes
    toc_float:
      collapsed: yes
      smooth_scroll: no
editor_options: 
  chunk_output_type: console
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
```

## Abstract

There are now 2 distinct dialects of the R programming language in the wild. The first and original dialect is typically referred to as "base" R, which derives from the base R package that comes pre-loaded as part of the standard R installation. The second, is known as the ["tidyverse"](https://www.tidyverse.org/) (or affectionately "Hadleyverse") was largely developed by Hadley Wickham, one of R's most prolific R package contributors. The tidyverse is an 'opinionated' collection of R packages that duplicate and seek to improve upon numerous base R functions for data manipulation (e.g. dplyr) and graphing (e.g. ggplot2). As the tidyverse has grown increasing more comprehensive, it has been suggested that it be taught first to new R users. The debate between which R dialect is better has generated a lot of heat, but not much light. This talk will review the similarities (with numerous examples) between the 2 dialects and hopefully help give new and old R users some perspective.


## Comparison

**base**

- base R is more closely associated with "Statistics", with focus on **statistical methods** not data manipulation methods
- variable syntax
- multiple ways to accomplish the same thing
- vector is probably the primary object
- base functions may produce multiple outputs based on arguments
- stable
- lots of [books](https://www.r-project.org/doc/bib/R-books.html) and [examples](https://www.statmethods.net/index.html), but fewer [free books](https://en.wikibooks.org/wiki/R_Programming)


**tidyverse**

- tidyverse more closely associated with "Data Science", with focus on **data manipulation** methods
- consistent synatx (e.g. data argument always comes first)
- fewer ways to accomplish the same thing
- data.frame is the primary object
- tidyverse functions produce a single output
- unstable
- lots of [free books](https://bookdown.org/)[(e.g. R for Data Science)](http://r4ds.had.co.nz/) and [examples](https://www.rstudio.com/resources/cheatsheets/) of tidyverse 



## Load Packages

```{r}
suppressWarnings( {
  library(aqp)
  library(soilDB) # install from github
  library(lattice)
  library(tidyverse) # includes: dplyr, tidyr, ggplot2, etc...
  })
```

## Toy Soil Dataset

```{r}
# soil data for Marion County
s <- get_component_from_SDA(WHERE = "compname IN ('Miami', 'Crosby') AND majcompflag = 'Yes' AND areasymbol != 'US'")
h1 <- get_chorizon_from_SDA(WHERE = "compname IN ('Miami', 'Crosby')")
source("https://raw.githubusercontent.com/ncss-tech/soilReports/master/inst/reports/region11/lab_summary_by_taxonname/genhz_rules/Miami_rules.R")

h <- subset(h1, cokey %in% s$cokey & !grepl("H", hzname))
h$genhz <- generalize.hz(h$hzname, new = ghr$n, pat = ghr$p)
names(h) <- gsub("total", "", names(h))
h2 <- h

h <- merge(h, s[c("cokey", "compname")], by = "cokey", all.x = TRUE)

depths(h2) <- cokey ~ hzdept_r + hzdepb_r
site(h2) <- s

# examine dataset

str(h, 2)


# plot dataset

plot(h2[1:10], label = "compname", name = "genhz", color = "clay_r")
```


## Options

In a lot of cases the tidyverse has different defaults for similarly named functions.


### strings vs factors

```{r}
fp <- "C:/workspace2/test.csv"
write.csv(s, file = fp, row.names = FALSE)

s1 <- read.csv(file = fp)
str(s1$drainagecl)

# base option 1
s_b <- read.csv(file = fp, stringsAsFactors = FALSE)
str(s_b$drainagecl)

# base option 2
options(stringsAsFactors = FALSE)
s_b <- read.csv(file = fp)
str(s_b$drainagecl)

# tidyverse -readr
s_t <- read_csv(file = fp) # notice the output is a data.frame
str(s_t$drainagecl)

```

### Printing
```{r}
# base
head(s_b) # or
# print(s_b) # prints the whole table

# tidyverse
head(s_t) # or 
# print(s_t) # prints the first 10 rows

```


## Standard Evaluation with base R

```{r accessing operators}

## square brackets using column names
summary(h[, "clay_r"], na.rm = TRUE)

# square brackets using logical indices
idx <- names(h) %in% "clay_r"
summary(h[, idx], na.rm = TRUE)

# square brackets using column indices
which(idx)
summary(h[, 12], na.rm = TRUE)

## $ operator
summary(h$clay_r, na.rm = TRUE)

```


## Non-Standard Evaluation (NSE)

Non-standard evaluation (NSE) allows you access columns within a data.frame without the need to repeatedly specifying the data.frame. This is particularly useful with long data.frame object names (e.g. soil_horizons vs h) and many calls to different columns. The tidyverse implements NSE by default. Base R has a few functions like 'with()' and 'attach()' that facilitate NSE evaluation, but few functions implement it by default. NSE is somewhat contensious because it can have unintended consequences if you have objects and columns with the same name. As such, NSE is generally meant for interactive analysis, and not programming.

```{r nse}
# base option 1
with(h, { data.frame(
  min = min(clay_r, na.rm = TRUE),
  mean = mean(clay_r, na.rm = TRUE), 
  max = max(clay_r, na.rm = TRUE)
  )})
  
# base option 2
attach(h)
data.frame(
  min = min(clay_r, na.rm = TRUE),
  mean = mean(clay_r, na.rm = TRUE), 
  max = max(clay_r, na.rm = TRUE)
  )
detach(h)

# tidyverse non-standard evaluation (enabled by default) - dplyr
summarize(h,
          min = min(clay_r, na.rm = TRUE), 
          mean = mean(clay_r, na.rm = TRUE), 
          max = max(clay_r, na.rm = TRUE)
          )

```


## Subsetting vs Filtering

```{r subsetting}
# base R
sub_b <- subset(h, genhz == "Ap")

dim(sub_b)

# tidyverse - dplyr
sub_t <- filter(h, genhz == "Ap")

dim(sub_t)
```

## Ordering vs Arranging

```{r ordering}
# base
with(h, h[order(cokey, hzdept_r), ])[1:4, 1:4]

# tidyverse - dplyr
arrange(h, cokey, hzdept_r)[1:4, 1:4]
```

## Pipping

Referred too as 'syntactic' sugar, pipping is supposed to make code more readable, by making if read from left to right, rather than from inside out. This becomes particularly valuable when 3 or more functions combined. It also alleviates the need to overwrite existing objects.

```{r pipping}
# base
pip_b <- {subset(s, drainagecl == "Well drained") ->.;
  .[order(.$nationalmusym), ]
  }
pip_b[1:4, 1:4]

# tidyverse
pip_t <- filter(s, drainagecl == "Well drained") %>% 
  arrange(nationalmusym)
pip_t[1:4, 1:4]

```


## Split-Combine-Apply

In lots of cases we want to know the variation within groups.

```{r split}

# base
vars <- c("compname", "genhz")
sca_b <- {
  split(h, h[vars], drop = TRUE) ->.;                 # split
  lapply(., function(x) data.frame(                   # apply 
    x[vars][1, ],
    clay_min  = round(min(x$clay_r, na.rm =TRUE)),
    clay_mean = round(mean(x$clay_r, na.rm = TRUE)),
    clay_max  = round(max(x$clay_r, na.rm = TRUE))
    )) ->.;
  do.call("rbind", .)                                  # combine
  }
print(sca_b)

# tidyverse - dplyr
sca_t <- group_by(h, compname, genhz) %>%              # split (sort of)
  summarize(                                           # apply and combine
    clay_min  = round(min(clay_r, na.rm =TRUE)),
    clay_mean = round(mean(clay_r, na.rm = TRUE)),
    clay_max  = round(max(clay_r, na.rm = TRUE))
  )
print(sca_t)

```


## Reshaping

In lots of instances, particularly for graphing, it's necessary to convert the a data.frame from **wide** to **long** format. 

```{r reshaping}

# base wide to long
vars <- c("clay_r", "sand_r", "om_r")
idvars <- c("compname", "genhz")
head(h[c(idvars, vars)])

lo_b <- reshape(h[c("compname", "genhz", vars)],      # need to exclude unused columns
                  direction = "long",
                  timevar = "variable", times = vars, # capture names of variables in variable column
                  v.names = "value", varying = vars   # capture values of variables in value column
                  )
head(lo_b) # notice the row.names

# tidyverse wide to long
idx <- which(names(h) %in% vars)
lo_t <- select(h, compname, idx) %>%                   # need to exclude unused columns
  gather(key = variable, 
         value = value,
         - compname
         )

head(lo_t)

# sort factors
comp_sort <- aggregate(value ~ compname, data = lo_b[lo_b$variable == "clay_r", ], median, na.rm = TRUE)
comp_sort <- comp_sort[order(comp_sort$value), ]
lo_b <- within(lo_b, {
  compname = factor(lo_b$compname, levels = comp_sort$compname)
  genhz = factor(genhz, levels = rev(levels(genhz)))
  })


# lattice density plot
bwplot(genhz ~ value | variable + compname, 
       data = lo_b,
       scales = list(x = "free")
       )


# ggplot2 density plot
ggplot(lo_b, aes(x = genhz, y = value)) + # ggplot2 doesn't like factors or strings on the y-axis
  geom_boxplot() +                        # notice ggplot2 pipes is "+" not "%>%"
  facet_wrap(~ compname + variable, scales = "free_x") +
  coord_flip()
```

```{r lattice, eval = FALSE, echo = FALSE}

test2 <- get_cosoilmoist_from_SDA_db(WHERE = "mukey = '406339'")
test <- subset(test2, !is.na(dept_r) & status == "Wet")

ggplot(test, aes(x = as.integer(month), y = dept_r)) +
  geom_ribbon(aes(ymin = dept_l, ymax = dept_h), alpha = 0.2) +
  geom_line() +
  ylim(max(test$dept_h), -5) + # won't plot unless the full range is present
  facet_wrap(~ compname)
  
panel_gribbon <- function(x, y, upper, lower, ..., 
                         fill, col, subscripts, font, fontface) {
  upper = upper[subscripts]
  lower = lower[subscripts]
  panel.polygon(c(x, rev(x)), c(upper, rev(lower)),
                col = fill, border = FALSE)
  }
panel_ribbon <- function(x, y, ...) {
  panel.superpose(x, y, ..., panel.groups = panel_gribbon)
  panel.xyplot(x, y, ...)
  }


xyplot(data = test, dept_r ~ as.integer(month) | compname, 
       groups = test$compname,
       type = "b", lty = 1:2,
       upper = test$dept_l, lower = test$dept_h,
       ylim = c(150, -5),
       grid = TRUE,
       panel = function(x, y, ...){
         panel.superpose(x, y, ..., panel.groups = panel_gribbon)
         panel.xyplot(x, y, ...)
         }
       )

xyplot(data = test, dept_r ~ as.integer(month) | compname, 
       groups = test$compname,
       type = "b", lty = 1:2,
       upper = test$dept_l, lower = test$dept_h,
       ylim = c(150, -5),
       grid = TRUE,
       panel = panel_ribbon,
       )

```

## Conclusion

The tidyverse and it's precusors plyr and reshape2, introduced me to a lot of cool ways of manipulating data in new ways, and made me question 'how I would do that in base'.

**base**

- can be tidyish
- more abstract syntax (e.g. "[")
- fast
- 'very' flexible, to the point of being confusing
- awkward defaults (e.g. column and row naming, default sorting)

**tidyverse**

- more verbose syntax for some things, less verbose for others
- faster (usually)
- 'very' opinionated, to the point of being annoying
- clean defaults

## Questions

- is the tidyverse a better syntax for new R users?