-
Notifications
You must be signed in to change notification settings - Fork 0
/
02-dplyr.Rmd
executable file
·285 lines (200 loc) · 16.5 KB
/
02-dplyr.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
# Working with Tables using the Tidyverse {#work-with-tables}
In this tutorial we will introduce the **tibble** (also called a data frame), or the object type that R uses to store tables. Most of the data you will work with in R can be represented by a table (think an excel sheet), and one of the main advantages of using R is that the data frame is a powerful and intuitive interface for tabular data. In this tutorial we will use the [tidyverse](http://www.tidyverse.org/) to manipulate and summarise tabular data. The tutorial is a companion to the [Data transformation](http://r4ds.had.co.nz/transform.html) chapter in [R for Data Science](http://r4ds.had.co.nz/).
## Prerequisites
The prerequisite for this tutorial is the `tidyverse` package. If this package isn't installed, you'll have to install it using `install.packages()`.
```{r, eval = FALSE}
install.packages("tidyverse")
```
Load the packages when you're done! If there are errors, you may have not installed the above packages correctly!
```{r}
library(tidyverse)
```
Finally, you will need to load the example data. For now, copy and paste the following code to load the [Halifax geochemistry dataset](data/halifax_geochem.csv) (we will learn how to read various types of files into R in the [preparing and loading data](#prepare-load) tutorial).
```{r, include=FALSE}
# read local version to build
halifax_geochem <- read_csv(
"data/halifax_geochem.csv",
col_types = cols(.default = col_guess())
)
```
```{r, eval = FALSE}
halifax_geochem <- read_csv(
"http://paleolimbot.github.io/r4paleolim/data/halifax_geochem.csv",
col_types = cols(.default = col_guess())
)
```
It's worth mentioning a little bit about what this data frame contains, since we'll be working with it for the rest of this tutorial. The data contains several bulk geochemical parameters from a recent study of Halifax drinking water reservoirs [@dunnington18], including Pockwock Lake, Lake Major, Bennery Lake, Lake Fletcher, Lake Lemont, First Chain Lake, First Lake, and Second Lake. (Later, we will take a look at the [core locations](data/halifax_geochem_cores.xlsx) as well as the geochemical data).
## Viewing a Data Frame
The variable we have just created (`halifax_geochem`) is a **tibble**, which is a table of values much like you would find in a spreadsheet (you will notice that we loaded it directly from an Excel spreadhseet). Each column in the tibble represents a variable (in this case, the core identifier, depth of the sample, age represented by that sample, and several geochemical parameters), and each row in the table represents an observation (in this case, a sample from a sediment core).
In RStudio's "Environment" tab (usually at the top right of the screen), you should see a variable called `halifax_geochem` in the list. You can inspect it by clicking on the variable name, after which a tab will appear displaying the contents of the variable you just loaded. Clicking the little arrow to the left of the name will display the structure of the data frame, including the column names and some sample values. You can also do both of these things using the R commands `View()` and `glimpse()`, respectively. Also useful is the `head()` function, which will display the first few rows of a data frame.
```{r, eval=FALSE}
View(halifax_geochem) # will display a graphic table browser
```
```{r}
glimpse(halifax_geochem) # will display a text summary of the object
```
```{r}
head(halifax_geochem) # will display the first few rows of the data
```
## Selecting Columns
One way to subset `halifax_geochem` is to subset by column, for which we will use the `select()` function. For example, we may only be interested in the stable isotope information, represented by the columns `d13C_permille` and `d15N_permille`.
```{r}
stable_isotope_data <- select(
halifax_geochem,
core_id, depth_cm, age_ad,
d13C_permille, d15N_permille
)
```
The first argument to the `select()` function is the original data frame (in this case, `halifax_geochem`), and the remaining arguments are the names of the columns to be selected. To select the `core_id`, `age_ad`, `Ti`, and `K` columns, you would use the following R command:
```{r}
geochem_data <- select(halifax_geochem, core_id, depth_cm, age_ad, Ti_percent, K_percent)
```
Some column names in `halifax_geochem` contain characters that could be interpreted as an operation (e.g., `C/N`, which is the name of the column and not `C` divided by `N`). To select these columns, you will need to surround the column name in backticks:
```{r, eval=FALSE}
select(halifax_geochem, core_id, depth_cm, age_ad, `C/N`)
```
### Excercises
- Use `View()`, `glimpse()`, and `head()` to preview the two data frames we just created. Do they have the columns you would expect?
- Use `select()` to select `core_id`, `depth_cm`, C/N, d13C, and d15N, and assign it to the variable `cn_data`.
```{r, include = FALSE, eval = FALSE}
# using View, str, and head
View(stable_isotope_data); str(stable_isotope_data); head(stable_isotope_data)
View(geochem_data); str(geochem_data); head(geochem_data)
# selecting all the temperature variables
cn_data <- select(halifax_geochem, core_id, depth_cm, `C/N`, d13C_permille, d15N_permille)
```
## Filtering Rows {#chap2_filter}
Another way to subset `halifax_geochem` is by filtering rows using column values, similar to the filter feature in Microsoft Excel. This is done using the `filter()` function. For example, we may only be interested in the core from Pockwock Lake.
```{r}
pockwock_data <- filter(halifax_geochem, core_id == "POC15-2")
```
Just like `select()`, the first argument to `filter()` is the original data frame, and the subsequent arguments are the conditions that each row must satisfy in order to be included in the output. Column values are referred to by the column name (in the above example, `core_id`), so to include all rows where the value in the `core_id` column is POC15-2, we use `core_id == "POC15-2"`. Passing multiple conditions means each row must satisfy all of the conditions, such that to obtain the data from core POC15-2 where the depth in the core was 0 cm, we can use the following call to `filter()`:
```{r}
pockwock_surface_data <- filter(halifax_geochem, core_id == "POC15-2", depth_cm == 0)
```
It is very important that there are two equals signs within `filter()`! The `==` operator tests for equality (e.g. `(2 + 2) == 4`), whereas the `=` operator assigns a value or passes a named argument to a function, which is not what you're trying to do within `filter()`. Other common operators that are useful within filter are `!=` (not equal to), `>` (greater than), `<` (less than), `>=` (greater than or equal to), `<=` (less than or equal to), and `%in%` (tests if the value is one of several values). Using these, we could find out which observations are representative of the era 1950 to present:
```{r}
data_recent <- filter(halifax_geochem, age_ad >= 1950)
```
We could also find observations from multiple cores:
```{r}
pockwock_major_data <- filter(halifax_geochem, core_id %in% c("POC15-2", "MAJ15-1"))
```
### Exercises
- Use `View()`, `glimpse()`, and `head()` to preview the data frames we just created. Do they have the rows you would expect?
- Use `filter()` to find observations from the core FCL16-1 with an age between 1900 and present, and assign it to a name of your choosing.
- Are there any observations with a C/N value greater than 20? (hint: you will have to surround `C/N` in backticks)
```{r, include = FALSE, eval = FALSE}
# View, str, and head as above
fcl_1900_present <- filter(halifax_geochem, core_id == "FCL16-1", age_ad > 1900)
filter(halifax_geochem, `C/N` > 20) # should be zero rows
```
## Selecting and Filtering
Often we need to use both `select()` and `filter()` to obtain the desired subset of a data frame. To do this, we need to pass the result of `select()` to `filter()`, or the result of `filter()` to select. For example, we could create a data frame of recent (age greater than 1950) stable isotope measurements (you'll recall that we selected stable isotope columns in the data frame `stable_isotope_data`):
```{r}
recent_stable_isotopes <- filter(stable_isotope_data, age_ad >= 1950)
recent_stable_isotopes2 <- select(
data_recent,
core_id, depth_cm, age_ad,
d13C_permille, d15N_permille
)
```
### Exercises
- Use `View()`, `glimpse()`, and/or `head()` to verify that `recent_stable_isotopes` and `recent_stable_isotopes_2` are identical.
## The Pipe (%>%)
There is an easier way! Instead of creating intermediary variables every time we want to subset a data frame using `select()` and `filter()`, we can use the pipe operator (`%>%`) to pass the result of one function call to another. Thus, creating our `recent_stable_isotopes` data frame from above becomes one line with one variable assignment instead of two.
```{r, eval=FALSE}
recent_stable_isotopes_pipe <- halifax_geochem %>%
filter(age_ad >= 1950) %>%
select(core_id, depth_cm, age_ad, d13C_permille, d15N_permille)
```
What `%>%` does is pass the left side into the first argument of the function call on the right side. Thus, `filter(halifax_geochem, age_ad >= 1950)` becomes `halifax_geochem %>% filter(age_ad >= 1950)`. When using the tidyverse family of packages, you should use the pipe as often as possible! It usually makes for more readable, less error-prone code, and reduces the number of temporary variables you create that clutter up your workspace. When using `filter()` and `select()` with other tidyverse manipulations like `arrange()`, `group_by()`, `summarise()`, and `mutate()`, the pipe becomes indispensable.
### Exercises
- Inspect `recent_stable_isotopes_pipe` to ensure it is identical to `recent_stable_isotopes`.
- Create a data frame of stable isotope data from surface samples (`depth_cm == 0`) using `halifax_geochem`, `filter()`, `select()`, and `%>%` and assign it to a variable of a suitable name.
```{r, include = FALSE, eval = FALSE}
surface_stable_isotopes <- halifax_geochem %>%
filter(depth_cm == 0) %>%
select(core_id, depth_cm, age_ad, d13C_permille, d15N_permille)
```
## Arranging (sorting) A Data Frame
Sometimes it is desirable to view rows in a particular order, which can be used to quickly determine min and max values of various parameters. You can do this in the interactive editor using `View()`, but sometimes rows need to be in particular order for plotting or other analysis. This is done using the `arrange()` function. For example, it may make sense to view `halifax_geochem` in ascending `core_id` and `depth_cm` order (most recent first):
```{r}
halifax_geochem %>%
arrange(core_id, depth_cm)
```
Or descending depth order (most recent last):
```{r}
halifax_geochem %>%
arrange(core_id, desc(depth_cm))
```
The `arrange()` function takes columns as arguments, surrounded by `desc()` if that column should be sorted in descending order.
## Distinct Values
It is often useful to know which values exist in a data frame. For example, I've told you that the core locations are for various lakes in the halifax area, but what are they actually called in the dataset? To do this, we can use the `distinct()` function.
```{r}
halifax_geochem %>%
distinct(core_id)
```
The `distinct()` function can take any number of column names as arguments, although in this particular dataset there isn't a good example for this.
## Calculating columns using `mutate()`
To create a brand-new column, we can use the `mutate()` function. This creates a column in a way that we can use existing column names to calculate a new column. For example, we could convert the `age_ad` column to years before 1950:
```{r}
halifax_geochem %>%
mutate(age_bp = 1950 - age_ad) %>%
select(core_id, age_ad, age_bp)
```
Or, we could convert the `K_percent` and `Ti_percent` columns to parts per million:
```{r}
halifax_geochem %>%
mutate(
K_ppm = K_percent * 10000,
Ti_ppm = Ti_percent * 10000
) %>%
select(core_id, K_percent, K_ppm, Ti_percent, Ti_ppm)
```
## Summarising A Data Frame
So far we have looked at subsets of `halifax_geochem`, but what if we want per-core averages instead of raw data values? Using the tidyverse, we can `group_by()` the `core_id` column, and `summarise()`:
```{r}
halifax_geochem %>%
group_by(core_id) %>%
summarise(mean_CN = mean(`C/N`))
```
Here `group_by()` gets a list of columns, for which each unique combination of values will get one row in the output. `summarise()` gets a list of expressions that are evaluated for every unique combination of values defined by `group_by()` (e.g., `mean_CN` is the `mean()` of the `C/N` column for each core). Often, we want to include a number of summary columns in the output, which we can do by pasing more expressions to `summarise()`:
```{r}
halifax_geochem %>%
group_by(core_id) %>%
summarise(
mean_CN = mean(`C/N`),
min_CN = min(`C/N`),
max_CN = max(`C/N`),
sd_CN = sd(`C/N`)
)
```
You will notice that in for several cores the summary values are `NA`, or missing. This is because R propogates missing values unless you explicitly tell it not to. To fix this, you could replace ``mean(`C/N`)`` with ``mean(`C/N`, na.rm = TRUE)``. Other useful functions to use inside `summarise()` include `mean()`, `median()`, `sd()`, `sum()`, `min()`, and `max()`. These all take a vector of values and produce a single aggregate value suitable for use in `summarise()`. One special function, `n()`, you can use (with no arguments) inside `summarise()` to tell you how many observations were aggregated to produce the values in that row.
```{r}
halifax_geochem %>%
group_by(core_id) %>%
summarise(
mean_CN = mean(`C/N`, na.rm = TRUE),
min_CN = min(`C/N`, na.rm = TRUE),
max_CN = max(`C/N`, na.rm = TRUE),
sd_CN = sd(`C/N`, na.rm = TRUE),
n = n()
)
```
It's always a good idea to include `n()` inside `summarise()`, if nothing else as a check to make sure you've used `group_by()` with the correct columns.
### Excercises
- Assign the data frame we just created to a variable, and inspect it using `View()` and `str()`. Which cores have the most terrestrial C/N signature? Which cores have the most aquatic signature?
- Create a similar data frame to the one we just created but using `C_percent`. Which cores had the highest peak organic value.
- Which cores had the oldest estimated basal date?
## Extracting Columns
When we use `select()` or `distinct()`, we get back a data frame, however occasionally we need one or a few of the vectors that make up the data frame (recall from the last tutorial that data frames are a collection of column vectors). If we needed *just* the temperature values, we can use the `$` operator or the `pull()` function to extract a column vector.
```{r}
halifax_geochem$C_percent
halifax_geochem %>% pull(C_percent)
```
The problem with doing this is that our mean temperature values no longer have any context! They come from multiple cores, but this is not reflected without the other columns. Nevertheless, many R functions outside of the tidyverse require input as vectors (including many you've used so far, including `mean()`, `max()`, `min()`, etc.), and you will often see the `$` used in code written in other places to refer to columns. Functions in the tidyverse allow you to refer to columns by name (without the `$`) when used within specific functions (`summarise()` is a good example), so you should do this whenever you can!
## Base R Subsetting vs. select() and filter()
In the wild, there are many ways to select columns and filter rows. I highly reccomend using `filter()` and `select()` to do this when writing new code, but you may see R code that subsets a data frame using square brackets in the form `my_data_frame[c("column_name_1", "column_name_2")]` or `my_data_frame[my_data_frame$column_name_1 > some_number, c("column_name_1", "column_name_2")]`. The latter is equivalent to `my_data_frame %>% select(column_name_1, column_name_2) %>% filter(column_name_1 > some_number)`. The tidyverse method of subsetting I find to be much more clear and far less error-prone, but it's worth knowing the other form so you can read R code written by others!
## Summary
In this tutorial we introduced the use of `select()`, `filter()`, `arrange()`, `distinct()`, and the pipe (`%>%`). We also used `group_by()` and `summarise()` to provide summary statistics from a data frame. These functions are the building blocks of other powerful tools in the tidyverse. For more information, see the [Data transformation](http://r4ds.had.co.nz/transform.html) chapter in [R for Data Science](http://r4ds.had.co.nz/). Another good resource is the [tidyverse, visualization, and manipulation basics](https://www.rstudio.com/resources/webinars/tidyverse-visualization-and-manipulation-basics/) tutorial from Garrett Grolemund.