forked from hadley/r4ds
-
Notifications
You must be signed in to change notification settings - Fork 0
/
data-transform.qmd
889 lines (651 loc) · 32.7 KB
/
data-transform.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
# Data transformation {#sec-data-transform}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("complete")
```
## Introduction
Visualization is an important tool for generating insight, but it's rare that you get the data in exactly the right form you need to make the graph you want.
Often you'll need to create some new variables or summaries to answer your questions with your data, or maybe you just want to rename the variables or reorder the observations to make the data a little easier to work with.
You'll learn how to do all that (and more!) in this chapter, which will introduce you to data transformation using the **dplyr** package and a new dataset on flights that departed New York City in 2013.
The goal of this chapter is to give you an overview of all the key tools for transforming a data frame.
We'll start with functions that operate on rows and then columns of a data frame, then circle back to talk more about the pipe, an important tool that you use to combine verbs.
We will then introduce the ability to work with groups.
We will end the chapter with a case study that showcases these functions in action and we'll come back to the functions in more detail in later chapters, as we start to dig into specific types of data (e.g., numbers, strings, dates).
### Prerequisites
In this chapter we'll focus on the dplyr package, another core member of the tidyverse.
We'll illustrate the key ideas using data from the nycflights13 package, and use ggplot2 to help us understand the data.
```{r}
#| label: setup
library(nycflights13)
library(tidyverse)
```
Take careful note of the conflicts message that's printed when you load the tidyverse.
It tells you that dplyr overwrites some functions in base R.
If you want to use the base version of these functions after loading dplyr, you'll need to use their full names: `stats::filter()` and `stats::lag()`.
So far we've mostly ignored which package a function comes from because most of the time it doesn't matter.
However, knowing the package can help you find help and find related functions, so when we need to be precise about which package a function comes from, we'll use the same syntax as R: `packagename::functionname()`.
### nycflights13
To explore the basic dplyr verbs, we're going to use `nycflights13::flights`.
This dataset contains all `r format(nrow(nycflights13::flights), big.mark = ",")` flights that departed from New York City in 2013.
The data comes from the US [Bureau of Transportation Statistics](http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&Link=0), and is documented in `?flights`.
```{r}
flights
```
`flights` is a tibble, a special type of data frame used by the tidyverse to avoid some common gotchas.
The most important difference between tibbles and data frames is the way tibbles print; they are designed for large datasets, so they only show the first few rows and only the columns that fit on one screen.
There are a few options to see everything.
If you're using RStudio, the most convenient is probably `View(flights)`, which will open an interactive scrollable and filterable view.
Otherwise you can use `print(flights, width = Inf)` to show all columns, or use `glimpse()`:
```{r}
glimpse(flights)
```
In both views, the variables names are followed by abbreviations that tell you the type of each variable: `<int>` is short for integer, `<dbl>` is short for double (aka real numbers), `<chr>` for character (aka strings), and `<dttm>` for date-time.
These are important because the operations you can perform on a column depend so much on its "type".
### dplyr basics
You're about to learn the primary dplyr verbs (functions) which will allow you to solve the vast majority of your data manipulation challenges.
But before we discuss their individual differences, it's worth stating what they have in common:
1. The first argument is always a data frame.
2. The subsequent arguments typically describe which columns to operate on, using the variable names (without quotes).
3. The output is always a new data frame.
Because each verb does one thing well, solving complex problems will usually require combining multiple verbs, and we'll do so with the pipe, `|>`.
We'll discuss the pipe more in @sec-the-pipe, but in brief, the pipe takes the thing on its left and passes it along to the function on its right so that `x |> f(y)` is equivalent to `f(x, y)`, and `x |> f(y) |> g(z)` is equivalent to `g(f(x, y), z)`.
The easiest way to pronounce the pipe is "then".
That makes it possible to get a sense of the following code even though you haven't yet learned the details:
```{r}
#| eval: false
flights |>
filter(dest == "IAH") |>
group_by(year, month, day) |>
summarize(
arr_delay = mean(arr_delay, na.rm = TRUE)
)
```
dplyr's verbs are organized into four groups based on what they operate on: **rows**, **columns**, **groups**, or **tables**.
In the following sections you'll learn the most important verbs for rows, columns, and groups, then we'll come back to the join verbs that work on tables in @sec-joins.
Let's dive in!
## Rows
The most important verbs that operate on rows of a dataset are `filter()`, which changes which rows are present without changing their order, and `arrange()`, which changes the order of the rows without changing which are present.
Both functions only affect the rows, and the columns are left unchanged.
We'll also discuss `distinct()` which finds rows with unique values but unlike `arrange()` and `filter()` it can also optionally modify the columns.
### `filter()`
`filter()` allows you to keep rows based on the values of the columns[^data-transform-1].
The first argument is the data frame.
The second and subsequent arguments are the conditions that must be true to keep the row.
For example, we could find all flights that departed more than 120 minutes (two hours) late:
[^data-transform-1]: Later, you'll learn about the `slice_*()` family which allows you to choose rows based on their positions.
```{r}
flights |>
filter(dep_delay > 120)
```
As well as `>` (greater than), you can use `>=` (greater than or equal to), `<` (less than), `<=` (less than or equal to), `==` (equal to), and `!=` (not equal to).
You can also combine conditions with `&` or `,` to indicate "and" (check for both conditions) or with `|` to indicate "or" (check for either condition):
```{r}
# Flights that departed on January 1
flights |>
filter(month == 1 & day == 1)
# Flights that departed in January or February
flights |>
filter(month == 1 | month == 2)
```
There's a useful shortcut when you're combining `|` and `==`: `%in%`.
It keeps rows where the variable equals one of the values on the right:
```{r}
# A shorter way to select flights that departed in January or February
flights |>
filter(month %in% c(1, 2))
```
We'll come back to these comparisons and logical operators in more detail in @sec-logicals.
When you run `filter()` dplyr executes the filtering operation, creating a new data frame, and then prints it.
It doesn't modify the existing `flights` dataset because dplyr functions never modify their inputs.
To save the result, you need to use the assignment operator, `<-`:
```{r}
jan1 <- flights |>
filter(month == 1 & day == 1)
```
### Common mistakes
When you're starting out with R, the easiest mistake to make is to use `=` instead of `==` when testing for equality.
`filter()` will let you know when this happens:
```{r}
#| error: true
flights |>
filter(month = 1)
```
Another mistakes is you write "or" statements like you would in English:
```{r}
#| eval: false
flights |>
filter(month == 1 | 2)
```
This "works", in the sense that it doesn't throw an error, but it doesn't do what you want because `|` first checks the condition `month == 1` and then checks the condition `2`, which is not a sensible condition to check.
We'll learn more about what's happening here and why in @sec-boolean-operations.
### `arrange()`
`arrange()` changes the order of the rows based on the value of the columns.
It takes a data frame and a set of column names (or more complicated expressions) to order by.
If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns.
For example, the following code sorts by the departure time, which is spread over four columns.
We get the earliest years first, then within a year the earliest months, etc.
```{r}
flights |>
arrange(year, month, day, dep_time)
```
You can use `desc()` on a column inside of `arrange()` to re-order the data frame based on that column in descending (big-to-small) order.
For example, this code orders flights from most to least delayed:
```{r}
flights |>
arrange(desc(dep_delay))
```
Note that the number of rows has not changed -- we're only arranging the data, we're not filtering it.
### `distinct()`
`distinct()` finds all the unique rows in a dataset, so in a technical sense, it primarily operates on the rows.
Most of the time, however, you'll want the distinct combination of some variables, so you can also optionally supply column names:
```{r}
# Remove duplicate rows, if any
flights |>
distinct()
# Find all unique origin and destination pairs
flights |>
distinct(origin, dest)
```
Alternatively, if you want to the keep other columns when filtering for unique rows, you can use the `.keep_all = TRUE` option.
```{r}
flights |>
distinct(origin, dest, .keep_all = TRUE)
```
It's not a coincidence that all of these distinct flights are on January 1: `distinct()` will find the first occurrence of a unique row in the dataset and discard the rest.
If you want to find the number of occurrences instead, you're better off swapping `distinct()` for `count()`, and with the `sort = TRUE` argument you can arrange them in descending order of number of occurrences.
You'll learn more about count in @sec-counts.
```{r}
flights |>
count(origin, dest, sort = TRUE)
```
### Exercises
1. In a single pipeline for each condition, find all flights that meet the condition:
- Had an arrival delay of two or more hours
- Flew to Houston (`IAH` or `HOU`)
- Were operated by United, American, or Delta
- Departed in summer (July, August, and September)
- Arrived more than two hours late, but didn't leave late
- Were delayed by at least an hour, but made up over 30 minutes in flight
2. Sort `flights` to find the flights with longest departure delays.
Find the flights that left earliest in the morning.
3. Sort `flights` to find the fastest flights.
(Hint: Try including a math calculation inside of your function.)
4. Was there a flight on every day of 2013?
5. Which flights traveled the farthest distance?
Which traveled the least distance?
6. Does it matter what order you used `filter()` and `arrange()` if you're using both?
Why/why not?
Think about the results and how much work the functions would have to do.
## Columns
There are four important verbs that affect the columns without changing the rows: `mutate()` creates new columns that are derived from the existing columns, `select()` changes which columns are present, `rename()` changes the names of the columns, and `relocate()` changes the positions of the columns.
### `mutate()` {#sec-mutate}
The job of `mutate()` is to add new columns that are calculated from the existing columns.
In the transform chapters, you'll learn a large set of functions that you can use to manipulate different types of variables.
For now, we'll stick with basic algebra, which allows us to compute the `gain`, how much time a delayed flight made up in the air, and the `speed` in miles per hour:
```{r}
flights |>
mutate(
gain = dep_delay - arr_delay,
speed = distance / air_time * 60
)
```
By default, `mutate()` adds new columns on the right hand side of your dataset, which makes it difficult to see what's happening here.
We can use the `.before` argument to instead add the variables to the left hand side[^data-transform-2]:
[^data-transform-2]: Remember that in RStudio, the easiest way to see a dataset with many columns is `View()`.
```{r}
flights |>
mutate(
gain = dep_delay - arr_delay,
speed = distance / air_time * 60,
.before = 1
)
```
The `.` is a sign that `.before` is an argument to the function, not the name of a third new variable we are creating.
You can also use `.after` to add after a variable, and in both `.before` and `.after` you can use the variable name instead of a position.
For example, we could add the new variables after `day`:
```{r}
#| results: false
flights |>
mutate(
gain = dep_delay - arr_delay,
speed = distance / air_time * 60,
.after = day
)
```
Alternatively, you can control which variables are kept with the `.keep` argument.
A particularly useful argument is `"used"` which specifies that we only keep the columns that were involved or created in the `mutate()` step.
For example, the following output will contain only the variables `dep_delay`, `arr_delay`, `air_time`, `gain`, `hours`, and `gain_per_hour`.
```{r}
#| results: false
flights |>
mutate(
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours,
.keep = "used"
)
```
Note that since we haven't assigned the result of the above computation back to `flights`, the new variables `gain,` `hours`, and `gain_per_hour` will only be printed but will not be stored in a data frame.
And if we want them to be available in a data frame for future use, we should think carefully about whether we want the result to be assigned back to `flights`, overwriting the original data frame with many more variables, or to a new object.
Often, the right answer is a new object that is named informatively to indicate its contents, e.g., `delay_gain`, but you might also have good reasons for overwriting `flights`.
### `select()` {#sec-select}
It's not uncommon to get datasets with hundreds or even thousands of variables.
In this situation, the first challenge is often just focusing on the variables you're interested in.
`select()` allows you to rapidly zoom in on a useful subset using operations based on the names of the variables:
- Select columns by name:
```{r}
#| results: false
flights |>
select(year, month, day)
```
- Select all columns between year and day (inclusive):
```{r}
#| results: false
flights |>
select(year:day)
```
- Select all columns except those from year to day (inclusive):
```{r}
#| results: false
flights |>
select(!year:day)
```
You can also use `-` instead of `!` (and you're likely to see that in the wild); we recommend `!` because it reads as "not", and combines well with `&` and `|`.
- Select all columns that are characters:
```{r}
#| results: false
flights |>
select(where(is.character))
```
There are a number of helper functions you can use within `select()`:
- `starts_with("abc")`: matches names that begin with "abc".
- `ends_with("xyz")`: matches names that end with "xyz".
- `contains("ijk")`: matches names that contain "ijk".
- `num_range("x", 1:3)`: matches `x1`, `x2` and `x3`.
See `?select` for more details.
Once you know regular expressions (the topic of @sec-regular-expressions) you'll also be able to use `matches()` to select variables that match a pattern.
You can rename variables as you `select()` them by using `=`.
The new name appears on the left hand side of the `=`, and the old variable appears on the right hand side:
```{r}
flights |>
select(tail_num = tailnum)
```
### `rename()`
If you want to keep all the existing variables and just want to rename a few, you can use `rename()` instead of `select()`:
```{r}
flights |>
rename(tail_num = tailnum)
```
If you have a bunch of inconsistently named columns and it would be painful to fix them all by hand, check out `janitor::clean_names()` which provides some useful automated cleaning.
### `relocate()`
Use `relocate()` to move variables around.
You might want to collect related variables together or move important variables to the front.
By default `relocate()` moves variables to the front:
```{r}
flights |>
relocate(time_hour, air_time)
```
You can also specify where to put them using the `.before` and `.after` arguments, just like in `mutate()`:
```{r}
#| results: false
flights |>
relocate(year:dep_time, .after = time_hour)
flights |>
relocate(starts_with("arr"), .before = dep_time)
```
### Exercises
```{r}
#| eval: false
#| echo: false
# For data checking, not used in results shown in book
flights <- flights |> mutate(
dep_time = hour * 60 + minute,
arr_time = (arr_time %/% 100) * 60 + (arr_time %% 100),
airtime2 = arr_time - dep_time,
dep_sched = dep_time + dep_delay
)
ggplot(flights, aes(x = dep_sched)) + geom_histogram(binwidth = 60)
ggplot(flights, aes(x = dep_sched %% 60)) + geom_histogram(binwidth = 1)
ggplot(flights, aes(x = air_time - airtime2)) + geom_histogram()
```
1. Compare `dep_time`, `sched_dep_time`, and `dep_delay`.
How would you expect those three numbers to be related?
2. Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from `flights`.
3. What happens if you specify the name of the same variable multiple times in a `select()` call?
4. What does the `any_of()` function do?
Why might it be helpful in conjunction with this vector?
```{r}
variables <- c("year", "month", "day", "dep_delay", "arr_delay")
```
5. Does the result of running the following code surprise you?
How do the select helpers deal with upper and lower case by default?
How can you change that default?
```{r}
#| eval: false
flights |> select(contains("TIME"))
```
6. Rename `air_time` to `air_time_min` to indicate units of measurement and move it to the beginning of the data frame.
7. Why doesn't the following work, and what does the error mean?
```{r}
#| error: true
flights |>
select(tailnum) |>
arrange(arr_delay)
```
## The pipe {#sec-the-pipe}
We've shown you simple examples of the pipe above, but its real power arises when you start to combine multiple verbs.
For example, imagine that you wanted to find the fast flights to Houston's IAH airport: you need to combine `filter()`, `mutate()`, `select()`, and `arrange()`:
```{r}
flights |>
filter(dest == "IAH") |>
mutate(speed = distance / air_time * 60) |>
select(year:day, dep_time, carrier, flight, speed) |>
arrange(desc(speed))
```
Even though this pipeline has four steps, it's easy to skim because the verbs come at the start of each line: start with the `flights` data, then filter, then mutate, then select, then arrange.
What would happen if we didn't have the pipe?
We could nest each function call inside the previous call:
```{r}
#| results: false
arrange(
select(
mutate(
filter(
flights,
dest == "IAH"
),
speed = distance / air_time * 60
),
year:day, dep_time, carrier, flight, speed
),
desc(speed)
)
```
Or we could use a bunch of intermediate objects:
```{r}
#| results: false
flights1 <- filter(flights, dest == "IAH")
flights2 <- mutate(flights1, speed = distance / air_time * 60)
flights3 <- select(flights2, year:day, dep_time, carrier, flight, speed)
arrange(flights3, desc(speed))
```
While both forms have their time and place, the pipe generally produces data analysis code that is easier to write and read.
To add the pipe to your code, we recommend using the build-in keyboard shortcut Ctrl/Cmd + Shift + M.
You'll need to make one change to your RStudio options to use `|>` instead of `%>%` as shown in @fig-pipe-options; more on `%>%` shortly.
```{r}
#| label: fig-pipe-options
#| echo: false
#| fig-cap: |
#| To insert `|>`, make sure the "Use native pipe operator" option is checked.
#| fig-alt: |
#| Screenshot showing the "Use native pipe operator" option which can
#| be found on the "Editing" panel of the "Code" options.
knitr::include_graphics("screenshots/rstudio-pipe-options.png")
```
::: callout-note
## magrittr
If you've been using the tidyverse for a while, you might be familiar with the `%>%` pipe provided by the **magrittr** package.
The magrittr package is included in the core tidyverse, so you can use `%>%` whenever you load the tidyverse:
```{r}
#| eval: false
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
summarize(n = n())
```
For simple cases, `|>` and `%>%` behave identically.
So why do we recommend the base pipe?
Firstly, because it's part of base R, it's always available for you to use, even when you're not using the tidyverse.
Secondly, `|>` is quite a bit simpler than `%>%`: in the time between the invention of `%>%` in 2014 and the inclusion of `|>` in R 4.1.0 in 2021, we gained a better understanding of the pipe.
This allowed the base implementation to jettison infrequently used and less important features.
:::
## Groups
So far you've learned about functions that work with rows and columns.
dplyr gets even more powerful when you add in the ability to work with groups.
In this section, we'll focus on the most important functions: `group_by()`, `summarize()`, and the slice family of functions.
### `group_by()`
Use `group_by()` to divide your dataset into groups meaningful for your analysis:
```{r}
flights |>
group_by(month)
```
`group_by()` doesn't change the data but, if you look closely at the output, you'll notice that the output indicates that it is "grouped by" month (`Groups: month [12]`).
This means subsequent operations will now work "by month".
`group_by()` adds this grouped feature (referred to as class) to the data frame, which changes the behavior of the subsequent verbs applied to the data.
### `summarize()` {#sec-summarize}
The most important grouped operation is a summary, which, if being used to calculate a single summary statistic, reduces the data frame to have a single row for each group.
In dplyr, this operation is performed by `summarize()`[^data-transform-3], as shown by the following example, which computes the average departure delay by month:
[^data-transform-3]: Or `summarise()`, if you prefer British English.
```{r}
flights |>
group_by(month) |>
summarize(
avg_delay = mean(dep_delay)
)
```
Uhoh!
Something has gone wrong and all of our results are `NA`s (pronounced "N-A"), R's symbol for missing value.
This happened because some of the observed flights had missing data in the delay column, and so when we calculated the mean including those values, we got an `NA` result.
We'll come back to discuss missing values in detail in @sec-missing-values, but for now we'll tell the `mean()` function to ignore all missing values by setting the argument `na.rm` to `TRUE`:
```{r}
flights |>
group_by(month) |>
summarize(
delay = mean(dep_delay, na.rm = TRUE)
)
```
You can create any number of summaries in a single call to `summarize()`.
You'll learn various useful summaries in the upcoming chapters, but one very useful summary is `n()`, which returns the number of rows in each group:
```{r}
flights |>
group_by(month) |>
summarize(
delay = mean(dep_delay, na.rm = TRUE),
n = n()
)
```
Means and counts can get you a surprisingly long way in data science!
### The `slice_` functions
There are five handy functions that allow you extract specific rows within each group:
- `df |> slice_head(n = 1)` takes the first row from each group.
- `df |> slice_tail(n = 1)` takes the last row in each group.
- `df |> slice_min(x, n = 1)` takes the row with the smallest value of column `x`.
- `df |> slice_max(x, n = 1)` takes the row with the largest value of column `x`.
- `df |> slice_sample(n = 1)` takes one random row.
You can vary `n` to select more than one row, or instead of `n =`, you can use `prop = 0.1` to select (e.g.) 10% of the rows in each group.
For example, the following code finds the flights that are most delayed upon arrival at each destination:
```{r}
flights |>
group_by(dest) |>
slice_max(arr_delay, n = 1) |>
relocate(dest)
```
Note that there are 105 destinations but we get 108 rows here.
What's up?
`slice_min()` and `slice_max()` keep tied values so `n = 1` means give us all rows with the highest value.
If you want exactly one row per group you can set `with_ties = FALSE`.
This is similar to computing the max delay with `summarize()`, but you get the whole corresponding row (or rows if there's a tie) instead of the single summary statistic.
### Grouping by multiple variables
You can create groups using more than one variable.
For example, we could make a group for each date.
```{r}
daily <- flights |>
group_by(year, month, day)
daily
```
When you summarize a tibble grouped by more than one variable, each summary peels off the last group.
In hindsight, this wasn't a great way to make this function work, but it's difficult to change without breaking existing code.
To make it obvious what's happening, dplyr displays a message that tells you how you can change this behavior:
```{r}
daily_flights <- daily |>
summarize(n = n())
```
If you're happy with this behavior, you can explicitly request it in order to suppress the message:
```{r}
#| results: false
daily_flights <- daily |>
summarize(
n = n(),
.groups = "drop_last"
)
```
Alternatively, change the default behavior by setting a different value, e.g., `"drop"` to drop all grouping or `"keep"` to preserve the same groups.
### Ungrouping
You might also want to remove grouping from a data frame without using `summarize()`.
You can do this with `ungroup()`.
```{r}
daily |>
ungroup()
```
Now let's see what happens when you summarize an ungrouped data frame.
```{r}
daily |>
ungroup() |>
summarize(
avg_delay = mean(dep_delay, na.rm = TRUE),
flights = n()
)
```
You get a single row back because dplyr treats all the rows in an ungrouped data frame as belonging to one group.
### `.by`
dplyr 1.1.0 includes a new, experimental, syntax for per-operation grouping, the `.by` argument.
`group_by()` and `ungroup()` aren't going away, but you can now also use the `.by` argument to group within a single operation:
```{r}
#| results: false
flights |>
summarize(
delay = mean(dep_delay, na.rm = TRUE),
n = n(),
.by = month
)
```
Or if you want to group by multiple variables:
```{r}
#| results: false
flights |>
summarize(
delay = mean(dep_delay, na.rm = TRUE),
n = n(),
.by = c(origin, dest)
)
```
`.by` works with all verbs and has the advantage that you don't need to use the `.groups` argument to suppress the grouping message or `ungroup()` when you're done.
We didn't focus on this syntax in this chapter because it was very new when we wrote the book.
We did want to mention it because we think it has a lot of promise and it's likely to be quite popular.
You can learn more about it in the [dplyr 1.1.0 blog post](https://www.tidyverse.org/blog/2023/02/dplyr-1-1-0-per-operation-grouping/).
### Exercises
1. Which carrier has the worst average delays?
Challenge: can you disentangle the effects of bad airports vs. bad carriers?
Why/why not?
(Hint: think about `flights |> group_by(carrier, dest) |> summarize(n())`)
2. Find the flights that are most delayed upon departure from each destination.
3. How do delays vary over the course of the day.
Illustrate your answer with a plot.
4. What happens if you supply a negative `n` to `slice_min()` and friends?
5. Explain what `count()` does in terms of the dplyr verbs you just learned.
What does the `sort` argument to `count()` do?
6. Suppose we have the following tiny data frame:
```{r}
df <- tibble(
x = 1:5,
y = c("a", "b", "a", "a", "b"),
z = c("K", "K", "L", "L", "K")
)
```
a. Write down what you think the output will look like, then check if you were correct, and describe what `group_by()` does.
```{r}
#| eval: false
df |>
group_by(y)
```
b. Write down what you think the output will look like, then check if you were correct, and describe what `arrange()` does.
Also comment on how it's different from the `group_by()` in part (a)?
```{r}
#| eval: false
df |>
arrange(y)
```
c. Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does.
```{r}
#| eval: false
df |>
group_by(y) |>
summarize(mean_x = mean(x))
```
d. Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does.
Then, comment on what the message says.
```{r}
#| eval: false
df |>
group_by(y, z) |>
summarize(mean_x = mean(x))
```
e. Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does.
How is the output different from the one in part (d).
```{r}
#| eval: false
df |>
group_by(y, z) |>
summarize(mean_x = mean(x), .groups = "drop")
```
f. Write down what you think the outputs will look like, then check if you were correct, and describe what each pipeline does.
How are the outputs of the two pipelines different?
```{r}
#| eval: false
df |>
group_by(y, z) |>
summarize(mean_x = mean(x))
df |>
group_by(y, z) |>
mutate(mean_x = mean(x))
```
## Case study: aggregates and sample size {#sec-sample-size}
Whenever you do any aggregation, it's always a good idea to include a count (`n()`).
That way, you can ensure that you're not drawing conclusions based on very small amounts of data.
We'll demonstrate this with some baseball data from the **Lahman** package.
Specifically, we will compare what proportion of times a player gets a hit (`H`) vs. the number of times they try to put the ball in play (`AB`):
```{r}
batters <- Lahman::Batting |>
group_by(playerID) |>
summarize(
performance = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
n = sum(AB, na.rm = TRUE)
)
batters
```
When we plot the skill of the batter (measured by the batting average, `performance`) against the number of opportunities to hit the ball (measured by times at bat, `n`), you see two patterns:
1. The variation in `performance` is larger among players with fewer at-bats.
The shape of this plot is very characteristic: whenever you plot a mean (or other summary statistics) vs. group size, you'll see that the variation decreases as the sample size increases[^data-transform-4].
2. There's a positive correlation between skill (`performance`) and opportunities to hit the ball (`n`) because teams want to give their best batters the most opportunities to hit the ball.
[^data-transform-4]: \*cough\* the law of large numbers \*cough\*.
```{r}
#| warning: false
#| fig-alt: |
#| A scatterplot of number of batting performance vs. batting opportunites
#| overlaid with a smoothed line. Average performance increases sharply
#| from 0.2 at when n is 1 to 0.25 when n is ~1000. Average performance
#| continues to increase linearly at a much shallower slope reaching
#| ~0.3 when n is ~15,000.
batters |>
filter(n > 100) |>
ggplot(aes(x = n, y = performance)) +
geom_point(alpha = 1 / 10) +
geom_smooth(se = FALSE)
```
Note the handy pattern for combining ggplot2 and dplyr.
You just have to remember to switch from `|>`, for dataset processing, to `+` for adding layers to your plot.
This also has important implications for ranking.
If you naively sort on `desc(performance)`, the people with the best batting averages are clearly the ones who tried to put the ball in play very few times and happened to get a hit, they're not necessarily the most skilled players:
```{r}
batters |>
arrange(desc(performance))
```
You can find a good explanation of this problem and how to overcome it at <http://varianceexplained.org/r/empirical_bayes_baseball/> and <https://www.evanmiller.org/how-not-to-sort-by-average-rating.html>.
## Summary
In this chapter, you've learned the tools that dplyr provides for working with data frames.
The tools are roughly grouped into three categories: those that manipulate the rows (like `filter()` and `arrange()`, those that manipulate the columns (like `select()` and `mutate()`), and those that manipulate groups (like `group_by()` and `summarize()`).
In this chapter, we've focused on these "whole data frame" tools, but you haven't yet learned much about what you can do with the individual variable.
We'll come back to that in the Transform part of the book, where each chapter will give you tools for a specific type of variable.
In the next chapter, we'll pivot back to workflow to discuss the importance of code style, keeping your code well organized in order to make it easy for you and others to read and understand your code.