-
Notifications
You must be signed in to change notification settings - Fork 4
/
R.data.Rnw
900 lines (664 loc) · 64.5 KB
/
R.data.Rnw
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
% !Rnw root = appendix.main.Rnw
<<echo=FALSE, include=FALSE>>=
opts_chunk$set(opts_fig_wide)
opts_knit$set(concordance=TRUE)
opts_knit$set(unnamed.chunk.label = 'data-chunk')
@
\chapter{\Rlang Extensions: Data Wrangling}\label{chap:R:data}
\begin{VF}
Essentially everything in S[R], for instance, a call to a function, is an S[R] object. One viewpoint is that S[R] has self-knowledge. This self-awareness makes a lot of things possible in S[R] that are not in other languages.
\VA{Patrick J. Burns}{\emph{S Poetry}, 1998}\nocite{Burns1998}
\end{VF}
\section{Aims of This Chapter}
Base \Rlang and the recommended extension packages (installed by default) include many functions for manipulating data. The \Rlang distribution supplies a complete set of functions and operators that allow all the usual data manipulation operations. These functions have stable and well-described behaviour, so in my view, they should be preferred unless some of their limitations justify the use of alternatives defined in contributed packages. In the present chapter, I describe the new syntax introduced by the most popular contributed \Rlang extension packages aiming at changing (usually improving one aspect at the expense of another) in various ways how we can manipulate data in \Rlang. These independently developed packages extend the \Rlang language not only by adding new ``words'' to it but by supporting new ways of meaningfully connecting ``words''---i.e., providing new ``grammars'' for data manipulation. While at the current stage of development of base \Rlang not breaking existing code has been the priority, several of the still ``young'' packages in the \pkgname{tidyverse} have prioritised experimentation with enhanced features over backwards compatibility. The development of \pkgname{tidyverse} packages seems to have initially emphasised users' convenience more than encouraging safe/error-free user code. The design of package \pkgname{data.table} has prioritised performance at the expense of easy of use. I do not describe in depth these new approaches but instead only briefly compare them to base \Rlang highlighting the most important differences.
\section{Introduction}
By reading previous chapters, you have already become familiar with base \Rlang classes, methods, functions, and operators for storing and manipulating data. Most of these had been originally designed to perform optimally on rather small data sets \autocite[see][]{Matloff2011}. The performance of these functions has improved significantly over the years and random-access memory in computers has become cheaper, making constraints imposed by the original design of \Rpgrm less limiting. On the other hand, the size of data sets has also increased.
Some contributed packages have aimed at improving performance by relying on different compromises between usability, speed, and reliability than used for base \Rlang.
Package \pkgname{data.table} is the best example of an alternative implementation of data storage and manipulation that maximises the speed of processing for large data sets using a new semantics and requiring a new syntax. We could say that package \pkgname{data.table} is based on a theoretical abstraction, or ``grammar of data'', that is different from that in the \Rlang language. The compromise in this case has been the use of a less intuitive syntax, and by defaulting to passing arguments by reference instead of by copy, increasing the ``responsibility'' of the programmer or data analyst with respect to not overwriting or corrupting data. This focus on performance has made obvious the performance bottlenecks present in base \Rpgrm, which have been subsequently alleviated while maintaining backwards compatibility for users' code.
Another recent development is the \pkgname{tidyverse}, which is a formidable effort to redefine how data analysis operations are expressed in \Rlang code and scripts. In many ways, it is also a new abstraction, or ``grammar of data''. With respect to its implementation, it can also be seen as a new language built on-top of the \Rlang language. It is still young and evolving, and the developers from Posit still remain relentless about fixing what they consider earlier misguided decisions in the design of the packages comprising the \pkgname{tidyverse}. This is a wise decision for the future, but can be annoying to occasional users who may not be aware of the changes that have taken place between uses. As a user I highly value long-term stability and backwards compatibility of software. Older systems like base \Rlang provide this, but their long development history shows up as occasional inconsistencies and quirks. The \pkgname{tidyverse} as a paradigm is nowadays popular among data analysts while among users for whom data analysis is not the main focus, it is more common to make use of only individual packages as the need arises, e.g., using the new grammar only for some stages of the data analysis workflow.
When a computation included a chain of sequential operations, until \Rlang 4.1.0, using base \Rlang by itself we could either store the returned value in a temporary variable at each step in the computation, or nest multiple function calls. The first approach is verbose, but allows readable scripts, especially if the names used for temporary variables are wisely chosen. The second approach becomes very difficult to read as soon as there is more than one nesting level. Attempts to find an alternative syntax have borrowed the concept of data \emph{pipes} from Unix shells \autocite{Kernigham1981}. Interestingly, that it has been possible to write packages that define the operators needed to ``add'' this new syntax to \Rlang is a testimony to its flexibility and extensibility. Two packages, \pkgname{magrittr} and \pkgname{wrapr}, define operators for pipe-based syntax. In year 2021, a pipe operator was added to the \Rlang language itself and more recently its features enhanced.
In much of my work I emphasise reproducibility and reliability, preferring base \Rlang over extension packages, except for plotting, whenever practical. For run once and delete or quick-and-dirty data analyses, I tend to use the \emph{tidyverse}. However, with modern computers and some understanding of what are the performance bottlenecks in \Rlang code, I have rarely found it worthwhile the effort needed for improved performance by using extension packages. The benefit to effort balance will be different for those readers who analyse huge data sets.
The definition of the \emph{tidyverse} is rather vague, as package \pkgname{tidyverse} loads and attaches a set of packages of which most but not all follow a consistent design and support this new grammar. The packages that are attached by package \pkgname{tidyverse} has changed over time. Package \pkgname{tidyverse}, however, defines a function that lists them.
<<tidyverse-00>>=
tidyverse::tidyverse_packages()
@
In this chapter, you will become familiar with packages \pkgname{tibble}, \pkgname{dplyr}, \pkgname{tidyr}, and \pkgname{lubridate}. Package \pkgnameNI{ggplot2} will be described in chapter \ref{chap:R:plotting} as it implements the grammar of graphics and has little in common with other members of the \pkgname{tidyverse}. As many of the functions in the \emph{tidyverse} can be substituted by existing base \Rlang functions, recognising similarities and differences between them has become important since both approaches are now in common use, and frequently even coexist within \Rlang scripts.
\begin{explainbox}
In any design, there is a tension between opposing goals. In software for data analysis, a key pair of opposed goals are usability, including concise but expressive code, and avoidance of ambiguity. Base \Rlang function \Rfunction{subset()} has an unusual syntax, as it evaluates the expression passed as the second argument within the namespace of the data frame passed as its first argument (see section \ref{sec:calc:df:with} on page \pageref{sec:calc:df:with}). This saves typing, enhancing usability, at the expense of increasing the risk of bugs, as by reading the call to subset, it is not obvious which names are resolved in the environment of the call to \code{subset()} and which ones within its first argument---i.e., as column names in the data frame. In addition, changes elsewhere in a script can change how a call to subset is interpreted. In reality, subset is a wrapper function built on top of the extraction operator \code{[ ]} (see section \ref{sec:calc:indexing} on page \pageref{sec:calc:indexing}). It is a convenience function, mostly intended to be used at the console, rather than in scripts or package code. To extract columns or rows from a data frame, it is always safer to use the \Roperator{[ , ]} or \Roperator{[[ ]]} operators at the expense of some verbosity.
Package \pkgname{dplyr}, and much of the \pkgname{tidyverse}, relies on a similar approach as subset to enhance convenience at the expense of ambiguity. Package \pkgname{dplyr} has undergone quite drastic changes during its development history with respect to how to handle the dilemma caused by ``guessing'' of the environment where names should be looked up. There is no easy answer; a simplified syntax leads to ambiguity, and a fully specified syntax is verbose. Recent versions of the package introduced a terse syntax to achieve a concise way of specifying where to look up names. I do appreciate the advantages of the grammar of data that is implemented in the \pkgname{tidyverse}. However, the actual implementation, can result in ambiguities and subtleties that are even more difficult to deal by inexperienced or occasional users than those caused by inconsistencies in base \Rlang. My opinion is that for code that needs to be highly reliable and produce reproducible results in the future, we should for the time being prefer base \Rlang constructs. For code that is to be used once, or for which reproducibility can depend on the use of a specific (old or soon to become old) version of packages like \pkgname{dplyr}, or which is not a burden to thoroughly test and update regularly, the conciseness and power of the new syntax can be an advantage.
\end{explainbox}
Package \pkgname{poorman} re-implements many of the functions in \pkgname{dplyr} and a few from \pkgname{tidyr} using pure \Rlang code instead of compiled \Cpplang code and with no dependencies on other extension packages. This light-weight approach can be useful when \Rlang's data frames rather than tibbles are preferred or when the possible enhanced performance with large data sets is not needed.
\section{Packages Used in This Chapter}
<<eval=FALSE>>=
install.packages(learnrbook::pkgs_ch_data)
@
To run the examples included in this chapter, you need first to load and attach some packages from the library (see section \ref{sec:script:packages} on page \pageref{sec:script:packages} for details on the use of packages).
<<message=FALSE>>=
library(learnrbook)
library(tibble)
library(magrittr)
library(wrapr)
library(stringr)
library(dplyr)
library(tidyr)
library(lubridate)
@
\section[Replacements for \texttt{data.frame}]{Replacements for \code{data.frame}}
\index{data frame!replacements|(}
\subsection{Package \pkgname{data.table}}
The function call semantics of the \Rlang language is that arguments are passed to functions by copy (see section \ref{sec:script:functions} on page \pageref{sec:script:functions}). Functions and methods from package \pkgname{data.table} pass arguments by reference, avoiding making copies. In base \Rlang, if the arguments are modified within the code of a function, these changes are local to the function. However, any assignments within the functions and methods defined in package \pkgname{data.table} modify the variables passed as their arguments.
If implemented naively, the copy semantics used in base \Rlang would impose a huge toll on performance. However, \Rlang in most situations only makes a copy in memory if and when the value changes. Consequently, for modern versions of \Rlang, which are good at avoiding unnecessary copying of objects, the normal \Rlang semantics has only a moderate negative impact on performance. However, this impact can still be a problem as modification is detected at the object level, and consequently, \Rlang can make copies of large objects such as a whole data frame when only values in a single column or even just an attribute have changed.
Passing arguments by reference, as in \pkgname{data.table}, simplifies the needed tests for delayed copying and by avoiding the need to copy arguments, achieves the best possible performance. This is a specialised package but extremely useful when dealing with very large data sets. Writing user code, such as scripts, with \pkgname{data.table} requires a good understanding of the pass-by-reference semantics. Obviously, package \pkgname{data.table} makes no attempt at backwards compatibility with base-\Rlang \code{data.frame}.
In contrast to the design of package \pkgname{data.table}, the focus of the \pkgname{tidyverse} is not only performance. The design of this grammar has also considered usability. Design compromises have been resolved differently than in base \Rlang or \pkgname{data.table} and in some cases code written using base \Rlang can significantly outperform the \pkgname{tidyverse} and vice versa. There exist packages that implement a translation layer from the syntax of the \pkgname{tidyverse} into that of \pkgname{data.table} or relational database queries.
\subsection{Package \pkgname{tibble}}\label{sec:data:tibble}
\index{tibble!differences with data frames|(}
Package \pkgname{tibble} aimed at enhanced performance, like \pkgname{data.table}, but not at the expense of usability. The \Rfunction{tibble()} constructor supports semantics that allow more concise code compared to the \Rfunction{data.frame()} constructor. The \code{print()} method for tibbles displays them concisely and provides additional information. With small data sets, differences in performance are in most cases irrelevant. Early on, package \Rfunction{tibble()} was consistently faster than base \Rlang data frames, but the performance of \Rlang has improved over the years. Nowadays, there is no clear winner. The decision to use package \Rfunction{tibble()} depends mostly on whether one uses the other packages from the \pkgname{tidyverse}, mainly \pkgname{dplyr} and \pkgname{tidyr}, or not.
The authors of package \pkgname{tibble} describe their \Rclass{tbl} class as nearly backwards compatible with \Rclass{data.frame} and make it a derived class. This backwards compatibility is only partial so in some situations data frames and tibbles are not equivalent.
The class and methods that package \pkgname{tibble} defines lift some of the restrictions imposed by the design of base \Rlang data frames at the cost of creating some incompatibilities due to changed (improved) syntax for member extraction. Tibbles simplify the creation of ``columns'' of class \Rclass{list} and remove support for columns of class \Rclass{matrix}. Handling of attributes is also different, with no row names added by default. There are also differences in default behaviour of both constructors and methods.
\emph{Although, objects of class \Rclass{tbl} can be passed as arguments to functions that expect data frames as input, these functions are not guaranteed to work correctly with tibbles as a result of the differences in behaviour of some methods and operators.}
\begin{warningbox}
It is easy to write code that will work correctly both with data frames and tibbles by avoiding constructs that behave differently. However, code that is syntactically correct according to the \Rlang language may fail to work as expected if a tibble is used in place of a data frame. Only functions tested to work correctly with both tibbles and data frames can be relied upon as compatible.
\end{warningbox}
\begin{explainbox}
That it has been possible to define tibbles as objects of a class derived from \Rclass{data.frame} reveals one of the drawbacks of the simple implementation of S3 object classes in \Rlang. Allowing this is problematic because the promise of compatibility implicit in a derived class is not always fulfilled. An independently developed method designed for data frames will not necessarily work correctly with tibbles, but in the absence of a specialised method for tibbles it will be used (dispatched) when the generic method is called with a tibble as argument.
\end{explainbox}
\begin{warningbox}
One should be aware that although the constructor \Rfunction{tibble()} and conversion function \Rfunction{as\_tibble()}, as well as the test \Rfunction{is\_tibble()} use the name \Rclass{tibble}, the class attribute is named \code{tbl}. This is inconsistent with base \Rlang conventions, as it is the use of an underscore instead of a dot in the name of these methods.
<<tibble-info-01>>=
my.tb <- tibble(numbers = 1:3)
is_tibble(my.tb)
inherits(my.tb, "tibble")
class(my.tb)
@
Furthermore, to support tibbles based on different underlying data sources such as \code{data.table} objects or databases, a further derived class is needed. In our example, as our tibble has an underlying \code{data.frame} class, the most derived class of \code{my.tb} is \Rclass{tbl\_df}.
\end{warningbox}
Function \code{show\_classes()}, defined below, concisely reports the class of the object passed as an argument and of its members (\emph{apply} functions are described in section \ref{sec:data:apply} on page \pageref{sec:data:apply}).
<<tibble-01>>=
show_classes <- function(x) {
cat(
paste(paste(class(x)[1],
"containing:"),
paste(names(x),
sapply(x, class), collapse = ", ", sep = ": "),
sep = "\n")
)
}
@
The \Rfunction{tibble()} constructor by default does not convert character data into factors, while the \Rfunction{data.frame()} constructor did before \Rlang version 4.0.0. The default can be overridden through an argument passed to these constructors, and in the case of \Rfunction{data.frame()} also by setting an \Rlang option. This new behaviour extends to function \Rfunction{read.table()} and its wrappers (see section \ref{sec:files:txt} on page \pageref{sec:files:txt}).
<<tibble-02>>=
my.df <- data.frame(codes = c("A", "B", "C"), numbers = -1:1, integers = 1L:3L)
is.data.frame(my.df)
is_tibble(my.df)
show_classes(my.df)
@
Tibbles are, or pretend to be (see above), data frames---or more formally class \Rclass{tibble} is derived from class \code{data.frame}. However, data frames are not tibbles.
<<tibble-03>>=
my.tb <- tibble(codes = c("A", "B", "C"), numbers = -1:1, integers = 1L:3L)
is.data.frame(my.tb)
is_tibble(my.tb)
show_classes(my.tb)
@
The \Rmethod{print()} method for tibbles differs from that for data frames in that it outputs a header with the text ``A tibble:'' followed by the dimensions (number of rows $\times$ number of columns), adds under each column name an abbreviation of its class and instead of printing all rows and columns, a limited number of them are displayed. In addition, individual values are formatted more compactly and using colour to highlight, for example, negative numbers in red.
<<tibble-04>>=
print(my.df)
print(my.tb)
@
\begin{explainbox}
The default number of rows printed depends on \Rlang option \code{tibble.print\_max} that can be set with a call to \Rfunction{options()}. This option plays for tibbles a similar role as option \code{max.print} plays for base \Rlang \Rmethod{print()} methods.
<<tibble-print-02>>=
options(tibble.print_max = 3, tibble.print_min = 3)
@
\end{explainbox}
\begin{playground}
Print methods for tibbles and data frames differ in their behaviour when not all columns fit in a printed line. 1) Construct a data frame and an equivalent tibble with at least 50 rows and then test how the output looks when they are printed. 2) Construct a data frame and an equivalent tibble with more columns than will fit in the width of the \Rlang console and then test how the output looks when they are printed.
\end{playground}
Data frames can be converted into tibbles with \code{as\_tibble()}.
<<tibble-05>>=
my_conv.tb <- as_tibble(my.df)
is.data.frame(my_conv.tb)
is_tibble(my_conv.tb)
show_classes(my_conv.tb)
@
Tibbles can be converted into ``real'' data.frames with \code{as.data.frame()}.
<<tibble-06>>=
my_conv.df <- as.data.frame(my.tb)
is.data.frame(my_conv.df)
is_tibble(my_conv.df)
show_classes(my_conv.df)
@
\begin{warningbox}
When dealing with tibbles, column- and row binding should be done with functions \Rfunction{bind\_rows()} and \Rfunction{bind\_cols()} from \pkgname{dplyr}, not with functions \Rfunction{rbind()} and \Rfunction{cbind()} from \Rlang. See explanation below.
\end{warningbox}
\begin{explainbox}
Not all conversion functions work consistently when converting from a derived class into its parent. The reason for this is disagreement between authors on what the \emph{correct} behaviour is based on logic and theory. You are not likely to be hit by this problem frequently, but it can be difficult to diagnose.
We have already seen that calling \Rfunction{as.data.frame()} on a tibble strips the derived class attributes, returning a data frame. We will look at the whole character vector stored in the \code{"class"} attribute to demonstrate the difference. We also test the two objects for equality, in two different ways. Using the operator \code{==} tests for equivalent objects. Objects that contain the same data. Using \Rfunction{identical()} tests that objects are exactly the same, including attributes such as \code{"class"}, which we retrieve using \Rfunction{class()}.
<<tibble-box-01>>=
class(my.tb)
class(my_conv.df)
my.tb == my_conv.df
identical(my.tb, my_conv.df)
@
Now we derive from a tibble, and then attempt a conversion back into a tibble.
<<tibble-box-02>>=
my.xtb <- my.tb
class(my.xtb) <- c("xtb", class(my.xtb))
class(my.xtb)
my_conv_x.tb <- as_tibble(my.xtb)
class(my_conv_x.tb)
my.xtb == my_conv_x.tb
identical(my.xtb, my_conv_x.tb)
@
The two viewpoints on conversion functions are as follows. If the argument passed to a conversion function is an object of a derived class, 1) it should be returned after stripping the derived class, or 2) it should be returned as is, without stripping the derived class. Base \Rlang follows, as far as I have been able to work out, approach 1). Some packages in the \pkgname{tidyverse} sometimes follow, or have followed in the past, approach 2). If in doubt about the behaviour of some function, then you will need to do a test similar to the one used above.
As tibbles have been defined as a class derived from \code{data.frame}, if methods have not been explicitly defined for tibbles, the methods defined for data frames are called, and these are likely to return a data frame rather than a tibble. Even a frequent operation like column binding is affected, at least at the time of writing.
<<tibble-box-03a>>=
class(my.df)
class(my.tb)
@
<<tibble-box-03b>>=
class(cbind(my.df, my.tb))
class(cbind(my.tb, my.df))
@
<<tibble-box-03c>>=
class(cbind(my.df, added = -3:-1))
class(cbind(my.tb, added = -3:-1))
identical(cbind(my.tb, added = -3:-1), cbind(my.df, added = -3:-1))
@
\end{explainbox}
There are additional important differences between the constructors \Rfunction{tibble()} and \code{data.frame()}. One of them is that in a call to \Rfunction{tibble()}, member variables (``columns'') being defined can be used in the definition of subsequent member variables.
<<tibble-07>>=
tibble(a = 1:5, b = 5:1, c = a + b, d = letters[a + 1])
@
\begin{playground}
What is the behaviour if you replace \Rfunction{tibble()} by \Rfunction{data.frame()} in the statement above?
\end{playground}
\begin{warningbox}
While objects passed directly as arguments to the \Rfunction{data.frame()} constructor to be included as ``columns'' can be factors, vectors or matrices (with the same number of rows as the data frame), arguments passed to the \Rfunction{tibble()} constructor can be factors, vectors or lists (with the same number of members as rows in the tibble). As we saw in section \ref{sec:R:data:frames} on page \pageref{sec:R:data:frames}, base \Rlang's data frames can contain columns of classes \code{list} and \code{matrix}. The difference is in the need to use \Rfunction{I()}, the identity function, to protect these variables during construction and assignment to true \code{data.frame} objects as otherwise list members and matrix columns will be assigned to multiple individual columns in the data frame.
<<tibble-08>>=
tibble(a = 1:5, b = 5:1, c = list("a", 2, 3, 4, 5))
@
A list of lists or a list of vectors can be directly passed to the constructor.
<<tibble-09>>=
tibble(a = 1:5, b = 5:1, c = list("a", 1:2, 0:3, letters[1:3], letters[3:1]))
@
\end{warningbox}
\index{tibble!differences with data frames|)}
\index{data frame!replacements|)}
\section{Data Pipes}\label{sec:data:pipes}
\index{chaining statements with \emph{pipes}|(}
The first obvious difference between scripts using \pkgname{tidyverse} packages is the frequent use of \emph{pipes}. This is, however, mostly a question of preferences, as pipes can be as well used with base \Rlang functions. In addition, since version 4.0.0, \Rlang has a native pipe operator \Roperator{\textbar >}, described in section \ref{sec:script:pipes} on page \pageref{sec:script:pipes}. Here I describe other earlier implementations of pipes, and the differences among these and \Rlang's pipe operator.
\subsection{\pkgname{magrittr}}
\index{pipes!tidyverse|(}
\index{pipe operator}
A set of operators for constructing pipes of \Rlang functions is implemented in package \pkgname{magrittr}. It preceded the native \Rlang pipe by several years. The pipe operator defined in package \pkgname{magrittr}, \Roperator{\%>\%}, is imported and re-exported by package \pkgname{dplyr}, which in turn defines functions that work well in data pipes.
Operator \Roperator{\%>\%} plays a similar role as \Rlang's \Roperator{\textbar >}.
<<pipes-x00>>=
data.in <- 1:10
@
<<pipes-x04>>=
data.in %>% sqrt() %>% sum() -> data0.out
@
The value passed can be made explicit using a dot as placeholder passed as an argument by name and by position to the function on the \emph{rhs} of the \Roperator{\%>\%} operator. Thus \code{.} in \pkgname{magrittr} plays a similar but not identical role as \code{\_} in base \Rlang pipes.
<<pipes-x04a>>=
data.in %>% sqrt(x = .) %>% sum(.) -> data1.out
all.equal(data0.out, data1.out)
@
\Rlang's native pipe operator requires, consistently with \Rlang in all other situations, that functions that are to be evaluated use the parenthesis syntax, while \pkgname{magrittr} allows the parentheses to be missing when the piped argument is the only one passed to the function call on \textit{rhs}.
<<pipes-x04b>>=
data.in %>% sqrt %>% sum -> data5.out
all.equal(data0.out, data5.out)
@
Package \pkgname{magrittr} provides additional pipe operators, such as ``tee'' (\Roperator{\%T>\%}) to create a branch in the pipe, and \Roperator{\%<>\%} to apply the pipe by reference. These operators are much less frequently used than \Roperator{\%>\%}.
\index{pipes!tidyverse|)}
\subsection{\pkgname{wrapr}}
\index{pipes!wrapr|(}
\index{dot-pipe operator}
The \Roperator{\%.>\%}, or ``dot-pipe'', operator from package \pkgname{wrapr}, allows expressions both on the rhs and lhs, and \emph{enforces the use of the dot} (\code{.}), as placeholder for the piped object. Given the popularity of \pkgname{dplyr} the pipe operator from \pkgname{magrittr} has been the most used.
Rewritten using the dot-pipe operator, the pipe in the previous chunk becomes
<<pipes-x05>>=
data.in %.>% sqrt(.) %.>% sum(.) -> data2.out
all.equal(data0.out, data2.out)
@
However, as operator \Roperator{\%>\%} from \pkgname{magrittr} recognises the \code{.} placeholder without enforcing its use, the code below where \Roperator{\%.>\%} is replaced by \Roperator{\%>\%} returns the same value as that above.
<<pipes-x05a>>=
data.in %>% sqrt(.) %>% sum(.) -> data3.out
all.equal(data0.out, data3.out)
@
To use operator \Roperator{\textbar >} from \Rlang, we need to edit the code using (\code{\_}) as placeholder and passing it as an argument to parameters by name in the function calls on the \textit{rhs}.
<<pipes-x05b>>=
data.in |> sqrt(x = _) |> sum(x = _) -> data4.out
all.equal(data0.out, data4.out)
@
We can, in this case, simply use no placeholder, and pass the arguments by position to the first parameter of the functions.
<<pipes-x05c>>=
data.in |> sqrt() |> sum() -> data4.out
all.equal(data0.out, data4.out)
@
The \index{pipes!expressions in rhs} dot-pipe operator \Roperator{\%.>\%} from \pkgname{wrapr} allows us to use the placeholder \code{.} in expressions on the \emph{rhs} of operators in addition to in function calls.
<<pipes-x07>>=
data.in %.>% (.^2) -> data7.out
@
In contrast, operator \Roperator{\%>\%} does not support expressions, only function call syntax on the \textit{rhs}, forcing calling of operators with parenthesis syntax
<<pipes-x07b>>=
data.in %>% `^`(e1 = ., e2 = 2) -> data9.out
all.equal(data7.out, data9.out)
@
In conclusion, \Rlang syntax for expressions is preserved when using the dot-pipe operator from \pkgname{wrapr}, with the only caveat that because of the higher precedence of the \Roperator{\%.>\%} operator, we need to ``protect'' bare expressions containing other operators by enclosing them in parentheses. In the examples above, we showed a simple expression so that it could be easily converted into a function call. The \Roperator{\%.>\%} operator supports also more complex expressions, even with multiple uses of the placeholder.
<<pipes-x07c>>=
data.in %.>% (.^2 + sqrt(. + 1))
@
\subsection{Comparing pipes}
Under-the-hood, the implementations of operators \Roperator{\textbar >}, \Roperator{\%>\%} and \Roperator{\%.>\%} are different, with \Roperator{\textbar >} expected to have the best performance, followed by \Roperator{\%.>\%} and \Roperator{\%>\%} being slowest. As implementations evolve, performance may vary among versions. However, \Roperator{\textbar >} being part of \Rlang is likely to remain the fastest.
Being part of the \Rlang language, \Roperator{\textbar >} will remain available and most likely also backwards compatible, while packages could be abandoned or redesigned by their maintainers. For this reason, it is preferable to use the \Roperator{\textbar >} in scripts or code expected to be reused, unless compatibility with \Rlang versions earlier than 4.2.0 is needed. Elsewhere in the book I have used \Rlang's pipe operator \Roperator{\textbar >}.
Pipes can be used with any \Rlang function, but how elegant can be their use depends on the order of formal parameters. This is especially the case when passing arguments implicitly to the first parameter of the function on the \emph{rhs}. Several of the functions and methods defined in \pkgnameNI{tidyr}, \pkgnameNI{dplyr}, and a few other packages from the \pkgname{tidyverse} fit this need.
Writing a series of statements and saving intermediate results in temporary variables makes debugging easiest. Debugging pipes is not as easy, as this usually requires splitting them, with one approach being the insertion of calls to \Rfunction{print()}. This is possible, because \Rfunction{print()} returns its input invisibly in addition to displaying it.
<<pipes-x08>>=
data.in |> print() |> sqrt() |> print() |> sum() |> print() -> data10.out
data10.out
@
Debugging nested function calls is difficult. So, in general, it is preferable to use pipes instead of deeply nested function calls. However, it is best to avoid very long pipes. Normally, while writing scripts or analysing data it is important to check the correctness of intermediate results, so saving them to variables can save time and effort.
\begin{explainbox}
The design of \Rlang's native pipes has benefited from the experience gathered by earlier implementations and being now part of the language, we can expect it to become the reference one once its implementation is stable. The designers of the three implementations have to some extent disagreed in their design decisions. Consequently, some differences are more than aesthetic. \Rlang pipes are simpler, easier to use and expected to be fastest. Those from \pkgname{magrittr} are the most feature rich, but not as safe to use, and purportedly given a more complex implementation, the slowest. Package \pkgname{wrapr} is an attempt to enhance pipes compared to \pkgname{magrittr} focusing in syntactic simplicity and performance. \Rlang's \Roperator{\textbar >} operator has been enhanced since its addition in \Rlang only two years ago. These enhancements have all been backwards compatible.
The syntax of operators \Roperator{\textbar >} and \Roperator{\%>\%} is not identical. With \Rlang's \Roperator{\textbar >}, (as of \Rpgrm 4.3.0) the placeholder \code{\_} can be only passed to parameters by name, while with operator \Roperator{\%>\%} from \pkgname{magrittr} the placeholder \code{.} can be used to pass arguments both by name and by position. With operator \Roperator{\%.>\%} the use of the placeholder \code{.} is mandatory, and it can be passed by name or by position to the function call on the \textit{rhs}. Other differences are deeper like those related to the use of the extraction operator in the \emph{rhs} or support or not for expressions that are not explicit function calls.
\end{explainbox}
In the case of \Rlang, the \Roperator{\textbar >} pipe is conceptually a substitution with no alteration of the syntax or evaluation order. This avoids \emph{surprising} the user and simplifies implementation. In other words, \Rlang pipes are an alternative way of writing nested function calls. Quoting \Rlang documentation:
\begin{quotation}
Currently, pipe operations are implemented as syntax transformations. So an expression written as \code{x |> f(y)} is parsed as \code{f(x, y)}. It is worth emphasising that while the code in a pipeline is written sequentially, regular \Rlang semantics for evaluation apply and so piped expressions will be evaluated only when first used in the rhs expression.
\end{quotation}
\begin{warningbox}
While frequently the different pipe operators can substitute for each other by adjusting the syntax, in some cases the differences among them in the order and timing of evaluation of the terms needs to be taken into account.
In some situations, operator \Roperator{\%>\%} from package \pkgname{magrittr} can behave unexpectedly. One example is the use of \Rfunction{assign()} in a pipe. With \Rlang's operator \Roperator{\textbar >} assignment takes place as expected.
<<pipes-clean, include=FALSE>>=
rm(data6.out, data7.out, data8.out)
@
<<pipes-x06>>=
data.in |> assign(x = "data6.out", value = _)
all.equal(data.in, data6.out)
@
Named arguments are also supported with the dot-pipe operator from \pkgname{wrapr}.
<<pipes-x06a>>=
data.in %.>% assign(x = "data7.out", value = .)
all.equal(data.in, data7.out)
@
However, the pipe operator (\Roperator{\%>\%}) from package \pkgname{magrittr} silently and unexpectedly fails to assign the value to the name.
<<pipes-x06b>>=
data.in %>% assign(x = "data8.out", value = .)
if (exists("data8.out")) {
all.equal(data.in, data8.out)
} else {
print("'data8.out' not found!")
}
@
Although there are usually alternatives to get the computations done correctly, unexpected silent behaviour can be confusing.
\end{warningbox}
\index{pipes!wrapr|)}
\index{chaining statements with \emph{pipes}|)}
\section{Reshaping with \pkgname{tidyr}}\label{sec:data:reshape}
\index{reshaping tibbles|(}
\index{long-form- and wide-form tabular data}
Data stored in table-like formats can be arranged in different ways, wide and long (Figure \ref{fig:wide:long:data}). In base \Rlang, most model fitting functions, and the \Rfunction{plot()} method using (model) formulas, expect data to be arranged in ``long form'' so that each row in a data frame corresponds to a single observation (or measurement) event on a subject. Each column corresponds to a different measured feature, or ancillary information like the time of measurement, or a factor describing a classification of subjects according to treatments or features of the experimental design (e.g., blocks). Covariates measured on the same subject at an earlier point in time may also be stored in a column. Data arranged in \emph{long form} has been nicknamed as ``tidy'' and this is reflected in the name given to the \pkgname{tidyverse} suite of packages. However, this longitudinal arrangement of data has been the preferred format since the inception of \Slang and \Rlang. Data in which columns correspond to measurement events are described as being in a \emph{wide form}.
\begin{figure}
\begin{footnotesize}
\parbox{.55\linewidth}{
\begin{tabular}{lp{1.25cm}p{1.25cm}p{1.25cm}}
\toprule
Subject & Height January & Height February & Height March \\
\midrule
A & 5 & 7 & 9 \\
B & 4 & 7 & 9 \\
C & 6 & 7 & 8 \\
\bottomrule
\end{tabular}}
\hfill
\parbox{.4\linewidth}{
\begin{tabular}{lll}
\toprule
Subject & Date & Height \\
\midrule
A & January & 5 \\
B & January & 4 \\
C & January & 6 \\
A & February & 7 \\
B & February & 7 \\
C & February & 7 \\
A & March & 9 \\
B & March & 9 \\
C & March & 8 \\
\bottomrule
\end{tabular}}
\end{footnotesize}
\caption{Wide (left) and long (right) data formats.}\label{fig:wide:long:data}
\end{figure}
It is rather frequently used when observations in each subject are repeated in time. In this case, there is one row per subject and one column for each combination of response variable and time of measurement. Real-world data at the time of acquisition are rather frequently stored in the wide format, or even in ad-hoc non-rectangular formats, so in many cases the first task in data analysis is to reshape the data. Package \pkgname{tidyr} provides functions for reshaping data from wide to long form and \emph{vice versa}.
\begin{warningbox}
Package \pkgname{tidyr} replaced package \pkgname{reshape2}, which in turn replaced package \pkgname{reshape}, while additionally the functions implemented in \pkgname{tidyr} have been recently replaced by new ones with different syntax and name. If a data analyst uses these functions every day, the cost involved is frequently tolerable or even desirable given the improvements. However, for \Rlang users in applied fields, to whom this book is targeted, in the long run using function \Rfunction{reshape()} from base \Rlang can be better, even when its syntax is not as straightforward (see section \ref{sec:calc:reshape} on page \pageref{sec:calc:reshape}). This does not detract from the advantages of using a clear workflow as emphasised by the proponents of the \emph{tidyverse}. Here I only want to emphasise that using some of the packages from the \pkgname{tidyverse} as with any software with an evolving user interface can have in some cases a cost that needs to be taken into consideration.
\end{warningbox}
I use in examples below the \Rdata{iris} data set included in base \Rlang. Some operations on \Rlang \code{data.frame} objects with functions and operators from the \pkgname{tidyverse} packages will return \code{data.frame} objects while others will return tibbles---i.e., \Rclass{"tb"} objects. Consequently, it is safer to first convert into tibbles the data frames we will work with.
<<tidy-tibble-00>>=
iris.tb <- as_tibble(iris)
@
Function \Rfunction{pivot\_longer()} from \pkgname{tidyr} converts data from wide form into long form (or ''tidy''). We use it here to obtain a long-form tibble. By comparing \code{iris.tb} with \code{long\_iris.tb} we can appreciate how \Rfunction{pivot\_longer()} reshaped its input.
<<tidy-tibble-01>>=
long_iris.tb <-
pivot_longer(iris.tb,
cols = -Species,
names_to = "part",
values_to = "dimension")
long_iris.tb
@
\begin{warningbox}
Differently to base \Rlang, in most functions from the \pkgname{tidyverse} packages we can use bare column names preceded by a minus sign to signify ``all other columns''.
\end{warningbox}
Function \Rfunction{pivot\_wider()} does not directly implement the exact inverse operation of \Rfunction{pivot\_longer()}. With multiple rows with shared codes, i.e., replication, in our case within each species and flower part, the returned tibble has columns that are lists of vectors. We need to expand these columns with function \Rfunction{unnest()} in a second step.
<<tidy-tibble-01a>>=
wide_iris.tb <-
pivot_wider(long_iris.tb,
names_from = "part",
values_from = "dimension",
values_fn = list) |>
unnest(cols = -Species)
wide_iris.tb
@
\begin{playground}
Is \code{wide\_iris.tb} equal to \code{iris.tb}, the tibble we converted into long shape and back into wide shape? Run the comparisons below, and print the tibbles to find out.
<<tidy-tibble-pivot-pg01, eval=eval_playground>>=
identical(iris.tb, wide_iris.tb)
all.equal(iris.tb, wide_iris.tb)
all.equal(iris.tb, wide_iris.tb[ , colnames(iris.tb)])
@
What has changed? Would it matter if our code used indexing with a numeric vector to extract columns? or if it used column names as character strings?
\end{playground}
\begin{warningbox}
Starting from version 1.0.0 of \pkgname{tidyr}, functions \Rfunction{gather()} and \Rfunction{spread()} are deprecated and replaced by functions \Rfunction{pivot\_longer()} and \Rfunction{pivot\_wider()}. These new functions, described above, use a different syntax than the old ones.
\end{warningbox}
%Base \Rlang function \Rfunction{reshape()} can do both operations, selected by passing it an argument. The names of the parameters are different and the manipulation of names is not built in as in the \pkgname{tidyr} functions. Package \pkgname{poorman} provides a light-weight and dependency-free implementation of the core functions of package \pkgname{dplyr} and well as its own versions of functions \Rfunction{pivot\_longer()} and \Rfunction{pivot\_wider()}.
%
\begin{advplayground}
Functions \Rfunction{pivot\_longer()} and \Rfunction{pivot\_wider()} from package \pkgname{poorman} attempt to replicate the behaviour of the same name functions from package \pkgname{tidyr}. In some edge cases, the behaviour differs. Test if the two code chunks above return identical or equal values when \code{poorman::} is prepended to the names of these two functions. First, ensure that package \pkgname{poorman} is installed, then run the code below.
<<tidy-tibble-01c, eval=eval_playground>>=
poor_long_iris.tb <-
poorman::pivot_longer(
iris,
cols = -Species,
names_to = "part",
values_to = "dimension")
identical(long_iris.tb, poor_long_iris.tb)
all.equal(long_iris.tb, poor_long_iris.tb)
class(long_iris.tb)
class(poor_long_iris.tb)
@
What is the difference between the values returned by the two functions? Could switching from package \pkgname{tidyr} to package \pkgname{poorman} affect code downstream of pivoting?
\end{advplayground}
\index{reshaping tibbles|)}
\section{Data Manipulation with \pkgname{dplyr}}\label{sec:dplyr:manip}
\index{data manipulation in the tidyverse|(}
The first advantage a user of the \pkgname{dplyr} functions and methods sees is the completeness of the set of operations supported and the symmetry and consistency among the different functions. A second advantage is that almost all the functions are defined not only for objects of class \Rclass{tibble}, but also for objects of class \code{data.table} (package \pkgname{dtplyr}) and for SQL databases (package \pkgname{dbplyr}), with consistent syntax (see also section \ref{sec:data:db} on page \pageref{sec:data:db}). As discussed above, using a code base that is not yet fully stable has a cost that needs to be balanced against the gain obtained from its use.
\subsection{Row-wise manipulations}
\index{row-wise operations on data|(}
Assuming that the data are stored in long form, row-wise operations are operations combining values from the same observation event---i.e., calculations within a single row of a data frame or tibble (see section \ref{sec:calc:df:with} on page \pageref{sec:calc:df:with} for the base \Rlang approach). Using functions \Rfunction{mutate()} and \Rfunction{transmute()} we can obtain derived quantities by combining different variables, or variables and constants, or applying a mathematical transformation. We add new variables (columns) retaining existing ones using \Rfunction{mutate()} or we assemble a new tibble containing only the columns we explicitly specify using \Rfunction{transmute()}.
\begin{explainbox}
Different from usual \Rlang syntax, with \Rfunction{tibble()}, \Rfunction{mutate()} and \Rfunction{transmute()} we can use values passed as arguments, in the statements computing the values passed as later arguments. In many cases, this allows more concise and easier to understand code.
<<tidy-tibble-02z>>=
tibble(a = 1:5, b = 2 * a)
@
\end{explainbox}
Continuing with the example from the previous section, a likely next step would be to split the values in variable \code{part} into \code{plant\_part} and \code{part\_dim}. We use \code{mutate()} from \pkgname{dplyr} and \Rfunction{str\_extract()} from \pkgname{stringr} (a package included in the \pkgname{tidyverse}, aimed at the manipulation of character strings). We use regular expressions (see section \ref{sec:calc:regex} on page \pageref{sec:calc:regex}) as arguments passed to \code{pattern}. We do not show it here, but \Rfunction{mutate()} can be used with variables of any \code{mode}, and calculations can involve values from several columns. It is even possible to operate on values applying a lag or, in other words, using rows displaced relative to the current one.
<<tidy-tibble-02>>=
long_iris.tb |>
mutate(plant_part = str_extract(part, "^[:alpha:]*"),
part_dimension = str_extract(part, "[:alpha:]*$")) -> long_iris.tb
long_iris.tb
@
In the next few chunks, returned values are displayed, while in normal use they would assigned to variables or passed to the next function in a pipe using \Roperator{\textbar >}.
Function \Rfunction{arrange()} is used to sort rows---it makes sorting a data frame or tibble simpler than when using \Rfunction{sort()} or \Rfunction{order()}. Below, \code{long\_iris.tb} rows are sorted based on the values in three of its columns.
<<tidy-tibble-03>>=
arrange(long_iris.tb, Species, plant_part, part_dimension)
@
Function \Rfunction{filter()} can be used to extract a subset of rows---similar to \Rfunction{subset()} but with a syntax consistent with that of other functions in the \pkgname{tidyverse}. In this case, 300 out of the original 600 rows are retained.
<<tidy-tibble-04>>=
filter(long_iris.tb, plant_part == "Petal")
@
Function \Rfunction{slice()} can be used to extract a subset of rows based on their positions---an operation that in base \Rlang would use positional (numeric) indexes with the \code{[ , ]} operator: \code{long\_iris.tb[1:5, ]}.
<<tidy-tibble-05>>=
slice(long_iris.tb, 1:5)
@
Function \Rfunction{select()} can be used to extract a subset of columns---this would be done with positional (numeric) indexes with \code{[ , ]}, passing them to the second argument as numeric indexes or column names in a vector. It is also possible to use function \Rfunction{subset()} from base \Rlang (see section \ref{sec:calc:df:subset} on page \pageref{sec:calc:df:subset}). Negative indexes in base \Rlang can only be numeric, while \Rfunction{select()} accepts bare column names prepended with a minus for exclusion.
<<tidy-tibble-06>>=
select(long_iris.tb, -part)
@
In addition, \Rfunction{select()} as other functions in \pkgname{dplyr} accepts ``selectors'' returned by functions \Rfunction{starts\_with()}, \Rfunction{ends\_with()}, \Rfunction{contains()}, and \Rfunction{matches()} to extract or retain columns. For this example, we use the wide-shaped \code{iris.tb} instead of \code{long\_iris.tb}.
<<tidy-tibble-06a>>=
select(iris.tb, -starts_with("Sepal"))
@
<<tidy-tibble-06b>>=
select(iris.tb, Species, matches("pal"))
@
Function \Rfunction{rename()} can be used to rename columns, whereas base \Rlang requires the use of both \Rfunction{names()} and \Rfunction{names()<-} and \emph{ad hoc} code to match new and old names. As shown below, the syntax for each column name to be changed is \code{<new name> = <old name>}. The two names can be given either as bare names as below or as character strings.
<<tidy-tibble-07>>=
long_iris.tb |>
select(-part) |>
rename(part = plant_part, size = dimension, dimension = part_dimension)
@
\index{row-wise operations on data|)}
\begin{explainbox}
Several of the functions described in this section were needed because operator \Roperator{\%>\%}) from package \pkgname{magrittr} did not support the use of the extraction operators in the rhs using operator syntax. Operator \Roperator{\textbar >} starting from \Rlang version 4.3.0 does not have this limitation (see section \ref{sec:script:pipes} on page \pageref{sec:script:pipes}), however, the functions from \pkgname{dplyr} remain useful as they allow more concise and clear coding of complex conditions.
\end{explainbox}
\subsection{Group-wise manipulations}\label{sec:dplyr:group:wise}
\index{group-wise operations on data|(}
Another important operation is to summarise quantities by groups of rows. Contrary to base \Rlang, the grammar of data manipulation as implemented in \pkgname{dplyr}, makes it possible to split this operation into two steps: the setting of the grouping, and the calculation of summaries. This simplifies the code, making it more easily understandable when using pipes compared to the approach of base \Rlang \Rfunction{aggregate()} (see section \ref{sec:calc:df:aggregate} on page \pageref{sec:calc:df:aggregate}).
\begin{warningbox}
In early 2023, package \pkgname{dplyr} version 1.1.0 added support for per-operation grouping by adding to functions a new parameter (\code{by} or \code{.by}). This is still considered an experimental feature that may change. Anyway, it is important to keep in mind that this new approach to grouping is not persistent like that described above. Depending on the circumstances, persistence can simplify the code but also create bugs when not taken into account.
\end{warningbox}
When using persistent grouping, the first step is to use \Rfunction{group\_by()} to ``tag'' a tibble with the grouping. We create a \emph{tibble} and then convert it into a \emph{grouped tibble}. Once we have a grouped tibble, function \Rfunction{summarise()} will recognise the grouping and use it when the summary values are calculated.
<<tibble-grouped-01>>=
tibble(numbers = 1:9, Letters = rep(letters[1:3], 3)) |>
group_by(Letters) |>
summarise(mean_num = mean(numbers),
median_num = median(numbers),
n = n()) |>
ungroup() # not always needed but safer
@%
\pagebreak
In the non-persistent grouping approach, we specify the grouping in the call to \Rfunction{summarise()} (this new feature is labelled as experimental in \pkgname{dplyr} version 1.1.3, and may change in future versions).
<<tibble-grouped-02>>=
tibble(numbers = 1:9, Letters = rep(letters[1:3], 3)) |>
summarise(.by = Letters,
mean_num = mean(numbers),
median_num = median(numbers),
n = n())
@
\begin{warningbox}
How is grouping implemented for data frames and tibbles?\index{grouping!implementation in tidyverse} Best way to find out is to explore how a grouped tibble differs from one that is not grouped.
Tibble \code{my.tb} is not grouped.
<<tibble-grouped-box-01>>=
my.tb <- tibble(numbers = 1:9, Letters = rep(letters[1:3], 3))
is.grouped_df(my.tb)
class(my.tb)
names(attributes(my.tb))
@
Tibble \code{my\_gr.tb} is grouped by variable, or column, \code{Letters}. In this case, as our tibble belongs to class \code{tibble\_df}, grouping adds \code{grouped\_df} as the most derived class.
<<tibble-grouped-box-02>>=
my_gr.tb <- group_by(.data = my.tb, Letters)
is.grouped_df(my_gr.tb)
class(my_gr.tb)
@
Grouping also adds several attributes with the grouping information in a format suitable for fast selection of group members.
<<tibble-grouped-box-02a>>=
names(attributes(my_gr.tb))
setdiff(attributes(my_gr.tb), attributes(my.tb))
@
A call to \Rfunction{ungroup()} removes the grouping, thereby restoring the original tibble.
<<tibble-grouped-box-03>>=
my_ugr.tb <- ungroup(my_gr.tb)
class(my_ugr.tb)
names(attributes(my_ugr.tb))
@
<<tibble-grouped-box-04>>=
all(my.tb == my_gr.tb)
all(my.tb == my_ugr.tb)
identical(my.tb, my_gr.tb)
identical(my.tb, my_ugr.tb)
@
The tests above show that members are in all cases the same as operator \Roperator{==} tests for equality at each position in the tibble but not the attributes, while attributes, including \code{class}, differ between normal tibbles and grouped ones and so they are not \emph{identical} objects.
If we replace \code{tibble} by \code{data.frame} in the first statement, and rerun the chunk, the result of the last statement in the chunk is \code{FALSE} instead of \code{TRUE}. At the time of writing starting with a \code{data.frame} object, applying grouping with \Rfunction{group\_by()} followed by ungrouping with \Rfunction{ungroup()} has the side effect of converting the data frame into a tibble. This is something to be very much aware of, as there are differences in how the extraction operator \Roperator{[ , ]} behaves in the two cases. The safe way to write code making use of functions from \pkgname{dplyr} and \pkgname{tidyr} is to always make sure that subsequent code works correctly with tibbles in addition to with data frames.
\end{warningbox}
\index{group-wise operations on data|)}
\subsection{Joins}
\index{joins between data sources|(}
\index{merging data from two tibbles|(}
Joins allow us to combine two data sources which share some variables. Variables in common are used to match the corresponding rows before ``joining'' variables (i.e., columns) from both sources together. There are several \emph{join} functions in \pkgname{dplyr}. They differ mainly in how they handle rows that do not have a match between data sources.
We create here some artificial data to demonstrate the use of these functions. We will create two small tibbles, with one column in common and one mismatched row in each.
<<tibble-print-10, echo=FALSE>>=
options(tibble.print_max = 6, tibble.print_min = 6)
@
<<joins-00>>=
first.tb <- tibble(idx = c(1:4, 5), values1 = "a")
second.tb <- tibble(idx = c(1:4, 6), values2 = "b")
@
Below, we apply the \emph{}\index{joins between data sources!mutating} functions exported by \pkgname{dplyr}: \Rfunction{full\_join()}, \Rfunction{left\_join()}, \Rfunction{right\_join()} and \Rfunction{inner\_join()}. These functions always retain all columns, and in case of multiple matches, they keep a row for each matching combination of rows. We repeat each of these examples with the arguments passed to \code{x} and \code{y} swapped to show the differences in the behaviour of these functions.
A full join retains all unmatched rows filling missing values with \code{NA}. By default, the match is done on columns with the same name in \code{x} and \code{y}, but this can be changed by passing an argument to parameter \code{by}. Using \code{by} one can base the match on columns that have different names in \code{x} and \code{y}, or prevent matching of columns with the same name in \code{x} and \code{y} (example at end of the section).
<<joins-01>>=
full_join(x = first.tb, y = second.tb)
@
<<joins-01a>>=
full_join(x = second.tb, y = first.tb)
@
Left and right joins retain rows not matched from only one of the two data sources, \code{x} and \code{y}, respectively.
<<joins-02>>=
left_join(x = first.tb, y = second.tb)
@
<<joins-02a>>=
left_join(x = second.tb, y = first.tb)
@
<<joins-03>>=
right_join(x = first.tb, y = second.tb)
@
<<joins-03a>>=
right_join(x = second.tb, y = first.tb)
@
An inner join discards rows in \code{x} that do not match rows in \code{y} and \emph{vice versa}.
<<joins-04>>=
inner_join(x = first.tb, y = second.tb)
@
<<joins-04a>>=
inner_join(x = second.tb, y = first.tb)
@
Next we apply the \emph{filtering join}\index{joins between data sources!filtering} functions exported by \pkgname{dplyr}: \Rfunction{semi\_join()} and \Rfunction{anti\_join()}. These functions only return a tibble that contains only the columns from \code{x}, retaining rows based on their match to rows in \code{y}.
A semi join retains rows from \code{x} that have a match in \code{y}.
<<joins-05>>=
semi_join(x = first.tb, y = second.tb)
@
<<joins-05a>>=
semi_join(x = second.tb, y = first.tb)
@
A anti-join retains rows from \code{x} that do not have a match in \code{y}.
<<joins-06>>=
anti_join(x = first.tb, y = second.tb)
@
<<joins-06a>>=
anti_join(x = second.tb, y = first.tb)
@
We here rename column \code{idx} in \code{first.tb} to demonstrate the use of \code{by} to specify which columns should be searched for matches.
<<joins-01b>>=
first2.tb <- rename(first.tb, idx2 = idx)
full_join(x = first2.tb, y = second.tb, by = c("idx2" = "idx"))
@
\index{merging data from two tibbles|)}
\index{joins between data sources|)}
\index{data manipulation in the tidyverse|)}
<<tibble-print-11, echo=FALSE>>=
options(tibble.print_max = 3, tibble.print_min = 3)
@
\section{Times and Dates with \pkgname{lubridate}}\label{sec:data:datetime}
\index{time and dates|(}
In \Rlang and many other computing languages, time values are stored as integer values subject to special interpretation. In \Rlang, times are most frequently stored as objects of class \code{POSIXct} or \code{POSIXlt}. Package \pkgname{lubridate} makes working with dates and times in \Rlang much easier.
When\index{time and dates!universal time coordinates}\index{time and dates!local time}\index{time and dates!time zones} dealing with time values, first of all, it is necessary to distinguish universal time coordinates (UTC) and local time coordinates. An instant in time is an absolute value and can be unambiguously described using UTC. Local times are different representations of a given instant in time, using local time coordinates such as CET (central European Time). The relationship between UTC and local times depends on country legislation, national borders, and in some cases, time zones within countries. In addition, many countries make use a seasonal shift in the local time coordinates, the so called ``summer time''. The dates on which these seasonal shifts are implemented depends on the country or region, and these dates have varied over time. Shifts in local time create gaps and overlaps: some local time values correspond to two different time instants, and the skipped ones do not exist and when encountered should be handled as errors.
\begin{explainbox}
Different systems are in use to describe time zones and the corresponding time coordinates. One commonly used is based on three or four letter codes, e.g., EET for Eastern European Time. Another commonly used one is based on the names of continents and cities, e.g., Europe/Helsinki. A third one in common use is simply expressed as an offset in hours, e.g., UTC+3. Most time zones have time shifts of whole hours and few to half hours. To some extent, what names are recognised depends on the operating system under which \Rpgrm is running. See \url{https://en.wikipedia.org/wiki/List_of_tz_database_time_zones} for a list.
\end{explainbox}
Periodical adjustments introduced by leap years, and even leap seconds need to be taken into account when computing time durations between instants in time, even when using UTC. When carrying out arithmetic operations on dates and times, all these ``irregularities'' have to be accounted for. The functions and operators from package \pkgname{lubridate} implement the necessary corrections for current and historical times.
Times and dates written as text are formatted rather inconsistently depending on the customs of different cultures and languages. Package \pkgname{lubridate} also provides functions implementing conversions between character strings and times or dates and back. These \code{character} to time conversions are based on patterns, and are, in general, reliable if the correct pattern is used. Package \pkgname{anytime} defines functions that can decode a broad range of formats, but relying on them can be risky, as not all possible formats are correctly decoded.
Objects of class \Rclass{POSIXlt}, the class used in \Rlang to store dates and times in a partly formatted form, do not necessarily contain time zone information. In many cases, when used in computations \Rclass{POSIXlt} values are interpreted based on the locale settings under which \Rlang is running, e.g., the time zone settings of the computer. Objects of class \Rclass{Date} do not keep track of the time zone, so do not represent instants in time traceable to UTC.
\begin{warningbox}
Whenever possible, it is best to store time data and also dates encoded using UTC as \Rclass{POSIXct} objects. This eliminates uncertainties that can cause otherwise major difficulties in computations.
\end{warningbox}
\Rclass{POSIXct} objects are of mode numeric, and thus vectors; because of this, they can be stored as columns in data frames and tibbles. Some statistical functions and even some model fitting functions accept them as input.
Current date can be easily queried, and the returned value is fetched from the computer's clock.
<<lubridate-01>>=
this.day <- today()
class(this.day)
as.POSIXct(this.day, tz = "") # local time zone
@
Similarly, the current instant in time can be retrieved. Resolution is in the order of milliseconds.
<<lubridate-02>>=
this.instant <- now()
class(this.instant)
this.instant
@
Conversion from character strings to \Rclass{POSIXct} is straightforward as long as all character strings to be converted have the same or very similarly formatted. A family of functions from \pkgname{lubridate} with names like \Rfunction{dmy\_hms()} can convert character strings into \Rclass{POSIXct} objects. These functions are vectorised and can convert a whole character vector in a single operation into a \Rclass{POSIXct} vector of the same length.
<<lubridate-03>>=
dmy_h("04/10/23 15", tz = "EET")
dmy_h("04/10/23 3pm", tz = "EET")
dmy_h("04/10/23 15 EET") # Wrong decoding!
@
\begin{warningbox}
Conversion functions with no time components return \Rclass{Date} objects if no argument is passed to \code{tz}, while \code{tz = ""}, as used below, signifies the local time zone.
<<lubridate-04>>=
class(ymd("2023-10-04"))
class(ymd("2023-10-04", tz = ""))
class(today(tzone = ""))
@
Conversions from \code{Date} into \code{POSIXct} can give very unexpected results! If you run the statement below, the returned value will be the time difference between the time setting in your computer and UTC!
<<lubridate-05>>=
as.POSIXct(ymd("2023-10-04"), tzone = "") - ymd("2023-10-04", tz = "")
@
The computations assume that the value in the \code{Date} is expressed in UTC, corresponding to 00:00:00 UTC. The time zone difference in UTC at midnight between time zones is not taken into account. Forcing the time zone after conversion in \code{POSIXct} fixes the problem. Quirks like these make it imperative to do extensive checks when doing conversions involving times and or dates.
<<lubridate-05a>>=
force_tz(as.POSIXct(ymd("2023-10-04")), tzone = "") - ymd("2023-10-04", tz = "")
@
\end{warningbox}
A difference between to instants in time returns a duration.
<<lubridate-06>>=
ymd_hms("2010-05-25 12:05:00") - ymd_hms("1810-05-25 12:00:00")
@
Functions with names in plural, like \Rfunction{years()} \ldots\ \Rfunction{seconds()} are constructors of durations, and can be added and subtracted from times.
<<lubridate-07>>=
ymd_hms("1810-05-25 12:00:00") + years(200) + minutes(5)
ymd_hms("2010-05-25 12:05:00") - ymd_hms("1810-05-25 12:00:00")
ymd("2023-01-01") + seconds(123)
@
Functions with names in singular, like \Rfunction{year()} \ldots\ \Rfunction{second()} are used to extract and set the implicit components of an instant in time.
<<lubridate-08>>=
my.time <- now()
my.time
year(my.time)
hour(my.time)
second(my.time)
second(my.time) <- 0
@
Special versions of methods \Rfunction{round()} and \Rfunction{trunc()} are available for times.
<<lubridate-09>>=
trunc(my.time, "days")
round(my.time, "hours")
@
\begin{playground}
Working with time data, frequently involves checking that the results of computations are according to our expectations. Sometimes the documentation is not enough and we need to explore with code examples how functions work. For example, take one date in February 2020 and one date in March 2020, and compute the duration between them. Then repeat the computation for year 2022 using the same dates. Which of these years was a leap year?
\end{playground}
\index{time and dates|)}
Ín the next chapter, I describe data visualisation with package \pkgnameNI{ggplot2}, frequently also considered part of the \pkgnameNI{tidyverse}.
\section{Further Reading}
An\index{further reading!new grammars of data} in-depth discussion of the \pkgname{tidyverse} is outside the scope of this book. Several books describe in detail the use of these packages. As several of them are under active development, recent editions of books such as \citebooktitle{Wickham2023a} \autocite{Wickham2023a} and \citebooktitle{Peng2022} \autocite{Peng2022} are the most useful.
<<echo=FALSE>>=
try(detach(package:lubridate))
try(detach(package:tidyr))
try(detach(package:dplyr))
try(detach(package:stringr))
try(detach(package:wrapr))
try(detach(package:magrittr))
try(detach(package:tibble))
try(detach(package:learnrbook))
@
<<eval=eval_diag, include=eval_diag, echo=eval_diag, cache=FALSE>>=
knitter_diag()
R_diag()
other_diag()
@