-
Notifications
You must be signed in to change notification settings - Fork 4
/
R.plotting.Rnw
3464 lines (2653 loc) · 247 KB
/
R.plotting.Rnw
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% !Rnw root = appendix.main.Rnw
<<echo=FALSE, include=FALSE>>=
opts_chunk$set(opts_fig_wide)
opts_knit$set(concordance=TRUE)
opts_knit$set(unnamed.chunk.label = 'plotting-chunk')
@
\chapter{\Rlang Extensions: Grammar of Graphics}\label{chap:R:plotting}
\begin{VF}
The commonality between science and art is in trying to see profoundly---to develop strategies of seeing and showing.
\VA{Edward Tufte's answer to Charlotte Thralls}{\emph{An Interview with Edward R. Tufte}, 2004}\nocite{Zachry2004}
\end{VF}
%\dictum[Edward Tufte]{The commonality between science and art is in trying to see profoundly---to develop strategies of seeing and showing.}
\index{geometries ('ggplot2')|see{grammar of graphics, geometries}}
%\index{geom@\texttt{geom}|see{grammar of graphics, geometries}}
%\index{functions!geom@\texttt{geom}|see{grammar of graphics, geometries}}
\index{statistics ('ggplot2')|see{grammar of graphics, statistics}}
%\index{stat@\texttt{stat}|see{grammar of graphics, statistics}}
%\index{functions!stat@\texttt{stat}|see{grammar of graphics, statistics}}
\index{scales ('ggplot2')|see{grammar of graphics, scales}}
%\index{scale@\texttt{scale}|see{grammar of graphics, scales}}
%\index{functions!scale@\texttt{scale}|see{grammar of graphics, scales}}
\index{coordinates ('ggplot2')|see{grammar of graphics, coordinates}}
\index{themes ('ggplot2')|see{grammar of graphics, themes}}
%\index{theme@\texttt{scale}|see{grammar of graphics, themes}}
%\index{function!theme@\texttt{scale}|see{grammar of graphics, themes}}
\index{facets ('ggplot2')|see{grammar of graphics, facets}}
\index{annotations ('ggplot2')|see{grammar of graphics, annotations}}
\index{aesthetics ('ggplot2')|see{grammar of graphics, aesthetics}}
\section{Aims of This Chapter}
Three main data plotting systems are available to \Rlang users: base \Rlang, package \pkgname{lattice} \autocite{Sarkar2008}, and package \pkgname{ggplot2} \autocite{Wickham2016}; the last one being the most recent and currently most popular system available in \Rlang for plotting data. Even two different sets of graphics primitives (i.e., those used to produce the simplest graphical elements such as lines and symbols) are available in \Rlang, those in base \Rlang and a newer one in the \pkgname{grid} package \autocite{Murrell2019}.
In this chapter, you will learn the concepts of the layered grammar of graphics, on which package \pkgname{ggplot2} is based. You will also learn how to build several types of data plots with package \pkgname{ggplot2}. As a consequence of the popularity and flexibility of \pkgname{ggplot2}, many contributed packages extending its functionality have been developed and deposited in public repositories. However, I will focus mainly on package \pkgname{ggplot2} only briefly describing a few of these extensions.
\section{Packages Used in This Chapter}
<<eval=FALSE, include=FALSE>>=
citation(package = "ggplot2")
@
If the packages used in this chapter are not yet installed in your computer, you can install them as shown below, as long as package \pkgname{learnrbook} is already installed.
<<eval=FALSE>>=
install.packages(learnrbook::pkgs_ch_ggplot)
@%
\pagebreak
To run the examples included in this chapter, you need first to load some packages from the library (see section \ref{sec:script:packages} on page \pageref{sec:script:packages} for details on the use of packages).
<<message=FALSE>>=
library(learnrbook)
library(scales)
library(ggplot2)
library(ggrepel)
library(gginnards)
library(broom)
library(ggpmisc)
library(ggbeeswarm)
library(lubridate)
library(tibble)
library(dplyr)
library(patchwork)
@
<<echo=FALSE>>=
theme_set(theme_gray(14))
@
<<echo=FALSE>>=
# set to TRUE to test non-executed code chunks and rendering of plots
eval_plots_all <- FALSE
@
\section{The Components of a Plot}
I\index{data visualisation!concepts} start by briefly presenting concepts central to data visualisation, following the \citetitle{Koponen2019} \autocite{Koponen2019}. Plots are a medium used to convey information, like text. It is worthwhile keeping this in mind. As with text, the design of plots needs to consider what needs to be highlighted to convey the take home message. The style of the plot should match the expectations and the plot-reading abilities of the expected audience. One needs to be careful to avoid ambiguities and most importantly of all not to miss-inform. Data visualisations like text need to be planned, revised, commented upon, and revised again until the best way of expressing our message is found. The flexibility of the grammar of graphics supports very well this approach to designing and producing high quality data visualisations for different audiences.
Of course, when exploring data, fancy details of graphical design are irrelevant, but flexibility remains important as it makes it possible to look at data from many differing angles, highlighting different aspects of them. In the same way as boiler-plate text and text templates have specific but limited uses, all-in-one functions for producing plots do not support well the design of original data visualisations. They tend to get the job done, but lack the flexibility needed to do the best job of communicating information. Being this a book about languages, the focus of this chapter is in the layered grammar of graphics.
The plots described in this chapter are classified as \emph{statistical graphics}\index{statistical graphics} within the broader field of data visualisation. Plots such as scatter plots include points (geometric objects) that by their position, shape, colour, or some other property directly convey information. The location of these points in the plot ``canvas'' or ``plotting area'', given by the values of their $x$ and $y$ coordinates describes properties of the data and any deviation in the mapping of observations to coordinates is misleading, because deviations from the expected mapping conveys wrong/false information to the audience.
A \emph{data label}\index{data visualisation!data labels} is connected to an observation but its position can be displaced as long as its link to the corresponding observation can be inferred, e.g., by the direction of an arrow or even simple proximity. Data labels provide ancillary information, such as the name of a gene or place.
\emph{Annotations}\index{data visualisation!annotations}, are additions to a plot that have no connection to individual observations, but rather with all observations taken together, e.g., a text like $n = 200$ indicating the number of observations, usually included in a corner or margin of a plot free of observations.
Axis and tick labels, legends and keys make it possible for the reader to retrieve the original values represented in the plot as graphical elements. Other features of visualisations even when not carrying additional information affect the easy with which a plot can be read and accessibility to readers with visual constraints such as colour blindness. These features include the size of text and symbols, thickness of lines, choice of font face, choice of colour palette, etc.
Because of the different lengths of time available for the audience to interact with visualisations, in general, plots designed to be included in books and journals are unsuitable for oral presentations, and vice versa. It is important to keep in mind the role played by plots in informing the audience, and what information can be expected to be of interest to different audiences and under different situations. The grammar of graphics and its extensions provide enough flexibility to tailor the design of plots to different uses and also to easily create variations of a given plot.
\section{The Grammar of Graphics}\label{sec:plot:intro}
\index{grammar of graphics!elements|(}
What separates \ggplot from base \Rlang and trellis/lattice plotting functions is the use of a layered grammar of graphics\index{grammar of graphics} (the reason behind `gg' in the name of package \pkgname{ggplot2}). What is meant by grammar in this case is that plots are assembled piece by piece using different ``nouns'' and ``verbs'' \autocite{Cleveland1985,Wickham2010}. Instead of using a single function with many arguments, plots are assembled by combining different elements with operators \code{+} and \verb|%+%|. Furthermore, the construction is mostly semantics-based and to a large extent, how plots look when printed, displayed, or exported to a bitmap or vector-graphics file is controlled by themes.
Plotting can be thought as translating or mapping the observations or data into a graphical language. Properties of graphical (or geometrical) objects are used to represent different aspects of the data. An observation can consist of multiple recorded values. Say an observation of air temperature may be defined by a position in 3-dimensional space and a point in time, in addition to the temperature itself. An observation for the size and shape of a plant can consist of height, stem diameter, number of leaves, size of individual leaves, length of roots, fresh mass, dry mass, etc. For example, an effective way of studying and/or communicating the relationship between height and stem diameter in plants, is to plot observations as points using cartesian coordinates\index{grammar of graphics!cartesian coordinates}, \emph{mapping} stem diameter to the $x$ axis and the height to the $y$ axis.
The grammar of graphics makes it possible to design plots by combining various elements in ways that are nearly orthogonal. In other words, the majority of the possible combinations of ``words'' yield valid plots as long the rules of the grammar are respected. This flexibility makes \ggplot extremely powerful as types of plots not considered when the \ggplot package was designed can be easily created.
\begin{warningbox}
When a ggplot is built, the whole plot and its components are created as \Rlang objects that can be saved in the workspace or written to a file as \Rlang objects. These objects encode a recipe for constructing the plot, not its final graphical representation. The graphical representation is generated when the object is printed, explicitly or not. Thus, the same \code{"gg"} plot object can be rendered into different bitmap and vector graphic formats for display and/or printing.
\end{warningbox}
The transformation of a set of data or observations into a rendered graphic with package \pkgname{ggplot2} can be represented as a flow of information, but also as a sequence of actions. However, what avoids that the flexibility from becoming a burden on users is that in most cases adequate defaults are used when the user does not provide explicit ``instructions''. The recipe to build a plot needs to specify a) the data to use, b) which variable to map to which graphical property (or aesthetic), c) which layers to add and which geometric representation to use, d) the scales that establish the link between data values and aesthetic values, e) a coordinate system (affecting only aesthetics $x$, $y$ and possibly $z$), f) a theme to use. The result from constructing a plot object using the grammar of graphics is an \Rlang object containing a ``recipe for a plot'', including the data, which behaves similarly to other \Rlang objects.
\subsection{The words of the grammar}
Before building a plot step by step, I introduce the different components of a ggplot recipe, or the words of the grammar of graphics.
\paragraph{Data}
The\index{grammar of graphics!data} data to be plotted must be available as a \code{data.frame} or \code{tibble}, with data stored so that each row represents a single observation event, and the columns are different values observed in that single event. In other words, in long form (so-called ``tidy data'') as described in chapter \ref{chap:R:data}. The variables to be plotted can be \code{numeric}, \code{factor}, \code{character}, and time or date stored as \code{POSIXct}. (Some extensions to \pkgname{ggplot2} add support for other types of data such as time series).
\paragraph{Mapping}
When\index{grammar of graphics!mapping of data} constructing a plot, data variables have to be mapped to aesthetics\index{plots!aesthetics} (or graphic properties). Most plots will have an $x$ dimension, which is one of the \emph{aesthetics}, and a variable containing numbers (or categories) mapped to it. The position on a 2D plot of, say, a point, will be determined by $x$ and $y$ aesthetics, while in a 3D plot, three aesthetics need to be mapped $x$, $y$, and $z$. Many aesthetics are not related to coordinates, they are properties, like colour, size, shape, line type, or even rotation angle, which add an additional dimension on which to represent the values of variables and/or constants.
\paragraph{Statistics}
Statistics\index{grammar of graphics!statistics} are ``words'' that represent calculation of summaries or some other operation on the values in the data. When \emph{statistics} are used for a computation, the returned value is passed to a \emph{geometry}, and consequently adding a \emph{statistics} also adds a layer to the plot. For example, \ggstat{stat\_smooth()} fits a smoother, and \ggstat{stat\_summary()} applies a summary function such as \code{mean(()}. Most statistics are applied by group when data have been grouped by mapping additional aesthetics such as colour to a factor.
\paragraph{Geometries}
\sloppy%
Geometries\index{grammar of graphics!geometries} are ``words'' that describe the graphics representation of the data: for example, \gggeom{geom\_point()}, plots a point or symbol for each observation or summary value, while \gggeom{geom\_line()}, draws line segments between observations. Some geometries rely by default on statistics, but most ``geoms'' default to the identity statistics. Each time a \emph{geometry} is used to add a graphical representation of data to a plot, one says that a new \emph{layer} has been added. The\index{plots!layers} grammar of graphics allows plots to contain multiple layers. The name \emph{layer} reflects the fact that each new layer added is plotted on top of the layers already present in the plot, or rather when a plot is printed the layers will be generated in the order they were added to the plot object. For example, one layer in a plot can display the observations, another layer a regression line fitted to them, and a third one may contain annotations such as an equation or a text label.
\paragraph{Positions}
Positions\index{grammar of graphics!positions} are ``words'' that determine the displacement or not of graphical plot elements relative to their original $x$ and $y$ coordinates. They are one of the arguments accepted by \emph{geometries}. Position \ggposition{position\_identity()} introduces no displacement, and for example, \ggposition{position\_stack()} makes it possible to create stacked bar plots and stacked area plots. Positions will be discussed together with geometries as they are always subordinate to them.
\paragraph{Scales}
Scales\index{grammar of graphics!scales} give the ``translation'' or mapping between data values and the aesthetic values to be actually plotted. Mapping a variable to the ``colour'' aesthetic (also recognised when spelled as ``color'') only tells that different values stored in the mapped variable will be represented by different colours. A scale, such as \ggscale{scale\_colour\_continuous()}, will determine which colour in the plot corresponds to which value in the variable. Scales can also define transformations on the data, which are used when mapping data values to aesthetic values. All continuous scales support transformations---e.g., in the case of $x$ and $y$ aesthetics, positions on the plotting region or graphic viewport will be affected by the transformation, while the original values are used for tick labels along the axes or in keys for shapes, colours, etc. Scales are used for all aesthetics, including continuous variables, such as numbers, and categorical ones such as factors. The grammar of graphics allows only one scale per \emph{aesthetic} and plot. This restriction is imposed by design to avoid ambiguity (e.g., it ensures that the red colour will have the same ``meaning'' in all plot layers where the \code{colour} \emph{aesthetic} is mapped to data). Scales have limits that are set automatically unless supplied explicitly.
\paragraph{Coordinate systems}
The\index{grammar of graphics!coordinates} most frequently used coordinate system when plotting data, the cartesian system, is the default for most \emph{geometries}. In the cartesian system, $x$ and $y$ are represented as distances on two orthogonal (at 90$^\circ$) axes. Additional coordinate systems are available in \pkgname{ggplot2} and through extensions. For example, in the polar system of coordinates, the $x$ values are mapped to angles around a central point and $y$ values to the radius. Setting limits to a coordinate system changes the region of the plotting space visible in the plot, but does not discard observations. In other words, when using \emph{statistics}, observations located outside the coordinate limits, i.e., not visible in the rendered plot, will still be included in computations when excluded by coordinate limits but will be ignored when excluded by scale limits.
\paragraph{Themes}
How\index{grammar of graphics!themes} the plots look when displayed or printed can be altered by means of themes. A plot can be saved without adding a theme and then printed or displayed using different themes. Also, individual theme elements can be changed, and whole new themes defined. This adds a lot of flexibility and helps in the separation of the data representation aspects from those related to the graphical design.
\paragraph{Operators}
The\index{grammar of graphics!operators} elements described above are assembled into a ggplot object using operator \Roperator{+} and exceptionally using \Roperator{\%+\%}. The choice of these operators makes sense, as ggplot objects are built by sequentially adding members or elements to them.
\index{grammar of graphics!elements|)}
\begin{warningbox}
The functions corresponding to the different elements of the grammar of graphics have distinctive names with the first few letters hinting at their roles: aesthetics mappings (\code{aes}), geometric elements (\code{geom\_\ldots}), statistics (\code{stat\_\ldots}), scales (\code{scale\_\ldots}), coordinate systems (\code{coord\_\ldots}), and themes (\code{theme\_\ldots}).
\end{warningbox}
\subsection{The workings of the grammar}\label{sec:plot:workings}
\index{grammar of graphics!plot structure|(}
\index{grammar of graphics!plot workings|(}
A \code{"gg"} plot object is an \Rlang object of mode \code{"list"} containing the recipe and data to construct a plot. It is self contained in the sense that the only requirement for rendering it into a graphical representation is the availability of package \pkgname{ggplot2}. A \code{"gg"} object contains the data in one or more data frames and instructions encoded as functions and parameters, but not yet a rendering of the plot into graphical objects. Both data transformations and rendering of the plot into drawing instructions (encoded as graphical objects or \emph{grobs}) take place at the time of printing or exporting the plot, e.g., when saving a bitmap to a file.
To understand ggplots, one should first think in terms of the graphical organisation of the plot: there is always a background layer onto which other layers composed by different graphical objects are laid. Each layer contains related graphical objects originating from the same data. The last layer added is the topmost and the first one added the lowermost. Graphical objects in upper layers occlude those in the layers below them if their locations overlap. Although frequently layers in a ggplot share the same data and the same mappings to aesthetics, this is not a requirement. It is possible to build ggplots with independent layers, although always with shared scales and plotting area.
%%% Drawing of a plot with layers
A second perspective on ggplots is that of the process of converting the data into a graphical representation that can be printed on paper or viewed on a computer screen. The transformations applied to the data to achieve this can be thought as a data flow process divided into stages. The diagram in Figure \ref{fig:ggplot:stages} represents a single self-contained layer in a plot. The data supplied by the user is transformed in stages into instructions to draw a graphical representation. In \pkgname{ggplot2} and its documentation, graphical features are called \emph{aesthetics}, with the correspondence between values in the data and values of the aesthetic controlled by \emph{scales}. The values in the data are summarised by \emph{statistics}. However, when no summaries are needed, layers make use of \Rfunction{stat\_indentity()}, which copies its input to its output unchanged.
\emph{Geometries} provide the ``recipe'' used to generate graphical objects from the mapped data.
\begin{figure}
{\sffamily
\centering
\resizebox{\linewidth}{!}{%
\begin{tikzpicture}[auto]
\node [b] (data) {layer\\ data};
\node [cc, right = of data] (mapping1) {\textbf{start}};
\node [b, right = of mapping1] (statistic) {statistic};
\node [cc, right = of statistic] (mapping2) {\textbf{after\\ stat}};
\node [b, right = of mapping2] (geometry) {geometry + scale};
\node [cc, right = of geometry] (mapping3) {\textbf{after\\ scale}};
\node [b, right = of mapping3] (render) {layer\\ grobs};
\path [ll] (mapping1) -- (data) node[near end,above]{a};
\path [ll] (statistic) -- (mapping1) node[near end,above]{b};
\path [ll] (mapping2) -- (statistic) node[near end,above]{c};
\path [ll] (geometry) -- (mapping2) node[near end,above]{d};
\path [ll] (mapping3) -- (geometry) node[near end,above]{e};
\path [ll] (render) -- (mapping3) node[near end,above]{f};
\end{tikzpicture}}}
\caption[Stages of data flow in a ggplot layer]{Abstract diagram of data transformations in a ggplot layer showing the stages at which mappings between variables and graphic aesthetics take place.}\label{fig:ggplot:stages}
\end{figure}
Function \code{aes()} is used to define mappings to aesthetics. The default for \Rfunction{aes()} is for the mapping to take place at the \textbf{start} (leftmost circle in the diagram above), mapping names in the user data to aesthetics such as x, y, colour, and shape. The statistic can alter the mapped data, but in most cases not which aesthetics they are mapped to. Statistics can add default mappings for additional aesthetics. In addition, the default mappings of the data returned by the statistic can be modified by user code at this later stage, \textbf{after stat}. Default mappings can be modified again at the \textbf{after scale} stage.
\begin{explainbox}
Statistics always return a mapping to the same aesthetics that they require as input. However, the values mapped to these aesthetics at the \textbf{after stat} stage are in most cases different from those at \textbf{start}. Many statistics return additional variables, which are not mapped by default to any aesthetic. These variables facilitate variations on how results from a given type of data summary are added to plots, including the use of a geometry different from the default set by the statistic. In this case, the user has to override default mappings at the \textbf{after stat} stage. The additional variables returned by statistics are listed in their documentation. (See section \ref{sec:plot:mappings} on page \pageref{sec:plot:mappings} for details.)
\end{explainbox}
\begin{warningbox}
As mentioned above, all ggplot layers include a statistic and a geometry. From the perspective of the construction of a plot using the grammar, both \code{stats} and \code{geoms} are layer constructor functions. While \code{stats} take a \code{geom} as one of their arguments, \code{geoms} take a \code{stat} as one of their arguments. Thus, in both cases, a \code{stat} and a \code{geom} are added as a layer, and their role and position in the data flow remain the same, i.e., the diagram in Figure \ref{fig:ggplot:stages} applies independently of how the layers are added to the plot. The default statistic of many geometries is \ggstat{stat\_identity()} making their behaviour when added to a plot as if the layer they create contained no statistics.
\end{warningbox}
There are some statistics in \pkgname{ggplot2} that have companion geometries that can be used (almost) interchangeably. This tends to lead into confusion, and in this book, only geometries that have as default \ggstat{stat\_identity()} are described as geometries in section \ref{sec:plot:geometries}. In the case of those that by default use other statistics, like \gggeom{geom\_smooth()} only the companion statistic, \gggeom{stat\_smooth()} for this example, are described in section \ref{sec:plot:statistics}.
A ggplot can have a single layer or many layers, but when ggplots have more than one layer, the data flow, computations, and generation of graphical objects takes place independently for each layer. As mentioned above, most ggplots do not have fully independent layers, but the layers share the same data and aesthetic mappings at the \textbf{start}. Ahead of this point computations in layers are always independent of those in other layers, except that for a given aesthetic only one scale is allowed per plot.
\begin{explainbox}
make it possible
\end{explainbox}
\index{grammar of graphics!plot workings|)}
\index{grammar of graphics!plot structure|)}
\subsection{Plot construction}
\index{grammar of graphics!plot construction|(}
As the use of the grammar is easier to demonstrate by example than to explain with words, I will show how to build plots of increasing complexity, starting from the simplest possible. All elements of a plot have defaults, although in some cases these defaults result in empty plots. Defaults make it possible to create a plot very succinctly. When building a plot step by step, the different viewpoints described in the previous section are relevant: the static structure of the plot's \Rlang object, the final graphic output, and the transformations that the data undergo ``in transit'' from the recipe stored in an object to the graphic output. In this section, I emphasise the syntax of the grammar and how it translates into a plot.
Function \code{ggplot()} by default constructs an empty plot. This is similar to how \code{character()}, \code{numeric()}, etc. construct empty vectors. This empty skeleton of a plot when printed is displayed as an grey rectangle.
<<ggplot-basics-01>>=
ggplot()
@
A data frame passed as an argument to \code{data} without adding a mapping results in the same empty grey rectangle (not shown). Data frame \Rdata{mtcars} is a data set included in \Rlang (to read a description, type \code{help("mtcars")} at the \Rlang command prompt).
<<ggplot-basics-02, eval=eval_plots_all>>=
ggplot(data = mtcars)
@
Once the data are available, a graphical or geometric representation needs to be selected. The geometry used, such as \code{geom\_point()} and \code{geom\_line()}, drawing separate points for the observations or connecting them with lines, respectively, defines the type of plot. A mapping defines which property of the geometric elements will be used to represent the values from a variable in the user's data. Most geometries require mappings to both $x$ and $y$ aesthetics, as they establish the position of the geometrical shapes like points or lines in the plotting area. Additional aesthetics like colour make use of default scales and palettes. These defaults can be overridden with \code{scale} functions added to the plot (see section \ref{sec:plot:scales}).
Mapping at the \textbf{start} stage, \code{disp} to $x$ and \code{mpg} to $y$ aesthetics, makes the ranges of the values available. They are used to find default limits for the $x$ and $y$ scales as reflected in the plot axes. The plotting area $x$ and $y$ now match the ranges of the mapped variables, expanded by a small margin. The axis labels also reflect the names of the mapped variables, however, there are no graphical element yet displayed for the individual observations.% ({\small\textsf{data $\to$ aes $\to$ \emph{ggplot object}}})
<<ggplot-basics-03>>=
ggplot(data = mtcars,
mapping = aes(x = disp, y = mpg))
@
Observations are made visible by the addition of a suitable \emph{geometry} or \code{geom} to the plot recipe. Below, adding \gggeom{geom\_point()} makes the observations visible as points or symbols. %({\small\textsf{data $\to$ aes $\to$ geom $\to$ \emph{ggplot object}}})
<<ggplot-basics-04>>=
ggplot(data = mtcars,
mapping = aes(x = disp, y = mpg)) +
geom_point()
@
\begin{warningbox}
In the examples above, the plots were printed automatically, which is the default at the \Rlang console. However, as with other \Rlang objects, ggplots can be assigned to a variable.
<<ggplot-basics-04-wb1>>=
p1 <- ggplot(data = mtcars,
mapping = aes(x = disp, y = mpg)) +
geom_point()
@
\noindent
and printed at a later time, and saved to and read from files on disk.
<<ggplot-basics-04-wb2, eval=eval_plots_all>>=
print(p1)
@
Layers and other elements can be also added to a saved ggplot as the saved objects are not the graphical representation of the plots themselves but instead a \emph{recipe} plus data needed to build them.
\end{warningbox}
\begin{advplayground}
As for any \Rlang object \code{str()} displays the structure of \code{"gg"} objects. In addition, package \pkgname{ggplot2} provides a \code{summary()} method for \code{"gg"} plot objects.
As you make progress through the chapter, use these methods to explore the \code{"gg"} plot objects you construct, paying attention to layers, and global vs.\ layer-specific data and mappings. You will learn how the plot components are stored as members of \code{"gg"} plot objects.
\end{advplayground}
Although \emph{aesthetics} are usually mapped to variables in the data, constant aesthetic values can be passed as arguments to layer functions, consistently controlling a property of all elements in a layer. While variables in \code{data} can be both mapped using \code{aes()} as whole-plot defaults, as shown above, or within individual layers, constant values for aesthetics have to be set, as shown here, as named arguments passed directly to layer functions, instead of to a call to \code{aes()}.
<<ggplot-basics-04a>>=
ggplot(data = mtcars,
mapping = aes(x = disp, y = mpg)) +
geom_point(colour = "blue", shape = "square")
@
\begin{warningbox}
Mapping an aesthetic to a constant value within a call to \Rfunction{aes()} adds a column containing this value to the data frame received as input by the \code{stat()}. This value is not interpreted as an aesthetic value but instead as a data value. The plot above, but using a call to \Rfunction{aes()}.
<<ggplot-basics-04b>>=
ggplot(data = mtcars,
mapping = aes(x = disp, y = mpg)) +
geom_point(mapping = aes(colour = "blue", shape = "square"))
@
The plot contains red circles instead of blue squares!
In principle, one could correct this plot by adding suitable \code{scales} but this would be still wasteful by unnecessarily storing many copies of the constant \code{"blue"} in the \code{"gg"} plot object.
\end{warningbox}
While a geometry directly constructs during rendering a graphical representation of the observations or summaries in the data it receives as input, a \emph{statistics} or \code{stat} ``sits'' in-between the data and a \code{geom}, applying some computation, usually but not always, to produce a statistical summary of the data. Here \code{stat\_smooth()} fits a linear regression (see section \ref{sec:stat:LM:regression} on page \pageref{sec:stat:LM:regression}) and passes the resulting predicted values to \gggeom{geom\_line()}. Passing \code{method = "lm"} selects \code{lm()} as the model fitting function. Passing \code{formula = y ~ x} sets the model to be fitted. This plot has two layers, one from geometries \gggeom{geom\_point()} and one from \gggeom{geom\_line()}.%({\small\textsf{data $\to$ aes $\to$ stat $\to$ geom $\to$ \emph{ggplot object}}})
<<ggplot-basics-05>>=
ggplot(data = mtcars,
mapping = aes(x = disp, y = mpg)) +
geom_point() +
stat_smooth(geom = "line", method = "lm", formula = y ~ x)
@
The plots above relied on defaults for \emph{scales}, \emph{coordinates} and \emph{themes}. In the examples below, the defaults are overridden by arguments that produce differently rendered plots. Adding \ggscale{scale\_y\_log10()} applies a logarithmic transformation to the values mapped to $y$. This works like plotting using graph paper with rulings spaced according to a logarithmic scale. Tick marks continue to be expressed in the original units, but statistics are applied to the transformed data. In other words, the transformation specified in the scale affects the values in advance of the \textbf{start} stage, before they are mapped to aesthetics and passed to \emph{statistics}. Thus, in this example, the linear regression is fitted to \code{log10()} transformed $y$ values and the original $x$ values.%({\small\textsf{data $\to$ aes $\to$ stat $\to$ geom $\to$ scale $\to$ \emph{ggplot object}}})
<<ggplot-basics-06>>=
ggplot(data = mtcars,
mapping = aes(x = disp, y = mpg)) +
geom_point() +
stat_smooth(geom = "line", method = "lm", formula = y ~ x) +
scale_y_log10()
@
The range limits of a scale can be set manually, instead of automatically as by default. These limits create a virtual \emph{window into the data}: out-of-bounds (oob) observations, those outside the scale limits remain hidden and are not mapped to aesthetics---i.e., these observations are not included in the graphical representation or used in calculations. Crucially, when using \emph{statistics} the computations are only applied to observations that fall within the limits of all scales in use. These limits \emph{indirectly} affect the plotting area when the plotting area is automatically set based on the range of the (within limits) data---even the mapping to values of a different aesthetics may change when a subset of the data is selected by manually setting the limits of a scale.
In contrast to \emph{scale limits}, \emph{coordinates}\index{grammar of graphics!cartesian coordinates} function as a \emph{zoomed view} into the plotting area, and do not affect which observations are visible to \emph{statistics}. The coordinate system, as expected, is also determined by this grammar element---below, adding cartesian coordinates, which are the default, but setting $y$ limits overrides the default ones. %({\small\textsf{data $\to$ aes $\to$ stat $\to$ geom $\to$ coordinate $\to$ theme $\to$ \emph{ggplot object}}})
<<ggplot-basics-07>>=
ggplot(data = mtcars,
mapping = aes(x = disp, y = mpg)) +
geom_point() +
stat_smooth(geom = "line", method = "lm", formula = y ~ x) +
coord_cartesian(ylim = c(15, 25))
@
The next example uses a coordinate system transformation. When the transformation is applied to the coordinate system, it affects only the plotting---it sits between the \code{geom} and the rendering of the plot. The transformation is applied to the values that were returned by \emph{statistics}. The straight line fitted is plotted on the transformed coordinates as a curve, because the model was fitted to the untransformed data obtaining untransformed predicted values. The coordinate transformation is applied to these predicted values and plotted. (Other coordinate systems are described in sections \ref{sec:plot:sf} and \ref{sec:plot:circular} on pages \pageref{sec:plot:sf} and \pageref{sec:plot:circular}, respectively.)
<<ggplot-basics-08>>=
ggplot(data = mtcars,
mapping = aes(x = disp, y = mpg)) +
geom_point() +
stat_smooth(geom = "line", method = "lm", formula = y ~ x) +
coord_trans(y = "log10")
@
Themes affect the rendering of plots at the time of printing---they can be thought of as style sheets defining the graphic design. A complete theme can override the default gray theme. The plot is the same, the observations are represented in the same way, the limits of the axes are the same and all text is the same. On the other hand, how these elements are rendered by different themes can be drastically different.% ({\small\textsf{data $\to$ aes $\to$ $\to$ geom $\to$ theme $\to$ \emph{ggplot object}}}
<<ggplot-basics-09>>=
ggplot(data = mtcars,
mapping = aes(x = disp, y = mpg)) +
geom_point() +
theme_classic()
@
Both the base font size and the base font family can be changed. The base font size controls the size of all text elements, as other sizes are defined relative to the base size. How the plot looks changes when using the same theme as in the previous example, but with a different base point size and font family for text elements. (The use of themes is discussed in section \ref{sec:plot:themes} on page \pageref{sec:plot:themes}.)
<<ggplot-basics-10>>=
ggplot(data = mtcars,
mapping = aes(x = disp, y = mpg)) +
geom_point() +
theme_classic(base_size = 20, base_family = "serif")
@
How to set axis labels, tick positions, and tick labels will be discussed in depth in section \ref{sec:plot:scales} on page \pageref{sec:plot:scales}. Function \code{labs()} is \emph{a convenience function} used to set the title and subtitle of a plot and to replace the default \code{name} of scales, here displayed as axis labels. The default \code{name} of scales is the name of the mapped variable. In the call to \code{labs()}, the names of aesthetics are used as if they were formal parameters with character strings or \Rlang expressions as arguments. Below \code{x} and \code{y} are the names of the two \emph{aesthetics} to which two variables in \code{data} were mapped, \code{disp} and \code{mpg}, respectively. Formal parameters \code{title} and \code{subtitle} add these plot elements. (The escaped character \verb|\n| stands for new line, see section \ref{sec:calc:character} on page \pageref{sec:calc:character}.)
<<ggplot-basics-11>>=
ggplot(data = mtcars,
mapping = aes(x = disp, y = mpg)) +
geom_point() +
labs(x = "Engine displacement (cubic inches)",
y = "Fuel use efficiency\n(miles per gallon)",
title = "Motor Trend Car Road Tests",
subtitle = "Source: 1974 Motor Trend US magazine")
@
As elsewhere in \Rlang, when a value is expected, either a value stored in a variable or a more complex statement returning a suitable value can be passed as an argument to be mapped to an \emph{aesthetic}. In other words, the values to be plotted do not need to be stored as variables (or columns) in the data frame passed as an argument to parameter \code{data}, they can also be computed from these variables. Below, miles-per-gallon, \code{mpg} are plotted against the engine displacement per cylinder by dividing \code{disp} by \code{cyl} within the call to \code{aes()}.
<<ggplot-basics-info-01>>=
ggplot(data = mtcars,
mapping = aes(x = disp / cyl, y = mpg)) +
geom_point()
@
Each of the elements of the grammar exemplified above is implemented in multiple functions, and in addition these functions accept arguments that can be used to modify their behaviour. Multiple data objects as well as multiple mappings can coexist within a single \code{"gg"} plot object. Packages and user code can define new \emph{geometries}, \emph{statistics}, \emph{scales}, \emph{coordinates}, and even implement new \emph{aesthetics}. Individual elements in a \emph{theme} can be modified and new complete \emph{themes} created, re-used and shared. I describe below how to use the grammar of graphics to construct different types of data visualisations, both simple and complex. Because the different elements interact, I introduce some of them first briefly in sections other than where I describe them in depth.
\index{grammar of graphics!plot construction|)}
\subsection{Plots as \Rlang objects}\label{sec:plot:objects}
\index{grammar of graphics!plots as R objects|(}
\code{"gg"} plot objects and their components behave as other \Rlang objects. Operators and methods for the \code{"gg"} class are available. As above, a \code{"gg"} plot object saved as \code{p1} is used below.
<<ggplot-basics-04-wb1>>=
@
In the previous section, operator \code{+} was used to assemble the plots from ``anonymous'' \Rlang objects. Saved or ``named'' objects can also be combined with \code{+}.
<<ggplot-objects-02>>=
p1 + stat_smooth(geom = "line", method = "lm", formula = y ~ x)
@
Above, plot elements were added one by one, with operator \code{+}. Multiple components can be also added in a single operation. Like individual components, sets of components stored in a list can be saved in a variable and added to multiple plots. This ensures consistency and makes coordinated alterations to a set of plots easier. \emph{Throughout this chapter, I use this approach to achieve conciseness and to highlight what is different and what is not among plots in related examples.}
<<ggplot-objects-info-01>>=
p.ls <- list(
stat_smooth(geom = "line", method = "lm", formula = y ~ x),
scale_y_log10())
@
<<ggplot-objects-info-02>>=
p1 + p.ls
@
\begin{playground}
Reproduce the examples in the previous section, using \code{p1} defined above as a basis instead of building each plot from scratch.
\end{playground}
\begin{warningbox}
\index{grammar of graphics!structure of plot objects|(}
The separation of plot construction and rendering is possible because \code{"gg"} objects are self-contained. A copy of the data object passed as an argument to \code{data} is saved within the plot object, similarly as in model-fit objects. In the example above, \code{p1} by itself could be saved to a file on disk and loaded into a clean \Rlang session, even on another computer, and rendered as long as package \ggplot and its dependencies are available. Another consequence of storing a copy of the data in the plot object, is that later changes to the data object used to create a \code{"gg"} object are \emph{not} reflected in newly rendered plots from this object: the \code{"gg"} object needs to be created anew.
\end{warningbox}
\begin{explainbox}
The \emph{recipe} for a plot is stored in a \code{"gg"} plot object. Objects of class \code{"gg"} are of mode \code{"list"}. In \Rlang, lists can contain heterogeneous members and \code{"gg"} objects contain data, function definitions, and unevaluated expressions. In other words, the data plus instructions to transform the data, to map them into graphic objects, and various aspects of the rendering from scale limits to type faces to use. (\Rlang lists are described in section \ref{sec:calc:lists} on page \pageref{sec:calc:lists}.)
Top level members of the \code{"gg"} plot object \code{p1}, a simple plot, are displayed below with method \code{summary()}, which shows the components without making explicit the structure of the object.
<<ggplot-objects-03a>>=
summary(p1)
@
Method \code{str()} shows the structure of objects and can be also used to advantage with ggplots (long output not shown). Alternatively, \code{names()} extracts the names of the top-level members of \code{p1}.
<<ggplot-objects-03b>>=
names(p1)
@
\end{explainbox}
\begin{advplayground}
Explore in more detail the different members of object \code{p1}. For example, the code statement below extracts member \code{"layers"} from object \code{p1} and display its structure.
<<ggplot-objects-box-03, eval=FALSE>>=
str(p1$layers, max.level = 1)
@
How many layers are present in this case?
\end{advplayground}
\index{grammar of graphics!structure of plot objects|)}
\index{grammar of graphics!plots as R objects|)}
\subsection{Scales and mappings}\label{sec:plot:mappings}
\index{grammar of graphics!mapping of data|(}
\index{grammar of graphics!aesthetics|(}
In \ggplot, a \emph{mapping} describes which variable in \code{data} is mapped to which \code{aesthetic}, or graphic feature of a plot, such as $x$, $y$, colour, fill, shape, and linewidth. In \ggplot, a \emph{scale} describes the correspondence between \emph{values} in the mapped variable and values of the graphic feature. Below, the numeric variable \code{cyl} is mapped to the \code{colour} aesthetic. As the variable is \code{numeric}, a continuous colour scale is used. Out of the multiple continuous colour scales available, \ggscale{scale\_colour\_continuous()} is the default.
<<ggplot-basics-12a>>=
p2 <-
ggplot(data = mtcars,
mapping = aes(x = disp, y = mpg, colour = cyl)) +
geom_point()
p2
@
Without changing the \code{mapping}, a different-looking plot can be created by changing the scale used. Below, in addition, a palette is selected with \code{option = "magma"} and the range of colours used from this palette adjusted with \code{end = 0.85}.
<<ggplot-basics-12b>>=
p2 + scale_colour_viridis_c(option = "magma", end = 0.85)
@
Changing the scale used for the \code{colour} aesthetic, conceptually does not modify the plot, except for the colours used. There is a separation between the semantic structure of the plot and its graphic design. Still, how the audience interacts and perceives the plot depends on both of these concerns.
Some scales, like those for \code{colour}, exist in multiple ``flavours'', suitable for numeric variables (continuous) or for factors (discrete) values. If \code{cyl} is converted into a \code{factor}, a discrete colour scale is used instead of a continuous one. Out of the different discrete scales, \ggscale{scale\_colour\_discrete()} is used by default.
<<ggplot-basics-12c>>=
ggplot(data = mtcars,
mapping = aes(x = disp, y = mpg, colour = factor(cyl))) +
geom_point()
@
If \code{cyl} is converted into an \code{ordered} factor, an ordinal colour scale is used, by default \ggscale{scale\_colour\_ordinal()} (plot not shown).
<<ggplot-basics-12d, eval=eval_plots_all>>=
ggplot(data = mtcars,
mapping = aes(x = disp, y = mpg, colour = ordered(cyl))) +
geom_point()
@
The scales for other aesthetics work in a similar way as those for colour. Scales are described in detail in section \ref{sec:plot:scales} on page \pageref{sec:plot:scales:continuous}.
In the examples above for simple plots, based on data contained in a single data frame, mappings were established by passing the value returned by the call to \Rfunction{aes()} as the argument to parameter \code{mapping} of \Rfunction{ggplot()}.
Arguments passed to \code{data} and/or \code{mapping} parameters of \Rfunction{ggplot()} work as defaults for all layers in a plot. If arguments are passed to the identically named parameters of a layer function---statistic or geometry---, they are applied to the layer, overriding whole-plot defaults, if they exist. Consequently, the code below creates a plot, \code{p3}, identical to \code{p2} above.
<<ggplot-basics-13, eval=eval_plots_all>>=
p3 <-
ggplot() +
geom_point(data = mtcars,
mapping = aes(x = disp, y = mpg, colour = cyl))
p3
@
These examples demonstrate two different approaches that are equally convenient for simple plots with a single layer. However, if a plot has multiple layers based on the same data, the approach used for \code{p2} makes this clear and is concise. If each layer uses different data and/or different mappings, the second approach is necessary.
\begin{explainbox}
In some cases, when flexibility is needed while constructing complex plots with multiple layers other \emph{idioms} can be preferable, e.g., when assembling a plot from ``pieces'' stored in variables or built programmatically.
The default mapping can also be added directly with the \code{+} operator, instead of being passed as an argument to \Rfunction{ggplot()}.
<<ggplot-basics-14, eval=eval_plots_all>>=
ggplot(data = mtcars) +
aes(x = disp, y = mpg) +
geom_point()
@
It is also possible to have a default mapping for the whole plot, but no default data.
<<ggplot-basics-15, eval=eval_plots_all>>=
ggplot() +
aes(x = disp, y = mpg) +
geom_point(data = mtcars)
@
A mapping saved in a variable (example below), as well as a mapping returned by a function call (shown above for \code{aes()}), can be passed as an argument to parameter \code{mapping}
<<ggplot-basics-15a, eval=eval_plots_all>>=
my.mapping <- aes(x = disp, y = mpg)
ggplot(data = mtcars,
mapping = my.mapping) +
geom_point()
@
In all these examples, the plot remains unchanged (not shown). However, the flexibility of the grammar allows the assembly of plots from separately constructed pieces and reusing these pieces by storing them in variables. These approaches can be very useful in scrips that construct consistently formatted sets of plots, or when the same mapping needs to be used consistently in multiple plots.
\end{explainbox}
The mapping to aesthetics in the call to \Rfunction{aes()} does not have to be to a variable from \code{data} as in examples above. A a code statement that returns a value computed from one or more variables from \code{data} is also accepted. Computations during mapping helps avoid the proliferation of variables in the data frames containing observations. In this simple example, \code{mpg} in miles per gallon is converted into km per litre during mapping.
<<ggplot-basics-15b>>=
ggplot(data = mtcars,
mapping =aes(x = disp, y = mpg * 0.43)) +
geom_point()
@
\begin{explainbox}
Operations applied to the \code{data} before they are plotted are usually implemented in \code{stats}. Sometimes it is convenient to directly modify the whole-plot default \code{data} before it reaches the layer's \code{stat} function. One approach is to pass a function to parameter \code{data} of the layer function. This argument must be the definition of a function accepting a data frame as its first argument and returning a data frame. When the argument to \code{data} is a function definition instead of the usual data frame, the function is applied to the plot's default data and the data frame returned by the function is used as the \code{data} in the layer. In the example below, an anonymous function defined in-line, extracts a subset of the rows. The observations in the extracted rows are highlighted in the plot by overplotting them with smaller yellow shapes.
<<ggplot-basics-16>>=
ggplot(data = mtcars,
mapping = aes(x = disp, y = mpg)) +
geom_point(size = 4) +
geom_point(data = function(x){subset(x = x, cyl == 4)},
colour = "yellow", size = 1.5)
@
The argument passed above to data is a function definition, not a function call. Thus, if a function is passed by name, no parentheses are used. No arguments can be passed to a function, except for the default \code{data} passed by position to its first parameter. Consequently, it is not possible to pass function \code{subset} directly. The anonymous function above is needed to be able to pass \code{cyl == 4} as argument.
The plot's default data can also be operated upon using the \pkgname{magrittr} pipe operator, but not the pipe operator native to \Rlang (\Roperator{\textbar >}) or the dot-pipe operator from \pkgname{wrapr} (see section \ref{sec:data:pipes} on page \pageref{sec:data:pipes}). In this approach, the dot (\code{.}) placeholder at the head of the pipe stands for the plot's default \code{data} object. The code statement below uses a pipe as argument for \code{data} to call function \Rfunction{subset()} with \code{cyl == 4} passed as the condition. The plot, not shown, is as in the example above.
<<ggplot-basics-17, eval=eval_plots_all>>=
ggplot(data = mtcars,
mapping = aes(x = disp, y = mpg)) +
geom_point(size = 4) +
geom_point(data = . %>% subset(x = ., cyl == 4), colour = "yellow",
size = 1.5)
@
A third possible approach is to test the condition within the call to \Rfunction{aes()}. In this approach, it is not possible to extract a subset of rows. Making some observations invisible by reducing their size seems straightforward. However, setting \code{size = 0} draws a very small point, still visible. Out of various possible approaches, setting size to \code{NA}, skips the rows, and \code{na.rm = TRUE} silences the expected warning. This is a roundabout approach to subsetting. Notice that \ggscale{scale\_size\_identity()} is also needed. The plot, not shown, when rendered does not differ from the two examples above.
<<ggplot-basics-18, eval=eval_plots_all>>=
ggplot(data = mtcars,
mapping = aes(x = disp, y = mpg)) +
geom_point(size = 4) +
geom_point(colour = "yellow",
mapping = aes(size = ifelse(cyl == 4, 1.5, NA)),
na.rm = TRUE) +
scale_size_identity()
@
As it is usual in \Rlang, multiple approaches can be used to the same end.
\end{explainbox}
\begin{explainbox}
\emph{Late mapping}\index{grammar of graphics!mapping of data!late} of variables to aesthetics has been possible in \pkgname{ggplot2} for a long time using as notation enclosure of the name of a variable returned by a statistic between \code{..}, but this notation has been deprecated some time ago and replaced by \ggscale{stat()}. In both cases, this imposed a limitation: it was impossible to map a computed variable to the same aesthetic as input to the statistic and to the geometry in the same layer. There were also some other quirks that prevented passing some arguments to the geometry through the dots \code{...} parameter of a statistic.
Since version 3.3.0 of \pkgname{ggplot2}, the syntax used for mapping variables to aesthetics is based on functions \ggscale{stage()}, \ggscale{after\_stat()} and \ggscale{after\_scale()}. Function \ggscale{after\_stat()} replaces both \ggscale{stat()} and the \code{..} notation.
\end{explainbox}
%Variables in the data frame passed as argument to \code{data} are mapped to aesthetics before they are received as input by a statistic (possibly \code{stat\_identity()}). The mappings of variables in the data frame returned by statistics are the input to the geometry. Those statistics that operate on \textit{x} and/or \text{y} return a transformed version of these variables, by default also mapped to these aesthetics. However, in most cases other variables in addition to \textit{x} and/or \text{y} are included in the \code{data} returned by a \emph{statistic}. Although their default mapping is coded in the statistic functions' definitions, the user can modify this default mapping explicitly within a call to \code{aes()} using \ggscale{after\_stat()}, which lets us differentiate between the data frame supplied by the user and that returned by the statistic. The third stage was not accessible in earlier versions of \pkgname{ggplot2}, but lack of access was usually not insurmountable. Now this third stage can be accessed with \ggscale{after\_scale()} making coding simpler.
%
%User-coded transformations of the data are best handled at the third stage using scale transformations. However, when the intention is to jointly display or combine different computed variables returned by a statistic we need to set the desired mapping of original and computed variables to aesthetics at more than one stage.
%
The documentation of \pkgname{ggplot2} gives several good examples of cases when the new mapping syntax is useful. I give here a different example, a polynomial fitted to data using \Rfunction{rlm()}. RLM is a procedure that automatically assigns before computing the residual sums of squares, weights to the individual residuals in an attempt to protect the estimated fit from the influence of extreme observations or outliers. When using this and similar methods, it is of interest to plot the residuals together with the weights. One approach is to map weights to a gradient between two colours. The code below constructs a data frame containing artificial data that includes an extreme value or outlier.
<<mapping-stage-01>>=
set.seed(4321)
X <- 0:10
Y <- (X + X^2 + X^3) + rnorm(length(X), mean = 0, sd = mean(X^3) / 4)
df1 <- data.frame(X, Y)
df2 <- df1
df2[6, "Y"] <-df1[6, "Y"] * 10
@
In the first plot, \ggscale{after\_stat()} is used to map variable \code{weights} computed by the statistic to the \code{colour} aesthetic. In \ggstat{stat\_fit\_residuals()}, \gggeom{geom\_point()} is used by default. This figure shows the raw residuals with no weights applied (mapped to $y$ by default), and the computed weights (with range 0 to 1) encoded by colours ranging between red and blue.
<<mapping-stage-02>>=
ggplot(data = df2, mapping = aes(x = X, y = Y)) +
stat_fit_residuals(formula = y ~ poly(x, 3, raw = TRUE), method = "rlm",
mapping = aes(colour = after_stat(weights)),
show.legend = TRUE) +
scale_colour_gradient(low = "red", high = "blue", limits = c(0, 1),
guide = "colourbar")
@
In the second plot, weighted residuals are mapped to the $y$ aesthetic, and weights, as above, to the colour aesthetic. A call to \ggscale{stage()} can distinguish the mapping ahead of the statistic (\code{start}) from that after the statistic, i.e., ahead of the geometry. As above, the default geometry, \gggeom{geom\_point()} is used. The mapping in this example can be read as: the variable \code{X} from the data frame \code{df2} is mapped to the \textit{x} aesthetic at all stages. Variable \code{Y} from the data frame \code{df2} is mapped to the \textit{y} aesthetic ahead of the computations in \ggstat{stat\_fit\_residuals()}. After the computations, variables \code{y} and \code{weights} in the data frame returned by \ggstat{stat\_fit\_residuals()} are multiplied and mapped to the \textit{y} ahead of \gggeom{geom\_point()}.\label{chunk:plot:weighted:resid}
<<mapping-stage-03>>=
ggplot(df2) +
stat_fit_residuals(formula = y ~ poly(x, 3, raw = TRUE),
method = "rlm",
mapping = aes(x = X,
y = stage(start = Y,
after_stat = y * weights),
colour = after_stat(weights)),
show.legend = TRUE) +
scale_colour_gradient(low = "red", high = "blue", limits = c(0, 1),
guide = "colourbar")
@
\begin{explainbox}
When fitting models to observations with \Rfunction{lm()}, the un-weighted residuals are used to compute the sum of squares unless weights are passed as an argument. In \Rfunction{rlm()}, the weights are computed from the data by the function.
\end{explainbox}
\index{grammar of graphics!mapping of data|)}
\index{grammar of graphics!aesthetics|)}
\section{Geometries}\label{sec:plot:geometries}
\index{grammar of graphics!geometries|(}
Different geometries support different \emph{aesthetics} (Table \ref{tab:plot:geoms}). While \gggeom{geom\_point()} supports \code{shape}, and \gggeom{geom\_line()} supports \code{linetype}, both support \code{x}, \code{y}, \code{colour}, and \code{size}. In this section I describe frequently used \code{geometries} from package \ggplot and from a few packages that extend \ggplot. The graphic output from some code examples will not be shown, with the expectation that readers will run the code to see the plots.
Mainly for historical reasons, \emph{geometries} accept a \emph{statistic} as an argument, in the same way as \emph{statistics} accept a \emph{geometry} as an argument. In this section I only describe \emph{geometries} which have as a default \emph{statistic} \code{stat\_identity}. In section \ref{sec:plot:stat:summaries} (page \pageref{sec:plot:stat:summaries}), I describe other \emph{geometries} together with the \emph{statistics} they use by default.
\begin{table}
\caption[Geometries]{\ggplot geometries described in section \ref{sec:plot:geometries}, packages where they are defined, and the aesthetics supported. The default statistic is in all cases \ggstat{stat\_identity()}.}\vspace{1ex}\label{tab:plot:geoms}
\centering
\begin{tabular}{llp{8.25cm}}
\toprule
Geometry & Package & Aesthetics \\
\midrule
\code{geom\_point} & \pkgnameNI{ggplot2} & x, y, shape, size, fill, colour, alpha \\
\code{geom\_point\_s} & \pkgnameNI{ggpp} & x, y, size, linetype, linewidth, fill, colour, alpha \\
\code{geom\_pointrange} & \pkgnameNI{ggplot2} & x, y, ymin, ymax, shape, size, linetype, linewidth, fill, colour, alpha \\
\code{geom\_errorbar} & \pkgnameNI{ggplot2} & x, ymin, ymax, linetype, linewidth, colour, alpha \\
\code{geom\_linerange} & \pkgnameNI{ggplot2} & x, ymin, ymax, linetype, linewidth, colour, alpha \\
\code{geom\_line} & \pkgnameNI{ggplot2} & x, y, linetype, linewidth, colour, alpha \\
\code{geom\_segment} & \pkgnameNI{ggplot2} & x, y, xend, yend, linetype, linewidth, colour, alpha \\
\code{geom\_step} & \pkgnameNI{ggplot2} & x, y, linetype, linewidth, colour, alpha \\
\code{geom\_path} & \pkgnameNI{ggplot2} & x, y, linetype, linewidth, colour, alpha \\
\code{geom\_curve} & \pkgnameNI{ggplot2} & x, y, xend or yend, linetype, linewidth, colour, alpha \\
\code{geom\_area} & \pkgnameNI{ggplot2} & x, y, (ymin = 0), linetype, linewidth, fill, colour, alpha \\
\code{geom\_ribbon} & \pkgnameNI{ggplot2} & x, ymin and ymax, linetype, linewidth, fill, colour, alpha \\
\code{geom\_align} & \pkgnameNI{ggplot2} & x or y, xmin or xmax, ymin or ymax, linetype, linewidth, fill, colour, alpha \\
\code{geom\_rect} & \pkgnameNI{ggplot2} & xmin, xmax, ymin, ymax, linetype, linewidth, fill, colour, alpha \\
\code{geom\_tile} & \pkgnameNI{ggplot2} & x, y, width, height, linetype, linewidth, fill, colour, alpha \\
\code{geom\_col} & \pkgnameNI{ggplot2} & x, y, width, linetype, linewidth, fill, colour, alpha \\
\code{geom\_rug} & \pkgnameNI{ggplot2} & x or y, linewidth, colour, alpha \\
\code{geom\_hline} & \pkgnameNI{ggplot2} & yintercept, linetype, linewidth, colour, alpha \\
\code{geom\_vline} & \pkgnameNI{ggplot2} & xintercept, linetype, linewidth, colour, alpha \\
\code{geom\_abline} & \pkgnameNI{ggplot2} & intercept, slope, linetype, linewidth, colour, alpha \\
\code{geom\_text} & \pkgnameNI{ggplot2} & x, y, label, face, family, angle, size, colour, alpha \\
\code{geom\_label} & \pkgnameNI{ggplot2} & x, y, label, face, family, (angle), size, fill, colour, alpha \\
\code{geom\_text\_repel} & \pkgnameNI{ggrepel} & x, y, label, face, family, angle, size, colour, alpha \\
\code{geom\_label\_repel} & \pkgnameNI{ggrepel} & x, y, label, face, family, size, fill, colour, alpha \\
\code{geom\_sf} & \pkgnameNI{ggplot2} & fill, colour \\
\code{geom\_table} & \pkgnameNI{ggpp} & x, y, label, size, colour, angle \\
\code{geom\_plot} & \pkgnameNI{ggpp} & x, y, label, vp.width, vp.height, angle \\
\code{geom\_grob} & \pkgnameNI{ggpp} & x, y, vp.width, vp.height, label \\
\code{geom\_blank} & \pkgnameNI{ggplot2} & --- \\
\bottomrule
\end{tabular}
\end{table}
\subsection{Point}\label{sec:plot:geom:point}
\index{grammar of graphics!point geometry|(}
As seen in examples above, \gggeom{geom\_point()}, can be used to add a layer with observations represented by ``points'' or symbols. In \emph{scatter plots} the variables mapped to $x$ and $y$ aesthetics are both continuous (\code{numeric}) and in \emph{dot plots} one of them is discrete (\code{factor} or \code{ordered}) and the other continuous. The plots in the examples above have been scatter plots.
\index{plots!scatter plot|(}The first examples of the use of \gggeom{geom\_point()} are for \textbf{scatter plots}, as \code{disp} and \code{mpg} are \code{numeric} variables. In the examples above, a third variable, \code{cyl}, was mapped to \code{colour}. While the colour aesthetic can be used with all \code{geoms}, other aesthetics can be used only with some \code{geoms}, for example the \code{shape} aesthetic can be used only with \gggeom{geom\_point()} and similar \code{geoms}, such as \gggeom{geom\_pointrange()}. The values in the \code{shape} aesthetic are discrete, and consequently only discrete values can be mapped to it.
<<scatter-01>>=
p.base <-
ggplot(data = mtcars,
mapping = aes(x = disp, y = mpg, shape = factor(cyl))) +
geom_point()
p.base
@
\begin{playground}
Try a different mapping: \code{disp} $\rightarrow$ \code{colour}, \code{cyl} $\rightarrow$ \code{x}, keeping the mapping \code{mpg} $\rightarrow$ \code{y} unchanged. Continue by using \code{help(mtcars)} and/or \code{names(mtcars)} to see what other variables are available, and then try the combinations that trigger your curiosity---i.e., explore the data.
\end{playground}
Adding \ggscale{scale\_shape\_discrete()}, the scale already used by default, but passing \code{solid = FALSE} in the call creates a version of the same plot based on open shapes, still selected automatically.
<<scatter-11>>=
p.base +
scale_shape_discrete(solid = FALSE)
@
In contrast to ``filled'' shapes that obey both \code{colour} and \code{fill}, ``open'' shapes obey only \code{colour}, similarly to ``solid'' shapes. Function \code{scale\_shape\_manual} can be used to set the shape used for each value in the mapped factor. Below, ``open'' shapes are used, as they reveal partial overlaps better than solid shapes (plot not shown).\label{chunk:filled:symbols}
<<scatter-11a, eval=eval_plots_all>>=
p.base +
scale_shape_manual(values = c("circle open",
"square open",
"diamond open"))
@%
\pagebreak
It is also possible to use characters as shapes. The character is centred on the position of the observation. As the numbers used as symbols are self-explanatory, the default guide is removed by passing \code{guide = "none"} (plot not shown).\label{chunk:plot:point:char}
<<scatter-12, eval=eval_plots_all>>=
p.base +
scale_shape_manual(values = c("4", "6", "8"), guide = "none")
@
A variable from \code{data} can be mapped to more than one aesthetic, allowing redundant aesthetics. This makes possible figures that, even if using colour, are readable when reproduced as black-and-white images and to viewers affected by colour blindness.
<<scatter-14, eval=eval_plots_all>>=
ggplot(data = mtcars,
mapping = aes(x = disp, y = mpg,
shape = factor(cyl), colour = factor(cyl))) +
geom_point()
@
\index{plots!scatter plot|)}
\index{plots!dot plot|(}The next examples of the use of \gggeom{geom\_point()} are for \textbf{dot plots}, as \code{disp} is a \code{numeric} variable but \code{factor(cyl)} is discrete. Dot plots are prone to have overlapping observations, and one way of making these points visible is to make them partly transparent by setting a constant value smaller than one for the \code{alpha} \emph{aesthetic}.
<<scatter-12a>>=
ggplot(data = mtcars,
mapping = aes(x = factor(cyl), y = mpg)) +
geom_point(alpha = 1/3)
@
Function\label{par:plot:pos:jitter} \ggposition{position\_identity()}, which is the default, does not alter the coordinates or position of observations, as shown in all examples above. To make overlapping observations visible, instead of making the points semitransparent as above, it is possible randomly displace them along the axis mapped to the discrete variable, $x$ in this case. This is called \emph{jitter}, and can be added using \ggposition{position\_jitter()} as argument to formal parameter \code{position} of \code{geoms}. The amount of jitter is set by numeric arguments passed to \code{width} and/or \code{height}, given as a fraction of the distance between adjacent factor levels in the plot.
<<scatter-13>>=
ggplot(data = mtcars,
mapping = aes(x = factor(cyl), y = mpg)) +
geom_point(position = position_jitter(width = 0.25, heigh = 0))
@
\begin{warningbox}
The name as a character string can be also used when no arguments need to be passed to the \emph{position} function, and for some positions by passing numerical arguments to specific parameters of geometries. However, the default width of $\pm0.5$ tends to be rarely optimal (plot not shown).
<<scatter-13info, eval=eval_plots_all>>=
ggplot(data = mtcars,
mapping = aes(x = factor(cyl), y = mpg), colour = factor(cyl)) +
geom_point(position = "jitter")
@
\end{warningbox}
\index{plots!dot plot|)}
\index{plots!bubble plot|(}
\textbf{Bubble plots} are scatter- or dot plots in which the size of points or bubbles varies following values of a continuous variable mapped to the \code{size} \emph{aesthetic}. There are two approaches to this mapping, values in the mapped variable either describe the area of the points or their radii. Although the radius is sometimes used, due to how visual perception works, using area is perceptually closer to a linear mapping compared to radii. Below, the weights of cars in tons are mapped to the area of the points. Open circles are used because of overlaps.
<<scatter-16>>=
ggplot(data = mtcars,
mapping = aes(x = disp, y = mpg, colour = factor(cyl), size = wt)) +
scale_size_area() +
geom_point(shape = "circle open", stroke = 1.5)
@
\begin{playground}
If a radius-based scale is used instead of an area-based one the perceived size differences are larger, i.e., the ``impression'' on the viewer is different. In the plot above, replace \code{scale\_size\_area()} with \code{scale\_size\_radius()}.
Display the plot, look at it carefully. Check the numerical values of some of the weights of the cars, and assess if your perception of the plot matches the numbers behind it.
\end{playground}
\index{plots!bubble plot|)}
As a final example summarising the use of \gggeom{geom\_point()}, the scatter plot below combines different \emph{aesthetics} and their \emph{scales}.
<<scatter-18>>=
ggplot(data = mtcars,
mapping = aes(x = disp, y = mpg, shape = factor(cyl),
fill = factor(cyl), size = wt)) +
geom_point(alpha = 0.33, colour = "black") +
scale_size_area() +
scale_shape_manual(values = c("circle filled",
"square filled",
"diamond filled"))
@
\begin{playground}
Play with the code in the chunk above. Remove or change each of the mappings and the scale, display the new plot, and compare it to the one above. Continue playing with the code until you are sure you understand what graphical element in the plot is added or modified by each individual argument or ``word'' in the code statement.
\end{playground}
\index{grammar of graphics!point geometry|)}
It is common to draw error bars together with points representing means or medians. These can be added in a single layer with \gggeom{geom\_pointrange()} with values mapped to the \code{x}, \code{y}, \code{ymin} and \code{ymax} aesthetics, using \code{y} for the point and \code{ymin} and \code{ymax} for the ends of the line segment. Two other \emph{geometries}, \gggeom{geom\_range()} and \gggeom{geom\_errorbar()} draw only a segment or a segment with capped ends. These three \code{geoms} are frequently used together with \code{stats} that compute summaries by group. However, summary values calculated before plotting can alternatively be passed as \code{data}.
\subsection{Rug}\label{sec:plot:rug}
\index{plots!rug margin|(}
Rarely, rug plots are used by themselves. Instead they are usually an addition to scatter plots. An example of the use of \gggeom{geom\_rug()} follows. They make it easier to see the distribution of observations along the $x$- and/or $y$-axes. By default, rugs are drawn on the left and bottom edges of the plotting area. By passing \code{sides = "btlr"} they are drawn on the bottom, top, left, and right margins. Any combination of the four characters can be used to control the drawing of the rugs.
<<rug-plot-01>>=
ggplot(data = mtcars,
mapping = aes(x = disp, y = mpg, colour = factor(cyl))) +
geom_point() +
geom_rug(sides = "btlr")
@
\begin{warningbox}
Rug plots are useful when the local density of observations in a continuous variable is not high as, otherwise, rugs become too cluttered and the rug ``threads'' overlap. When the overlap is moderate, making the segments semitransparent by setting the \code{alpha} aesthetic to a constant value smaller than one, can make the variation in density easier to appreciate. When the number of observations is large, marginal density plots are preferred.
\end{warningbox}
\index{plots!rug margin|)}
\subsection{Line and area}\label{sec:plot:line}
\index{grammar of graphics!various line and path geometries|(}\index{plots!line plot|(}
\textbf{Line plots} are normally created using \gggeom{geom\_line()}, and, occasionally using \gggeom{geom\_path()}. These two \code{geoms} differ in the sequence they follow when connecting values: \gggeom{geom\_line()} connects observations based on the ordering of \code{x} values while \gggeom{geom\_path()} uses the order in the data. Aesthetic \code{linewidth} controls the thickness of lines and \code{linetype} the patterns of dashes and dots.
In a line plot, observations, or the subset of observations in a group, are joined by straight lines. Below, a different data set, \Rdata{Orange}, with data on the growth of five orange trees (see \code{help(Orange)}) is used. By mapping \code{Tree} to \code{linetype} the observations become grouped, and a line is plotted for each tree.
\label{plot:fig:lines}
<<line-plot-01>>=
ggplot(data = Orange,
mapping = aes(x = age, y = circumference, linetype = Tree)) +
geom_line()
@
\begin{warningbox}
Before \ggplot 3.4.0 the \code{size} aesthetic controlled the width of lines. Aesthetic \code{linewidth} was added in \ggplot 3.4.0 and the use of the \code{size} aesthetic for lines deprecated.
\end{warningbox}
\index{plots!line plot|)}
\index{plots!step plot|(}%
Geometry \gggeom{geom\_step()} plots only vertical and horizontal lines to join the observations, creating a stepped line, or ``staircase''. Parameter \code{direction}, with default \code{"hv"}, controls the ordering of horizontal and vertical lines.
<<step-plot-01>>=
ggplot(data = Orange,
mapping = aes(x = age, y = circumference, linetype = Tree)) +
geom_step()
@
\index{plots!step plot|)}
\begin{playground}
Using the following toy data, make three plots using \code{geom\_line()}, \code{geom\_path()}, and \code{geom\_step} to add a layer. How do they differ?
<<line-plots-PG01,eval=eval_playground>>=
toy.df <- data.frame(x = c(1,3,2,4), y = c(0,1,0,1))
@
\end{playground}
\index{plots!filled-area plot|(}
While \gggeom{geom\_line()} draws a line joining observations, \gggeom{geom\_area()} supports, in addition, filling the area below the line according to the \code{fill} \emph{aesthetic}. In some cases, it is useful to stack the areas, e.g., when the values plotted represent parts of a bigger whole. In the next, contrived, example, the areas representing the growth of the five orange trees are stacked (visually summed) using \code{position = "stack"} in place of the default \code{position = "identity"}. The visibility of the lines for individual trees is improved by changing their colour and width from the defaults. (Compare the $y$ axis of the figure below to that drawn using \code{geom\_line()} on page \pageref{plot:fig:lines}.)
<<area-plot-01>>=
p1 <- # will be used again later
ggplot(data = Orange,
mapping = aes(x = age, y = circumference, fill = Tree)) +
geom_area(position = "stack", colour = "white", linewidth = 1)
p1
@
\gggeom{geom\_ribbon()} draws two lines based on the \code{x}, \code{ymin} and \code{ymax} \emph{aesthetics}, with the space between the lines filled according to the \code{fill} \emph{aesthetic}. \gggeom{geom\_polygon()} is similar to \gggeom{geom\_path()} but connects the first and last observations forming a closed polygon that obeys the \code{fill} aesthetic.
\index{plots!filled-area plot|)}
\index{plots!reference lines|(}
Finally,\label{sec:plot:vhline} three \emph{geometries} for drawing lines across the whole plotting area: \gggeom{geom\_hline()}, \gggeom{geom\_vline()} and \gggeom{geom\_abline()}. The first two draw horizontal and vertical lines, respectively, while the third one draws straight lines according to the \emph{aesthetics} \code{slope} and \code{intercept} determining the position. The lines drawn with these three geoms extend to the edge of the plotting area.
\gggeom{geom\_hline()} and \gggeom{geom\_vline()} require a single parameter (or aesthetic), \code{yintercept} and \code{xintercept}, respectively. Different from other geoms, the data for these aesthetics can be passed as constant numeric vector containing multiple values. The reason for this is that these geoms are most frequently used to annotate plots rather than plotting observations. Vertical lines can be used to highlight time points, here the ages of 1, 2, and 3 years.
<<area-plot-02>>=
p1 +
geom_vline(xintercept = 365 * 1:3, colour = "gray75") +
geom_vline(xintercept = 365 * 1:3, linetype = "dashed")
@
\begin{playground}
Change the order of the two layers in the example above. How did the figure change? What order is best? Would the same order be the best for a scatter plot? And would it be necessary to add two \code{geom\_vline()} layers?
\end{playground}
Similarly to \gggeom{geom\_hline()} and \gggeom{geom\_vline()}, \gggeom{geom\_abline()} draws a straight line, accepting as parameters (or as aesthetics) values for the \code{intercept}, $a$, and the \code{slope}, $b$.
\index{plots!reference lines|)}
\index{plots!segments and arrows|(}
Disconnected straight-line segments and arrows, one for each observation or row in the data, can be plotted with \gggeom{geom\_segment()} which accepts \code{x}, \code{xend}, \code{y}, and \code{yend} as mapped aesthetics. \gggeom{geom\_spoke()}, which uses a polar parametrisation, uses a different set of aesthetics, \code{x}, \code{y} for origin, and \code{angle} and \code{radius} for the segment. Similarly, \gggeom{geom\_curve()} draws curved segments, with the curvature, control points, and angles controlled through parameters. These three \emph{geometries} support arrow heads at the ends of segments or curves, controlled through parameter \code{arrow} (not through an aesthetic).
\index{plots!segments and arrows|)}
\index{grammar of graphics!various line and path geometries|)}
\subsection{Column}\label{sec:plot:col}
\index{grammar of graphics!column geometry|(}
\index{plots!column plot|(}
The \emph{geometry} \gggeom{geom\_col()} can be used to create \emph{column plots}, where each bar represents an observation or row in the \code{data} (frequently means or totals previously computed from the primary observations).
\begin{warningbox}
In other contexts, column plots are frequently called bar plots. \Rlang users not familiar yet with \ggplot are frequently surprised by the default behaviour of \gggeom{geom\_bar()} as it uses \ggstat{stat\_count()} to produce a histogram, rather than plotting values as is (see section \ref{sec:plot:histogram} on page \pageref{sec:plot:histogram}). \gggeom{geom\_col()} is identical to \gggeom{geom\_bar()} but with \code{"identity"} as the default statistic.
\end{warningbox}
Using very simple artificial data helps demonstrate how variations of column plots can be obtained. The data are for two groups, hypothetical males and females.
<<col-plot-01>>=
set.seed(654321)
my.col.data <-
data.frame(treatment = factor(rep(c("A", "B", "C"), 2)),
group = factor(rep(c("male", "female"), c(3, 3))),
measurement = rnorm(6) + c(5.5, 5, 7))
@
The first plot includes data for \code{"female"} subjects extracted using a nested call to \Rfunction{subset()}. Except for \code{x} and \code{y} default mappings are used for all \emph{aesthetics}.
<<echo=FALSE, include=FALSE>>=
opts_chunk$set(opts_fig_medium)
@
<<col-plot-02>>=
ggplot(subset(my.col.data, group == "female"),
mapping = aes(x = treatment, y = measurement)) +
geom_col()
@
The \label{par:plot:pos:stack} bars above, are overwhelmingly wide, passing \code{width = 0.5} makes the bars narrower, using only half the distance between the levels on the $x$ axis. Setting \code{colour = "white"} overrides the default colour of the lines bordering the bars. Both males and females are included and \code{group} is mapped to the \code{fill} aesthetic. The default argument for position in \gggeom{geom\_col()} is \ggposition{position\_stack()}. Function \ggposition{position\_stack()} is similar to \ggposition{position\_stack()} but divides the stacked values by their sum, i.e., the individual stacked ``slices'' of the column display proportions instead of absolute values.
<<echo=FALSE, include=FALSE>>=
opts_chunk$set(opts_2fig_very_wide)
@
<<col-plot-03>>=
p.base <-
ggplot(my.col.data,
mapping = aes(x = treatment, y = measurement, fill = group))
@
<<col-plot-03a>>=
p1 <- p.base + geom_col(width = 0.5) + ggtitle("stack (default)")
@
Using \code{position = "dodge"}\label{par:plot:pos:dodge} to override the default \code{position = "stack"} the columns for males and females are plotted side by side.\qRfunction{position\_stack()}
<<col-plot-04>>=
p2 <- p.base + geom_col(position = "dodge") + ggtitle("dodge")
@
The two plots side by side (see section \ref{sec:plot:composing} on page \pageref{sec:plot:composing} for details).
<<col-plot-04a>>=
p1 + p2
@
<<echo=FALSE, include=FALSE>>=
opts_chunk$set(opts_fig_wide)
@
\begin{playground}
Change the argument to \code{position}, or let the default be active, until you understand its effect on the figure. What is the difference between \emph{positions} \code{"identity"}, \code{"dodge"}, \code{"stack"}, and \code{"fill"}?
\end{playground}
\begin{playground}
Use constants as arguments for \emph{aesthetics} or map variable \code{treatment} to one or more of the \emph{aesthetics} recognised by \gggeom{geom\_col()}, such as \code{colour}, \code{fill}, \code{linetype}, \code{size}, \code{alpha} and \code{width}.
\end{playground}
\index{grammar of graphics!column geometry|)}
\index{plots!column plot|)}
\subsection{Tiles}\label{sec:tileplot}
\index{grammar of graphics!tile geometry|(}
\index{plots!tile plot|(}
\textbf{Tile plots} and \textbf{heat maps} are useful when observations are available on a regular rectangular 2D grid. The grid can, for example, represent locations in space as well combinations of levels of two discrete classification criteria. The colour or darkness of the tiles informs about the value of the observations. A layer with square or rectangular tiles can be added with \gggeom{geom\_tile()}.
Data from 100 random draws from the $F$ distribution with degrees of freedom $\nu_1 = 2, \nu_2 = 20$ are used in the examples.
<<tile-plot-01>>=
set.seed(1234)
randomf.df <- data.frame(F.value = rf(100, df1 = 2, df2 = 20),
x = rep(letters[1:10], 10),
y = LETTERS[rep(1:10, rep(10, 10))])
@
\gggeom{geom\_tile()} requires aesthetics $x$ and $y$, with no defaults, and \code{width} and \code{height} with defaults that make all tiles of equal size filling the plotting area. Variable \code{F.value} is mapped to \code{fill}.
<<tile-plot-02>>=
ggplot(data = randomf.df,
mapping = aes(x, y, fill = F.value)) +
geom_tile()
@
Below, setting \code{colour = "gray75"} and \code{linewidth = 1} makes the tile borders visible. Whether highlighting these lines improves or not a tile plot depends on whether the individual tiles correspond to values of a categorical- or continuous variable. For example, when rows of tiles correspond to genes and columns to discrete treatments, visible tile borders are preferable. In contrast, in the case when the tiles are an approximation to a continuous surface like measurements on a regular spatial grid, it is best to suppress tile borders.
<<tile-plot-03>>=
ggplot(data = randomf.df,
mapping = aes(x, y, fill = F.value)) +
geom_tile(colour = "gray75", linewidth = 1)
@
\begin{playground}
Play with the arguments passed to parameters \code{colour} and \code{size} in the example above, considering what features of the data are most clearly perceived in each of the plots you create.
\end{playground}
Continuous fill scales can be used to control the appearance. Below, code for a tile plot based on a gray gradient, with missing values in red, is constructed is shown (plot not shown).
<<tile-plot-04, eval=eval_plots_all>>=
ggplot(data = randomf.df,
mapping = aes(x, y, fill = F.value)) +
geom_tile(colour = "white") +
scale_fill_gradient(low = "gray15", high = "gray85", na.value = "red")
@
In contrast to \gggeom{geom\_tile()}, \gggeom{geom\_rect()} draws rectangular tiles based on the position of the corners, mapped to aesthetics \code{xmin}, \code{xmax}, \code{ymin} and \code{ymax}. In this case, tiles can vary in size and do not need to be contiguous. The filled rectangles can be used, for example, to highlight a rectangular region in a plot (see example on page \pageref{par:plot:inset:zoom}).
\index{plots!tile plot|)}
\index{grammar of graphics!tile geometry|)}
\subsection{Simple features (sf)}\label{sec:plot:sf}
\index{grammar of graphics!sf geometries|(}
\index{plots!maps and spatial plots|(}
\ggplot version 3.0.0 or later supports with \gggeom{geom\_sf()}, and its companions, \gggeom{geom\_sf\_text()}, \gggeom{geom\_sf\_label()}, and \ggstat{stat\_sf()}, the plotting of shape data similarly to geographic information systems (GIS). This makes it possible to display data on maps, for example, using different fill values for different regions. The special \emph{coordinate} \code{coord\_sf()} can be used to select different projections for maps. The \emph{aesthetic} used is called \code{geometry} and contrary to all the other aesthetics described above, the values to be mapped are of class \code{sfc} containing \emph{simple features} data with multiple components. Manipulation of simple features data is supported by package \pkgname{sf}. Normal geometries can be use together with \ggstat{stat\_sf\_coordinates()} to add other graphical elements to maps. This subject exceeds the scope of this book, so a single and very simple example is shown below.
<<sf_plot-01>>=
nc <- sf::st_read(system.file("shape/nc.shp", package = "sf"), quiet = TRUE)
ggplot(nc) +
geom_sf(mapping = aes(fill = AREA), colour = "gray90")
@