R.plotting.Rnw

% !Rnw root = appendix.main.Rnw

<<echo=FALSE, include=FALSE>>=
opts_chunk$set(opts_fig_wide)
opts_knit$set(concordance=TRUE)
opts_knit$set(unnamed.chunk.label = 'plotting-chunk')
@

\chapter{\Rlang Extensions: Grammar of Graphics}\label{chap:R:plotting}

\begin{VF}
The commonality between science and art is in trying to see profoundly---to develop strategies of seeing and showing.

\VA{Edward Tufte's answer to Charlotte Thralls}{\emph{An Interview with Edward R. Tufte}, 2004}\nocite{Zachry2004}
\end{VF}

%\dictum[Edward Tufte]{The commonality between science and art is in trying to see profoundly---to develop strategies of seeing and showing.}

\index{geometries ('ggplot2')|see{grammar of graphics, geometries}}
%\index{geom@\texttt{geom}|see{grammar of graphics, geometries}}
%\index{functions!geom@\texttt{geom}|see{grammar of graphics, geometries}}
\index{statistics ('ggplot2')|see{grammar of graphics, statistics}}
%\index{stat@\texttt{stat}|see{grammar of graphics, statistics}}
%\index{functions!stat@\texttt{stat}|see{grammar of graphics, statistics}}
\index{scales ('ggplot2')|see{grammar of graphics, scales}}
%\index{scale@\texttt{scale}|see{grammar of graphics, scales}}
%\index{functions!scale@\texttt{scale}|see{grammar of graphics, scales}}
\index{coordinates ('ggplot2')|see{grammar of graphics, coordinates}}
\index{themes ('ggplot2')|see{grammar of graphics, themes}}
%\index{theme@\texttt{scale}|see{grammar of graphics, themes}}
%\index{function!theme@\texttt{scale}|see{grammar of graphics, themes}}
\index{facets ('ggplot2')|see{grammar of graphics, facets}}
\index{annotations ('ggplot2')|see{grammar of graphics, annotations}}
\index{aesthetics ('ggplot2')|see{grammar of graphics, aesthetics}}

\section{Aims of This Chapter}

Three main data plotting systems are available to \Rlang users: base \Rlang, package \pkgname{lattice} \autocite{Sarkar2008}, and package \pkgname{ggplot2} \autocite{Wickham2016}; the last one being the most recent and currently most popular system available in \Rlang for plotting data. Even two different sets of graphics primitives (i.e., those used to produce the simplest graphical elements such as lines and symbols) are available in \Rlang, those in base \Rlang and a newer one in the \pkgname{grid} package \autocite{Murrell2019}.

In this chapter, you will learn the concepts of the layered grammar of graphics, on which package \pkgname{ggplot2} is based. You will also learn how to build several types of data plots with package \pkgname{ggplot2}. As a consequence of the popularity and flexibility of \pkgname{ggplot2}, many contributed packages extending its functionality have been developed and deposited in public repositories. However, I will focus mainly on package \pkgname{ggplot2} only briefly describing a few of these extensions.

\section{Packages Used in This Chapter}

<<eval=FALSE, include=FALSE>>=
citation(package = "ggplot2")
@

If the packages used in this chapter are not yet installed in your computer, you can install them as shown below, as long as package \pkgname{learnrbook} is already installed.

<<eval=FALSE>>=
install.packages(learnrbook::pkgs_ch_ggplot)
@%
\pagebreak

To run the examples included in this chapter, you need first to load some packages from the library (see section \ref{sec:script:packages} on page \pageref{sec:script:packages} for details on the use of packages).

<<message=FALSE>>=
library(learnrbook)
library(scales)
library(ggplot2)
library(ggrepel)
library(gginnards)
library(broom)
library(ggpmisc)
library(ggbeeswarm)
library(lubridate)
library(tibble)
library(dplyr)
library(patchwork)
@

<<echo=FALSE>>=
theme_set(theme_gray(14))
@

<<echo=FALSE>>=
# set to TRUE to test non-executed code chunks and rendering of plots
eval_plots_all <- FALSE
@

\section{The Components of a Plot}
I\index{data visualisation!concepts} start by briefly presenting concepts central to data visualisation, following the \citetitle{Koponen2019} \autocite{Koponen2019}. Plots are a medium used to convey information, like text. It is worthwhile keeping this in mind. As with text, the design of plots needs to consider what needs to be highlighted to convey the take home message. The style of the plot should match the expectations and the plot-reading abilities of the expected audience. One needs to be careful to avoid ambiguities and most importantly of all not to miss-inform. Data visualisations like text need to be planned, revised, commented upon, and revised again until the best way of expressing our message is found. The flexibility of the grammar of graphics supports very well this approach to designing and producing high quality data visualisations for different audiences.

Of course, when exploring data, fancy details of graphical design are irrelevant, but flexibility remains important as it makes it possible to look at data from many differing angles, highlighting different aspects of them. In the same way as boiler-plate text and text templates have specific but limited uses, all-in-one functions for producing plots do not support well the design of original data visualisations. They tend to get the job done, but lack the flexibility needed to do the best job of communicating information. Being this a book about languages, the focus of this chapter is in the layered grammar of graphics.

The plots described in this chapter are classified as \emph{statistical graphics}\index{statistical graphics} within the broader field of data visualisation. Plots such as scatter plots include points (geometric objects) that by their position, shape, colour, or some other property directly convey information. The location of these points in the plot ``canvas'' or ``plotting area'', given by the values of their $x$ and $y$ coordinates describes properties of the data and any deviation in the mapping of observations to coordinates is misleading, because deviations from the expected mapping conveys wrong/false information to the audience.

A \emph{data label}\index{data visualisation!data labels} is connected to an observation but its position can be displaced as long as its link to the corresponding observation can be inferred, e.g., by the direction of an arrow or even simple proximity. Data labels provide ancillary information, such as the name of a gene or place.

\emph{Annotations}\index{data visualisation!annotations}, are additions to a plot that have no connection to individual observations, but rather with all observations taken together, e.g., a text like $n = 200$ indicating the number of observations, usually included in a corner or margin of a plot free of observations.

Axis and tick labels, legends and keys make it possible for the reader to retrieve the original values represented in the plot as graphical elements. Other features of visualisations even when not carrying additional information affect the easy with which a plot can be read and accessibility to readers with visual constraints such as colour blindness. These features include the size of text and symbols, thickness of lines, choice of font face, choice of colour palette, etc.

Because of the different lengths of time available for the audience to interact with visualisations, in general, plots designed to be included in books and journals are unsuitable for oral presentations, and vice versa. It is important to keep in mind the role played by plots in informing the audience, and what information can be expected to be of interest to different audiences and under different situations. The grammar of graphics and its extensions provide enough flexibility to tailor the design of plots to different uses and also to easily create variations of a given plot.

\section{The Grammar of Graphics}\label{sec:plot:intro}
\index{grammar of graphics!elements|(}
What separates \ggplot from base \Rlang and trellis/lattice plotting functions is the use of a layered grammar of graphics\index{grammar of graphics} (the reason behind `gg' in the name of package \pkgname{ggplot2}). What is meant by grammar in this case is that plots are assembled piece by piece using different ``nouns'' and ``verbs'' \autocite{Cleveland1985,Wickham2010}. Instead of using a single function with many arguments, plots are assembled by combining different elements with operators \code{+} and \verb|%+%|. Furthermore, the construction is mostly semantics-based and to a large extent, how plots look when printed, displayed, or exported to a bitmap or vector-graphics file is controlled by themes.

Plotting can be thought as translating or mapping the observations or data into a graphical language. Properties of graphical (or geometrical) objects are used to represent different aspects of the data. An observation can consist of multiple recorded values. Say an observation of air temperature may be defined by a position in 3-dimensional space and a point in time, in addition to the temperature itself. An observation for the size and shape of a plant can consist of height, stem diameter, number of leaves, size of individual leaves, length of roots, fresh mass, dry mass, etc. For example, an effective way of studying and/or communicating the relationship between height and stem diameter in plants, is to plot observations as points using cartesian coordinates\index{grammar of graphics!cartesian coordinates}, \emph{mapping} stem diameter to the $x$ axis and the height to the $y$ axis.

The grammar of graphics makes it possible to design plots by combining various elements in ways that are nearly orthogonal. In other words, the majority of the possible combinations of ``words'' yield valid plots as long the rules of the grammar are respected. This flexibility makes \ggplot extremely powerful as types of plots not considered when the \ggplot package was designed can be easily created.

\begin{warningbox}
When a ggplot is built, the whole plot and its components are created as \Rlang objects that can be saved in the workspace or written to a file as \Rlang objects. These objects encode a recipe for constructing the plot, not its final graphical representation. The graphical representation is generated when the object is printed, explicitly or not. Thus, the same \code{"gg"} plot object can be rendered into different bitmap and vector graphic formats for display and/or printing.
\end{warningbox}

The transformation of a set of data or observations into a rendered graphic with package \pkgname{ggplot2} can be represented as a flow of information, but also as a sequence of actions. However, what avoids that the flexibility from becoming a burden on users is that in most cases adequate defaults are used when the user does not provide explicit ``instructions''. The recipe to build a plot needs to specify a) the data to use, b) which variable to map to which graphical property (or aesthetic), c) which layers to add and which geometric representation to use, d) the scales that establish the link between data values and aesthetic values, e) a coordinate system (affecting only aesthetics $x$, $y$ and possibly $z$), f) a theme to use. The result from constructing a plot object using the grammar of graphics is an \Rlang object containing a ``recipe for a plot'', including the data, which behaves similarly to other \Rlang objects.

\subsection{The words of the grammar}
Before building a plot step by step, I introduce the different components of a ggplot recipe, or the words of the grammar of graphics.

\paragraph{Data}
The\index{grammar of graphics!data} data to be plotted must be available as a \code{data.frame} or \code{tibble}, with data stored so that each row represents a single observation event, and the columns are different values observed in that single event. In other words, in long form (so-called ``tidy data'') as described in chapter \ref{chap:R:data}. The variables to be plotted can be \code{numeric}, \code{factor}, \code{character}, and time or date stored as \code{POSIXct}. (Some extensions to \pkgname{ggplot2} add support for other types of data such as time series).

\paragraph{Mapping}
When\index{grammar of graphics!mapping of data} constructing a plot, data variables have to be mapped to aesthetics\index{plots!aesthetics} (or graphic properties). Most plots will have an $x$ dimension, which is one of the \emph{aesthetics}, and a variable containing numbers (or categories) mapped to it. The position on a 2D plot of, say, a point, will be determined by $x$ and $y$ aesthetics, while in a 3D plot, three aesthetics need to be mapped $x$, $y$, and $z$. Many aesthetics are not related to coordinates, they are properties, like colour, size, shape, line type, or even rotation angle, which add an additional dimension on which to represent the values of variables and/or constants.

\paragraph{Statistics}
Statistics\index{grammar of graphics!statistics} are ``words'' that represent calculation of summaries or some other operation on the values in the data. When \emph{statistics} are used for a computation, the returned value is passed to a \emph{geometry}, and consequently adding a \emph{statistics} also adds a layer to the plot. For example, \ggstat{stat\_smooth()} fits a smoother, and \ggstat{stat\_summary()} applies a summary function such as \code{mean(()}. Most statistics are applied by group when data have been grouped by mapping additional aesthetics such as colour to a factor.

\paragraph{Geometries}
\sloppy%
Geometries\index{grammar of graphics!geometries} are ``words'' that describe the graphics representation of the data: for example, \gggeom{geom\_point()}, plots a point or symbol for each observation or summary value, while \gggeom{geom\_line()}, draws line segments between observations. Some geometries rely by default on statistics, but most ``geoms'' default to the identity statistics. Each time a \emph{geometry} is used to add a graphical representation of data to a plot, one says that a new \emph{layer} has been added. The\index{plots!layers} grammar of graphics allows plots to contain multiple layers. The name \emph{layer} reflects the fact that each new layer added is plotted on top of the layers already present in the plot, or rather when a plot is printed the layers will be generated in the order they were added to the plot object. For example, one layer in a plot can display the observations, another layer a regression line fitted to them, and a third one may contain annotations such as an equation or a text label.

\paragraph{Positions}
Positions\index{grammar of graphics!positions} are ``words'' that determine the displacement or not of graphical plot elements relative to their original $x$ and $y$ coordinates. They are one of the arguments accepted by \emph{geometries}. Position \ggposition{position\_identity()} introduces no displacement, and for example, \ggposition{position\_stack()} makes it possible to create stacked bar plots and stacked area plots. Positions will be discussed together with geometries as they are always subordinate to them.

\paragraph{Scales}
Scales\index{grammar of graphics!scales} give the ``translation'' or mapping between data values and the aesthetic values to be actually plotted. Mapping a variable to the ``colour'' aesthetic (also recognised when spelled as ``color'') only tells that different values stored in the mapped variable will be represented by different colours. A scale, such as \ggscale{scale\_colour\_continuous()}, will determine which colour in the plot corresponds to which value in the variable. Scales can also define transformations on the data, which are used when mapping data values to aesthetic values. All continuous scales support transformations---e.g., in the case of $x$ and $y$ aesthetics, positions on the plotting region or graphic viewport will be affected by the transformation, while the original values are used for tick labels along the axes or in keys for shapes, colours, etc. Scales are used for all aesthetics, including continuous variables, such as numbers, and categorical ones such as factors. The grammar of graphics allows only one scale per \emph{aesthetic} and plot. This restriction is imposed by design to avoid ambiguity (e.g., it ensures that the red colour will have the same ``meaning'' in all plot layers where the \code{colour} \emph{aesthetic} is mapped to data). Scales have limits that are set automatically unless supplied explicitly.

\paragraph{Coordinate systems}
The\index{grammar of graphics!coordinates} most frequently used coordinate system when plotting data, the cartesian system, is the default for most \emph{geometries}. In the cartesian system, $x$ and $y$ are represented as distances on two orthogonal (at 90$^\circ$) axes. Additional coordinate systems are available in \pkgname{ggplot2} and through extensions. For example, in the polar system of coordinates, the $x$ values are mapped to angles around a central point and $y$ values to the radius. Setting limits to a coordinate system changes the region of the plotting space visible in the plot, but does not discard observations. In other words, when using \emph{statistics}, observations located outside the coordinate limits, i.e., not visible in the rendered plot, will still be included in computations when excluded by coordinate limits but will be ignored when excluded by scale limits.

\paragraph{Themes}
How\index{grammar of graphics!themes} the plots look when displayed or printed can be altered by means of themes. A plot can be saved without adding a theme and then printed or displayed using different themes. Also, individual theme elements can be changed, and whole new themes defined. This adds a lot of flexibility and helps in the separation of the data representation aspects from those related to the graphical design.

\paragraph{Operators}
The\index{grammar of graphics!operators} elements described above are assembled into a ggplot object using operator \Roperator{+} and exceptionally using \Roperator{\%+\%}. The choice of these operators makes sense, as ggplot objects are built by sequentially adding members or elements to them.
\index{grammar of graphics!elements|)}

\begin{warningbox}
The functions corresponding to the different elements of the grammar of graphics have distinctive names with the first few letters hinting at their roles: aesthetics mappings (\code{aes}), geometric elements (\code{geom\_\ldots}), statistics (\code{stat\_\ldots}), scales (\code{scale\_\ldots}), coordinate systems (\code{coord\_\ldots}), and themes (\code{theme\_\ldots}).
\end{warningbox}

\subsection{The workings of the grammar}\label{sec:plot:workings}
\index{grammar of graphics!plot structure|(}
\index{grammar of graphics!plot workings|(}
A \code{"gg"} plot object is an \Rlang object of mode \code{"list"} containing the recipe and data to construct a plot. It is self contained in the sense that the only requirement for rendering it into a graphical representation is the availability of package \pkgname{ggplot2}. A \code{"gg"} object contains the data in one or more data frames and instructions encoded as functions and parameters, but not yet a rendering of the plot into graphical objects. Both data transformations and rendering of the plot into drawing instructions (encoded as graphical objects or \emph{grobs}) take place at the time of printing or exporting the plot, e.g., when saving a bitmap to a file.

To understand ggplots, one should first think in terms of the graphical organisation of the plot: there is always a background layer onto which other layers composed by different graphical objects are laid. Each layer contains related graphical objects originating from the same data. The last layer added is the topmost and the first one added the lowermost. Graphical objects in upper layers occlude those in the layers below them if their locations overlap. Although frequently layers in a ggplot share the same data and the same mappings to aesthetics, this is not a requirement. It is possible to build ggplots with independent layers, although always with shared scales and plotting area.

%%% Drawing of a plot with layers

A second perspective on ggplots is that of the process of converting the data into a graphical representation that can be printed on paper or viewed on a computer screen. The transformations applied to the data to achieve this can be thought as a data flow process divided into stages. The diagram in Figure \ref{fig:ggplot:stages} represents a single self-contained layer in a plot. The data supplied by the user is transformed in stages into instructions to draw a graphical representation. In \pkgname{ggplot2} and its documentation, graphical features are called \emph{aesthetics}, with the correspondence between values in the data and values of the aesthetic controlled by \emph{scales}. The values in the data are summarised by \emph{statistics}. However, when no summaries are needed, layers make use of \Rfunction{stat\_indentity()}, which copies its input to its output unchanged.
\emph{Geometries} provide the ``recipe'' used to generate graphical objects from the mapped data.

\begin{figure}
{\sffamily
\centering
\resizebox{\linewidth}{!}{%
  \begin{tikzpicture}[auto]
    \node [b] (data) {layer\\ data};
    \node [cc, right = of data] (mapping1) {\textbf{start}};
    \node [b, right = of mapping1] (statistic) {statistic};
    \node [cc, right = of statistic] (mapping2) {\textbf{after\\ stat}};
    \node [b, right = of mapping2] (geometry) {geometry + scale};
    \node [cc, right = of geometry] (mapping3) {\textbf{after\\ scale}};
    \node [b, right = of mapping3] (render) {layer\\ grobs};

    \path [ll] (mapping1) -- (data) node[near end,above]{a};
    \path [ll] (statistic) -- (mapping1) node[near end,above]{b};
    \path [ll] (mapping2) -- (statistic) node[near end,above]{c};
    \path [ll] (geometry) -- (mapping2) node[near end,above]{d};
    \path [ll] (mapping3) -- (geometry) node[near end,above]{e};
    \path [ll] (render) -- (mapping3) node[near end,above]{f};
  \end{tikzpicture}}}
  \caption[Stages of data flow in a ggplot layer]{Abstract diagram of data transformations in a ggplot layer showing the stages at which mappings between variables and graphic aesthetics take place.}\label{fig:ggplot:stages}
\end{figure}

Function \code{aes()} is used to define mappings to aesthetics. The default for \Rfunction{aes()} is for the mapping to take place at the \textbf{start} (leftmost circle in the diagram above), mapping names in the user data to aesthetics such as x, y, colour, and shape. The statistic can alter the mapped data, but in most cases not which aesthetics they are mapped to. Statistics can add default mappings for additional aesthetics. In addition, the default mappings of the data returned by the statistic can be modified by user code at this later stage, \textbf{after stat}. Default mappings can be modified again at the \textbf{after scale} stage.

\begin{explainbox}
Statistics always return a mapping to the same aesthetics that they require as input. However, the values mapped to these aesthetics at the \textbf{after stat} stage are in most cases different from those at \textbf{start}. Many statistics return additional variables, which are not mapped by default to any aesthetic. These variables facilitate variations on how results from a given type of data summary are added to plots, including the use of a geometry different from the default set by the statistic. In this case, the user has to override default mappings at the \textbf{after stat} stage. The additional variables returned by statistics are listed in their documentation. (See section \ref{sec:plot:mappings} on page \pageref{sec:plot:mappings} for details.)
\end{explainbox}

\begin{warningbox}
As mentioned above, all ggplot layers include a statistic and a geometry. From the perspective of the construction of a plot using the grammar, both \code{stats} and \code{geoms} are layer constructor functions. While \code{stats} take a \code{geom} as one of their arguments, \code{geoms} take a \code{stat} as one of their arguments. Thus, in both cases, a \code{stat} and a \code{geom} are added as a layer, and their role and position in the data flow remain the same, i.e., the diagram in Figure \ref{fig:ggplot:stages} applies independently of how the layers are added to the plot. The default statistic of many geometries is \ggstat{stat\_identity()} making their behaviour when added to a plot as if the layer they create contained no statistics.
\end{warningbox}

There are some statistics in \pkgname{ggplot2} that have companion geometries that can be used (almost) interchangeably. This tends to lead into confusion, and in this book, only geometries that have as default \ggstat{stat\_identity()} are described as geometries in section \ref{sec:plot:geometries}. In the case of those that by default use other statistics, like \gggeom{geom\_smooth()} only the companion statistic, \gggeom{stat\_smooth()} for this example, are described in section \ref{sec:plot:statistics}.

A ggplot can have a single layer or many layers, but when ggplots have more than one layer, the data flow, computations, and generation of graphical objects takes place independently for each layer. As mentioned above, most ggplots do not have fully independent layers, but the layers share the same data and aesthetic mappings at the \textbf{start}. Ahead of this point computations in layers are always independent of those in other layers, except that for a given aesthetic only one scale is allowed per plot.

\begin{explainbox}
 make it possible
\end{explainbox}

\index{grammar of graphics!plot workings|)}
\index{grammar of graphics!plot structure|)}

\subsection{Plot construction}
\index{grammar of graphics!plot construction|(}

As the use of the grammar is easier to demonstrate by example than to explain with words, I will show how to build plots of increasing complexity, starting from the simplest possible. All elements of a plot have defaults, although in some cases these defaults result in empty plots. Defaults make it possible to create a plot very succinctly. When building a plot step by step, the different viewpoints described in the previous section are relevant: the static structure of the plot's \Rlang object, the final graphic output, and the transformations that the data undergo ``in transit'' from the recipe stored in an object to the graphic output. In this section, I emphasise the syntax of the grammar and how it translates into a plot.

Function \code{ggplot()} by default constructs an empty plot. This is similar to how \code{character()}, \code{numeric()}, etc. construct empty vectors. This empty skeleton of a plot when printed is displayed as an grey rectangle.

<<ggplot-basics-01>>=
ggplot()
@

A data frame passed as an argument to \code{data} without adding a mapping results in the same empty grey rectangle (not shown). Data frame \Rdata{mtcars} is a data set included in \Rlang (to read a description, type \code{help("mtcars")} at the \Rlang command prompt).

<<ggplot-basics-02, eval=eval_plots_all>>=
ggplot(data = mtcars)
@

Once the data are available, a graphical or geometric representation needs to be selected. The geometry used, such as \code{geom\_point()} and \code{geom\_line()}, drawing separate points for the observations or connecting them with lines, respectively, defines the type of plot. A mapping defines which property of the geometric elements will be used to represent the values from a variable in the user's data. Most geometries require mappings to both $x$ and $y$ aesthetics, as they establish the position of the geometrical shapes like points or lines in the plotting area. Additional aesthetics like colour make use of default scales and palettes. These defaults can be overridden with \code{scale} functions added to the plot (see section \ref{sec:plot:scales}).

Mapping at the \textbf{start} stage, \code{disp} to $x$ and \code{mpg} to $y$ aesthetics, makes the ranges of the values available. They are used to find default limits for the $x$ and $y$ scales as reflected in the plot axes. The plotting area $x$ and $y$ now match the ranges of the mapped variables, expanded by a small margin. The axis labels also reflect the names of the mapped variables, however, there are no graphical element yet displayed for the individual observations.% ({\small\textsf{data $\to$ aes $\to$ \emph{ggplot object}}})

<<ggplot-basics-03>>=
ggplot(data = mtcars,
       mapping = aes(x = disp, y = mpg))
@

Observations are made visible by the addition of a suitable \emph{geometry} or \code{geom} to the plot recipe. Below, adding \gggeom{geom\_point()} makes the observations visible as points or symbols. %({\small\textsf{data $\to$ aes $\to$ geom $\to$ \emph{ggplot object}}})

<<ggplot-basics-04>>=
ggplot(data = mtcars,
       mapping = aes(x = disp, y = mpg)) +
  geom_point()
@

\begin{warningbox}
In the examples above, the plots were printed automatically, which is the default at the \Rlang console. However, as with other \Rlang objects, ggplots can be assigned to a variable.

<<ggplot-basics-04-wb1>>=
p1 <- ggplot(data = mtcars,
             mapping = aes(x = disp, y = mpg)) +
       geom_point()
@

\noindent
and printed at a later time, and saved to and read from files on disk.

<<ggplot-basics-04-wb2, eval=eval_plots_all>>=
print(p1)
@

Layers and other elements can be also added to a saved ggplot as the saved objects are not the graphical representation of the plots themselves but instead a \emph{recipe} plus data needed to build them.
\end{warningbox}

\begin{advplayground}
As for any \Rlang object \code{str()} displays the structure of \code{"gg"} objects. In addition, package \pkgname{ggplot2} provides a \code{summary()} method for \code{"gg"} plot objects.

As you make progress through the chapter, use these methods to explore the \code{"gg"} plot objects you construct, paying attention to layers, and global vs.\ layer-specific data and mappings. You will learn how the plot components are stored as members of \code{"gg"} plot objects.
\end{advplayground}

Although \emph{aesthetics} are usually mapped to variables in the data, constant aesthetic values can be passed as arguments to layer functions, consistently controlling a property of all elements in a layer. While variables in \code{data} can be both mapped using \code{aes()} as whole-plot defaults, as shown above, or within individual layers, constant values for aesthetics have to be set, as shown here, as named arguments passed directly to layer functions, instead of to a call to \code{aes()}.

<<ggplot-basics-04a>>=
ggplot(data = mtcars,
       mapping = aes(x = disp, y = mpg)) +
  geom_point(colour = "blue", shape = "square")
@

\begin{warningbox}
Mapping an aesthetic to a constant value within a call to \Rfunction{aes()} adds a column containing this value to the data frame received as input by the \code{stat()}. This value is not interpreted as an aesthetic value but instead as a data value. The plot above, but using a call to \Rfunction{aes()}.

<<ggplot-basics-04b>>=
ggplot(data = mtcars,
       mapping = aes(x = disp, y = mpg)) +
  geom_point(mapping = aes(colour = "blue", shape = "square"))
@

The plot contains red circles instead of blue squares!

In principle, one could correct this plot by adding suitable \code{scales} but this would be still wasteful by unnecessarily storing many copies of the constant \code{"blue"} in the \code{"gg"} plot object.
\end{warningbox}

While a geometry directly constructs during rendering a graphical representation of the observations or summaries in the data it receives as input, a \emph{statistics} or \code{stat} ``sits'' in-between the data and a \code{geom}, applying some computation, usually but not always, to produce a statistical summary of the data. Here \code{stat\_smooth()} fits a linear regression (see section \ref{sec:stat:LM:regression} on page \pageref{sec:stat:LM:regression}) and passes the resulting predicted values to \gggeom{geom\_line()}. Passing \code{method = "lm"} selects \code{lm()} as the model fitting function. Passing \code{formula = y ~ x} sets the model to be fitted. This plot has two layers, one from geometries \gggeom{geom\_point()} and one from \gggeom{geom\_line()}.%({\small\textsf{data $\to$ aes $\to$ stat $\to$ geom $\to$ \emph{ggplot object}}})

<<ggplot-basics-05>>=
ggplot(data = mtcars,
       mapping = aes(x = disp, y = mpg)) +
  geom_point() +
  stat_smooth(geom = "line", method = "lm", formula = y ~ x)
@

The plots above relied on defaults for \emph{scales}, \emph{coordinates} and \emph{themes}. In the examples below, the defaults are overridden by arguments that produce differently rendered plots. Adding \ggscale{scale\_y\_log10()} applies a logarithmic transformation to the values mapped to $y$. This works like plotting using graph paper with rulings spaced according to a logarithmic scale. Tick marks continue to be expressed in the original units, but statistics are applied to the transformed data. In other words, the transformation specified in the scale affects the values in advance of the \textbf{start} stage, before they are mapped to aesthetics and passed to \emph{statistics}. Thus, in this example, the linear regression is fitted to \code{log10()} transformed $y$ values and the original $x$ values.%({\small\textsf{data $\to$ aes $\to$ stat $\to$ geom $\to$ scale $\to$ \emph{ggplot object}}})

<<ggplot-basics-06>>=
ggplot(data = mtcars,
       mapping = aes(x = disp, y = mpg)) +
  geom_point() +
  stat_smooth(geom = "line", method = "lm", formula = y ~ x) +
  scale_y_log10()
@

The range limits of a scale can be set manually, instead of automatically as by default. These limits create a virtual \emph{window into the data}: out-of-bounds (oob) observations, those outside the scale limits remain hidden and are not mapped to aesthetics---i.e., these observations are not included in the graphical representation or used in calculations. Crucially, when using \emph{statistics} the computations are only applied to observations that fall within the limits of all scales in use. These limits \emph{indirectly} affect the plotting area when the plotting area is automatically set based on the range of the (within limits) data---even the mapping to values of a different aesthetics may change when a subset of the data is selected by manually setting the limits of a scale.

In contrast to \emph{scale limits}, \emph{coordinates}\index{grammar of graphics!cartesian coordinates} function as a \emph{zoomed view} into the plotting area, and do not affect which observations are visible to \emph{statistics}. The coordinate system, as expected, is also determined by this grammar element---below, adding cartesian coordinates, which are the default, but setting $y$ limits overrides the default ones. %({\small\textsf{data $\to$ aes $\to$ stat $\to$ geom $\to$ coordinate $\to$ theme $\to$ \emph{ggplot object}}})

<<ggplot-basics-07>>=
ggplot(data = mtcars,
       mapping = aes(x = disp, y = mpg)) +
  geom_point() +
  stat_smooth(geom = "line", method = "lm", formula = y ~ x) +
  coord_cartesian(ylim = c(15, 25))
@

The next example uses a coordinate system transformation. When the transformation is applied to the coordinate system, it affects only the plotting---it sits between the \code{geom} and the rendering of the plot. The transformation is applied to the values that were returned by \emph{statistics}. The straight line fitted is plotted on the transformed coordinates as a curve, because the model was fitted to the untransformed data obtaining untransformed predicted values. The coordinate transformation is applied to these predicted values and plotted. (Other coordinate systems are described in sections \ref{sec:plot:sf} and \ref{sec:plot:circular} on pages \pageref{sec:plot:sf} and \pageref{sec:plot:circular}, respectively.)

<<ggplot-basics-08>>=
ggplot(data = mtcars,
       mapping = aes(x = disp, y = mpg)) +
  geom_point() +
  stat_smooth(geom = "line", method = "lm", formula = y ~ x) +
  coord_trans(y = "log10")
@

Themes affect the rendering of plots at the time of printing---they can be thought of as style sheets defining the graphic design. A complete theme can override the default gray theme. The plot is the same, the observations are represented in the same way, the limits of the axes are the same and all text is the same. On the other hand, how these elements are rendered by different themes can be drastically different.% ({\small\textsf{data $\to$ aes $\to$ $\to$ geom $\to$ theme $\to$ \emph{ggplot object}}}

<<ggplot-basics-09>>=
ggplot(data = mtcars,
       mapping = aes(x = disp, y = mpg)) +
  geom_point() +
  theme_classic()
@

Both the base font size and the base font family can be changed. The base font size controls the size of all text elements, as other sizes are defined relative to the base size. How the plot looks changes when using the same theme as in the previous example, but with a different base point size and font family for text elements. (The use of themes is discussed in section \ref{sec:plot:themes} on page \pageref{sec:plot:themes}.)

<<ggplot-basics-10>>=
ggplot(data = mtcars,
       mapping = aes(x = disp, y = mpg)) +
  geom_point() +
  theme_classic(base_size = 20, base_family = "serif")
@

How to set axis labels, tick positions, and tick labels will be discussed in depth in section \ref{sec:plot:scales} on page \pageref{sec:plot:scales}. Function \code{labs()} is \emph{a convenience function} used to set the title and subtitle of a plot and to replace the default \code{name} of scales, here displayed as axis labels. The default \code{name} of scales is the name of the mapped variable. In the call to \code{labs()}, the names of aesthetics are used as if they were formal parameters with character strings or \Rlang expressions as arguments. Below \code{x} and \code{y} are the names of the two \emph{aesthetics} to which two variables in \code{data} were mapped, \code{disp} and \code{mpg}, respectively. Formal parameters \code{title} and \code{subtitle} add these plot elements. (The escaped character \verb|\n| stands for new line, see section \ref{sec:calc:character} on page \pageref{sec:calc:character}.)

<<ggplot-basics-11>>=
ggplot(data = mtcars,
       mapping = aes(x = disp, y = mpg)) +
  geom_point() +
  labs(x = "Engine displacement (cubic inches)",
       y = "Fuel use efficiency\n(miles per gallon)",
       title = "Motor Trend Car Road Tests",
       subtitle = "Source: 1974 Motor Trend US magazine")
@

As elsewhere in \Rlang, when a value is expected, either a value stored in a variable or a more complex statement returning a suitable value can be passed as an argument to be mapped to an \emph{aesthetic}. In other words, the values to be plotted do not need to be stored as variables (or columns) in the data frame passed as an argument to parameter \code{data}, they can also be computed from these variables. Below, miles-per-gallon, \code{mpg} are plotted against the engine displacement per cylinder by dividing \code{disp} by \code{cyl} within the call to \code{aes()}.

<<ggplot-basics-info-01>>=
ggplot(data = mtcars,
       mapping = aes(x = disp / cyl, y = mpg)) +
  geom_point()
@

Each of the elements of the grammar exemplified above is implemented in multiple functions, and in addition these functions accept arguments that can be used to modify their behaviour. Multiple data objects as well as multiple mappings can coexist within a single \code{"gg"} plot object. Packages and user code can define new \emph{geometries}, \emph{statistics}, \emph{scales}, \emph{coordinates}, and even implement new \emph{aesthetics}. Individual elements in a \emph{theme} can be modified and new complete \emph{themes} created, re-used and shared. I describe below how to use the grammar of graphics to construct different types of data visualisations, both simple and complex. Because the different elements interact, I introduce some of them first briefly in sections other than where I describe them in depth.
\index{grammar of graphics!plot construction|)}

\subsection{Plots as \Rlang objects}\label{sec:plot:objects}
\index{grammar of graphics!plots as R objects|(}
\code{"gg"} plot objects and their components behave as other \Rlang objects. Operators and methods for the \code{"gg"} class are available. As above, a \code{"gg"} plot object saved as \code{p1} is used below.

<<ggplot-basics-04-wb1>>=
@

In the previous section, operator \code{+} was used to assemble the plots from ``anonymous'' \Rlang objects. Saved or ``named'' objects can also be combined with \code{+}.

<<ggplot-objects-02>>=
p1 + stat_smooth(geom = "line", method = "lm", formula = y ~ x)
@

Above, plot elements were added one by one, with operator \code{+}. Multiple components can be also added in a single operation. Like individual components, sets of components stored in a list can be saved in a variable and added to multiple plots. This ensures consistency and makes coordinated alterations to a set of plots easier. \emph{Throughout this chapter, I use this approach to achieve conciseness and to highlight what is different and what is not among plots in related examples.}

<<ggplot-objects-info-01>>=
p.ls <- list(
  stat_smooth(geom = "line", method = "lm", formula = y ~ x),
  scale_y_log10())
@

<<ggplot-objects-info-02>>=
p1 + p.ls
@

\begin{playground}
  Reproduce the examples in the previous section, using \code{p1} defined above as a basis instead of building each plot from scratch.
\end{playground}

\begin{warningbox}
\index{grammar of graphics!structure of plot objects|(}
The separation of plot construction and rendering is possible because \code{"gg"} objects are self-contained. A copy of the data object passed as an argument to \code{data} is saved within the plot object, similarly as in model-fit objects. In the example above, \code{p1} by itself could be saved to a file on disk and loaded into a clean \Rlang session, even on another computer, and rendered as long as package \ggplot and its dependencies are available. Another consequence of storing a copy of the data in the plot object, is that later changes to the data object used to create a \code{"gg"} object are \emph{not} reflected in newly rendered plots from this object: the \code{"gg"} object needs to be created anew.
\end{warningbox}

\begin{explainbox}
The \emph{recipe} for a plot is stored in a \code{"gg"} plot object. Objects of class \code{"gg"} are of mode \code{"list"}. In \Rlang, lists can contain heterogeneous members and \code{"gg"} objects contain data, function definitions, and unevaluated expressions. In other words, the data plus instructions to transform the data, to map them into graphic objects, and various aspects of the rendering from scale limits to type faces to use. (\Rlang lists are described in section \ref{sec:calc:lists} on page \pageref{sec:calc:lists}.)

Top level members of the \code{"gg"} plot object \code{p1}, a simple plot, are displayed below with method \code{summary()}, which shows the components without making explicit the structure of the object.

<<ggplot-objects-03a>>=
summary(p1)
@

Method \code{str()} shows the structure of objects and can be also used to advantage with ggplots (long output not shown). Alternatively, \code{names()} extracts the names of the top-level members of \code{p1}.

<<ggplot-objects-03b>>=
names(p1)
@
\end{explainbox}

\begin{advplayground}
Explore in more detail the different members of object \code{p1}. For example, the code statement below extracts member \code{"layers"} from object \code{p1} and display its structure.

<<ggplot-objects-box-03, eval=FALSE>>=
str(p1$layers, max.level = 1)
@

How many layers are present in this case?
\end{advplayground}
\index{grammar of graphics!structure of plot objects|)}
\index{grammar of graphics!plots as R objects|)}

\subsection{Scales and mappings}\label{sec:plot:mappings}
\index{grammar of graphics!mapping of data|(}
\index{grammar of graphics!aesthetics|(}
In \ggplot, a \emph{mapping} describes which variable in \code{data} is mapped to which \code{aesthetic},  or graphic feature of a plot, such as $x$, $y$, colour, fill, shape, and linewidth. In \ggplot, a \emph{scale} describes the correspondence between \emph{values} in the mapped variable and values of the graphic feature. Below, the numeric variable \code{cyl} is mapped to the \code{colour} aesthetic. As the variable is \code{numeric}, a continuous colour scale is used. Out of the multiple continuous colour scales available, \ggscale{scale\_colour\_continuous()} is the default.

<<ggplot-basics-12a>>=
p2 <-
  ggplot(data = mtcars,
         mapping = aes(x = disp, y = mpg, colour = cyl)) +
  geom_point()
p2
@

Without changing the \code{mapping}, a different-looking plot can be created by changing the scale used. Below, in addition, a palette is selected with \code{option = "magma"} and the range of colours used from this palette adjusted with \code{end = 0.85}.

<<ggplot-basics-12b>>=
p2 + scale_colour_viridis_c(option = "magma", end = 0.85)
@

Changing the scale used for the \code{colour} aesthetic, conceptually does not modify the plot, except for the colours used. There is a separation between the semantic structure of the plot and its graphic design. Still, how the audience interacts and perceives the plot depends on both of these concerns.

Some scales, like those for \code{colour}, exist in multiple ``flavours'', suitable for numeric variables (continuous) or for factors (discrete) values. If \code{cyl} is converted into a \code{factor}, a discrete colour scale is used instead of a continuous one. Out of the different discrete scales, \ggscale{scale\_colour\_discrete()} is used by default.

<<ggplot-basics-12c>>=
ggplot(data = mtcars,
       mapping = aes(x = disp, y = mpg, colour = factor(cyl))) +
  geom_point()
@

If \code{cyl} is converted into an \code{ordered} factor, an ordinal colour scale is used, by default \ggscale{scale\_colour\_ordinal()} (plot not shown).

<<ggplot-basics-12d, eval=eval_plots_all>>=
ggplot(data = mtcars,
       mapping = aes(x = disp, y = mpg, colour = ordered(cyl))) +
  geom_point()
@

The scales for other aesthetics work in a similar way as those for colour. Scales are described in detail in section \ref{sec:plot:scales} on page \pageref{sec:plot:scales:continuous}.

In the examples above for simple plots, based on data contained in a single data frame, mappings were established  by passing the value returned by the call to \Rfunction{aes()} as the argument to parameter \code{mapping} of \Rfunction{ggplot()}.

Arguments passed to \code{data} and/or \code{mapping} parameters of \Rfunction{ggplot()} work as defaults for all layers in a plot. If arguments are passed to the identically named parameters of a layer function---statistic or geometry---, they are applied to the layer, overriding whole-plot defaults, if they exist. Consequently, the code below creates a plot, \code{p3}, identical to \code{p2} above.

<<ggplot-basics-13, eval=eval_plots_all>>=
p3 <-
  ggplot() +
  geom_point(data = mtcars,
             mapping = aes(x = disp, y = mpg, colour = cyl))
p3
@

These examples demonstrate two different approaches that are equally convenient for simple plots with a single layer. However, if a plot has multiple layers based on the same data, the approach used for \code{p2}  makes this clear and is concise. If each layer uses different data and/or different mappings, the second approach is necessary.

\begin{explainbox}
In some cases, when flexibility is needed while constructing complex plots with multiple layers other \emph{idioms} can be preferable, e.g., when assembling a plot from ``pieces'' stored in variables or built programmatically.

The default mapping can also be added directly with the \code{+} operator, instead of being passed as an argument to \Rfunction{ggplot()}.

<<ggplot-basics-14, eval=eval_plots_all>>=
ggplot(data = mtcars) +
       aes(x = disp, y = mpg) +
  geom_point()
@

It is also possible to have a default mapping for the whole plot, but no default data.

<<ggplot-basics-15, eval=eval_plots_all>>=
ggplot() +
  aes(x = disp, y = mpg) +
  geom_point(data = mtcars)
@

A mapping saved in a variable (example below), as well as a mapping returned by a function call (shown above for \code{aes()}), can be passed as an argument to parameter \code{mapping}

<<ggplot-basics-15a, eval=eval_plots_all>>=
my.mapping <- aes(x = disp, y = mpg)
ggplot(data = mtcars,
       mapping = my.mapping) +
  geom_point()
@

In all these examples, the plot remains unchanged (not shown). However, the flexibility of the grammar allows the assembly of plots from separately constructed pieces and reusing these pieces by storing them in variables. These approaches can be very useful in scrips that construct consistently formatted sets of plots, or when the same mapping needs to be used consistently in multiple plots.
\end{explainbox}

The mapping to aesthetics in the call to \Rfunction{aes()} does not have to be to a variable from \code{data} as in examples above. A a code statement that returns a value computed from one or more variables from \code{data} is also accepted. Computations during mapping helps avoid the proliferation of variables in the data frames containing observations. In this simple example, \code{mpg} in miles per gallon is converted into km per litre during mapping.

<<ggplot-basics-15b>>=
ggplot(data = mtcars,
       mapping =aes(x = disp, y = mpg * 0.43)) +
  geom_point()
@

\begin{explainbox}
Operations applied to the \code{data} before they are plotted are usually implemented in \code{stats}. Sometimes it is convenient to directly modify the whole-plot default \code{data} before it reaches the layer's \code{stat} function. One approach is to pass a function to parameter \code{data} of the layer function. This argument must be the definition of a function accepting a data frame as its first argument and returning a data frame. When the argument to \code{data} is a function definition instead of the usual data frame, the function is applied to the plot's default data and the data frame returned by the function is used as the \code{data} in the layer. In the example below, an anonymous function defined in-line, extracts a subset of the rows. The observations in the extracted rows are highlighted in the plot by overplotting them with smaller yellow shapes.

<<ggplot-basics-16>>=
ggplot(data = mtcars,
       mapping = aes(x = disp, y = mpg)) +
  geom_point(size = 4) +
  geom_point(data = function(x){subset(x = x, cyl == 4)},
             colour = "yellow", size = 1.5)
@

The argument passed above to data is a function definition, not a function call. Thus, if a function is passed by name, no parentheses are used. No arguments can be passed to a function, except for the default \code{data} passed by position to its first parameter. Consequently, it is not possible to pass function \code{subset} directly. The anonymous function above is needed to be able to pass \code{cyl == 4} as argument.

The plot's default data can also be operated upon using the \pkgname{magrittr} pipe operator, but not the pipe operator native to \Rlang (\Roperator{\textbar >}) or the dot-pipe operator from \pkgname{wrapr} (see section \ref{sec:data:pipes} on page \pageref{sec:data:pipes}). In this approach, the dot (\code{.}) placeholder at the head of the pipe stands for the plot's default \code{data} object. The code statement below uses a pipe as argument for \code{data} to call function \Rfunction{subset()} with \code{cyl == 4} passed as the condition. The plot, not shown, is as in the example above.

<<ggplot-basics-17, eval=eval_plots_all>>=
ggplot(data = mtcars,
       mapping = aes(x = disp, y = mpg)) +
  geom_point(size = 4) +
  geom_point(data = . %>% subset(x = ., cyl == 4), colour = "yellow",
             size = 1.5)
@

A third possible approach is to test the condition within the call to \Rfunction{aes()}. In this approach, it is not possible to extract a subset of rows. Making some observations invisible by reducing their size seems straightforward. However, setting \code{size = 0} draws a very small point, still visible. Out of various possible approaches, setting size to \code{NA}, skips the rows, and \code{na.rm = TRUE} silences the expected warning. This is a roundabout approach to subsetting. Notice that \ggscale{scale\_size\_identity()} is also needed. The plot, not shown, when rendered does not differ from the two examples above.

<<ggplot-basics-18, eval=eval_plots_all>>=
ggplot(data = mtcars,
       mapping = aes(x = disp, y = mpg)) +
  geom_point(size = 4) +
  geom_point(colour = "yellow",
             mapping = aes(size = ifelse(cyl == 4, 1.5, NA)),
             na.rm = TRUE) +
  scale_size_identity()
@

As it is usual in \Rlang, multiple approaches can be used to the same end.
\end{explainbox}

\begin{explainbox}
\emph{Late mapping}\index{grammar of graphics!mapping of data!late} of variables to aesthetics has been possible in \pkgname{ggplot2} for a long time using as notation enclosure of the name of a variable returned by a statistic between \code{..}, but this notation has been deprecated some time ago and replaced by \ggscale{stat()}. In both cases, this imposed a limitation: it was impossible to map a computed variable to the same aesthetic as input to the statistic and to the geometry in the same layer. There were also some other quirks that prevented passing some arguments to the geometry through the dots \code{...} parameter of a statistic.

Since version 3.3.0 of \pkgname{ggplot2}, the syntax used for mapping variables to aesthetics is based on functions \ggscale{stage()}, \ggscale{after\_stat()} and \ggscale{after\_scale()}. Function \ggscale{after\_stat()} replaces both \ggscale{stat()} and the \code{..} notation.
\end{explainbox}

%Variables in the data frame passed as argument to \code{data} are mapped to aesthetics before they are received as input by a statistic (possibly \code{stat\_identity()}). The mappings of variables in the data frame returned by statistics are the input to the geometry. Those statistics that operate on \textit{x} and/or \text{y} return a transformed version of these variables, by default also mapped to these aesthetics. However, in most cases other variables in addition to \textit{x} and/or \text{y} are included in the \code{data} returned by a \emph{statistic}. Although their default mapping is coded in the statistic functions' definitions, the user can modify this default mapping explicitly within a call to \code{aes()} using \ggscale{after\_stat()}, which lets us differentiate between the data frame supplied by the user and that returned by the statistic. The third stage was not accessible in earlier versions of \pkgname{ggplot2}, but lack of access was usually not insurmountable. Now this third stage can be accessed with \ggscale{after\_scale()} making coding simpler.
%
%User-coded transformations of the data are best handled at the third stage using scale transformations. However, when the intention is to jointly display or combine different computed variables returned by a statistic we need to set the desired mapping of original and computed variables to aesthetics at more than one stage.
%
The documentation of \pkgname{ggplot2} gives several good examples of cases when the new mapping syntax is useful. I give here a different example, a polynomial fitted to data using \Rfunction{rlm()}. RLM is a procedure that automatically assigns before computing the residual sums of squares, weights to the individual residuals in an attempt to protect the estimated fit from the influence of extreme observations or outliers. When using this and similar methods, it is of interest to plot the residuals together with the weights. One approach is to map weights to a gradient between two colours. The code below constructs a data frame containing artificial data that includes an extreme value or outlier.

<<mapping-stage-01>>=
set.seed(4321)
X <- 0:10
Y <- (X + X^2 + X^3) + rnorm(length(X), mean = 0, sd = mean(X^3) / 4)
df1 <- data.frame(X, Y)
df2 <- df1
df2[6, "Y"] <-df1[6, "Y"] * 10
@

In the first plot, \ggscale{after\_stat()} is used to map variable \code{weights} computed by the statistic to the \code{colour} aesthetic. In \ggstat{stat\_fit\_residuals()}, \gggeom{geom\_point()} is used by default. This figure shows the raw residuals with no weights applied (mapped to $y$ by default), and the computed weights (with range 0 to 1) encoded by colours ranging between red and blue.

<<mapping-stage-02>>=
ggplot(data = df2, mapping = aes(x = X, y = Y)) +
  stat_fit_residuals(formula = y ~ poly(x, 3, raw = TRUE), method = "rlm",
                     mapping = aes(colour = after_stat(weights)),
                     show.legend = TRUE) +
  scale_colour_gradient(low = "red", high = "blue", limits = c(0, 1),
                       guide = "colourbar")
@

In the second plot, weighted residuals are mapped to the $y$ aesthetic, and weights, as above, to the colour aesthetic. A call to \ggscale{stage()} can distinguish the mapping ahead of the statistic (\code{start}) from that after the statistic, i.e., ahead of the geometry. As above, the default geometry, \gggeom{geom\_point()} is used. The mapping in this example can be read as: the variable \code{X} from the data frame \code{df2} is mapped to the \textit{x} aesthetic at all stages. Variable \code{Y} from the data frame \code{df2} is mapped to the \textit{y} aesthetic ahead of the computations in \ggstat{stat\_fit\_residuals()}. After the computations, variables \code{y} and \code{weights} in the data frame returned by \ggstat{stat\_fit\_residuals()} are multiplied and mapped to the \textit{y} ahead of \gggeom{geom\_point()}.\label{chunk:plot:weighted:resid}

<<mapping-stage-03>>=
ggplot(df2) +
  stat_fit_residuals(formula = y ~ poly(x, 3, raw = TRUE),
                     method = "rlm",
                     mapping = aes(x = X,
                                   y = stage(start = Y,
                                             after_stat = y * weights),
                                   colour = after_stat(weights)),
                     show.legend = TRUE) +
  scale_colour_gradient(low = "red", high = "blue", limits = c(0, 1),
                        guide = "colourbar")
@

\begin{explainbox}
When fitting models to observations with \Rfunction{lm()}, the un-weighted residuals are used to compute the sum of squares unless weights are passed as an argument. In \Rfunction{rlm()}, the weights are computed from the data by the function.
\end{explainbox}

\index{grammar of graphics!mapping of data|)}
\index{grammar of graphics!aesthetics|)}

\section{Geometries}\label{sec:plot:geometries}
\index{grammar of graphics!geometries|(}

Different geometries support different \emph{aesthetics} (Table \ref{tab:plot:geoms}). While \gggeom{geom\_point()} supports \code{shape}, and \gggeom{geom\_line()} supports \code{linetype}, both support \code{x}, \code{y}, \code{colour}, and \code{size}. In this section I describe frequently used \code{geometries} from package \ggplot and from a few packages that extend \ggplot. The graphic output from some code examples will not be shown, with the expectation that readers will run the code to see the plots.

Mainly for historical reasons, \emph{geometries} accept a \emph{statistic} as an argument, in the same way as \emph{statistics} accept a \emph{geometry} as an argument. In this section I only describe \emph{geometries} which have as a default \emph{statistic} \code{stat\_identity}. In section \ref{sec:plot:stat:summaries} (page \pageref{sec:plot:stat:summaries}), I describe other \emph{geometries} together with the \emph{statistics} they use by default.

\begin{table}
  \caption[Geometries]{\ggplot geometries described in section \ref{sec:plot:geometries}, packages where they are defined, and the aesthetics supported. The default statistic is in all cases \ggstat{stat\_identity()}.}\vspace{1ex}\label{tab:plot:geoms}
  \centering
   \begin{tabular}{llp{8.25cm}}
     \toprule
     Geometry & Package & Aesthetics \\
     \midrule
     \code{geom\_point} & \pkgnameNI{ggplot2} & x, y, shape, size, fill, colour, alpha \\
     \code{geom\_point\_s} & \pkgnameNI{ggpp} & x, y, size, linetype, linewidth, fill, colour, alpha \\
     \code{geom\_pointrange} & \pkgnameNI{ggplot2} & x, y, ymin, ymax, shape, size, linetype, linewidth, fill, colour, alpha \\
     \code{geom\_errorbar} & \pkgnameNI{ggplot2} & x, ymin, ymax, linetype, linewidth, colour, alpha \\
     \code{geom\_linerange} & \pkgnameNI{ggplot2} & x, ymin, ymax, linetype, linewidth, colour, alpha  \\
     \code{geom\_line} & \pkgnameNI{ggplot2} & x, y, linetype, linewidth, colour, alpha  \\
     \code{geom\_segment} & \pkgnameNI{ggplot2} & x, y, xend, yend, linetype, linewidth, colour, alpha \\
     \code{geom\_step} & \pkgnameNI{ggplot2} & x, y, linetype, linewidth, colour, alpha  \\
     \code{geom\_path} & \pkgnameNI{ggplot2} & x, y, linetype, linewidth, colour, alpha  \\
     \code{geom\_curve} & \pkgnameNI{ggplot2} & x, y, xend or yend, linetype, linewidth, colour, alpha  \\
     \code{geom\_area} & \pkgnameNI{ggplot2} & x, y, (ymin = 0), linetype, linewidth, fill, colour, alpha \\
     \code{geom\_ribbon} & \pkgnameNI{ggplot2} & x, ymin and ymax, linetype, linewidth, fill, colour, alpha \\
     \code{geom\_align} & \pkgnameNI{ggplot2} & x or y, xmin or xmax, ymin or ymax, linetype, linewidth, fill, colour, alpha \\
     \code{geom\_rect} & \pkgnameNI{ggplot2} & xmin, xmax, ymin, ymax, linetype, linewidth, fill, colour, alpha \\
     \code{geom\_tile} & \pkgnameNI{ggplot2} & x, y, width, height, linetype, linewidth, fill, colour, alpha \\
     \code{geom\_col} & \pkgnameNI{ggplot2} & x, y, width, linetype, linewidth, fill, colour, alpha \\
     \code{geom\_rug} & \pkgnameNI{ggplot2} & x or y, linewidth, colour, alpha \\
     \code{geom\_hline} & \pkgnameNI{ggplot2} & yintercept, linetype, linewidth, colour, alpha  \\
     \code{geom\_vline} & \pkgnameNI{ggplot2} & xintercept, linetype, linewidth, colour, alpha  \\
     \code{geom\_abline} & \pkgnameNI{ggplot2} & intercept, slope, linetype, linewidth, colour, alpha  \\
     \code{geom\_text} & \pkgnameNI{ggplot2} & x, y, label, face, family, angle, size, colour, alpha \\
     \code{geom\_label} & \pkgnameNI{ggplot2} & x, y, label, face, family, (angle), size, fill, colour, alpha  \\
     \code{geom\_text\_repel} & \pkgnameNI{ggrepel} & x, y, label, face, family, angle, size, colour, alpha \\
     \code{geom\_label\_repel} & \pkgnameNI{ggrepel} & x, y, label, face, family, size, fill, colour, alpha  \\
     \code{geom\_sf} & \pkgnameNI{ggplot2} & fill, colour \\
     \code{geom\_table} & \pkgnameNI{ggpp} & x, y, label, size, colour, angle \\
     \code{geom\_plot} & \pkgnameNI{ggpp} & x, y, label, vp.width, vp.height, angle \\
     \code{geom\_grob} & \pkgnameNI{ggpp} & x, y, vp.width, vp.height, label \\
     \code{geom\_blank} & \pkgnameNI{ggplot2} & --- \\
     \bottomrule
   \end{tabular}
\end{table}

\subsection{Point}\label{sec:plot:geom:point}
\index{grammar of graphics!point geometry|(}

As seen in examples above, \gggeom{geom\_point()}, can be used to add a layer with observations represented by ``points'' or symbols. In \emph{scatter plots} the variables mapped to $x$ and $y$ aesthetics are both continuous (\code{numeric}) and in \emph{dot plots} one of them is discrete (\code{factor} or \code{ordered}) and the other continuous. The plots in the examples above have been scatter plots.

\index{plots!scatter plot|(}The first examples of the use of \gggeom{geom\_point()} are for \textbf{scatter plots}, as \code{disp} and \code{mpg} are \code{numeric} variables. In the examples above, a third variable, \code{cyl}, was mapped to \code{colour}. While the colour aesthetic can be used with all \code{geoms}, other aesthetics can be used only with some \code{geoms}, for example the \code{shape} aesthetic can be used only with \gggeom{geom\_point()} and similar \code{geoms}, such as \gggeom{geom\_pointrange()}. The values in the \code{shape} aesthetic are discrete, and consequently only discrete values can be mapped to it.

<<scatter-01>>=
p.base <-
  ggplot(data = mtcars,
       mapping = aes(x = disp, y = mpg, shape = factor(cyl))) +
  geom_point()
p.base
@

\begin{playground}
Try a different mapping: \code{disp} $\rightarrow$ \code{colour}, \code{cyl} $\rightarrow$ \code{x}, keeping the mapping \code{mpg}  $\rightarrow$ \code{y} unchanged. Continue by using \code{help(mtcars)} and/or \code{names(mtcars)} to see what other variables are available, and then try the combinations that trigger your curiosity---i.e., explore the data.
\end{playground}

Adding \ggscale{scale\_shape\_discrete()}, the scale already used by default, but passing \code{solid = FALSE} in the call creates a version of the same plot based on open shapes, still selected automatically.

<<scatter-11>>=
p.base +
  scale_shape_discrete(solid = FALSE)
@

In contrast to ``filled'' shapes that obey both \code{colour} and \code{fill}, ``open'' shapes obey only \code{colour}, similarly to ``solid'' shapes. Function \code{scale\_shape\_manual} can be used to set the shape used for each value in the mapped factor. Below, ``open'' shapes are used, as they reveal partial overlaps better than solid shapes (plot not shown).\label{chunk:filled:symbols}

<<scatter-11a, eval=eval_plots_all>>=
p.base +
  scale_shape_manual(values = c("circle open",
                                "square open",
                                "diamond open"))
@%
\pagebreak

It is also possible to use characters as shapes. The character is centred on the position of the observation. As the numbers used as symbols are self-explanatory, the default guide is removed by passing \code{guide = "none"} (plot not shown).\label{chunk:plot:point:char}

<<scatter-12, eval=eval_plots_all>>=
p.base +
 scale_shape_manual(values = c("4", "6", "8"), guide = "none")
@

A variable from \code{data} can be mapped to more than one aesthetic, allowing redundant aesthetics. This makes possible figures that, even if using colour, are readable when reproduced as black-and-white images and to viewers affected by colour blindness.

<<scatter-14, eval=eval_plots_all>>=
ggplot(data = mtcars,
       mapping = aes(x = disp, y = mpg,
       shape = factor(cyl), colour = factor(cyl))) +
  geom_point()
@

\index{plots!scatter plot|)}
\index{plots!dot plot|(}The next examples of the use of \gggeom{geom\_point()} are for \textbf{dot plots}, as \code{disp} is a \code{numeric} variable but \code{factor(cyl)} is discrete. Dot plots are prone to have overlapping observations, and one way of making these points visible is to make them partly transparent by setting a constant value smaller than one for the \code{alpha} \emph{aesthetic}.

<<scatter-12a>>=
ggplot(data = mtcars,
       mapping = aes(x = factor(cyl), y = mpg)) +
  geom_point(alpha = 1/3)
@

Function\label{par:plot:pos:jitter} \ggposition{position\_identity()}, which is the default, does not alter the coordinates or position of observations, as shown in all examples above. To make overlapping observations visible, instead of making the points semitransparent as above, it is possible randomly displace them along the axis mapped to the discrete variable, $x$ in this case. This is called \emph{jitter}, and can be added using \ggposition{position\_jitter()} as argument to formal parameter \code{position} of \code{geoms}. The amount of jitter is set by numeric arguments passed to \code{width} and/or \code{height}, given as a fraction of the distance between adjacent factor levels in the plot.

<<scatter-13>>=
ggplot(data = mtcars,
       mapping = aes(x = factor(cyl), y = mpg)) +
  geom_point(position = position_jitter(width = 0.25, heigh = 0))
@

\begin{warningbox}
   The name as a character string can be also used when no arguments need to be passed to the \emph{position} function, and for some positions by passing numerical arguments to specific parameters of geometries. However, the default width of $\pm0.5$ tends to be rarely optimal (plot not shown).

<<scatter-13info, eval=eval_plots_all>>=
ggplot(data = mtcars,
       mapping = aes(x = factor(cyl), y = mpg), colour = factor(cyl)) +
  geom_point(position = "jitter")
@
\end{warningbox}

\index{plots!dot plot|)}
\index{plots!bubble plot|(}
\textbf{Bubble plots} are scatter- or dot plots in which the size of points or bubbles varies following values of a continuous variable mapped to the \code{size} \emph{aesthetic}. There are two approaches to this mapping, values in the mapped variable either describe the area of the points or their radii. Although the radius is sometimes used, due to how visual perception works, using area is perceptually closer to a linear mapping compared to radii. Below, the weights of cars in tons are mapped to the area of the points. Open circles are used because of overlaps.

<<scatter-16>>=
ggplot(data = mtcars,
       mapping = aes(x = disp, y = mpg, colour = factor(cyl), size = wt)) +
  scale_size_area() +
  geom_point(shape = "circle open", stroke = 1.5)
@

\begin{playground}
If a radius-based scale is used instead of an area-based one the perceived size differences are larger, i.e., the ``impression'' on the viewer is different. In the plot above, replace \code{scale\_size\_area()} with \code{scale\_size\_radius()}.

Display the plot, look at it carefully. Check the numerical values of some of the weights of the cars, and assess if your perception of the plot matches the numbers behind it.
\end{playground}

\index{plots!bubble plot|)}

As a final example summarising the use of \gggeom{geom\_point()}, the scatter plot below combines different \emph{aesthetics} and their \emph{scales}.

<<scatter-18>>=
ggplot(data = mtcars,
       mapping = aes(x = disp, y = mpg, shape = factor(cyl),
                     fill = factor(cyl), size = wt)) +
  geom_point(alpha = 0.33, colour = "black") +
  scale_size_area() +
  scale_shape_manual(values = c("circle filled",
                                "square filled",
                                "diamond filled"))
@

\begin{playground}
Play with the code in the chunk above. Remove or change each of the mappings and the scale, display the new plot, and compare it to the one above. Continue playing with the code until you are sure you understand what graphical element in the plot is added or modified by each individual argument or ``word'' in the code statement.
\end{playground}
\index{grammar of graphics!point geometry|)}

It is common to draw error bars together with points representing means or medians. These can be added in a single layer with \gggeom{geom\_pointrange()} with values mapped to the \code{x}, \code{y}, \code{ymin} and \code{ymax} aesthetics, using \code{y} for the point and \code{ymin} and \code{ymax} for the ends of the line segment. Two other \emph{geometries}, \gggeom{geom\_range()} and  \gggeom{geom\_errorbar()} draw only a segment or a segment with capped ends. These three \code{geoms} are frequently used together with \code{stats} that compute summaries by group. However, summary values calculated before plotting can alternatively be passed as \code{data}.

\subsection{Rug}\label{sec:plot:rug}
\index{plots!rug margin|(}

Rarely, rug plots are used by themselves. Instead they are usually an addition to scatter plots. An example of the use of \gggeom{geom\_rug()} follows. They make it easier to see the distribution of observations along the $x$- and/or $y$-axes. By default, rugs are drawn on the left and bottom edges of the plotting area. By passing \code{sides = "btlr"} they are drawn on the bottom, top, left, and right margins. Any combination of the four characters can be used to control the drawing of the rugs.

<<rug-plot-01>>=
ggplot(data = mtcars,
       mapping = aes(x = disp, y = mpg, colour = factor(cyl))) +
  geom_point() +
  geom_rug(sides = "btlr")
@

\begin{warningbox}
  Rug plots are useful when the local density of observations in a continuous variable is not high as, otherwise, rugs become too cluttered and the rug ``threads'' overlap. When the overlap is moderate, making the segments semitransparent by setting the \code{alpha} aesthetic to a constant value smaller than one, can make the variation in density easier to appreciate. When the number of observations is large, marginal density plots are preferred.
\end{warningbox}
\index{plots!rug margin|)}

\subsection{Line and area}\label{sec:plot:line}

\index{grammar of graphics!various line and path geometries|(}\index{plots!line plot|(}
\textbf{Line plots} are normally created using \gggeom{geom\_line()}, and, occasionally using \gggeom{geom\_path()}. These two \code{geoms} differ in the sequence they follow when connecting values: \gggeom{geom\_line()} connects observations based on the ordering of \code{x} values while \gggeom{geom\_path()} uses the order in the data. Aesthetic \code{linewidth} controls the thickness of lines and \code{linetype} the patterns of dashes and dots.

In a line plot, observations, or the subset of observations in a group, are joined by straight lines. Below, a different data set, \Rdata{Orange}, with data on the growth of five orange trees (see \code{help(Orange)}) is used. By mapping \code{Tree} to \code{linetype} the observations become grouped, and a line is plotted for each tree.

\label{plot:fig:lines}
<<line-plot-01>>=
ggplot(data = Orange,
       mapping = aes(x = age, y = circumference, linetype = Tree)) +
  geom_line()
@

\begin{warningbox}
Before \ggplot 3.4.0 the \code{size} aesthetic controlled the width of lines. Aesthetic \code{linewidth} was added in \ggplot 3.4.0 and the use of the \code{size} aesthetic for lines deprecated.
\end{warningbox}
\index{plots!line plot|)}

\index{plots!step plot|(}%
Geometry \gggeom{geom\_step()} plots only vertical and horizontal lines to join the observations, creating a stepped line, or ``staircase''. Parameter \code{direction}, with default \code{"hv"}, controls the ordering of horizontal and vertical lines.

<<step-plot-01>>=
ggplot(data = Orange,
       mapping = aes(x = age, y = circumference, linetype = Tree)) +
  geom_step()
@
\index{plots!step plot|)}

\begin{playground}
Using the following toy data, make three plots using \code{geom\_line()}, \code{geom\_path()}, and \code{geom\_step} to add a layer. How do they differ?

<<line-plots-PG01,eval=eval_playground>>=
toy.df <- data.frame(x = c(1,3,2,4), y = c(0,1,0,1))
@
\end{playground}

\index{plots!filled-area plot|(}
While \gggeom{geom\_line()} draws a line joining observations, \gggeom{geom\_area()} supports, in addition, filling the area below the line according to the \code{fill} \emph{aesthetic}. In some cases, it is useful to stack the areas, e.g., when the values plotted represent parts of a bigger whole. In the next, contrived, example, the areas representing the growth of the five orange trees are stacked (visually summed) using \code{position = "stack"} in place of the default \code{position = "identity"}. The visibility of the lines for individual trees is improved by changing their colour and width from the defaults. (Compare the $y$ axis of the figure below to that drawn using \code{geom\_line()} on page \pageref{plot:fig:lines}.)

<<area-plot-01>>=
p1 <- # will be used again later
  ggplot(data = Orange,
         mapping = aes(x = age, y = circumference, fill = Tree)) +
  geom_area(position = "stack", colour = "white", linewidth = 1)
p1
@

\gggeom{geom\_ribbon()} draws two lines based on the \code{x}, \code{ymin} and \code{ymax} \emph{aesthetics}, with the space between the lines filled according to the \code{fill} \emph{aesthetic}. \gggeom{geom\_polygon()} is similar to \gggeom{geom\_path()} but connects the first and last observations forming a closed polygon that obeys the \code{fill} aesthetic.

\index{plots!filled-area plot|)}

\index{plots!reference lines|(}
Finally,\label{sec:plot:vhline} three \emph{geometries} for drawing lines across the whole plotting area: \gggeom{geom\_hline()}, \gggeom{geom\_vline()} and \gggeom{geom\_abline()}. The first two draw horizontal and vertical lines, respectively, while the third one draws straight lines according to the \emph{aesthetics} \code{slope} and \code{intercept} determining the position. The lines drawn with these three geoms extend to the edge of the plotting area.

\gggeom{geom\_hline()} and \gggeom{geom\_vline()} require a single parameter (or aesthetic), \code{yintercept} and \code{xintercept}, respectively. Different from other geoms, the data for these aesthetics can be passed as constant numeric vector containing multiple values. The reason for this is that these geoms are most frequently used to annotate plots rather than plotting observations. Vertical lines can be used to highlight time points, here the ages of 1, 2, and 3 years.

<<area-plot-02>>=
p1 +
  geom_vline(xintercept = 365 * 1:3, colour = "gray75") +
  geom_vline(xintercept = 365 * 1:3, linetype = "dashed")
@

\begin{playground}
  Change the order of the two layers in the example above. How did the figure change? What order is best? Would the same order be the best for a scatter plot? And would it be necessary to add two \code{geom\_vline()} layers?
\end{playground}

Similarly to \gggeom{geom\_hline()} and \gggeom{geom\_vline()}, \gggeom{geom\_abline()} draws a straight line, accepting as parameters (or as aesthetics) values for the \code{intercept}, $a$, and the \code{slope}, $b$.
\index{plots!reference lines|)}

\index{plots!segments and arrows|(}
Disconnected straight-line segments and arrows, one for each observation or row in the data, can be plotted with \gggeom{geom\_segment()} which accepts \code{x}, \code{xend}, \code{y}, and \code{yend} as mapped aesthetics. \gggeom{geom\_spoke()}, which uses a polar parametrisation, uses a different set of aesthetics, \code{x}, \code{y} for origin, and \code{angle} and \code{radius} for the segment. Similarly, \gggeom{geom\_curve()} draws curved segments, with the curvature, control points, and angles controlled through parameters. These three \emph{geometries} support arrow heads at the ends of segments or curves, controlled through parameter \code{arrow} (not through an aesthetic).
\index{plots!segments and arrows|)}
\index{grammar of graphics!various line and path geometries|)}

\subsection{Column}\label{sec:plot:col}
\index{grammar of graphics!column geometry|(}
\index{plots!column plot|(}

The \emph{geometry} \gggeom{geom\_col()} can be used to create \emph{column plots}, where each bar represents an observation or row in the \code{data} (frequently means or totals previously computed from the primary observations).

\begin{warningbox}
In other contexts, column plots are frequently called bar plots. \Rlang users not familiar yet with \ggplot are frequently surprised by the default behaviour of \gggeom{geom\_bar()} as it uses \ggstat{stat\_count()} to produce a histogram, rather than plotting values as is (see section \ref{sec:plot:histogram} on page \pageref{sec:plot:histogram}). \gggeom{geom\_col()} is identical to \gggeom{geom\_bar()} but with \code{"identity"} as the default statistic.
\end{warningbox}

Using very simple artificial data helps demonstrate how variations of column plots can be obtained. The data are for two groups, hypothetical males and females.

<<col-plot-01>>=
set.seed(654321)
my.col.data <-
  data.frame(treatment = factor(rep(c("A", "B", "C"), 2)),
             group = factor(rep(c("male", "female"), c(3, 3))),
             measurement = rnorm(6) + c(5.5, 5, 7))
@

The first plot includes data for \code{"female"} subjects extracted using a nested call to \Rfunction{subset()}. Except for \code{x} and \code{y} default mappings are used for all \emph{aesthetics}.

<<echo=FALSE, include=FALSE>>=
opts_chunk$set(opts_fig_medium)
@

<<col-plot-02>>=
ggplot(subset(my.col.data, group == "female"),
       mapping = aes(x = treatment, y = measurement)) +
  geom_col()
@

The \label{par:plot:pos:stack} bars above, are overwhelmingly wide, passing \code{width = 0.5} makes the bars narrower, using only half the distance between the levels on the $x$ axis. Setting \code{colour = "white"} overrides the default colour of the lines bordering the bars. Both males and females are included and \code{group} is mapped to the \code{fill} aesthetic. The default argument for position in \gggeom{geom\_col()} is \ggposition{position\_stack()}. Function \ggposition{position\_stack()} is similar to \ggposition{position\_stack()} but divides the stacked values by their sum, i.e., the individual stacked ``slices'' of the column display proportions instead of absolute values.

<<echo=FALSE, include=FALSE>>=
opts_chunk$set(opts_2fig_very_wide)
@

<<col-plot-03>>=
p.base <-
 ggplot(my.col.data,
        mapping = aes(x = treatment, y = measurement, fill = group))
@

<<col-plot-03a>>=
p1 <- p.base + geom_col(width = 0.5) + ggtitle("stack (default)")
@

Using \code{position = "dodge"}\label{par:plot:pos:dodge} to override the default \code{position = "stack"} the columns for males and females are plotted side by side.\qRfunction{position\_stack()}

<<col-plot-04>>=
p2 <- p.base + geom_col(position = "dodge") + ggtitle("dodge")
@

The two plots side by side (see section \ref{sec:plot:composing} on page \pageref{sec:plot:composing} for details).

<<col-plot-04a>>=
p1 + p2
@

<<echo=FALSE, include=FALSE>>=
opts_chunk$set(opts_fig_wide)
@

\begin{playground}
Change the argument to \code{position}, or let the default be active, until you understand its effect on the figure. What is the difference between \emph{positions} \code{"identity"}, \code{"dodge"}, \code{"stack"}, and \code{"fill"}?
\end{playground}

\begin{playground}
Use constants as arguments for \emph{aesthetics} or map variable \code{treatment} to one or more of the \emph{aesthetics} recognised by \gggeom{geom\_col()}, such as \code{colour}, \code{fill}, \code{linetype}, \code{size}, \code{alpha} and \code{width}.
\end{playground}

\index{grammar of graphics!column geometry|)}
\index{plots!column plot|)}

\subsection{Tiles}\label{sec:tileplot}
\index{grammar of graphics!tile geometry|(}
\index{plots!tile plot|(}
\textbf{Tile plots} and \textbf{heat maps} are useful when observations are available on a regular rectangular 2D grid. The grid can, for example, represent locations in space as well combinations of levels of two discrete classification criteria. The colour or darkness of the tiles informs about the value of the observations. A layer with square or rectangular tiles can be added with \gggeom{geom\_tile()}.

Data from 100 random draws from the $F$ distribution with degrees of freedom $\nu_1 = 2, \nu_2 = 20$ are used in the examples.

<<tile-plot-01>>=
set.seed(1234)
randomf.df <- data.frame(F.value = rf(100, df1 = 2, df2 = 20),
                         x = rep(letters[1:10], 10),
                         y = LETTERS[rep(1:10, rep(10, 10))])
@

\gggeom{geom\_tile()} requires aesthetics $x$ and $y$, with no defaults, and \code{width} and \code{height} with defaults that make all tiles of equal size filling the plotting area. Variable \code{F.value} is mapped to \code{fill}.

<<tile-plot-02>>=
ggplot(data = randomf.df,
       mapping = aes(x, y, fill = F.value)) +
  geom_tile()
@

Below, setting \code{colour = "gray75"} and \code{linewidth = 1} makes the tile borders visible. Whether highlighting these lines improves or not a tile plot depends on whether the individual tiles correspond to values of a categorical- or continuous variable. For example, when rows of tiles correspond to genes and columns to discrete treatments, visible tile borders are preferable. In contrast, in the case when the tiles are an approximation to a continuous surface like measurements on a regular spatial grid, it is best to suppress tile borders.

<<tile-plot-03>>=
ggplot(data = randomf.df,
       mapping = aes(x, y, fill = F.value)) +
  geom_tile(colour = "gray75", linewidth = 1)
@

\begin{playground}
Play with the arguments passed to parameters \code{colour} and \code{size} in the example above, considering what features of the data are most clearly perceived in each of the plots you create.
\end{playground}

Continuous fill scales can be used to control the appearance. Below, code for a tile plot based on a gray gradient, with missing values in red, is constructed is shown (plot not shown).

<<tile-plot-04, eval=eval_plots_all>>=
ggplot(data = randomf.df,
       mapping = aes(x, y, fill = F.value)) +
  geom_tile(colour = "white") +
  scale_fill_gradient(low = "gray15", high = "gray85", na.value = "red")
@

In contrast to \gggeom{geom\_tile()}, \gggeom{geom\_rect()} draws rectangular tiles based on the position of the corners, mapped to aesthetics \code{xmin}, \code{xmax}, \code{ymin} and \code{ymax}. In this case, tiles can vary in size and do not need to be contiguous. The filled rectangles can be used, for example, to highlight a rectangular region in a plot (see example on page \pageref{par:plot:inset:zoom}).
\index{plots!tile plot|)}
\index{grammar of graphics!tile geometry|)}

\subsection{Simple features (sf)}\label{sec:plot:sf}
\index{grammar of graphics!sf geometries|(}
\index{plots!maps and spatial plots|(}

\ggplot version 3.0.0 or later supports with \gggeom{geom\_sf()}, and its companions, \gggeom{geom\_sf\_text()}, \gggeom{geom\_sf\_label()}, and \ggstat{stat\_sf()}, the plotting of shape data similarly to geographic information systems (GIS). This makes it possible to display data on maps, for example, using different fill values for different regions. The special \emph{coordinate} \code{coord\_sf()} can be used to select different projections for maps. The \emph{aesthetic} used is called \code{geometry} and contrary to all the other aesthetics described above, the values to be mapped are of class \code{sfc} containing \emph{simple features} data with multiple components. Manipulation of simple features data is supported by package \pkgname{sf}. Normal geometries can be use together with \ggstat{stat\_sf\_coordinates()} to add other graphical elements to maps. This subject exceeds the scope of this book, so a single and very simple example is shown below.

<<sf_plot-01>>=
nc <- sf::st_read(system.file("shape/nc.shp", package = "sf"), quiet = TRUE)
ggplot(nc) +
  geom_sf(mapping = aes(fill = AREA), colour = "gray90")
@
\index{grammar of graphics!sf geometries|)}
\index{plots!maps and spatial plots|)}

\subsection{Text}\label{sec:plot:text}
\index{grammar of graphics!text and label geometries|(}
\index{plots!text in|(}
\index{plots!maths in|(}
Geometries \gggeom{geom\_text()} or \gggeom{geom\_label()} are used to add textual data labels and annotations to plots.

For \gggeom{geom\_text()} and \gggeom{geom\_label()}, the aesthetic \code{label} provides the text to be plotted and aesthetics \code{x} and \code{y}, the location of the labels. The size of the text is controlled by the \code{size} aesthetics, while the font is selected by the \code{family} and \code{fontface} aesthetics. Below, the whole-plot default mappings for \code{colour} and \code{size} aesthetics are overridden within \gggeom{geom\_text()}.

<<text-plot-01>>=
ggplot(data = mtcars,
         mapping = aes(x = disp, y = mpg,
                       colour = factor(cyl), size = wt, label = cyl)) +
  geom_point(alpha = 1/3) +
  geom_text(colour = "darkblue", size = 3)
@

Aesthetics \code{angle}, expressed in degrees, and \code{vjust} and \code{hjust} can be used to rotate the text and adjust its vertical and horizontal justification. The default value of 0.5 for both \code{hjust} and \code{vjust} sets the centre of the text at the supplied \code{x} and \code{y} coordinates. \emph{``Vertical'' and ``horizontal'' for text justification are relative to the text, not the plot.} This is important when \code{angle} is different from zero. Values larger than 0.5 shift the label left or down, and values smaller than 0.5, right or up with respect to its \code{x} and \code{y} coordinates. A value of 1 or 0 sets the text so that its edge is at the supplied coordinate. Values outside the range $0\ldots 1$ shift the text even farther away, however, still using units based on the length or height of the text label. Recent versions of \pkgname{ggplot2} make possible justification using character constants for alignment: \code{"left"}, \code{"middle"}, \code{"right"}, \code{"bottom"}, \code{"center"}, and \code{"top"}, and two special alignments, \code{"inward"} and \code{"outward"}, that automatically vary based on the position in the plotting area.

Below, \gggeom{geom\_text()} or \gggeom{geom\_label()} are used together with \gggeom{geom\_point()} similarly as they are used to add data labels in a plot.

<<text-plot-02>>=
my.data <-
  data.frame(x = 1:5,
             y = rep(2, 5),
             label = c("ab", "bc", "cd", "de", "ef"))
@

<<text-plot-02a>>=
ggplot(data = my.data,
       mapping = aes(x, y, label = label)) +
  geom_text(angle = 90, hjust = 1.5, size = 4) +
  geom_point()
@

In the case of \gggeom{geom\_label()} the text is enclosed in a box and obeys the \code{fill} \emph{aesthetic} and additional parameters (described starting at page \pageref{start:plot:label}) allowing control of the shape and size of the box. Before \ggplot 3.5.0, \gggeom{geom\_label()} did not support rotation with the \code{angle} aesthetic.

\begin{playground}
Modify the example above to use \gggeom{geom\_label()} instead of \gggeom{geom\_text()} using, in addition, the \code{fill} aesthetic.
\end{playground}

A serif font is set by passing \code{family = "serif"}. The names \code{"sans"} (the default), \code{"serif"} and \code{"mono"} are recognised by all graphics devices on all operating systems. They do not necessarily correspond to identical fonts in different computers or for different graphic devices, but instead to fonts that are similar. Additional fonts are available for specific graphic devices, such as the 35 ``PDF'' fonts by the \code{pdf()} device. In this case, their names can be queried with \code{names(pdfFonts())}.

<<text-plot-04, eval=eval_plots_all>>=
ggplot(data = my.data,
       mapping = aes(x, y, label = label)) +
  geom_text(angle = 90, hjust = 1.5, size = 4, family = "serif") +
  geom_point()
@

\begin{playground}
In the examples above, the character strings were all of the same length, containing a single character. Redo the plots above with longer character strings of various lengths mapped to the \code{label} \emph{aesthetic}. Do also play with justification of these labels.
\end{playground}

\begin{warningbox}
\Rlang\index{plots!fonts} and \ggplot support the use of UNICODE\index{UNICODE}, such as UTF8\index{UTF8} character encodings in strings. If your editor or IDE supports their use, then you can type Greek letters and simple maths symbols directly, and they \emph{may} show correctly in labels if a suitable font is loaded and an extended encoding like UTF8 is in use by the operating system. Even if UTF8 is in use, text is not fully portable unless the same font is available\index{portability}, as even if the character positions are standardised for many languages, most UNICODE fonts support at most a small number of languages. In principle, one can use this mechanism to have labels both using other alphabets and languages like Chinese with their numerous symbols mixed in the same figure. Furthermore, the support for fonts and consequently character sets in \Rlang is output-device dependent. The font encoding used by \Rlang by default depends on the default locale settings of the operating system, which can also lead to garbage printed to the console or wrong characters being plotted running the same code on a different computer from the one where a script was created. Not all is lost, though, as \Rlang can be coerced to use system fonts and Google fonts with functions provided by packages \pkgname{showtext} and \pkgname{extrafont}. Encoding-related problems, especially in MS-Windows, are common.
\end{warningbox}

Plotting (mathematical) expressions involves mapping to the \code{label} aesthetic character strings that can be parsed as expressions, and setting \code{parse = TRUE} (see section \ref{sec:plot:plotmath} on page \pageref{sec:plot:plotmath}). Below, the character strings are assembled using \Rfunction{paste()} but, of course, they could have been also typed in as constant values. This use of \Rfunction{paste()} is an example of recycling of shorter vectors, \code{"alpha["} and \code{"]"} to match the length of \code{1:5} (see section \ref{sec:vectors} on page \pageref{sec:vectors}).

<<text-plot-05>>=
my.data <-
  data.frame(x = 1:5, y = rep(2, 5), label = paste("alpha[", 1:5, "]", sep = ""))
my.data$label
@

Text and labels do not automatically expand the plotting area past their anchoring coordinates. In the example below, \code{expand\_limits(x = 5.2)} ensures that the text is not clipped at the edge of the plotting area.

<<text-plot-06>>=
ggplot(data = my.data,
       mapping = aes(x, y, label = label)) +
  geom_text(hjust = -0.2, parse = TRUE, size = 6) +
  geom_point() +
  expand_limits(x = 5.2)
@

In the example above, the text to be parsed was mapped to the \code{label} aesthetic using character strings previously added to the data frame \code{my.data}. It is also possible, and usually preferable, to build suitable character strings with a nested function call, or a code statement, passed as an argument in the call to \code{aes()} (plot identical to the previous one, not shown).

<<text-plot-07, eval=eval_plots_all>>=
ggplot(data = my.data,
       mapping = aes(x, y, label = paste("alpha[", x, "]", sep = ""))) +
  geom_text(hjust = -0.2, parse = TRUE, size = 6) +
  geom_point()
@

Geometry \gggeom{geom\_label()} obeys the same aesthetics as \gggeom{geom\_text()} (except for \code{angle} in \ggplot <\,3.5.0) and additionally \code{label.size} for the width of the border line, \code{label.r} for the roundness of the box corners, \code{label.padding} for the space between the text boundary and the box boundary, and \code{fill} for the colour used to fill the boxes' background.

\label{start:plot:label}
<<label-plot-01>>=
my.data <-
  data.frame(x = 1:5, y = rep(2, 5),
             label = c("one", "two", "three", "four", "five"))

ggplot(data = my.data,
       mapping = aes(x, y, label = label)) +
  geom_label(hjust = -0.2, size = 6,
             label.size = 0,
             label.r = unit(0, "lines"),
             label.padding = unit(0.15, "lines"),
             fill = "yellow", alpha = 0.5) +
  geom_point() +
  expand_limits(x = 5.6)
@

\begin{playground}
Starting from the example above, play with the arguments to the different parameters and with the mappings to \emph{aesthetics} to get an idea of the variations in the design that they allow. For example, use thicker border lines and increase the padding so that a visually well-balanced margin is retained. You may also try mapping the \code{fill} and \code{colour} \emph{aesthetics} to factors in the data.
\end{playground}

If\index{grammar of graphics!text and label geometries!repulsive} the parameter \code{check\_overlap} of \gggeom{geom\_text()} is set to \code{TRUE}, text overlap will be avoided by suppressing the text that would otherwise overlap other text.  \emph{Repulsive} versions of \gggeom{geom\_text()} and \gggeom{geom\_label()}, \gggeom{geom\_text\_repel()} and \gggeom{geom\_label\_repel()}, are available in package \pkgname{ggrepel}. These \emph{geometries} avoid overlaps by automatically repositioning the text or labels. Please read the package documentation for details of how to control the repulsion strength and direction, and the properties of the segments linking the labels to the position of their data coordinates. Nearly all aesthetics supported by \code{geom\_text()} and \code{geom\_label()} are supported by the repulsive versions. However, given that a segment connects the label or text to its anchor point, several properties of these segments can also be controlled with aesthetics or arguments.

<<repel-plot-01>>=
ggplot(data = mtcars,
       mapping = aes(x = disp, y = mpg,
                     colour = factor(cyl), size = wt, label = cyl)) +
  scale_size() +
  geom_point(alpha = 1/3) +
  geom_text_repel(colour = "black", size = 3,
                  min.segment.length = 0.2, point.padding = 0.1)
@
\index{plots!maths in|)}
\index{plots!text in|)}
\index{grammar of graphics!text and label geometries|)}

\subsection{Plot insets}\label{sec:plot:insets}
\index{grammar of graphics!inset-related geometries|(}
\index{plots!insets|(}

The support for insets in \pkgname{ggplot2} is confined to \code{annotation\_custom()}, which was designed to be used for static annotations expected to be the same in each panel of a plot (the use of annotations is described in section \ref{sec:plot:annotations}). Package \pkgname{ggpp} provides geoms that mimic \code{geom\_text()} in relation to the \emph{aesthetics} used, but that similarly to \code{geom\_sf()}, expect that the column in \code{data} mapped to the \code{label} aesthetics are lists of objects containing multiple pieces of information, rather than atomic vectors. Three geometries are currently available: \gggeom{geom\_table()}, \gggeom{geom\_plot()} and \gggeom{geom\_grob()}.

\begin{warningbox}
Given that  \gggeom{geom\_table()}, \gggeom{geom\_plot()}, and \gggeom{geom\_grob()} will rarely use a mapping inherited from the whole plot, by default they do not inherit it. Either the mapping should be supplied as an argument to these functions or their parameter \code{inherit.aes} explicitly set to \code{TRUE}.
\end{warningbox}

\index{plots!inset tables|(}
Tables can be added as plot insets with \gggeom{geom\_table()} by mapping a list of data frames (or tibbles) to the \code{label} \emph{aesthetic}. Positioning, justification, and angle work as for \gggeom{geom\_text()} and are applied to the whole table. The table(s) are constructed as \pkgnameNI{grid} \code{grob} objects and added to the \code{gg} plot object as a layer.

The code below builds a \code{tibble} containing summaries from the \code{mtcars} data set, with the summary values formatted as character strings, adds this tibble as the single member to a list, and stores this list as column named \code{table.inset} in another \code{tibble}, named code{table.tb}, together with the \code{x} and \code{y} coordinates for its location as an inset.

\begin{explainbox}
The code uses functions from the \pkgname{tidyverse} (see section \ref{sec:dplyr:group:wise} on page \pageref{sec:dplyr:group:wise}). Data frames and base \Rlang functions could have been used instead (see section \ref{sec:calc:df:aggregate} on page \pageref{sec:calc:df:aggregate}).
\end{explainbox}

<<table-plot-01>>=
mtcars |>
  group_by(cyl) |>
  summarize("mean wt" = format(mean(wt), digits = 3),
            "mean disp" = format(mean(disp), digits = 2),
            "mean mpg" = format(mean(mpg), digits = 2)) -> my.table
table.tb <- tibble(x = 500, y = 35, table.inset = list(my.table))
@

As with text labels, justification is interpreted in relation to table-text orientation, however, the default, \code{"inward"}, rarely needs to be changed if one sets $x$ and $y$ coordinates to the location of the inset corner farthest from the centre of the plot. The inset table is added at its native size, given by the \code{size} aesthetic, which is applied to the text in it.

<<table-plot-02>>=
ggplot(data = mtcars,
       mapping = aes(x = disp, y = mpg, colour = factor(cyl), size = wt)) +
  scale_size() +
  geom_point() +
  geom_table(data = table.tb,
             mapping = aes(x = x, y = y, label = table.inset),
             colour = "black", size = 3)
@

Parsed text, using \Rlang's \emph{plotmath} syntax is supported in tables, with fallback to plain text in case of parsing errors, on a cell-by-cell basis.

\begin{explainbox}
The \emph{geometry} \gggeom{geom\_table()} uses functions from package \pkgname{gridExtra} to build a graphical object for the table. A table theme can be passed as an argument to \gggeom{geom\_table()}.
\end{explainbox}
\index{plots!inset tables|)}

\index{plots!inset plots|(}
Geometry \gggeom{geom\_plot()} works similarly to \code{geom\_table()} but insets a ggplot within another ggplot. Thus, instead of expecting a list of data frames or tibbles to be mapped to the \code{label} aesthetics, it expects a list of ggplots (objects of class \code{gg}). Inset plots can be very useful for zooming-in on parts of a main plot where observations are crowded and for displaying summaries based on the observations shown in the main plot. The inset plots are nested in viewports which constrain the dimensions of the inset plot. Aesthetics \code{vp.height} and \code{vp.width} set the size of the viewports---with defaults of 1/3 of the height and width of the plotting area of the main plot. Themes can be applied separately to the main and inset plots.

In the first example of inset plots, the summaries shown above as numbers in a column in the inset table, are displayed in an inset column plot. We first create a one-row \code{data.frame} containing the plot to be inset as member of a \code{list}, and the $x$ and $y$ coordinates in the main plot of the location of the inset. Unlike with a \code{tibble}, with a \code{data.frame} we need to use \Rfunction{I()} to protect the \code{list}.

<<plot-plot-01>>=
mtcars |>
  group_by(cyl) |>
  summarize(mean.mpg = mean(mpg)) |>
  ggplot(data = _,
         mapping = aes(factor(cyl), mean.mpg, fill = factor(cyl))) +
  scale_fill_discrete(guide = "none") +
  scale_y_continuous(name = NULL) +
    geom_col() +
    theme_bw(8) -> my.plot
plot.tb <- data.frame(x = 500, y = 35, plot.inset = I(list(my.plot)))
@

<<plot-plot-02>>=
ggplot(data = mtcars,
       mapping = aes(x = disp, y = mpg, colour = factor(cyl))) +
  geom_point() +
  geom_plot(data = plot.tb,
            aes(x = x, y = y, label = plot.inset),
            vp.width = 1/2,
            hjust = "inward", vjust = "inward")
@

In the second example, the plot inset is a zoom-in into a region of the base plot. The code to build this plot is split into three chunks. \code{p.main} is the plot to be used as the base for the final plot.

<<plot-plot-03a>>=
p.main <-
  ggplot(data = mtcars,
         mapping = aes(x = disp, y = mpg, colour = factor(cyl))) +
  geom_point()
@

\code{p.inset}, is the plot to be used as the inset; the call to \code{coord\_cartesian()} zooms-into \code{p.main}; the call to \code{labs()} removes the redundant axis labels; the call to \code{scale\_colour\_discrete()} removes the redundant guide in the inset; and the calls to \code{theme\_bw()} and \code{theme()} change the theme and font size for the inset.

<<plot-plot-03b>>=
p.inset <- p.main +
  coord_cartesian(xlim = c(270, 330), ylim = c(14, 19)) +
  labs(x = NULL, y = NULL) +
  scale_colour_discrete(guide = "none") +
  theme_bw(8) + theme(aspect.ratio = 1)
@

As in the previous example, \gggeom{geom\_plot()} adds the inset, in this case with constant values for aesthetics. The call to \code{annotate()} using \gggeom{geom\_rect()} adds the rectangle highlighting the zoomed-in region in the main plot.\label{par:plot:inset:zoom}

<<plot-plot-03c>>=
p.main +
  geom_plot(x = 480, y = 34, label = list(p.inset), vp.height = 1/2) +
  annotate(geom = "rect", fill = NA, colour = "black",
           xmin = 270, xmax = 330, ymin = 14, ymax = 19,
           linetype = "dotted")
@
\index{plots!inset plots|)}
\index{plots!inset graphical objects|(}
Geometry \gggeom{geom\_grob()} differs very little from \code{geom\_plot()} but insets \pkgname{grid} graphical objects, called \code{grob} for short. This approach is very flexible, as grobs can be vector graphics as well as contain rasters (or bitmaps). In most cases, the grobs need to be first created either using functions from package \pkgname{grid} to draw them or by converting other types of objects into grobs. Geometry \gggeom{geom\_grob()} is as flexible as \gggeom{annotation\_custom()} with respect to the grobs but behaves as a \emph{geometry}. Below, two bitmaps are added as ``labels'' to the base plot.

The bitmaps are read from PNG files (contained as examples in package \pkgname{gpmisc}.

<<plot-grob-01a>>=
file1.name <-
  system.file("extdata", "Isoquercitin.png",
              package = "ggpp", mustWork = TRUE)
Isoquercitin <- magick::image_read(file1.name)
file2.name <-
  system.file("extdata", "Robinin.png",
              package = "ggpp", mustWork = TRUE)
Robinin <- magick::image_read(file2.name)
@

The two bitmaps are converted into \code{grobs}, added as two separate members to a list, and the list added as a column to a \code{data.frame} named, for this example, \code{grob.tb}. The coordinates for the position of each \code{grob} as well as the size of each viewport are also added to this \code{data.frame}.

<<plot-grob-01b>>=
grob.tb <-
  data.frame(x = c(0, 100), y = c(10, 20), height = 1/3, width = c(1/2),
             grobs = I(list(grid::rasterGrob(image = Isoquercitin),
                            grid::rasterGrob(image = Robinin))))
@

The two \code{grobs} are added as a single plot layer to an empty plot. Insets like these, can be added to any base plot.

<<plot-grob-01c>>=
ggplot() +
  geom_grob(data = grob.tb,
            mapping = aes(x = x, y = y, label = grobs,
                          vp.height = height, vp.width = width),
                          hjust = "inward", vjust = "inward")
@
\index{plots!inset graphical objects|)}

\begin{explainbox}
Grid graphics\index{grid graphics coordinate systems} provide the low-level functions that \pkgname{ggplot2} uses under the hood. Package \pkgname{grid} supports different types of units for expressing the coordinates of positions. In the \pkgname{ggplot2} user interface, \code{"native"} data coordinates are used with only a few exceptions. Package \pkgname{grid} supports the use of physical units like \code{"mm"} as well as relative units like "npc" \emph{normalised parent coordinates}. Positions expressed as npc are numbers in the range 0 to 1, relative to the dimensions of current \emph{viewport}, with origin at the lower left corner. Normalised parent coordinates ("npc") are useful when annotating plots and adding insets at positions relative to the plotting area, as these positions remain always consistent across different plots, or across panels when using facets with free axis limits.

Package \pkgname{ggplot2} interprets $x$ and $y$ coordinates in \code{"native"} data coordinates. Newly, \pkgname{ggplot2} >= 3.5.0 interprets ``mappings'' of variables and constant values enclosed in function \Rfunction{I()} as expressed using "npc" coordinates, skipping the usual mapping based on scales.

<<plot-npc-eb-02>>=
ggplot(data = mtcars,
       mapping = aes(x = disp, y = mpg, colour = factor(cyl))) +
  geom_point() +
  geom_label(x = I(0.5), y = I(0.9), label = "a label", colour = "black")
@

An earlier approach was provided by package \pkgname{ggpp} through \emph{pseudo aesthetics} \code{npcx} and \code{npcy} and \emph{geometries} that support them can be used with \pkgname{ggplot2} <= 3.4.4.

<<plot-npc-eb-01, eval=eval_plots_all>>=
ggplot(data = mtcars,
       mapping = aes(x = disp, y = mpg, colour = factor(cyl))) +
  geom_point() +
  geom_label_npc(npcx = 0.5, npcy = 0.9, label = "a label", colour = "black",
             vjust = "center")
@

\end{explainbox}

\index{grammar of graphics!inset-related geometries|)}
\index{plots!insets|)}
\index{grammar of graphics!geometries|)}

\section{Statistics}\label{sec:plot:statistics}
\index{grammar of graphics!statistics|(}
All statistics, except \ggstat{stat\_identity()}, modify the \code{data} they receive before passing it to a geometry. Most statistics compute a specific summary from the data, but there are exceptions. More generally, they make it possible to integrate computations on the data into the plotting workflow. This saves effort but more importantly helps ensure that the data and summaries within a given plot are consistent. Table \ref{tab:plot:stats} list all the statistics used in the chapter.

When a factor is mapped to an aesthetic, each level creates a group. For example, in the first plot example in section \ref{sec:plot:line} on page \pageref{sec:plot:line}, the grouping resulted in separate lines. The grouping is not so obvious with other aesthetics but it is not different. Most \emph{statistics} operate separately on the data for each group, returning an independent summary for each group. Mapping a continuous variable to an aesthetics does not create groups. All aesthetics, including \code{x} and \code{y}, follow this pattern, thus a factor mapped to \code{x} also creates a group for each level of the factor.

\begin{table}
  \caption[Statistics]{\ggplot statistics described in section \ref{sec:plot:statistics}, packages where they are defined, their default geometry, and the aesthetics they use as input for computations.}\vspace{1ex}\label{tab:plot:stats}
  \centering
   \begin{tabular}{llll}
     \toprule
     Statistic & Package & Geometry & Aesthetics \\
     \midrule
     \code{stat\_function} & \pkgnameNI{ggplot2} & \code{geom\_function} & x \\
     \code{stat\_summary} & \pkgnameNI{ggplot2} & \code{geom\_pointrange} & x, y \\
     \code{stat\_smooth} & \pkgnameNI{ggplot2} & \code{geom\_smooth} & x, y, weight \\
     \code{stat\_poly\_line} & \pkgnameNI{ggpmisc} &\code{geom\_smooth} & x, y, weight \\
     \code{stat\_poly\_eq} & \pkgnameNI{ggpmisc} & \code{geom\_text} & x, y, weight  \\
     \code{stat\_fit\_tb} & \pkgnameNI{ggpmisc} & \code{geom\_table} & x, y, weight  \\
     \code{stat\_bin} & \pkgnameNI{ggplot2} & \code{geom\_bar} & x, y \\
     \code{geom\_histogram} & \pkgnameNI{ggplot2} & --- & x, y \\
     \code{stat\_bin2d} & \pkgnameNI{ggplot2} & \code{geom\_tile} & x, y \\
     \code{stat\_bin\_hex} & \pkgnameNI{ggplot2} & \code{geom\_hex} & x, y \\
     \code{stat\_density} & \pkgnameNI{ggplot2} & \code{geom\_area} & x, y \\
     \code{geom\_density} & \pkgnameNI{ggplot2} & --- & x, y \\
     \code{stat\_density\_2d} & \pkgnameNI{ggplot2} & \code{geom\_density\_2d} & x, y \\
     \code{stat\_boxplot} & \pkgnameNI{ggplot2} & \code{geom\_boxplot} & x, y  \\
     \code{stat\_ydensity} & \pkgnameNI{ggplot2} & \code{geom\_violin} & x, y  \\
     \code{geom\_violin} & \pkgnameNI{ggplot2} & --- & x, y  \\
     \code{geom\_quasirandom} & \pkgnameNI{ggbeeswarm} & --- & x, y \\
     \code{stat\_ma\_line} & \pkgnameNI{ggpmisc} & \code{geom\_smooth} & x, y \\
     \code{stat\_ma\_eq} & \pkgnameNI{ggpmisc} & \code{geom\_text} & x, y \\
     \code{stat\_centroid} & \pkgnameNI{ggpmisc} & \code{geom\_point} & x, y \\
     \code{stat\_quant\_line} & \pkgnameNI{ggpmisc} & \code{geom\_smooth} & x, y \\
     \code{stat\_quant\_eq} & \pkgnameNI{ggpmisc} & \code{geom\_text} & x, y \\
     \code{stat\_identity} & \pkgnameNI{ggplot2} & \code{geom\_point} & --- \\
     \bottomrule
   \end{tabular}
\end{table}

\subsection{Functions}\label{sec:plot:function}
\index{grammar of graphics!function statistic|(}
\index{plots!plots of functions|(}
Statistics \ggstat{stat\_function()} is the simplest to use and understand, even if unusual. It generates $y$ values by applying an \Rlang function to a sequence of $x$ values. The range of the \code{numeric} variable mapped to \code{x} determines the range of $x$ values used.

Any \Rlang function, user defined or not, can be used as long as it is vectorised, with the length of the returned vector equal to the length of the vector passed as an argument to its first parameter. The argument passed to parameter \code{n} of \code{geom\_function()} determines the length of the generated vector of $x$ values. The data frame returned contains these are the $x$ values and as $y$ values the values returned by the function.

The code to plot the Normal probability distribution function is very simple, relying on the defaults \code{n = 101} and \code{geom = "path"}.

<<function-plot-01>>=
ggplot(data = data.frame(x = c(-3,3)),
       mapping = aes(x = x)) +
  stat_function(fun = dnorm)
@

Using a named list, additional arguments can be passed to the function when called to generate the data (plot not shown).

<<function-plot-02, eval=eval_plots_all>>=
ggplot(data = data.frame(x = c(-3,4)),
       mapping = aes(x = x)) +
  stat_function(fun = dnorm, args = list(mean = 1, sd = .5))
@

\begin{playground}
Edit the code above so as to plot in the same figure three curves, either for three different values for \code{mean} or for three different values for \code{sd}.
\end{playground}

Named user-defined functions (not shown), and anonymous functions (below) can also be used.

<<function-plot-03, eval=eval_plots_all>>=
ggplot(data = data.frame(x = 0:1),
       mapping = aes(x = x)) +
  stat_function(fun = function(x, a, b){a + b * x^2},
                args = list(a = 1, b = 1.4))
@

\begin{playground}
Edit the code above to use a different function, such as $e^{x + k}$, adjusting the argument(s) passed through \code{args} accordingly. Do this by means of an anonymous function, and by means of an equivalent named function defined by your code.
\end{playground}

\index{plots!plots of functions|)}
\index{grammar of graphics!function statistic|)}

\subsection{Summaries}\label{sec:plot:stat:summaries}
\index{grammar of graphics!summary statistic|(}
\index{plots!data summaries|(}
\index{plots!means}\index{plots!medians}\index{plots!error bars}
The summaries discussed in this section can be superimposed on raw data plots, or plotted on their own. Beware, that if scale limits are manually set, the summaries will be calculated from the subset of observations within these limits. Scale limits can be altered when explicitly defining a scale or by means of functions \Rfunction{xlim()} and \Rfunction{ylim()}. See section \ref{sec:plot:coord} on page \pageref{sec:plot:coord} for an explanation of how coordinate limits can be used to zoom into a plot without excluding of $x$ and $y$ values from the data.

It is possible to summarise data on the fly when plotting. The simultaneous calculation of measures of central tendency and of variation in \ggstat{stat\_summary()} allows them to be added together to the same plot layer.

Data frame \code{fake.data}, constructed below, contains normally distributed artificial values in variable \code{Y} in two groups, distinguished by the levels of factor \code{group}.

<<summary-plot-00>>=
fake.data <- data.frame(
  y = c(rnorm(10, mean = 2, sd = 0.5),
        rnorm(10, mean = 4, sd = 0.7)),
  group = factor(c(rep("A", 10), rep("B", 10))))
@

Below, a base plot is constructed an assigned to \code{p1.base}.

<<summary-plot-01>>=
p1.base <-
  ggplot(data = fake.data, mapping = aes(y = y, x = group)) +
  geom_point(shape = "circle open")
@

In \ggstat{stat\_summary()}, the \Rlang function used to compute the summaries is passed as an argument. This function can be one returning a single value, like \code{mean()}, or one returning a central value and the extremes of a range. With the default argument, \ggstat{stat\_summary()} plots means and standard errors, displaying a message.

<<summary-plot-02>>=
p1.base + stat_summary()
@

For $\bar{x} \pm \mathrm{s.e.}$, the default, \code{"mean\_se"} can be passed as argument to \code{fun.data} to avoid the message seen above, and for $\bar{x} \pm \mathrm{s.d.}$ \code{"mean\_sdl"} should be passed as argument. These functions have to be passed to parameter \code{fun.data}, while functions that return a single value, like \code{"mean"}, to \code{fun}. The \code{geom} used has to be suitable for the values computed by the \code{stat}.

Below is code for a similar plot, with means highlighted in red, using \gggeom{geom\_point()}.

<<summary-plot-03>>=
p1.base +
  stat_summary(fun = "mean", geom = "point",
               colour = "red", shape = "-", size = 15)
@

Below, confidence intervals for $P = 0.99$ computed assuming normality are added.
Intervals can be also computed without assuming normality, using the empirical distribution estimated from the data by bootstrap using \code{"mean\_cl\_boot"} instead of \code{"mean\_cl\_normal"}.

<<summary-plot-04>>=
p1.base +
  stat_summary(fun.data = "mean_cl_normal", fun.args = list(conf.int = 0.99),
               colour = "red", size = 0.7, linewidth = 1, alpha = 0.5)
@

\begin{explainbox}
It is possible to use user-defined functions instead of the functions exported by package \ggplot (based on those in package \Hmisc). Additional named arguments can be passed to the summary function through parameter \code{fun.args} of \ggstat{stat\_summary()}.
\end{explainbox}

Means, or other summaries, computed by groups based on the factor mapped to the \code{x} aesthetic (\code{class} in this example) can be plotted as columns by passing \code{"col"} as an argument to parameter \code{geom}.

<<summary-plot-09a>>=
p2.base <-
  ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
  stat_summary(geom = "col", fun = mean)
@

Error bars can be added to the column plot. Passing \code{linewidth = 1} makes the lines of the error bars thicker. The default \emph{geometry} in \ggstat{stat\_summary()} is \gggeom{geom\_pointrange()}, passing \code{"linerange"} as an argument for \code{geom} removes the points at the top edge of the bars.

<<summary-plot-12>>=
p2.base +
  stat_summary(geom = "linerange", fun.data = "mean_cl_normal",
               linewidth = 1, colour = "red")
@

Passing \code{"errorbar"} instead of \code{"linerange"} to \code{geom} results in traditional ``capped'' error bars. However, this type of error bar has been criticised as adding unnecessary clutter to plots \autocite{Tufte1983}. Aesthetic \code{width} controls the width of the caps at the ends of the tips bars.

When calculated values for the summaries are already available in \code{data}, equivalent plots can be obtained by mapping the summary values from \code{data} to the \emph{aesthetics} \code{x}, \code{y}, \code{ymax}, and \code{ymin} and using the \code{geoms} \gggeom{geom\_errorbar()} and \gggeom{geom\_linerange()} with their default for \code{stat}, \ggstat{stat\_identity()}, to add a plot layer.

\begin{explainbox}
A layer can be added to a plot directly with a \code{geom}, possibly passing a \code{stat} as an argument to it. In this book I have usually avoided this alternative syntax, except when not overriding \ggstat{stat\_identity()}, the usual default. The two code statements below are equivalent.

<<summary-plot-10, eval=eval_plots_all>>=
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
  geom_col(stat = "summary", fun = mean)

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
  stat_summary(geom = "col", fun = mean)
@
\end{explainbox}
\index{plots!data summaries|)}
\index{grammar of graphics!summary statistic|)}

\subsection{Smoothers and models}\label{sec:plot:smoothers}
\index{plots!smooth curves|(}
\index{plots!fitted curves|(}
\index{plots!statistics!smooth}

For describing or highlighting relationships between pairs of continuous variables, using a line, straight or curved, in a plot is very effective. Drawing lines that provide a meaningful and accurate description of the relationship, requires lines based on predictions from models fitted to the observations. Frequently fitted models make possible to assess the reliability of the estimation. See section \ref{sec:stat:mf} on page \pageref{sec:stat:mf} for a description of the model fitting procedures underlying the plotting described in the current section.

The statistic \ggstat{stat\_smooth()} fits a smooth curve to observations in the case when the scales for $x$ and $y$ are continuous---the corresponding \emph{geometry} \gggeom{geom\_smooth()} uses this \emph{statistic}, and differs only in how arguments are passed to formal parameters. In the first example, \ggstat{stat\_smooth()} with the default smoother, a spline is used. In \ggstat{stat\_smooth()}, the type of smoother, or \code{method}, is automatically chosen based on the number of observations, and the choice informed by a message. In statistics, the \code{formula} must be stated using the names of the $x$ and $y$ aesthetics, rather than the original names of the variables mapped, i.e., in this example, not their name in the \code{mtcars} data frame. Splines are described in section \ref{sec:stat:splines} on page \pageref{sec:stat:splines}. When their small enough number makes it possible, observations are usually plotted as points together with the smoother. The observations can be plotted on top of the smoother or the smoother on top of the observations, as done here.

<<smooth-plot-01>>=
p3 <-
  ggplot(data = mtcars, mapping = aes(x = disp, y = mpg)) +
  geom_point()
@

<<smooth-plot-02>>=
p3 + stat_smooth(method = "loess", formula = y ~ x)
@

A model different to the default one can be used. Below, a linear regression is fitted with \Rfunction{lm()}. Fitting of linear models is explained in section \ref{sec:stat:LM} on page \pageref{sec:stat:LM}.

<<smooth-plot-03, eval=eval_plots_all>>=
p3 + stat_smooth(method = "lm", formula = y ~ x)
@

These data can be grouped, here by mapping \code{factor(cyl)} to the \code{colour} \emph{aesthetic}. With three groups, three separate linear regressions are fitted, and displayed as three straight lines. Each one line is delimited by a confidence band for the ``true'' location of the curve.

<<smooth-plot-04>>=
p3 + aes(colour = factor(cyl)) +
  stat_smooth(method = "lm", formula = y ~ x)
@

To obtain a single fitted smoother, in this case a joint linear regression line for the three groups, the grouping in the layer was disabled by mapping a constant value to the \code{colour} \emph{aesthetic} in the call to \ggstat{stat\_smooth()}. Values passed to a layer function as argument override the defaults set in \code{ggplot()}. The use of \code{"black"} is arbitrary, any other \code{color} definition known to \Rlang could have been used instead.

<<smooth-plot-05, eval=eval_plots_all>>=
p3 + aes(colour = factor(cyl)) +
  stat_smooth(method = "lm", formula = y ~ x, colour = "black")
@

A different linear model, a second degree polynomial in this example, is fitted below by passing a different argument to \code{formula} than in the example above for linear regression.

<<smooth-plot-06>>=
p3 + aes(colour = factor(cyl)) +
  stat_smooth(method = "lm", formula = y ~ poly(x, 2), colour = "grey20")
@

\begin{explainbox}
It is possible to use other types of models, including GAM and GLM, as smoothers.  I give next two simple examples of the use of \code{nls()} to fit a model non-linear in its parameters (see section \ref{sec:stat:NLS} on page \pageref{sec:stat:NLS} for details about fitting this same model with \code{nls()}). In both examples, the model fitted is the Michaelis-Menten equation, describing the rate of a chemical reaction (\code{rate}) as a function of reactant concentration (\code{conc}). \Rdata{Puromycin} is a data set included in the \Rlang distribution. Function \Rfunction{SSmicmen()}, used in the first example, is also from \Rlang, and is a \emph{self-starting}\index{self-starting functions} implementation of the Michaelis-Menten equation. Thanks to this, even though the fit is done with an iterative algorithm, starting values for the parameters to be fitted are not needed. Passing \code{se = FALSE} suppresses the attempt to compute a confidence band as it is not supported by the \code{predict()} method for model fits done with function \Rfunction{nls()}.

<<smooth-plot-07>>=
ggplot(data = Puromycin,
       mapping = aes(conc, rate, colour = state)) +
  geom_point() +
  geom_smooth(method = "nls", formula =  y ~ SSmicmen(x, Vm, K), se = FALSE)
@

In the second example, the code describing the equation is passed as an argument to \code{formula}, with starting values passed as a named list to \code{start}. The names used for the parameters to be estimated by fitting the model can be chosen at will, within the restrictions of the \Rlang language, but of course the names used in \code{formula} and \code{start} must match each other. As for other models, \code{x} and \code{y} are the names of the aesthetics to which the observations have been mapped (plot not shown).%
\pagebreak

<<smooth-plot-08, eval=eval_plots_all>>=
ggplot(data = Puromycin,
       mapping = aes(conc, rate, colour = state)) +
  geom_point() +
  geom_smooth(method = "nls",
              formula =  y ~ (Vmax * x) / (k + x),
              method.args = list(start = list(Vmax = 200, k = 0.05)),
              se = FALSE)
@
\end{explainbox}

In some cases, it is desirable to annotate plots with fitted model equations or fitted parameters. One way of achieving this is by fitting the model and then extracting the parameters to manually construct text strings to use for text or label annotations. However, package \pkgname{ggpmisc} makes it possible to automate such annotations in many cases. This package also provides \ggstat{stat\_poly\_line()}, which is similar to \ggstat{stat\_smooth()} but with \code{method = "lm"} consistently as its default irrespective of the number of observations.

<<smooth-plot-12, warning=FALSE>>=
my.formula <- y ~ x + I(x^2)
p3 + aes(colour = factor(cyl)) +
  stat_poly_line(formula = my.formula, colour = "black") +
  stat_poly_eq(formula = my.formula, mapping = use_label(c("eq", "F")),
               colour = "black", label.x = "right")
@

Package \pkgname{ggpmisc} also makes it possible to annotate plots with summary tables from a model fit. The argument passed to \code{tb.vars} substitutes the names of the columns in the table.

<<smooth-plot-13>>=
ggplot(data = mtcars,
       mapping = aes(x = disp, y = mpg, colour = factor(cyl))) +
  stat_poly_line(formula = my.formula, colour = "black") +
  stat_fit_tb(method.args = list(formula = my.formula),
              colour = "black",
              parse = TRUE,
              tb.vars = c(Parameter = "term",
                          Estimate = "estimate",
                          "s.e." = "std.error",
                          "italic(t)" = "statistic",
                          "italic(P)" = "p.value"),
              label.y = "top", label.x = "right") +
  geom_point() +
  expand_limits(y = 40)
@

Package \pkgname{ggpmisc} provides additional \emph{statistics} for the annotation of plots based on fitted models supported by package \pkgname{broom} and its extensions. It also supports lines and equations for quantile regression and major axis regression. Please see the package documentation for details.

\index{plots!smooth curves|)}
\index{plots!fitted curves|)}

\subsection{Frequencies and counts}\label{sec:histogram}\label{sec:plot:histogram}
\index{plots!histograms|(}

When the number of observations is rather small, it is possible rely on the density of graphical elements, such as points, to convey the density of the observations. For example, scatter plots using well-chosen values for transparency, \code{alpha}, can give a satisfactory impression of the density. Rug plots, described in section \ref{sec:plot:rug} on page \pageref{sec:plot:rug}, can also satisfactorily convey the density of observations along $x$ and/or $y$ axes. Such approaches do not involve computations, while the \emph{statistics} described in this section do. Frequencies by value-range (or bins) and empirical density functions are summaries especially useful when the number of observations is large. These summaries can be computed in one or more dimensions.

Histograms are defined by how the plotted values are calculated. Although histograms are most frequently plotted as bar plots, many bar or ``column'' plots are not histograms. Although rarely done in practice, a histogram could be plotted using a different \emph{geometry} using \ggstat{stat\_bin()}, the \emph{statistic} used by default by \gggeom{geom\_histogram()}. This \emph{statistic} does binning of observations before computing frequencies, and is suitable for observations on a continuous scales, usually mapped to the \code{x} aesthetic. When a factor is mapped to \code{x}, \ggstat{stat\_count()} can be used, the default \code{stat} of \gggeom{geom\_bar()}. These two \emph{geometries} are described in this section about statistics, because they default to using statistics different from \code{stat\_identity()} and consequently summarise the data.

The code below constructs a data frame containing an artificial data set.

<<histogram-plot-00>>=
set.seed(54321)
my.data <-
  data.frame(X = rnorm(600),
             Y = c(rnorm(300, -1, 1), rnorm(300, 1, 1)),
             group = factor(rep(c("A", "B"), c(300, 300))) )
@

A default and usually suitable number of bins is automatically selected by the \ggstat{stat\_bin()} statistic; however, passing \code{bins = 15} sets it manually. In a histogram plot the variable mapped onto the \code{y} \emph{aesthetic} is not from \code{data} but instead computed in the statistics as the number of observations falling in each \emph{bin}.

<<histogram-plot-01>>=
ggplot(data = my.data, mapping = aes(x = X)) +
  geom_histogram(bins = 15)
@

\begin{explainbox}
A reason to add layers with \gggeom{geom\_histogram()}, instead of with \ggstat{stat\_bin()} or \ggstat{stat\_count()} is that its name is easier to remember.

<<histogram-plot-04, eval=eval_plots_all>>=
ggplot(data = my.data,
       mapping = aes(x = Y, fill = group)) +
  stat_bin(bins = 15, position = "dodge")
@
\end{explainbox}

The grouping created by mapping a factor to an additional \emph{aesthetic}, results in two separate histograms. The position of the two groups of bars with respect to each other is controlled with \emph{position} functions (see section \ref{sec:plot:positions} on page \pageref{sec:plot:positions} for details). With \code{position = "dodge"}, bars are plotted side by side; with \code{position = "stack"}, the default, plotted one above the other; and with \code{position = "identity"} overlapping. In this last case, adding \code{alpha = 0.5} makes occluded bars visible. The examples below use \code{position = "dodge"}.

<<echo=FALSE, include=FALSE>>=
opts_chunk$set(opts_2fig_very_wide)
@

<<histogram-plot-02>>=
p.base <-
  ggplot(data = my.data,
         mapping = aes(x = Y, fill = group))
@

<<histogram-plot-02a>>=
p1 <- p.base + geom_histogram(bins = 15, position = "dodge")
@

In addition to \code{count}, \code{density}, computed as \code{count} divided by the number of observations in the group, is returned, and mapped in \code{p2} using \Rfunction{after\_stat()}.

<<histogram-plot-03>>=
p2 <- p.base + geom_histogram(mapping = aes(y = after_stat(density)),
                              bins = 15, position = "dodge")
@

<<histogram-plot-03a>>=
p1 + p2
@

\emph{Statistic} \ggstat{stat\_bin2d()}, and its matching \emph{geometry} \gggeom{geom\_bin2d()}, by default compute a frequency histogram in two dimensions, along the \code{x} and \code{y} \emph{aesthetics}. The \code{count} for each 2D bin is mapped to the \code{fill} aesthetic, with a lighter-coloured value being equivalent to a taller bar in a 1D histogram.

<<echo=FALSE, include=FALSE>>=
opts_chunk$set(opts_fig_wide)
@

<<bin2d-plot-01a>>=
p.base <-
  ggplot(data = my.data,
         mapping = aes(x = X, y = Y)) +
  facet_wrap(facets = vars(group))
@

<<bin2d-plot-01>>=
p.base + stat_bin2d(bins = 8)
@

\emph{Statistic} \ggstat{stat\_bin\_hex()}, and its matching \emph{geometry} \gggeom{geom\_hex()}, differ from \ggstat{stat\_bin2d()} only in their use of hexagonal instead of square bins, and tiles.

<<hex-plot-01>>=
p.base + stat_bin_hex(bins = 8)
@

As \ggstat{stat\_bin()}, \ggstat{stat\_bin2d()} and \ggstat{stat\_bin\_hex()} compute \code{density} in addition to \code{counts} and they can be plotted by mapping them to the \code{fill} aesthetic.
\index{plots!histograms|)}

\subsection{Density functions}\label{sec:plot:density}
\index{plots!density plot!1 dimension|(}
\index{plots!statistics!density}
Empirical density functions are the equivalent of a histogram, but are continuous and not calculated using bins, but fitted. They can be estimated in 1 or 2 dimensions (1D or 2D). As with histograms it is possible to use different \emph{geometries} with them. Examples of \gggeom{geom\_density()} used to create 1D density plots follow. A semitransparent fill is used in addition to colour. Density plots for \code{Y} and \code{X}, i.e., using as mappings \code{x = Y} and \code{x = X}, are shown below side-by-side).

<<echo=FALSE, include=FALSE>>=
opts_chunk$set(opts_2fig_very_wide)
@

<<density-plot-01>>=
p3 <-
  ggplot(data = my.data,
       mapping = aes(x = Y, colour = group, fill = group)) +
  geom_density(alpha = 0.3)
@

<<density-plot-02>>=
p4 <-
  ggplot(data = my.data,
       mapping = aes(x = X, colour = group, fill = group)) +
  geom_density(alpha = 0.3)
@

Plot composition, as used below, is described in detail in section \ref{sec:plot:composition} on page \pageref{sec:plot:composition}.

<<density-plot-03>>=
p3 + p4 # plot composition
@
\index{plots!density plot!1 dimension|)}

\index{plots!density plot!2 dimensions|(}
\index{plots!statistics!density 2d}

A 2D density plot using the same data as for the 1D plots above. In the first example, \ggstat{stat\_density\_2d()} creates two 2D density ``maps'' shown using isolines, with \code{group} mapped to the \code{colour} \emph{aesthetic}. Isolines can be used when the empirical distributions overlap. The 1D plots above show the projections of the 2D density in the plot below onto the two axes.

<<echo=FALSE, include=FALSE>>=
opts_chunk$set(opts_fig_narrow_square)
@

<<density-plot-10, out.width='.46\\textwidth'>>=
ggplot(data = my.data,
       mapping = aes(x = X, y = Y, colour = group)) +
  stat_density_2d()
@%
\pagebreak

Below, the 2D density for each group is plotted in a separate panel, with \code{level}, a variable computed by \code{stat\_density\_2d()}, mapped to the \code{fill} \emph{aesthetic}.

<<echo=FALSE, include=FALSE>>=
opts_chunk$set(opts_fig_wide)
@

<<density-plot-12>>=
ggplot(data = my.data,
       mapping = aes(x = X, y = Y)) +
  stat_density_2d(aes(fill = after_stat(level)), geom = "polygon") +
  facet_wrap(facets = vars(group))
@

<<echo=FALSE, include=FALSE>>=
opts_chunk$set(opts_fig_narrow)
@
\index{plots!density plot!2 dimensions|)}

\subsection{Box and whiskers plots}\label{sec:boxplot}
\index{box plots|see{plots, box and whiskers plot}}
\index{plots!box and whiskers plot|(}

Box and whiskers plots, or just box plots, are summaries that convey some of the properties of a distribution. They are calculated and plotted with \ggstat{stat\_boxplot()} or the matching \gggeom{geom\_boxplot()}. Although box plots can be plotted based on just a few observations, they are not useful unless each box plot is based on more than 10 to 15 observations. In the next example, a sample of every sixth row from the data frame \code{my.data} with \Sexpr{nrow(my.data)} rows is used.

<<echo=FALSE>>=
opts_chunk$set(opts_2fig_very_wide)
@

<<bw-plot-00>>=
p.base <-
  ggplot(data = my.data[c(TRUE, rep(FALSE, 5)) , ],
         mapping = aes(x = group, y = Y))
@

<<bw-plot-01>>=
p1 <- p.base + stat_boxplot()
@

As with other \emph{statistics}, the appearance obeys both \emph{aesthetics} such as \code{colour}, and parameters specific to box plots: \code{outlier.colour}, \code{outlier.fill}, \code{outlier.shape}, \code{outlier.size}, \code{outlier.stroke}, and \code{outlier.alpha}, which affect outliers similarly to equivalent \code{aesthetics}. The shape and width of the ``box'' can be adjusted with \code{notch}, \code{notchwidth} and \code{varwidth}. Notches in box plots play a similar role as confidence limits play for means.

<<bw-plot-02>>=
p2 <-
  p.base +
  stat_boxplot(notch = TRUE, width = 0.4,
               outlier.colour = "red", outlier.shape = "*", outlier.size = 5)
@

The two plots have been composed side by side to save space (see section \ref{sec:plot:composing} on page \pageref{sec:plot:composing} for details about composing plots).

<<bw-plot-03>>=
p1 + p2
@

\index{plots!box and whiskers plot|)}

\subsection{Violin plots}\label{sec:plot:violin}
\index{plots!violin plot|(}

Violin plots are a more recent development than box plots, and usable with relatively large numbers of observations. They could be thought of as being a sort of hybrid between an empirical density function (see section \ref{sec:plot:density} on page \pageref{sec:plot:density}) and a box plot (see section \ref{sec:boxplot} on page \pageref{sec:boxplot}). As is the case with box plots, they are particularly useful when comparing distributions of related data, side by side. They can be created with  \gggeom{geom\_violin()} as shown in the examples below.

<<violin-plot-02>>=
p3 <- p.base +
  geom_violin(aes(fill = group), alpha = 0.16) +
  geom_point(alpha = 0.33, size = 1.5, colour = "black", shape = 21)
@

As with other \emph{geometries}, their appearance obeys both the usual \emph{aesthetics}, such as colour, and others specific to these types of visual representation.

Other types of displays related to violin plots are \emph{beeswarm} plots and \emph{sina} plots, and can be produced with \emph{geometries} defined in packages \pkgname{ggbeeswarm} and \pkgname{ggforce}, respectively. A minimal example of a beeswarm plot is shown below. See the documentation of the packages for details about the many options in their use.

<<ggbeeswarm-plot-01>>=
p4 <- p.base + geom_quasirandom()
@

<<ggbeeswarm-plot-012>>=
p3 + p4
@

\index{plots!violin plot|)}
\index{grammar of graphics!statistics|)}

<<echo=FALSE, include=FALSE>>=
opts_chunk$set(opts_fig_medium)
@

\section{Flipped Plot Layers}\label{sec:plot:flipped}
\index{grammar of graphics!flipped axes|(}
\index{grammar of graphics!swap axes}
\index{grammar of graphics!orientation}
\index{grammar of graphics!horizontal geometries}
\index{grammar of graphics!horizontal statistics}

Although it is the norm to design plots so that the independent variable is on the $x$ axis, i.e., mapped to the \code{x} aesthetic, there are situations where swapping the roles of $x$ and $y$ is useful. In `ggplot2', this is described as \emph{flipping the orientation} of a plot or of a plot layer. In the present section, I exemplify both cases where the flipping is automatic and where flipping requires user intervention. Some geometries like \gggeom{geom\_point()} are symmetric on the \textit{x} and \textit{y} aesthetics, but others like \gggeom{geom\_line()} operate differently on \textit{x} and \textit{y}. This is also the case for most \emph{statistics}.

Starting from \ggplot version 3.3.5, most geometries and statistics where it is meaningful, support flipping using a new syntax. This new approach is different to the flip of the coordinate system (which is expected to be deprecated in the future), and conceptually similar to that implemented by package \pkgname{ggstance}. However, instead of defining new horizontal layer functions as in \pkgname{ggstance}, in \ggplot the orientation of many layer functions can change. This has made package \pkgname{ggstance} nearly redundant and the coding of flipped plots easier and more intuitive. Although \ggplot has offered \ggcoordinate{coord\_flip()} for a long time, flipping of plot coordinates affects the whole plot rather than individual layers.

When a factor is mapped to $x$ or $y$ flipping is automatic. A factor creates groups and summaries are computed per group, i.e., per level of the factor irrespective of the factor being mapped to the $x$ or $y$ aesthetic. There are also cases that require user intervention. For example, flipping must be requested manually if both $x$ and $y$ are mapped to continuous variables. This is, for example, the case with \ggstat{stat\_smooth()} and with \gggeom{geom\_line()}.

\begin{figure}
  \centering%
  {\sffamily%
\resizebox{0.8\linewidth}{!}{%
\begin{tikzpicture}[auto]
    \node [b] (data) {data};
    \node [bo, right = of data] (statistic) {statistic};
    \node [b, right = of statistic] (geometry) {geometry};
    \node [b, right = of geometry] (render) {rendered\\plot};

    \path [ll] (statistic) -- (data) node[near end,above]{\ \ \ \ \ \ $x \rightleftarrows y$};
    \path [ll] (geometry) -- (statistic) node[near end,above]{\ \ \ \ \ \ $y \rightleftarrows x$};
    \path [ll] (render) -- (geometry) node[near end,above]{};
  \end{tikzpicture}}\\[2.5ex]
  \resizebox{0.8\linewidth}{!}{%
\begin{tikzpicture}[auto]
    \node [b] (data) {data};
    \node [b, right = of data] (statistic) {statistic};
    \node [bo, right = of statistic] (geometry) {geometry};
    \node [b, right = of geometry] (render) {rendered\\plot};

    \path [ll] (statistic) -- (data) node[near end,above]{};
    \path [ll] (geometry) -- (statistic) node[near end,above]{\ \ \ \ \ \ $x \rightleftarrows y$};
    \path [ll] (render) -- (geometry) node[near end,above]{\ \ \ \ \ \ $y \rightleftarrows x$};
  \end{tikzpicture}}}
  \caption[Flipped layers diagram]{Flipped layers. Top diagram, flipped aesthetics in statistic with \code{orientation = "y"}; bottom diagram, flipped aesthetics in geometry with \code{orientation = "y"}. During flipping, related aesthetics such as \code{xmin} and \code{ymin} are also swapped, but not shown in the diagram. }\label{fig:plot:flip:stat}
\end{figure}

In \emph{statistics}, passing \code{orientation = "y"} as argument results in the calculations being applied after swapping the mappings of the \code{x} and \code{y} aesthetics. After applying the calculations, the mappings of the $x$ and $y$ and related aesthetics are swapped back (Figure \ref{fig:plot:flip:stat}).

In geometries, passing \code{orientation = "y"} also results in flipping of the aesthetics  (Figure \ref{fig:plot:flip:stat}). For example, in \gggeom{geom\_line()}, flipping changes the drawing of the lines. Normally observations are sorted along the $x$ axis before drawing the line segments connecting them. After flipping, as $x$ and $y$ are swapped, observations are sorted along the $y$ axis before drawing the connecting segments. The variables shown on each axis remain the same, as does the position of points drawn with \gggeom{geom\_point()}, but the line connecting them is different: in the example below, only two segments are the same in the flipped plot and in the ``normal'' one.

<<echo=FALSE, include=FALSE>>=
opts_chunk$set(opts_2fig_very_wide)
@

<<flipping_box-01a-ggplot>>=
p.base <-
   ggplot(data = mtcars[1:8, ], mapping = aes(x = hp, y = mpg)) +
    geom_point()
p1 <- p.base + geom_line() + ggtitle("Not flipped")
p2 <- p.base + geom_line(orientation = "y") + ggtitle("Flipped")
p1 + p2
@

The next pair of examples demonstrates automatic flipping using \ggstat{stat\_boxplot()}. Factor \code{Species} is mapped first to $x$ and then to $y$. In both cases, the same boxplots were computed and plotted for each level of the factor. Statistics \ggstat{stat\_boxplot()}, \ggstat{stat\_summary()}, \ggstat{stat\_histogram()} and \ggstat{stat\_density()} behave similarly with respect to automatic flipping.

<<flipping-01-ggplot>>=
p3 <-
  ggplot(data = iris, mapping = aes(x = Species, y = Sepal.Length)) +
  stat_boxplot()
@

<<flipping-02-ggplot>>=
p4 <-
  ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Species)) +
  stat_boxplot()
@

<<flipping-01-02-ggplot>>=
p3 + p4
@

In the case of \code{stats} that do computations on a single variable mapped to \code{x} or \code{y} aesthetics, flipping is also automatic.

<<flipping-03-ggplot>>=
p5 <-
  ggplot(data = iris,
         mapping = aes(x = Sepal.Length, colour = Species)) +
  stat_density(geom = "line", position = "identity")
@

<<flipping-04-ggplot>>=
p6 <-
  ggplot(data = iris,
         mapping = aes(y = Sepal.Length, colour = Species)) +
  stat_density(geom = "line", position = "identity")
@

<<flipping-03-04-ggplot>>=
p5 + p6
@

<<echo=FALSE, include=FALSE>>=
opts_chunk$set(opts_fig_narrow)
@

\begin{explainbox}
In the case of ordinary least squares (OLS), regressions of $y$ on $x$ and of $x$ on $y$ in most cases yield different fitted lines, even if $R^2$ is consistent. This is due to the assumption that $x$ values are known, either set or measured without error, i.e., not subject to uncertainty. Under this assumption, all unexplained variation in the data is attributed to $y$. See section \ref{sec:stat:mf} on page \pageref{sec:stat:mf} or consult a Statistics book such as \citetitle{Holmes2019} \autocite[][pp.\ 168--170]{Holmes2019} for additional information.
\end{explainbox}

With two continuous variables mapped, the default is to take $x$ as independent and $y$ as dependent. Passing \code{"x"} (the default) or \code{"y"} as argument to parameter \code{orientation} indicates which of $x$ or $y$ is the independent or explanatory variable.

<<echo=FALSE, include=FALSE>>=
opts_chunk$set(opts_2fig_very_wide)
@

<<flipping-05base-ggplot>>=
p.base <-
  ggplot(data = iris,
       mapping = aes(x = Sepal.Length, y = Petal.Length)) +
  geom_point() +
  facet_wrap(~Species, scales = "free")
@

<<flipping-05-ggplot>>=
p.base + stat_smooth(method = "lm", formula = y ~ x)
@

Passing \code{orientation = "y"} to \gggeom{geom\_smooth()} is equivalent to swapping $x$ and $y$ in the model \code{formula}. The looser the correlation, the more different are the lines fitted before and after flipping.

<<flipping-06-ggplot>>=
p.base + stat_smooth(method = "lm", formula = y ~ x, orientation = "y")
@

The two variables in the example above, are both response variables, not directly connected by cause and effect, and with measurements subject to similar errors. None, of the two fitted models are close enough to fulfilling the assumptions.

\begin{explainbox}
Flipping the orientation of plot layers with \code{orientation = "y"} is not equivalent to flipping the whole plot with \ggcoordinate{coord\_flip()}. In the first case, which axis is considered independent for computation changes but not the positions of the axes in the plot, while in the second case, the position of the $x$ and $y$ axes in the plot is swapped. So, when coordinates are flipped the $x$ aesthetic is plotted on the vertical axis and the $y$ aesthetic on the horizontal axis, but the role of the variable mapped to the \code{x} aesthetic remains as explanatory variable. (Use of \ggcoordinate{coord\_flip()} will likely be deprecated in the future.)

<<flipping-06a-ggplot, >>=
p.base +
  stat_smooth(method = "lm", formula = y ~ x) +
  coord_flip()
@
\end{explainbox}

In package \ggpmisc (version $\geq$ 0.4.1), statistics related to model fitting have an \code{orientation} parameter as those from package \ggplot do, but in addition they accept formulas where $x$ is on the lhs and $y$ on the rhs, such as \code{formula = x \~{} y} providing a syntax consistent with \Rlang's model fitting functions. With two calls to \ggstat{stat\_poly\_line()}, the first using the default \code{formula = y \~{} x}, and the second using \code{formula = x \~{} y} to force the flipping of the fitted model, the plot produced contains two fitted lines per panel, with the flipped ones highlighted as red lines and yellow bands.

<<flipping-07-ggpmisc>>=
p.base +
    stat_poly_line() +
    stat_poly_line(formula = x ~ y, colour = "red", fill = "yellow")
@

In\index{plots!major axis regression}\label{par:ma:example} the case of the \code{iris} data used for these examples, both approaches used above to linear regression are wrong. In this case, the correct approach is to not assume that there is a variable that can be considered independent and another dependent on it, but instead to use a method like major axis (MA) regression, as below.

<<flipping-08-ggpmisc, message=FALSE, warning=FALSE>>=
p.base + stat_ma_line()
@

%A related problem is when we need to summarise in the same plot layer $x$ and $y$ values. A simple example is adding a point with coordinates given by the means along the $x$ and $y$ axes as we need to pass these computed means simultaneously to \gggeom{geom\_point()}. Package \ggplot provides \ggstat{stat\_density\_2d()} and \ggstat{stat\_summary\_2d()}. However, \ggstat{stat\_summary\_2d()} uses bins, and is similar to \ggstat{stat\_density\_2d()} in how the computed values are returned. Package \pkgname{ggpmisc} provides two dimensional equivalents of \ggstat{stat\_summary()}: \ggstat{stat\_centroid()}, which applies the same summary function along $x$ and $y$, and \ggstat{stat\_summary\_xy()}, which accepts one function for $x$ and one for $y$.
%
%<<flipping-09-ggpmisc>>=
%ggplot(data = iris,
%       mapping = aes(x = Sepal.Length, y = Petal.Length)) +
%    geom_point() +
%    stat_centroid(colour = "red") +
%    facet_wrap(~Species, scales = "free")
%@
%
%<<flipping-10-ggpmisc>>=
%ggplot(data = iris,
%       mapping = aes(x = Sepal.Length, y = Petal.Length)) +
%    geom_point() +
%    stat_centroid(geom = "rug", sides = "trbl",
%                  colour = "red", linewidth = 1.5) +
%    facet_wrap(~Species, scales = "free")
%@
%
%\begin{playground}
%Which of the plots in the last two chunks above can be created by adding two layers with \ggstat{stat\_summary()}? Recreate this plot using \ggstat{stat\_summary()}.
%\end{playground}
%
\index{grammar of graphics!flipped axes|)}

<<echo=FALSE, include=FALSE>>=
opts_chunk$set(opts_fig_narrow)
@

\section{Facets}\label{sec:plot:facets}
\index{grammar of graphics!facets|(}
\index{plots!trellis-like}\index{plots!coordinated panels}
\Kern{-1}{Facets are used in a special kind of plots containing multiple panels in which the panels share some properties. These sets of coordinated panels are a useful tool for visualising complex data. These plots became popular through the \code{trellis} graphs in \langname{S}, and the \pkgname{lattice} package in \Rlang. The basic idea is to have rows and/or columns of plots with common scales, all plots showing values for the same response variable. This is useful when there are multiple classification factors in a data set. Similar-looking plots, but with free scales or with the same scale but a `floating' intercept, are sometimes also useful. In \ggplot, there are two possible types of facets: facets organised in a grid and facets along a single `axis' of variation but, possibly, wrapped into two or more rows. These are produced by adding \Rfunction{facet\_grid()} or \Rfunction{facet\_wrap()}, respectively. Below, \gggeom{geom\_point()} is used in the examples, but faceting can be used with plots containing layers created with any \code{geom} or \code{stat}.}

<<echo=FALSE>>=
opts_chunk$set(opts_fig_wide)
@

A single-panel plot, saved as \code{p.base}, will be used through this section to demonstrate how the same plot changes when facets are added.

<<facets-00>>=
p.base <-
  ggplot(data = mtcars,
         mapping = aes(x = wt, y = mpg)) +
  geom_point()
p.base
@

A grid of panels has two dimensions, \code{rows} and \code{cols}. These dimensions in the grid of plot panels can be ``mapped'' to factors. Until recently, a formula-based syntax was the only available one. Although this notation has been retained, the preferred syntax is currently to use the parameters \code{rows} and \code{cols}. The argument passed to \code{cols} in this example is factor \code{cyl} retrieved from \code{data} with a call to \code{vars()}. The ``headings'' of the panels or \emph are by default the names or labels of the levels of the factor.

<<facets-01>>=
p.base + facet_grid(cols = vars(cyl))
@

Using \Rfunction{facet\_wrap()} the same plot can be coded as follows.

<<facets-01a, eval=eval_plots_all>>=
p.base + facet_wrap(facets = vars(cyl), nrow = 1)
@

By default, all panels share the same scale limits and share the plotting space evenly, but these defaults can be overridden.

<<facets-02a>>=
p.base + facet_wrap(facets = vars(cyl), nrow = 1, scales = "free_y")
@

<<facets-02b, eval = FALSE, include = FALSE>>=
p.base + facet_grid(cols = vars(cyl), scales = "free_y", space = "free_y")
@

Margins, added with \code{margins = TRUE}, display an additional column or row of panels with the combined data.

<<facets-06>>=
p.base + facet_grid(cols = vars(cyl), margins = TRUE)
@

<<echo=FALSE>>=
opts_chunk$set(opts_fig_narrow_square)
@

To obtain a 2D grid both \code{rows} and \code{cols} have to be passed factors as arguments.

<<facets-05>>=
p.base + facet_grid(rows = vars(vs), cols = vars(am), labeller = label_both)
@

<<echo=FALSE>>=
opts_chunk$set(opts_fig_wide)
@

Each faceting dimension can be mapped to more than one factor as below. As the levels are not self-explanatory, \code{label\_both} is passed as argument to \code{labeller} so that factor names are included in the \emph{strip labels} together with the levels.\qRfunction{label\_both()}

<<facets-07>>=
p.base + facet_grid(cols = vars(vs, am), labeller = label_both)
@

When facetting generates many panels, wrapping them into several rows helps keep the shape of the whole plot manageable. In this example, the number of levels is small, and no wrapping takes place by default. In cases when more panels are present, wrapping into two or more continuation rows is the default. Here, we force wrapping with \code{nrow = 2}. When using \Rfunction{facet\_wrap()} there is only one dimension, and the parameter is called \code{facets}, instead of \code{rows} or \code{cols}.

<<echo=FALSE>>=
opts_chunk$set(opts_fig_narrow_square)
@

<<facets-13>>=
p.base + facet_wrap(facets = vars(cyl), nrow = 2)
@

<<echo=FALSE>>=
opts_chunk$set(opts_fig_wide)
@

\begin{explainbox}
By default, panel headings display the names of the levels of the factor they are based on. Changing these names is one way of changing the labels. This approach can be used to add mathematical expressions or Greek letters in the panel headings. Below, first factor labels in the data frame passed as argument to \code{data} are set to strings that can be parsed into \emph{plotmath} expressions. Then, in the call to \Rfunction{facet\_grid()}, or to \Rfunction{facet\_wrap()}, we pass as argument to \code{labeller} a function definition, \code{label\_parsed}.\qRfunction{label\_parsed()}

<<facets-11, eval=eval_plots_all>>=
mtcars$cyl12 <- factor(mtcars$cyl,
                       labels = c("alpha", "beta", "sqrt(x, y)"))
ggplot(data = mtcars,
       mapping = aes(mpg, wt)) +
  geom_point() +
  facet_grid(cols = vars(cyl12), labeller = label_parsed)
@

The labels of the levels of the factor used in faceting can be combined with text, or math, using a ``template''. Passing as argument to \code{labeller} function \Rfunction{label\_bquote()} and using a plotmath expression as argument for its parameter \code{cols}, makes this possible. In the expression used below, \code{.(cyl)} is substituted by the value of \code{cyl} when the plot is rendered---we use here the name of the variable in the data, \code{cyl}. See section \ref{sec:plot:plotmath} for an example of the use of \code{bquote()}, the \Rlang function based on which \Rfunction{label\_bquote()} is built.

<<facets-12>>=
p.base +
  facet_grid(cols = vars(cyl),
             labeller = label_bquote(cols = .(cyl)~"cylinders"))
@
\end{explainbox}

\index{grammar of graphics!facets|)}

\section{Positions}\label{sec:plot:positions}

<<echo=FALSE, include=FALSE>>=
opts_chunk$set(opts_2fig_very_wide)
@

Position functions are passed as arguments to the \code{position} parameter of \code{geoms}. They displace the positions (the values mapped to \code{x} and/or \code{y} aesthetics) away from their original position. Different position functions differ in what displacement is applied. Table \ref{tab:plots:position} lists most of the position functions available. Function \ggposition{position\_stack()} and \ggposition{position\_fill()} were already described on page \pageref{par:plot:pos:stack}, with stacked column and area plots. Function \ggposition{position\_dodge()} was used in plots with side-by-side columns on page \pageref{par:plot:pos:dodge} and \ggposition{position\_jitter()} was used in dot plot examples on page \pageref{par:plot:pos:jitter}.

\begin{table}
  \caption[Positions]{Position functions from packages \ggplot and \ggpp. The table is divided into two sections. A. Positions that only return the modified $x$ and $y$ values. B. Identical positions that additionally return a copy of the unmodified $x$ and $y$ values. The last column describes the type of displacement: fixed uses constant values supplied in the call; random, uses random values for the displacement, within a maximum distance set by the user.}\vspace{1ex}\label{tab:plots:position}
  \centering
  \noindent
  \begin{tabular}{@{}llp{6.25cm}l@{}}
     \toprule
     Position & Package & Parameters & Displ. \\
     \midrule
     A. \textit{Origin not kept} & & & \\ \addlinespace
     \code{position\_identity} & \ggplot & --- & none \\
     \code{position\_stack} & \ggplot & vjust, reverse & fixed \\
     \code{position\_fill} & \ggplot & vjust, reverse & fixed \\
     \code{position\_dodge} & \ggplot & width, preserve, padding, reverse & fixed \\
     \code{position\_dodge2} & \ggplot & width, preserve, padding, reverse  & fixed \\
     \code{position\_jitter} & \ggplot & width, height, seed & rand. \\
     \code{position\_nudge} & \ggplot & x, y & fixed \\
     \midrule
     B. \textit{Origin kept} & & & \\ \addlinespace
     \code{position\_stack\_keep} & \ggpp & vjust, reverse & fixed \\
     \code{position\_fill\_keep} & \ggpp & vjust, reverse & fixed \\
     \code{position\_dodge\_keep} & \ggpp & width, preserve, padding, reverse & fixed \\
     \code{position\_dodge2\_keep} & \ggpp & width, preserve, padding, reverse & fixed \\
     \code{position\_jitter\_keep} &  \ggpp & width, height, seed & rand. \\
     \code{position\_nudge\_keep} & \ggpp & x, y & fixed \\
%     \midrule
%     C. \textit{Computed and kept} & & & \\ \addlinespace
%     \code{position\_nudge\_to} & \ggpp & \raggedright x, y, x.action, y.action, kept.origin & comp. \\
%     \code{position\_nudge\_line} & \ggpp & \raggedright x, y, xy\_relative, abline, method, formula, direction, line\_nudge, kept.origin & comp. \\
%     \code{position\_nudge\_center} & \ggpp & \raggedright x, y, center\_x, center\_y, direction, obey\_grouping, kept.origin & comp. \\
%     \midrule
%     D. \textit{Combined and kept} & & & \\ \addlinespace
%     \code{position\_stacknudge} & \ggpp & \raggedright vjust, reverse, x, y, direction, kept.origin & fixed \\
%     \code{position\_fillnudge} & \ggpp & \raggedright vjust, reverse, x, y, direction, kept.origin & fixed \\
%     \code{position\_dodgenudge} & \ggpp & \raggedright width, preserve, x, y, direction, kept.origin & fixed \\
%     \code{position\_dodge2nudge} & \ggpp & \raggedright width, preserve, x, y, direction, kept.origin & fixed \\
%     \code{position\_jitternudge} & \ggpp & \raggedright width, height, seed, x, y, direction, nudge.from, kept.origin & mix. \\
     \bottomrule
   \end{tabular}
\end{table}

The difference between \ggposition{position\_stack()} and \ggposition{position\_fill()} is illustrated by the example below.

<<position-01>>=
p.base <-
  ggplot(data = Orange,
         mapping = aes(x = age, y = circumference, fill = Tree))
@

<<position-02>>=
p1 <- p.base + geom_area(position = "stack", colour = "white", linewidth = 1) +
  ggtitle("stack")
p2 <- p.base + geom_area(position = "fill", colour = "white", linewidth = 1) +
  ggtitle("fill")
@

<<position-03>>=
p1 + p2
@

Position \ggposition{position\_nudge()} is used to consistently displace positions, and is most frequently used with \gggeom{geom\_text()} and \gggeom{geom\_label()} when adding data labels. When position functions are used to add data labels, it is common to add a segment linking the data point to the label. For this to be possible, position functions have to keep the original position. Position functions from package \ggplot discard them while the position functions from packages \ggpp and \ggrepel keep them in data under a different name. Table \ref{tab:plots:position} is divided into sections. The only difference between the position functions in the two sections of the table is in whether the original position is kept or not, i.e., those from package \ggpp are backwards compatible with those from package \ggplot.

The displacement introduced by jitter and nudge differ in that jitter is random, and nudge deterministic. In each case, the displacement can be separately adjusted vertically and horizontally. Jitter, as shown above, is useful when we desire to make visible overlapping points. Nudge is most frequently used with data labels to avoid occluding points or other graphical features.

Layer function \gggeom{geom\_point\_s()} from package \pkgname{ggpp} is used below to make the displacement visible by drawing an arrow connecting original and displaced positions for each observation. We need to use the \code{\_keep} flavour of the position functions for arrows to be drawn.

<<position-04>>=
p.base <-
  ggplot(data = mtcars,
         mapping = aes(x = factor(cyl), y = mpg)) +
  geom_point(colour = "blue")
p3 <- p.base +
  geom_point_s(position = position_jitter_keep(width = 0.35, heigh = 0.6),
               colour = "red") +
  ggtitle("jitter")
@

The amount of nudging is set by a distance expressed in data units through parameters \code{x} and \code{y}. (Factors have mode \code{numeric} and each level is represented by an integer, thus distance between levels of a factor is 1.)

<<position-05>>=
p4 <- p.base +
  geom_point_s(position = position_nudge_keep(x = 0.25, y = 1),
               colour = "red") +
  ggtitle("nudge")
@

<<position-06>>=
p3 + p4
@

<<echo=FALSE, include=FALSE>>=
opts_chunk$set(opts_fig_narrow)
@

\section{Scales}\label{sec:plot:scales}
\index{grammar of graphics!scales|(}

In earlier sections of this chapter, most examples have used the default \emph{scales}. In this section, I describe in more detail the use of \emph{scales}. There are \emph{scales} available for all the different \emph{aesthetics} recognised by \code{geoms}, such as position aesthetics (\code{x, y, z}), \code{size}, \code{shape}, \code{linewidth}, \code{linetype}, \code{colour}, \code{fill}, \code{alpha} or transparency, and \code{angle}. Scales determine how values in \code{data} are mapped to values of an \emph{aesthetics}, and optionally, also how these values are labelled.

Depending on the characteristics of the variables in \code{data} being mapped, \emph{scales} can be continuous or discrete, for \code{numeric} or \code{factor} variables in \code{data}, respectively. Some \emph{aesthetics}, like \code{size} and \code{colour}, are inherently continuous but others like \code{linetype} and \code{shape} are inherently discrete. In the case of inherently continuous aesthetics, both discrete and continuous scales are available, while, obviously for those inherently discrete only discrete scales are available.

The scales used by default have default mappings of data values to aesthetic values (e.g., which colour value corresponds to $cyl = 4$ and which one to $cyl = 8$). For each \emph{aesthetic}, such as \code{colour}, there are multiple scales to choose from when creating a plot, both continuous and discrete (e.g., 20 different colour scales in \ggplot 3.4.3). In addition, some scales implement multiple palettes.

\begin{warningbox}
As seen in previous sections, \emph{aesthetics} in a plot layer, in addition to being determined by mappings, can also be set to constant values. Aesthetics set to constant values are not mapped to data and are consequently independent of scales.
\end{warningbox}

The most direct mapping to data is \code{identity}, with the values in the mapped variable directly interpreted as aesthetic values. In a colour scale, say \ggscale{scale\_colour\_identity()}, the variable in the data would be encoded with values such as \code{"red"}, \code{"blue"}---i.e., valid \Rlang colours. In a simple mapping using \ggscale{scale\_colour\_discrete()} levels of a factor, such as \code{"treatment"} and \code{"control"} would be represented as distinct colours with the correspondence between factor levels and individual colours set automatically. In contrast with \code{scale\_colour\_manual()} the user explicitly provides the mapping between factor levels and colours by passing arguments to the scale functions' parameters \code{breaks} and \code{values}.

The details of the mapping of a continuous variable to an \emph{aesthetic} are controlled with a continuous scale such as \code{scale\_colour\_continuous()}. In this case, values in a \code{numeric} variable will be mapped into a continuous range of colours. How the correspondence between numeric values and colours is controlled can vary among scales. In the case of \code{colour}, some scales use complex palettes, while others implement simple gradients between two or three colours.

\begin{explainbox}
In some scales, missing values, or \code{NA}, can be assigned an aesthetic value, such as \code{colour}, while in other cases \code{NA} values are always skipped instead of plotted. The reverse, mapping values in data to \code{NA} as aesthetic value is in some cases also possible.
\end{explainbox}

\subsection{Axis and key labels}\label{sec:plot:scale:name}\label{sec:plot:labs}
\index{plots!labels|(}
\index{plots!title|(}
\index{plots!subtitle|(}
\index{plots!tag|(}
\index{plots!caption|(}
First I describe a feature common to all scales, their \code{name}. The default \code{name} of all scales is the name of the variable or the expression mapped to it. In the case of the \code{x}, \code{y}, and \code{z} \emph{aesthetics}, the \code{name} given to the scale is used for the axis labels. For other \emph{aesthetics} the name of the scale becomes the ``heading'' or \emph{key title} of the guide or key. All scales have a \code{name} parameter to which a character string or an \Rlang expression (see section \ref{sec:plot:plotmath}) can be passed as an argument to override the default. In scales that add a key or guide, passing \code{guide = "none"} to the scale function removes the key corresponding to the scale.

Convenience functions \Rfunction{xlab()} and \Rfunction{ylab()} can be used to set the axis labels.
Convenience function \Rfunction{labs()} can be used to manually set axis labels, key/guide titles, and title and other labels for the plot as a whole. For the names of scales, \Rfunction{labs()} accepts the names of aesthetics as if they were formal parameters and using \code{title}, \code{subtitle}, \code{caption}, \code{tag}, and \code{alt} for the labels for the plot as a whole. The text passed to \code{alt} is not visible in the plot but is expected to be made available to web browsers and used to enhance accessibility. (The size of title and subtitle can seem too big when rendering figures at a small size, see section \ref{sec:plot:themes} on page \pageref{sec:plot:themes} on how to replace and modify the theme used.)

<<axis-labels-00>>=
p.base <-
  ggplot(data = Orange,
         mapping = aes(x = age, y = circumference, colour = Tree)) +
  geom_line() +
  geom_point()
@

<<axis-labels-01>>=
p.base +
  expand_limits(y = 0) +
  labs(title = "Growth of orange trees",
       subtitle = "Starting from 1968-12-31",
       caption = "see Draper, N. R. and Smith, H. (1998)",
       tag = "A",
       alt = "A data plot",
       x = "Time (d)",
       y = "Circumference (mm)",
       colour = "Tree\nnumber")
@

When passing names directly to scales, the plot title and subtitle can be added with function \Rfunction{ggtitle()} by passing either character strings or \Rlang expressions as arguments.

<<axis-labels-02>>=
p.base +
  expand_limits(y = 0) +
  scale_x_continuous(name = "Time (d)") +
  scale_y_continuous(name = "Circumference (mm)") +
  ggtitle(label = "Growth of orange trees",
          subtitle = "Starting from 1968-12-31")
@

\begin{playground}
Make an empty plot (\code{ggplot()}) and add to it as title an \Rlang expression producing $y = b_0 + b_1 x + b_2 x^2$. (Hint: have a look at the examples for the use of expressions in the \code{plotmath} demo in \Rlang by typing \code{demo(plotmath)} at the \Rlang console.
\end{playground}

\index{plots!tag|)}
\index{plots!caption|)}
\index{plots!subtitle|)}
\index{plots!title|)}
\index{plots!labels|)}

\subsection{Continuous scales}\label{sec:plot:scales:continuous}
\index{grammar of graphics!continuous scales|(}
I start by listing the most frequently used arguments to the continuous scale functions: \code{name}, \code{breaks}, \code{minor\_breaks}, \code{labels}, \code{limits}, \code{expand}, \code{na.value}, \code{trans}, \code{guide}, and \code{position}. The value of \code{name} is used for axis labels or the key title (see previous section). The arguments to \code{breaks} and \code{minor\_breaks} override the default locations of major and minor ticks and grid lines. Setting them to \code{NULL} suppresses the ticks. By default, the tick labels are generated from the value of \code{breaks} but an argument to \code{labels} of the same length as \code{breaks} will replace these defaults. The values of \code{limits} determine both the range of values in the data included and the plotting area as described above---by default the out-of-bounds (\code{oob}) observations are replaced by \code{NA} but it is possible to instead ``squish'' these observations towards the edge of the plotting area. The argument to \code{expand} determines the size of the margins or padding added to the area delimited by \code{lims} when setting the ``visual'' plotting area. The value passed to \code{na.value} is used as a replacement for \code{NA} valued observations---most useful for \code{colour} and \code{fill} aesthetics. The transformation object passed as an argument to \code{trans} determines the transformation used---the transformation affects the rendering, but breaks and tick labels remain expressed in the original data units. The argument to \code{guide} determines the type of key or removes the default key. Depending on the scale in question not all these parameters are available. A family of continuous scales, \emph{binned scales}, was added in \ggplot 3.3.0. These scales map a continuous variable from \code{data} onto a discrete gradient of aesthetic values, but are otherwise very similar.

The code below constructs data frame \code{fake2.data}, containing artificial data.

<<scales-01>>=
fake2.data <-
  data.frame(y = c(rnorm(20, mean = 20, sd = 5),
                   rnorm(20, mean = 40, sd = 10)),
             group = factor(c(rep("A", 20), rep("B", 20))),
             z = rnorm(40, mean = 12, sd = 6))
@

\subsubsection{Limits}

Limits are relevant to all kinds of \emph{scales}. Limits are set through parameter \code{limits} of the different scale functions. They can also be set with convenience functions \code{xlim()} and \code{ylim()} in the case of the \code{x} and \code{y} \emph{aesthetics}, and more generally with function \code{lims()} which like \code{labs()}, takes arguments named according to the name of the \emph{aesthetics}. The \code{limits} argument of scales accepts vectors, factors, or a function computing them from \code{data}. In contrast, the convenience functions do not accept functions as their arguments.

In the next example, by setting ``hard'' limits, some observations are excluded from the plot, they are not seen by \code{stats} and \code{geoms}, i.e., hard limits in scales subset observations in \code{data} at the \code{start} stage (see Figure \ref{fig:ggplot:stages} on page \pageref{fig:ggplot:stages}). More precisely, the off-limits observations are converted to \code{NA} values before they are passed as \code{data} to \code{stats}, and subsequently discarded with a warning.

<<echo=FALSE>>=
opts_chunk$set(opts_2fig_very_wide)
@

<<scale-limits-00>>=
p1.base <-
  ggplot(data = fake2.data, mapping = aes(x = z, y = y)) +
  geom_point()
@

<<scale-limits-01>>=
p1 <- p1.base + scale_y_continuous(limits = c(0, 100))
@

To set only one limit leaving the other free, \code{NA} is used as a boundary.

<<scale-limits-02>>=
p2 <-p1.base + scale_y_continuous(limits = c(50, NA))
@

<<scale-limits-02a>>=
p1 + p2
@

Convenience functions \Rfunction{ylim()} and \Rfunction{xlim()} can be used to set the limits to the default $x$ and $y$ scales in use. Below, \Rfunction{ylim()} is used, but \Rfunction{xlim()} works identically except for the scale it modifies (plot identical to \code{p2} above, not shown).

<<scale-limits-03, eval=eval_plots_all>>=
p1.base +  ylim(50, NA)
@

In general, setting hard limits should be avoided, even though a warning is issued about \code{NA} values being omitted, as it is easy to unwillingly subset the data being plotted.
It is preferable to use function \Rfunction{expand\_limits()} as it safely \emph{expands} the dynamically computed default limits of a scale---the scale limits will grow past the requested expanded limits when needed to accommodate all observations. The arguments to \code{x} and \code{y} are numeric vectors of length one or two each, matching how the limits of the $x$ and $y$ continuous scales are defined. Below, the limits are expanded to include the origin.

<<echo=FALSE, include=FALSE>>=
opts_chunk$set(opts_fig_medium)
@

<<scale-limits-04>>=
p1.base + expand_limits(y = 0, x = 0)
@

The \code{expand} parameter of the scales plays a different role than \Rfunction{expand\_limits()}. It adds a ``margin'' or padding around the plotting area. The actual plotting area is given by the scale limits, set either dynamically or manually. Very rarely plots are drawn so that observations are plotted on top of the axes, avoiding this is a key role of \code{expand}. Rug plots and marginal annotations can make it necessary to expand the plotting area more than the default of 5\% on each margin.

In the example below, the upper edge of the plotting area is expanded by adding 0.02 units of padding and  the expansion at the bottom set to zero.

<<echo=FALSE>>=
opts_chunk$set(opts_2fig_very_wide)
@

<<scale-limits-05>>=
p2.base <-
  ggplot(data = fake2.data,
         mapping = aes(fill = group, colour = group, x = y)) +
  stat_density(alpha = 0.3, position = "identity")
@

<<scale-limits-05a>>=
p1 <-
  p2.base + scale_y_continuous(expand = expansion(add = c(0, 0.01)))
@

Using multipliers has the advantage that the expansion is proportional. A similar effect as above is achieved using multipliers, 10\% compared to the range of the \code{limits} at the top and none at the bottom.

<<scale-limits-06>>=
p2 <-
  p2.base + scale_y_continuous(expand = expansion(mult = c(0, 0.1)))
@

<<scale-limits-07>>=
p1 + p2
@

\begin{playground}
Compare the rendered plot from \code{p2.base} to \code{p1} and \code{p2} displayed above. What has been the effect of using \Rfunction{expansion()}? Try different values as arguments for \code{add} and \code{mult}.
\end{playground}

The direction of a scale can be reversed using a transformation (see section \ref{sec:plot:scales:trans} on page \pageref{sec:plot:scales:trans}). Scales \ggscale{scale\_x\_reverse()} and \ggscale{scale\_y\_reverse()} use by default the necessary transformation. However, inconsistently, \Rfunction{xlim()} and \Rfunction{ylim()} can be used to reverse the scale direction by passing the numeric values for the limits in decreasing order.

\begin{playground}
Test what the result is when the first limit is larger than the second one. Is it the same as when setting these same values as limits with \code{ylim()}? or by replacing \code{scale\_y\_continuous()} with \code{scale\_y\_reverse()}?

<<scale-limits-PG01, eval=eval_playground>>=
p1.base <- scale_y_continuous(limits = c(100, 0))
@
\end{playground}

\subsubsection{Breaks and their labels}\label{sec:plot:scales:ticks}

Parameter \code{breaks}\index{plots!scales!tick breaks} is used not only to set the location of ticks along the axis in scales for the \code{x} and \code{y} aesthetics, but also for the keys or guides for other continuous scales such as those for colour. Parameter \code{labels}\index{plots!scales!tick labels} is used to set the break labels, including tick labels. The argument passed to each of these parameters can be vector or a function. The default is to compute ``good'' breaks based on the limits and use to nice numbers suitable for labels. Examples in this section are for continuous scales, see section \ref{sec:plot:scales:time:date} on page \pageref{sec:plot:scales:time:date} for break labels in time and date scales.

When manually setting breaks, labels for the \code{breaks} are automatically computed unless overridden.

<<scale-ticks-00>>=
p3.base <-
  ggplot(data = fake2.data, mapping = aes(x = z, y = y)) +
  geom_point()
@

<<scale-ticks-01, eval=eval_plots_all>>=
p3.base + scale_y_continuous(breaks = c(20, pi * 10, 40, 60))
@

The default breaks are computed by function \Rfunction{pretty\_breaks()} from \pkgname{scales}. The argument passed to its parameter \code{n} determines the target number ticks to be generated automatically, but the actual number of ticks computed may be slightly different depending on the range of the data.

<<scale-ticks-01a>>=
p3 <-
  p3.base + scale_y_continuous(breaks = pretty_breaks(n = 7))
@

We can set tick labels manually, in parallel to the setting of \code{breaks} by passing as arguments two vectors of equal length. Below, an expression is used to include a Greek letter in the label.

<<scale-ticks-02>>=
p4 <-
  p3.base +
  scale_y_continuous(breaks = c(20, pi * 10, 40, 60),
                     labels = c("20", expression(10*pi), "40", "60"))
@

<<scale-ticks-02a>>=
p3 + p4
@

Package \pkgname{scales} provides several functions for the automatic generation of tick labels. For example, function \code{percent()} can be used to display tick labels as percentages when the values mapped from \code{data} are expressed as decimal fractions. This ``transformation'' is applied only to the tick labels.

<<scale-ticks-03>>=
p5 <-
  ggplot(data = fake2.data, mapping = aes(x = z, y = y / max(y))) +
  geom_point() +
  scale_y_continuous(labels = percent)
@

\sloppy
For currency, functions \code{dollar()} and \code{comma()} can be used to format the numbers in the labels as used for currency. Function \code{scientific\_format()} formats numbers using exponents of 10---useful for logarithmic-transformed scales. Additional functions, \code{label\_number(scale\_cut = cut\_short\_scale())}, \code{label\_log()}, or \code{label\_number(scale\_cut = cut\_si("g")} provide other options. As shown below, some of these functions can be useful with untransformed continuous scales.

<<scale-ticks-04>>=
p6 <-
  ggplot(data = fake2.data, mapping = aes(x = z, y = y * 1000)) +
  geom_point() +
  scale_y_continuous(name = "Mass",
                     labels = label_number(scale_cut = cut_si("g")))
@

<<scale-ticks-05>>=
p5 + p6
@

<<echo=FALSE, include=FALSE>>=
opts_chunk$set(opts_fig_medium)
@%
\pagebreak

\begin{explainbox}
Function \Rfunction{label\_number()} and the similar functions listed above, build new functions base on the arguments passed to them, and the values they return are function definitions. Thus, in the example above, even if the statement passed as argument to \code{labels} is a function call, the value actually ``received'' by \code{scale\_y\_continuous()} is an \textit{ad hoc} function definition created on-the-fly. Some packages define additional functions that work similarly to those from package \pkgname{scales}.
\end{explainbox}

\subsubsection{Transformed scales}\label{sec:plot:scales:trans}

The\index{plots!scales!transformations} default scales used by the \code{x} and \code{y} aesthetics, \ggscale{scale\_x\_continuous()} and \ggscale{scale\_y\_continuous()}, accept a user-supplied transformation function as an argument to \code{trans} with default \code{trans = "identity"} (no transformation). Package \pkgname{scales} defines several transformations that can be used as arguments for \code{trans}. User-defined transformations can be also implemented and used. In addition, there are predefined convenience scale functions for \code{log10}, \code{sqrt} and \code{reverse}.

\begin{warningbox}
  Consistently with maths functions in \Rlang, the names of the scales are \ggscale{scale\_x\_log10()} and \ggscale{scale\_y\_log10()}, rather than \ggscale{scale\_y\_log()} because in \Rlang, function \code{log()} computes the natural logarithm.
\end{warningbox}

Axis tick-labels display the original values, not transformed ones, and the argument to \code{breaks} also refers to these. Using \ggscale{scale\_y\_log10()} a $\log_{10}$ transformation is applied to the $y$ values.

<<scale-trans-02>>=
ggplot(data = fake2.data, mapping = aes(x = z, y = y)) +
  geom_point() +
  scale_y_log10(breaks=c(10,20,50,100))
@

\begin{playground}
Using a transformation in a scale is not equivalent to applying the same transformation on the fly when mapping a variable to the $x$  (or $y$) \emph{aesthetic}. How does the plot produced by the code below differ from the plot using the transformed scale, shown above?

<<scale-trans-03, eval=eval_plots_all>>=
ggplot(data = fake2.data, mapping = aes(x = z, y = log10(y))) +
  geom_point()
@
\end{playground}
\pagebreak

For the most common transformations like \Rfunction{log10()}, scales with those transformations as their default are available. In other cases, as mentioned above, the transformation is set by passing an argument to parameter \code{trans} of continuous scale functions that by default do not apply a transformation. Below, a predefined transformation, \code{"reciprocal"} or $1/y$ is used (plot not shown).

<<scale-trans-04, eval=eval_plots_all>>=
ggplot(data = fake2.data, mapping = aes(x = z, y = y)) +
  geom_point() +
  scale_y_continuous(trans = "reciprocal")
@

Natural logarithms are important in growth analysis as the slope against time gives the relative growth rate. The growth data for orange trees, from data set \code{Orange}, are plotted using a \Rfunction{log()} as transformation. Breaks are set using the original values.

<<echo=FALSE, include=FALSE>>=
opts_chunk$set(opts_fig_wide)
@

<<scale-trans-05>>=
ggplot(data = Orange,
       mapping = aes(x = age, y = circumference, colour = Tree)) +
  geom_line() +
  geom_point() +
  scale_y_continuous(trans = "log", breaks = c(20, 50, 100, 200))
@

\begin{explainbox}
In the examples above, and in practice most frequently, transformations are applied to position aesthetics, \code{x} and \code{y}. As the grammar of graphics is consistent, most if not all continuous scales, also accept transformations. In some cases, applying a transformation to a \code{size} or \code{colour} scale helps convey the information contained in the data.
\end{explainbox}

\subsubsection{Position of $x$ and $y$ axes}
\index{plots!axis position}

The default position of axes can be changed through parameter \code{position}, using character constants \code{"bottom"}, \code{"top"}, \code{"left"} and \code{"right"}.

<<axis-position-01>>=
ggplot(data = mtcars, mapping = aes(x = wt, y = mpg)) +
  geom_point() +
  scale_x_continuous(position = "top") +
  scale_y_continuous(position = "right")
@

\subsubsection{Secondary axes}

It\index{plots!secondary axes} is also possible to add secondary axes with ticks displayed in a transformed scale.

<<axis-secondary-01>>=
ggplot(data = mtcars, mapping = aes(x = wt, y = mpg)) +
  geom_point() +
  scale_y_continuous(sec.axis = sec_axis(~ . ^-1, name = "gpm") )
@

It is also possible to use different \code{breaks} and \code{labels} than for the main axes, and to provide a different \code{name} to be used as a secondary axis label.

<<axis-secondary-02, eval=eval_plots_all>>=
ggplot(data = mtcars, mapping = aes(x = wt, y = mpg)) +
  geom_point() +
  scale_y_continuous(sec.axis = sec_axis(~ . / 2.3521458,
                                         name = expression(km / l),
                                         breaks = c(5, 7.5, 10, 12.5)))
@
\index{grammar of graphics!continuous scales|)}

\subsection{Time and date scales for $x$ and $y$}\label{sec:plot:scales:time:date}
\index{grammar of graphics!time and date scales|(}
Time scales are similar to continuous scales for \code{numeric} values. In \Rlang and many other computing languages, time values are stored as integer values subject to special interpretation (see section \ref{sec:data:datetime} on page \pageref{sec:data:datetime}). Times stored as objects of class \code{POSIXct} (or \code{POSIXlt}) can be mapped to continuous \emph{aesthetics} such as \code{x}, \code{y}, \code{colour}, etc. Special scales for different aesthetics are available for time-related data.
\pagebreak

Limits and breaks are preferably set using constant values of class \code{POSIXct}. These are most easily input with the functions in packages \pkgname{lubridate} or \pkgname{anytime} that convert dates and times from character strings.

<<echo=FALSE, include=FALSE>>=
opts_chunk$set(opts_fig_very_wide)
@
\begin{explainbox}
In the next two chunks, scale limits subset a part of the observations present in \code{data}. Passing \code{na.rm = TRUE} when calling the \code{geom} functions silences warning messages.
\end{explainbox}

<<scale-datetime-01>>=
ggplot(data = weather_wk_25_2019.tb,
       mapping = aes(x = with_tz(time, tzone = "EET"),
                     y = air_temp_C)) +
  geom_line(na.rm = TRUE) +
  scale_x_datetime(name = NULL,
                   breaks = ymd_h("2019-06-11 12", tz = "EET") + days(0:1),
                   limits = ymd_h("2019-06-11 00", tz = "EET") + days(c(0, 2))) +
  scale_y_continuous(name = "Air temperature (C)") +
  expand_limits(y = 0)
@

As\index{plots!scales!axis labels} for numeric scales, breaks and the corresponding labels can be set differently to defaults. For example, if all observations have been collected within a single day, default tick labels will show hours and minutes. With several years, the labels will show only dates. The default labels are frequently good enough. Below, both breaks and the format of the labels are set through parameters passed in the call to \ggscale{scale\_x\_datetime()}.

<<scale-datetime-02>>=
ggplot(data = weather_wk_25_2019.tb,
       mapping = aes(x = with_tz(time, tzone = "EET"),
                     y = air_temp_C)) +
  geom_line(na.rm = TRUE) +
  scale_x_datetime(name = NULL,
                   date_breaks = "1 hour",
                   limits = ymd_h("2019-06-16 00", tz = "EET") + hours(c(6, 18)),
                   date_labels = "%H:%M") +
  scale_y_continuous(name = "Air temperature (C)") +
  expand_limits(y = 0)
@

\begin{playground}
The formatting strings used are those supported by \Rfunction{strptime()} and \code{help(strptime)} lists them. Change, in the two examples above, the $y$-axis labels used and the limits---e.g., include a single hour or a whole week of data, check which tick labels are produced by default and then pass as an argument to \code{date\_labels} different format strings, taking into account that in addition to the \emph{conversion specification} codes, format strings can include additional text.
\end{playground}

In \emph{date} scales tick labels are created with functions \Rfunction{label\_date()} or \Rfunction{label\_date\_short()}. In the case of \emph{time} scales, tick labels are created with function \Rfunction{label\_time()}. As shown for continuous scales, calls to these functions can passed as argument to the scales.

\index{grammar of graphics!time and date scales|)}

\subsection{Discrete scales for $x$ and $y$}
\index{grammar of graphics!discrete scales|(}

In\index{plots!scales!limits} the case of ordered or unordered factors, the tick labels are by default the names of the factor levels. Consequently, one roundabout way of obtaining the desired tick labels is to set them as factor labels in the data frame. This approach is not recommended as in many cases the text of the desired tick labels may not be a valid \Rlang name making more complex by the need to \emph{scape} these names each time they are used. It is best to use simple mnemonic short names for factor levels and variables, and to set suitable labels through scales.

Scales \ggscale{scale\_x\_discrete()} and \ggscale{scale\_y\_discrete()} can be used to reorder and select the factor levels without altering the data. When using this approach to subset the data, it is necessary to pass \code{na.rm = TRUE} in the call to layer functions to avoid warnings. Below, arguments passed to \code{limits} and \code{labels} in the call \code{scale\_x\_discrete} manually convert level names to uppercase (plot not shown, identical plot shown farther down using alternative code).

<<echo=FALSE, include=FALSE>>=
opts_chunk$set(opts_fig_narrow)
@
<<scale-discrete-10, eval=eval_plots_all>>=
ggplot(data = mpg,
       mapping = aes(x = class, y = hwy)) +
  stat_summary(geom = "col", fun = mean, na.rm = TRUE) +
  scale_x_discrete(limits = c("compact", "subcompact", "midsize"),
                   labels = c("COMPACT", "SUBCOMPACT", "MIDSIZE"))
@

If, as above, replacement is with the same names in upper case, passing function \Rfunction{toupper()} automates the operation. In addition, the code becomes independent of the labels used in the data. This is a more general and less error-prone approach. Any function, user defined or not, that converts the values of \code{limits} into the desired values can be passed as an argument to \code{labels}. This example, for completeness, sets scale names and limits, as well as the width of the columns.

<<scale-discrete-10a>>=
ggplot(data = mpg,
       mapping = aes(x = class, y = hwy)) +
  stat_summary(geom = "col", fun = mean, na.rm = TRUE, width = 0.6) +
  scale_x_discrete(name = "Vehicle class",
                   limits = c("compact", "subcompact", "midsize"),
                   labels = toupper) +
  scale_y_continuous(name = "Petrol use efficiency (mpg)", limits = c(0, 30))
@

The order of the columns in the plot follows the order of the levels in the factor, thus changing this ordering in factor \code{mpg\$class} works. This approach makes sense when the new ordering needs to be computed based on values in \code{data}, but can still be applied in the plotting code. Below, the breaks, and together with them the columns, are ordered based on the \code{mean()} of variable \code{hwy} by means of a call to \Rfunction{reorder()} within the call to \Rfunction{aes()}.

<<scale-discrete-11, eval=eval_plots_all>>=
ggplot(data = mpg,
       mapping = aes(x = reorder(x = factor(class), X = hwy, FUN = mean),
                     y = hwy)) +
  stat_summary(geom = "col", fun = mean)
@
\index{grammar of graphics!discrete scales|)}
<<echo=FALSE, include=FALSE>>=
opts_chunk$set(opts_fig_wide)
@

\subsection{Size and line width}
\index{grammar of graphics!size scales|(}
\index{grammar of graphics!linewidth scales|(}
\begin{warningbox}
  The \code{linewidth} aesthetic was added to package \ggplot in version 3.4.0. Previously, aesthetic \code{size} described the width of lines as well as the size of text and points or shapes. Below, I describe the scales according to version 3.4.0 and more recent.
\end{warningbox}

For the \code{size} \emph{aesthetic}, several scales are available, discrete, ordinal, continuous, and binned. They are similar to those already described above. Geometries \gggeom{geom\_point()}, \gggeom{geom\_text()}, and \gggeom{geom\_label()} obey the \code{size} aesthetic as expected. Size scales can be used with continuous numeric variables, date and times, and with discrete variables. Examples of the use of \ggscale{scale\_size()} and \ggscale{scale\_size\_area()} were given in section \ref{sec:plot:geom:point} on page \pageref{sec:plot:geom:point}. Scale \ggscale{scale\_size\_radius()} is rarely used as it does not match human visual size perception.

A similar set of scales is available for \code{linewidth} as there is for \code{size}, discrete, ordinal, continuous, and binned. Geometries \gggeom{geom\_line()}, \gggeom{geom\_hline()}, \gggeom{geom\_vline()}, \gggeom{geom\_abline()}, \gggeom{geom\_segment()}, \gggeom{geom\_curve()} and related ones, obey the \code{linewidth} aesthetic. Geometry \gggeom{geom\_pointrange()} obeys both aesthetics, as expected, \code{size} is used for the size of the point and \code{linewidth} for the bar segment. In geometries \gggeom{geom\_bar()}, \gggeom{geom\_col()}, \gggeom{geom\_area()}, \gggeom{geom\_ribbon()} and all other geometric elements bordered by lines, \code{linewidth} controls the width of these lines. Like lines, these borders and segments also obey the \code{linetype} aesthetic.

\begin{warningbox}
  Using \code{linewidth} makes code incompatible with versions of \ggplot prior to 3.4.0, while continuing to use \code{size} will trigger deprecation messages in newer versions of \ggplot. Eventually, use of \code{size} for lines will become an error, so when possible, it is preferable to use the new \code{linewidth} aesthetic.
\end{warningbox}

\index{grammar of graphics!linewidth scales|)}
\index{grammar of graphics!size scales|)}

\subsection{Colour and fill}
\index{grammar of graphics!colour and fill scales|(}
\index{plots!with colours|(}

The \code{colour} and \code{fill} scales are very similar, but they affect different elements of the plot. All visual elements in a plot obey the \code{colour} \emph{aesthetic}, but only elements that have an inner region and a boundary, obey both \code{colour} and \code{fill} \emph{aesthetics}. The boundary does not need to be rendered as a line when the plot is displayed, but it must exist. This is the case for \gggeom{geom\_area()} and \gggeom{geom\_ribbon()} that in recent versions of \ggplot are displayed with lines only on some edges. Only a subset of the shapes supported by \gggeom{geom\_point()} can be filled. There are separate but equivalent sets of scales available for these two \emph{aesthetics}. I will describe in more detail the \code{colour} \emph{aesthetic} and give only some examples for \code{fill}. I will, however, start by reviewing how colours are defined and used in \Rlang.

\subsubsection{Colour definitions in \Rlang}\label{sec:plot:colours}
\index{colour!definitions|(}
Colours can be specified in \Rlang not only through character strings with the names of previously defined \code{colours}, but also directly as strings describing the RGB (red, green, and blue) components as hexadecimal numbers (on base 16 expressed using 0, 1, 2, 3, 4, 6, 7, 8, 9, A, B, C, D, E, and F as ``digits'') such as \code{"\#FFFFFF"} for white or \code{"\#000000"} for black, or \code{"\#FF0000"} for the brightest available pure red.

The list of colour names\index{colour!names} known to \Rlang can be obtained be typing \Rfunction{colors()} at the \Rlang console. Differently to package \pkgname{ggplot2}, base \Rlang supports only \code{color} as the spelling.
Given the number of colours available, subsetting them based on their names is frequently a good first step. Function \code{colors()} returns a character vector. Using \code{grep()} it is possible to find the names that contain a given character substring, in this example \code{"dark"}.

<<scale-colour-01>>=
length(colors())
grep("dark",colors(), value = TRUE)
@

The RGB values for an \Rlang \code{color} definition are returned by function \Rfunction{col2rgb()}.

<<scale-colour-02>>=
col2rgb("purple")
col2rgb("#FF0000")
@

Colour definitions in \Rlang can contain a \emph{transparency} component described by an \code{alpha} value, which by default is not returned.

<<scale-colour-03>>=
col2rgb("purple", alpha = TRUE)
@

With function \Rfunction{rgb()} one can define new colours. Enter \code{help(rgb)} for more details.

<<scale-colour-04>>=
rgb(1, 1, 0)
rgb(1, 1, 0, names = "my.color")
rgb(255, 255, 0, names = "my.color", maxColorValue = 255)
@

As described above, colours can be defined in the RGB \emph{colour space}; however, other colour models such as HSV (hue, saturation, value) can be also used to define colours.\qRfunction{hsv()}

<<scale-colour-05>>=
hsv(c(0,0.25,0.5,0.75,1), 0.5, 0.5)
@

Frequently, sets of HSV colours returned by function \Rfunction{hcl()}, using hue, chroma and luminance as inputs, are better for use in scales. While the ``value'' and ``saturation'' in HSV are based on physical values, the ``chroma'' and ``luminance'' values in HCL are based on human visual perception. Colours with equal luminance will be seen as equally bright by an ``average'' human. In a scale based on different hues but equal chroma and luminance values, as used by default by package \ggplot, all colours are perceived as equally bright. The hues need to be expressed as angles in degrees, with values between zero and 360.

<<scale-colour-06>>=
hcl(c(0,0.25,0.5,0.75,1) * 360)
@

It is also important to remember that humans can only distinguish a limited set of colours, and even smaller colour gamuts can be reproduced by screens and printers. Furthermore, variation from individual to individual exists in colour perception, including different types of colour blindness. It is important to take this into account when choosing the colours used in illustrations.
\index{colour!definitions|)}

\subsection{Continuous colour-related scales}
\sloppy
Continuous colour scales \ggscale{scale\_colour\_continuous()}, \ggscale{scale\_colour\_gradient()}, \ggscale{scale\_colour\_gradient2()},  \ggscale{scale\_colour\_gradientn()}, \ggscale{scale\_colour\_date()}, and \ggscale{scale\_colour\_datetime()}, give smooth continuous gradients between two or more colours. They are used with \code{numeric}, \code{date} and \code{datetime} data. A matching set of \code{fill} scales is also available. Other scales like \ggscale{scale\_colour\_viridis\_c()} and \ggscale{scale\_colour\_distiller()} are based on the use of ready-made palettes of sets of colour gradients chosen to work well together under multiple conditions or for human vision including different types of colour blindness.

\subsection{Discrete colour-related scales}
\sloppy
Discrete colour scales, such as \ggscale{scale\_colour\_discrete()}, \ggscale{scale\_colour\_hue()}, \ggscale{scale\_colour\_gray()}, are used with categorical data stored as factors. Some discrete scales, such as \ggscale{scale\_colour\_viridis\_d()} and \ggscale{scale\_colour\_brewer()}, provide multiple discrete sets of colours selectable through palettes. A matching set of discrete \code{fill} scales is available. Ordinal scales, such as \ggscale{scale\_colour\_ordinal()} and \ggscale{scale\_fill\_ordinal()}, use palettes that set aesthetic values that ramp in steps between two extreme values. They are used when \code{ordered} factors are mapped to the aesthetics.

\subsection{Binned scales}\label{sec:binned:scales}
\index{grammar of graphics!binned scales|(}
Before version 3.3.0 of \pkgname{ggplot2}, only two types of scales were available, continuous and discrete. A third type of scales, called \emph{binned}, (implemented for all the aesthetics where relevant) was added in version 3.3.0. They are used with continuous variables, but they convert the continuous values into discrete ones, using bins corresponding to different ranges of values, and then represent them in the plot using a discrete set of aesthetic values from a gradient. We re-do the figure shown on page \pageref{chunk:plot:weighted:resid} but replacing \ggscale{scale\_colour\_gradient()} by \ggscale{scale\_colour\_binned()}.

<<mapping-stage-01>>=
@

<<mapping-stage-01a>>=
@%
\pagebreak

<<binned-scales-01>>=
ggplot(data = df2) +
  stat_fit_residuals(formula = y ~ poly(x, 3, raw = TRUE),
                     method = "rlm",
                     mapping = aes(x = X,
                                   y = stage(start = Y,
                                             after_stat = y * weights),
                                   colour = after_stat(weights)),
                     show.legend = TRUE) +
  scale_colour_binned(low = "red", high = "blue", limits = c(0, 1),
                     guide = "colourbar", n.breaks = 5)
@

The advantage of binned scales is that they facilitate the fast reading of the plot while their disadvantage is the decreased resolution of the scale. The use of a binned instead of a continuous scale is qualitative. The number of bins can be set by passing an argument to parameter \code{n.breaks} or alternatively, a \code{numeric} vector passed as argument to \code{breaks} can be used to explicitly set bin boundaries. When deciding how many bins to use, one needs to take into account the audience, how the figure will be rendered and displayed, and the length of time available to the viewers to peruse the plot relative to the density of information. Transformations are also allowed in these scales as in others.

\index{grammar of graphics!binned scales|)}

\subsection{Identity scales}
\index{grammar of graphics!identity colour scales|(}
In the case of identity scales, the mapping is one to-one to the data. For example, if we map the \code{colour} or \code{fill} \emph{aesthetic} to a variable using \ggscale{scale\_colour\_identity()} or \ggscale{scale\_fill\_identity()}, the mapped variable must already contain valid \Rlang \code{color} definitions. In the case of mapping \code{alpha}, the variable must contain numeric values in the range 0 to 1.

We use a data frame containing a variable \code{colours} containing character strings interpretable as the names of \code{color} definitions known to \Rlang. We then use them directly in the plot by passing \ggscale{scale\_colour\_identity()}.

<<scale-colour-10>>=
df3 <- data.frame(X = 1:10, Y = dnorm(10), colours = rep(c("red", "blue"), 5))

ggplot(data = df3, mapping = aes(x = X, y = Y, colour = colours)) +
  geom_point() +
  scale_colour_identity()
@

\begin{playground}
How does the plot look, if the identity scale is deleted from the example above? Edit and re-run the example code.

While using the identity scale, how would you need to change the code example above, to produce a plot with green and purple points?
\end{playground}
\index{grammar of graphics!identity colour scales|)}

\begin{explainbox}
  \index{grammar of graphics!setting default colour and fill scales}
  The \code{colour} and \code{fill} scales used by default by geometries defined in package \pkgname{ggplot2} can be changed through \Rlang options \code{"ggplot2.continuous.colour"}, \code{"ggplot2.discrete.colour"}, \code{"ggplot2.ordinal.colour"}, \code{"ggplot2.binned.colour"}, \code{"ggplot2.continuous.fill"}, \code{"ggplot2.discrete.fill"}, \code{"ggplot2.ordinal.fill"} and \code{"ggplot2.binned.fill"}.
\end{explainbox}

\index{plots!with colours|)}
\index{grammar of graphics!colour and fill scales|)}
\index{grammar of graphics!scales|)}

\section{Adding Annotations}\label{sec:plot:annotations}
\index{grammar of graphics!annotations|(}
The idea of annotations is that they add plot elements that are not directly connected individual observations in \code{data}. Some like company logos, could be called ``decorations'', but others like text indicating the number of observations or even an inset plot or table may convey information about the data set as a whole. They can be drawn referenced to the ``native'' data coordinates used to plot but the position itself does convey information. Annotations are distinct from data labels. Annotations are added to a ggplot with function \Rfunction{annotate()} as plot layers (each call to \code{annotate()} creates a new layer). To achieve the behaviour expected of annotations, \Rfunction{annotate()} does not inherit the default \code{data} or \code{mapping} of variables to \emph{aesthetics}. Annotations frequently make use of \code{"text"} or \code{"label"} \emph{geometries} with character strings as data, possibly to be parsed as expressions. In addition, for example, the \code{"segment"} geometry can be used to add arrows.

\begin{warningbox}
While layers added to a plot using \emph{geometries} and \emph{statistics} respect faceting, layers added with \Rfunction{annotate()} are replicated unchanged in all panels of a faceted plot. The reason is that annotation layers accept \emph{aesthetics} only as constant values which are the same for every panel as no grouping is possible without a \code{mapping} to \code{data}. Alternatives, using new geometries, are provided by package \ggpp.
\end{warningbox}

Function \Rfunction{annotate()} takes the name of a geometry as its argument, in the example below, \code{"text"}. Function \Rfunction{aes()} is not used, as only mappings to constant values are accepted. These values can be vectors, thus, layers added with annotate can add multiple graphic objects of the same type to a plot.

<<annotate-01>>=
ggplot(data = fake2.data, mapping = aes(x = z, y = y)) +
  geom_point() +
  annotate(geom = "text",
           label = "origin",
           x = 0, y = 0,
           colour = "blue",
           size=4)
@

\begin{playground}
Play with the values of the arguments to \Rfunction{annotate()} to vary the position, size, colour, font family, font face, rotation angle, and justification of the annotation.
\end{playground}

\index{plots!insets as annotations|(}
It is relatively common to use inset tables, plots, bitmaps, or vector graphics as annotations. In section \ref{sec:plot:insets} on page \pageref{sec:plot:insets}, \code{geoms} from package \ggpp were used to create insets in plots. An older alternative is to use \Rfunction{annotation\_custom()} to add grobs (\pkgname{grid} graphical object) to a ggplot. To add another or the same plot as an inset, it first needs to be converted it into a grob. A ggplot can be converted with function \Rfunction{ggplotGrob()}. In this example, the inset is a zoomed-in window into the main plot. In addition to the grob, coordinates for its location are passed in native data units.

<<inset-01>>=
p <- ggplot(data = fake2.data, mapping = aes(x = z, y = y)) +
  geom_point()
p + expand_limits(x = 40) +
  annotation_custom(ggplotGrob(p + coord_cartesian(xlim = c(4, 10), ylim = c(20, 30)) +
                               theme_bw(10)),
                    xmin = 25, xmax = 40, ymin = 30, ymax = 60)
@

This approach has the limitation, shared with the use of \Rfunction{annotate()}, that if used together with faceting, the inset is added identically to all plot panels.
\index{plots!insets as annotations|)}

In the next code example, expressions are used as annotations as well as for tick labels. Do notice that we use recycling and vectorised arithmetic for setting the breaks, as \code{c(0, 0.5, 1, 1.5, 2) * pi} is equivalent to \code{c(0, 0.5 * pi, pi, 1.5 * pi, 2 * pi}. Annotations are plotted at their own position, unrelated to any observation in the data, but using the same coordinates and units as for plotting the data.

<<annotate-03>>=
ggplot(data = data.frame(x = c(0, 2 * pi)),
       mapping = aes(x = x)) +
  stat_function(fun = sin) +
  scale_x_continuous(
    breaks = c(0, 0.5, 1, 1.5, 2) * pi,
    labels = c("0", expression(0.5~pi), expression(pi),
             expression(1.5~pi), expression(2~pi))) +
  labs(y = "sin(x)") +
  annotate(geom = "text",
           label = c("+", "-"),
           x = c(0.5, 1.5) * pi, y = c(0.5, -0.5),
           size = 20) +
  annotate(geom = "point",
           colour = "red",
           shape = 21,
           fill = "white",
           x = c(0, 1, 2) * pi, y = 0,
           size = 6)
@

\begin{playground}
Modify the plot above to show the cosine instead of the sine function, replacing \code{sin} with \code{cos}. This is easy, but the catch is that you will need to relocate the annotations.
\end{playground}

\begin{warningbox}
Function \Rfunction{annotate()} cannot be used with \code{geom = "vline"} or \code{geom = "hline"} as we can use \code{geom = "line"} or \code{geom = "segment"}. Instead, \gggeom{geom\_vline()} and/or  \gggeom{geom\_hline()} can be used directly passing constant arguments to them. See section \ref{sec:plot:line} on page \pageref{sec:plot:vhline}.
\end{warningbox}
\index{grammar of graphics!annotations|)}

\section{Coordinates and Circular Plots}\label{sec:plot:circular}\label{sec:plot:coord}
\index{grammar of graphics!polar coordinates|(}
\index{plots!circular|(}
The grammar of graphics, as implemented in \ggplot, allows many different combinations of its ``words'', and this is also how circular plots are created. To obtain circular plots, we use the same \emph{geometries}, \emph{statistics}, and \emph{scales} we have been using above, but combined with polar coordinates instead of the default cartesian coordinates. We override the default by adding \ggcoordinate{coord\_polar()} to the plot so that the \code{x} and \code{y} \textit{aesthetics} correspond to the angle and radial distance, respectively.

Special systems of coordinates, such as \ggcoordinate{coord\_sf()}, used for maps, support different projections. In contrast, coordinate functions such as \ggcoordinate{coord\_flip()}, \ggcoordinate{coord\_trans()}, and \ggcoordinate{coord\_fixed()} offer variations based on the cartesian system.

\subsection{Wind-rose plots}
\index{plots!wind rose|(}
Some types of data are more naturally expressed as angles using polar coordinates than on cartesian coordinates. The clearest example is wind direction, from which the name \textit{wind-rose} derives. In some cases of time series data with a strong periodic variation, polar coordinates can be used to highlight phase shifts or changes in frequency. A more mundane application is to plot variation in a response variable through the day with a clock-face-like representation of time of day.

Wind rose plots are frequently histograms drawn on a polar system of coordinates (see section \ref{sec:plot:histogram} on page \pageref{sec:plot:histogram}). In the examples, we plot wind direction data, measured once per minute during 24~h (dataset \code{viikki\_d29.dat} from package \pkgname{learnrbook}).

A circular histogram of wind directions with 30-degree-wide bins can be created using \ggstat{stat\_bin()}. The counts represent the number of minutes during 24~h when the wind direction was within each bin, as the data set contains one observation per minute.

<<wind-05>>=
p.wind <-
  ggplot(data = viikki_d29.dat,
       mapping = aes(x = WindDir_D1_WVT))  +
  stat_bin(colour = "black", fill = "gray50", geom = "bar",
           binwidth = 30, boundary = 0, na.rm = TRUE) +
  coord_polar() +
  scale_x_continuous(breaks = c(0, 90, 180, 270),
                     labels = c("N", "E", "S", "W"),
                     limits = c(0, 360),
                     expand = c(0, 0),
                     name = "Wind direction") +
  scale_y_continuous(name = "Frequency (min/d)")
p.wind
@

\begin{playground}
  In the example above, \gggeom{geom\_bar()} was used. Edit the code to use other geometries, e.g., \code{geom\_line()} and \code{geom\_area()}.
\end{playground}

\begin{warningbox}
A plot created using polar coordinates is not truly circular, but resembles a plot based on cartesian coordinates rolled into a circle. The difference is crucial in the case of some wind-rose plots. In a true circular plot, the data would have to be projected onto a cylinder without any discontinuity. The plot we obtain using \ggcoordinate{coord\_polar()} retains a discontinuity at the North, at the boundary between 0 and 360 degrees. Thus for a histogram computed with \ggstat{stat\_bin()}, one boundary between bins must normally coincide with this divide. In a density plot, the densities on both sides of the North divide are fitted separately, frequently resulting in odd looking plots.

One approach to centring the bins on the cardinal directions would be to pre-compute the frequencies before plotting, pooling the observations for the slices 345--360 and 0--15 degrees into the same bin, and in a separate step, plotting them using \gggeom{geom\_col()} (not shown).
\end{warningbox}

<<echo=FALSE>>=
opts_chunk$set(opts_fig_very_wide)
@

As when using other coordinates we can add facets. In this example, we create a factor based on solar time, to plot separately the observations from before or after local solar noon.

<<wind-08>>=
p.wind +
  facet_wrap(~factor(ifelse(hour(solar_time) < 12, "AM", "PM")))
@

\index{plots!wind rose|)}
<<echo=FALSE>>=
opts_chunk$set(opts_fig_narrow)
@
\pagebreak

\subsection{Pie charts}
\index{plots!pie charts|(}

\begin{warningbox}
Pie charts are more difficult to read than bar charts because our brain is better at comparing lengths than angles. If used, pie charts should only be used to show composition, or fractional components that add up to a total. In this case, used only if the number of “pie slices” is small (rule of thumb: seven at most), however in general, they are best avoided.
\end{warningbox}

A pie chart of counts is like a bar plot in which instead of heights angles describe the number of counts. \gggeom{geom\_bar()}, which defaults to use \code{stat\_count()}, together with \ggcoordinate{coord\_polar()} creates a pie chart. The brewer gradient scale supplies the palette for the fills, while the colour of the border line is set with \code{colour = "black")}.

<<>>=
ggplot(data = mpg,
       mapping = aes(x = factor(1), fill = factor(class))) +
  geom_bar(width = 1, colour = "black") +
  coord_polar(theta = "y") +
  scale_fill_brewer() +
  scale_x_discrete(breaks = NULL) +
  labs(x = NULL, fill = "Vehicle class")
@
\index{plots!pie charts|)}
\index{plots!circular|)}
\index{grammar of graphics!polar coordinates|)}

\begin{playground}
Edit the code for the pie chart above to obtain a bar chart. Which one of the two plots is easier to read?
\end{playground}

\section{Themes}\label{sec:plot:themes}
\index{grammar of graphics!themes|(}
\index{plots!styling|(}
In \ggplot, \emph{themes} are the equivalent of style sheets. They determine how the different elements of a plot are rendered when displayed, printed, or saved to a file. \emph{Themes} do not alter what aesthetics or scales are used to plot the observations or summaries, but instead how text labels, titles, axes, tick marks, plotting-area background, grid lines, etc., are formatted and if displayed or not. Package \ggplot includes several predefined \emph{theme constructors} (usually described as \emph{themes}), and independently developed extension packages define additional ones. These constructors return complete themes, which when added to a plot, replace the theme already present. In addition to choosing among these already available \emph{complete themes}, users can modify the ones already present in a plot by adding \emph{incomplete themes}. When used in this way, \emph{incomplete themes} usually are created on the fly. It is also possible to create new theme constructors that return complete themes, similar to \code{theme\_gray()} from \ggplot.

\subsection{Complete themes}
\index{grammar of graphics!complete themes|(}
The theme used by default is \ggtheme{theme\_gray()} with default arguments. In \pkgnameNI{ggplot2}, predefined themes are defined as constructor functions, with parameters. These parameters allow changing some ``base'' properties. The \code{base\_size} for text elements is given in points, and affects all text elements in the returned theme object because the size of these elements is by default defined relative to the base size. Another parameter, \code{base\_family}, allows the font family to be set. These functions return complete themes.

\begin{warningbox}
\emph{Themes} have no effect on layers produced by \emph{geometries} as themes have no effect on \emph{mappings}, \emph{scales}, or \emph{aesthetics}. In the name \ggtheme{theme\_bw()} black-and-white refers to the colour of the background of the plotting area and labels. If the \emph{colour} or fill \emph{aesthetics} are mapped or set to a constant in the figure, these will be respected irrespective of the theme. One cannot convert a colour figure into a black-and-white one by adding a \emph{theme}. For colour gradients an alternative is to use a greyscale gradient by changing the \emph{scale} used to map values to aesthetics. For discrete scales, a different aesthetic can be used, for example, use \code{shape} or \code{linetype} instead of \code{colour}.
\end{warningbox}

Even the default \ggtheme{theme\_gray()} can be added to a plot, to replace the default one with a newly constructed one created with arguments different to the defaults ones. Below, a serif font at a larger size than the default is used.

<<themes-01>>=
ggplot(data = fake2.data,
       mapping = aes(x = z, y = y)) +
  geom_point() +
  theme_gray(base_size = 18,
             base_family = "serif")
@

\begin{playground}
Change the code in the previous chunk to use, one at a time, each of the predefined themes from \ggplot: \ggtheme{theme\_bw()}, \ggtheme{theme\_classic()}, \ggtheme{theme\_minimal()}, \ggtheme{theme\_linedraw()}, \ggtheme{theme\_light()}, \ggtheme{theme\_dark()} and \ggtheme{theme\_void()}.
\end{playground}

\begin{explainbox}
Predefined ``themes'' like \ggtheme{theme\_gray()} are, in reality, not themes but instead are constructors of theme objects. The \emph{themes} they return when called depend on the arguments passed to their parameters. In other words, \code{theme\_gray(base\_size = 15)}, creates a different theme than \code{theme\_gray(base\_size = 11)}. In this case, as sizes of different text elements are defined relative to the base size, the size of all text elements changes in coordination. Font size changes by \emph{themes} do not affect the size of text or labels in plot layers created with geometries, as their size is controlled by the \code{size} \emph{aesthetic}.
\end{explainbox}

A frequent idiom is to create a plot without specifying a theme, and then adding the theme when printing or saving it. This can save work, for example, when producing different versions of the same plot for a publication and a talk.

<<themes-03>>=
p.base <-
  ggplot(data = fake2.data,
         mapping = aes(x = z, y = y)) +
  geom_point()
print(p.base + theme_bw())
@

It is also possible to change the theme used by default in the current \Rlang session with \Rfunction{theme\_set()}.

<<themes-05, eval=eval_plots_all>>=
old_theme <- theme_set(theme_bw(15))
@%

Similar to other functions used to change options in \Rlang, \Rfunction{theme\_set()} returns the previous setting. By saving this value to a variable, here \code{old\_theme}, we are able to restore the previous default, or undo the change.

<<themes-06, eval=eval_plots_all>>=
theme_set(old_theme)
@

\begin{explainbox}
The use of a grey background as default for plots is unusual. This graphic design decision originates in typesetters' goal of maintaining a uniform average luminosity throughout the text and plots in a page. Many scientific journals require or at least prefer a more traditional graphic design. Theme \ggtheme{theme\_bw()} is the most versatile of the traditional designs supported as it works well both for individual plots as for plots with facets as it includes a box. Theme \ggtheme{theme\_classic()} lacking a box and grid works well for individual plots, but needs to be adjusted when used with facets so as to obtain nice looking plots.
\end{explainbox}
\index{grammar of graphics!complete themes|)}

\subsection{Incomplete themes}
\index{grammar of graphics!incomplete themes|(}
To create a significantly different theme, and/or reuse it in multiple plots, it is best to create a new constructor, or a modified complete theme as described in section \ref{sec:plot:theme:new} on page \pageref{sec:plot:theme:new}. In other cases, it is enough to tweak individual theme settings for a single plot. Below, overlapping $x$-axis tick labels are avoided by rotation the axis tick labels. When rotating the labels, it is also necessary to change their justification, as justification is relative to the orientation of the text.

<<themes-11>>=
ggplot(data = fake2.data,
       mapping = aes(x = z + 1000, y = y)) +
  geom_point() +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 8)) +
  theme(axis.text.x = element_text(angle = 33, hjust = 1, vjust = 1))
@

\begin{playground}
Play with the code in the last chunk above, modifying the values used for \code{angle}, \code{hjust} and \code{vjust}. (Angles are expressed in degrees, and justification with values between 0 and 1).
\end{playground}

A less elegant approach is to use a smaller font size. Within \Rfunction{theme()}, function \Rfunction{rel()} can be used to set size relative to the base size. In this example, we use \code{axis.text.x} so as to change the size of tick labels only for the $x$ axis.

<<themes-12, eval=eval_plots_all>>=
ggplot(fake2.data, aes(z + 100, y)) +
  geom_point() +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 20)) +
  theme(axis.text.x = element_text(size = rel(0.6)))
@

Theme definitions follow a hierarchy, allowing us to modify the formatting of groups of similar elements, as well as of individual elements. In the chunk above, using \code{axis.text} instead of \code{axis.text.x}, would have affected the tick labels in both $x$ and $y$ axes.

\begin{playground}
Modify the example above, so that the tick labels on the $x$-axis are blue and those on the $y$-axis red, and the font size is the same for both axes, but changed from the default. Consult the documentation for \code{theme()} to find out the names of the elements that need to be given new values. For examples, see \citebooktitle{Wickham2016} \autocite{Wickham2016} and \citebooktitle{Chang2018} \autocite{Chang2018}.
\end{playground}

Formatting of other text elements can be adjusted in a similar way, as well as thickness of axes, length of tick marks, grid lines, etc. However, in most cases, these are graphic design elements that are best kept consistent throughout sets of plots and best handled by creating a new \emph{theme}.

\begin{warningbox}
If you both add a \emph{complete theme} and want to modify some of its elements, you should add the whole theme before modifying it with \code{+ theme(...)}. This may seem obvious once one has a good grasp of the grammar of graphics, but can be at first disconcerting.
\end{warningbox}

It is also possible to modify the default theme used for rendering all subsequent plots.

<<themes-15, eval=eval_plots_all>>=
old_theme <- theme_update(text = element_text(colour = "darkred"))
@

Having saved the previous default to \code{old\_theme} it can be restored when needed.

<<themes-16, eval=eval_plots_all>>=
theme_set(old_theme)
@
\index{grammar of graphics!incomplete themes|)}

\subsection{Defining a new theme}\label{sec:plot:theme:new}
\index{grammar of graphics!creating a theme|(}
Themes can be defined both from scratch, or by modifying existing saved themes, and saving the modified version. As discussed above, it is also possible to define a new, parameterised theme constructor function.

Unless we plan to widely reuse the new theme, there is usually no need to define a new function. We can simply save the modified theme to a variable and add it to different plots as needed. As we will be adding a ``ready-build'' theme object rather than a function, we do not use parentheses.

<<themes-21>>=
my_theme <- theme_bw(15) + theme(text = element_text(colour = "darkred"))
p.base + my_theme
@

\begin{explainbox}
Creating a new theme constructor similar to those from package \ggplot can be fairly simple if the changes are few. As the implementation details of theme objects may change in future versions of \ggplot, the safest approach is to rely only on the public interface of the package. The functions exported by package \ggplot can be wrapped inside a new function that modifies the theme before returning it. The interface, parameters, of the wrapped function can be included in the new one and the arguments passed along to the wrapped function, as is or modified. If needed, additional parameters can be handled by code in the wrapper function. Below, a wrapper on \ggtheme{theme\_gray()} is constructed retaining a compatible interface, but adding a new base parameter, \code{base\_colour}. A different default is used for \code{base\_family}. The key detail is passing \code{complete = TRUE} to \Rfunction{theme()}, as this tags the returned theme as being usable by itself, resulting in replacement of any theme already in a plot when it is added.

<<themes-32>>=
my_theme_gray <-
  function (base_size = 11,
            base_family = "serif",
            base_line_size = base_size/22,
            base_rect_size = base_size/22,
            base_colour = "darkblue") {

    theme_gray(base_size = base_size,
               base_family = base_family,
               base_line_size = base_line_size,
               base_rect_size = base_rect_size) +

    theme(line = element_line(colour = base_colour),
          rect = element_rect(colour = base_colour),
          text = element_text(colour = base_colour),
          title = element_text(colour = base_colour),
          axis.text = element_text(colour = base_colour),
          complete = TRUE)
  }
@

Our own theme constructor, created without too much effort, is ready to be used. To avoid surprising users, it is good to make \code{my\_theme\_grey()} a synonym of \code{my\_theme\_gray()} following \ggplot practice.

<<themes-32a>>=
my_theme_grey <- my_theme_gray
@

A plot created using \code{my\_theme\_gray()} with text colour set to dark red.

<<themes-33>>=
p.base + my_theme_gray(15, base_colour = "darkred")
@
\end{explainbox}
\index{grammar of graphics!creating a theme|)}
\index{plots!styling|)}
\index{grammar of graphics!themes|)}

\section{Composing Plots}\label{sec:plot:composing}
\index{plots!composing|(}
While facets make it possible to create plots with panels that share the same mappings and data (see section \ref{sec:plot:facets} on page \pageref{sec:plot:facets}), plot composition makes it possible to combine separately created \code{"gg"} plot objects into a single plot. Composition before rendering makes it possible to automate the correct alignments, ensure consistency of text size and even merge duplicate guide or keys. Composite plots can save space on the screen or page, but more importantly can bring together data visualisations that need to be compared or read as a whole.

Package \pkgname{patchwork} defines a simple grammar for composing plots created with \ggplot, that I have used earlier in the chapter to display pairs of plots side by side. Composition with \pkgname{patchwork} can also include grid graphical objects. The plot composition grammar uses operators \Roperator{+}, \Roperator{|} and \Roperator{/}, although \pkgname{patchwork} provides additional tools for defining complex layouts of panels. While \Roperator{+} allows different layouts, \Roperator{|} composes panels side by side, and \Roperator{/} composes panels on top of each other. The plots to be used as panels can be grouped using parentheses. The operands must be whole plots, below, this ensured by saving each plot to a variable. When composing anonymous plots they must be enclosed in parentheses, to ensure that the correct operators are dispatched.

Three simple plots, \code{p1}, \code{p2} and \code{p3} will be used below.

<<patchwork-01>>=
p1 <- ggplot(mpg, aes(displ, cty, colour = factor(cyl))) +
        geom_point() +
        theme(legend.position = "top")
p2 <- ggplot(mpg, aes(displ, cty, colour = factor(year))) +
        geom_point() +
        theme(legend.position = "top")
p3 <- ggplot(mpg, aes(factor(model), cty)) +
        geom_point() +
        theme(axis.text.x =
                element_text(angle = 90, hjust = 1, vjust = 0.5))
@

<<patchwork-00a, echo=FALSE>>=
opts_chunk$set(opts_fig_very_wide_square)
@

\begin{playground}
A combined plot can be simply assembled using the operators (plot not shown).

<<patchwork-02, eval=eval_plots_all>>=
p1 | (p2 / p3)
@

<<patchwork-02a, eval=eval_plots_all>>=
(p1 | p2) / p3
@

The operators used for composition are the arithmetic ones, and even if used for a different purpose still obey the precedence rules of mathematics. The order of precedence can be altered, as done above, using parentheses. Run the examples above after creating three plots. Modify the code trying different ways of organising the three panels.
\end{playground}

A title for the whole plot and a letter as tag for each panel are added as a whole-plot annotation.

<<patchwork-03>>=
((p1 | p2) / p3) +
   plot_annotation(title = "Fuel use in city traffic:", tag_levels = 'a')
@

<<patchwork-00b, echo=FALSE>>=
opts_chunk$set(opts_fig_wide)
@

Package \pkgname{patchwork} has in recent versions tools for the creation of complex layouts, addition of insets and combining in the same layout plots and other graphic objects such as bitmaps, photographs, and even tables.

\begin{advplayground}
Package \pkgname{patchwork} can be very useful. Study the documentation and its examples, and try to think how it could be useful to you. Then try to compose plots like those you could use in your work or studies.
\end{advplayground}

\index{plots!composing|)}

\section[Using \texttt{plotmath} Expressions]{Using \code{plotmath} Expressions}\label{sec:plot:plotmath}
\index{plotmath}
\index{plots!math expressions|(}
Plotmath expression are similar to \Rlang expressions, but they are targeted at the creation of mathematical annotations. In some respects, they are similar to the math mode in \LaTeX. They are used in graphical output like plots. The syntax sometimes feels awkward and takes some time to be learnt, but it gets the job done.

\begin{explainbox}
The main limitation to producing rich text annotations in \Rlang similar to those possible using \LaTeX\ or using HTML is at the core of the \Rpgrm program. There is work in progress and improvements can be expected in coming years. Meanwhile, the already implemented enhancements gradually appear as enhanced features in \ggplot and its extensions.

Package \pkgname{ggtext} provides rich-text (basic \langname{HTML} and \Markdown) support for \ggplot, both for annotations and for data visualisation. This is an alternative to the use of \Rlang expressions.
\end{explainbox}

In sections \ref{sec:plot:function} and \ref{sec:plot:text}, simple examples of the use of \Rlang expressions for labelling plots were given. The \code{demo(plotmath)} demo and the help page \code{help(plotmath)} provide enough information to start using expressions in plots. Although expressions are shown here in the context of plotting, they are also used in other contexts in \Rlang code.

In general, it is possible to create \emph{expressions} explicitly with function \Rfunction{expression()} or by parsing a character string. In the case of \ggplot for some plot elements, layers created with \gggeom{geom\_text()} and \gggeom{geom\_label()}, and the strip labels of facets the parsing is delayed and applied to mapped character variables in \code{data}. In contrast, for titles, subtitles, captions, axis-labels, etc. (anything that is defined within \Rfunction{labs()}), the expressions have to be entered explicitly, or saved as such into a variable, and the variable passed as an argument.

When plotting expressions using \gggeom{geom\_text()}, the parsing of character strings is signalled by passing \code{parse = TRUE} in the call to the layer function. In the case of facets' strip labels, parsing or not depends on the \emph{labeller} function used. An additional twist is the possibility of combining static character strings with values taken from \code{data} (see section \ref{sec:plot:facets} on page \pageref{sec:plot:facets}).

The most difficult thing to remember when writing expressions is how to connect the different parts. A tilde (\code{\textasciitilde}) adds space in between symbols. Asterisk (\code{*}) can be also used as a connector. The \code{*} is usually needed when dealing with numbers next to symbols. Using whitespace is allowed in some situations, but not in others. To include within an expression text that should not be parsed, it must be enclosed in quotation marks, which may need themselves to be quoted. For a long list of examples, have a look at the output and code displayed by \code{demo(plotmath)} at the \Rlang command prompt.

Expressions are frequently used for axis labels, e.g., when the units or symbols require the use of superscripts or Greek letters. In this case, they are usually entered as expressions.

<<plotmath-00>>=
p1 + labs(y = expression("Fuel use"~~(m~g^{-1})),
          x = "Engine displacement (L)",
          colour = "Engine\ncylinders") +
          theme(legend.position = "right")
@

<<plotmath-01>>=
set.seed(54321) # make sure we always generate the same data
my.data <-
  data.frame(x = 1:5,
             y = rnorm(5),
             greek.label = paste("alpha[", 1:5, "]", sep = ""))
@

In the example below, the $x$-axis label is a Greek $\alpha$ character with $i$ as subscript, and the $y$-axis label includes a superscript in the units. The title we use is a character string, while the subtitle is a rather complex expression.

Each observation has as data label a subscripted $alpha$. When using a \emph{geometry}, instead of directly using an expression, we map to the \code{label} \emph{aesthetic} character strings to be parsed into expressions. In other words, character strings, that are written using the syntax of expressions. We need to set \code{parse = TRUE} in the call to the \emph{geometry} so that the strings instead of being plotted as is, are parsed into expressions before the plot is rendered.

<<plotmath-02>>=
ggplot(my.data, aes(x, y, label = greek.label)) +
   geom_point() +
   geom_text(angle = 45, hjust = 1.2, parse = TRUE) +
   labs(x = expression(alpha[i]),
        y = expression(Speed~~(m~s^{-1})),
        title = "Using expressions",
        subtitle = expression(sqrt(alpha[1] + frac(beta, gamma))))
@

As parsing character strings is an alternative way of creating expressions, this approach can be also used in other situations. For example, a character string stored in a variable can be parsed with \Rfunction{parse()} as done below for \code{subtitle}. Tick labels are also set to expressions, taking advantage that \Rfunction{expression()} accepts multiple arguments separated by commas returning a vector of expressions.

<<plotmath-02a>>=
my_eq.char <- "alpha[i]"
ggplot(my.data, aes(x, y)) +
   geom_point() +
   labs(title = parse(text = my_eq.char)) +
   scale_x_continuous(name = expression(alpha[i]),
                      breaks = c(1,3,5),
                      labels = expression(alpha[1], alpha[3], alpha[5]))
@

A different approach (no example shown) would be to call \Rfunction{parse()} explicitly for each individual label, something that might be needed if the tick labels need to be ``assembled'' programmatically instead of set as constants.

\begin{explainbox}
\textbf{Differences between \Rfunction{parse()} and \Rfunction{expression()}}. Function \Rfunction{parse()} takes as an argument a character string. This is very useful as the character string can be created programmatically. When using \code{expression()} this is not possible, except for substitution at execution time of the value of variables into the expression. See the help pages for both functions.

Function \Rfunction{expression()} accepts its arguments without any delimiters. Function \Rfunction{parse()} takes a single character string as an argument to be parsed, in which case quotation marks within the string need to be \emph{escaped} (using \code{\backslash"} where a literal \code{"} is desired). In both cases, a character string can be embedded using one of the functions \Rfunction{plain()}, \Rfunction{italic()}, \Rfunction{bold()} or \Rfunction{bolditalic()} which also affect the font used. The argument to these functions needs to be a character string delimited by quotation marks if it is not to be parsed.

When using \Rfunction{expression()}, bare quotation marks can be embedded,

<<expr-parse-box-01, eval=eval_plots_all>>=
ggplot(cars, aes(speed, dist)) +
  geom_point() +
  xlab(expression(x[1]*"  test"))
@

while in the case of \Rfunction{parse()} they need to be \emph{escaped},

<<expr-parse-box-02, eval=eval_plots_all>>=
ggplot(cars, aes(speed, dist)) +
  geom_point() +
  xlab(parse(text = "x[1]*\"  test\""))
@

and in some cases will be enclosed within a format function.

<<expr-parse-box-03, eval=eval_plots_all>>=
ggplot(cars, aes(speed, dist)) +
  geom_point() +
  xlab(parse(text = "x[1]*italic(\"  test\")"))
@

Some additional remarks. If \Rfunction{expression()} is passed multiple arguments, it returns a vector of expressions. Where \Rfunction{ggplot()} expects a single value as an argument, as in the case of axis labels, only the first member of the vector will be used.

<<expr-parse-box-06, eval=eval_plots_all>>=
ggplot(cars, aes(speed, dist)) +
  geom_point() +
  xlab(expression(x[1], "  test"))
@

Depending on the location within a expression, spaces maybe ignored, or illegal. To juxtapose elements without adding space, use \code{*}, and to explicitly insert whitespace, use \code{\textasciitilde}. As shown above, spaces are accepted within quoted text. Consequently, the following alternatives can also be used.

<<expr-parse-box-07, eval=eval_plots_all, echo=3>>=
ggplot(cars, aes(speed, dist)) +
  geom_point() +
  xlab(parse(text = "x[1]~~~~\"test\""))
@

<<expr-parse-box-08, eval=eval_plots_all, echo=3>>=
ggplot(cars, aes(speed, dist)) +
  geom_point() +
  xlab(parse(text = "x[1]~~~~plain(test)"))
@

However, unquoted whitespace is discarded.

<<expr-parse-box-09, eval=eval_plots_all, echo=3>>=
ggplot(cars, aes(speed, dist)) +
  geom_point() +
  xlab(parse(text = "x[1]*plain(   test)"))
@

Finally, it can be surprising that trailing zeros in numeric values appearing within an expression are dropped.
\end{explainbox}

Function \Rfunction{paste()} was used above to insert values stored in a variable; functions \Rfunction{format()}, \Rfunction{sprintf()} and \Rfunction{strftime()} allow the conversion into character strings of other values. These functions can be used when creating plots to generate suitable character strings for the \code{label} \emph{aesthetic} out of numeric, logical, date, time, and even character values. They can be, for example, used to create labels within a call to \code{aes()}.

<<sprintf-01>>=
sprintf("log(%.3f) = %.3f", 5, log(5))
sprintf("log(%.3g) = %.3g", 5, log(5))
@

\begin{playground}
Study the chunk above. If you are familiar with \langname{C} or \langname{C++} function \Rfunction{sprintf()} will already be familiar to you, otherwise study its help page.

Play with functions \Rfunction{format()}, \Rfunction{sprintf()} and \Rfunction{strftime()}, using them to convert and format different types of data into character strings with different numbers of significant digits, scientific notation, decimal format, different field width, justification, etc.
\end{playground}

It is also possible to substitute the value of variables or, in fact, the result of evaluation, into a new expression, allowing on the fly construction of expressions. Such expressions are frequently used as labels in plots. This is achieved through use of \emph{quoting} and \emph{substitution}.

Function \Rfunction{bquote()} can be used to substitute variables or expressions enclosed in \code{.( )} by their value. Be aware that the argument to \Rfunction{bquote()} needs to be written as an expression; in this example, a tilde, \code{\textasciitilde}, inserts a space between words. Furthermore, if the expressions include variables, these will be searched for in the environment rather than within \code{data}, except within calls to \code{aes()} or \code{vars()}.

<<expr-bquote-01>>=
ggplot(cars, aes(speed, dist)) +
  geom_point() +
  labs(title = bquote(Time~zone: .(Sys.timezone())),
       subtitle = bquote(Date: .(as.character(today())))
       )
@

In the case of \Rfunction{substitute()} a named list can be passed as argument.

<<expr-substitute-01>>=
ggplot(cars, aes(speed, dist)) +
  geom_point() +
  labs(title = substitute(Time~zone: tz, list(tz = Sys.timezone())),
       subtitle = substitute(Date: date, list(date = as.character(today())))
       )
@

For example, substitution can be used to assemble an expression within a function based on the arguments passed. One case of interest is to retrieve the name of the object passed as an argument, from within a function.

<<expr-deparse-01>>=
deparse_test <- function(x) {
  print(deparse(substitute(x)))
}

a <- "saved in variable"

deparse_test("constant")
deparse_test(1 + 2)
deparse_test(a)
@

\index{plots!math expressions|)}

\section{Creating Complex Data Displays}\label{sec:plot:composition}
\index{plots!modular construction|(}

The grammar of graphics\index{grammar of graphics}\index{plots!layers} allows one to build and test plots incrementally. In daily use, when creating a completely new plot, it is best to start with a simple design for a plot, \code{print()} this plot, checking that the output is as expected and the code error-free. Afterwards, one can map additional \emph{aesthetics} and add \emph{geometries} and \emph{statistics} gradually. The final steps are then to add \emph{annotations} and the text or expressions used for titles, and axis and key labels. Another approach is to start with an existing plot and modify it, e.g.,  by using the same plotting code with different \code{data} or mapping different variables. When reusing code for a different data set, scale \code{limits} and \code{names} are likely to need to be edited.

\begin{playground}
  Build a graphically complex data plot of your interest, step by step, taking advantage of the layered structure to test intermediate versions in an iterative design process, first by building up the complex plot in stages as a tool in debugging, and later using iteration in the processes of improving the graphic design of the plot, its readability, and effectiveness in conveying information.
\end{playground}

\section{Creating Sets of Plots}\label{sec:plot:sets:of}
\index{plots!consistent styling}\index{plots!programatic construction|(}
Plots to be presented at a given occasion or published as part of the same work need to be consistent in various respects: themes, scales and palettes, annotations, titles, and captions. To guarantee this consistency, we need to build plots modularly and avoid repetition by assigning names to the ``modules'' that need to be used multiple times.

A simple version of this approach was used in many examples above, where a base plot was modified by addition of different layers or scales.

\subsection{Saving plot layers and scales in variables}

When creating plots with \ggplot,\index{plots!reusing parts of} objects are composed using operator \code{+} to assemble together the individual components. The functions that create plot layers, scales, etc.\ are constructors of objects and the objects they return can be stored in variables, and once saved, added to multiple plots at a later time.

A plot can be saved to a variable, here \code{p.base}, and, e.g., the value returned by a call to function \code{labs()}, into a different variable, here, \code{p.labels}.

<<plot_composition-01, eval=eval_plots_all>>=
p.base <- ggplot(data = mtcars,
                 aes(x = disp, y = mpg,
                 colour = factor(cyl))) +
          geom_point()

p.labels <- labs(x = "Engine displacement)",
                 y = "Gross horsepower",
                 colour = "Number of\ncylinders",
                 shape = "Number of\ncylinders")
@

\begin{warningbox}
 When composing plots with the \code{+} operator, the left-hand-side operand must be a \code{"gg"} object. The right-hand-side operand is added to the \code{"gg"} plot object and the result returned as a new \code{"gg"} plot object.
\end{warningbox}

The final plot can be assembled from the objects saved to variables. This is useful when creating several plots that should have consistent labels. The same approach can be used with other components. Below, the objects are combined with additional components to create different versions of the same plot.

<<plot_composition-02, eval=eval_plots_all>>=
p.base
p.base + p.labels + theme_bw(16)
p.base + p.labels + theme_bw(16) + ylim(0, NA)
@

We can also save intermediate results.

<<plot_composition-03, eval=eval_plots_all>>=
p.log <- p.base + scale_y_log10(limits=c(8,55))
p.log + p.labels + theme_bw(16)
@

\subsection{Saving plot layers and scales in lists}

If the pieces to be put together do not include a \code{"gg"} object, they can be collected into a list and saved. When the list is added to a \code{"gg"} plot object, the members of the list are added one by one to the plot respecting their order.

<<plot_composition-11, eval=eval_plots_all>>=
p.parts <- list(p.labels, theme_bw(16))
p1 + p.parts
@

\begin{playground}
Revise the code you wrote for the ``playground'' exercise in section \ref{sec:plot:composition}, but this time, pre-building and saving groups of elements that you expect to be useful unchanged when composing a different plot of the same type, or a plot of a different type from the same data.
\end{playground}

\subsection{Using functions as building blocks}

The ``packaged'' plots parts sometimes should adjust their behaviour at the time they are added to a plot. In this case a function that accepts the necessary arguments can be written, rather similarly as in the example for creating a new theme by wrapping function \ggtheme{theme\_grey()} (see section \ref{sec:plot:theme:new} on page \pageref{sec:plot:theme:new}). These functions can return a \code{"gg"} object, a list of plot components, or a single plot component. The simplest use is to alter some defaults in existing constructor functions returning \code{"gg"} objects or layers. The ellipsis (\code{...}) allows passing named arguments to a nested function. In this case, every single argument passed by name to \code{bw\_ggplot()} will be copied as an argument to the nested call to \code{ggplot()}. Be aware that supplying arguments by position, is possible only for parameters explicitly included in the definition of the wrapper function, thus, not supported with a function like this, with \code{...} as its only formal parameter.

<<plot_composition-21, eval=eval_plots_all>>=
bw_ggplot <- function(...) {
  ggplot(...) +
  theme_bw()
}
@

which could be used as follows.

<<plot_composition-22, eval=eval_plots_all>>=
bw_ggplot(data = mtcars,
          mapping = aes(x = disp, y = mpg,
          colour = factor(cyl))) +
          geom_point()
@

\index{plots!programatic construction|)}
\index{plots!modular construction|)}

\section{Generating Output Files}\label{sec:plot:render}
\index{devices!output|see{graphic output devices}}
\index{plots!saving to file|see{plots, rendering}}
\index{graphic output devices|(}
\index{plots!rendering|(}
It is possible, when using \RStudio, to directly export the displayed plot to a file using a menu. However, if the file will have to be generated again at a later time, or a series of plots need to be produced with consistent format, it is best to include the commands to export the plot in the script.

In \Rlang,\index{plots!printing}\index{plots!saving}\index{plots!output to files} files are created by printing to different devices. Printing is directed to a currently open device such a window in \RStudio. Some devices produce screen output, while others write files. Devices depend on drivers. There are both devices that are part of \Rlang and additional ones defined in contributed packages.

Creating a file involves opening a device, printing and closing the device in sequence. In most cases, the file remains locked until the device is close.

For example, when rendering a plot to\index{plots!PDF output}\index{file formats!PDF} PDF, Encapsulated Postscript, SVG or other vector graphics formats, arguments passed to \code{width} and \code{height} are expressed in inches.

<<plot-file-01, eval=eval_plots_all>>=
fig1 <- ggplot(data.frame(x = -3:3), aes(x = x)) +
  stat_function(fun = dnorm)
pdf(file = "fig1.pdf", width = 8, height = 6)
print(fig1)
dev.off()
@

For Encapsulated Postscript\index{plots!Postscript output}\index{file formats!PS, EPS} and SVG\index{plots!SVG output}\index{file formats!SVG} output, we only need to substitute \code{pdf()} with \code{postscript()} or \code{svg()}, respectively.

<<plot-file-02, eval=eval_plots_all>>=
postscript(file = "fig1.eps", width = 8, height = 6)
print(fig1)
dev.off()
@

In the case of graphics devices for\index{plots!bitmap output}\index{file formats!JPEG}\index{file formats!PNG}\index{file formats!TIFF}\index{file formats!BMP} file output in BMP, JPEG, PNG, and TIFF bitmap formats, arguments passed to \code{width} and \code{height} are expressed in pixels.

<<plot-file-03, eval=eval_plots_all>>=
tiff(file = "fig1.tiff", width = 1000, height = 800)
print(fig1)
dev.off()
@
\index{plots!rendering|)}
\index{graphic output devices|)}

\begin{warningbox}
Some graphics devices are part of base-\Rlang and others are implemented in contributed packages. In some cases, there are multiple graphic devices available for rendering graphics in a given file format. These devices usually use different libraries, or have been designed with different aims. These alternative graphic devices can also differ in their function signature, i.e., have differences in the parameters and their names.

Differences also exist in their limitations and supported features, so in cases when rendering fails inexplicably, it is worthwhile to switch to an alternative graphics device to find out if the problem is in the plot or in the rendering engine. Several of the new features added to \pkgname{grid} in \Rlang versions 4.1.0, 4.2.0, and 4.3.0 are currently supported only by some of the graphics devices.
\end{warningbox}

\section{Debugging Ggplots}

\Rlang package \pkgname{gginnards} provides methods \code{str()} (enhanced), \code{num\_layers()}, \code{top\_layer()}, \code{bottom\_layer()}, and \code{mapped\_vars()}. It also defines \code{geoms} and \code{stats} that instead of creating a layer, pass to a function such as \code{print()} the data frame they receive through parameter \code{data}. These are simple functions that even if dependent on \ggplot internals are not prone to easily break with \ggplot updates.

Package \pkgname{ggtrace} provides much more detailed and sophisticated approaches to explore the internals of \code{"gg"} plot objects. Package \ggplot itself gives access to some object components.

Of these tools, \gggeom{geom\_debug()} is probably the most intuitive to use, both on its own and as an argument to \code{stats}.

<<ggplot-debug-01>>=
ggplot(data = iris, mapping = aes(x = Petal.Length, y = Species)) +
  stat_summary(geom = "debug")
@

<<ggplot-debug-02>>=
ggplot(data = iris, mapping = aes(x = Petal.Length)) +
  stat_bin(geom = "debug")
@

\section{Further Reading}
An\index{further reading!grammar of graphics}\index{further reading!plotting} in-depth discussion of the many extensions to package \pkgname{ggplot2} is outside the scope of this book. Several books describe in detail the use of \pkgname{ggplot2}, being \citebooktitle{Wickham2016} \autocite{Wickham2016} the one written by the main author of the package. For inspiration or worked out examples, the book \citebooktitle{Chang2018} \autocite{Chang2018} is an excellent reference. In depth explanations of the technical aspects of \Rlang graphics are available in the book \citebooktitle{Murrell2019} \autocite{Murrell2019}.

<<echo=FALSE>>=
try(detach(package:learnrbook))
try(detach(package:ggbeeswarm))
try(detach(package:ggpmisc))
try(detach(package:ggpp))
try(detach(package:gginnards))
try(detach(package:ggrepel))
try(detach(package:ggplot2))
try(detach(package:scales))
try(detach(package:lubridate))
try(detach(package:dplyr))
try(detach(package:tibble))
@

<<eval=eval_diag, include=eval_diag, echo=eval_diag, cache=FALSE>>=
knitter_diag()
R_diag()
other_diag()
@


%See section \ref{sec:plot:composition} on page \pageref{sec:plot:composition} on plot composition for an explanation of the code below.