using-r-main-crc-edited.tex

\documentclass[krantz2]{krantz}\usepackage{knitr}
\usepackage{color}

\usepackage{hologo}

\usepackage{csquotes}

\usepackage{graphicx}
\DeclareGraphicsExtensions{.jpg,.pdf,.png}

\usepackage{animate}

\usepackage[style=authoryear-comp,giveninits,sortcites,maxcitenames=2,%
    mincitenames=1,maxbibnames=10,minbibnames=10,backref,uniquename=mininit,%
    uniquelist=minyear,sortgiveninits=true,backend=biber]{biblatex}

\newcommand{\href}[2]{\emph{#2} (\url{#1})}

\usepackage{framed}

\usepackage{abbrev}
\usepackage{usingr}

\usepackage{imakeidx}

% this is to reduce spacing above and below verbatim, which is used by knitr
% to show returned values
\usepackage{etoolbox}
\makeatletter
\preto{\@verbatim}{\topsep=-5pt \partopsep=-4pt \itemsep=-2pt}
\makeatother

%\usepackage{polyglossia}
%\setdefaultlanguage{english}

\setcounter{topnumber}{3}
\setcounter{bottomnumber}{3}
\setcounter{totalnumber}{4}
\renewcommand{\topfraction}{0.90}
\renewcommand{\bottomfraction}{0.90}
\renewcommand{\textfraction}{0.10}
\renewcommand{\floatpagefraction}{0.70}
\renewcommand{\dbltopfraction}{0.90}
\renewcommand{\dblfloatpagefraction}{0.70}

% ensure page numbers are aligned in TOC
\makeatletter
\renewcommand{\@pnumwidth}{2.05em}
\makeatother

\addbibresource{rbooks.bib}
\addbibresource{references.bib}

\makeindex[title=General index]
\makeindex[name=rindex,title=Alphabetic index of \Rlang names]
\makeindex[name=rcatsidx,title=Index of \Rlang names by category]
\IfFileExists{upquote.sty}{\usepackage{upquote}}{}
\begin{document}
\hyphenation{pro-cess-ing paren-the-ses spe-cif-ic au-thors in-ter-act-ed lim-it}

\title{\Huge{\fontseries{ub}\sffamily Learn R\\{\Large As a Language}}}

\author{Pedro J. Aphalo}

\date{Helsinki, \today}

% knitr setup


\frontmatter

\maketitle

\newpage

\setcounter{page}{7} %previous pages will be reserved for frontmatter to be added in later.
\tableofcontents
%\include{frontmatter/foreword}
\include{frontmatter/preface}
%\listoffigures

\mainmatter


% !Rnw root = using-r.main.Rnw


\chapter{R: The language and the program}\label{chap:R:introduction}

\begin{VF}
In a world of \ldots\ relentless pressure for more of everything, one can lose sight of the basic principles---simplicity, clarity, generality---that form the bedrock of good software.

\VA{Brian W. Kernighan and Rob Pike}{\emph{The Practice of Programming}, 1999}\nocite{Kernighan1999}
\end{VF}


\section{Aims of this chapter}

In this chapter you will learn some facts about the history and design aims behind the \Rlang language, its implementation in the \Rpgrm program, and how it is used in actual practice when sitting at a computer. You will learn the difference between typing commands interactively, reading each partial response from \Rlang on the screen as you type versus using \Rlang scripts to execute a ``job'' which saves results for later inspection by the user.

I will describe the advantages and disadvantages of textual command languages such as \Rlang compared to menu-driven user interfaces as frequently used in other statistics software and occasionally also with \Rlang. I will discuss the role of textual languages in the very important question of reproducibility of data analyses.

Finally you will learn about the different types and sources of help available to \Rlang users, and how to best make use of them.

\section{R}

\subsection{What is R?}

Most people think\index{R as a language@{\Rlang as a language}}\index{R as a program@{\Rlang as a program}}
of \Rpgrm as a computer program. \Rpgrm is indeed a computer program---a piece of software--- but it is also a computer language, implemented in the \Rpgrm program. Does this make a difference? Yes. Until recently we had only one mainstream implementation of \Rlang, the program \Rpgrm. Recently another implementation has gained some popularity, \pgrmname{Microsoft R Open} (MRO), which is directly based on the \Rpgrm program from \textit{The R Project for Statistical Computing}. MRO is described as an enhanced distribution of \Rpgrm. These two very similar implementations are not the only ones available, but others are not in widespread use. In other words, the \Rlang language can be used not only in the \Rpgrm program, and it is feasible that other implementations will be developed in the future.

The name ``base \Rlang\index{base R@{base \Rlang}}'' is used to distinguish \Rlang itself, as in the \Rpgrm distribution, from \Rlang in a broader sense, which includes independently developed extensions that can be loaded from separately distributed extension packages.

Being that \Rpgrm is essentially a command-line application, it can be used on what nowadays are frugal computing resources, equivalent to a personal computer of three decades ago. \Rpgrm can run even on the Raspberry Pi\index{Raspberry Pi}, a micro-controller board with the processing power of a modest smart phone. At the other end of the spectrum, on really powerful servers, \Rpgrm can be used for the analysis of big data sets with millions of observations. How powerful a computer you will need will depend on the size of the data sets you want to analyze, on how patient you are, and on your ability to write ``good'' code.

One could think of \Rlang as a dialect of an earlier language, called \Slang. \Slang evolved into \Splang \autocite{Becker1988}. \Slang and \Splang are commercial programs, and variations in the language appeared only between versions. \Rlang started as a poor man's home-brewed implementation of \Slang, for use in teaching. Initially \Rpgrm, the program, implemented a subset of the \Slang language. The \Rpgrm program evolved until only relatively few differences between \Slang and \Rlang remained, and these differences are intentional---thought of as significant improvements. As \Rlang overtook \Splang in popularity, some of the new features in \Rlang made their way back into \Splang. \Rpgrm is free and open-source and the name \pgrmname{Gnu S} is sometimes used to refer to \Rpgrm.

What makes \Rlang different from \pgrmname{SPSS}, \pgrmname{SAS}, etc., is that \Slang was designed from the start as a computer programming language. This may look unimportant for someone not actually needing or willing to write software for data analysis. However, in reality it makes a huge difference because \Rlang is easily extensible. By this we mean that new functionality can be easily added, and shared, and this new functionality is to the user indistinguishable from that built into \Rlang. In other words, instead of having to switch between different pieces of software to do different types of analyses or plots, one can usually find an \Rlang package that will provide the tools to do the job within \Rlang. For those routinely doing similar analyses the ability to write a short program, sometimes just a handful of lines of code, allows automation of routine analyses. For those willing to spend time programming, they have the door open to building the tools they need when these do not already exist.

However, the most important advantage of using a language like \Rlang is that it makes it easy to do data analyses in a way that ensures that they can be exactly repeated. In other words, the biggest advantage of using \Rlang, as a language, is not in communicating with the computer, but in communicating to other people what has been done, in a way that is unambiguous. Of course, other people may want to run the same commands in another computer, but still it means that a translation from a set of instructions to the computer into text readable to humans---say the materials and methods section of a paper---and back is avoided together with the ambiguities usually creeping in.

\subsection{R as a language}
\index{R as a language@{\Rlang as a language}}
\Rlang is a computer language designed for data analysis and data visualization, however, in contrast to some other scripting languages, it is, from the point of view of computer programming, a complete language---it is not missing any important feature. In other words, no fundamental operations or data types are lacking \autocite{Chambers2016}. I attribute much of its success to the fact that its design achieves a very good balance between simplicity, clarity and generality. \Rlang excels at generality thanks to its extensibility at the cost of only a moderate loss of simplicity, while clarity is ensured by enforced documentation of extensions and support for both object-oriented and functional approaches to programming. The same three principles can be also easily respected by user code written in \Rlang.

As mentioned above, \Rlang started as a free and open-source implementation of the \Slang language \autocite{Becker1984,Becker1988}. We will describe the features of the \Rlang language in later chapters. Here I mention, for those with programming experience, that it does have some features that make it different from other frequently used programming languages. For example, \Rlang does not have the strict type checks of \langname{Pascal} or  \Cpplang. It has operators that can take vectors and matrices as operands allowing more concise program statements for such operations than other languages. Writing programs, specially reliable and fast code, requires familiarity with some of these idiosyncracies of the \Rlang language. For those using \Rpgrm interactively, or writing short scripts, these idiosyncratic features make life a lot easier by saving typing.

\begin{explainbox}
Some languages have been standardized, and their grammar has been formally defined. \Rlang, in contrast is not standardized, and there is no formal grammar definition. So, the \Rlang language is defined by the behavior of the \Rpgrm program.
\end{explainbox}

\subsection{R as a computer program}
\index{R as a program@{\Rpgrm as a program}}
\index{Windows@{\textsf{Windows}}|see{MS-Windows@{\textsf{MS-Windows}}}}
The \Rpgrm program itself is open-source, and the source code is available for anybody to inspect, modify and use. A small fraction of users will directly contribute improvements to the \Rpgrm program itself, but it is possible, and those contributions are important in making \Rpgrm reliable. The executable, the \Rpgrm program we actually use, can be built for different operating systems and computer hardware. The members of the \Rpgrm developing team make an important effort to keep the results obtained from calculations done on all the different builds and computer architectures as consistent as possible. The aim is to ensure that computations return consistent results not only across updates to \Rpgrm but also across different operating systems like \osname{Linux}, \osname{Unix} (including \osname{OS X}), and \osname{MS-Windows}, and computer hardware.

\begin{figure}
  \centering
  \includegraphics[width=0.85\textwidth]{figures/R-console-r}
  \caption[The R console]{The \Rpgrm console where the user can type textual commands one by one. Here the user has typed \code{print("Hello")} and \textit{entered} it by ending the line of text by pressing the ``enter'' key. The result of running the command is displayed below the command. The character at the head of the input line, a ``$>$'' in this case, is called the command prompt, signaling where a command can be typed in. Commands entered by the user are displayed in red, while results returned by \Rlang are displayed in blue.}\label{fig:intro:console}
\end{figure}

The \Rpgrm program does not have a graphical user interface (GUI), or menus from which to start different types of analyses. Instead, the user types the commands at the \Rpgrm console (Figure \ref{fig:intro:console}). The same textual commands can also be saved into a text file, line by line, and such a file, called a ``script'' can substitute repeated typing of the same sequence of commands. When we work at the console typing in commands one by one, we say that we use \Rlang interactively. When we run script, we may say that we run a ``batch job.''

The two approaches described above are part of the \Rpgrm program by itself. However, it is common to use a second program as a front-end or middleman between the user and the \Rpgrm program. Such a program allows more flexibility and has multiple features that make entering commands or writing scripts easier. Computations are still done by exactly the same \Rpgrm program. The simplest option is to use a text editor like \pgrmname{Emacs} to edit the scripts and then run the scripts in \Rpgrm from within the editor. With some editors like \pgrmname{Emacs}, rather good integration is possible. However, nowadays there are also Integrated Development Environments (IDEs) available for \Rpgrm. An IDE both gives access to the \Rpgrm console in one window and provides a text editor for writing scripts in another window. Of the available IDEs for \Rpgrm, \RStudio is currently the most popular by a wide margin.

\subsubsection{Using R interactively}

A physical terminal (keyboard plus text-only screen) decades ago was how users communicated with computers, and was frequently called a \emph{console}\index{console}. Nowadays, a text-only interface to a computer, in most cases a window or a pane within a graphical user interface, is still called a console. In our case, the \Rpgrm console (Figure \ref{fig:intro:console}). This is the native user interface of \Rpgrm.

Typing commands at the \Rpgrm console is useful when one is playing around, rather aimlessly exploring things, or trying to understand how an \Rpgrm function or operator we are not familiar with works. Once we want to keep track of what we are doing, there are better ways of using \Rpgrm, which allow us to keep a record of how an analysis has been carried out. The different ways of using \Rpgrm are not exclusive of each other, so most users will use the \Rpgrm console to test individual commands and plot data during the first stages of exploration. As soon as we decide how we want to plot or analyze the data, it is best to start using scripts. This is not enforced in any way by \Rpgrm, but scripts are what really brings to light the most important advantages of using a programming language for data analysis. In Figure \ref{fig:intro:console} we can see how the \Rpgrm console looks. The text in red has been typed in by the user, except for the prompt \code{\textcolor{red}{$>$}}, and the text in blue is what \Rpgrm has displayed in response. It is essentially a dialogue between user and \Rpgrm. The console can \emph{look} different when displayed within an IDE like \RStudio, but the only difference is in the appearance of the text rather than in the text itself (cf.\ Figures \ref{fig:intro:console} and \ref{fig:intro:console:rstudio}).

\begin{figure}
  \centering
  \includegraphics[width=\linewidth]{figures/r-console-rstudio}
  \caption[The R console]{The \Rpgrm console embedded in \RStudio. The same commands have been typed in as in Figure \ref{fig:intro:console}. Commands entered by the user are displayed in purple, while results returned by \Rpgrm are displayed in black.}\label{fig:intro:console:rstudio}
\end{figure}

The two previous figures showed the result of entering a single command. Figure \ref{fig:intro:console:capture} shows how the console looks after the user has entered several commands, each as a separate line of text.

\begin{figure}
  \centering
  \includegraphics[width=\linewidth]{figures/r-console-capture}
  \caption[The R console]{The \Rpgrm console after several commands have been entered. Commands entered by the user are displayed in red, while results returned by \Rpgrm are displayed in blue.}\label{fig:intro:console:capture}
\end{figure}

The examples in this book require only the console window for user input. Menu-driven programs are not necessarily bad, they are just unsuitable when there is a need to set very many options and choose from many different actions. They are also difficult to maintain when extensibility is desired, and when independently developed modules of very different characteristics need to be integrated. Textual languages also have the advantage, to be addressed in later chapters, that command sequences can be stored in human- and computer-readable text files. Such files constitute a record of all the steps used, and in most cases, makes it trivial to reproduce the same steps at a later time. Scripts are a very simple and handy way of communicating to other users how to do a given data analysis.

\begin{explainbox}
In the console one types commands at the \code{>} prompt. When one ends a line by pressing the return or enter key, if the line can be interpreted as an \Rlang command, the result will be printed at the console, followed by a new \code{>} prompt.
If the command is incomplete, a \code{+} continuation prompt will be shown, and you will be able to type in the rest of the command. For example if the whole calculation that you would like to do is $1 + 2 + 3$, if you enter in the console \code{1 + 2 +} in one line, you will get a continuation prompt where you will be able to type \code{3}. However, if you type \code{1 + 2}, the result will be calculated, and printed.
\end{explainbox}

\subsubsection{Using R in a ``batch job''}

To run a script\index{script}\index{batch job} we need first to prepare a script in a text editor. Figure \ref{fig:intro:script} shows the console immediately after running the script file shown in the text editor. As before, red text, the command \code{source("my-script.R")}, was typed by the user, and the blue text in the console is what was displayed by \Rpgrm as a result of this action. The title bar of the console, shows ``R-console,'' while the title bar of the editor shows the \emph{path} to the script file that is open and ready to be edited followed by ``R-editor.''

\begin{figure}
  \centering
  \includegraphics[width=\linewidth]{figures/R-console-script}
  \caption[Script sourced at the R console]{Screen capture of the \Rpgrm console and editor just after running a script. The upper pane shows the \Rpgrm console, and the lower pane, the script file in an editor. }\label{fig:intro:script}
\end{figure}

\begin{warningbox}
When working at the command prompt, most results are printed by default. However, within scripts one needs to use function \Rfunction{print()} explicitly when a result is to be displayed.
\end{warningbox}

A true ``batch job'' is not run at the \Rpgrm console but at the operating system command prompt, or shell. The shell is the console of the operating system---\osname{Linux}, \osname{Unix}, \osname{OS X}, or \osname{MS-Windows}. Figure \ref{fig:intro:shell} shows how running a script at the Windows command prompt looks. A script can be run at the operating system prompt to do time-consuming calculations with the output saved to a file. One may use this approach on a server, say, to leave a large data analysis job running overnight or even for several days.

\begin{figure}
  \centering
  \includegraphics[width=\linewidth]{figures/windows-cmd-script}
  \caption[Script at the Windows cmd promt]{Screen capture of the \osname{MS-Windows} command console just after running the same script. Here we use \code{Rscript} to run the script; the exact syntax will depend on the operating system in use. In this case, \Rpgrm prints the results at the operating system console or shell, rather than in its own \Rpgrm console.}\label{fig:intro:shell}
\end{figure}

\subsubsection{Editors and IDEs}

Integrated Development Environments (IDEs)\index{integrated development environment}\index{IDE|see{integrated development environment}} are used when developing computer programs. IDEs provide a centralized user interface from within which the different tools used to create and test a computer program can be accessed and used in coordination. Most IDEs include a dedicated editor capable of syntax highlighting, and even report some mistakes, related to the programming language in use. One could describe such an editor as the equivalent of a word processor with spelling and grammar checking, that can alert about spelling and syntax errors for a computer language like \Rlang instead of for a natural language like English. In the case of \RStudio, the main, but not only language supported is \Rlang. The main window of IDEs usually displays more than one pane simultaneously. From within the \RStudio IDE, one has access to the \Rpgrm console, a text editor, a file-system browser, a pane for graphical output, and access to several additional tools such as for installing and updating extension packages. Although \RStudio supports very well the development of large scripts and packages, it is currently, in my opinion, also the best possible way of using \Rpgrm at the console as it has the \Rpgrm help system very well integrated both in the editor and \Rlang console. Figure \ref{fig:intro:rstudio} shows the main window displayed by \RStudio after running the same script as shown above at the \Rpgrm console (Figure \ref{fig:intro:script}) and at the operating system command prompt (Figure \ref{fig:intro:shell}). We can see by comparing these three figures how \RStudio is really a layer between the user and an unmodified \Rpgrm executable. The script was sourced by pressing the ``Source'' button at the top of the editor pane. \RStudio, in response to this, generated the code needed to source the file and ``entered'' it at the console, the same console, where we would type any \Rpgrm commands.

\begin{figure}
  \centering
  \includegraphics[width=\linewidth]{figures/Rstudio-script}
  \caption[Script in Rstudio]{The \RStudio interface just after running the same script. Here we used the ``Source'' button to run the script. In this case, \Rpgrm prints the results to the \Rpgrm console in the lower left pane.}\label{fig:intro:rstudio}
\end{figure}

When a script is run, if an error is triggered, \RStudio automatically finds the location of the error. \RStudio supports the concept of projects allowing saving of settings per project. Some features are beyond what you need for everyday data analysis and aimed at package development, such as integration of debugging, traceback on errors, profiling and bench marking of code so as to analyze and improve performance. It integrates support for file version control, which is not only useful for package development, but also for keeping track of the progress or collaboration in the analysis of data.

The version of \RStudio that one uses locally, i.e., installed in a computer used locally by a single user, runs with an almost identical user interface on most modern operating systems, such as \osname{Linux}, \osname{Unix}, \osname{OS X}, and \osname{MS-Windows}. There is also a server version that runs on \osname{Linux}, and that can be used remotely through a web browser. The user interface is still the same.

\RStudio is under active development, and constantly improved. Visit \url{http://www.rstudio.org/} for an up-to-date description and download and installation instructions. Two books \autocite{vanderLoo2012,Hillebrand2015} describe and teach how to use \RStudio without going in depth into data analysis or statistics, however, as \RStudio is under very active development, several recently added important features are not described in these books. You will find tutorials and up-to-date cheat sheets at \url{http://www.rstudio.org/}.

\section{Reproducible data analysis}
\index{reproducible data analysis|(}
Reproducible data analysis is much more than a fashionable buzzword. Under any situation where accountability is important, from scientific research to decision making in commercial enterprises, industrial quality control and safety and environmental impact assessments, being able to reproduce a data analysis reaching the same conclusions from the same data is crucial. Most approaches to reproducible data analysis are based on automating report generation and including, as part of the report, all the computer commands used to generate the results presented.

A fundamental requirement for reproducibility is a reliable record of what commands have been run on which data. Such a record is especially difficult to keep when issuing commands through menus and dialogue boxes in a graphical user interface or interactively at a console. Even working interactively at the \Rpgrm console using copy and paste to include commands and results in a report is error prone, and laborious.

A further requirement is to be able to match the output of the \Rlang commands to the input. If the script saves the output to separate files, then the user will need to take care that the script saved or shared as a record of the data analysis was the one actually used for obtaining the reported results and conclusions. This is another error-prone stage in the reporting of data analysis. To solve this problem an approach was developed, inspired in what is called \emph{literate programming} \autocite{Knuth1984a}. The idea is that running the script will produce a document that includes the listing of the \Rlang code used, the results of running this code and any explanatory text needed to understand and interpret the analysis.

Although a system capable of producing such reports with \Rlang, called \pkgname{Sweave} \autocite{Leisch2002}, has been available for a couple decades, it was rather limited and not supported by an IDE, making its use rather tedious. A more recently developed system called \pkgname{knitr} \autocite{Xie2013} together with its integration into \RStudio has made the use of this type of reports very easy. The most recent development is what has been called \Rlang \emph{notebooks} produced within \RStudio. This new feature, can produce the readable report of running the script as an HTML file, displaying the code used interspersed with the results within the viewable file as in earlier approaches. However, this newer approach goes even further: the actual source script used to generate the report is embedded in the HTML file of the report and can be extracted and run very easily and consequently re-used. This means that anyone who gets access to the output of the analysis in human readable form also gets access to the code used to generate the report, in computer executable format.

Because of these recent developments, \Rlang is an ideal language to use when the goal of reproducibility is important. During recent years the problem of the lack of reproducibility in scientific research has been broadly discussed and analysed \autocite{Gandrud2015}. One of the problems faced when attempting to reproduce experimental work, is reproducing the data analysis. \Rlang together with these modern tools can help in avoiding this source of lack of reproducibility.

How powerful are these tools and how flexible? They are powerful and flexible enough to write whole books, such as this very book you are now reading, produced with \Rpgrm, \pkgname{knitr} and \LaTeX\index{Latex@{\LaTeX}}\index{languages!Latex@{\LaTeX}}. All pages in the book are generated directly, all figures are generated by \Rpgrm and included automatically, except for the figures in this chapter that have been manually captured from the computer screen. Why am I using this approach? First because I want to make sure that every bit of code as you will see printed, runs without error. In addition, I want to make sure that the output that you will see below every line or chunk of \Rlang language code is exactly what \Rpgrm returns. Furthermore, it saves a lot of work for me as author, as I can just update \Rpgrm and all the packages used to their latest version, and build the book again, to keep it up to date and free of errors.

Although the use of these tools is important, they are outside the scope of this book and well described in other books \autocite{Gandrud2015,Xie2013}. Still when writing code, using a consistent style for formatting and indentation, carefully choosing variable names, and adding textual explanations in comments when needed, helps very much with readability for humans. I have tried to be as consistent as possible throughout the whole book in this respect, with only small personal deviations from the usual style.
\index{reproducible data analysis|)}

\section{Finding additional information}

When searching for answers, asking for advice or reading books, you will be confronted with different ways of approaching the same tasks. Do not allow this to overwhelm you; in most cases it will not matter as many computations can be done in \Rpgrm, as in any language, in several different ways, still obtaining the same result. The different approaches may differ mainly in two aspects: 1) how readable to humans are the instructions given to the computer as part of a script or program, and 2) how fast the code runs. Unless computation time is an important bottleneck in your work, just concentrate on writing code that is easy to understand to you and to others, and consequently easy to check and reuse. Of course, do always check any code you write for mistakes, preferably using actual numerical test cases for any complex calculation or even relatively simple scripts. Testing and validation are extremely important steps in data analysis, so get into this habit while reading this book. Testing how every function works, as I will challenge you to do in this book, is at the core of any robust data analysis or computing programming.

\begin{warningbox}
Error messages tend to be terse in \Rpgrm, and may require some lateral thinking and/or ``experimentation'' to understand the real cause behind problems. When you are not sure you understand how some command works, it is useful in many cases to try simple examples for which you know the correct answer and see if you can reproduce them with \Rpgrm. Because of this, this book includes some code examples that trigger errors. Learning to interpret error messages is part of what is needed to become a proficient user of \Rlang. To test your understanding of how a code statement or function works, it is good to try your hand at testing its limits, testing which variations of a piece code are valid or not.
\end{warningbox}

\subsection{R's built-in help}

To\index{R@{\Rlang}!help} access help pages through the command prompt we use function \Rfunction{help()} or a question mark. Every object exported by an \Rlang package (functions, methods, classes, data) is documented. Sometimes a single help page documents several \Rlang objects. Usually at the end of the help pages, some examples are given, which tend to help very much in learning how to use the functions described. For example, one can search for a help page at the \Rpgrm console.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{help}\hlstd{(}\hlstr{"sum"}\hlstd{)}
\hlopt{?}\hlstd{sum}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{playground}
Look at help for some other functions like \code{mean()}, \code{var()}, \code{plot()} and, why not, \Rfunction{help()} itself!

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{help}\hlstd{(help)}
\end{alltt}
\end{kframe}
\end{knitrout}
\end{playground}

When using \RStudio there are easier ways of navigating to a help page than using function \Rfunction{help()}, for example, with the cursor on the name of a function in the editor or console, pressing the F1 key opens the corresponding help page in the help pane. Letting the cursor hover for a few seconds over the name of a function at the \Rpgrm console will open ``bubble help'' for it. If the function is defined in a script or another file that is open in the editor pane, one can directly navigate from the line where the function is called to where it is defined. In \RStudio one can also search for help through the graphical interface.

In addition to help pages, \Rpgrm's distribution includes useful manuals as PDF or HTML files. These can be accessed most easily through the Help menu in \RStudio or \pgrmname{RGUI}. Extension packages provide help pages for the functions and data they export. When a package is loaded into an \Rpgrm session, its help pages are added to the native help of \Rpgrm. In addition to these individual help pages, each package provides an index of its corresponding help pages for users to browse. Many packages, contain \emph{vignettes} such as User Guides or articles describing the algorithms used.

There are some web sites that give access to \Rlang documentation through a web server. These sites can be very convenient when exploring whether a certain package could be useful for a certain problem, as they allow browsing and searching the documentation without need of installing the packages. Some package maintainers have web sites with additional documentation for their own packages. The DESCRIPTION or README of packages provide contact information for the maintainer, links to web sites, and instructions on how to report bugs. As packages are contributed by independent authors, they should be cited in addition to citing \Rpgrm itself. \Rlang function \Rfunction{citation()} when called with the name of a package as its argument provides the reference that should be cited for the package, and without an explicit argument, the reference to cite for the version of \Rlang in use.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{citation}\hlstd{()}
\end{alltt}
\begin{verbatim}
##
## To cite R in publications use:
##
##   R Core Team (2020). R: A language and environment for statistical
##   computing. R Foundation for Statistical Computing, Vienna, Austria.
##   URL https://www.R-project.org/.
##
## A BibTeX entry for LaTeX users is
##
##   @Manual{,
##     title = {R: A Language and Environment for Statistical Computing},
##     author = {{R Core Team}},
##     organization = {R Foundation for Statistical Computing},
##     address = {Vienna, Austria},
##     year = {2020},
##     url = {https://www.R-project.org/},
##   }
##
## We have invested a lot of time and effort in creating R, please cite it
## when using it for data analysis. See also 'citation("pkgname")' for
## citing R packages.
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{playground}
  Look at the help page for function \code{citation()} for a discussion of why it is important for users to cite \Rpgrm and packages when using them.
\end{playground}

\subsection{Obtaining help from online forums}\label{sec:intro:net:help}

When consulting help pages, vignettes, and possibly books at hand fails to provide the information needed, the next step to follow is to search internet forums for existing answers to one's questions. When these steps fail to solve a problem, then it is time to ask for help, either from local experts or by posting your own question in a suitable online forum. When posting requests for help, one needs to abide by what is usually described as ``netiquette.''

\subsubsection{Netiquette}
In\index{netiquette}\index{network etiquette} most internet forums, a certain behavior is expected from those asking and answering questions. Some types of misbehavior, like use of offensive or inappropriate language, will usually result in the user losing writing rights in a forum. Occasional minor misbehavior, will usually result in the original question not being answered and instead the problem highlighted in the reply. In general following the steps listed below will greatly increase your chances of getting a detailed and useful answer.

\begin{itemize}
  \item Do your homework: first search for existing answers to your question, both online and in the documentation. (Do mention that you attempted this without success when you post your question.)
  \item Provide a clear explanation of the problem, and all the relevant information. Say if it concerns \Rpgrm, the version, operating system, and any packages loaded and their versions.
  \item If at all possible, provide a simplified and short, but self-contained, code example that reproduces the problem (sometimes called \emph{reprex}).
  \item Be polite.
  \item Contribute to the forum by answering other users' questions when you know the answer.
\end{itemize}

\subsubsection{StackOverflow}

Nowadays, StackOverflow (\url{http://stackoverflow.com/})\index{StackOverflow} is the best question-and-answer (Q\,\&\,A) support site for \Rpgrm. In most cases, searching for existing questions and their answers, will be all that you need to do. If asking a question, make sure that it is really a new question. If there is some question that looks similar, make clear how your question is different.

StackOverflow has a user-rights system based on reputation, and questions and answers can be up- and down-voted. Those with the most up-votes are listed at the top of searches. If the questions or answers you write are up-voted, after you accumulate enough reputation, you acquire badges and rights, such as editing other users' questions and answers or later on, even deleting wrong answers or off-topic questions from the system. This sounds complicated, but works extremely well at ensuring that the base of questions and answers is relevant and correct, without relying on nominated \emph{moderators}. When using StackOverflow, do contribute by accepting correct answers, up-voting questions and answers that you find useful, down-voting those you consider poor, and flagging or correcting errors you may discover.

\subsubsection{Reporting bugs}

Being careful in the preparation of a reproducible example\index{reproducible example}\index{reprex|see{reproducible example}} is especially important when you intend to report a bug to the maintainer of any piece of software. For the problem to be fixed, the person revising the code, needs to be able to reproduce the problem, and after modifying the code, needs to be able to test if the problem has been solved or not. However, even if you are facing a problem caused by your misunderstanding of how \Rlang works, the simpler the example, the more likely that someone will quickly realize what your intention was when writing the code that produces a result different from what you expected.

\begin{explainbox}
How to prepare a reproducible example\index{reproducible example} (``reprex''). A \emph{reprex} is a self-contained and as simple as possible piece of computer code that triggers (and so demonstrates) a problem. If possible, when you need to use data, either use a data set included in base \Rpgrm or generate artificial data within the reprex code. If you can reproduce the problem only with your own data, then you need to provide a minimal subset of it that triggers the problem.

While preparing the \emph{reprex} you will need to simplify the code, and sometimes this step allows you to diagnose the problem. Always, before posting a reprex online, it is wise to check it with the latest versions of \Rpgrm and any package being used.

I would say that about two out of three times I prepare a \emph{reprex}, it allows me to much better understand the problem and find the root of the problem and a solution or a work-around on my own.
\end{explainbox}

\section{What is needed to run the examples in this book?}

The book is written with the expectation that you will run most of the code examples and try as many other variations as needed until you are sure you understand the basic ``rules'' of the \Rpgrm language and how each function or command described works. As mentioned above, you are expected to use this book as a travel guide for your exploration of the world of \Rlang.

\Rpgrm is all that is needed to work through all the examples in this book, but it is not a convenient way of doing this. I recommend that you use an editor or an IDE, in particular \RStudio\index{IDE for R}\index{editor for R scripts}. \RStudio is user friendly, actively maintained, free, open-source and available both in desktop and server versions. The desktop version runs on \osname{MS-Windows}, \osname{Linux}, and \osname{OS X} and other \osname{Unix} distributions.

Of course when choosing which editor to use, personal preferences and previous familiarity play an important role.
Currently, for the development of packages, I use \RStudio exclusively. For writing this book I have used both \RStudio and the text editor \pgrmname{WinEdt} which has support for \Rpgrm together with excellent support for \LaTeX\index{Latex@\LaTeX}. When working on a large project or collaborating with other data analysts or researchers, one big advantage of a system based on plain text files such as \Rlang scripts, is that the same files can be edited with different programs and under different operating systems as needed or wished by the different persons involved in a project.

When I started using \Rpgrm, nearly two decades ago, I was using other editors, using the operating system shell a lot more, and struggling with debugging as no IDE was available. The only reasonably good integration with an editor was for \pgrmname{Emacs}, which was widely available only under \osname{Unix}-like systems. Given my past experience, I encourage you to use an IDE for \Rpgrm. \RStudio is nowadays very popular, but if you do not like it, need a different set of features, such as integration with \pgrmname{ImageJ}, or are already familiar with the \pgrmname{Eclipse} IDE, you may want to try the \pgrmname{Bio7} IDE, available from \url{http://bio7.org}.

The examples in this book make use of several freely available \Rlang extension packages, which can be installed from CRAN. One of them, \pkgname{learnrbook}, also available through CRAN, contains data sets and files specific to this book. The \pkgname{learnrbook} package contains installation instructions and saved lists of the names of all other packages used in the book. Instructions on installing \Rpgrm, \pgrmname{Git}, \RStudio, compilers and other tools are available online. In many cases the IT staff at your employer or school will know how to install them, or they may even be included in the default computer setup. In addition, a web site supporting the book will be available at: \url{http://www.learnr-book.info}.

\section{Further reading}
Suggestions\index{further reading!shell scripts in Unix and Linux} for further reading are dependent on how you plan to use \Rlang. If you envision yourself running batch jobs under \pgrmname{Linux} or \pgrmname{Unix}, you would profit from learning to write shell scripts. Because \pgrmname{bash} is widely used nowadays, \citebooktitle{Newham2005} \autocite{Newham2005} can be recommended. If you aim at writing \Rlang code that is going to be reused, and have some familiarity with \Clang, \Cpplang or \javalang, reading \citetitle{Kernighan1999} \autocite{Kernighan1999} will provide a mostly language-independent view of programming as an activity and help you master the all-important tricks of the trade.


% !Rnw root = appendix.main.Rnw


\chapter{The R language: ``Words'' and ``sentences''}\label{chap:R:as:calc}

\begin{VF}
The desire to economize time and mental effort in arithmetical computations, and to eliminate human liability to error, is probably as old as the science of arithmetic itself.

\VA{Howard Aiken}{\emph{Proposed automatic calculating machine}, 1937; reprinted 1964}\nocite{Aiken1964}
\end{VF}

%\dictum[Howard Aiken, \emph{Proposed automatic calculating machine}, presented to IBM in 1937]{The desire to economize time and mental effort in arithmetical computations, and to eliminate human liability to error, is probably as old as the science of arithmetic itself.}\vskip2ex

\section{Aims of this chapter}

In my experience, for those not familiar with computer programming languages, the best first step in learning the \Rlang language is to use it interactively by typing textual commands at the \emph{console} or command line. This will teach not only the syntax and grammar rules, but also give you a glimpse at the advantages and flexibility of this approach to data analysis.

In the first part of the chapter we will use \Rlang to do everyday calculations that should be so easy and familiar that you will not need to think about the operations themselves. This easy start will give you a chance to focus on learning how to issue textual commands at the command prompt.

Later in the chapter, you will gradually need to focus more on the \Rlang language and its grammar and less on how commands are entered. By the end of the chapter you will be familiar with most of the kinds of ``words'' used in the \Rlang language and you will be able to write simple ``sentences'' in \Rlang.

Along the chapter, I will occasionally show the equivalent of the \Rlang code in mathematical notation. If you are not familiar with the mathematical notation, you can safely ignore it, as long as you understand the \Rlang code.

\section{Natural and computer languages}
\index{languages!natural and computer}
Computer languages have strict rules and interpreters and compilers are unforgiving about errors. They will issue error messages, but in contrast to human readers or listeners, will not guess your intentions and continue. However, computer languages have a much smaller set of words than natural languages, such as English. If you are new to computer programming, understanding the parallels between computer and natural languages may be useful.

One can think of constant values and variables (values stored under a name) as nouns and of operators and functions as verbs. A complete command, or statement, is the equivalent of a natural language sentence: ``a comprehensible utterance.'' The simple statement \code{a + 1} has three components: \code{a}, a variable, \code{+}, an operator and \code{1} a constant. The statement \code{sqrt(4)} has two components, a function \code{sqrt()} and a numerical constant \code{4}. We say that ``to compute $\sqrt{4}$ we \emph{call} \code{sqrt()} with \code{4} as its \emph{argument}.''

In later chapters you will learn how to write compound statements, the equivalent of natural-language paragraphs, and scripts, the equivalent of essays. You will also learn how to define new verbs, user-defined functions and operators, and new nouns, user-defined classes.

\section{Numeric values and arithmetic}
\index{classes and modes!numeric, integer, double|(}\index{numbers and their arithmetic|(}\qRclass{numeric}\index{math operators}\index{math functions}\index{numeric values}\qRoperator{+}\qRoperator{-}\qRoperator{*}\qRoperator{/}
When working in \Rlang with arithmetic expressions, the normal mathematical precedence rules are respected, but parentheses can be used to alter this order. Parentheses can be nested, but in contrast to the usual practice in mathematics, the same parenthesis symbol is used at all nesting levels.

\begin{explainbox}
 Both in mathematics and programming languages \emph{operator precedence rules} determine which subexpressions are evaluated first and which later. Contrary to primitive electronic calculators, \Rlang evaluates numeric expressions containing operators according to the rules of mathematics. In the expression $3 + 2 \times 3$, the product $2 \times 3$ has precedence over the addition, and is evaluated first, yielding as the result of the whole expression, 9. In programming languages, similar rules apply to all operators, even those taking as operands non-numeric values.
\end{explainbox}

It is important to keep in mind that in \Rlang trigonometric functions interpret numeric values representing angles as being expressed in radians.

The equivalent of the math expression\qRfunction{exp()}\qRfunction{sin()}\qRconst{pi}
$$
\frac{3 + e^2}{\sin \pi}
$$
is, in \Rlang, written as follows:

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{(}\hlnum{3} \hlopt{+} \hlkwd{exp}\hlstd{(}\hlnum{2}\hlstd{))} \hlopt{/} \hlkwd{sin}\hlstd{(pi)}
\end{alltt}
\begin{verbatim}
## [1] 8.483588e+16
\end{verbatim}
\end{kframe}
\end{knitrout}

It can be seen above that mathematical constants and functions are part of the \Rlang language. One thing to remember when translating complex fractions as above into \Rlang code, is that in arithmetic expressions the bar of the fraction generates a grouping that alters the normal precedence of operations. In contrast, in an \Rlang expression this grouping must be explicitly signaled with additional parentheses.

If you are in doubt about how precedence rules work, you can add parentheses to make sure the order of computations is the one you intend. Redundant parentheses have no effect.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlnum{1} \hlopt{+} \hlnum{2} \hlopt{*} \hlnum{3}
\end{alltt}
\begin{verbatim}
## [1] 7
\end{verbatim}
\begin{alltt}
\hlnum{1} \hlopt{+} \hlstd{(}\hlnum{2} \hlopt{*} \hlnum{3}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] 7
\end{verbatim}
\begin{alltt}
\hlstd{(}\hlnum{1} \hlopt{+} \hlnum{2}\hlstd{)} \hlopt{*} \hlnum{3}
\end{alltt}
\begin{verbatim}
## [1] 9
\end{verbatim}
\end{kframe}
\end{knitrout}

The number of opening (left side) and closing (right side) parentheses must be balanced, and they must be located so that each enclosed term is a valid mathematical expression. For example, while \code{(1 + 2) * 3} is valid, \code{(1 +) 2 * 3} is a syntax error as \code{1 +} is incomplete and cannot be calculated.

\begin{playground}
Here results are not shown. These are examples for you to type at the command prompt. In general you should not skip them, as in many cases, as with the statements highlighted with comments in the code chunk below, they have something to teach or demonstrate. You are strongly encouraged to \emph{play}, in other words, create new variations of the examples and execute them to explore how \Rlang works.\qRfunction{sqrt()}\qRfunction{sin()}\qRfunction{log(), log10(), log2()}\qRfunction{exp()}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlnum{1} \hlopt{+} \hlnum{1}
\hlnum{2} \hlopt{*} \hlnum{2}
\hlnum{2} \hlopt{+} \hlnum{10} \hlopt{/} \hlnum{5}
\hlstd{(}\hlnum{2} \hlopt{+} \hlnum{10}\hlstd{)} \hlopt{/} \hlnum{5}
\hlnum{10}\hlopt{^}\hlnum{2} \hlopt{+} \hlnum{1}
\hlkwd{sqrt}\hlstd{(}\hlnum{9}\hlstd{)}
\hlstd{pi} \hlcom{# whole precision not shown when printing}
\hlkwd{print}\hlstd{(pi,} \hlkwc{digits} \hlstd{=} \hlnum{22}\hlstd{)}
\hlkwd{sin}\hlstd{(pi)} \hlcom{# oops! Read on for explanation.}
\hlkwd{log}\hlstd{(}\hlnum{100}\hlstd{)}
\hlkwd{log10}\hlstd{(}\hlnum{100}\hlstd{)}
\hlkwd{log2}\hlstd{(}\hlnum{8}\hlstd{)}
\hlkwd{exp}\hlstd{(}\hlnum{1}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

\end{playground}

Variables\index{variables}\index{assignment} are used to store values. After we \emph{assign} a value to a variable, we can use the name of the variable in place of the stored value. The ``usual'' assignment operator is \Roperator{<-}. In \Rlang, all names, including variable names, are case sensitive. Variables \code{a} and \code{A} are two different variables. Variable names can be long in \Rlang although it is not a good idea to use very long names. Here I am using very short names, something that is usually also a very bad idea. However, in the examples in this chapter where the stored values have no connection to the real world, simple names emphasize their abstract nature. In the chunk below, \code{a} and \code{b} are arbitrarily chosen variable names; I could have used names like \code{my.variable.a} or \code{outside.temperature} if they had been useful to convey information.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a} \hlkwb{<-} \hlnum{1}
\hlstd{a} \hlopt{+} \hlnum{1}
\end{alltt}
\begin{verbatim}
## [1] 2
\end{verbatim}
\begin{alltt}
\hlstd{a}
\end{alltt}
\begin{verbatim}
## [1] 1
\end{verbatim}
\begin{alltt}
\hlstd{b} \hlkwb{<-} \hlnum{10}
\hlstd{b} \hlkwb{<-} \hlstd{a} \hlopt{+} \hlstd{b}
\hlstd{b}
\end{alltt}
\begin{verbatim}
## [1] 11
\end{verbatim}
\begin{alltt}
\hlnum{3e-2} \hlopt{*} \hlnum{2.0}
\end{alltt}
\begin{verbatim}
## [1] 0.06
\end{verbatim}
\end{kframe}
\end{knitrout}

Entering the name of a variable \emph{at the R console} implicitly calls function \code{print()} displaying the stored value on the console. The same applies to any other statement entered \emph{at the R console}: \code{print()} is implicitly called with the result of executing the statement as its argument.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a}
\end{alltt}
\begin{verbatim}
## [1] 1
\end{verbatim}
\begin{alltt}
\hlkwd{print}\hlstd{(a)}
\end{alltt}
\begin{verbatim}
## [1] 1
\end{verbatim}
\begin{alltt}
\hlstd{a} \hlopt{+} \hlnum{1}
\end{alltt}
\begin{verbatim}
## [1] 2
\end{verbatim}
\begin{alltt}
\hlkwd{print}\hlstd{(a} \hlopt{+} \hlnum{1}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] 2
\end{verbatim}
\end{kframe}
\end{knitrout}
\begin{playground}
There are some syntactically legal statements that are not very frequently used, but you should be aware that they are valid, as they will not trigger error messages, and may surprise you. The most important thing is to write code consistently. The ``backwards'' assignment operator \Roperator{->} and resulting code like \code{1 -> a}\index{assignment!leftwise} are valid but less frequently used. The use of the equals sign (\Roperator{=}) for assignment in place of \Roperator{<-} although valid is discouraged. Chaining\index{assignment!chaining} assignments as in the first statement below can be used to signal to the human reader that \code{a}, \code{b} and \code{c} are being assigned the same value.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a} \hlkwb{<-} \hlstd{b} \hlkwb{<-} \hlstd{c} \hlkwb{<-} \hlnum{0.0}
\hlstd{a}
\hlstd{b}
\hlstd{c}
\hlnum{1} \hlkwb{->} \hlstd{a}
\hlstd{a}
\hlstd{a} \hlkwb{=} \hlnum{3}
\hlstd{a}
\end{alltt}
\end{kframe}
\end{knitrout}

\end{playground}

\begin{explainbox}
In\index{numeric, integer and double values} \Rlang, all numbers belong to mode \Rclass{numeric} (we will discuss the concepts of \emph{mode} and \emph{class} in section \ref{sec:rlang:mode} on page \pageref{sec:rlang:mode}). We can query if the mode of an object is \Rclass{numeric} with function \Rfunction{is.numeric()}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{mode}\hlstd{(}\hlnum{1}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] "numeric"
\end{verbatim}
\begin{alltt}
\hlstd{a} \hlkwb{<-} \hlnum{1}
\hlkwd{is.numeric}\hlstd{(a)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\end{kframe}
\end{knitrout}

Because numbers can be stored in different formats, requiring different amounts of computer memory per value, most computing languages implement several different types of numbers. In most cases \Rpgrm's \Rfunction{numeric()} can be used everywhere that a number is expected. However, in some cases it has advantages to explicitly indicate that we will store or operate on whole numbers, in which case we can use class \Rclass{integer}, with integer constants indicated by a trailing capital ``L,'' as in  \code{32L}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{is.numeric}\hlstd{(}\hlnum{1L}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{is.integer}\hlstd{(}\hlnum{1L}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{is.double}\hlstd{(}\hlnum{1L}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\end{kframe}
\end{knitrout}

Real numbers are a mathematical abstraction, and do not have an exact equivalent in computers. Instead of Real numbers, computers store and operate on numbers that are restricted to a broad but finite range of values and have a finite resolution. They are called, \emph{floats} (or \emph{floating-point} numbers); in \Rlang they go by the name of \Rclass{double} and can be created with the constructor \Rfunction{double()}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{is.numeric}\hlstd{(}\hlnum{1}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{is.integer}\hlstd{(}\hlnum{1}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\begin{alltt}
\hlkwd{is.double}\hlstd{(}\hlnum{1}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\end{kframe}
\end{knitrout}

The name \code{double} originates from the \Clang language, in which there are different types of floats available. With the name \code{double} used to mean ``double-precision floating-point numbers.'' Similarly, the use of \code{L} stems from the \texttt{long} type in \Clang, meaning ``long integer numbers.''
\end{explainbox}

Numeric variables can contain more than one value. Even single numbers are in \Rlang \Rclass{vector}s of length one. We will later see why this is important. As you have seen above, the results of calculations were printed preceded with \code{[1]}. This is the index or position in the vector of the first number (or other value) displayed at the head of the current line.

One can use \Rmethod{c()} ``concatenate'' to create a vector from other vectors, including vectors of length 1, such as the \code{numeric} constants in the statements below.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a} \hlkwb{<-} \hlkwd{c}\hlstd{(}\hlnum{3}\hlstd{,} \hlnum{1}\hlstd{,} \hlnum{2}\hlstd{)}
\hlstd{a}
\end{alltt}
\begin{verbatim}
## [1] 3 1 2
\end{verbatim}
\begin{alltt}
\hlstd{b} \hlkwb{<-} \hlkwd{c}\hlstd{(}\hlnum{4}\hlstd{,} \hlnum{5}\hlstd{,} \hlnum{0}\hlstd{)}
\hlstd{b}
\end{alltt}
\begin{verbatim}
## [1] 4 5 0
\end{verbatim}
\begin{alltt}
\hlstd{c} \hlkwb{<-} \hlkwd{c}\hlstd{(a, b)}
\hlstd{c}
\end{alltt}
\begin{verbatim}
## [1] 3 1 2 4 5 0
\end{verbatim}
\begin{alltt}
\hlstd{d} \hlkwb{<-} \hlkwd{c}\hlstd{(b, a)}
\hlstd{d}
\end{alltt}
\begin{verbatim}
## [1] 4 5 0 3 1 2
\end{verbatim}
\end{kframe}
\end{knitrout}

Method \code{c()} accepts as arguments two or more vectors and concatenates them, one after another. Quite frequently we may need to insert one vector in the middle of another. For this operation, \code{c()} is not useful by itself. One could use indexing combined with \code{c()}, but this is not needed as \Rlang provides a function capable of directly doing this operation. Although it can be used to ``insert'' values, it is named \code{append()}, and by default, it indeed appends one vector at the end of another.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{append}\hlstd{(a, b)}
\end{alltt}
\begin{verbatim}
## [1] 3 1 2 4 5 0
\end{verbatim}
\end{kframe}
\end{knitrout}

The output above is the same as for \code{c(a, b)}, however, \Rfunction{append()} accepts as an argument an index position after which to ``append'' its second argument. This results in an \emph{insert} operation when the index points at any position different from the end of the vector.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{append}\hlstd{(a,} \hlkwc{values} \hlstd{= b,} \hlkwc{after} \hlstd{=} \hlnum{2L}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] 3 1 4 5 0 2
\end{verbatim}
\end{kframe}
\end{knitrout}

Both \code{c()} and \code{append()} can also be used with lists.

\begin{playground}
One can create sequences\index{sequence} using function \Rfunction{seq()} or the operator \Roperator{:}, or repeat values using function \Rfunction{rep()}. In this case, I leave to the reader to work out the rules by running these and his/her own examples, with the help of the documentation, available through \code{help(seq)} and \code{help(rep)}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a} \hlkwb{<-} \hlopt{-}\hlnum{1}\hlopt{:}\hlnum{5}
\hlstd{a}
\hlstd{b} \hlkwb{<-} \hlnum{5}\hlopt{:-}\hlnum{1}
\hlstd{b}
\hlstd{c} \hlkwb{<-} \hlkwd{seq}\hlstd{(}\hlkwc{from} \hlstd{=} \hlopt{-}\hlnum{1}\hlstd{,} \hlkwc{to} \hlstd{=} \hlnum{1}\hlstd{,} \hlkwc{by} \hlstd{=} \hlnum{0.1}\hlstd{)}
\hlstd{c}
\hlstd{d} \hlkwb{<-} \hlkwd{rep}\hlstd{(}\hlopt{-}\hlnum{5}\hlstd{,} \hlnum{4}\hlstd{)}
\hlstd{d}
\end{alltt}
\end{kframe}
\end{knitrout}

\end{playground}

Next, something that makes \Rlang different from most other programming languages: vectorized arithmetic\index{vectorized arithmetic}. Operators and functions that are vectorized accept, as arguments, vectors of arbitrary length, in which case the result returned is equivalent to having applied the same function or operator individually to each element of the vector.\label{par:vectorized:numeric}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a} \hlopt{+} \hlnum{1} \hlcom{# we add one to vector a defined above}
\end{alltt}
\begin{verbatim}
## [1] 4 2 3
\end{verbatim}
\begin{alltt}
\hlstd{(a} \hlopt{+} \hlnum{1}\hlstd{)} \hlopt{*} \hlnum{2}
\end{alltt}
\begin{verbatim}
## [1] 8 4 6
\end{verbatim}
\begin{alltt}
\hlstd{a} \hlopt{+} \hlstd{b}
\end{alltt}
\begin{verbatim}
## [1] 7 6 2
\end{verbatim}
\begin{alltt}
\hlstd{a} \hlopt{-} \hlstd{a}
\end{alltt}
\begin{verbatim}
## [1] 0 0 0
\end{verbatim}
\end{kframe}
\end{knitrout}

As it can be seen in the first line above, another peculiarity of \Rpgrm, is what is frequently called ``recycling'' of arguments:\index{recycling of arguments} as vector \code{a} is of length 6, but the constant 1 is a vector of length 1, this short constant vector is extended, by recycling its value, into a vector of six ones---i.e., a vector of the same length as the longest vector in the statement, \code{a}.\label{par:recycling:numeric}

Make sure you understand what calculations are taking place in the chunk above, and also the one below.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a} \hlkwb{<-} \hlkwd{rep}\hlstd{(}\hlnum{1}\hlstd{,} \hlnum{6}\hlstd{)}
\hlstd{a}
\end{alltt}
\begin{verbatim}
## [1] 1 1 1 1 1 1
\end{verbatim}
\begin{alltt}
\hlstd{a} \hlopt{+} \hlnum{1}\hlopt{:}\hlnum{2}
\end{alltt}
\begin{verbatim}
## [1] 2 3 2 3 2 3
\end{verbatim}
\begin{alltt}
\hlstd{a} \hlopt{+} \hlnum{1}\hlopt{:}\hlnum{3}
\end{alltt}
\begin{verbatim}
## [1] 2 3 4 2 3 4
\end{verbatim}
\begin{alltt}
\hlstd{a} \hlopt{+} \hlnum{1}\hlopt{:}\hlnum{4}
\end{alltt}


{\ttfamily\noindent\color{warningcolor}{\#\# Warning in a + 1:4: longer object length is not a multiple of shorter object length}}\begin{verbatim}
## [1] 2 3 4 5 2 3
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{explainbox}
A useful thing to know: a vector can have length zero. Vectors of length zero may seem at first sight quite useless, but in fact they are very useful. They allow the handling of ``no input'' or ``nothing to do'' cases as normal cases, which in the absence of vectors of length zero would require to be treated as special cases. I describe here a useful function, \Rfunction{length()} which returns the length of a vector or list.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{z} \hlkwb{<-} \hlkwd{numeric}\hlstd{(}\hlnum{0}\hlstd{)}
\hlstd{z}
\end{alltt}
\begin{verbatim}
## numeric(0)
\end{verbatim}
\begin{alltt}
\hlkwd{length}\hlstd{(z)}
\end{alltt}
\begin{verbatim}
## [1] 0
\end{verbatim}
\end{kframe}
\end{knitrout}

Vectors and lists of length zero, behave in most cases, as expected---e.g.,  they can be concatenated as shown here.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{length}\hlstd{(}\hlkwd{c}\hlstd{(a,} \hlkwd{numeric}\hlstd{(}\hlnum{0}\hlstd{), b))}
\end{alltt}
\begin{verbatim}
## [1] 9
\end{verbatim}
\begin{alltt}
\hlkwd{length}\hlstd{(}\hlkwd{c}\hlstd{(a, b))}
\end{alltt}
\begin{verbatim}
## [1] 9
\end{verbatim}
\end{kframe}
\end{knitrout}

Many functions, such as \Rlang's maths functions and operators, will accept numeric vectors of length zero as valid input, returning also a vector of length zero, issuing neither a warning nor an error message. In other words, \emph{these are valid operations} in \Rlang.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{log}\hlstd{(}\hlkwd{numeric}\hlstd{(}\hlnum{0}\hlstd{))}
\end{alltt}
\begin{verbatim}
## numeric(0)
\end{verbatim}
\begin{alltt}
\hlnum{5} \hlopt{+} \hlkwd{numeric}\hlstd{(}\hlnum{0}\hlstd{)}
\end{alltt}
\begin{verbatim}
## numeric(0)
\end{verbatim}
\end{kframe}
\end{knitrout}

Even when of length zero, vectors do have to belong to a class acceptable for the operation.

\end{explainbox}

It\index{removing objects}\index{deleting objects|see {removing objects}} is possible to \emph{remove} variables from the workspace with \Rfunction{rm()}. Function \Rfunction{ls()} returns a \emph{list} of all objects visible in the current environment, or by supplying a \code{pattern} argument, only the objects with names matching the \code{pattern}. The pattern is given as a regular expression, with \verb|[]| enclosing alternative matching characters, \verb|^| and \verb|$|, indicating the extremes of the name (start and end, respectively). For example, \verb|"^z$"| matches only the single character `z' while \verb|"^z"| matches any name starting with `z'. In contrast \verb|"^[zy]$"| matches both `z' and `y' but neither `zy' nor `yz', and \verb|"^[a-z]"| matches any name starting with a lowercase ASCII letter. If you are using \pgrmname{RStudio}, all objects are listed in the Environment pane, and the search box of the panel can be used to find a given object.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ls}\hlstd{(}\hlkwc{pattern}\hlstd{=}\hlstr{"^z$"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] "z"
\end{verbatim}
\begin{alltt}
\hlkwd{rm}\hlstd{(z)}
\hlkwd{ls}\hlstd{(}\hlkwc{pattern}\hlstd{=}\hlstr{"^z$"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## character(0)
\end{verbatim}
\end{kframe}
\end{knitrout}

There\index{special values!NA}\index{special values!NaN}\label{par:special:values} are some special values available for numbers. \Rconst{NA} meaning ``not available'' is used for missing values. Calculations can also yield the following values \Rconst{NaN} ``not a number'', \Rconst{Inf} and \Rconst{-Inf} for $\infty$ and $-\infty$. As you will see below, calculations yielding these values do \textbf{not} trigger errors or warnings, as they are arithmetically valid. \Rconst{Inf} and \Rconst{-Inf} are also valid numerical values for input and constants.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a} \hlkwb{<-} \hlnum{NA}
\hlstd{a}
\end{alltt}
\begin{verbatim}
## [1] NA
\end{verbatim}
\begin{alltt}
\hlopt{-}\hlnum{1} \hlopt{/} \hlnum{0}
\end{alltt}
\begin{verbatim}
## [1] -Inf
\end{verbatim}
\begin{alltt}
\hlnum{1} \hlopt{/} \hlnum{0}
\end{alltt}
\begin{verbatim}
## [1] Inf
\end{verbatim}
\begin{alltt}
\hlnum{Inf} \hlopt{/} \hlnum{Inf}
\end{alltt}
\begin{verbatim}
## [1] NaN
\end{verbatim}
\begin{alltt}
\hlnum{Inf} \hlopt{+} \hlnum{4}
\end{alltt}
\begin{verbatim}
## [1] Inf
\end{verbatim}
\begin{alltt}
\hlstd{b} \hlkwb{<-} \hlopt{-}\hlnum{Inf}
\hlstd{b} \hlopt{* -}\hlnum{1}
\end{alltt}
\begin{verbatim}
## [1] Inf
\end{verbatim}
\end{kframe}
\end{knitrout}

Not available (\Rconst{NA}) values are very important in the analysis of experimental data, as frequently some observations are missing from an otherwise complete data set due to ``accidents'' during the course of an experiment. It is important to understand how to interpret \Rconst{NA}'s. They are simple placeholders for something that is unavailable, in other words, \emph{unknown}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{A} \hlkwb{<-} \hlnum{NA}
\hlstd{A}
\end{alltt}
\begin{verbatim}
## [1] NA
\end{verbatim}
\begin{alltt}
\hlstd{A} \hlopt{+} \hlnum{1}
\end{alltt}
\begin{verbatim}
## [1] NA
\end{verbatim}
\begin{alltt}
\hlstd{A} \hlopt{+} \hlnum{Inf}
\end{alltt}
\begin{verbatim}
## [1] NA
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{playground}
\textbf{When to use vectors of length zero, and when \code{NA}s?}\index{zero length objects}\index{vectors!zero length} Make sure you understand the logic behind the different behavior of functions and operators with respect to \code{NA} and \code{numeric()} or its equivalent \code{numeric(0)}. What do they represent? Why \Rconst{NA}s are not ignored, while vectors of length zero are?

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlnum{123} \hlopt{+} \hlkwd{numeric}\hlstd{()}
\hlnum{123} \hlopt{+} \hlnum{NA}
\end{alltt}
\end{kframe}
\end{knitrout}

\emph{Model answer:}
\Rconst{NA} is used to signal a value that ``was lost'' or ``was expected'' but is unavailable because of some accident. A vector of length zero, represents no values, but within the normal expectations. In particular, if vectors are expected to have a certain length, or if index positions along a vector are meaningful, then using \Rconst{NA} is a must.

\end{playground}

Any operation, even tests of equality, involving one or more \Rconst{NA}'s return an \Rconst{NA}. In other words, when one input to a calculation is unknown, the result of the calculation is unknown. This means that a special function is needed for testing for the presence of \code{NA} values.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{is.na}\hlstd{(}\hlkwd{c}\hlstd{(}\hlnum{NA}\hlstd{,} \hlnum{1}\hlstd{))}
\end{alltt}
\begin{verbatim}
## [1]  TRUE FALSE
\end{verbatim}
\end{kframe}
\end{knitrout}

In the example above, we can also see that \Rfunction{is.na()} is vectorized, and that it applies the test to each of the two elements of the vector individually, returning the result as a logical vector of length two.

One thing\index{precision!math operations}\index{numbers!floating point} to be aware of are the consequences of the fact that numbers in computers are almost always stored with finite precision and/or range: the expectations derived from the mathematical definition of Real numbers are not always fulfilled. See the box on page \pageref{box:floats} for an in-depth explanation.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlnum{1} \hlopt{-} \hlnum{1e-20}
\end{alltt}
\begin{verbatim}
## [1] 1
\end{verbatim}
\end{kframe}
\end{knitrout}

When comparing \Rclass{integer}\index{numbers!whole}\index{numbers!integer} values these problems do not exist, as integer arithmetic is not affected by loss of precision in calculations restricted to integers. Because of the way integers are stored in the memory of computers, within the representable range, they are stored exactly. One can think of computer integers as a subset of whole numbers restricted to a certain range of values.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlnum{1L} \hlopt{+} \hlnum{3L}
\end{alltt}
\begin{verbatim}
## [1] 4
\end{verbatim}
\begin{alltt}
\hlnum{1L} \hlopt{*} \hlnum{3L}
\end{alltt}
\begin{verbatim}
## [1] 3
\end{verbatim}
\begin{alltt}
\hlnum{1L} \hlopt{%/%} \hlnum{3L}
\end{alltt}
\begin{verbatim}
## [1] 0
\end{verbatim}
\begin{alltt}
\hlnum{1L} \hlopt{%%} \hlnum{3L}
\end{alltt}
\begin{verbatim}
## [1] 1
\end{verbatim}
\begin{alltt}
\hlnum{1L} \hlopt{/} \hlnum{3L}
\end{alltt}
\begin{verbatim}
## [1] 0.3333333
\end{verbatim}
\end{kframe}
\end{knitrout}

The last statement in the example immediately above, using the ``usual'' division operator yields a floating-point \code{double} result, while the integer division operator \Roperator{\%/\%} yields an \code{integer} result, and \Roperator{\%\%} returns the remainder from the integer division. If as a result of an operation the result falls outside the range of representable values, the returned value is \code{NA}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlnum{1000000L} \hlopt{*} \hlnum{1000000L}
\end{alltt}


{\ttfamily\noindent\color{warningcolor}{\#\# Warning in 1000000L * 1000000L: NAs produced by integer overflow}}\begin{verbatim}
## [1] NA
\end{verbatim}
\end{kframe}
\end{knitrout}

Both doubles and integers are considered numeric. In most situations, conversion is automatic and we do not need to worry about the differences between these two types of numeric values. The next chunk shows returned values that are either \Rconst{TRUE} or \Rconst{FALSE}. These are \code{logical} values that will be discussed in the next section.\index{numbers!double}\index{numbers!integer}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{is.numeric}\hlstd{(}\hlnum{1L}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{is.integer}\hlstd{(}\hlnum{1L}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{is.double}\hlstd{(}\hlnum{1L}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\begin{alltt}
\hlkwd{is.double}\hlstd{(}\hlnum{1L} \hlopt{/} \hlnum{3L}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{is.numeric}\hlstd{(}\hlnum{1L} \hlopt{/} \hlnum{3L}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{advplayground}
Study the variations of the previous example shown below, and explain why the two statements return different values. Hint: 1 is a \code{double} constant. You can use \code{is.integer()} and \code{is.double()} in your explorations.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlnum{1} \hlopt{*} \hlnum{1000000L} \hlopt{*} \hlnum{1000000L}
\hlnum{1000000L} \hlopt{*} \hlnum{1000000L} \hlopt{*} \hlnum{1}
\end{alltt}
\end{kframe}
\end{knitrout}
\end{advplayground}

Both when displaying numbers or as part of computations, we may want to decrease the number of significant digits or the number of digits after the decimal marker. Be aware that in the examples below, even if printing is being done by default, these functions return \code{numeric} values that are different from their input and can be stored and used in computations. Function \Rfunction{round()} is used to round numbers to a certain number of decimal places after or before the decimal marker, while \Rfunction{signif()} rounds to the requested number of significant digits.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{round}\hlstd{(}\hlnum{0.0124567}\hlstd{,} \hlkwc{digits} \hlstd{=} \hlnum{3}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] 0.012
\end{verbatim}
\begin{alltt}
\hlkwd{signif}\hlstd{(}\hlnum{0.0124567}\hlstd{,} \hlkwc{digits} \hlstd{=} \hlnum{3}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] 0.0125
\end{verbatim}
\begin{alltt}
\hlkwd{round}\hlstd{(}\hlnum{1789.1234}\hlstd{,} \hlkwc{digits} \hlstd{=} \hlnum{3}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] 1789.123
\end{verbatim}
\begin{alltt}
\hlkwd{signif}\hlstd{(}\hlnum{1789.1234}\hlstd{,} \hlkwc{digits} \hlstd{=} \hlnum{3}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] 1790
\end{verbatim}
\begin{alltt}
\hlkwd{round}\hlstd{(}\hlnum{1789.1234}\hlstd{,} \hlkwc{digits} \hlstd{=} \hlopt{-}\hlnum{1}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] 1790
\end{verbatim}
\begin{alltt}
\hlstd{a} \hlkwb{<-} \hlnum{0.12345}
\hlstd{b} \hlkwb{<-} \hlkwd{round}\hlstd{(a,} \hlkwc{digits} \hlstd{=} \hlnum{2}\hlstd{)}
\hlstd{a} \hlopt{==} \hlstd{b}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\begin{alltt}
\hlstd{a} \hlopt{-} \hlstd{b}
\end{alltt}
\begin{verbatim}
## [1] 0.00345
\end{verbatim}
\begin{alltt}
\hlstd{b}
\end{alltt}
\begin{verbatim}
## [1] 0.12
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{explainbox}
Being \code{digits}, the second parameter of these functions, the argument can also be passed by position. However, code is usually easier to understand for humans when parameter names are made explicit.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{round}\hlstd{(}\hlnum{0.0124567}\hlstd{,} \hlkwc{digits} \hlstd{=} \hlnum{3}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] 0.012
\end{verbatim}
\begin{alltt}
\hlkwd{round}\hlstd{(}\hlnum{0.0124567}\hlstd{,} \hlnum{3}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] 0.012
\end{verbatim}
\end{kframe}
\end{knitrout}
\end{explainbox}

Functions \Rfunction{trunc()} and \Rfunction{ceiling()} return the non-fractional part of a numeric value as a new numeric value. They differ in how they handle negative values, and neither of them rounds the returned value to the nearest whole number.

\begin{playground}
What does value truncation mean? Function \Rfunction{trunc()} truncates a numeric value, but it does not return an \code{integer}.
\begin{itemize}
  \item Explore how \Rfunction{trunc()} and \Rfunction{ceiling()} differ. Test them both with positive and negative values.
  \item \textbf{Advanced} Use function \Rfunction{abs()} and operators \Roperator{+} and \Roperator{-} to reproduce the output of \Rfunction{trunc()} and \Rfunction{ceiling()} for the different inputs.
  \item Can \Rfunction{trunc()} and \Rfunction{ceiling()} be considered type conversion functions in \Rlang?
\end{itemize}
\end{playground}

\index{classes and modes!numeric, integer, double|)}\index{numbers and their arithmetic|)}

\section{Logical values and Boolean algebra}\label{sec:calc:boolean}
\index{classes and modes!logical|(}\index{logical operators}\index{logical values and their algebra|(}\index{Boolean arithmetic}
What in Mathematics are usually called Boolean values, are called \Rclass{logical} values in \Rlang. They can have only two values \code{TRUE} and \code{FALSE}, in addition to \code{NA} (not available). They are vectors as all other atomic types in \Rlang (by \emph{atomic} we mean that each value is not composed of ``parts''). There are also logical operators that allow Boolean algebra. In the chunk below we operate on \Rclass{logical} vectors of length one.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a} \hlkwb{<-} \hlnum{TRUE}
\hlstd{b} \hlkwb{<-} \hlnum{FALSE}
\hlkwd{mode}\hlstd{(a)}
\end{alltt}
\begin{verbatim}
## [1] "logical"
\end{verbatim}
\begin{alltt}
\hlstd{a}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlopt{!}\hlstd{a} \hlcom{# negation}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\begin{alltt}
\hlstd{a} \hlopt{&&} \hlstd{b} \hlcom{# logical AND}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\begin{alltt}
\hlstd{a} \hlopt{||} \hlstd{b} \hlcom{# logical OR}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{xor}\hlstd{(a, b)} \hlcom{# exclusive OR}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\end{kframe}
\end{knitrout}

%%%% index operators using verb!!
As with arithmetic operators, vectorization is available with \emph{some} logical operators. The availability of two kinds of logical operators is one of the most troublesome aspects of the \Rlang language for beginners. Pairs of ``equivalent'' logical operators behave differently, use similar syntax and use similar symbols! The vectorized operators have single-character names \Roperator{\&} and \Roperator{\textbar}, while the non-vectorized ones have double-character names \Roperator{\&\&} and \Roperator{\textbar\textbar}. There is only one version of the negation operator \Roperator{!} that is vectorized. In some, but not all cases, a warning will indicate that there is a possible problem.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a} \hlkwb{<-} \hlkwd{c}\hlstd{(}\hlnum{TRUE}\hlstd{,}\hlnum{FALSE}\hlstd{)}
\hlstd{b} \hlkwb{<-} \hlkwd{c}\hlstd{(}\hlnum{TRUE}\hlstd{,}\hlnum{TRUE}\hlstd{)}
\hlstd{a}
\end{alltt}
\begin{verbatim}
## [1]  TRUE FALSE
\end{verbatim}
\begin{alltt}
\hlstd{b}
\end{alltt}
\begin{verbatim}
## [1] TRUE TRUE
\end{verbatim}
\begin{alltt}
\hlstd{a} \hlopt{&} \hlstd{b} \hlcom{# vectorized AND}
\end{alltt}
\begin{verbatim}
## [1]  TRUE FALSE
\end{verbatim}
\begin{alltt}
\hlstd{a} \hlopt{|} \hlstd{b} \hlcom{# vectorized OR}
\end{alltt}
\begin{verbatim}
## [1] TRUE TRUE
\end{verbatim}
\begin{alltt}
\hlstd{a} \hlopt{&&} \hlstd{b} \hlcom{# not vectorized}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlstd{a} \hlopt{||} \hlstd{b} \hlcom{# not vectorized}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\end{kframe}
\end{knitrout}

Functions \Rfunction{any()} and \Rfunction{all()} take zero or more logical vectors as their arguments, and return a single logical value ``summarizing'' the logical values in the vectors. Function \Rfunction{all()} returns \code{TRUE} only if all values in the vectors passed as arguments are \code{TRUE}, and \Rfunction{any()} returns \code{TRUE} unless all values in the vectors are \code{FALSE}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{any}\hlstd{(a)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{all}\hlstd{(a)}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\begin{alltt}
\hlkwd{any}\hlstd{(a} \hlopt{&} \hlstd{b)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{all}\hlstd{(a} \hlopt{&} \hlstd{b)}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\end{kframe}
\end{knitrout}

Another important thing to know about logical operators is that they ``short-cut'' evaluation. If the result is known from the first part of the statement, the rest of the statement is not evaluated. Try to understand what happens when you enter the following commands. Short-cut evaluation is useful, as the first condition can be used as a guard protecting a later condition from being evaluated when it would trigger an error.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlnum{TRUE} \hlopt{||} \hlnum{NA}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlnum{FALSE} \hlopt{||} \hlnum{NA}
\end{alltt}
\begin{verbatim}
## [1] NA
\end{verbatim}
\begin{alltt}
\hlnum{TRUE} \hlopt{&&} \hlnum{NA}
\end{alltt}
\begin{verbatim}
## [1] NA
\end{verbatim}
\begin{alltt}
\hlnum{FALSE} \hlopt{&&} \hlnum{NA}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\begin{alltt}
\hlnum{TRUE} \hlopt{&&} \hlnum{FALSE} \hlopt{&&} \hlnum{NA}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\begin{alltt}
\hlnum{TRUE} \hlopt{&&} \hlnum{TRUE} \hlopt{&&} \hlnum{NA}
\end{alltt}
\begin{verbatim}
## [1] NA
\end{verbatim}
\end{kframe}
\end{knitrout}

When using the vectorized operators on vectors of length greater than one, `short-cut' evaluation still applies for the result obtained at each index position.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a} \hlopt{&} \hlstd{b} \hlopt{&} \hlnum{NA}
\end{alltt}
\begin{verbatim}
## [1]    NA FALSE
\end{verbatim}
\begin{alltt}
\hlstd{a} \hlopt{&} \hlstd{b} \hlopt{&} \hlkwd{c}\hlstd{(}\hlnum{NA}\hlstd{,} \hlnum{NA}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1]    NA FALSE
\end{verbatim}
\begin{alltt}
\hlstd{a} \hlopt{|} \hlstd{b} \hlopt{|} \hlkwd{c}\hlstd{(}\hlnum{NA}\hlstd{,} \hlnum{NA}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] TRUE TRUE
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{playground}
Based on the description of ``recycling'' presented on page \pageref{par:recycling:numeric} for \code{numeric} operators, explore how ``recycling'' works with vectorized logical operators. Create logical vectors of different lengths (including length one) and \emph{play} by writing several code statements with operations on them. To get you started, one example is given below. Execute this example, and then create and run your own, making sure that you understand why the values returned are what they are. Sometimes, you will need to devise several examples or test cases to tease out of \Rlang an understanding of how a certain feature of the language works, so do not give up early, and make use of your imagination!

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{x} \hlkwb{<-} \hlkwd{c}\hlstd{(}\hlnum{TRUE}\hlstd{,} \hlnum{FALSE}\hlstd{,} \hlnum{TRUE}\hlstd{,} \hlnum{NA}\hlstd{)}
\hlstd{x} \hlopt{&} \hlnum{FALSE}
\hlstd{x} \hlopt{|} \hlkwd{c}\hlstd{(}\hlnum{TRUE}\hlstd{,} \hlnum{FALSE}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

\end{playground}
\index{logical values and their algebra|)}
\section{Comparison operators and operations}
\index{comparison operators|(}\index{operators!comparison|(}\qRoperator{>}\qRoperator{<}\qRoperator{>=}\qRoperator{<=}\qRoperator{==}\qRoperator{!=}
Comparison operators return vectors of \code{logical} values as results.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlnum{1.2} \hlopt{>} \hlnum{1.0}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlnum{1.2} \hlopt{>=} \hlnum{1.0}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlnum{1.2} \hlopt{==} \hlnum{1.0} \hlcom{# be aware that here we use two = symbols}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\begin{alltt}
\hlnum{1.2} \hlopt{!=} \hlnum{1.0}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlnum{1.2} \hlopt{<=} \hlnum{1.0}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\begin{alltt}
\hlnum{1.2} \hlopt{<} \hlnum{1.0}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\begin{alltt}
\hlstd{a} \hlkwb{<-} \hlnum{20}
\hlstd{a} \hlopt{<} \hlnum{100} \hlopt{&&} \hlstd{a} \hlopt{>} \hlnum{10}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\end{kframe}
\end{knitrout}

These operators can be used on vectors of any length, returning as a result a logical vector as long as the longest operand. In other words, they behave in the same way as the arithmetic operators described on page \pageref{par:vectorized:numeric}: their arguments are recycled when needed. Hint: if you do not know what to expect as a value for the vector returned by \code{1:10}, execute the statement \code{print(a)} after the first code statement below, or, alternatively, \code{1:10} without saving the result to a variable.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a} \hlkwb{<-} \hlnum{1}\hlopt{:}\hlnum{10}
\hlstd{a} \hlopt{>} \hlnum{5}
\end{alltt}
\begin{verbatim}
##  [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
\end{verbatim}
\begin{alltt}
\hlstd{a} \hlopt{<} \hlnum{5}
\end{alltt}
\begin{verbatim}
##  [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
\end{verbatim}
\begin{alltt}
\hlstd{a} \hlopt{==} \hlnum{5}
\end{alltt}
\begin{verbatim}
##  [1] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
\end{verbatim}
\begin{alltt}
\hlkwd{all}\hlstd{(a} \hlopt{>} \hlnum{5}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\begin{alltt}
\hlkwd{any}\hlstd{(a} \hlopt{>} \hlnum{5}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlstd{b} \hlkwb{<-} \hlstd{a} \hlopt{>} \hlnum{5}
\hlstd{b}
\end{alltt}
\begin{verbatim}
##  [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{any}\hlstd{(b)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{all}\hlstd{(b)}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\end{kframe}
\end{knitrout}

Precedence rules also apply to comparison operators and they can be overridden by means of parentheses.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a} \hlopt{>} \hlnum{2} \hlopt{+} \hlnum{3}
\end{alltt}
\begin{verbatim}
##  [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
\end{verbatim}
\begin{alltt}
\hlstd{(a} \hlopt{>} \hlnum{2}\hlstd{)} \hlopt{+} \hlnum{3}
\end{alltt}
\begin{verbatim}
##  [1] 3 3 4 4 4 4 4 4 4 4
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{playground}
Use the statement below as a starting point in exploring how precedence works when logical and arithmetic operators are part of the same statement. \emph{Play} with the example by adding parentheses at different positions and based on the returned values, work out the default order of operator precedence used for the evaluation of the example given below.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a} \hlkwb{<-} \hlnum{1}\hlopt{:}\hlnum{10}
\hlstd{a} \hlopt{>} \hlnum{3} \hlopt{|} \hlstd{a} \hlopt{+} \hlnum{2} \hlopt{<} \hlnum{3}
\end{alltt}
\end{kframe}
\end{knitrout}
\end{playground}

Again, be aware of ``short-cut evaluation''. If the result does not depend on the missing value, then the result, \code{TRUE} or \code{FALSE} is returned. If the presence of the \code{NA} makes the end result unknown, then \code{NA} is returned.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{c} \hlkwb{<-} \hlkwd{c}\hlstd{(a,} \hlnum{NA}\hlstd{)}
\hlstd{c} \hlopt{>} \hlnum{5}
\end{alltt}
\begin{verbatim}
##  [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE    NA
\end{verbatim}
\begin{alltt}
\hlkwd{all}\hlstd{(c} \hlopt{>} \hlnum{5}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\begin{alltt}
\hlkwd{any}\hlstd{(c} \hlopt{>} \hlnum{5}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{all}\hlstd{(c} \hlopt{<} \hlnum{20}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] NA
\end{verbatim}
\begin{alltt}
\hlkwd{any}\hlstd{(c} \hlopt{>} \hlnum{20}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] NA
\end{verbatim}
\begin{alltt}
\hlkwd{is.na}\hlstd{(a)}
\end{alltt}
\begin{verbatim}
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
\end{verbatim}
\begin{alltt}
\hlkwd{is.na}\hlstd{(c)}
\end{alltt}
\begin{verbatim}
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{any}\hlstd{(}\hlkwd{is.na}\hlstd{(c))}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{all}\hlstd{(}\hlkwd{is.na}\hlstd{(c))}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\end{kframe}
\end{knitrout}

The behavior of many of base-\Rlang's functions when \code{NA}s are present in their input arguments can be modified. \code{TRUE} passed as an argument to parameter \code{na.rm}, results in \code{NA} values being \emph{removed} from the input \textbf{before} the function is applied.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{all}\hlstd{(c} \hlopt{<} \hlnum{20}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] NA
\end{verbatim}
\begin{alltt}
\hlkwd{any}\hlstd{(c} \hlopt{>} \hlnum{20}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] NA
\end{verbatim}
\begin{alltt}
\hlkwd{all}\hlstd{(c} \hlopt{<} \hlnum{20}\hlstd{,} \hlkwc{na.rm}\hlstd{=}\hlnum{TRUE}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{any}\hlstd{(c} \hlopt{>} \hlnum{20}\hlstd{,} \hlkwc{na.rm}\hlstd{=}\hlnum{TRUE}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{explainbox}
\label{box:floats} \label{par:float}
\index{floating point numbers!arithmetic|(}\index{machine arithmetic!precision|(}
\index{floats|see{floating point numbers}}\index{machine arithmetic!rounding errors}\index{Real numbers and computers}
\index{EPS ($\epsilon$)|see{machine arithmetic precision}}%
Here I give some examples for which the finite resolution of computer machine floats, as compared to Real numbers as defined in mathematics, can cause serious problems. In \Rpgrm, numbers that are not integers are stored as \emph{double-precision floats}. In addition to having limits to the largest and smallest numbers that can be represented, the precision of floats is limited by the number of significant digits that can be stored. Precision is usually described by ``epsilon'' ($\epsilon$), abbreviated \emph{eps}, defined as the largest value of $\epsilon$ for which $1 + \epsilon = 1$. The finite resolution of floats can lead to unexpected results when testing for equality. In the second example below, the result of the subtraction is still exactly 1 due to insufficient resolution.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlnum{0} \hlopt{-} \hlnum{1e-20}
\end{alltt}
\begin{verbatim}
## [1] -1e-20
\end{verbatim}
\begin{alltt}
\hlnum{1} \hlopt{-} \hlnum{1e-20}
\end{alltt}
\begin{verbatim}
## [1] 1
\end{verbatim}
\end{kframe}
\end{knitrout}

The finiteness of floats also affects tests of equality, which is more likely to result in errors with important consequences.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlnum{1e20} \hlopt{==} \hlnum{1} \hlopt{+} \hlnum{1e20}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlnum{1} \hlopt{==} \hlnum{1} \hlopt{+} \hlnum{1e-20}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlnum{0} \hlopt{==} \hlnum{1e-20}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\end{kframe}
\end{knitrout}

As \Rpgrm can run on different types of computer hardware, the actual machine limits for storing numbers in memory may vary depending on the type of processor and even compiler used to build the \Rpgrm program executable. However, it is possible to obtain these values at run time from the variable \code{.Machine}, which is part of the \Rlang language. Please see the help page for \code{.Machine} for a detailed and up-to-date description of the available constants.\qRconst{.Machine\$double.eps}\qRconst{.Machine\$double.neg.eps}\qRconst{.Machine\$double.max}\qRconst{.Machine\$double.min}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{.Machine}\hlopt{$}\hlstd{double.eps}
\end{alltt}
\begin{verbatim}
## [1] 2.220446e-16
\end{verbatim}
\begin{alltt}
\hlstd{.Machine}\hlopt{$}\hlstd{double.neg.eps}
\end{alltt}
\begin{verbatim}
## [1] 1.110223e-16
\end{verbatim}
\begin{alltt}
\hlstd{.Machine}\hlopt{$}\hlstd{double.max}
\end{alltt}
\begin{verbatim}
## [1] 1024
\end{verbatim}
\begin{alltt}
\hlstd{.Machine}\hlopt{$}\hlstd{double.min}
\end{alltt}
\begin{verbatim}
## [1] -1022
\end{verbatim}
\end{kframe}
\end{knitrout}

The last two values refer to the exponents of 10, rather than the maximum and minimum size of numbers that can be handled as objects of class \Rclass{double}. Values outside these limits are stored as \Rconst{-Inf} or \Rconst{Inf} and enter arithmetic as infinite values according the mathematical rules.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlnum{1e1026}
\end{alltt}
\begin{verbatim}
## [1] Inf
\end{verbatim}
\begin{alltt}
\hlnum{1e-1026}
\end{alltt}
\begin{verbatim}
## [1] 0
\end{verbatim}
\begin{alltt}
\hlnum{Inf} \hlopt{+} \hlnum{1}
\end{alltt}
\begin{verbatim}
## [1] Inf
\end{verbatim}
\begin{alltt}
\hlopt{-}\hlnum{Inf} \hlopt{+} \hlnum{1}
\end{alltt}
\begin{verbatim}
## [1] -Inf
\end{verbatim}
\end{kframe}
\end{knitrout}

As \Rclass{integer} values are stored in machine memory without loss of precision, epsilon is not defined for \Rclass{integer} values.\qRconst{.Machine\$integer.max}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{.Machine}\hlopt{$}\hlstd{integer.max}
\end{alltt}
\begin{verbatim}
## [1] 2147483647
\end{verbatim}
\begin{alltt}
\hlnum{2147483699L}
\end{alltt}
\begin{verbatim}
## [1] 2147483699
\end{verbatim}
\end{kframe}
\end{knitrout}

In those statements in the chunk below where at least one operand is \Rclass{double} the \Rclass{integer} operands are \emph{promoted} to \Rclass{double} before computation. A similar promotion does not take place when operations are among \Rclass{integer} values, resulting in \emph{overflow}\index{arithmetic overflow}\index{overflow|see{arithmetic overflow}}, meaning numbers that are too big to be represented as \Rclass{integer} values.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlnum{2147483600L} \hlopt{+} \hlnum{99L}
\end{alltt}


{\ttfamily\noindent\color{warningcolor}{\#\# Warning in 2147483600L + 99L: NAs produced by integer overflow}}\begin{verbatim}
## [1] NA
\end{verbatim}
\begin{alltt}
\hlnum{2147483600L} \hlopt{+} \hlnum{99}
\end{alltt}
\begin{verbatim}
## [1] 2147483699
\end{verbatim}
\begin{alltt}
\hlnum{2147483600L} \hlopt{*} \hlnum{2147483600L}
\end{alltt}


{\ttfamily\noindent\color{warningcolor}{\#\# Warning in 2147483600L * 2147483600L: NAs produced by integer overflow}}\begin{verbatim}
## [1] NA
\end{verbatim}
\begin{alltt}
\hlnum{2147483600L} \hlopt{*} \hlnum{2147483600}
\end{alltt}
\begin{verbatim}
## [1] 4.611686e+18
\end{verbatim}
\end{kframe}
\end{knitrout}

We see next that the exponentiation operator \Roperator{\^{}} forces the promotion\index{type promotion}\index{arithmetic overflow!type promotion} of its arguments to \Rclass{double}, resulting in no overflow. In contrast, as seen above, the multiplication operator \Roperator{*} operates on integers resulting in overflow.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlnum{2147483600L} \hlopt{*} \hlnum{2147483600L}
\end{alltt}


{\ttfamily\noindent\color{warningcolor}{\#\# Warning in 2147483600L * 2147483600L: NAs produced by integer overflow}}\begin{verbatim}
## [1] NA
\end{verbatim}
\begin{alltt}
\hlnum{2147483600L}\hlopt{^}\hlnum{2L}
\end{alltt}
\begin{verbatim}
## [1] 4.611686e+18
\end{verbatim}
\end{kframe}
\end{knitrout}
\index{floating point numbers!arithmetic|)}\index{machine arithmetic!precision|)}
\end{explainbox}

\begin{warningbox}
\index{comparison of floating point numbers|(}\index{inequality and equality tests|(}\index{loss of numeric precision}In many situations, when writing programs one should avoid testing for equality of floating point numbers (`floats'). Here we show how to gracefully handle rounding errors. As the example shows, rounding errors may accumulate, and in practice \verb|.Machine$double.eps| is not always a good value to safely use in tests for ``zero,'' and a larger value may be needed. Whenever possible according to the logic of the calculations, it is best to test for inequalities, for example using \verb|x <= 1.0| instead of \verb|x == 1.0|. If this is not possible, then the tests should be done replacing tests like \verb|x == 1.0| with \verb|abs(x - 1.0) < eps|. Function \Rfunction{abs()} returns the absolute value, in simpler words, makes all values positive or zero, by changing the sign of negative values, or in mathematical notation $|x| = |-x|$.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a} \hlopt{==} \hlnum{0.0} \hlcom{# may not always work}
\end{alltt}
\begin{verbatim}
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
\end{verbatim}
\begin{alltt}
\hlkwd{abs}\hlstd{(a)} \hlopt{<} \hlnum{1e-15} \hlcom{# is safer}
\end{alltt}
\begin{verbatim}
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{sin}\hlstd{(pi)} \hlopt{==} \hlnum{0.0} \hlcom{# angle in radians, not degrees!}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\begin{alltt}
\hlkwd{sin}\hlstd{(}\hlnum{2} \hlopt{*} \hlstd{pi)} \hlopt{==} \hlnum{0.0}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\begin{alltt}
\hlkwd{abs}\hlstd{(}\hlkwd{sin}\hlstd{(pi))} \hlopt{<} \hlnum{1e-15}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{abs}\hlstd{(}\hlkwd{sin}\hlstd{(}\hlnum{2} \hlopt{*} \hlstd{pi))} \hlopt{<} \hlnum{1e-15}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{sin}\hlstd{(pi)}
\end{alltt}
\begin{verbatim}
## [1] 1.224606e-16
\end{verbatim}
\begin{alltt}
\hlkwd{sin}\hlstd{(}\hlnum{2} \hlopt{*} \hlstd{pi)}
\end{alltt}
\begin{verbatim}
## [1] -2.449213e-16
\end{verbatim}
\end{kframe}
\end{knitrout}

\index{comparison of floating point numbers|)}\index{inequality and equality tests|)}
\end{warningbox}

\index{comparison operators|)}\index{operators!comparison|)}
\index{classes and modes!logical|)}

\section{Sets and set operations}
\index{sets|(}\index{algebra of sets}\index{operators!set|(}

The \Rlang language supports set operations on vectors. They can be useful in many different contexts when manipulating and comparing vectors of values. In Bioinformatics it is usual, for example, to have character vectors of gene tags. We may have a vector for each of a set of different samples, and need to compare them. However, we start by using a more mundane example, everyday shopping.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{fruits} \hlkwb{<-} \hlkwd{c}\hlstd{(}\hlstr{"apple"}\hlstd{,} \hlstr{"pear"}\hlstd{,} \hlstr{"orange"}\hlstd{,} \hlstr{"lemon"}\hlstd{,} \hlstr{"tangerine"}\hlstd{)}
\hlstd{bakery} \hlkwb{<-} \hlkwd{c}\hlstd{(}\hlstr{"bread"}\hlstd{,} \hlstr{"buns"}\hlstd{,} \hlstr{"cake"}\hlstd{,} \hlstr{"cookies"}\hlstd{)}
\hlstd{dairy} \hlkwb{<-} \hlkwd{c}\hlstd{(}\hlstr{"milk"}\hlstd{,} \hlstr{"butter"}\hlstd{,} \hlstr{"cheese"}\hlstd{)}
\hlstd{shopping} \hlkwb{<-} \hlkwd{c}\hlstd{(}\hlstr{"bread"}\hlstd{,} \hlstr{"butter"}\hlstd{,} \hlstr{"apple"}\hlstd{,} \hlstr{"cheese"}\hlstd{,} \hlstr{"orange"}\hlstd{)}
\hlkwd{intersect}\hlstd{(fruits, shopping)}
\end{alltt}
\begin{verbatim}
## [1] "apple"  "orange"
\end{verbatim}
\begin{alltt}
\hlkwd{intersect}\hlstd{(bakery, shopping)}
\end{alltt}
\begin{verbatim}
## [1] "bread"
\end{verbatim}
\begin{alltt}
\hlkwd{intersect}\hlstd{(dairy, shopping)}
\end{alltt}
\begin{verbatim}
## [1] "butter" "cheese"
\end{verbatim}
\begin{alltt}
\hlstr{"lemon"} \hlopt{%in%} \hlstd{dairy}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\begin{alltt}
\hlstr{"lemon"} \hlopt{%in%} \hlstd{fruits}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{setdiff}\hlstd{(}\hlkwd{union}\hlstd{(bakery, dairy), shopping)}
\end{alltt}
\begin{verbatim}
## [1] "buns"    "cake"    "cookies" "milk"
\end{verbatim}
\end{kframe}
\end{knitrout}

We continue next with abstract (symbolic) examples.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.set} \hlkwb{<-} \hlkwd{c}\hlstd{(}\hlstr{"a"}\hlstd{,} \hlstr{"b"}\hlstd{,} \hlstr{"c"}\hlstd{,} \hlstr{"b"}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

To test if a given value belongs to a set, we use operator \Roperator{\%in\%}. In the algebra of sets notation, this is written $a \in A$, where $A$ is a set and $a$ a member. The second statement shows that the \code{\%in\%} operator is vectorized on its left-hand-side (lhs) operand, returning a logical vector.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstr{"a"} \hlopt{%in%} \hlstd{my.set}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{c}\hlstd{(}\hlstr{"a"}\hlstd{,} \hlstr{"a"}\hlstd{,} \hlstr{"z"}\hlstd{)} \hlopt{%in%} \hlstd{my.set}
\end{alltt}
\begin{verbatim}
## [1]  TRUE  TRUE FALSE
\end{verbatim}
\end{kframe}
\end{knitrout}

The negation of inclusion is $a \not\in A$, and coded in \Rlang by applying the negation operator \Roperator{!} to the result of the test done with \Roperator{\%in\%}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlopt{!}\hlstr{"a"} \hlopt{%in%} \hlstd{my.set}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\begin{alltt}
\hlopt{!}\hlkwd{c}\hlstd{(}\hlstr{"a"}\hlstd{,} \hlstr{"a"}\hlstd{,} \hlstr{"z"}\hlstd{)} \hlopt{%in%} \hlstd{my.set}
\end{alltt}
\begin{verbatim}
## [1] FALSE FALSE  TRUE
\end{verbatim}
\end{kframe}
\end{knitrout}

Although inclusion is a set operation, it is also very useful for the simplification of \code{if()\ldots else} statements by replacing multiple tests for alternative constant values of the same \code{mode} chained by multiple \Roperator{|} operators.

\begin{playground}
Use operator \Roperator{\%in\%} to simplify the following comparison.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{x} \hlkwb{<-} \hlkwd{c}\hlstd{(}\hlstr{"a"}\hlstd{,} \hlstr{"a"}\hlstd{,} \hlstr{"z"}\hlstd{)}
\hlstd{x} \hlopt{==} \hlstr{"a"} \hlopt{|} \hlstd{x} \hlopt{==} \hlstr{"b"} \hlopt{|} \hlstd{x} \hlopt{==} \hlstr{"c"} \hlopt{|} \hlstd{x} \hlopt{==} \hlstr{"d"}
\end{alltt}
\end{kframe}
\end{knitrout}
\end{playground}

With \Rfunction{unique()} we convert a vector of possibly repeated values into a set of unique values. In the algebra of sets, a certain object belongs or not to a set. Consequently, in a set, multiple copies of the same object or value are meaningless.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{unique}\hlstd{(my.set)}
\end{alltt}
\begin{verbatim}
## [1] "a" "b" "c"
\end{verbatim}
\begin{alltt}
\hlkwd{c}\hlstd{(}\hlstr{"a"}\hlstd{,} \hlstr{"a"}\hlstd{,} \hlstr{"z"}\hlstd{)} \hlopt{%in%} \hlkwd{unique}\hlstd{(my.set)}
\end{alltt}
\begin{verbatim}
## [1]  TRUE  TRUE FALSE
\end{verbatim}
\end{kframe}
\end{knitrout}

In the notation used in algebra of sets, the set union operator is $\cup$ while the intersection operator is $\cap$. If we have sets $A$ and $B$, their union is given by $A \cup B$---in the next three examples, \code{c("a", "a", "z")} is a constant, while \code{my.set} is a variable.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{union}\hlstd{(}\hlkwd{c}\hlstd{(}\hlstr{"a"}\hlstd{,} \hlstr{"a"}\hlstd{,} \hlstr{"z"}\hlstd{), my.set)}
\end{alltt}
\begin{verbatim}
## [1] "a" "z" "b" "c"
\end{verbatim}
\end{kframe}
\end{knitrout}

If we have sets $A$ and $B$, their intersection is given by $A \cap B$.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{intersect}\hlstd{(}\hlkwd{c}\hlstd{(}\hlstr{"a"}\hlstd{,} \hlstr{"a"}\hlstd{,} \hlstr{"z"}\hlstd{), my.set)}
\end{alltt}
\begin{verbatim}
## [1] "a"
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{playground}
What do you expect to be the difference between the values returned by the three statements in the code chunk below? Before running them, write down your expectations about the value each one will return. Only then run the code. Independently of whether your predictions were correct or not, write down an explanation of what each statement's operation is.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{union}\hlstd{(}\hlkwd{c}\hlstd{(}\hlstr{"a"}\hlstd{,} \hlstr{"a"}\hlstd{,} \hlstr{"z"}\hlstd{), my.set)}
\hlkwd{c}\hlstd{(}\hlkwd{c}\hlstd{(}\hlstr{"a"}\hlstd{,} \hlstr{"a"}\hlstd{,} \hlstr{"z"}\hlstd{), my.set)}
\hlkwd{c}\hlstd{(}\hlstr{"a"}\hlstd{,} \hlstr{"a"}\hlstd{,} \hlstr{"z"}\hlstd{, my.set)}
\end{alltt}
\end{kframe}
\end{knitrout}

In the algebra of sets notation $A \subseteq B$, where $A$ and $B$ are sets, indicates that $A$ is a subset or equal to $B$. For a true subset, the notation is $A \subset B$. The operators with the reverse direction are $\supseteq$ and $\supset$. Implement these four operations in four \Rlang statements, and test them on sets (represented by \Rlang vectors) with different ``overlap'' among set members.

\end{playground}

\begin{explainbox}
All set algebra examples above use character vectors and character constants. This is just the most frequent use case. Sets operations are valid on vectors of any atomic class, including \code{integer}, and computed values can be part of statements. In the second and third statements in the next chunk, we need to use additional parentheses to alter the default order of precedence between arithmetic and set operators.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlnum{9L} \hlopt{%in%} \hlnum{2L}\hlopt{:}\hlnum{4L}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\begin{alltt}
\hlnum{9L} \hlopt{%in%} \hlstd{((}\hlnum{2L}\hlopt{:}\hlnum{4L}\hlstd{)} \hlopt{*} \hlstd{(}\hlnum{2L}\hlopt{:}\hlnum{4L}\hlstd{))}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{c}\hlstd{(}\hlnum{1L}\hlstd{,} \hlnum{16L}\hlstd{)} \hlopt{%in%} \hlstd{((}\hlnum{2L}\hlopt{:}\hlnum{4L}\hlstd{)} \hlopt{*} \hlstd{(}\hlnum{2L}\hlopt{:}\hlnum{4L}\hlstd{))}
\end{alltt}
\begin{verbatim}
## [1] FALSE  TRUE
\end{verbatim}
\end{kframe}
\end{knitrout}

\emph{Empty sets} are an important component of the algebra of sets, in \Rlang they are represented as vectors of zero length. Vectors and lists of zero length, which the \Rlang language fully supports, can be used to ``encode'' emptiness also in other contexts. These vectors do belong to a class such as \Rclass{numeric} or \Rclass{character} and must be compatible with other operands in an expression. By default, constructors for vectors, construct empty vectors.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{length}\hlstd{(}\hlkwd{integer}\hlstd{())}
\end{alltt}
\begin{verbatim}
## [1] 0
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlnum{1L} \hlopt{%in%} \hlkwd{integer}\hlstd{()}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\begin{alltt}
\hlkwd{setdiff}\hlstd{(}\hlnum{1L}\hlopt{:}\hlnum{4L}\hlstd{,} \hlkwd{union}\hlstd{(}\hlnum{1L}\hlopt{:}\hlnum{4L}\hlstd{,} \hlkwd{integer}\hlstd{()))}
\end{alltt}
\begin{verbatim}
## integer(0)
\end{verbatim}
\end{kframe}
\end{knitrout}

Although set operators are defined for \Rclass{numeric} vectors, rounding errors in `floats' can result in unexpected results (see section \ref{box:floats} on page \pageref{box:floats}). The next two examples do, however, return the correct answers.\qRoperator{\%in\%}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlnum{9} \hlopt{%in%} \hlstd{(}\hlnum{2}\hlopt{:}\hlnum{4}\hlstd{)}\hlopt{^}\hlnum{2}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{c}\hlstd{(}\hlnum{1}\hlstd{,} \hlnum{5}\hlstd{)} \hlopt{%in%} \hlstd{(}\hlnum{1}\hlopt{:}\hlnum{10}\hlstd{)}\hlopt{^}\hlnum{2}
\end{alltt}
\begin{verbatim}
## [1]  TRUE FALSE
\end{verbatim}
\end{kframe}
\end{knitrout}

\end{explainbox}
\index{operators!set|)}
\index{sets|)}

\section{Character values}\label{sec:calc:character}
\index{character strings}\index{classes and modes!character|(}\qRclass{character}
Character variables can be used to store any character. Character constants are written by enclosing characters in quotes. There are three types of quotes in the ASCII character set, double quotes \code{"}, single quotes \code{'}, and back ticks \code{`}. The first two types of quotes can be used as delimiters of \code{character} constants.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a} \hlkwb{<-} \hlstr{"A"}
\hlstd{a}
\end{alltt}
\begin{verbatim}
## [1] "A"
\end{verbatim}
\begin{alltt}
\hlstd{b} \hlkwb{<-} \hlstr{'A'}
\hlstd{b}
\end{alltt}
\begin{verbatim}
## [1] "A"
\end{verbatim}
\begin{alltt}
\hlstd{a} \hlopt{==} \hlstd{b}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{explainbox}
In many computer languages, vectors of characters are distinct from vectors of character strings. In these languages, character vectors store at each index position a single character, while vectors of character strings store at each index position strings of characters of various lengths, such as words or sentences. If you are familiar with \Clang or \Cpplang, you need to keep in mind that \Clang's \code{char} and \Rlang's \code{character} are not equivalent and that in \Rlang, \code{character} vectors are vectors of character strings. In contrast to these other languages, in \Rlang there is no predefined class for vectors of individual characters and character constants enclosed in double or single quotes are not different.
\end{explainbox}

Concatenating character vectors of length one does not yield a longer character string, it yields instead a longer vector.
\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a} \hlkwb{<-} \hlstr{'A'}
\hlstd{b} \hlkwb{<-} \hlstr{"bcdefg"}
\hlstd{c} \hlkwb{<-} \hlstr{"123"}
\hlstd{d} \hlkwb{<-} \hlkwd{c}\hlstd{(a, b, c)}
\hlstd{d}
\end{alltt}
\begin{verbatim}
## [1] "A"      "bcdefg" "123"
\end{verbatim}
\end{kframe}
\end{knitrout}

Having two different delimiters available makes it possible to choose the type of quotes used as delimiters so that other quotes can be included in a string.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a} \hlkwb{<-} \hlstr{"He said 'hello' when he came in"}
\hlstd{a}
\end{alltt}
\begin{verbatim}
## [1] "He said 'hello' when he came in"
\end{verbatim}
\begin{alltt}
\hlstd{b} \hlkwb{<-} \hlstr{'He said "hello" when he came in'}
\hlstd{b}
\end{alltt}
\begin{verbatim}
## [1] "He said \"hello\" when he came in"
\end{verbatim}
\end{kframe}
\end{knitrout}

The\index{character string delimiters} outer quotes are not part of the string, they are ``delimiters'' used to mark the boundaries. As you can see when \code{b} is printed special characters can be represented using ``escape sequences''. There are several of them, and here we will show just four, new line (\verb|\n|) and tab (\verb|\t|), \verb|\"| the escape code for a quotation mark within a string and \verb|\\| the escape code for a single backslash \verb|\|. We also show here the different behavior of \Rfunction{print()} and \Rfunction{cat()}, with \Rfunction{cat()} \emph{interpreting} the escape sequences and \Rfunction{print()} displaying them as entered.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{c} \hlkwb{<-} \hlstr{"abc\textbackslash{}ndef\textbackslash{}tx\textbackslash{}"yz\textbackslash{}"\textbackslash{}\textbackslash{}\textbackslash{}tm"}
\hlkwd{print}\hlstd{(c)}
\end{alltt}
\begin{verbatim}
## [1] "abc\ndef\tx\"yz\"\\\tm"
\end{verbatim}
\begin{alltt}
\hlkwd{cat}\hlstd{(c)}
\end{alltt}
\begin{verbatim}
## abc
## def	x"yz"\	m
\end{verbatim}
\end{kframe}
\end{knitrout}

The \textit{escape codes}\index{character escape codes} work only in some contexts, as when using \Rfunction{cat()} to generate the output. For example, the new-line escape (\verb|\n|) can be embedded in strings used for axis-label, title or label in a plot to split them over two or more lines.
\index{classes and modes!character|)}

\section{The `mode' and `class' of objects}\label{sec:rlang:mode}
\index{objects!mode}
Variables have a \emph{mode} that depends on what is stored in them. But different from other languages, assignment to a variable of a different mode is allowed and in most cases its mode changes together with its contents. However, there is a restriction that all elements in a vector, array or matrix, must be of the same mode. While this is not required for lists, which can be heterogenous. In practice this means that we can assign an object, such as a vector, with a different \code{mode} to a name already in use, but we cannot use indexing to assign an object of a different mode to individual members of a vector, matrix or array. Functions with names starting with \code{is.} are tests returning a logical value, \code{TRUE}, \code{FALSE} or \code{NA}. Function \Rfunction{mode()} returns the mode of an object, as a character string and \Rfunction{typeof()} returns R's internal type or storage mode.\qRfunction{is.character()}\qRfunction{is.numeric()}\qRfunction{is.logical()}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my_var} \hlkwb{<-} \hlnum{1}\hlopt{:}\hlnum{5}
\hlkwd{mode}\hlstd{(my_var)} \hlcom{# no distinction of integer or double}
\end{alltt}
\begin{verbatim}
## [1] "numeric"
\end{verbatim}
\begin{alltt}
\hlkwd{typeof}\hlstd{(my_var)}
\end{alltt}
\begin{verbatim}
## [1] "integer"
\end{verbatim}
\begin{alltt}
\hlkwd{is.numeric}\hlstd{(my_var)} \hlcom{# no distinction of integer or double}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{is.double}\hlstd{(my_var)}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\begin{alltt}
\hlkwd{is.integer}\hlstd{(my_var)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{is.logical}\hlstd{(my_var)}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\begin{alltt}
\hlkwd{is.character}\hlstd{(my_var)}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\begin{alltt}
\hlstd{my_var} \hlkwb{<-} \hlstr{"abc"}
\hlkwd{mode}\hlstd{(my_var)}
\end{alltt}
\begin{verbatim}
## [1] "character"
\end{verbatim}
\end{kframe}
\end{knitrout}

While \emph{mode} is a fundamental property, and limited to those modes defined as part of the \Rlang language, the concept of \emph{class}, is different in that new classes can be defined in user code. In particular, different \Rlang objects of a given mode, such as \code{numeric}, can belong to different \code{class}es. The use of classes for dispatching functions is discussed in section \ref{sec:script:objects:classes:methods} on page \pageref{sec:script:objects:classes:methods}, in relation to object-oriented programming in \Rlang. Method \Rfunction{class()} is used to query the class of an object, and method \Rfunction{inherits()} is used to test if an object belongs to a specific class or not (including ``parent'' classes, to be later described).

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{class}\hlstd{(my_var)}
\end{alltt}
\begin{verbatim}
## [1] "character"
\end{verbatim}
\begin{alltt}
\hlkwd{inherits}\hlstd{(my_var,} \hlstr{"character"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{inherits}\hlstd{(my_var,} \hlstr{"numeric"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\end{kframe}
\end{knitrout}

\section{`Type' conversions}\label{sec:calc:type:conversion}
\index{type conversion|(}
The least-intuitive type conversions are those related to logical values. All others are as one would expect. By convention, functions used to convert objects from one mode to a different one have names starting with \code{as.}\footnote{Except for some packages in the \pkgnameNI{tidyverse} that use names starting with \code{as\_} instead of \code{as.}.}.\qRfunction{as.character()}\qRfunction{as.numeric()}\qRfunction{as.logical()}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{as.character}\hlstd{(}\hlnum{1}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] "1"
\end{verbatim}
\begin{alltt}
\hlkwd{as.numeric}\hlstd{(}\hlstr{"1"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] 1
\end{verbatim}
\begin{alltt}
\hlkwd{as.logical}\hlstd{(}\hlstr{"TRUE"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{as.logical}\hlstd{(}\hlstr{"NA"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] NA
\end{verbatim}
\end{kframe}
\end{knitrout}

Conversion takes place automatically in arithmetic and logical expressions.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlnum{TRUE} \hlopt{+} \hlnum{10}
\end{alltt}
\begin{verbatim}
## [1] 11
\end{verbatim}
\begin{alltt}
\hlnum{1} \hlopt{||} \hlnum{0}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlnum{FALSE} \hlopt{| -}\hlnum{2}\hlopt{:}\hlnum{2}
\end{alltt}
\begin{verbatim}
## [1]  TRUE  TRUE FALSE  TRUE  TRUE
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{playground}
There is some flexibility in the conversion from character strings into \code{numeric} and \code{logical} values. Use the examples below plus your own variations to get an idea of what strings are acceptable and correctly converted and which are not. Do also pay attention at the conversion between \code{numeric} and \code{logical} values.\qRfunction{as.character()}\qRfunction{as.numeric()}\qRfunction{as.logical()}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{as.character}\hlstd{(}\hlnum{3.0e10}\hlstd{)}
\hlkwd{as.numeric}\hlstd{(}\hlstr{"5E+5"}\hlstd{)}
\hlkwd{as.numeric}\hlstd{(}\hlstr{"A"}\hlstd{)}
\hlkwd{as.numeric}\hlstd{(}\hlnum{TRUE}\hlstd{)}
\hlkwd{as.numeric}\hlstd{(}\hlnum{FALSE}\hlstd{)}
\hlkwd{as.logical}\hlstd{(}\hlstr{"T"}\hlstd{)}
\hlkwd{as.logical}\hlstd{(}\hlstr{"t"}\hlstd{)}
\hlkwd{as.logical}\hlstd{(}\hlstr{"true"}\hlstd{)}
\hlkwd{as.logical}\hlstd{(}\hlnum{100}\hlstd{)}
\hlkwd{as.logical}\hlstd{(}\hlnum{0}\hlstd{)}
\hlkwd{as.logical}\hlstd{(}\hlopt{-}\hlnum{1}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

\end{playground}

\begin{playground}
Compare the values returned by \Rfunction{trunc()} and \Rfunction{as.integer()} when applied to a floating point number, such as \code{12.34}. Check for the equality of values, and for the \emph{class} of the returned objects.
\end{playground}

\begin{explainbox}
Using conversions, the difference between the length of a \code{character} vector and the number of characters composing each member ``string'' within a vector is obvious.\qRfunction{length()}\qRfunction{as.numeric()}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{f} \hlkwb{<-} \hlkwd{c}\hlstd{(}\hlstr{"1"}\hlstd{,} \hlstr{"2"}\hlstd{,} \hlstr{"3"}\hlstd{)}
\hlkwd{length}\hlstd{(f)}
\end{alltt}
\begin{verbatim}
## [1] 3
\end{verbatim}
\begin{alltt}
\hlstd{g} \hlkwb{<-} \hlstr{"123"}
\hlkwd{length}\hlstd{(g)}
\end{alltt}
\begin{verbatim}
## [1] 1
\end{verbatim}
\begin{alltt}
\hlkwd{as.numeric}\hlstd{(f)}
\end{alltt}
\begin{verbatim}
## [1] 1 2 3
\end{verbatim}
\begin{alltt}
\hlkwd{as.numeric}\hlstd{(g)}
\end{alltt}
\begin{verbatim}
## [1] 123
\end{verbatim}
\end{kframe}
\end{knitrout}
\end{explainbox}

\sloppy
Other\index{formatted character strings from numbers} functions relevant to the ``conversion'' of numbers and other values are \Rfunction{format()}, and \Rfunction{sprintf()}. These two functions return \Rclass{character} strings, instead of \code{numeric} or other values, and are useful for printing output. One could think of these functions as advanced conversion functions returning formatted, and possibly combined and annotated, character strings. However, they are usually not considered normal conversion functions, as they are very rarely used in a way that preserves the original precision of the input values. We show here the use of \Rfunction{format()} and \Rfunction{sprintf()} with \code{numeric} values, but they can also be used with values of other modes.

When using \Rfunction{format()}, the format used to display numbers is set by passing arguments to several different parameters. As \Rfunction{print()} calls \Rfunction{format()} to make numbers \emph{pretty} it accepts the same options.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{x} \hlkwb{=} \hlkwd{c}\hlstd{(}\hlnum{123.4567890}\hlstd{,} \hlnum{1.0}\hlstd{)}
\hlkwd{format}\hlstd{(x)} \hlcom{# using defaults}
\end{alltt}
\begin{verbatim}
## [1] "123.4568" "  1.0000"
\end{verbatim}
\begin{alltt}
\hlkwd{format}\hlstd{(x[}\hlnum{1}\hlstd{])} \hlcom{# using defaults}
\end{alltt}
\begin{verbatim}
## [1] "123.4568"
\end{verbatim}
\begin{alltt}
\hlkwd{format}\hlstd{(x[}\hlnum{2}\hlstd{])} \hlcom{# using defaults}
\end{alltt}
\begin{verbatim}
## [1] "1"
\end{verbatim}
\begin{alltt}
\hlkwd{format}\hlstd{(x,} \hlkwc{digits} \hlstd{=} \hlnum{3}\hlstd{,} \hlkwc{nsmall} \hlstd{=} \hlnum{1}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] "123.5" "  1.0"
\end{verbatim}
\begin{alltt}
\hlkwd{format}\hlstd{(x[}\hlnum{1}\hlstd{],} \hlkwc{digits} \hlstd{=} \hlnum{3}\hlstd{,} \hlkwc{nsmall} \hlstd{=} \hlnum{1}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] "123.5"
\end{verbatim}
\begin{alltt}
\hlkwd{format}\hlstd{(x[}\hlnum{2}\hlstd{],} \hlkwc{digits} \hlstd{=} \hlnum{3}\hlstd{,} \hlkwc{nsmall} \hlstd{=} \hlnum{1}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] "1.0"
\end{verbatim}
\begin{alltt}
\hlkwd{format}\hlstd{(x,} \hlkwc{digits} \hlstd{=} \hlnum{3}\hlstd{,} \hlkwc{scientific} \hlstd{=} \hlnum{TRUE}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] "1.23e+02" "1.00e+00"
\end{verbatim}
\end{kframe}
\end{knitrout}

Function \Rfunction{sprintf()} is similar to \Clang's function of the same name. The user interface is rather unusual, but very powerful, once one learns the syntax. All the formatting is specified using a \code{character} string as template. In this template, placeholders for data and the formatting instructions are embedded using special codes. These codes start with a percent character. We show in the example below the use of some of these: \code{f} is used for \code{numeric} values to be formatted according to a ``fixed point,'' while \code{g} is used when we set the number of significant digits and \code{e} for exponential or \emph{scientific} notation.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{x} \hlkwb{=} \hlkwd{c}\hlstd{(}\hlnum{123.4567890}\hlstd{,} \hlnum{1.0}\hlstd{)}
\hlkwd{sprintf}\hlstd{(}\hlstr{"The numbers are: %4.2f and %.0f"}\hlstd{, x[}\hlnum{1}\hlstd{], x[}\hlnum{2}\hlstd{])}
\end{alltt}
\begin{verbatim}
## [1] "The numbers are: 123.46 and 1"
\end{verbatim}
\begin{alltt}
\hlkwd{sprintf}\hlstd{(}\hlstr{"The numbers are: %.4g and %.2g"}\hlstd{, x[}\hlnum{1}\hlstd{], x[}\hlnum{2}\hlstd{])}
\end{alltt}
\begin{verbatim}
## [1] "The numbers are: 123.5 and 1"
\end{verbatim}
\begin{alltt}
\hlkwd{sprintf}\hlstd{(}\hlstr{"The numbers are: %4.2e and %.0e"}\hlstd{, x[}\hlnum{1}\hlstd{], x[}\hlnum{2}\hlstd{])}
\end{alltt}
\begin{verbatim}
## [1] "The numbers are: 1.23e+02 and 1e+00"
\end{verbatim}
\end{kframe}
\end{knitrout}

In the template \code{"The numbers are: \%4.2f and \%.0f"}, there are two placeholders for \code{numeric} values, \code{\%4.2f} and \code{\%.0f}, so in addition to the template, we pass two values extracted from the first two positions of vector \code{x}. These could have been two different vectors of length one, or even numeric constants. The template itself does not need to be a \code{character} constant as in these examples, as a variable can be also passed as argument.

\begin{playground}
Function \Rfunction{format()} may be easier to use, in some cases, but \Rfunction{sprintf()} is more flexible and powerful. Those with experience in the use of the \Clang language will already know about \Rfunction{sprintf()} and its use of templates for formatting output. Even if you are familiar with  \Clang, look up the help pages for both functions, and practice, by trying to create the same formatted output by means of the two functions. Do also play with these functions with other types of data like \code{integer} and \code{character}.
\end{playground}

\begin{explainbox}
We have above described \Rconst{NA} as a single value ignoring modes, but in reality \Rconst{NA}s come in various flavors. \Rconst{NA\_real\_}, \Rconst{NA\_character\_}, etc. and \Rconst{NA} defaults to an \Rconst{NA} of class \Rclass{logical}. \Rconst{NA} is normally converted on the fly to other modes when needed, so in general \Rconst{NA} is all we need to use.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a} \hlkwb{<-} \hlkwd{c}\hlstd{(}\hlnum{1}\hlstd{,} \hlnum{NA}\hlstd{)}
\hlkwd{is.numeric}\hlstd{(a[}\hlnum{2}\hlstd{])}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{is.numeric}\hlstd{(}\hlnum{NA}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\begin{alltt}
\hlstd{b} \hlkwb{<-} \hlkwd{c}\hlstd{(}\hlstr{"abc"}\hlstd{,} \hlnum{NA}\hlstd{)}
\hlkwd{is.character}\hlstd{(b[}\hlnum{2}\hlstd{])}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{is.character}\hlstd{(}\hlnum{NA}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\begin{alltt}
\hlkwd{class}\hlstd{(}\hlnum{NA}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] "logical"
\end{verbatim}
\end{kframe}
\end{knitrout}

Even the statement below works transparently.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a[}\hlnum{3}\hlstd{]} \hlkwb{<-} \hlstd{b[}\hlnum{2}\hlstd{]}
\end{alltt}
\end{kframe}
\end{knitrout}
\end{explainbox}

\index{type conversion|)}

\section{Vector manipulation}\label{sec:vectors}\label{sec:calc:indexing}
\index{vectors!indexing|(}\index{vectors!member extraction}
If you have read earlier sections of this chapter, you already know how to create a vector. \Rlang's vectors are equivalent to what would be written in mathematical notation as $x_{1\ldots n} = a_1, a_2, \ldots, a_i, \ldots, a_n$, they are not the equivalent to the vectors, common in Physics, which are symbolized with an arrow as an ``accent,'' such as $\overrightarrow{\mathbf{F}}$.

In this section we are going to see how to extract or retrieve, replace, and move elements such as $a_2$ from a vector. Elements are extracted using an index enclosed in single square brackets. The index indicates the position in the vector, starting from one, following the usual mathematical tradition. What in maths would be $a_i$ for a vector $a_{1\ldots n}$, in \Rpgrm is represented as \code{a[i]} and the whole vector as earlier seen as \code{a}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a} \hlkwb{<-} \hlstd{letters[}\hlnum{1}\hlopt{:}\hlnum{10}\hlstd{]}
\hlstd{a}
\end{alltt}
\begin{verbatim}
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
\end{verbatim}
\begin{alltt}
\hlstd{a[}\hlnum{2}\hlstd{]}
\end{alltt}
\begin{verbatim}
## [1] "b"
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{explainbox}
Four constant vectors are available in \Rlang: \Rconst{letters}, \Rconst{LETTERS}, \Rconst{month.name} and  \Rconst{month.abb}, of which we used \code{letters} in the example above. These vectors are always for English, irrespective of the locale.
\end{explainbox}

\begin{warningbox}
In \Rlang, indexes always start from one, while in some other programming languages such as \Clang and \Cpplang, indexes start from zero. It is important to be aware of this difference, as many computation algorithms are valid only under a given indexing convention.
\end{warningbox}

It is possible to extract a subset of the elements of a vector in a single operation, using a vector of indexes. The positions of the extracted elements in the result (``returned value'') are determined by the ordering of the members of the vector of indexes---easier to demonstrate than to explain.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a[}\hlkwd{c}\hlstd{(}\hlnum{3}\hlstd{,}\hlnum{2}\hlstd{)]}
\end{alltt}
\begin{verbatim}
## [1] "c" "b"
\end{verbatim}
\begin{alltt}
\hlstd{a[}\hlnum{10}\hlopt{:}\hlnum{1}\hlstd{]}
\end{alltt}
\begin{verbatim}
##  [1] "j" "i" "h" "g" "f" "e" "d" "c" "b" "a"
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{playground}
The length of the indexing vector is not restricted by the length of the indexed vector. However, only numerical indexes that match positions present in the indexed vector can extract values. Those values in the indexing vector pointing to positions that are not present in the indexed vector, result in \code{NA}s. This is easier to learn by \emph{playing} with \Rlang, than from explanations. Play with \Rlang, using the following examples as a starting point.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{length}\hlstd{(a)}
\hlstd{a[}\hlkwd{c}\hlstd{(}\hlnum{3}\hlstd{,}\hlnum{3}\hlstd{,}\hlnum{3}\hlstd{,}\hlnum{3}\hlstd{)]}
\hlstd{a[}\hlkwd{c}\hlstd{(}\hlnum{10}\hlopt{:}\hlnum{1}\hlstd{,} \hlnum{1}\hlopt{:}\hlnum{10}\hlstd{)]}
\hlstd{a[}\hlkwd{c}\hlstd{(}\hlnum{1}\hlstd{,}\hlnum{11}\hlstd{)]}
\hlstd{a[}\hlnum{11}\hlstd{]}
\end{alltt}
\end{kframe}
\end{knitrout}

Have you tried some of your own examples? If not yet, do \emph{play} with additional variations of your own before continuing.

\end{playground}

Negative indexes have a special meaning; they indicate the positions at which values should be excluded. Be aware that it is \emph{illegal} to mix positive and negative values in the same indexing operation.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a[}\hlopt{-}\hlnum{2}\hlstd{]}
\end{alltt}
\begin{verbatim}
## [1] "a" "c" "d" "e" "f" "g" "h" "i" "j"
\end{verbatim}
\begin{alltt}
\hlstd{a[}\hlopt{-}\hlkwd{c}\hlstd{(}\hlnum{3}\hlstd{,}\hlnum{2}\hlstd{)]}
\end{alltt}
\begin{verbatim}
## [1] "a" "d" "e" "f" "g" "h" "i" "j"
\end{verbatim}
\begin{alltt}
\hlstd{a[}\hlopt{-}\hlnum{3}\hlopt{:-}\hlnum{2}\hlstd{]}
\end{alltt}
\begin{verbatim}
## [1] "a" "d" "e" "f" "g" "h" "i" "j"
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{advplayground}
Results from indexing with special values and zero may be surprising. Try to build a rule from the examples below, a rule that will help you remember what to expect next time you are confronted with similar statements using ``subscripts'' which are special values instead of integers larger or equal to one---this is likely to happen sooner or later as these special values can be returned by different \Rlang expressions depending on the value of operands or function arguments, some of them described earlier in this chapter.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a[ ]}
\hlstd{a[}\hlnum{0}\hlstd{]}
\hlstd{a[}\hlkwd{numeric}\hlstd{(}\hlnum{0}\hlstd{)]}
\hlstd{a[}\hlnum{NA}\hlstd{]}
\hlstd{a[}\hlkwd{c}\hlstd{(}\hlnum{1}\hlstd{,} \hlnum{NA}\hlstd{)]}
\hlstd{a[}\hlkwa{NULL}\hlstd{]}
\hlstd{a[}\hlkwd{c}\hlstd{(}\hlnum{1}\hlstd{,} \hlkwa{NULL}\hlstd{)]}
\end{alltt}
\end{kframe}
\end{knitrout}
\end{advplayground}

Another way of indexing, which is very handy, but not available in most other programming languages, is indexing with a vector of \code{logical} values. The \code{logical} vector used for indexing is usually of the same length as the vector from which elements are going to be selected. However, this is not a requirement, because if the \code{logical} vector of indexes is shorter than the indexed vector, it is ``recycled'' as discussed above in relation to other operators.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a[}\hlnum{TRUE}\hlstd{]}
\end{alltt}
\begin{verbatim}
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
\end{verbatim}
\begin{alltt}
\hlstd{a[}\hlnum{FALSE}\hlstd{]}
\end{alltt}
\begin{verbatim}
## character(0)
\end{verbatim}
\begin{alltt}
\hlstd{a[}\hlkwd{c}\hlstd{(}\hlnum{TRUE}\hlstd{,} \hlnum{FALSE}\hlstd{)]}
\end{alltt}
\begin{verbatim}
## [1] "a" "c" "e" "g" "i"
\end{verbatim}
\begin{alltt}
\hlstd{a[}\hlkwd{c}\hlstd{(}\hlnum{FALSE}\hlstd{,} \hlnum{TRUE}\hlstd{)]}
\end{alltt}
\begin{verbatim}
## [1] "b" "d" "f" "h" "j"
\end{verbatim}
\begin{alltt}
\hlstd{a} \hlopt{>} \hlstr{"c"}
\end{alltt}
\begin{verbatim}
##  [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
\end{verbatim}
\begin{alltt}
\hlstd{a[a} \hlopt{>} \hlstr{"c"}\hlstd{]}
\end{alltt}
\begin{verbatim}
## [1] "d" "e" "f" "g" "h" "i" "j"
\end{verbatim}
\end{kframe}
\end{knitrout}

Indexing with logical vectors is very frequently used in \Rlang because comparison operators are vectorized. Comparison operators, when applied to a vector, return a \code{logical} vector, a vector that can be used to extract the elements for which the result of the comparison test was \code{TRUE}.

\begin{playground}
The examples in this text box demonstrate additional uses of logical vectors: 1) the logical vector returned by a vectorized comparison can be stored in a variable, and the variable used as a ``selector'' for extracting a subset of values from the same vector, or from a different vector.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a} \hlkwb{<-} \hlstd{letters[}\hlnum{1}\hlopt{:}\hlnum{10}\hlstd{]}
\hlstd{b} \hlkwb{<-} \hlnum{1}\hlopt{:}\hlnum{10}
\hlstd{selector} \hlkwb{<-} \hlstd{a} \hlopt{>} \hlstr{"c"}
\hlstd{selector}
\hlstd{a[selector]}
\hlstd{b[selector]}
\end{alltt}
\end{kframe}
\end{knitrout}

Numerical indexes can be obtained from a logical vector by means of function \code{which()}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{indexes} \hlkwb{<-} \hlkwd{which}\hlstd{(a} \hlopt{>} \hlstr{"c"}\hlstd{)}
\hlstd{indexes}
\hlstd{a[indexes]}
\hlstd{b[indexes]}
\end{alltt}
\end{kframe}
\end{knitrout}

Make sure to understand the examples above. These constructs are very widely used in \Rlang because they allow for concise code that is easy to understand once you are familiar with the indexing rules. However, if you do not command these rules, many of these terse statements will be unintelligible to you.
\end{playground}

Indexing can be used on either side of an assignment expression. In the chunk below, we use the extraction operator on the left-hand side of the assignments to replace values only at selected positions in the vector. This may look rather esoteric at first sight, but it is just a simple extension of the logic of indexing described above. It works, because the low precedence of the \Roperator{<-} operator results in both the left-hand side and the right-hand side being fully evaluated before the assignment takes place. To make the changes to the vectors easier to follow, we use identical vectors with different names for each of these examples.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a} \hlkwb{<-} \hlnum{1}\hlopt{:}\hlnum{10}
\hlstd{a}
\end{alltt}
\begin{verbatim}
##  [1]  1  2  3  4  5  6  7  8  9 10
\end{verbatim}
\begin{alltt}
\hlstd{a[}\hlnum{1}\hlstd{]} \hlkwb{<-} \hlnum{99}
\hlstd{a}
\end{alltt}
\begin{verbatim}
##  [1] 99  2  3  4  5  6  7  8  9 10
\end{verbatim}
\begin{alltt}
\hlstd{b} \hlkwb{<-} \hlnum{1}\hlopt{:}\hlnum{10}
\hlstd{b[}\hlkwd{c}\hlstd{(}\hlnum{2}\hlstd{,}\hlnum{4}\hlstd{)]} \hlkwb{<-} \hlopt{-}\hlnum{99} \hlcom{# recycling}
\hlstd{b}
\end{alltt}
\begin{verbatim}
##  [1]   1 -99   3 -99   5   6   7   8   9  10
\end{verbatim}
\begin{alltt}
\hlstd{c} \hlkwb{<-} \hlnum{1}\hlopt{:}\hlnum{10}
\hlstd{c[}\hlkwd{c}\hlstd{(}\hlnum{2}\hlstd{,}\hlnum{4}\hlstd{)]} \hlkwb{<-} \hlkwd{c}\hlstd{(}\hlopt{-}\hlnum{99}\hlstd{,} \hlnum{99}\hlstd{)}
\hlstd{c}
\end{alltt}
\begin{verbatim}
##  [1]   1 -99   3  99   5   6   7   8   9  10
\end{verbatim}
\begin{alltt}
\hlstd{d} \hlkwb{<-} \hlnum{1}\hlopt{:}\hlnum{10}
\hlstd{d[}\hlnum{TRUE}\hlstd{]} \hlkwb{<-} \hlnum{1} \hlcom{# recycling}
\hlstd{d}
\end{alltt}
\begin{verbatim}
##  [1] 1 1 1 1 1 1 1 1 1 1
\end{verbatim}
\begin{alltt}
\hlstd{e} \hlkwb{<-} \hlnum{1}\hlopt{:}\hlnum{10}
\hlstd{e} \hlkwb{<-} \hlnum{1}  \hlcom{# no recycling}
\hlstd{e}
\end{alltt}
\begin{verbatim}
## [1] 1
\end{verbatim}
\end{kframe}
\end{knitrout}

We can also use subscripting on both sides of the assignment operator, for example, to swap two elements.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a} \hlkwb{<-} \hlstd{letters[}\hlnum{1}\hlopt{:}\hlnum{10}\hlstd{]}
\hlstd{a[}\hlnum{1}\hlopt{:}\hlnum{2}\hlstd{]} \hlkwb{<-} \hlstd{a[}\hlnum{2}\hlopt{:}\hlnum{1}\hlstd{]}
\hlstd{a}
\end{alltt}
\begin{verbatim}
##  [1] "b" "a" "c" "d" "e" "f" "g" "h" "i" "j"
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{playground}
Do play with subscripts to your heart's content, really grasping how they work and how they can be used, will be very useful in anything you do in the future with \Rlang. Even the contrived example below follows the same simple rules, just study it bit by bit. Hint: the second statement in the chunk below, modifies \code{a}, so, when studying variations of this example you will need to recreate \code{a} by executing the first statement, each time you run a variation of the second statement.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a} \hlkwb{<-} \hlstd{letters[}\hlnum{1}\hlopt{:}\hlnum{10}\hlstd{]}
\hlstd{a[}\hlnum{5}\hlopt{:}\hlnum{1}\hlstd{]} \hlkwb{<-} \hlstd{a[}\hlkwd{c}\hlstd{(}\hlnum{TRUE}\hlstd{,}\hlnum{FALSE}\hlstd{)]}
\hlstd{a}
\end{alltt}
\begin{verbatim}
##  [1] "i" "g" "e" "c" "a" "f" "g" "h" "i" "j"
\end{verbatim}
\end{kframe}
\end{knitrout}

\end{playground}

\begin{explainbox}\label{box:vec:sort}
In \Rlang, indexing with positional indexes can be done with \Rclass{integer} or \Rclass{numeric} values. Numeric values can be floats, but for indexing, only integer values are meaningful. Consequently, \Rclass{double} values are converted into \code{integer} values when used as indexes. The conversion is done invisibly, but it does slow down computations slightly. When working on big data sets, explicitly using \code{integer} values can improve performance.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{b} \hlkwb{<-} \hlstd{LETTERS[}\hlnum{1}\hlopt{:}\hlnum{10}\hlstd{]}
\hlstd{b[}\hlnum{1}\hlstd{]}
\end{alltt}
\begin{verbatim}
## [1] "A"
\end{verbatim}
\begin{alltt}
\hlstd{b[}\hlnum{1.1}\hlstd{]}
\end{alltt}
\begin{verbatim}
## [1] "A"
\end{verbatim}
\begin{alltt}
\hlstd{b[}\hlnum{1.9999}\hlstd{]} \hlcom{# surprise!!}
\end{alltt}
\begin{verbatim}
## [1] "A"
\end{verbatim}
\begin{alltt}
\hlstd{b[}\hlnum{2}\hlstd{]}
\end{alltt}
\begin{verbatim}
## [1] "B"
\end{verbatim}
\end{kframe}
\end{knitrout}

From this experiment, we can learn that if positive indexes are not whole numbers, they are truncated to the next smaller integer.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{b} \hlkwb{<-} \hlstd{LETTERS[}\hlnum{1}\hlopt{:}\hlnum{10}\hlstd{]}
\hlstd{b[}\hlopt{-}\hlnum{1}\hlstd{]}
\end{alltt}
\begin{verbatim}
## [1] "B" "C" "D" "E" "F" "G" "H" "I" "J"
\end{verbatim}
\begin{alltt}
\hlstd{b[}\hlopt{-}\hlnum{1.1}\hlstd{]}
\end{alltt}
\begin{verbatim}
## [1] "B" "C" "D" "E" "F" "G" "H" "I" "J"
\end{verbatim}
\begin{alltt}
\hlstd{b[}\hlopt{-}\hlnum{1.9999}\hlstd{]}
\end{alltt}
\begin{verbatim}
## [1] "B" "C" "D" "E" "F" "G" "H" "I" "J"
\end{verbatim}
\begin{alltt}
\hlstd{b[}\hlopt{-}\hlnum{2}\hlstd{]}
\end{alltt}
\begin{verbatim}
## [1] "A" "C" "D" "E" "F" "G" "H" "I" "J"
\end{verbatim}
\end{kframe}
\end{knitrout}

From this experiment, we can learn that if negative indexes are not whole numbers, they are truncated to the next larger (less negative) integer. In conclusion, \code{double} index values behave as if they where sanitized using function \code{trunc()}.

This example also shows how one can tease out of \Rlang its rules through experimentation.

\end{explainbox}

A\index{vectors!sorting} frequent operation on vectors is sorting them into an increasing or decreasing order. The most direct approach is to use \Rfunction{sort()}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.vector} \hlkwb{<-} \hlkwd{c}\hlstd{(}\hlnum{10}\hlstd{,} \hlnum{4}\hlstd{,} \hlnum{22}\hlstd{,} \hlnum{1}\hlstd{,} \hlnum{4}\hlstd{)}
\hlkwd{sort}\hlstd{(my.vector)}
\end{alltt}
\begin{verbatim}
## [1]  1  4  4 10 22
\end{verbatim}
\begin{alltt}
\hlkwd{sort}\hlstd{(my.vector,} \hlkwc{decreasing} \hlstd{=} \hlnum{TRUE}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] 22 10  4  4  1
\end{verbatim}
\end{kframe}
\end{knitrout}

An indirect way of sorting a vector, possibly based on a different vector, is to generate with \Rfunction{order()} a vector of numerical indexes that can be used to achieve the ordering.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{order}\hlstd{(my.vector)}
\end{alltt}
\begin{verbatim}
## [1] 4 2 5 1 3
\end{verbatim}
\begin{alltt}
\hlstd{my.vector[}\hlkwd{order}\hlstd{(my.vector)]}
\end{alltt}
\begin{verbatim}
## [1]  1  4  4 10 22
\end{verbatim}
\begin{alltt}
\hlstd{another.vector} \hlkwb{<-} \hlkwd{c}\hlstd{(}\hlstr{"ab"}\hlstd{,} \hlstr{"aa"}\hlstd{,} \hlstr{"c"}\hlstd{,} \hlstr{"zy"}\hlstd{,} \hlstr{"e"}\hlstd{)}
\hlstd{another.vector[}\hlkwd{order}\hlstd{(my.vector)]}
\end{alltt}
\begin{verbatim}
## [1] "zy" "aa" "e"  "ab" "c"
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{explainbox}
A problem linked to sorting that we may face is counting how many copies of each value are present in a vector. We need to use two functions \Rfunction{sort()} and \Rfunction{rle()}\index{vector!run length encoding}. The second of these functions computes \emph{run length} as used in \emph{run length encoding} for which \emph{rle} is an abbreviation. A \emph{run} is a series of consecutive identical values. As the objective is to count the number of copies of each value present, we need first to sort the vector.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.letters} \hlkwb{<-} \hlstd{letters[}\hlkwd{c}\hlstd{(}\hlnum{1}\hlstd{,}\hlnum{5}\hlstd{,}\hlnum{10}\hlstd{,}\hlnum{3}\hlstd{,}\hlnum{1}\hlstd{,}\hlnum{4}\hlstd{,}\hlnum{21}\hlstd{,}\hlnum{1}\hlstd{,}\hlnum{10}\hlstd{)]}
\hlstd{my.letters}
\end{alltt}
\begin{verbatim}
## [1] "a" "e" "j" "c" "a" "d" "u" "a" "j"
\end{verbatim}
\begin{alltt}
\hlkwd{sort}\hlstd{(my.letters)}
\end{alltt}
\begin{verbatim}
## [1] "a" "a" "a" "c" "d" "e" "j" "j" "u"
\end{verbatim}
\begin{alltt}
\hlkwd{rle}\hlstd{(}\hlkwd{sort}\hlstd{(my.letters))}
\end{alltt}
\begin{verbatim}
## Run Length Encoding
##   lengths: int [1:6] 3 1 1 1 2 1
##   values : chr [1:6] "a" "c" "d" "e" "j" "u"
\end{verbatim}
\end{kframe}
\end{knitrout}

The second and third statements are only to demonstrate the effect of each step. The last statement uses nested function calls to compute the number of copies of each value in the vector.
\end{explainbox}


\index{vectors!indexing|)}

\section{Matrices and multidimensional arrays}\label{sec:matrix:array}
\index{matrices|(}\index{arrays|(}\qRclass{matrix}\qRclass{array}

Vectors have a single dimension, and, as we saw above, we can query their length with method \Rfunction{length()}. Matrices have two dimensions, which can be queried with \Rfunction{dim()}, \Rfunction{ncol()} and \Rfunction{nrow()}. \Rlang arrays can have any number of dimensions, even a single dimension, which can be queried with method \Rfunction{dim()}. As expected \Rfunction{is.vector()}, \Rfunction{is.matrix()} and \Rfunction{is.array()} can be used to query the class.

We can create a new matrix using the \Rfunction{matrix()} or \Rfunction{as.matrix()} constructors. The first argument of \Rfunction{matrix()} is a vector. In the same way as vectors, matrices are homogeneous, all elements are of the same type.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{matrix}\hlstd{(}\hlnum{1}\hlopt{:}\hlnum{15}\hlstd{,} \hlkwc{ncol} \hlstd{=} \hlnum{3}\hlstd{)}
\end{alltt}
\begin{verbatim}
##      [,1] [,2] [,3]
## [1,]    1    6   11
## [2,]    2    7   12
## [3,]    3    8   13
## [4,]    4    9   14
## [5,]    5   10   15
\end{verbatim}
\begin{alltt}
\hlkwd{matrix}\hlstd{(}\hlnum{1}\hlopt{:}\hlnum{15}\hlstd{,} \hlkwc{nrow} \hlstd{=} \hlnum{3}\hlstd{)}
\end{alltt}
\begin{verbatim}
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    4    7   10   13
## [2,]    2    5    8   11   14
## [3,]    3    6    9   12   15
\end{verbatim}
\end{kframe}
\end{knitrout}

When a vector is converted to a matrix, \Rlang's default is to allocate the values in the vector to the matrix starting from the leftmost column, and within the column, down from the top. Once the first column is filled, the process continues from the top of the next column, as can be seen above. This order can be changed as you will discover in the playground below.

\begin{playground}
Check in the help page for the \code{matrix}\qRfunction{matrix()} constructor how to use the \code{byrow} parameter to alter the default order in which the elements of the vector are allocated to columns and rows of the new matrix.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{help}\hlstd{(matrix)}
\end{alltt}
\end{kframe}
\end{knitrout}

While you are looking at the help page, also consider the default number of columns and rows.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{matrix}\hlstd{(}\hlnum{1}\hlopt{:}\hlnum{15}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

And to start getting a sense of how to interpret error and warning messages, run the code below and make sure you understand which problem is being reported. Before executing the statement, analyze it and predict what the returned value will be. Afterwards, compare your prediction, to the value actually returned.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{matrix}\hlstd{(}\hlnum{1}\hlopt{:}\hlnum{15}\hlstd{,} \hlkwc{ncol} \hlstd{=} \hlnum{2}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

\end{playground}

Subscripting of matrices and arrays is consistent with that used for vectors; we only need to supply an indexing vector, or leave a blank space, for each dimension. A matrix has two dimensions, so to access any element or group of elements, we use two indices. The only complication is that there are two possible orders in which, in principle, indexes could be supplied. In \Rlang, indexes for matrices are written ``row first.'' In simpler words, the first index value selects rows, and the second one, columns.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{A} \hlkwb{<-} \hlkwd{matrix}\hlstd{(}\hlnum{1}\hlopt{:}\hlnum{20}\hlstd{,} \hlkwc{ncol} \hlstd{=} \hlnum{4}\hlstd{)}
\hlstd{A}
\end{alltt}
\begin{verbatim}
##      [,1] [,2] [,3] [,4]
## [1,]    1    6   11   16
## [2,]    2    7   12   17
## [3,]    3    8   13   18
## [4,]    4    9   14   19
## [5,]    5   10   15   20
\end{verbatim}
\begin{alltt}
\hlstd{A[}\hlnum{1}\hlstd{,} \hlnum{1}\hlstd{]}
\end{alltt}
\begin{verbatim}
## [1] 1
\end{verbatim}
\end{kframe}
\end{knitrout}

Remind yourself of how indexing of vectors works in \Rlang (see section \ref{sec:vectors} on page \pageref{sec:vectors}). We will now apply the same rules in two dimensions.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{A[}\hlnum{1}\hlstd{, ]}
\end{alltt}
\begin{verbatim}
## [1]  1  6 11 16
\end{verbatim}
\begin{alltt}
\hlstd{A[ ,} \hlnum{1}\hlstd{]}
\end{alltt}
\begin{verbatim}
## [1] 1 2 3 4 5
\end{verbatim}
\begin{alltt}
\hlstd{A[}\hlnum{2}\hlopt{:}\hlnum{3}\hlstd{,} \hlkwd{c}\hlstd{(}\hlnum{1}\hlstd{,}\hlnum{3}\hlstd{)]}
\end{alltt}
\begin{verbatim}
##      [,1] [,2]
## [1,]    2   12
## [2,]    3   13
\end{verbatim}
\begin{alltt}
\hlstd{A[}\hlnum{3}\hlstd{,} \hlnum{4}\hlstd{]} \hlkwb{<-} \hlnum{99}
\hlstd{A}
\end{alltt}
\begin{verbatim}
##      [,1] [,2] [,3] [,4]
## [1,]    1    6   11   16
## [2,]    2    7   12   17
## [3,]    3    8   13   99
## [4,]    4    9   14   19
## [5,]    5   10   15   20
\end{verbatim}
\begin{alltt}
\hlstd{A[}\hlnum{4}\hlopt{:}\hlnum{3}\hlstd{,} \hlnum{2}\hlopt{:}\hlnum{1}\hlstd{]} \hlkwb{<-} \hlstd{A[}\hlnum{3}\hlopt{:}\hlnum{4}\hlstd{,} \hlnum{1}\hlopt{:}\hlnum{2}\hlstd{]}
\hlstd{A}
\end{alltt}
\begin{verbatim}
##      [,1] [,2] [,3] [,4]
## [1,]    1    6   11   16
## [2,]    2    7   12   17
## [3,]    9    4   13   99
## [4,]    8    3   14   19
## [5,]    5   10   15   20
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{explainbox}
In \Rlang, a \Rclass{matrix} can have a single row, a single column, a single element or no elements. However, in all cases, a \code{matrix} will have \emph{dimensions} of length two defined and stored as an attribute.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.vector} \hlkwb{<-} \hlnum{1}\hlopt{:}\hlnum{6}
\hlkwd{dim}\hlstd{(my.vector)}
\end{alltt}
\begin{verbatim}
## NULL
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{one.col.matrix} \hlkwb{<-} \hlkwd{matrix}\hlstd{(}\hlnum{1}\hlopt{:}\hlnum{6}\hlstd{,} \hlkwc{ncol} \hlstd{=} \hlnum{1}\hlstd{)}
\hlkwd{dim}\hlstd{(one.col.matrix)}
\end{alltt}
\begin{verbatim}
## [1] 6 1
\end{verbatim}
\begin{alltt}
\hlstd{two.col.matrix} \hlkwb{<-} \hlkwd{matrix}\hlstd{(}\hlnum{1}\hlopt{:}\hlnum{6}\hlstd{,} \hlkwc{ncol} \hlstd{=} \hlnum{2}\hlstd{)}
\hlkwd{dim}\hlstd{(two.col.matrix)}
\end{alltt}
\begin{verbatim}
## [1] 3 2
\end{verbatim}
\begin{alltt}
\hlstd{one.elem.matrix} \hlkwb{<-} \hlkwd{matrix}\hlstd{(}\hlnum{1}\hlstd{,} \hlkwc{ncol} \hlstd{=} \hlnum{1}\hlstd{)}
\hlkwd{dim}\hlstd{(one.elem.matrix)}
\end{alltt}
\begin{verbatim}
## [1] 1 1
\end{verbatim}
\begin{alltt}
\hlstd{no.elem.matrix} \hlkwb{<-} \hlkwd{matrix}\hlstd{(}\hlkwd{numeric}\hlstd{(),} \hlkwc{ncol} \hlstd{=} \hlnum{0}\hlstd{)}
\hlkwd{dim}\hlstd{(no.elem.matrix)}
\end{alltt}
\begin{verbatim}
## [1] 0 0
\end{verbatim}
\end{kframe}
\end{knitrout}

\end{explainbox}

Arrays\index{matrix!dimensions}\index{arrays!dimensions} are similar to matrices, but can have more than two dimensions, which are specified with the \code{dim} argument to the \Rfunction{array()} constructor.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{B} \hlkwb{<-} \hlkwd{array}\hlstd{(}\hlnum{1}\hlopt{:}\hlnum{27}\hlstd{,} \hlkwc{dim} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{3}\hlstd{,} \hlnum{3}\hlstd{,} \hlnum{3}\hlstd{))}
\hlstd{B}
\end{alltt}
\begin{verbatim}
## , , 1
##
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
##
## , , 2
##
##      [,1] [,2] [,3]
## [1,]   10   13   16
## [2,]   11   14   17
## [3,]   12   15   18
##
## , , 3
##
##      [,1] [,2] [,3]
## [1,]   19   22   25
## [2,]   20   23   26
## [3,]   21   24   27
\end{verbatim}
\begin{alltt}
\hlstd{B[}\hlnum{2}\hlstd{,} \hlnum{2}\hlstd{,} \hlnum{2}\hlstd{]}
\end{alltt}
\begin{verbatim}
## [1] 14
\end{verbatim}
\end{kframe}
\end{knitrout}

In the chunk above, the length of the supplied vector is the product of the dimensions, $27 = 3 \times 3 \times 3$.

\begin{playground}
  How do you use indexes to extract the second element of the original vector, in each of the following matrices and arrays?

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{v} \hlkwb{<-} \hlnum{1}\hlopt{:}\hlnum{10}
\hlstd{m2c} \hlkwb{<-} \hlkwd{matrix}\hlstd{(v,} \hlkwc{ncol} \hlstd{=} \hlnum{2}\hlstd{)}
\hlstd{m2cr} \hlkwb{<-} \hlkwd{matrix}\hlstd{(v,} \hlkwc{ncol} \hlstd{=} \hlnum{2}\hlstd{,} \hlkwc{byrow} \hlstd{=} \hlnum{TRUE}\hlstd{)}
\hlstd{m2r} \hlkwb{<-} \hlkwd{matrix}\hlstd{(v,} \hlkwc{nrow} \hlstd{=} \hlnum{2}\hlstd{)}
\hlstd{m2rc} \hlkwb{<-} \hlkwd{matrix}\hlstd{(v,} \hlkwc{nrow} \hlstd{=} \hlnum{2}\hlstd{,} \hlkwc{byrow} \hlstd{=} \hlnum{TRUE}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{v} \hlkwb{<-} \hlnum{1}\hlopt{:}\hlnum{10}
\hlstd{a2c} \hlkwb{<-} \hlkwd{array}\hlstd{(v,} \hlkwc{dim} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{5}\hlstd{,} \hlnum{2}\hlstd{))}
\hlstd{a2c} \hlkwb{<-} \hlkwd{array}\hlstd{(v,} \hlkwc{dim} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{5}\hlstd{,} \hlnum{2}\hlstd{),} \hlkwc{dimnames} \hlstd{=} \hlkwd{list}\hlstd{(}\hlkwa{NULL}\hlstd{,} \hlkwd{c}\hlstd{(}\hlstr{"c1"}\hlstd{,} \hlstr{"c2"}\hlstd{)))}
\hlstd{a2r} \hlkwb{<-} \hlkwd{array}\hlstd{(v,} \hlkwc{dim} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{2}\hlstd{,} \hlnum{5}\hlstd{))}
\end{alltt}
\end{kframe}
\end{knitrout}

Be aware that vectors and one-dimensional arrays are not the same thing, while two-dimensional arrays are matrices.
\begin{enumerate}
      \item Use the different constructors and query methods to explore this, and its consequences.
      \item Convert a matrix into a vector using \Rfunction{unlist()} and \Rfunction{as.vector()} and compare the returned values.
\end{enumerate}

\end{playground}

Operators for matrices are available in \Rlang, as matrices are used in many statistical algorithms. We will not describe them all here, only \Rfunction{t()} and some specializations of arithmetic operators. Function \Rfunction{t()} transposes a matrix, by swapping columns and rows.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{A} \hlkwb{<-} \hlkwd{matrix}\hlstd{(}\hlnum{1}\hlopt{:}\hlnum{20}\hlstd{,} \hlkwc{ncol} \hlstd{=} \hlnum{4}\hlstd{)}
\hlstd{A}
\end{alltt}
\begin{verbatim}
##      [,1] [,2] [,3] [,4]
## [1,]    1    6   11   16
## [2,]    2    7   12   17
## [3,]    3    8   13   18
## [4,]    4    9   14   19
## [5,]    5   10   15   20
\end{verbatim}
\begin{alltt}
\hlkwd{t}\hlstd{(A)}
\end{alltt}
\begin{verbatim}
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    2    3    4    5
## [2,]    6    7    8    9   10
## [3,]   11   12   13   14   15
## [4,]   16   17   18   19   20
\end{verbatim}
\end{kframe}
\end{knitrout}

As with vectors, recycling applies to arithmetic operators when applied to matrices.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{A} \hlopt{+} \hlnum{2}
\end{alltt}
\begin{verbatim}
##      [,1] [,2] [,3] [,4]
## [1,]    3    8   13   18
## [2,]    4    9   14   19
## [3,]    5   10   15   20
## [4,]    6   11   16   21
## [5,]    7   12   17   22
\end{verbatim}
\begin{alltt}
\hlstd{A} \hlopt{*} \hlnum{0}\hlopt{:}\hlnum{1}
\end{alltt}
\begin{verbatim}
##      [,1] [,2] [,3] [,4]
## [1,]    0    6    0   16
## [2,]    2    0   12    0
## [3,]    0    8    0   18
## [4,]    4    0   14    0
## [5,]    0   10    0   20
\end{verbatim}
\begin{alltt}
\hlstd{A} \hlopt{*} \hlnum{1}\hlopt{:}\hlnum{0}
\end{alltt}
\begin{verbatim}
##      [,1] [,2] [,3] [,4]
## [1,]    1    0   11    0
## [2,]    0    7    0   17
## [3,]    3    0   13    0
## [4,]    0    9    0   19
## [5,]    5    0   15    0
\end{verbatim}
\end{kframe}
\end{knitrout}

In the examples above with the usual multiplication operator \code{*}, the operation described is not a matrix product, but instead, the products between individual elements of the matrix and vectors. Matrix multiplication is indicated by operator \Roperator{\%*\%}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{B} \hlkwb{<-} \hlkwd{matrix}\hlstd{(}\hlnum{1}\hlopt{:}\hlnum{16}\hlstd{,} \hlkwc{ncol} \hlstd{=} \hlnum{4}\hlstd{)}
\hlstd{B} \hlopt{*} \hlstd{B}
\end{alltt}
\begin{verbatim}
##      [,1] [,2] [,3] [,4]
## [1,]    1   25   81  169
## [2,]    4   36  100  196
## [3,]    9   49  121  225
## [4,]   16   64  144  256
\end{verbatim}
\begin{alltt}
\hlstd{B} \hlopt{%*%} \hlstd{B}
\end{alltt}
\begin{verbatim}
##      [,1] [,2] [,3] [,4]
## [1,]   90  202  314  426
## [2,]  100  228  356  484
## [3,]  110  254  398  542
## [4,]  120  280  440  600
\end{verbatim}
\end{kframe}
\end{knitrout}

Other operators and functions for matrix algebra like cross-product (\Rfunction{crossprod()}), extracting or replacing the diagonal (\Rfunction{diag()}) are available in base \Rlang. Packages, including \pkgname{matrixStats}, provide additional functions and operators for matrices.


\index{matrices|)}\index{arrays|)}

\section{Factors}\label{sec:calc:factors}
\index{factors|(}
\index{categorical variables|see{factors}}\qRclass{factor}
Factors are used to indicate categories, most frequently the factors describing the treatments in an experiment, or categories in a survey. They can be created either from numerical or character vectors. The different possible values are called \emph{levels}. Normal factors created with \Rfunction{factor()} are unordered or categorical. \Rlang also supports ordered factors that can be created with function \Rfunction{ordered()}.\index{factors!ordered}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.vector} \hlkwb{<-} \hlkwd{c}\hlstd{(}\hlstr{"treated"}\hlstd{,} \hlstr{"treated"}\hlstd{,} \hlstr{"control"}\hlstd{,} \hlstr{"control"}\hlstd{,} \hlstr{"control"}\hlstd{,} \hlstr{"treated"}\hlstd{)}
\hlstd{my.factor} \hlkwb{<-} \hlkwd{factor}\hlstd{(my.vector)}
\hlstd{my.factor}
\end{alltt}
\begin{verbatim}
## [1] treated treated control control control treated
## Levels: control treated
\end{verbatim}
\begin{alltt}
\hlstd{my.factor} \hlkwb{<-} \hlkwd{factor}\hlstd{(}\hlkwc{x} \hlstd{= my.vector,} \hlkwc{levels} \hlstd{=} \hlkwd{c}\hlstd{(}\hlstr{"treated"}\hlstd{,} \hlstr{"control"}\hlstd{))}
\hlstd{my.factor}
\end{alltt}
\begin{verbatim}
## [1] treated treated control control control treated
## Levels: treated control
\end{verbatim}
\end{kframe}
\end{knitrout}

The\index{factors!labels}\index{factors!levels} labels (``names'') of the levels can be set when the factor is created. In this case, when calling \Rfunction{factor()}, parameters \code{levels} and \code{labels} should both be passed a vector as argument, with levels and matching labels in the same position in the two vectors. The argument passed to \code{levels} determines the order of the levels based on their old names or values, and the argument passed to \code{labels} gives new names to the levels.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.vector} \hlkwb{<-} \hlkwd{c}\hlstd{(}\hlnum{1}\hlstd{,} \hlnum{1}\hlstd{,} \hlnum{0}\hlstd{,} \hlnum{0}\hlstd{,} \hlnum{0}\hlstd{,} \hlnum{1}\hlstd{)}
\hlstd{my.factor} \hlkwb{<-} \hlkwd{factor}\hlstd{(}\hlkwc{x} \hlstd{= my.vector,} \hlkwc{levels} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{1}\hlstd{,} \hlnum{0}\hlstd{),} \hlkwc{labels} \hlstd{=} \hlkwd{c}\hlstd{(}\hlstr{"treated"}\hlstd{,} \hlstr{"control"}\hlstd{))}
\hlstd{my.factor}
\end{alltt}
\begin{verbatim}
## [1] treated treated control control control treated
## Levels: treated control
\end{verbatim}
\end{kframe}
\end{knitrout}

It is always preferable to use meaningful labels for levels, although it is also possible to use numbers.

In the examples above we passed a numeric vector or a character vector as an argument for parameter \code{x} of function \Rfunction{factor()}. It is also possible to pass a \code{factor} as an argument for parameter \code{x}. We use indexing with a test returning a logical vector to extract all ``controls.'' We use function \Rfunction{levels()} to look at the levels of the factors.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{levels}\hlstd{(my.factor)}
\end{alltt}
\begin{verbatim}
## [1] "treated" "control"
\end{verbatim}
\begin{alltt}
\hlstd{control.factor} \hlkwb{<-} \hlstd{my.factor[my.factor} \hlopt{==} \hlstr{"control"}\hlstd{]}
\hlstd{control.factor}
\end{alltt}
\begin{verbatim}
## [1] control control control
## Levels: treated control
\end{verbatim}
\begin{alltt}
\hlstd{control.factor} \hlkwb{<-} \hlkwd{factor}\hlstd{(control.factor)}
\hlstd{control.factor}
\end{alltt}
\begin{verbatim}
## [1] control control control
## Levels: control
\end{verbatim}
\end{kframe}
\end{knitrout}

It can be seen above that subsetting does not drop unused factor levels, and that \code{factor()} can be used to explicitly drop the unused factor levels.\index{factors!drop unused levels}

\begin{explainbox}
When the pattern of levels is regular, it is possible to use function \Rfunction{gl()} to \emph{generate levels} in a factor. Nowadays, it is more usual to read data into \Rlang from files in which the treatment codes are already available as character strings or numeric values, however, when we need to create a factor within R, \Rfunction{gl()} can save some typing.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{gl}\hlstd{(}\hlnum{2}\hlstd{,} \hlnum{5}\hlstd{,} \hlkwc{labels} \hlstd{=} \hlkwd{c}\hlstd{(}\hlstr{"A"}\hlstd{,} \hlstr{"B"}\hlstd{))}
\end{alltt}
\begin{verbatim}
##  [1] A A A A A B B B B B
## Levels: A B
\end{verbatim}
\end{kframe}
\end{knitrout}
\end{explainbox}

Converting factors into numbers is not intuitive, even in the case where a factor was created from a \code{numeric} vector.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.vector2} \hlkwb{<-} \hlkwd{rep}\hlstd{(}\hlnum{3}\hlopt{:}\hlnum{5}\hlstd{,} \hlnum{4}\hlstd{)}
\hlstd{my.vector2}
\end{alltt}
\begin{verbatim}
##  [1] 3 4 5 3 4 5 3 4 5 3 4 5
\end{verbatim}
\begin{alltt}
\hlstd{my.factor2} \hlkwb{<-} \hlkwd{factor}\hlstd{(my.vector2)}
\hlstd{my.factor2}
\end{alltt}
\begin{verbatim}
##  [1] 3 4 5 3 4 5 3 4 5 3 4 5
## Levels: 3 4 5
\end{verbatim}
\begin{alltt}
\hlkwd{as.numeric}\hlstd{(my.factor2)}
\end{alltt}
\begin{verbatim}
##  [1] 1 2 3 1 2 3 1 2 3 1 2 3
\end{verbatim}
\begin{alltt}
\hlkwd{as.numeric}\hlstd{(}\hlkwd{as.character}\hlstd{(my.factor2))}
\end{alltt}
\begin{verbatim}
##  [1] 3 4 5 3 4 5 3 4 5 3 4 5
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{explainbox}
\textbf{Why is a double conversion needed?}\index{factors!convert to numeric} Internally, factor levels are stored as running integers starting from one, and those are the numbers returned by \Rfunction{as.numeric()} when applied to a factor. The labels of the factor levels are always stored as character strings, even when these characters are digits. In contrast to \Rfunction{as.numeric()}, \Rfunction{as.character()} returns the character labels of the levels. If these character strings represent numbers, they can be converted, in a second step, using \Rfunction{as.numeric()} into the original numeric values.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{class}\hlstd{(my.factor2)}
\end{alltt}
\begin{verbatim}
## [1] "factor"
\end{verbatim}
\begin{alltt}
\hlkwd{mode}\hlstd{(my.factor2)}
\end{alltt}
\begin{verbatim}
## [1] "numeric"
\end{verbatim}
\end{kframe}
\end{knitrout}
\end{explainbox}

\begin{playground}
Create a factor with levels labeled with words. Create another factor with the levels labeled with the same words, but ordered differently. After this convert both factors to numeric vectors using \Rfunction{as.numeric()}. Explain why the two numeric vectors differ or not from each other.
\end{playground}

Factors are very important in \Rlang. In contrast to other statistical software in which the role of a variable is set when defining a model to be fitted or when setting up a test, in \Rlang, models are specified exactly in the same way for ANOVA and regression analysis, both as \emph{linear models}. The type of model that is fitted is decided by whether the explanatory variable is a factor (giving ANOVA) or a numerical variable (giving regression). This makes a lot of sense, because in most cases, considering an explanatory variable as categorical or not, depends on the design of the experiment or survey, and in other words, is a property of the data and the experiment or survey that gave origin to them, rather than of the data analysis.

The order of the levels in a \code{factor} does not affect simple calculations or the values plotted, but it does affect how the output is printed, the order of the levels in the scales of plots, and in some cases the contrasts in significance tests. The default ordering is alphabetical, and is established at the time a factor is created. Consequently, rather frequently the default ordering of levels is not the one needed. As shown above, parameter \code{levels} in the constructor makes it possible to set the order of the levels. It is also possible to change the ordering of an existing factor.

\begin{explainbox}
\textbf{Renaming factor levels.}\index{factors!rename levels} The most direct way is using \Rfunction{levels()<-} as shown below, but it is also possible to use \Rfunction{factor()}. The difference is that \code{factor()} drops the unused levels and \code{levels()} only renames existing levels, all of them by position. (Although we here use \code{character} strings that are only one character long, longer strings can be set as labels in exactly the same way.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.factor1} \hlkwb{<-} \hlkwd{gl}\hlstd{(}\hlnum{4}\hlstd{,} \hlnum{3}\hlstd{,} \hlkwc{labels} \hlstd{=} \hlkwd{c}\hlstd{(}\hlstr{"A"}\hlstd{,} \hlstr{"F"}\hlstd{,} \hlstr{"B"}\hlstd{,} \hlstr{"Z"}\hlstd{))}
\hlstd{my.factor1}
\end{alltt}
\begin{verbatim}
##  [1] A A A F F F B B B Z Z Z
## Levels: A F B Z
\end{verbatim}
\begin{alltt}
\hlkwd{levels}\hlstd{(my.factor1)} \hlkwb{<-} \hlkwd{c}\hlstd{(}\hlstr{"a"}\hlstd{,} \hlstr{"b"}\hlstd{,} \hlstr{"c"}\hlstd{,} \hlstr{"d"}\hlstd{)}
\hlstd{my.factor1}
\end{alltt}
\begin{verbatim}
##  [1] a a a b b b c c c d d d
## Levels: a b c d
\end{verbatim}
\end{kframe}
\end{knitrout}

Or more safely by matching names---i.e., order in the list of replacement values is irrelevant.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.factor1} \hlkwb{<-} \hlkwd{gl}\hlstd{(}\hlnum{4}\hlstd{,} \hlnum{3}\hlstd{,} \hlkwc{labels} \hlstd{=} \hlkwd{c}\hlstd{(}\hlstr{"A"}\hlstd{,} \hlstr{"F"}\hlstd{,} \hlstr{"B"}\hlstd{,} \hlstr{"Z"}\hlstd{))}
\hlstd{my.factor1}
\end{alltt}
\begin{verbatim}
##  [1] A A A F F F B B B Z Z Z
## Levels: A F B Z
\end{verbatim}
\begin{alltt}
\hlkwd{levels}\hlstd{(my.factor1)} \hlkwb{<-} \hlkwd{list}\hlstd{(}\hlstr{"a"} \hlstd{=} \hlstr{"A"}\hlstd{,} \hlstr{"d"} \hlstd{=} \hlstr{"Z"}\hlstd{,} \hlstr{"c"} \hlstd{=} \hlstr{"B"}\hlstd{,} \hlstr{"b"} \hlstd{=} \hlstr{"F"}\hlstd{)}
\hlstd{my.factor1}
\end{alltt}
\begin{verbatim}
##  [1] a a a b b b c c c d d d
## Levels: a d c b
\end{verbatim}
\end{kframe}
\end{knitrout}

Or alternatively by position and replacing the labels of only some levels---i.e., rather unsafe.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.factor1} \hlkwb{<-} \hlkwd{gl}\hlstd{(}\hlnum{4}\hlstd{,} \hlnum{3}\hlstd{,} \hlkwc{labels} \hlstd{=} \hlkwd{c}\hlstd{(}\hlstr{"A"}\hlstd{,} \hlstr{"F"}\hlstd{,} \hlstr{"B"}\hlstd{,} \hlstr{"Z"}\hlstd{))}
\hlstd{my.factor1}
\end{alltt}
\begin{verbatim}
##  [1] A A A F F F B B B Z Z Z
## Levels: A F B Z
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{levels}\hlstd{(my.factor1)[}\hlkwd{c}\hlstd{(}\hlnum{1}\hlstd{,} \hlnum{4}\hlstd{)]} \hlkwb{<-} \hlkwd{c}\hlstd{(}\hlstr{"a"}\hlstd{,} \hlstr{"d"}\hlstd{)}
\hlstd{my.factor1}
\end{alltt}
\begin{verbatim}
##  [1] a a a F F F B B B d d d
## Levels: a F B d
\end{verbatim}
\end{kframe}
\end{knitrout}

\end{explainbox}

\begin{explainbox}
\textbf{Merging factor levels.}\index{factors!merge levels} We use \Rfunction{factor()} as shown below, setting the same label for the levels we want to merge.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.factor1} \hlkwb{<-} \hlkwd{gl}\hlstd{(}\hlnum{4}\hlstd{,} \hlnum{3}\hlstd{,} \hlkwc{labels} \hlstd{=} \hlkwd{c}\hlstd{(}\hlstr{"A"}\hlstd{,} \hlstr{"F"}\hlstd{,} \hlstr{"B"}\hlstd{,} \hlstr{"Z"}\hlstd{))}
\hlstd{my.factor1}
\end{alltt}
\begin{verbatim}
##  [1] A A A F F F B B B Z Z Z
## Levels: A F B Z
\end{verbatim}
\begin{alltt}
\hlkwd{factor}\hlstd{(my.factor1,}
       \hlkwc{levels} \hlstd{=} \hlkwd{c}\hlstd{(}\hlstr{"A"}\hlstd{,} \hlstr{"B"}\hlstd{,} \hlstr{"F"}\hlstd{,} \hlstr{"Z"}\hlstd{),}
       \hlkwc{labels} \hlstd{=} \hlkwd{c}\hlstd{(}\hlstr{"A"}\hlstd{,} \hlstr{"B"}\hlstd{,} \hlstr{"C"}\hlstd{,} \hlstr{"C"}\hlstd{))}
\end{alltt}
\begin{verbatim}
##  [1] A A A C C C B B B C C C
## Levels: A B C
\end{verbatim}
\end{kframe}
\end{knitrout}
\end{explainbox}

\begin{explainbox}
\textbf{Reordering factor levels.}\index{factors!reorder levels} The simplest approach is to use \Rfunction{factor()} and its \code{levels} parameter. The only complication is that the names of the existing levels and those passed as an argument need to match, and typing mistakes can cause bugs. To avoid the error-prone step, in all examples except the first, we use \Rfunction{levels()} to retrieve the names of the levels from the factor itself.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{levels}\hlstd{(my.factor2)}
\end{alltt}
\begin{verbatim}
## [1] "3" "4" "5"
\end{verbatim}
\begin{alltt}
\hlstd{my.factor2} \hlkwb{<-} \hlkwd{factor}\hlstd{(my.factor2,} \hlkwc{levels} \hlstd{=} \hlkwd{c}\hlstd{(}\hlstr{"5"}\hlstd{,} \hlstr{"3"}\hlstd{,} \hlstr{"4"}\hlstd{))}
\hlkwd{levels}\hlstd{(my.factor2)}
\end{alltt}
\begin{verbatim}
## [1] "5" "3" "4"
\end{verbatim}
\begin{alltt}
\hlstd{my.factor2} \hlkwb{<-} \hlkwd{factor}\hlstd{(my.factor2,} \hlkwc{levels} \hlstd{=} \hlkwd{rev}\hlstd{(}\hlkwd{levels}\hlstd{(my.factor2)))}
\hlkwd{levels}\hlstd{(my.factor2)}
\end{alltt}
\begin{verbatim}
## [1] "4" "3" "5"
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.factor2} \hlkwb{<-} \hlkwd{factor}\hlstd{(my.factor2,}
                     \hlkwc{levels} \hlstd{=} \hlkwd{sort}\hlstd{(}\hlkwd{levels}\hlstd{(my.factor2),} \hlkwc{decreasing} \hlstd{=} \hlnum{TRUE}\hlstd{))}
\hlkwd{levels}\hlstd{(my.factor2)}
\end{alltt}
\begin{verbatim}
## [1] "5" "4" "3"
\end{verbatim}
\begin{alltt}
\hlstd{my.factor2} \hlkwb{<-} \hlkwd{factor}\hlstd{(my.factor2,} \hlkwc{levels} \hlstd{=} \hlkwd{levels}\hlstd{(my.factor2)[}\hlkwd{c}\hlstd{(}\hlnum{2}\hlstd{,} \hlnum{1}\hlstd{,} \hlnum{3}\hlstd{)])}
\hlkwd{levels}\hlstd{(my.factor2)}
\end{alltt}
\begin{verbatim}
## [1] "4" "5" "3"
\end{verbatim}
\end{kframe}
\end{knitrout}

Reordering the levels of a factor based on summary quantities from data is very useful, especially when plotting. Function \Rfunction{reorder()} can be used in this case. It defaults to using \code{mean()} for summaries, but other suitable functions can be supplied in its place.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.factor3} \hlkwb{<-} \hlkwd{gl}\hlstd{(}\hlnum{2}\hlstd{,} \hlnum{5}\hlstd{,} \hlkwc{labels} \hlstd{=} \hlkwd{c}\hlstd{(}\hlstr{"A"}\hlstd{,} \hlstr{"B"}\hlstd{))}
\hlstd{my.vector3} \hlkwb{<-} \hlkwd{c}\hlstd{(}\hlnum{5.6}\hlstd{,} \hlnum{7.3}\hlstd{,} \hlnum{3.1}\hlstd{,} \hlnum{8.7}\hlstd{,} \hlnum{6.9}\hlstd{,} \hlnum{2.4}\hlstd{,} \hlnum{4.5}\hlstd{,} \hlnum{2.1}\hlstd{,} \hlnum{1.4}\hlstd{,} \hlnum{2.0}\hlstd{)}
\hlkwd{levels}\hlstd{(my.factor3)}
\end{alltt}
\begin{verbatim}
## [1] "A" "B"
\end{verbatim}
\begin{alltt}
\hlstd{my.factor3ord} \hlkwb{<-} \hlkwd{reorder}\hlstd{(my.factor3, my.vector3)}
\hlkwd{levels}\hlstd{(my.factor3ord)}
\end{alltt}
\begin{verbatim}
## [1] "B" "A"
\end{verbatim}
\begin{alltt}
\hlstd{my.factor3rev} \hlkwb{<-} \hlkwd{reorder}\hlstd{(my.factor3,} \hlopt{-}\hlstd{my.vector3)} \hlcom{# a simple trick}
\hlkwd{levels}\hlstd{(my.factor3rev)}
\end{alltt}
\begin{verbatim}
## [1] "A" "B"
\end{verbatim}
\end{kframe}
\end{knitrout}

In the last statement, using the unary negation operator, which is vectorized, allows us to easily reverse the ordering of the levels, while still using the default function, \code{mean()}, to summarize the data.

\end{explainbox}

\begin{advplayground}\label{calc:ADVPG:order:sort}
\textbf{Reordering factor values.}\index{factors!reorder values}\index{factors!arrange values} It is possible to arrange the values stored in a factor either alphabetically according to the labels of the levels or according to the order of the levels.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlcom{# gl() keeps order of levels}
\hlstd{my.factor4} \hlkwb{<-} \hlkwd{gl}\hlstd{(}\hlnum{4}\hlstd{,} \hlnum{3}\hlstd{,} \hlkwc{labels} \hlstd{=} \hlkwd{c}\hlstd{(}\hlstr{"A"}\hlstd{,} \hlstr{"F"}\hlstd{,} \hlstr{"B"}\hlstd{,} \hlstr{"Z"}\hlstd{))}
\hlstd{my.factor4}
\hlkwd{as.integer}\hlstd{(my.factor4)}
\hlcom{# factor() orders levels alphabetically}
\hlstd{my.factor5} \hlkwb{<-} \hlkwd{factor}\hlstd{(}\hlkwd{rep}\hlstd{(}\hlkwd{c}\hlstd{(}\hlstr{"A"}\hlstd{,} \hlstr{"F"}\hlstd{,} \hlstr{"B"}\hlstd{,} \hlstr{"Z"}\hlstd{),} \hlkwd{rep}\hlstd{(}\hlnum{3}\hlstd{,}\hlnum{4}\hlstd{)))}
\hlstd{my.factor5}
\hlkwd{as.integer}\hlstd{(my.factor5)}
\hlkwd{levels}\hlstd{(my.factor5)[}\hlkwd{as.integer}\hlstd{(my.factor5)]}
\end{alltt}
\end{kframe}
\end{knitrout}

We see above that the integer values by which levels in a factor are stored, are equivalent to indices or ``subscripts'' referencing the vector of labels. Function \Rfunction{sort()} operates on the values' underlying integers and sorts according to the order of the levels while \Rfunction{order()} operates on the values' labels and returns a vector of indices that arrange the values alphabetically.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{sort}\hlstd{(my.factor4)}
\hlstd{my.factor4[}\hlkwd{order}\hlstd{(my.factor4)]}
\hlstd{my.factor4[}\hlkwd{order}\hlstd{(}\hlkwd{as.integer}\hlstd{(my.factor4))]}
\end{alltt}
\end{kframe}
\end{knitrout}

Run the examples in the chunk above and work out why the results differ.
\end{advplayground}


\index{factors|)}

\section{Lists}\label{sec:calc:lists}
\index{lists|(}\qRclass{list}
\emph{Lists'} main difference to other collections is, in \Rlang, that they can be heterogeneous. In \Rlang, the members of a list can be considered as following a sequence, and accessible through numerical indexes, the same as vectors. However, frequently members of a list are given names, and retrieved (indexed) through these names. Lists are created using function \Rfunction{list()}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a.list} \hlkwb{<-} \hlkwd{list}\hlstd{(}\hlkwc{x} \hlstd{=} \hlnum{1}\hlopt{:}\hlnum{6}\hlstd{,} \hlkwc{y} \hlstd{=} \hlstr{"a"}\hlstd{,} \hlkwc{z} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{TRUE}\hlstd{,} \hlnum{FALSE}\hlstd{))}
\hlstd{a.list}
\end{alltt}
\begin{verbatim}
## $x
## [1] 1 2 3 4 5 6
##
## $y
## [1] "a"
##
## $z
## [1]  TRUE FALSE
\end{verbatim}
\end{kframe}
\end{knitrout}

\subsection{Member extraction and subsetting}
Using\qRoperator{[[]]}\index{lists!member extraction|(}\index{lists!member indexing|see{lists!member extraction}} double square brackets for indexing a list extracts the element stored in the list, in its original mode. In the example above, \code{a.list[["x"]]} returns a numeric vector, while \code{a.list[1]} returns a list containing the numeric vector \code{x}. \code{a.list\$x} returns the same value as \code{a.list[["x"]]}, a numeric vector. \code{a.list[c(1,3)]} returns a list of length two, while \code{a.list[[c(1,3)]]} is an error.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a.list}\hlopt{$}\hlstd{x}
\end{alltt}
\begin{verbatim}
## [1] 1 2 3 4 5 6
\end{verbatim}
\begin{alltt}
\hlstd{a.list[[}\hlstr{"x"}\hlstd{]]}
\end{alltt}
\begin{verbatim}
## [1] 1 2 3 4 5 6
\end{verbatim}
\begin{alltt}
\hlstd{a.list[[}\hlnum{1}\hlstd{]]}
\end{alltt}
\begin{verbatim}
## [1] 1 2 3 4 5 6
\end{verbatim}
\begin{alltt}
\hlstd{a.list[}\hlstr{"x"}\hlstd{]}
\end{alltt}
\begin{verbatim}
## $x
## [1] 1 2 3 4 5 6
\end{verbatim}
\begin{alltt}
\hlstd{a.list[}\hlnum{1}\hlstd{]}
\end{alltt}
\begin{verbatim}
## $x
## [1] 1 2 3 4 5 6
\end{verbatim}
\begin{alltt}
\hlstd{a.list[}\hlkwd{c}\hlstd{(}\hlnum{1}\hlstd{,}\hlnum{3}\hlstd{)]}
\end{alltt}
\begin{verbatim}
## $x
## [1] 1 2 3 4 5 6
##
## $z
## [1]  TRUE FALSE
\end{verbatim}
\begin{alltt}
\hlkwd{try}\hlstd{(a.list[[}\hlkwd{c}\hlstd{(}\hlnum{1}\hlstd{,}\hlnum{3}\hlstd{)]])}
\end{alltt}
\begin{verbatim}
## [1] 3
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{explainbox}
\emph{Lists}, as usually defined in languages like \Clang, are based on pointers stored at each node, and these pointers chain or link the different member nodes. In such implementations, indexing by position is not possible, or at least requires ``walking'' down the list, node by node. In \Rlang, \code{list} members can be accessed through positional indexes. Of course, insertions and deletions in the middle of a list, whatever their implementation, alter (or \emph{invalidate}) any position-based indexes.
\end{explainbox}

To investigate the returned values, function \Rfunction{str()} (\emph{structure}) is of help, especially when the lists have many members, as it formats lists more compactly than function \code{print()}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{str}\hlstd{(a.list)}
\end{alltt}
\begin{verbatim}
## List of 3
##  $ x: int [1:6] 1 2 3 4 5 6
##  $ y: chr "a"
##  $ z: logi [1:2] TRUE FALSE
\end{verbatim}
\end{kframe}
\end{knitrout}

\index{lists!member extraction|)}

\subsection{Adding and removing list members}
In other languages the two most common operations on lists are insertions and deletions.\index{lists!insert into}\index{lists!append to} In \Rlang, function \Rfunction{append()} can be used both to append elements at the end of a list and insert elements into the head or any position in the middle of a list.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{another.list} \hlkwb{<-} \hlkwd{append}\hlstd{(a.list,} \hlkwd{list}\hlstd{(}\hlkwc{yy} \hlstd{=} \hlnum{1}\hlopt{:}\hlnum{10}\hlstd{,} \hlkwc{zz} \hlstd{= letters[}\hlnum{5}\hlopt{:}\hlnum{1}\hlstd{]),} \hlnum{2L}\hlstd{)}
\hlstd{another.list}
\end{alltt}
\begin{verbatim}
## $x
## [1] 1 2 3 4 5 6
##
## $y
## [1] "a"
##
## $yy
##  [1]  1  2  3  4  5  6  7  8  9 10
##
## $zz
## [1] "e" "d" "c" "b" "a"
##
## $z
## [1]  TRUE FALSE
\end{verbatim}
\end{kframe}
\end{knitrout}

To delete a member from a list we assign \code{NULL} to it.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a.list}\hlopt{$}\hlstd{y} \hlkwb{<-} \hlkwa{NULL}
\hlstd{a.list}
\end{alltt}
\begin{verbatim}
## $x
## [1] 1 2 3 4 5 6
##
## $z
## [1]  TRUE FALSE
\end{verbatim}
\end{kframe}
\end{knitrout}

Lists can be nested, i.e., lists of lists.\index{lists!nested}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a.list} \hlkwb{<-} \hlkwd{list}\hlstd{(}\hlstr{"a"}\hlstd{,} \hlstr{"aa"}\hlstd{,} \hlstr{"aaa"}\hlstd{)}
\hlstd{b.list} \hlkwb{<-} \hlkwd{list}\hlstd{(}\hlstr{"b"}\hlstd{,} \hlstr{"bb"}\hlstd{)}
\hlstd{nested.list} \hlkwb{<-} \hlkwd{list}\hlstd{(}\hlkwc{A} \hlstd{= a.list,} \hlkwc{B} \hlstd{= b.list)}
\hlkwd{str}\hlstd{(nested.list)}
\end{alltt}
\begin{verbatim}
## List of 2
##  $ A:List of 3
##   ..$ : chr "a"
##   ..$ : chr "aa"
##   ..$ : chr "aaa"
##  $ B:List of 2
##   ..$ : chr "b"
##   ..$ : chr "bb"
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{explainbox}\index{lists!structure}
When dealing with deep lists, it is sometimes useful to limit the number of levels of nesting returned by \Rfunction{str()} by means of a \code{numeric} argument passed to parameter \code{max.levels}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{str}\hlstd{(nested.list,} \hlkwc{max.level} \hlstd{=} \hlnum{1}\hlstd{)}
\end{alltt}
\begin{verbatim}
## List of 2
##  $ A:List of 3
##  $ B:List of 2
\end{verbatim}
\end{kframe}
\end{knitrout}

\end{explainbox}

\subsection{Nested lists}
A nested\index{lists!nested} list can be constructed within a single statement in which several member lists are created. Here we combine the first three statements in the earlier chunk into a single one.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{nested.list} \hlkwb{<-} \hlkwd{list}\hlstd{(}\hlkwc{A} \hlstd{=} \hlkwd{list}\hlstd{(}\hlstr{"a"}\hlstd{,} \hlstr{"aa"}\hlstd{,} \hlstr{"aaa"}\hlstd{),} \hlkwc{B} \hlstd{=} \hlkwd{list}\hlstd{(}\hlstr{"b"}\hlstd{,} \hlstr{"bb"}\hlstd{))}
\hlkwd{str}\hlstd{(nested.list)}
\end{alltt}
\begin{verbatim}
## List of 2
##  $ A:List of 3
##   ..$ : chr "a"
##   ..$ : chr "aa"
##   ..$ : chr "aaa"
##  $ B:List of 2
##   ..$ : chr "b"
##   ..$ : chr "bb"
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{explainbox}
The logic behind extraction of members of nested lists using indexing is the same as for simple lists, but applied recursively---e.g., \code{nested.list[[2]]} extracts the second member of the outermost list, which is another list. As, this is a list, its members can be extracted using again the extraction operator: \code{nested.list[[2]][[1]]}. It is important to remember that these concatenated extraction operations are written so that the leftmost operator is applied to the outermost list. The example given here uses the \Roperator{[[ ]]} operator, but the same logic applies to \Roperator{[ ]}.
\end{explainbox}

\begin{playground}
What\index{lists!nested} do you expect each of the statements below to return? \emph{Before running the code}, predict what value and of which mode each statement will return. You may use implicit or explicit calls to \Rfunction{print()}, or calls to \Rfunction{str()} to visualize the structure of the different objects.

% not handled correctly by knitr, works at console.
\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{nested.list} \hlkwb{<-} \hlkwd{list}\hlstd{(}\hlkwc{A} \hlstd{=} \hlkwd{list}\hlstd{(}\hlstr{"a"}\hlstd{,} \hlstr{"aa"}\hlstd{,} \hlstr{"aaa"}\hlstd{),} \hlkwc{B} \hlstd{=} \hlkwd{list}\hlstd{(}\hlstr{"b"}\hlstd{,} \hlstr{"bb"}\hlstd{))}
\hlkwd{str}\hlstd{(nested.list)}
\hlstd{nested.list[}\hlnum{2}\hlopt{:}\hlnum{1}\hlstd{]}
\hlstd{nested.list[}\hlnum{1}\hlstd{]}
\hlstd{nested.list[[}\hlnum{1}\hlstd{]][}\hlnum{2}\hlstd{]}
\hlstd{nested.list[[}\hlnum{1}\hlstd{]][[}\hlnum{2}\hlstd{]]}
\hlstd{nested.list[}\hlnum{2}\hlstd{]}
\hlstd{nested.list[}\hlnum{2}\hlstd{][[}\hlnum{1}\hlstd{]]}
\end{alltt}
\end{kframe}
\end{knitrout}

\end{playground}

Sometimes we need to flatten a list\index{lists!flattening}\index{lists!nested}, or a nested structure of lists within lists. Function \Rfunction{unlist()} is what should be normally used in such cases.

The list \code{nested.list} is a nested system of lists, but all the ``terminal'' members are character strings. In other words, terminal nodes are all of the same mode, allowing the list to be ``flattened'' into a character vector.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{c.vec} \hlkwb{<-} \hlkwd{unlist}\hlstd{(nested.list)}
\hlstd{c.vec}
\end{alltt}
\begin{verbatim}
##    A1    A2    A3    B1    B2
##   "a"  "aa" "aaa"   "b"  "bb"
\end{verbatim}
\begin{alltt}
\hlkwd{is.list}\hlstd{(nested.list)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{is.list}\hlstd{(c.vec)}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\begin{alltt}
\hlkwd{mode}\hlstd{(nested.list)}
\end{alltt}
\begin{verbatim}
## [1] "list"
\end{verbatim}
\begin{alltt}
\hlkwd{mode}\hlstd{(c.vec)}
\end{alltt}
\begin{verbatim}
## [1] "character"
\end{verbatim}
\begin{alltt}
\hlkwd{names}\hlstd{(nested.list)}
\end{alltt}
\begin{verbatim}
## [1] "A" "B"
\end{verbatim}
\begin{alltt}
\hlkwd{names}\hlstd{(c.vec)}
\end{alltt}
\begin{verbatim}
## [1] "A1" "A2" "A3" "B1" "B2"
\end{verbatim}
\end{kframe}
\end{knitrout}

The returned value is a vector with named member elements. We use function \Rfunction{str()} to figure out how this vector relates to the original list. The names are based on the names of list elements when available, while numbers are used for anonymous nodes. We can access the members of the vector either through numeric indexes or names.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{str}\hlstd{(c.vec)}
\end{alltt}
\begin{verbatim}
##  Named chr [1:5] "a" "aa" "aaa" "b" "bb"
##  - attr(*, "names")= chr [1:5] "A1" "A2" "A3" "B1" ...
\end{verbatim}
\begin{alltt}
\hlstd{c.vec[}\hlnum{2}\hlstd{]}
\end{alltt}
\begin{verbatim}
##   A2
## "aa"
\end{verbatim}
\begin{alltt}
\hlstd{c.vec[}\hlstr{"A2"}\hlstd{]}
\end{alltt}
\begin{verbatim}
##   A2
## "aa"
\end{verbatim}
\end{kframe}
\end{knitrout}

Function \Rfunction{unname()} can be used to remove names safely---i.e., without risk of altering the mode or class of the object.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{unname}\hlstd{(c.vec)}
\end{alltt}
\begin{verbatim}
## [1] "a"   "aa"  "aaa" "b"   "bb"
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{playground}
Function \Rfunction{unlist()}\index{lists!convert into vector} has two additional parameters, with default argument values, which we did not modify in the example above. These parameters are \code{recursive} and \code{use.names}, both of them expecting a \code{logical} value as an argument. Modify the statement \code{c.vec <- unlist(c.list)}, by passing \code{FALSE} as an argument to these two parameters, in turn, and in each case, study the value returned and how it differs with respect to the one obtained above.
\end{playground}


\index{lists|)}

\section{Data frames}\label{sec:R:data:frames}
\index{data frames|(}\qRclass{data.frame}
\index{worksheet@`worksheet'|see{data frame}}
Data frames are a special type of list, in which each element is a vector or a factor of the same length. They are created with function \Rfunction{data.frame()} with a syntax similar to that used for lists---in object-oriented programming we say that data frames are derived from class \Rclass{list}. As the expectation is equal length, if vectors of different lengths are supplied as arguments, the shorter vector(s) is/are recycled, possibly several times, until the required full length is reached.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a.df} \hlkwb{<-} \hlkwd{data.frame}\hlstd{(}\hlkwc{x} \hlstd{=} \hlnum{1}\hlopt{:}\hlnum{6}\hlstd{,} \hlkwc{y} \hlstd{=} \hlstr{"a"}\hlstd{,} \hlkwc{z} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{TRUE}\hlstd{,} \hlnum{FALSE}\hlstd{))}
\hlstd{a.df}
\end{alltt}
\begin{verbatim}
##   x y     z
## 1 1 a  TRUE
## 2 2 a FALSE
## 3 3 a  TRUE
## 4 4 a FALSE
## 5 5 a  TRUE
## 6 6 a FALSE
\end{verbatim}
\begin{alltt}
\hlkwd{str}\hlstd{(a.df)}
\end{alltt}
\begin{verbatim}
## 'data.frame':	6 obs. of  3 variables:
##  $ x: int  1 2 3 4 5 6
##  $ y: chr  "a" "a" "a" "a" ...
##  $ z: logi  TRUE FALSE TRUE FALSE TRUE FALSE
\end{verbatim}
\begin{alltt}
\hlkwd{class}\hlstd{(a.df)}
\end{alltt}
\begin{verbatim}
## [1] "data.frame"
\end{verbatim}
\begin{alltt}
\hlkwd{mode}\hlstd{(a.df)}
\end{alltt}
\begin{verbatim}
## [1] "list"
\end{verbatim}
\begin{alltt}
\hlkwd{is.data.frame}\hlstd{(a.df)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{is.list}\hlstd{(a.df)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\end{kframe}
\end{knitrout}

Indexing of data frames is similar to that of the underlying list, but not exactly equivalent. We can index with operator \Roperator{[[]]} to extract individual variables, thought of being the columns in a matrix-like list or ``worksheet.''

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a.df}\hlopt{$}\hlstd{x}
\end{alltt}
\begin{verbatim}
## [1] 1 2 3 4 5 6
\end{verbatim}
\begin{alltt}
\hlstd{a.df[[}\hlstr{"x"}\hlstd{]]}
\end{alltt}
\begin{verbatim}
## [1] 1 2 3 4 5 6
\end{verbatim}
\begin{alltt}
\hlstd{a.df[[}\hlnum{1}\hlstd{]]}
\end{alltt}
\begin{verbatim}
## [1] 1 2 3 4 5 6
\end{verbatim}
\begin{alltt}
\hlkwd{class}\hlstd{(a.df)}
\end{alltt}
\begin{verbatim}
## [1] "data.frame"
\end{verbatim}
\end{kframe}
\end{knitrout}

With function \Rfunction{class()} we can query the class of an \Rlang object (see section \ref{sec:rlang:mode} on page \pageref{sec:rlang:mode}). As we saw in the two previous chunks, \code{list} and \code{data.frame} objects belong to two different classes. However, their relationship is based on a hierarchy of classes. We say that class \Rclass{data.frame} is derived from class \code{list}. Consequently, data frames inherit the methods and characteristics of lists, as long as they have not been hidden by new ones defined for data frames.

In the same way as with vectors, we can add members to lists and data frames.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a.df}\hlopt{$}\hlstd{x2} \hlkwb{<-} \hlnum{6}\hlopt{:}\hlnum{1}
\hlstd{a.df}\hlopt{$}\hlstd{x3} \hlkwb{<-} \hlstr{"b"}
\hlkwd{str}\hlstd{(a.df)}
\end{alltt}
\begin{verbatim}
## 'data.frame':	6 obs. of  5 variables:
##  $ x : int  1 2 3 4 5 6
##  $ y : chr  "a" "a" "a" "a" ...
##  $ z : logi  TRUE FALSE TRUE FALSE TRUE FALSE
##  $ x2: int  6 5 4 3 2 1
##  $ x3: chr  "b" "b" "b" "b" ...
\end{verbatim}
\end{kframe}
\end{knitrout}

We have added two columns to the data frame, and in the case of column \code{x3} recycling took place. This is where lists and data frames differ substantially in their behavior. In a data frame, although class and mode can be different for different variables (columns), they are required to be vectors or factors of the same length. In the case of lists, there is no such requirement, and recycling never takes place when adding a node. Compare the values returned below for \code{a.ls}, to those in the example above for \code{a.df}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a.ls} \hlkwb{<-} \hlkwd{list}\hlstd{(}\hlkwc{x} \hlstd{=} \hlnum{1}\hlopt{:}\hlnum{6}\hlstd{,} \hlkwc{y} \hlstd{=} \hlstr{"a"}\hlstd{,} \hlkwc{z} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{TRUE}\hlstd{,} \hlnum{FALSE}\hlstd{))}
\hlkwd{str}\hlstd{(a.ls)}
\end{alltt}
\begin{verbatim}
## List of 3
##  $ x: int [1:6] 1 2 3 4 5 6
##  $ y: chr "a"
##  $ z: logi [1:2] TRUE FALSE
\end{verbatim}
\begin{alltt}
\hlstd{a.ls}\hlopt{$}\hlstd{x2} \hlkwb{<-} \hlnum{6}\hlopt{:}\hlnum{1}
\hlstd{a.ls}\hlopt{$}\hlstd{x3} \hlkwb{<-} \hlstr{"b"}
\hlkwd{str}\hlstd{(a.ls)}
\end{alltt}
\begin{verbatim}
## List of 5
##  $ x : int [1:6] 1 2 3 4 5 6
##  $ y : chr "a"
##  $ z : logi [1:2] TRUE FALSE
##  $ x2: int [1:6] 6 5 4 3 2 1
##  $ x3: chr "b"
\end{verbatim}
\end{kframe}
\end{knitrout}

Data frames are extremely important to anyone analyzing or plotting data using \Rlang. One can think of data frames as tightly structured work-sheets, or as lists. As you may have guessed from the examples earlier in this section, there are several different ways of accessing columns, rows, and individual observations stored in a data frame. The columns can be treated as members in a list, and can be accessed both by name or index (position). When accessed by name, using \Roperator{\$} or double square brackets, a single column is returned as a vector or factor. In contrast to lists, data frames are always ``rectangular'' and for this reason the values stored can also be accessed in a way similar to how elements in a matrix are accessed, using two indexes. As we saw for vectors, indexes can be vectors of integer numbers or vectors of logical values. For columns they can, in addition, be vectors of character strings matching the names of the columns. When using indexes it is extremely important to remember that the indexes are always given \textbf{row first}.

\begin{explainbox}
Indexing of data frames can in all cases be done as if they were lists, which is preferable, as it ensures compatibility with regular \Rlang lists and with newer implementations of data-frame-like structures like those defined in package \pkgname{tibble}. Using this approach, extracting two values from the second and third positions in the first column of \code{a.df} is done as follows, using numerical indexes.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a.df[[}\hlnum{1}\hlstd{]][}\hlnum{2}\hlopt{:}\hlnum{3}\hlstd{]}
\end{alltt}
\begin{verbatim}
## [1] 2 3
\end{verbatim}
\end{kframe}
\end{knitrout}

Or using the column name.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a.df[[}\hlstr{"x"}\hlstd{]][}\hlnum{2}\hlopt{:}\hlnum{3}\hlstd{]}
\end{alltt}
\begin{verbatim}
## [1] 2 3
\end{verbatim}
\end{kframe}
\end{knitrout}

The less portable, matrix-like indexing is done as follows, with the first index indicating rows and the second one indicating columns. This notation allows simultaneous extraction from multiple columns, which is not possible with lists. The value returned is a ``smaller'' data frame.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a.df[}\hlnum{2}\hlopt{:}\hlnum{3}\hlstd{,} \hlnum{1}\hlopt{:}\hlnum{2}\hlstd{]}
\end{alltt}
\begin{verbatim}
##   x y
## 2 2 a
## 3 3 a
\end{verbatim}
\end{kframe}
\end{knitrout}

If the length of the column indexing vector is one, the returned value is a vector, which is not consistent with the previous example which returned a data frame. This is not only surprising in everyday use, but can be the source of bugs when coding algorithms in which the length of the second index vector cannot be guaranteed to be always more than one.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a.df[}\hlnum{2}\hlopt{:}\hlnum{3}\hlstd{,} \hlnum{1}\hlstd{]}
\end{alltt}
\begin{verbatim}
## [1] 2 3
\end{verbatim}
\end{kframe}
\end{knitrout}

In contrast, indexing of tibbles---defined in package \pkgname{tibble}---is always consistent, never returning a vector, even when the vector used to extract columns is of length one (see section \ref{sec:data:tibble} on page \pageref{sec:data:tibble} for details on the differences between data frames and tibbles).
\end{explainbox}


\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlcom{# first column, a.df[[1]] preferred}
\hlstd{a.df[ ,} \hlnum{1}\hlstd{]}
\end{alltt}
\begin{verbatim}
## [1] 1 2 3 4 5 6
\end{verbatim}
\begin{alltt}
\hlcom{# first column, a.df[["x"]] or a.df$x preferred}
\hlstd{a.df[ ,} \hlstr{"x"}\hlstd{]}
\end{alltt}
\begin{verbatim}
## [1] 1 2 3 4 5 6
\end{verbatim}
\begin{alltt}
\hlcom{# first row}
\hlstd{a.df[}\hlnum{1}\hlstd{, ]}
\end{alltt}
\begin{verbatim}
##   x y    z x2 x3
## 1 1 a TRUE  6  b
\end{verbatim}
\begin{alltt}
\hlcom{# first two rows of the third and fourth columns}
\hlstd{a.df[}\hlnum{1}\hlopt{:}\hlnum{2}\hlstd{,} \hlkwd{c}\hlstd{(}\hlnum{FALSE}\hlstd{,} \hlnum{FALSE}\hlstd{,} \hlnum{TRUE}\hlstd{,} \hlnum{TRUE}\hlstd{,} \hlnum{FALSE}\hlstd{)]}
\end{alltt}
\begin{verbatim}
##       z x2
## 1  TRUE  6
## 2 FALSE  5
\end{verbatim}
\begin{alltt}
\hlcom{# the rows for which z is true}
\hlstd{a.df[a.df}\hlopt{$}\hlstd{z , ]}
\end{alltt}
\begin{verbatim}
##   x y    z x2 x3
## 1 1 a TRUE  6  b
## 3 3 a TRUE  4  b
## 5 5 a TRUE  2  b
\end{verbatim}
\begin{alltt}
\hlcom{# the rows for which x > 3 keeping all columns except the third one}
\hlstd{a.df[a.df}\hlopt{$}\hlstd{x} \hlopt{>} \hlnum{3}\hlstd{,} \hlopt{-}\hlnum{3}\hlstd{]}
\end{alltt}
\begin{verbatim}
##   x y x2 x3
## 4 4 a  3  b
## 5 5 a  2  b
## 6 6 a  1  b
\end{verbatim}
\end{kframe}
\end{knitrout}

As explained earlier for vectors (see section \ref{sec:calc:indexing} on page \pageref{sec:calc:indexing}), indexing can be present both on the right-hand side and left-hand side of an assignment.
The next few examples do assignments to ``cells'' of \code{a.df}, either to one whole column, or individual values. The last statement in the chunk below copies a number from one location to another by using indexing of the same data frame both on the right side and left side of the assignment.\qRoperator{[[]]}\qRoperator{[]}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a.df[}\hlnum{1}\hlstd{,} \hlnum{1}\hlstd{]} \hlkwb{<-} \hlnum{99}
\hlstd{a.df}
\end{alltt}
\begin{verbatim}
##    x y     z x2 x3
## 1 99 a  TRUE  6  b
## 2  2 a FALSE  5  b
## 3  3 a  TRUE  4  b
## 4  4 a FALSE  3  b
## 5  5 a  TRUE  2  b
## 6  6 a FALSE  1  b
\end{verbatim}
\begin{alltt}
\hlstd{a.df[ ,} \hlnum{1}\hlstd{]} \hlkwb{<-} \hlopt{-}\hlnum{99}
\hlstd{a.df}
\end{alltt}
\begin{verbatim}
##     x y     z x2 x3
## 1 -99 a  TRUE  6  b
## 2 -99 a FALSE  5  b
## 3 -99 a  TRUE  4  b
## 4 -99 a FALSE  3  b
## 5 -99 a  TRUE  2  b
## 6 -99 a FALSE  1  b
\end{verbatim}
\begin{alltt}
\hlstd{a.df[[}\hlstr{"x"}\hlstd{]]} \hlkwb{<-} \hlnum{123}
\hlstd{a.df}
\end{alltt}
\begin{verbatim}
##     x y     z x2 x3
## 1 123 a  TRUE  6  b
## 2 123 a FALSE  5  b
## 3 123 a  TRUE  4  b
## 4 123 a FALSE  3  b
## 5 123 a  TRUE  2  b
## 6 123 a FALSE  1  b
\end{verbatim}
\begin{alltt}
\hlstd{a.df[}\hlnum{1}\hlstd{,} \hlnum{1}\hlstd{]} \hlkwb{<-} \hlstd{a.df[}\hlnum{6}\hlstd{,} \hlnum{4}\hlstd{]}
\hlstd{a.df}
\end{alltt}
\begin{verbatim}
##     x y     z x2 x3
## 1   1 a  TRUE  6  b
## 2 123 a FALSE  5  b
## 3 123 a  TRUE  4  b
## 4 123 a FALSE  3  b
## 5 123 a  TRUE  2  b
## 6 123 a FALSE  1  b
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{warningbox}
We mentioned above that indexing by name can be done either with double square brackets, \Roperator{[[]]}, or with \Roperator{\$}. In the first case the name of the variable or column is given as a character string, enclosed in quotation marks, or as a variable with mode \code{character}. When using \Roperator{\$}, the name is entered as a constant, without quotation marks, and cannot be a variable.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{x.list} \hlkwb{<-} \hlkwd{list}\hlstd{(}\hlkwc{abcd} \hlstd{=} \hlnum{123}\hlstd{,} \hlkwc{xyzw} \hlstd{=} \hlnum{789}\hlstd{)}
\hlstd{x.list[[}\hlstr{"abcd"}\hlstd{]]}
\end{alltt}
\begin{verbatim}
## [1] 123
\end{verbatim}
\begin{alltt}
\hlstd{a.var} \hlkwb{<-} \hlstr{"abcd"}
\hlstd{x.list[[a.var]]}
\end{alltt}
\begin{verbatim}
## [1] 123
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{x.list}\hlopt{$}\hlstd{abcd}
\end{alltt}
\begin{verbatim}
## [1] 123
\end{verbatim}
\begin{alltt}
\hlstd{x.list}\hlopt{$}\hlstd{ab}
\end{alltt}
\begin{verbatim}
## [1] 123
\end{verbatim}
\begin{alltt}
\hlstd{x.list}\hlopt{$}\hlstd{a}
\end{alltt}
\begin{verbatim}
## [1] 123
\end{verbatim}
\end{kframe}
\end{knitrout}

Both in the case of lists and data frames, when using double square brackets, an exact match is required between the name in the object and the name used for indexing. In contrast, with \Roperator{\$}, any unambiguous partial match will be accepted. For interactive use, partial matching is helpful in reducing typing. However, in scripts, and especially \Rlang code in packages, it is best to avoid the use of \Roperator{\$} as partial matching to a wrong variable present at a later time, e.g., when someone else revises the script, can lead to very difficult-to-diagnose errors. In addition, as \Roperator{\$} is implemented by first attempting a match to the name and then calling \Roperator{[[]]}, using \Roperator{\$} for indexing can result in slightly slower performance compared to using \Roperator{[[]]}. It is possible to set an \Rlang option so that partial matching triggers a warning, which can be very useful when debugging.
\end{warningbox}

\subsection{Operating within data frames}\label{sec:calc:df:with}
\index{data frames!operating within}
When\index{data frames!subsetting}\index{data frames!``filtering rows''} the names of data frames are long, complex conditions become awkward to write using indexing---i.e., subscripts. In such cases \Rfunction{subset()} is handy because evaluation is done in the ``environment'' of the data frame, i.e., the names of the columns are recognized if entered directly when writing the condition. Function  \Rfunction{subset()} ``filters'' rows, usually corresponding to observations or experimental units. The condition is computed for each row, and if it returns \code{TRUE}, the row is included in the returned data frame, and excluded if \code{FALSE}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a.df} \hlkwb{<-} \hlkwd{data.frame}\hlstd{(}\hlkwc{x} \hlstd{=} \hlnum{1}\hlopt{:}\hlnum{6}\hlstd{,} \hlkwc{y} \hlstd{=} \hlstr{"a"}\hlstd{,} \hlkwc{z} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{TRUE}\hlstd{,} \hlnum{FALSE}\hlstd{))}
\hlkwd{subset}\hlstd{(a.df, x} \hlopt{>} \hlnum{3}\hlstd{)}
\end{alltt}
\begin{verbatim}
##   x y     z
## 4 4 a FALSE
## 5 5 a  TRUE
## 6 6 a FALSE
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{playground}
What is the behavior of \code{subset()} when the condition is \code{NA}? Find the answer by writing code to test this, for a case where tests for different rows return \code{NA}, \code{TRUE} and \code{FALSE}.
\end{playground}

When calling functions that return a vector, data frame, or other structure, the square brackets can be appended to the rightmost parenthesis of the function call, in the same way as to the name of a variable holding the same data.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{subset}\hlstd{(a.df, x} \hlopt{>} \hlnum{3}\hlstd{)[ ,} \hlopt{-}\hlnum{3}\hlstd{]}
\end{alltt}
\begin{verbatim}
##   x y
## 4 4 a
## 5 5 a
## 6 6 a
\end{verbatim}
\begin{alltt}
\hlkwd{subset}\hlstd{(a.df, x} \hlopt{>} \hlnum{3}\hlstd{)}\hlopt{$}\hlstd{x}
\end{alltt}
\begin{verbatim}
## [1] 4 5 6
\end{verbatim}
\end{kframe}
\end{knitrout}

None of the examples in the last three code chunks alter the original data frame \code{a.df}. We can store the returned value using a new name if we want to preserve \code{a.df} unchanged, or we can assign the result to \code{a.df}, deleting in the process, the previously stored value.
\begin{warningbox}
  In the example above, the names in the expression passed as the second argument to \code{subset()} are first searched within \code{ad.df} but if not found, searched in the environment. There being no variable \code{A} in the data frame \code{a.df}, vector \code{A} from the environment is silently used in the expression resulting in a returned data frame with no rows.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{A} \hlkwb{<-} \hlnum{1}
\hlkwd{subset}\hlstd{(a.df, A} \hlopt{>} \hlnum{3}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] x y z
## <0 rows> (or 0-length row.names)
\end{verbatim}
\end{kframe}
\end{knitrout}

The use of \Rfunction{subset()} is convenient, but more prone to result in bugs compared to directly using the extraction operator \code{[]}. This same ``cost'' to achieving convenience applies to functions like \Rfunction{attach()} and \Rfunction{with()} described below. The longer time that a script is expected to be used, adapted and reused, the more careful we should be when using any of these functions. An alternative way of avoiding excessive verbosity is to keep the names of data frames short.
\end{warningbox}

A frequently used way of deleting a column by name from a data frame is to assign \code{NULL} to it---i.e., in the same way as members are deleted from \code{list}s. This approach modifies \code{a.df} in place.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{aa.df} \hlkwb{<-} \hlstd{a.df}
\hlkwd{colnames}\hlstd{(aa.df)}
\end{alltt}
\begin{verbatim}
## [1] "x" "y" "z"
\end{verbatim}
\begin{alltt}
\hlstd{aa.df[[}\hlstr{"y"}\hlstd{]]} \hlkwb{<-} \hlkwa{NULL}
\hlkwd{colnames}\hlstd{(aa.df)}
\end{alltt}
\begin{verbatim}
## [1] "x" "z"
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{explainbox}
Alternatively, we can use negative indexing to remove columns from a copy of a data frame. In this example we remove a single column. As base \Rlang does not support negative indexing by name, we need to find the numerical index of the column to delete.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a.df[ ,} \hlopt{-}\hlkwd{which}\hlstd{(}\hlkwd{colnames}\hlstd{(a.df)} \hlopt{==} \hlstr{"y"}\hlstd{)]}
\end{alltt}
\begin{verbatim}
##   x     z
## 1 1  TRUE
## 2 2 FALSE
## 3 3  TRUE
## 4 4 FALSE
## 5 5  TRUE
## 6 6 FALSE
\end{verbatim}
\end{kframe}
\end{knitrout}
Instead of using the equality test, we can use the operator \code{\%in\%} or function \code{grepl()} to delete multiple columns in a single statement.
\end{explainbox}

\begin{playground}
In the previous code chunk we deleted the last column of the data frame \code{a.df}.
Here is an esoteric trick for you to first untangle how it changes the positions of columns and row, and then for you to think how and why it can be useful to use indexing with the extraction operator \Roperator{[ ]} on both sides of the assignment operator \Roperator{<-}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a.df[}\hlnum{1}\hlopt{:}\hlnum{6}\hlstd{,} \hlkwd{c}\hlstd{(}\hlnum{1}\hlstd{,}\hlnum{3}\hlstd{)]} \hlkwb{<-} \hlstd{a.df[}\hlnum{6}\hlopt{:}\hlnum{1}\hlstd{,} \hlkwd{c}\hlstd{(}\hlnum{3}\hlstd{,}\hlnum{1}\hlstd{)]}
\hlstd{a.df}
\end{alltt}
\end{kframe}
\end{knitrout}
\end{playground}

\begin{warningbox}
Although in this last example we used numeric indexes to make it more interesting, in practice, especially in scripts or other code that will be reused, do use column or member names instead of positional indexes whenever possible. This makes code much more reliable, as changes elsewhere in the script could alter the order of columns and \emph{invalidate} numerical indexes. In addition, using meaningful names makes programmers' intentions easier to understand.
\end{warningbox}

\begin{explainbox}
It is sometimes inconvenient to have to pre-pend the name of a \emph{container} such as a list or data frame to the name of each member variable being accessed. There are functions in \Rlang that allow us to change where \Rlang looks for the names of objects we include in a code statement. Here I describe the use of \Rscoping{attach()} and its matching \Rscoping{detach()}, and \Rscoping{with()} and \Rscoping{within()} to access members of a data frame. They can be used as well with lists and classes derived from \code{list}.

As we can see below, when using a rather long name for a data frame, entering a simple calculation can easily result in a long and difficult to read statement. (Method \code{head()} is used here to limit the displayed value to the first two rows---\code{head()} is described in section \ref{sec:calc:looking:at:data} on page \pageref{sec:calc:looking:at:data}.)

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my_data_frame.df} \hlkwb{<-} \hlkwd{data.frame}\hlstd{(}\hlkwc{A} \hlstd{=} \hlnum{1}\hlopt{:}\hlnum{10}\hlstd{,} \hlkwc{B} \hlstd{=} \hlnum{3}\hlstd{)}
\hlstd{my_data_frame.df}\hlopt{$}\hlstd{C} \hlkwb{<-}
  \hlstd{(my_data_frame.df}\hlopt{$}\hlstd{A} \hlopt{+} \hlstd{my_data_frame.df}\hlopt{$}\hlstd{B)} \hlopt{/} \hlstd{my_data_frame.df}\hlopt{$}\hlstd{A}
\hlkwd{head}\hlstd{(my_data_frame.df,} \hlnum{2}\hlstd{)}
\end{alltt}
\begin{verbatim}
##   A B   C
## 1 1 3 4.0
## 2 2 3 2.5
\end{verbatim}
\end{kframe}
\end{knitrout}

Using\index{data frames!attaching} \Rscoping{attach()} we can alter how \Rlang looks up names and consequently simplify the statement. With \Rscoping{detach()} we can restore the original state. It is important to remember that here we can only simplify the right-hand side of the assignment, while the ``destination'' of the result of the computation still needs to be fully specified on the left-hand side of the assignment operator. We include below only one statement between \Rscoping{attach()} and \Rscoping{detach()} but multiple statements are allowed. Furthermore, if variables with the same name as the columns exist in the search path, these will take precedence, something that can result in bugs or crashes, or as seen below, a message warns that variable \code{A} from the global environment will be used instead of column \code{A} of the attached \code{my\_data\_frame.df}. The returned value is, of course, not the desired one.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my_data_frame.df}\hlopt{$}\hlstd{C} \hlkwb{<-} \hlkwa{NULL}
\hlkwd{attach}\hlstd{(my_data_frame.df)}
\end{alltt}


{\ttfamily\noindent\itshape\color{messagecolor}{\#\# The following object is masked \_by\_ .GlobalEnv:\\\#\# \\\#\#\ \ \ \  A}}\begin{alltt}
\hlstd{my_data_frame.df}\hlopt{$}\hlstd{C} \hlkwb{<-} \hlstd{(A} \hlopt{+} \hlstd{B)} \hlopt{/} \hlstd{A}
\hlkwd{detach}\hlstd{(my_data_frame.df)}
\hlkwd{head}\hlstd{(my_data_frame.df,} \hlnum{2}\hlstd{)}
\end{alltt}
\begin{verbatim}
##   A B C
## 1 1 3 4
## 2 2 3 4
\end{verbatim}
\end{kframe}
\end{knitrout}

In the case of \Rscoping{with()} only one, possibly compound code statement is affected and this statement is passed as an argument. As before, we need to fully specify the left-hand side of the assignment. The value returned is the one returned by the statement passed as an argument, in the case of compound statements, the value returned by the last contained simple code statement to be executed. Consequently, if the intent is to modify the container, assignment to an individual member variable (column in this case) is required. In contrast to the behavior of \code{attach()}, In this case, column \code{A} of \code{my\_data\_frame.df} takes precedence, and the returned value is the expected one.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my_data_frame.df}\hlopt{$}\hlstd{C} \hlkwb{<-} \hlkwa{NULL}
\hlstd{my_data_frame.df}\hlopt{$}\hlstd{C} \hlkwb{<-} \hlkwd{with}\hlstd{(my_data_frame.df, (A} \hlopt{+} \hlstd{B)} \hlopt{/} \hlstd{A)}
\hlkwd{head}\hlstd{(my_data_frame.df,} \hlnum{2}\hlstd{)}
\end{alltt}
\begin{verbatim}
##   A B   C
## 1 1 3 4.0
## 2 2 3 2.5
\end{verbatim}
\end{kframe}
\end{knitrout}

In the case of \Rscoping{within()}, assignments in the argument to its second parameter affect the object returned, which is a copy of the container (In this case, a whole data frame), which still needs to be saved through assignment. Here the intention is to modify it, so we assign it back to the same name, but it could have been assigned to a different name so as not to overwrite the original data frame.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my_data_frame.df}\hlopt{$}\hlstd{C} \hlkwb{<-} \hlkwa{NULL}
\hlstd{my_data_frame.df} \hlkwb{<-} \hlkwd{within}\hlstd{(my_data_frame.df,  C} \hlkwb{<-} \hlstd{(A} \hlopt{+} \hlstd{B)} \hlopt{/} \hlstd{A)}
\hlkwd{head}\hlstd{(my_data_frame.df,} \hlnum{2}\hlstd{)}
\end{alltt}
\begin{verbatim}
##   A B   C
## 1 1 3 4.0
## 2 2 3 2.5
\end{verbatim}
\end{kframe}
\end{knitrout}
In the example above, \code{within()} makes little difference compared to using \Rscoping{with()} with respect to the amount of typing or clarity, but with multiple member variables being operated upon, as shown below, \Rscoping{within()} has an advantage resulting in more concise and easier to understand code.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my_data_frame.df}\hlopt{$}\hlstd{C} \hlkwb{<-} \hlkwa{NULL}
\hlstd{my_data_frame.df} \hlkwb{<-} \hlkwd{within}\hlstd{(my_data_frame.df,}
                           \hlstd{\{C} \hlkwb{<-} \hlstd{(A} \hlopt{+} \hlstd{B)} \hlopt{/} \hlstd{A}
                            \hlstd{D} \hlkwb{<-} \hlstd{A} \hlopt{*} \hlstd{B}
                            \hlstd{E} \hlkwb{<-} \hlstd{A} \hlopt{/} \hlstd{B} \hlopt{+} \hlnum{1}\hlstd{\}}
                           \hlstd{)}
\hlkwd{head}\hlstd{(my_data_frame.df,} \hlnum{2}\hlstd{)}
\end{alltt}
\begin{verbatim}
##   A B        E D   C
## 1 1 3 1.333333 3 4.0
## 2 2 3 1.666667 6 2.5
\end{verbatim}
\end{kframe}
\end{knitrout}

Use of \Rscoping{attach()} and \Rscoping{detach()}, which function as a pair of ON and OFF switches, can result in an undesired after-effect on name lookup if the script terminates after \Rscoping{attach()} is executed but before \Rscoping{detach()} is called, as cleanup is not automatic. In contrast, \Rscoping{with()} and \Rscoping{within()}, being self-contained, guarantee that cleanup takes place. Consequently, the usual recommendation is to give preference to the use of \Rscoping{with()} and \Rscoping{within()} over \Rscoping{attach()} and \Rscoping{detach()}. Use of these functions not only saves typing but also makes code more readable.
\end{explainbox}

\subsection{Re-arranging columns and rows}
\index{data frames!ordering rows}\index{data frames!ordering columns}
The most direct way of changing the order of columns and/or rows in data frames (and matrices and arrays) is to use subscripting as described above. Once we know the original position and target position we can use numerical indexes on both right-hand side and left-hand side of an assignment.

\begin{warningbox}
When using the extraction operator \Roperator{[]} on both the left-hand-side and right-hand-side to swap columns, the vectors or factors are swapped, while the names of the columns are not! The same applies to row names, which makes storing important information in them inconvenient and error prone.
\end{warningbox}

To retain the correct naming after the column swap, we need to separately swap the names of the columns.
\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my_data_frame.df} \hlkwb{<-} \hlkwd{data.frame}\hlstd{(}\hlkwc{A} \hlstd{=} \hlnum{1}\hlopt{:}\hlnum{10}\hlstd{,} \hlkwc{B} \hlstd{=} \hlnum{3}\hlstd{)}
\hlkwd{head}\hlstd{(my_data_frame.df,} \hlnum{2}\hlstd{)}
\end{alltt}
\begin{verbatim}
##   A B
## 1 1 3
## 2 2 3
\end{verbatim}
\begin{alltt}
\hlstd{my_data_frame.df[ ,} \hlnum{1}\hlopt{:}\hlnum{2}\hlstd{]} \hlkwb{<-} \hlstd{my_data_frame.df[ ,} \hlnum{2}\hlopt{:}\hlnum{1}\hlstd{]}
\hlkwd{head}\hlstd{(my_data_frame.df,} \hlnum{2}\hlstd{)}
\end{alltt}
\begin{verbatim}
##   A B
## 1 3 1
## 2 3 2
\end{verbatim}
\begin{alltt}
\hlkwd{colnames}\hlstd{(my_data_frame.df)[}\hlnum{1}\hlopt{:}\hlnum{2}\hlstd{]} \hlkwb{<-} \hlkwd{colnames}\hlstd{(my_data_frame.df)[}\hlnum{2}\hlopt{:}\hlnum{1}\hlstd{]}
\hlkwd{head}\hlstd{(my_data_frame.df,} \hlnum{2}\hlstd{)}
\end{alltt}
\begin{verbatim}
##   B A
## 1 3 1
## 2 3 2
\end{verbatim}
\end{kframe}
\end{knitrout}

Taking into account that \Rfunction{order()} returns the indexes needed to sort a vector (see page \pageref{box:vec:sort}), we can use \Rfunction{order()} to generate the indexes needed to sort either columns or rows of a data frame. When we want to sort rows, the argument to \Rfunction{order()} is usually a column of the data frame being arranged. However, any vector of suitable length, including the result of applying a function to one or more columns, can be passed as an argument to \Rfunction{order()}.

\begin{playground}\index{data frames!ordering rows}
The first task to be completed is to sort a data frame based on the values in one column, using indexing and \Rfunction{order()}. Create a new data frame and with three numeric columns with three different haphazard sequences of values. Call these columns \code{A}, \code{B} and \code{C}. 1) Sort the rows of the data frame so that the values in \code{A} are in decreasing order. 2) Sort the rows of the data frame according to increasing values of the sum of \code{A} and \code{B} without adding a new column to the data frame or storing the vector of sums in a variable. In other words, do the sorting based on sums calculated on the fly.
\end{playground}

\begin{advplayground}\index{data frames!ordering rows}
Repeat the tasks in the playground immediately above but using factors instead of numeric vectors as columns in the data frame. Hint: revisit the exercise on page \pageref{calc:ADVPG:order:sort} were the use of \Rfunction{order()} on factors is described.
\end{advplayground}

\index{data frames|)}


\section{Attributes of R objects}\label{sec:calc:attributes}
\index{attributes|(}

\Rlang objects can have attributes. Attributes are normally used to store ancillary data. They are used by \Rlang itself to store things like column names in data frames and labels of factor levels. All these attributes are visible to user code, and user code can read and write objects' attributes. Attribute \code{"comment"} is meant to be set by users---e.g.,  to store metadata together with data.\qRfunction{comment()}\qRfunction{comment()<-}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a.df} \hlkwb{<-} \hlkwd{data.frame}\hlstd{(}\hlkwc{x} \hlstd{=} \hlnum{1}\hlopt{:}\hlnum{6}\hlstd{,} \hlkwc{y} \hlstd{=} \hlstr{"a"}\hlstd{,} \hlkwc{z} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{TRUE}\hlstd{,} \hlnum{FALSE}\hlstd{))}
\hlkwd{comment}\hlstd{(a.df)}
\end{alltt}
\begin{verbatim}
## NULL
\end{verbatim}
\begin{alltt}
\hlkwd{comment}\hlstd{(a.df)} \hlkwb{<-} \hlstr{"this is stored as a comment"}
\hlkwd{comment}\hlstd{(a.df)}
\end{alltt}
\begin{verbatim}
## [1] "this is stored as a comment"
\end{verbatim}
\end{kframe}
\end{knitrout}

Methods like \Rfunction{names()}, \Rfunction{dim()} or \Rfunction{levels()} return values retrieved from attributes stored in \Rlang objects, and methods like \Rfunction{names()<-}, \Rfunction{dim()<-} or \Rfunction{levels()<-} set (or unset with \code{NULL}) the value of the respective attributes. Specific query and set methods do not exist for all attributes. Methods \Rfunction{attr()}, \Rfunction{attr()<-} and \Rfunction{attributes()} can be used with any attribute. In addition, method \Rfunction{str()} displays all components of \Rlang objects including their attributes.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{names}\hlstd{(a.df)}
\end{alltt}
\begin{verbatim}
## [1] "x" "y" "z"
\end{verbatim}
\begin{alltt}
\hlkwd{names}\hlstd{(a.df)} \hlkwb{<-} \hlkwd{toupper}\hlstd{(}\hlkwd{names}\hlstd{(a.df))}
\hlkwd{names}\hlstd{(a.df)}
\end{alltt}
\begin{verbatim}
## [1] "X" "Y" "Z"
\end{verbatim}
\begin{alltt}
\hlkwd{attr}\hlstd{(a.df,} \hlstr{"names"}\hlstd{)} \hlcom{# same as previous line}
\end{alltt}
\begin{verbatim}
## [1] "X" "Y" "Z"
\end{verbatim}
\begin{alltt}
\hlkwd{attr}\hlstd{(a.df,} \hlstr{"my.attribute"}\hlstd{)} \hlkwb{<-} \hlstr{"this is stored in my attribute"}
\hlkwd{attributes}\hlstd{(a.df)}
\end{alltt}
\begin{verbatim}
## $names
## [1] "X" "Y" "Z"
##
## $class
## [1] "data.frame"
##
## $row.names
## [1] 1 2 3 4 5 6
##
## $comment
## [1] "this is stored as a comment"
##
## $my.attribute
## [1] "this is stored in my attribute"
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{warningbox}
There is no restriction to the creation, setting, resetting and reading of attributes, but not all methods and operators that can be used to modify objects will preserve non-standard attributes. So, using private attributes is a double-edged sword that usually is worthwhile considering only when designing a new class together with the corresponding methods for it. A good example of extensive use of class-specific attributes are the values returned by model fitting functions like \Rfunction{lm()} (see section \ref{sec:stat:LM} on page \pageref{sec:stat:LM}).

Even the class of S3 objects is stored as an attribute that is accessible as any other attribute---this is in contrast to the mode and atomic class of an object. Object-oriented programming in \Rlang in explained in section \ref{sec:script:objects:classes:methods} on page \pageref{sec:script:objects:classes:methods}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{numbers} \hlkwb{<-} \hlnum{1}\hlopt{:}\hlnum{10}
\hlkwd{class}\hlstd{(numbers)}
\end{alltt}
\begin{verbatim}
## [1] "integer"
\end{verbatim}
\begin{alltt}
\hlkwd{attributes}\hlstd{(numbers)}
\end{alltt}
\begin{verbatim}
## NULL
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a.factor} \hlkwb{<-} \hlkwd{factor}\hlstd{(numbers)}
\hlkwd{class}\hlstd{(a.factor)}
\end{alltt}
\begin{verbatim}
## [1] "factor"
\end{verbatim}
\begin{alltt}
\hlkwd{attributes}\hlstd{(a.factor)}
\end{alltt}
\begin{verbatim}
## $levels
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"
##
## $class
## [1] "factor"
\end{verbatim}
\end{kframe}
\end{knitrout}
\end{warningbox}


\index{attributes|)}

\section{Saving and loading data}

\subsection{Data sets in R and packages}
\index{data!loading data sets|(}
To be able to present more meaningful examples, we need some real data. Here we use \code{cars}, one of the many data sets included in base \Rpgrm. Function \Rfunction{data()} is used to load data objects that are included in \Rlang or contained in packages. It is also possible to import data saved in files with \textit{foreign} formats, defined by other software or commonly used for data exchange. Package \pkgname{foreign}, included in the \Rlang distribution, as well as contributed packages make available functions capable of reading and decoding various foreign formats. How to read or import ``foreign'' data is discussed in \Rlang documentation in \emph{R Data Import/Export}, and in this book, in chapter \ref{chap:R:data:io} starting on page \pageref{chap:R:data:io}. It is also good to keep in mind that in \Rlang, URLs (Uniform Resource Locators) are accepted as arguments to the \code{file} or \code{path} parameter of many functions (see section \ref{sec:files:remote} starting on page \pageref{sec:files:remote}).

In the next example we load data included in \Rlang as \Rlang objects by calling function \Rfunction{data()}. The loaded \Rlang object \code{cars} is a data frame.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{data}\hlstd{(cars)}
\end{alltt}
\end{kframe}
\end{knitrout}

Once we have a data set available, the first step is usually to explore it, and we will do this with \code{cars} in section \ref{sec:calc:looking:at:data} on page \pageref{sec:calc:looking:at:data}.
\index{data!loading data sets|)}

\subsection{.rda files}\label{sec:data:rda}

By default, at the end of a session, the current workspace containing the results of your work is saved into a file called \code{.RData}. In addition to saving the whole workspace, it is possible to save one or more \Rlang objects present in the workspace to disk using the same file format (with file name tag \code{.rda} or \code{.Rda}). One or more objects, belonging to any mode or class can be saved into a single file using function \Rfunction{save()}. Reading the file restores all the saved objects into the current workspace with their original names. These files are portable across most \Rlang versions---i.e., old formats can be read and written by newer versions of R, although the newer, default format may be not readable with earlier \Rlang versions. Whether compression is used, and whether the ``binary'' data is encoded into ASCII characters, allowing maximum portability at the expense of increased size can be controlled by passing suitable arguments to \Rfunction{save()}.

We create a data frame object and then save it to a file.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.df} \hlkwb{<-} \hlkwd{data.frame}\hlstd{(}\hlkwc{x} \hlstd{=} \hlnum{1}\hlopt{:}\hlnum{5}\hlstd{,} \hlkwc{y} \hlstd{=} \hlnum{5}\hlopt{:}\hlnum{1}\hlstd{)}
\hlstd{my.df}
\end{alltt}
\begin{verbatim}
##   x y
## 1 1 5
## 2 2 4
## 3 3 3
## 4 4 2
## 5 5 1
\end{verbatim}
\begin{alltt}
\hlkwd{save}\hlstd{(my.df,} \hlkwc{file} \hlstd{=} \hlstr{"my-df.rda"}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

We delete the data frame object and confirm that it is no longer present in the workspace.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{rm}\hlstd{(my.df)}
\hlkwd{ls}\hlstd{(}\hlkwc{pattern} \hlstd{=} \hlstr{"my.df"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## character(0)
\end{verbatim}
\end{kframe}
\end{knitrout}

We read the file we earlier saved to restore the object.\qRfunction{load()}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{load}\hlstd{(}\hlkwc{file} \hlstd{=} \hlstr{"my-df.rda"}\hlstd{)}
\hlkwd{ls}\hlstd{(}\hlkwc{pattern} \hlstd{=} \hlstr{"my.df"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] "my.df"
\end{verbatim}
\begin{alltt}
\hlstd{my.df}
\end{alltt}
\begin{verbatim}
##   x y
## 1 1 5
## 2 2 4
## 3 3 3
## 4 4 2
## 5 5 1
\end{verbatim}
\end{kframe}
\end{knitrout}

The default format used is binary and compressed, which results in smaller files.

\begin{playground}
In the example above, only one object was saved, but one can simply give the names of additional objects as arguments. Just try saving more than one data frame to the same file. Then the data frames plus a few vectors. After creating each file, clear the workspace and then restore from the file the objects you saved.
\end{playground}

Sometimes it is easier to supply the names of the objects to be saved as a vector of character strings passed as an argument to parameter \code{list}. One case is when wanting to save a group of objects based on their names. We can use \Rfunction{ls()} to list the names of objects matching a simple \code{pattern} or a complex regular expression. The example below does this in two steps, first saving a character vector with the names of the objects matching a pattern, and then using this saved vector as an argument to \code{save}'s \code{list} parameter.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{objcts} \hlkwb{<-} \hlkwd{ls}\hlstd{(}\hlkwc{pattern} \hlstd{=} \hlstr{"*.df"}\hlstd{)}
\hlkwd{save}\hlstd{(}\hlkwc{list} \hlstd{= objcts,} \hlkwc{file} \hlstd{=} \hlstr{"my-df1.rda"}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

The two statements above can be combined into a single statement by nesting the function calls.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{save}\hlstd{(}\hlkwc{list} \hlstd{=} \hlkwd{ls}\hlstd{(}\hlkwc{pattern} \hlstd{=} \hlstr{"*.df"}\hlstd{),} \hlkwc{file} \hlstd{=} \hlstr{"my-df1.rda"}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{playground}
Practice using different patterns with \Rfunction{ls()}. You do not need to save the objects to a file. Just have a look at the list of object names returned.
\end{playground}

As a coda, we show how to clean up by deleting the two files we created. Function \Rfunction{unlink()} can be used to delete any files for which the user has enough rights.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{unlink}\hlstd{(}\hlkwd{c}\hlstd{(}\hlstr{"my-df.rda"}\hlstd{,} \hlstr{"my-df1.rda"}\hlstd{))}
\end{alltt}
\end{kframe}
\end{knitrout}

\subsection{.rds files}\label{sec:data:rds}

The RDS format can be used to save individual objects instead of multiple objects (usually using file name tag \code{.rds}). They are read and saved with functions \Rfunction{readRDS()} and \Rfunction{saveRDS()}, respectively.
When RDS files are read, different from when RDA files are loaded, we need to assign the object read to a possibly different name for it to added to the search pass. Of course, it is also possible to use the returned object as an argument to a function or in an expression without saving it to a variable.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{saveRDS}\hlstd{(my.df,} \hlstr{"my-df.rds"}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

If we read the file, by default the read \Rlang object will be printed at the console.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{readRDS}\hlstd{(}\hlstr{"my-df.rds"}\hlstd{)}
\end{alltt}
\begin{verbatim}
##   x y
## 1 1 5
## 2 2 4
## 3 3 3
## 4 4 2
## 5 5 1
\end{verbatim}
\end{kframe}
\end{knitrout}

In the next example we assign the read object to a different name, and check that the object read is identical to the one saved.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my_read.df} \hlkwb{<-} \hlkwd{readRDS}\hlstd{(}\hlstr{"my-df.rds"}\hlstd{)}
\hlkwd{identical}\hlstd{(my.df, my_read.df)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\end{kframe}
\end{knitrout}

As above, we clean up by deleting the file.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{unlink}\hlstd{(}\hlstr{"my-df.rds"}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

\section{Looking at data}\label{sec:calc:looking:at:data}
\index{data!exploration at the R console|(}
There are several functions in \Rlang that let us obtain different views into objects. Function \Rfunction{print()} is useful for small data sets, or objects. Especially in the case of large data frames, we need to explore them step by step. In the case of named components, we can obtain their names with \Rfunction{colnames()}, \Rfunction{rownames()}, and \Rfunction{names()}. If a data frame contains many rows of observations, \Rfunction{head()} and \Rfunction{tail()} allow us to easily restrict the number of rows printed. Functions \Rfunction{nrow()} and \Rfunction{ncol()} return the number of rows and columns in the data frame (also applicable to matrices but not to lists or vectors where we use \Rfunction{length()}). As mentioned earlier, function \Rfunction{str()} concisely displays the structure of \Rlang objects.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{class}\hlstd{(cars)}
\end{alltt}
\begin{verbatim}
## [1] "data.frame"
\end{verbatim}
\begin{alltt}
\hlkwd{nrow}\hlstd{(cars)}
\end{alltt}
\begin{verbatim}
## [1] 50
\end{verbatim}
\begin{alltt}
\hlkwd{ncol}\hlstd{(cars)}
\end{alltt}
\begin{verbatim}
## [1] 2
\end{verbatim}
\begin{alltt}
\hlkwd{names}\hlstd{(cars)}
\end{alltt}
\begin{verbatim}
## [1] "speed" "dist"
\end{verbatim}
\begin{alltt}
\hlkwd{head}\hlstd{(cars)}
\end{alltt}
\begin{verbatim}
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10
\end{verbatim}
\begin{alltt}
\hlkwd{tail}\hlstd{(cars)}
\end{alltt}
\begin{verbatim}
##    speed dist
## 45    23   54
## 46    24   70
## 47    24   92
## 48    24   93
## 49    24  120
## 50    25   85
\end{verbatim}
\begin{alltt}
\hlkwd{str}\hlstd{(cars)}
\end{alltt}
\begin{verbatim}
## 'data.frame':	50 obs. of  2 variables:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{playground}
Look up the help pages for \Rfunction{head()} and \Rfunction{tail()}, and edit the code above to print only the first two lines, or only the last three lines of \code{cars}, respectively.
\end{playground}

The different columns of a data frame can be factors or vectors of various modes (e.g.,  numeric, logical, character, etc.) (see section \ref{sec:R:data:frames} on page \pageref{sec:R:data:frames}).
To explore the mode of the columns of \code{cars}, we can use an \emph{apply} function. In the present case, we want to apply function \code{class()} to each column of the data frame \code{cars}. (Apply functions are described in section \ref{sec:data:apply} on page \pageref{sec:data:apply}.)
\qRloop{sapply}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{sapply}\hlstd{(}\hlkwc{X} \hlstd{= cars,} \hlkwc{FUN} \hlstd{= class)}
\end{alltt}
\begin{verbatim}
##     speed      dist
## "numeric" "numeric"
\end{verbatim}
\end{kframe}
\end{knitrout}

The statement above returns a vector of character strings, with the mode of each column. Each element of the vector is named according to the name of the corresponding ``column'' in the data frame. For this same statement to be used with any other data frame or list, we need only to substitute the name of the object, the argument to the first parameter called \code{X}, to the one of current interest.

\begin{playground}
Data set \code{airquality} contains data from air quality measurements in New York, and, being included in the \Rpgrm distribution, can be loaded with \code{data(airquality)}. Load it, and repeat the steps above, to learn what variables (columns) it contains, their classes, the number of rows, etc.
\end{playground}

Function \Rfunction{summary()} can be used to obtain a summary from objects of most \Rlang classes, including data frames. We can also use \Rloop{sapply()}, \Rloop{lapply()} or \Rloop{vapply()} to apply any suitable function to individual columns.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{summary}\hlstd{(cars)}
\end{alltt}
\begin{verbatim}
##      speed           dist
##  Min.   : 4.0   Min.   :  2.00
##  1st Qu.:12.0   1st Qu.: 26.00
##  Median :15.0   Median : 36.00
##  Mean   :15.4   Mean   : 42.98
##  3rd Qu.:19.0   3rd Qu.: 56.00
##  Max.   :25.0   Max.   :120.00
\end{verbatim}
\begin{alltt}
\hlkwd{sapply}\hlstd{(cars, range)}
\end{alltt}
\begin{verbatim}
##      speed dist
## [1,]     4    2
## [2,]    25  120
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{advplayground}
Obtain the summary of \code{airquality} with function \Rfunction{summary()}, but in addition, write code with an \emph{apply} function to count the number of non-missing values in each column. Hint: using \code{sum()} on a \code{logical} vector returns the count of \code{TRUE} values as \code{TRUE}, and \code{FALSE} are transparently converted into \code{numeric} 1 and 0, respectively, when \code{logical} values are used in arithmetic expressions.
\end{advplayground}

\section{Plotting}
\index{plots!base R graphics}
The base-\Rlang generic method \Rfunction{plot()} can be used to plot different data. It is a generic method that has specializations suitable for different kinds of objects (see section \ref{sec:script:objects:classes:methods} on page \pageref{sec:script:objects:classes:methods} for a brief introduction to objects, classes and methods). In this section we only very briefly demonstrate the use of the most common base-\Rlang graphics functions. They are well described in the book \citebooktitle{Murrell2019} \autocite{Murrell2019}. We will not describe the Lattice (based on S's Trellis) approach to plotting \autocite{Sarkar2008}. Instead we describe in detail the use of the \emph{grammar of graphics} and plotting with package \ggplot in chapter \ref{chap:R:plotting} starting on page \pageref{chap:R:plotting}.

It is possible to pass two variables (here columns from a data frame) directly as arguments to the \code{x} and \code{y} parameters of \Rfunction{plot()}.


\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{plot}\hlstd{(}\hlkwc{x} \hlstd{= cars}\hlopt{$}\hlstd{speed,} \hlkwc{y} \hlstd{= cars}\hlopt{$}\hlstd{dist)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.54\textwidth]{figure/pos-plot-1-1}

}


\end{knitrout}

It is also possible, and usually more convenient, to use a \emph{formula} to specify the variables to be plotted on the $x$ and $y$ axes, passing additionally as an argument to  parameter \code{data} the name of the data frame containing these variables. The formula \code{dist \textasciitilde\ speed}, is read as \code{dist} explained by \code{speed}---i.e., \code{dist} is mapped to the $y$-axis as the dependent variable and \code{speed} to the $x$-axis as the independent variable.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{plot}\hlstd{(dist} \hlopt{~} \hlstd{speed,} \hlkwc{data} \hlstd{= cars)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.54\textwidth]{figure/pos-plot-2-1}

}


\end{knitrout}

Within \Rlang there exist different specializations, or ``flavors,'' of method \Rfunction{plot()} that become active depending on the class of the variables passed as arguments: passing two numerical variables results in a scatter plot as seen above. In contrast passing one factor and one numeric variable to \code{plot()} results in a box-and-whiskers plot being produced. To exemplify this we need to use a different data set, here \code{chickwts} as \code{cars} does not contain any factors. Use \code{help("chickwts")} to learn more about this data set, also included in \Rpgrm .

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{plot}\hlstd{(weight} \hlopt{~} \hlstd{feed,} \hlkwc{data} \hlstd{= chickwts)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.54\textwidth]{figure/pos-plot-3-1}

}


\end{knitrout}

Method \Rfunction{plot()} and variants defined in \Rlang, when used for plotting return their graphical output to a \emph{graphical output device}. In \Rlang, graphical devices are very frequently not real physical devices like a printer, but virtual devices implemented fully in software that translate the plotting commands into a specific graphical file format. Several different graphical devices are available in \Rlang and they differ in the kind of output they produce: raster files (e.g.,  TIFF, PNG and JPEG formats), vector graphics files (e.g.,  SVG, EPS and PDF) or output to a physical device like a window in the screen of a computer. Additional devices are available through contributed \Rlang packages.

Devices follow the paradigm of ON and OFF switches. Some devices producing a file as output, save this output only when the device is closed. When opening a device the user supplies additional information. For the PDF device that produces output in a vector-graphics format, width and height of the output are specified in \emph{inches}. A default file name is used unless we pass a \code{character} string as an argument to parameter \code{file}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{pdf}\hlstd{(}\hlkwc{file} \hlstd{=} \hlstr{"output/my-file.pdf"}\hlstd{,} \hlkwc{width} \hlstd{=} \hlnum{6}\hlstd{,} \hlkwc{height} \hlstd{=} \hlnum{5}\hlstd{,} \hlkwc{onefile} \hlstd{=} \hlnum{TRUE}\hlstd{)}
\hlkwd{plot}\hlstd{(dist} \hlopt{~} \hlstd{speed,} \hlkwc{data} \hlstd{= cars)}
\hlkwd{plot}\hlstd{(weight} \hlopt{~} \hlstd{feed,} \hlkwc{data} \hlstd{= chickwts)}
\hlkwd{dev.off}\hlstd{()}
\end{alltt}
\begin{verbatim}
## cairo_pdf
##         2
\end{verbatim}
\end{kframe}
\end{knitrout}

Raster devices return bitmaps and \code{width} and \code{height} are specified in \emph{pixels}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{png}\hlstd{(}\hlkwc{file} \hlstd{=} \hlstr{"output/my-file.png"}\hlstd{,} \hlkwc{width} \hlstd{=} \hlnum{600}\hlstd{,} \hlkwc{height} \hlstd{=} \hlnum{500}\hlstd{)}
\hlkwd{plot}\hlstd{(weight} \hlopt{~} \hlstd{feed,} \hlkwc{data} \hlstd{= chickwts)}
\hlkwd{dev.off}\hlstd{()}
\end{alltt}
\begin{verbatim}
## cairo_pdf
##         2
\end{verbatim}
\end{kframe}
\end{knitrout}

When \Rlang is used interactively, a device to output the graphical output to a display device is opened automatically. The name of the device may depend on the operating system used (e.g.,  \osname{MS-Windows} or \osname{Linux}) or an IDE---e.g.,  \RStudio defines its own graphic device for output to the Plots pane of its user interface.

\begin{warningbox}
This approach of direct output to a device, and addition of plot components as shown below directly on the output device itself, is not the only approach available. As we will see in chapter \ref{chap:R:plotting} starting on page \pageref{chap:R:plotting}, an alternative approach is to build a \emph{plot object} as a list of member components that is later rendered as a whole on a graphical device by calling \code{print()} once.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{png}\hlstd{(}\hlkwc{file} \hlstd{=} \hlstr{"output/my-file.png"}\hlstd{,} \hlkwc{width} \hlstd{=} \hlnum{600}\hlstd{,} \hlkwc{height} \hlstd{=} \hlnum{500}\hlstd{)}
\hlkwd{plot}\hlstd{(dist} \hlopt{~} \hlstd{speed,} \hlkwc{data} \hlstd{= cars)}
\hlkwd{text}\hlstd{(}\hlkwc{x} \hlstd{=} \hlnum{10}\hlstd{,} \hlkwc{y} \hlstd{=} \hlnum{110}\hlstd{,} \hlkwc{labels} \hlstd{=} \hlstr{"some texts to be added"}\hlstd{)}
\hlkwd{dev.off}\hlstd{()}
\end{alltt}
\begin{verbatim}
## cairo_pdf
##         2
\end{verbatim}
\end{kframe}
\end{knitrout}
\end{warningbox}

\section{Further reading}
For\index{further reading!using the R language} further reading on the aspects of \Rlang discussed in the current chapter,
 I suggest the books \citetitle{Peng2016} (\citeauthor{Peng2016}) and \citetitle{Matloff2011} (\citeauthor{Matloff2011}).


\index{data!exploration at the R console|)}


% !Rnw root = appendix.main.Rnw


\chapter{The R language: ``Paragraphs'' and ``essays''}\label{chap:R:scripts}
\index{scripts}

\begin{VF}
An \Rlang script is simply a text file containing (almost) the same commands that you would enter on the command line of \Rlang.

\VA{Jim Lemon}{\emph{Kickstarting R}}\nocite{LemonND}
\end{VF}

%\dictum[\href{https://cran.r-project.org/doc/contrib/Lemon-kickstart/}{Kickstarting R}]{An R script is simply a text file containing (almost) the same commands that you would enter on the command line of R.}\vskip2ex

\section{Aims of this chapter}

For those who have mainly used graphical user interfaces, understanding why and when scripts can help in communicating a certain data analysis protocol can be revelatory. As soon as a data analysis stops being trivial, describing the steps followed through a system of menus and dialogue boxes becomes extremely tedious.

Moreover, graphical user interfaces tend to be difficult to extend or improve in a way that keeps step-by-step instructions valid across program versions and operating systems.

Many times, exactly the same sequence of commands needs to be applied to different data sets, and scripts make both implementation and validation of such a requirement easy.

In this chapter, I will walk you through the use of \Rpgrm scripts, starting from an extremely simple script.

\section{Writing scripts}

In \Rlang language, the closest match to a natural language essay is a script. A script is built from multiple interconnected code statements needed to complete a given task. Simple statements can be combined into compound statements, which are the equivalent of natural language paragraphs. Scripts can vary from simple scripts containing only a few code statements, to complex scripts containing hundreds of code statements. In the rest of the present section I discuss how to write readable and reliable scripts and how to use them.

\subsection{What is a script?}\label{sec:script:what:is}
\index{scripts!definition}
A \textit{script} is a text file that contains (almost) the same commands that you would type at the console prompt. A true script is not, for example, an MS-Word file where you have pasted or typed some \Rlang commands. A script file has the following characteristics.
\begin{itemize}
  \item The script is a text file.
  \item The file contains valid \Rlang statements (including comments) and nothing else.
  \item Comments start at a \code{\#} and end at the end of the line.
  \item The \Rlang statements are in the file in the order that they must be executed.
  \item \Rlang scripts have file names ending in \texttt{.r} or \texttt{.R}.
\end{itemize}

It is good practice to write scripts so that they are self-contained. To make a script self-contained, one must include calls to \texttt{library()} to load the packages used, load or import data from files, perform the data analysis and display and/or save the results of the analysis. Such scripts can be used to apply the same analysis algorithm to other data and/or to reproduce the same analysis at a later time. Such scripts document all steps used for the analysis.


\subsection{How do we use a script?}\label{sec:script:using}
\index{scripts!sourcing}

A script can be ``sourced'' using function \Rfunction{source()}. If we have a text file called \texttt{my.first.script.r} containing the following text:
\begin{shaded}
\footnotesize
\begin{verbatim}
# this is my first R script
print(3 + 4)
\end{verbatim}
\end{shaded}

and then source this file:

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{source}\hlstd{(}\hlstr{"my.first.script.r"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] 7
\end{verbatim}
\end{kframe}
\end{knitrout}

The results of executing the statements contained in the file will appear in the console. The commands themselves are not shown (by default the sourced file is not \emph{echoed} to the console) and the results will not be printed unless you include explicit \Rfunction{print()} commands in the script. This applies in many cases also to plots---e.g., a figure created with \Rfunction{ggplot()} needs to be printed if we want it to be included in the output when the script is run. Adding a redundant \Rfunction{print()} is harmless.

From within \RStudio, if you have an \Rpgrm script open in the editor, there will be a ``source'' icon visible with an attached drop-down menu from which you can choose ``Source'' as described above, or ``Source with echo,'' or ``Source as local job'' for the script in the currently active editor tab.

When a script is \emph{sourced}, the output can be saved to a text file instead of being shown in the console. It is also easy to call \Rpgrm with the \Rlang script file as an argument directly at the operating system shell or command-interpreter prompt---and obviously also from shell scripts. The next two chunks show commands entered at the OS shell command prompt rather than at the \Rlang command prompt.
\begin{shaded}
\footnotesize
\begin{verbatim}
> RScript my.first.script.r
\end{verbatim}
\end{shaded}

You can open an operating system's \emph{shell} from the Tools menu in \RStudio, to run this command. The output will be printed to the shell console. If you would like to save the output to a file, use redirection using the operating system's syntax.
\begin{shaded}
\footnotesize
\begin{verbatim}
> RScript my.first.script.r > my.output.txt
\end{verbatim}
\end{shaded}

Sourcing is very useful when the script is ready, however, while developing a script, or sometimes when testing things, one usually wants to run (or \emph{execute}) one or a few statements at a time. This can be done using the ``run'' button\footnote{If you use a different IDE or editor with an \Rlang mode, the details will vary, but a run command will be usually available.} after either positioning the cursor in the line to be executed, or selecting the text that one would like to run (the selected text can be part of a line, a whole line, or a group of lines, as long as it is syntactically valid). The key-shortcut Ctrl-Enter is equivalent to pressing the ``run'' button in \RStudio.

\subsection{How to write a script}\label{sec:script:writing}
\index{scripts!writing}

As with any type of writing, different approaches may be preferred by different \Rlang users. In general, the approach used, or mix of approaches, will also depend on how confident you are that the statements will work as expected---you already know the best approach vs.\ you are exploring different alternatives.
\begin{description}
\item[If one is very familiar with similar problems] One would just create a new text file and write the whole thing in the editor, and then test it. This is rather unusual.
\item[If one is moderately familiar with the problem] One would write the script as above, but testing it, step by step, as one is writing it. This is usually what I do.
\item[If one is mostly playing around] Then if one is using \RStudio, one can type statements at the console prompt. As you should know by now, everything you run at the console is saved to the ``History.'' In \RStudio, the History is displayed in its own pane, and in this pane one can select any previous statement(s) and by clicking on a single icon, copy and paste them to either the \Rlang console prompt, or the cursor position in the editor pane. In this way one can build a script by copying and pasting from the history to your script file, the bits that have worked as you wanted.
\end{description}

\begin{playground}
By now you should be familiar enough with \Rlang to be able to write your own script.
\begin{enumerate}
  \item Create a new \Rpgrm script (in \RStudio, from the File menu, ``+'' icon, or by typing ``Ctrl + Shift + N'').
  \item Save the file as \texttt{my.second.script.r}.
  \item Use the editor pane in \RStudio to type some \Rpgrm commands and comments.
  \item \emph{Run} individual commands.
  \item \emph{Source} the whole file.
\end{enumerate}
\end{playground}

\subsection{The need to be understandable to people}\label{sec:script:readability}
\index{scripts!readability}

When you write a script, it is either because you want to document what you have done or you want re-use the script at a later time. In either case, the script itself although still meaningful for the computer, could become very obscure to you, and even more to someone seeing it for the first time. This must be avoided by spending time and effort on the writing style.

How does one achieve an understandable script or program?
\begin{itemize}
  \item Avoid the unusual. People using a certain programming language tend to use some implicit or explicit rules of style---style includes \textit{indentation} of statements, \textit{capitalization} of variable and function names. As a minimum try to be consistent with yourself.
  \item Use meaningful names for variables, and any other object. What is meaningful depends on the context. Depending on common use, a single letter may be more meaningful than a long word. However self-explanatory names are usually better: e.g.,  using \code{n.rows} and \code{n.cols} is much clearer than using \code{n1} and \code{n2} when dealing with a matrix of data. Probably \code{number.of.rows} and \code{number.of.columns} would make the script verbose, and take longer to type without gaining anything in return.
  \item How to make the words visible in names: traditionally in \Rlang one would use dots to separate the words and use only lower case. Some years ago, it became possible to use underscores. The use of underscores is quite common nowadays because in some contexts it is ``safer'', as in some situations a dot may have a special meaning. What we call ``camel case'' is only infrequently used in \Rlang programming but is common in other languages like Pascal. An example of camel case is \code{NumCols}.
\end{itemize}

\begin{playground}
Here is an example of bad style in a script. Read \href{https://google.github.io/styleguide/Rguide.xml}{Google's R Style Guide}%\footnote{This is just an example, similar, but not exactly the same style as the style I use myself.}
, and edit the code in the chunk below so that it becomes easier to read.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a} \hlkwb{<-} \hlnum{2} \hlcom{# height}
\hlstd{b} \hlkwb{<-} \hlnum{4} \hlcom{# length}
\hlstd{C} \hlkwb{<-}
    \hlstd{a} \hlopt{*}
\hlstd{b}
\hlstd{C} \hlkwb{->} \hlstd{variable}
      \hlkwd{print}\hlstd{(}
\hlstr{"area: "}\hlstd{, variable}
\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}
\end{playground}

The points discussed above already help a lot. However, one can go further in achieving the goal of human readability by interspersing explanations and code ``chunks'' and using all the facilities of typesetting, even of formatted maths formulas and equations, within the listing of the script. Furthermore, by including the results of the calculations and the code itself in a typeset report built automatically, we can ensure that the results are indeed the result of running the code shown. This greatly contributes to data analysis reproducibility, which is becoming a widespread requirement for any data analysis both in academia and in industry. It is possible not only to typeset whole books like this one, but also whole data-based web sites with these tools.

In the realm of programming, this approach is called literate programming\index{literate programming} and was first proposed by Donald Knuth \autocite{Knuth1984a} through his \pgrmname{WEB} system. In the case of \Rpgrm programming, the first support of literate programming was through \pkgname{Sweave}, which has been mostly superseded by \pkgname{knitr} \autocite{Xie2013}. This package supports the use of \langname{Markdown} or \LaTeX\index{Latex@\LaTeX}\ \autocite{Lamport1994} as the markup language for the textual contents and also formats and adds syntax highlighting to code chunks. \langname{Rmarkdown} is an extension to \langname{Markdown} that makes it easier to include \Rlang code in documents (see \url{http://rmarkdown.rstudio.com/}). It is the basis of \Rlang packages that support typesetting large and complex documents (\pkgname{bookdown}), web sites (\pkgname{blogdown}), package vignettes (\pkgname{pkgdown}) and slides for presentations \autocite{Xie2016,Xie2018}. The use of \pkgname{knitr} is very well integrated into the \RStudio IDE.

This is not strictly an \Rlang programming subject, as it concerns programming in any language. On the other hand, this is an incredibly important skill to learn, but well described in other books and web sites cited in the previous paragraph. This whole book, including figures, has been generated using \pkgname{knitr} and the source code for the book is available through Bitbucket at \url{https://bitbucket.org/aphalo/learnr-book}.

\subsection{Debugging scripts}\label{sec:script:debug}
\index{scripts!debugging}

The use of the word \emph{bug} to describe a problem in computer hardware and software started in 1946 when a real bug, more precisely a moth, got between the contacts of a relay in an electromechanical computer causing it to malfunction and Grace Hooper described the first computer \emph{bug}. The use of the term bug in engineering predates the use in computer science, and consequently, the first use of bug in computing caught on easily because it represented an earlier-used metaphor becoming real.

A suitable quotation from a letter written by Thomas Alva Edison 1878 \autocite[as given by][]{Hughes2004}:
\begin{quotation}
  It has been just so in all of my inventions. The first step is an intuition, and comes with a burst, then difficulties arise--this thing gives out and [it is] then that ``Bugs''--as such little faults and difficulties are called--show themselves and months of intense watching, study and labor are requisite before commercial success or failure is certainly reached.
\end{quotation}

The quoted paragraph above makes clear that only very exceptionally does any new design fully succeed. The same applies to \Rlang scripts as well as any other non-trivial piece of computer code. From this it logically follows that testing and de-bugging are fundamental steps in the development of \Rlang scripts and packages. Debugging, as an activity, is outside the scope of this book. However, clear programming style and good documentation are indispensable for efficient testing and reuse.

Even for scripts used for analyzing a single data set, we need to be confident that the algorithms and their implementation are valid, and able to return correct results. This is true both for scientific reports, expert data-based reports and any data analysis related to assessment of compliance with legislation or regulations. Of course, even in cases when we are not required to demonstrate validity, say for decision making purely internal to a private organization, we will still want to avoid costly mistakes.

The first step in producing reliable computer code is to accept that any code that we write needs to be tested and, if possible, validated. Another important step is to make sure that input is validated within the script and a suitable error produced for bad input (including valid input values falling outside the range that can be reliably handled by the script).

If during testing, or during normal use, a wrong value is returned by a calculation, or no value (e.g.,  the script crashes or triggers a fatal error), debugging consists in finding the cause of the problem. The cause can be either a mistake in the implementation of an algorithm, as well as in the algorithm itself. However, many apparent \emph{bugs} are caused by bad or missing handling of special cases like invalid input values, rounding errors, division by zero, etc., in which a program crashes instead of elegantly issuing a helpful error message.

Diagnosing the source of bugs is, in most cases, like detective work. One uses hunches based on common sense and experience to try to locate the lines of code causing the problem. One follows different \emph{leads} until the case is solved. In most cases, at the very bottom we rely on some sort of divide-and-conquer strategy. For example, we may check the value returned by intermediate calculations until we locate the earliest code statement producing a wrong value. Another common case is when some input values trigger a bug. In such cases it is frequently best to start by testing if different ``cases'' of input lead to errors/crashes or not. Boundary input values are usually the telltale ones: e.g.,  for numbers, zero, negative and positive values, very large values, very small values, missing values (\code{NA}), vectors of length zero (\code{numeric()}), etc.

\begin{warningbox}
  \textbf{Error messages} When debugging, keep in mind that in some cases a single bug can lead to a whole cascade of error messages. Do also keep in mind that typing mistakes, originating when code is entered through the keyboard, can wreak havock in a script: usually there is little correspondence between the number of error messages and the seriousness of the bug triggering them. When several errors are triggered, start by reading the error message printed first, as later errors can be an indirect consequence of earlier ones.
\end{warningbox}

There are special tools, called debuggers, available, and they help enormously. Debuggers allow one to step through the code, executing one statement at a time, and at each pause, allowing the user to inspect the objects present in the \Rlang environment and their values. It is even possible to execute additional statements, say, to modify the value of a variable, while execution is paused. An \Rlang debugger is available within \RStudio and also through the \Rlang console.

When writing your first scripts, you will manage perfectly well, and learn more by running the script one line at a time and when needed temporarily inserting \code{print()} statements to ``look'' at how the value of variables changes at each step. A debugger allows a lot more control, as one can ``step in'' and ``step out'' of function definitions, and set and unset break points where execution will stop, which is especially useful when developing \Rlang packages.

When reproducing the examples in this chapter, do keep this section in mind. In addition, if you get stuck trying to find the cause of a bug, do extend your search both to the most trivial of possible causes, and to the least likely ones (such as a bug in a package installed from CRAN or \Rlang itself). Of course, when suspecting a bug in code you have not written, it is wise to very carefully read the documentation, as the ``bug'' may be just in your understanding of what a certain piece of code is expected to do.  Also keep in mind that as discussed on page \pageref{sec:intro:net:help}, you will be able to find online already-answered questions to many of your likely problems and doubts. For example, Googling for the text of an error message is usually well rewarded.

\begin{warningbox}
When installing packages from other sources than CRAN (e.g.,  development versions from GitHub, Bitbucket or R-Forge, or in-house packages) there is no warranty that conflicts will not happen. Packages (and their versions) released through CRAN are regularly checked for inter-compatibility, while packages released through other channels are usually checked against only a few packages.

Conflicts among packages can easily arise, for example, when they use the same names for objects or functions. In addition, many packages use functions defined in packages in the \Rlang distribution itself or other independently developed packages by importing them. Updates to depended-upon packages can ``break'' (make non-functional) the dependent packages or parts of them. The rigorous testing by CRAN detects such problems in most cases when package revisions are submitted, forcing package maintainers to fix problems before distribution through CRAN is possible. However, if you use other repositories, I recommend that you make sure that revised (especially if under development) versions do work with your own script, before their use in ``production'' (important) data analyses.
\end{warningbox}


\section{Control of execution flow}\label{sec:script:flow:control}
\index{control of execution flow}
We give the name \emph{control of execution statements} to those statements that allow the execution of sections of code when a certain dynamically computed condition is \code{TRUE}. Some of the control of execution flow statements, function like \emph{ON-OFF switches} for program statements. Others allow statements to be executed repeatedly while or until a condition is met, or until all members of a list or a vector are processed.

These \emph{control of execution statements} can be also used at the \Rlang console, but it is usually awkward to do so as they can extend over several lines of text. In simple scripts, the \emph{flow of execution} can be fixed and linear from the first to the last statement in the script. \emph{Control of execution statements} allow flexibility, as they allow conditional execution  and/or repeated execution of statements. The part of the script conditionally executed can be a simple or a compound code statement providing a lot of flexibility. As we will see next, a compound statement can include multiple simple or nested compound statements.

\subsection{Compound statements}
\index{compound code statements}\index{simple code statements}

First of all, we need to consider compound statements. Individual statements can be grouped into compound statements by enclosed them in curly braces.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{print}\hlstd{(}\hlstr{"A"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] "A"
\end{verbatim}
\begin{alltt}
\hlstd{\{}
  \hlkwd{print}\hlstd{(}\hlstr{"B"}\hlstd{)}
  \hlkwd{print}\hlstd{(}\hlstr{"C"}\hlstd{)}
\hlstd{\}}
\end{alltt}
\begin{verbatim}
## [1] "B"
## [1] "C"
\end{verbatim}
\end{kframe}
\end{knitrout}

The grouping of the last two statements above is of no consequence by itself, but grouping becomes useful when used together with control-of-execution constructs.

\subsection{Conditional execution}
\index{conditional execution}

Conditional execution allows handling different values, such as negative and non-negative values, differently within a script. This is achieved by evaluating or not (i.e., switching ON and OFF) parts of a script based on the result returned by a logical expression. This expression can also be a \emph{flag}---i.e., a \code{logical} variable set manually, preferable near the top of the script. Use of flags is most useful when switching between two script behaviors depends on multiple sections of code. A frequent use case for flags is jointly enabling and disabling printing of output from multiple statements scattered in over a long script.

\Rpgrm has two types of \emph{if}\index{conditional statements} statements, non-vectorized and vectorized. We will start with the non-vectorized one, which is similar to what is available in most other computer programming languages. We start with toy examples demonstrating how \emph{if} and \emph{if-else} statements work. Later we will see examples closer to real use cases.

\subsubsection[Non-vectorized \texttt{if}, \texttt{else} and \texttt{switch}]{Non-vectorized \code{if}, \code{else} and \code{switch}}
\qRcontrol{if()}\qRcontrol{if()\ldots else}%

The \code{if} construct ``decides,'' depending on a \code{logical} value, whether the next code statement is executed (if \code{TRUE}) or skipped (if \code{FALSE}).

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{flag} \hlkwb{<-} \hlnum{TRUE}
\hlkwa{if} \hlstd{(flag)} \hlkwd{print}\hlstd{(}\hlstr{"Hello!"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] "Hello!"
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{playground}
Play with the code above by changing the value assigned to variable \code{flag}, \code{FALSE}, \code{NA}, and \code{logical(0)}.

In the example above we use variable \code{flag} as the \emph{condition}.

Nothing in the \Rlang language prevents this condition from being a \code{logical} constant. Explain why \code{if (TRUE)} in the syntactically-correct statement below is of no practical use.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwa{if} \hlstd{(}\hlnum{TRUE}\hlstd{)} \hlkwd{print}\hlstd{(}\hlstr{"Hello!"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] "Hello!"
\end{verbatim}
\end{kframe}
\end{knitrout}
\end{playground}

Conditional execution is much more useful than what could be expected from the previous example, because the statement whose execution is being controlled can be a compound statement of almost any length or complexity. A very simple example follows.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{printing} \hlkwb{<-} \hlnum{TRUE}
\hlkwa{if} \hlstd{(printing) \{}
  \hlkwd{print}\hlstd{(}\hlstr{"A"}\hlstd{)}
  \hlkwd{print}\hlstd{(}\hlstr{"B"}\hlstd{)}
\hlstd{\}}
\end{alltt}
\begin{verbatim}
## [1] "A"
## [1] "B"
\end{verbatim}
\end{kframe}
\end{knitrout}

The condition passed as an argument to \code{if}, enclosed in parentheses, can be anything yielding a \Rclass{logical} vector, however, as this condition is \emph{not} vectorized, only the first element will be used and a warning issued if longer than one.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a} \hlkwb{<-} \hlnum{10.0}
\hlkwa{if} \hlstd{(a} \hlopt{<} \hlnum{0.0}\hlstd{)} \hlkwd{print}\hlstd{(}\hlstr{"'a' is negative"}\hlstd{)} \hlkwa{else} \hlkwd{print}\hlstd{(}\hlstr{"'a' is not negative"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] "'a' is not negative"
\end{verbatim}
\begin{alltt}
\hlkwd{print}\hlstd{(}\hlstr{"This is always printed"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] "This is always printed"
\end{verbatim}
\end{kframe}
\end{knitrout}

As can be seen above, the statement immediately following \code{if} is executed if the condition returns \code{TRUE} and that following \code{else} is executed if the condition returns \code{FALSE}. Statements after the conditionally executed \code{if} and \code{else} statements are always executed, independently of the value returned by the condition.

\begin{playground}
Play with the code in the chunk above by assigning different numeric vectors to \code{a}.
\end{playground}


\begin{explainbox}
Do you still remember the rules about continuation lines?

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlcom{# 1}
\hlstd{a} \hlkwb{<-} \hlnum{1}
\hlkwa{if} \hlstd{(a} \hlopt{<} \hlnum{0.0}\hlstd{)} \hlkwd{print}\hlstd{(}\hlstr{"'a' is negative"}\hlstd{)} \hlkwa{else} \hlkwd{print}\hlstd{(}\hlstr{"'a' is not negative"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] "'a' is not negative"
\end{verbatim}
\end{kframe}
\end{knitrout}

Why does the statement below (not evaluated here) trigger an error while the one above does not?

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlcom{# 2 (not evaluated here)}
\hlkwd{if} (a < 0.0) \hlkwd{print}(\hlstr{"\hlstr{'a'} is negative"})
else \hlkwd{print}(\hlstr{"\hlstr{'a'} is not negative"})
\end{alltt}
\end{kframe}
\end{knitrout}

How do the continuation line rules apply when we add curly braces as shown below.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlcom{# 1}
\hlstd{a} \hlkwb{<-} \hlnum{1}
\hlkwa{if} \hlstd{(a} \hlopt{<} \hlnum{0.0}\hlstd{) \{}
    \hlkwd{print}\hlstd{(}\hlstr{"'a' is negative"}\hlstd{)}
  \hlstd{\}} \hlkwa{else} \hlstd{\{}
    \hlkwd{print}\hlstd{(}\hlstr{"'a' is not negative"}\hlstd{)}
  \hlstd{\}}
\end{alltt}
\begin{verbatim}
## [1] "'a' is not negative"
\end{verbatim}
\end{kframe}
\end{knitrout}

In the example above, we enclosed a single statement between each pair of curly braces, but as these braces create compound statements, multiple statements could have been enclosed between each pair.
\end{explainbox}

\begin{playground}
Play with the use of conditional execution, with both simple and compound statements, and also think how to combine \code{if} and \code{else} to select among more than two options.
\end{playground}

In \Rlang, the value returned by any compound statement is the value returned by the last simple statement executed within the compound one. This means that we can assign the value returned by an \code{if} and \code{else} statement to a variable. This style is less frequently used, but occasionally can result in easier-to-understand scripts.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a} \hlkwb{<-} \hlnum{1}
\hlstd{my.message} \hlkwb{<-}
  \hlkwa{if} \hlstd{(a} \hlopt{<} \hlnum{0.0}\hlstd{)} \hlstr{"'a' is negative"} \hlkwa{else} \hlstr{"'a' is not negative"}
\hlkwd{print}\hlstd{(my.message)}
\end{alltt}
\begin{verbatim}
## [1] "'a' is not negative"
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{playground}
Study the conversion rules between \Rclass{numeric} and \Rclass{logical} values, run each of the statements below, and explain the output based on how type conversions are interpreted, remembering the difference between \emph{floating-point numbers} as implemented in computers and \emph{real numbers} ($\mathbb{R}$) as defined in mathematics.

% chunk contains intentional error-triggering examples
\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwa{if} \hlstd{(}\hlnum{0}\hlstd{)} \hlkwd{print}\hlstd{(}\hlstr{"hello"}\hlstd{)}
\hlkwa{if} \hlstd{(}\hlopt{-}\hlnum{1}\hlstd{)} \hlkwd{print}\hlstd{(}\hlstr{"hello"}\hlstd{)}
\hlkwa{if} \hlstd{(}\hlnum{0.01}\hlstd{)} \hlkwd{print}\hlstd{(}\hlstr{"hello"}\hlstd{)}
\hlkwa{if} \hlstd{(}\hlnum{1e-300}\hlstd{)} \hlkwd{print}\hlstd{(}\hlstr{"hello"}\hlstd{)}
\hlkwa{if} \hlstd{(}\hlnum{1e-323}\hlstd{)} \hlkwd{print}\hlstd{(}\hlstr{"hello"}\hlstd{)}
\hlkwa{if} \hlstd{(}\hlnum{1e-324}\hlstd{)} \hlkwd{print}\hlstd{(}\hlstr{"hello"}\hlstd{)}
\hlkwa{if} \hlstd{(}\hlnum{1e-500}\hlstd{)} \hlkwd{print}\hlstd{(}\hlstr{"hello"}\hlstd{)}
\hlkwa{if} \hlstd{(}\hlkwd{as.logical}\hlstd{(}\hlstr{"true"}\hlstd{))} \hlkwd{print}\hlstd{(}\hlstr{"hello"}\hlstd{)}
\hlkwa{if} \hlstd{(}\hlkwd{as.logical}\hlstd{(}\hlkwd{as.numeric}\hlstd{(}\hlstr{"1"}\hlstd{)))} \hlkwd{print}\hlstd{(}\hlstr{"hello"}\hlstd{)}
\hlkwa{if} \hlstd{(}\hlkwd{as.logical}\hlstd{(}\hlstr{"1"}\hlstd{))} \hlkwd{print}\hlstd{(}\hlstr{"hello"}\hlstd{)}
\hlkwa{if} \hlstd{(}\hlstr{"1"}\hlstd{)} \hlkwd{print}\hlstd{(}\hlstr{"hello"}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

Hint: if you need to refresh your understanding of the type conversion rules, see section \ref{sec:calc:type:conversion} on page \pageref{sec:calc:type:conversion}.
\end{playground}

In addition to \Rcontrol{if()}, there is in \Rlang a \Rcontrol{switch()} statement, which we describe next. It can be used to select among \emph{cases}, or several alternative statements, based on an expression evaluating to a \code{numeric} or a \code{character} value of length equal to one. The switch statement returns a value, the value returned by the statement corresponding to the matching switch value, or the default if there is no match and a default return value has been defined in the code.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.object} \hlkwb{<-} \hlstr{"two"}
\hlstd{b} \hlkwb{<-} \hlkwd{switch}\hlstd{(my.object,}
            \hlkwc{one} \hlstd{=} \hlnum{1}\hlstd{,}
            \hlkwc{two} \hlstd{=} \hlnum{1} \hlopt{/} \hlnum{2}\hlstd{,}
            \hlkwc{three} \hlstd{=} \hlnum{1} \hlopt{/} \hlnum{4}\hlstd{,}
            \hlnum{0}
\hlstd{)}
\hlstd{b}
\end{alltt}
\begin{verbatim}
## [1] 0.5
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{playground}
    Do play with the use of the switch statement. Look at the documentation for \code{switch()} using \code{help(switch)} and study the examples at the end of the help page.
\end{playground}

The \Rcontrol{switch()} statement can substitute for chained \code{if else} statements when all the conditions are comparisons against different constant values, resulting in more concise and clear code.

\subsubsection[Vectorized \texttt{ifelse()}]{Vectorized \code{ifelse()}}
\index{vectorized ifelse}
Vectorized \emph{ifelse} is a peculiarity of the \Rlang language, but very useful for writing concise code that may execute faster than logically equivalent but not vectorized code.
Vectorized conditional execution is coded by means of \emph{function} \Rcontrol{ifelse()} (written as a single word). This function takes three arguments: a \code{logical} vector usually the result of a test (parameter \code{test}), a result vector for \code{TRUE} cases (parameter \code{yes}), and a result vector for \code{FALSE} cases (parameter \code{no}). At each index position along the vectors, the value included in the returned vector is taken from \code{yes} if \code{test} is \code{TRUE} and from \code{no} if \code{test} is \code{FALSE}. All three arguments can be any \Rlang statement returning the required vector. In the case of vectors passed as arguments to parameters \code{yes} and \code{no}, recycling will take place if they are shorter than the logical vector passed as argument to \code{test}. No recycling ever applies to \code{test}, even if \code{yes} and/or \code{no} are longer than \code{test}. It is customary to pass arguments to \code{ifelse} by position. We give a first example with named arguments to clarify the use of the function.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.test} \hlkwb{<-} \hlkwd{c}\hlstd{(}\hlnum{TRUE}\hlstd{,} \hlnum{FALSE}\hlstd{,} \hlnum{TRUE}\hlstd{,} \hlnum{TRUE}\hlstd{)}
\hlkwd{ifelse}\hlstd{(}\hlkwc{test} \hlstd{= my.test,} \hlkwc{yes} \hlstd{=} \hlnum{1}\hlstd{,} \hlkwc{no} \hlstd{=} \hlopt{-}\hlnum{1}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1]  1 -1  1  1
\end{verbatim}
\end{kframe}
\end{knitrout}

In practice, the most common idiom is to have as an argument passed to \code{test}, the result of a comparison calculated on the fly. In the first example we compute the absolute values for a vector, equivalent to that returned by \Rlang function \code{abs()}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{nums} \hlkwb{<-} \hlopt{-}\hlnum{3}\hlopt{:+}\hlnum{3}
\hlkwd{ifelse}\hlstd{(nums} \hlopt{<} \hlnum{0}\hlstd{,} \hlopt{-}\hlstd{nums, nums)}
\end{alltt}
\begin{verbatim}
## [1] 3 2 1 0 1 2 3
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{playground}
Some additional examples to play with, with a few surprises. Study the examples below until you understand why returned values are what they are. In addition, create your own examples to test other possible cases. In other words, play with the code until you fully understand how \code{ifelse} works.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a} \hlkwb{<-} \hlnum{1}\hlopt{:}\hlnum{10}
\hlkwd{ifelse}\hlstd{(a} \hlopt{>} \hlnum{5}\hlstd{,} \hlnum{1}\hlstd{,} \hlopt{-}\hlnum{1}\hlstd{)}
\hlkwd{ifelse}\hlstd{(a} \hlopt{>} \hlnum{5}\hlstd{, a} \hlopt{+} \hlnum{1}\hlstd{, a} \hlopt{-} \hlnum{1}\hlstd{)}
\hlkwd{ifelse}\hlstd{(}\hlkwd{any}\hlstd{(a} \hlopt{>} \hlnum{5}\hlstd{), a} \hlopt{+} \hlnum{1}\hlstd{, a} \hlopt{-} \hlnum{1}\hlstd{)} \hlcom{# tricky}
\hlkwd{ifelse}\hlstd{(}\hlkwd{logical}\hlstd{(}\hlnum{0}\hlstd{), a} \hlopt{+} \hlnum{1}\hlstd{, a} \hlopt{-} \hlnum{1}\hlstd{)} \hlcom{# even more tricky}
\hlkwd{ifelse}\hlstd{(}\hlnum{NA}\hlstd{, a} \hlopt{+} \hlnum{1}\hlstd{, a} \hlopt{-} \hlnum{1}\hlstd{)} \hlcom{# as expected}
\end{alltt}
\end{kframe}
\end{knitrout}
Hint: if you need to refresh your understanding of \code{logical} values and Boolean algebra see section \ref{sec:calc:boolean} on page \pageref{sec:calc:boolean}.
\end{playground}

\begin{warningbox}
In the case of \Rcontrol{ifelse()}, the length of the returned value is determined by the length of the logical vector passed as an argument to its first formal parameter (named \code{test})! A frequent mistake is to use a condition that returns a \code{logical} vector of length one, expecting that it will be recycled because arguments passed to the other formal parameters (named \code{yes} and \code{no}) are longer. However, no recycling will take place, resulting in a returned value of length one, with the remaining elements of the vectors passed to \code{yes} and \code{no} being discarded. Do try this by yourself, using logical vectors of different lengths. You can start with the examples below, making sure you understand why the returned values are what they are.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ifelse}\hlstd{(}\hlnum{TRUE}\hlstd{,} \hlnum{1}\hlopt{:}\hlnum{5}\hlstd{,} \hlopt{-}\hlnum{5}\hlopt{:-}\hlnum{1}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] 1
\end{verbatim}
\begin{alltt}
\hlkwd{ifelse}\hlstd{(}\hlnum{FALSE}\hlstd{,} \hlnum{1}\hlopt{:}\hlnum{5}\hlstd{,} \hlopt{-}\hlnum{5}\hlopt{:-}\hlnum{1}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] -5
\end{verbatim}
\begin{alltt}
\hlkwd{ifelse}\hlstd{(}\hlkwd{c}\hlstd{(}\hlnum{TRUE}\hlstd{,} \hlnum{FALSE}\hlstd{),} \hlnum{1}\hlopt{:}\hlnum{5}\hlstd{,} \hlopt{-}\hlnum{5}\hlopt{:-}\hlnum{1}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1]  1 -4
\end{verbatim}
\begin{alltt}
\hlkwd{ifelse}\hlstd{(}\hlkwd{c}\hlstd{(}\hlnum{FALSE}\hlstd{,} \hlnum{TRUE}\hlstd{),} \hlnum{1}\hlopt{:}\hlnum{5}\hlstd{,} \hlopt{-}\hlnum{5}\hlopt{:-}\hlnum{1}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] -5  2
\end{verbatim}
\begin{alltt}
\hlkwd{ifelse}\hlstd{(}\hlkwd{c}\hlstd{(}\hlnum{FALSE}\hlstd{,} \hlnum{TRUE}\hlstd{),} \hlnum{1}\hlopt{:}\hlnum{5}\hlstd{,} \hlnum{0}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] 0 2
\end{verbatim}
\end{kframe}
\end{knitrout}
\end{warningbox}

\begin{playground}
Write, using \Rcontrol{ifelse()}, a single statement to combine numbers from the two vectors \code{a} and \code{b} into a result vector \code{d}, based on whether the corresponding value in vector \code{c} is the character \code{"a"} or \code{"b"}. Then print vector \code{d} to make the result visible.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a} \hlkwb{<-} \hlopt{-}\hlnum{10}\hlopt{:-}\hlnum{1}
\hlstd{b} \hlkwb{<-} \hlopt{+}\hlnum{1}\hlopt{:}\hlnum{10}
\hlstd{c} \hlkwb{<-} \hlkwd{c}\hlstd{(}\hlkwd{rep}\hlstd{(}\hlstr{"a"}\hlstd{,} \hlnum{5}\hlstd{),} \hlkwd{rep}\hlstd{(}\hlstr{"b"}\hlstd{,} \hlnum{5}\hlstd{))}
\hlcom{# your code}
\end{alltt}
\end{kframe}
\end{knitrout}

If you do not understand how the three vectors are built, or you cannot guess the values they contain by reading the code, print them, and play with the arguments, until you understnd what each parameter does. Also use \code{help(rep)} and/or \code{help(ifelse)} to access the documentation.
\end{playground}

\subsection{Iteration}
\index{loops|seealso{iteration}}
We give the name \emph{iteration} to the process of repetitive execution of a program statement (simple or compound)---e.g., \emph{computed by iteration}. We use the same word, iteration, to name each one of these repetitions of the execution of a statement---e.g.,  the second iteration.

The section of computer code being executed multiple times, forms a loop (a closed path). Most loops contain a condition that determines when the flow of execution will exit the loop and continue at the next statement following the loop. In \Rlang three types of iteration loops are available: those using \Rloop{for}, \Rloop{while} and \Rloop{repeat} constructs. They differ in how much flexibility they provide with respect to the values they iterate over, and how the condition that terminates the iteration is tested. When the same algorithm can be implemented with more than one of these constructs, using the least flexible of them usually results in the easiest to understand \Rlang scripts. In \Rlang, rather frequently, explicit loops as described in this section can be replaced advantageously by calls to the \emph{apply} functions described in section \ref{sec:data:apply} on page \pageref{sec:data:apply}.

\subsubsection[\texttt{for} loops]{\code{for} loops}
The\index{for loop}\index{iteration!for loop}\qRloop{for} most frequently used type of loop is a \code{for} loop. These loops work in \Rlang on lists or vectors of values to act upon.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{b} \hlkwb{<-} \hlnum{0}
\hlkwa{for} \hlstd{(a} \hlkwa{in} \hlnum{1}\hlopt{:}\hlnum{5}\hlstd{) b} \hlkwb{<-} \hlstd{b} \hlopt{+} \hlstd{a}
\hlstd{b}
\end{alltt}
\begin{verbatim}
## [1] 15
\end{verbatim}
\begin{alltt}
\hlstd{b} \hlkwb{<-} \hlkwd{sum}\hlstd{(}\hlnum{1}\hlopt{:}\hlnum{5}\hlstd{)} \hlcom{# built-in function (faster)}
\hlstd{b}
\end{alltt}
\begin{verbatim}
## [1] 15
\end{verbatim}
\end{kframe}
\end{knitrout}

Here the statement \code{b <- b + a} is executed five times, with variable \code{a} sequentially taking each of the values in \code{1:5}. Instead of a simple statement used here, a compound statement could also have been used for the body of the \Rloop{for} loop.

\begin{warningbox}
It is important to note that a list or vector of length zero is a valid argument to \code{for()}, that triggers no error, but skips the statements in the loop body.
\end{warningbox}

Some examples of use of \code{for} loops---and of how to avoid their use.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a} \hlkwb{<-} \hlkwd{c}\hlstd{(}\hlnum{1}\hlstd{,} \hlnum{4}\hlstd{,} \hlnum{3}\hlstd{,} \hlnum{6}\hlstd{,} \hlnum{8}\hlstd{)}
\hlkwa{for}\hlstd{(x} \hlkwa{in} \hlstd{a) \{}\hlkwd{print}\hlstd{(x}\hlopt{*}\hlnum{2}\hlstd{)\}} \hlcom{# print is needed!}
\end{alltt}
\begin{verbatim}
## [1] 2
## [1] 8
## [1] 6
## [1] 12
## [1] 16
\end{verbatim}
\end{kframe}
\end{knitrout}

A call to \Rloop{for} does not return a value. We need to assign values to an object so that they are not lost. If we print at each iteration the value of this object, we can follow how the stored value changes. Printing allows us to see, how the vector grows in length, unless we create a long-enough vector before the start of the loop.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{b} \hlkwb{<-} \hlkwa{for}\hlstd{(x} \hlkwa{in} \hlstd{a) \{x}\hlopt{*}\hlnum{2}\hlstd{\}}
\hlstd{b}
\end{alltt}
\begin{verbatim}
## NULL
\end{verbatim}
\begin{alltt}
\hlstd{b} \hlkwb{<-} \hlkwd{numeric}\hlstd{()}
\hlkwa{for}\hlstd{(i} \hlkwa{in} \hlkwd{seq}\hlstd{(}\hlkwc{along.with} \hlstd{= a)) \{}
  \hlstd{b[i]} \hlkwb{<-} \hlstd{a[i]}\hlopt{^}\hlnum{2}
  \hlkwd{print}\hlstd{(b)}
\hlstd{\}}
\end{alltt}
\begin{verbatim}
## [1] 1
## [1]  1 16
## [1]  1 16  9
## [1]  1 16  9 36
## [1]  1 16  9 36 64
\end{verbatim}
\begin{alltt}
\hlstd{b}
\end{alltt}
\begin{verbatim}
## [1]  1 16  9 36 64
\end{verbatim}
\begin{alltt}
\hlcom{# runs faster if we first allocate a long enough vector}
\hlstd{b} \hlkwb{<-} \hlkwd{numeric}\hlstd{(}\hlkwd{length}\hlstd{(a))}
\hlkwa{for}\hlstd{(i} \hlkwa{in} \hlkwd{seq}\hlstd{(}\hlkwc{along.with} \hlstd{= a)) \{}
  \hlstd{b[i]} \hlkwb{<-} \hlstd{a[i]}\hlopt{^}\hlnum{2}
  \hlkwd{print}\hlstd{(b)}
\hlstd{\}}
\end{alltt}
\begin{verbatim}
## [1] 1 0 0 0 0
## [1]  1 16  0  0  0
## [1]  1 16  9  0  0
## [1]  1 16  9 36  0
## [1]  1 16  9 36 64
\end{verbatim}
\begin{alltt}
\hlstd{b}
\end{alltt}
\begin{verbatim}
## [1]  1 16  9 36 64
\end{verbatim}
\begin{alltt}
\hlcom{# a vectorized expression is simplest and fastest}
\hlstd{b} \hlkwb{<-} \hlstd{a}\hlopt{^}\hlnum{2}
\hlstd{b}
\end{alltt}
\begin{verbatim}
## [1]  1 16  9 36 64
\end{verbatim}
\end{kframe}
\end{knitrout}

In the previous chunk we used \code{seq(along.with = a)} to build a new numeric vector with a sequence of the same length as vector \code{a}. Using this \emph{idiom} is best as it ensures that even the case when \code{a} is an \emph{empty} vector of length zero will be handled correctly, with \code{numeric(0)} assigned to \code{b}.

\begin{playground}\label{box:play:forloop}
Look at the results from the above examples, and try to understand where the returned value comes from in each case. In the code chunk above, \Rfunction{print()} is used within the \emph{loop} to make intermediate values visible. You can add additional \code{print()} statements to visualize other variables, such as \code{i}, or run parts of the code, such as \code{seq(along.with = a)}, by themselves.

In this case, the code examples trigger no errors or warnings, but the same approach can be used for debugging syntactically correct code that does not return the expected results.
\end{playground}

\begin{advplayground}
In the examples above we show the use of \code{seq()} passing a vector as an argument to its parameter \code{along.with}. Run the examples below and explain why the two approaches are equivalent only when the length of \code{a} is one or more. Find the answer by assigning to \code{a}, vectors of different lengths, including zero (using \code{a <- numeric(0)}).

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{b} \hlkwb{<-} \hlkwd{numeric}\hlstd{(}\hlkwd{length}\hlstd{(a))}
\hlkwa{for}\hlstd{(i} \hlkwa{in} \hlkwd{seq}\hlstd{(}\hlkwc{along.with} \hlstd{= a)) \{}
  \hlstd{b[i]} \hlkwb{<-} \hlstd{a[i]}\hlopt{^}\hlnum{2}
\hlstd{\}}
\hlkwd{print}\hlstd{(b)}

\hlstd{c} \hlkwb{<-} \hlkwd{numeric}\hlstd{(}\hlkwd{length}\hlstd{(a))}
\hlkwa{for}\hlstd{(i} \hlkwa{in} \hlnum{1}\hlopt{:}\hlkwd{length}\hlstd{(a)) \{}
  \hlstd{c[i]} \hlkwb{<-} \hlstd{a[i]}\hlopt{^}\hlnum{2}
\hlstd{\}}
\hlkwd{print}\hlstd{(c)}
\end{alltt}
\end{kframe}
\end{knitrout}

\end{advplayground}

\begin{explainbox}
\Rloop{for} loops as described above, in the absence of errors, have statically predictable behavior. The compound statement in the loop will be executed once for each member of the vector or list. Special cases may require the alteration of the normal flow of execution in the loop. Two cases are easy to deal with, one is stopping iteration early, which we can do with \Rloop{break()}, and another is jumping ahead to the start of the next iteration, which we can do with \Rloop{next()}.
\end{explainbox}

\subsubsection[\texttt{while} loops]{\code{while} loops}
\Rloop{while} loops\index{iteration!while loop} are frequently useful, even if not as frequently used as \code{for} loops. Instead of a list or vector, they take a logical argument, which is usually an expression, but which can also be a variable.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a} \hlkwb{<-} \hlnum{2}
\hlkwa{while} \hlstd{(a} \hlopt{<} \hlnum{50}\hlstd{) \{}
  \hlkwd{print}\hlstd{(a)}
  \hlstd{a} \hlkwb{<-} \hlstd{a}\hlopt{^}\hlnum{2}
\hlstd{\}}
\end{alltt}
\begin{verbatim}
## [1] 2
## [1] 4
## [1] 16
\end{verbatim}
\begin{alltt}
\hlkwd{print}\hlstd{(a)}
\end{alltt}
\begin{verbatim}
## [1] 256
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{playground}
Make sure that you understand why the final value of \code{a} is larger than 50.
\end{playground}


\begin{advplayground}
The statements above can be simplified to:

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a} \hlkwb{<-} \hlnum{2}
\hlkwd{print}\hlstd{(a)}
\hlkwa{while} \hlstd{(a} \hlopt{<} \hlnum{50}\hlstd{) \{}
  \hlkwd{print}\hlstd{(a} \hlkwb{<-} \hlstd{a}\hlopt{^}\hlnum{2}\hlstd{)}
\hlstd{\}}
\end{alltt}
\end{kframe}
\end{knitrout}

Explain why this works, and how it relates to the support in \Rlang of \emph{chained} assignments to several variables within a single statement like the one below.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a} \hlkwb{<-} \hlstd{b} \hlkwb{<-} \hlstd{c} \hlkwb{<-} \hlnum{1}\hlopt{:}\hlnum{5}
\hlstd{a}
\end{alltt}
\end{kframe}
\end{knitrout}

Explain why a second \code{print(a)} has been added before \code{while()}. Hint: experiment if necessary.
\end{advplayground}

\begin{explainbox}
\Rloop{while} loops as described above will terminate when the condition tested is \code{FALSE}. In those cases that require stopping iteration based on an additional test condition within the compound statement, we can call \Rloop{break()} in the body of an \code{if} or \code{else} statement.
\end{explainbox}

\subsubsection[\texttt{repeat} loops]{\code{repeat} loops}
The \Rloop{repeat}\index{iteration!repeat loop} construct is less frequently used, but adds flexibility as termination will always depend on a call to \Rcontrol{break()}, which can be located anywhere within the compound statement that forms the body of the loop. To achieve conditional end of iteration, function \Rcontrol{break()} must be called, as otherwise, iteration in a \code{repeat} loop will not stop.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a} \hlkwb{<-} \hlnum{2}
\hlkwa{repeat}\hlstd{\{}
  \hlkwd{print}\hlstd{(a)}
  \hlkwa{if} \hlstd{(a} \hlopt{>} \hlnum{50}\hlstd{)} \hlkwa{break}\hlstd{()}
  \hlstd{a} \hlkwb{<-} \hlstd{a}\hlopt{^}\hlnum{2}
\hlstd{\}}
\end{alltt}
\begin{verbatim}
## [1] 2
## [1] 4
## [1] 16
## [1] 256
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{playground}
Please explain why the example above returns the values it does. Use the approach of adding \code{print()} statements, as described on page \pageref{box:play:forloop}.
\end{playground}

\begin{explainbox}
Although \code{repeat} loop constructs are easier to read if they have a single condition resulting in termination of iteration, it is allowed by the \Rlang language for the compound statement in the body of a loop to contain more than one call to \Rcontrol{break()}, each within a different \code{if} or \code{else} statement.
\end{explainbox}

\subsection{Explicit loops can be slow in \Rlang}\label{sec:loops:slow}
\index{vectorization}\index{recycling of arguments}\index{iteration}\index{loops!faster alternatives|(}
If you have written programs in other languages, it will feel natural to you to use loops (\Rloop{for}, \Rloop{while}, \Rloop{repeat}) for many of the things for which in \Rlang one would normally use vectorization. In \Rlang, using vectorization whenever possible keeps scripts shorter and easier to understand (at least for those with experience in \Rlang). More importantly, as \Rlang is an interpreted language, vectorized arithmetic tends to be much faster than the use of explicit iteration. In recent versions of \Rpgrm, byte-compilation is used by default and loops may be compiled on the fly, which relieves part of the burden of repeated interpretation. However, even byte-compiled loops are usually slower to execute than efficiently coded vectorized functions and operators.

Execution speed needs to be balanced against the effort invested in writing faster code. However, using vectorization and specific \Rlang functions requires little effort once we are familiar with them. The simplest way of measuring the execution time of an \Rlang expression is to use function \Rfunction{system.time()}. However, the returned time is in seconds and consequently the expression must take long enough to execute for the returned time to have useful resolution. See package \pkgname{microbenchmark} for tools for benchmarking code with better time resolution.\qRloop{for}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{system.time}\hlstd{(\{a} \hlkwb{<-} \hlkwd{numeric}\hlstd{()}
            \hlkwa{for} \hlstd{(i} \hlkwa{in} \hlnum{1}\hlopt{:}\hlnum{1000000}\hlstd{) \{}
              \hlstd{a[i]} \hlkwb{<-} \hlstd{i} \hlopt{/} \hlnum{1000}
              \hlstd{\}}
            \hlstd{\})}
\end{alltt}
\begin{verbatim}
##    user  system elapsed
##    0.44    0.03    0.48
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{explainbox}
Whenever\label{box:vectorization:perf} working with large data sets, or many similar data sets, we will need to take performance into account. As vectorization usually also makes code simpler, it is good style to use vectorization whenever possible. For operations that are frequently used, \Rlang includes specific functions. It is thus important to consider not only vectorization of arithmetic but also check for the availability of performance-optimized functions for specific cases. The results from running the code examples in this box are not included, because they are the same for all chunks. Here we are interested in the execution time, and we leave this as an exercise.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a} \hlkwb{<-} \hlkwd{rnorm}\hlstd{(}\hlnum{10}\hlopt{^}\hlnum{7}\hlstd{)} \hlcom{# 10 000 0000 pseudo-random numbers}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlcom{# b <- numeric()}
\hlstd{b} \hlkwb{<-} \hlkwd{numeric}\hlstd{(}\hlkwd{length}\hlstd{(a)}\hlopt{-}\hlnum{1}\hlstd{)} \hlcom{# pre-allocate memory}
\hlstd{i} \hlkwb{<-} \hlnum{1}
\hlkwa{while} \hlstd{(i} \hlopt{<} \hlkwd{length}\hlstd{(a)) \{}
  \hlstd{b[i]} \hlkwb{<-} \hlstd{a[i}\hlopt{+}\hlnum{1}\hlstd{]} \hlopt{-} \hlstd{a[i]}
  \hlkwd{print}\hlstd{(b)}
  \hlstd{i} \hlkwb{<-} \hlstd{i} \hlopt{+} \hlnum{1}
\hlstd{\}}
\hlstd{b}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlcom{# b <- numeric()}
\hlstd{b} \hlkwb{<-} \hlkwd{numeric}\hlstd{(}\hlkwd{length}\hlstd{(a)}\hlopt{-}\hlnum{1}\hlstd{)} \hlcom{# pre-allocate memory}
\hlkwa{for}\hlstd{(i} \hlkwa{in} \hlkwd{seq}\hlstd{(}\hlkwc{along.with} \hlstd{= b)) \{}
  \hlstd{b[i]} \hlkwb{<-} \hlstd{a[i}\hlopt{+}\hlnum{1}\hlstd{]} \hlopt{-} \hlstd{a[i]}
  \hlkwd{print}\hlstd{(b)}
\hlstd{\}}
\hlstd{b}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlcom{# although in this case there were alternatives, there}
\hlcom{# are other cases when we need to use indexes explicitly}
\hlstd{b} \hlkwb{<-} \hlstd{a[}\hlnum{2}\hlopt{:}\hlkwd{length}\hlstd{(a)]} \hlopt{-} \hlstd{a[}\hlnum{1}\hlopt{:}\hlkwd{length}\hlstd{(a)}\hlopt{-}\hlnum{1}\hlstd{]}
\hlstd{b}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlcom{# or even better}
\hlstd{b} \hlkwb{<-} \hlkwd{diff}\hlstd{(a)}
\hlstd{b}
\end{alltt}
\end{kframe}
\end{knitrout}

Execution time can be obtained with \Rfunction{system.time()}. For a vector of ten million numbers, the \code{for} loop above takes 1.1\,s and the equivalent \code{while} loop 2.0\,s, the vectorized statement using indexing takes 0.2\,s and function \code{diff()} takes 0.1\,s. The \code{for} loop without pre-allocation of memory to \code{b} takes 3.6\,s, and the equivalent while loop 4.7\,s---i.e., the fastest execution time was more than 40 times faster than the slowest one. (Times for \Rpgrm 3.5.1 on my laptop under Windows 10 x64.)
\end{explainbox}
\index{loops!faster alternatives|)}

\subsection{Nesting of loops}\label{sec:nested:loops}
\index{iteration!nesting of loops}\index{nested iteration loops}\index{loops!nested}

All the execution-flow control statements seen above can be nested. We will show an example with two \code{for} loops. We first create a matrix of data to work with:

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{A} \hlkwb{<-} \hlkwd{matrix}\hlstd{(}\hlnum{1}\hlopt{:}\hlnum{50}\hlstd{,} \hlnum{10}\hlstd{)}
\hlstd{A}
\end{alltt}
\begin{verbatim}
##       [,1] [,2] [,3] [,4] [,5]
##  [1,]    1   11   21   31   41
##  [2,]    2   12   22   32   42
##  [3,]    3   13   23   33   43
##  [4,]    4   14   24   34   44
##  [5,]    5   15   25   35   45
##  [6,]    6   16   26   36   46
##  [7,]    7   17   27   37   47
##  [8,]    8   18   28   38   48
##  [9,]    9   19   29   39   49
## [10,]   10   20   30   40   50
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{row.sum} \hlkwb{<-} \hlkwd{numeric}\hlstd{()}
\hlkwa{for} \hlstd{(i} \hlkwa{in} \hlnum{1}\hlopt{:}\hlkwd{nrow}\hlstd{(A)) \{}
  \hlstd{row.sum[i]} \hlkwb{<-} \hlnum{0}
  \hlkwa{for} \hlstd{(j} \hlkwa{in} \hlnum{1}\hlopt{:}\hlkwd{ncol}\hlstd{(A))}
    \hlstd{row.sum[i]} \hlkwb{<-} \hlstd{row.sum[i]} \hlopt{+} \hlstd{A[i, j]}
\hlstd{\}}
\hlkwd{print}\hlstd{(row.sum)}
\end{alltt}
\begin{verbatim}
##  [1] 105 110 115 120 125 130 135 140 145 150
\end{verbatim}
\end{kframe}
\end{knitrout}

The code above is very general, it will work with any two-dimensional matrix with at least one column and one row. However, sometimes we need more specific calculations. \code{A[1, 2]} selects one cell in the matrix, the one on the first row of the second column. \code{A[1, ]} selects row one, and  \code{A[ , 2]} selects column two. In the example above, the value of \code{i} changes for each iteration of the outer loop. The value of \code{j} changes for each iteration of the inner loop, and the inner loop is run in full for each iteration of the outer loop. The inner loop index \code{j} changes fastest.

\begin{advplayground}
1) Modify the code in the example in the last chunk above so that it sums the values only in the first three columns of \code{A}, 2) modify the same example so that it sums the values only in the last three rows of \code{A}, 3) modify the code so that matrices with dimensions equal to zero (as reported by \code{ncol()} and \code{nrow()}).

Will the code you wrote continue working as expected if the number of rows in \code{A} changed? What if the number of columns in \code{A} changed, and the required results still needed to be calculated for relative positions? What would happen if \code{A} had fewer than three columns? Try to think first what to expect based on the code you wrote. Then create matrices of different sizes and test your code. After that, think how to improve the code, so that wrong results are not produced.
\end{advplayground}

\begin{explainbox}
If the total number of iterations is large and the code executed at each iteration runs fast, the overhead added by the loop code can make a big contribution to the total running time of a script.
When dealing with nested loops, as the inner loop is executed most frequently, this is the best place to look for ways of reducing execution time. In this example, vectorization can be achieved easily for the inner loop, as \Rlang has a function \code{sum()} which returns the sum of a vector passed as its argument. Replacing the inner loop by an efficient function can be expected to improve performance significantly.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{row.sum} \hlkwb{<-} \hlkwd{numeric}\hlstd{(}\hlkwd{nrow}\hlstd{(A))} \hlcom{# faster}
\hlkwa{for} \hlstd{(i} \hlkwa{in} \hlnum{1}\hlopt{:}\hlkwd{nrow}\hlstd{(A)) \{}
  \hlstd{row.sum[i]} \hlkwb{<-} \hlkwd{sum}\hlstd{(A[i, ])}
\hlstd{\}}
\hlkwd{print}\hlstd{(row.sum)}
\end{alltt}
\begin{verbatim}
##  [1] 105 110 115 120 125 130 135 140 145 150
\end{verbatim}
\end{kframe}
\end{knitrout}

\code{A[i, ]} selects row \code{i} and all columns. Reminder: in \Rlang the row index comes first.

Both\index{apply functions} explicit loops can be eliminated if we use an \emph{apply} function, such as \Rloop{apply()}, \Rloop{lapply()} or \Rloop{sapply()}, in place of the outer \code{for} loop. See section \ref{sec:data:apply} below %on page \pageref{sec:data:apply}
for details on the use of the different \emph{apply} functions.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{row.sum} \hlkwb{<-} \hlkwd{apply}\hlstd{(A,} \hlkwc{MARGIN} \hlstd{=} \hlnum{1}\hlstd{, sum)} \hlcom{# MARGIN=1 indicates rows}
\hlkwd{print}\hlstd{(row.sum)}
\end{alltt}
\begin{verbatim}
##  [1] 105 110 115 120 125 130 135 140 145 150
\end{verbatim}
\end{kframe}
\end{knitrout}
Calculating row sums is a frequent operation, so \Rlang has a built-in function for this. As earlier with \code{diff()}, it is always worthwhile to check if there is an existing \Rlang function, optimized for performance, capable of doing the computations we need. In this case, using \code{rowSums()} simplifies the nested loops into a single function call, both improving performance and readability.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{rowSums}\hlstd{(A)}
\end{alltt}
\begin{verbatim}
##  [1] 105 110 115 120 125 130 135 140 145 150
\end{verbatim}
\end{kframe}
\end{knitrout}

\end{explainbox}

\begin{playground}
1) How would you change this last example, so that only the last three columns are added up? (Think about use of subscripts to select a part of the matrix.)
2) To obtain column sums, one could modify the nested loops (think how), transpose the matrix and use \code{rowSums()} (think how), or look up if there is in \Rlang a function for this operation. A good place to start is with \code{help(rowSums)} as similar functions may share the same help page, or at least be listed in the ``See also'' section. Do try this, and explore other help pages in search for some function you may find useful in the analysis of your own data.
\end{playground}

\subsubsection{Clean-up}

Sometimes we need to make sure that clean-up code is executed even if the execution of a script or function is aborted by the user or as a result of an error condition. A typical example is a script that temporarily sets a disk folder as the working directory or uses a file as temporary storage. Function \Rfunction{on.exit()} can be used to record that a user supplied expression needs to be executed when the current function, or a script, exits. Function \Rfunction{on.exit()} can also make code easier to read as it keeps creation and clean-up next to each other in the body of a function or in the listing of a script.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{file.create}\hlstd{(}\hlstr{"temp.file"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{on.exit}\hlstd{(}\hlkwd{file.remove}\hlstd{(}\hlstr{"temp.file"}\hlstd{))}
\hlcom{# code that makes use of the file goes here}
\end{alltt}
\end{kframe}
\end{knitrout}

\section[Apply functions]{\emph{Apply} functions}\label{sec:data:apply}

\emph{Apply}\index{apply functions}\index{loops!faster alternatives} functions apply a function passed as an argument to parameter \code{FUN} or equivalent, to elements in a collection of \Rlang objects passed as an argument to parameter \code{X} or equivalent. Collections to which \code{FUN} is to be applied can be vectors, lists, data frames, matrices or arrays. As long as the operations to be applied are \emph{independent---i.e., the results from one iteration are not used in another iteration---} apply functions can replace \code{for}, \code{while} or \code{repeat} loops.

The different \emph{apply} functions in base \Rlang differ in the class of the values they accept for their \code{X} parameter, the class of the object they return and/or the class of the value returned by the applied function. \Rloop{lapply()} and \Rloop{sapply()} expect a \code{vector} or \code{list} as an argument passed through \code{X}. \Rloop{lapply()} returns a \code{list} or an \code{array}; and \Rloop{vapply()} always \emph{simplifies} its returned value into a vector, while \Rloop{sapply()} does the simplification according to the argument passed to its \code{simplify} parameter. All these \emph{apply} functions can be used to apply an \Rlang function that returns a value of the same or a different class as its argument. In the case of \Rloop{apply()} and \Rloop{lapply()} not even the length of the values returned for each member of the collection passed as an argument, needs to be consistent. In summary, \Rloop{apply()} is used to apply a function to the elements along a dimension of an object that has two or more \emph{dimensions}, and \Rloop{lapply()} and \Rloop{sapply()} are used to apply a function to the members of a vector or list. \Rloop{apply()} returns an array or a list or a vector depending on the size, and consistency in length and class among the values returned by the applied function.

\subsection{Applying functions to vectors and lists}

We first exemplify the use of \Rloop{lapply()}, \Rloop{sapply()} and \Rloop{vapply()}. In the chunks below we apply a user-defined function to a vector.

\begin{warningbox}
A constraint on the function to be applied is that the member object will be always passed as an argument to the first parameter of the applied function.
\end{warningbox}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{set.seed}\hlstd{(}\hlnum{123456}\hlstd{)} \hlcom{# so that a.vector does not change}
\hlstd{a.vector} \hlkwb{<-} \hlkwd{runif}\hlstd{(}\hlnum{6}\hlstd{)} \hlcom{# A short vector as input to keep output short}
\hlkwd{str}\hlstd{(a.vector)}
\end{alltt}
\begin{verbatim}
##  num [1:6] 0.798 0.754 0.391 0.342 0.361 ...
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.fun} \hlkwb{<-} \hlkwa{function}\hlstd{(}\hlkwc{x}\hlstd{,} \hlkwc{k}\hlstd{) \{}\hlkwd{log}\hlstd{(x)} \hlopt{+} \hlstd{k\}}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{z} \hlkwb{<-} \hlkwd{lapply}\hlstd{(}\hlkwc{X} \hlstd{= a.vector,} \hlkwc{FUN} \hlstd{= my.fun,} \hlkwc{k} \hlstd{=} \hlnum{5}\hlstd{)}
\hlkwd{str}\hlstd{(z)}
\end{alltt}
\begin{verbatim}
## List of 6
##  $ : num 4.77
##  $ : num 4.72
##  $ : num 4.06
##  $ : num 3.93
##  $ : num 3.98
##  $ : num 3.38
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{z} \hlkwb{<-} \hlkwd{sapply}\hlstd{(}\hlkwc{X} \hlstd{= a.vector,} \hlkwc{FUN} \hlstd{= my.fun,} \hlkwc{k} \hlstd{=} \hlnum{5}\hlstd{)}
\hlkwd{str}\hlstd{(z)}
\end{alltt}
\begin{verbatim}
##  num [1:6] 4.77 4.72 4.06 3.93 3.98 ...
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{z} \hlkwb{<-} \hlkwd{sapply}\hlstd{(}\hlkwc{X} \hlstd{= a.vector,} \hlkwc{FUN} \hlstd{= my.fun,} \hlkwc{k} \hlstd{=} \hlnum{5}\hlstd{,} \hlkwc{simplify} \hlstd{=} \hlnum{FALSE}\hlstd{)}
\hlkwd{str}\hlstd{(z)}
\end{alltt}
\begin{verbatim}
## List of 6
##  $ : num 4.77
##  $ : num 4.72
##  $ : num 4.06
##  $ : num 3.93
##  $ : num 3.98
##  $ : num 3.38
\end{verbatim}
\end{kframe}
\end{knitrout}

We can see above that the computed results are the same in the three cases, but the class and structure of the objects returned differ.

Anonymous functions can be defined on the fly and passed to \code{FUN}, allowing us to re-write the examples above more concisely (only the second one shown).

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{z} \hlkwb{<-} \hlkwd{sapply}\hlstd{(}\hlkwc{X} \hlstd{= a.vector,} \hlkwc{FUN} \hlstd{=} \hlkwa{function}\hlstd{(}\hlkwc{x}\hlstd{,} \hlkwc{k}\hlstd{) \{}\hlkwd{log}\hlstd{(x)} \hlopt{+} \hlstd{k\},} \hlkwc{k} \hlstd{=} \hlnum{5}\hlstd{)}
\hlkwd{str}\hlstd{(z)}
\end{alltt}
\begin{verbatim}
##  num [1:6] 4.77 4.72 4.06 3.93 3.98 ...
\end{verbatim}
\end{kframe}
\end{knitrout}

Of course, as discussed in section \ref{sec:loops:slow} on page \pageref{sec:loops:slow}, when suitable vectorized functions are available, their use should be preferred. On the other hand, even if \emph{apply} functions are usually not as fast as vectorized functions, they are faster than the equivalent \code{for()} loops.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{z} \hlkwb{<-} \hlkwd{log}\hlstd{(a.vector)} \hlopt{+} \hlnum{5}
\hlkwd{str}\hlstd{(z)}
\end{alltt}
\begin{verbatim}
##  num [1:6] 4.77 4.72 4.06 3.93 3.98 ...
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{explainbox}
Function \Rloop{vapply()} can be safer to use as the mode of returned values is enforced. Here is a possible way of obtaining means and variances across member vectors at each vector index position from a list of vectors. These could be called \emph{parallel} means and variances. The argument passed to \code{FUN.VALUE} provides a template for the type of the return value and its organization into rows and columns. Notice that the rows in the output are now named according to the names in \code{FUN.VALUE}.

We first use \code{lapply()} to create the object \code{a.list} containing artificial data. One or more additional \emph{named} arguments can be passed to the function to be applied.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{set.seed}\hlstd{(}\hlnum{123456}\hlstd{)}
\hlstd{a.list} \hlkwb{<-} \hlkwd{lapply}\hlstd{(}\hlkwd{rep}\hlstd{(}\hlnum{4}\hlstd{,} \hlnum{5}\hlstd{), rnorm,} \hlkwc{mean} \hlstd{=} \hlnum{10}\hlstd{,} \hlkwc{sd} \hlstd{=} \hlnum{1}\hlstd{)}
\hlkwd{str}\hlstd{(a.list)}
\end{alltt}
\begin{verbatim}
## List of 5
##  $ : num [1:4] 10.83 9.72 9.64 10.09
##  $ : num [1:4] 12.3 10.8 11.3 12.5
##  $ : num [1:4] 11.17 9.57 9 8.89
##  $ : num [1:4] 9.94 11.17 11.05 10.06
##  $ : num [1:4] 9.26 10.93 11.67 10.56
\end{verbatim}
\end{kframe}
\end{knitrout}

We define the function that we will apply, a function that returns a numeric vector of length 2.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{mean_and_sd} \hlkwb{<-} \hlkwa{function}\hlstd{(}\hlkwc{x}\hlstd{,} \hlkwc{na.rm} \hlstd{=} \hlnum{FALSE}\hlstd{) \{}
       \hlkwd{c}\hlstd{(}\hlkwd{mean}\hlstd{(x,} \hlkwc{na.rm} \hlstd{= na.rm),}  \hlkwd{sd}\hlstd{(x,} \hlkwc{na.rm} \hlstd{= na.rm))}
    \hlstd{\}}
\end{alltt}
\end{kframe}
\end{knitrout}

We next use \Rloop{vapply()} to apply our function to each member vector of the list.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{values} \hlkwb{<-} \hlkwd{vapply}\hlstd{(}\hlkwc{X} \hlstd{= a.list,}
                 \hlkwc{FUN} \hlstd{= mean_and_sd,}
                 \hlkwc{FUN.VALUE} \hlstd{=} \hlkwd{c}\hlstd{(}\hlkwc{mean} \hlstd{=} \hlnum{0}\hlstd{,} \hlkwc{sd} \hlstd{=} \hlnum{0}\hlstd{),}
                 \hlkwc{na.rm} \hlstd{=} \hlnum{TRUE}\hlstd{)}
\hlkwd{class}\hlstd{(values)}
\end{alltt}
\begin{verbatim}
## [1] "matrix" "array"
\end{verbatim}
\begin{alltt}
\hlstd{values}
\end{alltt}
\begin{verbatim}
##            [,1]       [,2]     [,3]       [,4]      [,5]
## mean 10.0725427 11.7254442 9.657997 10.5573814 10.605846
## sd    0.5428149  0.7844356 1.050663  0.6460881  1.005676
\end{verbatim}
\end{kframe}
\end{knitrout}
\end{explainbox}

\begin{playground}
As explained in section \ref{sec:R:data:frames} on page \pageref{sec:R:data:frames}, class \code{data.frame} is derived from class \code{list}. Apply function \code{mean\_and\_sd()} defined above to the data frame \code{cars} included as example data in \Rlang. The aim is to obtain the mean and standard deviation for each column.
\end{playground}

\subsection{Applying functions to matrices and arrays}
In the next example we use \Rloop{apply()} and \Rfunction{mean()} to compute the mean for each column of matrix \code{a.matrix}. In \Rlang the dimensions of a matrix, rows and columns, over which a function is applied are called \emph{margins}. The argument passed to parameter \code{MARGIN} determines over which margin the function will be applied. If the function is applied to individual rows, we say that we operate on the first margin, and if the function is applied to individual columns, over the second margin. Arrays can have many dimensions, and consequently more margins. In the case of arrays with more than two dimensions, it is possible and useful to apply functions over multiple margins at once.

\begin{warningbox}
A constraint on the function to be applied is that the vector or ``slice'' will always be passed as a positional argument to the first formal parameter of the applied function.
\end{warningbox}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a.matrix} \hlkwb{<-} \hlkwd{matrix}\hlstd{(}\hlkwd{runif}\hlstd{(}\hlnum{100}\hlstd{),} \hlkwc{ncol} \hlstd{=} \hlnum{10}\hlstd{)}
\hlstd{z} \hlkwb{<-} \hlkwd{apply}\hlstd{(a.matrix,} \hlkwc{MARGIN} \hlstd{=} \hlnum{1}\hlstd{,} \hlkwc{FUN} \hlstd{= mean)}
\hlkwd{str}\hlstd{(z)}
\end{alltt}
\begin{verbatim}
##  num [1:10] 0.247 0.404 0.537 0.5 0.504 ...
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{playground}
Modify the example above so that it computes row means instead of column means.
\end{playground}

\begin{playground}
Look up the help pages for \Rloop{apply()} and \code{mean()} and study them until you understand how additional arguments can be passed to the applied function. Can you guess why \Rloop{apply()} was designed to have parameter names fully in uppercase, something very unusual for \Rlang code style?
\end{playground}

If we apply a function that returns a value of the same length as its input, then the dimensions of the value returned by \Rloop{apply()} are the same as those of its input. We use, in the next examples, a ``no-op'' function that returns its argument unchanged, so that input and output can be easily compared.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a.small.matrix} \hlkwb{<-} \hlkwd{matrix}\hlstd{(}\hlkwd{rnorm}\hlstd{(}\hlnum{6}\hlstd{,} \hlkwc{mean} \hlstd{=} \hlnum{10}\hlstd{,} \hlkwc{sd} \hlstd{=} \hlnum{1}\hlstd{),} \hlkwc{ncol} \hlstd{=} \hlnum{2}\hlstd{)}
\hlstd{a.small.matrix} \hlkwb{<-} \hlkwd{round}\hlstd{(a.small.matrix,} \hlkwc{digits} \hlstd{=} \hlnum{1}\hlstd{)}
\hlstd{a.small.matrix}
\end{alltt}
\begin{verbatim}
##      [,1] [,2]
## [1,] 11.3 10.4
## [2,] 10.6  8.6
## [3,]  8.2 11.0
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{no_op.fun} \hlkwb{<-} \hlkwa{function}\hlstd{(}\hlkwc{x}\hlstd{) \{x\}}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{z} \hlkwb{<-} \hlkwd{apply}\hlstd{(}\hlkwc{X} \hlstd{= a.small.matrix,} \hlkwc{MARGIN} \hlstd{=} \hlnum{2}\hlstd{,} \hlkwc{FUN} \hlstd{= no_op.fun)}
\hlkwd{class}\hlstd{(z)}
\end{alltt}
\begin{verbatim}
## [1] "matrix" "array"
\end{verbatim}
\begin{alltt}
\hlstd{z}
\end{alltt}
\begin{verbatim}
##      [,1] [,2]
## [1,] 11.3 10.4
## [2,] 10.6  8.6
## [3,]  8.2 11.0
\end{verbatim}
\end{kframe}
\end{knitrout}

In the chunk above, we passed \code{MARGIN = 2}, but if we pass \code{MARGIN = 1}, we get a return value that is transposed! To restore the original layout of the matrix we can transpose the result with function \Rfunction{t()}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{z} \hlkwb{<-} \hlkwd{apply}\hlstd{(}\hlkwc{X} \hlstd{= a.small.matrix,} \hlkwc{MARGIN} \hlstd{=} \hlnum{1}\hlstd{,} \hlkwc{FUN} \hlstd{= no_op.fun)}
\hlstd{z}
\end{alltt}
\begin{verbatim}
##      [,1] [,2] [,3]
## [1,] 11.3 10.6  8.2
## [2,] 10.4  8.6 11.0
\end{verbatim}
\begin{alltt}
\hlkwd{t}\hlstd{(z)}
\end{alltt}
\begin{verbatim}
##      [,1] [,2]
## [1,] 11.3 10.4
## [2,] 10.6  8.6
## [3,]  8.2 11.0
\end{verbatim}
\end{kframe}
\end{knitrout}

A more realistic example, but difficult to grasp without seeing the toy examples shown above, is when we apply a function that returns a value of a different length than its input, but longer than one. When we compute column summaries (\code{MARGIN = 2}), a matrix is returned, with each column containing the summaries for the corresponding column in the original matrix (\code{a.small.matrix}). In contrast, when we compute row summaries (\code{MARGIN = 1}), each column in the returned matrix contains the summaries for one row in the original array. What happens is that by using \Rloop{apply()} the dimension of the original matrix or array over which we compute summaries ``disappears.'' Consequently, given how matrices are stored in \Rlang, when columns collapse into a single value, the rows become columns. After this, the vectors returned by the applied function, are stored as rows.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{mean_and_sd} \hlkwb{<-} \hlkwa{function}\hlstd{(}\hlkwc{x}\hlstd{,} \hlkwc{na.rm} \hlstd{=} \hlnum{FALSE}\hlstd{) \{}
       \hlkwd{c}\hlstd{(}\hlkwd{mean}\hlstd{(x,} \hlkwc{na.rm} \hlstd{= na.rm),}  \hlkwd{sd}\hlstd{(x,} \hlkwc{na.rm} \hlstd{= na.rm))}
    \hlstd{\}}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{z} \hlkwb{<-} \hlkwd{apply}\hlstd{(}\hlkwc{X} \hlstd{= a.small.matrix,} \hlkwc{MARGIN} \hlstd{=} \hlnum{2}\hlstd{,} \hlkwc{FUN} \hlstd{= mean_and_sd,} \hlkwc{na.rm} \hlstd{=} \hlnum{TRUE}\hlstd{)}
\hlstd{z}
\end{alltt}
\begin{verbatim}
##           [,1]   [,2]
## [1,] 10.033333 10.000
## [2,]  1.625833  1.249
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{z} \hlkwb{<-} \hlkwd{apply}\hlstd{(}\hlkwc{X} \hlstd{= a.small.matrix,} \hlkwc{MARGIN} \hlstd{=} \hlnum{1}\hlstd{,} \hlkwc{FUN} \hlstd{= mean_and_sd,} \hlkwc{na.rm} \hlstd{=} \hlnum{TRUE}\hlstd{)}
\hlstd{z}
\end{alltt}
\begin{verbatim}
##            [,1]     [,2]     [,3]
## [1,] 10.8500000 9.600000 9.600000
## [2,]  0.6363961 1.414214 1.979899
\end{verbatim}
\end{kframe}
\end{knitrout}

In all examples above, we have used ordinary functions. Operators in \Rlang are functions with two formal parameters which can be called using infix notation in expressions---i.e., \code{a + b}. By back-quoting their names they can be called using the same syntax as for ordinary functions, and consequently also passed to the \code{FUN} parameter of apply functions. A toy example, equivalent to the vectorized operation \code{a.vector + 5} follows. We enclosed operator \code{+} in back ticks (\code{`}) and pass by name a constant to its second formal parameter (\code{e2 = 5}).

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{set.seed}\hlstd{(}\hlnum{123456}\hlstd{)} \hlcom{# so that a.vector does not change}
\hlstd{a.vector} \hlkwb{<-} \hlkwd{runif}\hlstd{(}\hlnum{10}\hlstd{)}
\hlstd{z} \hlkwb{<-} \hlkwd{sapply}\hlstd{(}\hlkwc{X} \hlstd{= a.vector,} \hlkwc{FUN} \hlstd{= `+`,} \hlkwc{e2} \hlstd{=} \hlnum{5}\hlstd{)}
\hlkwd{str}\hlstd{(z)}
\end{alltt}
\begin{verbatim}
##  num [1:10] 5.8 5.75 5.39 5.34 5.36 ...
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{explainbox}
\textbf{Apply functions vs.\ loop constructs} Apply functions cannot always replace explicit loops as they are less flexible. A simple example is the accumulation pattern, where we ``walk'' through a collection that stores a partial result between iterations. A similar case is a pattern where calculations are done over a ``window'' that moves at each iteration. The simplest and probably most frequent calculation of this kind is the calculation of differences between successive members. Other examples are moving window summaries such as a moving median (see page \pageref{box:vectorization:perf} for other alternatives to the use of explicit iteration loops).
\end{explainbox}

\section{Object names and character strings}

In\index{object names}\index{object names!as character strings} all assignment examples before this section, we have used object names included as literal character strings in the code expressions. In other words, the names are ``decided'' as part of the code, rather than at run time. In scripts or packages, the object name to be assigned may need to be decided at run time and, consequently, be available only as a character string stored in a variable. In this case, function \Rfunction{assign()} must be used instead of the operators \code{<-} or \code{->}. The statements below demonstrate its use.

First using a \code{character} constant.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{assign}\hlstd{(}\hlstr{"a"}\hlstd{,} \hlnum{9.99}\hlstd{)}
\hlstd{a}
\end{alltt}
\begin{verbatim}
## [1] 9.99
\end{verbatim}
\end{kframe}
\end{knitrout}
Next using a \code{character} value stored in a variable.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{name.of.var} \hlkwb{<-} \hlstr{"b"}
\hlkwd{assign}\hlstd{(name.of.var,} \hlnum{9.99}\hlstd{)}
\hlstd{b}
\end{alltt}
\begin{verbatim}
## [1] 9.99
\end{verbatim}
\end{kframe}
\end{knitrout}

The two toy examples above do not demonstrate why one may want to use \Rfunction{assign()}. Common situations where we may want to use character strings to store (future or existing) object names are 1) when we allow users to provide names for objects either interactively or as \code{character} data, 2) when in a loop we transverse a vector or list of object names, or 3) we construct at runtime object names from multiple character strings based on data or settings. A common case is when we import data from a text file and we want to name the object according to the name of the file on disk, or a character string read from the header at the top of the file.

Another case is when \code{character} values are the result of a computation.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwa{for} \hlstd{(i} \hlkwa{in} \hlnum{1}\hlopt{:}\hlnum{5}\hlstd{) \{}
   \hlkwd{assign}\hlstd{(}\hlkwd{paste}\hlstd{(}\hlstr{"zz_"}\hlstd{, i,} \hlkwc{sep} \hlstd{=} \hlstr{""}\hlstd{), i}\hlopt{^}\hlnum{2}\hlstd{)}
\hlstd{\}}
\hlkwd{ls}\hlstd{(}\hlkwc{pattern} \hlstd{=} \hlstr{"zz_*"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] "zz_1" "zz_2" "zz_3" "zz_4" "zz_5"
\end{verbatim}
\end{kframe}
\end{knitrout}

The complementary operation of \emph{assigning} a name to an object is to \emph{get} an object when we have available its name as a character string. The corresponding function is \Rfunction{get()}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{get}\hlstd{(}\hlstr{"a"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] 9.99
\end{verbatim}
\begin{alltt}
\hlkwd{get}\hlstd{(}\hlstr{"b"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] 9.99
\end{verbatim}
\end{kframe}
\end{knitrout}

If we have available a character vector containing object names and we want to create a list containing these objects we can use function \Rfunction{mget()}. In the example below we use function \code{ls()} to obtain a character vector of object names matching a specific pattern and then collect all these objects into a list.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{obj_names} \hlkwb{<-} \hlkwd{ls}\hlstd{(}\hlkwc{pattern} \hlstd{=} \hlstr{"zz_*"}\hlstd{)}
\hlstd{obj_lst} \hlkwb{<-} \hlkwd{mget}\hlstd{(obj_names)}
\hlkwd{str}\hlstd{(obj_lst)}
\end{alltt}
\begin{verbatim}
## List of 5
##  $ zz_1: num 1
##  $ zz_2: num 4
##  $ zz_3: num 9
##  $ zz_4: num 16
##  $ zz_5: num 25
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{advplayground}
Think of possible uses of functions \Rfunction{assign()}, \Rfunction{get()} and \Rfunction{mget()} in scripts you use or could use to analyze your own data (or from other sources). Write a script to implement this, and iteratively test and revise this script until the result produced by the script matches your expectations.
\end{advplayground}

\section{The multiple faces of loops}\label{sec:R:faces:of:loops}

\ilAdvanced\ To close this chapter, I will mention some advanced aspects of the \Rlang language that are useful when writing complex scrips---if you are going through the book sequentially, you will want to return to this section after reading chapters \ref{chap:R:statistics} and \ref{chap:R:functions}. In the same way as we can assign names to \code{numeric}, \code{character} and other types of objects, we can assign names to functions and expressions. We can also create lists of functions and/or expressions. The \Rlang language has a very consistent grammar, with all lists and vectors behaving in the same way. The implication of this is that we can assign different functions or expressions to a given name, and consequently it is possible to write loops over lists of functions or expressions.

In this first example we use a \emph{character vector of function names}, and use function \Rfunction{do.call()} as it accepts either character strings or function names as its first argument. We obtain a numeric vector with named members with names matching the function names.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{x} \hlkwb{<-} \hlkwd{rnorm}\hlstd{(}\hlnum{10}\hlstd{)}
\hlstd{results} \hlkwb{<-} \hlkwd{numeric}\hlstd{()}
\hlstd{fun.names} \hlkwb{<-} \hlkwd{c}\hlstd{(}\hlstr{"mean"}\hlstd{,} \hlstr{"max"}\hlstd{,} \hlstr{"min"}\hlstd{)}
\hlkwa{for} \hlstd{(f.name} \hlkwa{in} \hlstd{fun.names) \{}
   \hlstd{results[[f.name]]} \hlkwb{<-} \hlkwd{do.call}\hlstd{(f.name,} \hlkwd{list}\hlstd{(x))}
   \hlstd{\}}
\hlstd{results}
\end{alltt}
\begin{verbatim}
##       mean        max        min
##  0.5453427  2.5026454 -1.1139499
\end{verbatim}
\end{kframe}
\end{knitrout}

When traversing a \emph{list of functions} in a loop, we face the problem that we cannot access the original names of the functions as what is stored in the list are the definitions of the functions. In this case, we can hold the function definitions in the loop variable (\code{f} in the chunk below) and call the functions by use of the function call notation (\code{f()}). We obtain a numeric vector with anonymous members.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{results} \hlkwb{<-} \hlkwd{numeric}\hlstd{()}
\hlstd{funs} \hlkwb{<-} \hlkwd{list}\hlstd{(mean, max, min)}
\hlkwa{for} \hlstd{(f} \hlkwa{in} \hlstd{funs) \{}
   \hlstd{results} \hlkwb{<-} \hlkwd{c}\hlstd{(results,} \hlkwd{f}\hlstd{(x))}
   \hlstd{\}}
\hlstd{results}
\end{alltt}
\begin{verbatim}
## [1]  0.5453427  2.5026454 -1.1139499
\end{verbatim}
\end{kframe}
\end{knitrout}

We can use a named list of functions to gain full control of the naming of the results. We obtain a numeric vector with named members with names matching the names given to the list members.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{results} \hlkwb{<-} \hlkwd{numeric}\hlstd{()}
\hlstd{funs} \hlkwb{<-} \hlkwd{list}\hlstd{(}\hlkwc{average} \hlstd{= mean,} \hlkwc{maximum} \hlstd{= max,} \hlkwc{minimum} \hlstd{= min)}
\hlkwa{for} \hlstd{(f} \hlkwa{in} \hlkwd{names}\hlstd{(funs)) \{}
   \hlstd{results[[f]]} \hlkwb{<-} \hlstd{funs[[f]](x)}
   \hlstd{\}}
\hlstd{results}
\end{alltt}
\begin{verbatim}
##    average    maximum    minimum
##  0.5453427  2.5026454 -1.1139499
\end{verbatim}
\end{kframe}
\end{knitrout}

Next is an example using model formulas. We use a loop to fit three models, obtaining a list of fitted models. We cannot pass to \Rfunction{anova()} this list of fitted models, as it expects each fitted model as a separate nameless argument to its \code{\ldots} parameter. We can get around this problem using function \Rfunction{do.call()} to call \Rfunction{anova()}. Function \Rfunction{do.call()} passes the members of the list passed as its second argument as individual arguments to the function being called, using their names if present. \Rfunction{anova()} expects nameless arguments so we need to remove the names present in \code{results}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.data} \hlkwb{<-} \hlkwd{data.frame}\hlstd{(}\hlkwc{x} \hlstd{=} \hlnum{1}\hlopt{:}\hlnum{10}\hlstd{,} \hlkwc{y} \hlstd{=} \hlnum{1}\hlopt{:}\hlnum{10} \hlopt{+} \hlkwd{rnorm}\hlstd{(}\hlnum{10}\hlstd{,} \hlnum{1}\hlstd{,} \hlnum{0.1}\hlstd{))}
\hlstd{results} \hlkwb{<-} \hlkwd{list}\hlstd{()}
\hlstd{models} \hlkwb{<-} \hlkwd{list}\hlstd{(}\hlkwc{linear} \hlstd{= y} \hlopt{~} \hlstd{x,} \hlkwc{linear.orig} \hlstd{= y} \hlopt{~} \hlstd{x} \hlopt{-} \hlnum{1}\hlstd{,} \hlkwc{quadratic} \hlstd{= y} \hlopt{~} \hlstd{x} \hlopt{+} \hlkwd{I}\hlstd{(x}\hlopt{^}\hlnum{2}\hlstd{))}
\hlkwa{for} \hlstd{(m} \hlkwa{in} \hlkwd{names}\hlstd{(models)) \{}
   \hlstd{results[[m]]} \hlkwb{<-} \hlkwd{lm}\hlstd{(models[[m]],} \hlkwc{data} \hlstd{= my.data)}
   \hlstd{\}}
\hlkwd{str}\hlstd{(results,} \hlkwc{max.level} \hlstd{=} \hlnum{1}\hlstd{)}
\end{alltt}
\begin{verbatim}
## List of 3
##  $ linear     :List of 12
##   ..- attr(*, "class")= chr "lm"
##  $ linear.orig:List of 12
##   ..- attr(*, "class")= chr "lm"
##  $ quadratic  :List of 12
##   ..- attr(*, "class")= chr "lm"
\end{verbatim}
\begin{alltt}
\hlkwd{do.call}\hlstd{(anova,} \hlkwd{unname}\hlstd{(results))}
\end{alltt}
\begin{verbatim}
## Analysis of Variance Table
##
## Model 1: y ~ x
## Model 2: y ~ x - 1
## Model 3: y ~ x + I(x^2)
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)
## 1      8 0.05525
## 2      9 2.31266 -1   -2.2574 306.19 4.901e-07 ***
## 3      7 0.05161  2    2.2611 153.34 1.660e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\end{verbatim}
\end{kframe}
\end{knitrout}

If we had no further use for \code{results} we could simply build a list with nameless members by using positional indexing.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{results} \hlkwb{<-} \hlkwd{list}\hlstd{()}
\hlstd{models} \hlkwb{<-} \hlkwd{list}\hlstd{(y} \hlopt{~} \hlstd{x, y} \hlopt{~} \hlstd{x} \hlopt{-} \hlnum{1}\hlstd{, y} \hlopt{~} \hlstd{x} \hlopt{+} \hlkwd{I}\hlstd{(x}\hlopt{^}\hlnum{2}\hlstd{))}
\hlkwa{for} \hlstd{(i} \hlkwa{in} \hlkwd{seq}\hlstd{(}\hlkwc{along.with} \hlstd{= models)) \{}
   \hlstd{results[[i]]} \hlkwb{<-} \hlkwd{lm}\hlstd{(models[[i]],} \hlkwc{data} \hlstd{= my.data)}
   \hlstd{\}}
\hlkwd{str}\hlstd{(results,} \hlkwc{max.level} \hlstd{=} \hlnum{1}\hlstd{)}
\end{alltt}
\begin{verbatim}
## List of 3
##  $ :List of 12
##   ..- attr(*, "class")= chr "lm"
##  $ :List of 12
##   ..- attr(*, "class")= chr "lm"
##  $ :List of 12
##   ..- attr(*, "class")= chr "lm"
\end{verbatim}
\begin{alltt}
\hlkwd{do.call}\hlstd{(anova, results)}
\end{alltt}
\begin{verbatim}
## Analysis of Variance Table
##
## Model 1: y ~ x
## Model 2: y ~ x - 1
## Model 3: y ~ x + I(x^2)
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)
## 1      8 0.05525
## 2      9 2.31266 -1   -2.2574 306.19 4.901e-07 ***
## 3      7 0.05161  2    2.2611 153.34 1.660e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\end{verbatim}
\end{kframe}
\end{knitrout}

\subsection{Further reading}
For\index{further reading!the R language} further readings on the aspects of \Rlang discussed in the current chapter, I suggest the books \citetitle{Matloff2011} (\citeauthor{Matloff2011}) and \citetitle{Wickham2019} (\citeauthor{Wickham2019}).


% !Rnw root = appendix.main.Rnw


\chapter{The R language: Statistics}\label{chap:R:statistics}

\begin{VF}
The purpose of computing is insight, not numbers.

\VA{Richard W. Hamming}{\emph{Numerical Methods for Scientists and Engineers}, 1987}\nocite{Hamming1987}
\end{VF}

\section{Aims of this chapter}

This chapter aims to give the reader only a quick introduction to statistics in base \Rlang, as there are many good texts on the use of \Rpgrm for different kinds of statistical analyses (see further reading on page \pageref{sec:stat:further:reading}). Although many of base \R's functions are specific to given statistical procedures, they use a particular approach to model specification and for returning the computed values that can be considered a part of the \Rlang language. Here you will learn the approaches used in \Rlang for calculating statistical summaries, generating (pseudo-)random numbers, sampling, fitting models and carrying out tests of significance. We will use linear correlation, \emph{t}-test, linear models, generalized linear models, non-linear models and some simple multivariate methods as examples. My aim is teaching how to specify models, contrasts and data used, and how to access different components of the objects returned by the corresponding fit and summary functions.

%\emph{At present I use several examples adapted from the help pages for the functions described. I may revise this before publication.}

\section{Statistical summaries}
\index{functions!base R}\index{summaries!statistical}
Being the main focus of the \Rlang language in data analysis and statistics, \Rlang provides functions for both simple and complex calculations, going from means and variances to fitting very complex models. Below are examples of functions implementing the calculation of the frequently used data summaries mean or average (\Rfunction{mean()}), variance (\Rfunction{var()}), standard deviation (\Rfunction{sd()}), median (\Rfunction{median()}), mean absolute deviation (\Rfunction{mad()}), mode (\Rfunction{mode()}), maximum (\Rfunction{max()}), minimum (\Rfunction{min()}), range (\Rfunction{range()}), quantiles (\Rfunction{quantile()}), length (\Rfunction{length()}), and all-encompassing summaries (\Rfunction{summary()}). All these methods accept numeric vectors and matrices as an argument. Some of them also have definitions for other classes such as data frames in the case of \Rfunction{summary()}. (The \Rlang language does not define a function for calculation of the standard error of the mean. Please, see section \ref{sec:functions:sem} on page \pageref{sec:functions:sem} for how to define your own.)

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{x} \hlkwb{<-} \hlnum{1}\hlopt{:}\hlnum{20}
\hlkwd{mean}\hlstd{(x)}
\hlkwd{var}\hlstd{(x)}
\hlkwd{sd}\hlstd{(x)}
\hlkwd{median}\hlstd{(x)}
\hlkwd{mad}\hlstd{(x)}
\hlkwd{mode}\hlstd{(x)}
\hlkwd{max}\hlstd{(x)}
\hlkwd{min}\hlstd{(x)}
\hlkwd{range}\hlstd{(x)}
\hlkwd{quantile}\hlstd{(x)}
\hlkwd{length}\hlstd{(x)}
\hlkwd{summary}\hlstd{(x)}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{playground}
  In contrast to many other examples in this book, the summaries computed with the code in the previous chunk are not shown. You should \emph{run} them, using vector \code{x} as defined above, and then play with other real or artificial data that you may find interesting.% Later in the book, only the output from certain examples will be shown, with the expectation, that other examples will be run by readers.
\end{playground}

By default, if the argument contains \code{NAs} these functions return \code{NA}. The logic behind this is that if one value exists but is unknown, the true result of the computation is unknown (see page \pageref{par:special:values} for details on the role of \code{NA} in \Rlang). However, an additional parameter called \code{na.omit} allows us to override this default behavior by requesting any \code{NA} in the input to be omitted (or discarded) before calculation,

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{x} \hlkwb{<-} \hlkwd{c}\hlstd{(}\hlnum{1}\hlopt{:}\hlnum{20}\hlstd{,} \hlnum{NA}\hlstd{)}
\hlkwd{mean}\hlstd{(x)}
\end{alltt}
\begin{verbatim}
## [1] NA
\end{verbatim}
\begin{alltt}
\hlkwd{mean}\hlstd{(x,} \hlkwc{na.omit} \hlstd{=} \hlnum{TRUE}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] NA
\end{verbatim}
\end{kframe}
\end{knitrout}

\section{Distributions}
\index{distributions|(}\index{Normal distribution}
Density, distribution functions, quantile functions and generation of pseudo-random values for several different distributions are part of the \Rlang language. Entering \code{help(Distributions)} at the \Rlang prompt will open a help page describing all the distributions available in base \Rlang. In what follows we use the Normal distribution for the examples, but with slight differences in their parameters the functions for other theoretical distributions follow a consistent naming pattern. For each distribution the different functions contain the same ``root'' in their names: \code{norm} for the normal distribution, \code{unif} for the uniform distribution, and so on. The ``head'' of the name indicates the type of values returned: ``d'' for density, ``q'' for quantile, ``r'' (pseudo-)random numbers, and ``p'' for probabilities.

\subsection{Density from parameters}
\index{distributions!density from parameters}
Theoretical distributions are defined by mathematical functions that accept parameters that control the exact shape and location. In the case of the Normal distribution, these parameters are the \emph{mean} controlling location and \emph(standard deviation) (or its square, the \emph{variance}) controlling the spread around the center of the distribution.

To obtain a single point from the distribution curve we pass a vector of length one as an argument for \code{x}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{dnorm}\hlstd{(}\hlkwc{x} \hlstd{=} \hlnum{1.5}\hlstd{,} \hlkwc{mean} \hlstd{=} \hlnum{1}\hlstd{,} \hlkwc{sd} \hlstd{=} \hlnum{0.5}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] 0.4839414
\end{verbatim}
\end{kframe}
\end{knitrout}

To obtain multiple values we can pass a longer vector as an argument. As perusing a long vector of numbers is difficult, we plot the result of the computation as a line (\code{type = "l"}) that shows that the 50 generated data points give the illusion of a continuous curve.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.x} \hlkwb{<-} \hlkwd{seq}\hlstd{(}\hlkwc{from} \hlstd{=} \hlopt{-}\hlnum{1}\hlstd{,} \hlkwc{to} \hlstd{=} \hlnum{3}\hlstd{,} \hlkwc{length.out} \hlstd{=} \hlnum{50}\hlstd{)}

\hlstd{my.data} \hlkwb{<-} \hlkwd{data.frame}\hlstd{(}\hlkwc{x} \hlstd{= my.x,}
                      \hlkwc{y} \hlstd{=} \hlkwd{dnorm}\hlstd{(}\hlkwc{x} \hlstd{= my.x,} \hlkwc{mean} \hlstd{=} \hlnum{1}\hlstd{,} \hlkwc{sd} \hlstd{=} \hlnum{0.5}\hlstd{))}
\hlkwd{plot}\hlstd{(y}\hlopt{~}\hlstd{x,} \hlkwc{data} \hlstd{= my.data,} \hlkwc{type} \hlstd{=} \hlstr{"l"}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.54\textwidth]{figure/pos-distrib-01a-1}

}


\end{knitrout}

\subsection{Probabilities from parameters and quantiles}\label{sec:prob:quant}
\index{distributions!probabilities from quantiles}

If we have a calculated quantile we can look up the corresponding $p$-value from the Normal distribution. The mean and standard deviation would, in such a case, also be computed from the same observations under the null hypothesis. In the example below, we use invented values for all parameters \code{q}, the quantile, \code{mean}, and \code{sd}, the standard deviation. Use

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{pnorm}\hlstd{(}\hlkwc{q} \hlstd{=} \hlnum{4}\hlstd{,} \hlkwc{mean} \hlstd{=} \hlnum{0}\hlstd{,} \hlkwc{sd} \hlstd{=} \hlnum{1}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] 0.9999683
\end{verbatim}
\begin{alltt}
\hlkwd{pnorm}\hlstd{(}\hlkwc{q} \hlstd{=} \hlnum{4}\hlstd{,} \hlkwc{mean} \hlstd{=} \hlnum{0}\hlstd{,} \hlkwc{sd} \hlstd{=} \hlnum{1}\hlstd{,} \hlkwc{lower.tail} \hlstd{=} \hlnum{FALSE}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] 3.167124e-05
\end{verbatim}
\begin{alltt}
\hlkwd{pnorm}\hlstd{(}\hlkwc{q} \hlstd{=} \hlnum{4}\hlstd{,} \hlkwc{mean} \hlstd{=} \hlnum{0}\hlstd{,} \hlkwc{sd} \hlstd{=} \hlnum{4}\hlstd{,} \hlkwc{lower.tail} \hlstd{=} \hlnum{FALSE}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] 0.1586553
\end{verbatim}
\begin{alltt}
\hlkwd{pnorm}\hlstd{(}\hlkwc{q} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{2}\hlstd{,} \hlnum{4}\hlstd{),} \hlkwc{mean} \hlstd{=} \hlnum{0}\hlstd{,} \hlkwc{sd} \hlstd{=} \hlnum{1}\hlstd{,} \hlkwc{lower.tail} \hlstd{=} \hlnum{FALSE}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] 2.275013e-02 3.167124e-05
\end{verbatim}
\begin{alltt}
\hlkwd{pnorm}\hlstd{(}\hlkwc{q} \hlstd{=} \hlnum{4}\hlstd{,} \hlkwc{mean} \hlstd{=} \hlnum{0}\hlstd{,} \hlkwc{sd} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{1}\hlstd{,} \hlnum{4}\hlstd{),} \hlkwc{lower.tail} \hlstd{=} \hlnum{FALSE}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] 3.167124e-05 1.586553e-01
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{explainbox}
  In tests of significance, empirical $z$-values and $t$-values are computed by subtracting from the observed mean for one group or raw quantile, the ``expected'' mean (possibly a hypothesized theoretical value, the mean of a control condition used as reference, or the mean computed over all treatments under the assumption of no effect of treatments) and then dividing by the standard deviation. Consequently, the $p$-values corresponding to these empirical $z$-values and $t$-values need to be looked up using \code{mean = 0} and \code{sd = 1} when calling \Rfunction{pnorm()} or \Rfunction{pt()} respectively. These frequently used values are the defaults.
\end{explainbox}

\subsection{Quantiles from parameters and probabilities}\label{sec:quant:prob}
\index{distributions!quantiles from probabilities}

The reverse computation from that in the previous section is to obtain the quantile corresponding to a known $p$-value. These quantiles are equivalent to the values in the tables used earlier to assess significance.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{qnorm}\hlstd{(}\hlkwc{p} \hlstd{=} \hlnum{0.01}\hlstd{,} \hlkwc{mean} \hlstd{=} \hlnum{0}\hlstd{,} \hlkwc{sd} \hlstd{=} \hlnum{1}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] -2.326348
\end{verbatim}
\begin{alltt}
\hlkwd{qnorm}\hlstd{(}\hlkwc{p} \hlstd{=} \hlnum{0.05}\hlstd{,} \hlkwc{mean} \hlstd{=} \hlnum{0}\hlstd{,} \hlkwc{sd} \hlstd{=} \hlnum{1}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] -1.644854
\end{verbatim}
\begin{alltt}
\hlkwd{qnorm}\hlstd{(}\hlkwc{p} \hlstd{=} \hlnum{0.05}\hlstd{,} \hlkwc{mean} \hlstd{=} \hlnum{0}\hlstd{,} \hlkwc{sd} \hlstd{=} \hlnum{1}\hlstd{,} \hlkwc{lower.tail} \hlstd{=} \hlnum{FALSE}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] 1.644854
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{warningbox}
Quantile functions like \Rfunction{qnorm()} and probability functions like \Rfunction{pnorm()} always do computations based on a single tail of the distribution, even though it is possible to specify which tail we are interested in. If we are interested in obtaining simultaneous quantiles for both tails, we need to do this manually. If we are aiming at quantiles for $P = 0.05$, we need to find the quantile for each tail based on $P / 2 = 0.025$.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{qnorm}\hlstd{(}\hlkwc{p} \hlstd{=} \hlnum{0.025}\hlstd{,} \hlkwc{mean} \hlstd{=} \hlnum{0}\hlstd{,} \hlkwc{sd} \hlstd{=} \hlnum{1}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] -1.959964
\end{verbatim}
\begin{alltt}
\hlkwd{qnorm}\hlstd{(}\hlkwc{p} \hlstd{=} \hlnum{0.025}\hlstd{,} \hlkwc{mean} \hlstd{=} \hlnum{0}\hlstd{,} \hlkwc{sd} \hlstd{=} \hlnum{1}\hlstd{,} \hlkwc{lower.tail} \hlstd{=} \hlnum{FALSE}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] 1.959964
\end{verbatim}
\end{kframe}
\end{knitrout}

We see above that in the case of a symmetric distribution like the Normal, the quantiles in the two tails differ only in sign. This is not the case for asymmetric distributions.

When calculating a $p$-value from a quantile in a test of significance, we need to first decide whether a two-sided or single-sided test is relevant, and in the case of a single sided test, which tail is of interest. For a two-sided test we need to multiply the returned value by 2.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{pnorm}\hlstd{(}\hlkwc{q} \hlstd{=} \hlnum{4}\hlstd{,} \hlkwc{mean} \hlstd{=} \hlnum{0}\hlstd{,} \hlkwc{sd} \hlstd{=} \hlnum{1}\hlstd{)} \hlopt{*} \hlnum{2}
\end{alltt}
\begin{verbatim}
## [1] 1.999937
\end{verbatim}
\end{kframe}
\end{knitrout}

\end{warningbox}

\subsection{``Random'' draws from a distribution}\label{sec:stat:random}
\index{random draws|see{distributions!pseudo-random draws}}\index{distributions!pseudo-random draws}

True random sequences can only be generated by physical processes. All so-called ``random'' sequences of numbers generated by computation are really deterministic although they share some properties with true random sequences (e.g.,  in relation to autocorrelation). It is possible to compute not only pseudo-random draws from a uniform distribution but also from the Normal, $t$, $F$ and other distributions. Parameter \code{n} indicates the number of values to be drawn, or its equivalent, the length of the vector returned.\qRfunction{rnorm()}\qRfunction{runif()}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{rnorm}\hlstd{(}\hlnum{5}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] -0.8248801  0.1201213 -0.4787266 -0.7134216  1.1264443
\end{verbatim}
\begin{alltt}
\hlkwd{rnorm}\hlstd{(}\hlkwc{n} \hlstd{=} \hlnum{10}\hlstd{,} \hlkwc{mean} \hlstd{=} \hlnum{10}\hlstd{,} \hlkwc{sd} \hlstd{=} \hlnum{2}\hlstd{)}
\end{alltt}
\begin{verbatim}
##  [1] 12.394190  9.697729  9.212345 11.624844 12.194317 10.257707 10.082981
##  [8] 10.268540 10.792963  7.772915
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{playground}
Edit the examples in sections \ref{sec:prob:quant}, \ref{sec:quant:prob} and \ref{sec:stat:random} to do computations based on different distributions, such as Student's \emph{t}, \emph{F} or uniform.
\end{playground}

\begin{explainbox}
\index{random numbers|see{pseudo-random numbers}}\index{pseudo-random numbers}
It is impossible to generate truly random sequences of numbers by means of a deterministic process such as a mathematical computation. ``Random numbers'' as generated by \Rpgrm and other computer programs are \emph{pseudo random numbers}, long deterministic series of numbers that resemble random draws. Random number generation uses a \emph{seed} value that determines where in the series we start. The usual way of automatically setting the value of the seed is to take the milliseconds or similar rapidly changing set of digits from the real time clock of the computer. However, in cases when we wish to repeat a calculation using the same series of pseudo-random values, we can use \Rfunction{set.seed()} with an arbitrary integer as an argument to reset the generator to the same point in the underlying (deterministic) sequence.
\end{explainbox}

\begin{advplayground}
Execute the statement \code{rnorm(3)}\qRfunction{rnorm()} by itself several times, paying attention to the values obtained. Repeat the exercise, but now executing \code{set.seed(98765)}\qRfunction{setseed()} immediately before each call to \code{rnorm(3)}, again paying attention to the values obtained. Next execute \code{set.seed(98765)}, followed by \code{c(rnorm(3), rnorm(3))}, and then execute \code{set.seed(98765)}, followed by \code{rnorm(6)} and compare the output. Repeat the exercise using a different argument in the call to \code{set.seed()}. analyze the results and explain how \code{setseed()} affects the generation of pseudo-random numbers in \Rlang.
\end{advplayground}

\section{``Random'' sampling}
\index{random sampling|see{pseudo-random sampling}}%
\index{pseudo-random sampling}%

In addition to drawing values from a theoretical distribution, we can draw values from an existing set or collection of values. We call this operation (pseudo-)random sampling. The draws can be done either with replacement or without replacement. In the second case, all draws are taken from the whole set of values, making it possible for a given value to be drawn more than once. In the default case of not using replacement, subsequent draws are taken from the values remaining after removing the values chosen in earlier draws.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{sample}\hlstd{(}\hlkwc{x} \hlstd{= LETTERS)}
\end{alltt}
\begin{verbatim}
##  [1] "Z" "N" "Y" "R" "M" "E" "W" "J" "H" "G" "U" "O" "S" "T" "L" "F" "X" "P" "K"
## [20] "V" "D" "A" "B" "C" "I" "Q"
\end{verbatim}
\begin{alltt}
\hlkwd{sample}\hlstd{(}\hlkwc{x} \hlstd{= LETTERS,} \hlkwc{size} \hlstd{=} \hlnum{12}\hlstd{)}
\end{alltt}
\begin{verbatim}
##  [1] "M" "S" "L" "R" "B" "D" "Q" "W" "V" "N" "J" "P"
\end{verbatim}
\begin{alltt}
\hlkwd{sample}\hlstd{(}\hlkwc{x} \hlstd{= LETTERS,} \hlkwc{size} \hlstd{=} \hlnum{12}\hlstd{,} \hlkwc{replace} \hlstd{=} \hlnum{TRUE}\hlstd{)}
\end{alltt}
\begin{verbatim}
##  [1] "K" "E" "V" "N" "A" "Q" "L" "C" "T" "L" "H" "U"
\end{verbatim}
\end{kframe}
\end{knitrout}

In practice, pseudo-random sampling is useful when we need to select subsets of observations. One such case is  assigning treatments to experimental units in an experiment or selecting persons to interview in a survey. Another use is in bootstrapping to estimate variation in parameter estimates using empirical distributions.
\index{distributions|)}

\section{Correlation}
\index{correlation|(}
Both parametric (Pearson's) and non-parametric robust (Spearman's and Kendall's) methods for the estimation of the (linear) correlation between pairs of variables are available in base \Rlang. The different methods are selected by passing arguments to a single function. While Pearson's method is based on the actual values of the observations, non-parametric methods are based on the ordering or rank of the observations, and consequently less affected by observations with extreme values.

We first load and explore the data set \Rdata{cars} from \Rlang which we will use in the example. These data consist of stopping distances for cars moving at different speeds as described in the documentation available by entering \code{help(cars)}).

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{data}\hlstd{(cars)}
\hlkwd{plot}\hlstd{(cars)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.54\textwidth]{figure/pos-cor-00-1}

}


\end{knitrout}
\label{chunk:plot:cars}

\subsection{Pearson's $r$}
\index{correlation!parametric}
\index{correlation!Pearson}

Function \Rfunction{cor()} can be called with two vectors of the same length as arguments. In the case of the parametric Pearson method, we do not need to provide further arguments as this method is the default one.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{cor}\hlstd{(}\hlkwc{x} \hlstd{= cars}\hlopt{$}\hlstd{speed,} \hlkwc{y} \hlstd{= cars}\hlopt{$}\hlstd{dist)}
\end{alltt}
\begin{verbatim}
## [1] 0.8068949
\end{verbatim}
\end{kframe}
\end{knitrout}

It is also possible to pass a data frame (or a matrix) as the only argument. When the data frame (or matrix) contains only two columns, the returned value is equivalent to that of passing the two columns individually as vectors.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{cor}\hlstd{(cars)}
\end{alltt}
\begin{verbatim}
##           speed      dist
## speed 1.0000000 0.8068949
## dist  0.8068949 1.0000000
\end{verbatim}
\end{kframe}
\end{knitrout}

When the data frame or matrix contains more than two numeric vectors, the returned value is a matrix of estimates of pairwise correlations between columns. We here use \Rfunction{rnorm()} described above to create a long vector of pseudo-random values drawn from the Normal distribution and \Rfunction{matrix()} to convert it into a matrix with three columns (see page \pageref{sec:matrix:array} for details about \Rlang matrices).

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.mat} \hlkwb{<-} \hlkwd{matrix}\hlstd{(}\hlkwd{rnorm}\hlstd{(}\hlnum{54}\hlstd{),} \hlkwc{ncol} \hlstd{=} \hlnum{3}\hlstd{,}
                 \hlkwc{dimnames} \hlstd{=} \hlkwd{list}\hlstd{(}\hlkwc{rows} \hlstd{=} \hlnum{1}\hlopt{:}\hlnum{18}\hlstd{,} \hlkwc{cols} \hlstd{=} \hlkwd{c}\hlstd{(}\hlstr{"A"}\hlstd{,} \hlstr{"B"}\hlstd{,} \hlstr{"C"}\hlstd{)))}
\hlkwd{cor}\hlstd{(my.mat)}
\end{alltt}
\begin{verbatim}
##            A         B          C
## A 1.00000000 0.2126595 0.05623007
## B 0.21265951 1.0000000 0.31065243
## C 0.05623007 0.3106524 1.00000000
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{playground}
Modify the code in the chunk immediately above constructing a matrix with six columns and then computing the correlations.
\end{playground}

While \Rfunction{cor()} returns and estimate for $r$ the correlation coefficient, \Rfunction{cor.test()} also computes the $t$-value, $p$-value, and confidence interval for the estimate.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{cor.test}\hlstd{(}\hlkwc{x} \hlstd{= cars}\hlopt{$}\hlstd{speed,} \hlkwc{y} \hlstd{= cars}\hlopt{$}\hlstd{dist)}
\end{alltt}
\begin{verbatim}
##
## 	Pearson's product-moment correlation
##
## data:  cars$speed and cars$dist
## t = 9.464, df = 48, p-value = 1.49e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6816422 0.8862036
## sample estimates:
##       cor
## 0.8068949
\end{verbatim}
\end{kframe}
\end{knitrout}

As described below for model fitting and $t$-test, \Rfunction{cor.test()} also accepts a \code{formula} plus \code{data} as arguments.

\begin{playground}
Functions \Rfunction{cor()} and \Rfunction{cor.test()} return \Rlang objects, that when using \Rlang interactively get automatically ``printed'' on the screen. One should be aware that \Rfunction{print()} methods do not necessarily display all the information contained in an \Rlang object. This is almost always the case for complex objects like those returned by \Rlang functions implementing statistical tests. As with any \Rlang object we can save the result of an analysis into a variable. As described in section \ref{sec:calc:lists} on page \pageref{sec:calc:lists} for lists, we can peek into the structure of an object with method \Rfunction{str()}. We can use \Rfunction{class()} and \Rfunction{attributes()} to extract further information. Run the code in the chunk below to discover what is actually returned by \Rfunction{cor()}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a} \hlkwb{<-} \hlkwd{cor}\hlstd{(cars)}
\hlkwd{class}\hlstd{(a)}
\hlkwd{attributes}\hlstd{(a)}
\hlkwd{str}\hlstd{(a)}
\end{alltt}
\end{kframe}
\end{knitrout}

Methods \Rfunction{class()}, \Rfunction{attributes()} and \Rfunction{str()} are very powerful tools that can be used when we are in doubt about the data contained in an object and/or how it is structured. Knowing the structure allows us to retrieve the data members directly from the object when predefined extractor methods are not available.
\end{playground}

\subsection{Kendall's $\tau$ and Spearman's $\rho$}
\index{correlation!non-parametric}
\index{correlation!Kendall}
\index{correlation!Spearman}

We use the same functions as for Pearson's $r$ but explicitly request the use of one of these methods by passing and argument.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{cor}\hlstd{(}\hlkwc{x} \hlstd{= cars}\hlopt{$}\hlstd{speed,} \hlkwc{y} \hlstd{= cars}\hlopt{$}\hlstd{dist,} \hlkwc{method} \hlstd{=} \hlstr{"kendall"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] 0.6689901
\end{verbatim}
\begin{alltt}
\hlkwd{cor}\hlstd{(}\hlkwc{x} \hlstd{= cars}\hlopt{$}\hlstd{speed,} \hlkwc{y} \hlstd{= cars}\hlopt{$}\hlstd{dist,} \hlkwc{method} \hlstd{=} \hlstr{"spearman"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] 0.8303568
\end{verbatim}
\end{kframe}
\end{knitrout}

Function \Rfunction{cor.test()}, described above, also allows the choice of method with the same syntax as shown for \Rfunction{cor()}.

\begin{playground}
Repeat the exercise in the playground immediately above, but now using non-parametric methods. How does the information stored in the returned \code{matrix} differ depending on the method, and how can we extract information about the method used for calculation of the correlation from the returned object.
\end{playground}
\index{correlation|)}

\section{Fitting linear models}\label{sec:stat:LM}
\index{models!linear|see{linear models}}
\index{linear models|(}
\index{LM|see{linear models}}

In \Rlang, the models to be fitted are described by ``model formulas'' such as \verb|y ~ x| which we read as $y$ is explained by $x$. Model formulas are used in different contexts: fitting of models, plotting, and tests like $t$-test. The syntax of model formulas is consistent throughout base \Rlang and numerous independently developed packages. However, their use is not universal, and several packages extend the basic syntax to allow the description of specific types of models.

As most things in \Rlang, model formulas can be stored in variables. In addition, contrary to the usual behavior of other statistical software, the result of a model fit is returned as an object, containing the different components of the fit. Once the model has been fitted, different methods allow us to extract parts and/or further manipulate the results obtained by fitting a model. Most of these methods have implementations for model fit objects for different types of statistical models. Consequently, what is described in this chapter using linear models as examples, also applies in many respects to the fit of models not described here.

The \Rlang function \Rfunction{lm()} is used to fit linear models. If the explanatory variable is continuous, the fit is a regression. If the explanatory variable is a factor, the fit is an analysis of variance (ANOVA) in broad terms. However, there is another meaning of ANOVA, referring only to the tests of significance rather to an approach to model fitting. Consequently, rather confusingly, results for tests of significance for fitted parameter estimates can both in the case of regression and ANOVA, be presented in an ANOVA table. In this second, stricter meaning, ANOVA means a test of significance based on the ratios between pairs of variances.

\begin{warningbox}
If you do not clearly remember the difference between numeric vectors and factors, or how they can be created, please, revisit chapter \ref{chap:R:as:calc} on page \pageref{chap:R:as:calc}.
\end{warningbox}

\subsection{Regression}
%\index{linear regression}
\index{linear regression|see{linear models!linear regression}}\index{linear models!linear regression}
In \index{linear models!ANOVA table} the example immediately below, \code{speed} is a continuous numeric variable. In the ANOVA table calculated for the model fit, in this case a linear regression, we can see that the term for \code{speed} has only one degree of freedom (df).

In the next example we continue using the stopping distance for \Rdata{cars} data set included in \Rpgrm. Please see the plot on page \pageref{chunk:plot:cars}.
\label{xmpl:fun:lm:fm1}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{data}\hlstd{(cars)}
\hlkwd{is.factor}\hlstd{(cars}\hlopt{$}\hlstd{speed)}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\begin{alltt}
\hlkwd{is.numeric}\hlstd{(cars}\hlopt{$}\hlstd{speed)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\end{kframe}
\end{knitrout}

We then fit the simple linear model $y = \alpha \cdot 1 + \beta \cdot x$ where $y$ corresponds to stopping distance (\code{dist}) and $x$ to initial speed (\code{speed}). Such a model is formulated in \Rlang as \verb|dist ~ 1 + speed|. We save the fitted model as \code{fm1} (a mnemonic for fitted-model one).

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{fm1} \hlkwb{<-} \hlkwd{lm}\hlstd{(dist} \hlopt{~} \hlnum{1} \hlopt{+} \hlstd{speed,} \hlkwc{data}\hlstd{=cars)}
\hlkwd{class}\hlstd{(fm1)}
\end{alltt}
\begin{verbatim}
## [1] "lm"
\end{verbatim}
\end{kframe}
\end{knitrout}

The next step is diagnosis of the fit. Are assumptions of the linear model procedure used reasonably close to being fulfilled? In \Rlang it is most common to use plots to this end. We show here only one of the four plots normally produced. This quantile vs.\ quantile plot allows us to assess how much the residuals deviate from being normally distributed.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{plot}\hlstd{(fm1,} \hlkwc{which} \hlstd{=} \hlnum{2}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.54\textwidth]{figure/pos-models-1a-1}

}


\end{knitrout}

In the case of a regression, calling \Rfunction{summary()} with the fitted model object as argument is most useful as it provides a table of coefficient estimates and their errors. Remember that as is the case for most \Rlang functions, the value returned by \Rfunction{summary()} is printed when we call this method at the \Rlang prompt.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{summary}\hlstd{(fm1)}
\end{alltt}
\begin{verbatim}
##
## Call:
## lm(formula = dist ~ 1 + speed, data = cars)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -29.069  -9.525  -2.272   9.215  43.201
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511,	Adjusted R-squared:  0.6438
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12
\end{verbatim}
\end{kframe}
\end{knitrout}

Let's\index{linear models!summary table} look at the printout of the summary, section by section. Under ``Call:'' we find, \verb|dist ~ 1 + speed| or the specification of the model fitted, plus the data used. Under ``Residuals:'' we find the extremes, quartiles and median of the residuals, or deviations between observations and the fitted line. Under ``Coefficients:'' we find the estimates of the model parameters and their variation plus corresponding $t$-tests. At the end of the summary there is information on degrees of freedom and overall coefficient of determination ($R^2$).

If we return to the model formulation, we can now replace $\alpha$ and $\beta$ by the estimates obtaining $y = -17.6 + 3.93 x$. Given the nature of the problem, we \emph{know based on first principles} that stopping distance must be zero when speed is zero. This suggests that we should not estimate the value of $\alpha$ but instead set $\alpha = 0$, or in other words, fit the model $y = \beta \cdot x$.

However, in \Rlang models, the intercept is always implicitly included, so the model fitted above can be formulated as \verb|dist ~ speed|---i.e., a missing \code{+ 1} does not change the model. To exclude the intercept from the previous model, we need to specify it as \verb|dist ~ speed - 1|, resulting in the fitting of a straight line passing through the origin ($x = 0$, $y = 0$).

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{fm2} \hlkwb{<-} \hlkwd{lm}\hlstd{(dist} \hlopt{~} \hlstd{speed} \hlopt{-} \hlnum{1}\hlstd{,} \hlkwc{data} \hlstd{= cars)}
\hlkwd{summary}\hlstd{(fm2)}
\end{alltt}
\begin{verbatim}
##
## Call:
## lm(formula = dist ~ speed - 1, data = cars)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -26.183 -12.637  -5.455   4.590  50.181
##
## Coefficients:
##       Estimate Std. Error t value Pr(>|t|)
## speed   2.9091     0.1414   20.58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.26 on 49 degrees of freedom
## Multiple R-squared:  0.8963,	Adjusted R-squared:  0.8942
## F-statistic: 423.5 on 1 and 49 DF,  p-value: < 2.2e-16
\end{verbatim}
\end{kframe}
\end{knitrout}

Now there is no estimate for the intercept in the summary, only an estimate for the slope.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{plot}\hlstd{(fm2,} \hlkwc{which} \hlstd{=} \hlnum{1}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.54\textwidth]{figure/pos-models-2a-1}

}


\end{knitrout}

The equation of the second fitted model is $y = 2.91 x$, and from the residuals, it can be seen that it is inadequate, as the straight line does not follow the curvature of the relationship between \code{dist} and \code{speed}.
\begin{playground}
You will now fit a second-degree polynomial\index{linear models!polynomial regression}\index{polynomial regression}, a different linear model: $y = \alpha \cdot 1 + \beta_1 \cdot x + \beta_2 \cdot x^2$. The function used is the same as for linear regression, \Rfunction{lm()}. We only need to alter the formulation of the model. The identity function \Rfunction{I()} is used to protect its argument from being interpreted as part of the model formula. Instead, its argument is evaluated beforehand and the result is used as the, in this case second, explanatory variable.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{fm3} \hlkwb{<-} \hlkwd{lm}\hlstd{(dist} \hlopt{~} \hlstd{speed} \hlopt{+} \hlkwd{I}\hlstd{(speed}\hlopt{^}\hlnum{2}\hlstd{),} \hlkwc{data} \hlstd{= cars)}
\hlkwd{plot}\hlstd{(fm3,} \hlkwc{which} \hlstd{=} \hlnum{3}\hlstd{)}
\hlkwd{summary}\hlstd{(fm3)}
\hlkwd{anova}\hlstd{(fm3)}
\end{alltt}
\end{kframe}
\end{knitrout}

The ``same'' fit using an orthogonal polynomial can be specified using function \Rfunction{poly()}. Polynomials of different degrees can be obtained by supplying as the second argument to \Rfunction{poly()} the corresponding positive integer value. In this case, the different terms of the polynomial are bulked together in the summary.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{fm3a} \hlkwb{<-} \hlkwd{lm}\hlstd{(dist} \hlopt{~} \hlkwd{poly}\hlstd{(speed,} \hlnum{2}\hlstd{),} \hlkwc{data} \hlstd{= cars)}
\hlkwd{summary}\hlstd{(fm3a)}
\hlkwd{anova}\hlstd{(fm3a)}
\end{alltt}
\end{kframe}
\end{knitrout}

We can also compare two model fits using \Rfunction{anova()}, to test whether one of the models describes the data better than the other. It is important in this case to take into consideration the nature of the difference between the model formulas, most importantly if they can be interpreted as nested---i.e., interpreted as a base model vs. the same model with additional terms.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{anova}\hlstd{(fm2, fm1)}
\end{alltt}
\end{kframe}
\end{knitrout}

Three or more models can also be compared in a single call to \Rfunction{anova()}. However, be careful, as the order of the arguments matters.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{anova}\hlstd{(fm2, fm3, fm3a)}
\hlkwd{anova}\hlstd{(fm2, fm3a, fm3)}
\end{alltt}
\end{kframe}
\end{knitrout}

We can use different criteria to choose the ``best'' model: significance based on $p$-values or information criteria (AIC, BIC). AIC (Akaike's ``An Information Criterion'') and BIC (``Bayesian Information Criterion'' = SBC, ``Schwarz's Bayesian criterion'') that penalize the resulting ``goodness'' based on the number of parameters in the fitted model. In the case of AIC and BIC, a smaller value is better, and values returned can be either positive or negative, in which case more negative is better. Estimates for both BIC and AIC are returned by \Rfunction{anova()}, and on their own by \Rfunction{BIC()} and \Rfunction{AIC()}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{BIC}\hlstd{(fm2, fm1, fm3, fm3a)}
\hlkwd{AIC}\hlstd{(fm2, fm1, fm3, fm3a)}
\end{alltt}
\end{kframe}
\end{knitrout}

Once you have run the code in the chunks above, you will be able see that these three criteria do not necessarily agree on which is the ``best'' model. Find in the output $p$-value, BIC and AIC estimates, for the different models and conclude which model is favored by each of the three criteria. In addition you will notice that the two different formulations of the quadratic polynomial are equivalent.

\end{playground}

Additional methods give easy access to different components of fitted models: \Rfunction{vcov()} returns the variance-covariance matrix, \Rfunction{coef()} and its alias \Rfunction{coefficients()} return the estimates for the fitted model coefficients, \Rfunction{fitted()} and its alias \Rfunction{fitted.values()} extract the fitted values, and \Rfunction{resid()} and its alias \Rfunction{residuals()} the corresponding residuals (or deviations). Less frequently used accessors are \Rfunction{effects()}, \Rfunction{terms()}, \Rfunction{model.frame()} and \Rfunction{model.matrix()}.

\begin{playground}
Familiarize yourself with these extraction and summary methods by reading their documentation and use them to explore \code{fm1} fitted above or model fits to other data of your interest.
\end{playground}

\begin{explainbox}
The objects returned by model fitting functions are rather complex and contain the full information, including the data to which the model was fit to. The different functions described above, either extract parts of the object or do additional calculations and formatting based on them. There are different specializations of these methods which are called depending on the class of the model-fit object. (See section \ref{sec:methods} on page \pageref{sec:methods}.)

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{class}\hlstd{(fm1)}
\end{alltt}
\begin{verbatim}
## [1] "lm"
\end{verbatim}
\end{kframe}
\end{knitrout}

We rarely need to manually explore the structure of these model-fit objects when using \Rlang interactively. In contrast, when including model fitting in scripts or package code, the need to efficiently extract specific members from them happens more frequently. As with any other \Rlang object we can use \Rfunction{str()} to explore them.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{str}\hlstd{(fm1,} \hlkwc{max.level} \hlstd{=} \hlnum{1}\hlstd{)} \hlcom{# not evaluated}
\end{alltt}
\end{kframe}
\end{knitrout}

We frequently only look at the output of \Rfunction{anova()} as implicitly displayed by \code{print()}. However, both \Rfunction{anova()} and \Rfunction{summary()} return complex objects containing members with data not displayed by the matching \code{print()} methods. Understanding this is frequently useful, when we want to either display the results in a different format, or extract parts of them for use in additional tests or computations. Once again we use \Rfunction{str()} to look at the structure.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{str}\hlstd{(}\hlkwd{anova}\hlstd{(fm1))}
\end{alltt}
\begin{verbatim}
## Classes 'anova' and 'data.frame':	2 obs. of  5 variables:
##  $ Df     : int  1 48
##  $ Sum Sq : num  21185 11354
##  $ Mean Sq: num  21185 237
##  $ F value: num  89.6 NA
##  $ Pr(>F) : num  1.49e-12 NA
##  - attr(*, "heading")= chr [1:2] "Analysis of Variance Table\n" "Response: dist"
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{str}\hlstd{(}\hlkwd{summary}\hlstd{(fm1))}
\end{alltt}
\begin{verbatim}
## List of 11
##  $ call         : language lm(formula = dist ~ 1 + speed, data = cars)
##  $ terms        :Classes 'terms', 'formula'  language dist ~ 1 + speed
##   .. ..- attr(*, "variables")= language list(dist, speed)
##   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "dist" "speed"
##   .. .. .. ..$ : chr "speed"
##   .. ..- attr(*, "term.labels")= chr "speed"
##   .. ..- attr(*, "order")= int 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
##   .. ..- attr(*, "predvars")= language list(dist, speed)
##   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:2] "dist" "speed"
##  $ residuals    : Named num [1:50] 3.85 11.85 -5.95 12.05 2.12 ...
##   ..- attr(*, "names")= chr [1:50] "1" "2" "3" "4" ...
##  $ coefficients : num [1:2, 1:4] -17.579 3.932 6.758 0.416 -2.601 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "(Intercept)" "speed"
##   .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
##  $ aliased      : Named logi [1:2] FALSE FALSE
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "speed"
##  $ sigma        : num 15.4
##  $ df           : int [1:3] 2 48 2
##  $ r.squared    : num 0.651
##  $ adj.r.squared: num 0.644
##  $ fstatistic   : Named num [1:3] 89.6 1 48
##   ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
##  $ cov.unscaled : num [1:2, 1:2] 0.19311 -0.01124 -0.01124 0.00073
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "(Intercept)" "speed"
##   .. ..$ : chr [1:2] "(Intercept)" "speed"
##  - attr(*, "class")= chr "summary.lm"
\end{verbatim}
\end{kframe}
\end{knitrout}

Once we know the structure of the object and the names of members, we can simply extract them using the usual \Rlang rules for member extraction.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{summary}\hlstd{(fm1)}\hlopt{$}\hlstd{adj.r.squared}
\end{alltt}
\begin{verbatim}
## [1] 0.6438102
\end{verbatim}
\end{kframe}
\end{knitrout}

As an example we test if the slope from a linear regression fit deviates significantly from a constant value different from the usual zero.

The examples above are for a null hypothesis of slope = 0 and next we show how to do the equivalent test with a null hypothesis of slope = 1. The procedure is applicable to any constant value as a null hypothesis for any of the fitted parameter estimates for hypotheses set \emph{a priori}. The examples use a two-sided test. In some cases, a single-sided test should be used (e.g., if its known a priori that deviation is because of physical reasons possible only in one direction away from the null hypothesis, or because only one direction of response is of interest).

To estimate the \emph{t}-value we need an estimate for the parameter and an estimate of the standard error for this estimate and its degrees of freedom.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{est.slope.value} \hlkwb{<-} \hlkwd{summary}\hlstd{(fm1)}\hlopt{$}\hlstd{coef[}\hlstr{"speed"}\hlstd{,} \hlstr{"Estimate"}\hlstd{]}
\hlstd{est.slope.se} \hlkwb{<-} \hlkwd{summary}\hlstd{(fm1)}\hlopt{$}\hlstd{coef[}\hlstr{"speed"}\hlstd{,} \hlstr{"Std. Error"}\hlstd{]}
\hlstd{degrees.of.freedom} \hlkwb{<-} \hlkwd{summary}\hlstd{(fm1)}\hlopt{$}\hlstd{df[}\hlnum{2}\hlstd{]}
\end{alltt}
\end{kframe}
\end{knitrout}

The \emph{t}-test is based on the difference between the value of the null hypothesis and the estimate.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{hyp.null} \hlkwb{<-} \hlnum{1}
\hlstd{t.value} \hlkwb{<-} \hlstd{(est.slope.value} \hlopt{-} \hlstd{hyp.null)} \hlopt{/} \hlstd{est.slope.se}
\hlstd{p.value} \hlkwb{<-} \hlkwd{dt}\hlstd{(t.value,} \hlkwc{df} \hlstd{= degrees.of.freedom)}
\end{alltt}
\end{kframe}
\end{knitrout}
\end{explainbox}

\begin{advplayground}
Check that the procedure above agrees with the output of \code{summary()} when we set \code{hyp.null <- 0} instead of \code{hyp.null <- 1}.

Modify the example so as to test whether the intercept is significantly larger than 5 feet, doing a one-sided test.
\end{advplayground}

Method \Rfunction{predict()} uses the fitted model together with new data for the independent variables to compute predictions. As \Rfunction{predict()} accepts new data as input, it allows interpolation and extrapolation to values of the independent variables not present in the original data. In the case of fits of linear- and some other models, method \Rfunction{predict()} returns, in addition to the prediction, estimates of the confidence and/or prediction intervals. The new data must be stored in a data frame with columns using the same names for the explanatory variables as in the data used for the fit, a response variable is not needed and additional columns are ignored. (The explanatory variables in the new data can be either continuous or factors, but they must match in this respect those in the original data.)

\begin{advplayground}
Predict using both \code{fm1} and \code{fm2} the distance required to stop cars moving at 0, 5, 10, 20, 30, and 40~mph. Study the help page for the predict method for linear models (using \code{help(predict.lm)}). Explore the difference between \code{"prediction"} and \code{"confidence"} bands: why are they so different?
\end{advplayground}

\subsection{Analysis of variance, ANOVA}\label{sec:anova}
%\index{analysis of variance}
\index{analysis of variance|see{linear models!analysis of variance}}\index{linear models!analysis of variance}
\index{ANOVA|see{analysis of variance}}
We use here the \Rdata{InsectSprays} data set, giving insect counts in plots sprayed with different insecticides. In these data, \code{spray} is a factor with six levels.%
\label{xmpl:fun:lm:fm4}

The call is exactly the same as the one for linear regression, only the names of the variables and data frame are different. What determines that this is an ANOVA is that \code{spray}, the explanatory variable, is a \code{factor}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{data}\hlstd{(InsectSprays)}
\hlkwd{is.numeric}\hlstd{(InsectSprays}\hlopt{$}\hlstd{spray)}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\begin{alltt}
\hlkwd{is.factor}\hlstd{(InsectSprays}\hlopt{$}\hlstd{spray)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{levels}\hlstd{(InsectSprays}\hlopt{$}\hlstd{spray)}
\end{alltt}
\begin{verbatim}
## [1] "A" "B" "C" "D" "E" "F"
\end{verbatim}
\end{kframe}
\end{knitrout}

We fit the model in exactly the same way as for linear regression; the difference is that we use a factor as the explanatory variable. By using a factor instead of a numeric vector, a different model matrix is built from an equivalent formula.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{fm4} \hlkwb{<-} \hlkwd{lm}\hlstd{(count} \hlopt{~} \hlstd{spray,} \hlkwc{data} \hlstd{= InsectSprays)}
\end{alltt}
\end{kframe}
\end{knitrout}

Diagnostic plots are obtained in the same way as for linear regression.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{plot}\hlstd{(fm4,} \hlkwc{which} \hlstd{=} \hlnum{3}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.54\textwidth]{figure/pos-model-6a-1}

}


\end{knitrout}

In ANOVA we are mainly interested in testing hypotheses, and \Rfunction{anova()} provides the most interesting output. Function \Rfunction{summary()} can be used to extract parameter estimates. The default contrasts and corresponding $p$-values returned by \Rfunction{summary()} test hypotheses that have little or no direct interest in an analysis of variance. Function \Rfunction{aov()} is a wrapper on \Rfunction{lm()} that returns an object that by default when printed displays the output of \Rfunction{anova()}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{anova}\hlstd{(fm4)}
\end{alltt}
\begin{verbatim}
## Analysis of Variance Table
##
## Response: count
##           Df Sum Sq Mean Sq F value    Pr(>F)
## spray      5 2668.8  533.77  34.702 < 2.2e-16 ***
## Residuals 66 1015.2   15.38
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{warningbox}
The defaults used for model fits and ANOVA calculations vary among programs. There exist different so-called ``types'' of sums of squares, usually called I, II, and III. In orthogonal designs the choice has no consequences, but differences can be important for unbalanced designs, even leading to different conclusions. \Rlang's default, type~I, is usually considered to suffer milder problems than type~III, the default used by \pgrmname{SPSS} and \pgrmname{SAS}.

The contrasts used affect the estimates returned by \Rfunction{coef()} and \Rfunction{summary()} applied to an ANOVA model fit. The default used in \Rlang is different to that used in some other programs (even different than in \Slang). The most straightforward way of setting a different default for a whole series of model fits is by setting \Rlang option \code{contrasts}, which we here only print.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{options}\hlstd{(}\hlstr{"contrasts"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## $contrasts
##         unordered           ordered
## "contr.treatment"      "contr.poly"
\end{verbatim}
\end{kframe}
\end{knitrout}

It is also possible to select the contrast to be used in the call to \code{aov()} or \code{lm()}. The default, \code{contr.treatment} uses the first level of the factor (assumed to be a control) as reference for estimation of coefficients and their significance, while \code{contr.sum} uses as reference the mean of all levels, by using as condition that the sum of the coefficient estimates is equal to zero. Obviously this changes what the coefficients describe, and consequently also the estimated $p$-values.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{fm4trea} \hlkwb{<-} \hlkwd{lm}\hlstd{(count} \hlopt{~} \hlstd{spray,} \hlkwc{data} \hlstd{= InsectSprays,}
              \hlkwc{contrasts} \hlstd{=} \hlkwd{list}\hlstd{(}\hlkwc{spray} \hlstd{= contr.treatment))}
\hlstd{fm4sum}  \hlkwb{<-} \hlkwd{lm}\hlstd{(count} \hlopt{~} \hlstd{spray,} \hlkwc{data} \hlstd{= InsectSprays,}
              \hlkwc{contrasts} \hlstd{=} \hlkwd{list}\hlstd{(}\hlkwc{spray} \hlstd{= contr.sum))}
\end{alltt}
\end{kframe}
\end{knitrout}

Interpretation of any analysis has to take into account these differences and users should not be surprised if ANOVA yields different results in base \Rlang and \pgrmname{SPSS} or \pgrmname{SAS} given the different types of sums of squares used. The interpretation of ANOVA on designs that are not orthogonal will depend on which type is used, so the different results are not necessarily contradictory even when different.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{summary}\hlstd{(fm4trea)}
\end{alltt}
\begin{verbatim}
##
## Call:
## lm(formula = count ~ spray, data = InsectSprays, 
##    contrasts = list(spray = contr.treatment))
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -8.333 -1.958 -0.500  1.667  9.333
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  14.5000     1.1322  12.807  < 2e-16 ***
## spray2        0.8333     1.6011   0.520    0.604
## spray3      -12.4167     1.6011  -7.755 7.27e-11 ***
## spray4       -9.5833     1.6011  -5.985 9.82e-08 ***
## spray5      -11.0000     1.6011  -6.870 2.75e-09 ***
## spray6        2.1667     1.6011   1.353    0.181
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.922 on 66 degrees of freedom
## Multiple R-squared:  0.7244,	Adjusted R-squared:  0.7036
## F-statistic:  34.7 on 5 and 66 DF,  p-value: < 2.2e-16
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{summary}\hlstd{(fm4sum)}
\end{alltt}
\begin{verbatim}
##
## Call:
## lm(formula = count ~ spray, data = InsectSprays, 
##    contrasts = list(spray = contr.sum))
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -8.333 -1.958 -0.500  1.667  9.333
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   9.5000     0.4622  20.554  < 2e-16 ***
## spray1        5.0000     1.0335   4.838 8.22e-06 ***
## spray2        5.8333     1.0335   5.644 3.78e-07 ***
## spray3       -7.4167     1.0335  -7.176 7.87e-10 ***
## spray4       -4.5833     1.0335  -4.435 3.57e-05 ***
## spray5       -6.0000     1.0335  -5.805 2.00e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.922 on 66 degrees of freedom
## Multiple R-squared:  0.7244,	Adjusted R-squared:  0.7036
## F-statistic:  34.7 on 5 and 66 DF,  p-value: < 2.2e-16
\end{verbatim}
\end{kframe}
\end{knitrout}

In the case of contrasts, they always affect the parameter estimates independently of whether the experiment design is orthogonal or not. A different set of contrasts simply tests a different set of possible treatment effects. Contrasts, on the other hand, do not affect the table returned by \Rfunction{anova()} as this table does not deal with the effects of individual factor levels.
\end{warningbox}

\subsection{Analysis of covariance, ANCOVA}
%\index{analysis of covariance}
\index{analysis of covariance|see{linear models!analysis of covariance}}
\index{linear models!analysis of covariance}
\index{ANCOVA|see{analysis of covariance}}

When a linear model includes both explanatory factors and continuous explanatory variables, we may call it \emph{analysis of covariance} (ANCOVA). The formula syntax is the same for all linear models and, as mentioned in previous sections, what determines the type of analysis is the nature of the explanatory variable(s). As the formulation remains the same, no specific example is given. The main difficulty of ANCOVA is in the selection of the covariate and the interpretation of the results of the analysis \autocite[e.g.][]{Smith1957}.
\index{linear models|)}

\section{Generalized linear models}\label{sec:stat:GLM}
\index{generalized linear models|(}\index{models!generalized linear|see{generalized linear models}}
\index{GLM|see{generalized linear models}}

Linear models make the assumption of normally distributed residuals. Generalized linear models, fitted with function \Rfunction{glm()} are more flexible, and allow the assumed distribution to be selected as well as the link function.
For the analysis of the \Rdata{InsectSpray} data set above (section \ref{sec:anova} on page \pageref{sec:anova}), the Normal distribution is not a good approximation as count data deviates from it. This was visible in the quantile--quantile plot above.

For count data, GLMs provide a better alternative. In the example below we fit the same model as above, but we assume a quasi-Poisson distribution instead of the Normal. In addition to the model formula we need to pass an argument through \code{family} giving the error distribution to be assumed---the default for \code{family} is \code{gaussian} or Normal distribution.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{fm10} \hlkwb{<-} \hlkwd{glm}\hlstd{(count} \hlopt{~} \hlstd{spray,} \hlkwc{data} \hlstd{= InsectSprays,} \hlkwc{family} \hlstd{= quasipoisson)}
\hlkwd{anova}\hlstd{(fm10)}
\end{alltt}
\begin{verbatim}
## Analysis of Deviance Table
##
## Model: quasipoisson, link: log
##
## Response: count
##
## Terms added sequentially (first to last)
##
##
##       Df Deviance Resid. Df Resid. Dev
## NULL                     71     409.04
## spray  5   310.71        66      98.33
\end{verbatim}
\end{kframe}
\end{knitrout}

The printout from the \Rfunction{anova()} method for GLM fits has some differences to that for LM fits. By default, no significance test is computed, as a knowledgeable choice is required depending on the characteristics of the model and data. We here use \code{"F"} as an argument to request an $F$-test.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{anova}\hlstd{(fm10,} \hlkwc{test} \hlstd{=} \hlstr{"F"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## Analysis of Deviance Table
##
## Model: quasipoisson, link: log
##
## Response: count
##
## Terms added sequentially (first to last)
##
##
##       Df Deviance Resid. Df Resid. Dev      F    Pr(>F)
## NULL                     71     409.04
## spray  5   310.71        66      98.33 41.216 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\end{verbatim}
\end{kframe}
\end{knitrout}

Method \Rfunction{plot()} as for linear-model fits, produces diagnosis plots. We show as above the q-q-plot of residuals.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{plot}\hlstd{(fm10,} \hlkwc{which} \hlstd{=} \hlnum{3}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.54\textwidth]{figure/pos-model-11-1}

}


\end{knitrout}

We can extract different components similarly as described for linear models (see section \ref{sec:stat:LM} on page \pageref{sec:stat:LM}).

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{class}\hlstd{(fm10)}
\end{alltt}
\begin{verbatim}
## [1] "glm" "lm"
\end{verbatim}
\begin{alltt}
\hlkwd{summary}\hlstd{(fm10)}
\end{alltt}
\begin{verbatim}
##
## Call:
## glm(formula = count ~ spray, family = quasipoisson, data = InsectSprays)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -2.3852  -0.8876  -0.1482   0.6063   2.6922
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  2.67415    0.09309  28.728  < 2e-16 ***
## sprayB       0.05588    0.12984   0.430    0.668
## sprayC      -1.94018    0.26263  -7.388 3.30e-10 ***
## sprayD      -1.08152    0.18499  -5.847 1.70e-07 ***
## sprayE      -1.42139    0.21110  -6.733 4.82e-09 ***
## sprayF       0.13926    0.12729   1.094    0.278
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 1.507713)
##
##     Null deviance: 409.041  on 71  degrees of freedom
## Residual deviance:  98.329  on 66  degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 5
\end{verbatim}
\begin{alltt}
\hlkwd{head}\hlstd{(}\hlkwd{residuals}\hlstd{(fm10))}
\end{alltt}
\begin{verbatim}
##          1          2          3          4          5          6
## -1.2524891 -2.1919537  1.3650439 -0.1320721 -0.1320721 -0.6768988
\end{verbatim}
\begin{alltt}
\hlkwd{head}\hlstd{(}\hlkwd{fitted}\hlstd{(fm10))}
\end{alltt}
\begin{verbatim}
##    1    2    3    4    5    6
## 14.5 14.5 14.5 14.5 14.5 14.5
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{explainbox}
If we use \code{str()} or \code{names()} we can see that there are some differences with respect to linear model fits. The returned object is of a different class and contains some members not present in linear models. Two of these have to do with the iterative approximation method used, \code{iter} contains the number of iterations used  and \code{converged} the success or not in finding a solution.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{names}\hlstd{(fm10)}
\end{alltt}
\begin{verbatim}
##  [1] "coefficients"      "residuals"         "fitted.values"
##  [4] "effects"           "R"                 "rank"
##  [7] "qr"                "family"            "linear.predictors"
## [10] "deviance"          "aic"               "null.deviance"
## [13] "iter"              "weights"           "prior.weights"
## [16] "df.residual"       "df.null"           "y"
## [19] "converged"         "boundary"          "model"
## [22] "call"              "formula"           "terms"
## [25] "data"              "offset"            "control"
## [28] "method"            "contrasts"         "xlevels"
\end{verbatim}
\begin{alltt}
\hlstd{fm10}\hlopt{$}\hlstd{converged}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlstd{fm10}\hlopt{$}\hlstd{iter}
\end{alltt}
\begin{verbatim}
## [1] 5
\end{verbatim}
\end{kframe}
\end{knitrout}
\end{explainbox}

\index{generalized linear models|)}

\section{Non-linear regression}\label{sec:stat:NLS}
\index{non-linear models|(}%
\index{models!non-linear|see{non-linear models}}%
\index{NLS|see{non-linear models}}

Function \Rfunction{nls()} is \Rlang's workhorse for fitting non-linear models. By \emph{non-linear} it is meant non-linear \emph{in the parameters} whose values are being estimated through fitting the model to data. This is different from the shape of the function when plotted---i.e., polynomials of any degree are linear models. In contrast, the Michaelis-Menten equation used in chemistry and the  Gompertz equation used to describe growth are non-linear models in their parameters.

While analytical algorithms exist for finding estimates for the parameters of linear models, in the case of non-linear models, the estimates are obtained by approximation. For analytical solutions, estimates can always be obtained, except in infrequent pathological cases where reliance on floating point numbers with limited resolution introduces rounding errors that ``break'' mathematical algorithms that are valid for real numbers. For approximations obtained through iteration, cases when the algorithm fails to \emph{converge} onto an answer are relatively common. Iterative algorithms attempt to improve an initial guess for the values of the parameters to be estimated, a guess frequently supplied by the user. In each iteration the estimate obtained in the previous iteration is used as the starting value, and this process is repeated one time after another. The expectation is that after a finite number of iterations the algorithm will converge into a solution that ``cannot'' be improved further. In real life we stop iteration when the improvement in the fit is smaller than a certain threshold, or when no convergence has been achieved after a certain maximum number of iterations. In the first case, we usually obtain good estimates; in the second case, we do not obtain usable estimates and need to look for different ways of obtaining them. When convergence fails, the first thing to do is to try different starting values and if this also fails, switch to a different computational algorithm. These steps usually help, but not always. Good starting values are in many cases crucial and in some cases ``guesses'' can be obtained using either graphical or analytical approximations.

For functions for which computational algorithms exist for ``guessing'' suitable starting values, \Rlang provides a mechanism for packaging the function to be fitted together with the function generating the starting values. These functions go by the name of \emph{self-starting functions} and relieve the user from the burden of guessing and supplying suitable starting values. The\index{self-starting functions} self-starting functions available in \Rlang are \code{SSasymp()}, \code{SSasympOff()}, \code{SSasympOrig()}, \code{SSbiexp()}, \code{SSfol()}, \code{SSfpl()}, \code{SSgompertz()}, \code{SSlogis()}, \code{SSmicmen()}, and \code{SSweibull()}. Function \code{selfStart()} can be used to define new ones. All these functions can be used when fitting models with \Rfunction{nls} or \Rfunction{nlme}. Please, check the respective help pages for details.

In the case of \Rfunction{nls()} the specification of the model to be fitted differs from that used for linear models. We will use as an example fitting the Michaelis-Menten equation\index{Michaelis-Menten equation} describing reaction kinetics\index{chemical reaction kinetics} in biochemistry and chemistry. The mathematical formulation is given by:

\begin{equation}\label{eq:michaelis:menten}
v = \frac{\mathrm{d} [P]}{\mathrm{d} t} = \frac{V_{\mathrm{max}} [S]}{K_{\mathrm{M}} + [S]}
\end{equation}

The function takes its name from Michaelis and Menten's paper from 1913 \autocite{Johnson2011}. A self-starting function implementing the Michaelis-Menten equation is available in \Rlang under the name \Rfunction{SSmicmen()}\index{models!selfstart@{\texttt{selfStart}}}. We will use the \Rdata{Puromycin} data set.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{data}\hlstd{(Puromycin)}
\hlkwd{names}\hlstd{(Puromycin)}
\end{alltt}
\begin{verbatim}
## [1] "conc"  "rate"  "state"
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{fm21} \hlkwb{<-} \hlkwd{nls}\hlstd{(rate} \hlopt{~} \hlkwd{SSmicmen}\hlstd{(conc, Vm, K),} \hlkwc{data} \hlstd{= Puromycin,}
            \hlkwc{subset} \hlstd{= state} \hlopt{==} \hlstr{"treated"}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

We can extract different components similarly as described for linear models (see section \ref{sec:stat:LM} on page \pageref{sec:stat:LM}).

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{class}\hlstd{(fm21)}
\end{alltt}
\begin{verbatim}
## [1] "nls"
\end{verbatim}
\begin{alltt}
\hlkwd{summary}\hlstd{(fm21)}
\end{alltt}
\begin{verbatim}
##
## Formula: rate ~ SSmicmen(conc, Vm, K)
##
## Parameters:
##     Estimate Std. Error t value Pr(>|t|)
## Vm 2.127e+02  6.947e+00  30.615 3.24e-11 ***
## K  6.412e-02  8.281e-03   7.743 1.57e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.93 on 10 degrees of freedom
##
## Number of iterations to convergence: 0
## Achieved convergence tolerance: 1.937e-06
\end{verbatim}
\begin{alltt}
\hlkwd{residuals}\hlstd{(fm21)}
\end{alltt}
\begin{verbatim}
##  [1]  25.4339970  -3.5660030  -5.8109606   4.1890394 -11.3616076   4.6383924
##  [7]  -5.6846886 -12.6846886   0.1670799  10.1670799   6.0311724  -0.9688276
## attr(,"label")
## [1] "Residuals"
\end{verbatim}
\begin{alltt}
\hlkwd{fitted}\hlstd{(fm21)}
\end{alltt}
\begin{verbatim}
##  [1]  50.5660  50.5660 102.8110 102.8110 134.3616 134.3616 164.6847 164.6847
##  [9] 190.8329 190.8329 200.9688 200.9688
## attr(,"label")
## [1] "Fitted values"
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{explainbox}
If we use \code{str()} or \code{names()} we can see that there are differences with respect to linear model and generalized model fits. The returned object is of class \code{nls} and contains some new members and lacks others. Two members are related to the iterative approximation method used, \code{control} containing nested members holding iteration settings, and \code{convInfo} (convergence information) with nested members with information on the outcome of the iterative algorithm.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{str}\hlstd{(fm21,} \hlkwc{max.level} \hlstd{=} \hlnum{1}\hlstd{)}
\end{alltt}
\begin{verbatim}
## List of 6
##  $ m          :List of 16
##   ..- attr(*, "class")= chr "nlsModel"
##  $ convInfo   :List of 5
##  $ data       : symbol Puromycin
##  $ call       : language nls(formula = rate ~ SSmicmen(conc, Vm, K), 
##                              data = Puromycin, subset = __truncated__ ...
##  $ dataClasses: Named chr "numeric"
##   ..- attr(*, "names")= chr "conc"
##  $ control    :List of 5
##  - attr(*, "class")= chr "nls"
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{fm21}\hlopt{$}\hlstd{convInfo}
\end{alltt}
\begin{verbatim}
## $isConv
## [1] TRUE
##
## $finIter
## [1] 0
##
## $finTol
## [1] 1.937028e-06
##
## $stopCode
## [1] 0
##
## $stopMessage
## [1] "converged"
\end{verbatim}
\end{kframe}
\end{knitrout}
\end{explainbox}

\index{non-linear models|)}

\section{Model formulas}
\index{model formulas|(}
In the examples above we fitted simple models. More complex ones can be easily formulated using the same syntax. First of all, one can avoid use of operator \code{*} and explicitly define all individual main effects and interactions using operators \code{+} and \code{:}. The syntax implemented in base \Rlang allows grouping by means of parentheses, so it is also possible to exclude some interactions by combining the use of \code{*} and parentheses.

The same symbols as for arithmetic operators are used for model formulas. Within a formula, symbols are interpreted according to formula syntax. When we mean an arithmetic operation that could be interpreted as being part of the model formula we need to ``protect'' it by means of the identity function \Rfunction{I()}. The next two examples define formulas for models with only one explanatory variable. With formulas like these, the explanatory variable will be computed on the fly when fitting the model to data. In the first case below we need to explicitly protect the addition of the two variables into their sum, because otherwise they would be interpreted as two separate explanatory variables in the model. In the second case, \Rfunction{log()} cannot be interpreted as part of the model formula, and consequently does not require additional protection, neither does the expression passed as its argument.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{y} \hlopt{~} \hlkwd{I}\hlstd{(x1} \hlopt{+} \hlstd{x2)}
\hlstd{y} \hlopt{~} \hlkwd{log}\hlstd{(x1} \hlopt{+} \hlstd{x2)}
\end{alltt}
\end{kframe}
\end{knitrout}

\Rlang formula syntax allows alternative ways for specifying interaction terms. They allow ``abbreviated'' ways of entering formulas, which for complex experimental designs saves typing and can improve clarity. As seen above, operator \code{*} saves us from having to explicitly indicate all the interaction terms in a full factorial model.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{y} \hlopt{~} \hlstd{x1} \hlopt{+} \hlstd{x2} \hlopt{+} \hlstd{x3} \hlopt{+} \hlstd{x1}\hlopt{:}\hlstd{x2} \hlopt{+} \hlstd{x1}\hlopt{:}\hlstd{x3} \hlopt{+} \hlstd{x2}\hlopt{:}\hlstd{x3} \hlopt{+} \hlstd{x1}\hlopt{:}\hlstd{x2}\hlopt{:}\hlstd{x3}
\end{alltt}
\end{kframe}
\end{knitrout}

Can be replaced by a concise equivalent.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{y} \hlopt{~} \hlstd{x1} \hlopt{*} \hlstd{x2} \hlopt{*} \hlstd{x3}
\end{alltt}
\end{kframe}
\end{knitrout}

When the model to be specified does not include all possible interaction terms, we can combine the concise notation with parentheses.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{y} \hlopt{~} \hlstd{x1} \hlopt{+} \hlstd{(x2} \hlopt{*} \hlstd{x3)}
\hlstd{y} \hlopt{~} \hlstd{x1} \hlopt{+} \hlstd{x2} \hlopt{+} \hlstd{x3} \hlopt{+} \hlstd{x2}\hlopt{:}\hlstd{x3}
\end{alltt}
\end{kframe}
\end{knitrout}

That the two model formulas above are equivalent, can be seen using \code{terms()}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{terms}\hlstd{(y} \hlopt{~} \hlstd{x1} \hlopt{+} \hlstd{(x2} \hlopt{*} \hlstd{x3))}
\end{alltt}
\begin{verbatim}
## y ~ x1 + (x2 * x3)
## attr(,"variables")
## list(y, x1, x2, x3)
## attr(,"factors")
##    x1 x2 x3 x2:x3
## y   0  0  0     0
## x1  1  0  0     0
## x2  0  1  0     1
## x3  0  0  1     1
## attr(,"term.labels")
## [1] "x1"    "x2"    "x3"    "x2:x3"
## attr(,"order")
## [1] 1 1 1 2
## attr(,"intercept")
## [1] 1
## attr(,"response")
## [1] 1
## attr(,".Environment")
## <environment: R_GlobalEnv>
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{y} \hlopt{~} \hlstd{x1} \hlopt{*} \hlstd{(x2} \hlopt{+} \hlstd{x3)}
\hlstd{y} \hlopt{~} \hlstd{x1} \hlopt{+} \hlstd{x2} \hlopt{+} \hlstd{x3} \hlopt{+} \hlstd{x1}\hlopt{:}\hlstd{x2} \hlopt{+} \hlstd{x1}\hlopt{:}\hlstd{x3}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{terms}\hlstd{(y} \hlopt{~} \hlstd{x1} \hlopt{*} \hlstd{(x2} \hlopt{+} \hlstd{x3))}
\end{alltt}
\begin{verbatim}
## y ~ x1 * (x2 + x3)
## attr(,"variables")
## list(y, x1, x2, x3)
## attr(,"factors")
##    x1 x2 x3 x1:x2 x1:x3
## y   0  0  0     0     0
## x1  1  0  0     1     1
## x2  0  1  0     1     0
## x3  0  0  1     0     1
## attr(,"term.labels")
## [1] "x1"    "x2"    "x3"    "x1:x2" "x1:x3"
## attr(,"order")
## [1] 1 1 1 2 2
## attr(,"intercept")
## [1] 1
## attr(,"response")
## [1] 1
## attr(,".Environment")
## <environment: R_GlobalEnv>
\end{verbatim}
\end{kframe}
\end{knitrout}

The \code{\textasciicircum{}} operator provides a concise notation to limit the order of the interaction terms included in a formula.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{y} \hlopt{~} \hlstd{(x1} \hlopt{+} \hlstd{x2} \hlopt{+} \hlstd{x3)}\hlopt{^}\hlnum{2}
\hlstd{y} \hlopt{~} \hlstd{x1} \hlopt{+} \hlstd{x2} \hlopt{+} \hlstd{x3} \hlopt{+} \hlstd{x1}\hlopt{:}\hlstd{x2} \hlopt{+} \hlstd{x1}\hlopt{:}\hlstd{x3} \hlopt{+} \hlstd{x2}\hlopt{:}\hlstd{x3}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{terms}\hlstd{(y} \hlopt{~} \hlstd{(x1} \hlopt{+} \hlstd{x2} \hlopt{+} \hlstd{x3)}\hlopt{^}\hlnum{2}\hlstd{)}
\end{alltt}
\begin{verbatim}
## y ~ (x1 + x2 + x3)^2
## attr(,"variables")
## list(y, x1, x2, x3)
## attr(,"factors")
##    x1 x2 x3 x1:x2 x1:x3 x2:x3
## y   0  0  0     0     0     0
## x1  1  0  0     1     1     0
## x2  0  1  0     1     0     1
## x3  0  0  1     0     1     1
## attr(,"term.labels")
## [1] "x1"    "x2"    "x3"    "x1:x2" "x1:x3" "x2:x3"
## attr(,"order")
## [1] 1 1 1 2 2 2
## attr(,"intercept")
## [1] 1
## attr(,"response")
## [1] 1
## attr(,".Environment")
## <environment: R_GlobalEnv>
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{advplayground}
For operator \code{\textasciicircum{}} to behave as expected, its first operand should be a formula with no interactions!  Compare the result of expanding these two formulas with \Rfunction{trems()}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{y} \hlopt{~} \hlstd{(x1} \hlopt{+} \hlstd{x2} \hlopt{+} \hlstd{x3)}\hlopt{^}\hlnum{2}
\hlstd{y} \hlopt{~} \hlstd{(x1} \hlopt{*} \hlstd{x2} \hlopt{*} \hlstd{x3)}\hlopt{^}\hlnum{2}
\end{alltt}
\end{kframe}
\end{knitrout}

\end{advplayground}

Operator \code{\%in\%} can also be used as a shortcut for including only some of all the possible interaction terms in a formula.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{y} \hlopt{~} \hlstd{x1} \hlopt{+} \hlstd{x2} \hlopt{+} \hlstd{x1} \hlopt{%in%} \hlstd{x2}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{terms}\hlstd{(y} \hlopt{~} \hlstd{x1} \hlopt{+} \hlstd{x2} \hlopt{+} \hlstd{x1} \hlopt{%in%} \hlstd{x2)}
\end{alltt}
\begin{verbatim}
## y ~ x1 + x2 + x1 %in% x2
## attr(,"variables")
## list(y, x1, x2)
## attr(,"factors")
##    x1 x2 x1:x2
## y   0  0     0
## x1  1  0     1
## x2  0  1     1
## attr(,"term.labels")
## [1] "x1"    "x2"    "x1:x2"
## attr(,"order")
## [1] 1 1 2
## attr(,"intercept")
## [1] 1
## attr(,"response")
## [1] 1
## attr(,".Environment")
## <environment: R_GlobalEnv>
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{playground}
Execute the examples below using the \Rdata{npk} data set from \Rlang. They demonstrate the use of different model formulas in ANOVA\index{analysis of variance!model formula}. Use these examples plus your own variations on the same theme to build your understanding of the syntax of model formulas. Based on the terms displayed in the ANOVA tables, first work out what models are being fitted in each case. In a second step, write each of the models using a mathematical formulation. Finally, think how model choice may affect the conclusions from an analysis of variance.

% runs fine but crashes LaTeX
\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{data}\hlstd{(npk)}
\hlkwd{anova}\hlstd{(}\hlkwd{lm}\hlstd{(yield} \hlopt{~} \hlstd{N} \hlopt{*} \hlstd{P} \hlopt{*} \hlstd{K,} \hlkwc{data} \hlstd{= npk))}
\hlkwd{anova}\hlstd{(}\hlkwd{lm}\hlstd{(yield} \hlopt{~} \hlstd{(N} \hlopt{+} \hlstd{P} \hlopt{+} \hlstd{K)}\hlopt{^}\hlnum{2}\hlstd{,} \hlkwc{data} \hlstd{= npk))}
\hlkwd{anova}\hlstd{(}\hlkwd{lm}\hlstd{(yield} \hlopt{~} \hlstd{N} \hlopt{+} \hlstd{P} \hlopt{+} \hlstd{K} \hlopt{+} \hlstd{P} \hlopt{%in%} \hlstd{N} \hlopt{+} \hlstd{K} \hlopt{%in%} \hlstd{N,} \hlkwc{data} \hlstd{= npk))}
\hlkwd{anova}\hlstd{(}\hlkwd{lm}\hlstd{(yield} \hlopt{~} \hlstd{N} \hlopt{+} \hlstd{P} \hlopt{+} \hlstd{K} \hlopt{+} \hlstd{N} \hlopt{%in%} \hlstd{P} \hlopt{+} \hlstd{K} \hlopt{%in%} \hlstd{P,} \hlkwc{data} \hlstd{= npk))}
\end{alltt}
\end{kframe}
\end{knitrout}

\end{playground}

Nesting of factors in experiments using hierarchical designs such as split-plots or repeated measures, results in the need to compute additional error terms, differing in their degrees of freedom. In such a design, different effects are tested based on different error terms. Whether nesting exists or not is a property of an experiment. It is decided as part of the design of the experiment based on the mechanics of treatment assignment to experimental units. In base-\Rlang model-formulas, nesting needs to be described by explicit definition of error terms by means of \code{Error()} within the formula. Nowadays, linear mixed-effects (LME) models are most frequently used with data from experiments and surveys using hierarchical designs, as implemented in packages \pkgname{nlme} and \pkgname{lme4}. These two packages use their own extensions to the model formula syntax to describe nesting and distinguishing fixed and random effects. Additive models have required other extensions, most of them specific to individual packages. These extensions fall outside the scope of this book.

\begin{warningbox}
  \Rlang will accept any syntactically correct model formula, even when the results of the fit are not interpretable. It is \emph{the responsibility of the user to ensure that models are meaningful}. The most common, and dangerous, mistake is specifying for factorial experiments, models that are missing lower-order interactions.

  Fitting models like those below to data from an experiment based on a three-way factorial design should be avoided. In both cases simpler terms are missing, while higher-order interaction(s) that include the missing term are included in the model. Such models are not interpretable, as the variation from the missing term(s) ends being ``disguised'' within the remaining terms, distorting their apparent significance and parameter estimates.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{y} \hlopt{~} \hlstd{A} \hlopt{+} \hlstd{B} \hlopt{+} \hlstd{A}\hlopt{:}\hlstd{B} \hlopt{+} \hlstd{A}\hlopt{:}\hlstd{C} \hlopt{+} \hlstd{B}\hlopt{:}\hlstd{C}
\hlstd{y} \hlopt{~} \hlstd{A} \hlopt{+} \hlstd{B} \hlopt{+} \hlstd{C} \hlopt{+} \hlstd{A}\hlopt{:}\hlstd{B} \hlopt{+} \hlstd{A}\hlopt{:}\hlstd{C} \hlopt{+} \hlstd{A}\hlopt{:}\hlstd{B}\hlopt{:}\hlstd{C}
\end{alltt}
\end{kframe}
\end{knitrout}

  In contrast to those above, the models below are interpretable, even if not ``full'' models (not including all possible interactions).

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{y} \hlopt{~} \hlstd{A} \hlopt{+} \hlstd{B} \hlopt{+} \hlstd{C} \hlopt{+} \hlstd{A}\hlopt{:}\hlstd{B} \hlopt{+} \hlstd{A}\hlopt{:}\hlstd{C} \hlopt{+} \hlstd{B}\hlopt{:}\hlstd{C}
\hlstd{y} \hlopt{~} \hlstd{(A} \hlopt{+} \hlstd{B} \hlopt{+} \hlstd{C)}\hlopt{^}\hlnum{2}
\hlstd{y} \hlopt{~} \hlstd{A} \hlopt{+} \hlstd{B} \hlopt{+} \hlstd{C} \hlopt{+} \hlstd{B}\hlopt{:}\hlstd{C}
\hlstd{y} \hlopt{~} \hlstd{A} \hlopt{+} \hlstd{B} \hlopt{*} \hlstd{C}
\end{alltt}
\end{kframe}
\end{knitrout}

\end{warningbox}

As seen in chapter \ref{chap:R:data}, almost everything in the \Rlang language is an object that can be stored and manipulated. Model formulas are also objects, objects of class \Rclass{"formula"}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{class}\hlstd{(y} \hlopt{~} \hlstd{x)}
\end{alltt}
\begin{verbatim}
## [1] "formula"
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a} \hlkwb{<-} \hlstd{y} \hlopt{~} \hlstd{x}
\hlkwd{class}\hlstd{(a)}
\end{alltt}
\begin{verbatim}
## [1] "formula"
\end{verbatim}
\end{kframe}
\end{knitrout}

There is no method \code{is.formula()} in base \Rlang, but we can easily test the class of an object with \Rfunction{inherits()}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{inherits}\hlstd{(a,} \hlstr{"formula"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{explainbox}
\index{model formulas!manipulation}\textbf{Manipulation of model formulas.} Because this is a book about the \Rlang language, it is pertinent to describe how formulas can be manipulated. Formulas, as any other \Rlang objects, can be saved in variables including lists. Why is this useful? For example, if we want to fit several different models to the same data, we can write a \code{for} loop that walks through a list of model formulas. Or we can write a function that accepts one or more formulas as arguments.

The use of \code{for} \emph{loops} for iteration over a list of model formulas is described in section \ref{sec:R:faces:of:loops} on page \pageref{sec:R:faces:of:loops}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.data} \hlkwb{<-} \hlkwd{data.frame}\hlstd{(}\hlkwc{x} \hlstd{=} \hlnum{1}\hlopt{:}\hlnum{10}\hlstd{,} \hlkwc{y} \hlstd{= (}\hlnum{1}\hlopt{:}\hlnum{10}\hlstd{)} \hlopt{/} \hlnum{2} \hlopt{+} \hlkwd{rnorm}\hlstd{(}\hlnum{10}\hlstd{))}
\hlstd{anovas} \hlkwb{<-} \hlkwd{list}\hlstd{()}
\hlstd{formulas} \hlkwb{<-} \hlkwd{list}\hlstd{(}\hlkwc{a} \hlstd{= y} \hlopt{~} \hlstd{x} \hlopt{-} \hlnum{1}\hlstd{,} \hlkwc{b} \hlstd{= y} \hlopt{~} \hlstd{x,} \hlkwc{c} \hlstd{= y} \hlopt{~} \hlstd{x} \hlopt{+} \hlstd{x}\hlopt{^}\hlnum{2}\hlstd{)}
\hlkwa{for} \hlstd{(formula} \hlkwa{in} \hlstd{formulas) \{}
 \hlstd{anovas} \hlkwb{<-} \hlkwd{c}\hlstd{(anovas,} \hlkwd{list}\hlstd{(}\hlkwd{lm}\hlstd{(formula,} \hlkwc{data} \hlstd{= my.data)))}
 \hlstd{\}}
\hlkwd{str}\hlstd{(anovas,} \hlkwc{max.level} \hlstd{=} \hlnum{1}\hlstd{)}
\end{alltt}
\begin{verbatim}
## List of 3
##  $ :List of 12
##   ..- attr(*, "class")= chr "lm"
##  $ :List of 12
##   ..- attr(*, "class")= chr "lm"
##  $ :List of 12
##   ..- attr(*, "class")= chr "lm"
\end{verbatim}
\end{kframe}
\end{knitrout}

As could be expected, a conversion constructor is available with name \Rfunction{as.formula()}. It is useful when formulas are input interactively by the user or read from text files. With \Rfunction{as.formula()} we can convert a character string into a formula.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.string} \hlkwb{<-} \hlstr{"y ~ x"}
\hlkwd{lm}\hlstd{(}\hlkwd{as.formula}\hlstd{(my.string),} \hlkwc{data} \hlstd{= my.data)}
\end{alltt}
\begin{verbatim}
##
## Call:
## lm(formula = as.formula(my.string), data = my.data)
##
## Coefficients:
## (Intercept)            x
##      1.4059       0.2839
\end{verbatim}
\end{kframe}
\end{knitrout}

As there are many functions for the manipulation of character strings available in base \Rlang and through extension packages, it is straightforward to build model formulas programmatically as strings. We can use functions like \code{paste()} to assemble a formula as text, and then use \Rfunction{as.formula()} to convert it to an object of class \code{formula}, usable for fitting a model.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.string} \hlkwb{<-} \hlkwd{paste}\hlstd{(}\hlstr{"y"}\hlstd{,} \hlstr{"x"}\hlstd{,} \hlkwc{sep} \hlstd{=} \hlstr{"~"}\hlstd{)}
\hlkwd{lm}\hlstd{(}\hlkwd{as.formula}\hlstd{(my.string),} \hlkwc{data} \hlstd{= my.data)}
\end{alltt}
\begin{verbatim}
##
## Call:
## lm(formula = as.formula(my.string), data = my.data)
##
## Coefficients:
## (Intercept)            x
##      1.4059       0.2839
\end{verbatim}
\end{kframe}
\end{knitrout}

For the reverse operation of converting a formula into a string, we have available methods \code{as.character()} and \code{format()}. The first of these methods returns a character vector containing the components of the formula as individual strings, while \code{format()} returns a single character string with the formula formatted for printing.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{formatted.string} \hlkwb{<-} \hlkwd{format}\hlstd{(y} \hlopt{~} \hlstd{x)}
\hlstd{formatted.string}
\end{alltt}
\begin{verbatim}
## [1] "y ~ x"
\end{verbatim}
\begin{alltt}
\hlkwd{as.formula}\hlstd{(formatted.string)}
\end{alltt}
\begin{verbatim}
## y ~ x
\end{verbatim}
\end{kframe}
\end{knitrout}

It is also possible to \emph{edit} formula objects with method \Rfunction{update()}. In the replacement formula, a dot can replace either the left-hand side (lhs) or the right-hand side (rhs) of the existing formula in the replacement formula. We can also remove terms as can be seen below. In some cases the dot corresponding to the lhs can be omitted, but including it makes the syntax clearer.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.formula} \hlkwb{<-} \hlstd{y} \hlopt{~} \hlstd{x1} \hlopt{+} \hlstd{x2}
\hlkwd{update}\hlstd{(my.formula, .} \hlopt{~} \hlstd{.} \hlopt{+} \hlstd{x3)}
\end{alltt}
\begin{verbatim}
## y ~ x1 + x2 + x3
\end{verbatim}
\begin{alltt}
\hlkwd{update}\hlstd{(my.formula, .} \hlopt{~} \hlstd{.} \hlopt{-} \hlstd{x1)}
\end{alltt}
\begin{verbatim}
## y ~ x2
\end{verbatim}
\begin{alltt}
\hlkwd{update}\hlstd{(my.formula, .} \hlopt{~} \hlstd{x3)}
\end{alltt}
\begin{verbatim}
## y ~ x3
\end{verbatim}
\begin{alltt}
\hlkwd{update}\hlstd{(my.formula, z} \hlopt{~} \hlstd{.)}
\end{alltt}
\begin{verbatim}
## z ~ x1 + x2
\end{verbatim}
\begin{alltt}
\hlkwd{update}\hlstd{(my.formula, .} \hlopt{+} \hlstd{z} \hlopt{~} \hlstd{.)}
\end{alltt}
\begin{verbatim}
## y + z ~ x1 + x2
\end{verbatim}
\end{kframe}
\end{knitrout}

R provides high-level functions for model selection. Consequently many \Rlang users will rarely need to edit model formulas in their scripts. For example, step-wise model selection is possible with \Rlang method \code{step()}.

A matrix of dummy coefficients can be derived from a model formula, a type of contrast, and the data for the explanatory variables.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{treats.df} \hlkwb{<-} \hlkwd{data.frame}\hlstd{(}\hlkwc{A} \hlstd{=} \hlkwd{rep}\hlstd{(}\hlkwd{c}\hlstd{(}\hlstr{"yes"}\hlstd{,} \hlstr{"no"}\hlstd{),} \hlkwd{c}\hlstd{(}\hlnum{4}\hlstd{,} \hlnum{4}\hlstd{)),}
                        \hlkwc{B} \hlstd{=} \hlkwd{rep}\hlstd{(}\hlkwd{c}\hlstd{(}\hlstr{"white"}\hlstd{,} \hlstr{"black"}\hlstd{),} \hlnum{4}\hlstd{))}
\hlstd{treats.df}
\end{alltt}
\begin{verbatim}
##     A     B
## 1 yes white
## 2 yes black
## 3 yes white
## 4 yes black
## 5  no white
## 6  no black
## 7  no white
## 8  no black
\end{verbatim}
\end{kframe}
\end{knitrout}

The default contrasts types currently in use.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{options}\hlstd{(}\hlstr{"contrasts"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## $contrasts
##         unordered           ordered
## "contr.treatment"      "contr.poly"
\end{verbatim}
\end{kframe}
\end{knitrout}

A model matrix for a model for a two-way factorial design with no interaction term:

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{model.matrix}\hlstd{(}\hlopt{~} \hlstd{A} \hlopt{+} \hlstd{B, treats.df)}
\end{alltt}
\begin{verbatim}
##   (Intercept) Ayes Bwhite
## 1           1    1      1
## 2           1    1      0
## 3           1    1      1
## 4           1    1      0
## 5           1    0      1
## 6           1    0      0
## 7           1    0      1
## 8           1    0      0
## attr(,"assign")
## [1] 0 1 2
## attr(,"contrasts")
## attr(,"contrasts")$A
## [1] "contr.treatment"
##
## attr(,"contrasts")$B
## [1] "contr.treatment"
\end{verbatim}
\end{kframe}
\end{knitrout}

A model matrix for a model for a two-way factorial design with interaction term:

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{model.matrix}\hlstd{(}\hlopt{~} \hlstd{A} \hlopt{*} \hlstd{B, treats.df)}
\end{alltt}
\begin{verbatim}
##   (Intercept) Ayes Bwhite Ayes:Bwhite
## 1           1    1      1           1
## 2           1    1      0           0
## 3           1    1      1           1
## 4           1    1      0           0
## 5           1    0      1           0
## 6           1    0      0           0
## 7           1    0      1           0
## 8           1    0      0           0
## attr(,"assign")
## [1] 0 1 2 3
## attr(,"contrasts")
## attr(,"contrasts")$A
## [1] "contr.treatment"
##
## attr(,"contrasts")$B
## [1] "contr.treatment"
\end{verbatim}
\end{kframe}
\end{knitrout}

\end{explainbox}
\index{model formulas|)}

\section{Time series}\label{sec:stat:time:series}
\index{time series|(}
Longitudinal data consist of repeated measurements, usually done over time, on the same experimental units. Longitudinal data, when replicated on several experimental units at each time point, are called repeated measurements, while when not replicated, they are called time series. Base \Rlang provides special support for the analysis of time series data, while repeated measurements can be analyzed with nested linear models, mixed-effects models, and additive models.

Time series data are data collected in such a way that there is only one observation, possibly of multiple variables, available at each point in time. This brief section introduces only the most basic aspects of time-series analysis. In most cases time steps are of uniform duration and occur regularly, which simplifies data handling and storage. \Rlang not only provides methods for the analysis and manipulation of time-series, but also a specialized class for their storage, \Rclass{"ts"}. Regular time steps allow more compact storage---e.g.,  a \code{ts} object does not need to store time values for each observation but instead a combination of two of start time, step size and end time.

We start by creating a time series from a numeric vector. By now, you surely guessed that you need to use a constructor called \Rfunction{ts()} or a conversion constructor called \Rfunction{as.ts()} and that you can look up the arguments they accept by reading the corresponding help pages.

For example for a time series of monthly values we could use:

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.ts} \hlkwb{<-} \hlkwd{ts}\hlstd{(}\hlnum{1}\hlopt{:}\hlnum{10}\hlstd{,} \hlkwc{start} \hlstd{=} \hlnum{2019}\hlstd{,} \hlkwc{deltat} \hlstd{=} \hlnum{1}\hlopt{/}\hlnum{12}\hlstd{)}
\hlkwd{class}\hlstd{(my.ts)}
\end{alltt}
\begin{verbatim}
## [1] "ts"
\end{verbatim}
\begin{alltt}
\hlkwd{str}\hlstd{(my.ts)}
\end{alltt}
\begin{verbatim}
##  Time-Series [1:10] from 2019 to 2020: 1 2 3 4 5 6 7 8 9 10
\end{verbatim}
\end{kframe}
\end{knitrout}

We next use the data set \Rdata{austres} with data on the number of Australian residents and included in \Rlang.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{class}\hlstd{(austres)}
\end{alltt}
\begin{verbatim}
## [1] "ts"
\end{verbatim}
\begin{alltt}
\hlkwd{is.ts}\hlstd{(austres)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\end{kframe}
\end{knitrout}


Time series \Rdata{austres} is dominated by the increasing trend.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{plot}\hlstd{(austres)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-ts-02-1}

}


\end{knitrout}

A different example, using data set  \Rdata{nottem} containing meteorological data for Nottingham, shows a clear cyclic component. The annual cycle of mean air temperatures (in degrees Fahrenheit) is clear when data are plotted.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{data}\hlstd{(nottem)}
\hlkwd{is.ts}\hlstd{(nottem)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{plot}\hlstd{(nottem)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-ts-03-1}

}


\end{knitrout}

In\index{time series!decomposition} the next two code chunks, two different approaches to time series decomposition are used. In the first one we use a moving average to capture the trend, while in the second approach we use Loess (a smooth curve fitted by local weighted regression) for the decomposition, a method for which the acronym STL (Seasonal and Trend decomposition using Loess) is used.\qRfunction{decompose()}\qRfunction{stl()} Before decomposing the time-series we reexpress the temperatures in degrees Celsius.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{nottem.celcius} \hlkwb{<-} \hlstd{(nottem} \hlopt{-} \hlnum{32}\hlstd{)} \hlopt{*} \hlnum{5}\hlopt{/}\hlnum{9}
\end{alltt}
\end{kframe}
\end{knitrout}

We set the seasonal window to 7 months, the minimum accepted.


\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{nottem.stl} \hlkwb{<-} \hlkwd{stl}\hlstd{(nottem.celcius,} \hlkwc{s.window} \hlstd{=} \hlnum{7}\hlstd{)}
\hlkwd{plot}\hlstd{(nottem.stl)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-ts-05-1}

}


\end{knitrout}

\begin{advplayground}
It is interesting to explore the class and structure of the object returned by \Rfunction{stl()}, as we may want to extract components. Run the statements below to find out, and then plot individual components from the time series decomposition.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{class}\hlstd{(nottem.stl)}
\hlkwd{str}\hlstd{(nottem.stl)}
\end{alltt}
\end{kframe}
\end{knitrout}
\end{advplayground}

\index{time series|)}

\section{Multivariate statistics}\label{sec:stat:MV}
\index{multivariate methods|(}

\subsection{Multivariate analysis of variance}
\index{multivariate analysis of variance|(}
\index{MANOVA|see{multivariate analysis of variance}}
Multivariate methods take into account several response variables simultaneously, as part of a single analysis. In practice it is usual to use contributed packages for multivariate data analysis in \Rlang, except for simple cases. We will look first at \emph{multivariate} ANOVA or MANOVA. In the same way as \Rfunction{aov()} is a wrapper that uses internally \Rfunction{lm()}, \Rfunction{manova()} is a wrapper that uses internally \Rfunction{aov()}.

Multivariate model formulas in base \Rlang require the use of column binding (\code{cbind()}) on the left-hand side (lhs) of the model formula. For the next examples we use the well-known \Rdata{iris} data set, containing size measurements for flowers of two species of \emph{Iris}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{data}\hlstd{(iris)}
\hlstd{mmf1} \hlkwb{<-} \hlkwd{lm}\hlstd{(}\hlkwd{cbind}\hlstd{(Petal.Length, Petal.Width)} \hlopt{~}  \hlstd{Species,} \hlkwc{data} \hlstd{= iris)}
\hlkwd{anova}\hlstd{(mmf1)}
\end{alltt}
\begin{verbatim}
## Analysis of Variance Table
##
##              Df  Pillai approx F num Df den Df    Pr(>F)
## (Intercept)   1 0.98786   5939.2      2    146 < 2.2e-16 ***
## Species       2 1.04645     80.7      4    294 < 2.2e-16 ***
## Residuals   147
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\end{verbatim}
\begin{alltt}
\hlkwd{summary}\hlstd{(mmf1)}
\end{alltt}
\begin{verbatim}
## Response Petal.Length :
##
## Call:
## lm(formula = Petal.Length ~ Species, data = iris)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -1.260 -0.258  0.038  0.240  1.348
##
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)
## (Intercept)        1.46200    0.06086   24.02   <2e-16 ***
## Speciesversicolor  2.79800    0.08607   32.51   <2e-16 ***
## Speciesvirginica   4.09000    0.08607   47.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4303 on 147 degrees of freedom
## Multiple R-squared:  0.9414,	Adjusted R-squared:  0.9406
## F-statistic:  1180 on 2 and 147 DF,  p-value: < 2.2e-16
##
##
## Response Petal.Width :
##
## Call:
## lm(formula = Petal.Width ~ Species, data = iris)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -0.626 -0.126 -0.026  0.154  0.474
##
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)
## (Intercept)        0.24600    0.02894    8.50 1.96e-14 ***
## Speciesversicolor  1.08000    0.04093   26.39  < 2e-16 ***
## Speciesvirginica   1.78000    0.04093   43.49  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2047 on 147 degrees of freedom
## Multiple R-squared:  0.9289,	Adjusted R-squared:  0.9279
## F-statistic:   960 on 2 and 147 DF,  p-value: < 2.2e-16
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{mmf2} \hlkwb{<-} \hlkwd{manova}\hlstd{(}\hlkwd{cbind}\hlstd{(Petal.Length, Petal.Width)} \hlopt{~}  \hlstd{Species,} \hlkwc{data} \hlstd{= iris)}
\hlkwd{anova}\hlstd{(mmf2)}
\end{alltt}
\begin{verbatim}
## Analysis of Variance Table
##
##              Df  Pillai approx F num Df den Df    Pr(>F)
## (Intercept)   1 0.98786   5939.2      2    146 < 2.2e-16 ***
## Species       2 1.04645     80.7      4    294 < 2.2e-16 ***
## Residuals   147
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\end{verbatim}
\begin{alltt}
\hlkwd{summary}\hlstd{(mmf2)}
\end{alltt}
\begin{verbatim}
##            Df Pillai approx F num Df den Df    Pr(>F)
## Species     2 1.0465   80.661      4    294 < 2.2e-16 ***
## Residuals 147
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{advplayground}
Modify the example above to use \code{aov()} instead of \code{manova()} and save the result to a variable named \code{mmf3}.
Use \code{class()}, \code{attributes()}, \code{names()}, \code{str()} and extraction of members to explore objects \code{mmf1}, \code{mmf2} and \code{mmf3}. Are they different?
\end{advplayground}

\index{multivariate analysis of variance|)}

\subsection{Principal components analysis}\label{sec:stat:PCA}
\index{principal components analysis|(}\index{PCA|see {principal components analysis}}

Principal components analysis (PCA) is used to simplify a data set by combining variables with similar and ``mirror'' behavior into principal components. At a later stage, we frequently try to interpret these components in relation to known and/or assumed independent variables. Base \Rlang's function \Rfunction{prcomp()} computes the principal components and accepts additional arguments for centering and scaling.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{pc} \hlkwb{<-} \hlkwd{prcomp}\hlstd{(iris[}\hlkwd{c}\hlstd{(}\hlstr{"Sepal.Length"}\hlstd{,} \hlstr{"Sepal.Width"}\hlstd{,}
                    \hlstr{"Petal.Length"}\hlstd{,} \hlstr{"Petal.Width"}\hlstd{)],}
             \hlkwc{center} \hlstd{=} \hlnum{TRUE}\hlstd{,}
             \hlkwc{scale} \hlstd{=} \hlnum{TRUE}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

By printing the returned object we can see the loadings of each variable in the principal components \code{P1} to \code{P4}.
\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{class}\hlstd{(pc)}
\end{alltt}
\begin{verbatim}
## [1] "prcomp"
\end{verbatim}
\begin{alltt}
\hlstd{pc}
\end{alltt}
\begin{verbatim}
## Standard deviations (1, .., p=4):
## [1] 1.7083611 0.9560494 0.3830886 0.1439265
##
## Rotation (n x k) = (4 x 4):
##                     PC1         PC2        PC3        PC4
## Sepal.Length  0.5210659 -0.37741762  0.7195664  0.2612863
## Sepal.Width  -0.2693474 -0.92329566 -0.2443818 -0.1235096
## Petal.Length  0.5804131 -0.02449161 -0.1421264 -0.8014492
## Petal.Width   0.5648565 -0.06694199 -0.6342727  0.5235971
\end{verbatim}
\end{kframe}
\end{knitrout}

In the summary, the rows ``Proportion of Variance'' and ``Cumulative Proportion'' are most informative of the contribution of each principal component (PC) to explaining the variation among observations.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{summary}\hlstd{(pc)}
\end{alltt}
\begin{verbatim}
## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.7084 0.9560 0.38309 0.14393
## Proportion of Variance 0.7296 0.2285 0.03669 0.00518
## Cumulative Proportion  0.7296 0.9581 0.99482 1.00000
\end{verbatim}
\end{kframe}
\end{knitrout}


Method \Rfunction{biplot()} produces a plot with one principal component (PC) on each axis, plus arrows for the loadings.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{biplot}\hlstd{(pc)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-pca-05-1}

}


\end{knitrout}


Method \code{plot()} generates a bar plot of variances corresponding to the different components.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{plot}\hlstd{(pc)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.54\textwidth]{figure/pos-pca-04-1}

}


\end{knitrout}

Visually more elaborate plots of the principal components and their loadings can be obtained using packages \pkgnameNI{ggplot} described in chapter \ref{chap:R:plotting} starting on page \pageref{chap:R:plotting}. Package \pkgnameNI{ggfortify} and package \pkgnameNI{ggbiplot} extend \pkgnameNI{ggplot} so as to make it easy to plot principal components and their loadings.

\begin{playground}
For growth and morphological data, a log-transformation can be suitable given that variance is frequently proportional to the magnitude of the values measured. We leave as an exercise to repeat the above analysis using transformed values for the dimensions of petals and sepals. How much does the use of transformations change the outcome of the analysis?
\end{playground}

\begin{advplayground}
As for other fitted models, the object returned by function \Rfunction{prcomp()} is a list with multiple components.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{str}\hlstd{(pc,} \hlkwc{max.level} \hlstd{=} \hlnum{1}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}
\end{advplayground}

\index{principal components analysis|)}

\subsection{Multidimensional scaling}\label{sec:stat:MDS}
\index{multidimensional scaling|(}\index{MDS|see {multidimensional scaling}}

The aim of multidimensional scaling (MDS) is to visualize in 2D space the similarity between pairs of observations. The values for the observed variable(s) are used to compute a measure of distance among pairs of observations. The nature of the data will influence what distance metric is most informative.
For MDS we start with a matrix of distances among observations. We will use, for the example, distances in kilometers between geographic locations in Europe from data set \Rdata{eurodist}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{loc} \hlkwb{<-} \hlkwd{cmdscale}\hlstd{(eurodist)}
\end{alltt}
\end{kframe}
\end{knitrout}

We can see that the returned object \code{loc} is a \code{matrix}, with names for one of the dimensions.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{class}\hlstd{(loc)}
\end{alltt}
\begin{verbatim}
## [1] "matrix" "array"
\end{verbatim}
\begin{alltt}
\hlkwd{dim}\hlstd{(loc)}
\end{alltt}
\begin{verbatim}
## [1] 21  2
\end{verbatim}
\begin{alltt}
\hlkwd{dimnames}\hlstd{(loc)}
\end{alltt}
\begin{verbatim}
## [[1]]
##  [1] "Athens"          "Barcelona"       "Brussels"        "Calais"
##  [5] "Cherbourg"       "Cologne"         "Copenhagen"      "Geneva"
##  [9] "Gibraltar"       "Hamburg"         "Hook of Holland" "Lisbon"
## [13] "Lyons"           "Madrid"          "Marseilles"      "Milan"
## [17] "Munich"          "Paris"           "Rome"            "Stockholm"
## [21] "Vienna"
##
## [[2]]
## NULL
\end{verbatim}
\begin{alltt}
\hlkwd{head}\hlstd{(loc)}
\end{alltt}
\begin{verbatim}
##                 [,1]      [,2]
## Athens    2290.27468 1798.8029
## Barcelona -825.38279  546.8115
## Brussels    59.18334 -367.0814
## Calais     -82.84597 -429.9147
## Cherbourg -352.49943 -290.9084
## Cologne    293.68963 -405.3119
\end{verbatim}
\end{kframe}
\end{knitrout}

To make the code easier to read, two vectors are first extracted from the matrix and named \code{x} and \code{y}. We force aspect to equality so that distances on both axes are comparable.


\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{x} \hlkwb{<-} \hlstd{loc[,} \hlnum{1}\hlstd{]}
\hlstd{y} \hlkwb{<-} \hlopt{-}\hlstd{loc[,} \hlnum{2}\hlstd{]} \hlcom{# change sign so North is at the top}
\hlkwd{plot}\hlstd{(x, y,} \hlkwc{type} \hlstd{=} \hlstr{"n"}\hlstd{,} \hlkwc{asp} \hlstd{=} \hlnum{1}\hlstd{,}
     \hlkwc{main} \hlstd{=} \hlstr{"cmdscale(eurodist)"}\hlstd{)}
\hlkwd{text}\hlstd{(x, y,} \hlkwd{rownames}\hlstd{(loc),} \hlkwc{cex} \hlstd{=} \hlnum{0.6}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-mds-03-1}

}


\end{knitrout}

\begin{advplayground}
  Find data on the mean annual temperature, mean annual rainfall and mean number of sunny days at each of the locations in the \code{eurodist} data set. Next, compute suitable distance metrics, for example, using function \Rfunction{dist}. Finally, use MDS to visualize how similar the locations are with respect to each of the three variables. Devise a measure of distance that takes into account the three climate variables and use MDS to find how distant the different locations are.
\end{advplayground}

\index{multidimensional scaling|)}

\subsection{Cluster analysis}\label{sec:stat:cluster}
\index{cluster analysis|(}

In cluster analysis, the aim is to group observations into discrete groups with maximal internal homogeneity and maximum group-to-group differences. In the next example we use function \Rfunction{hclust()} from the base-\Rlang package \pkgname{stats}. We use, as above, the \Rdata{eurodist} data which directly provides distances. In other cases a matrix of distances between pairs of observations needs to be first calculated with function \Rfunction{dist} which supports several methods.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{hc} \hlkwb{<-} \hlkwd{hclust}\hlstd{(eurodist)}
\hlkwd{print}\hlstd{(hc)}
\end{alltt}
\begin{verbatim}
##
## Call:
## hclust(d = eurodist)
##
## Cluster method   : complete
## Number of objects: 21
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{plot}\hlstd{(hc)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-cluster-02-1}

}


\end{knitrout}

We can use \Rfunction{cutree()} to limit the number of clusters by directly passing as an argument the desired number of clusters or the height at which to cut the tree.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{cutree}\hlstd{(hc,} \hlkwc{k} \hlstd{=} \hlnum{5}\hlstd{)}
\end{alltt}
\begin{verbatim}
##          Athens       Barcelona        Brussels          Calais       Cherbourg
##               1               2               3               3               3
##         Cologne      Copenhagen          Geneva       Gibraltar         Hamburg
##               3               4               2               5               4
## Hook of Holland          Lisbon           Lyons          Madrid      Marseilles
##               3               5               2               5               2
##           Milan          Munich           Paris            Rome       Stockholm
##               2               3               3               1               4
##          Vienna
##               3
\end{verbatim}
\end{kframe}
\end{knitrout}

The object returned by \Rfunction{hclust()} contains details of the result of the clustering, which allows further manipulation and plotting.
\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{str}\hlstd{(hc)}
\end{alltt}
\begin{verbatim}
## List of 7
##  $ merge      : int [1:20, 1:2] -8 -3 -6 -4 -16 -17 -5 -7 -2 -12 ...
##  $ height     : num [1:20] 158 172 269 280 328 428 460 460 521 668 ...
##  $ order      : int [1:21] 1 19 9 12 14 20 7 10 16 8 ...
##  $ labels     : chr [1:21] "Athens" "Barcelona" "Brussels" "Calais" ...
##  $ method     : chr "complete"
##  $ call       : language hclust(d = eurodist)
##  $ dist.method: NULL
##  - attr(*, "class")= chr "hclust"
\end{verbatim}
\end{kframe}
\end{knitrout}

\index{cluster analysis|)}

%\subsection{Discriminant analysis}\label{sec:stat:DA}
%\index{discriminant analysis|(}
%
%In discriminant analysis the categories or groups to which objects belong are known \emph{a priori} for a training data set. The aim is to fit/build a classifier that will allow us to assign future observations to the different non-overlapping groups with as few mistakes as possible.
%
%
%\index{discriminant analysis|)}
\index{multivariate methods|)}

\section{Further reading}\label{sec:stat:further:reading}

Two recent text books\index{further reading!statistics in R} on statistics, following a modern approach, and using \Rlang for examples, are \citetitle{Diez2019} \autocite{Diez2019} and \citetitle{Holmes2019} \autocite{Holmes2019}. Three examples of books introducing statistical computations in \Rlang are \citetitle{Dalgaard2008} \autocite{Dalgaard2008}, \citetitle{Everitt2009} \autocite{Everitt2009} and \citetitle{Zuur2009} \autocite{Zuur2009}. More advanced books are available with detailed descriptions various types of analyses in \Rlang, including thorough descriptions of the methods briefly presented in this chapter. Good examples of books with broad scope are \citebooktitle{Crawley2012} \autocite{Crawley2012} and the classic reference \citebooktitle{Venables2002} \autocite{Venables2002}. More specific books are also available from which a few suggestions for further reading are \citebooktitle{Everitt2011} \autocite{Everitt2011}, \citebooktitle{Faraway2004} \autocite{Faraway2004}, \citebooktitle{Faraway2006} \autocite{Faraway2006}, \citebooktitle{Pinheiro2000} \autocite{Pinheiro2000} and \citebooktitle{Wood2017} \autocite{Wood2017}.


% !Rnw root = appendix.main.Rnw


\chapter{The R language: Adding new ``words''}\label{chap:R:functions}

\begin{VF}
Computer Science is a science of abstraction---creating the right model for a problem and devising the appropriate mechanizable techniques to solve it.

\VA{Alfred V. Aho and Jeffrey D. Ullman}{\emph{Foundations of Computer Science}, 1992}\nocite{Aho1992}
\end{VF}

%\dictum[Alfred V. Aho, Jeffrey D. Ullman, \emph{Foundations of Computer Science}, Computer Science Press, 1992]{Computer Science is a science of abstraction---creating the right model for a problem and devising the appropriate mechanizable techniques to solve it.}\vskip2ex

\section{Aims of this chapter}

In earlier chapters we have only used base \Rlang features. In this chapter you will learn how to expand the range of features available. In the first part of the chapter we will focus on using existing packages and how they expand the functionality of \Rlang. In the second part you will learn how to define new functions, operators and classes. We will not consider the important, but more advanced question of packaging functions and classes into new \Rlang packages.

\section{Packages}\label{sec:script:packages}

\subsection{Sharing of \Rlang-language extensions}
\index{extensions to R}
The most elegant way of adding new features or capabilities to \Rlang is through packages. This is without doubt the best mechanism when these extensions to \Rlang need to be shared. However, in most situations it is also the best mechanism for managing code that will be reused even by a single person over time. \Rlang packages have strict rules about their contents, file structure, and documentation, which makes it possible among other things for the package documentation to be merged into \Rpgrm's help system when a package is loaded. With a few exceptions, packages can be written so that they will work on any computer where \Rpgrm runs.

Packages can be shared as source or binary package files, sent for example through e-mail. However, for sharing packages widely, it is best to submit them to a repository. The largest public repository of \Rpgrm packages is called CRAN\index{CRAN}, an acronym for Comprehensive R Archive Network. Packages available through CRAN are guaranteed to work, in the sense of not failing any tests built into the package and not crashing or aborting prematurely. They are tested daily, as they may depend on other packages whose code will change when updated. In January 2017, the number of packages available through CRAN passed the 10,000 mark.

A key repository for bioinformatics with \Rlang is Bioconductor\index{Bioconductor}, containing packages that pass strict quality tests. Recently, ROpenScience\index{ROpenScience} has established guidelines and a system for code peer review for packages. These peer-reviewed packages are available through CRAN or other repositories and listed at the ROpenScience website. In some cases you may need or want to install less stable code from Git repositories such as versions still under development not yet submitted to CRAN. Using the package \pkgname{devtools} we can install packages directly from GitHub\index{GitHub}, Bitbucket\index{GitHub} and other code repositories based on \pgrmname{Git}. Installations from code repositories are always installations from sources (see below). It is of course also possible to install packages from local files (e.g.,  after a manual download).

One good way of learning how the extensions provided by a package work, is by experimenting with them. When using a function we are not yet familiar with, looking at its help to check all its features will expand your ``toolbox.'' How much documentation is included with packages varies, while documentation of exported objects is enforced, many packages include, in addition, comprehensive user guides or articles as \emph{vignettes}. It is not unusual to decide which package to use from a set of alternatives based on the quality of available documentation. In the case of packages adding extensive new functionality, they may be documented in depth in a book. Well-known examples are \citebooktitle{Pinheiro2000} \autocite{Pinheiro2000}, \citebooktitle{Sarkar2008} \autocite{Sarkar2008} and \citebooktitle{Wickham2016} \autocite{Wickham2016}.

\subsection{How packages work}

The development of packages is beyond the scope of the current book, and thoroughly explained in the book \citebooktitle{Wickham2015} \autocite{Wickham2015}. However, it is still worthwhile mentioning a few things about the development of \Rpgrm packages. Using \RStudio it is relatively easy to develop your own packages. Packages can be of very different sizes and complexity. Packages use a relatively rigid structure of folders for storing the different types of files, including documentation compatible with \Rpgrm's built-in help system. This allows documentation for contributed packages to be seamlessly linked to \Rlang's help system when packages are loaded. In addition to \Rlang code, packages can call functions and routines written in \langname{C}, \langname{C++}, \langname{FORTRAN}, \langname{Java}, \langname{Python}, etc., but some kind of ``glue'' is needed, as function call conventions and \emph{name mangling} depend on the programming language, and in many cases also on the compiler used. For \langname{C++}, the \pkgname{Rcpp} \Rlang package makes the ``gluing'' relatively easy \autocite{Eddelbuettel2013}. In the case of \langname{Python}, \Rlang package \pkgname{reticulate} makes calling of \langname{Python} methods and exchange of data easy, and it is well supported by \RStudio. In the case of \langname{Java} we can use package \pkgname{RJava} instead. For \langname{C} and \langname{FORTRAN}, \Rlang provides the functionality needed, but the interface needs some ad hoc coding in most cases.

Only objects exported by a package that has been attached are visible outside its own namespace. Loading and attaching a package with \Rfunction{library()} makes the exported objects available. Attaching a package adds the objects exported by the package to the search path so that they can be accessed without prepending the name of the namespace. Most packages do not export all the functions and objects defined in their code; some are kept internal, in most cases because they may change or be removed in future versions. Package namespaces can be detached and also unloaded with function \Rscoping{detach()} using a slightly different notation for the argument from that which we described for data frames in section \ref{sec:calc:df:with} on page \pageref{sec:calc:df:with}.

\subsection{Download, installation and use}

\index{packages!using}
In \Rlang speak, ``library'' is the location where packages are installed. Packages are sets of functions, and data, specific for some particular purpose, that can be loaded into an \Rlang session to make them available so that they can be used in the same way as built-in \Rlang functions and data. Function \Rfunction{library()} is used to load and attach packages that are already installed in the local \Rlang library. In contrast, function \Rfunction{install.packages()} is used to install packages. When using \RStudio it is easiest to use \RStudio menus (which call \Rfunction{install.packages()} and \Rfunction{update.packages()}) to install or update packages.

\begin{playground}
Use \code{help} to look up the help pages for \Rfunction{install.packages()} and \Rfunction{library()}, and explain what the code in the next chunk does.
\end{playground}

\Rpgrm packages can be installed either from sources, or from already built ``binaries''. Installing from sources, depending on the package, may require additional software to be available. Under \pgrmname{MS-Windows}, the needed shell, commands and compilers are not available as part of the operating system. Installing them is not difficult as they are available prepackaged in installers (you will need \pgrmname{RTools}, and \pgrmnameTwo{\hologo{MiKTeX}}{MiKTeX}). It is easier to install packages from binary \texttt{.zip} files under \pgrmname{MS-Windows}. Under Linux most tools will be available, or very easy to install, so it is usual to install packages from sources. For \pgrmname{OS X} (Apple Mac) the situation is somewhere in-between. If the tools are available, packages can be very easily installed from sources from within \RStudio. However, binaries are for most packages also readily available.

\subsection{Finding suitable packages}

Due to the large number of contributed \Rlang packages it can sometimes be difficult to find a suitable package for a task at hand. It is good to first check if the necessary capability is already built into base \Rlang. Base \Rlang plus the recommended packages (installed when \Rlang is installed) cover a lot of ground. To analyze data using almost any of the more common statistical methods does not require the use of special packages. Sometimes, contributed packages duplicate or extend the functionality in base \Rlang with advantage. When one considers the use of novel or specialized types of data analysis, the use of contributed packages can be unavoidable. Even in such cases, it is not unusual to have alternatives to choose from within the available contributed packages. Sometimes groups or suites of packages are designed to work well together.

The CRAN repository has very broad scope and includes a section called ``views.'' \Rlang views are web pages providing annotated lists of packages frequently used within a given field of research, engineering or specific applications. These views are edited and updated by different editors. They can be found at \url{https://cran.r-project.org/web/views/}.

The Bioconductor repository specializes in bioinformatics with \Rlang. It also has a section with ``views'' and within it, descriptions of different data analysis workflows. The workflows are especially good as they reveal which sets of packages work well together. These views can be found at \url{https://www.bioconductor.org/packages/release/BiocViews.html}.

Although ROpenSci does not keep a separate package repository for the peer-reviewed packages, they do keep an index of them at \url{https://ropensci.org/packages/}.

The CRAN repository keeps an archive of earlier versions of packages, on an individual package basis. METACRAN (\url{https://www.r-pkg.org/}) is an archive of repositories, that keeps a historical record as snapshots from CRAN. METACRAN uses a different search engine than CRAN itself, making it easier to search the whole repository.

\section{Defining functions and operators}\label{sec:script:functions}
\index{functions!defining new}\index{operators!defining new}

\emph{Abstraction} can be defined as separating the fundamental properties from the accidental ones. Say obtaining the mean from a given vector of numbers is an actual operation. There can be many such operations on different numeric vectors, each one a specific case. When we describe an algorithm for computing the mean from any numeric vector we have created the abstraction of \emph{mean}. In the same way, each time we separate operations from specific data we create a new abstraction. In this sense, functions are abstractions of operations or actions; they are like ``verbs'' describing actions separately from actors.

The main role of functions is that of providing an abstraction allowing us to avoid repeating blocks of code (groups of statements) applying the same operations on different data. The reasons to avoid repetition of similar blocks of code statements are that 1) if the algorithm or implementation needs to be revised---e.g., to fix a bug or error---it is best to make edits in a single place; 2) sooner or later pieces of repeated code can become different leading to inconsistencies and hard-to-track bugs; 3) abstraction and division of a problem into smaller chunks, greatly helps with keeping the code understandable to humans; 4) textual repetition makes the script file longer, and this makes debugging, commenting, etc., more tedious, and error prone.

How do we, in practice, avoid repeating bits of code? We write a function containing the statements that we would need to repeat, and later we \emph{call} (``use'') the function in their place. We have been calling \Rlang functions or operators in almost every example in this book; what we will next tackle is how to define new functions of our own.

New functions and operators are defined using function \Rfunction{function()}, and saved like any other object in \Rpgrm by assignment to a variable name. In the example below, \code{x} and \code{y} are both formal parameters, or names used within the function for objects that will be supplied as \emph{arguments} when the function is called. One can think of parameter names as placeholders for actual values to be supplied as arguments when calling the function.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.prod} \hlkwb{<-} \hlkwa{function}\hlstd{(}\hlkwc{x}\hlstd{,} \hlkwc{y}\hlstd{)\{x} \hlopt{*} \hlstd{y\}}
\hlkwd{my.prod}\hlstd{(}\hlnum{4}\hlstd{,} \hlnum{3}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] 12
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{warningbox}
In base \Rlang, arguments\index{functions!arguments} to functions are passed by copy. This is something very important to remember. Whatever you do within a function to modify an argument, its value outside the function will remain (almost) always unchanged. (In other languages, arguments can also be passed by reference, meaning that assignments to a formal parameter within the body of the function are referenced to the argument and modify it. Such roundabout effects are frequently called side effects of a call. It is possible to imitate such behavior in \Rlang using some language trickery and consequently, some packages such as \pkgname{data.table} do define functions that use passing of arguments by reference.)

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.change} \hlkwb{<-} \hlkwa{function}\hlstd{(}\hlkwc{x}\hlstd{)\{x} \hlkwb{<-} \hlnum{NA}\hlstd{\}}
\hlstd{a} \hlkwb{<-} \hlnum{1}
\hlkwd{my.change}\hlstd{(a)}
\hlstd{a}
\end{alltt}
\begin{verbatim}
## [1] 1
\end{verbatim}
\end{kframe}
\end{knitrout}

In general, any result that needs to be made available outside the function must be returned by the function---or explicitly assigned to an object in the enclosing environment (i.e., using \Roperator{<<-} or \Rfunction{assign()}) as a side effect.

A function can only return a single object, so when multiple results are produced they need to be collected into a single object. In many cases, lists are used to collect all the values to be returned into one \Rlang object. For example, model fit functions like \code{lm()}, discussed in section \ref{sec:stat:LM} on page \pageref{sec:stat:LM}, return a complex list with heterogeneous named members.
\end{warningbox}

\begin{playground}
When function \Rcontrol{return()} is called within a function, flow of execution within the function stops and the argument passed
to \Rcontrol{return()} is the value returned by the function call. In contrast, if function \Rcontrol{return()} is not explicitly
called, the value returned by the function call is that returned by the last statement \emph{executed} within the body of the function.

\label{chunck:print:funs}
\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{print.x.1} \hlkwb{<-} \hlkwa{function}\hlstd{(}\hlkwc{x}\hlstd{)\{}\hlkwd{print}\hlstd{(x)\}}
\hlkwd{print.x.1}\hlstd{(}\hlstr{"test"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] "test"
\end{verbatim}
\begin{alltt}
\hlstd{print.x.2} \hlkwb{<-} \hlkwa{function}\hlstd{(}\hlkwc{x}\hlstd{)\{}\hlkwd{print}\hlstd{(x);} \hlkwd{return}\hlstd{(x)\}}
\hlkwd{print.x.2}\hlstd{(}\hlstr{"test"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] "test"
## [1] "test"
\end{verbatim}
\begin{alltt}
\hlstd{print.x.3} \hlkwb{<-} \hlkwa{function}\hlstd{(}\hlkwc{x}\hlstd{)\{}\hlkwd{return}\hlstd{(x);} \hlkwd{print}\hlstd{(x)\}}
\hlkwd{print.x.3}\hlstd{(}\hlstr{"test"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] "test"
\end{verbatim}
\begin{alltt}
\hlstd{print.x.4} \hlkwb{<-} \hlkwa{function}\hlstd{(}\hlkwc{x}\hlstd{)\{}\hlkwd{return}\hlstd{();} \hlkwd{print}\hlstd{(x)\}}
\hlkwd{print.x.4}\hlstd{(}\hlstr{"test"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## NULL
\end{verbatim}
\begin{alltt}
\hlstd{print.x.5} \hlkwb{<-} \hlkwa{function}\hlstd{(}\hlkwc{x}\hlstd{)\{x\}}
\hlkwd{print.x.4}\hlstd{(}\hlstr{"test"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## NULL
\end{verbatim}
\end{kframe}
\end{knitrout}
\end{playground}

\begin{advplayground}
Test the behavior of functions \code{print.x.1()} and \code{print.x.5()}, as defined above, both at the command prompt, and in a script. The behavior of one of these functions will be different when the script is sourced than at the command prompt. Explain why.
\end{advplayground}

Functions have their own scope. Any names created by normal assignment within the body of a function are visible only within the body of the function and disappear when the function returns from the call. In normal use, functions in \Rlang do not affect their environment through side effects. They receive input through arguments and return a value as the result of the call. This value can be either printed or assigned as we have seen when using functions earlier.

\subsection{Ordinary functions}\label{sec:functions:sem}
\index{functions!defining new}

After the toy examples above, we will define a small but useful function: a function for calculating the standard error of the mean from a numeric vector. The standard error is given by $S_{\hat{x}} = \sqrt{S^2 / n}$. We can translate this into the definition of an \Rlang function called \code{SEM}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{SEM} \hlkwb{<-} \hlkwa{function}\hlstd{(}\hlkwc{x}\hlstd{)\{}\hlkwd{sqrt}\hlstd{(}\hlkwd{var}\hlstd{(x)} \hlopt{/} \hlkwd{length}\hlstd{(x))\}}
\end{alltt}
\end{kframe}
\end{knitrout}

We can test our function.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a} \hlkwb{<-} \hlkwd{c}\hlstd{(}\hlnum{1}\hlstd{,} \hlnum{2}\hlstd{,} \hlnum{3}\hlstd{,} \hlopt{-}\hlnum{5}\hlstd{)}
\hlstd{a.na} \hlkwb{<-} \hlkwd{c}\hlstd{(a,} \hlnum{NA}\hlstd{)}
\hlkwd{SEM}\hlstd{(}\hlkwc{x} \hlstd{= a)}
\end{alltt}
\begin{verbatim}
## [1] 1.796988
\end{verbatim}
\begin{alltt}
\hlkwd{SEM}\hlstd{(a)}
\end{alltt}
\begin{verbatim}
## [1] 1.796988
\end{verbatim}
\begin{alltt}
\hlkwd{SEM}\hlstd{(a.na)}
\end{alltt}
\begin{verbatim}
## [1] NA
\end{verbatim}
\end{kframe}
\end{knitrout}

For example in \code{SEM(a)} we are calling function \Rfunction{SEM()} with \code{a} as an argument.

The function we defined above will always give the correct answer because \code{NA} values in the input will always result in an \code{NA} being returned. The problem is that unlike \Rlang's functions like \code{var()}, there is no option to omit \code{NA} values in the function we defined.

This could be implemented by adding a second parameter \code{na.omit} to the definition of our function and passing its argument to the call to \Rfunction{var()} within the body of \code{SEM()}. However, to avoid returning wrong values we need to make sure \code{NA} values are also removed before counting the number of observations with \code{length()}.

A readable way of implementing this in code is to define the function as follows.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{sem} \hlkwb{<-} \hlkwa{function}\hlstd{(}\hlkwc{x}\hlstd{,} \hlkwc{na.omit} \hlstd{=} \hlnum{FALSE}\hlstd{) \{}
 \hlkwa{if} \hlstd{(na.omit) \{}
   \hlstd{x} \hlkwb{<-} \hlkwd{na.omit}\hlstd{(x)}
 \hlstd{\}}
 \hlkwd{sqrt}\hlstd{(}\hlkwd{var}\hlstd{(x)}\hlopt{/}\hlkwd{length}\hlstd{(x))}
\hlstd{\}}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{sem}\hlstd{(}\hlkwc{x} \hlstd{= a)}
\end{alltt}
\begin{verbatim}
## [1] 1.796988
\end{verbatim}
\begin{alltt}
\hlkwd{sem}\hlstd{(}\hlkwc{x} \hlstd{= a.na)}
\end{alltt}
\begin{verbatim}
## [1] NA
\end{verbatim}
\begin{alltt}
\hlkwd{sem}\hlstd{(}\hlkwc{x} \hlstd{= a.na,} \hlkwc{na.omit} \hlstd{=} \hlnum{TRUE}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] 1.796988
\end{verbatim}
\end{kframe}
\end{knitrout}

\Rlang does not provide a function for standard error, so the function above is generally useful. Its user interface is consistent with that of functionally similar existing functions. We have added a new word to the \Rlang vocabulary available to us.

In the definition of \code{sem()} we set a default argument for parameter \code{na.omit} which is used unless the user explicitly passes an argument to this parameter.

%In addition if names of the parameters are supplied arguments can be passed in any order. If parameter names are not supplied arguments are matched to parameters based on their position. Once one parameter name is given, all later arguments need also to be explicitly named.

%We can assign to a variable defined `outside' a function with operator \code{<<-} but the usual recommendation is to avoid its use. This type of effects of calling a function are frequently called `side-effects'.

\begin{playground}
Define your own function to calculate the mean in a similar way as \Rfunction{SEM()} was defined above. Hint: function \Rfunction{sum()} could be of help.
\end{playground}

Functions can have much more complex and larger compound statements as their body than those in the examples above. Within an expression, a function name followed by parentheses is interpreted as a call to the function. The bare name of a function instead gives access to its definition.

We first print (implicitly) the definition of our function from earlier in this section.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{sem}
\end{alltt}
\begin{verbatim}
## function(x, na.omit = FALSE) {
##  if (na.omit) {
##    x <- na.omit(x)
##  }
##  sqrt(var(x)/length(x))
## }
## <bytecode: 0x000000001c7dcd30>
\end{verbatim}
\end{kframe}
\end{knitrout}

Next we print the definition of \Rlang's linear model fitting function \code{lm()}. (Use of \code{lm()} is described in section \ref{sec:stat:LM} on page \pageref{sec:stat:LM}.)

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{lm}
\end{alltt}
\begin{verbatim}
## function (formula, data, subset, weights, na.action, method = "qr",
##     model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,
##     contrasts = NULL, offset, ...)
## {
##     ret.x <- x
##     ret.y <- y
##     cl <- match.call()
##     mf <- match.call(expand.dots = FALSE)
##     m <- match(c("formula", "data", "subset", "weights", "na.action",
##         "offset"), names(mf), 0L)
##     mf <- mf[c(1L, m)]
##     mf$drop.unused.levels <- TRUE
##     mf[[1L]] <- quote(stats::model.frame)
##     mf <- eval(mf, parent.frame())
##     if (method == "model.frame")
##         return(mf)
##     else if (method != "qr")
##         warning(gettextf("method = '%s' is not supported. Using 'qr'",
##             method), domain = NA)
##     mt <- attr(mf, "terms")
##     y <- model.response(mf, "numeric")
##     w <- as.vector(model.weights(mf))
##     if (!is.null(w) && !is.numeric(w))
##         stop("'weights' must be a numeric vector")
##     offset <- model.offset(mf)
##     mlm <- is.matrix(y)
##     ny <- if (mlm)
##         nrow(y)
##     else length(y)
##     if (!is.null(offset)) {
##         if (!mlm)
##             offset <- as.vector(offset)
##         if (NROW(offset) != ny)
##             stop(gettextf("number of offsets is %d, should equal %d (number of observations)",
##                 NROW(offset), ny), domain = NA)
##     }
##     if (is.empty.model(mt)) {
##         x <- NULL
##         z <- list(coefficients = if (mlm) matrix(NA_real_, 0,
##             ncol(y)) else numeric(), residuals = y, fitted.values = 0 *
##             y, weights = w, rank = 0L, df.residual = if (!is.null(w)) sum(w !=
##             0) else ny)
##         if (!is.null(offset)) {
##             z$fitted.values <- offset
##             z$residuals <- y - offset
##         }
##     }
##     else {
##         x <- model.matrix(mt, mf, contrasts)
##         z <- if (is.null(w))
##             lm.fit(x, y, offset = offset, singular.ok = singular.ok,
##                 ...)
##         else lm.wfit(x, y, w, offset = offset, singular.ok = singular.ok,
##             ...)
##     }
##     class(z) <- c(if (mlm) "mlm", "lm")
##     z$na.action <- attr(mf, "na.action")
##     z$offset <- offset
##     z$contrasts <- attr(x, "contrasts")
##     z$xlevels <- .getXlevels(mt, mf)
##     z$call <- cl
##     z$terms <- mt
##     if (model)
##         z$model <- mf
##     if (ret.x)
##         z$x <- x
##     if (ret.y)
##         z$y <- y
##     if (!qr)
##         z$qr <- NULL
##     z
## }
## <bytecode: 0x00000000151d2418>
## <environment: namespace:stats>
\end{verbatim}
\end{kframe}
\end{knitrout}

As can be seen at the end of the listing, this function written in the \Rlang language has been byte-compiled so that it executes faster. Functions that are part of the \Rlang language, but that are not coded using the \Rlang language, are called primitives and their full definition cannot be accessed through their name (c.f., \code{sem()} defined above).

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{list}
\end{alltt}
\begin{verbatim}
## function (...)  .Primitive("list")
\end{verbatim}
\end{kframe}
\end{knitrout}

\subsection{Operators}
\index{operators!defining new}

Operators are functions that use a different syntax for being called. If their name is enclosed in back ticks they can be called as ordinary functions. Binary operators like \code{+} have two formal parameters, and unary operators like unary \code{-} have only one formal parameter. The parameters of many binary \Rlang operators are named \code{e1} and \code{e2}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlnum{1} \hlopt{/} \hlnum{2}
\end{alltt}
\begin{verbatim}
## [1] 0.5
\end{verbatim}
\begin{alltt}
\hlkwd{`/`}\hlstd{(}\hlnum{1} \hlstd{,} \hlnum{2}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] 0.5
\end{verbatim}
\begin{alltt}
\hlkwd{`/`}\hlstd{(}\hlkwc{e1} \hlstd{=} \hlnum{1} \hlstd{,} \hlkwc{e2} \hlstd{=} \hlnum{2}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] 0.5
\end{verbatim}
\end{kframe}
\end{knitrout}

An important consequence of the possibility of calling operators using ordinary syntax is that operators can be used as arguments to \emph{apply} functions in the same way as ordinary functions. When passing operator names as arguments to \emph{apply} functions we only need to enclose them in back ticks (see section \ref{sec:data:apply} on page \pageref{sec:data:apply}).

The name by itself and enclosed in back ticks allows us to access the definition of an operator.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{`/`}
\end{alltt}
\begin{verbatim}
## function (e1, e2)  .Primitive("/")
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{explainbox}
\textbf{Defining a new operator.} We will define a binary operator (taking two arguments) that subtracts from the numbers in a vector the mean of another vector. First we need a suitable name, but we have less freedom as names of user-defined operators must be enclosed in percent signs. We will use \code{\%-mean\%} and as with any \emph{special name}, we need to enclose it in quotation marks for the assignment.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstr{"%-mean%"} \hlkwb{<-} \hlkwa{function}\hlstd{(}\hlkwc{e1}\hlstd{,} \hlkwc{e2}\hlstd{) \{}
  \hlstd{e1} \hlopt{-} \hlkwd{mean}\hlstd{(e2)}
\hlstd{\}}
\end{alltt}
\end{kframe}
\end{knitrout}

We can then use our new operator in a example.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlnum{10}\hlopt{:}\hlnum{15} \hlopt{%-mean%} \hlnum{1}\hlopt{:}\hlnum{20}
\end{alltt}
\begin{verbatim}
## [1] -0.5  0.5  1.5  2.5  3.5  4.5
\end{verbatim}
\end{kframe}
\end{knitrout}

To print the definition, we enclose the name of our new operator in back ticks---i.e., we \emph{back quote} the special name.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{`%-mean%`}
\end{alltt}
\begin{verbatim}
## function(e1, e2) {
##   e1 - mean(e2)
## }
\end{verbatim}
\end{kframe}
\end{knitrout}

\end{explainbox}

\section{Objects, classes, and methods}\label{sec:script:objects:classes:methods}\label{sec:methods}
\index{objects}\index{classes}\index{methods}\index{object-oriented programming}
\index{S3 class system}\index{classes!S3 class system}\index{methods!S3 class system}
New classes are normally defined within packages rather than in user scripts. To be really useful implementing a new class involves not only defining a class but also a set of specialized functions or \emph{methods} that implement operations on objects belonging to the new class. Nevertheless, an understanding of how classes work is important even if only very occasionally a user will define a new method for an existing class within a script.

Classes are abstractions, but abstractions describing the shared properties of ``types'' or groups of similar objects. In this sense, classes are abstractions of ``actors,'' they are like ``nouns'' in natural language. What we obtain with classes is the possibility of defining multiple versions of functions (or \emph{methods}) sharing the same name but tailored to operate on objects belonging to different classes. We have already been using methods with multiple \emph{specializations} throughout the book, for example \code{plot()} and \code{summary()}.

We start with a quotation from \citebooktitle{Burns1998} \autocite[][, page 13]{Burns1998}.
\begin{quotation}
The idea of object-oriented programming is simple, but carries a lot of weight.
Here's the whole thing: if you told a group of people ``dress for work,'' then
you would expect each to put on clothes appropriate for that individual's job.
Likewise it is possible for S[R] objects to get dressed appropriately depending on
what class of object they are.
\end{quotation}

We say that specific methods are \emph{dispatched} based on the class of the argument passed. This, together with the loose type checks of \Rlang, allows writing code that functions as expected on different types of objects, e.g., character and numeric vectors.

\Rlang has good support for the object-oriented programming paradigm, but as a system that has evolved over the years, currently \Rlang supports multiple approaches. The still most popular approach is called S3, and a more recent and powerful approach, with slower performance, is called S4. The general idea is that a name like ``plot'' can be used as a generic name, and that the specific version of \Rfunction{plot()} called depends on the arguments of the call. Using computing terms we could say that the \emph{generic} of \Rfunction{plot()} dispatches the original call to different specific versions of \Rfunction{plot()} based on the class of the arguments passed. S3 generic functions dispatch, by default, based only on the argument passed to a single parameter, the first one. S4 generic functions can dispatch the call based on the arguments passed to more than one parameter and the structure of the objects of a given class is known to the interpreter. In S3 functions, the specializations of a generic are recognized/identified only by their name. And the class of an object by a character string stored as an attribute to the object.

We first explore one of the methods already available in \Rlang. The definition of \code{mean} shows that it is the generic for a method.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{mean}
\end{alltt}
\begin{verbatim}
## function (x, ...)
## UseMethod("mean")
## <bytecode: 0x00000000138ddc60>
## <environment: namespace:base>
\end{verbatim}
\end{kframe}
\end{knitrout}

We can find out which specializations of method are available in the current search path using \Rfunction{methods()}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{methods}\hlstd{(mean)}
\end{alltt}
\begin{verbatim}
## [1] mean.Date     mean.default  mean.difftime mean.POSIXct  mean.POSIXlt
## see '?methods' for accessing help and source code
\end{verbatim}
\end{kframe}
\end{knitrout}

We can also use \Rfunction{methods()} to query all methods, including operators, defined for objects of a given class.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{methods}\hlstd{(}\hlkwc{class} \hlstd{=} \hlstr{"list"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] all.equal     as.data.frame coerce        Ops           relist
## [6] type.convert  within
## see '?methods' for accessing help and source code
\end{verbatim}
\end{kframe}
\end{knitrout}

S3 class information is stored as a character vector in an attribute named \code{"class"}. The most basic approach to creation of an object of a new S3 class, is to add the new class name to the class attribute of the object. As the implied class hierarchy is given by the order of the members of the character vector, the name of the new class must be added at the head of the vector. Even though this step can be done as shown here, in practice this step would normally take place within a \emph{constructor} function and the new class, if defined within a package, would need to be registered. We show here this bare-bones example to demonstrate how S3 classes are implemented in \Rlang.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{a} \hlkwb{<-} \hlnum{123}
\hlkwd{class}\hlstd{(a)}
\end{alltt}
\begin{verbatim}
## [1] "numeric"
\end{verbatim}
\begin{alltt}
\hlkwd{class}\hlstd{(a)} \hlkwb{<-} \hlkwd{c}\hlstd{(}\hlstr{"myclass"}\hlstd{,} \hlkwd{class}\hlstd{(a))}
\hlkwd{class}\hlstd{(a)}
\end{alltt}
\begin{verbatim}
## [1] "myclass" "numeric"
\end{verbatim}
\end{kframe}
\end{knitrout}

Now we create a print method specific to \code{"myclass"} objects. Internally we are using function \Rfunction{sprintf()} and for the format template to work we need to pass a \code{numeric} value as an argument---i.e., obviously \Rfunction{sprintf()} does not ``know'' how to handle objects of the class we have just created!

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{print.myclass} \hlkwb{<-} \hlkwa{function}\hlstd{(}\hlkwc{x}\hlstd{) \{}
    \hlkwd{sprintf}\hlstd{(}\hlstr{"[myclass] %.0f"}\hlstd{,} \hlkwd{as.numeric}\hlstd{(x))}
\hlstd{\}}
\end{alltt}
\end{kframe}
\end{knitrout}

Once a specialized method exists for a class, it will be used for objects of this class.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{print}\hlstd{(a)}
\end{alltt}
\begin{verbatim}
## [1] "[myclass] 123"
\end{verbatim}
\begin{alltt}
\hlkwd{print}\hlstd{(}\hlkwd{as.numeric}\hlstd{(a))}
\end{alltt}
\begin{verbatim}
## [1] 123
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{explainbox}
 The S3 class system is ``lightweight'' in that it adds very little additional computation load, but it is rather ``fragile'' in that most of the responsibility for consistency and correctness of the design---e.g., not messing up dispatch by redefining functions or loading a package exporting functions with the same name, etc., is not checked by the \Rlang interpreter.

Defining a new S3 generic\index{generic method!S3 class system} is also quite simple. A generic method and a default method need to be created.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my_print} \hlkwb{<-} \hlkwa{function} \hlstd{(}\hlkwc{x}\hlstd{,} \hlkwc{...}\hlstd{) \{}
   \hlkwd{UseMethod}\hlstd{(}\hlstr{"my_print"}\hlstd{, x)}
 \hlstd{\}}

\hlstd{my_print.default} \hlkwb{<-} \hlkwa{function}\hlstd{(}\hlkwc{x}\hlstd{,} \hlkwc{...}\hlstd{) \{}
   \hlkwd{print}\hlstd{(}\hlkwd{class}\hlstd{(x))}
   \hlkwd{print}\hlstd{(x, ...)}
\hlstd{\}}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{my_print}\hlstd{(}\hlnum{123}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] "numeric"
## [1] 123
\end{verbatim}
\begin{alltt}
\hlkwd{my_print}\hlstd{(}\hlstr{"abc"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] "character"
## [1] "abc"
\end{verbatim}
\end{kframe}
\end{knitrout}

Up to now, \Rfunction{my\_print()}, has no specialization. We now write one for data frames.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my_print.data.frame} \hlkwb{<-} \hlkwa{function}\hlstd{(}\hlkwc{x}\hlstd{,} \hlkwc{rows} \hlstd{=} \hlnum{1}\hlopt{:}\hlnum{5}\hlstd{,} \hlkwc{...}\hlstd{) \{}
   \hlkwd{print}\hlstd{(x[rows, ], ...)}
   \hlkwd{invisible}\hlstd{(x)}
\hlstd{\}}
\end{alltt}
\end{kframe}
\end{knitrout}

We add the second statement so that the function invisibly returns the whole data frame, rather than the lines printed. We now do a quick test of the function.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{my_print}\hlstd{(cars)}
\end{alltt}
\begin{verbatim}
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{my_print}\hlstd{(cars,} \hlnum{8}\hlopt{:}\hlnum{10}\hlstd{)}
\end{alltt}
\begin{verbatim}
##    speed dist
## 8     10   26
## 9     10   34
## 10    11   17
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{b} \hlkwb{<-} \hlkwd{my_print}\hlstd{(cars)}
\end{alltt}
\begin{verbatim}
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
\end{verbatim}
\begin{alltt}
\hlkwd{str}\hlstd{(b)}
\end{alltt}
\begin{verbatim}
## 'data.frame':	50 obs. of  2 variables:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...
\end{verbatim}
\begin{alltt}
\hlkwd{nrow}\hlstd{(b)} \hlopt{==} \hlkwd{nrow}\hlstd{(cars)} \hlcom{# was the whole data frame returned?}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\end{kframe}
\end{knitrout}

%\begin{playground}
%1) What would be the most concise way of defining a \code{my\_print()} specialization for \code{matrix}? Write one, and test it.
%2) How would you modify the code of your \code{my\_print.matrix()} so that also the columns to print can be selected?
%\end{playground}
%
\end{explainbox}

\section{Scope of names}
\index{names and scoping}\index{scoping rules}\index{namespaces}

The visibility of names is determined by the \emph{scoping rules} of a language. The clearest, but not the only situation when scoping rules matter, is when objects with the same name coexist. In such a situation one will be accessible by its unqualified name and the other hidden but possibly accessible by qualifying the name with its name space.

As the \Rlang language has few reserved words for which no redefinition is allowed, we should take care not to accidentally reuse names that are part of language. For example \code{pi} is a constant defined in \Rlang with the value of the mathematical constant $\pi$. If we use the same name for one of our variables, the original definition becomes hidden.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{pi}
\end{alltt}
\begin{verbatim}
## [1] 3.141593
\end{verbatim}
\begin{alltt}
\hlstd{pi} \hlkwb{<-} \hlstr{"apple pie"}
\hlstd{pi}
\end{alltt}
\begin{verbatim}
## [1] "apple pie"
\end{verbatim}
\begin{alltt}
\hlkwd{rm}\hlstd{(pi)}
\hlstd{pi}
\end{alltt}
\begin{verbatim}
## [1] 3.141593
\end{verbatim}
\begin{alltt}
\hlkwd{exists}\hlstd{(}\hlstr{"pi"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\end{kframe}
\end{knitrout}

In the example above, the two variables are not defined in the same scope. In the example below we assign a new value to a variable we have earlier created within the same scope, and consequently the second assignment overwrites, rather than hides, the existing definition.\qRscoping{exists()}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.pie} \hlkwb{<-} \hlstr{"raspberry pie"}
\hlstd{my.pie}
\end{alltt}
\begin{verbatim}
## [1] "raspberry pie"
\end{verbatim}
\begin{alltt}
\hlstd{my.pie} \hlkwb{<-} \hlstr{"apple pie"}
\hlstd{my.pie}
\end{alltt}
\begin{verbatim}
## [1] "apple pie"
\end{verbatim}
\begin{alltt}
\hlkwd{rm}\hlstd{(my.pie)}
\hlkwd{exists}\hlstd{(}\hlstr{"my.pie"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\end{kframe}
\end{knitrout}

An additional important thing to remember is that \Rlang packages define all objects within a \emph{namespace} with the same name as the package itself. This means that when we reuse a name defined in a package, its definition in the package does not get overwritten, but instead, only hidden and still accessible using the name \emph{qualified} by prepending the name of the package followed by two colons.

If two packages define objects with the same name, then which one is visible depends on the order in which the packages were attached. To avoid confusion in such cases, in scripts is best to use the qualified names for calling both definitions.

\section{Further reading}

An\index{further reading!object oriented programming in R} in-depth discussion of object-oriented programming in \Rlang is outside the scope of this book. For the non-programmer user, a basic understanding of \Rlang classes can be useful, even if he or she does not intend to create new classes. This basic knowledge is what we covered in this chapter. Several books describe in detail the different class systems available and how to use them in \Rlang packages. For an in-depth treatment of the subject please consult the books \citebooktitle{Wickham2019} \autocite{Wickham2019} and \citebooktitle{Chambers2016} \autocite{Chambers2016}.

The\index{further reading!package development} development of packages is thoroughly described in the book \citebooktitle{Wickham2015} \autocite{Wickham2015} and an in-depth description of \Rlang from the programming perspective is given in the book \citebooktitle{Wickham2019} \autocite{Wickham2019}. The book \citebooktitle{Chambers2016} \autocite{Chambers2016} covers both subjects.


% !Rnw root = appendix.main.Rnw


\chapter{New grammars of data}\label{chap:R:data}

\begin{VF}
Essentially everything in S[R], for instance, a call to a function, is an S[R] object. One viewpoint is that S[R] has self-knowledge. This self-awareness makes a lot of things possible in S[R] that are not in other languages.

\VA{Patrick J. Burns}{\emph{S Poetry}, 1998}\nocite{Burns1998}
\end{VF}

%\dictum[Patrick J. Burns (1998) S Poetry. \url{http://www.burns-stat.com/documents/books/s-poetry/}]{Essentially everything in S[R], for instance, a call to a function, is an S[R] object. One viewpoint is that S[R] has self-knowledge. This self-awareness makes a lot of things possible in S[R] that are not in other languages.}

\section{Aims of this chapter}

Base \Rlang and the recommended extension packages (installed by default) include many functions for manipulating data. The \Rlang distribution supplies a complete set of functions and operators that allow all the usual data manipulation operations. These functions have stable and well-described behavior, so they should be preferred unless some of their limitations justify the use of alternatives defined in contributed packages. In the present chapter we aim at describing the new syntaxes introduced by the most popular of these contributed \Rlang extension packages aiming at changing (usually improving one aspect at the expense of another) in various ways how we can manipulate data in \Rlang. These independently developed packages extend the \Rlang language not only by adding new ``words'' to it but by supporting new ways of meaningfully connecting ``words''---i.e., providing new ``grammars'' for data manipulation.

\section{Introduction}

By reading previous chapters, you have already become familiar with base \Rlang classes, methods, functions and operators for storing and manipulating data. Most of these had been originally designed to perform optimally on rather small data sets \autocite[see][]{Matloff2011}. The \Rlang implementation has been improved over the years significantly in performance, and random-access memory in computers has become cheaper, making constraints imposed by the original design of \Rlang less limiting, but on the other hand, the size of data sets has also increased. Some contributed packages have aimed at improving performance by relying on different compromises between usability, speed and reliability than used for base \Rlang.

Package \pkgname{data.table} is the best example of an alternative implementation of data storage that maximizes the speed of processing for large data sets using a new semantics and requiring a new syntax. We could say that package \pkgname{data.table} is based on a ``grammar of data'' that is different from that in the \Rlang language. The compromise in this case has been the use of a less intuitive syntax, and by defaulting to call by reference of arguments instead of by copy, increasing the ``responsibility'' of the author of code defining new functions.

When a computation includes a chain of sequential operations, if using base \Rlang, we can either store at each step in the computation the returned value in a variable, or nest multiple function calls. The first approach is verbose, but allows readable scripts, especially if variable names are wisely chosen. The second approach becomes very difficult too read as soon as there is more than one nesting level. Attempts to find an alternative syntax have borrowed the concept of data \emph{pipes} from Unix shells \autocite{Kernigham1981}. Interestingly, that it has been possible to write packages that define the operators needed to ``add'' this new syntax to \Rlang is a testimony to its flexibility and extensibility. Two packages, \pkgname{magrittr} and \pkgname{wrapr}, define operators for pipe-based syntax.

A different aspect of the \Rlang syntax is extraction of members from lists and data frames by name. Base \Rlang provides two different operators for this, \code{\$} and \code{[[]]}, with different syntax. These two operators also differ in how \emph{incomplete names} are handled. Package \pkgname{tibble} alters this syntax for an alternative to base \Rlang's data frames. Once again, a new syntax allows new functionality at the expense of partial incompatibility with base \Rlang syntax. Objects of class \code{"tb"} were also an attempt to improve performance compared to objects of class \code{"data.frame"}. \Rlang performance has improved in recent releases and currently, even though performance is not the same, depending on the operations and data, either \Rlang's data frames or tibbles perform better.

Base \Rlang function \Rfunction{subset()} has an unusual syntax, as it evaluates the expression passed as the second argument within the namespace of the data frame passed as its first argument (see \ref{sec:calc:df:with} on page \pageref{sec:calc:df:with}). This saves typing at the expense of increasing the risk of bugs, as by reading the call to subset, it is not obvious which names are resolved in the environment of \code{subset()} and which ones within its first argument---i.e., as column names in the data frame. In addition, changes elsewhere in a script can change how a call to subset is interpreted. In reality, subset is a wrapper function built on top of the extraction operator \code{[]}. It is a convenience function, mostly intended to be used at the console, rather than in scripts or package code. To extract rows from a data frame it is always best to use the \code{[ , ]} operator.

Package \pkgname{dplyr} provides convenience functions that work in a similar way as base \Rlang \code{subset()}, although in latest versions more safely. This package has suffered quite drastic changes during its development with respect to how to handle the dilemma caused by ``guessing'' of the environment where names should be looked up. There is no easy answer; a simplified syntax leads to ambiguity, and a fully specified syntax is verbose. Recent versions of the package introduced a terse syntax to achieve a concise way of specifying where to look up  names. My opinion is that for code that needs to be highly reliable and produce reproducible results in the future, we should for the time being prefer base \Rlang. For code that is to be used once, or for which reproducibility can depend on the use of a specific (old or soon to be old) version of \pkgname{dplyr}, or which is not a burden to update, the conciseness and power of the new syntax will be an advantage.

In this chapter you will become familiar with alternative ``grammars of data'' as implemented in some of the packages that enable new approaches to manipulating data in \Rlang. As in previous chapters I will focus more on the available tools and how to use them than on their role in the analysis of data. The books \citebooktitle{Wickham2017} \autocite{Wickham2017} and \citebooktitle{Peng2016} \autocite{Peng2016} partly cover the same subjects from the perspective of data analysis.

\section{Packages used in this chapter}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{install.packages}\hlstd{(learnrbook}\hlopt{::}\hlstd{pkgs_ch_data)}
\end{alltt}
\end{kframe}
\end{knitrout}

To run the examples included in this chapter, you need first to load some packages from the library (see section \ref{sec:script:packages} on page \pageref{sec:script:packages} for details on the use of packages).

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{library}\hlstd{(learnrbook)}
\hlkwd{library}\hlstd{(tibble)}
\hlkwd{library}\hlstd{(magrittr)}
\hlkwd{library}\hlstd{(wrapr)}
\hlkwd{library}\hlstd{(stringr)}
\hlkwd{library}\hlstd{(dplyr)}
\hlkwd{library}\hlstd{(tidyr)}
\hlkwd{library}\hlstd{(lubridate)}
\end{alltt}
\end{kframe}
\end{knitrout}

\section[Replacements for \texttt{data.frame}]{Replacements for \code{data.frame}}
\index{data frame!replacements|(}
\subsection{\pkgname{data.table}}
The function call semantics of the \Rlang language is that arguments are passed to functions by copy. If the arguments are modified within the code of a function, these changes are local to the function. If implemented naively, this semantic would impose a huge toll on performance, however, \Rlang in most situations only makes a copy if and when the value changes. Consequently, for modern versions of \Rlang which are very good at avoiding unnecessary copying of objects, the normal \Rlang semantics has only a moderate negative impact on performance. However, this impact can still be a problem as modification is detected at the object level, and consequently \Rlang may make copies of a whole data frame when only values in a single column or even just an attribute have changed.

Functions and methods from package \pkgname{data.table} use arguments by reference, avoiding making any copies. However, any assignments within these functions and methods modify the variable passed as an argument. This simplifies the needed tests for delayed copying and also by avoiding the need to make a copy of arguments, achieves the best possible performance. This is a specialized package but extremely useful when dealing with very large data sets. Writing user code, such as scripts, with \pkgname{data.table} requires a good understanding of the pass-by-reference semantics. Obviously, package \pkgname{data.table} makes no attempt at backwards compatibility with base-\Rlang \code{data.frame}.

\subsection{\pkgname{tibble}}\label{sec:data:tibble}
\index{tibble!differences with data frames|(}

The authors of package \pkgname{tibble} describe their \Rclass{tbl} class as backwards compatible with \Rclass{data.frame} and make it a derived class. This backwards compatibility is only partial so in some situations data frames and tibbles are not equivalent.

The class and methods that package \pkgname{tibble} defines lift some of the restrictions imposed by the design of base \Rlang data frames at the cost of creating some incompatibilities due to changed (improved) syntax for member extraction and by adding support for ``columns'' of class \Rclass{list} and removing support for columns of class \Rclass{matrix}. Handling of attributes is also different, with no row names added by default. There are also differences in default behavior of both constructors and methods. Although, objects of class \Rclass{tbl} can be passed as arguments to functions that expect data frames as input, these functions are not guaranteed to work correctly as a result of the differences in syntax.

\begin{warningbox}
It is easy to write code that will work correctly both with data frames and tibbles. However, code that is syntactically correct according to the \Rlang language may fail if a tibble is used in place of a data frame.
\end{warningbox}

\begin{explainbox}
  The \Rmethod{print()} method for tibbles differs from that for data frames in that it outputs a header with the text ``A tibble:'' followed by the dimensions (number of rows $\times$ number of columns), adds under each column name an abbreviation of its class and instead of printing all rows and columns, a limited number of them are displayed. In addition, individual values are formatted differently even adding color highlighting for negative numbers.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{tibble}\hlstd{(}\hlkwc{A} \hlstd{= LETTERS[}\hlnum{1}\hlopt{:}\hlnum{5}\hlstd{],} \hlkwc{B} \hlstd{=} \hlopt{-}\hlnum{2}\hlopt{:}\hlnum{2}\hlstd{,} \hlkwc{C} \hlstd{=} \hlkwd{seq}\hlstd{(}\hlkwc{from} \hlstd{=} \hlnum{1}\hlstd{,} \hlkwc{to} \hlstd{=} \hlnum{0}\hlstd{,} \hlkwc{length.out} \hlstd{=} \hlnum{5}\hlstd{))}
\end{alltt}
\begin{verbatim}
## # A tibble: 5 x 3
##   A         B     C
##   <chr> <int> <dbl>
## 1 A        -2  1
## 2 B        -1  0.75
## 3 C         0  0.5
## 4 D         1  0.25
## 5 E         2  0
\end{verbatim}
\end{kframe}
\end{knitrout}

The default number of rows printed can be set with options, that we set here to only three rows for most of this chapter.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{options}\hlstd{(}\hlkwc{tibble.print_max} \hlstd{=} \hlnum{3}\hlstd{,} \hlkwc{tibble.print_min} \hlstd{=} \hlnum{3}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}
\end{explainbox}

\begin{infobox}
In their first incarnation, the name for \Rclass{tibble} was \code{data\_frame} (with a dash instead of a dot). The old name is still recognized, but its use should be avoided and \Rfunction{tibble()} used instead. One should be aware that although the constructor \Rfunction{tibble()} and conversion function \Rfunction{as\_tibble()}, as well as the test \Rfunction{is\_tibble()} use the name \Rclass{tibble}, the class attribute is named \code{tbl}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.tb} \hlkwb{<-} \hlkwd{tibble}\hlstd{(}\hlkwc{numbers} \hlstd{=} \hlnum{1}\hlopt{:}\hlnum{3}\hlstd{)}
\hlkwd{is_tibble}\hlstd{(my.tb)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{inherits}\hlstd{(my.tb,} \hlstr{"tibble"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\begin{alltt}
\hlkwd{class}\hlstd{(my.tb)}
\end{alltt}
\begin{verbatim}
## [1] "tbl_df"     "tbl"        "data.frame"
\end{verbatim}
\end{kframe}
\end{knitrout}

Furthermore, by necessity, to support tibbles based on different underlying data sources, a further derived class is needed. In our example, as our tibble has an underlying \code{data.frame} class, the most derived class of \code{my.tb} is \Rclass{tbl\_df}.
\end{infobox}

We start with the constructor and conversion methods. For this we will define our own diagnosis function (\emph{apply} functions are described in section \ref{sec:data:apply} on page \pageref{sec:data:apply}).

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{show_classes} \hlkwb{<-} \hlkwa{function}\hlstd{(}\hlkwc{x}\hlstd{) \{}
  \hlkwd{cat}\hlstd{(}
    \hlkwd{paste}\hlstd{(}\hlkwd{paste}\hlstd{(}\hlkwd{class}\hlstd{(x)[}\hlnum{1}\hlstd{],}
    \hlstr{"containing:"}\hlstd{),}
    \hlkwd{paste}\hlstd{(}\hlkwd{names}\hlstd{(x),}
          \hlkwd{sapply}\hlstd{(x, class),} \hlkwc{collapse} \hlstd{=} \hlstr{", "}\hlstd{,} \hlkwc{sep} \hlstd{=} \hlstr{": "}\hlstd{),}
    \hlkwc{sep} \hlstd{=} \hlstr{"\textbackslash{}n"}\hlstd{)}
    \hlstd{)}
\hlstd{\}}
\end{alltt}
\end{kframe}
\end{knitrout}

In the next two chunks we can see some of the differences. The \Rfunction{tibble()} constructor does not by default convert character data into factors, while the \Rfunction{data.frame()} constructor does.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.df} \hlkwb{<-} \hlkwd{data.frame}\hlstd{(}\hlkwc{codes} \hlstd{=} \hlkwd{c}\hlstd{(}\hlstr{"A"}\hlstd{,} \hlstr{"B"}\hlstd{,} \hlstr{"C"}\hlstd{),} \hlkwc{numbers} \hlstd{=} \hlnum{1}\hlopt{:}\hlnum{3}\hlstd{,} \hlkwc{integers} \hlstd{=} \hlnum{1L}\hlopt{:}\hlnum{3L}\hlstd{)}
\hlkwd{is.data.frame}\hlstd{(my.df)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{is_tibble}\hlstd{(my.df)}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\begin{alltt}
\hlkwd{show_classes}\hlstd{(my.df)}
\end{alltt}
\begin{verbatim}
## data.frame containing:
## codes: character, numbers: integer, integers: integer
\end{verbatim}
\end{kframe}
\end{knitrout}

Tibbles are data frames---or more formally class \Rclass{tibble} is derived from class \code{data.frame}. However, data frames are not tibbles.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.tb} \hlkwb{<-} \hlkwd{tibble}\hlstd{(}\hlkwc{codes} \hlstd{=} \hlkwd{c}\hlstd{(}\hlstr{"A"}\hlstd{,} \hlstr{"B"}\hlstd{,} \hlstr{"C"}\hlstd{),} \hlkwc{numbers} \hlstd{=} \hlnum{1}\hlopt{:}\hlnum{3}\hlstd{,} \hlkwc{integers} \hlstd{=} \hlnum{1L}\hlopt{:}\hlnum{3L}\hlstd{)}
\hlkwd{is.data.frame}\hlstd{(my.tb)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{is_tibble}\hlstd{(my.tb)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{show_classes}\hlstd{(my.tb)}
\end{alltt}
\begin{verbatim}
## tbl_df containing:
## codes: character, numbers: integer, integers: integer
\end{verbatim}
\end{kframe}
\end{knitrout}

The \Rfunction{print()} method for tibbles, overrides the one defined for data frames.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{print}\hlstd{(my.df)}
\end{alltt}
\begin{verbatim}
##   codes numbers integers
## 1     A       1        1
## 2     B       2        2
## 3     C       3        3
\end{verbatim}
\begin{alltt}
\hlkwd{print}\hlstd{(my.tb)}
\end{alltt}
\begin{verbatim}
## # A tibble: 3 x 3
##   codes numbers integers
##   <chr>   <int>    <int>
## 1 A           1        1
## 2 B           2        2
## 3 C           3        3
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{playground}
Tibbles and data frames differ in how they are printed when they have many rows or columns. 1) Construct a data frame and an equivalent tibble with at least 50 rows and then test how the output looks when they are printed. 2) Construct a data frame and an equivalent tibble with more columns than will fit in the width of the \R console and then test how the output looks when they are printed.
\end{playground}

Data frames can be converted into tibbles with \code{as\_tibble()}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my_conv.tb} \hlkwb{<-} \hlkwd{as_tibble}\hlstd{(my.df)}
\hlkwd{is.data.frame}\hlstd{(my_conv.tb)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{is_tibble}\hlstd{(my_conv.tb)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{show_classes}\hlstd{(my_conv.tb)}
\end{alltt}
\begin{verbatim}
## tbl_df containing:
## codes: character, numbers: integer, integers: integer
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my_conv.df} \hlkwb{<-} \hlkwd{as.data.frame}\hlstd{(my.tb)}
\hlkwd{is.data.frame}\hlstd{(my_conv.df)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{is_tibble}\hlstd{(my_conv.df)}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\begin{alltt}
\hlkwd{show_classes}\hlstd{(my_conv.df)}
\end{alltt}
\begin{verbatim}
## data.frame containing:
## codes: character, numbers: integer, integers: integer
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{playground}
Look carefully at the result of the conversions. Why do we now have a data frame with \code{A} as \code{character} and a tibble with \code{A} as a \code{factor}?
\end{playground}

\begin{explainbox}
Not all conversion functions work consistently when converting from a derived class into its parent. The reason for this is disagreement between authors on what the \emph{correct} behavior is based on logic and theory. You are not likely to be hit by this problem frequently, but it can be difficult to diagnose.

We have already seen that calling \Rfunction{as.data.frame()} on a tibble strips the derived class attributes, returning a data frame. We will look at the whole character vector stored in the \code{"class"} attribute to demonstrate the difference. We also test the two objects for equality, in two different ways. Using the operator \code{==} tests for equivalent objects. Objects that contain the same data. Using \Rfunction{identical()} tests that objects are exactly the same, including attributes such as \code{"class"}, which we retrieve using \Rfunction{class()}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{class}\hlstd{(my.tb)}
\end{alltt}
\begin{verbatim}
## [1] "tbl_df"     "tbl"        "data.frame"
\end{verbatim}
\begin{alltt}
\hlkwd{class}\hlstd{(my_conv.df)}
\end{alltt}
\begin{verbatim}
## [1] "data.frame"
\end{verbatim}
\begin{alltt}
\hlstd{my.tb} \hlopt{==} \hlstd{my_conv.df}
\end{alltt}
\begin{verbatim}
##      codes numbers integers
## [1,]  TRUE    TRUE     TRUE
## [2,]  TRUE    TRUE     TRUE
## [3,]  TRUE    TRUE     TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{identical}\hlstd{(my.tb, my_conv.df)}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\end{kframe}
\end{knitrout}

Now we derive from a tibble, and then attempt a conversion back into a tibble.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.xtb} \hlkwb{<-} \hlstd{my.tb}
\hlkwd{class}\hlstd{(my.xtb)} \hlkwb{<-} \hlkwd{c}\hlstd{(}\hlstr{"xtb"}\hlstd{,} \hlkwd{class}\hlstd{(my.xtb))}
\hlkwd{class}\hlstd{(my.xtb)}
\end{alltt}
\begin{verbatim}
## [1] "xtb"        "tbl_df"     "tbl"        "data.frame"
\end{verbatim}
\begin{alltt}
\hlstd{my_conv_x.tb} \hlkwb{<-} \hlkwd{as_tibble}\hlstd{(my.xtb)}
\hlkwd{class}\hlstd{(my_conv_x.tb)}
\end{alltt}
\begin{verbatim}
## [1] "tbl_df"     "tbl"        "data.frame"
\end{verbatim}
\begin{alltt}
\hlstd{my.xtb} \hlopt{==} \hlstd{my_conv_x.tb}
\end{alltt}
\begin{verbatim}
##      codes numbers integers
## [1,]  TRUE    TRUE     TRUE
## [2,]  TRUE    TRUE     TRUE
## [3,]  TRUE    TRUE     TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{identical}\hlstd{(my.xtb, my_conv_x.tb)}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\end{kframe}
\end{knitrout}

The two viewpoints on conversion functions are as follows. 1) The conversion function should return an object of its corresponding class, even if the argument is an object of a derived class, stripping the derived class. 2) If the object is of the class to be converted to, including objects of derived classes, then it should remain untouched. Base \Rlang follows, as far as I have been able to work out, approach 1). Packages in the \pkgname{tidyverse} follow approach 2). If in doubt about the behavior of some function, then you will need to do a test similar to the one used in this box.
\end{explainbox}

There are additional important differences between the constructors \Rfunction{tibble()} and \code{data.frame()}. One of them is that in a call to \Rfunction{tibble()}, member variables (``columns'')  being defined can be used in the definition of subsequent member variables.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{tibble}\hlstd{(}\hlkwc{a} \hlstd{=} \hlnum{1}\hlopt{:}\hlnum{5}\hlstd{,} \hlkwc{b} \hlstd{=} \hlnum{5}\hlopt{:}\hlnum{1}\hlstd{,} \hlkwc{c} \hlstd{= a} \hlopt{+} \hlstd{b,} \hlkwc{d} \hlstd{= letters[a} \hlopt{+} \hlnum{1}\hlstd{])}
\end{alltt}
\begin{verbatim}
## # A tibble: 5 x 4
##       a     b     c d
##   <int> <int> <int> <chr>
## 1     1     5     6 b
## 2     2     4     6 c
## 3     3     3     6 d
## # ... with 2 more rows
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{playground}
What is the behavior if you replace \Rfunction{tibble()} by \Rfunction{data.frame()} in the statement above?
\end{playground}

While data frame columns can be factors, vectors or matrices (with the same number of rows as the data frame), columns of tibbles can be factors, vectors or lists (with the same number of members as rows the tibble has).

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{tibble}\hlstd{(}\hlkwc{a} \hlstd{=} \hlnum{1}\hlopt{:}\hlnum{5}\hlstd{,} \hlkwc{b} \hlstd{=} \hlnum{5}\hlopt{:}\hlnum{1}\hlstd{,} \hlkwc{c} \hlstd{=} \hlkwd{list}\hlstd{(}\hlstr{"a"}\hlstd{,} \hlnum{2}\hlstd{,} \hlnum{3}\hlstd{,} \hlnum{4}\hlstd{,} \hlnum{5}\hlstd{))}
\end{alltt}
\begin{verbatim}
## # A tibble: 5 x 3
##       a     b c
##   <int> <int> <list>
## 1     1     5 <chr [1]>
## 2     2     4 <dbl [1]>
## 3     3     3 <dbl [1]>
## # ... with 2 more rows
\end{verbatim}
\end{kframe}
\end{knitrout}

Which even allows a list of lists as a variable, or a list of vectors.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{tibble}\hlstd{(}\hlkwc{a} \hlstd{=} \hlnum{1}\hlopt{:}\hlnum{5}\hlstd{,} \hlkwc{b} \hlstd{=} \hlnum{5}\hlopt{:}\hlnum{1}\hlstd{,} \hlkwc{c} \hlstd{=} \hlkwd{list}\hlstd{(}\hlstr{"a"}\hlstd{,} \hlnum{1}\hlopt{:}\hlnum{2}\hlstd{,} \hlnum{0}\hlopt{:}\hlnum{3}\hlstd{, letters[}\hlnum{1}\hlopt{:}\hlnum{3}\hlstd{], letters[}\hlnum{3}\hlopt{:}\hlnum{1}\hlstd{]))}
\end{alltt}
\begin{verbatim}
## # A tibble: 5 x 3
##       a     b c
##   <int> <int> <list>
## 1     1     5 <chr [1]>
## 2     2     4 <int [2]>
## 3     3     3 <int [4]>
## # ... with 2 more rows
\end{verbatim}
\end{kframe}
\end{knitrout}
\index{tibble!differences with data frames|)}
\index{data frame!replacements|)}

\section{Data pipes}\label{sec:data:pipes}
\index{chaining statements with \emph{pipes}|(}
The first obvious difference between scripts using some of the new grammars is the frequent use of \emph{pipes}. This is, however, mostly a question of preferences, as pipes can be used equally well with base \Rlang functions. Pipes have been at the core of shell scripting in \osname{Unix} since early stages of its design \autocite{Kernigham1981}. Within an OS, pipes are chains of small programs or ``tools'' that carry out a single well-defined task (e.g., \code{ed}, \code{gsub}, \code{grep}, \code{more}, etc.). Data such as text is described as flowing from a source into a sink through a series of steps at which a specific transformation takes place. In \osname{Unix}, sinks and sources are files, but files as an abstraction include all devices and connections for input or output, including physical ones as terminals and printers. The connection between steps in the pipe is usually implemented by means of temporary files.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
stdin | grep("abc") | more
\end{alltt}
\end{kframe}
\end{knitrout}

How can \emph{pipes} exist within a single \Rlang script? When chaining functions into a pipe, data is passed between them through temporary \Rlang objects stored in memory, which are created and destroyed automatically. Conceptually there is little difference between Unix shell pipes and pipes in \Rlang scripts, but the implementations are different.

What do pipes achieve in \Rlang scripts? They relieve the user from the responsibility of creating and deleting the temporary objects and of enforcing the sequential execution of the different steps. Pipes usually improve readability of scripts by allowing more concise code.

Currently, two main implementations of pipes are available as \Rlang extensions, in packages \pkgnameNI{magrittr} and \pkgnameNI{wrapr}.

\subsection{\pkgname{magrittr}}
\index{pipes!tidyverse|(}
\index{pipe operator}
One set of operators needed to build pipes of \Rlang functions is implemented in package \pkgname{magrittr}. This implementation is used in the \pkgname{tidyverse} and the pipe operator re-exported by package \pkgname{dplyr}.

We start with a toy example first written using separate steps and normal \Rlang syntax

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{data.in} \hlkwb{<-} \hlnum{1}\hlopt{:}\hlnum{10}
\hlstd{data.tmp} \hlkwb{<-} \hlkwd{sqrt}\hlstd{(data.in)}
\hlstd{data.out} \hlkwb{<-} \hlkwd{sum}\hlstd{(data.tmp)}
\hlkwd{rm}\hlstd{(data.tmp)} \hlcom{# clean up!}
\end{alltt}
\end{kframe}
\end{knitrout}

next using nested function calls still using normal \Rlang syntax

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{data.out} \hlkwb{<-} \hlkwd{sum}\hlstd{(}\hlkwd{sqrt}\hlstd{(data.in))}
\end{alltt}
\end{kframe}
\end{knitrout}

written as a pipe using the chaining operator from package \pkgname{magrittr}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{data.in} \hlopt{%>%} \hlkwd{sqrt}\hlstd{()} \hlopt{%>%} \hlkwd{sum}\hlstd{()} \hlkwb{->} \hlstd{data.out}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{explainbox}
The \Roperator{\%>\%} from package \pkgname{magrittr} takes two operands. The value returned by the \emph{lhs} (left-hand side) operand, which can be any \Rlang expression, is passed as first argument to the \emph{rhs} operand, which must be a function accepting at least one argument. Consequently, in this implementation, the function in the \emph{rhs} must have a suitable signature for the pipe to work implicitly as usually used. However, it is possible to pass piped arguments to a function by name or to other parameters than the first one using a dot (\code{.}) as placeholder.

Some base \Rlang functions like \code{subset()} have a signature that is suitable for use in \pkgname{magrittr} pipes using implicit passing of the piped value to the first argument, while others such as \code{assign()} will not. In such cases we can use \code{.} as a placeholder and pass it as an argument, or, alternatively, define a wrapper function to change the order of the formal parameters in the function signature.
\end{explainbox}

Package \pkgname{magrittr} provides additional pipe operators, such as ``tee'' (\Roperator{\%T>\%}) to create a branch in the pipe, and \Roperator{\%<>\%} to apply the pipe by reference. These operators are less frequently used than \Roperator{\%>\%}.
\index{pipes!tidyverse|)}

\subsection{\pkgname{wrapr}}
\index{pipes!wrapr|(}
\index{dot-pipe operator}
The \Roperator{\%.>\%}, or ``dot-pipe'' operator from package \pkgname{wrapr}, allows expressions both on the rhs and lhs, and enforces the use of the dot (\code{.}), as placeholder for the piped object.

Rewritten using the dot-pipe operator, the pipe in the previous chunk becomes

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{data.in} \hlopt{%.>%} \hlkwd{sqrt}\hlstd{(.)} \hlopt{%.>%} \hlkwd{sum}\hlstd{(.)} \hlkwb{->} \hlstd{data1.out}
\end{alltt}
\end{kframe}
\end{knitrout}
However, the same code can use the pipe operator from \pkgname{magrittr}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{data.in} \hlopt{%>%} \hlkwd{sqrt}\hlstd{(.)} \hlopt{%>%} \hlkwd{sum}\hlstd{(.)} \hlkwb{->} \hlstd{data2.out}
\hlkwd{all.equal}\hlstd{(data1.out, data2.out)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\end{kframe}
\end{knitrout}

If needed or desired, named arguments are supported with the dot-pipe operator resulting in the expected behavior.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{data.in} \hlopt{%.>%} \hlkwd{assign}\hlstd{(}\hlkwc{value} \hlstd{= .,} \hlkwc{x} \hlstd{=} \hlstr{"data3.out"}\hlstd{)}
\hlkwd{all.equal}\hlstd{(data.in, data3.out)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\end{kframe}
\end{knitrout}

In contrast, the pipe operator silently and unexpectedly fails to create the variable for the same example.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{data.in} \hlopt{%>%} \hlkwd{assign}\hlstd{(}\hlkwc{value} \hlstd{= .,} \hlkwc{x} \hlstd{=} \hlstr{"data4.out"}\hlstd{)}
\hlkwd{exists}\hlstd{(}\hlstr{"data4.out"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\end{kframe}
\end{knitrout}

The dot-pipe operator allows us to use \code{.} in expressions as shown below, while \Roperator{\%>\%} fails with an error (not shown).

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{data.in} \hlopt{%.>%} \hlstd{(}\hlnum{2} \hlopt{+} \hlstd{.}\hlopt{^}\hlnum{2}\hlstd{)} \hlopt{%.>%} \hlkwd{assign}\hlstd{(}\hlstr{"data1.out"}\hlstd{, .)}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{explainbox}
In conclusion, \Rlang syntax for expressions is preserved when using the dot-pipe operator, with the only caveat that because of the higher precedence of the \Roperator{\%.>\%} operator, we need to ``protect'' bare expressions containing other operators by enclosing them in parentheses.
\end{explainbox}

Under-the-hood, the implementations of \Roperator{\%>\%} and \Roperator{\%.>\%} are very different, with \Roperator{\%.>\%} usually having better performance.

In the rest of the book we will exclusively use \emph{dot pipes} in examples to ensure easier understanding as they avoid implicit (''invisible'') passing of arguments and impose fewer restrictions on the syntax that can be used.

Although pipes can make scripts visually very different from the use of assignments of intermediate results to variables, from the point of view of data analysis what makes pipes most convenient to use are some of the new classes, functions, and methods defined in \pkgnameNI{tidyr}, \pkgnameNI{dplyr}, and other packages from the \pkgname{tidyverse}.
\index{pipes!wrapr|)}
\index{chaining statements with \emph{pipes}|)}

\section{Reshaping with \pkgname{tidyr}}
\index{reshaping tibbles|(}
\index{long-form- and wide-form tabular data}
Data stored in table-like formats can be arranged in different ways. In base \Rlang most model fitting functions and the \Rfunction{plot()} method using (model) formulas and accepting data frames, expect data to be arranged in ``long form'' so that each row in a data frame corresponds to a single observation (or measurement) event on a subject. Each column corresponds to a different measured feature, time of measurement, or a factor describing a classification of subjects according to treatments or features of the experimental design (e.g., blocks). Covariates measured on the same subject at an earlier point in time may also be stored in a column. Data arranged in \emph{long form} has been nicknamed as ``tidy'' and this is reflected in the name given to the \pkgname{tidyverse} suite of packages. Data in which columns correspond to measurement events is described as being in a \emph{wide form}.

Although long-form data is and has been the most commonly used arrangement of data in \Rlang, manipulation of such data has not always been possible with concise \Rlang statements. The packages in the \pkgname{tidyverse} provide convenience functions to simplify coding of data manipulation, which in some cases, have, in addition, improved performance compared to base \Rlang---i.e., it is possible to code the same operations using only base \Rlang, but may require more and/or more verbose statements.

Real-world data is rather frequently stored in wide format or even ad hoc formats, so in many cases the first task in data analysis is to reshape the data. Package \pkgname{tidyr} provides functions for reshaping data from wide to long form and \emph{vice versa} (replacing the older packages \pkgname{reshape} and \pkgname{reshape2}).

%% replace iris with an example that is really ``wide''
We use in examples below the \Rdata{iris} data set included in base \Rlang. Some operations on \Rlang \code{data.frame} objects with \pkgname{tidyverse} packages will return \code{data.frame} objects while others will return tibbles---i.e., \Rclass{"tb"} objects. Consequently it is safer to first convert into tibbles the data frames we will work with.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{iris.tb} \hlkwb{<-} \hlkwd{as_tibble}\hlstd{(iris)}
\end{alltt}
\end{kframe}
\end{knitrout}

Function \Rfunction{gather()} converts data from wide form into long form (or ''tidy''). We use \code{gather} to obtain a long-form tibble. By comparing \code{iris.tb} with \code{long\_iris.tb} we can appreciate how \Rfunction{gather()} reshaped its input.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{head}\hlstd{(iris.tb,} \hlnum{2}\hlstd{)}
\end{alltt}
\begin{verbatim}
## # A tibble: 2 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>
## 1          5.1         3.5          1.4         0.2 setosa
## 2          4.9         3            1.4         0.2 setosa
\end{verbatim}
\begin{alltt}
\hlstd{iris.tb} \hlopt{%.>%}
  \hlkwd{gather}\hlstd{(.,} \hlkwc{key} \hlstd{= part,} \hlkwc{value} \hlstd{= dimension,} \hlopt{-}\hlstd{Species)} \hlkwb{->} \hlstd{long_iris.tb}
\hlstd{long_iris.tb}
\end{alltt}
\begin{verbatim}
## # A tibble: 600 x 3
##   Species part         dimension
##   <fct>   <chr>            <dbl>
## 1 setosa  Sepal.Length       5.1
## 2 setosa  Sepal.Length       4.9
## 3 setosa  Sepal.Length       4.7
## # ... with 597 more rows
\end{verbatim}
\end{kframe}
\end{knitrout}

In this statement, we can see the convenience of dispensing with quotation marks for the new (\code{part} and \code{dimension}) and existing (\code{Species}) column names. Use of bare names as above triggers errors when package code is tested, requiring the use of a less convenient but more consistent and reliable syntax instead. As it is also possible to pass column names as strings but not together with the subtraction operator, equivalent code becomes more verbose but with the intention explicit and easier to grasp.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{long_iris.tb_1} \hlkwb{<-} \hlkwd{gather}\hlstd{(iris.tb,} \hlkwc{key} \hlstd{=} \hlstr{"part"}\hlstd{,} \hlkwc{value} \hlstd{=} \hlstr{"dimension"}\hlstd{,} \hlkwd{setdiff}\hlstd{(}\hlkwd{colnames}\hlstd{(iris.tb),} \hlstr{"Species"}\hlstd{))}
\hlstd{long_iris.tb_1}
\end{alltt}
\begin{verbatim}
## # A tibble: 600 x 3
##   Species part         dimension
##   <fct>   <chr>            <dbl>
## 1 setosa  Sepal.Length       5.1
## 2 setosa  Sepal.Length       4.9
## 3 setosa  Sepal.Length       4.7
## # ... with 597 more rows
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{warningbox}
\index{function arguments in the tidyverse}
Altering \Rlang's normal interpretation of the name passed as an argument to \code{key} and \code{value} prevents these arguments from being recognized as the name of a variable in the calling environment. We need to use a new operator \Roperator{!!} to restore the normal \Rlang behavior.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{part} \hlkwb{<-} \hlstr{"not part"}
\hlstd{long_iris.tb_2} \hlkwb{<-} \hlkwd{gather}\hlstd{(iris.tb,} \hlkwc{key} \hlstd{=} \hlopt{!!}\hlstd{part,} \hlkwc{value} \hlstd{= dimension,} \hlopt{-}\hlstd{Species)}
\hlstd{long_iris.tb_2}
\end{alltt}
\begin{verbatim}
## # A tibble: 600 x 3
##   Species `not part`   dimension
##   <fct>   <chr>            <dbl>
## 1 setosa  Sepal.Length       5.1
## 2 setosa  Sepal.Length       4.9
## 3 setosa  Sepal.Length       4.7
## # ... with 597 more rows
\end{verbatim}
\end{kframe}
\end{knitrout}

This syntax has been recently subject to debate and led to John Mount developing package \pkgname{seplyr} which provides wrappers on functions and methods from \pkgname{dplyr} that respect standard evaluation (SE). At the time of writing, \pkgname{seplyr} can be considered as experimental.
\end{warningbox}

\begin{playground}
To better understand why I added \code{-Species} as an argument, edit the code by removing it, and execute the statement to see how the returned tibble is different.
\end{playground}

For the reverse operation, converting from long form to wide form, we use \Rfunction{spread()}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{spread}\hlstd{(long_iris.tb,} \hlkwc{key} \hlstd{=} \hlkwd{c}\hlstd{(}\hlopt{!!}\hlstd{part, Species),} \hlkwc{value} \hlstd{= dimension)} \hlcom{# does not work!!}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{warningbox}
  Starting from version 1.0.0 of \pkgname{tidyr}, \Rfunction{gather()} and \Rfunction{spread()} are deprecated and replaced by \Rfunction{pivot\_longer()} and \Rfunction{pivot\_wider()}. These new functions use a different syntax but are not yet fully stable.
\end{warningbox}

\index{reshaping tibbles|)}

\section{Data manipulation with \pkgname{dplyr}}
\index{data manipulation in the tidyverse|(}
\begin{warningbox}
The first advantage a user of the \pkgname{dplyr} functions and methods sees is the completeness of the set of operations supported and the symmetry and consistency among the different functions. A second advantage is that almost all the functions are defined not only for objects of class \Rclass{tibble}, but also for objects of class \code{data.table} (packages \pkgname{dtplyr}) and for SQL databases (\pkgname{dbplyr}), with consistent syntax (see also section \ref{sec:data:db} on page \pageref{sec:data:db}). A further variant exists in package \pkgname{seplyr}, supporting a different syntax stemming from the use of ``standard evaluation'' (SE) instead of non-standard evaluation (NSE). A downside of \pkgname{dplyr} and much of the \pkgname{tidyverse} is that the syntax is not yet fully stable. Additionally, some function and method names either override those in base \Rlang or clash with names used in other packages. \Rlang itself is extremely stable and expected to remain forward and backward compatible for a long time. For code intended to remain in use for years, the fewer packages it depends on, the less maintenance it will need. When using the \pkgname{tidyverse} we need to be prepared to revise our own dependent code after any major revision to the \pkgname{tidyverse} packages we may use.
\end{warningbox}

\begin{infobox}
A new package, \pkgname{poorman}, implements many of the same words and grammar as \pkgname{dplyr} using pure \Rlang in the implementation instead of compiled \Cpplang and \Clang code. This light-weight approach could be useful when dealing with relatively small data sets or when the use of \Rlang's data frames instead of tibbles is preferred.
\end{infobox}

\subsection{Row-wise manipulations}
\index{row-wise operations on data|(}

Assuming that the data is stored in long form, row-wise operations are operations combining values from the same observation event---i.e., calculations within a single row of a data frame or tibble. Using functions \Rfunction{mutate()} and \Rfunction{transmute()} we can obtain derived quantities by combining different variables, or variables and constants, or applying a mathematical transformation. We add new variables (columns) retaining existing ones using \Rfunction{mutate()} or we assemble a new tibble containing only the columns we explicitly specify using \Rfunction{transmute()}.

\begin{explainbox}
Different from usual \Rlang syntax, with \Rfunction{tibble()}, \Rfunction{mutate()} and \Rfunction{transmute()} we can use values passed as arguments, in the statements computing the values passed as later arguments. In many cases, this allows more concise and easier to understand code.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{tibble}\hlstd{(}\hlkwc{a} \hlstd{=} \hlnum{1}\hlopt{:}\hlnum{5}\hlstd{,} \hlkwc{b} \hlstd{=} \hlnum{2} \hlopt{*} \hlstd{a)}
\end{alltt}
\begin{verbatim}
## # A tibble: 5 x 2
##       a     b
##   <int> <dbl>
## 1     1     2
## 2     2     4
## 3     3     6
## # ... with 2 more rows
\end{verbatim}
\end{kframe}
\end{knitrout}
\end{explainbox}

Continuing with the example from the previous section, we most likely would like to split the values in variable \code{part} into \code{plant\_part} and \code{part\_dim}. We use \code{mutate()} from \pkgname{dplyr} and \Rfunction{str\_extract()} from \pkgname{stringr}. We use regular expressions as arguments passed to \code{pattern}.  We do not show it here, but \Rfunction{mutate()} can be used with variables of any \code{mode}, and calculations can involve values from several columns. It is even possible to operate on values applying a lag or, in other words, using rows displaced relative to the current one.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{long_iris.tb} \hlopt{%.>%}
  \hlkwd{mutate}\hlstd{(.,}
         \hlkwc{plant_part} \hlstd{=} \hlkwd{str_extract}\hlstd{(part,} \hlstr{"^[:alpha:]*"}\hlstd{),}
         \hlkwc{part_dim} \hlstd{=} \hlkwd{str_extract}\hlstd{(part,} \hlstr{"[:alpha:]*$"}\hlstd{))} \hlkwb{->} \hlstd{long_iris.tb}
\hlstd{long_iris.tb}
\end{alltt}
\begin{verbatim}
## # A tibble: 600 x 5
##   Species part         dimension plant_part part_dim
##   <fct>   <chr>            <dbl> <chr>      <chr>
## 1 setosa  Sepal.Length       5.1 Sepal      Length
## 2 setosa  Sepal.Length       4.9 Sepal      Length
## 3 setosa  Sepal.Length       4.7 Sepal      Length
## # ... with 597 more rows
\end{verbatim}
\end{kframe}
\end{knitrout}

In the next few chunks, we print the returned values rather than saving them in variables. In normal use, one would combine these functions into a pipe using operator \Roperator{\%.>\%} (see section \ref{sec:data:pipes} on page \pageref{sec:data:pipes}).

Function \Rfunction{arrange()} is used for sorting the rows---makes sorting a data frame or tibble simpler than by using \Rfunction{sort()} and \Rfunction{order()}. Here we sort the tibble \code{long\_iris.tb} based on the values in three of its columns.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{arrange}\hlstd{(long_iris.tb, Species, plant_part, part_dim)}
\end{alltt}
\begin{verbatim}
## # A tibble: 600 x 5
##   Species part         dimension plant_part part_dim
##   <fct>   <chr>            <dbl> <chr>      <chr>
## 1 setosa  Petal.Length       1.4 Petal      Length
## 2 setosa  Petal.Length       1.4 Petal      Length
## 3 setosa  Petal.Length       1.3 Petal      Length
## # ... with 597 more rows
\end{verbatim}
\end{kframe}
\end{knitrout}

Function \Rfunction{filter()} can be used to extract a subset of rows---similar to \Rfunction{subset()} but with a syntax consistent with that of other functions in the \pkgname{tidyverse}. In this case, 300 out of the original 600 rows are retained.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{filter}\hlstd{(long_iris.tb, plant_part} \hlopt{==} \hlstr{"Petal"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## # A tibble: 300 x 5
##   Species part         dimension plant_part part_dim
##   <fct>   <chr>            <dbl> <chr>      <chr>
## 1 setosa  Petal.Length       1.4 Petal      Length
## 2 setosa  Petal.Length       1.4 Petal      Length
## 3 setosa  Petal.Length       1.3 Petal      Length
## # ... with 297 more rows
\end{verbatim}
\end{kframe}
\end{knitrout}

Function \Rfunction{slice()} can be used to extract a subset of rows based on their positions---an operation that in base \Rlang would use positional (numeric) indexes with the \code{[ , ]} operator: \code{long\_iris.tb[1:5, ]}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{slice}\hlstd{(long_iris.tb,} \hlnum{1}\hlopt{:}\hlnum{5}\hlstd{)}
\end{alltt}
\begin{verbatim}
## # A tibble: 5 x 5
##   Species part         dimension plant_part part_dim
##   <fct>   <chr>            <dbl> <chr>      <chr>
## 1 setosa  Sepal.Length       5.1 Sepal      Length
## 2 setosa  Sepal.Length       4.9 Sepal      Length
## 3 setosa  Sepal.Length       4.7 Sepal      Length
## # ... with 2 more rows
\end{verbatim}
\end{kframe}
\end{knitrout}

Function \Rfunction{select()} can be used to extract a subset of columns--this would be done with positional (numeric) indexes with \code{[ , ]} in base \Rlang, passing them to the second argument as numeric indexes or column names in a vector. Negative indexes in base \Rlang can only be numeric, while \Rfunction{select()} accepts bare column names prepended with a minus for exclusion.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{select}\hlstd{(long_iris.tb,} \hlopt{-}\hlstd{part)}
\end{alltt}
\begin{verbatim}
## # A tibble: 600 x 4
##   Species dimension plant_part part_dim
##   <fct>       <dbl> <chr>      <chr>
## 1 setosa        5.1 Sepal      Length
## 2 setosa        4.9 Sepal      Length
## 3 setosa        4.7 Sepal      Length
## # ... with 597 more rows
\end{verbatim}
\end{kframe}
\end{knitrout}

In addition, \Rfunction{select()} as other functions in \pkgname{dplyr} accept ``selectors'' returned by functions \Rfunction{starts\_with()}, \Rfunction{ends\_with()}, \Rfunction{contains()}, and \Rfunction{matches()} to extract or retain columns. For this example we use the ``wide''-shaped \code{iris.tb} instead of \code{long\_iris.tb}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{select}\hlstd{(iris.tb,} \hlopt{-}\hlkwd{starts_with}\hlstd{(}\hlstr{"Sepal"}\hlstd{))}
\end{alltt}
\begin{verbatim}
## # A tibble: 150 x 3
##   Petal.Length Petal.Width Species
##          <dbl>       <dbl> <fct>
## 1          1.4         0.2 setosa
## 2          1.4         0.2 setosa
## 3          1.3         0.2 setosa
## # ... with 147 more rows
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{select}\hlstd{(iris.tb, Species,} \hlkwd{matches}\hlstd{(}\hlstr{"pal"}\hlstd{))}
\end{alltt}
\begin{verbatim}
## # A tibble: 150 x 3
##   Species Sepal.Length Sepal.Width
##   <fct>          <dbl>       <dbl>
## 1 setosa           5.1         3.5
## 2 setosa           4.9         3
## 3 setosa           4.7         3.2
## # ... with 147 more rows
\end{verbatim}
\end{kframe}
\end{knitrout}

Function \Rfunction{rename()} can be used to rename columns, whereas base \Rlang requires the use of both \Rfunction{names()} and \Rfunction{names<-()} and \emph{ad hoc} code to match new and old names. As shown below, the syntax for each column name to be changed is \code{<new name> = <old name>}. The two names can be given either as bare names as below or as character strings.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{rename}\hlstd{(long_iris.tb,} \hlkwc{dim} \hlstd{= dimension)}
\end{alltt}
\begin{verbatim}
## # A tibble: 600 x 5
##   Species part           dim plant_part part_dim
##   <fct>   <chr>        <dbl> <chr>      <chr>
## 1 setosa  Sepal.Length   5.1 Sepal      Length
## 2 setosa  Sepal.Length   4.9 Sepal      Length
## 3 setosa  Sepal.Length   4.7 Sepal      Length
## # ... with 597 more rows
\end{verbatim}
\end{kframe}
\end{knitrout}
\index{row-wise operations on data|)}

\subsection{Group-wise manipulations}
\index{group-wise operations on data|(}

Another important operation is to summarize quantities by groups of rows. Contrary to base \Rlang, the grammar of data manipulation, splits this operation in two: the setting of the grouping, and the calculation of summaries. This simplifies the code, making it more easily understandable when using pipes compared to the approach of base \Rlang \Rfunction{aggregate()}, and it also makes it easier to summarize several columns in a single operation.

\begin{warningbox}
It is important to be aware that grouping is persistent, and may also affect other operations on the same data frame or tibble if it is saved or piped and reused. Grouping is invisible to users except for its side effects and because of this can lead to erroneous and surprising results from calculations. Do not save grouped tibbles or data frames, and always make sure that inputs and outputs, at the head and tail of a pipe, are not grouped, by using \Rfunction{ungroup()} when needed.
\end{warningbox}

The first step is to use \Rfunction{group\_by()} to ``tag'' a tibble with the grouping. We create a \emph{tibble} and then convert it into a \emph{grouped tibble}. Once we have a grouped tibble, function \Rfunction{summarise()} will recognize the grouping and use it when the summary values are calculated.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{tibble}\hlstd{(}\hlkwc{numbers} \hlstd{=} \hlnum{1}\hlopt{:}\hlnum{9}\hlstd{,} \hlkwc{letters} \hlstd{=} \hlkwd{rep}\hlstd{(letters[}\hlnum{1}\hlopt{:}\hlnum{3}\hlstd{],} \hlnum{3}\hlstd{))} \hlopt{%.>%}
  \hlkwd{group_by}\hlstd{(., letters)} \hlopt{%.>%}
  \hlkwd{summarise}\hlstd{(.,}
            \hlkwc{mean_numbers} \hlstd{=} \hlkwd{mean}\hlstd{(numbers),}
            \hlkwc{median_numbers} \hlstd{=} \hlkwd{median}\hlstd{(numbers),}
            \hlkwc{n} \hlstd{=} \hlkwd{n}\hlstd{())}
\end{alltt}
\begin{verbatim}
## # A tibble: 3 x 4
##   letters mean_numbers median_numbers     n
## * <chr>          <dbl>          <int> <int>
## 1 a                  4              4     3
## 2 b                  5              5     3
## 3 c                  6              6     3
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{warningbox}
How is grouping implemented for data frames and tibbles?\index{grouping!implementation in tidyverse} In our case as our tibble belongs to class \code{tibble\_df}, grouping adds \code{grouped\_df} as the most derived class. It also adds several attributes with the grouping information in a format suitable for fast selection of group members. To demonstrate this, we need to make an exception to our recommendation above and save a grouped tibble to a variable.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.tb} \hlkwb{<-} \hlkwd{tibble}\hlstd{(}\hlkwc{numbers} \hlstd{=} \hlnum{1}\hlopt{:}\hlnum{9}\hlstd{,} \hlkwc{letters} \hlstd{=} \hlkwd{rep}\hlstd{(letters[}\hlnum{1}\hlopt{:}\hlnum{3}\hlstd{],} \hlnum{3}\hlstd{))}
\hlkwd{is.grouped_df}\hlstd{(my.tb)}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\begin{alltt}
\hlkwd{class}\hlstd{(my.tb)}
\end{alltt}
\begin{verbatim}
## [1] "tbl_df"     "tbl"        "data.frame"
\end{verbatim}
\begin{alltt}
\hlkwd{names}\hlstd{(}\hlkwd{attributes}\hlstd{(my.tb))}
\end{alltt}
\begin{verbatim}
## [1] "names"     "row.names" "class"
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my_gr.tb} \hlkwb{<-} \hlkwd{group_by}\hlstd{(}\hlkwc{.data} \hlstd{= my.tb, letters)}
\hlkwd{is.grouped_df}\hlstd{(my_gr.tb)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{class}\hlstd{(my_gr.tb)}
\end{alltt}
\begin{verbatim}
## [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"
\end{verbatim}
\end{kframe}
\end{knitrout}
% allow page break
\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{names}\hlstd{(}\hlkwd{attributes}\hlstd{(my_gr.tb))}
\end{alltt}
\begin{verbatim}
## [1] "names"     "row.names" "groups"    "class"
\end{verbatim}
\begin{alltt}
\hlkwd{setdiff}\hlstd{(}\hlkwd{attributes}\hlstd{(my_gr.tb),} \hlkwd{attributes}\hlstd{(my.tb))}
\end{alltt}
\begin{verbatim}
## [[1]]
## # A tibble: 3 x 2
##   letters       .rows
## * <chr>   <list<int>>
## 1 a               [3]
## 2 b               [3]
## 3 c               [3]
##
## [[2]]
## [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my_ugr.tb} \hlkwb{<-} \hlkwd{ungroup}\hlstd{(my_gr.tb)}
\hlkwd{class}\hlstd{(my_ugr.tb)}
\end{alltt}
\begin{verbatim}
## [1] "tbl_df"     "tbl"        "data.frame"
\end{verbatim}
\begin{alltt}
\hlkwd{names}\hlstd{(}\hlkwd{attributes}\hlstd{(my_ugr.tb))}
\end{alltt}
\begin{verbatim}
## [1] "names"     "row.names" "class"
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{all}\hlstd{(my.tb} \hlopt{==} \hlstd{my_gr.tb)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{all}\hlstd{(my.tb} \hlopt{==} \hlstd{my_ugr.tb)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{identical}\hlstd{(my.tb, my_gr.tb)}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\begin{alltt}
\hlkwd{identical}\hlstd{(my.tb, my_ugr.tb)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\end{kframe}
\end{knitrout}

The tests above show that members are in all cases the same as operator \Roperator{==} tests for equality at each position in the tibble but not the attributes, while attributes, including \code{class} differ between normal tibbles and grouped ones and so they are not \emph{identical} objects.

If we replace \code{tibble} by \code{data.frame} in the first statement, and rerun the chunk, the result of the last statement in the chunk is \code{FALSE} instead of \code{TRUE}. At the time of writing starting with a \code{data.frame} object, applying grouping with \Rfunction{group\_by()} followed by ungrouping with \Rfunction{ungroup()} has the side effect of converting the data frame into a tibble. This is something to be very much aware of, as there are differences in how the extraction operator \Roperator{[ , ]} behaves in the two cases. The safe way to write code making use of functions from \pkgname{dplyr} and \pkgname{tidyr} is to always use tibbles instead of data frames.
\end{warningbox}
\index{group-wise operations on data|)}

\subsection{Joins}
\index{joins between data sources|(}
\index{merging data from two tibbles|(}
Joins allow us to combine two data sources which share some variables. Variables in common are used to match the corresponding rows before ``joining'' variables (i.e., columns) from both sources together. There are several \emph{join} functions in \pkgname{dplyr}. They differ mainly in how they handle rows that do not have a match between data sources.

We create here some artificial data to demonstrate the use of these functions. We will create two small tibbles, with one column in common and one mismatched row in each.


\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{first.tb} \hlkwb{<-} \hlkwd{tibble}\hlstd{(}\hlkwc{idx} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{1}\hlopt{:}\hlnum{4}\hlstd{,} \hlnum{5}\hlstd{),} \hlkwc{values1} \hlstd{=} \hlstr{"a"}\hlstd{)}
\hlstd{second.tb} \hlkwb{<-} \hlkwd{tibble}\hlstd{(}\hlkwc{idx} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{1}\hlopt{:}\hlnum{4}\hlstd{,} \hlnum{6}\hlstd{),} \hlkwc{values2} \hlstd{=} \hlstr{"b"}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

Below we apply the \emph{}\index{joins between data sources!mutating} functions exported by \pkgname{dplyr}: \Rfunction{full\_join()}, \Rfunction{left\_join()}, \Rfunction{right\_join()} and \Rfunction{inner\_join()}. These functions always retain all columns, and in case of multiple matches, keep a row for each matching combination of rows. We repeat each example with the arguments passed to \code{x} and \code{y} swapped to more clearly show their different behavior.

A full join retains all unmatched rows filling missing values with \code{NA}. By default the match is done on columns with the same name in \code{x} and \code{y}, but this can be changed by passing an argument to parameter \code{by}. Using \code{by} one can base the match on columns that have different names in \code{x} and \code{y}, or prevent matching of columns with the same name in \code{x} and \code{y} (example at end of the section).

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{full_join}\hlstd{(}\hlkwc{x} \hlstd{= first.tb,} \hlkwc{y} \hlstd{= second.tb)}
\end{alltt}


{\ttfamily\noindent\itshape\color{messagecolor}{\#\# Joining, by = "{}idx"{}}}\begin{verbatim}
## # A tibble: 6 x 3
##     idx values1 values2
## * <dbl> <chr>   <chr>
## 1     1 a       b
## 2     2 a       b
## 3     3 a       b
## 4     4 a       b
## 5     5 a       <NA>
## 6     6 <NA>    b
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{full_join}\hlstd{(}\hlkwc{x} \hlstd{= second.tb,} \hlkwc{y} \hlstd{= first.tb)}
\end{alltt}


{\ttfamily\noindent\itshape\color{messagecolor}{\#\# Joining, by = "{}idx"{}}}\begin{verbatim}
## # A tibble: 6 x 3
##     idx values2 values1
## * <dbl> <chr>   <chr>
## 1     1 b       a
## 2     2 b       a
## 3     3 b       a
## 4     4 b       a
## 5     6 b       <NA>
## 6     5 <NA>    a
\end{verbatim}
\end{kframe}
\end{knitrout}

Left and right joins retain rows not matched from only one of the two data sources, \code{x} and \code{y}, respectively.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{left_join}\hlstd{(}\hlkwc{x} \hlstd{= first.tb,} \hlkwc{y} \hlstd{= second.tb)}
\end{alltt}


{\ttfamily\noindent\itshape\color{messagecolor}{\#\# Joining, by = "{}idx"{}}}\begin{verbatim}
## # A tibble: 5 x 3
##     idx values1 values2
## * <dbl> <chr>   <chr>
## 1     1 a       b
## 2     2 a       b
## 3     3 a       b
## 4     4 a       b
## 5     5 a       <NA>
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{left_join}\hlstd{(}\hlkwc{x} \hlstd{= second.tb,} \hlkwc{y} \hlstd{= first.tb)}
\end{alltt}


{\ttfamily\noindent\itshape\color{messagecolor}{\#\# Joining, by = "{}idx"{}}}\begin{verbatim}
## # A tibble: 5 x 3
##     idx values2 values1
## * <dbl> <chr>   <chr>
## 1     1 b       a
## 2     2 b       a
## 3     3 b       a
## 4     4 b       a
## 5     6 b       <NA>
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{right_join}\hlstd{(}\hlkwc{x} \hlstd{= first.tb,} \hlkwc{y} \hlstd{= second.tb)}
\end{alltt}


{\ttfamily\noindent\itshape\color{messagecolor}{\#\# Joining, by = "{}idx"{}}}\begin{verbatim}
## # A tibble: 5 x 3
##     idx values1 values2
## * <dbl> <chr>   <chr>
## 1     1 a       b
## 2     2 a       b
## 3     3 a       b
## 4     4 a       b
## 5     6 <NA>    b
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{right_join}\hlstd{(}\hlkwc{x} \hlstd{= second.tb,} \hlkwc{y} \hlstd{= first.tb)}
\end{alltt}


{\ttfamily\noindent\itshape\color{messagecolor}{\#\# Joining, by = "{}idx"{}}}\begin{verbatim}
## # A tibble: 5 x 3
##     idx values2 values1
## * <dbl> <chr>   <chr>
## 1     1 b       a
## 2     2 b       a
## 3     3 b       a
## 4     4 b       a
## 5     5 <NA>    a
\end{verbatim}
\end{kframe}
\end{knitrout}

An inner join discards all rows in \code{x} that do not have a matching row in \code{y} and \emph{vice versa}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{inner_join}\hlstd{(}\hlkwc{x} \hlstd{= first.tb,} \hlkwc{y} \hlstd{= second.tb)}
\end{alltt}


{\ttfamily\noindent\itshape\color{messagecolor}{\#\# Joining, by = "{}idx"{}}}\begin{verbatim}
## # A tibble: 4 x 3
##     idx values1 values2
## * <dbl> <chr>   <chr>
## 1     1 a       b
## 2     2 a       b
## 3     3 a       b
## 4     4 a       b
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{inner_join}\hlstd{(}\hlkwc{x} \hlstd{= second.tb,} \hlkwc{y} \hlstd{= first.tb)}
\end{alltt}


{\ttfamily\noindent\itshape\color{messagecolor}{\#\# Joining, by = "{}idx"{}}}\begin{verbatim}
## # A tibble: 4 x 3
##     idx values2 values1
## * <dbl> <chr>   <chr>
## 1     1 b       a
## 2     2 b       a
## 3     3 b       a
## 4     4 b       a
\end{verbatim}
\end{kframe}
\end{knitrout}

Next we apply the \emph{filtering join}\index{joins between data sources!filtering} functions exported by \pkgname{dplyr}: \Rfunction{semi\_join()} and \Rfunction{anti\_join()}. These functions only return a tibble that always contains only the columns from \code{x}, but retains rows based on their match to rows in \code{y}.

A semi join retains rows from \code{x} that have a match in \code{y}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{semi_join}\hlstd{(}\hlkwc{x} \hlstd{= first.tb,} \hlkwc{y} \hlstd{= second.tb)}
\end{alltt}


{\ttfamily\noindent\itshape\color{messagecolor}{\#\# Joining, by = "{}idx"{}}}\begin{verbatim}
## # A tibble: 4 x 2
##     idx values1
##   <dbl> <chr>
## 1     1 a
## 2     2 a
## 3     3 a
## 4     4 a
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{semi_join}\hlstd{(}\hlkwc{x} \hlstd{= second.tb,} \hlkwc{y} \hlstd{= first.tb)}
\end{alltt}


{\ttfamily\noindent\itshape\color{messagecolor}{\#\# Joining, by = "{}idx"{}}}\begin{verbatim}
## # A tibble: 4 x 2
##     idx values2
##   <dbl> <chr>
## 1     1 b
## 2     2 b
## 3     3 b
## 4     4 b
\end{verbatim}
\end{kframe}
\end{knitrout}

A anti-join retains rows from \code{x} that do not have a match in \code{y}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{anti_join}\hlstd{(}\hlkwc{x} \hlstd{= first.tb,} \hlkwc{y} \hlstd{= second.tb)}
\end{alltt}


{\ttfamily\noindent\itshape\color{messagecolor}{\#\# Joining, by = "{}idx"{}}}\begin{verbatim}
## # A tibble: 1 x 2
##     idx values1
##   <dbl> <chr>
## 1     5 a
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{anti_join}\hlstd{(}\hlkwc{x} \hlstd{= second.tb,} \hlkwc{y} \hlstd{= first.tb)}
\end{alltt}


{\ttfamily\noindent\itshape\color{messagecolor}{\#\# Joining, by = "{}idx"{}}}\begin{verbatim}
## # A tibble: 1 x 2
##     idx values2
##   <dbl> <chr>
## 1     6 b
\end{verbatim}
\end{kframe}
\end{knitrout}

We here rename column \code{idx} in \code{first.tb} to demonstrate the use of \code{by} to specify which columns should be searched for matches.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{first2.tb} \hlkwb{<-} \hlkwd{rename}\hlstd{(first.tb,} \hlkwc{idx2} \hlstd{= idx)}
\hlkwd{full_join}\hlstd{(}\hlkwc{x} \hlstd{= first2.tb,} \hlkwc{y} \hlstd{= second.tb,} \hlkwc{by} \hlstd{=} \hlkwd{c}\hlstd{(}\hlstr{"idx2"} \hlstd{=} \hlstr{"idx"}\hlstd{))}
\end{alltt}
\begin{verbatim}
## # A tibble: 6 x 3
##    idx2 values1 values2
## * <dbl> <chr>   <chr>
## 1     1 a       b
## 2     2 a       b
## 3     3 a       b
## 4     4 a       b
## 5     5 a       <NA>
## 6     6 <NA>    b
\end{verbatim}
\end{kframe}
\end{knitrout}
\index{merging data from two tibbles|)}
\index{joins between data sources|)}
\index{data manipulation in the tidyverse|)}


\section{Further reading}
An\index{further reading!new grammars of data} in-depth discussion of the \pkgname{tidyverse} is outside the scope of this book. Several books describe in detail the use of these packages. As several of them are under active development, recent editions of books such as \citebooktitle{Wickham2017} \autocite{Wickham2017} are the most useful.


% !Rnw root = appendix.main.Rnw


\chapter{Grammar of graphics}\label{chap:R:plotting}

\begin{VF}
The commonality between science and art is in trying to see profoundly---to develop strategies of seeing and showing.

\VA{Edward Tufte's answer to Charlotte Thralls}{\emph{An Interview with Edward R. Tufte}, 2004}\nocite{Zachry2004}
\end{VF}

%\dictum[Edward Tufte]{The commonality between science and art is in trying to see profoundly---to develop strategies of seeing and showing.}

\index{geometries ('ggplot2')|see{grammar of graphics, geometries}}
%\index{geom@\texttt{geom}|see{grammar of graphics, geometries}}
%\index{functions!geom@\texttt{geom}|see{grammar of graphics, geometries}}
\index{statistics ('ggplot2')|see{grammar of graphics, statistics}}
%\index{stat@\texttt{stat}|see{grammar of graphics, statistics}}
%\index{functions!stat@\texttt{stat}|see{grammar of graphics, statistics}}
\index{scales ('ggplot2')|see{grammar of graphics, scales}}
%\index{scale@\texttt{scale}|see{grammar of graphics, scales}}
%\index{functions!scale@\texttt{scale}|see{grammar of graphics, scales}}
\index{coordinates ('ggplot2')|see{grammar of graphics, coordinates}}
\index{themes ('ggplot2')|see{grammar of graphics, themes}}
%\index{theme@\texttt{scale}|see{grammar of graphics, themes}}
%\index{function!theme@\texttt{scale}|see{grammar of graphics, themes}}
\index{facets ('ggplot2')|see{grammar of graphics, facets}}
\index{annotations ('ggplot2')|see{grammar of graphics, annotations}}
\index{aesthetics ('ggplot2')|see{grammar of graphics, aesthetics}}

\section{Aims of this chapter}

Three main data plotting systems are available to \Rlang users: base \Rlang, package \pkgname{lattice} \autocite{Sarkar2008} and package \pkgname{ggplot2} \autocite{Wickham2016}, the last one being the most recent and currently most popular system available in \Rlang for plotting data. Even two different sets of graphics primitives (i.e., those used to produce the simplest graphical elements such as lines and symbols) are available in \Rlang, those in base \Rlang and a newer one in the \pkgname{grid} package \autocite{Murrell2011}.

In this chapter you will learn the concepts of the grammar of graphics, on which package \pkgname{ggplot2} is based. You will also learn how to build several types of data plots with package \pkgname{ggplot2}. As a consequence of the popularity and flexibility of \pkgname{ggplot2}, many contributed packages extending its functionality have been developed and deposited in public repositories. However, I will focus mainly on package \pkgname{ggplot2} only briefly describing a few of these extensions.

\section{Packages used in this chapter}


If the packages used in this chapter are not yet installed in your computer, you can install them as shown below, as long as package \pkgname{learnrbook} is already installed.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{install.packages}\hlstd{(learnrbook}\hlopt{::}\hlstd{pkgs_ch_ggplot)}
\end{alltt}
\end{kframe}
\end{knitrout}

To run the examples included in this chapter, you need first to load some packages from the library (see section \ref{sec:script:packages} on page \pageref{sec:script:packages} for details on the use of packages).

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{library}\hlstd{(learnrbook)}
\hlkwd{library}\hlstd{(wrapr)}
\hlkwd{library}\hlstd{(scales)}
\hlkwd{library}\hlstd{(ggplot2)}
\hlkwd{library}\hlstd{(ggrepel)}
\hlkwd{library}\hlstd{(gginnards)}
\hlkwd{library}\hlstd{(ggpmisc)}
\hlkwd{library}\hlstd{(ggbeeswarm)}
\hlkwd{library}\hlstd{(ggforce)}
\hlkwd{library}\hlstd{(tikzDevice)}
\hlkwd{library}\hlstd{(lubridate)}
\hlkwd{library}\hlstd{(tidyverse)}
\hlkwd{library}\hlstd{(patchwork)}
\end{alltt}
\end{kframe}
\end{knitrout}


\section{Introduction to the grammar of graphics}
\index{grammar of graphics!elements|(}
What separates \ggplot from base \Rlang and trellis/lattice plotting functions is the use of a grammar of graphics\index{grammar of graphics} (the reason behind `gg' in the name of package \pkgname{ggplot2}). What is meant by grammar in this case is that plots are assembled piece by piece using different ``nouns'' and ``verbs'' \autocite{Cleveland1985}. Instead of using a single function with many arguments, plots are assembled by combining different elements with operators \code{+} and \verb|%+%|. Furthermore, the construction is mostly semantics-based and to a large extent, how plots look when printed, displayed, or exported to a bitmap or vector-graphics file is controlled by themes.

We can think of plotting as representing the observations or data in a graphical language. We use the properties of graphical objects to represent different aspects of our data. An observation can consist of multiple recorded values. Say an observation of air temperature may be defined by a position in 3-dimensional space and a point in time, in addition to the temperature itself. An observation for the size and shape of a plant can consist of height, stem diameter, number of leaves, size of individual leaves, length of roots, fresh mass, dry mass, etc. If we are interested in the relationship between height and stem diameter, we may want to use cartesian coordinates\index{grammar of graphics!cartesian coordinates}, \emph{mapping} stem diameter to the $x$ dimension of the plot and the height to the $y$ dimension. The observations could be represented on the plot by points and/or joined by lines.

The grammar of graphics allows us to design plots by combining various elements in ways that are nearly orthogonal. In other words, the majority of the possible combinations of ``words'' yield valid plots as long as we assemble them respecting the rules of the grammar. This flexibility makes \ggplot extremely powerful as we can build plots and even types of plots which were not even considered while designing the \ggplot package.

When a plot is built, the whole plot and its components are created as \Rlang objects that can be saved in the workspace or written to a file as objects. The graphical representation is generated when the object is printed, explicitly or automatically. The same ggplot object can be rendered into different bitmap and vector graphic formats for display or printing.

Even if we do not explicitly add them all, default elements may be used. The production of a rendered graphic with package \pkgname{ggplot2} can be represented as a flow of information:
\textsf{data $\to$ scale $\to$ statistic $\to$ aesthetic $\to$ geometry $\to$ coordinate $\to$ ggplot $\to$ theme $\to$ rendered graphic}.

\subsection{Data}
The\index{grammar of graphics!data} data to be plotted must be available as a \code{data.frame} or \code{tibble}, with data stored so that each row represents a single observation event, and the columns are different values observed in that single event. In other words, in long form (so-called ``tidy data'') as described in chapter \ref{chap:R:data}. The variables to be plotted can be \code{numeric}, \code{factor}, \code{character}, and time or date stored as \code{POSIXct}.

\subsection{Mapping}

When\index{grammar of graphics!mapping of data} we design a plot, we need to map data variables to aesthetics\index{plots!aesthetics} (or graphic properties). Most plots will have an $x$ dimension, which is considered an \emph{aesthetic}, and a variable containing numbers mapped to it. The position on a 2D plot of, say, a point, will be determined by $x$ and $y$ aesthetics, while in a 3D plot, three aesthetics need to be mapped $x$, $y$ and $z$. Many aesthetics are not related to coordinates, they are properties, like color, size, shape, line type, or even rotation angle, which add an additional dimension on which to represent the values of variables and/or constants.

\subsection{Geometries}

\sloppy%
Geometries\index{grammar of graphics!geometries} are ``words'' that describe the graphics representation of the data: for example, \gggeom{geom\_point()}, plots a point or symbol for each observation, while \gggeom{geom\_line()}, draws line segments between observations. Some geometries rely by default on statistics, but most ``geoms'' default to the identity statistics. Each time a \emph{geometry} is used to add a graphical representation of data to a plot, we say that a new \emph{layer} has been added. The name \emph{layer} reflects the fact that each new layer added is plotted on top of the layers already present in the plot, or rather when a plot is printed the layers will be generated in the order they were added to the ggplot object. For example, one layer in a plot can display the observations, another layer a regression line fitted to them, and a third one may contain annotations such an equation or a text label.

\subsection{Statistics}

Statistics\index{grammar of graphics!statistics} are ``words'' that represent calculation of summaries or some other operation on the values from the data. When \emph{statistics} are used for a computation, the returned value is passed directly to a \emph{geometry}, and consequently adding an \emph{statistics} also adds a layer to the plot. For example, \ggstat{stat\_smooth()} fits a smoother, and \ggstat{stat\_summary()} applies a summary function. Statistics are applied automatically by group when data have been grouped by mapping additional aesthetics such as color to a factor.

\subsection{Scales}

Scales\index{grammar of graphics!scales} give the ``translation'' or mapping between data values and the aesthetic values to be actually plotted. Mapping a variable to the ``color'' aesthetic (also recognized when spelled as ``colour'') only tells that different values stored in the mapped variable will be represented by different colors. A scale, such as \ggscale{scale\_color\_continuous()}, will determine which color in the plot corresponds to which value in the variable. Scales can also define transformations on the data, which are used when mapping data values to aesthetic values. All continuous scales support transformations---e.g., in the case of $x$ and $y$ aesthetics, positions on the plotting region or viewport will be affected by the transformation, while the original values will be used for tick labels along the axes.  Scales are used for all aesthetics, including continuous variables, such as numbers, and categorical ones such as factors. The grammar of graphics allows only one scale per \emph{aesthetic} and plot. This restriction is imposed by design to avoid ambiguity (e.g., it ensures that the red color will have the same ``meaning'' in all plot layers where the \code{color} \emph{aesthetic} is mapped to data). Scales have limits with observations falling outside these limits being ignored (replaced by \code{NA}) rather than passed to statistics or geometries---it is easy to unintentionally drop observations when setting scale limits manually as warning messages report that \code{NA} values have been omitted.

\subsection{Coordinate systems}

The\index{grammar of graphics!coordinates} most frequently used coordinate system when plotting data, the cartesian system, is the default for most \emph{geometries}. In the cartesian system, $x$ and $y$ are represented as distances on two orthogonal (at 90$^\circ$) axes. Additional coordinate systems are available in \pkgname{ggplot2} and through extensions. For example, in the polar system of coordinates, the $x$ values are mapped to angles around a central point and $y$ values to the radius. Another example is the ternary system of coordinates, an extension of the grammar implemented in package \pkgname{ggtern}, that allows the construction of ternary plots. Setting limits to a coordinate system changes the region of the plotting space visible in the plot, but does not discard observations. In other words, when using \emph{statistics}, observations located outside the coordinate limits, i.e., not visible in the rendered plot, will still be included in computations.

\subsection{Themes}

How\index{grammar of graphics!themes} the plots look when displayed or printed can be altered by means of themes. A plot can be saved without adding a theme and then printed or displayed using different themes. Also, individual theme elements can be changed, and whole new themes defined. This adds a lot of flexibility and helps in the separation of the data representation aspects from those related to the graphical design.
\index{grammar of graphics!elements|)}

\subsection{Plot construction}
\index{grammar of graphics!plot construction|(}
We have described above the components of the grammar of graphics: aesthetics (\code{aes}), for example color, geometric elements \code{geom\_\ldots} such as lines and points, statistics \code{stat\_\ldots}, scales \code{scale\_\ldots}, coordinate systems \code{coord\_\ldots}, and themes \code{theme\_\ldots}. In this section we will see how plots are assembled from these elements.

As the workings and use of the grammar are easier to show by example than to explain with words, will show how to build plots of increasing complexity. All elements of a plot have defaults, although in some cases these defaults result in empty plots. Defaults make it possible to create a plot very succinctly. We use function \code{ggplot()} to create the skeleton for a plot, which can be enhanced, but also printed as is.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{()}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-ggplot-basics-01-1}

}


\end{knitrout}

The plot above is of little use without any data, so we next pass a data frame object, in this case \code{mtcars}---\Rdata{mtcars} is a data set included in \Rlang; to learn more about this data set, type \code{help("mtcars")} at the \Rlang command prompt.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars)}
\end{alltt}
\end{kframe}
\end{knitrout}

Once the data are available, we need to \emph{map} the quantities in the data onto graphical features in the plot, or \emph{aesthetics}. When plotting in two dimensions, we need to map variables in the data to at least the $x$ and $y$ aesthetics. This mapping can be seen in the chunk below by its effect on the plotting area ranges that now match the ranges of the mapped variables, extended by a small margin. The axis labels also reflect the names of the mapped variables, however, there is no graphical element yet displayed for the individual observations.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,}
       \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg))}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-ggplot-basics-03-1}

}


\end{knitrout}

To make observations visible in the plot we need to add a suitable \emph{geometry} or \code{geom} to the plot. Here we display the observations as points using \gggeom{geom\_point()}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,}
       \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-ggplot-basics-04-1}

}


\end{knitrout}

\begin{warningbox}
In the examples above, the plots were printed automatically, which is the default at the \Rlang console. However, as with other \Rlang objects, ggplots can be assigned to a variable,

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{p} \hlkwb{<-} \hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,}
            \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg))} \hlopt{+}
       \hlkwd{geom_point}\hlstd{()}
\end{alltt}
\end{kframe}
\end{knitrout}

and printed at a later time.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{print}\hlstd{(p)}
\end{alltt}
\end{kframe}
\end{knitrout}
\end{warningbox}

\begin{advplayground}
Above we have seen how to build a plot, layer by layer, using the grammar of graphics. We have also seen how to save a ggplot. We can peep into the innards of this object using \code{summary()}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{summary}\hlstd{(p)}
\end{alltt}
\end{kframe}
\end{knitrout}
We can view the structure of the ggplot object with \code{str()}.

Package \pkgname{gginnards} provides methods \code{str()}, \code{num\_layers()}, \code{top\_layer()} and  \code{mapped\_vars()}. Use these methods to explore ggplot objects with different numbers of layers or mappings. You will see that the plot elements that were added to the plot are stored as members of a list with nested lists forming a tree-like structure.
\end{advplayground}

Although \emph{aesthetics} can be mapped to variables in the data, they can also be set to constant values, but only within layers, not as whole-plot defaults.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,}
       \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{(}\hlkwc{color} \hlstd{=} \hlstr{"red"}\hlstd{,} \hlkwc{shape} \hlstd{=} \hlstr{"square"}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-ggplot-basics-04a-1}

}


\end{knitrout}

While a geometry directly constructs a graphical representation of the observations in the data, a \emph{statistics} or \code{stat} ``sits'' in-between the data and a \code{geom}, applying some computation, usually but not always, to produce a statistical summary of the data. Here we add a fitted line using \code{stat\_smooth()} with its output added to the plot using \gggeom{geom\_line()} passed by name with \code{"line"} as an argument to \code{stat\_smooth}. We fit a linear regression, using \code{lm()} as the method.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,}
       \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{stat_smooth}\hlstd{(}\hlkwc{geom} \hlstd{=} \hlstr{"line"}\hlstd{,} \hlkwc{method} \hlstd{=} \hlstr{"lm"}\hlstd{,} \hlkwc{formula} \hlstd{= y} \hlopt{~} \hlstd{x)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-ggplot-basics-05-1}

}


\end{knitrout}

We haven't yet added some of the elements of the grammar described above: \emph{scales}, \emph{coordinates} and \emph{themes}. The plots were rendered anyway because these elements have defaults which are used when we do not set them explicitly. We next will see examples in which they are explicitly set. We start with a scale using a logarithmic transformation. This works like plotting by hand using graph paper with rulings spaced according to a logarithmic scale. Tick marks continue to be expressed in the original units, but statistics are applied to the transformed data. In other words, a transformed scale affects the values before they are passed to \emph{statistics}, and the linear regression will be fitted to \code{log10()} transformed $y$ values and the original $x$ values.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,}
       \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{stat_smooth}\hlstd{(}\hlkwc{geom} \hlstd{=} \hlstr{"line"}\hlstd{,} \hlkwc{method} \hlstd{=} \hlstr{"lm"}\hlstd{,} \hlkwc{formula} \hlstd{= y} \hlopt{~} \hlstd{x)} \hlopt{+}
  \hlkwd{scale_y_log10}\hlstd{()}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-ggplot-basics-06-1}

}


\end{knitrout}

The range limits of a scale can be set manually, instead of automatically by default. These limits create a virtual \emph{window into the data}: observations outside the scale limits remain hidden and are not mapped to aesthetics---i.e., these observations are not included in the graphical representation or used in calculations. Crucially, when using \emph{statistics} the computations are only applied to observations that fall within the limits of all scales in use. These limits \emph{indirectly} affect the plotting area when the plotting area is automatically set based on the range of the (within limits) data---even the mapping to values of a different aesthetics may change when a subset of the data are selected by manually setting the limits of a scale.

In contrast to \emph{scale limits}, \emph{coordinates}\index{grammar of graphics!cartesian coordinates} function as a \emph{zoomed view} into the plotting area, and do not affect which observations are visible to \emph{statistics}. The coordinate system, as expected, is also determined by this grammar element---here we use cartesian coordinates which are the default, but we manually set $y$ limits.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,}
       \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{stat_smooth}\hlstd{(}\hlkwc{geom} \hlstd{=} \hlstr{"line"}\hlstd{,} \hlkwc{method} \hlstd{=} \hlstr{"lm"}\hlstd{,} \hlkwc{formula} \hlstd{= y} \hlopt{~} \hlstd{x)} \hlopt{+}
  \hlkwd{coord_cartesian}\hlstd{(}\hlkwc{ylim} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{15}\hlstd{,} \hlnum{25}\hlstd{))}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-ggplot-basics-07-1}

}


\end{knitrout}

The next example uses a coordinate system transformation. When the transformation is applied to the coordinate system, it affects only the plotting---it sits between the geom and the rendering of the plot. The transformation is applied to the values returned by any \emph{statistics}. The straight line fitted is plotted on the transformed coordinates as a curve, because the model was fitted to the untransformed data and this fitted model is automatically used to obtain the predicted values, which are then plotted after the transformation is applied to them. We have here described only cartesian coordinate systems while other coordinate systems are described in sections \ref{sec:plot:sf} and \ref{sec:plot:circular} on pages \pageref{sec:plot:sf} and \pageref{sec:plot:circular}, respectively.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,}
       \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{stat_smooth}\hlstd{(}\hlkwc{geom} \hlstd{=} \hlstr{"line"}\hlstd{,} \hlkwc{method} \hlstd{=} \hlstr{"lm"}\hlstd{,} \hlkwc{formula} \hlstd{= y} \hlopt{~} \hlstd{x)} \hlopt{+}
  \hlkwd{coord_trans}\hlstd{(}\hlkwc{y} \hlstd{=} \hlstr{"log10"}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-ggplot-basics-08-1}

}


\end{knitrout}

Themes affect the rendering of plots at the time of printing---they can be thought of as style sheets defining the graphic design. A complete theme can override the default gray theme. The plot is the same, the observations are represented in the same way, the limits of the axes are the same and all text is the same. On the other, hand how these elements are rendered by different themes can be drastically different.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,}
       \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{theme_classic}\hlstd{()}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-ggplot-basics-09-1}

}


\end{knitrout}

We can also override the base font size and font family. This affects the size of all text elements, as their size is defined relative to the base size. Here we add the same theme as used in the previous example, but with a different base point size for text.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,}
       \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{theme_classic}\hlstd{(}\hlkwc{base_size} \hlstd{=} \hlnum{20}\hlstd{,} \hlkwc{base_family} \hlstd{=} \hlstr{"serif"}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-ggplot-basics-10-1}

}


\end{knitrout}

The details of how to set axis labels, tick positions and tick labels will be discussed in depth in section \ref{sec:plot:scales}. Meanwhile, we will use function \code{labs()} which is a convenience function allowing us to easily set the title and subtitle of a plot and to replace the default \code{name} of scales used for axis labels---by default \code{name} is set to the name of the mapped variable. When setting the \code{name} of scales with \code{labs()}, we use as parameter names the names of aesthetics and pass as an argument a character string, or an \Rlang expression. Here we use \code{x} and \code{y}, the names of the two \emph{aesthetics} to which we have mapped two variables in \code{data}, \code{disp} and \code{mpg}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,}
       \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{labs}\hlstd{(}\hlkwc{x} \hlstd{=} \hlstr{"Engine displacement (cubic inches)"}\hlstd{,}
       \hlkwc{y} \hlstd{=} \hlstr{"Fuel use efficiency\textbackslash{}n(miles per gallon)"}\hlstd{,}
       \hlkwc{title} \hlstd{=} \hlstr{"Motor Trend Car Road Tests"}\hlstd{,}
       \hlkwc{subtitle} \hlstd{=} \hlstr{"Source: 1974 Motor Trend US magazine"}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-ggplot-basics-11-1}

}


\end{knitrout}

\begin{infobox}
As elsewhere in \Rlang, when a value is expected, either a value stored in a variable or statement returning a suitable value can be passed as an argument to be mapped to an \emph{aesthetic}. In other words, the values to be plotted do not need to be stored as variables (or columns) in the data frame passed as an argument to parameter \code{data}, they can also be computed from these variables. Here we plot miles-per-gallon, \code{mpg} on the engine displacement per cylinder by dividing \code{disp} by \code{cyl} within the call to \code{aes()}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,} \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp} \hlopt{/} \hlstd{cyl,} \hlkwc{y} \hlstd{= mpg))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-ggplot-basics-info-01-1}

}


\end{knitrout}

\end{infobox}

Each of the elements of the grammar exemplified above has several different members, and many of the individual \emph{geometries} and \emph{statistics} accept arguments that can be used to modify their behavior. There are also more \emph{aesthetics} than those shown above. Multiple data objects as well as multiple mappings can coexist within a single ggplot object. Packages and user code can define new \emph{geometries}, \emph{statistics}, \emph{coordinates} and even implement new \emph{aesthetics}. Individual elements in a theme can also be modified and new complete themes created, re-used and shared. We will describe in the remaining sections of this chapter how to use the grammar of graphics to construct other types of graphical presentations including more complex plots than those in the examples above.
\index{grammar of graphics!plot construction|)}

\subsection{Plots as \Rlang objects}
\index{grammar of graphics!plots as R objects|(}
We can manipulate ggplot objects and their components in the same way as other \Rlang objects. We can operate on them using the operators and methods defined for the \code{"gg"} class they belong to. We start by saving a ggplot into a variable.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{p} \hlkwb{<-} \hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,}
       \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{warningbox}
  \index{grammar of graphics!structure of plot objects|(}
  The separation of plot construction and rendering is possible, because \code{"gg"} objects are self-contained. Most importantly, a copy of the data object passed as argument is saved within the plot object. In the example above, \code{p} by itself could be saved to a file on disk and loaded into a clean \Rlang session, even on another computer, and rendered as long as package \ggplot and its dependencies are available. Another consequence of a copy of the data being stored in the plot object, is that editing the data used to create a \code{"gg"} object after its creation does \emph{not} affect rendered plots unless we recreate the "gg" object.

  With \code{str()} we can explore the structure of any \Rlang object, including those of class \code{"gg"}. We use \code{max.level = 1} to reduce the length of output, but to see deeper into the nested list you can increase the value passed as an argument to \code{max.level} or simply accept its default.

% the next chuck works but it leads to stack overflow in LaTeX
% there is something wrong with how knitr handles the output of str()
% eval_playground
\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{str}\hlstd{(p,} \hlkwc{max.level} \hlstd{=} \hlnum{1}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}
  \index{grammar of graphics!structure of plot objects|)}
\end{warningbox}

When we used in the previous section operator \code{+} to assemble the plots, we were operating on ``anonymous'' \Rlang objects. In the same way, we can operate on saved or ``named'' objects.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{p} \hlopt{+}
  \hlkwd{stat_smooth}\hlstd{(}\hlkwc{geom} \hlstd{=} \hlstr{"line"}\hlstd{,} \hlkwc{method} \hlstd{=} \hlstr{"lm"}\hlstd{,} \hlkwc{formula} \hlstd{= y} \hlopt{~} \hlstd{x)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-ggplot-objects-02-1}

}


\end{knitrout}

\begin{playground}
  Reproduce the examples in the previous section, using \code{p} defined above as a basis instead of building each plot from scratch.
\end{playground}

\begin{infobox}
  In the examples above we have been adding elements one by one, using the \code{+} operator. It is also possible to add multiple components in a single operation using a list. This is useful, when we want to save sets of components in a variable so as to reuse them in multiple plots. This saves typing, ensures consistency and can make alterations to a set of similar plots much easier.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.layers} \hlkwb{<-} \hlkwd{list}\hlstd{(}
  \hlkwd{stat_smooth}\hlstd{(}\hlkwc{geom} \hlstd{=} \hlstr{"line"}\hlstd{,} \hlkwc{method} \hlstd{=} \hlstr{"lm"}\hlstd{,} \hlkwc{formula} \hlstd{= y} \hlopt{~} \hlstd{x),}
  \hlkwd{scale_x_log10}\hlstd{())}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{p} \hlopt{+} \hlstd{my.layers}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-ggplot-objects-info-02-1}

}


\end{knitrout}

\end{infobox}
\index{grammar of graphics!plots as R objects|)}

\subsection{Data and mappings}
\index{grammar of graphics!mapping of data|(}
In the case of simple plots, based on data contained in a single data frame, the usual style is to code a plot as described above, passing an argument, \code{mtcars} in these examples, to the \code{data} parameter of \Rfunction{ggplot()}. Data passed in this way becomes the default for all layers in the plot. The same applies to the argument passed to \code{mapping}.\qRfunction{aes()}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,}
       \hlkwc{mapping} \hlstd{=} \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()}
\end{alltt}
\end{kframe}
\end{knitrout}

However, the grammar of graphics contemplates the possibility of data and mappings restricted to individual layers. In this case, those passed as arguments to \Rfunction{ggplot()}, if present,  are overridden by arguments passed to individual layers, making it possible to code the same plot as follows.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{()} \hlopt{+}
  \hlkwd{geom_point}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,}
             \hlkwc{mapping} \hlstd{=} \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg))}
\end{alltt}
\end{kframe}
\end{knitrout}

The default mapping can also be added directly with the \code{+} operator, instead of being passed as an argument to \Rfunction{ggplot()}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars)} \hlopt{+}
  \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg)} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()}
\end{alltt}
\end{kframe}
\end{knitrout}

It is even possible to have a default mapping for the whole plot, but no default data.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{()} \hlopt{+}
  \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg)} \hlopt{+}
  \hlkwd{geom_point}\hlstd{(}\hlkwc{data} \hlstd{= mtcars)}
\end{alltt}
\end{kframe}
\end{knitrout}

In these examples, the plot remains unchanged, but this flexibility in the grammar allows, in plots containing multiple layers, for each layer to use different data or a different mapping.

\begin{explainbox}
The argument passed to parameter \code{data} of a layer function, can be a function instead of a data frame, if the plot contains default data. In this case, the function is applied to the default data and must return a data frame containing data to be used in the layer.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,}
       \hlkwc{mapping} \hlstd{=} \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{(}\hlkwc{size} \hlstd{=} \hlnum{4}\hlstd{)} \hlopt{+}
  \hlkwd{geom_point}\hlstd{(}\hlkwc{data} \hlstd{=} \hlkwa{function}\hlstd{(}\hlkwc{x}\hlstd{)\{}\hlkwd{subset}\hlstd{(x, cyl} \hlopt{==} \hlnum{4}\hlstd{)\},} \hlkwc{color} \hlstd{=} \hlstr{"yellow"}\hlstd{,}
             \hlkwc{size} \hlstd{=} \hlnum{1.5}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

The plot default data can also be operated upon using the \pkgname{magritrr} pipe operator, but not the dot-pipe operator from \pkgname{wrapr} (see section \ref{sec:data:pipes} on page \pageref{sec:data:pipes}).
\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,}
       \hlkwc{mapping} \hlstd{=} \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{(}\hlkwc{size} \hlstd{=} \hlnum{4}\hlstd{)} \hlopt{+}
  \hlkwd{geom_point}\hlstd{(}\hlkwc{data} \hlstd{= .} \hlopt{%.>%} \hlkwd{subset}\hlstd{(}\hlkwc{x} \hlstd{= ., cyl} \hlopt{==} \hlnum{4}\hlstd{),} \hlkwc{color} \hlstd{=} \hlstr{"yellow"}\hlstd{,}
             \hlkwc{size} \hlstd{=} \hlnum{1.5}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

\end{explainbox}
\index{grammar of graphics!mapping of data|)}

\section{Geometries}\label{sec:plot:geometries}
\index{grammar of graphics!geometries|(}

Different geometries support different \emph{aesthetics}. While \gggeom{geom\_point()} supports \code{shape}, and \gggeom{geom\_line()} supports \code{linetype}, both support \code{x}, \code{y}, \code{color} and \code{size}. In this section we will describe the different \code{geometries} available in package \ggplot and some examples from packages that extend \ggplot. The graphic output from most code examples will not be shown, with the expectation that readers will run them to see the plots.

Mainly for historical reasons, \emph{geometries} accept a \emph{statistic} as an argument, in the same way as \emph{statistics} accept a \emph{geometry} as an argument. In this section we will only describe \emph{geometries} which have as a default \emph{statistic} \code{stat\_identity} which passes values directly as mapped. The \emph{geometries} that have other \emph{statistics} as default are described in section \ref{sec:plot:stat:summaries} together with the corresponding \emph{statistics}.

\subsection{Point}\label{sec:plot:geom:point}
\index{grammar of graphics!point geometry|(}

As shown earlier in this chapter, \gggeom{geom\_point()}, can be used to add a layer with observations represented by ``points'' or symbols. Variable \code{cyl} describes the numbers of cylinders in the engines of the cars. It is a numeric variable, and when mapped to color, a continuous color scale is used to represent this variable.

\index{plots!scatter plot|(}The first examples build scatter plots, because numeric variables are mapped to both \code{x} and \code{y}.
Some scales, like those for \code{color}, exist in two ``flavors,'' one suitable for numeric variables (continuous) and another for factors (discrete).

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,}
       \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg,} \hlkwc{color} \hlstd{= cyl))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-scatter-01-1}

}


\end{knitrout}

If we convert \code{cyl} into a factor, a discrete color scale is used instead of a continuous one.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,}
       \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg,} \hlkwc{color} \hlstd{=} \hlkwd{factor}\hlstd{(cyl)))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()}
\end{alltt}
\end{kframe}
\end{knitrout}

If we convert \code{cyl} into an ordered factor, a different discrete color scale is used by default.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,}
       \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg,} \hlkwc{color} \hlstd{=} \hlkwd{ordered}\hlstd{(cyl)))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{playground}
Try a different mapping: \code{disp} $\rightarrow$ \code{color}, \code{cyl} $\rightarrow$ \code{x}. Continue by using \code{help(mtcars)} and/or \code{names(mtcars)} to see what variables are available, and then try the combinations that trigger your curiosity---i.e., explore the data.
\end{playground}

The mapping between data values and aesthetic values is controlled by scales. Different color scales, and even palettes within a given scale, provide different mappings between data values and rendered colours.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,}
       \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg,} \hlkwc{color} \hlstd{=} \hlkwd{factor}\hlstd{(cyl)))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{scale_color_brewer}\hlstd{(}\hlkwc{type} \hlstd{=} \hlstr{"qual"}\hlstd{,} \hlkwc{palette} \hlstd{=} \hlnum{2}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

The data, aesthetics mappings, and geometries are the same as in earlier code; to alter how the plot looks, we have changed only the scale and palette used for the color aesthetic. Conceptually it is still exactly the same plot we created earlier, except for the colours used. This is a very important point to understand, because it allows us to separate two different concerns: the semantic structure and the graphic design.

\begin{playground}
Try the different palettes available through the brewer scale. You can play directly with the palettes using function \code{brewer\_pal()} from package \pkgname{scales} together with \code{show\_col()}).

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{show_col}\hlstd{(}\hlkwd{brewer_pal}\hlstd{()(}\hlnum{3}\hlstd{))}
\hlkwd{show_col}\hlstd{(}\hlkwd{brewer_pal}\hlstd{(}\hlkwc{type} \hlstd{=} \hlstr{"qual"}\hlstd{,} \hlkwc{palette} \hlstd{=} \hlnum{2}\hlstd{,} \hlkwc{direction} \hlstd{=} \hlnum{1}\hlstd{)(}\hlnum{3}\hlstd{))}
\end{alltt}
\end{kframe}
\end{knitrout}
Once you have found a suitable palette for these data, redo the plot above with the chosen palette.
\end{playground}

When not relying on colors, the most common way of distinguishing groups of observations in scatter plots is to use the \code{shape} of the points as an \emph{aesthetic}. We need to change a single ``word'' in the code statement to achieve this different mapping.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,} \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg,} \hlkwc{shape} \hlstd{=} \hlkwd{factor}\hlstd{(cyl)))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()}
\end{alltt}
\end{kframe}
\end{knitrout}

We can use \code{scale\_shape\_manual} to choose each shape to be used. We set three ``open'' shapes that we will see later are very useful as they obey both \code{color} and \code{fill} \emph{aesthetics}.\label{chunk:filled:symbols}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,} \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg,} \hlkwc{shape} \hlstd{=} \hlkwd{factor}\hlstd{(cyl)))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{scale_shape_manual}\hlstd{(}\hlkwc{values} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{21}\hlstd{,} \hlnum{22}\hlstd{,} \hlnum{23}\hlstd{))}
\end{alltt}
\end{kframe}
\end{knitrout}

It is also possible to use characters as shapes. The character is centered on the position of the observation. As the numbers used as symbols are self-explanatory, we suppress the default guide or key.\label{chunk:plot:point:char}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,} \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg,} \hlkwc{shape} \hlstd{=} \hlkwd{factor}\hlstd{(cyl)))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{(}\hlkwc{size} \hlstd{=} \hlnum{2.5}\hlstd{)} \hlopt{+}
  \hlkwd{scale_shape_manual}\hlstd{(}\hlkwc{values} \hlstd{=} \hlkwd{c}\hlstd{(}\hlstr{"4"}\hlstd{,} \hlstr{"6"}\hlstd{,} \hlstr{"8"}\hlstd{),} \hlkwc{guide} \hlstd{=} \hlnum{FALSE}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-scatter-12-1}

}


\end{knitrout}

\begin{infobox}
  One variable in the data can be mapped to more than one aesthetic, allowing redundant aesthetics. This may seem wasteful, but it is extremely useful as it allows one to produce figures that, even when produced in color, can still be read if reproduced as black-and-white images.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,} \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg,}
                          \hlkwc{shape} \hlstd{=} \hlkwd{factor}\hlstd{(cyl),}
                          \hlkwc{color} \hlstd{=} \hlkwd{factor}\hlstd{(cyl)))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()}
\end{alltt}
\end{kframe}
\end{knitrout}
\end{infobox}

\index{plots!scatter plot|)}
\index{plots!dot plot|(}Dot plots are similar to scatter plots but a factor is mapped to either the \code{x} or \code{y} \emph{aesthetic}. Dot plots are prone to have overlapping observations, and one way of making these points visible is to make them partly transparent by setting a constant value smaller than one for the \code{alpha} \emph{aesthetic}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,} \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{=} \hlkwd{factor}\hlstd{(cyl),} \hlkwc{y} \hlstd{= mpg))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{(}\hlkwc{alpha} \hlstd{=} \hlnum{1}\hlopt{/}\hlnum{3}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-scatter-12a-1}

}


\end{knitrout}

Instead of making the points semitransparent, we can randomly displace them to avoid overlaps. This is called \emph{jitter}, and can be added using \code{position\_jitter()} and the amount of jitter set with \code{width} as a fraction of the distance between adjacent factor levels in the plot.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,} \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{=} \hlkwd{factor}\hlstd{(cyl),} \hlkwc{y} \hlstd{= mpg))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{(}\hlkwc{position} \hlstd{=} \hlkwd{position_jitter}\hlstd{(}\hlkwc{width} \hlstd{=} \hlnum{0.05}\hlstd{))}
\end{alltt}
\end{kframe}
\end{knitrout}

\index{plots!dot plot|)}
\index{plots!bubble plot|(}We can create a ``bubble'' plot by mapping the \code{size} \emph{aesthetic} to a continuous variable. In this case, one has to think what is visually more meaningful. Although the radius of the shape is frequently mapped, due to how human perception works, mapping a variable to the area of the shape is more useful by being perceptually closer to a linear mapping. For this example we add a new variable to the plot. The weight of the car in tons and map it to the area of the points.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,} \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg,}
                          \hlkwc{color} \hlstd{=} \hlkwd{factor}\hlstd{(cyl),}
                          \hlkwc{size} \hlstd{= wt))} \hlopt{+}
  \hlkwd{scale_size_area}\hlstd{()} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-scatter-16-1}

}


\end{knitrout}

\begin{playground}
If we use a radius-based scale the ``impression'' is different.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,} \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg,}
                          \hlkwc{color} \hlstd{=} \hlkwd{factor}\hlstd{(cyl),}
                          \hlkwc{size} \hlstd{= wt))} \hlopt{+}
  \hlkwd{scale_size}\hlstd{()} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()}
\end{alltt}
\end{kframe}
\end{knitrout}

Make the plot, look at it carefully. Check the numerical values of some of the weights, and assess if your perception of the plot matches the numbers behind it.
\end{playground}

\index{plots!bubble plot|)}

As a final example summarizing the use of \gggeom{geom\_point()}, we combine different \emph{aesthetics} and \emph{scales} in the same scatter plot.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,} \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg,}
                          \hlkwc{shape} \hlstd{=} \hlkwd{factor}\hlstd{(cyl),}
                          \hlkwc{fill} \hlstd{=} \hlkwd{factor}\hlstd{(cyl),}
                          \hlkwc{size} \hlstd{= wt))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{(}\hlkwc{alpha} \hlstd{=} \hlnum{0.33}\hlstd{,} \hlkwc{color} \hlstd{=} \hlstr{"black"}\hlstd{)} \hlopt{+}
  \hlkwd{scale_size_area}\hlstd{()} \hlopt{+}
  \hlkwd{scale_shape_manual}\hlstd{(}\hlkwc{values} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{21}\hlstd{,} \hlnum{22}\hlstd{,} \hlnum{23}\hlstd{))}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-scatter-18-1}

}


\end{knitrout}

\begin{playground}
Play with the code in the chunk above. Remove or change each of the mappings and the scale, display the new plot, and compare it to the one above. Continue playing with the code until you are sure you understand what graphical element in the plot is added or modified by each individual argument or ``word'' in the code statement.
\end{playground}
\index{grammar of graphics!point geometry|)}

It is common to draw error bars together with points representing means or medians of observations and \gggeom{geom\_pointrange()} achieves this task based on the values mapped to the \code{x}, \code{y}, \code{ymin} and \code{ymax}, using \code{y} for the position of the point and \code{ymin} and \code{ymax} for the positions of the ends of the line segment representing a range. Two other \emph{geometries}, \gggeom{geom\_range()} and  \gggeom{geom\_errorbar} draw only a segment or a segment with capped ends. They are frequently used together with \emph{statistics} when summaries are calculated on the fly, but can also be used directly when the data summaries are stored in a data frame passed as an argument to \code{data}.

\subsection{Rug}\label{sec:plot:rug}
\index{plots!rug marging|(}

Rarely, rug plots are used by themselves. Instead they are usually an addition to scatter plots. An example of the use of \gggeom{geom\_rug()} follows. They make it easier to see the distribution of observations
along the $x$- and $y$-axes.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,}
       \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg,} \hlkwc{color} \hlstd{=} \hlkwd{factor}\hlstd{(cyl)))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{geom_rug}\hlstd{()}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-rug-plot-01-1}

}


\end{knitrout}

\begin{warningbox}
  Rug plots are most useful when the local density of observations is not too high, otherwise rugs become too cluttered and the ``rug threads'' may overlap. When overlap is moderate, making the segments semitransparent by setting the \code{alpha} aesthetic to a constant value smaller than one, can make the variation in density easier to appreciate. When the number of observations is large, marginal density plots should be preferred.
\end{warningbox}
\index{plots!rug marging|)}

\subsection{Line and area}\label{sec:plot:line}

\index{grammar of graphics!various line and path geometries|(}\index{plots!line plot|(}
For line plots we use \gggeom{geom\_line()}. The \code{size} of a line is its thickness, and as we had \code{shape} for points, we have \code{linetype} for lines. In a line plot, observations in successive rows of the data frame, or the subset corresponding to a group, are joined by straight lines. We use a different data set included in \Rlang, \Rdata{Orange}, with data on the growth of five orange trees. See the help page for \code{Orange} for details.

\label{plot:fig:lines}
\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= Orange,}
       \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= age,} \hlkwc{y} \hlstd{= circumference,} \hlkwc{linetype} \hlstd{= Tree))} \hlopt{+}
  \hlkwd{geom_line}\hlstd{()}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-line-plot-01-1}

}


\end{knitrout}
\index{plots!line plot|)}

\index{plots!step plot|(}
Instead of drawing a line joining the successive observations, we may want to draw a disconnected straight-line segment for each observation or row in the data. In this case, we use \gggeom{geom\_segment()} which accepts \code{x}, \code{xend}, \code{y} and \code{yend} as mapped aesthetics. \gggeom{geom\_curve()} draws curved lines, and the curvature, control points, and angles can be controlled through additional \emph{aesthetics}. These two \emph{geometries} support arrow heads at their ends. Other \emph{geometries} useful for drawing lines or segments are \gggeom{geom\_path()}, which is similar to \gggeom{geom\_line()}, but instead of joining observations according to the values mapped to \code{x}, it joins them according to their row-order in \code{data}, and \gggeom{geom\_spoke()}, which is similar to \gggeom{geom\_segment()} but using a polar parametrization, based on \code{x}, \code{y} for origin, and \code{angle} and \code{radius} for the segment. Finally, \gggeom{geom\_step()} plots only vertical and horizontal lines to join the observations, creating a stepped line.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= Orange,}
       \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= age,} \hlkwc{y} \hlstd{= circumference,} \hlkwc{linetype} \hlstd{= Tree))} \hlopt{+}
  \hlkwd{geom_step}\hlstd{()}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-step-plot-01-1}

}


\end{knitrout}
\index{plots!step plot|)}

\begin{playground}
Using the following toy data, make three plots using \code{geom\_line()}, \code{geom\_path()}, and \code{geom\_step} to add a layer.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{toy.df} \hlkwb{<-} \hlkwd{data.frame}\hlstd{(}\hlkwc{x} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{1}\hlstd{,}\hlnum{3}\hlstd{,}\hlnum{2}\hlstd{,}\hlnum{4}\hlstd{),} \hlkwc{y} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{0}\hlstd{,}\hlnum{1}\hlstd{,}\hlnum{0}\hlstd{,}\hlnum{1}\hlstd{))}
\end{alltt}
\end{kframe}
\end{knitrout}
\end{playground}

\index{plots!filled-area plot|(}
While \gggeom{geom\_line()} draws a line joining observations, \gggeom{geom\_area()} supports filling the area below the line according to the \code{fill} \emph{aesthetic}. In contrast \gggeom{geom\_ribbon} draws two lines based on the \code{x}, \code{ymin} and \code{ymax} \emph{aesthetics}, with the space between the lines filled according to the \code{fill} \emph{aesthetic}. Finally, \gggeom{geom\_polygom} is similar to \gggeom{geom\_path()} but connects the extreme observations forming a closed polygon that supports \code{fill}.

Much of what was described above for \gggeom{geom\_point} can be adapted to \gggeom{geom\_line}, \gggeom{geom\_ribbon},  \gggeom{geom\_area} and other \emph{geometries} described in this section. In some cases, it is useful to stack the areas---e.g.,  when the values represent parts of a bigger whole. In the next, contrived, example, we stack the growth of the different trees by using \code{position = "stack"} instead of the default \code{position = "identity"}. (Compare the $y$ axis of the figure below to that drawn using \code{geom\_line} on page \pageref{plot:fig:lines}.)

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= Orange,}
       \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= age,} \hlkwc{y} \hlstd{= circumference,} \hlkwc{fill} \hlstd{= Tree))} \hlopt{+}
  \hlkwd{geom_area}\hlstd{(}\hlkwc{position} \hlstd{=} \hlstr{"stack"}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-area-plot-01-1}

}


\end{knitrout}
\index{plots!filled-area plot|)}

\index{plots!reference lines|(}
Finally,\label{sec:plot:vhline} three \emph{geometries} for drawing lines across the whole plotting area: \gggeom{geom\_hline}, \gggeom{geom\_vline} and \gggeom{geom\_abline}. The first two draw horizontal and vertical lines, respectively, while the third one draws straight lines according to the \emph{aesthetics} \code{slope} and \code{intercept} determining the position. The lines drawn with these three geoms extend to the edge of the plotting area.

\gggeom{geom\_hline} and \gggeom{geom\_vline} require a single aesthetic, \code{yintercept} and \code{xintercept}, respectively. Different from other geoms, the data for these aesthetics can also be passed as constant numeric vectors. The reason for this is that these geoms are most frequently used to annotate plots rather than plotting observations. Let's assume that we want to highlight an event at the age of 1000 days.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= Orange,}
       \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= age,} \hlkwc{y} \hlstd{= circumference,} \hlkwc{fill} \hlstd{= Tree))} \hlopt{+}
  \hlkwd{geom_area}\hlstd{(}\hlkwc{position} \hlstd{=} \hlstr{"stack"}\hlstd{)} \hlopt{+}
  \hlkwd{geom_vline}\hlstd{(}\hlkwc{xintercept} \hlstd{=} \hlnum{1000}\hlstd{,} \hlkwc{color} \hlstd{=} \hlstr{"gray75"}\hlstd{)} \hlopt{+}
  \hlkwd{geom_vline}\hlstd{(}\hlkwc{xintercept} \hlstd{=} \hlnum{1000}\hlstd{,} \hlkwc{linetype} \hlstd{=} \hlstr{"dotted"}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-area-plot-02-1}

}


\end{knitrout}

\begin{playground}
  Change the order of the three layers in the example above. How did the figure change? What order is best? Would the same order be the best for a scatter plot? And would it be necessary to add two \code{geom\_vline()} layers?
\end{playground}
\index{plots!reference lines|)}
\index{grammar of graphics!various line and path geometries|)}

\subsection{Column}\label{sec:plot:col}
\index{grammar of graphics!column geometry|(}
\index{plots!column plot|(}

The \emph{geometry} \gggeom{geom\_col()} can be used to create \emph{column plots} where each bar represents an observation or case in the data.

\begin{warningbox}
\Rlang users not familiar yet with \ggplot are frequently surprised by the default behavior of \gggeom{geom\_bar()} as it uses \ggstat{stat\_count()} to produce a histogram, rather than plotting values as is (see section \ref{sec:plot:histogram} on page \pageref{sec:plot:histogram}). \gggeom{geom\_col()} is identical to \gggeom{geom\_bar()} but with \code{"identity"} as the default statistic.
\end{warningbox}

We create artificial data that we will reuse in multiple variations of the next figure.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{set.seed}\hlstd{(}\hlnum{654321}\hlstd{)}
\hlstd{my.col.data} \hlkwb{<-} \hlkwd{data.frame}\hlstd{(}\hlkwc{treatment} \hlstd{=} \hlkwd{factor}\hlstd{(}\hlkwd{rep}\hlstd{(}\hlkwd{c}\hlstd{(}\hlstr{"A"}\hlstd{,} \hlstr{"B"}\hlstd{,} \hlstr{"C"}\hlstd{),} \hlnum{2}\hlstd{)),}
                          \hlkwc{group} \hlstd{=} \hlkwd{factor}\hlstd{(}\hlkwd{rep}\hlstd{(}\hlkwd{c}\hlstd{(}\hlstr{"male"}\hlstd{,} \hlstr{"female"}\hlstd{),} \hlkwd{c}\hlstd{(}\hlnum{3}\hlstd{,} \hlnum{3}\hlstd{))),}
                          \hlkwc{measurement} \hlstd{=} \hlkwd{rnorm}\hlstd{(}\hlnum{6}\hlstd{)} \hlopt{+} \hlkwd{c}\hlstd{(}\hlnum{5.5}\hlstd{,} \hlnum{5}\hlstd{,} \hlnum{7}\hlstd{))}
\end{alltt}
\end{kframe}
\end{knitrout}

First we plot data for females only, using defaults for all \emph{aesthetics} except $x$ and $y$ which we explicitly map to variables.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwd{subset}\hlstd{(my.col.data, group} \hlopt{==} \hlstr{"female"}\hlstd{),}
       \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= treatment,} \hlkwc{y} \hlstd{= measurement))} \hlopt{+}
   \hlkwd{geom_col}\hlstd{()}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-col-plot-02-1}

}


\end{knitrout}

We play with \emph{aesthetics} to produce a plot with a semi-formal style---e.g.,  suitable for a science popularization article or book. See section \ref{sec:plot:scales} and section \ref{sec:plot:themes} for information on scales and themes, respectively. We set \code{width = 0.5} to make the bars narrower. Setting \code{color = "white"} overrides the default color of the lines bordering the bars.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(my.col.data,} \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= treatment,} \hlkwc{y} \hlstd{= measurement,} \hlkwc{fill} \hlstd{= group))} \hlopt{+}
     \hlkwd{geom_col}\hlstd{(}\hlkwc{color} \hlstd{=} \hlstr{"white"}\hlstd{,} \hlkwc{width} \hlstd{=} \hlnum{0.5}\hlstd{)} \hlopt{+}
     \hlkwd{scale_fill_grey}\hlstd{()} \hlopt{+} \hlkwd{theme_dark}\hlstd{()}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-col-plot-03-1}

}


\end{knitrout}

We next use a formal style, and in addition, put the bars side by side by setting \code{position = "dodge"} to override the default \code{position = "stack"}. Setting \code{color = NA} removes the lines bordering the bars.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(my.col.data,} \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= treatment,} \hlkwc{y} \hlstd{= measurement,} \hlkwc{fill} \hlstd{= group))} \hlopt{+}
     \hlkwd{geom_col}\hlstd{(}\hlkwc{color} \hlstd{=} \hlnum{NA}\hlstd{,} \hlkwc{position} \hlstd{=} \hlstr{"dodge"}\hlstd{)} \hlopt{+}
     \hlkwd{scale_fill_grey}\hlstd{()} \hlopt{+} \hlkwd{theme_classic}\hlstd{()}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-col-plot-04-1}

}


\end{knitrout}

\begin{playground}
Change the argument to \code{position}, or let the default be active, until you understand its effect on the figure. What is the difference between \emph{positions} \code{"identity"}, \code{"dodge"} and \code{"stack"}?
\end{playground}

\begin{playground}
Use constants as arguments for \emph{aesthetics} or map variable \code{treatment} to one or more of the \emph{aesthetics} used by \gggeom{geom\_col()}, such as \code{color}, \code{fill}, \code{linetype}, \code{size}, \code{alpha} and \code{width}.
\end{playground}

\index{grammar of graphics!column geometry|)}
\index{plots!column plot|)}

\subsection{Tiles}\label{sec:tileplot}
\index{grammar of graphics!tile geometry|(}
\index{plots!tile plot|(}
We can draw square or rectangular tiles with \gggeom{geom\_tile()} producing tile plots or simple heat maps.

We here generate 100 random draws from the $F$ distribution with degrees of freedom $\nu_1 = 5, \nu_2 = 20$.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{set.seed}\hlstd{(}\hlnum{1234}\hlstd{)}
\hlstd{randomf.df} \hlkwb{<-} \hlkwd{data.frame}\hlstd{(}\hlkwc{F.value} \hlstd{=} \hlkwd{rf}\hlstd{(}\hlnum{100}\hlstd{,} \hlkwc{df1} \hlstd{=} \hlnum{5}\hlstd{,} \hlkwc{df2} \hlstd{=} \hlnum{20}\hlstd{),}
                         \hlkwc{x} \hlstd{=} \hlkwd{rep}\hlstd{(letters[}\hlnum{1}\hlopt{:}\hlnum{10}\hlstd{],} \hlnum{10}\hlstd{),}
                         \hlkwc{y} \hlstd{= LETTERS[}\hlkwd{rep}\hlstd{(}\hlnum{1}\hlopt{:}\hlnum{10}\hlstd{,} \hlkwd{rep}\hlstd{(}\hlnum{10}\hlstd{,} \hlnum{10}\hlstd{))])}
\end{alltt}
\end{kframe}
\end{knitrout}

\gggeom{geom\_tile()} requires aesthetics $x$ and $y$, with no defaults, and \code{width} and \code{height} with defaults that make all tiles of equal size filling the plotting area.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(randomf.df,} \hlkwd{aes}\hlstd{(x, y,} \hlkwc{fill} \hlstd{= F.value))} \hlopt{+}
  \hlkwd{geom_tile}\hlstd{()}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-tile-plot-02-1}

}


\end{knitrout}

We can set \code{color = "gray75"} and \code{size = 1} to make the tile borders more visible as in the example below, or use a contrasting color, to better delineate the borders of the tiles. What to use will depend on whether the individual tiles add meaningful information. In cases like when rows of tiles correspond to individual genes and columns to discrete treatments, the use of contrasting tile borders is preferable. In contrast, in the case when the tiles are an approximation to a continuous surface such as measurements on a regular spatial grid, it is best to suppress the tile borders.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(randomf.df,} \hlkwd{aes}\hlstd{(x, y,} \hlkwc{fill} \hlstd{= F.value))} \hlopt{+}
  \hlkwd{geom_tile}\hlstd{(}\hlkwc{color} \hlstd{=} \hlstr{"gray75"}\hlstd{,} \hlkwc{size} \hlstd{=} \hlnum{1.33}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-tile-plot-03-1}

}


\end{knitrout}

\begin{playground}
Play with the arguments passed to parameters \code{color} and \code{size} in the example above, considering what features of the data are most clearly perceived in each of the plots you create.
\end{playground}

Any continuous fill scale can be used to control the appearance. Here we show a tile plot using a gray gradient, with missing values in red.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}(randomf.df, \hlkwd{aes}(x, y, fill = F.value) +
  \hlkwd{geom_tile}(color = \hlstr{"white"}) +
  \hlkwd{scale_fill_gradient}(low = \hlstr{"gray15"}, high = \hlstr{"gray85"}, na.value = \hlstr{"red"})
\end{alltt}
\end{kframe}
\end{knitrout}

In contrast to \gggeom{geom\_tile()}, \gggeom{geom\_rect()} draws rectangular tiles based on the position of the corners, mapped to aesthetics \code{xmin}, \code{xmax}, \code{ymin} and \code{ymax}.

\index{plots!tile plot|)}
\index{grammar of graphics!tile geometry|)}

\subsection{Simple features (sf)}\label{sec:plot:sf}
\index{grammar of graphics!sf geometries|(}
\index{plots!maps and spatial plots|(}

\ggplot version 3.0.0 or later supports the plotting of shape data similar to the plotting in geographic information systems (GIS) through \gggeom{geom\_sf()} and its companions, \gggeom{geom\_sf\_text()}, \gggeom{geom\_sf\_label()}, and \ggstat{stat\_sf()}. This makes it possible to display data on maps, for example, using different fill values for different regions. Special \emph{coordinate} \code{coord\_sf()} can be used to select different projections for maps. The \emph{aesthetic} used is called \code{geometry} and contrary to all the other aesthetics we have seen until now, the values to be mapped are of class \code{sfc} containing \emph{simple features} data with multiple components. Manipulation of simple features data is supported by package \pkgname{sf}. This subject exceeds the scope of this book, so a single and very simple example follows.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{nc} \hlkwb{<-} \hlstd{sf}\hlopt{::}\hlkwd{st_read}\hlstd{(}\hlkwd{system.file}\hlstd{(}\hlstr{"shape/nc.shp"}\hlstd{,} \hlkwc{package} \hlstd{=} \hlstr{"sf"}\hlstd{),} \hlkwc{quiet} \hlstd{=} \hlnum{TRUE}\hlstd{)}
\hlkwd{ggplot}\hlstd{(nc)} \hlopt{+}
  \hlkwd{geom_sf}\hlstd{(}\hlkwd{aes}\hlstd{(}\hlkwc{fill} \hlstd{= AREA),} \hlkwc{color} \hlstd{=} \hlstr{"gray90"}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-sf_plot-01-1}

}


\end{knitrout}
\index{grammar of graphics!sf geometries|)}
\index{plots!maps and spatial plots|)}

\subsection{Text}\label{sec:plot:text}
\index{grammar of graphics!text and label geometries|(}
\index{plots!text in|(}
\index{plots!maths in|(}
We can use \gggeom{geom\_text()} or \gggeom{geom\_label()} to add text labels to observations. For \gggeom{geom\_text()} and \gggeom{geom\_label()}, the aesthetic \code{label} provides the text to be plotted and the usual aesthetics \code{x} and \code{y}, the location of the labels. As one would expect, the \code{color} and \code{size} aesthetics can also be used for the text.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,} \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg,}
                          \hlkwc{color} \hlstd{=} \hlkwd{factor}\hlstd{(cyl),}
                          \hlkwc{size} \hlstd{= wt,}
                          \hlkwc{label} \hlstd{= cyl))} \hlopt{+}
  \hlkwd{scale_size}\hlstd{()} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{geom_text}\hlstd{(}\hlkwc{color} \hlstd{=} \hlstr{"darkblue"}\hlstd{,} \hlkwc{size} \hlstd{=} \hlnum{3}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-text-plot-01-1}

}


\end{knitrout}

In addition, \code{angle} and \code{vjust} and \code{hjust} can be used to rotate the text and adjust its position. The default value of 0.5 for both \code{hjust} and \code{vjust} sets the center of the text at the supplied \code{x} and \code{y} coordinates. ``Vertical'' and ``horizontal'' for justification refer to the text, not the plot. This is important when \code{angle} is different from zero. Values larger than 0.5 shift the label left or down, and values smaller than 0.5, right or up with respect to its \code{x} and \code{y} coordinates. A value of 1 or 0 sets the text so that its edge is at the supplied coordinate. Values outside the range $0\ldots 1$ shift the text even farther away, however, still using units based on the length or height of the text label. Recent versions of \pkgname{ggplot2} make possible justification using character constants for alignment: \code{"left"}, \code{"middle"}, \code{"right"}, \code{"bottom"}, \code{"center"} and \code{"top"}, and two special alignments, \code{"inward"} and \code{"outward"}, that automatically vary based on the position in the plotting area.

In the case of \gggeom{geom\_label()} the text is enclosed in a box, which obeys the \code{fill} \emph{aesthetic} and takes additional parameters (described starting at page \pageref{start:plot:label}) allowing control of the shape and size of the box. However, \gggeom{geom\_label()} does not support rotation with the \code{angle} aesthetic.

\begin{warningbox}
You\index{plots!fonts} should be aware that \Rlang and \ggplot support the use of UNICODE\index{UNICODE}, such as UTF8\index{UTF8} character encodings in strings. If your editor or IDE supports their use, then you can type Greek letters and simple maths symbols directly, and they \emph{may} show correctly in labels if a suitable font is loaded and an extended encoding like UTF8 is in use by the operating system. Even if UTF8 is in use, text is not fully portable unless the same font is available\index{portability}, as even if the character positions are standardized for many languages, most UNICODE fonts support at most a small number of languages. In principle one can use this mechanism to have labels both using other alphabets and languages like Chinese with their numerous symbols mixed in the same figure. Furthermore, the support for fonts and consequently character sets in \Rlang is output-device dependent. The font encoding used by \Rlang by default depends on the default locale settings of the operating system, which can also lead to garbage printed to the console or wrong characters being plotted running the same code on a different computer from the one where a script was created. Not all is lost, though, as \Rlang can be coerced to use system fonts and Google fonts with functions provided by packages \pkgname{showtext} and \pkgname{extrafont}. Encoding-related problems, especially in MS-Windows, are common.
\end{warningbox}

In the remaining examples, with output not shown, we use \gggeom{geom\_text} or \gggeom{geom\_label} together with \gggeom{geom\_point} as this is how they may be used to label observations.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.data} \hlkwb{<-}
  \hlkwd{data.frame}\hlstd{(}\hlkwc{x} \hlstd{=} \hlnum{1}\hlopt{:}\hlnum{5}\hlstd{,}
             \hlkwc{y} \hlstd{=} \hlkwd{rep}\hlstd{(}\hlnum{2}\hlstd{,} \hlnum{5}\hlstd{),}
             \hlkwc{label} \hlstd{=} \hlkwd{c}\hlstd{(}\hlstr{"a"}\hlstd{,} \hlstr{"b"}\hlstd{,} \hlstr{"c"}\hlstd{,} \hlstr{"d"}\hlstd{,} \hlstr{"e"}\hlstd{))}

\hlkwd{ggplot}\hlstd{(my.data,} \hlkwd{aes}\hlstd{(x, y,} \hlkwc{label} \hlstd{= label))} \hlopt{+}
  \hlkwd{geom_text}\hlstd{(}\hlkwc{angle} \hlstd{=} \hlnum{45}\hlstd{,} \hlkwc{hjust} \hlstd{=} \hlnum{1.5}\hlstd{,} \hlkwc{size} \hlstd{=} \hlnum{8}\hlstd{)} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{playground}
Modify the example above to use \gggeom{geom\_label()} instead of \gggeom{geom\_text()} using, in addition, the \code{fill} aesthetic.
\end{playground}

In the next example we select a different font family, using the same characters in the Roman alphabet. The names \code{"sans"} (the default), \code{"serif"} and \code{"mono"} are recognized by all graphics devices on all operating systems. Additional fonts are available for specific graphic devices, such as the 35 ``PDF'' fonts by the \code{pdf()} device. In this case, their names can be queried with \code{names(pdfFonts())}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(my.data,} \hlkwd{aes}\hlstd{(x, y,} \hlkwc{label} \hlstd{= label))} \hlopt{+}
  \hlkwd{geom_text}\hlstd{(}\hlkwc{angle} \hlstd{=} \hlnum{45}\hlstd{,} \hlkwc{hjust} \hlstd{=} \hlnum{1.5}\hlstd{,} \hlkwc{size} \hlstd{=} \hlnum{8}\hlstd{,} \hlkwc{family} \hlstd{=} \hlstr{"serif"}\hlstd{)} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{playground}
In the examples above the character strings were all of the same length, containing a single character. Redo the plots above with longer character strings of various lengths mapped to the \code{label} \emph{aesthetic}. Do also play with justification of these labels.
\end{playground}

Plotting (mathematical) expressions involves mapping to the \code{label} aesthetic character strings that can be parsed as expressions, and setting \code{parse = TRUE} (see section \ref{sec:plot:plotmath} on page \pageref{sec:plot:plotmath}). Here, we build the character strings using \Rfunction{paste()} but, of course, they could also have been entered one by one. This use of \Rfunction{paste()} provides an example of recycling of shorter vectors (see section \ref{sec:vectors} on page \pageref{sec:vectors}).

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.data} \hlkwb{<-}
  \hlkwd{data.frame}\hlstd{(}\hlkwc{x} \hlstd{=} \hlnum{1}\hlopt{:}\hlnum{5}\hlstd{,} \hlkwc{y} \hlstd{=} \hlkwd{rep}\hlstd{(}\hlnum{2}\hlstd{,} \hlnum{5}\hlstd{),} \hlkwc{label} \hlstd{=} \hlkwd{paste}\hlstd{(}\hlstr{"alpha["}\hlstd{,} \hlnum{1}\hlopt{:}\hlnum{5}\hlstd{,} \hlstr{"]"}\hlstd{,} \hlkwc{sep} \hlstd{=} \hlstr{""}\hlstd{))}
\hlstd{my.data}\hlopt{$}\hlstd{label}
\end{alltt}
\begin{verbatim}
## [1] "alpha[1]" "alpha[2]" "alpha[3]" "alpha[4]" "alpha[5]"
\end{verbatim}
\end{kframe}
\end{knitrout}

Text and labels do not automatically expand the plotting area past their anchoring coordinates. In the example above, we need to use \code{expand\_limits()} to ensure that the text is not clipped at the edge of the plotting area.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(my.data,} \hlkwd{aes}\hlstd{(x, y,} \hlkwc{label} \hlstd{= label))} \hlopt{+}
  \hlkwd{geom_text}\hlstd{(}\hlkwc{hjust} \hlstd{=} \hlopt{-}\hlnum{0.2}\hlstd{,} \hlkwc{parse} \hlstd{=} \hlnum{TRUE}\hlstd{,} \hlkwc{size} \hlstd{=} \hlnum{6}\hlstd{)} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{expand_limits}\hlstd{(}\hlkwc{x} \hlstd{=} \hlnum{5.2}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-text-plot-06-1}

}


\end{knitrout}

In the example above, we mapped to label the text to be parsed. It is also possible, and usually preferable, to build suitable labels on the fly within \code{aes()} when setting the mapping for \code{label}. Here we use \gggeom{geom\_text()} with strings to be parsed into expressions created on the fly within the call to \Rfunction{aes()}. The same approach can be used for regular character strings not requiring parsing.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(my.data,} \hlkwd{aes}\hlstd{(x, y,} \hlkwc{label} \hlstd{=} \hlkwd{paste}\hlstd{(}\hlstr{"alpha["}\hlstd{, x,} \hlstr{"]"}\hlstd{,} \hlkwc{sep} \hlstd{=} \hlstr{""}\hlstd{)))} \hlopt{+}
  \hlkwd{geom_text}\hlstd{(}\hlkwc{hjust} \hlstd{=} \hlopt{-}\hlnum{0.2}\hlstd{,} \hlkwc{parse} \hlstd{=} \hlnum{TRUE}\hlstd{,} \hlkwc{size} \hlstd{=} \hlnum{6}\hlstd{)} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()}
\end{alltt}
\end{kframe}
\end{knitrout}

As \gggeom{geom\_label()} obeys the same parameters as \gggeom{geom\_text()} except for \code{angle}, we briefly describe below only the additional parameters compared to \gggeom{geom\_text()}. We may want to alter the default width of the border line or the color used to \code{fill} the rectangle, or to change the ``roundness'' of the corners. To suppress the border line, use \code{label.size = 0}. Corner roundness is controlled by parameter \code{label.r} and the size of the margin around the text by \code{label.padding}.

\label{start:plot:label}
\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.data} \hlkwb{<-}
  \hlkwd{data.frame}\hlstd{(}\hlkwc{x} \hlstd{=} \hlnum{1}\hlopt{:}\hlnum{5}\hlstd{,} \hlkwc{y} \hlstd{=} \hlkwd{rep}\hlstd{(}\hlnum{2}\hlstd{,} \hlnum{5}\hlstd{),}
             \hlkwc{label} \hlstd{=} \hlkwd{c}\hlstd{(}\hlstr{"one"}\hlstd{,} \hlstr{"two"}\hlstd{,} \hlstr{"three"}\hlstd{,} \hlstr{"four"}\hlstd{,} \hlstr{"five"}\hlstd{))}

\hlkwd{ggplot}\hlstd{(my.data,} \hlkwd{aes}\hlstd{(x, y,} \hlkwc{label} \hlstd{= label))} \hlopt{+}
  \hlkwd{geom_label}\hlstd{(}\hlkwc{hjust} \hlstd{=} \hlopt{-}\hlnum{0.2}\hlstd{,} \hlkwc{size} \hlstd{=} \hlnum{6}\hlstd{,}
             \hlkwc{label.size} \hlstd{=} \hlnum{0L}\hlstd{,}
             \hlkwc{label.r} \hlstd{=} \hlkwd{unit}\hlstd{(}\hlnum{0}\hlstd{,} \hlstr{"lines"}\hlstd{),}
             \hlkwc{label.padding} \hlstd{=} \hlkwd{unit}\hlstd{(}\hlnum{0.15}\hlstd{,} \hlstr{"lines"}\hlstd{),}
             \hlkwc{fill} \hlstd{=} \hlstr{"yellow"}\hlstd{,} \hlkwc{alpha} \hlstd{=} \hlnum{0.5}\hlstd{)} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{expand_limits}\hlstd{(}\hlkwc{x} \hlstd{=} \hlnum{5.6}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-label-plot-01-1}

}


\end{knitrout}

\begin{playground}
Play with the arguments to the different parameters and with the \emph{aesthetics} to get an idea of what can be done with them. For example, use thicker border lines and increase the padding so that a visually well-balanced margin is retained. You may also try mapping the \code{fill} and \code{color} \emph{aesthetics} to factors in the data.
\end{playground}

If\index{grammar of graphics!text and label geometries!repulsive} the parameter \code{check\_overlap} of \gggeom{geom\_text()} is set to \code{TRUE}, text overlap will be avoided by suppressing the text that would otherwise overlap other text.  \emph{Repulsive} versions of \gggeom{geom\_text()} and \gggeom{geom\_label()}, \gggeom{geom\_text\_repel()} and \gggeom{geom\_label\_repel()},  are available in package \pkgname{ggrepel}. These \emph{geometries} avoid overlaps by automatically repositioning the text or labels. Please read the package documentation for details of how to control the repulsion strength and direction, and the properties of the segments linking the labels to the position of their data coordinates. Nearly all aesthetics supported by \code{geom\_text()} and \code{geom\_label()} are supported by the repulsive versions. However, given that a segment connects the label or text to its anchor point, several properties of these segments can also be controlled with aesthetics or arguments.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,}
       \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg,} \hlkwc{color} \hlstd{=} \hlkwd{factor}\hlstd{(cyl),} \hlkwc{size} \hlstd{= wt,} \hlkwc{label} \hlstd{= cyl))} \hlopt{+}
  \hlkwd{scale_size}\hlstd{()} \hlopt{+}
  \hlkwd{geom_point}\hlstd{(}\hlkwc{alpha} \hlstd{=} \hlnum{1}\hlopt{/}\hlnum{3}\hlstd{)} \hlopt{+}
  \hlkwd{geom_text_repel}\hlstd{(}\hlkwc{color} \hlstd{=} \hlstr{"black"}\hlstd{,} \hlkwc{size} \hlstd{=} \hlnum{3}\hlstd{,}
                  \hlkwc{min.segment.length} \hlstd{=} \hlnum{0.2}\hlstd{,} \hlkwc{point.padding} \hlstd{=} \hlnum{0.1}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-repel-plot-01-1}

}


\end{knitrout}
\index{plots!maths in|)}
\index{plots!text in|)}
\index{grammar of graphics!text and label geometries|)}

\subsection{Plot insets}\label{sec:plot:insets}
\index{grammar of graphics!inset-related geometries|(}
\index{plots!insets|(}

The support for insets in \pkgname{ggplot2} is confined to \code{annotation\_custom()} which was designed to be used for static annotations expected to be the same in each panel of a plot (the use of annotations is described in section \ref{sec:plot:annotations}). Package \pkgname{ggpmisc} provides geoms that mimic \code{geom\_text()} in relation to the \emph{aesthetics} used, but that similarly to \code{geom\_sf()}, expect that the column in \code{data} mapped to the \code{label} aesthetics are lists of objects containing multiple pieces of information, rather than atomic vectors. Similar to \code{geom\_sf()} these geoms do not inherit the plot's default mappings to aesthetics. Three geometries are currently available: \gggeom{geom\_table()}, \gggeom{geom\_plot()} and \gggeom{geom\_grob()}.

\begin{warningbox}
Given that  \gggeom{geom\_table()}, \gggeom{geom\_plot()} and \gggeom{geom\_grob()} will rarely use a mapping inherited from the whole plot, by default they do not inherit it. Either the mapping should be supplied as an argument to these functions or their parameter \code{inherit.aes} explicitly set to \code{TRUE}.
\end{warningbox}

\index{plots!inset tables|(}
The plotting of tables by mapping a list of data frames to the \code{label} \emph{aesthetic} is done with \gggeom{geom\_table}. Positioning, justification, and angle work as for \gggeom{geom\_text} and are applied to the whole table. Only \code{tibble} objects (see documentation of package \pkgname{tibble}) can contain, as variables, lists of data frames, so this \emph{geometry} requires the use of \code{tibble} objects to store the data. The table(s) are created as 'grid' \code{grob} objects, collected in a tree and added to the \code{ggplot} object as a new layer.

We first generate a \code{tibble} containing summaries from the data, formatted as character strings, wrap this tibble in a list, and store this list as a column in another \code{tibble}. To accomplish this, we use functions from the \pkgname{tidyverse} described in chapter \ref{chap:R:data}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{mtcars} \hlopt{%.>%}
  \hlkwd{group_by}\hlstd{(., cyl)} \hlopt{%.>%}
  \hlkwd{summarize}\hlstd{(.,}
            \hlstr{"mean wt"} \hlstd{=} \hlkwd{format}\hlstd{(}\hlkwd{mean}\hlstd{(wt),} \hlkwc{digits} \hlstd{=} \hlnum{2}\hlstd{),}
            \hlstr{"mean disp"} \hlstd{=} \hlkwd{format}\hlstd{(}\hlkwd{mean}\hlstd{(disp),} \hlkwc{digits} \hlstd{=} \hlnum{0}\hlstd{),}
            \hlstr{"mean mpg"} \hlstd{=} \hlkwd{format}\hlstd{(}\hlkwd{mean}\hlstd{(mpg),} \hlkwc{digits} \hlstd{=} \hlnum{0}\hlstd{))} \hlkwb{->} \hlstd{my.table}
\hlstd{table.tb} \hlkwb{<-} \hlkwd{tibble}\hlstd{(}\hlkwc{x} \hlstd{=} \hlnum{500}\hlstd{,} \hlkwc{y} \hlstd{=} \hlnum{35}\hlstd{,} \hlkwc{table.inset} \hlstd{=} \hlkwd{list}\hlstd{(my.table))}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,} \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg,}
                          \hlkwc{color} \hlstd{=} \hlkwd{factor}\hlstd{(cyl),}
                          \hlkwc{size} \hlstd{= wt,}
                          \hlkwc{label} \hlstd{= cyl))} \hlopt{+}
  \hlkwd{scale_size}\hlstd{()} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{geom_table}\hlstd{(}\hlkwc{data} \hlstd{= table.tb,}
             \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= x,} \hlkwc{y} \hlstd{= y,} \hlkwc{label} \hlstd{= table.inset),}
             \hlkwc{color} \hlstd{=} \hlstr{"black"}\hlstd{,} \hlkwc{size} \hlstd{=} \hlnum{3}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-table-plot-02-1}

}


\end{knitrout}

The \code{color} and \code{size} aesthetics control the text in the table(s) as a whole.
It is also possible to rotate the table(s) using \code{angle}. As with text labels, justification is interpreted in relation to table-text orientation. We set the \code{y = 0} in \code{data.tb} and then use \code{vjust = 1} to position the top of the table at this coordinate value.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,} \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg,} \hlkwc{color} \hlstd{=} \hlkwd{factor}\hlstd{(cyl)))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{geom_table}\hlstd{(}\hlkwc{data} \hlstd{= table.tb,}
             \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= x,} \hlkwc{y} \hlstd{= y,} \hlkwc{label} \hlstd{= table.inset),}
             \hlkwc{color} \hlstd{=} \hlstr{"blue"}\hlstd{,} \hlkwc{size} \hlstd{=} \hlnum{3}\hlstd{,}
             \hlkwc{hjust} \hlstd{=} \hlnum{1}\hlstd{,} \hlkwc{vjust} \hlstd{=} \hlnum{0}\hlstd{,} \hlkwc{angle} \hlstd{=} \hlnum{90}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

Parsed text, using R's \emph{plotmath} syntax is supported in the table, with fallback to plain text in case of parsing errors, on a cell-by-cell basis. We end this section with a simple example, which even if not very useful, demonstrates that \gggeom{geom\_table()} behaves like a ``normal'' ggplot \emph{geometry} and that a table can be the only layer in a ggplot if desired. The addition of multiple tables with a single call to \gggeom{geom\_table()} by passing a \code{tibble} with multiple rows as an argument for \code{data} is also possible.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{tb.pm} \hlkwb{<-} \hlkwd{tibble}\hlstd{(}\hlstr{'x^0'} \hlstd{=} \hlnum{1}\hlstd{,}
                \hlstr{'x^1'} \hlstd{=} \hlnum{1}\hlopt{:}\hlnum{5}\hlstd{,}
                \hlstr{'x^2'} \hlstd{= (}\hlnum{1}\hlopt{:}\hlnum{5}\hlstd{)}\hlopt{^}\hlnum{2}\hlstd{,}
                \hlstr{'x^3'} \hlstd{= (}\hlnum{1}\hlopt{:}\hlnum{5}\hlstd{)}\hlopt{^}\hlnum{3}\hlstd{)}
\hlstd{data.tb} \hlkwb{<-} \hlkwd{tibble}\hlstd{(}\hlkwc{x} \hlstd{=} \hlnum{1}\hlstd{,} \hlkwc{y} \hlstd{=} \hlnum{1}\hlstd{,} \hlkwc{table.inset} \hlstd{=} \hlkwd{list}\hlstd{(tb.pm))}
\hlkwd{ggplot}\hlstd{(data.tb,} \hlkwc{mapping} \hlstd{=} \hlkwd{aes}\hlstd{(x, y,} \hlkwc{label} \hlstd{= table.inset))} \hlopt{+}
  \hlkwd{geom_table}\hlstd{(}\hlkwc{inherit.aes} \hlstd{=} \hlnum{TRUE}\hlstd{,} \hlkwc{size} \hlstd{=} \hlnum{7}\hlstd{,} \hlkwc{parse} \hlstd{=} \hlnum{TRUE}\hlstd{)} \hlopt{+}
  \hlkwd{theme_void}\hlstd{()}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{explainbox}
  The \emph{geometry} \gggeom{geom\_table()} uses functions from package \pkgname{gridExtra} to build a graphical object for the table. The use of table themes was not yet supported by this geometry at the time of writing.
\end{explainbox}
\index{plots!inset tables|)}

\index{plots!inset plots|(}
Geometry \gggeom{geom\_plot()} works much like \code{geom\_table()}, but instead of expecting a list of data frames or tibbles to be mapped to the \code{label} aesthetics, it expects a list of ggplots (objects of class \code{gg}). This allows adding as an inset to a ggplot, another ggplot. In the times when plots were hand drafted with India ink on paper, the use of inset plots was more frequent than nowadays. Inset plots can be very useful for zooming-in on parts of a main plot where observations are crowded and for displaying summaries based on the observations shown in the main plot. The inset plots are nested in viewports which control the dimensions of the inset plot, and aesthetics \code{vp.height} and \code{vp.width} control their sizes---with defaults of 1/3 of the height and width of the plotting area of the main plot. Themes can be applied separately to the main and inset plots.

In the first example of inset plots, we include one of the summaries shown above as an inset table. We first create a tibble containing the plot to be inset.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{mtcars} \hlopt{%.>%}
  \hlkwd{group_by}\hlstd{(., cyl)} \hlopt{%.>%}
  \hlkwd{summarize}\hlstd{(.,} \hlkwc{mean.mpg} \hlstd{=} \hlkwd{mean}\hlstd{(mpg))} \hlopt{%.>%}
  \hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= .,}
         \hlkwd{aes}\hlstd{(}\hlkwd{factor}\hlstd{(cyl), mean.mpg,} \hlkwc{fill} \hlstd{=} \hlkwd{factor}\hlstd{(cyl)))} \hlopt{+}
  \hlkwd{scale_fill_discrete}\hlstd{(}\hlkwc{guide} \hlstd{=} \hlnum{FALSE}\hlstd{)} \hlopt{+}
  \hlkwd{scale_y_continuous}\hlstd{(}\hlkwc{name} \hlstd{=} \hlkwa{NULL}\hlstd{)} \hlopt{+}
    \hlkwd{geom_col}\hlstd{()} \hlopt{+}
    \hlkwd{theme_bw}\hlstd{(}\hlnum{8}\hlstd{)} \hlkwb{->} \hlstd{my.plot}
\hlstd{plot.tb} \hlkwb{<-} \hlkwd{tibble}\hlstd{(}\hlkwc{x} \hlstd{=} \hlnum{500}\hlstd{,} \hlkwc{y} \hlstd{=} \hlnum{35}\hlstd{,} \hlkwc{plot.inset} \hlstd{=} \hlkwd{list}\hlstd{(my.plot))}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,} \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg,}
                          \hlkwc{color} \hlstd{=} \hlkwd{factor}\hlstd{(cyl)))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{geom_plot}\hlstd{(}\hlkwc{data} \hlstd{= plot.tb,}
            \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= x,} \hlkwc{y} \hlstd{= y,} \hlkwc{label} \hlstd{= plot.inset),}
            \hlkwc{vp.width} \hlstd{=} \hlnum{1}\hlopt{/}\hlnum{2}\hlstd{,}
            \hlkwc{hjust} \hlstd{=} \hlstr{"inward"}\hlstd{,} \hlkwc{vjust} \hlstd{=} \hlstr{"inward"}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-plot-plot-02-1}

}


\end{knitrout}

In the second example we add the zoomed version of the same plot as an inset. 1) Manually set limits to the coordinates to zoom into a region of the main plot, 2) set the \emph{theme} of the inset, 3) remove axis labels as they are the same as in the main plot, 4) and 5) highlight the zoomed-in region in the main plot. This fairly complex example shows how a new extension to \pkgname{ggplot2} can integrate well into the grammar of graphics paradigm. In this example, to show an alternative approach, instead of collecting all the data into a data frame, we map constant values directly to the various aesthetics within \Rfunction{annotate()} (see section \ref{sec:plot:annotations} on page \pageref{sec:plot:annotations}).

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{p.main} \hlkwb{<-} \hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,} \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg,} \hlkwc{color} \hlstd{=} \hlkwd{factor}\hlstd{(cyl)))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()}
\hlstd{p.inset} \hlkwb{<-} \hlstd{p.main} \hlopt{+}
  \hlkwd{coord_cartesian}\hlstd{(}\hlkwc{xlim} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{270}\hlstd{,} \hlnum{330}\hlstd{),} \hlkwc{ylim} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{14}\hlstd{,} \hlnum{19}\hlstd{))} \hlopt{+}
  \hlkwd{labs}\hlstd{(}\hlkwc{x} \hlstd{=} \hlkwa{NULL}\hlstd{,} \hlkwc{y} \hlstd{=} \hlkwa{NULL}\hlstd{)} \hlopt{+}
  \hlkwd{scale_color_discrete}\hlstd{(}\hlkwc{guide} \hlstd{=} \hlnum{FALSE}\hlstd{)} \hlopt{+}
  \hlkwd{theme_bw}\hlstd{(}\hlnum{8}\hlstd{)} \hlopt{+} \hlkwd{theme}\hlstd{(}\hlkwc{aspect.ratio} \hlstd{=} \hlnum{1}\hlstd{)}
\hlstd{p.main} \hlopt{+}
  \hlkwd{geom_plot}\hlstd{(}\hlkwc{x} \hlstd{=} \hlnum{480}\hlstd{,} \hlkwc{y} \hlstd{=} \hlnum{34}\hlstd{,} \hlkwc{label} \hlstd{=} \hlkwd{list}\hlstd{(p.inset),} \hlkwc{vp.height} \hlstd{=} \hlnum{1}\hlopt{/}\hlnum{2}\hlstd{,}
            \hlkwc{hjust} \hlstd{=} \hlstr{"inward"}\hlstd{,} \hlkwc{vjust} \hlstd{=} \hlstr{"inward"}\hlstd{)} \hlopt{+}
  \hlkwd{annotate}\hlstd{(}\hlkwc{geom} \hlstd{=} \hlstr{"rect"}\hlstd{,} \hlkwc{fill} \hlstd{=} \hlnum{NA}\hlstd{,} \hlkwc{color} \hlstd{=} \hlstr{"black"}\hlstd{,}
           \hlkwc{xmin} \hlstd{=} \hlnum{270}\hlstd{,} \hlkwc{xmax} \hlstd{=} \hlnum{330}\hlstd{,} \hlkwc{ymin} \hlstd{=} \hlnum{14}\hlstd{,} \hlkwc{ymax} \hlstd{=} \hlnum{19}\hlstd{,}
           \hlkwc{linetype} \hlstd{=} \hlstr{"dotted"}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-plot-plot-03-1}

}


\end{knitrout}
\index{plots!inset plots|)}
\index{plots!inset graphical objects|(}
Geometry \gggeom{geom\_grob()} works much like \code{geom\_table()} and \code{geom\_plot()} but expects a list of \pkgname{grid} graphical objects, called \code{grob} for short. This adds generality at the expense of having to separately create the grobs either using \pkgname{grid} or by converting other objects into grobs. This geometry is as flexible as \gggeom{annotation\_custom()} with respect to the grobs, but behaves as a \emph{geometry}. We show an example that adds two bitmaps to the plot. The bitmaps are read from PNG files, converted into grobs, and added to the plot as a new layer. The PNG bitmaps used have a transparent background.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{file1.name} \hlkwb{<-}
  \hlkwd{system.file}\hlstd{(}\hlstr{"extdata"}\hlstd{,} \hlstr{"Isoquercitin.png"}\hlstd{,} \hlkwc{package} \hlstd{=} \hlstr{"ggpmisc"}\hlstd{,} \hlkwc{mustWork} \hlstd{=} \hlnum{TRUE}\hlstd{)}
\hlstd{Isoquercitin} \hlkwb{<-} \hlstd{magick}\hlopt{::}\hlkwd{image_read}\hlstd{(file1.name)}
\hlstd{file2.name} \hlkwb{<-}
  \hlkwd{system.file}\hlstd{(}\hlstr{"extdata"}\hlstd{,} \hlstr{"Robinin.png"}\hlstd{,} \hlkwc{package} \hlstd{=} \hlstr{"ggpmisc"}\hlstd{,} \hlkwc{mustWork} \hlstd{=} \hlnum{TRUE}\hlstd{)}
\hlstd{Robinin} \hlkwb{<-} \hlstd{magick}\hlopt{::}\hlkwd{image_read}\hlstd{(file2.name)}
\hlstd{grob.tb} \hlkwb{<-} \hlkwd{tibble}\hlstd{(}\hlkwc{x} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{0}\hlstd{,} \hlnum{100}\hlstd{),} \hlkwc{y} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{10}\hlstd{,} \hlnum{20}\hlstd{),} \hlkwc{height} \hlstd{=} \hlnum{1}\hlopt{/}\hlnum{3}\hlstd{,} \hlkwc{width} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{1}\hlopt{/}\hlnum{2}\hlstd{),}
                  \hlkwc{grobs} \hlstd{=} \hlkwd{list}\hlstd{(grid}\hlopt{::}\hlkwd{rasterGrob}\hlstd{(}\hlkwc{image} \hlstd{= Isoquercitin),}
                               \hlstd{grid}\hlopt{::}\hlkwd{rasterGrob}\hlstd{(}\hlkwc{image} \hlstd{= Robinin)))}

\hlkwd{ggplot}\hlstd{()} \hlopt{+}
  \hlkwd{geom_grob}\hlstd{(}\hlkwc{data} \hlstd{= grob.tb,}
            \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= x,} \hlkwc{y} \hlstd{= y,} \hlkwc{label} \hlstd{= grobs,} \hlkwc{vp.height} \hlstd{= height,} \hlkwc{vp.width} \hlstd{= width),}
                \hlkwc{hjust} \hlstd{=} \hlstr{"inward"}\hlstd{,} \hlkwc{vjust} \hlstd{=} \hlstr{"inward"}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-plot-grob-01-1}

}


\end{knitrout}
\index{plots!inset graphical objects|)}

\begin{explainbox}
Grid graphics\index{grid graphics coordinate systems} provide the low-level functions that both \pkgname{ggplot2} and \pkgname{lattice} use under the hood. Grid supports different types of units for expressing the coordinates of positions within the plotting area. All examples outside this text box use \code{"native"} data coordinates, however, coordinates can be also given in physical units like \code{"mm"}. More useful when working with scalable plots is to use "npc" \emph{normalized parent coordinates}, which are expressed as numbers in the range 0 to 1, relative to the dimensions of the sides of the current \emph{viewport}, with origin at the lower left corner.

Package \pkgname{ggplot2} interprets $x$ and $y$ coordinates in \code{"native"} data coordinates, and trickery seems to be needed to get around this limitation. A rather general solution is provided by package \pkgname{ggpmisc} through \emph{aesthetics} \code{npcx} and \code{npcy} and \emph{geometries} that support them. At the time of writing, \gggeom{geom\_text\_npc()}, \gggeom{geom\_label\_npc()}, \gggeom{geom\_table\_npc()}, \gggeom{geom\_plot\_npc()} and \gggeom{geom\_grob\_npc()}. These \emph{geometries} are useful for annotating plots and adding insets at positions relative to the plotting area that remain always consistent across different plots, or across panels when using facets with free axis limits. Being geometries they provide freedom in the elements added to different panels and their positions.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,} \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg,} \hlkwc{color} \hlstd{=} \hlkwd{factor}\hlstd{(cyl)))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{geom_label_npc}\hlstd{(}\hlkwc{npcx} \hlstd{=} \hlnum{0.5}\hlstd{,} \hlkwc{npcy} \hlstd{=} \hlnum{0.9}\hlstd{,} \hlkwc{label} \hlstd{=} \hlstr{"a label"}\hlstd{,} \hlkwc{color} \hlstd{=} \hlstr{"black"}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-plot-npc-eb-01-1}

}


\end{knitrout}

\end{explainbox}

\index{grammar of graphics!inset-related geometries|)}
\index{plots!insets|)}
\index{grammar of graphics!geometries|)}

\section{Statistics}\label{sec:plot:statistics}
\index{grammar of graphics!statistics|(}

Before learning about \ggplot \emph{statistics}, it is important to have clear how the mapping of factors to \emph{aesthetics} works. When a factor, for example, is mapped to \code{color}, it creates a new grouping, with the observations matching a given level of the factor, corresponding to a group. Most \emph{statistics} operate on the data for each of these groups separately, returning a summary for each group, for example, the mean of the observations in a group.

\subsection{Functions}\label{sec:plot:function}
\index{grammar of graphics!function statistic|(}
\index{plots!plots of functions|(}
In addition to plotting data from a data frame with variables to map to $x$ and $y$ \emph{aesthetics}, it is possible to have only a variable mapped to $x$ and use \ggstat{stat\_function()} to compute the values to be mapped to $y$ using an \Rlang function. This avoids the need to generate data beforehand as even the number of data points to be generated can be set in \code{geom\_function()}. Any \Rlang function, user defined or not, can be used as long as it is vectorized, with the length of the returned vector equal to the length of the vector passed as first argument to it. The variable mapped to \code{x} determines the range, and the argument to parameter \code{n} of \code{geom\_function()} the length of the generated vector that is passed as first argument to \code{fun} when it is called to generate the values to napped to \code{y}. These are the $x$ and $y$ values passed to the \emph{geometry}.

We start with the Normal distribution function. We rely on the defaults \code{n = 101} and \code{geom = "path"}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwd{data.frame}\hlstd{(}\hlkwc{x} \hlstd{=} \hlopt{-}\hlnum{3}\hlopt{:}\hlnum{3}\hlstd{),} \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= x))} \hlopt{+}
  \hlkwd{stat_function}\hlstd{(}\hlkwc{fun} \hlstd{= dnorm)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-function-plot-01-1}

}


\end{knitrout}

Using a list we can even pass by name additional arguments to use when the function is called.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwd{data.frame}\hlstd{(}\hlkwc{x} \hlstd{=} \hlopt{-}\hlnum{3}\hlopt{:}\hlnum{3}\hlstd{),} \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= x))} \hlopt{+}
  \hlkwd{stat_function}\hlstd{(}\hlkwc{fun} \hlstd{= dnorm,} \hlkwc{args} \hlstd{=} \hlkwd{list}\hlstd{(}\hlkwc{mean} \hlstd{=} \hlnum{1}\hlstd{,} \hlkwc{sd} \hlstd{=} \hlnum{.5}\hlstd{))}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{playground}
Edit the code above so as to plot in the same figure three curves, either for three different values for \code{mean} or for three different values for \code{sd}.
\end{playground}

Named user-defined functions (not shown), and anonymous functions (below) can also be used.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwd{data.frame}\hlstd{(}\hlkwc{x} \hlstd{=} \hlnum{0}\hlopt{:}\hlnum{1}\hlstd{),} \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= x))} \hlopt{+}
  \hlkwd{stat_function}\hlstd{(}\hlkwc{fun} \hlstd{=} \hlkwa{function}\hlstd{(}\hlkwc{x}\hlstd{,} \hlkwc{a}\hlstd{,} \hlkwc{b}\hlstd{)\{a} \hlopt{+} \hlstd{b} \hlopt{*} \hlstd{x}\hlopt{^}\hlnum{2}\hlstd{\},}
                \hlkwc{args} \hlstd{=} \hlkwd{list}\hlstd{(}\hlkwc{a} \hlstd{=} \hlnum{1}\hlstd{,} \hlkwc{b} \hlstd{=} \hlnum{1.4}\hlstd{))}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{playground}
Edit the code above to use a different function, such as $e^{x + k}$, adjusting the argument(s) passed through \code{args} accordingly. Do this by means of an anonymous function, and by means of an equivalent named function defined by your code.
\end{playground}

\index{plots!plots of functions|)}
\index{grammar of graphics!function statistic|)}

\subsection{Summaries}\label{sec:plot:stat:summaries}
\index{grammar of graphics!summary statistic|(}
\index{plots!data summaries|(}
\index{plots!means}\index{plots!medians}\index{plots!error bars}
The summaries discussed in this section can be superimposed on raw data plots, or plotted on their own. Beware, that if scale limits are manually set, the summaries will be calculated from the subset of observations within these limits. Scale limits can be altered when explicitly defining a scale or by means of functions \Rfunction{xlim()} and \Rfunction{ylim()}. See section \ref{sec:plot:coord} on page \pageref{sec:plot:coord} for an explanation of how coordinate limits can be used to zoom into a plot without excluding of $x$ and $y$ values from the data.

It is possible to summarize data on the fly when plotting. We describe in the same section the calculation of measures of central tendency and of variation, as \ggstat{stat\_summary()} allows them to be calculated simultaneously and added together with a single layer.

For use in the examples, we generate some normally distributed artificial data.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{fake.data} \hlkwb{<-} \hlkwd{data.frame}\hlstd{(}
  \hlkwc{y} \hlstd{=} \hlkwd{c}\hlstd{(}\hlkwd{rnorm}\hlstd{(}\hlnum{10}\hlstd{,} \hlkwc{mean} \hlstd{=} \hlnum{2}\hlstd{,} \hlkwc{sd} \hlstd{=} \hlnum{0.5}\hlstd{),}
        \hlkwd{rnorm}\hlstd{(}\hlnum{10}\hlstd{,} \hlkwc{mean} \hlstd{=} \hlnum{4}\hlstd{,} \hlkwc{sd} \hlstd{=} \hlnum{0.7}\hlstd{)),}
  \hlkwc{group} \hlstd{=} \hlkwd{factor}\hlstd{(}\hlkwd{c}\hlstd{(}\hlkwd{rep}\hlstd{(}\hlstr{"A"}\hlstd{,} \hlnum{10}\hlstd{),} \hlkwd{rep}\hlstd{(}\hlstr{"B"}\hlstd{,} \hlnum{10}\hlstd{)))}
  \hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

We will reuse a ``base'' scatter plot in a series of examples, so that the differences are easier to appreciate. We first add just the mean. In this case, we need to pass as an argument to \ggstat{stat\_summary()}, the \code{geom} to use, as the default one, \gggeom{geom\_pointrange()}, expects data for plotting error bars in addition to the mean. This example uses a hyphen character as the constant value of \code{shape} (see the example for \code{geom\_point()} on page \pageref{chunk:plot:point:char} on the use of digits as \code{shape}). Instead of passing \code{"mean"} as an argument to parameter \code{fun} (earlier called \code{fun.y}), we can pass, if desired, other summary functions like \code{"median"}. In the case of these functions that return a single computed value, we pass them, or character strings with their names, as an argument to parameter \code{fun}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= fake.data,} \hlkwd{aes}\hlstd{(}\hlkwc{y} \hlstd{= y,} \hlkwc{x} \hlstd{= group))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{(}\hlkwc{shape} \hlstd{=} \hlnum{21}\hlstd{)} \hlopt{+}
  \hlkwd{stat_summary}\hlstd{(}\hlkwc{fun} \hlstd{=} \hlstr{"mean"}\hlstd{,} \hlkwc{geom} \hlstd{=} \hlstr{"point"}\hlstd{,}
               \hlkwc{color} \hlstd{=} \hlstr{"red"}\hlstd{,} \hlkwc{shape} \hlstd{=} \hlstr{"-"}\hlstd{,} \hlkwc{size} \hlstd{=} \hlnum{10}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-summary-plot-02-1}

}


\end{knitrout}

To pass as an argument a function that returns a central value like the mean plus confidence or other limits, we use parameter \code{fun.data} instead of \code{fun}. In the next example we add means and confidence intervals for $p = 0.95$ (the default) assuming normality.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
  \hlkwd{stat_summary}\hlstd{(}\hlkwc{fun.data} \hlstd{=} \hlstr{"mean_cl_normal"}\hlstd{,} \hlkwc{color} \hlstd{=} \hlstr{"red"}\hlstd{,} \hlkwc{size} \hlstd{=} \hlnum{1}\hlstd{,} \hlkwc{alpha} \hlstd{=} \hlnum{0.7}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

We can override the default of $p = 0.95$ for confidence intervals by setting, for example, \code{conf.int = 0.90} in the list of arguments passed to the function. The intervals can also be computed without assuming normality, using the empirical distribution estimated from the data by bootstrap. To achieve this we pass to \code{fun.data} the argument \code{"mean\_cl\_boot"} instead of \code{"mean\_cl\_normal"}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
  \hlkwd{stat_summary}\hlstd{(}\hlkwc{fun.data} \hlstd{=} \hlstr{"mean_cl_boot"}\hlstd{,}
               \hlkwc{fun.args} \hlstd{=} \hlkwd{list}\hlstd{(}\hlkwc{conf.int} \hlstd{=} \hlnum{0.90}\hlstd{),}
               \hlkwc{color} \hlstd{=} \hlstr{"red"}\hlstd{,} \hlkwc{size} \hlstd{=} \hlnum{1}\hlstd{,} \hlkwc{alpha} \hlstd{=} \hlnum{0.7}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

For $\bar{x} \pm \mathrm{s.e.}$ we should pass \code{"mean\_se"} and for $\bar{x} \pm \mathrm{s.d.}$ \code{"mean\_sdl"}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
  \hlkwd{stat_summary}\hlstd{(}\hlkwc{fun.data} \hlstd{=} \hlstr{"mean_se"}\hlstd{,}
               \hlkwc{color} \hlstd{=} \hlstr{"red"}\hlstd{,} \hlkwc{size} \hlstd{=} \hlnum{1}\hlstd{,} \hlkwc{alpha} \hlstd{=} \hlnum{0.7}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

We do not give an example here, but it is possible to use user-defined functions instead of the functions exported by package \ggplot (based on those in package \Hmisc). Because arguments to the function used, except for the first one containing the variable in \code{data} mapped to the $y$ aesthetic, are supplied as a named list through parameter \code{fun.args}, the names used for parameters in the function definition need only match the names in this list.

Finally, we plot the means in a scatter plot, with the observations superimposed on the error bars as a result of the order in which the layers are added to the plot. In this case, we set \code{fill}, \code{color} and \code{alpha} (transparency) to constants, but in more complex data sets, mapping them to factors in \code{data} can be used for grouping of observations. Here, adding two plot layers with \ggstat{stat\_summary()} allows us to plot the mean and the error bars using different colors.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= fake.data,} \hlkwd{aes}\hlstd{(}\hlkwc{y} \hlstd{= y,} \hlkwc{x} \hlstd{= group))} \hlopt{+}
  \hlkwd{stat_summary}\hlstd{(}\hlkwc{fun} \hlstd{=} \hlstr{"mean"}\hlstd{,} \hlkwc{geom} \hlstd{=} \hlstr{"point"}\hlstd{,}
               \hlkwc{fill} \hlstd{=} \hlstr{"white"}\hlstd{,} \hlkwc{color} \hlstd{=} \hlstr{"black"}\hlstd{)} \hlopt{+}
  \hlkwd{stat_summary}\hlstd{(}\hlkwc{fun.data} \hlstd{=} \hlstr{"mean_cl_boot"}\hlstd{,}
               \hlkwc{geom} \hlstd{=} \hlstr{"errorbar"}\hlstd{,}
               \hlkwc{width} \hlstd{=} \hlnum{0.1}\hlstd{,} \hlkwc{size} \hlstd{=} \hlnum{1}\hlstd{,} \hlkwc{color} \hlstd{=} \hlstr{"red"}\hlstd{)} \hlopt{+}
  \hlkwd{geom_point}\hlstd{(}\hlkwc{size} \hlstd{=} \hlnum{3}\hlstd{,} \hlkwc{alpha} \hlstd{=} \hlnum{0.3}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

We can plot means, or other summaries, by group mapped to \code{x} (\code{class} in this example) as columns by passing \code{"col"} as an argument to  \code{geom}. In this way we avoid the need to compute the summaries in advance.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(mpg,} \hlkwd{aes}\hlstd{(class, hwy))} \hlopt{+}
  \hlkwd{stat_summary}\hlstd{(}\hlkwc{geom} \hlstd{=} \hlstr{"col"}\hlstd{,} \hlkwc{fun} \hlstd{= mean)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-summary-plot-09a-1}

}


\end{knitrout}

We can easily add error bars to the column plot. We use \code{size} to make the lines of the error bars thicker. The default \emph{geometry} in \ggstat{stat\_summary()} is \gggeom{geom\_pointrange()}, so we can pass \code{"linerange"} as an argument for \code{geom} to eliminate the point.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
  \hlkwd{stat_summary}\hlstd{(}\hlkwc{geom} \hlstd{=} \hlstr{"linerange"}\hlstd{,} \hlkwc{fun.data} \hlstd{=} \hlstr{"mean_se"}\hlstd{,}
               \hlkwc{size} \hlstd{=} \hlnum{1}\hlstd{,} \hlkwc{color} \hlstd{=} \hlstr{"red"}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

Passing \code{"errorbar"} instead of \code{"linerange"} to \code{geom} results in traditional ``capped'' error bars. However, this type of error bar has been criticized as adding unnecessary clutter to plots \autocite{Tufte1983}. We can use \code{width} to reduce the width of the caps at the ends of the error bars.

If we have already calculated values for the summaries, we can still obtain the same plots by mapping variables to the \emph{aesthetics} required by \gggeom{geom\_errorbar()} and \gggeom{geom\_linerange()}: \code{x}, \code{y}, \code{ymax} and \code{ymin}.

\begin{explainbox}
The ``reverse'' syntax is also valid, as we can add the \emph{geometry} to the plot object and pass the \emph{statistics} as an argument to it. In general in this book we avoid this alternative syntax for the sake of consistency.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(mpg,} \hlkwd{aes}\hlstd{(class, hwy))} \hlopt{+}
  \hlkwd{geom_col}\hlstd{(}\hlkwc{stat} \hlstd{=} \hlstr{"summary"}\hlstd{,} \hlkwc{fun} \hlstd{= mean)}
\end{alltt}
\end{kframe}
\end{knitrout}
\end{explainbox}
\index{plots!data summaries|)}
\index{grammar of graphics!summary statistic|)}

\subsection{Smoothers and models}
\index{plots!smooth curves|(}
\index{plots!fitted curves|(}
\index{plots!statistics!smooth}
The \emph{statistic} \ggstat{stat\_smooth()} fits a smooth curve to observations in the case when the scales for $x$ and $y$ are continuous---the corresponding \emph{geometry} \gggeom{geom\_smooth()} uses this \emph{statistic}, and differs only in how arguments are passed to formal parameters. For the first example, we use \ggstat{stat\_smooth()} with the default smoother, a spline. The type of spline is automatically chosen based on the number of observations and informed by a message. The \code{formula} must be stated using the names of the $x$ and $y$ aesthetics, rather the names of the mapped variables in \code{mtcars}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,} \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg))} \hlopt{+}
       \hlkwd{stat_smooth}\hlstd{(}\hlkwc{formula} \hlstd{= y} \hlopt{~} \hlstd{x)}
\end{alltt}
\end{kframe}
\end{knitrout}

In most cases we will want to plot the observations as points together with the smoother. We can plot the observation on top of the smoother, as done here, or the smoother on top of the observations.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,} \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg))} \hlopt{+}
  \hlkwd{stat_smooth}\hlstd{(}\hlkwc{formula} \hlstd{= y} \hlopt{~} \hlstd{x)} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()}
\end{alltt}


{\ttfamily\noindent\itshape\color{messagecolor}{\#\# `geom\_smooth()` using method = 'loess'}}\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-smooth-plot-02-1}

}


\end{knitrout}

Instead of using the default spline, we can fit a different model. In this example we use a linear model as smoother, fitted by \Rfunction{lm()}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
  \hlkwd{stat_smooth}(method = \hlstr{"lm"}, formula = y ~ x) +
\end{alltt}
\end{kframe}
\end{knitrout}

These data are really grouped, so we map variable \code{cyl} to the \code{color} \emph{aesthetic}. Now we get three groups of points with different colours but also three separate smooth lines.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,} \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg,} \hlkwc{color} \hlstd{=} \hlkwd{factor}\hlstd{(cyl)))} \hlopt{+}
  \hlkwd{stat_smooth}\hlstd{(}\hlkwc{method} \hlstd{=} \hlstr{"lm"}\hlstd{,} \hlkwc{formula} \hlstd{= y} \hlopt{~} \hlstd{x)} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-smooth-plot-04-1}

}


\end{knitrout}

To obtain a single smoother for the three groups, we need to set the mapping of the \code{color} \emph{aesthetic} to a constant within \ggstat{stat\_smooth}. This local value overrides the default \code{color} mapping set in \code{ggplot()} just for this plot layer. We use \code{"black"} but this could be replaced by any other color definition known to \Rlang.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,} \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg,} \hlkwc{color} \hlstd{=} \hlkwd{factor}\hlstd{(cyl)))} \hlopt{+}
  \hlkwd{stat_smooth}\hlstd{(}\hlkwc{method} \hlstd{=} \hlstr{"lm"}\hlstd{,} \hlkwc{formula} \hlstd{= y} \hlopt{~} \hlstd{x,} \hlkwc{color} \hlstd{=} \hlstr{"black"}\hlstd{)} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()}
\end{alltt}
\end{kframe}
\end{knitrout}

Instead of using the \code{formula} for a linear regression as smoother, we pass a different \code{formula} as an argument. In this example we use a polynomial of order 2.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,} \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg,} \hlkwc{color} \hlstd{=} \hlkwd{factor}\hlstd{(cyl)))} \hlopt{+}
  \hlkwd{stat_smooth}\hlstd{(}\hlkwc{method} \hlstd{=} \hlstr{"lm"}\hlstd{,} \hlkwc{formula} \hlstd{= y} \hlopt{~} \hlkwd{poly}\hlstd{(x,} \hlnum{2}\hlstd{),} \hlkwc{color} \hlstd{=} \hlstr{"black"}\hlstd{)} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-smooth-plot-06-1}

}


\end{knitrout}

It is possible to use other types of models, including GAM and GLM, as smoothers, but we will give only two simple examples of the use of \code{nls()} to fit a model non-linear in its parameters (see section \ref{sec:stat:NLS} on page \pageref{sec:stat:NLS} for details about fitting this same model with \code{nls()}). In the first one we fit a Michaelis-Menten equation to reaction rate (\code{rate}) versus reactant concentration (\code{conc}). \Rdata{Puromycin} is a data set included in the \Rlang distribution. Function \Rfunction{SSmicmen()} is also from \Rlang, and is a \emph{self-starting}\index{self-starting functions} implementation of the Michaelis-Menten equation. Thanks to this, even though the fit is done with an iterative algorithm, we do not need to explicitly provide starting values for the parameters to be fitted. We need to set \code{se = FALSE} because standard errors are not supported by the \code{predict()} method for \code{nls} fitted models.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(Puromycin,} \hlkwd{aes}\hlstd{(conc, rate,} \hlkwc{color} \hlstd{= state))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{geom_smooth}\hlstd{(}\hlkwc{method} \hlstd{=} \hlstr{"nls"}\hlstd{,}
              \hlkwc{formula} \hlstd{=  y} \hlopt{~} \hlkwd{SSmicmen}\hlstd{(x, Vm, K),}
              \hlkwc{se} \hlstd{=} \hlnum{FALSE}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

In the second example we define the same model directly in the model formula, and provide the starting values explicitly. The names used for the parameters to be fitted can be chosen at will, within the restrictions of the \Rlang language, but of course the names used in \code{formula} and \code{start} must match each other.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(Puromycin,} \hlkwd{aes}\hlstd{(conc, rate,} \hlkwc{color} \hlstd{= state))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{geom_smooth}\hlstd{(}\hlkwc{method} \hlstd{=} \hlstr{"nls"}\hlstd{,}
              \hlkwc{method.args} \hlstd{=} \hlkwd{list}\hlstd{(}\hlkwc{formula} \hlstd{=  y} \hlopt{~} \hlstd{(Vmax} \hlopt{*} \hlstd{x)} \hlopt{/} \hlstd{(k} \hlopt{+} \hlstd{x),}
                                 \hlkwc{start} \hlstd{=} \hlkwd{list}\hlstd{(}\hlkwc{Vmax} \hlstd{=} \hlnum{200}\hlstd{,} \hlkwc{k} \hlstd{=} \hlnum{0.05}\hlstd{)),}
              \hlkwc{se} \hlstd{=} \hlnum{FALSE}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

In some cases it is desirable to annotate plots with fitted model equations or fitted parameters. One way of achieving this is by fitting the model and then extracting the parameters to manually construct text strings to use for text or label annotations. However, package \pkgname{ggpmisc} makes it possible to automate such annotations in many cases.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.formula} \hlkwb{<-} \hlstd{y} \hlopt{~} \hlkwd{poly}\hlstd{(x,} \hlnum{2}\hlstd{)}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,} \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg,} \hlkwc{color} \hlstd{=} \hlkwd{factor}\hlstd{(cyl)))} \hlopt{+}
  \hlkwd{stat_smooth}\hlstd{(}\hlkwc{method} \hlstd{=} \hlstr{"lm"}\hlstd{,} \hlkwc{formula} \hlstd{= my.formula,} \hlkwc{color} \hlstd{=} \hlstr{"black"}\hlstd{)} \hlopt{+}
  \hlkwd{stat_poly_eq}\hlstd{(}\hlkwc{formula} \hlstd{= my.formula,} \hlkwd{aes}\hlstd{(}\hlkwc{label} \hlstd{= ..eq.label..),}
               \hlkwc{color} \hlstd{=} \hlstr{"black"}\hlstd{,} \hlkwc{parse} \hlstd{=} \hlnum{TRUE}\hlstd{,} \hlkwc{label.x.npc} \hlstd{=} \hlnum{0.3}\hlstd{)} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-smooth-plot-12-1}

}


\end{knitrout}

This same package makes it possible to annotate plots with summary tables from a model fit.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.formula} \hlkwb{<-} \hlstd{y} \hlopt{~} \hlkwd{poly}\hlstd{(x,} \hlnum{2}\hlstd{)}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,} \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg,} \hlkwc{color} \hlstd{=} \hlkwd{factor}\hlstd{(cyl)))} \hlopt{+}
  \hlkwd{stat_smooth}\hlstd{(}\hlkwc{method} \hlstd{=} \hlstr{"lm"}\hlstd{,} \hlkwc{formula} \hlstd{= my.formula,} \hlkwc{color} \hlstd{=} \hlstr{"black"}\hlstd{)} \hlopt{+}
  \hlkwd{stat_fit_tb}\hlstd{(}\hlkwc{method} \hlstd{=} \hlstr{"lm"}\hlstd{,}
              \hlkwc{method.args} \hlstd{=} \hlkwd{list}\hlstd{(}\hlkwc{formula} \hlstd{= my.formula),}
              \hlkwc{color} \hlstd{=} \hlstr{"black"}\hlstd{,}
              \hlkwc{tb.vars} \hlstd{=} \hlkwd{c}\hlstd{(}\hlkwc{Parameter} \hlstd{=} \hlstr{"term"}\hlstd{,}
                          \hlkwc{Estimate} \hlstd{=} \hlstr{"estimate"}\hlstd{,}
                          \hlstr{"s.e."} \hlstd{=} \hlstr{"std.error"}\hlstd{,}
                          \hlstr{"italic(t)"} \hlstd{=} \hlstr{"statistic"}\hlstd{,}
                          \hlstr{"italic(P)"} \hlstd{=} \hlstr{"p.value"}\hlstd{),}
              \hlkwc{label.y.npc} \hlstd{=} \hlstr{"top"}\hlstd{,} \hlkwc{label.x.npc} \hlstd{=} \hlstr{"right"}\hlstd{,}
              \hlkwc{parse} \hlstd{=} \hlnum{TRUE}\hlstd{)} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-smooth-plot-13-1}

}


\end{knitrout}

Package \pkgname{ggpmisc} provides additional \emph{statistics} for the annotation of plots based on fitted models. Please see the package documentation for details.

\index{plots!smooth curves|)}
\index{plots!fitted curves|)}

\subsection{Frequencies and counts}\label{sec:histogram}\label{sec:plot:histogram}
\index{plots!histograms|(}

When the number of observations is rather small, we can rely on the density of graphical elements to convey the density of the observations. For example, scatter plots using well-chosen values for \code{alpha} can give a satisfactory impression of the density. Rug plots, described in section \ref{sec:plot:rug} on page \pageref{sec:plot:rug}, can also satisfactorily convey the density of observations along $x$ and/or $y$ axes. Such approaches do not involve computations, while the \emph{statistics} described in this section do. Frequencies by value-range (or bins) and empirical density functions are summaries especially useful when the number of observations is large. These summaries can be computed in one or more dimensions.

Histograms are defined by how the plotted values are calculated. Although histograms are most frequently plotted as bar plots, many bar or ``column'' plots are not histograms. Although rarely done in practice, a histogram could be plotted using a different \emph{geometry} using \ggstat{stat\_bin()}, the \emph{statistic} used by default by \gggeom{geom\_histogram()}. This \emph{statistic} does binning of observations before computing frequencies, and is suitable for continuous $x$ scales. When a factor is mapped to \code{x}, \ggstat{stat\_count()} should be used, which is the default \code{stat} for \gggeom{geom\_bar()}. These two \emph{geometries} are described in this section about statistics, because they default to using statistics different from \code{stat\_identity()} and consequently summarize the data.

As before, we generate suitable artificial data.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{set.seed}\hlstd{(}\hlnum{12345}\hlstd{)}
\hlstd{my.data} \hlkwb{<-}
  \hlkwd{data.frame}\hlstd{(}\hlkwc{x} \hlstd{=} \hlkwd{rnorm}\hlstd{(}\hlnum{200}\hlstd{),}
             \hlkwc{y} \hlstd{=} \hlkwd{c}\hlstd{(}\hlkwd{rnorm}\hlstd{(}\hlnum{100}\hlstd{,} \hlopt{-}\hlnum{1}\hlstd{,} \hlnum{1}\hlstd{),} \hlkwd{rnorm}\hlstd{(}\hlnum{100}\hlstd{,} \hlnum{1}\hlstd{,} \hlnum{1}\hlstd{)),}
             \hlkwc{group} \hlstd{=} \hlkwd{factor}\hlstd{(}\hlkwd{rep}\hlstd{(}\hlkwd{c}\hlstd{(}\hlstr{"A"}\hlstd{,} \hlstr{"B"}\hlstd{),} \hlkwd{c}\hlstd{(}\hlnum{100}\hlstd{,} \hlnum{100}\hlstd{))) )}
\end{alltt}
\end{kframe}
\end{knitrout}

We could have relied on the default number of bins automatically computed by the \ggstat{stat\_bin()} statistic, however, we here set it to 15 with \code{bins = 15}. It is important to remember that in this case no variable in \code{data} is mapped onto the \code{y} \emph{aesthetic}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(my.data,} \hlkwd{aes}\hlstd{(x))} \hlopt{+}
  \hlkwd{geom_histogram}\hlstd{(}\hlkwc{bins} \hlstd{=} \hlnum{15}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-histogram-plot-01-1}

}


\end{knitrout}

If we create a grouping by mapping a factor to an additional \emph{aesthetic} how the bars created are positioned with respect to each other becomes relevant. We can then plot side by side with \code{position = "dodge"}, stacked one above the other with \code{position = "stack"} and overlapping with \code{position = "identity"} in which case we need to make them semi-transparent with \code{alpha = 0.5} so that they all remain visible.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(my.data,} \hlkwd{aes}\hlstd{(y,} \hlkwc{fill} \hlstd{= group))} \hlopt{+}
  \hlkwd{geom_histogram}\hlstd{(}\hlkwc{bins} \hlstd{=} \hlnum{15}\hlstd{,} \hlkwc{position} \hlstd{=} \hlstr{"dodge"}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

The computed values are contained in the \code{data} that the \emph{geometry} ``receives'' from the \emph{statistic}. Many statistics compute additional values that are not mapped by default. These can be mapped with \code{aes()} by enclosing them in a call to \code{stat()}. From the help page we can learn that in addition to counts in variable \code{count}, density is returned in variable \code{density} by this statistic. Consequently, we can create a histogram with the counts per bin expressed as densities whose integral is one (rather than their sum, as the width of the bins is in this case different from one), as follows.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(my.data,} \hlkwd{aes}\hlstd{(y,} \hlkwc{fill} \hlstd{= group))} \hlopt{+}
  \hlkwd{geom_histogram}\hlstd{(}\hlkwc{mapping} \hlstd{=} \hlkwd{aes}\hlstd{(}\hlkwc{y} \hlstd{=} \hlkwd{stat}\hlstd{(density)),} \hlkwc{bins} \hlstd{=} \hlnum{15}\hlstd{,} \hlkwc{position} \hlstd{=} \hlstr{"dodge"}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-histogram-plot-03-1}

}


\end{knitrout}

If it were not for the easier to remember name of \gggeom{geom\_histogram()}, adding the layers with \ggstat{stat\_bin()} or \ggstat{stat\_count()} would be preferable as it makes clear that computations on the data are involved.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(my.data,} \hlkwd{aes}\hlstd{(y,} \hlkwc{fill} \hlstd{= group))} \hlopt{+}
  \hlkwd{stat_bin}\hlstd{(}\hlkwc{bins} \hlstd{=} \hlnum{15}\hlstd{,} \hlkwc{position} \hlstd{=} \hlstr{"dodge"}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

The \emph{statistic} \ggstat{stat\_bin2d}, and its matching \emph{geometry} \gggeom{geom\_bin2d()}, by default compute a frequency histogram in two dimensions, along the \code{x} and \code{y} \emph{aesthetics}. The frequency for each rectangular tile is mapped onto a \code{fill} scale. As for \ggstat{stat\_bin()}, \code{density} is also computed and available to be mapped as shown above for \code{geom\_histogram}. In this example, to compare dispersion in two dimensions, equal $x$ and $y$ scales are most suitable, which we achieve by adding \ggcoordinate{coord\_fixed()}, which is a variation of the default \ggcoordinate{coord\_cartesian()}  (see section \ref{sec:plot:coord} on page \pageref{sec:plot:coord} for details on other systems of coordinates).

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(my.data,} \hlkwd{aes}\hlstd{(x, y))} \hlopt{+}
  \hlkwd{stat_bin2d}\hlstd{(}\hlkwc{bins} \hlstd{=} \hlnum{8}\hlstd{)} \hlopt{+}
  \hlkwd{coord_fixed}\hlstd{(}\hlkwc{ratio} \hlstd{=} \hlnum{1}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-bin2d-plot-01-1}

}


\end{knitrout}

The \emph{statistic} \ggstat{stat\_bin\_hex()}, and its matching \emph{geometry} \gggeom{geom\_hex()}, differ from \ggstat{stat\_bin2d()} in their use of hexagonal instead of square tiles. By default the frequency or \code{count} for each hexagon is mapped to the \code{fill} aesthetic, but counts expressed as \code{density} are also computed and can be mapped with \code{aes(fill = stat(density))}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(my.data,} \hlkwd{aes}\hlstd{(x, y))} \hlopt{+}
  \hlkwd{stat_bin_hex}\hlstd{(}\hlkwc{bins} \hlstd{=} \hlnum{8}\hlstd{)} \hlopt{+}
  \hlkwd{coord_fixed}\hlstd{(}\hlkwc{ratio} \hlstd{=} \hlnum{1}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-hex-plot-01-1}

}


\end{knitrout}
\index{plots!histograms|)}

\subsection{Density functions}\label{sec:plot:density}
\index{plots!density plot!1 dimension|(}
\index{plots!statistics!density}
Empirical density functions are the equivalent of a histogram, but are continuous and not calculated using bins. They can be estimated in 1 or 2 dimensions (1D or 2D), for $x$ or $x$ and $y$, respectively. As with histograms it is possible to use different \emph{geometries} to visualize them. Examples of the use of \gggeom{geom\_density()} to create 1D density plots follow.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(my.data,} \hlkwd{aes}\hlstd{(y,} \hlkwc{color} \hlstd{= group))} \hlopt{+}
  \hlkwd{geom_density}\hlstd{()}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-density-plot-01-1}

}


\end{knitrout}

A semitransparent fill can be used instead of coloured lines.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(my.data,} \hlkwd{aes}\hlstd{(y,} \hlkwc{fill} \hlstd{= group))} \hlopt{+}
  \hlkwd{geom_density}\hlstd{(}\hlkwc{alpha} \hlstd{=} \hlnum{0.5}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}
\index{plots!density plot!1 dimension|)}

\index{plots!density plot!2 dimensions|(}
\index{plots!statistics!density 2d}

Examples of 2D density plots follow. In the first example we use two \emph{geometries} which were earlier described, \code{geom\_point()} and \code{geom\_rug()}, to plot the observations in the background. With \ggstat{stat\_density\_2d()} we add a two-dimensional density ``map'' represented using isolines. We map \code{group} to the \code{color} \emph{aesthetic}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(my.data,} \hlkwd{aes}\hlstd{(x, y,} \hlkwc{color} \hlstd{= group))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{geom_rug}\hlstd{()} \hlopt{+}
  \hlkwd{stat_density_2d}\hlstd{()}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-density-plot-10-1}

}


\end{knitrout}

In this case, \gggeom{geom\_density\_2d()} is equivalent, and we can replace it in the last line in the chunk above.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
  \hlkwd{geom_density_2d}\hlstd{()}
\end{alltt}
\end{kframe}
\end{knitrout}

In the next example we plot the groups in separate panels, and use a \emph{geometry} supporting the \code{fill} \emph{aesthetic} and we map to it the variable \code{level}, computed by \code{stat\_density\_2d()}


\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(my.data,} \hlkwd{aes}\hlstd{(x, y))} \hlopt{+}
\hlkwd{stat_density_2d}\hlstd{(}\hlkwd{aes}\hlstd{(}\hlkwc{fill} \hlstd{=} \hlkwd{stat}\hlstd{(level)),} \hlkwc{geom} \hlstd{=} \hlstr{"polygon"}\hlstd{)} \hlopt{+}
  \hlkwd{facet_wrap}\hlstd{(}\hlopt{~}\hlstd{group)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-density-plot-12-1}

}


\end{knitrout}


\index{plots!density plot!2 dimensions|)}

\subsection{Box and whiskers plots}\label{sec:boxplot}
\index{box plots|see{plots, box and whiskers plot}}
\index{plots!box and whiskers plot|(}

Box and whiskers plots, also very frequently called just box plots, are also summaries that convey some of the properties of a distribution. They are calculated and plotted by means of \ggstat{stat\_boxplot()} or its matching \gggeom{geom\_boxplot()}. Although they can be calculated and plotted based on just a few observations, they are not useful unless each box plot is based on more than 10 to 15 observations.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(my.data,} \hlkwd{aes}\hlstd{(group, y))} \hlopt{+}
  \hlkwd{stat_boxplot}\hlstd{()}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.54\textwidth]{figure/pos-bw-plot-01-1}

}


\end{knitrout}

As with other \emph{statistics}, their appearance obeys both the usual \emph{aesthetics} such as \code{color}, and parameters specific to this type of visual representation: \code{outlier.color}, \code{outlier.fill}, \code{outlier.shape}, \code{outlier.size}, \code{outlier.stroke} and \code{outlier.alpha}, which affect the outliers in a way similar to the equivalent \code{aethetics} in \code{geom\_point()}. The shape and width of the ``box'' can be adjusted with \code{notch}, \code{notchwidth} and \code{varwidth}. Notches in a boxplot serve a similar role for comparing medians as confidence limits serve when comparing means.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(my.data,} \hlkwd{aes}\hlstd{(group, y))} \hlopt{+}
  \hlkwd{stat_boxplot}\hlstd{(}\hlkwc{notch} \hlstd{=} \hlnum{TRUE}\hlstd{,} \hlkwc{width} \hlstd{=} \hlnum{0.4}\hlstd{,}
               \hlkwc{outlier.color} \hlstd{=} \hlstr{"red"}\hlstd{,} \hlkwc{outlier.shape} \hlstd{=} \hlstr{"*"}\hlstd{,} \hlkwc{outlier.size} \hlstd{=} \hlnum{5}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.54\textwidth]{figure/pos-bw-plot-02-1}

}


\end{knitrout}

\index{plots!box and whiskers plot|)}

\subsection{Violin plots}\label{sec:plot:violin}
\index{plots!violin plot|(}

Violin plots are a more recent development than box plots, and usable with relatively large numbers of observations. They could be thought of as being a sort of hybrid between an empirical density function (see section \ref{sec:plot:density} on page \pageref{sec:plot:density}) and a box plot (see section \ref{sec:boxplot} on page \pageref{sec:boxplot}). As is the case with box plots, they are particularly useful when comparing distributions of related data, side by side. They can be created with  \gggeom{geom\_violin()} as shown in the examples below.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(my.data,} \hlkwd{aes}\hlstd{(group, y))} \hlopt{+}
  \hlkwd{geom_violin}\hlstd{()}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(my.data,} \hlkwd{aes}\hlstd{(group, y,} \hlkwc{fill} \hlstd{= group))} \hlopt{+}
  \hlkwd{geom_violin}\hlstd{(}\hlkwc{alpha} \hlstd{=} \hlnum{0.16}\hlstd{)} \hlopt{+}
  \hlkwd{geom_point}\hlstd{(}\hlkwc{alpha} \hlstd{=} \hlnum{0.33}\hlstd{,} \hlkwc{size} \hlstd{=} \hlnum{1.5}\hlstd{,}
             \hlkwc{color} \hlstd{=} \hlstr{"black"}\hlstd{,} \hlkwc{shape} \hlstd{=} \hlnum{21}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.54\textwidth]{figure/pos-violin-plot-02-1}

}


\end{knitrout}

As with other \emph{geometries}, their appearance obeys both the usual \emph{aesthetics} such as color, and others specific to these types of visual representation.

Other types of displays related to violin plots are \emph{beeswarm} plots and \emph{sina} plots, and can be produced with \emph{geometries} defined in packages \pkgname{ggbeeswarm} and \pkgname{ggforce}, respectively. A minimal example of a beeswarm plot is shown below. See the documentation of the packages for details about the many options in their use.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(my.data,} \hlkwd{aes}\hlstd{(group, y))} \hlopt{+}
  \hlkwd{geom_quasirandom}\hlstd{()}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.54\textwidth]{figure/pos-ggbeeswarm-plot-01-1}

}


\end{knitrout}

\index{plots!violin plot|)}
\index{grammar of graphics!statistics|)}

\section{Facets}\label{sec:plot:facets}
\index{grammar of graphics!facets|(}
\index{plots!trellis-like}\index{plots!coordinated panels}
Facets are used in a special kind of plots containing multiple panels in which the panels share some properties.
These sets of coordinated panels are a useful tool for visualizing complex data. These plots became popular through the \code{trellis} graphs in \langname{S}, and the \pkgname{lattice} package in \Rlang. The basic idea is to have rows and/or columns of plots with common scales, all plots showing values for the same response variable. This is useful when there are multiple classification factors in a data set. Similar-looking plots, but with free scales or with the same scale but a `floating' intercept, are sometimes also useful. In \ggplot there are two possible types of facets: facets organized in a grid, and facets along a single `axis' of variation but, possibly, wrapped into two or more rows. These are produced by adding \Rfunction{facet\_grid()} or \Rfunction{facet\_wrap()}, respectively. In the examples below we use \gggeom{geom\_point()} but faceting can be used with \code{ggplot} objects containing diverse kinds of layers, displaying either observations or summaries from \code{data}.


We start by creating and saving a single-panel plot that we will use through this section to demonstrate how the same plot changes when we add facets.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{p} \hlkwb{<-} \hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,} \hlkwd{aes}\hlstd{(wt, mpg))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()}
\hlstd{p}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-facets-00-1}

}


\end{knitrout}

A grid of panels has two dimensions, \code{rows} and \code{cols}. These dimensions in the grid of plot panels can be ``mapped'' to factors. Until recently a formula syntax was the only available one. Although this notation has been retained, the preferred syntax is currently to use the parameters \code{rows} and \code{cols}. We use \code{cols} in this example. Note that we need to use \code{vars()} to enclose the names of the variables in the data. The ``headings'' of the panels or \emph{strip labels} are by default the levels of the factors.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{p} \hlopt{+} \hlkwd{facet_grid}\hlstd{(}\hlkwc{cols} \hlstd{=} \hlkwd{vars}\hlstd{(cyl))}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-facets-01-1}

}


\end{knitrout}

In the ``historical notation'' the same plot would have been coded as follows.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{p} \hlopt{+} \hlkwd{facet_grid}\hlstd{(.} \hlopt{~} \hlstd{cyl)}
\end{alltt}
\end{kframe}
\end{knitrout}

By default, all panels share the same scale limits and share the plotting space evenly, but these defaults can be overridden.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{p} \hlopt{+} \hlkwd{facet_grid}\hlstd{(}\hlkwc{cols} \hlstd{=} \hlkwd{vars}\hlstd{(cyl),} \hlkwc{scales} \hlstd{=} \hlstr{"free"}\hlstd{)}
\hlstd{p} \hlopt{+} \hlkwd{facet_grid}\hlstd{(}\hlkwc{cols} \hlstd{=} \hlkwd{vars}\hlstd{(cyl),} \hlkwc{scales} \hlstd{=} \hlstr{"free"}\hlstd{,} \hlkwc{space} \hlstd{=} \hlstr{"free"}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}


To obtain a 2D grid we need to specify both \code{rows} and \code{cols}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{p} \hlopt{+} \hlkwd{facet_grid}\hlstd{(}\hlkwc{rows} \hlstd{=} \hlkwd{vars}\hlstd{(vs),} \hlkwc{cols} \hlstd{=} \hlkwd{vars}\hlstd{(am))}
\end{alltt}
\end{kframe}
\end{knitrout}


Margins display an additional column or row of panels with the combined data.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{p} \hlopt{+} \hlkwd{facet_grid}\hlstd{(}\hlkwc{cols} \hlstd{=} \hlkwd{vars}\hlstd{(cyl),} \hlkwc{margins} \hlstd{=} \hlnum{TRUE}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-facets-06-1}

}


\end{knitrout}

We can represent more than one variable per dimension of the grid of plot panels. For this example, we also override the default \code{labeller} used for the panels with one that includes the name of the variable in addition to factor levels in the \emph{strip labels}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{p} \hlopt{+} \hlkwd{facet_grid}\hlstd{(}\hlkwc{cols} \hlstd{=} \hlkwd{vars}\hlstd{(vs, am),} \hlkwc{labeller} \hlstd{= label_both)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-facets-07-1}

}


\end{knitrout}

\begin{explainbox}
Sometimes we may want to have mathematical expressions or Greek letters in the panel headings. The next example shows a way of achieving this. The key is to use as \code{labeller} a function that parses character strings into \Rlang expressions.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{mtcars}\hlopt{$}\hlstd{cyl12} \hlkwb{<-} \hlkwd{factor}\hlstd{(mtcars}\hlopt{$}\hlstd{cyl,}
                       \hlkwc{labels} \hlstd{=} \hlkwd{c}\hlstd{(}\hlstr{"alpha"}\hlstd{,} \hlstr{"beta"}\hlstd{,} \hlstr{"sqrt(x, y)"}\hlstd{))}
\hlstd{p1} \hlkwb{<-} \hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,} \hlkwd{aes}\hlstd{(mpg, wt))} \hlopt{+}
      \hlkwd{geom_point}\hlstd{()} \hlopt{+}
      \hlkwd{facet_grid}\hlstd{(}\hlkwc{cols} \hlstd{=} \hlkwd{vars}\hlstd{(cyl12),} \hlkwc{labeller} \hlstd{= label_parsed)}
\end{alltt}
\end{kframe}
\end{knitrout}

More frequently we may need to include the levels of the factor used in the faceting as part of the labels. Here we use as \code{labeller}, function \Rfunction{label\_bquote()} with a special syntax that allows us to use an expression where replacement based on the facet (panel) data takes place. See section \ref{sec:plot:plotmath} for an example of the use of \code{bquote()}, the \Rlang function on which \Rfunction{label\_bquote()}, is built.


\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{p} \hlopt{+}
  \hlkwd{facet_grid}\hlstd{(}\hlkwc{cols} \hlstd{=} \hlkwd{vars}\hlstd{(cyl),}
             \hlkwc{labeller} \hlstd{=} \hlkwd{label_bquote}\hlstd{(}\hlkwc{cols} \hlstd{=} \hlkwd{.}\hlstd{(cyl)}\hlopt{~}\hlstr{"cylinders"}\hlstd{))}
\end{alltt}
\end{kframe}
\end{knitrout}

\end{explainbox}
%\begin{infobox}
%\sloppy
%In versions of \ggplot before 2.0.0, \code{labeller} was not implemented for \Rfunction{facet\_wrap()}, it was only available for \Rfunction{facet\_grid()}.
%\end{infobox}

In the next example we create a plot with wrapped facets. In this case the number of levels is small, and no wrapping takes place by default. In cases when more panels are present, wrapping into two or more continuation rows is the default. Here, we force wrapping with \code{nrow = 2}. When using \Rfunction{facet\_wrap()} there is only one dimension, and the parameter is called \code{facets}, instead of \code{rows} or \code{cols}.


\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{p} \hlopt{+} \hlkwd{facet_wrap}\hlstd{(}\hlkwc{facets} \hlstd{=} \hlkwd{vars}\hlstd{(cyl),} \hlkwc{nrow} \hlstd{=} \hlnum{2}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.54\textwidth]{figure/pos-facets-13-1}

}


\end{knitrout}

The example below (plot not shown), is similar to the earlier one for \code{facet\_grid}, but faceting according to two factors with \code{facet\_wrap()} along a single wrapped row of panels.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{p} \hlopt{+} \hlkwd{facet_wrap}\hlstd{(}\hlkwc{facets} \hlstd{=} \hlkwd{vars}\hlstd{(vs, am),} \hlkwc{nrow} \hlstd{=} \hlnum{2}\hlstd{,} \hlkwc{labeller} \hlstd{= label_both)}
\end{alltt}
\end{kframe}
\end{knitrout}


%In versions of \ggplot before 2.0.0, \code{labeller} was not implemented for
%\code{facet\_wrap()}, it was only available for \code{facet\_grid()}. In the current
%version it is implemented for both.
%
%<<echo=FALSE>>=
%opts_chunk$set(opts_fig_wide)
%@
%
%<<>>=
%p + facet_wrap(~ vs, labeller = label_bquote(alpha ^ .(vs)))
%@
\index{grammar of graphics!facets|)}

\section{Scales}\label{sec:plot:scales}
\index{grammar of graphics!scales|(}

In earlier sections of this chapter, examples have used the default \emph{scales} or we have set them with convenience functions. In the present section we describe in more detail the use of \emph{scales}. There are \emph{scales} available for different \emph{aesthetics} ($\approx$ attributes) of the plotted geometrical objects, such as position (\code{x, y, z}), \code{size}, \code{shape}, \code{linetype}, \code{color}, \code{fill}, \code{alpha} or transparency, \code{angle}. Scales determine how values in \code{data} are mapped to values of an \emph{aesthetics}, and how these values are labeled.

Depending on the characteristics of the data being mapped, \emph{scales} can be continuous or discrete, for \code{numeric} or \code{factor} variables in \code{data}, respectively. On the other hand, some \emph{aesthetics}, like \code{size}, can vary continuously but others like \code{linetype} are inherently discrete. In addition to discrete scales for inherently discrete \emph{aesthetics}, discrete scales are available for those \emph{aesthetics} that are inherently continuous, like \code{x}, \code{y}, \code{size}, \code{color}, etc.\

The scales used by default set the mapping automatically (e.g.,  which color value corresponds to $x = 0$ and which one to $x = 1$). However, for each \emph{aesthetic} such as \code{color}, there are multiple scales to choose from when creating a plot, both continuous and discrete (e.g.,  20 different color scales in \ggplot 3.2.0).

\begin{explainbox}
\emph{Aesthetics} in a plot layer, in addition to being determined by mappings, can also be set to constant values (e.g.,  plotting all points in a layer in red instead of the default black). \emph{Aesthetics} set to constant values, are not mapped to data, and are consequently independent of scales. In other words, properties of plot elements can be either set to a single constant value of an \emph{aesthetic} affecting all observations present in the layer \code{data}, or mapped to a variable in \code{data} in which case the value of the \emph{aesthetic}, such as \code{color}, will depend on the values of the mapped variable.
\end{explainbox}

The most direct mapping to data is \code{identity}, which means that the data is taken at its face value. In a color scale, say \ggscale{scale\_color\_identity()}, the variable in the data would be encoded with values such as \code{"red"}, \code{"blue"}---i.e., valid \Rlang colours. In a simple mapping using \ggscale{scale\_color\_discrete()} levels of a factor, such as \code{"treatment"} and \code{"control"} would be represented as distinct colours with the correspondence of individual factor levels to individual colours selected automatically by default. In contrast with \code{scale\_color\_manual()} the user needs to explicitly provide the mapping between factor levels and colours by passing arguments to the scale functions' parameters \code{breaks} and \code{values}.

A continuous data variable needs to be mapped to an \emph{aesthetic} through a continuous scale such as \code{scale\_color\_continuous()} or one its various variants. Values in a \code{numeric} variable will be mapped into a continuous range of colours, determined either automatically through a palette or manually by giving the colours at the extremes, and optionally at multiple intermediate values, within the range of variation of the mapped variable (e.g.,  scale settings so that the color varies gradually between \code{"red"} and \code{"gray50"}). Handling of missing values is such that mapping a value in a variable to an \code{NA} value for an aesthetic such as color makes the mapped values invisible. The reverse, mapping \code{NA} values in the data to a specific value of an aesthetic is also possible (e.g.,  displaying \code{NA} values in the mapped variable in red, while other values are mapped to shades of blue).

%
%
%\sloppy
%Advanced scale manipulation requires package \code{scales} to be loaded, although \ggplot (2.0.0 and later) re-export several functions from package \code{scales}. Some simple examples follow.

%\begin{infobox}
\subsection{Axis and key labels}\label{sec:plot:scale:name}\label{sec:plot:labs}
\index{plots!labels|(}
\index{plots!title|(}
\index{plots!subtitle|(}
\index{plots!tag|(}
\index{plots!caption|(}
First we describe a feature common to all scales, their \code{name}. The default \code{name} of all scales is the name of the variable or the expression mapped to it. In the case of the \code{x}, \code{y} and \code{z} \emph{aesthetics} the \code{name} given to the scale is used for the axis labels. For other \emph{aesthetics} the name of the scale becomes the ``heading'' or \emph{key title} of the guide or key. All scales have a \code{name} parameter to which a character string or \Rlang expression (see section \ref{sec:plot:plotmath}) can be passed as an argument to override the default.

Whole-plot title, subtitle and caption are not connected to \emph{scales} or \code{data}. A title (\code{label}) and \code{subtitle} can be added least confusingly with function \Rfunction{ggtitle()} by passing either character strings or \Rlang expressions as arguments.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= Orange,}
       \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= age,} \hlkwc{y} \hlstd{= circumference,} \hlkwc{color} \hlstd{= Tree))} \hlopt{+}
  \hlkwd{geom_line}\hlstd{()} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{expand_limits}\hlstd{(}\hlkwc{y} \hlstd{=} \hlnum{0}\hlstd{)} \hlopt{+}
  \hlkwd{scale_x_continuous}\hlstd{(}\hlkwc{name} \hlstd{=} \hlstr{"Time (d)"}\hlstd{)} \hlopt{+}
  \hlkwd{scale_y_continuous}\hlstd{(}\hlkwc{name} \hlstd{=} \hlstr{"Circumference (mm)"}\hlstd{)} \hlopt{+}
  \hlkwd{ggtitle}\hlstd{(}\hlkwc{label} \hlstd{=} \hlstr{"Growth of orange trees"}\hlstd{,}
          \hlkwc{subtitle} \hlstd{=} \hlstr{"Starting from 1968-12-31"}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-axis-labels-01-1}

}


\end{knitrout}

Convenience functions \Rfunction{xlab()} and \Rfunction{ylab()} can be used to set the axis labels to match those in the previous chunk.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
  \hlkwd{xlab}(\hlstr{"\hlkwd{Time} (d)"}) +
  \hlkwd{ylab}(\hlstr{"\hlkwd{Circumference} (mm)"}) +
\end{alltt}
\end{kframe}
\end{knitrout}

Convenience function \Rfunction{labs()} is useful when we use default scales for all the \emph{aesthetics} in a plot but want to manually set axis labels and/or key titles---i.e., the \code{name} of these scales. \Rfunction{labs()} accepts arguments for these names using, as parameter names, the names of the \emph{aesthetics}. It also allows us to set \code{title}, \code{subtitle}, \code{caption} and \code{tag}, of which the first two can also be set with \Rfunction{ggtitle()}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= Orange,}
       \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= age,} \hlkwc{y} \hlstd{= circumference,} \hlkwc{color} \hlstd{= Tree))} \hlopt{+}
  \hlkwd{geom_line}\hlstd{()} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{expand_limits}\hlstd{(}\hlkwc{y} \hlstd{=} \hlnum{0}\hlstd{)} \hlopt{+}
  \hlkwd{labs}\hlstd{(}\hlkwc{title} \hlstd{=} \hlstr{"Growth of orange trees"}\hlstd{,}
       \hlkwc{subtitle} \hlstd{=} \hlstr{"Starting from 1968-12-31"}\hlstd{,}
       \hlkwc{caption} \hlstd{=} \hlstr{"see Draper, N. R. and Smith, H. (1998)"}\hlstd{,}
       \hlkwc{tag} \hlstd{=} \hlstr{"A"}\hlstd{,}
       \hlkwc{x} \hlstd{=} \hlstr{"Time (d)"}\hlstd{,}
       \hlkwc{y} \hlstd{=} \hlstr{"Circumference (mm)"}\hlstd{,}
       \hlkwc{color} \hlstd{=} \hlstr{"Tree\textbackslash{}nnumber"}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-axis-labels-03-1}

}


\end{knitrout}

\begin{playground}
Make an empty plot (\code{ggplot()}) and add to it as title an \Rlang expression producing $y = b_0 + b_1 x + b_2 x^2$. (Hint: have a look at the examples for the use of expressions in the \code{plotmath} demo in \Rlang by typing \code{demo(plotmath)} at the \Rlang console.
\end{playground}

%\begin{warningbox}
%Check!!
%When setting or updating labels using either \Rfunction{labs()} or \Rfunction{update\_labels()} be aware that even though \code{color} and \code{color} are synonyms for the same \emph{aesthetics}, the `name' used in the call to \Rfunction{aes()} must match the  `name' used when setting or updating the labels.
%\end{warningbox}
%
%The labels used in keys and axis tick-labels for factor levels can be changed through the different \emph{scales} as described in section \ref{sec:plot:scales} on page \pageref{sec:plot:scales}.
%
\index{plots!tag|)}
\index{plots!caption|)}
\index{plots!subtitle|)}
\index{plots!title|)}
\index{plots!labels|)}

\subsection{Continuous scales}\label{sec:plot:scales:continuous}
\index{grammar of graphics!continuous scales|(}
We start by listing the most frequently used arguments to the continuous scale functions: \code{name}, \code{breaks}, \code{minor\_breaks}, \code{labels}, \code{limits}, \code{expand}, \code{na.value}, \code{trans}, \code{guide}, and \code{position}. The value of \code{name} is used for axis labels or the key title (see previous section). The arguments to \code{breaks} and \code{minor\_breaks} override the default locations of major and minor ticks and grid lines. Setting them to \code{NULL} suppresses the ticks. By default the tick labels are generated from the value of \code{breaks} but an argument to \code{labels} of the same length as \code{breaks} will replace these defaults. The values of \code{limits} determine both the range of values in the data included and the plotting area as described above---by default the out-of-bounds (\code{oob}) observations are replaced by \code{NA} but it is possible to instead ``squish'' these observations towards the edge of the plotting area. The argument to \code{expand} determines the size of the margins or padding added to the area delimited by \code{lims} when setting the ``visual'' plotting area. The value passed to \code{na.value} is used as a replacement for \code{NA} valued observations---most useful for \code{color} and \code{fill} aesthetics. The transformation object passed as an argument to \code{trans} determines the transformation used---the transformation affects the rendering, but breaks and tick labels remain expressed in the original data units. The argument to \code{guide} determines the type of key or removes the default key. Depending on the scale in question not all these parameters are available.


We generate new fake data.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{fake2.data} \hlkwb{<-}
  \hlkwd{data.frame}\hlstd{(}\hlkwc{y} \hlstd{=} \hlkwd{c}\hlstd{(}\hlkwd{rnorm}\hlstd{(}\hlnum{20}\hlstd{,} \hlkwc{mean} \hlstd{=} \hlnum{20}\hlstd{,} \hlkwc{sd} \hlstd{=} \hlnum{5}\hlstd{),}
                   \hlkwd{rnorm}\hlstd{(}\hlnum{20}\hlstd{,} \hlkwc{mean} \hlstd{=} \hlnum{40}\hlstd{,} \hlkwc{sd} \hlstd{=} \hlnum{10}\hlstd{)),}
             \hlkwc{group} \hlstd{=} \hlkwd{factor}\hlstd{(}\hlkwd{c}\hlstd{(}\hlkwd{rep}\hlstd{(}\hlstr{"A"}\hlstd{,} \hlnum{20}\hlstd{),} \hlkwd{rep}\hlstd{(}\hlstr{"B"}\hlstd{,} \hlnum{20}\hlstd{))),}
             \hlkwc{z} \hlstd{=} \hlkwd{rnorm}\hlstd{(}\hlnum{40}\hlstd{,} \hlkwc{mean} \hlstd{=} \hlnum{12}\hlstd{,} \hlkwc{sd} \hlstd{=} \hlnum{6}\hlstd{))}
\end{alltt}
\end{kframe}
\end{knitrout}
\subsubsection{Limits}

Limits are relevant to all kinds of \emph{scales}. Limits are set through parameter \code{limits} of the different scale functions. They can also be set with convenience functions \code{xlim()} and \code{ylim()} in the case of the \code{x} and \code{y} \emph{aesthetics}, and more generally with function \code{lims()} which like \code{labs()}, takes arguments named according to the name of the \emph{aesthetics}. The \code{limits} argument of scales accepts vectors, factors or a function computing them from \code{data}. In contrast, the convenience functions do not accept functions as their arguments.

In the next example we set ``hard'' limits, which will exclude some observations from the plot and from any computation of summaries or fitting of smoothers. More exactly, the off-limits observations are converted to \code{NA} values before they are passed as \code{data} to \emph{geometries}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(fake2.data,} \hlkwd{aes}\hlstd{(z, y))} \hlopt{+} \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{scale_y_continuous}\hlstd{(}\hlkwc{limits} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{0}\hlstd{,} \hlnum{100}\hlstd{))}
\end{alltt}
\end{kframe}
\end{knitrout}

To set only one limit leaving the other free, we can use \code{NA} as a boundary.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
  \hlkwd{scale_y_continuous}\hlstd{(}\hlkwc{limits} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{50}\hlstd{,} \hlnum{NA}\hlstd{))}
\end{alltt}
\end{kframe}
\end{knitrout}

Convenience functions \Rfunction{ylim()} and \Rfunction{xlim()} can be used to set the limits to the default $x$ and $y$ scales in use. We here use \Rfunction{ylim()}, but \Rfunction{xlim()} is identical except for the \emph{scale} it affects.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
  \hlkwd{ylim}\hlstd{(}\hlnum{50}\hlstd{,} \hlnum{NA}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

In general, setting hard limits should be avoided, even though a warning is issued about \code{NA} values being omitted, as it is easy to unwillingly subset the data being plotted.
It is preferable to use function \Rfunction{expand\_limits()} as it safely \emph{expands} the dynamically computed default limits of a scale---the scale limits will grow past the requested expanded limits when needed to accommodate all observations. The arguments to \code{x} and \code{y} are numeric vectors of length one or two each, matching how the limits of the $x$ and $y$ continuous scales are defined. Here we expand the limits to include the origin.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(fake2.data,} \hlkwd{aes}\hlstd{(z, y))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{expand_limits}\hlstd{(}\hlkwc{y} \hlstd{=} \hlnum{0}\hlstd{,} \hlkwc{x} \hlstd{=} \hlnum{0}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.54\textwidth]{figure/pos-scale-limits-04-1}

}


\end{knitrout}

The \code{expand} parameter of the scales plays a different role than \Rfunction{expand\_limits()}. It controls how much larger the ``visual'' plotting area is compared to the limits of the actual plotting area. In other words, it adds a ``margin'' or padding to the plotting area outside the limits set either dynamically or manually. Very rarely plots are drawn so that observations are plotted on top of the axes, avoiding this is a key role of \code{expand}. Rug plots and marginal annotations will also require the plotting area to be expanded. In \ggplot the default is to always apply some expansion.

We here set the upper limit of the plotting area to be expanded by adding padding to the top and remove the default padding from the bottom of the plotting area.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(fake2.data,}
  \hlkwd{aes}\hlstd{(}\hlkwc{fill} \hlstd{= group,} \hlkwc{color} \hlstd{= group,} \hlkwc{x} \hlstd{= y))} \hlopt{+}
  \hlkwd{stat_density}\hlstd{(}\hlkwc{alpha} \hlstd{=} \hlnum{0.3}\hlstd{)} \hlopt{+}
  \hlkwd{scale_y_continuous}\hlstd{(}\hlkwc{expand} \hlstd{=} \hlkwd{expand_scale}\hlstd{(}\hlkwc{add} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{0}\hlstd{,} \hlnum{0.02}\hlstd{)))}
\end{alltt}
\end{kframe}
\end{knitrout}

Here we instead use a multiplier to a similar effect as above; we add 10\% compared to the range of the \code{limits}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
  \hlkwd{scale_y_continuous}\hlstd{(}\hlkwc{expand} \hlstd{=} \hlkwd{expand_scale}\hlstd{(}\hlkwc{mult} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{0}\hlstd{,} \hlnum{0.1}\hlstd{)))}
\end{alltt}
\end{kframe}
\end{knitrout}

In the case of scales, we cannot reverse their direction through the setting of limits. We need instead to use a transformation as described in section \ref{sec:plot:scales:trans} on page \pageref{sec:plot:scales:trans}. But, inconsistently, \Rfunction{xlim()} and \Rfunction{ylim()} do implicitly allow this transformation through the numeric values passed as limits.

%%% to be moved
%We can also use \code{limits} with discrete scales, listing all or some of the levels of a factor that are to be included in the scale. This works even if the levels are defined in the factor but not present in a given data set, such as after subsetting.

\begin{playground}
Test what the result is when the first limit is larger than the second one. Is it the same as when setting these same values as limits with \code{ylim()}?

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(fake2.data,} \hlkwd{aes}\hlstd{(z, y))} \hlopt{+} \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{scale_y_continuous}\hlstd{(}\hlkwc{limits} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{100}\hlstd{,} \hlnum{0}\hlstd{))}
\end{alltt}
\end{kframe}
\end{knitrout}
\end{playground}

\subsubsection{Ticks and their labels}\label{sec:plot:scales:ticks}

Parameter \code{breaks}\index{plots!scales!tick breaks} is used to set the location of ticks along the axis. Parameter \code{labels}\index{plots!scales!tick labels} is used to set the tick labels. Both parameters can be passed either a vector or a function as an argument. The default is to compute ``good'' breaks based on the limits and format the numbers as strings.

When manually setting breaks, we can keep the default computed labels for the \code{breaks}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(fake2.data,} \hlkwd{aes}\hlstd{(z, y))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{scale_y_continuous}\hlstd{(}\hlkwc{breaks} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{20}\hlstd{, pi} \hlopt{*} \hlnum{10}\hlstd{,} \hlnum{40}\hlstd{,} \hlnum{60}\hlstd{))}
\end{alltt}
\end{kframe}
\end{knitrout}

The default breaks are computed by function \Rfunction{pretty\_breaks()} from \pkgname{scales}. The argument passed to its parameter \code{n} determines the target number ticks to be generated automatically, but the actual number of ticks computed may be slightly different depending on the range of the data.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
  \hlkwd{scale_y_continuous}\hlstd{(}\hlkwc{breaks} \hlstd{= scales}\hlopt{::}\hlkwd{pretty_breaks}\hlstd{(}\hlkwc{n} \hlstd{=} \hlnum{7}\hlstd{))}
\end{alltt}
\end{kframe}
\end{knitrout}

We can set tick labels manually, in parallel to the setting of \code{breaks} by passing as arguments two vectors of equal length. In the next example we use an expression to obtain a Greek letter.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(fake2.data,} \hlkwd{aes}\hlstd{(z, y))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{scale_y_continuous}\hlstd{(}\hlkwc{breaks} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{20}\hlstd{, pi} \hlopt{*} \hlnum{10}\hlstd{,} \hlnum{40}\hlstd{,} \hlnum{60}\hlstd{),}
                     \hlkwc{labels} \hlstd{=} \hlkwd{c}\hlstd{(}\hlstr{"20"}\hlstd{,} \hlkwd{expression}\hlstd{(}\hlnum{10}\hlopt{*}\hlstd{pi),} \hlstr{"40"}\hlstd{,} \hlstr{"60"}\hlstd{))}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.54\textwidth]{figure/pos-scale-ticks-02-1}

}


\end{knitrout}

Package \pkgname{scales} provides several functions for the automatic generation of labels. For example, to display tick labels as percentages for data available as decimal fractions, we can use function \code{scales::percent()}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(fake2.data,} \hlkwd{aes}\hlstd{(z, y} \hlopt{/} \hlkwd{max}\hlstd{(y)))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{scale_y_continuous}\hlstd{(}\hlkwc{labels} \hlstd{= scales}\hlopt{::}\hlstd{percent)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.54\textwidth]{figure/pos-scale-ticks-03-1}

}


\end{knitrout}

For currency, we can use \code{scales::dollar()}, to include commas separating thousands, millions, so on, we can use \code{scales::comma()}, and for numbers formatted using exponents of 10---useful for logarithmic-transformed scales---we can use \code{scales::scientific\_format()}. It is also possible to use user-defined functions both for breaks and labels.

\subsubsection{Transformed scales}\label{sec:plot:scales:trans}

The\index{plots!scales!transformations} default scales used by the \code{x} and \code{y} aesthetics, \ggscale{scale\_x\_continuous()} and \ggscale{scale\_y\_continuous()}, accept a user-supplied transformation function as an argument to \code{trans} with default code{trans = "identity"} (no transformation). In addition, there are predefined convenience scale functions for \code{log10}, \code{sqrt} and \code{reverse}.

\begin{warningbox}
  Similar to the maths functions of \Rlang, the name of the scales are \ggscale{scale\_x\_log10()} and \ggscale{scale\_y\_log10()} rather than \ggscale{scale\_y\_log()} because in \Rlang, the function \code{log} returns the natural logarithm.
\end{warningbox}

We can use \ggscale{scale\_x\_reverse()} to reverse the direction of a continuous scale,

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(fake2.data,} \hlkwd{aes}\hlstd{(z, y))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{scale_x_reverse}\hlstd{()}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.54\textwidth]{figure/pos-scale-trans-01-1}

}


\end{knitrout}

Axis tick-labels display the original values before applying the transformation. The \code{"breaks"} need to be given in the original scale as well. We use \ggscale{scale\_y\_log10()} to apply a $\log_{10}$ transformation to the $y$ values.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
  \hlkwd{scale_y_log10}\hlstd{(}\hlkwc{breaks}\hlstd{=}\hlkwd{c}\hlstd{(}\hlnum{10}\hlstd{,}\hlnum{20}\hlstd{,}\hlnum{50}\hlstd{,}\hlnum{100}\hlstd{))}
\end{alltt}
\end{kframe}
\end{knitrout}

Using a transformation in a scale is not equivalent to applying the same transformation on the fly when mapping a variable to the $x$  (or $y$) \emph{aesthetic} as this results in tick-labels expressed in transformed values.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(fake2.data,} \hlkwd{aes}\hlstd{(z,} \hlkwd{log10}\hlstd{(y)))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()}
\end{alltt}
\end{kframe}
\end{knitrout}

We show next how to specify a transformation to a continuous scale, using a predefined ``transformation'' object.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
  \hlkwd{scale_y_continuous}\hlstd{(}\hlkwc{trans} \hlstd{=} \hlstr{"reciprocal"}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

Natural logarithms are important in growth analysis as the slope against time gives the relative growth rate. We show this with the \code{Orange} data set.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= Orange,}
       \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= age,} \hlkwc{y} \hlstd{= circumference,} \hlkwc{color} \hlstd{= Tree))} \hlopt{+}
  \hlkwd{geom_line}\hlstd{()} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{scale_y_continuous}\hlstd{(}\hlkwc{trans} \hlstd{=} \hlstr{"log"}\hlstd{,} \hlkwc{breaks} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{20}\hlstd{,} \hlnum{50}\hlstd{,} \hlnum{100}\hlstd{,} \hlnum{200}\hlstd{))}
\end{alltt}
\end{kframe}
\end{knitrout}

\subsubsection{Position of $x$ and $y$ axes}
\index{plots!axis position}

The default position of axes can be changed through parameter \code{position}, using character constants \code{"bottom"}, \code{"top"}, \code{"left"} and \code{"right"}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,} \hlkwd{aes}\hlstd{(wt, mpg))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{scale_x_continuous}\hlstd{(}\hlkwc{position} \hlstd{=} \hlstr{"top"}\hlstd{)} \hlopt{+}
  \hlkwd{scale_y_continuous}\hlstd{(}\hlkwc{position} \hlstd{=} \hlstr{"right"}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.54\textwidth]{figure/pos-axis-position-01-1}

}


\end{knitrout}

\subsubsection{Secondary axes}

It\index{plots!secondary axes} is also possible to add secondary axes with ticks displayed in a transformed scale.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,} \hlkwd{aes}\hlstd{(wt, mpg))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{scale_y_continuous}\hlstd{(}\hlkwc{sec.axis} \hlstd{=} \hlkwd{sec_axis}\hlstd{(}\hlopt{~} \hlstd{.} \hlopt{^-}\hlnum{1}\hlstd{,} \hlkwc{name} \hlstd{=} \hlstr{"1/y"}\hlstd{) )}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.54\textwidth]{figure/pos-axis-secondary-01-1}

}


\end{knitrout}

It is also possible to use different \code{breaks} and \code{labels} than for the main axes, and to provide a different \code{name} to be used as a secondary axis label.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
  \hlkwd{scale_y_continuous}\hlstd{(}\hlkwc{sec.axis} \hlstd{=} \hlkwd{sec_axis}\hlstd{(}\hlopt{~} \hlstd{.} \hlopt{/} \hlnum{2.3521458}\hlstd{,} \hlkwc{name} \hlstd{=} \hlkwd{expression}\hlstd{(km} \hlopt{/} \hlstd{l),}
                                         \hlkwc{breaks} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{5}\hlstd{,} \hlnum{7.5}\hlstd{,} \hlnum{10}\hlstd{,} \hlnum{12.5}\hlstd{)))}
\end{alltt}
\end{kframe}
\end{knitrout}
\index{grammar of graphics!continuous scales|)}

\subsection{Time and date scales for $x$ and $y$}\label{sec:plot:scales:time:date}
\index{grammar of graphics!time and date scales|(}
In \Rlang and many other computing languages, time values are stored as integer or numeric values subject to special interpretation. Times stored as objects of class \code{POSIXct} can be mapped to continuous \emph{aesthetics} such as $x$ and $y$. Special scales are available for these quantities.

We can set limits and breaks using constants as time or dates. These are most easily input with the functions in packages \pkgname{lubridate} or \pkgname{anytime}.


\begin{warningbox}
Warnings are issued in the next two chunks as we are using scale limits to subset a part of the observations present in \code{data}.
\end{warningbox}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= weather_wk_25_2019.tb,}
       \hlkwd{aes}\hlstd{(}\hlkwd{with_tz}\hlstd{(time,} \hlkwc{tzone} \hlstd{=} \hlstr{"EET"}\hlstd{), air_temp_C))} \hlopt{+}
  \hlkwd{geom_line}\hlstd{()} \hlopt{+}
  \hlkwd{scale_x_datetime}\hlstd{(}\hlkwc{name} \hlstd{=} \hlkwa{NULL}\hlstd{,}
                   \hlkwc{breaks} \hlstd{=} \hlkwd{ymd_hm}\hlstd{(}\hlstr{"2019-06-11 12:00"}\hlstd{,} \hlkwc{tz} \hlstd{=} \hlstr{"EET"}\hlstd{)} \hlopt{+} \hlkwd{days}\hlstd{(}\hlnum{0}\hlopt{:}\hlnum{1}\hlstd{),}
                   \hlkwc{limits} \hlstd{=} \hlkwd{ymd_hm}\hlstd{(}\hlstr{"2019-06-11 00:00"}\hlstd{,} \hlkwc{tz} \hlstd{=} \hlstr{"EET"}\hlstd{)} \hlopt{+} \hlkwd{days}\hlstd{(}\hlkwd{c}\hlstd{(}\hlnum{0}\hlstd{,} \hlnum{2}\hlstd{)))} \hlopt{+}
  \hlkwd{scale_y_continuous}\hlstd{(}\hlkwc{name} \hlstd{=} \hlstr{"Air temperature (C)"}\hlstd{)} \hlopt{+}
  \hlkwd{expand_limits}\hlstd{(}\hlkwc{y} \hlstd{=} \hlnum{0}\hlstd{)}
\end{alltt}


{\ttfamily\noindent\color{warningcolor}{\#\# Warning: Removed 7199 row(s) containing missing values (geom\_path).}}\end{kframe}

{\centering \includegraphics[width=.9\textwidth]{figure/pos-scale-datetime-01-1}

}


\end{knitrout}

By\index{plots!scales!axis labels} default the tick labels produced and their formatting are automatically selected based on the extent of the time data. For example, if we have all data collected within a single day, then the tick labels will show hours and minutes. If we plot data for several years, the labels will show the date portion of the time instant. The default is frequently good enough, but it is possible, as for numbers, to use different formatter functions to generate the tick labels.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= weather_wk_25_2019.tb,}
       \hlkwd{aes}\hlstd{(}\hlkwd{with_tz}\hlstd{(time,} \hlkwc{tzone} \hlstd{=} \hlstr{"EET"}\hlstd{), air_temp_C))} \hlopt{+}
  \hlkwd{geom_line}\hlstd{()} \hlopt{+}
  \hlkwd{scale_x_datetime}\hlstd{(}\hlkwc{name} \hlstd{=} \hlkwa{NULL}\hlstd{,}
                   \hlkwc{date_breaks} \hlstd{=} \hlstr{"1 hour"}\hlstd{,}
                   \hlkwc{limits} \hlstd{=} \hlkwd{ymd_hm}\hlstd{(}\hlstr{"2019-06-16 00:00"}\hlstd{,} \hlkwc{tz} \hlstd{=} \hlstr{"EET"}\hlstd{)} \hlopt{+} \hlkwd{hours}\hlstd{(}\hlkwd{c}\hlstd{(}\hlnum{6}\hlstd{,} \hlnum{18}\hlstd{)),}
                   \hlkwc{date_labels} \hlstd{=} \hlstr{"%H:%M"}\hlstd{)} \hlopt{+}
  \hlkwd{scale_y_continuous}\hlstd{(}\hlkwc{name} \hlstd{=} \hlstr{"Air temperature (C)"}\hlstd{)} \hlopt{+}
  \hlkwd{expand_limits}\hlstd{(}\hlkwc{y} \hlstd{=} \hlnum{0}\hlstd{)}
\end{alltt}


{\ttfamily\noindent\color{warningcolor}{\#\# Warning: Removed 9359 row(s) containing missing values (geom\_path).}}\end{kframe}

{\centering \includegraphics[width=.9\textwidth]{figure/pos-scale-datetime-02-1}

}


\end{knitrout}

\begin{playground}
The formatting strings used are those supported by \Rfunction{strptime()} and \code{help(strptime)} lists them. Change, in the two examples above, the $y$-axis labels used and the limits---e.g., include a single hour or a whole week of data, check which tick labels are produced by default and then pass as an argument to \code{date\_labels} different format strings, taking into account that in addition to the \emph{conversion specification} codes, format strings can include additional text.
\end{playground}
\index{grammar of graphics!time and date scales|)}

\subsection{Discrete scales for $x$ and $y$}
\index{grammar of graphics!discrete scales|(}

In\index{plots!scales!limits} the case of ordered or unordered factors, the tick labels are by default the names of the factor levels. Consequently, one roundabout way of obtaining the desired tick labels is to set them as factor levels. This approach is not recommended as in many cases the text of the desired tick labels may not be recognized as a valid name making the code using them more difficult to type in scripts or at the command prompt. It is best to use simple mnemonic short names for factor levels and variables, and to set suitable labels through \emph{scales} when plotting, as we will show here.

We can use \ggscale{scale\_x\_discrete()} to reorder and select the columns without altering the data. If we use this approach to subset the data, then to avoid warnings we need to add \code{na.rm = TRUE}. We additionally use \code{scale\_x\_discrete} to convert level names to uppercase.


\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(mpg,} \hlkwd{aes}\hlstd{(class, hwy))} \hlopt{+}
  \hlkwd{stat_summary}\hlstd{(}\hlkwc{geom} \hlstd{=} \hlstr{"col"}\hlstd{,} \hlkwc{fun} \hlstd{= mean,} \hlkwc{na.rm} \hlstd{=} \hlnum{TRUE}\hlstd{)} \hlopt{+}
  \hlkwd{scale_x_discrete}\hlstd{(}\hlkwc{limits} \hlstd{=} \hlkwd{c}\hlstd{(}\hlstr{"compact"}\hlstd{,} \hlstr{"subcompact"}\hlstd{,} \hlstr{"midsize"}\hlstd{),}
                   \hlkwc{labels} \hlstd{=} \hlkwd{c}\hlstd{(}\hlstr{"COMPACT"}\hlstd{,} \hlstr{"SUBCOMPACT"}\hlstd{,} \hlstr{"MIDSIZE"}\hlstd{))}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.54\textwidth]{figure/pos-scale-discrete-10-1}

}


\end{knitrout}

If, as in the previous example, only the case of character strings needs to be changed, passing function \Rfunction{toupper()} or \Rfunction{tolower()} allows a more general and less error-prone approach. In fact any function, user defined or not, which converts the values of \code{limits} into the desired values can be passed as an argument to \code{labels}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
  \hlkwd{scale_x_discrete}\hlstd{(}\hlkwc{limits} \hlstd{=} \hlkwd{c}\hlstd{(}\hlstr{"compact"}\hlstd{,} \hlstr{"subcompact"}\hlstd{,} \hlstr{"midsize"}\hlstd{),}
                   \hlkwc{labels} \hlstd{= toupper)}
\end{alltt}
\end{kframe}
\end{knitrout}

Alternatively, we can change the order of the columns in the plot by reordering the levels of factor \code{mpg\$class}. This approach makes sense if the ordering needs to be done programmatically based on values in \code{data}. See section \ref{sec:calc:factors} on page \pageref{sec:calc:factors} for details. The example below shows how to reorder the columns, corresponding to the levels of \code{class} based on the \code{mean()} of \code{hwy}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(mpg,} \hlkwd{aes}\hlstd{(}\hlkwd{reorder}\hlstd{(}\hlkwc{x} \hlstd{=} \hlkwd{factor}\hlstd{(class),} \hlkwc{X} \hlstd{= hwy,} \hlkwc{FUN} \hlstd{= mean), hwy))} \hlopt{+}
  \hlkwd{stat_summary}\hlstd{(}\hlkwc{geom} \hlstd{=} \hlstr{"col"}\hlstd{,} \hlkwc{fun} \hlstd{= mean)}
\end{alltt}
\end{kframe}
\end{knitrout}
\index{grammar of graphics!discrete scales|)}


\subsection{Size}
\index{grammar of graphics!size scales|(}
For the \code{size} \emph{aesthetic}, several scales are available, both discrete and continuous. They do not differ much from those already described above. \emph{Geometries} \gggeom{geom\_point()}, \gggeom{geom\_line()}, \gggeom{geom\_hline()}, \gggeom{geom\_vline()}, \gggeom{geom\_text()}, \gggeom{geom\_label()} obey \code{size} as expected. In the case of \gggeom{geom\_bar()}, \gggeom{geom\_col()}, \gggeom{geom\_area()} and all other geometric elements bordered by lines, \code{size} is obeyed by these border lines. In fact, other aesthetics natural for lines such as \code{linetype} also apply to these borders.

When using \code{size} scales, \code{breaks} and \code{labels} affect the key or \code{guide}. In scales that produce a key passing \code{guide = FALSE} removes the key corresponding to the scale.
\index{grammar of graphics!size scales|)}

\subsection{Color and fill}
\index{grammar of graphics!color and fill scales|(}
\index{plots!with colors|(}

color and fill scales are similar, but they affect different elements of the plot. All visual elements in a plot obey the \code{color} \emph{aesthetic}, but only elements that have an inner region and a boundary, obey both \code{color} and \code{fill} \emph{aesthetics}. There are separate but equivalent sets of scales available for these two \emph{aesthetics}. We will describe in more detail the \code{color} \emph{aesthetic} and give only some examples for \code{fill}. We will, however, start by reviewing how colors are defined and used in \Rlang.

\subsubsection{Color definitions in R}\label{sec:plot:colors}
\index{color!definitions|(}
Colors can be specified in \Rlang not only through character strings with the names of previously defined colors, but also directly as strings describing the RGB (red, green and blue) components as hexadecimal numbers (on base 16 expressed using 0, 1, 2, 3, 4, 6, 7, 8, 9, A, B, C, D, E, and F as ``digits'') such as \code{"\#FFFFFF"} for white or \code{"\#000000"} for black, or \code{"\#FF0000"} for the brightest available pure red.

The list of color names\index{color!names} known to \Rlang can be obtained be typing \code{colors()} at the \Rlang console.
Given the number of colors available, we may want to subset them based on their names. Function \code{colors()} returns a character vector. We can use \code{grep()} to find the names containing a given character substring, in this example \code{"dark"}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{length}\hlstd{(}\hlkwd{colors}\hlstd{())}
\end{alltt}
\begin{verbatim}
## [1] 657
\end{verbatim}
\begin{alltt}
\hlkwd{grep}\hlstd{(}\hlstr{"dark"}\hlstd{,}\hlkwd{colors}\hlstd{(),} \hlkwc{value} \hlstd{=} \hlnum{TRUE}\hlstd{)}
\end{alltt}
\begin{verbatim}
##  [1] "darkblue"        "darkcyan"        "darkgoldenrod"   "darkgoldenrod1"
##  [5] "darkgoldenrod2"  "darkgoldenrod3"  "darkgoldenrod4"  "darkgray"
##  [9] "darkgreen"       "darkgrey"        "darkkhaki"       "darkmagenta"
## [13] "darkolivegreen"  "darkolivegreen1" "darkolivegreen2" "darkolivegreen3"
## [17] "darkolivegreen4" "darkorange"      "darkorange1"     "darkorange2"
## [21] "darkorange3"     "darkorange4"     "darkorchid"      "darkorchid1"
## [25] "darkorchid2"     "darkorchid3"     "darkorchid4"     "darkred"
## [29] "darksalmon"      "darkseagreen"    "darkseagreen1"   "darkseagreen2"
## [33] "darkseagreen3"   "darkseagreen4"   "darkslateblue"   "darkslategray"
## [37] "darkslategray1"  "darkslategray2"  "darkslategray3"  "darkslategray4"
## [41] "darkslategrey"   "darkturquoise"   "darkviolet"
\end{verbatim}
\end{kframe}
\end{knitrout}

To retrieve the RGB values for a color definition we use:

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{col2rgb}\hlstd{(}\hlstr{"purple"}\hlstd{)}
\end{alltt}
\begin{verbatim}
##       [,1]
## red    160
## green   32
## blue   240
\end{verbatim}
\begin{alltt}
\hlkwd{col2rgb}\hlstd{(}\hlstr{"#FF0000"}\hlstd{)}
\end{alltt}
\begin{verbatim}
##       [,1]
## red    255
## green    0
## blue     0
\end{verbatim}
\end{kframe}
\end{knitrout}

Color definitions in \Rlang can contain a \emph{transparency} described by an \code{alpha} value, which by default is not returned.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{col2rgb}\hlstd{(}\hlstr{"purple"}\hlstd{,} \hlkwc{alpha} \hlstd{=} \hlnum{TRUE}\hlstd{)}
\end{alltt}
\begin{verbatim}
##       [,1]
## red    160
## green   32
## blue   240
## alpha  255
\end{verbatim}
\end{kframe}
\end{knitrout}

With function \Rfunction{rgb()} we can define new colors. Enter \code{help(rgb)} for more details.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{rgb}\hlstd{(}\hlnum{1}\hlstd{,} \hlnum{1}\hlstd{,} \hlnum{0}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] "#FFFF00"
\end{verbatim}
\begin{alltt}
\hlkwd{rgb}\hlstd{(}\hlnum{1}\hlstd{,} \hlnum{1}\hlstd{,} \hlnum{0}\hlstd{,} \hlkwc{names} \hlstd{=} \hlstr{"my.color"}\hlstd{)}
\end{alltt}
\begin{verbatim}
##  my.color
## "#FFFF00"
\end{verbatim}
\begin{alltt}
\hlkwd{rgb}\hlstd{(}\hlnum{255}\hlstd{,} \hlnum{255}\hlstd{,} \hlnum{0}\hlstd{,} \hlkwc{names} \hlstd{=} \hlstr{"my.color"}\hlstd{,} \hlkwc{maxColorValue} \hlstd{=} \hlnum{255}\hlstd{)}
\end{alltt}
\begin{verbatim}
##  my.color
## "#FFFF00"
\end{verbatim}
\end{kframe}
\end{knitrout}

As described above, colors can be defined in the RGB \emph{color space}, however, other color models such as HSV (hue, saturation, value) can be also used to define colours.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{hsv}\hlstd{(}\hlkwd{c}\hlstd{(}\hlnum{0}\hlstd{,}\hlnum{0.25}\hlstd{,}\hlnum{0.5}\hlstd{,}\hlnum{0.75}\hlstd{,}\hlnum{1}\hlstd{),} \hlnum{0.5}\hlstd{,} \hlnum{0.5}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] "#804040" "#608040" "#408080" "#604080" "#804040"
\end{verbatim}
\end{kframe}
\end{knitrout}

Probably a more useful flavor of HSV colors for use in scales are those returned by function \Rfunction{hcl()} for hue, chroma and luminance. While the ``value'' and ``saturation'' in HSV are based on physical values, the ``chroma'' and ``luminance'' values in HCL are based on human visual perception. Colours with equal luminance will be seen as equally bright by an ``average'' human. In a scale based on different hues but equal chroma and luminance values, as used by package \ggplot, all colours are perceived as equally bright. The hues need to be expressed as angles in degrees, with values between zero and 360.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{hcl}\hlstd{(}\hlkwd{c}\hlstd{(}\hlnum{0}\hlstd{,}\hlnum{0.25}\hlstd{,}\hlnum{0.5}\hlstd{,}\hlnum{0.75}\hlstd{,}\hlnum{1}\hlstd{)} \hlopt{*} \hlnum{360}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] "#FFC5D0" "#D4D8A7" "#99E2D8" "#D5D0FC" "#FFC5D0"
\end{verbatim}
\end{kframe}
\end{knitrout}

It is also important to remember that humans can only distinguish a limited set of colours, and even smaller color gamuts can be reproduced by screens and printers. Furthermore, variation from individual to individual exists in color perception, including different types of color blindness. It is important to take this into account when choosing the colors used in illustrations.
\index{color!definitions|)}

\subsection{Continuous color-related scales}
\sloppy
Continuous color scales \ggscale{scale\_color\_continuous()}, \ggscale{scale\_color\_gradient()}, \ggscale{scale\_color\_gradient2()},  \ggscale{scale\_color\_gradientn()}, \ggscale{scale\_color\_date()} and \ggscale{scale\_color\_datetime()}, give a smooth continuous gradient between two or more colours. They are used with \code{numeric}, \code{date} and \code{datetime} data. A corresponding set of \code{fill} scales is also available. Other scales like \ggscale{scale\_color\_viridis\_c()} and \ggscale{scale\_color\_distiller()} are based on the use of ready-made palettes of sets of color gradients chosen to work well together under multiple conditions or for human vision including different types of color blindness.

\subsection{Discrete color-related scales}
\sloppy
Color scales \ggscale{scale\_color\_discrete()}, \ggscale{scale\_color\_hue()}, \ggscale{scale\_color\_gray()} are used with categorical data stored as factors. Other scales like \ggscale{scale\_color\_viridis\_d()} and \ggscale{scale\_color\_brewer()} provide discrete sets of colours based on palettes.

\subsection{Identity scales}
\index{grammar of graphics!identity color scales|(}
In the case of identity scales, the mapping is one to-one to the data. For example, if we map the \code{color} or \code{fill} \emph{aesthetic} to a variable using \ggscale{scale\_color\_identity()} or \ggscale{scale\_fill\_identity()}, the mapped variable must already contain valid color definitions. In the case of mapping \code{alpha}, the variable must contain numeric values in the range 0 to 1.

We create a data frame containing a variable \code{colors} containing character strings interpretable as the names of color definitions known to \Rlang. We then use them directly in the plot.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{df99} \hlkwb{<-} \hlkwd{data.frame}\hlstd{(}\hlkwc{x} \hlstd{=} \hlnum{1}\hlopt{:}\hlnum{10}\hlstd{,} \hlkwc{y} \hlstd{=} \hlkwd{dnorm}\hlstd{(}\hlnum{10}\hlstd{),} \hlkwc{colors} \hlstd{=} \hlkwd{rep}\hlstd{(}\hlkwd{c}\hlstd{(}\hlstr{"red"}\hlstd{,} \hlstr{"blue"}\hlstd{),} \hlnum{5}\hlstd{))}

\hlkwd{ggplot}\hlstd{(df99,} \hlkwd{aes}\hlstd{(x, y,} \hlkwc{color} \hlstd{= colors))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{scale_color_identity}\hlstd{()}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-scale-color-10-1}

}


\end{knitrout}

\begin{playground}
How does the plot look, if the identity scale is deleted from the example above? Edit and re-run the example code.

While using the identity scale, how would you need to change the code example above, to produce a plot with green and purple points?
\end{playground}
\index{grammar of graphics!identity color scales|)}
\index{plots!with colors|)}
\index{grammar of graphics!color and fill scales|)}
\index{grammar of graphics!scales|)}

\section{Adding annotations}\label{sec:plot:annotations}
\index{grammar of graphics!annotations|(}
The idea of annotations is that they add plot elements that are not directly connected with \code{data}, which we could call ``decorations'' such as arrows used to highlight some feature of the data, specific points along an axis, etc. They are referenced to the ``natural'' coordinates used to plot the observations, but are elements that do not represent observations or summaries computed from the observations.  Annotations are added to a ggplot with \Rfunction{annotate()} as plot layers (each call to \code{annotate()} creates a new layer). To achieve the behavior expected of annotations, \Rfunction{annotate()} does not inherit the default \code{data} or \code{mapping} of variables to \emph{aesthetics}. Annotations frequently make use of \code{"text"} or \code{"label"} \emph{geometries} with character strings as data, possibly to be parsed as expressions. However, for example, the \code{"segment"} geometry can be used to add arrows.

\begin{warningbox}
While layers added to a plot directly using \emph{geometries} and \emph{statistics} respect faceting, annotation layers added with \Rfunction{annotate()} are replicated unchanged in every panel of a faceted plot. The reason is that annotation layers accept \emph{aesthetics} only as constant values which are the same for every panel as no grouping is possible without a \code{mapping} to \code{data}.
\end{warningbox}

We show a simple example using \code{"text"} as \emph{geometry}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(fake2.data,} \hlkwd{aes}\hlstd{(z, y))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{annotate}\hlstd{(}\hlkwc{geom} \hlstd{=} \hlstr{"text"}\hlstd{,}
           \hlkwc{label} \hlstd{=} \hlstr{"origin"}\hlstd{,}
           \hlkwc{x} \hlstd{=} \hlnum{0}\hlstd{,} \hlkwc{y} \hlstd{=} \hlnum{0}\hlstd{,}
           \hlkwc{color} \hlstd{=} \hlstr{"blue"}\hlstd{,}
           \hlkwc{size}\hlstd{=}\hlnum{4}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-annotate-01-1}

}


\end{knitrout}

\begin{playground}
Play with the values of the arguments to \Rfunction{annotate()} to vary the position, size, color, font family, font face, rotation angle and justification of the annotation.
\end{playground}

\index{plots!insets as annotations|(}
It is relatively common to use inset tables, plots, bitmaps or vector plots as annotations. With \Rfunction{annotation\_custom()}, grobs (\pkgname{grid} graphical object) can be added to a ggplot. To add another or the same plot as an inset, we first need to convert it into a grob. In the case of a ggplot we use \Rfunction{ggplotGrob()}. In this example the inset is a zoomed-in window into the main plot. In addition to the grob, we need to provide the coordinates expressed in ``natural'' data units of the main plot for the location of the grob.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{p} \hlkwb{<-} \hlkwd{ggplot}\hlstd{(fake2.data,} \hlkwd{aes}\hlstd{(z, y))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()}
\hlstd{p} \hlopt{+} \hlkwd{expand_limits}\hlstd{(}\hlkwc{x} \hlstd{=} \hlnum{40}\hlstd{)} \hlopt{+}
  \hlkwd{annotation_custom}\hlstd{(}\hlkwd{ggplotGrob}\hlstd{(p} \hlopt{+} \hlkwd{coord_cartesian}\hlstd{(}\hlkwc{xlim} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{5}\hlstd{,} \hlnum{10}\hlstd{),} \hlkwc{ylim} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{20}\hlstd{,} \hlnum{40}\hlstd{))} \hlopt{+}
                               \hlkwd{theme_bw}\hlstd{(}\hlnum{10}\hlstd{)),}
                    \hlkwc{xmin} \hlstd{=} \hlnum{21}\hlstd{,} \hlkwc{xmax} \hlstd{=} \hlnum{40}\hlstd{,} \hlkwc{ymin} \hlstd{=} \hlnum{30}\hlstd{,} \hlkwc{ymax} \hlstd{=} \hlnum{60}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-inset-01-1}

}


\end{knitrout}

This approach has the limitation that if used together with faceting, the inset will be the same for each plot panel. See section \ref{sec:plot:insets} on page \pageref{sec:plot:insets} for \emph{geometries} that can be used to add insets.
\index{plots!insets as annotations|)}

In the next example, in addition to adding expressions as annotations, we also pass expressions as tick labels through the scale. Do notice that we use recycling for setting the breaks, as \code{c(0, 0.5, 1, 1.5, 2) * pi} is equivalent to \code{c(0, 0.5 * pi, pi, 1.5 * pi, 2 * pi}. Annotations are plotted at their own position, unrelated to any observation in the data, but using the same coordinates and units as for plotting the data.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwd{data.frame}\hlstd{(}\hlkwc{x} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{0}\hlstd{,} \hlnum{2} \hlopt{*} \hlstd{pi)),} \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= x))} \hlopt{+}
  \hlkwd{stat_function}\hlstd{(}\hlkwc{fun} \hlstd{= sin)} \hlopt{+}
  \hlkwd{scale_x_continuous}\hlstd{(}
    \hlkwc{breaks} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{0}\hlstd{,} \hlnum{0.5}\hlstd{,} \hlnum{1}\hlstd{,} \hlnum{1.5}\hlstd{,} \hlnum{2}\hlstd{)} \hlopt{*} \hlstd{pi,}
    \hlkwc{labels} \hlstd{=} \hlkwd{c}\hlstd{(}\hlstr{"0"}\hlstd{,} \hlkwd{expression}\hlstd{(}\hlnum{0.5}\hlopt{~}\hlstd{pi),} \hlkwd{expression}\hlstd{(pi),}
             \hlkwd{expression}\hlstd{(}\hlnum{1.5}\hlopt{~}\hlstd{pi),} \hlkwd{expression}\hlstd{(}\hlnum{2}\hlopt{~}\hlstd{pi)))} \hlopt{+}
  \hlkwd{labs}\hlstd{(}\hlkwc{y} \hlstd{=} \hlstr{"sin(x)"}\hlstd{)} \hlopt{+}
  \hlkwd{annotate}\hlstd{(}\hlkwc{geom} \hlstd{=} \hlstr{"text"}\hlstd{,}
           \hlkwc{label} \hlstd{=} \hlkwd{c}\hlstd{(}\hlstr{"+"}\hlstd{,} \hlstr{"-"}\hlstd{),}
           \hlkwc{x} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{0.5}\hlstd{,} \hlnum{1.5}\hlstd{)} \hlopt{*} \hlstd{pi,} \hlkwc{y} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{0.5}\hlstd{,} \hlopt{-}\hlnum{0.5}\hlstd{),}
           \hlkwc{size} \hlstd{=} \hlnum{20}\hlstd{)} \hlopt{+}
  \hlkwd{annotate}\hlstd{(}\hlkwc{geom} \hlstd{=} \hlstr{"point"}\hlstd{,}
           \hlkwc{color} \hlstd{=} \hlstr{"red"}\hlstd{,}
           \hlkwc{shape} \hlstd{=} \hlnum{21}\hlstd{,}
           \hlkwc{fill} \hlstd{=} \hlstr{"white"}\hlstd{,}
           \hlkwc{x} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{0}\hlstd{,} \hlnum{1}\hlstd{,} \hlnum{2}\hlstd{)} \hlopt{*} \hlstd{pi,} \hlkwc{y} \hlstd{=} \hlnum{0}\hlstd{,}
           \hlkwc{size} \hlstd{=} \hlnum{6}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-annotate-03-1}

}


\end{knitrout}

\begin{playground}
Modify the plot above to show the cosine instead of the sine function, replacing \code{sin} with \code{cos}. This is easy, but the catch is that you will need to relocate the annotations.
\end{playground}

\begin{infobox}
We cannot use \Rfunction{annotate()} with \code{geom = "vline"} or \code{geom = "hline"} as we can use \code{geom = "line"} or \code{geom = "segment"}. Instead, \gggeom{geom\_vline()} and/or  \gggeom{geom\_hline()} can be used directly passing constant arguments to them. See section \ref{sec:plot:line} on page \pageref{sec:plot:vhline}.
\end{infobox}
\index{grammar of graphics!annotations|)}

\section{Coordinates and circular plots}\label{sec:plot:circular}\label{sec:plot:coord}
\index{grammar of graphics!polar coordinates|(}
\index{plots!circular|(}
Circular plots can be thought of as plots equivalent to those described earlier in this chapter but drawn using a different system of coordinates. This is a key insight, that the grammar of graphics as implemented in \ggplot makes use of. To obtain circular plots we use the same \emph{geometries}, \emph{statistics} and \emph{scales} we have been using with the default system of cartesian coordinates. The only thing that we need to do is to add \ggcoordinate{coord\_polar()} to override the default. Of course only some observed quantities can be better perceived in circular plots than in cartesian plots. Here we add a new ``word'' to the grammar of graphics, \textit{coordinates}, such as \ggcoordinate{coord\_polar()}.
When using polar coordinates, the \code{x} and \code{y} \textit{aesthetics} correspond to the angle and radial distance, respectively.

\subsection{Wind-rose plots}
\index{plots!wind rose|(}
Some types of data are more naturally expressed on polar coordinates than on cartesian coordinates. The clearest example is wind direction, from which the name derives. In some cases of time series data with a strong periodic variation, polar coordinates can be used to highlight any phase shifts or changes in frequency. A more mundane application is to plot variation in a response variable through the day with a clock-face-like representation of time of day.

Wind rose plots are frequently histograms or density plots drawn on a polar system of coordinates (see sections \ref{sec:plot:histogram} and \ref{sec:plot:density} on pages \pageref{sec:plot:histogram} and \pageref{sec:plot:density}, respectively for a description of the use of these \emph{statistics} and \emph{geometries}). We will use them for examples where we plot wind speed and direction data, measured once per minute during 24~h (from package \pkgname{learnrbook}).

Here we plot a circular histogram of wind directions with 30-degree-wide bins. We use \ggstat{stat\_bin()}. The counts represent the number of minutes during 24~h when the wind direction was within each bin.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{p} \hlkwb{<-} \hlkwd{ggplot}\hlstd{(viikki_d29.dat,} \hlkwd{aes}\hlstd{(WindDir_D1_WVT))}  \hlopt{+}
  \hlkwd{coord_polar}\hlstd{()} \hlopt{+}
  \hlkwd{scale_x_continuous}\hlstd{(}\hlkwc{breaks} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{0}\hlstd{,} \hlnum{90}\hlstd{,} \hlnum{180}\hlstd{,} \hlnum{270}\hlstd{),}
                     \hlkwc{labels} \hlstd{=} \hlkwd{c}\hlstd{(}\hlstr{"N"}\hlstd{,} \hlstr{"E"}\hlstd{,} \hlstr{"S"}\hlstd{,} \hlstr{"W"}\hlstd{),}
                     \hlkwc{limits} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{0}\hlstd{,} \hlnum{360}\hlstd{),}
                     \hlkwc{expand} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{0}\hlstd{,} \hlnum{0}\hlstd{),}
                     \hlkwc{name} \hlstd{=} \hlstr{"Wind direction"}\hlstd{)}
\hlstd{p} \hlopt{+} \hlkwd{stat_bin}\hlstd{(}\hlkwc{color} \hlstd{=} \hlstr{"black"}\hlstd{,} \hlkwc{fill} \hlstd{=} \hlstr{"gray50"}\hlstd{,} \hlkwc{geom} \hlstd{=} \hlstr{"bar"}\hlstd{,}
             \hlkwc{binwidth} \hlstd{=} \hlnum{30}\hlstd{,} \hlkwc{na.rm} \hlstd{=} \hlnum{TRUE}\hlstd{)} \hlopt{+} \hlkwd{labs}\hlstd{(}\hlkwc{y} \hlstd{=} \hlstr{"Frequency"}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-wind-05-1}

}


\end{knitrout}

For an equivalent plot, using an empirical density, we have to use \ggstat{stat\_density()} instead of \ggstat{stat\_bin()}, \gggeom{geom\_polygon()} instead of \gggeom{geom\_bar()} and change the \code{name} of the \code{y} scale.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{p} \hlopt{+} \hlkwd{stat_density}\hlstd{(}\hlkwc{color} \hlstd{=} \hlstr{"black"}\hlstd{,} \hlkwc{fill} \hlstd{=} \hlstr{"gray50"}\hlstd{,}
                 \hlkwc{geom} \hlstd{=} \hlstr{"polygon"}\hlstd{,} \hlkwc{size} \hlstd{=} \hlnum{1}\hlstd{)} \hlopt{+} \hlkwd{labs}\hlstd{(}\hlkwc{y} \hlstd{=} \hlstr{"Density"}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-wind-06-1}

}


\end{knitrout}

As the final wind-rose plot example, we do 2D density plot with facets added with \Rfunction{facet\_wrap()} to have separate panels for AM and PM. This plot uses fill to describe the density of observations for different combinations wind directions and speeds, the radius ($y$ \emph{aesthetic}) to represent wind speeds and the angle ($x$ \emph{aesthetic}) to represent wind direction.


\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(viikki_d29.dat,} \hlkwd{aes}\hlstd{(WindDir_D1_WVT, WindSpd_S_WVT))} \hlopt{+}
  \hlkwd{coord_polar}\hlstd{()} \hlopt{+}
  \hlkwd{stat_density_2d}\hlstd{(}\hlkwd{aes}\hlstd{(}\hlkwc{fill} \hlstd{=} \hlkwd{stat}\hlstd{(level)),} \hlkwc{geom} \hlstd{=} \hlstr{"polygon"}\hlstd{)} \hlopt{+}
  \hlkwd{scale_x_continuous}\hlstd{(}\hlkwc{breaks} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{0}\hlstd{,} \hlnum{90}\hlstd{,} \hlnum{180}\hlstd{,} \hlnum{270}\hlstd{),}
                     \hlkwc{labels} \hlstd{=} \hlkwd{c}\hlstd{(}\hlstr{"N"}\hlstd{,} \hlstr{"E"}\hlstd{,} \hlstr{"S"}\hlstd{,} \hlstr{"W"}\hlstd{),}
                     \hlkwc{limits} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{0}\hlstd{,} \hlnum{360}\hlstd{),}
                     \hlkwc{expand} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{0}\hlstd{,} \hlnum{0}\hlstd{),}
                     \hlkwc{name} \hlstd{=} \hlstr{"Wind direction"}\hlstd{)} \hlopt{+}
  \hlkwd{scale_y_continuous}\hlstd{(}\hlkwc{name} \hlstd{=} \hlstr{"Wind speed (m/s)"}\hlstd{)} \hlopt{+}
  \hlkwd{facet_wrap}\hlstd{(}\hlopt{~}\hlkwd{factor}\hlstd{(}\hlkwd{ifelse}\hlstd{(}\hlkwd{hour}\hlstd{(solar_time)} \hlopt{<} \hlnum{12}\hlstd{,} \hlstr{"AM"}\hlstd{,} \hlstr{"PM"}\hlstd{)))}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.9\textwidth]{figure/pos-wind-08-1}

}


\end{knitrout}
\index{plots!wind rose|)}


\subsection{Pie charts}
\index{plots!pie charts|(}

\begin{warningbox}
Pie charts are more difficult to read than bar charts because our brain is better at comparing lengths than angles. If used, pie charts should only be used to show composition, or fractional components that add up to a total. In this case, used only if the number of “pie slices” is small (rule of thumb: seven at most), however in general, they are best avoided.
\end{warningbox}

As we use \gggeom{geom\_bar()} which defaults to use \code{stat\_count}. We use the brewer scale for nice colors.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mpg,} \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{=} \hlkwd{factor}\hlstd{(}\hlnum{1}\hlstd{),} \hlkwc{fill} \hlstd{=} \hlkwd{factor}\hlstd{(class)))} \hlopt{+}
  \hlkwd{geom_bar}\hlstd{(}\hlkwc{width} \hlstd{=} \hlnum{1}\hlstd{,} \hlkwc{color} \hlstd{=} \hlstr{"black"}\hlstd{)} \hlopt{+}
  \hlkwd{coord_polar}\hlstd{(}\hlkwc{theta} \hlstd{=} \hlstr{"y"}\hlstd{)} \hlopt{+}
  \hlkwd{scale_fill_brewer}\hlstd{()} \hlopt{+}
  \hlkwd{scale_x_discrete}\hlstd{(}\hlkwc{breaks} \hlstd{=} \hlkwa{NULL}\hlstd{)} \hlopt{+}
  \hlkwd{labs}\hlstd{(}\hlkwc{x} \hlstd{=} \hlkwa{NULL}\hlstd{,} \hlkwc{fill} \hlstd{=} \hlstr{"Vehicle class"}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.54\textwidth]{figure/pos-main-chunk-54-1}

}


\end{knitrout}
\index{plots!pie charts|)}
\index{plots!circular|)}
\index{grammar of graphics!polar coordinates|)}

\begin{playground}
Edit the code for the pie chart above to obtain a bar chart. Which one of the two plots is easier to read?
\end{playground}

\section{Themes}\label{sec:plot:themes}
\index{grammar of graphics!themes|(}
\index{plots!styling|(}
In \ggplot, \emph{themes} are the equivalent of style sheets. They determine how the different elements of a plot are rendered when displayed, printed or saved to a file. \emph{Themes} do not alter what aesthetics or scales are used to plot the observations or summaries, but instead how text-labels, titles, axes, grids, plotting-area background and grid, etc., are formatted and if displayed or not. Package \ggplot includes several predefined \emph{theme constructors} (usually described as \emph{themes}), and independently developed extension packages define additional ones. These constructors return complete themes, which when added to a plot, replace any theme already present in whole. In addition to choosing among these already available \emph{complete themes}, users can modify the ones already present by adding \emph{incomplete themes} to a plot. When used in this way, \emph{incomplete themes} usually are created on the fly. It is also possible to create new theme constructors that return complete themes, similar to \code{theme\_gray()} from \ggplot.

\subsection{Complete themes}
\index{grammar of graphics!complete themes|(}
The theme used by default is \ggtheme{theme\_gray()} with default arguments. In \pkgnameNI{ggplot2}, predefined themes are defined as constructor functions, with parameters. These parameters allow changing some ``base'' properties. The \code{base\_size} for text elements controlled is given in points, and affects all text elements in the returned theme object as the size of these elements is by default defined relative to the base size. Another parameter, \code{base\_family}, allows the font family to be set. These functions return complete themes.

\begin{warningbox}
\emph{Themes} have no effect on layers produced by \emph{geometries} as themes have no effect on \emph{mappings}, \emph{scales} or \emph{aesthetics}. In the name \ggtheme{theme\_bw()} black-and- white refers to the color of the background of the plotting area and labels. If the \emph{color} or fill \emph{aesthetics} are mapped or set to a constant in the figure, these will be respected irrespective of the theme. We cannot convert a color figure into a black-and-white one by adding a \emph{theme}, we need to change the \emph{aesthetics} used, for example, use \code{shape} instead of \code{color} for a layer added with \code{geom\_point()}.
\end{warningbox}

Even the default \ggtheme{theme\_gray()} can be added to a plot, to modify it, if arguments different to the defaults are passed when called. In this example we override the default base size with a larger one and the default sans-serif font with one with serifs.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(fake2.data,} \hlkwd{aes}\hlstd{(z, y))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{theme_gray}\hlstd{(}\hlkwc{base_size} \hlstd{=} \hlnum{15}\hlstd{,}
             \hlkwc{base_family} \hlstd{=} \hlstr{"serif"}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.54\textwidth]{figure/pos-themes-01-1}

}


\end{knitrout}

\begin{playground}
Change the code in the previous chunk to use, one at a time, each of the predefined themes from \ggplot: \ggtheme{theme\_bw()}, \ggtheme{theme\_classic()}, \ggtheme{theme\_minimal()}, \ggtheme{theme\_linedraw()}, \ggtheme{theme\_light()}, \ggtheme{theme\_dark()} and \ggtheme{theme\_void()}.
\end{playground}

\begin{explainbox}
Predefined ``themes'' like \ggtheme{theme\_gray()} are, in reality, not themes but instead are constructors of theme objects. The \emph{themes} they return when called depend on the arguments passed to their parameters. In other words, \code{theme\_gray(base\_size = 15)}, creates a different theme than \code{theme\_gray(base\_size = 11)}. In this case, as sizes of different text elements are defined relative to the base size, the size of all text elements changes in coordination. Font size changes by \emph{themes} do not affect the size of text or labels in plot layers created with geometries, as their size is controlled by the \code{size} \emph{aesthetic}.
\end{explainbox}

A frequent idiom is to create a plot without specifying a theme, and then adding the theme when printing or saving it. This can save work, for example, when producing different versions of the same plot for a publication and a talk.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{p} \hlkwb{<-} \hlkwd{ggplot}\hlstd{(fake2.data,} \hlkwd{aes}\hlstd{(z, y))} \hlopt{+}
       \hlkwd{geom_point}\hlstd{()}
\hlkwd{print}\hlstd{(p} \hlopt{+} \hlkwd{theme_bw}\hlstd{())}
\end{alltt}
\end{kframe}
\end{knitrout}

It is also possible to change the theme used by default in the current \Rlang session with \Rfunction{theme\_set()}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{old_theme} \hlkwb{<-} \hlkwd{theme_set}\hlstd{(}\hlkwd{theme_bw}\hlstd{(}\hlnum{15}\hlstd{))}
\end{alltt}
\end{kframe}
\end{knitrout}

Similar to other functions used to change options in \Rlang, \Rfunction{theme\_set()} returns the previous setting. By saving this value to a variable, here \code{old\_theme}, we are able to restore the previous default, or undo the change.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{theme_set}\hlstd{(old_theme)}
\hlstd{p}
\end{alltt}
\end{kframe}
\end{knitrout}
\index{grammar of graphics!complete themes|)}

\subsection{Incomplete themes}
\index{grammar of graphics!incomplete themes|(}
If we want to extensively modify a theme, and/or reuse it in multiple plots, it is best to create a new constructor, or a modified complete theme as described in the next section. In other cases we may need to tweak some theme settings for a single figure, in which case we can most effectively do this when creating a plot. We exemplify this approach by solving the problem of overlapping $x$-axis tick labels. In practice this problem is most frequent when factor levels have long names or the labels are dates. Rotating the tick labels is the most elegant solution from the graphics design point of view.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(fake2.data,} \hlkwd{aes}\hlstd{(z} \hlopt{+} \hlnum{1000}\hlstd{, y))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{scale_x_continuous}\hlstd{(}\hlkwc{breaks} \hlstd{= scales}\hlopt{::}\hlkwd{pretty_breaks}\hlstd{(}\hlkwc{n} \hlstd{=} \hlnum{8}\hlstd{))} \hlopt{+}
  \hlkwd{theme}\hlstd{(}\hlkwc{axis.text.x} \hlstd{=} \hlkwd{element_text}\hlstd{(}\hlkwc{angle} \hlstd{=} \hlnum{90}\hlstd{,} \hlkwc{hjust} \hlstd{=} \hlnum{1}\hlstd{,} \hlkwc{vjust} \hlstd{=} \hlnum{0.5}\hlstd{))}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.54\textwidth]{figure/pos-themes-11-1}

}


\end{knitrout}

\begin{warningbox}
When tick labels are rotated, one usually needs to set both the horizontal and vertical justification, \code{hjust} and \code{vjust}, as the default values stop being suitable. This is due to the fact that justification settings are referenced to the text itself rather than to the plot, i.e., \textbf{vertical} justification of $x$-axis tick labels rotated 90 degrees shifts their alignment with respect to tick marks along the (\textbf{horizontal}) $x$ axis.
\end{warningbox}

\begin{playground}
Play with the code in the last chunk above, modifying the values used for \code{angle}, \code{hjust} and \code{vjust}. (Angles are expressed in degrees, and justification with values between 0 and 1).
\end{playground}

A less elegant approach is to use a smaller font size. Within \Rfunction{theme()}, function \Rfunction{rel()} can be used to set size relative to the base size. In this example, we use \code{axis.text.x} so as to change the size of tick labels only for the $x$ axis.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
  \hlkwd{theme}\hlstd{(}\hlkwc{axis.text.x} \hlstd{=} \hlkwd{element_text}\hlstd{(}\hlkwc{size} \hlstd{=} \hlkwd{rel}\hlstd{(}\hlnum{0.6}\hlstd{)))}
\end{alltt}
\end{kframe}
\end{knitrout}

Theme definitions follow a hierarchy, allowing us to modify the formatting of groups of similar elements, as well as of individual elements. In the chunk above, had we used \code{axis.text} instead of \code{axis.text.x}, the change would have affected the tick labels in both $x$ and $y$ axes.

\begin{playground}
Modify the example above, so that the tick labels on the $x$-axis are blue and those on the $y$-axis red, and the font size is the same for both axes, but changed from the default. Consult the documentation for \code{theme()} to find out the names of the elements that need to be given new values. For examples, see \citebooktitle{Wickham2016} \autocite{Wickham2016} and \citebooktitle{Chang2018} \autocite{Chang2018}.
\end{playground}

Formatting of other text elements can be adjusted in a similar way, as well as thickness of axes, length of tick marks, grid lines, etc. However, in most cases these are graphic design elements that are best kept consistent throughout sets of plots and best handled by creating a new \emph{theme} that can be easily reused.

\begin{warningbox}
If you both add a \emph{complete theme} and want to modify some of its elements, you should add the whole theme before modifying it with \code{+ theme(...)}. This may seem obvious once one has a good grasp of the grammar of graphics, but can be at first disconcerting.
\end{warningbox}

It is also possible to modify the default theme used for rendering all subsequent plots.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{old_theme} \hlkwb{<-} \hlkwd{theme_update}\hlstd{(}\hlkwc{text} \hlstd{=} \hlkwd{element_text}\hlstd{(}\hlkwc{color} \hlstd{=} \hlstr{"darkred"}\hlstd{))}
\end{alltt}
\end{kframe}
\end{knitrout}


\index{grammar of graphics!incomplete themes|)}

\subsection{Defining a new theme}
\index{grammar of graphics!creating a theme|(}
Themes can be defined both from scratch, or by modifying existing saved themes, and saving the modified version. As discussed above, it is also possible to define a new, parameterized theme constructor function.

Unless we plan to widely reuse the new theme, there is usually no need to define a new function. We can simply save the modified theme to a variable and add it to different plots as needed. As we will be adding a ``ready-build'' theme object rather than a function, we do not use parentheses.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my_theme} \hlkwb{<-} \hlkwd{theme_bw}\hlstd{()} \hlopt{+} \hlkwd{theme}\hlstd{(}\hlkwc{text} \hlstd{=} \hlkwd{element_text}\hlstd{(}\hlkwc{color} \hlstd{=} \hlstr{"darkred"}\hlstd{))}
\hlstd{p} \hlopt{+} \hlstd{my_theme}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.54\textwidth]{figure/pos-themes-21-1}

}


\end{knitrout}

\begin{playground}
It is always good to learn to recognize error messages. One way of doing this is by generating errors on purpose. So do add parentheses to the statement in the code chunk above and study the error message.
\end{playground}

\begin{explainbox}
How to create a new theme constructor similar to those in package \ggplot can be fairly simple if the changes are few. As the implementation details of theme objects may change in future versions of \ggplot, the safest approach is to rely only on the public interface of the package. We can ``wrap'' the functions exported by package \ggplot inside a new function. For this we need to find out what are the parameters and their order and duplicate these in our wrapper. Looking at the ``usage'' section of the help page for \ggtheme{theme\_gray()} is enough. In this case, we retain compatibility, but add a new base parameter, \code{base\_color}, and set a different default for \code{base\_family}. The key detail is passing \code{complete = TRUE} to \Rfunction{theme()}, as this tags the returned theme as being usable by itself, resulting in replacement of any theme already in a plot when it is added.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my_theme_gray} \hlkwb{<-}
  \hlkwa{function} \hlstd{(}\hlkwc{base_size} \hlstd{=} \hlnum{11}\hlstd{,}
            \hlkwc{base_family} \hlstd{=} \hlstr{"serif"}\hlstd{,}
            \hlkwc{base_line_size} \hlstd{= base_size}\hlopt{/}\hlnum{22}\hlstd{,}
            \hlkwc{base_rect_size} \hlstd{= base_size}\hlopt{/}\hlnum{22}\hlstd{,}
            \hlkwc{base_color} \hlstd{=} \hlstr{"darkblue"}\hlstd{) \{}
    \hlkwd{theme_gray}\hlstd{(}\hlkwc{base_size} \hlstd{= base_size,}
               \hlkwc{base_family} \hlstd{= base_family,}
               \hlkwc{base_line_size} \hlstd{= base_line_size,}
               \hlkwc{base_rect_size} \hlstd{= base_rect_size)} \hlopt{+}
    \hlkwd{theme}\hlstd{(}\hlkwc{line} \hlstd{=} \hlkwd{element_line}\hlstd{(}\hlkwc{color} \hlstd{= base_color),}
          \hlkwc{rect} \hlstd{=} \hlkwd{element_rect}\hlstd{(}\hlkwc{color} \hlstd{= base_color),}
          \hlkwc{text} \hlstd{=} \hlkwd{element_text}\hlstd{(}\hlkwc{color} \hlstd{= base_color),}
          \hlkwc{title} \hlstd{=} \hlkwd{element_text}\hlstd{(}\hlkwc{color} \hlstd{= base_color),}
          \hlkwc{axis.text} \hlstd{=} \hlkwd{element_text}\hlstd{(}\hlkwc{color} \hlstd{= base_color),} \hlkwc{complete} \hlstd{=} \hlnum{TRUE}\hlstd{)}
  \hlstd{\}}
\end{alltt}
\end{kframe}
\end{knitrout}

In the chunk above we have created our own theme constructor, without too much effort, and using an approach that is very likely to continue working with future versions of \ggplot. The saved theme is a function with parameters and defaults for them. In this example we have kept the function parameters the same as those used in \ggplot, only adding an additional parameter after the existing ones to maximize compatibility and avoid surprising users. To avoid surprising users, we may want additionally to make \code{my\_theme\_gray()} a synonym of \code{my\_theme\_gray()} following \ggplot practice.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my_theme_gray} \hlkwb{<-} \hlstd{my_theme_gray}
\end{alltt}
\end{kframe}
\end{knitrout}

Finally, we use the new theme constructor in the same way as those defined in \ggplot.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{p} \hlopt{+} \hlkwd{my_theme_gray}\hlstd{(}\hlnum{15}\hlstd{,} \hlkwc{base_color} \hlstd{=} \hlstr{"darkred"}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.54\textwidth]{figure/pos-themes-33-1}

}


\end{knitrout}
\end{explainbox}
\index{grammar of graphics!creating a theme|)}
\index{plots!styling|)}
\index{grammar of graphics!themes|)}

\section{Composing plots}
\index{plots!composing|(}
In section \ref{sec:plot:facets} on page \pageref{sec:plot:facets}, we described how facets can be used to created coordinated sets of panels, based on a single data set. Rather frequently, we need to assemble a composite plot from individually created plots. If one wishes to have correctly aligned axis labels and plotting areas, similar to when using facets, then the task is not easy to achieve without the help of especial tools.

Package \pkgname{patchwork} defines a simple grammar for composing plots created with \ggplot. We briefly describe here the use of operators \Roperator{+}, \Roperator{|} and \Roperator{/}, although \pkgname{patchwork} provides additional tools for defining complex layouts of panels. While \Roperator{+} allows different layouts, \Roperator{|} composes panels side by side, and \Roperator{/} composes panels on top of each other. The plots to be used as panels can be grouped using parentheses.

We start by creating and saving three plots.
\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{p1} \hlkwb{<-} \hlkwd{ggplot}\hlstd{(mpg,} \hlkwd{aes}\hlstd{(displ, cty,} \hlkwc{color} \hlstd{=} \hlkwd{factor}\hlstd{(cyl)))} \hlopt{+}
        \hlkwd{geom_point}\hlstd{()} \hlopt{+}
        \hlkwd{theme}\hlstd{(}\hlkwc{legend.position} \hlstd{=} \hlstr{"top"}\hlstd{)}
\hlstd{p2} \hlkwb{<-} \hlkwd{ggplot}\hlstd{(mpg,} \hlkwd{aes}\hlstd{(displ, cty,} \hlkwc{color} \hlstd{=} \hlkwd{factor}\hlstd{(year)))} \hlopt{+}
        \hlkwd{geom_point}\hlstd{()} \hlopt{+}
        \hlkwd{theme}\hlstd{(}\hlkwc{legend.position} \hlstd{=} \hlstr{"top"}\hlstd{)}
\hlstd{p3} \hlkwb{<-} \hlkwd{ggplot}\hlstd{(mpg,} \hlkwd{aes}\hlstd{(}\hlkwd{factor}\hlstd{(model), cty))} \hlopt{+}
        \hlkwd{geom_point}\hlstd{()} \hlopt{+}
        \hlkwd{theme}\hlstd{(}\hlkwc{axis.text.x} \hlstd{=}
                \hlkwd{element_text}\hlstd{(}\hlkwc{angle} \hlstd{=} \hlnum{90}\hlstd{,} \hlkwc{hjust} \hlstd{=} \hlnum{1}\hlstd{,} \hlkwc{vjust} \hlstd{=} \hlnum{0.5}\hlstd{))}
\end{alltt}
\end{kframe}
\end{knitrout}

Next, we compose a plot using as panels the three plots created above (plot not shown).


\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{(p1} \hlopt{|} \hlstd{p2)} \hlopt{/} \hlstd{p3}
\end{alltt}
\end{kframe}
\end{knitrout}

We add a title and tag the panels with a letter. In this, and similar cases, parentheses may be needed to alter the default precedence of the \Rlang operators.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{((p1} \hlopt{|} \hlstd{p2)} \hlopt{/} \hlstd{p3)} \hlopt{+}
   \hlkwd{plot_annotation}\hlstd{(}\hlkwc{title} \hlstd{=} \hlstr{"Fuel use in city traffic:"}\hlstd{,} \hlkwc{tag_levels} \hlstd{=} \hlstr{'a'}\hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.9\textwidth]{figure/pos-patchwork-03-1}

}


\end{knitrout}


\index{plots!composing|)}

\section[Using plotmath expressions]{Using \code{plotmath} expressions}\label{sec:plot:plotmath}
\index{plotmath}
\index{plots!math expressions|(}
In sections \ref{sec:plot:function} and \ref{sec:plot:text} we gave some simple examples of the use of \Rlang expressions in plots. The \code{plotmath} demo and help in \Rlang provide enough information to start using expressions in plots. However, composing syntactically correct expressions can be challenging because their syntax is rather unusual. Although expressions are shown here in the context of plotting, they are also used in other contexts in \Rlang code.

In general it is possible to create \emph{expressions} explicitly with function \Rfunction{expression()}, or by parsing a character string. In the case of \ggplot for some plot elements, layers created with \gggeom{geom\_text} and \gggeom{geom\_label}, and the strip labels of facets the parsing is delayed and applied to mapped character variables in \code{data}. In contrast, for titles, subtitles, captions, axis-labels, etc. (anything that is defined within \Rfunction{labs()}) the expressions have to be entered explicitly, or saved as such into a variable, and the variable passed as an argument.

When plotting expressions using \gggeom{geom\_text()}, that character strings are to be parsed is signaled with \code{parse = TRUE}. In the case of facets' strip labels, parsing or not depends on the \emph{labeller} function used. An additional twist is in this case the possibility of combining static character strings with values taken from \code{data}.

The most difficult thing to remember when writing expressions is how to connect the different parts. A tilde (\code{\textasciitilde}) adds space in between symbols. Asterisk (\code{*}) can be also used as a connector, and is needed usually when dealing with numbers. Using space is allowed in some situations, but not in others. To include bits of text within an expression we need to use quotation marks. For a long list of examples have a look at the output and code displayed by \code{demo(plotmath)} at the \Rlang command prompt.

We will use a couple of complex examples to show how to use expressions for different elements of a plot.
We first create a data frame, using \Rfunction{paste()} to assemble a vector of subscripted $\alpha$ values as character strings suitable for parsing into expressions.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{set.seed}\hlstd{(}\hlnum{54321}\hlstd{)} \hlcom{# make sure we always generate the same data}
\hlstd{my.data} \hlkwb{<-}
  \hlkwd{data.frame}\hlstd{(}\hlkwc{x} \hlstd{=} \hlnum{1}\hlopt{:}\hlnum{5}\hlstd{,}
             \hlkwc{y} \hlstd{=} \hlkwd{rnorm}\hlstd{(}\hlnum{5}\hlstd{),}
             \hlkwc{greek.label} \hlstd{=} \hlkwd{paste}\hlstd{(}\hlstr{"alpha["}\hlstd{,} \hlnum{1}\hlopt{:}\hlnum{5}\hlstd{,} \hlstr{"]"}\hlstd{,} \hlkwc{sep} \hlstd{=} \hlstr{""}\hlstd{))}
\end{alltt}
\end{kframe}
\end{knitrout}

We use as $x$-axis label, a Greek $\alpha$ character with $i$ as subscript, and in the $y$-axis label, we have a superscript in the units. For the title we use a character string but for the subtitle a rather complex expression. We create these expressions with function \Rfunction{expression()}.

We label each observation with a subscripted $alpha$. We cannot pass expressions to \emph{geometries} by simply mapping them to the label aesthetic. Instead, we pass character strings that can be parsed into expressions. In other words, character strings, that are written using the syntax of expressions. We need to set \code{parse = TRUE} in the call to the \emph{geometry} so that the strings, instead of being plotted as is, are parsed into expressions before the plot is rendered.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(my.data,} \hlkwd{aes}\hlstd{(x, y,} \hlkwc{label} \hlstd{= greek.label))} \hlopt{+}
   \hlkwd{geom_point}\hlstd{()} \hlopt{+}
   \hlkwd{geom_text}\hlstd{(}\hlkwc{angle} \hlstd{=} \hlnum{45}\hlstd{,} \hlkwc{hjust} \hlstd{=} \hlnum{1.2}\hlstd{,} \hlkwc{parse} \hlstd{=} \hlnum{TRUE}\hlstd{)} \hlopt{+}
   \hlkwd{labs}\hlstd{(}\hlkwc{x} \hlstd{=} \hlkwd{expression}\hlstd{(alpha[i]),}
        \hlkwc{y} \hlstd{=} \hlkwd{expression}\hlstd{(Speed}\hlopt{~~}\hlstd{(m}\hlopt{~}\hlstd{s}\hlopt{^}\hlstd{\{}\hlopt{-}\hlnum{1}\hlstd{\})),}
        \hlkwc{title} \hlstd{=} \hlstr{"Using expressions"}\hlstd{,}
        \hlkwc{subtitle} \hlstd{=} \hlkwd{expression}\hlstd{(}\hlkwd{sqrt}\hlstd{(alpha[}\hlnum{1}\hlstd{]} \hlopt{+} \hlkwd{frac}\hlstd{(beta, gamma))))}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-plotmath-02-1}

}


\end{knitrout}

We can also use a character string stored in a variable, and use function \Rfunction{parse()} to parse it in cases where an expression is required as we do here for \code{subtitle}. In this example we also set tick labels to expressions, taking advantage that \Rfunction{expression()} accepts multiple arguments separated by commas returning a vector of expressions.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my_eq.char} \hlkwb{<-} \hlstr{"alpha[i]"}
\hlkwd{ggplot}\hlstd{(my.data,} \hlkwd{aes}\hlstd{(x, y))} \hlopt{+}
   \hlkwd{geom_point}\hlstd{()} \hlopt{+}
   \hlkwd{labs}\hlstd{(}\hlkwc{title} \hlstd{=} \hlkwd{parse}\hlstd{(}\hlkwc{text} \hlstd{= my_eq.char))} \hlopt{+}
   \hlkwd{scale_x_continuous}\hlstd{(}\hlkwc{name} \hlstd{=} \hlkwd{expression}\hlstd{(alpha[i]),}
                      \hlkwc{breaks} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{1}\hlstd{,}\hlnum{3}\hlstd{,}\hlnum{5}\hlstd{),}
                      \hlkwc{labels} \hlstd{=} \hlkwd{expression}\hlstd{(alpha[}\hlnum{1}\hlstd{], alpha[}\hlnum{3}\hlstd{], alpha[}\hlnum{5}\hlstd{]))}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-plotmath-02a-1}

}


\end{knitrout}

A different approach (no example shown) would be to use \Rfunction{parse()} explicitly for each individual label, something that might be needed if the tick labels need to be ``assembled'' programmatically instead of set as constants.

\begin{explainbox}
\textbf{Differences between \Rfunction{parse()} and \Rfunction{expression()}}. Function \Rfunction{parse()} takes as an argument a character string. This is very useful as the character string can be created programmatically. When using \code{expression()} this is not possible, except for substitution at execution time of the value of variables into the expression. See the help pages for both functions.

Function \Rfunction{expression()} accepts its arguments without any delimiters. Function \Rfunction{parse()} takes a single character string as an argument to be parsed, in which case quotation marks within the string need to be \emph{escaped} (using \code{\backslash"} where a literal \code{"} is desired). We can, also in both cases, embed a character string by means of one of the functions \Rfunction{plain()}, \Rfunction{italic()}, \Rfunction{bold()} or \Rfunction{bolditalic()} which also affect the font used. The argument to these functions needs to be a character string delimited by quotation marks if it is not to be parsed.

When using \Rfunction{expression()}, bare quotation marks can be embedded,

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(cars,} \hlkwd{aes}\hlstd{(speed, dist))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{xlab}\hlstd{(}\hlkwd{expression}\hlstd{(x[}\hlnum{1}\hlstd{]}\hlopt{*}\hlstr{"  test"}\hlstd{))}
\end{alltt}
\end{kframe}
\end{knitrout}

while in the case of \Rfunction{parse()} they need to be \emph{escaped},

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(cars,} \hlkwd{aes}\hlstd{(speed, dist))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{xlab}\hlstd{(}\hlkwd{parse}\hlstd{(}\hlkwc{text} \hlstd{=} \hlstr{"x[1]*\textbackslash{}"  test\textbackslash{}""}\hlstd{))}
\end{alltt}
\end{kframe}
\end{knitrout}

and in some cases will be enclosed within a format function.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(cars,} \hlkwd{aes}\hlstd{(speed, dist))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{xlab}\hlstd{(}\hlkwd{parse}\hlstd{(}\hlkwc{text} \hlstd{=} \hlstr{"x[1]*italic(\textbackslash{}"  test\textbackslash{}")"}\hlstd{))}
\end{alltt}
\end{kframe}
\end{knitrout}

Some additional remarks. If \Rfunction{expression()} is passed multiple arguments, it returns a vector of expressions. Where \Rfunction{ggplot()} expects a single value as an argument, as in the case of axis labels, only the first member of the vector will be used.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(cars,} \hlkwd{aes}\hlstd{(speed, dist))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{xlab}\hlstd{(}\hlkwd{expression}\hlstd{(x[}\hlnum{1}\hlstd{],} \hlstr{"  test"}\hlstd{))}
\end{alltt}
\end{kframe}
\end{knitrout}

Depending on the location within a expression, spaces maybe ignored, or illegal. To juxtapose elements without adding space use \code{*}, to explicitly insert white space, use \code{\textasciitilde}. As shown above, spaces are accepted within quoted text. Consequently, the following alternatives can also be used.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
  \hlkwd{xlab}\hlstd{(}\hlkwd{parse}\hlstd{(}\hlkwc{text} \hlstd{=} \hlstr{"x[1]~~~~\textbackslash{}"test\textbackslash{}""}\hlstd{))}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
  \hlkwd{xlab}\hlstd{(}\hlkwd{parse}\hlstd{(}\hlkwc{text} \hlstd{=} \hlstr{"x[1]~~~~plain(test)"}\hlstd{))}
\end{alltt}
\end{kframe}
\end{knitrout}

However, unquoted white space is discarded.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
  \hlkwd{xlab}\hlstd{(}\hlkwd{parse}\hlstd{(}\hlkwc{text} \hlstd{=} \hlstr{"x[1]*plain(   test)"}\hlstd{))}
\end{alltt}
\end{kframe}
\end{knitrout}

Finally, it can be surprising that trailing zeros in numeric values appearing within an expression or text to be parsed are dropped. To force the trailing zeros to be retained we need to enclose the number in quotation marks so that it is interpreted as a character string.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(cars,} \hlkwd{aes}\hlstd{(speed, dist))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{annotate}\hlstd{(}\hlkwc{geom} \hlstd{=} \hlstr{"text"}\hlstd{,}
           \hlkwc{x} \hlstd{=} \hlkwd{rep}\hlstd{(}\hlnum{6}\hlstd{,} \hlnum{3}\hlstd{),} \hlkwc{y} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{90}\hlstd{,} \hlnum{100}\hlstd{,} \hlnum{110}\hlstd{),}
           \hlkwc{label} \hlstd{=} \hlkwd{c}\hlstd{(}\hlstr{"'1.00'*x^2"}\hlstd{,} \hlstr{"1.00*x^2"}\hlstd{,} \hlstr{"1.01*x^2"}\hlstd{),} \hlkwc{parse} \hlstd{=} \hlnum{TRUE}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}
\end{explainbox}

Above we used \Rfunction{paste()} to insert values stored in a variable; functions \Rfunction{format()}, \Rfunction{sprintf()}, and \Rfunction{strftime()} allow the conversion into character strings of other values. These functions can be used when creating plots to generate suitable character strings for the \code{label} \emph{aesthetic} out of numeric, logical, date, time, and even character values. They can be, for example, used to create labels within a call to \code{aes()}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{sprintf}\hlstd{(}\hlstr{"log(%.3f) = %.3f"}\hlstd{,} \hlnum{5}\hlstd{,} \hlkwd{log}\hlstd{(}\hlnum{5}\hlstd{))}
\end{alltt}
\begin{verbatim}
## [1] "log(5.000) = 1.609"
\end{verbatim}
\begin{alltt}
\hlkwd{sprintf}\hlstd{(}\hlstr{"log(%.3g) = %.3g"}\hlstd{,} \hlnum{5}\hlstd{,} \hlkwd{log}\hlstd{(}\hlnum{5}\hlstd{))}
\end{alltt}
\begin{verbatim}
## [1] "log(5) = 1.61"
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{playground}
Study the chunck above. If you are familiar with \langname{C} or \langname{C++} function \Rfunction{sprintf()} will already be familiar to you, otherwise study its help page.

Play with functions \Rfunction{format()}, \Rfunction{sprintf()}, and \Rfunction{strftime()}, formatting different types of data, into character strings of different widths, with different numbers of digits, etc.
\end{playground}

It is also possible to substitute the value of variables or, in fact, the result of evaluation, into a new expression, allowing on the fly construction of expressions. Such expressions are frequently used as labels in plots. This is achieved through use of \emph{quoting} and \emph{substitution}.

We use \Rfunction{bquote()} to substitute variables or expressions enclosed in \code{.( )} by their value. Be aware that the argument to \Rfunction{bquote()} needs to be written as an expression; in this example we need to use a tilde, \code{\textasciitilde}, to insert a space between words. Furthermore, if the expressions include variables, these will be searched for in the environment rather than in \code{data}, except within a call to \code{aes()}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(cars,} \hlkwd{aes}\hlstd{(speed, dist))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{labs}\hlstd{(}\hlkwc{title} \hlstd{=} \hlkwd{bquote}\hlstd{(Time}\hlopt{~}\hlstd{zone}\hlopt{:} \hlkwd{.}\hlstd{(}\hlkwd{Sys.timezone}\hlstd{())),}
       \hlkwc{subtitle} \hlstd{=} \hlkwd{bquote}\hlstd{(Date}\hlopt{:} \hlkwd{.}\hlstd{(}\hlkwd{as.character}\hlstd{(}\hlkwd{today}\hlstd{())))}
       \hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-expr-bquote-01-1}

}


\end{knitrout}

In the case of \Rfunction{substitute()} we supply what is to be used for substitution through a named list.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{ggplot}\hlstd{(cars,} \hlkwd{aes}\hlstd{(speed, dist))} \hlopt{+}
  \hlkwd{geom_point}\hlstd{()} \hlopt{+}
  \hlkwd{labs}\hlstd{(}\hlkwc{title} \hlstd{=} \hlkwd{substitute}\hlstd{(Time}\hlopt{~}\hlstd{zone}\hlopt{:} \hlstd{tz,} \hlkwd{list}\hlstd{(}\hlkwc{tz} \hlstd{=} \hlkwd{Sys.timezone}\hlstd{())),}
       \hlkwc{subtitle} \hlstd{=} \hlkwd{substitute}\hlstd{(Date}\hlopt{:} \hlstd{date,} \hlkwd{list}\hlstd{(}\hlkwc{date} \hlstd{=} \hlkwd{as.character}\hlstd{(}\hlkwd{today}\hlstd{())))}
       \hlstd{)}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=.7\textwidth]{figure/pos-expr-substitute-01-1}

}


\end{knitrout}

For example, substitution can be used to assemble an expression within a function based on the arguments passed. One case of interest is to retrieve the name of the object passed as an argument, from within a function.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{deparse_test} \hlkwb{<-} \hlkwa{function}\hlstd{(}\hlkwc{x}\hlstd{) \{}
  \hlkwd{print}\hlstd{(}\hlkwd{deparse}\hlstd{(}\hlkwd{substitute}\hlstd{(x)))}
\hlstd{\}}

\hlstd{a} \hlkwb{<-} \hlstr{"saved in variable"}

\hlkwd{deparse_test}\hlstd{(}\hlstr{"constant"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] "\"constant\""
\end{verbatim}
\begin{alltt}
\hlkwd{deparse_test}\hlstd{(}\hlnum{1} \hlopt{+} \hlnum{2}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] "1 + 2"
\end{verbatim}
\begin{alltt}
\hlkwd{deparse_test}\hlstd{(a)}
\end{alltt}
\begin{verbatim}
## [1] "a"
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{infobox}
A new package, \pkgname{ggtext}, which is not yet in CRAN, provides rich-text (basic \langname{HTML} and \langname{Markdown}) support for \ggplot, both for annotations and for data visualization. This package provides an alternative to the use of \Rlang expressions.
\end{infobox}
\index{plots!math expressions|)}

\section{Creating complex data displays}\label{sec:plot:composition}
\index{plots!modular construction|(}

The grammar of graphics\index{grammar of graphics}\index{plots!layers} allows one to build and test plots incrementally. In daily use, when creating a completely new plot, it is best to start with a simple design for a plot, \code{print()} this plot, checking that the output is as expected and the code error-free. Afterwards, one can map additional \emph{aesthetics} and add \emph{geometries} and \emph{statistics} gradually. The final steps are then to add \emph{annotations} and the text or expressions used for titles, and axis and key labels. Another approach is to start with an existing plot and modify it, e.g.,  by using the same plotting code with different \code{data} or mapping different variables. When reusing code for a different data set, scale \code{limits} and \code{names} are likely to need to be edited.

\begin{playground}
  Build a graphically complex data plot of your interest, step by step. By step by step, I do not refer to using the grammar in the construction of the plot as earlier, but of taking advantage of this modularity to test intermediate versions in an iterative design process, first by building up the complex plot in stages as a tool in debugging, and later using iteration in the processes of improving the graphic design of the plot and improving its readability and effectiveness.
\end{playground}

\section{Creating sets of plots}\label{sec:plot:sets:of}
\index{plots!consistent styling}\index{plots!programatic construction|(}
Plots to be presented at a given occasion or published as part of the same work need to be consistent in various respects: themes, scales and palettes, annotations, titles and captions. To guarantee this consistency we need to build plots modularly and avoid repetition by assigning names to the ``modules'' that need to be used multiple times.

\subsection{Saving plot layers and scales in variables}

When creating plots with \ggplot,\index{plots!reusing parts of} objects are composed using operator \code{+} to assemble together the individual components. The functions that create plot layers, scales, etc.\ are constructors of objects and the objects they return can be stored in variables, and once saved, added to multiple plots at a later time.

We create a plot and save it to variable \code{myplot} and we separately save the values returned by a call to function \code{labs()}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{myplot} \hlkwb{<-} \hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,}
                 \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg,}
                 \hlkwc{color} \hlstd{=} \hlkwd{factor}\hlstd{(cyl)))} \hlopt{+}
          \hlkwd{geom_point}\hlstd{()}

\hlstd{mylabs} \hlkwb{<-} \hlkwd{labs}\hlstd{(}\hlkwc{x} \hlstd{=} \hlstr{"Engine displacement)"}\hlstd{,}
               \hlkwc{y} \hlstd{=} \hlstr{"Gross horsepower"}\hlstd{,}
               \hlkwc{color} \hlstd{=} \hlstr{"Number of\textbackslash{}ncylinders"}\hlstd{,}
               \hlkwc{shape} \hlstd{=} \hlstr{"Number of\textbackslash{}ncylinders"}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

We assemble the final plot from the two parts we saved into variables. This is useful when we need to create several plots ensuring that scale \code{name} arguments are used consistently. In the example above, we saved these names, but the approach can be used for other plot components or lists of components.

\begin{warningbox}
 When composing plots with the \code{+} operator, the left-hand-side operand must be a \code{"gg"} object. The left operand is added to the \code{"gg"} object and the result returned.
\end{warningbox}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{myplot}
\hlstd{myplot} \hlopt{+} \hlstd{mylabs} \hlopt{+} \hlkwd{theme_bw}\hlstd{(}\hlnum{16}\hlstd{)}
\hlstd{myplot} \hlopt{+} \hlstd{mylabs} \hlopt{+} \hlkwd{theme_bw}\hlstd{(}\hlnum{16}\hlstd{)} \hlopt{+} \hlkwd{ylim}\hlstd{(}\hlnum{0}\hlstd{,} \hlnum{NA}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

We can also save intermediate results.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{mylogplot} \hlkwb{<-} \hlstd{myplot} \hlopt{+} \hlkwd{scale_y_log10}\hlstd{(}\hlkwc{limits}\hlstd{=}\hlkwd{c}\hlstd{(}\hlnum{8}\hlstd{,}\hlnum{55}\hlstd{))}
\hlstd{mylogplot} \hlopt{+} \hlstd{mylabs} \hlopt{+} \hlkwd{theme_bw}\hlstd{(}\hlnum{16}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

\subsection{Saving plot layers and scales in lists}

If the pieces to be put together do not include a \code{"gg"} object, we can group them into an \Rlang list and save it. When we later add the saved list to a \code{"gg"} object, the members of the list are added one by one to the plot respecting their order.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{myparts} \hlkwb{<-} \hlkwd{list}\hlstd{(mylabs,} \hlkwd{theme_bw}\hlstd{(}\hlnum{16}\hlstd{))}
\hlstd{mylogplot} \hlopt{+} \hlstd{myparts}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{playground}
Revise the code you wrote for the ``playground'' exercise in section \ref{sec:plot:composition}, but this time, pre-building and saving groups of elements that you expect to be useful unchanged when composing a different plot of the same type, or a plot of a different type from the same data.
\end{playground}

\subsection{Using functions as building blocks}

When the blocks we assemble need to accept arguments when used, we have to define functions instead of saving plot components to variables. The functions we define, have to return a \code{"gg"} object, a list of plot components, or a single plot component. The simplest use is to alter some defaults in existing constructor functions returning \code{"gg"} objects or layers. The ellipsis (\code{...}) allows passing named arguments to a nested function. In this case, every single argument passed by name to \code{bw\_ggplot()} will be copied as argument to the nested call to \code{ggplot()}. Be aware, that supplying arguments by position, is possible only for parameters explicitly included in the definition of the wrapper function,

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{bw_ggplot} \hlkwb{<-} \hlkwa{function}\hlstd{(}\hlkwc{...}\hlstd{) \{}
  \hlkwd{ggplot}\hlstd{(...)} \hlopt{+}
  \hlkwd{theme_bw}\hlstd{()}
\hlstd{\}}
\end{alltt}
\end{kframe}
\end{knitrout}

which could be used as follows.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{bw_ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mtcars,}
          \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= disp,} \hlkwc{y} \hlstd{= mpg,}
          \hlkwc{color} \hlstd{=} \hlkwd{factor}\hlstd{(cyl)))} \hlopt{+}
          \hlkwd{geom_point}\hlstd{()}
\end{alltt}
\end{kframe}
\end{knitrout}

\index{plots!programatic construction|)}
\index{plots!modular construction|)}

\section{Generating output files}
\index{devices!output|see{graphic output devices}}
\index{plots!saving to file|see{plots, rendering}}
\index{graphic output devices|(}
\index{plots!rendering|(}
It is possible, when using \RStudio, to directly export the displayed plot to a file using a menu. However, if the file will have to be generated again at a later time, or a series of plots need to be produced with consistent format, it is best to include the commands to export the plot in the script.

In \Rlang,\index{plots!printing}\index{plots!saving}\index{plots!output to files} files are created by printing to different devices. Printing is directed to a currently open device such a window in \RStudio. Some devices produce screen output, others files. Devices depend on drivers. There are both devices that are part of \Rlang and additional ones defined in contributed packages.

Creating a file involves opening a device, printing and closing the device in sequence. In most cases the file remains locked until the device is close.

For example when rendering a plot to\index{plots!PDF output} PDF, Encapsulated Postcript, SVG or other vector graphics formats, arguments passed to \code{width} and \code{height} are expressed in inches.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{fig1} \hlkwb{<-} \hlkwd{ggplot}\hlstd{(}\hlkwd{data.frame}\hlstd{(}\hlkwc{x} \hlstd{=} \hlopt{-}\hlnum{3}\hlopt{:}\hlnum{3}\hlstd{),} \hlkwd{aes}\hlstd{(}\hlkwc{x} \hlstd{= x))} \hlopt{+}
  \hlkwd{stat_function}\hlstd{(}\hlkwc{fun} \hlstd{= dnorm)}
\hlkwd{pdf}\hlstd{(}\hlkwc{file} \hlstd{=} \hlstr{"fig1.pdf"}\hlstd{,} \hlkwc{width} \hlstd{=} \hlnum{8}\hlstd{,} \hlkwc{height} \hlstd{=} \hlnum{6}\hlstd{)}
\hlkwd{print}\hlstd{(fig1)}
\hlkwd{dev.off}\hlstd{()}
\end{alltt}
\end{kframe}
\end{knitrout}

For Encapsulated Postscript\index{plots!Postscript output} and SVG\index{plots!SVG output} output, we only need to substitute \code{pdf()} with \code{postscript()} or \code{svg()}, respectively.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{postscript}\hlstd{(}\hlkwc{file} \hlstd{=} \hlstr{"fig1.eps"}\hlstd{,} \hlkwc{width} \hlstd{=} \hlnum{8}\hlstd{,} \hlkwc{height} \hlstd{=} \hlnum{6}\hlstd{)}
\hlkwd{print}\hlstd{(fig1)}
\hlkwd{dev.off}\hlstd{()}
\end{alltt}
\end{kframe}
\end{knitrout}

In the case of graphics devices for\index{plots!bitmap output} file output in BMP, JPEG, PNG and TIFF bitmap formats, arguments passed to \code{width} and \code{height} are expressed in pixels.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{tiff}\hlstd{(}\hlkwc{file} \hlstd{=} \hlstr{"fig1.tiff"}\hlstd{,} \hlkwc{width} \hlstd{=} \hlnum{1000}\hlstd{,} \hlkwc{height} \hlstd{=} \hlnum{800}\hlstd{)}
\hlkwd{print}\hlstd{(fig1)}
\hlkwd{dev.off}\hlstd{()}
\end{alltt}
\end{kframe}
\end{knitrout}
\index{plots!rendering|)}
\index{graphic output devices|)}

\begin{infobox}
Some graphics devices are part of base-\Rlang, and others are implemented in contributed packages. In some cases, there are multiple graphic device available for rendering graphics in a given file format. These devices usually use different libraries, or have been designed with different aims. These alternative graphic devices can also differ in their function signature, i.e., have differences in the parameters and their names. In cases when rendering fails inexplicably, it can be worthwhile to switch to an alternative graphics device to find out if the problem is in the plot or in the rendering engine.
\end{infobox}

\section{Further reading}
An\index{further reading!grammar of graphics}\index{further reading!plotting} in-depth discussion of the many extensions to package \pkgname{ggplot2} is outside the scope of this book. Several books describe in detail the use of \pkgname{ggplot2}, being \citebooktitle{Wickham2016} \autocite{Wickham2016} the one written by the main author of the package. For inspiration or worked out examples, the book \citebooktitle{Chang2018} \autocite{Chang2018} is an excellent reference. In depth explanations of the technical aspects of \Rlang graphics are available in the book \citebooktitle{Murrell2019} \autocite{Murrell2019}.


% !Rnw root = appendix.main.Rnw


\chapter{Data import and export}\label{chap:R:data:io}\label{sec:data:io}

\begin{VF}
Most programmers have seen them, and most good programmers realize they've written at least one. They are huge, messy, ugly programs that should have been short, clean, beautiful programs.

\VA{John Bentley}{\emph{Programming Pearls}, 1986}
\end{VF}


\section{Aims of this chapter}

Base \Rlang and the recommended packages (installed by default) include several functions for importing and exporting data. Contributed packages provide both replacements for some of these functions and support for several additional file formats. In the present chapter, I aim at describing both data input and output covering in detail only the most common ``foreign'' data formats (those not native to \Rlang).

Data file formats that are foreign to \Rlang are not always well defined, making it necessary to reverse-engineer the algorithms needed to read them. These formats, even when clearly defined, may be updated by the developers of the foreign software that writes the files. Consequently, developing software to read and write files using foreign formats can easily result in long, messy, and ugly \Rlang scripts. We can also unwillingly write code that usually works but occasionally fails with specific files, or even worse, occasionally silently corrupts the imported data. The aim of this chapter is to provide guidance for finding functions for reading data encoded using foreign formats, covering both base \Rlang, including the \pkgname{foreign} package, and independently contributed packages. Such functions are well tested or validated.

In this chapter you will familiarize yourself with how to exchange data between \Rlang and other applications. The functions \code{save()} and \code{load()}, and \code{saveRDS()} and  \code{readRDS()}, all of which save and read data in \Rlang's native formats, are described in sections \ref{sec:data:rda} and \ref{sec:data:rds} starting on page \pageref{sec:data:rda}.

\section{Introduction}

The first step in any data analysis with \Rlang is to input or read-in the data. Available sources of data are many and data can be stored or transmitted using various formats, both based on text or binary encodings. It is crucial that data is not altered (corrupted) when read and that in the eventual case of an error, errors are clearly reported. Most dangerous are silent non-catastrophic errors.

The very welcome increase of awareness of the need for open availability of data, makes the output of data from \Rlang into well-defined data-exchange formats another crucial step. Consequently, in many cases an important step in data analysis is to export the data for submission to a repository, in addition to publication of the results of the analysis.

Faster internet access to data sources and cheaper random-access memory (RAM) has made it possible to efficiently work with relatively large data sets in \Rlang. That \Rlang keeps all data in memory (RAM), imposes limits to the size of data \Rlang functions can operate on. For data sets large enough not to fit in computer RAM, one can use selective reading of data from flat files, or from databases outside of \Rlang.

Some \Rlang packages have made it faster to import data saved in the same formats already supported by base \Rlang, but in some cases providing weaker guarantees of not corrupting the data than base \Rlang. Other contributed packages make it possible to import and export data stored in file formats not supported by base \Rlang functions. Some of these formats are subject-area specific while others are in widespread use.

\section{Packages used in this chapter}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{install.packages}\hlstd{(learnrbook}\hlopt{::}\hlstd{pkgs_ch_data)}
\end{alltt}
\end{kframe}
\end{knitrout}

To run the examples included in this chapter, you need first to load some packages from the library (see section \ref{sec:script:packages} on page \pageref{sec:script:packages} for details on the use of packages).

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{library}\hlstd{(learnrbook)}
\hlkwd{library}\hlstd{(tibble)}
\hlkwd{library}\hlstd{(purrr)}
\hlkwd{library}\hlstd{(wrapr)}
\hlkwd{library}\hlstd{(stringr)}
\hlkwd{library}\hlstd{(dplyr)}
\hlkwd{library}\hlstd{(tidyr)}
\hlkwd{library}\hlstd{(readr)}
\hlkwd{library}\hlstd{(readxl)}
\hlkwd{library}\hlstd{(xlsx)}
\hlkwd{library}\hlstd{(readODS)}
\hlkwd{library}\hlstd{(pdftools)}
\hlkwd{library}\hlstd{(foreign)}
\hlkwd{library}\hlstd{(haven)}
\hlkwd{library}\hlstd{(xml2)}
\hlkwd{library}\hlstd{(XML)}
\hlkwd{library}\hlstd{(ncdf4)}
\hlkwd{library}\hlstd{(tidync)}
\hlkwd{library}\hlstd{(lubridate)}
\hlkwd{library}\hlstd{(jsonlite)}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{infobox}
Some data sets used in this and other chapters are available in package \pkgname{learnrbook}. In addition to the
R data objects, we provide files saved in \emph{foreign} formats, which we used in examples on how to import data. The files can be either read from the \Rlang library, or from a copy in a local folder. In this chapter we assume the user has copied the folder \code{"extdata"} from the package to a working folder.

Copy the files using:

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{pkg.path} \hlkwb{<-} \hlkwd{system.file}\hlstd{(}\hlstr{"extdata"}\hlstd{,} \hlkwc{package} \hlstd{=} \hlstr{"learnrbook"}\hlstd{)}
\hlkwd{file.copy}\hlstd{(pkg.path,} \hlstr{"."}\hlstd{,} \hlkwc{overwrite} \hlstd{=} \hlnum{TRUE}\hlstd{,} \hlkwc{recursive} \hlstd{=} \hlnum{TRUE}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\end{kframe}
\end{knitrout}

We also make sure the folder used to save data read from the internet, exists.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{save.path} \hlkwb{=} \hlstr{"./data"}
\hlkwa{if} \hlstd{(}\hlopt{!}\hlkwd{dir.exists}\hlstd{(save.path)) \{}
  \hlkwd{dir.create}\hlstd{(save.path)}
\hlstd{\}}
\end{alltt}
\end{kframe}
\end{knitrout}
\end{infobox}

\section{File names and operations}\label{sec:files:filenames}
\index{file names!portable}
\index{file operations|(}
We start with the naming of files as it affects data sharing irrespective of the format used for its encoding. The main difficulty is that different operating systems have different rules governing the syntax used for file names and file paths. In many cases, like when depositing data files in a public repository, we need to ensure that file names are valid in multiple operating systems (OSs). If the script used to create the files is itself expected to be OS agnostic, we also need to be careful to query the OS for file names and paths without making assumptions on the naming rules or available OS commands. This is especially important when developing \Rlang packages.

\begin{warningbox}
\index{file names!script portability}
For maximum portability, file names should never contain white-space characters and contain at most one dot. For the widest possible portability, underscores should be avoided using dashes instead. As an example, instead of \code{my data.2019.csv}, use \code{my-data-2019.csv}.
\end{warningbox}

\Rlang provides functions which help with portability, by hiding the idiosyncrasies of the different OSs from \Rlang code. In scripts these functions should be preferred over direct call to OS commands (i.e., using \Rfunction{shell()} or \Rfunction{system()}) whenever possible. As the algorithm needed to extract a file name from a file path is OS specific, \Rlang provides functions such as \Rfunction{basename()}, whose implementation is OS specific but from the side of \Rlang code behave identically---these functions hide the differences among OSs from the user of \Rlang. The chunk below can be expected to work correctly under any OS for which \Rlang is available.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{basename}\hlstd{(}\hlstr{"extdata/my-file.txt"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] "my-file.txt"
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{warningbox}
\index{file paths!script portability}
\index{folders|see{file paths}}
\index{file paths!parsing|(}
While in \pgrmname{Unix} and \pgrmname{Linux} folder nesting in file paths is marked with a forward slash character (\verb|/|), under \pgrmname{MS-Windows} it is marked with a backslash character (\verb|\|). Backslash (\verb|\|) is an escape character in \Rlang and interpreted as the start of an embedded special sequence of characters (see section \ref{sec:calc:character} on page \pageref{sec:calc:character}), while in \Rlang a forward slash (\verb|/|) can be used for file paths under any OS, and escaped backslash (\verb|\\|) is valid only under MS-Windows. Consequently, \verb|/| should be always preferred to \verb|\\| to ensure portability, and is the approach used in this book.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{basename}\hlstd{(}\hlstr{"extdata/my-file.txt"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] "my-file.txt"
\end{verbatim}
\begin{alltt}
\hlkwd{basename}\hlstd{(}\hlstr{"extdata\textbackslash{}\textbackslash{}my-file.txt"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] "my-file.txt"
\end{verbatim}
\end{kframe}
\end{knitrout}
\end{warningbox}

The complementary function to \code{basename()} is \Rfunction{dirname()} and extracts the bare path to the containing folder, from a full file path.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{dirname}\hlstd{(}\hlstr{"extdata/my-file.txt"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] "extdata"
\end{verbatim}
\end{kframe}
\end{knitrout}
\index{file paths!parsing|)}
\index{working directory|(}
Functions \Rfunction{getwd()} and \Rfunction{setwd()} can be used to get the path to the current working directory and to set a directory as current, respectively.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlcom{# not run}
\hlkwd{getwd}\hlstd{()}
\end{alltt}
\end{kframe}
\end{knitrout}

Function \Rfunction{setwd()} returns the path to the current working directory, allowing us to portably set the working directory to the previous one. Both relative paths (relative to the current working directory), as in the example, or absolute paths (given in full) are accepted as an argument. In mainstream OSs ``\code{.}'' indicates the current directory and ``\code{..}'' the directory above the current one.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlcom{# not run}
\hlstd{oldwd} \hlkwb{<-} \hlkwd{setwd}\hlstd{(}\hlstr{".."}\hlstd{)}
\hlkwd{getwd}\hlstd{()}
\end{alltt}
\end{kframe}
\end{knitrout}

The returned value is always an absolute full path, so it remains valid even if the path to the working directory changes more than once before being restored.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlcom{# not run}
\hlstd{oldwd}
\hlkwd{setwd}\hlstd{(oldwd)}
\hlkwd{getwd}\hlstd{()}
\end{alltt}
\end{kframe}
\end{knitrout}
\index{working directory|)}

\index{listing files or directories|(}
We can also obtain lists of files and/or directories (= disk folders) portably across OSs.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{head}\hlstd{(}\hlkwd{list.files}\hlstd{())}
\end{alltt}
\begin{verbatim}
## [1] "abbrev.sty"         "anscombe.svg"       "appendixes.prj"
## [4] "appendixes.prj.bak" "bits"               "chapters-removed"
\end{verbatim}
\begin{alltt}
\hlkwd{head}\hlstd{(}\hlkwd{list.dirs}\hlstd{())}
\end{alltt}
\begin{verbatim}
## [1] "."                "./.git"           "./.git/hooks"     "./.git/info"
## [5] "./.git/logs"      "./.git/logs/refs"
\end{verbatim}
\begin{alltt}
\hlkwd{head}\hlstd{(}\hlkwd{dir}\hlstd{())}
\end{alltt}
\begin{verbatim}
## [1] "abbrev.sty"         "anscombe.svg"       "appendixes.prj"
## [4] "appendixes.prj.bak" "bits"               "chapters-removed"
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{playground}
The default argument for parameter \code{path} is the current working directory, under Windows, Unix, and Linux indicated by \code{"."}. Convince yourself that this is indeed the default by calling the functions with an explicit argument. After this, play with the functions trying other existing and non-existent paths in your computer.
\end{playground}

\begin{playground}
Use parameter \code{full.names} with \Rfunction{list.files()} to obtain either a list of file paths or bare file names. Similarly, investigate how the returned list of files is affected by the argument passed to \code{all.names}.
\end{playground}

\begin{playground}
Compare the behavior of functions \Rfunction{dir()} and \Rfunction{list.dirs()}, and try by overriding the default arguments of \Rfunction{list.dirs()}, to get the call to return the same output as \Rfunction{dir()} does by default.
\end{playground}
\index{listing files or directories|)}

Base \Rlang provides several functions for portably working with files, and they are listed in the help page for \code{files} and in individual help pages. Use \code{help("files")} to access the help for this ``family'' of functions.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwa{if} \hlstd{(}\hlopt{!}\hlkwd{file.exists}\hlstd{(}\hlstr{"xxx.txt"}\hlstd{)) \{}
  \hlkwd{file.create}\hlstd{(}\hlstr{"xxx.txt"}\hlstd{)}
\hlstd{\}}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{file.size}\hlstd{(}\hlstr{"xxx.txt"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] 0
\end{verbatim}
\begin{alltt}
\hlkwd{file.info}\hlstd{(}\hlstr{"xxx.txt"}\hlstd{)}
\end{alltt}
\begin{verbatim}
##         size isdir mode               mtime               ctime
## xxx.txt    0 FALSE  666 2020-04-24 02:52:45 2020-04-24 02:52:45
##                       atime exe
## xxx.txt 2020-04-24 02:52:45  no
\end{verbatim}
\begin{alltt}
\hlkwd{file.rename}\hlstd{(}\hlstr{"xxx.txt"}\hlstd{,} \hlstr{"zzz.txt"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{file.exists}\hlstd{(}\hlstr{"xxx.txt"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] FALSE
\end{verbatim}
\begin{alltt}
\hlkwd{file.exists}\hlstd{(}\hlstr{"zzz.txt"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlkwd{file.remove}\hlstd{(}\hlstr{"zzz.txt"}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{playground}
Function \Rfunction{file.path()} can be used to construct a file path from its components in a way that is portable across OSs. Look at the help page and play with the function to assemble some paths that exist in the computer you are using.
\end{playground}
\index{file operations|)}

\section{Opening and closing file connections}\label{sec:io:connections}

Examples in the rest of this chapter use as an argument for the \code{file} formal parameter literal paths or URLs, and complete the reading or writing operations within the call to a function. Sometimes it is necessary to read or write a text file sequentially, one row or record at a time. In such cases it is most efficient to keep the file open between reads and close the connection only when it is no longer needed. See \code{help(connections)} for details about the various functions available and their behavior in different OSs. In the next example we open a file connection, read two lines, first the top one with column headers, then in a separate call to \Rfunction{readLines()}, the two lines or records with data, and finally close the connection.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{f1} \hlkwb{<-} \hlkwd{file}\hlstd{(}\hlstr{"extdata/not-aligned-ASCII-UK.csv"}\hlstd{,} \hlkwc{open} \hlstd{=} \hlstr{"r"}\hlstd{)} \hlcom{# open for reading}
\hlkwd{readLines}\hlstd{(f1,} \hlkwc{n} \hlstd{=} \hlnum{1L}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] "col1,col2,col3,col4"
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{readLines}\hlstd{(f1,} \hlkwc{n} \hlstd{=} \hlnum{2L}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] "1.0,24.5,346,ABC" "23.4,45.6,78,Z Y"
\end{verbatim}
\begin{alltt}
\hlkwd{close}\hlstd{(f1)}
\end{alltt}
\end{kframe}
\end{knitrout}

When \Rpgrm is used in batch mode, the ``files'' \code{stdin}, \code{stdout} and \code{stderror} can be opened, and data read from, or written to. These \emph{standard} sources and sinks, so familiar to \Clang programmers, allow the use of \Rlang scripts as tools in data pipes coded as shell scripts under Unix and other OSs.

\section{Plain-text files}\label{sec:files:txt}
\index{importing data!text files|(}
In general, text files are the most portable approach to data storage but usually also the least efficient with respect to the size of the file. Text files are composed of encoded characters. This makes them easy to edit with text editors and easy to read from programs written in most programming languages. On the other hand, how the data encoded as characters is arranged can be based on two different approaches: positional or using a specific character as a separator. The positional approach is more concise but almost unreadable to humans as the values run into each other. Reading of data stored using a positional approach requires access to a format definition and was common in FORTRAN and COBOL at the time when punch cards were used to store data. In the case of separators, different separators are in common use. Comma-separated values (CSV) encodings use either a comma or semicolon to separate the fields or columns. Tabulator, or tab-separated values (TSV) use the tab character as a column separator. Sometimes white space is used as a separator, most commonly when all values are to be converted to \code{numeric}.

\begin{explainbox}
\textbf{Not all text files are born equal.} When reading text files, and \emph{foreign} binary files which may contain embedded text strings, there is potential for their misinterpretation during the import operation. One common source of problems, is that column headers are to be read as \Rlang names. As earlier discussed, there are strict rules, such as avoiding spaces or special characters if the names are to be used with the normal syntax. On import, some functions will attempt to sanitize the names, but others not. Most such names are still accessible in \Rlang statements, but a special syntax is needed to protect them from triggering syntax errors through their interpretation as something different than variable or function names---in \Rlang jargon we say that they need to be quoted.

Some of the things we need to be on the watch for are:
1) Mismatches between the character encoding expected by the function used to read the file, and the encoding used for saving the file---usually because of different locales.
2) Leading or trailing (invisible) spaces present in the character values or column names---which are almost invisible when data frames are printed.
3) Wrongly guessed column classes---a typing mistake affecting a single value in a column, e.g.,  the wrong kind of decimal marker, prevents the column from being recognized as numeric.
4) Mismatched decimal marker in \code{CSV} files---the marker depends on the locale (language and sometimes country) settings.

If you encounter problems after import, such as failure of indexing of data frame columns by name, use function \code{names()} to get the names printed to the console as a character vector. This is useful because character vectors are always printed with each string delimited by quotation marks making leading and trailing spaces clearly visible. The same applies to use of \code{levels()} with factors created with data that might have contained mistakes.

To demonstrate some of these problems, I create a data frame with name sanitation disabled, and in the second statement with sanitation enabled. The first statement is equivalent to the default behavior of functions in package \pkgname{readr} and the second is equivalent to the behavior of base \Rlang functions. \pkgname{readr} prioritizes the integrity of the original data while \Rlang prioritizes compatibility with R's naming rules.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{data.frame}\hlstd{(}\hlkwc{a} \hlstd{=} \hlnum{1}\hlstd{,} \hlstr{"a "} \hlstd{=} \hlnum{2}\hlstd{,} \hlstr{" a"} \hlstd{=} \hlnum{3}\hlstd{,} \hlkwc{check.names} \hlstd{=} \hlnum{FALSE}\hlstd{)}
\end{alltt}
\begin{verbatim}
##   a a   a
## 1 1  2  3
\end{verbatim}
\begin{alltt}
\hlkwd{data.frame}\hlstd{(}\hlkwc{a} \hlstd{=} \hlnum{1}\hlstd{,} \hlstr{"a "} \hlstd{=} \hlnum{2}\hlstd{,} \hlstr{" a"} \hlstd{=} \hlnum{3}\hlstd{)}
\end{alltt}
\begin{verbatim}
##   a a. X.a
## 1 1  2   3
\end{verbatim}
\end{kframe}
\end{knitrout}

An even more subtle case is when characters can be easily confused by the user reading the output: zero and o (\code{a0} vs.\ \code{aO}) or el and one (\code{al} vs.\ \code{a1}) can be difficult to distinguish in some fonts. When using encodings capable of storing many character shapes, such as unicode, in some cases two characters with almost identical visual shape may be encoded as different characters.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{data.frame}\hlstd{(}\hlkwc{al} \hlstd{=} \hlnum{1}\hlstd{,} \hlkwc{a1} \hlstd{=} \hlnum{2}\hlstd{,} \hlkwc{aO} \hlstd{=} \hlnum{3}\hlstd{,} \hlkwc{a0} \hlstd{=} \hlnum{4}\hlstd{)}
\end{alltt}
\begin{verbatim}
##   al a1 aO a0
## 1  1  2  3  4
\end{verbatim}
\end{kframe}
\end{knitrout}

Reading data from a text file can result in very odd-looking values stored in \Rlang variables because of a mismatch in encoding, e.g., when a CSV file saved with \pgrmname{MS-Excel} is silently encoded using 16-bit unicode format, but read as an 8-bit unicode encoded file.

The hardest part of all these problems is to diagnose their origin, as function arguments and working environment options can in most cases be used to force the correct decoding of text files with diverse characteristics, origins and vintages once one knows what is required. One function in the \Rlang \pkgname{tools} package, which is not exported, can at the time of writing be used to test files for the presence on non-ASCII characters: \Rfunction{tools:::showNonASCIIfile()}. This function takes as an argument the path to a file.
\end{explainbox}

\subsection[Base R and `utils']{Base \Rlang and \pkgname{utils}}
\index{text files!with field markers}
Text files containing data in columns can be divided into two broad groups. Those with fixed-width fields and those with delimited fields. Fixed-width fields were especially common in the early days of \langname{FORTRAN} and \langname{COBOL} when data storage capacity was very limited. These formats are frequently capable of encoding information using fewer characters than when delimited fields are used. The best way of understanding the differences is with examples. Although in this section we exemplify the use of functions by passing a file name as an argument, URLs, and open file descriptors are also accepted (see section \ref{sec:io:connections} on page \pageref{sec:io:connections}).

In the first example we will read a file with fields solely delimited by ``,.'' This is what is called comma-separated-values (CSV) format which can be read and written with \Rfunction{read.csv()} and \Rfunction{write.csv()}, respectively.

Example file \code{not-aligned-ASCII-UK.csv} contains:

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{verbatim}
col1,col2,col3,col4
1.0,24.5,346,ABC
23.4,45.6,78,Z Y
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{from_csv_a.df} \hlkwb{<-} \hlkwd{read.csv}\hlstd{(}\hlstr{"extdata/not-aligned-ASCII-UK.csv"}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{sapply}\hlstd{(from_csv_a.df, class)}
\end{alltt}
\begin{verbatim}
##        col1        col2        col3        col4
##   "numeric"   "numeric"   "integer" "character"
\end{verbatim}
\begin{alltt}
\hlstd{from_csv_a.df[[}\hlstr{"col4"}\hlstd{]]}
\end{alltt}
\begin{verbatim}
## [1] "ABC" "Z Y"
\end{verbatim}
\begin{alltt}
\hlkwd{levels}\hlstd{(from_csv_a.df[[}\hlstr{"col4"}\hlstd{]])}
\end{alltt}
\begin{verbatim}
## NULL
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{playground}
Read the file \code{not-aligned-ASCII-UK.csv} with function \Rfunction{read.csv2()} instead of \Rfunction{read.csv()}. Although this may look like a waste of time, the point of the exercise is for you to get familiar with \Rlang behavior in case of such a mistake. This will help you recognize similar errors when they happen accidentally, which is quite common when files are shared.
\end{playground}

Example file \code{aligned-ASCII-UK.csv} contains comma-separated-values with added white space to align the columns, to make it easier to read by humans. These aligned fields contain leading and trailing white spaces that are included in string values when the file is read.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{verbatim}
col1, col2, col3, col4
 1.0, 24.5,  346,  ABC
23.4, 45.6,   78,  Z Y
\end{verbatim}
\end{kframe}
\end{knitrout}

Although space characters are read as part of the fields, they are ignored when conversion to numeric takes place. The remaining leading and trailing spaces in character strings are difficult to see when data frames are printed.
\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{from_csv_b.df} \hlkwb{<-} \hlkwd{read.csv}\hlstd{(}\hlstr{"extdata/aligned-ASCII-UK.csv"}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

Using \code{levels()} we can more clearly see that the labels of the automatically created factor levels contain leading spaces.
\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{sapply}\hlstd{(from_csv_b.df, class)}
\end{alltt}
\begin{verbatim}
##        col1        col2        col3        col4
##   "numeric"   "numeric"   "integer" "character"
\end{verbatim}
\begin{alltt}
\hlstd{from_csv_b.df[[}\hlstr{"col4"}\hlstd{]]}
\end{alltt}
\begin{verbatim}
## [1] "  ABC" "  Z Y"
\end{verbatim}
\begin{alltt}
\hlkwd{levels}\hlstd{(from_csv_b.df[[}\hlstr{"col4"}\hlstd{]])}
\end{alltt}
\begin{verbatim}
## NULL
\end{verbatim}
\end{kframe}
\end{knitrout}

By default, column names are sanitized but factor levels are not. By consulting the documentation with \code{help(read.csv)} we discover that by passing an additional argument we can change this default and obtain the data read as desired. Most likely the default has been chosen so that by default data integrity is maintained.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{from_csv_e.df} \hlkwb{<-} \hlkwd{read.csv}\hlstd{(}\hlstr{"extdata/aligned-ASCII-UK.csv"}\hlstd{,} \hlkwc{strip.white} \hlstd{=} \hlnum{TRUE}\hlstd{)}
\hlkwd{sapply}\hlstd{(from_csv_e.df, class)}
\end{alltt}
\begin{verbatim}
##        col1        col2        col3        col4
##   "numeric"   "numeric"   "integer" "character"
\end{verbatim}
\begin{alltt}
\hlstd{from_csv_e.df[[}\hlstr{"col4"}\hlstd{]]}
\end{alltt}
\begin{verbatim}
## [1] "ABC" "Z Y"
\end{verbatim}
\begin{alltt}
\hlkwd{levels}\hlstd{(from_csv_e.df[[}\hlstr{"col4"}\hlstd{]])}
\end{alltt}
\begin{verbatim}
## NULL
\end{verbatim}
\end{kframe}
\end{knitrout}

The functions from the \Rlang \pkgname{utils} package by default convert columns containing character strings into factors, as seen above. This default can be changed so that character strings remain as is.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{from_csv_c.df} \hlkwb{<-} \hlkwd{read.csv}\hlstd{(}\hlstr{"extdata/not-aligned-ASCII-UK.csv"}\hlstd{,}
                          \hlkwc{stringsAsFactors} \hlstd{=} \hlnum{FALSE}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{sapply}\hlstd{(from_csv_c.df, class)}
\end{alltt}
\begin{verbatim}
##        col1        col2        col3        col4
##   "numeric"   "numeric"   "integer" "character"
\end{verbatim}
\begin{alltt}
\hlstd{from_csv_c.df[[}\hlstr{"col4"}\hlstd{]]}
\end{alltt}
\begin{verbatim}
## [1] "ABC" "Z Y"
\end{verbatim}
\end{kframe}
\end{knitrout}

Decimal points and exponential notation are allowed for floating point values. In English-speaking locales, the decimal mark is a point, while in many other locales it is a comma. If a comma is used as decimal marker, we can no longer use it as field separator and is usually substituted by a semicolon (\verb|;|). In such a case we can use \Rfunction{read.csv2()} and \Rfunction{write.csv2}. Furthermore, parameters \code{dec} and \code{sep} allow setting them to arbitrary characters. Function \Rfunction{read.table()} does the actual work and functions like \Rfunction{read.csv()} only differ in the default arguments for the different parameters. By default, \Rfunction{read.table()} expects fields to be separated by white space (one or more spaces, tabs, new lines, or carriage return). Strings with embedded spaces need to be quoted in the file as shown below.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{verbatim}
col1 col2 col3 col4
 1.0 24.5  346 ABC
23.4 45.6   78 "Z Y"
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{from_txt_b.df} \hlkwb{<-} \hlkwd{read.table}\hlstd{(}\hlstr{"extdata/aligned-ASCII.txt"}\hlstd{,} \hlkwc{header} \hlstd{=} \hlnum{TRUE}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{sapply}\hlstd{(from_txt_b.df, class)}
\end{alltt}
\begin{verbatim}
##        col1        col2        col3        col4
##   "numeric"   "numeric"   "integer" "character"
\end{verbatim}
\begin{alltt}
\hlstd{from_txt_b.df[[}\hlstr{"col4"}\hlstd{]]}
\end{alltt}
\begin{verbatim}
## [1] "ABC" "Z Y"
\end{verbatim}
\begin{alltt}
\hlkwd{levels}\hlstd{(from_txt_b.df[[}\hlstr{"col4"}\hlstd{]])}
\end{alltt}
\begin{verbatim}
## NULL
\end{verbatim}
\end{kframe}
\end{knitrout}

\index{text files!fixed width fields}
With a fixed-width format, no delimiters are needed. Decoding is based solely on the position of the characters in the line or record. A file like this cannot be interpreted without a description of the format used for saving the data. Files containing data stored in \emph{fixed width format} can be read with function \Rfunction{read.fwf()}. Records for a single observation can be stored in a single or multiple lines. In either case, each line has fields of different but fixed known widths.

Function \Rfunction{read.fortran()} is a wrapper on \Rfunction{read.fwf()} that accepts format definitions similar to those used in \langname{FORTRAN}. One particularity of \langname{FORTRAN} \emph{formatted data transfer} is that the decimal marker can be omitted in the saved file and its position specified as part of the format definition, a trick used to make text files (or stacks of punch cards!) smaller. Modern versions of \langname{FORTRAN} support reading from and writing to other formats like those using field delimiters described above.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{verbatim}
 10245346ABC
234456 78Z Y
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{from_fwf_a.df} \hlkwb{<-} \hlkwd{read.fortran}\hlstd{(}\hlstr{"extdata/aligned-ASCII.fwf"}\hlstd{,}
                              \hlkwc{format} \hlstd{=} \hlkwd{c}\hlstd{(}\hlstr{"2F3.1"}\hlstd{,} \hlstr{"F3.0"}\hlstd{,} \hlstr{"A3"}\hlstd{),}
                              \hlkwc{col.names} \hlstd{=} \hlkwd{c}\hlstd{(}\hlstr{"col1"}\hlstd{,} \hlstr{"col2"}\hlstd{,} \hlstr{"col3"}\hlstd{,} \hlstr{"col4"}\hlstd{))}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{sapply}\hlstd{(from_fwf_a.df, class)}
\end{alltt}
\begin{verbatim}
##        col1        col2        col3        col4
##   "numeric"   "numeric"   "numeric" "character"
\end{verbatim}
\begin{alltt}
\hlstd{from_fwf_a.df[[}\hlstr{"col4"}\hlstd{]]}
\end{alltt}
\begin{verbatim}
## [1] "ABC" "Z Y"
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{explainbox}
  The file reading functions described above share with \Rfunction{read.table()} the same parameters. In addition to those described above, other frequently useful parameters are \code{skip} and \code{n}, which can be used to skip lines at the top of a file and limit the number of lines (or records) to read; \code{header}, which accepts a logical argument indicating if the fields in the first text line read should be decoded as column names rather than data; \code{na.strings}, to which can be passed a character vector with strings to be interpreted as \code{NA}; and \code{colClasses}, which provides control of the conversion of the fields to \Rlang classes and possibly skipping some columns altogether. All these parameters are described in the corresponding help pages.
\end{explainbox}

\begin{playground}
In reality  \Rfunction{read.csv()}, \code{read.csv2()} and \Rfunction{read.table()} are the same function with different default arguments to several of their parameters. Study the help page, and by passing suitable arguments, make \Rfunction{read.csv()} behave like \Rfunction{read.table()}, then make \Rfunction{read.table()} behave like \Rfunction{read.csv2()}.
\end{playground}

\begin{explainbox}
We can read a text file as character strings, without attempting to decode them. This is occasionally useful, such as when we do the decoding as part of our own script. In this case, the function to use is \code{readLines()}. The returned value is a character vector in which each member string corresponds to one line or record in the file, with the end-of-line markers stripped (see example in section \ref{sec:io:connections} on page \pageref{sec:io:connections}).
\end{explainbox}
\index{importing data!text files|)}

\index{exporting data!text files|(}
Next we give one example of the use of a \emph{write} function matching one of the \emph{read} functions described above. The \Rfunction{write.csv()} function takes as an argument a data frame, or an object that can be coerced into a data frame, converts it to character strings, and saves them to a text file. We first create the data frame that we will write to disk.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.df} \hlkwb{<-} \hlkwd{data.frame}\hlstd{(}\hlkwc{x} \hlstd{=} \hlnum{1}\hlopt{:}\hlnum{5}\hlstd{,} \hlkwc{y} \hlstd{=} \hlnum{5}\hlopt{:}\hlnum{1} \hlopt{/} \hlnum{10}\hlstd{,} \hlkwc{z} \hlstd{= letters[}\hlnum{1}\hlopt{:}\hlnum{5}\hlstd{])}
\end{alltt}
\end{kframe}
\end{knitrout}

We write \code{my.df} to a CSV file suitable for an English language locale, and then display its contents.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{write.csv}\hlstd{(my.df,} \hlkwc{file} \hlstd{=} \hlstr{"my-file1.csv"}\hlstd{,} \hlkwc{row.names} \hlstd{=} \hlnum{FALSE}\hlstd{)}
\hlkwd{file.show}\hlstd{(}\hlstr{"my-file1.csv"}\hlstd{,} \hlkwc{pager} \hlstd{=} \hlstr{"console"}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{verbatim}
"x","y","z"
1,0.5,"a"
2,0.4,"b"
3,0.3,"c"
4,0.2,"d"
5,0.1,"e"
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{explainbox}
In most cases setting, as above, \code{row.names = FALSE} when writing a CSV file will help when it is read. Of course, if row names do contain important information, such as gene tags, you cannot skip writing the row names to the file unless you first copy these data into a column in the data frame. (Row names are stored separately as an attribute in \code{data.frame} objects, see section \ref{sec:calc:attributes} on page \pageref{sec:calc:attributes} for details.)
\end{explainbox}

\begin{playground}
Write the data frame \code{my.df} into text files with functions \Rfunction{write.csv2()} and \Rfunction{write.table()} instead of \Rfunction{read.csv()} and display the files.
\end{playground}

Function \Rfunction{cat()} takes \Rlang objects and writes them after conversion to character strings to the console or a file, inserting one or more characters as separators, by default, a space. This separator can be set through parameter \code{sep}. In our example we set \code{sep} to a new line (entered as the escape sequence \code{"\textbackslash n"}).

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.lines} \hlkwb{<-} \hlkwd{c}\hlstd{(}\hlstr{"abcd"}\hlstd{,} \hlstr{"hello world"}\hlstd{,} \hlstr{"123.45"}\hlstd{)}
\hlkwd{cat}\hlstd{(my.lines,} \hlkwc{file} \hlstd{=} \hlstr{"my-file2.txt"}\hlstd{,} \hlkwc{sep} \hlstd{=} \hlstr{"\textbackslash{}n"}\hlstd{)}
\hlkwd{file.show}\hlstd{(}\hlstr{"my-file2.txt"}\hlstd{,} \hlkwc{pager} \hlstd{=} \hlstr{"console"}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{verbatim}
abcd
hello world
123.45
\end{verbatim}
\end{kframe}
\end{knitrout}
\index{exporting data!text files|)}

\subsection[readr]{\pkgname{readr}}\label{sec:files:readr}
\index{importing data!text files|(}


Package \pkgname{readr} is part of the \pkgname{tidyverse} suite. It defines functions that allow faster input and output, and have different default behavior. Contrary to base \Rlang functions, they are optimized for speed, but may sometimes wrongly decode their input and rarely even silently do this. Base \Rlang functions do less \emph{guessing}, e.g., the delimiters must be supplied as arguments. The \pkgname{readr} functions guess more properties of the text file format; in most cases they succeed, which is very handy, but occasionally they fail. Automatic guessing can be overridden by passing arguments and this is recommended for scripts that may be reused to read different files in the future. Another important advantage is that these functions read character strings formatted as dates or times directly into columns of class \code{POSIXct}. All \code{write} functions defined in \pkgname{readr} have an \code{append} parameter, which can be used to change the default behavior of overwriting an existing file with the same name, to appending the output at its end.

Although in this section we exemplify the use of these functions by passing a file name as an argument, as is the case with \Rlang native functions, URLs, and open file descriptors are also accepted (see section \ref{sec:io:connections} on page \pageref{sec:io:connections}). Furthermore, if the file name ends in a tag recognizable as indicating a compressed file format, the file will be uncompressed on the fly.

\begin{warningbox}
The names of functions ``equivalent'' to those described in the previous section have names formed by replacing the dot with an underscore, e.g.,  \Rfunction{read\_csv()} $\approx$ \Rfunction{read.csv()}. The similarity refers to the format of the files read, but not the order, names, or roles of their formal parameters. For example, function \code{read\_table()} has a slightly different behavior than \Rfunction{read.table()}, although they both read fields separated by white space. Other aspects of the default behavior are also different, for example \pkgname{readr} functions do not convert columns of character strings into factors and row names are not set in the returned \Rclass{tibble}, which inherits from \Rclass{data.frame}, but is not fully compatible (see section \ref{sec:data:tibble} on page \pageref{sec:data:tibble}).
\end{warningbox}

As we can see in this first example, these functions also report to the console the specifications of the columns, which is important when these are guessed from the file contents, or part of it.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{read_csv}\hlstd{(}\hlkwc{file} \hlstd{=} \hlstr{"extdata/aligned-ASCII-UK.csv"}\hlstd{)}
\end{alltt}


{\ttfamily\noindent\itshape\color{messagecolor}{\#\# Parsed with column specification:\\\#\# cols(\\\#\#\ \  col1 = col\_double(),\\\#\#\ \  col2 = col\_double(),\\\#\#\ \  col3 = col\_double(),\\\#\#\ \  col4 = col\_character()\\\#\# )}}\begin{verbatim}
## # A tibble: 2 x 4
##    col1  col2  col3 col4
##   <dbl> <dbl> <dbl> <chr>
## 1   1    24.5   346 ABC
## 2  23.4  45.6    78 Z Y
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{read_csv}\hlstd{(}\hlkwc{file} \hlstd{=} \hlstr{"extdata/not-aligned-ASCII-UK.csv"}\hlstd{)}
\end{alltt}


{\ttfamily\noindent\itshape\color{messagecolor}{\#\# Parsed with column specification:\\\#\# cols(\\\#\#\ \  col1 = col\_double(),\\\#\#\ \  col2 = col\_double(),\\\#\#\ \  col3 = col\_double(),\\\#\#\ \  col4 = col\_character()\\\#\# )}}\begin{verbatim}
## # A tibble: 2 x 4
##    col1  col2  col3 col4
##   <dbl> <dbl> <dbl> <chr>
## 1   1    24.5   346 ABC
## 2  23.4  45.6    78 Z Y
\end{verbatim}
\end{kframe}
\end{knitrout}

Function \Rfunction{read\_table()}, differently to \Rfunction{read.table()}, retains quotes as part of read character strings.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{read_table}\hlstd{(}\hlkwc{file} \hlstd{=} \hlstr{"extdata/aligned-ASCII.txt"}\hlstd{)}
\end{alltt}


{\ttfamily\noindent\itshape\color{messagecolor}{\#\# Parsed with column specification:\\\#\# cols(\\\#\#\ \  col1 = col\_double(),\\\#\#\ \  col2 = col\_double(),\\\#\#\ \  col3 = col\_double(),\\\#\#\ \  col4 = col\_character()\\\#\# )}}\begin{verbatim}
## # A tibble: 2 x 4
##    col1  col2  col3 col4
##   <dbl> <dbl> <dbl> <chr>
## 1   1    24.5   346 "ABC"
## 2  23.4  45.6    78 "\"Z Y\""
\end{verbatim}
\end{kframe}
\end{knitrout}

Because of the misaligned fields in file \code{"not-aligned-ASCII.txt"}, we need to use \Rfunction{read\_table2()}, which allows misalignment of fields, like \Rfunction{read.table()}, instead of \Rfunction{read\_table()}, which expects vertically aligned fields across rows. However, in this case the embedded space character in the quoted string is misinterpreted and part of the string dropped with a warning.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{read_table2}\hlstd{(}\hlkwc{file} \hlstd{=} \hlstr{"extdata/not-aligned-ASCII.txt"}\hlstd{)}
\end{alltt}


{\ttfamily\noindent\itshape\color{messagecolor}{\#\# Parsed with column specification:\\\#\# cols(\\\#\#\ \  col1 = col\_double(),\\\#\#\ \  col2 = col\_double(),\\\#\#\ \  col3 = col\_double(),\\\#\#\ \  col4 = col\_character()\\\#\# )}}

{\ttfamily\noindent\color{warningcolor}{\#\# Warning: 1 parsing failure.\\\#\# row col\ \ expected\ \ \ \ actual\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ file\\\#\#\ \  2\ \ -- 4 columns 5 columns 'extdata/not-aligned-ASCII.txt'}}\begin{verbatim}
## # A tibble: 2 x 4
##    col1  col2  col3 col4
##   <dbl> <dbl> <dbl> <chr>
## 1   1    24.5   346 "ABC"
## 2  23.4  45.6    78 "\"Z"
\end{verbatim}
\end{kframe}
\end{knitrout}

Function \Rfunction{read\_delim()} with space as the delimiter needs to be used.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{read_delim}\hlstd{(}\hlkwc{file} \hlstd{=} \hlstr{"extdata/not-aligned-ASCII.txt"}\hlstd{,} \hlkwc{delim} \hlstd{=} \hlstr{" "}\hlstd{)}
\end{alltt}


{\ttfamily\noindent\itshape\color{messagecolor}{\#\# Parsed with column specification:\\\#\# cols(\\\#\#\ \  col1 = col\_double(),\\\#\#\ \  col2 = col\_double(),\\\#\#\ \  col3 = col\_double(),\\\#\#\ \  col4 = col\_character()\\\#\# )}}\begin{verbatim}
## # A tibble: 2 x 4
##    col1  col2  col3 col4
##   <dbl> <dbl> <dbl> <chr>
## 1   1    24.5   346 ABC
## 2  23.4  45.6    78 Z Y
\end{verbatim}
\end{kframe}
\end{knitrout}

Function \Rfunction{read\_tsv()} reads files encoded with the tab character as the delimiter, and \Rfunction{read\_fwf()} reads files with fixed width fields. There is, however, no equivalent to \Rfunction{read.fortran()}, supporting implicit decimal points.

\begin{playground}
Use the "wrong" \code{read\_} functions to read the example files used above and/or your own files. As mentioned earlier, forcing errors will help you learn how to diagnose when such errors are caused by coding mistakes. In this case, as wrongly read data are not always accompanied by error or warning messages, carefully check the returned tibbles for misread data values.
\end{playground}

\begin{explainbox}
The functions from R's \pkgname{utils} read the whole file as text before attempting to guess the class of the columns or their alignment. This is reliable but slow for very large text files. The functions from \pkgname{readr} read only the top 1000 lines by default for guessing, and then rather blindly read the whole files assuming that the guessed properties also apply to the remainder of the file. This is more efficient, but somehow risky. In earlier versions of \pkgname{readr}, a typical failure to correctly decode fields was when numbers are in increasing order and the field widths continue increasing in the lines below those used for guessing, but this case seems to be, at the time of writing correctly, handled. It also means that in cases when an individual value after \code{guess\_max} lines cannot be converted to numeric, instead of returning a column of character strings as base \Rpgrm functions, this value is encoded as a numeric \code{NA} with a warning. To demonstrate this we will drastically reduce \code{guess\_max} from its default so that we can use an example file only a few lines in length.


\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{read_table2}\hlstd{(}\hlkwc{file} \hlstd{=} \hlstr{"extdata/miss-aligned-ASCII.txt"}\hlstd{)}
\end{alltt}


{\ttfamily\noindent\itshape\color{messagecolor}{\#\# Parsed with column specification:\\\#\# cols(\\\#\#\ \  col1 = col\_character(),\\\#\#\ \  col2 = col\_double(),\\\#\#\ \  col3 = col\_double(),\\\#\#\ \  col4 = col\_character()\\\#\# )}}\begin{verbatim}
## # A tibble: 4 x 4
##   col1   col2  col3 col4
##   <chr> <dbl> <dbl> <chr>
## 1 1.0    24.5   346 ABC
## 2 2.4    45.6    78 XYZ
## 3 20.4   45.6    78 XYZ
## 4 a      20    2500 abc
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{read_table2}\hlstd{(}\hlkwc{file} \hlstd{=} \hlstr{"extdata/miss-aligned-ASCII.txt"}\hlstd{,} \hlkwc{guess_max} \hlstd{=} \hlnum{3L}\hlstd{)}
\end{alltt}


{\ttfamily\noindent\itshape\color{messagecolor}{\#\# Parsed with column specification:\\\#\# cols(\\\#\#\ \  col1 = col\_double(),\\\#\#\ \  col2 = col\_double(),\\\#\#\ \  col3 = col\_double(),\\\#\#\ \  col4 = col\_character()\\\#\# )}}

{\ttfamily\noindent\color{warningcolor}{\#\# Warning: 1 parsing failure.\\\#\# row\ \ col expected actual\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \  file\\\#\#\ \  4 col1 a double\ \ \ \ \ \ a 'extdata/miss-aligned-ASCII.txt'}}\begin{verbatim}
## # A tibble: 4 x 4
##    col1  col2  col3 col4
##   <dbl> <dbl> <dbl> <chr>
## 1   1    24.5   346 ABC
## 2   2.4  45.6    78 XYZ
## 3  20.4  45.6    78 XYZ
## 4  NA    20    2500 abc
\end{verbatim}
\end{kframe}
\end{knitrout}
\end{explainbox}
\index{importing data!text files|)}


\index{exporting data!text files|(}
The \code{write\_} functions from \pkgname{readr} are the counterpart to \code{write.} functions from \pkgname{utils}. In addition to the expected \Rfunction{write\_csv()}, \Rfunction{write\_csv2()}, \Rfunction{write\_tsv()} and \Rfunction{write\_delim()}, \pkgname{readr} provides functions that write \pgrmname{MS-Excel}-friendly CSV files. We demonstrate here the use of \Rfunction{write\_excel\_csv()} to produce a text file with comma-separated fields suitable for import into \pgrmname{MS-Excel}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{write_excel_csv}\hlstd{(my.df,} \hlkwc{path} \hlstd{=} \hlstr{"my-file6.csv"}\hlstd{)}
\hlkwd{file.show}\hlstd{(}\hlstr{"my-file6.csv"}\hlstd{,} \hlkwc{pager} \hlstd{=} \hlstr{"console"}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

That saves a file containing the following text:
\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{verbatim}
x,y,z
1,0.5,a
2,0.4,b
3,0.3,c
4,0.2,d
5,0.1,e
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{playground}
Compare the output from \Rfunction{write\_excel\_csv()} and \Rfunction{write\_csv()}. What is the difference? Does it matter when you import the written CSV file into Excel (the version you are using, with the locale settings of your computer)?
\end{playground}

The pair of functions \Rfunction{read\_lines()} and \Rfunction{write\_lines()} read and write character vectors without conversion, similarly to base \Rlang \code{readLines()} and \code{writeLines()}. Functions \Rfunction{read\_file()} and \Rfunction{write\_file()} read and write the contents of a whole text file into, and from, a single character string. Functions \Rfunction{read\_file()} and \Rfunction{write\_file()} can also be used with raw vectors to read and write binary files or text files of unknown encoding.

The contents of the whole file are returned as a character vector of length one, with the embedded new line markers. We use \code{cat()} to print it so these new line characters force the start of a new print-out line.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{one.str} \hlkwb{<-} \hlkwd{read_file}\hlstd{(}\hlkwc{file} \hlstd{=} \hlstr{"extdata/miss-aligned-ASCII.txt"}\hlstd{)}
\hlkwd{length}\hlstd{(one.str)}
\end{alltt}
\begin{verbatim}
## [1] 1
\end{verbatim}
\begin{alltt}
\hlkwd{cat}\hlstd{(one.str)}
\end{alltt}
\begin{verbatim}
## col1  col2 col3 col4

## 1.0   24.5  346 ABC

## 2.4   45.6   78 XYZ

## 20.4   45.6   78 XYZ

##  a    20     2500 abc

\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{advplayground}
Use \Rfunction{write\_file()} to write a file that can be read with \Rfunction{read\_csv()}.
\end{advplayground}
\index{exporting data!text files|)}

\section{XML and HTML files}
\index{importing data!XML and HTML files|(}

XML files contain text with special markup. Several modern data exchange formats are based on the \langname{XML} standard (see \url{https://www.w3.org/TR/xml/}) which uses schemas for flexibility. Schemas define specific formats, allowing reading of formats not specifically targeted during development of the read functions. Even the modern \langname{XHTML} standard used for web pages is based on such schemas, while \langname{HTML} only differs slightly in its syntax.

\subsection[`xml2']{\pkgname{xml2}}


Package \pkgname{xml2} provides functions for reading and parsing \langname{XTML} and \langname{HTML} files. This is a vast subject, of which I will only give a brief example.

We first read a web page with function \Rfunction{read\_html()}, and explore its structure.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{web_page} \hlkwb{<-} \hlkwd{read_html}\hlstd{(}\hlstr{"http://r4photobiology.info/R/index.html"}\hlstd{)}
\hlkwd{html_structure}\hlstd{(web_page)}
\end{alltt}
\begin{verbatim}
## <html>
##   <head>
##     <title>
##       {text}
##     <meta [name, content]>
##     <meta [name, content]>
##     <meta [name, content]>
##   <body>
##     {text}
##     <hr>
##     <h1>
##       {text}
##     {text}
##     <hr>
##     <p>
##       {text}
##       <a [href]>
##         {text}
##       {text}
##     {text}
##     <p>
##       {text}
##       <a [href]>
##         {text}
##       {text}
##     {text}
##     <address>
##       {text}
##     {text}
\end{verbatim}
\end{kframe}
\end{knitrout}

Next we extract the text from its \code{title} attribute, using functions \Rfunction{xml\_find\_all()} and \Rfunction{xml\_text()}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{xml_text}\hlstd{(}\hlkwd{xml_find_all}\hlstd{(web_page,} \hlstr{".//title"}\hlstd{))}
\end{alltt}
\begin{verbatim}
## [1] "Suite of R packages for photobiology"
\end{verbatim}
\end{kframe}
\end{knitrout}

The functions defined in this package can be used to ``harvest'' data from web pages, but also to read data from files using formats that are defined through \langname{XML} schemas.
\index{importing data!XML and HTML files|)}

\section{GPX files}
\index{importing data!GPX files|(}
GPX (GPS Exchange Format) files use an XML scheme designed for saving and exchanging data from geographic positioning systems (GPS). There is some variation on the variables saved depending on the settings of the GPS receiver. The example data used here is from a Transmeta BT747 GPS logger. The example below reads the data into a \code{tibble} as character strings. For plotting, the character values representing numbers and dates would need to be converted to numeric and datetime (\code{POSIXct}) values, respectively. In the case of plotting tracks on a map, it is preferable to use package \pkgname{sf} to import the tracks directly from the \code{.gpx} file into a layer (use of the dot pipe operator is described in section \ref{sec:data:pipes} on page \pageref{sec:data:pipes}).

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{xmlTreeParse}\hlstd{(}\hlkwc{file} \hlstd{=} \hlstr{"extdata/GPSDATA.gpx"}\hlstd{,} \hlkwc{useInternalNodes} \hlstd{=} \hlnum{TRUE}\hlstd{)} \hlopt{%.>%}
\hlkwd{xmlRoot}\hlstd{(}\hlkwc{x} \hlstd{= .)} \hlopt{%.>%}
\hlkwd{xmlToList}\hlstd{(}\hlkwc{node} \hlstd{= .)[[}\hlstr{"trk"}\hlstd{]]} \hlopt{%.>%}
\hlkwd{unlist}\hlstd{(}\hlkwc{x} \hlstd{= .[}\hlkwd{names}\hlstd{(.)} \hlopt{==} \hlstr{"trkseg"}\hlstd{],} \hlkwc{recursive} \hlstd{=} \hlnum{FALSE}\hlstd{)} \hlopt{%.>%}
\hlkwd{map_df}\hlstd{(}\hlkwc{.x} \hlstd{= .,} \hlkwc{.f} \hlstd{=} \hlkwa{function}\hlstd{(}\hlkwc{x}\hlstd{)} \hlkwd{as_tibble}\hlstd{(}\hlkwc{x} \hlstd{=} \hlkwd{t}\hlstd{(}\hlkwc{x} \hlstd{=} \hlkwd{unlist}\hlstd{(}\hlkwc{x} \hlstd{= x))))}
\end{alltt}
\begin{verbatim}
## # A tibble: 199 x 7
##   time              speed name                 type  fix   .attrs.lat .attrs.lon
## * <chr>             <chr> <chr>                <chr> <chr> <chr>      <chr>
## 1 2018-12-08T23:09~ 0.03~ trkpt-2018-12-08T23~ T     3d    -34.912071 138.660595
## 2 2018-12-08T23:09~ 0.08~ trkpt-2018-12-08T23~ T     3d    -34.912067 138.660543
## 3 2018-12-08T23:09~ 0.01~ trkpt-2018-12-08T23~ T     3d    -34.912102 138.660554
## # ... with 196 more rows
\end{verbatim}
\end{kframe}
\end{knitrout}

I have passed all arguments by name to make explicit how this pipe works. See section \ref{sec:data:pipes} on page \pageref{sec:data:pipes} for details on the use of the pipe and dot-pipe operators.

\begin{playground}
  To understand what data transformation takes place in each statement of this pipe, start by executing the first statement by itself, excluding the dot-pipe operator, and continue adding one statement at a time, and at each step check the returned value and look out for what has changed from the previous step.
\end{playground}

\index{importing data!GPX files|)}

\section{Worksheets}\label{sec:files:worksheets}
\index{importing data!worksheets and workbooks|(}

Microsoft Office, Open Office and Libre Office are the most frequently used suites containing programs based on the worksheet paradigm. There is available a standardized file format for exchange of worksheet data, but it does not support all the features present in native file formats. We will start by considering \pgrmname{MS-Excel}. The file format used by \pgrmname{MS-Excel} has changed significantly over the years, and old formats tend to be less well supported by available \Rlang packages and may require the file to be updated to a more modern format with \pgrmname{MS-Excel} itself before import into \Rlang. The current format is based on XML and relatively simple to decode, whereas older binary formats are more difficult. Worksheets contain code as equations in addition to the actual data. In all cases, only values entered as such or those computed by means of the embedded equations can be imported into \Rlang rather than the equations themselves.

\subsection{CSV files as middlemen}

If we have access to the original software used for creating a worksheet or workbook, then exporting worksheets to text files in CSV format and importing them into \Rlang using the functions described in sections \ref{sec:files:txt} and \ref{sec:files:readr} starting on pages \pageref{sec:files:txt} and \pageref{sec:files:readr} provides a broadly compatible route for importing data---with the caveat that we should take care that delimiters and decimal marks match the expectations of the functions used. This approach is not ideal from the perspective of having to recreate intermediate files. A better approach is, when feasible, to import the data directly from the workbook or worksheets into \Rlang.

\subsection[`readxl']{\pkgname{readxl}}\label{sec:files:excel}
\index{importing data!.xlsx files|(}


Package \pkgname{readxl} supports reading of \pgrmname{MS-Excel} workbooks, and selecting worksheets and regions within worksheets specified in ways similar to those used by \pgrmname{MS-Excel} itself. The interface is simple, and the package easy to install. We will import a file that in \pgrmname{MS-Excel} looks like the screen capture below.

\begin{center}
\includegraphics[width=0.75\textwidth]{figures/Book1-xlsx.png}
\end{center}

We first list the sheets contained in the workbook file with \Rfunction{excel\_sheets()}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{sheets} \hlkwb{<-} \hlkwd{excel_sheets}\hlstd{(}\hlstr{"extdata/Book1.xlsx"}\hlstd{)}
\hlstd{sheets}
\end{alltt}
\begin{verbatim}
## [1] "my data"
\end{verbatim}
\end{kframe}
\end{knitrout}

In this case, the argument passed to \code{sheet} is redundant, as there is only a single worksheet in the file. It is possible to use either the name of the sheet or a positional index (in this case \code{1} would be equivalent to \code{"my data"}). We use function \Rfunction{read\_excel()} to import the worksheet. Being part of the \pkgname{tidyverse} the returned value is a tibble and character columns are returned as is.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{Book1.df} \hlkwb{<-} \hlkwd{read_excel}\hlstd{(}\hlstr{"extdata/Book1.xlsx"}\hlstd{,} \hlkwc{sheet} \hlstd{=} \hlstr{"my data"}\hlstd{)}
\hlstd{Book1.df}
\end{alltt}
\begin{verbatim}
## # A tibble: 10 x 3
##   sample group observation
##    <dbl> <chr>       <dbl>
## 1      1 a               1
## 2      2 a               5
## 3      3 a               7
## # ... with 7 more rows
\end{verbatim}
\end{kframe}
\end{knitrout}

We can also read a region instead of the whole worksheet.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{Book1_region.df} \hlkwb{<-} \hlkwd{read_excel}\hlstd{(}\hlstr{"extdata/Book1.xlsx"}\hlstd{,} \hlkwc{sheet} \hlstd{=} \hlstr{"my data"}\hlstd{,} \hlkwc{range} \hlstd{=} \hlstr{"A1:B8"}\hlstd{)}
\hlstd{Book1_region.df}
\end{alltt}
\begin{verbatim}
## # A tibble: 7 x 2
##   sample group
##    <dbl> <chr>
## 1      1 a
## 2      2 a
## 3      3 a
## # ... with 4 more rows
\end{verbatim}
\end{kframe}
\end{knitrout}

Of the remaining arguments, the most useful ones have the same names and play similar roles as in \pkgname{readr} (see section \ref{sec:files:readr} on page \pageref{sec:files:readr}).

\subsection[`xlsx']{\pkgname{xlsx}}


Package \pkgname{xlsx} can be more difficult to install as it uses Java functions to do the actual work. However, it is more comprehensive, with functions both for reading and writing \pgrmname{MS-Excel} worksheets and workbooks, in different formats including the older binary ones. Similar to \pkgname{readr} it allows selected regions of a worksheet to be imported.

Here we use function \Rfunction{read.xlsx()}, indexing the worksheet by name. The returned value is a data frame, and following the expectations of \Rlang package \pkgnameNI{utils}, character columns are converted into factors by default.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{Book1_xlsx.df} \hlkwb{<-} \hlkwd{read.xlsx}\hlstd{(}\hlstr{"extdata/Book1.xlsx"}\hlstd{,} \hlkwc{sheetName} \hlstd{=} \hlstr{"my data"}\hlstd{)}
\hlstd{Book1_xlsx.df}
\end{alltt}
\begin{verbatim}
##    sample group observation
## 1       1     a         1.0
## 2       2     a         5.0
## 3       3     a         7.0
## 4       4     a         2.0
## 5       5     a         5.0
## 6       6     b         0.0
## 7       7     b         2.0
## 8       8     b         3.0
## 9       9     b         1.0
## 10     10     b         1.5
\end{verbatim}
\end{kframe}
\end{knitrout}

With function \Rfunction{write.xlsx()} we can write data frames out to Excel worksheets and even append new worksheets to an existing workbook.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{set.seed}\hlstd{(}\hlnum{456321}\hlstd{)}
\hlstd{my.data} \hlkwb{<-} \hlkwd{data.frame}\hlstd{(}\hlkwc{x} \hlstd{=} \hlnum{1}\hlopt{:}\hlnum{10}\hlstd{,} \hlkwc{y} \hlstd{= letters[}\hlnum{1}\hlopt{:}\hlnum{10}\hlstd{])}
\hlkwd{write.xlsx}\hlstd{(my.data,} \hlkwc{file} \hlstd{=} \hlstr{"extdata/my-data.xlsx"}\hlstd{,} \hlkwc{sheetName} \hlstd{=} \hlstr{"first copy"}\hlstd{)}
\hlkwd{write.xlsx}\hlstd{(my.data,} \hlkwc{file} \hlstd{=} \hlstr{"extdata/my-data.xlsx"}\hlstd{,} \hlkwc{sheetName} \hlstd{=} \hlstr{"second copy"}\hlstd{,} \hlkwc{append} \hlstd{=} \hlnum{TRUE}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

When opened in Excel, we get a workbook containing two worksheets, named using the arguments we passed through \code{sheetName} in the code chunk above.
% screen capture to be replaced!!
\begin{center}
\includegraphics[width=0.75\textwidth]{figures/my-data-xlsx.png}
\end{center}

\begin{playground}
If you have some worksheet files available, import them into \Rlang to get a feel for how the way in which data is organized in the worksheets affects how easy or difficult it is to import them into \Rlang.
\end{playground}
\index{importing data!.xlsx files|)}

\subsection[`readODS']{\pkgname{readODS}}
\index{importing data!.ods files|(}

Package \pkgname{readODS} provides functions for reading data saved in files that follow the \emph{Open Documents Standard}. Function \Rfunction{read\_ods()} has a similar but simpler user interface to that of \code{read\_excel()} and reads one worksheet at a time, with support only for skipping top rows. The value returned is a data frame.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{ods.df} \hlkwb{<-} \hlkwd{read_ods}\hlstd{(}\hlstr{"extdata/Book1.ods"}\hlstd{,} \hlkwc{sheet} \hlstd{=} \hlnum{1}\hlstd{)}
\end{alltt}


{\ttfamily\noindent\itshape\color{messagecolor}{\#\# Parsed with column specification:\\\#\# cols(\\\#\#\ \  sample = col\_double(),\\\#\#\ \  group = col\_character(),\\\#\#\ \  observation = col\_double()\\\#\# )}}\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{ods.df}
\end{alltt}
\begin{verbatim}
##    sample group observation
## 1       1     a         1.0
## 2       2     a         5.0
## 3       3     a         7.0
## 4       4     a         2.0
## 5       5     a         5.0
## 6       6     b         0.0
## 7       7     b         2.0
## 8       8     b         3.0
## 9       9     b         1.0
## 10     10     b         1.5
\end{verbatim}
\end{kframe}
\end{knitrout}

Function \Rfunction{write\_ods()} writes a data frame into an ODS file.
\index{importing data!.ods files|)}
\index{importing data!worksheets and workbooks|)}

\section{Statistical software}\label{sec:files:stat}
\index{importing data!other statistical software|(}

There are two different comprehensive packages for importing data saved from other statistical programs such as SAS, Statistica, SPSS, etc. The longtime ``standard'' is package \pkgname{foreign} included in base \Rlang, and package \pkgname{haven} is a newer contributed extension. In the case of files saved with old versions of statistical programs, functions from \pkgname{foreign} tend to be more robust than those from \pkgname{haven}.

\subsection[foreign]{\pkgname{foreign}}


Functions in package \pkgname{foreign} allow us to import data from files saved by several statistical analysis programs, including \pgrmname{SAS}, \pgrmname{Stata}, \pgrmname{SPPS}, \pgrmname{Systat}, \pgrmname{Octave} among others, and a function for writing data into files with formats native to \pgrmname{SAS}, \pgrmname{Stata}, and \pgrmname{SPPS}. \Rlang documents the use of these functions in detail in the \emph{R Data Import/Export} manual. As a simple example, we use function \Rfunction{read.spss()} to read a \texttt{.sav} file, saved a few years ago with the then current version of \pgrmname{SPSS}. We display only the first six rows and seven columns of the data frame, including a column with dates, which appears as numeric.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my_spss.df} \hlkwb{<-} \hlkwd{read.spss}\hlstd{(}\hlkwc{file} \hlstd{=} \hlstr{"extdata/my-data.sav"}\hlstd{,} \hlkwc{to.data.frame} \hlstd{=} \hlnum{TRUE}\hlstd{)}
\end{alltt}


{\ttfamily\noindent\itshape\color{messagecolor}{\#\# re-encoding from UTF-8}}\begin{alltt}
\hlstd{my_spss.df[}\hlnum{1}\hlopt{:}\hlnum{6}\hlstd{,} \hlkwd{c}\hlstd{(}\hlnum{1}\hlopt{:}\hlnum{6}\hlstd{,} \hlnum{17}\hlstd{)]}
\end{alltt}
\begin{verbatim}
##   block       treat mycotreat water1 pot harvest harvest_date
## 1     0 Watered, EM         1      1  14       1  13653705600
## 2     0 Watered, EM         1      1  52       1  13653705600
## 3     0 Watered, EM         1      1 111       1  13653705600
## 4     0 Watered, EM         1      1 127       1  13653705600
## 5     0 Watered, EM         1      1 230       1  13653705600
## 6     0 Watered, EM         1      1 258       1  13653705600
\end{verbatim}
\end{kframe}
\end{knitrout}

A second example, this time with a simple \code{.sav} file saved 15 years ago.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{thiamin.df} \hlkwb{<-} \hlkwd{read.spss}\hlstd{(}\hlkwc{file} \hlstd{=} \hlstr{"extdata/thiamin.sav"}\hlstd{,} \hlkwc{to.data.frame} \hlstd{=} \hlnum{TRUE}\hlstd{)}
\hlkwd{head}\hlstd{(thiamin.df)}
\end{alltt}
\begin{verbatim}
##   THIAMIN CEREAL
## 1     5.2  wheat
## 2     4.5  wheat
## 3     6.0  wheat
## 4     6.1  wheat
## 5     6.7  wheat
## 6     5.8  wheat
\end{verbatim}
\end{kframe}
\end{knitrout}

Another example, for a \pgrmname{Systat} file saved on an PC more than 20 years ago, and read with \Rfunction{read.systat()}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my_systat.df} \hlkwb{<-} \hlkwd{read.systat}\hlstd{(}\hlkwc{file} \hlstd{=} \hlstr{"extdata/BIRCH1.SYS"}\hlstd{)}
\hlkwd{head}\hlstd{(my_systat.df)}
\end{alltt}
\begin{verbatim}
##   CONT DENS BLOCK SEEDL VITAL BASE ANGLE HEIGHT DIAM
## 1    1    1     1     2    44    2     0      1   53
## 2    1    1     1     2    41    2     1      2   70
## 3    1    1     1     2    21    2     0      1   65
## 4    1    1     1     2    15    3     0      1   79
## 5    1    1     1     2    37    3     0      1   71
## 6    1    1     1     2    29    2     1      1   43
\end{verbatim}
\end{kframe}
\end{knitrout}

Not all functions in \pkgname{foreign} return data frames by default, but all of them can be coerced to do so.

\subsection[haven]{\pkgname{haven}}


Package \pkgname{haven} is less ambitious with respect to the number of formats supported, or their vintages, providing read and write functions for only three file formats: \pgrmname{SAS}, \pgrmname{Stata} and \pgrmname{SPSS}. On the other hand, \pkgname{haven} provides flexible ways to convert the different labeled values that cannot be directly mapped to \Rlang modes. They also decode dates and times according to the idiosyncrasies of each of these file formats. In cases when the imported file contains labeled values the returned \Rclass{tibble} object needs some additional attention from the user. Labeled numeric columns in \pgrmname{SPSS} are not necessarily equivalent to factors, although they sometimes are. Consequently, conversion to factors cannot be automated and must be done manually in a separate step.

We can use function \Rfunction{read\_sav()} to import a \code{.sav} file saved by a recent version of \pgrmname{SPSS}. As in the previous section, we display only the first six rows and seven columns of the data frame, including a column \code{treat} containing a labeled numeric vector and \code{harvest\_date} with dates encoded as \Rlang date values.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my_spss.tb} \hlkwb{<-} \hlkwd{read_sav}\hlstd{(}\hlkwc{file} \hlstd{=} \hlstr{"extdata/my-data.sav"}\hlstd{)}
\hlstd{my_spss.tb[}\hlnum{1}\hlopt{:}\hlnum{6}\hlstd{,} \hlkwd{c}\hlstd{(}\hlnum{1}\hlopt{:}\hlnum{6}\hlstd{,} \hlnum{17}\hlstd{)]}
\end{alltt}
\begin{verbatim}
## # A tibble: 6 x 7
##   block           treat mycotreat water1   pot harvest harvest_date
##   <dbl>       <dbl+lbl>     <dbl>  <dbl> <dbl>   <dbl> <date>
## 1     0 1 [Watered, EM]         1      1    14       1 2015-06-15
## 2     0 1 [Watered, EM]         1      1    52       1 2015-06-15
## 3     0 1 [Watered, EM]         1      1   111       1 2015-06-15
## # ... with 3 more rows
\end{verbatim}
\end{kframe}
\end{knitrout}

In this case, the dates are correctly decoded.

Next, we import an \pgrmname{SPSS}'s \code{.sav} file saved 15 years ago.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{thiamin.tb} \hlkwb{<-} \hlkwd{read_sav}\hlstd{(}\hlkwc{file} \hlstd{=} \hlstr{"extdata/thiamin.sav"}\hlstd{)}
\hlstd{thiamin.tb}
\end{alltt}
\begin{verbatim}
## # A tibble: 24 x 2
##   THIAMIN    CEREAL
##     <dbl> <dbl+lbl>
## 1     5.2 1 [wheat]
## 2     4.5 1 [wheat]
## 3     6   1 [wheat]
## # ... with 21 more rows
\end{verbatim}
\begin{alltt}
\hlstd{thiamin.tb} \hlkwb{<-} \hlkwd{as_factor}\hlstd{(thiamin.tb)}
\hlstd{thiamin.tb}
\end{alltt}
\begin{verbatim}
## # A tibble: 24 x 2
##   THIAMIN CEREAL
##     <dbl> <fct>
## 1     5.2 wheat
## 2     4.5 wheat
## 3     6   wheat
## # ... with 21 more rows
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{playground}
Compare the values returned by different \code{read} functions when applied to the same file on disk. Use \Rfunction{names()}, \Rfunction{str()} and \Rfunction{class()} as tools in your exploration. If you are brave, also use \Rfunction{attributes()}, \Rfunction{mode()}, \Rfunction{dim()}, \Rfunction{dimnames()}, \Rfunction{nrow()} and \Rfunction{ncol()}.
\end{playground}

\begin{playground}
If you use or have in the past used other statistical software or a general-purpose language like \langname{Python}, look for some old files and import them into \Rlang.
\end{playground}
\index{importing data!other statistical software|)}

\section{NetCDF files}
\index{importing data!NeCDF files|(}

In some fields, including geophysics and meteorology, \pgrmname{NetCDF} is a very common format for the exchange of data. It is also used in other contexts in which data is referenced to a grid of locations, like with data read from Affymetrix microarrays used to study gene expression. \pgrmname{NetCDF} files are binary but use a format that allows the storage of metadata describing each variable together with the data itself in a well-organized and standardized format, which is ideal for exchange of moderately large data sets measured on a spatial or spatio-temporal grid.

Officially described as follows:
\begin{quote}
\pgrmname{NetCDF} is a set of software libraries [from Unidata] and self-describing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data.
\end{quote}

As sometimes \pgrmname{NetCDF} files are large, it is good that it is possible to selectively read the data from individual variables with functions in packages \pkgname{ncdf4} or \pkgname{RNetCDF}. On the other hand, this implies that contrary to other data file reading operations, reading a \pgrmname{NetCDF} file is done in two or more steps---i.e., opening the file, reading metadata describing the variables and spatial grid, and finally reading the data of interest.

\subsection[ncdf4]{\pkgname{ncdf4}}


Package \pkgname{ncdf4} supports reading of files using \pgrmname{netCDF} version 4 or earlier formats. Functions in \pkgname{ncdf4} not only allow reading and writing of these files, but also their modification.

We first read metadata to obtain an index of the file contents, and in additional steps, read a subset of the data. With \Rfunction{print()} we can find out the names and characteristics of the variables and attributes. In this example, we read long-term averages for potential evapotranspiration (PET).

We first open a connection to the file with function \Rfunction{nc\_open()}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{meteo_data.nc} \hlkwb{<-} \hlkwd{nc_open}\hlstd{(}\hlstr{"extdata/pevpr.sfc.mon.ltm.nc"}\hlstd{)}
\hlkwd{str}\hlstd{(meteo_data.nc,} \hlkwc{max.level} \hlstd{=} \hlnum{1}\hlstd{)}
\end{alltt}
\begin{verbatim}
## List of 14
##  $ filename   : chr "extdata/pevpr.sfc.mon.ltm.nc"
##  $ writable   : logi FALSE
##  $ id         : int 65536
##  $ safemode   : logi FALSE
##  $ format     : chr "NC_FORMAT_NETCDF4_CLASSIC"
##  $ is_GMT     : logi FALSE
##  $ groups     :List of 1
##  $ fqgn2Rindex:List of 1
##  $ ndims      : num 4
##  $ natts      : num 8
##  $ dim        :List of 4
##  $ unlimdimid : num -1
##  $ nvars      : num 3
##  $ var        :List of 3
##  - attr(*, "class")= chr "ncdf4"
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{advplayground}
Increase \code{max.level} in the call to \Rfunction{str()} above and study the connection object stores information on the dimensions and for each data variable. You can also \code{print(meteo\_data.nc)} for a more complete printout once you have understood the structure of the object.
\end{advplayground}
The dimensions of the array data are described with metadata, in our examples mapping indexes to a grid of latitudes and longitudes and into a time vector as a third dimension. The dates are returned as character strings. We get here the variables one at a time with function \Rfunction{ncvar\_get()}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{time.vec} \hlkwb{<-} \hlkwd{ncvar_get}\hlstd{(meteo_data.nc,} \hlstr{"time"}\hlstd{)}
\hlkwd{head}\hlstd{(time.vec)}
\end{alltt}
\begin{verbatim}
## [1] -657073 -657042 -657014 -656983 -656953 -656922
\end{verbatim}
\begin{alltt}
\hlstd{longitude} \hlkwb{<-}  \hlkwd{ncvar_get}\hlstd{(meteo_data.nc,} \hlstr{"lon"}\hlstd{)}
\hlkwd{head}\hlstd{(longitude)}
\end{alltt}
\begin{verbatim}
## [1] 0.000 1.875 3.750 5.625 7.500 9.375
\end{verbatim}
\begin{alltt}
\hlstd{latitude} \hlkwb{<-} \hlkwd{ncvar_get}\hlstd{(meteo_data.nc,} \hlstr{"lat"}\hlstd{)}
\hlkwd{head}\hlstd{(latitude)}
\end{alltt}
\begin{verbatim}
## [1] 88.5420 86.6531 84.7532 82.8508 80.9473 79.0435
\end{verbatim}
\end{kframe}
\end{knitrout}

The \code{time} vector is rather odd, as it contains only monthly data as these are long-term averages, but expressed as days from 1800-01-01 corresponding to the first day of each month of year 1. We use package \pkgname{lubridate} for the conversion.

We construct a \Rclass{tibble} object with PET values for one grid point, taking advantage of the \emph{recycling} of short vectors.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{pet.tb} \hlkwb{<-}
    \hlkwd{tibble}\hlstd{(}\hlkwc{time} \hlstd{=} \hlkwd{ncvar_get}\hlstd{(meteo_data.nc,} \hlstr{"time"}\hlstd{),}
           \hlkwc{month} \hlstd{=} \hlkwd{month}\hlstd{(}\hlkwd{ymd}\hlstd{(}\hlstr{"1800-01-01"}\hlstd{)} \hlopt{+} \hlkwd{days}\hlstd{(time)),}
           \hlkwc{lon} \hlstd{= longitude[}\hlnum{6}\hlstd{],}
           \hlkwc{lat} \hlstd{= latitude[}\hlnum{2}\hlstd{],}
           \hlkwc{pet} \hlstd{=} \hlkwd{ncvar_get}\hlstd{(meteo_data.nc,} \hlstr{"pevpr"}\hlstd{)[}\hlnum{6}\hlstd{,} \hlnum{2}\hlstd{, ]}
           \hlstd{)}
\hlstd{pet.tb}
\end{alltt}
\begin{verbatim}
## # A tibble: 12 x 5
##      time month   lon   lat   pet
##     <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -657073    12  9.38  86.7  4.28
## 2 -657042     1  9.38  86.7  5.72
## 3 -657014     2  9.38  86.7  4.38
## # ... with 9 more rows
\end{verbatim}
\end{kframe}
\end{knitrout}

If we want to read in several grid points, we can use several different approaches. However, the order of nesting of dimensions can make adding the dimensions as columns error prone. It is much simpler to use package \pkgnameNI{tidync} described next.

\subsection[tidync]{\pkgname{tidync}}


Package \pkgname{tidync} provides functions that make it easier to extract subsets of the data from an \pgrmname{NetCDF} file. We start by doing the same operations as in the examples for \pkgnameNI{ncdf4}.

We open the file creating an object and simultaneously activating the first grid.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{meteo_data.tnc} \hlkwb{<-} \hlkwd{tidync}\hlstd{(}\hlstr{"extdata/pevpr.sfc.mon.ltm.nc"}\hlstd{)}
\hlstd{meteo_data.tnc}
\end{alltt}
\begin{verbatim}
##
## Data Source (1): pevpr.sfc.mon.ltm.nc ...
##
## Grids (5) <dimension family> : <associated variables>
##
## [1]   D0,D1,D2 : pevpr, valid_yr_count    **ACTIVE GRID** ( 216576  values per variable)
## [2]   D3,D2    : climatology_bounds
## [3]   D0       : lon
## [4]   D1       : lat
## [5]   D2       : time
##
## Dimensions 4 (3 active):
##
##   dim   name  length     min     max start count    dmin    dmax unlim coord_dim
##   <chr> <chr>  <dbl>   <dbl>   <dbl> <int> <int>   <dbl>   <dbl> <lgl> <lgl>
## 1 D0    lon      192  0.      3.58e2     1   192  0.      3.58e2 FALSE TRUE
## 2 D1    lat       94 -8.85e1  8.85e1     1    94 -8.85e1  8.85e1 FALSE TRUE
## 3 D2    time      12 -6.57e5 -6.57e5     1    12 -6.57e5 -6.57e5 FALSE TRUE
##
## Inactive dimensions:
##
##   dim   name  length   min   max unlim coord_dim
##   <chr> <chr>  <dbl> <dbl> <dbl> <lgl> <lgl>
## 1 D3    nbnds      2     1     2 FALSE FALSE
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{hyper_dims}\hlstd{(meteo_data.tnc)}
\end{alltt}
\begin{verbatim}
## # A tibble: 3 x 7
##   name  length start count    id unlim coord_dim
## * <chr>  <dbl> <int> <int> <int> <lgl> <lgl>
## 1 lon      192     1   192     0 FALSE TRUE
## 2 lat       94     1    94     1 FALSE TRUE
## 3 time      12     1    12     2 FALSE TRUE
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{hyper_vars}\hlstd{(meteo_data.tnc)}
\end{alltt}
\begin{verbatim}
## # A tibble: 2 x 6
##      id name           type     ndims natts dim_coord
##   <int> <chr>          <chr>    <int> <int> <lgl>
## 1     4 pevpr          NC_FLOAT     3    14 FALSE
## 2     5 valid_yr_count NC_FLOAT     3     4 FALSE
\end{verbatim}
\end{kframe}
\end{knitrout}

We extract a subset of the data into a tibble in long (or tidy) format, and add
the months using a pipe operator from \pkgname{wrapr} and methods from \pkgname{dplyr}.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{hyper_tibble}\hlstd{(meteo_data.tnc,}
             \hlkwc{lon} \hlstd{=} \hlkwd{signif}\hlstd{(lon,} \hlnum{1}\hlstd{)} \hlopt{==} \hlnum{9}\hlstd{,}
             \hlkwc{lat} \hlstd{=} \hlkwd{signif}\hlstd{(lat,} \hlnum{2}\hlstd{)} \hlopt{==} \hlnum{87}\hlstd{)} \hlopt{%.>%}
  \hlkwd{mutate}\hlstd{(}\hlkwc{.data} \hlstd{= .,} \hlkwc{month} \hlstd{=} \hlkwd{month}\hlstd{(}\hlkwd{ymd}\hlstd{(}\hlstr{"1800-01-01"}\hlstd{)} \hlopt{+} \hlkwd{days}\hlstd{(time)))} \hlopt{%.>%}
  \hlkwd{select}\hlstd{(}\hlkwc{.data} \hlstd{= .,} \hlopt{-}\hlstd{time)}
\end{alltt}
\begin{verbatim}
## # A tibble: 12 x 5
##   pevpr valid_yr_count   lon   lat month
##   <dbl>          <dbl> <dbl> <dbl> <dbl>
## 1  4.28       1.19e-39  9.38  86.7    12
## 2  5.72       1.19e-39  9.38  86.7     1
## 3  4.38       1.29e-39  9.38  86.7     2
## # ... with 9 more rows
\end{verbatim}
\end{kframe}
\end{knitrout}

In this second example, we extract data for all grid points along latitudes. To achieve this we need only to omit the test for \code{lat} from the chuck above. The tibble is assembled automatically and columns for the active dimensions added. The decoding of the months remains unchanged.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{hyper_tibble}\hlstd{(meteo_data.tnc,}
             \hlkwc{lon} \hlstd{=} \hlkwd{signif}\hlstd{(lon,} \hlnum{1}\hlstd{)} \hlopt{==} \hlnum{9}\hlstd{)} \hlopt{%.>%}
  \hlkwd{mutate}\hlstd{(}\hlkwc{.data} \hlstd{= .,} \hlkwc{month} \hlstd{=} \hlkwd{month}\hlstd{(}\hlkwd{ymd}\hlstd{(}\hlstr{"1800-01-01"}\hlstd{)} \hlopt{+} \hlkwd{days}\hlstd{(time)))} \hlopt{%.>%}
  \hlkwd{select}\hlstd{(}\hlkwc{.data} \hlstd{= .,} \hlopt{-}\hlstd{time)}
\end{alltt}
\begin{verbatim}
## # A tibble: 1,128 x 5
##   pevpr valid_yr_count   lon   lat month
##   <dbl>          <dbl> <dbl> <dbl> <dbl>
## 1  1.02       1.19e-39  9.38  88.5    12
## 2  4.28       1.19e-39  9.38  86.7    12
## 3  3.03       9.18e-40  9.38  84.8    12
## # ... with 1,125 more rows
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{playground}
Instead of extracting data for one longitude across latitudes, extract data across longitudes for one latitude near the Equator.
\end{playground}
\index{importing data!NeCDF files|)}

\section{Remotely located data}\label{sec:files:remote}
\index{importing data!remote connections|(}

Many of the functions described above accept an URL address in place of a file name. Consequently files can be read remotely without having to first download and save a copy in the local file system. This can be useful, especially when file names are generated within a script. However, one should avoid, especially in the case of servers open to public access, repeatedly downloading the same file as this unnecessarily increases network traffic and workload on the remote server. Because of this, our first example reads a small file from my own web site. See section \ref{sec:files:txt} on page \pageref{sec:files:txt} for details on the use of these and other functions for reading text files.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{logger.df} \hlkwb{<-}
      \hlkwd{read.csv2}\hlstd{(}\hlkwc{file} \hlstd{=} \hlstr{"http://r4photobiology.info/learnr/logger_1.txt"}\hlstd{,}
                \hlkwc{header} \hlstd{=} \hlnum{FALSE}\hlstd{,}
                \hlkwc{col.names} \hlstd{=} \hlkwd{c}\hlstd{(}\hlstr{"time"}\hlstd{,} \hlstr{"temperature"}\hlstd{))}
\hlkwd{sapply}\hlstd{(logger.df, class)}
\end{alltt}
\begin{verbatim}
##        time temperature
## "character"   "numeric"
\end{verbatim}
\begin{alltt}
\hlkwd{sapply}\hlstd{(logger.df, mode)}
\end{alltt}
\begin{verbatim}
##        time temperature
## "character"   "numeric"
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{logger.tb} \hlkwb{<-}
    \hlkwd{read_csv2}\hlstd{(}\hlkwc{file} \hlstd{=} \hlstr{"http://r4photobiology.info/learnr/logger_1.txt"}\hlstd{,}
              \hlkwc{col_names} \hlstd{=} \hlkwd{c}\hlstd{(}\hlstr{"time"}\hlstd{,} \hlstr{"temperature"}\hlstd{))}
\end{alltt}


{\ttfamily\noindent\itshape\color{messagecolor}{\#\# Using ',' as decimal and '.' as grouping mark. Use read\_delim() for more control.}}

{\ttfamily\noindent\itshape\color{messagecolor}{\#\# Parsed with column specification:\\\#\# cols(\\\#\#\ \  time = col\_character(),\\\#\#\ \  temperature = col\_double()\\\#\# )}}\begin{alltt}
\hlkwd{sapply}\hlstd{(logger.tb, class)}
\end{alltt}
\begin{verbatim}
##        time temperature
## "character"   "numeric"
\end{verbatim}
\begin{alltt}
\hlkwd{sapply}\hlstd{(logger.tb, mode)}
\end{alltt}
\begin{verbatim}
##        time temperature
## "character"   "numeric"
\end{verbatim}
\end{kframe}
\end{knitrout}

While functions in package \pkgname{readr} support the use of URLs, those in packages \pkgname{readxl} and \pkgname{xlsx} do not. Consequently, we need to first download the file and save a copy locally, that we can read as described in section \ref{sec:files:excel} on page \pageref{sec:files:excel}. Function \Rfunction{download.file()} in the \Rlang \pkgname{utils} package can be used to download files using URLs. It supports different modes such as binary or text, and write or append, and different methods such as \code{"internal"}, \code{"wget"} and \code{"libcurl" }.

\begin{warningbox}
For portability, \pgrmname{MS-Excel} files should be downloaded in binary mode, setting \code{mode = "wb"}, which is required under \osname{MS-Windows}.
\end{warningbox}


\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{download.file}\hlstd{(}\hlstr{"http://r4photobiology.info/learnr/my-data.xlsx"}\hlstd{,}
              \hlstr{"data/my-data-dwn.xlsx"}\hlstd{,}
              \hlkwc{mode} \hlstd{=} \hlstr{"wb"}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

Functions in package \pkgname{foreign}, as well as those in package \pkgname{haven}, support URLs. See section \ref{sec:files:stat} on page \pageref{sec:files:stat} for more information about importing this kind of data into \Rlang.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{remote_thiamin.df} \hlkwb{<-}
  \hlkwd{read.spss}\hlstd{(}\hlkwc{file} \hlstd{=} \hlstr{"http://r4photobiology.info/learnr/thiamin.sav"}\hlstd{,}
            \hlkwc{to.data.frame} \hlstd{=} \hlnum{TRUE}\hlstd{)}
\hlkwd{head}\hlstd{(remote_thiamin.df)}
\end{alltt}
\begin{verbatim}
##   THIAMIN CEREAL
## 1     5.2  wheat
## 2     4.5  wheat
## 3     6.0  wheat
## 4     6.1  wheat
## 5     6.7  wheat
## 6     5.8  wheat
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{remote_my_spss.tb} \hlkwb{<-}
    \hlkwd{read_sav}\hlstd{(}\hlkwc{file} \hlstd{=} \hlstr{"http://r4photobiology.info/learnr/thiamin.sav"}\hlstd{)}
\hlstd{remote_my_spss.tb}
\end{alltt}
\begin{verbatim}
## # A tibble: 24 x 2
##   THIAMIN    CEREAL
##     <dbl> <dbl+lbl>
## 1     5.2 1 [wheat]
## 2     4.5 1 [wheat]
## 3     6   1 [wheat]
## # ... with 21 more rows
\end{verbatim}
\end{kframe}
\end{knitrout}

In this example we use a downloaded NetCDF file of long-term means for potential evapotranspiration from NOOA, the same used above in the \pkgname{ncdf4} example. This is a moderately large file at 444~KB. In this case, we cannot directly open the connection to the NetCDF file, and we first download it (commented out code, as we have a local copy), and then we open the local file.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{my.url} \hlkwb{<-} \hlkwd{paste}\hlstd{(}\hlstr{"ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.derived/"}\hlstd{,}
                \hlstr{"surface_gauss/pevpr.sfc.mon.ltm.nc"}\hlstd{,}
                \hlkwc{sep} \hlstd{=} \hlstr{""}\hlstd{)}
\hlcom{#download.file(my.url,}
\hlcom{#              mode = "wb",}
\hlcom{#              destfile = "extdata/pevpr.sfc.mon.ltm.nc")}
\hlstd{pet_ltm.nc} \hlkwb{<-} \hlkwd{nc_open}\hlstd{(}\hlstr{"extdata/pevpr.sfc.mon.ltm.nc"}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{warningbox}
For portability, \pgrmname{NetCDF} files should be downloaded in binary mode, setting \code{mode = "wb"}, which is required under \osname{MS-Windows}.
.
\end{warningbox}
\index{importing data!remote connections|)}

\section{Data acquisition from physical devices}\label{sec:data:acquisition}
\index{importing data!physical devices|(}

Numerous\index{internet-of-things} modern data acquisition devices based on microcontrollers, including internet-of-things (IoT) devices, have servers (or daemons) that can be queried over a network connection to retrieve either real-time or logged data. Formats based on XML schemas or in JSON format are commonly used.

\subsection[jsonlite]{\pkgname{jsonlite}}


\index{importing data!jsonlite}\index{YoctoPuce modules}
We give here a simple example using a module from the \href{http://www.yoctopuce.com/}{YoctoPuce} family using a software hub running locally. We retrieve logged data from a YoctoMeteo module.

\begin{infobox}
This example needs setting the configuration of the YoctoPuce module beforehand. Fully reproducible examples, including configuration instructions, will be provided online.
\end{infobox}

Here we use function \Rfunction{fromJSON()} from package \pkgname{jsonlite} to retrieve logged data from one sensor.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{hub.url} \hlkwb{<-} \hlstr{"http://localhost:4444/"}
\hlstd{Meteo01.df} \hlkwb{<-}
    \hlkwd{fromJSON}\hlstd{(}\hlkwd{paste}\hlstd{(hub.url,} \hlstr{"byName/METEO01/dataLogger.json"}\hlstd{,}
                   \hlkwc{sep} \hlstd{=} \hlstr{""}\hlstd{),} \hlkwc{flatten} \hlstd{=} \hlnum{TRUE}\hlstd{)}
\hlkwd{str}\hlstd{(Meteo01.df,} \hlkwc{max.level} \hlstd{=} \hlnum{2}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}

The minimum, mean, and maximum values for each logging interval need to be split from a single vector. We do this by indexing with a logical vector (recycled). The data returned is in long form, with quantity names and units also returned by the module, as well as the time.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{Meteo01.df[[}\hlstr{"streams"}\hlstd{]][[}\hlkwd{which}\hlstd{(Meteo01.df}\hlopt{$}\hlstd{id} \hlopt{==} \hlstr{"temperature"}\hlstd{)]]} \hlopt{%.>%}
  \hlkwd{as_tibble}\hlstd{(}\hlkwc{x} \hlstd{= .)} \hlopt{%.>%}
  \hlstd{dplyr}\hlopt{::}\hlkwd{transmute}\hlstd{(}\hlkwc{.data} \hlstd{= .,}
                   \hlkwc{utc.time} \hlstd{=} \hlkwd{as.POSIXct}\hlstd{(utc,} \hlkwc{origin} \hlstd{=} \hlstr{"1970-01-01"}\hlstd{,} \hlkwc{tz} \hlstd{=} \hlstr{"UTC"}\hlstd{),}
                   \hlkwc{t_min} \hlstd{=} \hlkwd{unlist}\hlstd{(val)[}\hlkwd{c}\hlstd{(}\hlnum{TRUE}\hlstd{,} \hlnum{FALSE}\hlstd{,} \hlnum{FALSE}\hlstd{)],}
                   \hlkwc{t_mean} \hlstd{=} \hlkwd{unlist}\hlstd{(val)[}\hlkwd{c}\hlstd{(}\hlnum{FALSE}\hlstd{,} \hlnum{TRUE}\hlstd{,} \hlnum{FALSE}\hlstd{)],}
                   \hlkwc{t_max} \hlstd{=} \hlkwd{unlist}\hlstd{(val)[}\hlkwd{c}\hlstd{(}\hlnum{FALSE}\hlstd{,} \hlnum{FALSE}\hlstd{,} \hlnum{TRUE}\hlstd{)])} \hlkwb{->} \hlstd{temperature.df}

\hlstd{Meteo01.df[[}\hlstr{"streams"}\hlstd{]][[}\hlkwd{which}\hlstd{(Meteo01.df}\hlopt{$}\hlstd{id} \hlopt{==} \hlstr{"humidity"}\hlstd{)]]} \hlopt{%.>%}
  \hlkwd{as_tibble}\hlstd{(}\hlkwc{x} \hlstd{= .)} \hlopt{%.>%}
  \hlstd{dplyr}\hlopt{::}\hlkwd{transmute}\hlstd{(}\hlkwc{.data} \hlstd{= .,}
                   \hlkwc{utc.time} \hlstd{=} \hlkwd{as.POSIXct}\hlstd{(utc,} \hlkwc{origin} \hlstd{=} \hlstr{"1970-01-01"}\hlstd{,} \hlkwc{tz} \hlstd{=} \hlstr{"UTC"}\hlstd{),}
                   \hlkwc{hr_min} \hlstd{=} \hlkwd{unlist}\hlstd{(val)[}\hlkwd{c}\hlstd{(}\hlnum{TRUE}\hlstd{,} \hlnum{FALSE}\hlstd{,} \hlnum{FALSE}\hlstd{)],}
                   \hlkwc{hr_mean} \hlstd{=} \hlkwd{unlist}\hlstd{(val)[}\hlkwd{c}\hlstd{(}\hlnum{FALSE}\hlstd{,} \hlnum{TRUE}\hlstd{,} \hlnum{FALSE}\hlstd{)],}
                   \hlkwc{hr_max} \hlstd{=} \hlkwd{unlist}\hlstd{(val)[}\hlkwd{c}\hlstd{(}\hlnum{FALSE}\hlstd{,} \hlnum{FALSE}\hlstd{,} \hlnum{TRUE}\hlstd{)])} \hlkwb{->} \hlstd{humidity.df}

\hlkwd{full_join}\hlstd{(temperature.df, humidity.df)}
\end{alltt}
\end{kframe}
\end{knitrout}

\begin{explainbox}
Most YoctoPuce input modules have a built-in datalogger, and the stored data can also be downloaded as a \code{CSV} file through a physical or virtual hub. As shown above, it is possible to control them through the HTML server in the physical or virtual hubs. Alternatively the \Rlang package \pkgname{reticulate} can be used to control YoctoPuce modules by means of the \langname{Python} library giving access to their API.
\end{explainbox}
\index{importing data!physical devices|)}

\section{Databases}\label{sec:data:db}
\index{importing data!databases|(}


One of the advantages of using databases is that subsets of cases and variables can be retrieved, even remotely, making it possible to work in \Rlang both locally and remotely with huge data sets. One should remember that \Rlang natively keeps whole objects in RAM, and consequently, available machine memory limits the size of data sets with which it is possible to work. Package \pkgname{dbplyr} provides the tools to work with data in databases using the same verbs as when using \pkgname{dplyr} with data stored in memory (RAM) (see chapter \ref{chap:R:data}). This is an important subject, but extensive enough to be outside the scope of this book. We provide a few simple examples to show the very basics but interested readers should consult \citebooktitle{Wickham2017} \autocite{Wickham2017}.

The additional steps compared to using \pkgname{dplyr} start with the need to establish a connection to a local or remote database. We will use \Rlang package \pkgname{RSQLite} to create a local temporary \pgrmname{SQLite} database. \pkgname{dbplyr} backends supporting other database systems are also available. We will use meteorological data from \pkgname{learnrbook} for this example.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{library}\hlstd{(dplyr)}
\hlstd{con} \hlkwb{<-} \hlstd{DBI}\hlopt{::}\hlkwd{dbConnect}\hlstd{(RSQLite}\hlopt{::}\hlkwd{SQLite}\hlstd{(),} \hlkwc{dbname} \hlstd{=} \hlstr{":memory:"}\hlstd{)}
\hlkwd{copy_to}\hlstd{(con, weather_wk_25_2019.tb,} \hlstr{"weather"}\hlstd{,}
        \hlkwc{temporary} \hlstd{=} \hlnum{FALSE}\hlstd{,}
        \hlkwc{indexes} \hlstd{=} \hlkwd{list}\hlstd{(}
          \hlkwd{c}\hlstd{(}\hlstr{"month_name"}\hlstd{,} \hlstr{"calendar_year"}\hlstd{,} \hlstr{"solar_time"}\hlstd{),}
          \hlstr{"time"}\hlstd{,}
          \hlstr{"sun_elevation"}\hlstd{,}
          \hlstr{"was_sunny"}\hlstd{,}
          \hlstr{"day_of_year"}\hlstd{,}
          \hlstr{"month_of_year"}
        \hlstd{)}
\hlstd{)}
\hlstd{weather.db} \hlkwb{<-} \hlkwd{tbl}\hlstd{(con,} \hlstr{"weather"}\hlstd{)}
\hlkwd{colnames}\hlstd{(weather.db)}
\end{alltt}
\begin{verbatim}
##  [1] "time"           "PAR_umol"       "PAR_diff_fr"    "global_watt"
##  [5] "day_of_year"    "month_of_year"  "month_name"     "calendar_year"
##  [9] "solar_time"     "sun_elevation"  "sun_azimuth"    "was_sunny"
## [13] "wind_speed"     "wind_direction" "air_temp_C"     "air_RH"
## [17] "air_DP"         "air_pressure"   "red_umol"       "far_red_umol"
## [21] "red_far_red"
\end{verbatim}
\begin{alltt}
\hlstd{weather.db} \hlopt{%.>%}
  \hlkwd{filter}\hlstd{(., sun_elevation} \hlopt{>} \hlnum{5}\hlstd{)} \hlopt{%.>%}
  \hlkwd{group_by}\hlstd{(., day_of_year)} \hlopt{%.>%}
  \hlkwd{summarise}\hlstd{(.,} \hlkwc{energy_Wh} \hlstd{=} \hlkwd{sum}\hlstd{(global_watt,} \hlkwc{na.rm} \hlstd{=} \hlnum{TRUE}\hlstd{)} \hlopt{*} \hlnum{60} \hlopt{/} \hlnum{3600}\hlstd{)}
\end{alltt}
\begin{verbatim}
## # Source:   lazy query [?? x 2]
## # Database: sqlite 3.30.1 [:memory:]
##   day_of_year energy_Wh
##         <dbl>     <dbl>
## 1         162     7500.
## 2         163     6660.
## 3         164     3958.
## # ... with more rows
\end{verbatim}
\end{kframe}
\end{knitrout}

\begin{explainbox}
Package \pkgname{dbplyr} translates data pipes that use \pkgname{dplyr} syntax into SQL queries to databases, either local or remote. As long as there are no problems with the backend, the use of a database is almost transparent to the \Rlang user.
\end{explainbox}
\index{importing data!databases|)}

\begin{infobox}
It is always good to clean up, and in the case of the book, the best way to test that the examples
can be run in a ``clean'' system.

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{unlink}\hlstd{(}\hlstr{"./data"}\hlstd{,} \hlkwc{recursive} \hlstd{=} \hlnum{TRUE}\hlstd{)}
\hlkwd{unlink}\hlstd{(}\hlstr{"./extdata"}\hlstd{,} \hlkwc{recursive} \hlstd{=} \hlnum{TRUE}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}
\end{infobox}

\section{Further reading}
Since\index{further reading!elegant R code}\index{further reading!idiosyncracies or R} this is the end of the book, I recommend as further reading the writings of \citeauthor{Burns1998} as they are full of insight. Having arrived at the end of \emph{Learn R: As a Language} you should read \citebooktitle{Burns1998} \autocite{Burns1998} and \citebooktitle{Burns2012} \autocite{Burns2012}. If you want to never get caught unaware by \Rlang's idiosyncrasies, read also \citebooktitle{Burns2011} \autocite{Burns2011}.


\backmatter

\printbibliography

\printindex

\printindex[rcatsidx]

\printindex[rindex]

\end{document}

\appendix

\chapter{Build information}

\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{Sys.info}\hlstd{()}
\end{alltt}
\end{kframe}
\end{knitrout}


\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{sessionInfo}\hlstd{()}
\end{alltt}
\end{kframe}
\end{knitrout}

\end{document}