-
Notifications
You must be signed in to change notification settings - Fork 4
/
R.intro.Rnw
223 lines (150 loc) · 37.8 KB
/
R.intro.Rnw
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
% !Rnw root = using-r.main.Rnw
<<echo=FALSE, include=FALSE>>=
# opts_chunk$set(opts_fig_wide)
opts_knit$set(unnamed.chunk.label = 'intro-chunk')
opts_knit$set(concordance=TRUE)
@
\chapter{\Rlang: The Language and the Program}\label{chap:R:introduction}
\begin{VF}
In a world of \ldots\ relentless pressure for more of everything, one can lose sight of the basic principles---simplicity, clarity, generality---that form the bedrock of good software.
\VA{Brian W. Kernighan and Rob Pike}{\emph{The Practice of Programming}, 1999}\nocite{Kernighan1999}
\end{VF}
\section{Aims of This Chapter}
I share some facts about the history and design of the \Rlang language so that you can gain a good vantage point from which to grasp the logic behind \Rlang's features, making it easier to understand and remember them. You will learn the distinction between the \Rpgrm program itself and the front-end programs, like \RStudio, frequently used together with \Rpgrm.
You will also learn how to interact with \Rpgrm when sitting at a computer. You will learn the difference between typing commands interactively and reading each partial result from \Rlang on the screen as you enter them, versus using \Rlang scripts containing multiple commands stored in a file to execute or run a ``job'' that saves results to another file for later inspection.
I describe the steps taken in a typical scientific or technical study, including the data analysis workflow and the roles that \Rpgrm can play in it. I share my views on the advantages and disadvantages of textual command languages such as \Rlang compared to menu-driven user interfaces, frequently used in other statistics software. I discuss the role of textual languages and \emph{literate programming} in the very important question of the reproducibility of data analyses and mention how I have used them while writing and typesetting this book.
\section{What is \Rlang?}
\subsection{\Rlang as a language}
\index{R as a language@{\Rlang as a language}}
\Rlang is a computer language designed for data analysis and data visualisation, however, in contrast to some other scripting languages, it is, from the point of view of computer programming, a complete language---it is not missing any important feature. In other words, no fundamental operations or data types are lacking \autocite{Chambers2016}. I attribute much of its success to the fact that its design achieves a very good balance between simplicity, clarity, and generality. \Rlang excels at generality, thanks to its extensibility at the cost of only a moderate loss of simplicity, while clarity is ensured by enforced documentation of extensions and support for both object-oriented and functional approaches to programming. The same three principles can be also easily followed by user code written in \Rlang.
In the case of languages like \Cpplang, \Clang, \pascallang, and \langname{FORTRAN}, multiple software implementations exist (different compilers and interpreters, i.e., pieces of software that translate programs encoded in these languages into \emph{machine code} instructions for computer processors to run). So in addition to different flavours of each language stemming from different definitions, e.g., versions of international standards, different implementations of the same standard may have, usually small, unintentional and intentional differences.
Most people think\index{R as a language@{\Rlang as a language}}\index{R as a program@{\Rpgrm as a program}} of \Rpgrm as a computer program, similar to \pgrmname{SAS} or \pgrmname{SPSS}. \Rpgrm is indeed a computer program---a piece of software---but it is also a computer language, implemented in the \Rpgrm program. At the moment, this difference is not as important as for other languages because the \Rpgrm program is the only widely used implementation of the \Rlang language.
\Rlang started as a partial implementation of the then relatively new \Slang language \autocite{Becker1984,Becker1988}. When designed, \Slang, developed at Bell Labs in the U.S.A., provided a novel way of carrying out data analyses. \Slang evolved into \Splang \autocite{Becker1988}. \Splang was available as a commercial program, most recently from TIBCO, U.S. \Rlang started as a poor man's home-brewed implementation of \Slang, for use in teaching, developed by Robert Gentleman and Ross Ihaka at the University of Auckland, in New Zealand \autocite{Ihaka1996}. Initially, \Rpgrm, the program, implemented a subset of the \Slang language. The \Rpgrm program evolved until only relatively few differences between \Slang and \Rlang remained. These remaining differences are intentional---thought of as significant improvements. In more recent times, \Rlang overtook \Splang in popularity. The \Rlang language is not standardised, and no formal definition of its grammar exists. Consequently, the \Rlang language is defined by the behaviour of its implementation in the \Rpgrm program.
What makes \Rlang different from \pgrmname{SPSS}, \pgrmname{SAS}, etc., is that \Slang was designed from the start as a computer programming language. This may look unimportant for someone not actually needing or willing to write software for data analysis. However, in reality, it makes a huge difference because \Rlang is easily extensible, both using the \Rlang language for implementation and by calling from \Rlang functions and routines written in other computer programming languages such as \Clang, \Cpplang, \langname{FORTRAN}, \pythonlang, or \javalang. This flexibility means that new functionality can be easily added, and easily shared with a consistent \Rlang-based user interface. In other words, instead of having to switch between different pieces of software to do different types of analyses or plots, one can usually find a package that will make new tools seamlessly available within \Rlang.
The name\index{base R@{base \Rlang}} ``base \Rlang{}'' is used to distinguish \Rlang itself, as in the \Rpgrm executable included in the \Rpgrm distribution and its default packages, from \Rlang in a broader sense, which includes contributed packages. A few packages are included in the \Rpgrm distribution, but most \Rlang packages are independently developed extensions and separately distributed. The number of freely available open-source \Rlang packages available is huge, in the order of 20\,000.
The most important advantage of using a language like \Rlang is that instructions to the computer are given as text. This makes it easy to repeat or \emph{reproduce} a data analysis. Textual instructions serve to communicate to other people what has been done in a way that is unambiguous. Sharing the instructions themselves avoids a translation from a set of instructions to the computer into text readable to humans---for example, the materials and methods section of a paper.
\begin{explainbox}
Readers with programming experience will notice that some features of \Rlang differ from those in other programming languages. \Rlang does not have the strict type checks of \langname{Pascal} or \Cpplang. It has operators that can take vectors and matrices as operands. Reliable and fast \Rlang code tends to rely on different \emph{idioms} than well-written \langname{Pascal} or \Cpplang code.
\end{explainbox}
\subsection{\Rlang as a computer program}
\index{R as a program@{\Rpgrm as a program}}
\index{Windows@{\textsf{Windows}}|see{\textsf{MS-Windows}}}
The \Rpgrm program itself is open-source, i.e., its source code is available for anybody to inspect, modify, and use. A very small fraction of users will directly contribute improvements to the \Rpgrm program itself. However, those contributions and bug reports are important in making \Rpgrm extremely reliable. The executable \Rpgrm program we actually use can be built for different operating systems and computer hardware. The members of the \Rpgrm developing team aim to keep the results obtained from calculations done on all the different builds and computer architectures as consistent as possible. The idea is to ensure that computations return consistent results not only across updates to \Rpgrm but also across different operating systems, like \osname{Linux}, \osname{Unix} (including \osname{OS X}) and \osname{MS-Windows}, or computer hardware, like that based on \textsf{ARM} and \textsf{x86} processors.
\begin{figure}
\centering
\includegraphics[width=0.85\textwidth]{figures/R-console-r}
\caption[The \Rpgrm console]{The \Rpgrm console. This is where the user can type textual commands line by line. Here a user has typed \code{print("Hello")} and \textit{entered} it by ending the line of text by pressing the ``enter'' key. The result of running the command is displayed below the command. The character at the head of the input line, a ``$>$'' in this case, is called the command prompt, signalling where a command can be typed in. Commands entered by the user are displayed in red, while results returned by \Rlang are displayed in blue. ``\code{[1]}'' can be ignored here, its meaning is explained on page \pageref{par:print:vec:index}. The console as displayed in \Rpgrm \textsf{GUI} under \osnameNI{MS-Windows} is shown.}\label{fig:intro:console}
\end{figure}
The \Rpgrm program does not have a full-fledged graphical user interface (GUI), or menus from which to start different types of analyses. Instead, the user types the commands at the \Rpgrm console and the result is displayed starting on the next line (Figure \ref{fig:intro:console}). The same textual commands can also be saved into a text file, line by line, and such a file, called a ``script'' can substitute for the direct typing of the same sequence of commands at the console (writing and use of \Rlang scripts are explained in chapter \ref{chap:R:scripts} on page \pageref{chap:R:scripts}). When we work at the console, typing-in commands one by one, we use \Rlang \emph{interactively}. When we run a script, we may say that we run a ``batch job''. The two approaches described above are available in the \Rpgrm program itself.
\begin{explainbox}
As \Rpgrm is essentially a command-line application, it can be used on what nowadays are frugal computing resources, equivalent to a personal computer of three decades ago. \Rpgrm can run even on the Raspberry Pi\index{Raspberry Pi}, a micro-controller board with the processing power of a modest smartphone (see \url{https://r4pi.org/}). At the other end of the spectrum, on really powerful servers, \Rpgrm can be used for the analysis of big data sets with millions of observations. How powerful a computer is needed for a given data analysis task depends on the size of the data sets, on how patient one is, on the ability to select efficient algorithms and on writing ``good'' code.
\end{explainbox}
\section{Using \Rlang}\label{sec:intro:using:R}
\subsection{Editors and IDEs}
Integrated Development Environments (IDEs)\index{integrated development environment}\index{IDE|see{integrated development environment}} are normally used when developing computer programs. IDEs provide a centralised user interface from within which the different tools used to create and test a computer program can be accessed and used in coordination. Most IDEs include a dedicated editor capable of syntax highlighting (automatically colouring ``code words'' based on their role in the programming language), and even able to report some mistakes in advance of running the code. One could describe such an editor as the equivalent of a word processor with spelling and grammar checking that can alert about spelling and syntax errors for a computer language like \Rlang instead of a natural language like English. IDEs frequently add other features that help navigation of the programme source code and give easy access to documentation.
Nowadays, it is very common to use an IDE as a front-end or middleman between the user and the \Rpgrm program. Computations are still done in the \Rpgrm program, which is \emph{not} built-in in the IDEs. Of the available IDEs for \Rpgrm, \RStudio is currently the most popular by a wide margin. Recent versions of \RStudio support \pythonlang in addition to \Rlang.
\begin{explainbox}
Readers with programming experience may be already familiar with Microsoft's free \pgrmname{Visual Studio Code} or the open-source \pgrmname{Eclipse} IDEs for which plugins supporting \Rpgrm are available.
\end{explainbox}
The main window of IDEs is in most cases divided into windows or panes, possibly with tabs. In \RStudio one has access to the \Rpgrm console, a text editor, a file-system browser, a pane for graphical output, and access to several additional tools such as for installing and updating extension packages. Although \RStudio supports very well the development of large scripts and packages, it is currently, in my opinion, also the best possible way of using \Rpgrm at the console as it has the \Rpgrm help system very well integrated both in the editor and \Rlang console. Figure \ref{fig:intro:rstudio} shows the main window displayed by \RStudio after running the same script as shown at the \Rpgrm console (Figure \ref{fig:intro:script}) and at the operating system command prompt (Figure \ref{fig:intro:shell}). By comparing these three figures, it is clear that \RStudio is really only a software layer between the user and an unmodified \Rpgrm executable. In \RStudio, the script was sourced by pressing the ``Source'' button at the top of the editor panel. \RStudio, in response to this, generated the code needed to source the file and ``entered'' it at the console (\ref{fig:intro:rstudio}, lower left screen panel, text in purple), the same console where we can directly type this same \Rpgrm command if we wish.
\begin{figure}
\centering
\includegraphics[width=\linewidth]{figures/Rstudio-script}
\caption[Script in \RStudio]{The \RStudio interface after running the script that is visible in tab \texttt{my-script.R} of the editor pane (top left). Here I used the ``Source'' button to run the script and \Rpgrm printed the results to the \Rpgrm console in the lower left pane. The lower right pane shows a list of files, including the script open in the editor. The upper right pane displays a list of the objects currently visible in the user workspace, object \code{a}, which was created by the code in the second line of the \Rlang script.}\label{fig:intro:rstudio}
\end{figure}
\begin{explainbox}
When a script is run, if an error is triggered, \RStudio automatically finds the location of the error, a feature you will find useful when running code from exercises in this book. Other features are beyond what one needs for simple everyday data analysis and are aimed at package development and report generation. Tools for debugging, code profiling, benchmarking of code and unit tests, make it possible to analyse and improve performance as well as help with quality assurance and certification of \Rlang packages and exceed what you will need for this book's exercises and simple data analysis. \RStudio also integrates support for file version control, which is not only useful for package development but also for keeping track of the progress or concurrent work with collaborators in the analysis of data.
\end{explainbox}
The ``desktop'' version of \RStudio that one installs and uses locally, runs on most modern operating systems, such as \osname{Linux}, \osname{Unix}, \osname{OS X}, and \osname{MS-Windows}. There is also a server version that runs on \osname{Linux}, as well as a cloud service (\url{https://posit.cloud/}) providing shared access to such a server. The \RStudio server is used remotely through a web browser. The user interface is almost the same in all cases. Desktop and server versions are both distributed as unsupported free software and as supported commercial software.
\RStudio and other IDEs support saving of their state and some settings per working folder under the name of \emph{project}, so that work on a data analysis can be interrupted and later continued, even on a different computer. As mentioned in section \ref{sec:R:workspace} on page \pageref{sec:R:workspace}, when working with \Rlang we keep related files in a folder.
In this book, I provide only a minimum of guidance on the use of \RStudio, and no guidance for other IDEs. To learn more about \RStudio, please, read the documentation available through \RStudio's help menu and keep at hand a printed copy of the \RStudio cheat sheet while learning how to use it. This and other useful \Rlang-related cheatsheets can be downloaded at \url{https://posit.co/resources/cheatsheets/}. Additional instructions on the use of \RStudio, including a video, are available through the Resources menu entry of the book's website at \url{https://www.learnr-book.info/}.
\subsection{\Rlang sessions and workspaces}\label{sec:R:workspace}
We use \emph{session} to describe the interactive execution from start to finish of one running instance of the \Rpgrm program. We use \emph{workspace} to name the imaginary space were all objects currently available in an \Rpgrm session are stored. In \Rpgrm, the whole workspace can be stored in a single file on disk at the end or during a session and restored later into another session, possibly on a different computer. Usually, when working with \Rpgrm, we dedicate a folder in disk storage to store all files from a given data analysis project. We normally keep in this folder files with data to read in, scripts, a file storing the whole contents of the workspace, named by default \code{.Rdata} and a text file with the history of commands entered interactively, named by default \code{.Rhistory}. The user's files within this folder can be located in nested folders. There are no strict rules on how the files should be organised or on their number. The recommended practice is to avoid crowded folders and folders containing unrelated files. It is a good idea to keep in a given folder and workspace the work in progress for a single data analysis project or experiment, so that the workspace can be saved and restored easily between sessions and work continued from where one left it independently of work done in other workspaces. The folder where files are currently read and saved is in \Rpgrm documentation called the \emph{current working directory}. When opening an \code{.Rdata} file the current working directory is automatically set to the location where the \code{.Rdata} file was read from.
\begin{warningbox}
\RStudio projects are implemented as a folder with a name ending in \code{.Rprj}, located under the same folder where scripts, data, \code{.Rdata}, and \code{.Rhistory} are stored. This folder is managed by \RStudio and should be not modified or deleted by the user. Only in the very rare case of its corruption, it should be deleted, and the \RStudio project created again from scratch. Files \code{.Rdata} and \code{.Rhistory} should not be deleted by the user, except to reset the \Rlang workspace. However, this is unnecessary as it can be also easily achieved from within \Rpgrm.
\end{warningbox}
\subsection{Using \Rlang interactively}
Decades ago, users communicated with computers through a physical terminal (keyboard plus text-only screen) that was frequently called a \emph{console}\index{console}. A text-only interface to a computer program, in most cases a window or a pane within a graphical user interface, is still called a console. In our case, the \Rpgrm console (Figure \ref{fig:intro:console}). This is the native user interface of \Rpgrm.
Typing commands at the \Rpgrm console is useful when one is playing around, rather aimlessly exploring things, or trying to understand how an \Rpgrm function or operator we are not familiar with works. Once we want to keep track of what we are doing, there are better ways of using \Rpgrm, which allow us to keep a record of how an analysis has been carried out. The different ways of using \Rpgrm are not exclusive of each other, so most users will use the \Rpgrm console to test individual commands and plot data during the first stages of exploration. As soon as we decide how we want to plot or analyse the data, it is best to start using scripts. This is not enforced in any way by \Rpgrm, but scripts are what really brings to light the most important advantages of using a programming language for data analysis. In Figure \ref{fig:intro:console}, we can see how the \Rpgrm console looks. The text in red has been typed in by the user, except for the prompt \code{\textcolor{red}{$>$}}, and the text in blue is what \Rpgrm has displayed in response. It is essentially a dialogue between user and \Rpgrm. The console can \emph{look} different when displayed within an IDE like \RStudio, but the only difference is in the appearance of the text rather than in the text itself (cf.\ Figures \ref{fig:intro:console} and \ref{fig:intro:console:rstudio}).
\begin{figure}
\centering
\includegraphics[width=\linewidth]{figures/r-console-rstudio}
\caption[The \Rpgrm console in \RStudio]{The \Rpgrm console embedded in \RStudio. The same commands have been typed in as in Figure \ref{fig:intro:console}. Commands entered by the user are displayed in purple, while results returned by \Rpgrm are displayed in black.}\label{fig:intro:console:rstudio}
\end{figure}
The two previous figures showed the result of entering a single command. Figure \ref{fig:intro:console:capture} shows how the console looks after the user has entered several commands, each as a separate line of text.
\begin{figure}
\centering
\includegraphics[width=\linewidth]{figures/r-console-capture}
\caption[The \Rpgrm console in use]{The \Rpgrm console after several commands have been entered. Commands entered by the user are displayed in red, while results returned by \Rpgrm are displayed in blue.}\label{fig:intro:console:capture}
\end{figure}
The examples in this book require only the console window for user input. Menu-driven programs are not necessarily bad, they are just unsuitable when there is a need to set very many options and choose from many different actions. They are also difficult to maintain when extensibility is desired, and when independently developed modules of very different characteristics need to be integrated. Textual languages also have the advantage, to be addressed in later chapters, that command sequences can be stored in human- and computer-readable text files. Such files constitute a record of all the steps used, and in most cases, make it trivial to manually reproduce the same steps at a later time. Scripts are a very simple and handy way of communicating to other users how a given data analysis has been done or can be done.
\begin{explainbox}
In the console one types commands at the \code{>} prompt. When one ends a line by pressing the return or enter key, if the line can be interpreted as an \Rlang command, the result will be printed at the console, followed by a new \code{>} prompt.
If the command is incomplete, a \code{+} continuation prompt will be shown, and you will be able to type in the rest of the command. For example, if the whole calculation that you would like to do is $1 + 2 + 3$, if you enter in the console \code{1 + 2 +} in one line, you will get a continuation prompt where you will be able to type \code{3}. However, if you type \code{1 + 2}, the result will be calculated, and printed.
\end{explainbox}
For example, one can search for a help page at the \Rpgrm console. Below are the first code example and the first playground in the book. This first example is for illustration only, and you can return to them later as only on page \pageref{sec:R:install} I discuss how to install or get access to the \Rpgrm program.
<<help-1, eval=FALSE>>=
help("sum")
?sum
@
\begin{playground}
Look at help for some other functions like \code{mean()}, \code{var()}, \code{plot()} and, why not, \Rfunction{help()} itself!
<<eval=FALSE>>=
help(help)
@
\end{playground}
\begin{warningbox}
When trying to access help related to \Rlang extension packages through \Rlang's built in help, make sure the package is loaded into the current \Rlang session, as described on page \pageref{sec:packages:install}, before calling \Rfunction{help()}.
\end{warningbox}
When using \RStudio, there are easier ways of navigating to a help page than calling function \Rfunction{help()} by typing its name, for example, with the cursor on the name of a function in the editor or console, pressing the \textsf{F1} key opens the corresponding help page in the help pane. Letting the cursor hover for a few seconds over the name of a function at the \Rpgrm console will open ``bubble help'' for it. If the function is defined in a script or another file that is open in the editor pane, one can directly navigate from the line where the function is called to where it is defined. In \RStudio one can also search for help through the graphical interface. The \Rlang manuals can also be accessed most easily through the Help menu in \RStudio or \pgrmname{RGUI}.
\subsection{Using \Rlang in a ``batch job''}
To run a script,\index{scripts}\index{batch job} we need first to prepare a script in a text editor. Figure \ref{fig:intro:script} shows the console immediately after running the script file shown in the text editor. As before, red text, the command \code{source("my-script.R")}, was typed by the user, and the blue text in the console is what was displayed by \Rpgrm as a result of this action. The title bar of the console, shows ``R-console'', while the title bar of the editor shows the \emph{path} to the script file that is open and ready to be edited followed by ``R-editor''.
\begin{figure}
\centering
\includegraphics[width=\linewidth]{figures/R-console-script}
\caption[Script sourced at the \Rpgrm console]{Screen capture of the \Rpgrm console and editor just after running a script. The upper pane shows the \Rpgrm console, and the lower pane, the script file in an editor. }\label{fig:intro:script}
\end{figure}
\begin{warningbox}
When working at the command prompt, most results are printed by default. However, within scripts one needs to use function \Rfunction{print()} explicitly when a result is to be displayed.
\end{warningbox}
A true ``batch job'' is not run at the \Rpgrm console but at the operating system command prompt, or shell. The shell is the console of the operating system---\osname{Linux}, \osname{Unix}, \osname{OS X}, or \osname{MS-Windows}. Figure \ref{fig:intro:shell} shows how running a script at the Windows command prompt looks. A script can be run at the operating system prompt to do time-consuming calculations with the output saved to a file. One may use this approach on a server, say, to leave a large data analysis job running overnight or even for several days.
\begin{figure}
\centering
\includegraphics[width=\linewidth]{figures/windows-cmd-script}
\caption[Script at the Windows cmd prompt]{Screen capture of the \osname{MS-Windows} command console just after running the same script. Here we use \code{Rscript} to run the script; the exact syntax will depend on the operating system in use. In this case, \Rpgrm prints the results at the operating system console or shell, rather than in its own \Rpgrm console.}\label{fig:intro:shell}
\end{figure}
Within \RStudio desktop it is possible to access the operating system shell through the tab named ``Terminal'' and through the menu. It is also possible to run jobs in the background in the tab ``Background jobs'', i.e., while simultaneously using the \Rpgrm console. This is made possible by concurrently running two or more instances of the \Rpgrm program.
\section{Reproducible Data Analysis with \Rlang}
\index{reproducible data analysis|(}
Statistical concepts and procedures are not only important after data are collected but also crucial at the design stage of any data-based study. Rather frequently, we deal with pre-existing data already at the planning stage of an experiment or survey. Statistics provides the foundation for the design of experiments and surveys, data analysis, and data visualisation. This is similar to the role played by grammar and vocabulary in communication in a natural language like English. Statistics makes possible decision-making based on partial evidence (or samples), but it is also a means of communication. Data visualisation also plays a key role in the written and oral communication of study conclusions. \Rlang is useful throughout all stages of the research process, from the design of studies to the communication of the results.
During recent years, the lack of reproducibility in scientific research, frequently described as a \emph{reproducibility crisis}, has been broadly discussed and analysed \autocite{Gandrud2015}. One of the problems faced when attempting to reproduce scientific and technical studies is reproducing the data analysis. More generally, under any situation where accountability is important, from scientific research to decision making in commercial enterprises, industrial quality control and safety, and environmental impact assessments, being able to reproduce a data analysis reaching the same conclusions from the same data is crucial. Thus, an unambiguous description of the steps taken for an analysis is a requirement. Currently, most approaches to reproducible data analysis are based on automating report generation and including, as part of the report, all the computer commands that were used.
A reliable record of what commands have been run on which data is especially difficult to keep when issuing commands through menus and dialogue boxes in a graphical user interface or by interactively typing commands as text at a console. Even working interactively at the \Rpgrm console using copy and paste to include commands and results in a report typed in a word processor is error prone, and laborious. The use and archiving of \Rlang scripts alleviate this difficulty.
However, a further requirement to achieve reproducibility is the consistency between the saved and reported output and the \Rlang commands reported as having been used to produce them, saved separately when using scripts. This creates an error-prone step between data analysis and reporting. To solve this problem an approach to data analysis derived from what is called \emph{literate programming} \autocite{Knuth1984a} was developed: running an especially formatted script that produces a document that includes the \Rlang code used for the analysis; the results of running this code and any explanatory text needed to describe the methodology used and interpret the results of the analysis.
Although a system capable of producing such reports with \Rlang, called \pkgname{Sweave} \autocite{Leisch2002}, has been available for a couple of decades, it was rather limited and not supported by an IDE, making its use rather tedious. Package \pkgname{knitr} \autocite{Xie2013} further developed the approach and together with its integration into \RStudio made the use of this type of report much easier. Less sophisticated reports, called \Rlang \emph{notebooks}, formatted as HTML files can be created directly from ordinary \Rlang scripts containing no special formatting. Notebooks are HTML files that show as text the code used interspersed with the results, and can contain embedded the actual source script used to generate them.
Package \pkgname{knitr} supports the writing of reports with the textual explanations encoded using either \Markdown or \Latex\ as markup for text-formatting instructions. While \Markdown (\url{https://daringfireball.net/projects/markdown/}) is an easy-to-learn and use text markup approach, \Latex\ \autocite{Lamport1994} is based on \TeX\ \autocite{Knuth1987}, the most powerful typesetting engine freely available. There are different flavours of \Markdown, including \Rmarkdown (see \url{https://rmarkdown.rstudio.com/}) with special support for \Rlang code. \Quarto (see \url{https://quarto.org/}) was recently released as an enhancement of \Rmarkdown (see \url{https://rmarkdown.rstudio.com/}), improving typesetting and styling, and providing a single system capable of generating a broad selection of outputs. When used together with \Rlang, \Quarto relies on package \pkgname{knitr} for the key step in the conversion, so in a strict sense \Quarto does not replace it.
Because of the availability of these approaches to the generation of reports, the \Rlang language is extremely useful when reproducibility is important. Both \pkgname{knitr} and \Quarto are powerful and flexible enough to write whole books, such as this very book you are now reading, produced with \Rpgrm, \pkgname{knitr} and \LaTeX. All pages in the book were typeset directly, with plots and other \Rlang output generated on-the-fly by \Rpgrm and inserted automatically. All diagrams were generated by \LaTeX\ during the typesetting step. The only exceptions are the figures in this chapter that have been manually captured from the computer screen. Why am I using this approach? First, because I want to make sure that every bit of code, as you will see printed, runs without error. In addition, I want to make sure that the output displayed below every line or chunk of \Rlang language code is exactly what \Rpgrm returns. Furthermore, it saves a lot of work for me as an author, as I can just update \Rpgrm and all the packages used to their latest version, and build the book again, after any changes needed to keep it up to date and free of errors. By using these tools and markup in plain text files, the indices, cross-references, citations, and list of references are all generated automatically.
Although the use of these tools is very important, they are outside the scope of this book and well described in other books dedicated to them \autocite{Gandrud2015,Xie2013}. When using \Rlang in this way, a good command of \Rlang as a language for communication with both humans and computers is very useful.
\index{reproducible data analysis|)}
\section{Getting Ready to Use \Rlang}\label{sec:R:install}
As the book is designed with the expectation that readers will run code examples as they read the text, you have to ensure access to the \Rpgrm before reading the next chapter. It is likely that your school, employer or teacher has already enabled access to \Rpgrm. If not, or if you are reading the book on your own, you should install \Rpgrm or secure access to an online service. Using \RStudio or another IDE can facilitate the use of \Rpgrm, but all the code in the remaining chapters makes only use of \Rpgrm and packages available through \CRAN.
I have written an \Rlang package, named \pkgname{learnrbook}, containing original data and computer-readable listings for all code examples and exercises in the book. It also contains code and data that makes it easier to install the packages used in later chapters. Its name is \pkgname{learnrbook} and is available through \CRAN. \textbf{It is not necessary for you to install this or any other packages until section \ref{sec:packages:install} on page \pageref{sec:packages:install}, where I explain how to install and use \Rlang packages.}
\begin{faqbox}{Are there any resources to support the \emph{Learn R: As a Language} book?}
Please, visit \url{https://www.learnr-book.info/} to find additional material related to this book, including additional free chapters. Up-to-date instructions for software installation are provided online at this and other sites, as these instructions are likely to change after the publication of the book.
\end{faqbox}
\begin{faqbox}{How to install the \textsf{R} program in my computer?}
Installation of \Rpgrm varies depending on the operating system and computer hardware, and is in general similar to that of other software under a given operating system distribution. For most types of computer hardware, the current version of \Rpgrm is available through the Comprehensive \Rlang Archive Network (\CRAN) at \url{https://cran.r-project.org/}. Especially in the case of Linux distributions, \Rpgrm can frequently be installed as a component of the operating system distribution. There are some exceptions, such as the \textsl{R4Pi}\index{Raspberry Pi} distribution of \Rpgrm for the Raspberry Pi, which is maintained independently (\url{https://r4pi.org/}).
Installers for Linux, Windows and MacOS are available through \CRAN (\url{https://cran.r-project.org/}) together with brief but up-to-date installation instructions.
\end{faqbox}
\begin{faqbox}{How to install the \textsf{RStudio} IDE in my computer?}
\RStudio installers are available at Posit's web site (\url{https://posit.co/products/open-source/rstudio/}) of which the free version is suitable for running the code examples and exercises in the book. In many cases, the IT staff at your employer or school will install them, or they may be already included in the default computer setup.
\end{faqbox}
\begin{faqbox}{How to get access to \textsf{RStudio} as a cloud service?}
An alternative, that is very well suited for courses or learning as part of a group is the \RStudio cloud service, recently renamed Posit cloud (\url{https://posit.co/products/cloud/cloud/}). For individual use, a free account is in many cases enough, and for groups that qualify for the discounted price, a low-cost teacher's account works very well.
\end{faqbox}
\section{Further Reading}
Suggestions\index{further reading!shell scripts in Unix and Linux} for further reading are dependent on how you plan to use \Rlang. If you envision yourself running batch jobs under \pgrmname{Linux} or \pgrmname{Unix}, you would profit from learning to write shell scripts. Because \pgrmname{bash} is widely used nowadays, \citebooktitle{Newham2005} \autocite{Newham2005} can be recommended. If you aim at writing \Rlang code that is going to be reused, and have some familiarity with \Clang, \Cpplang or \javalang, reading \citetitle{Kernighan1999} \autocite{Kernighan1999} will provide a mostly language-independent view of programming as an activity and help you master the all-important tricks of the trade. The history of \Rlang, and its relation or \Slang, is best told by those who were involved at the early stages of its development, \citeauthor{Chambers2016} (\citeyear[][, chapter 2]{Chambers2016}), and \citeauthor{Ihaka1998} (\citeyear{Ihaka1998}).
<<eval=eval_diag, include=eval_diag, echo=eval_diag, cache=FALSE>>=
knitter_diag()
R_diag()
other_diag()
@