R.as.calculator.Rnw

% !Rnw root = appendix.main.Rnw

<<echo=FALSE, include=FALSE>>=
opts_chunk$set(opts_fig_wide)
opts_knit$set(concordance=TRUE)
opts_knit$set(unnamed.chunk.label = 'calculator-chunk')
@

\chapter{Base \Rlang: ``Words'' and ``Sentences''}\label{chap:R:as:calc}

\begin{VF}
The desire to economise time and mental effort in arithmetical computations, and to eliminate human liability to error, is probably as old as the science of arithmetic itself.

\VA{Howard Aiken}{\emph{Proposed automatic calculating machine}, 1937; reprinted 1964}\nocite{Aiken1964}
\end{VF}

%\dictum[Howard Aiken, \emph{Proposed automatic calculating machine}, presented to IBM in 1937]{The desire to economise time and mental effort in arithmetical computations, and to eliminate human liability to error, is probably as old as the science of arithmetic itself.}\vskip2ex

\section{Aims of This Chapter}

In my experience, for those who are not familiar with computer programming languages, the best first step in learning the \Rlang language is to use it interactively by typing textual commands at the \Rpgrm \emph{console}. This teaches not only the syntax and grammar rules, but also gives a glimpse at the advantages and flexibility of this approach to data analysis. In this chapter, I focus on the different simple values or items that can be stored and manipulated in \Rpgrm, as well as the role of computer program statements, the equivalent of ``sentences'' in natural languages.

In the first part of the chapter, you will use \Rlang to do everyday calculations that should be so easy and familiar that you will not need to think about the operations themselves. This easy start will give you a chance to focus on learning how to issue textual commands at the command prompt.

Later in the chapter, you will gradually need to focus more on the \Rlang language and its grammar and less on how commands are entered. By the end of the chapter, you will be familiar with most of the kinds of simple ``words'' used in the \Rlang language and you will be able to read and write simple \Rlang statements.

Throughout the chapter, I will occasionally show the equivalent of the \Rlang code in mathematical notation. If you are not familiar with the mathematical notation, you can safely ignore the mathematics, as long as you understand the diagrams and the \Rlang code.

\section{Natural and Computer Languages}
\index{languages!natural and computer}
Computer languages have strict rules, and the interpreters and compilers that translate these languages into machine code are unforgiving about errors. They will issue error messages, but in contrast to human readers or listeners, will not guess your intentions and continue. However, computer languages have a much smaller set of words than natural languages, such as English. If you are new to computer programming, understanding the parallels between computer and natural languages may be useful.

One can think of constant values and variables (values stored under a name) as nouns and of operators and functions as verbs. A complete command, or statement, is the equivalent of a natural language sentence: ``a comprehensible utterance''. The simple statement \code{a + 1} has three components: \code{a}, a variable, \code{+}, an operator and \code{1} a constant. The statement \code{sqrt(4)} has two components, a function \code{sqrt()} and a numerical constant \code{4}. We say that ``to compute $\sqrt{4}$ we \emph{call} \code{sqrt()} with \code{4} as its \emph{argument}''.

Although all values manipulated in a digital computer are stored as \textit{bits} in memory, multiple interpretations are possible. Numbers, letters, logical values, etc., can be encoded into bits and decoded as long as their type or \code{mode} is known. The concept of \code{class} is not directly related to how values are encoded when stored in computer memory, but instead how they are interpretated as part of a computer program. We can have, for example, RGB colour values, stored as three numbers such as \code{0, 0, 255}, as hexadecimal numbers stored as characters {\#0000FF}, or even use fancy names stored as character strings like \code{"blue"}. We could create a \code{class} for colours using any of these representations, based on two different modes: \code{numeric} and \code{character}.

\section{Numeric Values and Arithmetic}\label{sec:calc:numeric}
\index{classes and modes!numeric, integer, double|(}\index{numbers and their arithmetic|(}\qRclass{numeric}\index{math operators}\index{math functions}\index{numeric values}\qRoperator{+}\qRoperator{-}\qRoperator{*}\qRoperator{/}
When working in \Rlang with arithmetic expressions, the normal mathematical precedence rules are followed and parentheses can be used to alter this order. Parentheses can be nested, but in contrast to the usual practice in mathematics, the same parenthesis symbol is used at all nesting levels.

\begin{explainbox}
 Both in mathematics and programming languages \emph{operator precedence rules} determine which subexpressions are evaluated first and which later. Contrary to primitive electronic calculators, \Rlang evaluates numeric expressions containing operators according to the rules of mathematics. In the expression $1 + 2 \times 3$, the product $2 \times 3$ has precedence over the addition, and is evaluated first, yielding as the result of the whole expression, 7. Similar rules apply to other operators, even those taking as operands non-numeric values.
\end{explainbox}

The equivalent of the math expression\qRfunction{exp()}\qRfunction{cos()}\qRconst{pi}
$$
\frac{3 + e^2}{\cos \pi}
$$
is, in \Rlang, written as follows:

<<numbers-0>>=
(3 + exp(2)) / cos(pi)
@

Where constant \Rconst{pi} ($\pi = 3.1415\ldots$) and function \Rfunction{cos()} (cosine) are defined in base \Rlang. Many trigonometric and mathematical functions are available in addition to operators like \verb|+|, \verb|-|, \verb|*|, \verb|/|, and \verb|^|.

\begin{warningbox}
  In \Rlang, angles are expressed in radians, thus $\cos(\pi) = 1$ and $\sin(\pi) = 0$, according to trigonometry. Degrees can be converted into radians taking into account that the circle corresponds to $2 \times \pi$ when expressed in radians and to $360^\circ$ when expressed in degrees. Thus the cosine of an angle of $45^\circ$ can be computed as follows.

<<numbers-radians>>=
sin(45/180 * pi)
@
\end{warningbox}

One thing to remember when translating fractions into \Rlang code is that in arithmetic expressions the bar of the fraction generates a grouping that alters the normal precedence of operations. In contrast, in \Rlang expressions this grouping must be explicitly signalled with additional parentheses.

If you are in doubt about how precedence rules work, you can add parentheses to make sure the order of computations is the one you intend. Redundant parentheses have no effect.

<<numbers-00>>=
1 + 2 * 3
1 + (2 * 3)
(1 + 2) * 3
@

The number of opening (left side) and closing (right side) parentheses must be balanced, and they must be located so that each enclosed term is a valid mathematical expression, i.e., code that can be evaluated to return a value, a value that can be inserted in place of the expression enclosed in parenthesis before evaluating the remaining of the expression. For example, \code{(1 + 2) * 3} after evaluating \code{(1 + 2)} becomes \code{3 * 3} yielding \code{9}. In contrast, \code{(1 +) 2 * 3} is a syntax error as \code{1 +} is incomplete and does not yield a number.

\begin{playground}
In \emph{playgrounds} the output from running the code in \Rpgrm is not shown, as these are exercises for you to enter at the \Rpgrm console and run. In general, you should not skip them as in most cases playgrounds aim to teach or demonstrate concepts or features that I have \emph{not} included in full-detail in the main text. You are strongly encouraged to \emph{play}, in other words, to create new variations of the examples and execute them to explore how \Rlang works.\qRfunction{sqrt()}\qRfunction{sin()}\qRfunction{log()}\qRfunction{log10()}\qRfunction{log2()}\qRfunction{exp()}

<<numbers-1, eval=eval_playground>>=
1 + 1
2 * 2
2 + 10 / 5
(2 + 10) / 5
10^2 + 1
sqrt(9)
@

<<numbers-1a, eval=eval_playground>>=
pi
sin(pi)
log(100)
log10(100)
log2(8)
exp(1)
@

\end{playground}

Variables\index{variables}\index{assignment} are used to store values. After we \emph{assign} a value to a variable, we can use in our code the name of the variable in place of the stored value. The ``usual'' assignment operator is \Roperator{<-}. In \Rlang, all names, including variable names, are case sensitive. Variables \code{a} and \code{A} are two different variables. Variable names can be long in \Rlang, although it is not a good idea to use very long names. Here I am using very short names, something that is usually also a very bad idea. However, in the examples in this chapter, where the stored values have no connection to the real world, simple names emphasise their abstract nature. In the chunk below, \code{vct1} and \code{vct2} are arbitrarily chosen variable names; I should have used names like \code{height.cm} or \code{outside.temperature.C} if they had been useful to convey information.

In the book, I use variable names that help recognise the kind of object stored, as this is most relevant when learning \Rlang. Here I use \code{vct1} because in \Rlang, as we will see on page \pageref{par:numeric:vectors:start}, numeric objects are always vectors, even when of length one.

<<numbers-2>>=
vct1 <- 1
vct1 + 1
vct1
vct2 <- 10
vct2 <- vct1 + vct2
vct2
@

Entering the name of a variable \emph{at the \Rlang console} implicitly calls function \code{print()} displaying the stored value on the console. The same applies to any other statement entered \emph{at the \Rlang console}: \code{print()} is implicitly called with the result of executing the statement as its argument.

<<numbers-2a>>=
vct1
print(vct1)
vct1 + 1
print(vct1 + 1)
@
\begin{playground}
There are some syntactically legal assignment statements that are not very frequently used, but you should be aware that they are valid, as they will not trigger error messages and may surprise you. The most important thing is to write code consistently. The ``backwards'' assignment operator \Roperator{->} and resulting code like \code{1 -> VCT1}\index{assignment!leftwise} are valid but less frequently used. The use of the equals sign (\Roperator{=}) for assignment in place of \Roperator{<-} although valid is discouraged. Chaining\index{assignment!chaining} assignments as in the first statement below can be used to signal to the human reader that \code{VCT1}, \code{VCT2} and \code{VCT3} are being assigned the same value.

<<numbers-3, tidy=FALSE, eval=eval_playground>>=
VCT1 <- VCT2 <- VCT3 <- 0
VCT1
VCT2
VCT3
1 -> VCT1
VCT1
VCT1 = 3
VCT1
remove(VCT1, VCT2, VCT3) # cleanup
@

\end{playground}

\begin{explainbox}\label{box:integer:float}
In\index{numeric, integer and double values} \Rlang, all numbers belong to mode \Rclass{numeric} (we will discuss the concepts of \emph{mode} and \emph{class} in section \ref{sec:rlang:mode} on page \pageref{sec:rlang:mode}). We can query if the mode of an object is \Rclass{numeric} with function \Rfunction{is.numeric()}. The returned values are either \code{TRUE} or \code{FALSE}. These are logical values that will be discussed in section \ref{sec:calc:boolean} on page \pageref{sec:calc:boolean}.

<<classes-01>>=
mode(1)
vct1 <- 1
is.numeric(vct1)
@

Because numbers can be stored in computer memory in different formats, most computing languages, including \Rlang, implement multiple types of numerical values. In most cases, \Rpgrm's \code{numeric} values can be used everywhere that a number is expected. However, in some cases, explicitly using class \Rclass{integer} to indicate that we will store or operate on whole numbers, can be advantageous, e.g., \Rclass{integer} constants are identified by a trailing capital ``L'', as in  \code{32L}.

<<classes-02>>=
is.numeric(1L)
is.integer(1L)
is.double(1L)
@

Real numbers are a mathematical abstraction, and do not have an exact equivalent in computers. Instead of Real numbers, computers store and operate on numbers that are restricted to a broad but finite range of values and have a finite resolution. They are called, \emph{floats} (or \emph{floating-point} numbers); in \Rlang they go by the name of \Rclass{double} and can be created with the constructor \Rfunction{double()}.

<<classes-03>>=
is.numeric(1)
@

<<classes-03a>>=
is.integer(1)
is.double(1)
@
\end{explainbox}

\index{vectors!introduction|(}\label{par:calc:vectors:diag}
Vectors\label{par:numeric:vectors:start} are one-dimensional in structure, of varying length and used to store similar values, e.g., numbers. They are different from the vectors, commonly used in Physics when describing directional forces, which are symbolised with an arrow as an ``accent'', such as $\overrightarrow{\mathbf{F}}$. In \Rlang numeric values and other atomic values are always \Rclass{vector}s that can contain zero, one or more elements. The diagram below exemplifies a vector containing ten elements, also called members. These elements can be extracted using integer numbers as positional indices, and manipulated as described in more detail in section \ref{sec:calc:indexing} on page \pageref{sec:calc:indexing}.\vspace{1ex}

\begin{center}
\begin{footnotesize}
\begin{tikzpicture}[font=\sffamily,
array/.style={matrix of nodes,nodes={draw, minimum size=7mm, fill=codeshadecolor},column sep=-\pgflinewidth, row sep=0.5mm, nodes in empty cells,
row 1/.style={nodes={draw=none, fill=none, minimum size=5mm}},
row 1 column 1/.style={nodes={draw}}}]

\matrix[array] (array) {
1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
  &   &   &   &   &   &   &   &   &   \\};
\node[draw, fill=gray, minimum size=4mm] at (array-2-9) (box) {};

\begin{scope}[on background layer]
\fill[blue!10] (array-1-1.north west) rectangle (array-1-10.south east);
\end{scope}

\draw (array-2-1.west) node [signal, draw, fill=codeshadecolor, minimum size=6mm, line width=1.5pt, left] (first) {\textcolor{blue}{\ \code{<name>}\strut}};
\draw (array-1-1.north)--++(90:3mm) node [above] (first) {First index};
\draw (array-1-10.east)--++(0:3mm) node [right]{\code{integer} positional indices};
\draw (array-2-10.east)--++(0:3mm) node [right]{Elements or \textcolor{blue}{\code{<values>}}};
\node [align=center, anchor=south] at (array-2-9.north west|-first.south) (8) {element at index 9};
\draw (8)--(box);
%
\end{tikzpicture}
\end{footnotesize}
\end{center}

Vectors, in mathematical notation, are similarly represented using positional indexes as subscripts,
\begin{equation}\label{eq:vector}
  a_{1\ldots n} = a_1, a_2, \cdots a_i, \cdots, a_n,
\end{equation}
where $a_{1\ldots n}$ is the whole vector and $a_1$ its first member. The length of $a_{1\ldots n}$ is $n$ as it contains $n$ members. In the diagram above $n = 10$.

As you have seen above, the results of calculations were printed preceded with \code{[1]}. This is the index or position in the vector of the first number (or other value) displayed at the head of the current line. As in \Rlang single values are vectors of length one, when they are printed, they are also preceded with \code{[1]}.\label{par:print:vec:index}

One\label{par:calc:concatenate} can use function \Rfunction{c()} ``concatenate'' to create a vector from other vectors, including vectors of length 1, or even vectors of length 0, such as the \code{numeric} constants in the statements below. The first example shows an anonymous vector created, printed, and then automatically discarded.

<<numbers-4aann>>=
c(3, 1, 2)
@

To be able to reuse the vector, we assign it to a variable, giving a name to it. The length of a vector can be queried with function \Rfunction{length()}. Below, \Rlang code is followed by diagrams depicting the structure of the vectors created.

<<numbers-4aa>>=
vct4 <- c(3, 1, 2)
length(vct4)
vct4
@

%\begin{center}
\noindent
\begin{footnotesize}
\begin{tikzpicture}[font=\sffamily,
array/.style={matrix of nodes,nodes={draw, minimum size=7mm, fill=blue!20},column sep=-\pgflinewidth, row sep=0.5mm, nodes in empty cells,
row 1/.style={nodes={draw=none, fill=none, minimum size=5mm}}}]

\matrix[array] (array) {
1 & 2 & 3 \\
3  & 1  & 2 \\};
%\node[draw, fill=gray, minimum size=4mm] at (array-2-9) (box) {};

\begin{scope}[on background layer]
\fill[blue!10] (array-1-1.north west) rectangle (array-1-3.south east);
\end{scope}

\draw (array-2-1.west) node [signal, draw, fill=codeshadecolor, minimum size=6mm, line width=1.5pt, left] (first) {\strut\code{\ vct4\phantom{mm}}};
\draw (array-1-3.east)--++(0:3mm) node [right]{\code{integer} positional indices};
\draw (array-2-3.east)--++(0:3mm) node [right]{\code{numeric} values};
%
\end{tikzpicture}
\end{footnotesize}
%\end{center}

<<numbers-4bb>>=
vct5 <- c(4, 5, 0)
vct5
@

\noindent
%\begin{center}
\begin{footnotesize}
\begin{tikzpicture}[font=\sffamily,
array/.style={matrix of nodes,nodes={draw, minimum size=7mm, fill=blue!20},column sep=-\pgflinewidth, row sep=0.5mm, nodes in empty cells,
row 1/.style={nodes={draw=none, fill=none, minimum size=5mm}}}]

\matrix[array] (array) {
1 & 2 & 3 \\
4  & 5  & 0 \\};
%\node[draw, fill=gray, minimum size=4mm] at (array-2-9) (box) {};

\begin{scope}[on background layer]
\fill[blue!10] (array-1-1.north west) rectangle (array-1-3.south east);
\end{scope}

\draw (array-2-1.west) node [signal, draw, fill=codeshadecolor, minimum size=6mm, line width=1.5pt, left] (first) {\strut\code{\ vct5\phantom{mm}}};
\draw (array-1-3.east)--++(0:3mm) node [right]{\code{integer} positional indices};
\draw (array-2-3.east)--++(0:3mm) node [right]{\code{numeric} values};
%
\end{tikzpicture}
\end{footnotesize}
%\end{center}

<<numbers-4cc>>=
vct6 <- c(vct4, vct5)
vct6
@

\noindent
%\begin{center}
\begin{footnotesize}
\begin{tikzpicture}[font=\sffamily,
array/.style={matrix of nodes,nodes={draw, minimum size=7mm, fill=blue!20},column sep=-\pgflinewidth, row sep=0.5mm, nodes in empty cells,
row 1/.style={nodes={draw=none, fill=none, minimum size=5mm}}}]

\matrix[array] (array) {
1 & 2 & 3 & 4 & 5 & 6\\
3 & 1 & 2 & 4  & 5  & 0 \\};
%\node[draw, fill=gray, minimum size=4mm] at (array-2-9) (box) {};

\begin{scope}[on background layer]
\fill[blue!10] (array-1-1.north west) rectangle (array-1-6.south east);
\end{scope}

\draw (array-2-1.west) node [signal, draw, fill=codeshadecolor, minimum size=6mm, line width=1.5pt, left] (first) {\strut\code{\ vct6\phantom{mm}}};
\draw (array-1-6.east)--++(0:3mm) node [right]{\code{integer} positional indices};
\draw (array-2-6.east)--++(0:3mm) node [right]{\code{numeric} values};
%
\end{tikzpicture}
\end{footnotesize}
%\end{center}

<<numbers-4dd>>=
vct7 <- c(vct5, vct4)
vct7
@

\noindent
%\begin{center}
\begin{footnotesize}
\begin{tikzpicture}[font=\sffamily,
array/.style={matrix of nodes,nodes={draw, minimum size=7mm, fill=blue!20},column sep=-\pgflinewidth, row sep=0.5mm, nodes in empty cells,
row 1/.style={nodes={draw=none, fill=none, minimum size=5mm}}}]

\matrix[array] (array) {
1 & 2 & 3 & 4 & 5 & 6\\
4 & 5 & 0 & 3 & 1 & 2\\};
%\node[draw, fill=gray, minimum size=4mm] at (array-2-9) (box) {};

\begin{scope}[on background layer]
\fill[blue!10] (array-1-1.north west) rectangle (array-1-6.south east);
\end{scope}

\draw (array-2-1.west) node [signal, draw, fill=codeshadecolor, minimum size=6mm, line width=1.5pt, left] (first) {\strut\code{\ vct7\phantom{mm}}};
\draw (array-1-6.east)--++(0:3mm) node [right]{\code{integer} positional indices};
\draw (array-2-6.east)--++(0:3mm) node [right]{\code{numeric} values};
%
\end{tikzpicture}
\end{footnotesize}
%\end{center}

One or more member values of a vector can be extracted using the positional indexes and the extraction operator \Roperator{[ ]}. The returned value is a new vector. Member extraction is discussed in detail in section \ref{sec:calc:indexing} on page \pageref{sec:calc:indexing}.

<<numeric-extract-member>>=
vct7[3]
vct7[c(6, 2)]
@

\begin{faqbox}{How to create an empty vector?}
<<numeric-empty-faq>>=
numeric()
@
\end{faqbox}

Next, I show concatenation of two vectors of the same class, the second of them of length zero.

<<numbers-4ee>>=
c(vct7, numeric())
@

Function \code{c()} accepts as arguments two or more vectors and concatenates them, one after another. Quite frequently we may need to insert one vector in the middle of another. For this operation, \code{c()} is not useful by itself. One could use indexing combined with \code{c()}, but this is not needed as \Rlang provides a function capable of directly doing this operation. Although it can be used to ``insert'' values, it is named \code{append()}, and by default, it indeed appends one vector at the end of another.

<<numbers-4a>>=
append(vct4, vct5)
@

The output above is the same as for \code{c(a, b)}, however, \Rfunction{append()} accepts as an argument an index position after which to ``append'' its second argument. This results in an \emph{insert} operation when the index points at any position different from the end of the vector.\label{par:calc:append:end}

<<numbers-4b>>=
append(vct4, values = vct5, after = 2)
@

\begin{playground}\label{pg:seq:rep}
One can create sequences\index{sequence} using function \Rfunction{seq()} or the operator \Roperator{:}, or repeat values using function \Rfunction{rep()}. In this case, I leave it to the reader to work out the rules by running these and his/her own examples, with the help of the documentation, available through \code{help(seq)} and \code{help(rep)}.

<<numbers-5, eval=eval_playground>>=
-1:5
5:-1
seq(from = -1, to = 1, by = 0.1)
rep(-5, times = 4)
rep(1:2, length.out = 4)
@

\end{playground}

\begin{faqbox}{How to create a vector of zeros?}

<<numeric-zeros1-faq>>=
numeric(length = 10)
@

or

<<numeric-zeros2-faq>>=
rep(0, times = 10)
@

\end{faqbox}

Next,\label{par:calc:vectorised:opers} something that makes \Rlang different from most other programming languages: vectorised arithmetic\index{vectorised arithmetic}. Operators and functions that are vectorised accept, as arguments, vectors of arbitrary length, in which case the result returned is equivalent to having applied the same function or operator individually to each element of the vector.\label{par:vectorised:numeric}

<<numbers-6aa>>=
log10(100)
log10(c(10, 5, 100, 200))
@

Function \Rfunction{sum()} accepts vectors of different lengths as input but is not vectorised, as it always returns a vector of length one as result.

<<numbers-6ab>>=
sum(100)
sum(c(10, 5, 100, 200))
@

A vectorised sum, also called a parallel sum of  vectors, to differentiate it from obtaining the sum of the members of a vector, as computed above with function \Rfunction{sum()}, is the usual way in which operators like \Roperator{+} and other arithmetic operators and functions work in \Rlang.

<<numbers-6ac>>=
c(3, 1, 2) + c(1, 2, 31)
@

Vectorised\index{recycling of arguments}\index{recycling of operands} functions and operators that operate on more than one vector simultaneously, in many cases accept vectors of mismatched length as arguments or operands. When two or more vectors are of different length, these functions and operators recycle the shorter vector(s) to match the length of the longest one. The two statements below are equivalent; in the first statement, the short vector \code{1} is first recycled into \code{c(1, 1, 1)}. The operation, addition in this example, is applied to the numbers stored at the same position in the two vectors, returning a new vector.

<<numbers-6ad>>=
c(3, 1, 2) + 1
c(3, 1, 2) + c(1, 1, 1)
@

In the second code statement (line) below, \code{vct4} is of length \Sexpr{length(vct4)}, but the \code{numeric} constant 2 is a vector of length 1, this short constant vector is extended, by recycling (replicating) its value, into a longer vector of ones---i.e., a vector of the same length as the longest vector in the statement, \code{a}.\label{par:recycling:numeric}

<<numbers-6>>=
vct4 <- c(3, 1, 2)
(vct4 + 1) * 2
vct4 * 0:1
vct4 - vct4
@

Make sure you understand what calculations are taking place in the chunk above, and also the one below. Vectorisation and vector recycling are key features of the \Rlang language.

<<numbers-6a>>=
vct8 <- rep(1, 6)
vct8
vct8 + 1:2
vct8 + 1:3
vct8 + 1:4
@

\begin{playground}
  Create further variants of the statements in the code chunk above to work out when warnings or errors are issued. Does the length of the operands matter?
\end{playground}

\begin{warningbox}
  Most functions defined in base \Rlang apply recycling to vectors passed as arguments to at least some of their parameters. When recycling is supported, the conditions triggering warnings or errors are consistent with those you discovered in the playground above. However, if and how recycling is applied depends on how functions have been defined. Thus, there is variation, especially, but not only, in the case of functions and operators defined in contributed extension packages. For example, package \pkgname{tibble} and some other packages in the \pkgname{tidyverse} support recycling but some boundary cases that trigger a warning in base \Rlang functions, trigger an error in functions defined in these packages. See section \ref{sec:data:tibble} on page \pageref{sec:data:tibble} about package \pkgname{tibble}.
\end{warningbox}

\begin{explainbox}
As mentioned above, a vector can contain zero or more member values. Vectors of length zero may seem at first sight quite useless, but in practice they are very useful. They allow the handling of ``no input'' or ``nothing to do'' cases as normal cases, which in the absence of vectors of length zero would require to be treated as special cases. Constructors for \Rlang classes like \Rfunction{numeric()} return vectors of a length given by their first argument, which defaults to zero.

<<>>=
vct9 <- numeric(length = 0) # named argument
vct9
length(vct8)
@

<<>>=
numeric() # default argument
@

Vectors of length zero, behave in most cases, as expected---e.g., they can be concatenated as shown here.

<<>>=
length(c(vct4, vct9, vct5))
length(c(vct4, vct5))
@

Many functions, such as \Rlang's maths functions and operators, will accept numeric vectors of length zero as valid input, returning also a vector of length zero, issuing neither a warning nor an error message. In other words, \emph{these are valid operations} in \Rlang.

<<>>=
log(numeric(0))
5 + numeric(0)
@

Even when of length zero, vectors do have to belong to a class acceptable for the operation: \code{5 + character(0)} is an error (\code{character} values are described in section \ref{sec:calc:character} on page \pageref{sec:calc:character}).

Passing as an argument to parameter \code{length} a value larger than zero creates a longer vector filled with zeros in the case of \Rfunction{numeric()}.

<<>>=
numeric(length = 5)
@

The length of a vector can be explicitly increased, with missing values filled automatically with \code{NA}, the marker for not available.

<<>>=
vct10 <- 1:5
length(vct10) <- 10
vct10
@

If the length is decreased, the values in the \emph{tail} of the vector are discarded.

<<>>=
vct11 <- 1:10
vct11
length(vct11) <- 5
vct11
@

\end{explainbox}
\label{par:numeric:vectors:end}\index{vectors!introduction|)}

There\index{special values!NA} are some special values available for numbers. \Rconst{NA} meaning ``not available'' is used for missing values. (\Rconst{NA}) values play a very important role in the analysis of data, as frequently some observations are missing from an otherwise complete data set due to ``accidents'' during the course of an experiment or survey. It is important to understand how to interpret \Rconst{NA} values: They are placeholders for something that is unavailable, in other words, whose value is \emph{unknown}. \Rconst{NA} values propagate when used, so that numerical computations yield \Rconst{NA} when one or more input of the values is unknown.

<<numbers-8>>=
vct12 <- c(NA, 5)
vct12
vct12 + 1
@

Calculations\index{special values!NaN}\label{par:special:values} can also yield the following values \Rconst{NaN} ``not a number'', \Rconst{Inf} and \Rconst{-Inf} for $\infty$ and $-\infty$. As you will see below, calculations yielding these values do \textbf{not} trigger errors or warnings, as they are arithmetically valid. \Rconst{Inf} and \Rconst{-Inf} are also valid numerical values for input and constants.

<<numbers-8a>>=
vct12 + Inf
Inf / vct12
-1 / 0
1 / 0
Inf / Inf
Inf + 4
-Inf * -1
@

\begin{playground}
\textbf{When to use vectors of length zero, and when \code{NA}s?}\index{zero length objects}\index{vectors!zero length} Make sure you understand the logic behind the different behaviour of functions and operators with respect to \code{NA} and \code{numeric()} or its equivalent \code{numeric(0)}. What do they represent? Why \Rconst{NA}s are not ignored, while vectors of length zero are?

<<numbers-PG00, eval=eval_playground>>=
123 + numeric()
123 + NA
@

\emph{Model answer:}
\Rconst{NA} values are used to signal a value that ``was lost'' or ``was expected'' but is unavailable because of some accident. A vector of length zero, represents no values, but within the normal expectations. In particular, if vectors are expected to have a certain length, or if index positions along a vector are meaningful, then using \Rconst{NA} is a must.

\end{playground}

Any operation, even tests of equality, involving one or more \Rconst{NA}'s return an \Rconst{NA}. In other words, when one input to a calculation is unknown, the result of the calculation is unknown. This means that a special function is needed for testing for the presence of \code{NA} values.

<<numbers-8b>>=
is.na(c(NA, 1))
@

In the example above, we can also see that \Rfunction{is.na()} is vectorised, and that it applies the test to each of the elements of the vector individually, returning the result as \code{TRUE} or \code{FALSE}.

One\index{precision!math operations}\index{numbers!floating point} needs to be aware of the consequences of numbers in computers being almost always stored with finite precision and/or range: the expectations derived from the mathematical definition of Real numbers are not always fulfilled. See the box on page \pageref{box:floats} for an in-depth explanation.

<<numbers-9>>=
1 - 1e-20
@

When using \Rclass{integer}\index{numbers!whole}\index{numbers!integer} values these problems do not exist, as integer arithmetic is not affected by loss of precision in calculations restricted to integers. Because of the way integers are stored in the memory of computers, within the representable range, they are stored exactly. One can think of computer integers as a subset of whole numbers restricted to a certain range of values.

<<integers-1>>=
1L + 3L
1L * 3L
@

Using the ``usual'' division operator yields a floating-point \code{double} result, while the integer division operator \Roperator{\%/\%} yields an \code{integer} result, and the modulo operator \Roperator{\%\%} returns the remainder from the integer division.

<<integers-1a>>=
1L / 3L
1L %/% 3L
1L %% 3L
@

If an operation would create an \code{integer} value that falls outside the range representable in \Rlang, the value returned is \code{NA} (not available).

<<integers-1b>>=
1000000L * 1000000L
@

Both doubles and integers are considered numeric. In most situations, conversion is automatic and we do not need to worry about the differences between these two types of numeric values. The functions in the next chunk return \code{TRUE} or \code{FALSE}, i.e., \code{logical} values (see section \ref{sec:calc:boolean} on page \pageref{sec:calc:boolean}).\index{numbers!double}\index{numbers!integer}

<<integers-2>>=
is.numeric(1L)
is.integer(1L)
is.double(1L)
is.double(1L / 3L)
is.numeric(1L / 3L)
@

\begin{advplayground}
Study the variations of the previous example shown below, and explain why the two statements return different values. Hint: 1 is a \code{double} constant. You can use \code{is.integer()} and \code{is.double()} in your explorations.

<<integers-PG1, eval=eval_playground>>=
1 * 1000000L * 1000000L
1000000L * 1000000L * 1
@
\end{advplayground}

\begin{explainbox}
\label{box:floats} \label{par:float}
\index{integer numbers!arithmetic|(}\index{double precision numbers!arithmetic|(}
\index{floating point numbers!arithmetic|(}\index{machine arithmetic!precision|(}
\index{floats|see{floating point numbers}}\index{machine arithmetic!rounding errors}\index{Real numbers and computers}\index{integer numbers and computers}
\index{EPS ($\epsilon$)|see{machine arithmetic precision}}%
The usual way to store numerical values in computers is to reserve a fixed amount of space in memory for each value, which imposes limits on which numbers can be represented or not, and the maximum precision that can be achieved. The difference between \Rclass{integer} and \Rclass{double} is explained on page \pageref{box:integer:float}. Integers, or ``whole numbers'', like \Rlang \Rclass{integer} values are stored always with the same resolution such that the smallest difference between two integer values is 1. The amount of memory available to store an individual value creates a limit for the size of the largest and smallest values that can be represented. Thus integers in \Rlang behave like Integers or whole numbers as defined in mathematics, but constrained to a restricted finite range of values. In the computing language \Clang, different types of integer numbers are available \code{short} and \code{long}, these differ in the size of the space reserved for them in memory. \Rlang \Rclass{integer} type is equivalent to \code{long} in \Clang, thus the use of \code{L} for integer constant values like \code{5L}.

Floating point numbers like \Rlang \Rclass{double} values are stored in two parts: an integer \emph{significand} and an integer \emph{exponent}, each part using a fixed amount of space in memory. The relative resolution is constrained by the number of digits that can be stored in the significand while the absolute size of the largest and smallest numbers that can be represented is limited by the largest and smallest values that fit in the memory reserved for the exponent. In many computing languages, different types of floating point numbers are available, these differ in the size of the space reserved for them in memory. The properties of Real numbers as defined in mathematics differ from floating point numbers in assuming unlimited resolution and an unlimited range of representable values.

In \Rpgrm, numbers that are not integers are stored as \emph{double-precision floats}. Precision of numerical values in computers is usually symbolised by ``epsilon'' ($\epsilon$), commonly abbreviated \emph{eps}, defined as the largest value of $\epsilon$ for which $1 + \epsilon = 1$. The finite resolution of floats can lead to unexpected results when testing for equality or inequality. Test for equality is done with operator \code{==}. The use of this and other comparison operators is explained in section \ref{sec:calc:comparison} on page \pageref{sec:calc:comparison}.


<<comparison-5>>=
1e20 == 1 + 1e20
1 == 1 + 1e-20
0 == 1e-20
@

Another way of revealing the limited precision is during conversion to \code{character}.

<<numbers-EB10>>=
format(5.123, digits = 16) # near maximun resolution
format(5.123, digits = 22) # more digits than in resolution
@

The accumulation of successive small losses of precision from multiple operations on \Rlang \code{double} values can be a problem. Thus when computations involve both very large and very small numbers, the returned value can depend on the order of the operations. In practice ordinary users rarely need to be concerned about losses in precision except when testing for equality and inequality. On the other hand, finite resolution of \code{double} numerical values can explain why sometimes returned values for equivalent computations differ, and why some computation algorithms may be preferable, and others even fail, in specific cases.

As the \Rpgrm program can be used on different types of computer hardware, the actual machine limits for storing numbers in memory may vary depending on the type of processor and even the compiler used to build the \Rpgrm program executable. However, it is possible to obtain these values at run time, i.e., while the \Rpgrm is being used, from the variable \code{.Machine}, which is part of the \Rlang language. Please see the help page for \code{.Machine} for a detailed and up-to-date description of the available constants. \emph{Beware that when you run the examples below, the values returned by \Rlang in your own computer can differ from those returned in the computer I have used to typeset the book as you are reading it here.}\qRconst{.Machine\$double.eps}\qRconst{.Machine\$double.neg.eps}\qRconst{.Machine\$double.max}

<<machine-eps-01>>=
.Machine$double.eps
.Machine$double.neg.eps
.Machine$double.max
.Machine$double.min
.Machine$double.base
@

The last two values refer to the exponents of a base number or \emph{radix}, \Sexpr{.Machine$double.base}, rather than the maximum and minimum size of numbers that can be handled as objects of class \Rclass{double}. The maximum size of normalised \code{double} values, given by \code{.Machine\$double.xmax}, is much larger than the maximum value of \code{integer} values, given by \code{.Machine\$integer.max}.\qRconst{.Machine\$double.min}\qRconst{.Machine\$double.xmax}\qRconst{.Machine\$integer.max}

<<machine-eps-01a>>=
.Machine$double.xmax
.Machine$integer.max
@

As \Rclass{integer} values are stored in machine memory without loss of precision, epsilon is not defined for \Rclass{integer} values.
In \Rlang not all out-of-range \code{numeric} values behave in the same way: while off-range \code{double} values are stored as \Rconst{-Inf} or \Rconst{Inf} and enter arithmetic as infinite values according to the mathematical rules, off-range \code{integer} values become \code{NA} with a warning.

<<machine-eps-02>>=
1e1026
1e-1026
@

<<machine-eps-03, warning=TRUE>>=
2147483699L
@

In those statements in the chunk below where at least one operand is \Rclass{double} the \Rclass{integer} operands are \emph{promoted} to \Rclass{double} before computation. A similar promotion does not take place when operations are among \Rclass{integer} values, resulting in \emph{overflow}\index{arithmetic overflow}\index{overflow|see{arithmetic overflow}}, meaning numbers that are too big to be represented as \Rclass{integer} values.

<<machine-eps-04>>=
2147483600L + 99L
2147483600L + 99
2147483600L * 2147483600L
2147483600L * 2147483600
@

The exponentiation operator \Roperator{\^{}} forces the promotion\index{type promotion}\index{arithmetic overflow!type promotion} of its arguments to \Rclass{double}, resulting in no overflow. In contrast, as seen above, the multiplication operator \Roperator{*} operates on \code{integer} values resulting in overflow.

<<machine-eps-05>>=
2147483600L * 2147483600L
2147483600L^2L
@

\index{integer numbers!arithmetic|)}\index{double precision numbers!arithmetic|)}
\index{floating point numbers!arithmetic|)}\index{machine arithmetic!precision|)}
\end{explainbox}

Both\label{par:calc:round} for display or as part of computations, we may want to decrease the number of significant digits or the number of digits after the decimal marker. Be aware that in the examples below, even if printing is being done by default, these functions return \code{numeric} values that are different from their input and can be stored and used in computations. Function \Rfunction{round()} is used to round numbers to a certain number of decimal places after or before the decimal marker, with a positive or negative value for \code{digits}, respectively. In contrast, function \Rfunction{signif()} rounds to the requested number of significant digits, i.e., ignoring the position of the decimal marker.

<<convert-3>>=
round(0.0124567, digits = 3)
signif(0.0124567, digits = 3)
round(1789.1234, digits = -1)
round(1789.1234, digits = 3)
signif(1789.1234, digits = 3)
@

<<convert-3x>>=
vct13 <- 0.12345
vct14 <- round(vct13, digits = 2)
vct13 == vct14
vct13 - vct14
vct14
@

\begin{explainbox}
Functions are described in detail in section \ref{sec:script:functions} on page \pageref{sec:script:functions}. Here I describe them briefly in relation to their use. Functions are objects containing \Rlang code that can be used to perform an operation on values passed as arguments to its parameters. They return the result of the operation as a single \Rlang object, or less frequently, as a side effect. Functions have a name like any other \Rlang object. If the name of a function is followed by parentheses \code{()} and included in a code statement, it becomes a function \emph{call} or a ``request'' for the code stored in the function object to be run. Many functions, accept \Rlang objects and/or constant values as \emph{arguments} to their \emph{formal parameters}. Formal parameters are placeholder names in the code stored in the function object, or the \emph{definition} of the function. In a function call, the code in its definition is evaluated (or run) with formal parameter names taking the values passed as arguments to them.

In a function definition, formal parameters can be assigned default values, which are used if no explicit argument is passed in the call. Arguments can be passed to formal parameters by name or by position. In most cases, passing arguments by name makes the code easier to understand and more robust against coding mistakes. In the examples presented in the book, I most frequently pass arguments by name, except for the first parameter.

Being \code{digits}, the second parameter, its argument can also be passed by position.

<<convert-3a>>=
round(0.0124567, digits = 3)
round(0.0124567, 3)
@

When passing arguments by name, in most cases unambiguous partial matching is acceptable, but can make code difficult to read.

<<convert-3b>>=
round(0.0124567, di = 3)
@

\end{explainbox}

Functions \Rfunction{trunc()} and \Rfunction{ceiling()} return the non-fractional part of a numeric value as a new numeric value. They differ in how they handle negative values, and neither of them rounds the returned value to the nearest whole number. Hint: you can use \code{help(trunc)} or \code{trunc?} at the \Rpgrm console, or the help tab of \RStudio to find out the answer.

\begin{playground}
What does value truncation mean? Function \Rfunction{trunc()} truncates a numeric value, but it does not return an \code{integer}.
\begin{itemize}
  \item Explore how \Rfunction{trunc()} and \Rfunction{ceiling()} differ. Test them both with positive and negative values.
  \item \textbf{Advanced} Use function \Rfunction{abs()} and operators \Roperator{+} and \Roperator{-} to reproduce the output of \Rfunction{trunc()} and \Rfunction{ceiling()} for the different inputs.
  \item Can \Rfunction{trunc()} and \Rfunction{ceiling()} be considered type conversion functions in \Rlang?
\end{itemize}
\end{playground}

\begin{explainbox}
  \Rlang supports complex numbers and arithmetic operations with class \Rclass{complex}. As complex numbers rarely appear in user-written scripts, I give only one example of their use. Complex numbers, as defined in mathematics, have two parts, a real component and an imaginary one. Complex numbers can be used, for example, to describe the result of $\sqrt{-1} = 1i$.

<<numbers-complex>>=
cmp1 <- complex(real = c(-1, 1), imaginary = c(0, 0))
cmp1
cmp2 <- sqrt(cmp1)
cmp2
cmp2^2
@

\end{explainbox}

\index{classes and modes!numeric, integer, double|)}\index{numbers and their arithmetic|)}

\begin{warningbox}
  Instants in time and periods of time in computers are usually encoded as classes derived from \code{integer}, and thus considered in \Rlang as atomic classes and the objects vectors. Some of these encodings are standardised and supported by \Rlang classes \Rclass{POSIXlt} and \Rclass{POSIXct}. The computations based on times and dates are difficult because the relationship between local time at a given location and Universal Time Coordinates (UTC) has changed with time, as well as with changes in national borders. Packages \pkgname{lubridate} and \pkgname{anytime} support operations among time-related data and conversions between character strings and time and date classes, making them easier and less error prone than when using base \Rlang functions. Thus I describe classes and operations related to dates and times in section \ref{sec:data:datetime} on page \pageref{sec:data:datetime}.
\end{warningbox}

It\index{removing objects}\index{deleting objects|see {removing objects}}\label{par:clac:remove}\label{par:calc:remove} is good to \emph{remove} from the workspace objects that are no longer needed. We use function \Rfunction{remove()} to delete objects stored in the current workspace.

Arguments passed to \Rfunction{remove()} can be bare object names as shown here.

<<>>=
an.object <- 1:4
remove(an.object) # using a bare name
@

Function \Rfunction{remove()} also accepts the names of the objects to remove as a \code{character} vector passed to its parameter \code{list}. In spite of its name, the argument must be a \code{vector} rather than a \code{list} (see section \ref{sec:calc:character} on \code{character} and section \ref{sec:calc:lists} on \code{list} on pages \pageref{sec:calc:character} and \pageref{sec:calc:lists}).

<<>>=
an.object <- 5:2
remove(list = "an.object") # using a character vector
@

Function \Rfunction{objects()} returns a \code{character} vector containing the names of all objects visible in the current environment, or by passing an argument to parameter \code{pattern}, only the objects with names matching it.

<<>>=
an.object <- 1:4
another.object <- 2
objects(pattern = "*.object")
remove(an.object)
objects(pattern = "*.object")
@

In \pgrmname{RStudio}, all objects are listed in the \textbf{Environment} tab and the search box of this tab can be used to find a given object.

\begin{explainbox}
Function \Rfunction{remove()} accepts both bare names of objects as in the chunk above and \code{character} strings corresponding to object names like in \code{remove("any.object")}. However, While \Rfunction{objects()} accept patterns to be matched to object names, \Rfunction{remove()} does not. Because of this, these two functions have to be used together for removing all objects with names that match a pattern. The pattern can be given as a regular expression (see section \ref{sec:calc:regex} on page \pageref{sec:calc:regex}).

Both functions are available under short names matching those used in \osnameNI{Linux} and \osnameNI{Unix} for managing files: \Rfunction{ls()} is a synonym of \Rfunction{objects()} and \Rfunction{rm()} of \Rfunction{remove()}.

Using a simple search pattern we obtain the names of all objects with names \code{"vct1"}, \code{"vct2"}, and so on. When using a pattern to remove objects, it is good to first use \Rfunction{objects()} on its own to get a list of the objects that would be deleted by calling \Rfunction{remove()} when passing the names returned by \Rfunction{objects()} as the argument for parameter \code{list}.

<<numbers-7>>=
objects(pattern = "^vec.*")
@

The code below removes all objects with names \code{"vct1"}, \code{"vct2"}, and so on. We do this at the end of the section before reusing the same names in the code examples of the next section.

<<numbers-last>>=
remove(list = objects(pattern = "^vct[[:digit:]]?"))
@

Similar code chunks are included at the end of each section throughout the book to ensure that code examples are self-contained by section. The chunk about is shown above as an example, but kept hidden in later sections.
\end{explainbox}

\section{Character Values}\label{sec:calc:character}
\index{character strings}\index{classes and modes!character|(}\qRclass{character}
In spite of the name \code{character}, values of this mode, are vectors of \emph{character strings"}. Character constants are written by enclosing characters strings in quotation marks, i.e., \code{"this is a character string"}. There are three types of quotation marks in the ASCII character set, double quotes \code{"}, single quotes \code{'}, and back ticks \code{`}. The first two types of quotes can be used as delimiters of \code{character} constants.

<<char-1>>=
vct1 <- "A"
vct1
vct2 <- 'A'
vct2
vct1 == vct2 # two variables holding character values, or named objects
"A" == 'A' # two constant character values, or anonymous objects
@

\begin{explainbox}
In many computer languages, vectors of characters are distinct from vectors of character strings. In these languages, character vectors store at each index position a single character, while vectors of character strings store at each index position strings of characters of various lengths, such as words or sentences. If you are familiar with \Clang or \Cpplang, you need to keep in mind that \Clang's \code{char} and \Rlang's \code{character} are not equivalent and that in \Rlang. In contrast to these other languages, in \Rlang there is no predefined class for vectors of individual characters and character constants enclosed in double or single quotes are not different.
\end{explainbox}

Concatenating character vectors of length one does not yield a longer character string, it yields instead a longer vector of character strings.

<<char-1a>>=
vct3 <- 'ABC'
vct4 <- "bcdefg"
vct5 <- c("123", "xyz")
c(vct3, vct4, vct5)
@

Having two different delimiters available makes it possible to choose the type of quotation marks used as delimiters so that other quotation marks can be easily included in a string.

<<char-3>>=
"He said 'hello' when he came in"
'He said "hello" when he came in'
@

The\index{character string delimiters} outer quotes are not part of the string, they are ``delimiters'' used to mark the boundaries. As you can see when \code{b} is printed special characters can be represented using ``escape codes''. There are several of them, and here we will show just four, new line (\verb|\n|) and tab (\verb|\t|), \verb|\"| the escape code for a quotation mark within a string and \verb|\\| the escape code for a single backslash \verb|\|. I also show the different behaviour of \Rfunction{print()} and \Rfunction{cat()}, with \Rfunction{cat()} \emph{interpreting} the escape sequences and \Rfunction{print()} displaying them as entered.

<<char-4>>=
vct6 <- "abc\ndef\tx\"yz\"\\\tm"
print(vct6)
cat(vct6)
@

The \textit{escape codes}\index{character escape codes} are expanded only in some contexts, such as when using \Rfunction{cat()} to display text output.

%\subsection{Character operations}\label{sec:calc:character:oper}

\begin{faqbox}{How to find the length of a character string?}
  While\index{character strings!number of characters} function \code{length()} returns the number of member \code{character} strings in a vector, function \Rfunction{nchar()} returns the number of characters in each string in the vector (see below for examples).
\end{faqbox}

In the example below, function \Rfunction{nchar()} returns the number of characters in each member string.

<<char-nchar-01>>=
nchar(x = "abracadabra")
nchar(x = c("abracadabra", "workaholic", ""))
@

To convert a \code{character} string into upper case or lower case we use functions \Rfunction{toupper()} and \Rfunction{tolower()}, respectively.

<<char-toupper-01>>=
toupper(x = "aBcD")
tolower(x = "aBcD")
@

Function \Rfunction{strtrim()} trims a string to a maximum number of characters or width.

<<char-trim-01>>=
strtrim(x = "abracadabra", width = 6)
strtrim(x = "abra", width = 6)
strtrim(x = c("abracadabra", "workaholic"), 6)
strtrim(x = c("abracadabra", "workaholic"), c(6, 3))
@

\begin{faqbox}{How to wrap long character strings?}
  Use \Rlang function \Rfunction{strwrap()} (see below for examples).
\end{faqbox}

Function \Rfunction{strwrap()} edits a string to a maximum number of characters or width, by splitting it into a vector of shorter character strings. It can additionally insert a character string at the start or end of each of these new shorter strings.

<<char-wrap-01>>=
strwrap(x = "This is a long sentence used to show how line wrapping works.", width = 20)
@

\begin{advplayground}
  Function \Rfunction{cat()} prints a character vector respecting the embedded special characters such as new line (encoded as \verb|\n|) in \code{character} strings) and without issuing any additional new lines. Study the code below and the output it generates, consult the documentation of the two functions, and modify the example code until you are confident that you understand in detail how these two functions work.

<<char-wrap-02, eval=eval_playground>>=
wrapped_sentence <-
  strwrap(x = "This is a very long sentence used to show how line wrapping works.",
          width = 10,
          prefix = "\n")
print(wrapped_sentence)
cat(wrapped_sentence, "\n")
@

\end{advplayground}

\begin{faqbox}{How to create a single character string from multiple shorter strings?}
  While function \code{c()} is used to concatenate \code{character} vectors into longer vectors, function \Rfunction{paste()} is used to concatenate character strings into a single longer string (see below for examples).
\end{faqbox}

Pasting together \code{character} strings has many uses, e.g., assembling informative messages to be printed, programmatically creating file names or file paths, etc. If we pass numbers, they are converted to \code{character} before pasting. The default separator is a space character, but this can be changed by passing a \code{character} string as an argument for parameter \code{sep}.

<<char-paste-01>>=
paste("n =", 3)
paste("n", 3, sep = " = ")
@

Pasting constants, as shown above,  is of little practical use. In contrast, combining values stored in different variables is a very frequent operation when working with data. A simple use example follows. Assuming vector \code{friends} contains the names of friends and vector \code{fruits} the fruits they like to eat we can paste these values together into short sentences.

<<char-paste-02>>=
friends <- c("John ", "Yan ", "Juana ", "Mary ")
fruits <- c("apples", "lichees", "oranges", "strawberries")
paste(friends, "eats ", fruits, ".", sep = "")
@

\begin{playground}
 Why was necessary to pass \code{sep = ""} in the call to \Rfunction{paste()} in the example above? First try to predict what will happen and then remove \code{, sep = ""} from the statement above and run it to learn the answer. Try your own variations of the code until you understand the role of the separator string.
\end{playground}

We can pass an additional argument to tell that the vector resulting from the paste operation is to be collapsed into a single \code{character} string. The argument passed to collapse is used as the separator. I use here \code{cat()} so that the newline character is obeyed in the display of the single character string.

<<char-paste-03>>=
cat(paste(friends, "eats ", fruits, collapse = ".\n", sep = ""))
@

\begin{explainbox}
When the vectors are of different length, as in the last example above, the shorter one is recycled as many times as needed, which is not always what we want. To void the recycling, we need to first collapse the members of the long vector \code{fruits} into a vector of length one. We can achieve this by nesting two calls to \Rfunction{paste()}, and passing an argument to \code{collapse} in the inner function call.

<<char-paste-04>>=
collapsed_fruits <- paste(fruits, collapse = ", ")
paste("My friends eat", collapsed_fruits, "and other fruits.")
@

The nesting of function calls is explained in section \ref{sec:script:pipes} on page \pageref{sec:script:pipes}. However, as the two statements above would in most cases be written as nested function calls, I add this example for reference.

<<char-paste-05>>=
paste("My friends eat", paste(fruits, collapse = ", "), "and other fruits.")
@

\end{explainbox}

Function \Rfunction{strrep()} repeats and pastes \code{character} strings into a new longer \code{character string}, while function \Rfunction{rep()} repeats character strings without pasting them together, returning a longer vector with each repeat of the string as a separate member.

<<char-strrep-01>>=
rep(x = "ABC", times = 3)
strrep(x = "ABC", times = 3)
strrep(x = "ABC", times = c(2, 4))
strrep(x = c("ABC", "X"), times = 2)
strrep(x = c("ABC", "X"), times = c(2, 5))
@

\begin{faqbox}{How to trim leading and/or trailing whitespace in character strings?}
Use function \Rfunction{trimws()} (see below for examples).
\end{faqbox}

Trimming\index{character strings!whitespace trimming} leading and trailing whitespace is a frequent operation. \Rlang function \Rfunction{trimws()} implements this operation as shown below.

<<char-str-00a>>=
trimws(x = " two words ")
trimws(x = c("  eight words and a newline at the end\n", " two words "))
@

\begin{playground}
Function \Rfunction{trimws()} has additional parameters that make it possible to select which end of the string is trimmed and which characters are considered whitespace. Use \code{help(trimws)} to access the help and study this documentation. Modify the example above so that only trailing whitespace is removed, and so that the newline character \verb!\n! is not considered whitespace, and thus not trimmed away.
\end{playground}

Within\index{character strings!position-based operations} \Rclass{character} strings, substrings can be extracted and replaced \emph{by position} using \Rfunction{substring()} or \Rfunction{substr()}.

For extraction, we can pass to \code{x} a constant as shown below or a variable.

<<char-str-01>>=
substr(x = "abracadabra", start = 5, stop = 9)
substr(x = c("abracadabra", "workaholic"), start = 5, stop = 11)
@

Replacement is done \emph{in place}, by having function \code{substr()} on the left-hand side (lhs) of the assignment operator \code{<-}. Thus, the argument passed to parameter \code{x} of \code{substr()} must in this case be a variable rather than a constant. This is a substitution character by character, not insertion, so the number of characters in the string passed as the argument to \code{x} remains unchanged, i.e., the value returned by \code{nchar()} does not change.

<<char-str-02>>=
vct7 <- c("abracadabra", "workaholic")
substr(x = vct7, start = 5, stop = 9) <- "xxx"
vct7
@

If we pass values to both \code{start} and \code{stop} then only part of the value on the \emph{rhs} of the assignment operator \code{<-} may be used.

<<char-str-03>>=
vct8 <- c("abracadabra", "workaholic")
substr(x = vct8, start = 5, stop = 6) <- "xxx"
vct8
@

\begin{playground}
Frequently, a very effective way of learning how a function behaves, is to experiment. In the example below, we set \code{start} and \code{stop} delimiting more characters than those in \code{"xxx"}. In this case, is \code{"xxx"} extended,
or \code{start} or \code{stop} ignored? Run this ``toy example'' to find out the answer.

<<char-str-04, eval=eval_playground>>=
VCT1 <- c("abracadabra", "workaholic")
substr(x = VCT1, start = 5, stop = 11) <- "xxx"
VCT1
remove(VCT1) # clean up
@

\end{playground}

As\index{character strings!partial matching and substitution} in \Rlang each character value is a string comprised of zero to many characters, in addition to comparisons based on whole strings or values, partial matches among them are of interest.

To substitute part of a \code{character} string \emph{by matching a pattern}, we can use functions \Rfunction{sub()} or \Rfunction{gsub()}. The first example uses three \code{character} constants, but values stored in variables can also be passed as arguments.

<<char-regex-01>>=
sub(pattern = "ab", replacement = "AB", x = "about")
@

The difference between \Rfunction{sub()} (substitution) and \Rfunction{gsub()} (global substitution) is that the first replaces only the first match found while the second replaces all matches.

<<char-regex-02>>=
sub(pattern = "ab", replacement = "x", x = "abracadabra")
gsub(pattern = "ab", replacement = "x", x = "abracadabra")
@

\begin{playground}
Functions \Rfunction{sub()} or \Rfunction{gsub()} accept character vectors as the argument for parameter \code{x}. Run the two statements below and study how the values returned differ.

<<char-regex-03, eval=eval_playground>>=
sub(pattern = "ab", replacement = "x", x = c("abra", "cadabra"))
gsub(pattern = "ab", replacement = "x", x = c("abra", "cadabra"))
@

\end{playground}

Function \Rfunction{grep()} returns indices to the values in a vector matching a pattern, or alternatively, the matching values themselves.

<<char-regex-04>>=
grep(pattern = "C", x = c("R", "C++", "C", "Perl", "Pascal"))
grep(pattern = "C", x = c("R", "C++", "C", "Perl", "Pascal"), value = TRUE)
grep(pattern = "C", x = c("R", "C++", "C", "Perl", "Pascal"), ignore.case = TRUE)
@

Function \Rfunction{grepl()} is a variation of \Rfunction{grep()} that returns a vector of \code{logical} values instead of numeric indices to the matching values in \code{x}.

<<char-regex-05>>=
grepl(pattern = "C", x = c("R", "C++", "C", "Perl", "Pascal"))
grepl(pattern = "C", x = c("R", "C++", "C", "Perl", "Pascal"), ignore.case = TRUE)
@

\index{regular expressions|(}%
In\label{sec:calc:regex} the examples above, the arguments for \code{pattern} strings matched exactly their targets. In \Rlang and other languages, \emph{regular expressions} are used to concisely describe more elaborate and conditional patterns. Regular expressions themselves are encoded as character strings, where some characters and character sequences have special meaning. This means that when a pattern should be interpreted literally rather than specially, \code{fixed = TRUE} should be passed in the call. This, in addition, ensures faster computation. In the examples above, the patterns used contained no characters with special meaning, thus, the returned value is not affect by passing \code{fixed = TRUE} as done here.

<<char-regex-06>>=
sub(pattern = "ab", replacement = "AB", x = "about", fixed = TRUE)
@

\begin{warningbox}
Regular expressions are used in Unix and Linux shell scripts and programs, and are part of \perllang, \Cpplang and other languages in addition to \Rlang. This means that variations exist on the same idea, with \Rlang supporting two variations of the syntax. A description of \Rlang regular expressions can be accessed with \code{help(regex)}. We here describe \Rlang's default syntax.
\end{warningbox}

Regular expressions are concise, terse, and extremely powerful. They are a language in themselves. However, the effort needed to learn their use more than pays back. I will show examples of the use, rather than systematically describe them. I will use \Rfunction{gsub()} for these examples, but several other \Rlang functions including \Rfunction{grep()} and \Rfunction{grepl()} accept regular expressions as patterns.

In a regular expression, \code{|} separates alternative matching patterns.

<<char-regex-07>>=
gsub(pattern = "ab|t", replacement = "123", x = "about")
@

Within a regular expression, we can group characters within \code{[ ]} as alternative, e.g., \code{[0123456789]}, or \code{[0-9]} matches any digit.

<<char-regex-08>>=
gsub(pattern = "a[0123456789]",
     replacement = "ab",
     x = c("a1out", "a9out", "a3out"))
@

Character \code{\textasciicircum} indicates that the match must be at the ``head'' of the string, and \code{\$} that the match should be at its ``tail''.

<<char-regex-09>>=
gsub(pattern = "^a[0123456789]",
     replacement = "ab",
     x = c("a1out", "a9out", " a3out"))
@

The replacement can be an empty string.

<<char-regex-10>>=
gsub(pattern = "out$",
     replacement = "",
     x = c("about", "a9out", "a3outx"))
@

A dot (\code{.}) matches any character. In this example, we replace the last character with \code{""}.

<<char-regex-11>>=
gsub(pattern = ".$",
     replacement = "",
     x = c("about", "a9out", "a3outx"))
@

\begin{playground}
  How would you modify the last code example above to edit \code{c("about", "axout", "a3outx")} into \code{c("about", "axout", "a3out")}? Think of different ways of doing this using regular expressions.
\end{playground}

The number of matching characters can be indicated with \code{+} (match 1 or more times), \code{?} (match 0 or 1 times), \code{*} (match 0 or more times) or even numerically. Matching is in most cases ``greedy''.

<<char-regex-12>>=
gsub(pattern = "^.[0-9][a-z]*$",
     replacement = "gone",
     x = c("about", "a9out", "a3outx"))
@

Several named classes of characters are predefined, for example \code{[:lower:]⁠} for lower case alphabetic characters according to the current locale (see page \pageref{box:calc:locale}). In the regular expression in the example below, \code{[:lower:]⁠} replaces only \code{a-z}, thus we need to keep the outer square brackets. While \code{a-z} includes only the unaccented letters, \code{[:lower:]⁠} does include additional characters such as \texttt{ä}, \texttt{ö}, or \texttt{é} if they are in use in the current locale. In the case of \code{[:digit:]} and \code{0-9}, they are equivalent.

<<char-regex-13a>>=
gsub(pattern = "^.([[:digit:]])[[:lower:]]*$",
     replacement = "gone with \\1",
     x = c("about", "a9out", "a3outx"))
@

With parentheses we can isolate part of the matched string and reuse it in the replacement with a numeric back-reference. Up to a maximum of nine pairs of parentheses can be used.

<<char-regex-13>>=
gsub(pattern = "^.([0-9])[a-z]*$",
     replacement = "gone with \\1",
     x = c("about", "a9out", "a3outx"))
@

\begin{playground}
Run the two statements below, study the returned values by creating variations of the patterns and explain why the returned values differ.

<<char-regex-14, eval=eval_playground>>=
gsub(pattern = "^.+$",
     replacement = "",
     x = c("about", "a9out", "a3outx"))
gsub(pattern = "^.?$",
     replacement = "",
     x = c("about", "a9out", "a3outx"))
@
\end{playground}

Splitting\index{character strings!splitting of} of character strings based on pattern matching is a frequently used operation, e..g., treatment labels containing information about two different treatment factors need to be split into their components before data analysis. Function \Rfunction{strsplit()} has an interface consistent with \code{grep()}. In the examples we will split strings containing date and time of day information in different ways.

<<char-regex-20>>=
strsplit(x = "2023-07-29 10:30", split = " ")
@

Using a simple regular expression we can extract individual strings representing the numbers.

<<char-regex-21>>=
strsplit(x = "2023-07-29 10:30", split = " |-|:")
@

The argument to \code{split} is by default interpreted as a regular expression, but as discussed above we can pass \code{fixed = TRUE} to prevent this.

\begin{warningbox}
One needs to be aware that the part of the string matched by the regular expression is not included in the returned vectors. If the regular expression matches more than what we consider a separator, the returned values may be surprising.

<<char-regex-23>>=
strsplit(x = "2023-07-29", split = "-[0-9]+$")
@

\end{warningbox}

\begin{explainbox}
When the argument passed to \code{x} is a vector with multiple member strings, the returned value is a list of \code{character} vectors. This list contains as many character vectors as members had the vector passed as argument to \code{x}, each vector the result of splitting one character string in the input. (Lists are described in section \ref{sec:calc:lists} on page \pageref{sec:calc:lists}.)

<<char-regex-22>>=
strsplit(x = c("2023-07-29 10:30", "2023-07-29 19:17"), split = " ")
@

\end{explainbox}
\index{regular expressions|)}
\index{classes and modes!character|)}

\begin{warningbox}\label{box:calc:locale}
  The ASCII character set is the oldest and simplest in use. In contains only 128 characters including non-printable characters. These characters support the English language. Several different extended versions with 256 characters provided support for other languages, mostly by adding accented letters and some symbols. The 128 ASCII characters were for a long time the only consistently available across computers set up for different languages and countries (or \emph{locales}). Recently the use of much larger character sets like UTF8 has become common. Since \Rlang version 4.2.0 support for UTF8 is available under Windows 10. This makes it possible the processing of text data for many more languages than in the past. Even though now it is possible to use non-ASCII characters as part of object names, it is anyway safer to use only ASCII characters as this support is recent.

  The extended character sets include additional characters, that are distinct but may produce glyphs that look very similar to those in the ASCII set. One case are em-dash (---), en-dash (--), minus sign ($-$) and regular dash (-), which are all different characters, with only the last one recognised by \Rlang as the minus operator. For those copying and pasting text from a word-processor into \Rpgrm or \RStudio, a frequent difficulty is that even if one types in an ASCII quote character (\verb|"|), the opening and closing quotes in many languages are automatically replaced with non-ASCII ones (``and''), which \Rlang does not accept as character string delimiters. The best solution is to use a plain text editor instead of a word processor when writing scripts or editing text files containing data to be read as code statements or numerical data.

  A locale definition determines not only the language, and character set, but also date, time, and currency formats.
\end{warningbox}

<<include=FALSE>>=
rm(list = setdiff(ls(pattern="*"), to.keep))
@

\section{Logical Values and Boolean Algebra}\label{sec:calc:boolean}
\index{classes and modes!logical|(}\index{logical operators}\index{logical values and their algebra|(}\index{Boolean arithmetic}
What in Mathematics are usually called Boolean values, are called \Rclass{logical} values in \Rlang. They can have only two values \code{TRUE} and \code{FALSE}, in addition to \code{NA} (not available). Logical values \code{TRUE} and \code{FALSE} should not be confused with text strings, they are names for the two conditions that can be stored. Logical values are always vectors as all other atomic types in \Rlang (by \emph{atomic} we mean that each value is not composed of ``parts'').

Logical values are rarely used to store data from experiments or surveys. They are used mostly to keep track of binary conditions, like results from comparisons in a script and to operate on them. Most frequent uses of \code{logical} values do not involve their storage in user-created variables. Most comparisons or tests return a \code{logical} value and Boolean algebra makes it possible to combine the results from multiple tests or conditions into a single combined outcome or binary decision, i.e., TRUE or False, Yes or No. (See section \ref{sec:calc:comparison} on page \pageref{sec:calc:comparison} for examples.)

In mathematics, Boolean algebra provides the rules of the logic used to combine multiple logical values. Boolean operators like AND and OR take as operands logical values and return a logical value as a result. In \Rlang there are two ``families'' of Boolean operators, vectorised and not vectorised. Vectorised operators accept logical vectors of any length as operands, while non-vectorised ones accept only logical vectors of length one as operands. In the chunk below we use non-vectorised operators with two \Rclass{logical} vectors of length one, \code{a} and \code{b}, as operands.

<<logical-1>>=
vct1 <- TRUE
mode(vct1)
vct1
!TRUE # negation
TRUE && FALSE # logical AND
TRUE || FALSE # logical OR
xor(TRUE, FALSE) # exclusive OR
@

%%%% index operators using verb!!
The availability of two kinds of logical operators can be troublesome for those new to \Rlang. Pairs of ``equivalent'' logical operators behave differently, use similar syntax and use similar symbols! The vectorised operators have single-character names, \Roperator{\&} and \Roperator{\textbar} (like the vectorised arithmetic operators, such as \code{+}), while the non-vectorised ones have double-character names, \Roperator{\&\&} and \Roperator{\textbar\textbar}. There is only one version of the negation operator \Roperator{!} that is vectorised. In recent versions of \Rlang, an error is triggered when a non-vectorised operator is used with a vector with length $> 1$, which helps prevent mistakes. In some situations, vectorised \code{logical} operators can replace non-vectorised ones, but it is important to use the ones that match the intention of the code, as this enables relevant checks for mistakes. Once the distinction is learnt, using the most appropriate operators also contributes to make code easier to read.

<<logical-2>>=
c(TRUE, FALSE) & c(TRUE,TRUE) # vectorised AND
c(TRUE, FALSE) | c(TRUE,TRUE) # vectorised OR
@

Functions \Rfunction{any()} and \Rfunction{all()} take zero or more logical vectors as their arguments, and return a single logical value ``summarising'' the logical values in the vectors. Function \Rfunction{all()} returns \code{TRUE} only if all values in the vectors passed as arguments are \code{TRUE}, and \Rfunction{any()} returns \code{TRUE} unless all values in the vectors are \code{FALSE}.

<<>>=
vct2 <- c(TRUE, FALSE, FALSE)
any(vct2)
all(vct2)
any(c(TRUE, FALSE) & c(TRUE,TRUE))
all(c(TRUE, FALSE) & c(TRUE,TRUE))
any(c(TRUE, FALSE) | c(TRUE,TRUE))
all(c(TRUE, FALSE) | c(TRUE,TRUE))
@

Another important thing to know about logical operators is that they ``short-cut'' evaluation. If the result is known from the first part of the statement, the rest of the statement is not evaluated. Try to understand what happens when you enter the following commands. Short-cut evaluation is useful, as the first condition can be used as a guard protecting a later condition from being evaluated when it would trigger an error.\label{par:calc:shortcut:eval}

<<logical-3>>=
TRUE || NA
FALSE || NA
TRUE && NA
FALSE && NA
TRUE && FALSE && NA
TRUE && TRUE && NA
@

\begin{playground}
  Investigate how swapping the order of the operands in the code chunk above affects the values returned, e.g.., the first statement becomes \code{NA || TRUE}.
\end{playground}

When using the vectorised operators on vectors of length greater than one, `short-cut' evaluation still applies for the result obtained at each index position.

<<logical-4>>=
c(TRUE, FALSE) & c(TRUE,TRUE) & NA
c(TRUE, FALSE) & c(TRUE,TRUE) & c(NA, NA)
c(TRUE, FALSE) | c(TRUE,TRUE) | c(NA, NA)
@

\begin{playground}
Based on the description of ``recycling'' presented on page \pageref{par:recycling:numeric} for \code{numeric} operators, explore how ``recycling'' works with vectorised logical operators. Create logical vectors of different lengths (including length one) and \emph{play} by writing several code statements with operations on them. To get you started, one example is given below. Execute this example, and then create and run your own, making sure that you understand why the values returned are what they are. Sometimes, you will need to devise several examples or test cases to tease out of \Rlang an understanding of how a certain feature of the language works, so do not give up early, and make use of your imagination!

<<logical-PG01,eval=eval_playground>>=
c(TRUE, FALSE, TRUE, NA) & FALSE
c(TRUE, FALSE, TRUE, NA) | c(TRUE, FALSE)
@

\end{playground}

\begin{faqbox}{How to test if a vector contains no values other than \code{NA} (or \code{NaN}) values?}
A call to \Rfunction{is.na()} returns a \code{logical} vector that we can pass to \Rfunction{all()}. We can save the intermediate vector \code{temp} and pass it as argument to \Rfunction{is.na()}, or alternatively nest the function calls. The name \code{tmp}, for \emph{temporary}, is frequently used for variables whose value is retrieved only once.

<<faq-vectors-01>>=
vct2 <- rep(NA, 5) # toy data
tmp <- is.na(vct2) # tmp for temporary
all(tmp)
@

<<<faq-vectors-01a>>=
all(is.na(vct2)) # nested call
@

\end{faqbox}

\begin{faqbox}{How to test if a vector contains one or more \code{NA} (or \code{NaN}) values?}
See previous question. We only need to replace \code{all()} by \Rfunction{any()} to obtain the answer.

<<faq-vectors-02>>=
vct2 <- rep(NA, 5)
any(is.na(vct2))
@

\end{faqbox}

\index{logical values and their algebra|)}
\index{classes and modes!logical|)}

<<include=FALSE>>=
rm(list = setdiff(ls(pattern="*"), to.keep))
@

\section{Comparison Operators and Operations}\label{sec:calc:comparison}
\index{comparison operators|(}\index{operators!comparison|(}\qRoperator{>}\qRoperator{<}\qRoperator{>=}\qRoperator{<=}\qRoperator{==}\qRoperator{!=}
Comparison operators return vectors of \code{logical} values (see section \ref{sec:calc:boolean} on page \pageref{sec:calc:boolean}), with values \code{TRUE} or \code{FALSE} depending on the outcome.

Equality (\code{==}) and inequality (\code{!=}) operators are defined not only for \code{numeric} values but also for \code{character} and most other atomic and many other values. Be aware that operator \code{=} is an infrequently used synonym of the assignment operator \code{<-} rather than a comparison operator!

<<comparison-0>>=
# be aware that we use two = symbols
"abc" == "ab"
"ABC" == "abc"
"abc" != "ab"
"ABC" != "abc"
@

In the case of \code{numeric} values additional comparisons are meaningful and additional operators are defined.

<<comparison-1>>=
1.2 > 1.0
1.2 >= 1.0
1.2 == 1.0
1.2 != 1.0
1.2 <= 1.0
1.2 < 1.0
@

These operators can be used on vectors of any length, returning as a result a logical vector as long as the longest operand. In other words, they behave in the same way as the arithmetic operators described on page \pageref{par:vectorised:numeric}: their arguments are recycled when needed. Hint: if you do not know what value is stored in numeric vector \code{a}, use \code{print(a)} after the first code statement below to see its contents.

<<comparison-2>>=
vct3 <- 1:10
vct3 > 5
vct3 < 5
vct3 == 5
all(vct3 > 5)
any(vct3 > 5)
vct4 <- vct3 > 5
vct4
any(vct4)
all(vct4)
@

Individual comparisons can be useful, but their full role in data analysis and programming is realised when we combine multiple tests using the operations of the Boolean algebra described in section \ref{sec:calc:boolean} on page \pageref{sec:calc:boolean}.

For example, to test if members of a numeric vector are within a range, in our example, $-1$ to $+1$, we can combine the results from two comparisons using the vectorised logical \emph{AND} operator \Roperator{\&}, and use parentheses to override the default order of precedence of the operations.

<<logical-2a>>=
vct5 <- -2:3
vct5 >= -1 & vct5 <= 1
@

If we want to find those values outside this same range, we can negate the test.

<<logical-2b>>=
!(vct5 >= -1 & vct5 <= 1)
@

Or, we can combine another two comparisons using the vectorised logical \emph{OR} operator \Roperator{\textbar}.

<<logical-2c>>=
vct5 < -1 | vct5 > 1
@

In some cases, an additional advantage is that \Rclass{logical} values require less space in memory for their storage than \code{numeric} values.

\begin{playground}
Use the statement below as a starting point in exploring how precedence works when logical and arithmetic operators are part of the same statement. \emph{Play} with the example by adding parentheses at different positions and based on the returned values, work out the default order of operator precedence used for the evaluation of the example given below.

<<comparison-PG00, eval=eval_playground>>=
vct6 <- 1:10
vct6 > 3 | vct6 + 2 < 3
@
\end{playground}

It is important to be aware of the consequences of ``short-cut evaluation'' (described on page \pageref{par:calc:shortcut:eval}).
The behaviour of many of base-\Rlang's functions when \code{NA}s are present in their input arguments can be modified. If \code{TRUE} is passed as an argument to parameter \code{na.rm}, \code{NA} values are \emph{removed} from the input \textbf{before} the function is applied.

<<comparison-4>>=
vct7 <- c(1:10, NA)
all(vct7 < 20)
any(vct7 > 20)
all(vct7 < 20, na.rm=TRUE)
any(vct7 > 20, na.rm=TRUE)
@

\begin{warningbox}
\index{comparison of floating point numbers|(}\index{inequality and equality tests|(}\index{loss of numeric precision}In many situations, when writing programs one should avoid testing for equality of floating point numbers (`floats'). This is because of how numbers are stored in computers (see the box on page \pageref{box:floats} for an in-depth explanation). Here I show how to gracefully handle rounding errors when using comparison operators. As rounding errors may accumulate, in practice \code{.Machine\$double.eps} is frequently too small a value to safely use in tests for ``zero.''. Whenever possible according to the logic of the calculations, it is best to test for inequalities, for example using \verb|x <= 1.0| instead of \verb|x == 1.0|. If this is not possible, then equality tests should be done by replacing tests like \verb|x == 1.0| with \verb|abs(x - 1.0) < k|, where \verb|k| is a number larger than \code{eps}. Function \Rfunction{abs()} returns the absolute value, in simpler words, makes all values positive or zero, by changing the sign of negative values, or in mathematical notation $|x| = |-x|$.

<<machine-eps-06>>=
sin(pi) == 0 # angle in radians, not degrees!
sin(2 * pi) == 0
abs(sin(pi)) < 1e-15
abs(sin(2 * pi)) < 1e-15
sin(pi)
sin(2 * pi)
@

\index{comparison of floating point numbers|)}\index{inequality and equality tests|)}
\end{warningbox}

\index{comparison operators|)}\index{operators!comparison|)}

<<include=FALSE>>=
rm(list = setdiff(ls(pattern="*"), to.keep))
@

\section{Sets and Set Operations}
\index{sets|(}\index{algebra of sets}\index{operators!set|(}

The \Rlang language supports set operations on vectors. They can be useful in many different contexts when manipulating and comparing vectors of values. In Bioinformatics, it is usual, for example, to make use of character vectors of gene tags. Algebra sets is implemented with functions \code{union()}, \code{intersect()}, \code{setdiff()}, \code{setequal()}, \code{is.element()} and operator \code{\%in\%} (Figure \ref{fig:set:opers}). The first three operations return a vector of the same mode as their inputs, and the last three a \code{logical} vector. The action of the first three operations is most easily illustrated with Venn diagrams, where the returned value (or result of the operation) is depicted in darker grey.\vspace{1ex}

\begin{figure}
\begin{footnotesize}
\hfill%
\begin{tikzpicture}[thick,
    set/.style = {circle,
        minimum size = 3cm,
        fill=black!15}]

% Set A
\node[set,label={135:$A$}] (A) at (0,0) {};

% Set B
\node[set,label={45:$B$}] (B) at (1.8,0) {};

% Intersection
\begin{scope}
    \clip (0,0) circle(1.5cm);
    \clip (1.8,0) circle(1.5cm);
    \fill[black!15](0,0) circle(1.5cm);
\end{scope}

% Circles outline
\draw (0,0) circle(1.5cm);
\draw (1.8,0) circle(1.5cm);

% Set intersection label
\node at (0.9,0) {$A\cup B$};

\end{tikzpicture}%
\hfill%
\begin{tikzpicture}[thick,
    set/.style = {circle,
        minimum size = 3cm,
        fill = black!5}]

% Set A
\node[set,label={135:$A$}] (A) at (0,0) {};

% Set B
\node[set,label={45:$B$}] (B) at (1.8,0) {};

% Intersection
\begin{scope}
    \clip (0,0) circle(1.5cm);
    \clip (1.8,0) circle(1.5cm);
    \fill[black!15](0,0) circle(1.5cm);
\end{scope}
% Circles outline
\draw (0,0) circle(1.5cm);
\draw (1.8,0) circle(1.5cm);

% Set intersection label
\node at (0.9,0) {$A\cap B$};

\end{tikzpicture}%
\hfill%
\vspace{2ex}

\hfill%
\begin{tikzpicture}[thick,
    set/.style = {circle,
        minimum size = 3cm}]

% Set A
\node[set,label={135:$A$},fill=black!15] (A) at (0,0) {};

% Set B
\node[set,label={45:$B$},fill=black!5] (B) at (1.8,0) {};

% Circles outline
\draw (0,0) circle(1.5cm);
\draw (1.8,0) circle(1.5cm);

% Set intersection label
\node at (-0.4,0) {$A - B$};

\end{tikzpicture}%
\hfill%
\begin{tikzpicture}[thick,
    set/.style = {circle,
        minimum size = 3cm}]

% Set B
\node[set,label={45:$B$},fill=black!15] (B) at (1.8,0) {};

% Set A
\node[set,label={135:$A$},fill=black!5] (A) at (0,0) {};

% Circles outline
\draw (0,0) circle(1.5cm);
\draw (1.8,0) circle(1.5cm);

% Set intersection label
\node at (2.2,0) {$B - A$};

\end{tikzpicture}%
\hfill%
\end{footnotesize}
\vspace{1ex}
  \caption[Boolean algebra]{Boolean algebra. Venn diagrams for algebra of sets operations: \emph{union}, $\cup$, \code{union()}; \emph{intersection}, $\cap$, \code{intersect()}; \emph{difference (asymmetrical)}, $-$, \code{setdiff()}; \emph{equality test} \code{setequal()}; \emph{membership}, \code{is.element()} and operator \code{\%in\%}}\label{fig:set:opers}
\end{figure}

Set operations applied to vectors with values representing a mundane example, grocery shopping, demonstrate them.

<<sets-00>>=
fruits <- c("apple", "pear", "orange", "lemon", "tangerine")
bakery <- c("bread", "buns", "cake", "cookies")
dairy <- c("milk", "butter", "cheese")
shopping <- c("bread", "butter", "apple", "cheese", "orange")
intersect(fruits, shopping)
intersect(bakery, shopping)
intersect(dairy, shopping)
"lemon" %in% dairy
"lemon" %in% fruits
dairy %in% shopping
union(bakery, dairy)
setdiff(union(bakery, dairy), shopping) # nested call
@

\begin{warningbox}
Sets describe membership as a binary property, thus when vectors are interpreted as sets, duplicate members are redundant. Duplicate members although accepted as input are always simplified in the returned values.

<<sets-00a>>=
union(c("a", "a", "b"), c("b", "a", "b")) # set operation
@

<<sets-0ba>>=
setequal(c("a", "a", "b"), c("b", "a", "b")) # sets compared
all.equal(c("a", "a", "b"), c("b", "a", "b")) # vectors compared
identical(c("a", "a", "b"), c("b", "a", "b")) # vectors compared
@
\end{warningbox}

We construct and save a character vector to use in the next examples.

<<sets-01>>=
vct1 <- c("a", "b", "c", "b")
@

To test if a given value belongs to a set, we use operator \Roperator{\%in\%} or its function equivalent \Rfunction{is.element()}. In the algebra of sets notation, this is written $a \in A$, where $A$ is a set and $a$ a member. The second statement shows that the \code{\%in\%} operator is vectorised on its left-hand-side (lhs) operand, returning a logical vector.

<<sets-02>>=
is.element("a", vct1)
"a" %in% vct1
c("a", "a", "z") %in% vct1
@

\begin{explainbox}
Keep in mind that inclusion, implemented in operator \verb|%in%|, is an asymmetrical (not reflective) operation among a vector and a set. The right-hand-side (rhs) argument is interpreted as a set, while the left-hand-side (lhs) argument is interpreted as a vector of values to test for membership in the set. In other words, any duplicate member in the lhs operand is retained and tested while the rhs operand is interpreted as a set of unique values. The returned logical vector has the same length as the lhs operand.

<<sets-02a>>=
vct1 %in% "a"
@
\end{explainbox}

The negation of inclusion is $a \not\in A$, and coded in \Rlang by applying the negation operator \Roperator{!} to the result of the test done with \Roperator{\%in\%} or function \Rfunction{is.element()}.

<<sets-02b>>=
!is.element("a", vct1)
!"a" %in% vct1
!c("a", "a", "z") %in% vct1
@

Although inclusion is a set operation, it is also very useful for the simplification of \code{if () \ldots\ else} statements by replacing multiple tests for alternative constant values of the same \code{mode} chained by multiple \Roperator{|} operators. A useful property of \Roperator{\%in\%} and \Rfunction{is.element()} is that they never return \code{NA}.

\begin{explainbox}
Operator \Roperator{\%in\%} is equivalent to function \Rfunction{match()}, although the additional parameters of \Rfunction{match()} provide additional flexibility.

In some cases, such as when accepting partial character strings as input, the aim is not an exact match, but a partial match to target character strings. In this case, either \Rfunction{charmatch()} or \Rfunction{pmatch()} is the correct tool to use depending on the desired handling of partial, ambiguous and exact matches. Use \code{help()} to find the details if you need to use one of them.

\end{explainbox}

\begin{playground}
Use operator \Roperator{\%in\%} to write more concisely the following comparisons. Hint: see section \ref{sec:calc:boolean} on page \pageref{sec:calc:boolean} for the difference between \code{|} and \code{||} operators.

<<sets-PG00, eval=FALSE>>=
vct2 <- c("a", "a", "z")
vct2 == "a" | vct2 == "b" | vct2 == "c" | xvct2 == "d"
@

Convert the \code{logical} vectors of length 3 into a vector of length one. Hint: see help for functions \code{all()} and \code{any()}.
\end{playground}

With \Rfunction{unique()} we convert a vector of possibly repeated values into a set of unique values. In the algebra of sets, a certain object belongs or not to a set. Consequently, in a set, multiple copies of the same object or value are meaningless.

<<sets-03>>=
unique(vct1)
@

Function \Rfunction{unique()} is frequently useful, for example when we want determine the number of distinct values in a vector.

<<sets-03a>>=
length(unique(vct1))
@

\begin{playground}
  Do the values returned by these two statements differ?

<<sets-03b, eval=eval_playground>>=
c("a", "a", "z") %in% vct1
c("a", "a", "z") %in% unique(vct1)
@

\end{playground}

\begin{explainbox}
Function \Rfunction{duplicated()} is the counterpart of \Rfunction{unique()}, returning a logical vector, indicating which values in a vector are duplicates of values already present at positions with a lower index.

<<sets-expl-01>>=
duplicated(vct1)
anyDuplicated(vct1)
@

The \Rlang language includes many functions that simplify tasks related to data analysis. Some are well known like \code{unique()}, but others may need to be searched for in the documentation.
\end{explainbox}

\begin{playground}
What do you expect to be the difference between the values returned by the three statements in the code chunk below? Before running them, write down your expectations about the value each one will return. Only then run the code. Independently of whether your predictions were correct or not, write down an explanation of what each statement's operation is.

<<sets-PG01, eval=eval_playground>>=
union(c("a", "a", "z"), vct1)
c(c("a", "a", "z"), vct1)
c("a", "a", "z", vct1)
@

Are set union and concatenation of vectors equivalent operations? why or why not?

\end{playground}

\begin{explainbox}
All set algebra examples above use character vectors and character constants. This is just the most frequent use case. Sets operations are valid on vectors of any atomic class, including \code{integer}, and computed values can be part of statements. In the second and third statements in the next chunk, we need to use additional parentheses to alter the default order of precedence between arithmetic and set operators.

<<sets-EB01>>=
9 %in% 2:4
9 %in% ((2:4) * (2:4))
c(1, 16) %in% ((2:4) * (2:4))
@

\emph{Empty sets} are an important component of the algebra of sets, in \Rlang they are represented as vectors of zero length. These vectors do belong to a class such as \Rclass{numeric} or \Rclass{character} and must be compatible with other operands in an expression.

<<sets-EB02a>>=
c("ab", "xy") %in% character()
character() %in% c("a", "b", "c")
union("ab", character())
@

\end{explainbox}

\begin{warningbox}
  Although set operators are defined for \Rclass{numeric} vectors, rounding errors in `floats' can result in unexpected results (see section \ref{box:floats} on page \pageref{box:floats}).

<<sets-warn-flt1>>=
c(cos(pi), sin(pi)) %in% c(0, -1)
c(cos(pi), sin(pi))
@

\end{warningbox}

\begin{advplayground}
In the algebra of sets notation $A \subseteq B$, where $A$ and $B$ are sets, indicates that $A$ is a subset or equal to $B$. For a true subset, the notation is $A \subset B$. The operators with the reverse direction are $\supseteq$ and $\supset$. Implement these four operations in four \Rlang statements, and test them on sets (represented by \Rlang vectors) with different ``overlap'' among set members.
\end{advplayground}
\index{operators!set|)}
\index{sets|)}

<<include=FALSE>>=
rm(list = setdiff(ls(pattern="*"), to.keep))
@

\section{The Mode and Class of Objects}\label{sec:rlang:mode}
\index{objects!mode}
Classes are abstractions, they determine the ``meaning'' and behaviour of objects belonging to them. New classes can be defined in user code as well as new methods, i.e., functions or operators tailored to fit them. The \emph{class} is like a ``tag'' that tells how the value in an object should be interpreted and operated upon.

Variables (names given to objects) have a \emph{class} that depends on the object stored in them. In contrast to some other languages in \Rlang assignment to a variable already in use to store an object belonging to a different class is allowed. There is a restriction that all elements in a vector, array or matrix, must be of the same mode (these are called atomic, as they contain homogeneous members). Lists and data frames can be heterogenous (to be described in chapter \ref{chap:R:collective}). In practice, this means that we can assign an object, such as a vector, with a different \code{class} to a name already in use, but we cannot use indexing to assign an object of a different mode to individual members of a vector, matrix or array.

Function \Rfunction{class()} is used to query the class of an object, and function \Rfunction{inherits()} is used to test if an object belongs to a specific class or not (including ``parent'' classes, to be later described).

<<mode-1>>=
vct1 <- 1:5
class(vct1)
inherits(vct1, "character")
inherits(vct1, "numeric")
@

Functions with names starting with \code{is.} are tests returning a logical value, \code{TRUE}, \code{FALSE} or \code{NA}.\qRfunction{is.character()}\qRfunction{is.numeric()}\qRfunction{is.logical()}

<<mode-2>>=
is.numeric(vct1) # no distinction of integer or double
is.double(vct1)
is.integer(vct1)
is.logical(vct1)
is.character(vct1)
@

\begin{explainbox}
Functions starting with \code{is.} have to be individually defined and are available only for some classes. Function \Rfunction{inherits()} takes as its second argument a character vector containing strings to be tested against the \code{class} attribute of the object passed as its first argument.

<<mode-1b>>=
inherits(vct1, c("numeric", "character", "logical"), which = TRUE)
@
\end{explainbox}

\begin{explainbox}
The \emph{mode} of an object is a fundamental property, and limited to those modes defined as part of the \Rlang language. In particular, different \Rlang objects of a given mode, such as \code{numeric}, can belong to different \code{class}es. Classes and the dispatch of methods are discussed in section \ref{sec:script:objects:classes:methods} on page \pageref{sec:script:objects:classes:methods}, together with object-oriented programming.

<<mode-3a>>=
mode(c(1, 2, 3)) # no distinction of integer or double
typeof(c(1, 2, 3))
class(c(1, 2, 3))
mode(c(1L, 2L, 3L)) # no distinction of integer or double
typeof(c(1L, 2L, 3L))
class(c(1L, 2L, 3L))
@

<<mode-3b>>=
mode(factor(c("a", "b", "c"))) # no distinction of integer or double
typeof(factor(c("a", "b", "c")))
class(factor(c("a", "b", "c")))
@

<<mode-3c>>=
mode(c("a", "b", "c"))
typeof(c("a", "b", "c"))
class(c("a", "b", "c"))
@

<<mode-3d>>=
mode(c(TRUE, FALSE))
typeof(c(TRUE, FALSE))
class(c(TRUE, FALSE))
@
\end{explainbox}

<<include=FALSE>>=
rm(list = setdiff(ls(pattern="*"), to.keep))
@

\section{Type Conversions}\label{sec:calc:type:conversion}
\index{type conversion|(}
By type conversion we mean converting a value from one class into a value expressed in a different class. usually the meaning can be retained, at least in part. We can, for example, convert character strings into numeric values, but this conversion is possible only for character strings conformed by digits, like \code{"100"}. Most conversions, such as the conversion of \code{character} value \code{"100"} into \code{numeric} value \code{100} are obvious. Type conversions involving logical values are less intuitive. By convention, functions used to convert objects from one mode or class to a different one have names starting with \code{as.}\footnote{Except for some packages in the \pkgnameNI{tidyverse} that use names starting with \code{as\_} instead of \code{as.}.}.\qRfunction{as.character()}\qRfunction{as.numeric()}\qRfunction{as.logical()}

<<convert-1>>=
as.character(102)
as.character(TRUE)
as.character(3.0e10)
as.numeric("203")
as.logical("TRUE")
as.logical(100)
as.logical(0)
as.logical(-1)
@

Some conversions takes place automatically in expressions involving both \code{numeric} and \code{logical} values.

<<convert-1a>>=
TRUE + 10
1 || 0
FALSE | -2:2
@

\begin{playground}
There is flexibility in the conversion from character strings into \code{numeric} and \code{logical} values. Use the examples below plus your own variations to get an idea of what strings are acceptable and correctly converted and which are not. Do also pay attention at the conversion between \code{numeric} and \code{logical} values.\qRfunction{as.character()}\qRfunction{as.numeric()}\qRfunction{as.logical()}

<<convert-PG1, eval=eval_playground>>=
as.numeric("5E+5")
as.numeric("50e+4")
as.numeric(".12")
as.numeric("0.12")
as.numeric("A")
as.logical("TRUE")
as.logical("FALSE")
as.logical("T")
as.logical("t")
as.logical("true")
as.logical("NA")
@

\end{playground}

\begin{playground}
Conversion of fractional numbers into whole numbers can be achieved in different ways, by truncation of the fractional part or rounding it up or down. If we consider both negative and positive numbers, how each of them is handled creates additional possibilities. All these approaches, as defined in mathematics, are available through different \Rlang functions. These functions, are not conversion functions as they return a \code{numeric} value of class \code{double}. See page \pageref{par:calc:round}. In contrast, \Rfunction{as.integer()} is a conversion function for type \code{double} into type \code{integer}, both with mode \code{numeric}.

Compare the values returned by \Rfunction{trunc()} and \Rfunction{as.integer()} when applied to a floating point number, such as \code{12.34}. Check for the equality of values, and for the \emph{class} and \emph{type} of the returned objects.
\end{playground}

\begin{explainbox}
Using conversions, the difference between the length of a \code{character} vector and the number of characters composing each member ``string'' within a vector becomes clear.\qRfunction{length()}\qRfunction{as.numeric()}

<<convert-2a>>=
vct1 <- c("1", "2", "3")
length(vct1)
@

<<convert-2b>>=
vct2 <- "123.1"
length(vct2)
@

<<convert-2c>>=
as.numeric(vct1)
as.numeric(vct2)
as.integer(vct1)
as.integer(vct2)
@

\end{explainbox}

\sloppy
Other\index{formatted character strings from numbers} functions relevant to the ``conversion'' of numbers and other values are \Rfunction{format()}, and \Rfunction{sprintf()}. This is sometimes informally called ``pretty printing''. These two functions return \Rclass{character} strings, instead of \code{numeric} or other values, and are useful for printed output. One could think of these functions as advanced conversion functions returning formatted, and possibly combined and annotated, character strings. However, they are usually not considered normal conversion functions, as they are very rarely used in a way that preserves the original precision of the input values. We show here the use of \Rfunction{format()} and \Rfunction{sprintf()} with \code{numeric} values, but they can also be used with values of other classes like \code{character}, \code{logical}, etc.

When using \Rfunction{format()}, the format used to display numbers is set by passing arguments to several different parameters. As \Rfunction{print()} calls \Rfunction{format()} to convert \code{numeric} values into \code{character} strings, it accepts the same options.

<<convert-5>>=
vct2 = c(123.4567890, 1.0)
format(vct2) # using defaults
format(123.4567890) # using defaults
format(1.0) # using defaults
format(vct2, digits = 3, nsmall = 1)
format(vct2, digits = 3, scientific = TRUE)
@

Function \Rfunction{sprintf()} is similar to \Clang's function of the same name. The user interface is rather unusual, but very powerful, once one learns the syntax. All the formatting is specified using a \code{character} string as template. In this template, placeholders for data and the formatting instructions are embedded using special codes. These codes start with a percent character. We show in the example below the use of some of these: \code{f} is used for \code{numeric} values to be formatted according to a ``fixed point'', while \code{g} is used when we set the number of significant digits and \code{e} for exponential or \emph{scientific} notation.

<<convert-6>>=
x = c(123.4567890, 1.0)
sprintf("The numbers are: %4.2f and %.0f", x[1], x[2])
sprintf("The numbers are: %.4g and %.2g", x[1], x[2])
sprintf("The numbers are: %4.2e and %.0e", x[1], x[2])
@

In the template \code{"The numbers are: \%4.2f and \%.0f"}, there are two placeholders for \code{numeric} values, \code{\%4.2f} and \code{\%.0f}; so, in addition to the template, we pass two values extracted from the first two positions of vector \code{x}. These could have been two different vectors of length one, or even numeric constants. The template itself does not need to be a \code{character} constant as in these examples, as a variable can be also passed as argument.

\begin{playground}
Function \Rfunction{format()} may be easier to use, in some cases, but \Rfunction{sprintf()} is more flexible and powerful. Those with experience in the use of the \Clang language will already know about \Rfunction{sprintf()} and its use of templates for formatting output. Even if you are familiar with  \Clang, look up the help pages for both functions, and practice, by trying to create the same formatted output by means of the two functions. Do also play with these functions with other types of data like \code{integer} and \code{character}.
\end{playground}

\begin{explainbox}
I have described above \Rconst{NA} as a single value ignoring modes, but in reality \Rconst{NA} values come in various flavours: \Rconst{NA\_real\_}, \Rconst{NA\_character\_}, etc. and \Rconst{NA} defaults to an \Rconst{NA} of class \Rclass{logical}. \Rconst{NA} is normally converted on the fly to other modes when needed, so in general \Rconst{NA} is all we need to use. The examples below use the extraction operator to demonstrate automatic conversion on assignment. This operator is described in section \ref{sec:calc:indexing} below.

<<nas-01>>=
vct3 <- c(1, NA)
is.numeric(vct3[2])
is.numeric(NA)
@

<<nas-01a>>=
vct4 <- c("abc", NA)
is.character(vct4[2])
class(NA_character_)
@

<<nas-01b>>=
is.character(NA)
class(NA)
@

<<nas-01c>>=
vct5 <- NA
c(vct5, 2:3)
@

However, even the statement below works transparently.

<<nas-02>>=
vct3[3] <- vct4[2]
@
\end{explainbox}

\index{type conversion|)}

<<include=FALSE>>=
rm(list = setdiff(ls(pattern="*"), to.keep))
@

\section{Vector Manipulation}\label{sec:vectors}\label{sec:calc:indexing}
\index{vectors!indexing|(}\index{vectors!member extraction}
If you have read earlier sections of this chapter, you already know how to create a vector. If not, see pages \pageref{par:numeric:vectors:start}--\pageref{par:numeric:vectors:end} before continuing.

In this section, we are going to see how to extract or retrieve, replace, and move elements such as $a_2$ from a vector $a_{i = 1\ldots n}$. Elements are extracted using an index enclosed in single square brackets. The index indicates the position in the vector, starting from one, following the usual mathematical tradition. While in maths notation $a_1$ represents the first, or leftmost, member of vector $a_{i = 1\ldots n}$, in \Rpgrm the equivalent notation is \code{a[1]} for the member and \code{a} for the vector.

We extract the first 10 elements of the vector \code{letters}, by passing an \code{integer} vector as argument to operator \Roperator{[ ]}.

<<vectors-1a>>=
vct1 <- letters[1:10]
vct1
@

\begin{center}
\begin{footnotesize}
\begin{tikzpicture}[font=\sffamily,
array/.style={matrix of nodes,nodes={draw, minimum size=7mm, fill=red!20},column sep=-\pgflinewidth, row sep=0.5mm, nodes in empty cells,
row 1/.style={nodes={draw=none, fill=none, minimum size=5mm}},
row 1 column 2/.style={nodes={draw}}}]

\matrix[array] (array) {
1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
``a''\strut & ``b''\strut & ``c''\strut & ``d''\strut & ``e''\strut & ``f''\strut & ``g''"\strut &``h''\strut & ``i''\strut & ``j''\strut \\};
\node[draw, minimum size=4mm] at (array-2-2) (box) {};

\begin{scope}[on background layer]
\fill[blue!10] (array-1-1.north west) rectangle (array-1-10.south east);
\end{scope}

\draw (array-2-1.west) node [signal, draw, fill=codeshadecolor, minimum size=6mm, line width=1.5pt, left] (first) {\code{\strut\ vct1\phantom{mm}}};
%\draw (array-1-2.north)--++(90:3mm) node [above] (first) {Index};
\draw (array-1-10.east)--++(0:3mm) node [right]{\code{integer} positional indices};
\draw (array-2-10.east)--++(0:3mm) node [right]{\code{character} values};
\node [align=center, anchor=south] at (array-2-2.south west|-first.south) (1) [below, yshift=-5mm]{\code{vct1[2]}};
\draw (1)--(box);
%
\end{tikzpicture}
\end{footnotesize}
\end{center}

<<vectors-1>>=
vct1[2]
@

\begin{explainbox}
Four constant vectors are available in base \Rlang: \Rconst{letters}, \Rconst{LETTERS}, \Rconst{month.name} and  \Rconst{month.abb}, of which I used \code{letters} in the example above. These vectors are always for English, irrespective of the locale.

<<vectors-eb-01>>=
month.name
month.name[6]
@
\end{explainbox}

\begin{warningbox}
In \Rlang, indexes always start from one, while in some other programming languages such as \Clang and \Cpplang, indexes start from zero. It is important to be aware of this difference, as many computation algorithms are valid only under a given indexing convention.
\end{warningbox}

\begin{faqbox}{How to access the last value in a vector?}
<<faq-vectors-1>>=
month.name[length(month.name)]
@

\end{faqbox}

It is possible to extract a subset of the elements of a vector in a single operation, using a vector of indexes. The positions of the extracted elements in the result (``returned value'') are determined by the ordering of the members of the vector of indexes---easier to demonstrate than to explain.

<<vectors-2>>=
vct1[c(3, 2)]
vct1[10:1]
@

\begin{playground}
The length of the indexing vector is \emph{not} restricted by the length of the indexed vector. However, only numerical indexes that match positions present in the indexed vector can extract values. Those values in the indexing vector pointing to positions that are not present in the indexed vector, result in \code{NA} values. This is easier to learn by \emph{playing} with \Rlang, than from explanations. Play with \Rlang, using the following examples as a starting point.

<<vectors-PG1, eval=eval_playground>>=
length(vct1)
vct1[c(3, 3, 3, 3)]
vct1[c(10:1, 1:10)]
vct1[c(1, 11)]
vct1[11]
@

Have you tried some of your own examples? If not yet, do \emph{play} with additional variations of your own before continuing.

\end{playground}

Negative indexes have a special meaning; they indicate the positions at which values should be excluded. Be aware that it is \emph{illegal} to mix positive and negative values in the same indexing operation.

<<vectors-3>>=
vct1[-2]
vct1[-c(3,2)]
vct1[-3:-2]
@

\begin{advplayground}
Results from indexing with special values and zero may be surprising. Try to build a rule from the examples below, a rule that will help you remember what to expect next time you are confronted with similar statements using special values as ``subscripts'' instead of integers larger or equal to one---this is likely to happen sooner or later as these special values can be returned by different \Rlang expressions depending on the value of operands or function arguments, some of them described earlier in this chapter.

<<vectors-5, eval=eval_playground>>=
vct1[ ]
vct1[0]
vct1[numeric(0)]
vct1[NA]
vct1[c(1, NA)]
vct1[NULL]
vct1[c(1, NULL)]
@
\end{advplayground}

Another way of indexing, which is very handy, but not available in most other programming languages, is indexing with a vector of \code{logical} values. The \code{logical} vector used for indexing is usually of the same length as the vector from which elements are going to be selected. However, this is not a requirement, because if the \code{logical} vector of indexes is shorter than the indexed vector, it is ``recycled'' as discussed on page \pageref{par:recycling:numeric} in relation to other operators.

<<vectors-6>>=
vct1[TRUE]
vct1[FALSE]
vct1[c(TRUE, FALSE)]
vct1[c(FALSE, TRUE)]
vct1 > "c"
vct1[vct1 > "c"]
@

Indexing with logical vectors is very frequently used in \Rlang because comparison operators are vectorised. Comparison operators, when applied to a vector, return a \code{logical} vector, a vector that can be used to extract the elements for which the result of the comparison test was \code{TRUE}.

\begin{playground}
The examples in this text box demonstrate additional uses of logical vectors: 1) the logical vector returned by a vectorised comparison can be stored in a variable, and the variable used as a ``selector'' for extracting a subset of values from the same vector, or from a different vector.

<<vectors-PG6, eval=eval_playground>>=
vct1 <- letters[1:10]
vct2 <- 1:10
selector <- vct1 > "c"
selector
vct1[selector]
vct2[selector]
@

Positional indexes can be obtained from a \code{logical} vector by means of function \code{which()} as it returns a \code{numeric} vector with the positions of the \code{TRUE} values in the \code{logical} vector.

<<vectors-PG6a, eval=eval_playground>>=
indexes <- which(vct1 > "c")
indexes
vct1[indexes]
@

Make sure to understand the examples above. These constructs are very widely used in \Rlang because they allow for concise code that is easy to understand once one is familiar with the indexing rules.
\end{playground}

\begin{explainbox}\label{par:calc:vector:map}
\index{vectors!named elements}
Above, \code{integer} or \code{logical} vectors were used as indices for extraction of anonymous elements, or members, from \code{character} vectors. In \Rlang, elements can be assigned names, and these names used in place of \code{numeric} indices to access them. One situation where this is very useful is the mapping of values between two representations. Let's assume we have a long vector encoding treatments using single letter codes and that we want to replace these codes with self-explanatory names.

<<vectors-named-01>>=
treat <- c("H", "C", "H", "W", "C", "H", "H", "W", "W")
@

We can create a named vector to \emph{map} the single letter codes onto full words. Above, we used function \Rfunction{c()} to concatenate several \code{character} strings, without assigning any names to them, thus they have to be extracted from the vector using \code{numeric} values, indexing by position. Below, we assign a name to each string. Using operator \Roperator{=} we assign the name on the left-hand side (\emph{lhs}) to the member of the vector on the right-hand-side (\emph{rhs}).

<<vectors-named-02>>=
treat.map <- c(H = "hot", C = "cold", W = "warm")
treat.map
names(treat.map)
@

\begin{center}
\begin{footnotesize}
\begin{tikzpicture}[font=\sffamily,
array/.style={matrix of nodes,nodes={draw, minimum size=7mm, fill=red!20},column sep=-\pgflinewidth, row sep=0.5mm, nodes in empty cells,
row 1/.style={nodes={draw=none, fill=none, minimum size=5mm}}}]

\matrix[array] (array) {
1 & 2 & 3 \\
``hot''\strut & ``cold''\strut & ``warm''\strut \\};

\begin{scope}[on background layer]
\fill[blue!10] (array-1-1.north west) rectangle (array-1-3.south east);
\end{scope}

\draw (array-2-1.west) node [signal, draw, fill=codeshadecolor, minimum size=6mm, line width=1.5pt, left] (first) {\code{\strut treat.map}};
\draw (array-2-1.north) node [signal, draw, fill=codeshadecolor, minimum size=5mm, rotate=-90, xshift=-11.5mm, yshift=-2.7mm, above] (nameh) {\rotatebox{90}{H\strut}};
\draw (array-2-2.north) node [signal, draw, fill=codeshadecolor, minimum size=5mm, rotate=-90, xshift=-11.5mm, yshift=-2.7mm, above] (namec) {\rotatebox{90}{C\strut}};
\draw (array-2-3.north) node [signal, draw, fill=codeshadecolor, minimum size=5mm, rotate=-90, xshift=-11.5mm, yshift=-2.7mm, above] (namew) {\rotatebox{90}{W\strut}};
%\draw (array-1-2.north)--++(90:3mm) node [above] (first) {Index};
\draw (array-1-3.east)--++(0:6.5mm) node [right]{\code{integer} positional indices};
\draw (array-2-3.east)--++(0:3mm) node [right]{\code{character} values};
\draw (namew)--++(0:9mm) node [right]{\code{character} member names};
%
\end{tikzpicture}
\end{footnotesize}
\end{center}

As \code{treat.map} is a named vector, we can use the element names, in addition to \code{numeric} values, as indices for element extraction.

<<vectors-named-03>>=
treat.map["H"]
@

The indexing vector can be of a different length than the indexed vector, and the returned value is a new vector of the same length as the indexing vector.

<<vectors-named-04>>=
treat.new <- treat.map[treat]
treat.new
@

\noindent
where \code{treat.new} is a named vector, from which we will frequently want to remove the members' names.

<<vectors-named-05>>=
treat.new <- unname(treat.new)
treat.new
@

It is more common to use named members with lists than with vectors, but in \Rlang, in both cases it is possible to use both numeric positional indices and names.
\end{explainbox}

Indexing can be used on either side of an assignment expression. In the code chunk below, we use the extraction operator on the left-hand side of the assignments to replace values only at selected positions in the vector. This may look rather esoteric at first sight, but it is just a simple extension of the logic of indexing described above. It works, because the low precedence of the \Roperator{<-} operator results in both the left- and the right-hand side being fully evaluated before the assignment takes place. To make the changes to the vectors easier to compare, identical vectors are used in each of the examples below.

<<vectors-7>>=
vct2 <- 1:10
vct2
vct2[1] <- 99
vct2

vct2 <- 1:10
vct2[c(2,4)] <- -99 # recycling
vct2

vct2 <- 1:10
vct2[c(2,4)] <- c(-99, 99)
vct2

vct2 <- 1:10
vct2[TRUE] <- 1 # recycling
vct2

vct2 <- 1:10
vct2 <- 1  # no recycling
vct2
@

Indexing can be used simultaneously on both sides of the assignment operator, for example, to swap two elements.

<<vectors-8>>=
vct3 <- letters[1:10]
vct3[1:2] <- vct3[2:1]
vct3
@

\begin{playground}
Do play with subscripts to your heart's content, really grasping how they work and how they can be used, will be very useful in anything you do in the future with \Rlang. Even the contrived example below follows the same simple rules, just study it bit by bit. Hint: the second statement in the chunk below, modifies \code{VCT1}, so, when studying variations of this example, you will need to recreate \code{VCT1} by executing the first statement each time you run a variation of the second statement.

<<vectors-8a, eval=eval_playground>>=
VCT1 <- letters[1:10]
VCT1[5:1] <- VCT1[c(TRUE,FALSE)]
VCT1
@

\end{playground}

\begin{explainbox}\label{box:vec:sort}
In \Rlang, indexing with positional indexes can be done with \Rclass{integer} or \Rclass{numeric} values. Numeric values can be floats, but for indexing, only integer values are meaningful. Consequently, \Rclass{double} values are converted into \code{integer} values when used as indexes. The conversion is done invisibly, but it does slow down computations slightly. When working on big data sets, explicitly using \code{integer} values can improve performance.

<<vectors-9>>=
vct4 <- LETTERS[1:10]
vct4
vct4[1]
vct4[1.1]
vct4[1.9999] # surprise!!
vct4[2]
@

From this experiment, we can learn that if positive indexes are not whole numbers, they are truncated to the next smaller integer.

<<vectors-9a>>=
vct4 <- LETTERS[1:10]
vct4
vct4[-1]
vct4[-1.1]
vct4[-1.9999]
vct4[-2]
@

From this experiment, we can learn that if negative indexes are not whole numbers, they are truncated to the next larger (less negative) integer. In conclusion, \code{double} index values behave as if they where sanitised using function \code{trunc()}.

This example also shows how one can tease out of \Rlang its rules through experimentation.

\end{explainbox}

A\index{vectors!sorting} frequent operation on vectors is sorting them into an increasing or decreasing order. The most direct approach is to use \Rfunction{sort()}.

<<vectors-10>>=
vct5 <- c(10, 4, 22, 1, 4)
sort(vct5)
sort(vct5, decreasing = TRUE)
@

An indirect way of sorting a vector, possibly based on a different vector, is to generate with \Rfunction{order()} a vector of numerical indexes that can be used to achieve the ordering.

<<vectors-11>>=
order(vct5)
vct5[order(vct5)]
vct6 <- c("ab", "aa", "c", "zy", "e")
vct6[order(vct5)]
@

\begin{explainbox}
A problem linked to sorting that we may face is counting how many copies of each value are present in a vector. We need to use two functions \Rfunction{sort()} and \Rfunction{rle()}\index{vector!run length encoding}. The second of these functions computes \emph{run length} as used in \emph{run length encoding} for which \emph{rle} is an abbreviation. A \emph{run} is a series of consecutive identical values. As the objective is to count the number of copies of each value present, we need first to sort the vector.

<<vectors-EB21>>=
vct7 <- letters[c(1, 5, 10, 3, 1, 4, 21, 1, 10)]
vct7
sort(vct7)
rle(sort(vct7))
@

The second and third statements are only to demonstrate the effect of each step. The last statement uses nested function calls to compute the number of copies of each value in the vector.
\end{explainbox}
\index{vectors!indexing|)}

<<include=FALSE>>=
rm(list = setdiff(ls(pattern="*"), to.keep))
@

\section{Matrices and Multidimensional Arrays}\label{sec:matrix:array}
\index{matrices|(}\index{arrays|(}\qRclass{matrix}\qRclass{array}

Matrices have two dimensions, rows and columns, and like vectors all their members share the same mode, and are atomic, i.e., they are homogeneous (Figure \ref{fig:matrix:margins}). Most commonly, matrices are used to store \code{numeric}, \code{integer} or \code{logical} values. The number of rows and columns can differ, so matrices can be either square or rectangular in shape, but never ragged.

\begin{figure}
  \centering
\begin{footnotesize}
\begin{tikzpicture}[auto matrix/.style={matrix of nodes,
  draw,thick,inner sep=0pt,
  nodes in empty cells,column sep=-0.2pt,row sep=-0.2pt,
  cells={nodes={minimum width=3em,minimum height=3em,
   draw,very thin,anchor=center,fill=codeshadecolor,
   execute at begin node={%
   $\vphantom{a_|}\ifnum\the\pgfmatrixcurrentrow<4
     \ifnum\the\pgfmatrixcurrentcolumn<4
      {#1}_{\the\pgfmatrixcurrentrow,\the\pgfmatrixcurrentcolumn}
     \else
      \ifnum\the\pgfmatrixcurrentcolumn=5
       {#1}_{\the\pgfmatrixcurrentrow,n}
      \fi
     \fi
    \else
     \ifnum\the\pgfmatrixcurrentrow=5
      \ifnum\the\pgfmatrixcurrentcolumn<4
       {#1}_{m, \the\pgfmatrixcurrentcolumn}
      \else
       \ifnum\the\pgfmatrixcurrentcolumn=5
        {#1}_{m,n}
       \fi
      \fi
     \fi
    \fi
    \ifnum\the\pgfmatrixcurrentrow\the\pgfmatrixcurrentcolumn=14
     \cdots
    \fi
    \ifnum\the\pgfmatrixcurrentrow\the\pgfmatrixcurrentcolumn=41
     \vdots
    \fi
    \ifnum\the\pgfmatrixcurrentrow\the\pgfmatrixcurrentcolumn=44
     \ddots
    \fi$
    }
  }}}]
 \matrix[auto matrix=a](matx){
  & & & & \\
  & & & & \\
  & & & & \\
  & & & & \\
  & & & & \\
 };
 \draw[thick,-stealth] ([yshift=-2ex]matx.south west) --
  ([yshift=-2ex]matx.south east) node[midway,below] {Columns or margin 2: $j = 1$ to $j = n$};
 \draw[thick,-stealth] ([xshift=-2ex]matx.north west)
   -- ([xshift=-2ex]matx.south west) node[midway,above,rotate=90] {Rows or margin 1: $i = 1$ to $i = m$};
\end{tikzpicture}
\end{footnotesize}\vspace{-1ex}
  \caption[Diagram of an \Rlang matrix.]{Diagram of an \Rlang matrix showing indexing of members.}\label{fig:matrix:margins}
\end{figure}

In \Rlang, the first index always denotes rows and the second index always denotes columns. The diagram below depicts a matrix, $A$, with $m$ rows and $n$ columns and size equal to $m \times n$ ``cells'', with individual values denoted by $a_{i,j}$. Here we use a simpler representation than that used for vectors on page \pageref{par:calc:vectors:diag} above, but the same concepts apply.

\begin{warningbox}
  In \Rlang documentation, the individual dimensions of matrices and arrays are frequently called \emph{margins}, numbered in the same order as the indices are given. Thus, in a matrix the first margin corresponds to rows and the second one to columns.
\end{warningbox}

In mathematical notation the same generic matrix is represented as
\begin{equation*}
  A_{m\times n} =
  \begin{bmatrix}
    a_{1,1} & a_{1,2} & \cdots & a_{1,j} & \cdots & a_{1,n}\\
    a_{2,1} & a_{2,2} & \cdots & a_{2,j} & \cdots & a_{2,n}\\
    \vdots & \vdots & \ddots & \vdots &        & \vdots \\
   a_{i,1} & a_{i,2} & \cdots & a_{i,j} & \cdots & a_{i,n}\\
     \vdots & \vdots &      & \vdots &  \ddots & \vdots \\
   a_{m,1} & a_{m,2} & \cdots & a_{m,j} & \cdots & a_{m,n}
  \end{bmatrix}
\end{equation*}
where $A$ represents the whole matrix, $m \times n$ its dimensions, and $a_{i,j}$ its elements, with $i$ indexing rows and $j$ indexing columns. The lengths of the two dimensions of the matrix are given by $m$ and $n$, for rows and columns.

Vectors have a single dimension, and, as described on page \pageref{par:calc:vectors:diag}, we can query this dimension, their length, with function \Rfunction{length()}. Matrices have two dimensions, which can be queried individually with \Rfunction{ncol()} and \Rfunction{nrow()}, and jointly with \Rfunction{dim()}. As expected, \Rfunction{is.matrix()} can be used to query the class.

We can create a matrix using the \Rfunction{matrix()} or \Rfunction{as.matrix()} constructors. The first argument of \Rfunction{matrix()} must be a vector. Function \Rfunction{as.matrix()} is a conversion constructor, with specialisations accepting as argument objects belonging to a few other classes. The shape of the \code{matrix} is controlled by passing an argument to either \code{ncol} or \code{nrow}.

<<matrix-01>>=
matrix(1:15, ncol = 3)
matrix(1:15, nrow = 3)
@

When a \code{matrix} is printed at the \Rlang console, the row and column indexes are indicated on the left and top margins, in the same way as they would be used to extract whole rows and columns.

\begin{explainbox}
  Matrices are most useful for the storage of numeric values as matrix algebra plays an important role in statistical computations. This notwithstanding, it is possible to create matrices (and arrays) from atomic vectors of other classes such as \Rclass{logical} or \Rclass{character}. The only difference is the scarcity of meaningful operations other than retrieval of members using two indices.

<<matrix-character-01>>=
matrix(letters[1:15], nrow = 3)
@
\end{explainbox}

When a vector is converted to a matrix, \Rlang's default is to allocate the values in the vector to the matrix starting from the leftmost column, and within the column, down from the top. Once the first column is filled, the process continues from the top of the next column, as can be seen above. This order can be changed as you will discover in the playground below.

\begin{playground}
Check in the help page for the \code{matrix}\qRfunction{matrix()} constructor how to use the \code{byrow} parameter to alter the default order in which the elements of the vector are allocated to columns and rows of the new matrix.

<<matrix-PG00, eval=FALSE>>=
help(matrix)
@

While you are looking at the help page, also consider the default number of columns and rows.

<<matrix-PG00a, eval=eval_playground>>=
matrix(1:15)
@

And to start getting a sense of how to interpret error and warning messages, run the code below and make sure you understand which problem is being reported. Before executing the statement, analyse it and predict what the returned value will be. Afterwards, compare your prediction with the value actually returned.

<<matrix-PG00b, eval=FALSE>>=
matrix(1:15, ncol = 2)
@

\end{playground}

Subscripting of matrices and arrays is consistent with that used for vectors; we only need to supply an indexing vector, or leave a blank space, for each dimension. A matrix has two dimensions, so to access an element or group of elements, we use two indices. The first index value selects rows, and the second one, columns.

<<matrix-10>>=
mat1 <- matrix(1:20, ncol = 4)
mat1
mat1[1, 2]
mat1[2, 1]
@

Remind yourself of how indexing of vectors works in \Rlang (see section \ref{sec:vectors} on page \pageref{sec:vectors}). We will now apply the same rules in two dimensions to extract and replace values. The first or leftmost indexing vector corresponds to rows and the second one to columns, so \Rlang uses a rows-first convention for indexing. Missing indexing vectors are interpreted as meaning \emph{extract all rows} and \emph{extract all columns}, respectively.

<<matrix-11>>=
mat1[1, ]
mat1[ , 1]
mat1[2:3, c(1,3)]
mat1[3, 4] <- 99
mat1
mat1[4:3, 2:1] <- mat1[3:4, 1:2]
mat1
@

\begin{explainbox}
Vectors are simpler than matrices, and by default when possible the ``slice'' extracted from a matrix is simplified into a vector by dropping one dimension. By passing \code{drop = FALSE}, we can prevent this.

<<matrix-11a>>=
is.matrix(mat1[1, ])
is.matrix(mat1[1:2, 1:2])
@

<<matrix-11b>>=
is.vector(mat1[1, ])
is.vector(mat1[1:2, 1:2])
@

<<matrix-11c>>=
is.matrix(mat1[1, , drop = FALSE])
is.matrix(mat1[1:2, 1:2, drop = FALSE])
@

\end{explainbox}

Matrices, like vectors, can be assigned names that function as ``nicknames'' for indices for assignment and extraction. Matrices can have row names and/or column names.

<<matrix-12>>=
colnames(mat1)
rownames(mat1)
colnames(mat1) <- c("a", "b", "c", "d")
mat1
rownames(mat1) <- c("A", "B", "C", "D", "E")
mat1
mat1[c("E", "A", "D"), c("b", "a")]
colnames(mat1) <- NULL
mat1
@

\begin{warningbox}
Matrices can be indexed as vectors, without triggering an error or warning.

<<matrix-13>>=
mat1 <- matrix(1:20, ncol = 4)
mat1
dim(mat1)
mat1[10]
mat1[5, 2]
@

The next code example demonstrates that indexing as a vector with a single index, always works column-wise even if matrix \code{B} was created by assigning vector elements by row.

<<matrix-14>>=
mat2 <- matrix(1:20, ncol = 4, byrow = TRUE)
mat2
dim(mat2)
mat2[10]
mat2[5, 2]
@
\end{warningbox}

\begin{explainbox}
In \Rlang, a \Rclass{matrix} can have a single row, a single column, a single element, or no elements. However, in all cases, a \code{matrix} will have as \emph{dimensions} attribute an \code{integer} vector of length two.

<<dimensions-box-01>>=
vct1 <- 1:6
dim(vct1)
@

<<dimensions-box-02a>>=
one.col.matrix <- matrix(1:6, ncol = 1)
dim(one.col.matrix)
@

<<dimensions-box-02b>>=
two.col.matrix <- matrix(1:6, ncol = 2)
dim(two.col.matrix)
@

<<dimensions-box-02c>>=
one.elem.matrix <- matrix(1, ncol = 1)
dim(one.elem.matrix)
@

<<dimensions-box-02d>>=
no.elem.matrix <- matrix(numeric(), ncol = 0)
dim(no.elem.matrix)
@

\end{explainbox}

\begin{figure}
  \centering
%\usetikzlibrary{matrix}
\newcounter{kmargincount}
\begin{footnotesize}
\begin{tikzpicture}[auto matrix/.style={matrix of nodes,
  draw,thick,inner sep=0pt,
  nodes in empty cells,column sep=-0.2pt,row sep=-0.2pt,
  cells={nodes={minimum width=3.5em,minimum height=3.5em,
   draw,very thin,anchor=center,fill=codeshadecolor,
   execute at begin node={%
   $\vphantom{a_|}\ifnum\the\pgfmatrixcurrentrow<4
     \ifnum\the\pgfmatrixcurrentcolumn<4
      {#1}_{\the\pgfmatrixcurrentrow,\the\pgfmatrixcurrentcolumn,\arabic{kmargincount}}
     \else
      \ifnum\the\pgfmatrixcurrentcolumn=5
       {#1}_{\the\pgfmatrixcurrentrow,m,\arabic{kmargincount}}
      \fi
     \fi
    \else
     \ifnum\the\pgfmatrixcurrentrow=5
      \ifnum\the\pgfmatrixcurrentcolumn<4
       {#1}_{l, \the\pgfmatrixcurrentcolumn,\arabic{kmargincount}}
      \else
       \ifnum\the\pgfmatrixcurrentcolumn=5
        {#1}_{l,m,\arabic{kmargincount}}
       \fi
      \fi
     \fi
    \fi
    \ifnum\the\pgfmatrixcurrentrow\the\pgfmatrixcurrentcolumn=14
     \cdots
    \fi
    \ifnum\the\pgfmatrixcurrentrow\the\pgfmatrixcurrentcolumn=41
     \vdots
    \fi
    \ifnum\the\pgfmatrixcurrentrow\the\pgfmatrixcurrentcolumn=44
     \ddots
    \fi$
    }
  }}}]
\setcounter{kmargincount}{4}
 \matrix[auto matrix=a,xshift=7.5em,yshift=7.5em](matback){
  & & & & \\
  & & & & \\
  & & & & \\
  & & & & \\
  & & & & \\
 };
\setcounter{kmargincount}{3}
 \matrix[auto matrix=a,xshift=5em,yshift=5em](matz){
  & & & & \\
  & & & & \\
  & & & & \\
  & & & & \\
  & & & & \\
 };
\setcounter{kmargincount}{2}
 \matrix[auto matrix=a,xshift=2.5em,yshift=2.5em](maty){
  & & & & \\
  & & & & \\
  & & & & \\
  & & & & \\
  & & & & \\
 };
\setcounter{kmargincount}{1}
 \matrix[auto matrix=a](matx){
  & & & & \\
  & & & & \\
  & & & & \\
  & & & & \\
  & & & & \\
 };
 \draw[thick,-stealth] ([xshift=2ex]matx.south east) -- ([xshift=2ex]matback.south east)
  node[midway,below,rotate=45] {Margin 3: $k = 1$ to $k = n$};
 \draw[thick,-stealth] ([yshift=-2ex]matx.south west) --
  ([yshift=-2ex]matx.south east) node[midway,below] {Margin 2: $j = 1$ to $j = m$};
 \draw[thick,-stealth] ([xshift=-2ex]matx.north west)
   -- ([xshift=-2ex]matx.south west) node[midway,above,rotate=90] {Margin 1: $i = 1$ to $i = l$};
\end{tikzpicture}
\end{footnotesize}\vspace{-1ex}
  \caption[Diagram of an \Rlang array.]{Diagram of an \Rlang array with three dimensions showing indexing of members.}\label{fig:array:margins}
\end{figure}

Arrays\index{matrix!dimensions}\index{arrays!dimensions} are similar to matrices, but can have one or more dimensions (Figure \ref{fig:array:margins}). The dimensions of an array can be queried with \Rfunction{dim()}, similarly as with matrices. Whether an \Rlang object is an array can be found out with \Rfunction{is.array()}. The diagram below depicts an array, $A$ with three dimensions giving a size equal to $l\times m \times n$, and individual values denoted by $a_{i,j,k}$.

When calling the constructor \Rfunction{array()}, dimensions are specified with the argument passed to parameter \code{dim}.

<<matrix-21>>=
ary1 <- array(1:27, dim = c(3, 3, 3))
ary1
ary1[2, 2, 2]
@

In the chunk above, the length of the supplied vector is the product of the dimensions, $27 = 3 \times 3 \times 3 = 3^3$. Arrays are printed in slices, where slices across 3rd and higher dimensions are shown separately, with their corresponding indexes above each slice and the first two dimensions on the margins of the individual slices, similarly to how matrices are displayed.

\begin{playground}
  How do you use indexes to extract the second element of the original vector, in each of the following matrices and arrays?

<<matrix-PG01, eval=eval_playground>>=
VCT2 <- 1:10
MAT1 <- matrix(VCT2, ncol = 2)
MAT2 <- matrix(VCT2, ncol = 2, byrow = TRUE)
MAT3 <- matrix(VCT2, nrow = 2)
MAT4 <- matrix(VCT2, nrow = 2, byrow = TRUE)
@

<<matrix-PG02, eval=eval_playground>>=
ARY1 <- array(VCT2, dim = c(5, 2))
ARY2 <- array(VCT2, dim = c(5, 2), dimnames = list(NULL, c("c1", "c2")))
ARY3 <- array(VCT2, dim = c(2, 5))
@

Be aware that vectors and one-dimensional arrays are not the same thing, while two-dimensional arrays are matrices.
\begin{enumerate}
  \item Use the different constructors and query functions to explore this, and its consequences.
  \item Convert a matrix into a vector using \Rfunction{as.vector()} and compare the returned values to those in the matrix. Are values extracted by columns or by rows first?
\end{enumerate}
\end{playground}
\index{arrays|)}

\index{matrix!operators|(}
Operators and functions for matrix algebra are available in \Rlang as matrices are used in statistical algorithms. I describe below only some of these matrix-specific functions and operators. I also give examples of the use of some of the usual arithmetic operators together with objects of class \Rclass{matrix}.

Recycling applies to the usual arithmetic operators when applied to matrices. This is similar to their behaviour when all operands are vectors (see page \pageref{par:recycling:numeric}).\index{matrix!operations with vectors}

<<matrix-32>>=
mat3 <- matrix(1:20, ncol = 4)
mat3 + 2
mat3 * 0:1
mat3 * 1:0
@

\begin{playground}
  When a \code{matrix} and a \code{vector} are operands in an arithmetic operation, how the positions of the \code{vector} are mapped to positions in the \code{matrix} affects the result of the operation. Run the code below to find out. What is the logic behind?

<<matrix32a, eval=eval_playground>>=
matrix(rep(1, 6)) * 1:6
@
\end{playground}

Function \Rfunction{t()} transposes\index{matrix!transpose} a matrix, by swapping columns and rows.

<<matrix-31>>=
mat3
t(mat3)
@

In the examples above with the usual multiplication operator \code{*}, the operation described is not a matrix product, but instead, the products between individual elements of the matrix and vectors. Operators and functions implementing the operations of matrix algebra are distinct. Matrix algebra gives the rules for operations where both operands are matrices. For example, matrix multiplication is indicated by the operator \Roperator{\%*\%}. \index{matrix!multiplication}

<<matrix-33>>=
mat4 <- matrix(1:16, ncol = 4)
mat4 * mat4
mat4 %*% mat4
@

Function \Rfunction{diag()} makes it possible to easily create a diagonal matrix.

<<matrix-34>>=
mat5 <- diag(4)
mat5
mat4 %*% mat5
@

The inverse of a matrix can be found by means of function \Rfunction{solve()}.

<<matrix-35>>=
mat6 <- matrix(c(3, 2, 0, 1, 3, 2, 7, 2, 4), ncol = 3)
solve(mat6)
@

Additional operators and functions for matrix algebra like cross-product (\code{crossprod()}) and Cholesky root (\code{chol()}) are available in base \Rlang. Packages, including \pkgname{matrixStats}, provide additional functions and operators for matrices.

<<include=FALSE>>=
rm(list = setdiff(ls(pattern="*"), to.keep))
@

\index{matrices|)}

\section{Factors}\label{sec:calc:factors}
\index{factors|(}
\index{categorical variables|see{factors}}\qRclass{factor}

In data analysis and Statistics, the distinction between values measured on continuous vs.\ discrete \emph{scales} is crucial. In a continuous scale, any values are in theory possible. In a discrete scale, the observations are values from a few categories.

In contrast to other statistical software in which a variable is set as continuous or discrete when defining a model to be fitted or when setting up a test, in \Rlang this distinction is based on whether the explanatory variable is \code{numeric} (continuous) or a \code{factor} (discrete). This approach makes sense because in most cases considering an explanatory variable as categorical or not, depends on the quantity stored and/or the design of the experiment or survey. In other words, being categorical is a property of the data. The order of the levels in an unordered \code{factor} does not affect simple calculations or the values plotted, but as we will see in chapters \ref{chap:R:statistics} and \ref{chap:R:plotting}, it can affect the contrasts used by some tests of significance, and the arrangement or positions of the levels along axes and keys in plots.

In an \Rlang \code{factor}, values indicate discrete unordered categories, most frequently the treatments in an experiment, or categories in a survey. Factor can be created either from numerical or character vectors. The different possible values are called \emph{levels}. Factors created with \Rfunction{factor()} are always unordered or categorical. \Rlang also supports \code{ordered} factors, created with function \Rfunction{ordered()} with identical user interface. The distinction, however, only affects how they are interpreted in statistical tests as discussed in chapter \ref{chap:R:statistics}.\index{factors!ordered}

When using \Rfunction{factor()} or \Rfunction{ordered()} we create a factor from a vector, but this vector can be created on-the-fly and anonymous as shown in this example. When the vector is \code{numeric} and no labels are supplied, level labels are character strings matching the numbers. The default ordering of the levels is alphanumerical.

<<factors-1>>=
factor(x = c(1, 2, 2, 1, 2, 1, 1))
ordered(x = c(1, 2, 2, 1, 2, 1, 1))
factor(x = c(1, 2, 2, 1, 2, 1, 1), ordered = TRUE)
@

\begin{explainbox}
When the pattern of levels is regular, it is possible to use function \Rfunction{gl()}, \emph{generate levels}, to construct a factor. Nowadays, it is usual to read data into \Rlang from files in which the treatment codes are already available as character strings or numeric values, however, when we need to create a factor within \Rlang, \Rfunction{gl()} can save some typing. In this case, instead of passing a vector as argument, we pass a \emph{recipe} to create it: \code{n} is the number of levels, \code{k} the number of contiguous repeats (called ``replicates'' in \Rlang documentation), and \code{length} the length of the factor to be created.

<<factors-bx-01>>=
gl(n = 2, k = 5, labels = c("A", "B"))
@

<<factors-bx-01a>>=
gl(n = 2, k = 1, length = 10, labels = c("A", "B"))
@
\end{explainbox}

It is always preferable to use meaningful labels for levels, even if \Rlang does not require it. Here the vector is stored in a variable named \code{my.vector}. In a real data analysis situation, in most cases, the vector would have been read from a file on disk and would be longer.

<<factors-2>>=
vct1 <- c("treated", "treated", "control", "control", "control", "treated")
factor(vct1)
@

The ordering of levels is established at the time a factor is created and by default it is alphabetical. This default ordering of levels is frequently not the one needed. We can pass an argument to parameter \code{levels} of function \Rfunction{factor()} to set a different ordering of the levels.

<<factors-3>>=
factor(x = vct1, levels = c("treated", "control"))
@

The\index{factors!labels}\index{factors!levels} labels (``names'') of the levels can be set when calling \Rfunction{factor()}. Two vectors are passed as arguments to parameters \code{levels} and \code{labels} with levels and matching labels in the same position. The argument passed to \code{levels} determines the order of the levels based on their old names or values, and the argument passed to \code{labels} gives new names to the levels.\label{par:calc:factor:rename:levels}

<<factors-4>>=
factor(x = c("a", "a", "b", "b", "b", "a"), levels = c("a", "b"), labels = c("treated", "control"))
@

The argument passed to \code{labels} can be a named vector that \emph{maps} new labels onto the values stored in the vector passed as the argument to \code{x} (see named vectors and mapping on page \pageref{par:calc:vector:map}).

<<factors-4a>>=
factor(x = c("a", "a", "b", "b", "b", "a"), labels = c(a = "treated", b = "control"))
@

In the examples above, we passed a numeric vector or a character vector as an argument for parameter \code{x} of function \Rfunction{factor()}. It is also possible to pass a \code{factor} as an argument to parameter \code{x}. This makes it possible to modify the ordering of levels or replace the labels in a factor.

<<factors-5>>=
fct1 <- factor(x = vct1)
fct1
factor(x = fct1, levels = c("treated", "control"))
factor(x = fct1, labels = c(control = "cooled", treated = "heated"))
factor(x = fct1,
       levels = c("treated", "control"),
       labels = c("heated", "cooled"))
@

\textbf{Merging factor levels.}\index{factors!merge levels} We use \Rfunction{factor()} as shown below, setting the same label for the levels we want to merge.

<<factors-eb3>>=
fct2 <- gl(4, 3, labels = c("A", "F", "B", "Z"))
fct2
factor(fct2,
       levels = c("A", "B", "F", "Z"),
       labels = c("A", "B", "C", "C"))
@

\begin{playground}
  Edit the code in the chunk above to use only a named vector for \code{labels} instead of separate vectors passed to \code{levels} and \code{labels}.
\end{playground}

We can use indexing on factors in the same way as with vectors. In the next example, we use a test returning a logical vector to extract all ``controls''. We use function \Rfunction{levels()} to look at the levels of the factors, as with vectors, \code{lengtgh()} to query the number of values stored.

<<factors-6>>=
fct1
levels(fct1)
length(fct1)
fct1.control <- fct1[fct1 == "control"]
fct1.control
levels(fct1.control) # same as in my.factor
length(fct1.control) # shorter than my.factor
@

\begin{faqbox}{How to drop unused levels in a factor?}
  It can be seen above that subsetting does not drop unused factor levels. Constructor function \code{factor()} can be used to explicitly drop the unused factor levels.\index{factors!drop unused levels}

<<factor-drop-faq>>=
fct1.control <- factor(fct1.control)
levels(fct1.control) # the unused level was dropped
@
\end{faqbox}

\begin{faqbox}{How to convert a factor into a vector with matching values?}
This operation is not obvious, specially when the factor was created from a \code{numeric} vector.

<<factors-7>>=
vct3 <- rep(3:5, 4)
vct3
fct3 <- factor(vct3)
fct3
as.numeric(fct3)
as.numeric(as.character(fct3))
@
\end{faqbox}

\begin{explainbox}
\textbf{Why is a double conversion needed?}\index{factors!convert to numeric} Internally, factor values are stored as running integers starting from one, each distinct integer value corresponding to a level. These underlying integer values are returned by \Rfunction{as.numeric()} when applied to a factor. The labels of the factor levels are always stored as character strings, even when these characters are digits. In contrast to \Rfunction{as.numeric()}, \Rfunction{as.character()} returns the character labels of the levels for each of the values stored in the factor. If these character strings represent numbers, they can be converted, in a second step, using \Rfunction{as.numeric()} into the original numeric values. Use of \code{class} and \code{mode} is described on section \ref{sec:rlang:mode} on page \pageref{sec:rlang:mode}, and \code{str()} on page \pageref{par:calc:str}.

<<factors-eb2>>=
class(fct3)
mode(fct3)
str(fct3)
@

\end{explainbox}

\begin{playground}
Create a factor with levels labelled with words. Create another factor with the levels labelled with the same words, but ordered differently. After this convert both factors to numeric vectors using \Rfunction{as.numeric()}. Explain why the two numeric vectors differ or not from each other.
\end{playground}

\begin{explainbox}
\textbf{Safely reordering and renaming factor levels.}\index{factors!reorder levels} The simplest approach is to use \Rfunction{factor()} and its \code{levels} parameter as shown on page \pageref{par:calc:factor:rename:levels}. In these more advanced examples, we use \Rfunction{levels()} to retrieve the names of the levels from the factor itself to protect from possible bugs due to typing mistakes, or for changes in the naming conventions used.

Reverse previous order using \Rfunction{rev()}.

<<factors-10>>=
fct4 <- factor(c("treated", "treated", "control", "control", "control", "treated"))
levels(fct4)
fct4 <- factor(fct4, levels = rev(levels(fct4)))
levels(fct4)
@%
\pagebreak

Sort in decreasing order, i.e., opposite to default.

<<factors-11>>=
fct5 <- factor(fct4,
               levels = sort(levels(fct4), decreasing = TRUE))
levels(fct5)
@

Alter ordering using subscripting; especially useful with three or more levels.

<<factors-12>>=
fct6 <- factor(fct4, levels = levels(fct4)[c(2, 1)])
levels(fct6)
@

Reordering the levels of a factor based on summary quantities from data stored in a numeric vector is very useful, especially when plotting. Function \Rfunction{reorder()} can be used in this case. It defaults to using \code{mean()} for summaries, but other suitable summary functions, such as \code{median()} can be supplied in its place.

<<factors-13>>=
fct7 <- gl(2, 5, labels = c("A", "B"))
vct4 <- c(5.6, 7.3, 3.1, 8.7, 6.9, 2.4, 4.5, 2.1, 1.4, 2.0)
fct7
fct7ord <- reorder(fct7, vct4)
levels(fct7ord)
fct7rev <- reorder(fct7, -vct4) # a simple trick: change sign
levels(fct7rev)
@

In the last statement, using the unary negation operator, which is vectorised, allows us to easily reverse the ordering of the levels, while still using the default function, \code{mean()}, to summarise the data.

\end{explainbox}

\begin{advplayground}\label{calc:ADVPG:order:sort}
\textbf{Reordering factor values.}\index{factors!reorder values}\index{factors!arrange values} It is possible to arrange the values stored in a factor either alphabetically according to the labels of the levels or according to the order of the levels. (The use of \code{rep()} is explained on page \pageref{pg:seq:rep}.)

<<factors-ADVPG-11a, eval=eval_playground>>=
# gl() keeps order of levels
FCT1 <- gl(4, 3, labels = c("A", "F", "B", "Z"))
FCT1
as.integer(FCT1)
@

<<factors-ADVPG-11b, eval=eval_playground>>=
# factor() orders levels alphabetically
FCT2 <- factor(rep(c("A", "F", "B", "Z"), times = rep(3, times = 4))) # nested calls
FCT2
as.integer(FCT2)
levels(FCT2)[as.integer(FCT2)]
@

We see above that the integer values by which levels in a factor are stored, are equivalent to indices or ``subscripts'' referencing the vector of labels. Function \Rfunction{sort()} operates on the values' underlying integers and sorts according to the order of the levels while \Rfunction{order()} operates on the values' labels and returns a vector of indices that arrange the values alphabetically.

<<factors-ADVPG-12, eval=eval_playground>>=
sort(FCT2)
FCT2[order(FCT2)]
FCT2[order(as.integer(FCT2))]
@

Run the examples in the chunk above and work out why the results differ.
\end{advplayground}

\begin{explainbox}
  Factors encode levels as \code{integer} values in a vector. In many cases, statistical computations, require the same information to be encoded as binary values using multiple \emph{dummy variables}. Factors are much friendlier for the user to manage. They are converted into the equivalent dummy variables when a model formula is translated into a \emph{model matrix}. This is handled transparently by most functions implementing fitting of statistical models to data (see sections \ref{sec:stat:mf} and \ref{sec:stat:formulas} on pages \pageref{sec:stat:mf} and \pageref{sec:stat:formulas}).
\end{explainbox}
\index{factors|)}

<<factors-cleanup, include=FALSE>>=
rm(list = setdiff(ls(pattern="*"), to.keep))
@
% !Rnw root = appendix.main.Rnw

<<echo=FALSE, include=FALSE>>=
opts_chunk$set(opts_fig_wide)
opts_knit$set(concordance=TRUE)
# opts_knit$set(unnamed.chunk.label = 'calculator-chunk')
@

\section{Further Reading}
For\index{further reading!using the R language} further reading on the aspects of \Rlang discussed in the current chapter, I suggest the book \citetitle{Matloff2011} \autocite{Matloff2011}.

<<calculator-chapter-cleanup, include=FALSE>>=
rm(list = setdiff(ls(pattern="*"), to.keep))
@

<<eval=eval_diag, include=eval_diag, echo=eval_diag, cache=FALSE>>=
knitter_diag()
R_diag()
other_diag()
@