R.stats.rnw

% !Rnw root = appendix.main.Rnw
<<echo=FALSE, include=FALSE>>=
opts_chunk$set(opts_fig_narrow_square)
opts_knit$set(concordance=TRUE)
opts_knit$set(unnamed.chunk.label = 'functions-chunk')
rm(plot)
@

\chapter{Base \Rlang: ``Verbs'' and ``Nouns'' for Statistics}\label{chap:R:statistics}

\begin{VF}
The purpose of computing is insight, not numbers.

\VA{Richard W. Hamming}{\emph{Numerical Methods for Scientists and Engineers}, 1987}\nocite{Hamming1987}
\end{VF}

\section{Aims of This Chapter}

This chapter aims to give the reader an introduction to the approach used in base \Rlang for the computation of statistical summaries, the fitting of models to observations and tests of hypothesis. This chapter does \emph{not} explain data analysis methods, statistical principles or experimental designs. There are many good books on the use of \Rpgrm for different kinds of statistical analyses (see further reading on page \pageref{sec:stat:further:reading}) but most of them tend to focus on specific statistical methods rather than on the commonalities among them. Although base \R's model fitting functions target specific statistical procedures, they use a common approach to model specification and for returning the computed estimates and test outcomes. This approach, also followed by many contributed extension packages, can be considered as part of the philosophy behind the \Rlang language. In this chapter, you will become familiar with the approaches used in \Rlang for calculating statistical summaries, generating (pseudo-)random numbers, sampling, fitting models, and carrying out tests of significance. We will use linear correlation, \emph{t}-test, linear models, generalised linear models, non-linear models, and some simple multivariate methods as examples. The focus is on how to specify statistical models, contrasts and observations, how to access different components of the objects returned by the corresponding fit and summary functions, and how to use these extracted components in further computations or for customised printing and formatting.

%\emph{At present I use several examples adapted from the help pages for the functions described. I may revise this before publication.}
\section{Statistical Summaries}
\index{summaries!statistical|(}
Being the main focus of the \Rlang language in data analysis and statistics, \Rlang provides functions both for simple and complex calculations, going from means and variances to fitting very complex models. Table \ref{tab:stat:summaries} lists some frequently used functions.  All these methods accept numeric vectors and/or matrices as arguments. In addition, function \Rfunction{quantile()} can be used to simultaneously compute multiple arbitrary quantiles for a vector of observations, and method \Rfunction{summary()} produces a summary that depends on the class of the argument passed to it. (See section \ref{sec:functions:sem} on page \pageref{sec:functions:sem} for how to define your own functions.)

\begin{table}
  \centering
  \caption[Simple statistical summaries.]{Frequently used simple statistical summaries and the corresponding \Rlang functions.\vspace{1ex}}\label{tab:stat:summaries}
  \begin{tabular}{llll}
     \toprule
     Function & Symbol & Formulation & Name \\
     \midrule
     \Rfunction{mean()} & $\bar{x}$ & $\sum x / n$ & mean \\
     \Rfunction{var()} & $s^2$ & $\sum (x_i - \hat{x})^2 / (n - 1)$ & sample variance \\
     \Rfunction{sd()} & $s$ & $\sqrt[2]{s^2}$ & sample standard deviation \\
     \Rfunction{median()} & M or $\tilde{x}$ &  & median \\
     \Rfunction{mad()} & MAD & median $|x_i - \hat{x}|$ & median absolute deviation \\
     \Rfunction{mode()} & MOD &  & mode \\
     \Rfunction{max()} & $x_\mathrm{max}$ &  & maximum \\
     \Rfunction{min()} & $x_\mathrm{min}$ &  & minimum \\
     \Rfunction{range()} & $x_\mathrm{min}, x_\mathrm{max}$ &  & range \\
     \bottomrule
   \end{tabular}
\end{table}

By default, if the argument contains \code{NAs} these functions return \code{NA}. The logic behind this is that if one value exists but is unknown, the true result of the computation is unknown (see page \pageref{par:special:values} for details on the role of \code{NA} in \Rlang). However, an additional parameter called \code{na.rm} allows us to override this default behaviour by requesting any \code{NA} in the input to be removed (or discarded) before calculating the summary,

<<stat-fun-1>>=
x <- c(1:20, NA)
mean(x)
mean(x, na.rm = TRUE)
@

Function \Rfunction{mean()} can be used to compute the mean from all values, as in the example above, as well as trimmed means, i.e., means computed after discarding extreme values. The argument passed to parameter \code{trim} decides the fraction of the observations to discard at \emph{each extreme} of the vector of values after ordering them from smallest to largest.

<<stat-fun-2>>=
x <- c(1:20, 100)
mean(x)
mean(x, trim = 0.05)
@

\begin{playground}
  In contrast to the use of other functions, I do not provide examples of the use of all the functions listed in Table \ref{tab:stat:summaries}. Construct \code{numeric} vectors with artificial data or use real data to play with the remaining functions. Study the help pages to learn about the different parameters and their uses.% Later in the book, only the output from certain examples will be shown, with the expectation, that other examples will be run by readers.
\end{playground}

Other more advanced functions are also available in \Rlang, such as \Rfunction{boxplot.stats()} that computes the values needed to draw boxplots (see section \ref{sec:boxplot} on page \pageref{sec:boxplot}).

In many cases, you will want to compute statistical summaries by group or treatment in addition or instead of for a whole data set or vector. See section \ref{sec:calc:df:aggregate} on page \pageref{sec:calc:df:aggregate} for details on how to compute summaries of data stored in data frames using base \Rlang functions, and section \ref{sec:dplyr:manip} on page \pageref{sec:dplyr:manip} for alternative functions from contributed packages.
\index{summaries!statistical|)}

\section{Standard Probability Distributions}\label{sec:prob:dist}
\index{probability distributions!standard|(}%
\index{probability distributions!theoretical|see{--- standard}}%
\index{Normal distribution}%
Density functions, probability distribution functions, quantile functions, and functions for pseudo-random draws are available in \Rlang for several different standard (theoretical) probability distributions. Entering \code{help(Distributions)} at the \Rlang prompt will open a help page describing all the distributions available in base \Rlang. For each distribution, the different functions contain the same ``root'' in their names: \code{norm} for the normal distribution, \code{unif} for the uniform distribution, and so on. The ``head'' of the name indicates the type of values returned: ``\code{d}'' for density, ``\code{q}'' for quantile, ``\code{r}'' (pseudo-)random draws, and ``\code{p}'' for probability (Table \ref{tab:prob:funs}).

\begin{table}
  \centering
    \caption[Standard probability distributions]{Standard probability distributions in \Rlang. Partial list of base \Rlang functions related to probability distributions. The full list can be obtained by executing the command \code{help(Distributions)}.\vspace{1ex}}\label{tab:prob:funs}

  \begin{tabular}{llllll}
     \toprule
     Distribution & Symbol & Density & $P$-value & Quantiles & Draws \\
     \midrule
     Normal & $N$ & \Rfunction{dnorm()} & \Rfunction{pnorm()} & \Rfunction{qnorm()} & \Rfunction{rnorm()} \\
     Student's & $t$ & \Rfunction{dt()} & \Rfunction{pt()} & \Rfunction{qt()} & \Rfunction{rt()}\\
     F & $F$ & \Rfunction{df()} & \Rfunction{pf()} & \Rfunction{qf()} & \Rfunction{rf()} \\
     binomial & $B$ & \Rfunction{dbinom()} & \Rfunction{pbinom()} & \Rfunction{qbinom()} & \Rfunction{rbinom()} \\
     multinomial & $M$ & \Rfunction{dmultinom()} & \Rfunction{pmultinom()} & \Rfunction{qmultinom()} & \Rfunction{rmultinom()} \\
     Poisson &   & \Rfunction{dpois()} & \Rfunction{ppois()} & \Rfunction{qpois()} & \Rfunction{rpois()} \\
     $\Chi$-squared & $\Chi^2$ & \Rfunction{dchisq()} & \Rfunction{pchisq()} & \Rfunction{qchisq()} & \Rfunction{rchisq()} \\
     lognormal &  & \Rfunction{dlnorm()} & \Rfunction{plnorm()} & \Rfunction{qlnorm()} & \Rfunction{rlnorm()} \\
     uniform &  & \Rfunction{dunif()} & \Rfunction{punif()} & \Rfunction{qunif()} & \Rfunction{runif()} \\
     \bottomrule
   \end{tabular}
\end{table}

Theoretical distributions are defined by mathematical functions that accept parameters that control the exact shape and location. In the case of the Normal distribution, these parameters are the \emph{mean} (\code{mean}) controlling the location center and \emph(standard deviation) (\code{sd}) controlling the spread away from the center of the distribution. The four different functions differ in which values are calculated (the unknowns) and which values are supplied as arguments (the known inputs).

In what follows, I use the Normal distribution as an example, but with differences in their parameters, the functions for other theoretical distributions follow a similar naming pattern.

\subsection{Density from parameters}\label{sec:prob:dens}
\index{probability distributions!density from parameters}
To obtain a single point from the distribution curve we pass a vector of length one as an argument for \code{x}.
<<distrib-01>>=
dnorm(x = 1.5, mean = 1, sd = 0.5)
@

To obtain multiple values we can pass a longer vector as an argument.

<<distrib-01a>>=
dnorm(x = seq(from = -1, to = 1, length.out = 5), mean = 1, sd = 0.5)
@

With 50 equally spaced values for $x$ we can plot a line (\code{type = "l"}) that shows that the 50 generated data points give the illusion of a continuous curve. We also add a point showing the value for $x = 1.5$ calculated above.

<<distrib-01b>>=
vct1 <- seq(from = -1, to = 3, length.out = 50)

df1 <- data.frame(x = vct1,
                  y = dnorm(x = vct1, mean = 1, sd = 1))
plot(y~x, data = df1, type = "l", xlab = "z", ylab = "f(z)")
points(x = 2, y = dnorm(x = 2, mean = 1, sd = 1))
@

\subsection{Probabilities from parameters and quantiles}\label{sec:prob:quant}
\index{probability distributions!probabilities from quantiles}

With a known quantile value, it is possible to look up the corresponding $P$-value from the Normal distribution, i.e., the area under the curve, either to the right or to the left of a given value of \code{q} (by default, integrating the lower or left tail). When working with observations, the quantile, mean and standard deviation are in most cases computed from the same observations under the null hypothesis. In the example below, we use invented values for all parameters \code{q}, the quantile, \code{mean}, and \code{sd}, the standard deviation.

<<distrib-02>>=
pnorm(q = 2, mean = 1, sd = 1)
pnorm(q = 2, mean = 1, sd = 1, lower.tail = FALSE)
pnorm(q = 2, mean = 1, sd = 4, lower.tail = FALSE)
pnorm(q = c(2, 4), mean = 1, sd = 1, lower.tail = FALSE)
@

\begin{explainbox}
  In tests of significance, empirical $z$-values and $t$-values are computed by subtracting from the observed mean for one group or raw quantile, the ``expected'' mean (a hypothesised theoretical value, the mean of a control condition used as a reference, or the mean computed over all treatments under the assumption of no effect of treatments) and then dividing this difference by the standard deviation. Consequently, the $p$-values corresponding to these empirical $z$-values and $t$-values need to be looked up using \code{mean = 0} and \code{sd = 1} when calling \Rfunction{pnorm()} or \Rfunction{pt()} respectively. These frequently used values are the defaults.
\end{explainbox}

\subsection{Quantiles from parameters and probabilities}\label{sec:quant:prob}
\index{probability distributions!quantiles from probabilities}

The reverse computation from that in the previous section is to obtain the quantile corresponding to a known $P$-value or area under one of the tails of the distribution curve. These quantiles are equivalent to the values in the tables of precalculated quantiles used in earlier times to assess significance with statistical tests.

<<distrib-03>>=
qnorm(p = 0.01, mean = 0, sd = 1)
qnorm(p = 0.05, mean = 0, sd = 1)
qnorm(p = 0.05, mean = 0, sd = 1, lower.tail = FALSE)
@

\begin{warningbox}
Quantile functions like \Rfunction{qnorm()} and probability functions like \Rfunction{pnorm()} always do computations based on a single tail of the distribution, even though it is possible to specify which tail we are interested in. If we are interested in obtaining simultaneous quantiles for both tails, we need to do this manually. If we are aiming at quantiles for $P = 0.05$, we need to find the quantile for each tail based on $P / 2 = 0.025$.

<<distrib-WB01>>=
qnorm(p = 0.025, mean = 0, sd = 1)
qnorm(p = 0.025, mean = 0, sd = 1, lower.tail = FALSE)
@

When calculating a $P$-value from a quantile computed from observations in a test of significance, we need to first decide whether a two-sided or single-sided test is relevant, and in the case of a single sided test, which tail is of interest. In a two-sided test we need to multiply the returned $P$-value by 2. This works in the case of a symmetric distribution like the Normal, because the quantiles in the two tails differ only in sign. However, this is not the case for asymmetric distributions.

<<distrib-WB02>>=
pnorm(q = 4, mean = 0, sd = 1) * 2
@

\end{warningbox}

\subsection{``Random'' draws from a distribution}\label{sec:stat:random}
\index{random draws|see{probability distributions, pseudo-random draws}}\index{probability distributions!pseudo-random draws}

True random sequences can only be generated by physical processes. All ``pseudo-random'' sequences of numbers generated by computation are really deterministic although they share many properties with true random sequences (e.g., in relation to autocorrelation).

It is possible to compute not only pseudo-random draws from a uniform distribution but also from the Normal, $t$, $F$, and other distributions. In each case, the probability with which different values are ``drawn'' approximates the probabilities set by the corresponding theoretical distribution. Parameter \code{n} indicates the number of values to be drawn, or its equivalent, the length of the vector returned (see section \ref{sec:plot:histogram} on page \pageref{sec:plot:histogram} for example plots).\qRfunction{rnorm()}\qRfunction{runif()}%

<<distrib-04,echo=2:3>>=
set.seed(12234)
rnorm(5)
rnorm(n = 10, mean = 10, sd = 2)
@

\begin{playground}
Edit the examples in sections \ref{sec:prob:quant}, \ref{sec:quant:prob}, and \ref{sec:stat:random} to do computations based on different distributions, such as Student's \emph{t}, \emph{F} or uniform.
\end{playground}

\begin{explainbox}
\index{random numbers|see{pseudo-random numbers}}\index{pseudo-random numbers}
It is impossible to generate truly random sequences of numbers by means of a deterministic process such as a mathematical computation. ``Random numbers'' as generated by \Rpgrm and other computer programs are \emph{pseudo-random numbers}, long deterministic series of numbers that resemble random draws. Random number generation uses a \emph{seed} value that determines where in the series the first value is fetched. The usual way of automatically setting the value of the seed is to take the milliseconds or a similar rapidly changing set of digits from the real-time clock of the computer. However, in cases when we wish to repeat a calculation using the same series of pseudo-random values, we can use \Rfunction{set.seed()} with an arbitrary integer as an argument to reset the generator to the same point in the underlying (deterministic) sequence.
\end{explainbox}

\begin{advplayground}
Execute the statement \code{rnorm(3)}\qRfunction{rnorm()} by itself several times, paying attention to the values obtained. Repeat the exercise, but now executing \code{set.seed(98765)}\qRfunction{set.seed()} immediately before each call to \code{rnorm(3)}, again paying attention to the values obtained. Next execute \code{set.seed(98765)}, followed by \code{c(rnorm(3), rnorm(3))}, and then execute \code{set.seed(98765)}, followed by \code{rnorm(6)} and compare the output. Repeat the exercise using a different argument in the call to \code{set.seed()}. analyse the results and explain how \code{setseed()} affects the generation of pseudo-random numbers in \Rlang.
\end{advplayground}
\index{probability distributions!standard|)}

\section{Observed Probability Distributions}
\index{probability distributions!observed|(}%
\index{empirical probability distributions|see{probability distributions, observed}}%
It is common to estimate the value of the parameters for a standard distribution like Student's $t$ or Normal distributions from observational data, assuming a priori the suitability of the distribution. If we compute the mean and standard deviation of a large sample, these two parameters define a specific Normal distribution curve. If we add the estimate of the degrees of freedom, $v = n - 1$, the three parameters define a specific $t$-distribution curve. Thus it is possible to use the functions described in section \ref{sec:prob:dist} on page \ref{sec:prob:dist}, in statistical inference.

\begin{explainbox}
Package \pkgname{mixtools} provides tools for fitting and analysing \emph{mixture models} such as the mix of two or more univariate Normal distributions. An example of its use could be to estimate mean and standard deviations for males and females in a dataset where the gender was not recorded at the time of observation.
\end{explainbox}

It is also possible to describe the observed shape of the distribution, or empirical distribution, for a data set without relying on a standard distribution. The fitted empirical distribution can later be used to compute probabilities, quantiles, and random draws as from standard distributions. This also allows statistical inference, using methods such as the bootstrap or some additive models.

Function \Rfunction{density()} computes kernel density estimates, using different methods. A curve is used to describe the shape, and the bandwidth determines how flexible this curve is. The curve is a smoother that adapts to the observed shape of the distribution of observations. The object returned is a complex list that can be used to plot the estimated curve.

The code below estimates the empirical distribution for the waiting time in minutes between eruptions of the Old Faithful geyser at Yellowstone, a dataset from \Rlang.

<<density-01>>=
d <- density(faithful$waiting, bw = "sj")
@

\begin{explainbox}
Using \Rfunction{str()} we can explore the structure of the object returned by function \Rfunction{density()}.

<<density-02>>=
str(d)
@

The object saved as \code{d} is a \code{list} with seven members. The two numeric vectors, \code{x} and \code{y} describe the estimated probability distribution and produce the curve in the plot below. The numerical bandwidth estimated using method \code{"sj"} is in \code{bw}, and the length of vector \code{faithful\$waiting}, the data used, is in \code{n}. Member \code{call} is the command used to call the function, the remaining two members have self-explanatory names. The returned object belongs to class \Rclass{density}. The overall pattern is similar, but simpler than for the model fitting functions that we will see later in the chapter. The class name of the object is the same as the name of the function that created it, \code{call} provides a \emph{trace} of how the object was created. Other members, facilitate computation of derived quantities and plotting. Being a list, the individual members can be extracted by name.

<<density-03>>=
d$n
@
\end{explainbox}

As a \Rmethod{plot()} method is available for class \Rclass{density} we can easily produce a plot of the estimated empirical density distribution. In this case, the fitted bimodal curve, with two maxima, is very different to the Normal.

<<desnity-04>>=
plot(d)
@

Observed probability distributions, especially empirical ones, nowadays play a central role in data visualisation including 1D and 2D empirical density plots based on the use of functions like \Rfunction{density()}, as well as traditional histograms (see section \ref{sec:plot:density} on page \pageref{sec:plot:density} for examples of more elaborate and elegant plots).
\index{probability distributions!observed|)}

\section{``Random'' Sampling}
\index{random sampling|see{pseudo-random sampling}}%
\index{pseudo-random sampling|(}%

In addition to drawing values from a theoretical distribution, we can draw values from an existing set or collection of values. We call this operation (pseudo-)random sampling. The draws can be done either with replacement or without replacement. In the second case, all draws are taken from the whole set of values, making it possible for a given value to be drawn more than once. In the default case of not using replacement, subsequent draws are taken from the values remaining after removing the values chosen in earlier draws.

<<sampling-01>>=
sample(x = LETTERS)
sample(x = LETTERS, size = 12)
sample(x = LETTERS, size = 12, replace = TRUE)
@

In practice, pseudo-random sampling is useful when we need to select subsets of observations. One such case is assigning treatments to experimental units in an experiment or selecting persons to interview in a survey. Another use is in bootstrapping to estimate variation in parameter estimates using empirical distributions.

\begin{faqbox}{How to sample random rows from a data frame?}
As described in section \ref{sec:R:data:frames} on page \pageref{sec:R:data:frames}, data frames are commonly used to store one observation per row. To sample a subset of rows, we need to generate a random set of indices to use with the extraction operator (\Roperator{[ ]}). Here we sample four rows from data frame \code{cars} included in \Rlang. These data consist of stopping distances for cars moving at different speeds as described in the documentation available by entering \code{help(cars)}).

<<sampling-02>>=
cars[sample(x = 1:nrow(cars), size = 4), ]
@

\end{faqbox}

\begin{advplayground}
Consult the documentation of \Rfunction{sample()} and explain why the code below is equivalent to that in the example immediately above.

<<sampling-03, eval=eval_playground>>=
cars[sample(x = nrow(cars), size = 4), ]
@

\end{advplayground}
\index{pseudo-random sampling|)}%

\section{Correlation}\label{sec:stats:correlation}
\index{correlation|(}
Both parametric (Pearson's) and non-parametric robust (Spearman's and Kendall's) methods for the estimation of the (linear) correlation between pairs of variables are available in base \Rlang. The different methods are selected by passing arguments to a single function. While Pearson's method is based on the actual values of the observations, non-parametric methods are based on the ordering or rank of the observations, and consequently less affected by observations with extreme values.

\subsection{Pearson's $r$}
\index{correlation!parametric}
\index{correlation!Pearson}

Function \Rfunction{cor()} can be called with two vectors of the same length as arguments. In the case of the parametric Pearson method, we do not need to provide further arguments as this method is the default one.  We use data set \code{cars}.

<<cor-01>>=
cor(x = cars$speed, y = cars$dist)
@

It is also possible to pass a data frame (or a matrix) as the only argument. When the data frame (or matrix) contains only two columns, the returned correlation estimate is equivalent to that of passing the two columns individually as vectors. The object returned is a $2 \times 2$ \code{matrix} instead of a vector of length one.

<<cor-02>>=
cor(cars)
@

When the data frame or matrix contains more than two numeric vectors, the returned value is a matrix of estimates of pairwise correlations between columns. We here use \Rfunction{rnorm()} described above to create a long vector of pseudo-random values drawn from the Normal distribution and \Rfunction{matrix()} to convert it into a matrix with three columns (see page \pageref{sec:matrix:array} for details about \Rlang matrices).

<<cor-02a>>=
mat1 <- matrix(rnorm(54), ncol = 3,
               dimnames = list(rows = 1:18, cols = c("A", "B", "C")))
cor(mat1)
@

\begin{playground}
Modify the code in the chunk immediately above constructing a matrix with six columns and then computing the correlations.
\end{playground}

While \Rfunction{cor()} returns an estimate of $r$, the correlation coefficient, \Rfunction{cor.test()} computes, in addition, the $t$-value, $P$-value, and confidence interval for the estimate.

<<cor-03>>=
cor.test(x = cars$speed, y = cars$dist)
@

Above we passed two numeric vectors as arguments, one to parameter \code{x} and one to parameter \code{y}. Alternatively, we can pass a data frame as an argument to \code{data}, and a \emph{model formula} to \code{formula}. The argument passed to \code{formula} determines which variables from \code{data} are used, and in which role. Briefly, the variable(s) to the left of the tilde (\code{~}) are response variables, and those to the right are independent, or explanatory, variables. In the case of correlation, no assumption is made on cause and effect, and both variables appear to the right of the tilde. The code below is equivalent to that above. See section \ref{sec:stat:formulas} on page \pageref{sec:stat:formulas} for details on the use of model formulas and section \ref{sec:stat:mf} on page \pageref{sec:stat:mf} for examples of their use in model fitting.

<<cor-03a, eval=eval_playground>>=
cor.test(formula = ~ speed + dist, data = cars)
@

\begin{playground}
Functions \Rfunction{cor()} and \Rfunction{cor.test()} return \Rlang objects, that when using \Rlang interactively get automatically ``printed'' on the screen. One should be aware that \Rfunction{print()} methods do not necessarily display all the information contained in an \Rlang object. This is almost always the case for complex objects like those returned by \Rlang functions implementing statistical tests. As with any \Rlang object, we can save the result of an analysis into a variable. As described in section \ref{sec:calc:lists} on page \pageref{sec:calc:lists} for lists, we can peek into the structure of an object with method \Rfunction{str()}. We can use \Rfunction{class()} and \Rfunction{attributes()} to extract further information. Run the code in the chunk below to discover what is actually returned by \Rfunction{cor()}.

<<cor-PG01, eval=eval_playground>>=
MAT1 <- cor(cars)
class(MAT1)
attributes(MAT1)
str(MAT1)
@

Methods \Rfunction{class()}, \Rfunction{attributes()} and \Rfunction{str()} are very powerful tools that can be used when we are in doubt about the data contained in an object and/or how it is structured. Knowing the structure allows us to retrieve the data members directly from the object when predefined extractor methods are not available.
\end{playground}

\subsection{Kendall's $\tau$ and Spearman's $\rho$}
\index{correlation!non-parametric}
\index{correlation!Kendall}
\index{correlation!Spearman}

We use the same functions as for Pearson's $r$ but explicitly request the use of one of these methods by passing an argument.

<<cor-04>>=
cor(x = cars$speed, y = cars$dist, method = "kendall")
cor(x = cars$speed, y = cars$dist, method = "spearman")
@

Function \Rfunction{cor.test()}, described above, also allows the choice of method with the same syntax as shown for \Rfunction{cor()}.

\begin{playground}
Repeat the exercise in the playground immediately above, but now using non-parametric methods. How does the information stored in the returned \code{matrix} differ depending on the method, and how can we extract from the returned object information about the method used for the calculation of the correlation?
\end{playground}
\index{correlation|)}

\section{$t$-test}\label{sec:stats:ttest}
\index{t-test@$t$-test|(}%
\index{Student's t-test@Student's $t$-test|see{$t$-test}}%
The $t$-test is based on Student's $t$-distribution. It can be applied to any parameter estimate for which its standard deviation is available, and the $t$-distribution is a plausible assumption. It is most frequently used to compare an estimate of the mean against a constant value, or the estimate of a difference between two means and a target difference, usually no difference. In \Rlang these can be computed manually using functions \Rfunction{mean()}, \Rfunction{sd()}, and \Rfunction{pt()} or with \Rfunction{t.test()}.

Although rarely presented in such a way, the $t$-test can be thought of as a special case of a linear model fit. Consistently with functions used to fit models to observations, we can use a \emph{formula} to describe a $t$-test. A formula such as \code{y\,\char"007E\,x} is read as $y$ is explained by $x$. We use \emph{lhs} (left-hand-side) and \emph{rhs} (right-hand-side) to signify all terms to the left and right of the tilde (\code{\,\char"007E\,}), respectively (\code{<lhs>\,\char"007E\,<rhs>}). (See section \ref{sec:stat:formulas} on page \pageref{sec:stat:formulas} for a detailed discussion of model formulas, and section \ref{sec:stat:mf} on page \pageref{sec:stat:mf} for examples of their use in model fitting.)

<<ttest-01>>=
df1 <- data.frame(some.size = c(rnorm(10, mean = 2.5), rnorm(10, mean = 2.0)),
                  group = factor(rep(c("A", "B"), each = 10)))
@

The formula \code{some.size\,\char"007E\,1} is read as ``the mean of variable \code{some.size} is explained by a constant value''. The value estimated from observations, $\bar{x}$, is compared against the value of $\mu$ set as the null hypothesis, where $\mu$ is the \emph{unknown} mean of the sampled population. By default, \Rfunction{t.test()} applies a two-sided test (\code{alternative = "two.sided"}) against \code{mu = 0}, but here we use \code{mu = 2} instead.

<<ttest-02>>=
t.test(some.size ~ 1, mu = 2, data = df1)
@

The same test can be calculated step by step. In this case, this approach is not needed, but it is useful when we have a parameter estimate (not just mean) and its standard error available, as in model fits (see the advanced playground on page \pageref{box:stats:slope:ttest} for an example).

<<ttest-02a>>=
sem = sqrt(var(df1$some.size) / nrow(df1))
t.value = (mean(df1$some.size) - 2) / sem # Ho: mu = 2
p.value <- pt(t.value, df = nrow(df1) - 1, lower.tail = FALSE) * 2 # two tails
signif(c(t = t.value, df = nrow(df1) - 1, P = p.value), 4) # 4 digits
@

The same function, with a different formula, tests for the difference between the means of two groups or treatments, $H_o: {\mu}_A - {\mu}_B = 0$. We read the formula \code{some.size\,\char"007E\,group} as ``differences in \code{some.size} are explained by factor \code{group}''. The difference between the means for the two groups is estimated and compared against the hypothesis. (In this case, the value of the argument passed to \code{mu}, zero by default, describes this difference.) By default, variances in the two groups are not to assumed equal,

<<ttest-03>>=
t.test(some.size ~ group, data = df1)
@

\noindent
but with \code{var.equal = TRUE}, the variances in the populations from which observations in  groups A and B were sampled are assumed equal, and pooled into a single estimate.

<<ttest-03a>>=
t.test(some.size ~ group, var.equal = TRUE, data = df1)
@

The $t$-test serves as an example of how statistical tests are usually carried out in \Rlang. Table \ref{tab:stats:tests} lists \Rlang functions for frequently used statistical tests.

\begin{table}
  \caption[Statistical tests]{\Rlang functions implementing frequently used statistical tests. Student's $t$-test and correlation tests are described on pages \pageref{sec:stats:ttest} and \pageref{sec:stats:correlation}, respectively.}\vspace{1ex}\label{tab:stats:tests}

  \centering\noindent
   \begin{tabular}{ll}
     \toprule
     Statistical test & Function name \\
     \midrule
     Student's $t$-test (1 and 2 samples) & \code{t.test()} \\
     Wilcoxon rank sum and signed rank tests & \code{wilcox.test} \\
     Kolmogorov-Smirnov tests & \code{ks.test()} \\
     Correlation tests (Pearson, Kendall, Spearman) & \code{cor.test()} \\
     $F$-test to compare two variances & \code{var.test()} \\
     Fisher's exact test for count data & \code{fisher.test()} \\
     Pearson's Chi-squared ($\chi^2$) test for count data & \code{chisq.test()} \\
     Exact Binomial Test & \code{binom.test()} \\
     Test of equal or given proportions & \code{prop.test()} \\
     \bottomrule
   \end{tabular}
\end{table}

\index{t-test@$t$-test|)}

\section{Model Fitting in \Rlang}\label{sec:stat:mf}
\index{models!fitting|(}
The general approach to model fitting in \Rlang is to separate the actual fitting of a model from the inspection of the fitted model. A model fitting function minimally requires a description of the model to fit, as a model \code{formula} and a data frame or vectors with the data or observations to which to fit the model. These functions in \Rlang return a model-fit object. This object contains the data, the model formula, the call, and the result of fitting the model. Several methods are available for querying it. The diagram in Figure \ref{fig:model:fit:diagram} summarises the approach used in \Rlang for data analysis based on fitted models.

\begin{figure}
  \centering
\begin{small}
\begin{tikzpicture}[node distance=1.4cm, scale=0.5]
\node (model) [tprocess] {\textsl{model specification}};
\node (data) [tprocess, below of=model] {\textsl{observations}};
\node (fitfun) [tprocess, right of=model, yshift=-0.7cm, xshift=2.5cm] {\textsl{fitting function}};
\node (fm) [tprocess, color = black, right of=fitfun, xshift=1.5cm] {\textsl{fitted model}};
\node (summary) [tprocess, color = black, right of=fm, xshift=1.7cm] {\textsl{query methods}};
\draw [arrow] (model) -- (fitfun);
\draw [arrow] (data) -- (fitfun);
\draw [arrow] (fitfun) -- (fm);
\draw [arrow] (fm) -- (summary);
\end{tikzpicture}
\end{small}
  \caption[Model fitting in \Rlang]{Model fitting in \Rlang is done in steps, and can be represented schematically as a flow of information.}\label{fig:model:fit:diagram}
\end{figure}

Models are described using model formulas such as \code{y\,\char"007E\,x} which we read as $y$ is explained by $x$. We use \emph{lhs} (left-hand-side) and \emph{rhs} (right-hand-side) to signify all terms to the left and right of the tilde (\code{\,\char"007E\,}), respectively (\code{<lhs>\,\char"007E\,<rhs>}). Model formulas are used in different contexts: fitting of models, plotting, and tests like $t$-test. The syntax of model formulas is consistent throughout base \Rlang and numerous independently developed packages. However, their use is not universal, and several packages extend the basic syntax to allow the description of specific types of models. As most things in \Rlang, model formulas are objects and can be stored in variables. See section \ref{sec:stat:formulas} on page \pageref{sec:stat:formulas} for a detailed discussion of model formulas.

Although there is some variation, especially for fitted model classes defined in extension packages, in most cases, the \textsl{query functions} bulked together in the rightmost box in the diagram in Figure \ref{fig:model:fit:diagram} include \Rfunction{summary()}, \Rfunction{anova()} and \Rfunction{plot()}, with other methods such as \Rfunction{coef()}, \Rfunction{residuals()}, \Rfunction{fitted()}, \Rfunction{predict()}, \Rfunction{AIC()}, and \Rfunction{BIC()} usually also available. Additional methods may be available. However, as model fit objects are \code{list}-like, these and other values can be extracted and/or computed programmatically when needed. The examples in this chapter can be adapted to the fitting of types of models not described in this book.

\begin{explainbox}
  Fitted model objects in \Rlang are self contained and include a copy of the data to which the model was fit, as well as residuals and possibly even intermediate results of computations. Although this can make the size of these objects large, it allows querying and even updating them in the absence of the data in the current \Rlang workspace.
\end{explainbox}

\section{Fitting Linear Models}\label{sec:stat:LM}
\index{models!linear|see{linear models}}
\index{linear models|(}
\index{LM|see{linear models}}

Regression, analysis of variance (ANOVA) and analysis of covariance (ANCOVA) are all linear models, differing only on the type of explanatory variables included in the statistical model fitted. If in the fitted model all explanatory variables are continuous, i.e., \code{numeric}, vectors, the model is a regression model. If all explanatory variables are discrete, i.e., \code{factors}, the model is ANOVA. Finally, if the model contains but \code{numeric} variables and \code{factors} it is named ANCOVA. As in all cases the fitting approach is the same, based on ordinary least squares (OLS), in \Rlang, they are all implemented in function \Rfunction{lm()}.

There is another meaning of ANOVA, referring only to the tests of significance rather than to an approach to model fitting. Consequently, rather confusingly, results for tests of significance can both in the case of regression, ANOVA and ANCOVA, be presented in an ANOVA table. In this second, stricter meaning, ANOVA means a test of significance based on the ratios between pairs of variances.

\begin{warningbox}
If you do not clearly remember the difference between numeric vectors and factors, or how they can be created, please, revisit chapter \ref{chap:R:as:calc} on page \pageref{chap:R:as:calc}.
\end{warningbox}

\begin{figure}
  \centering
\begin{small}
\begin{tikzpicture}[node distance=1.4cm, scale=0.5]
\node (model) [tprocess] {\textsl{model} $\to$ \code{formula}};
\node (data) [tprocess, below of=model, yshift = 0.4cm] {\textsl{observations} $\to$ \code{data}};
\node (weights) [tprocess, dashed, below of=data, fill=blue!1, yshift = 0.4cm] {\textsl{weights} $\to$ \code{weights}};
\node (fitfun) [tprocess, right of=data, xshift=2.5cm, fill=blue!5] {\code{lm()}};
\node (fm) [tprocess, color = black, right of=fitfun, xshift=1.5cm, fill=blue!5] {\code{lm} \textsl{object}};
\node (summary) [tprocess, color = black, right of=fm, xshift=1.7cm] {\code{summary()}};
\node (anova) [tprocess, color = black, below of=summary, yshift = 0.4cm] {\code{anova()}};
\node (plot) [tprocess, color = black, above of=summary, yshift = -0.4cm] {\code{plot()}};
\draw [arrow] (model) -- (fitfun);
\draw [arrow] (data) -- (fitfun);
\draw [arrow, dashed] (weights) -- (fitfun);
\draw [arrow] (fitfun) -- (fm);
\draw [arrow] (fm) -- (plot);
\draw [arrow] (fm) -- (anova);
\draw [arrow] (fm) -- (summary);
\end{tikzpicture}
\end{small}
  \caption[Linear model fitting in \Rlang]{Linear model fitting in \Rlang is done in steps. The generic diagram from Figure \ref{fig:model:fit:diagram} redrawn to show a linear model fit. Non-filled boxes are shared with the fitting of other types of models, and filled ones are specific to \Rfunction{lm()}. Only the three most frequently used query methods are shown, while  both response and explanatory variables are under \textsl{observations}. Dashed boxes and arrows are optional as defaults are provided.}\label{fig:lm:fit:diagram}
\end{figure}

Figure \ref{fig:lm:fit:diagram} shows the steps needed to fit a linear model and extract the estimates and test results. The observations are stored in a data frame, one case or event per row, with values for both response and explanatory variables in variables or columns. The model formula is used to indicate which variables in the data frame are to be used and in which role: either
response or explanatory, and when explanatory how they contribute to the estimated response. The object containing the results from the fit is queried to assess validity and make conclusions or predictions.

\begin{explainbox}
Weights are multiplicative factors used to alter the \emph{weight} given to individual residuals when fitting a model to observations that are not equally informative. A frequent case is fitting a model to observations that are means of drastically different numbers of individual measurements. Some model fit functions compute the weights, but in most cases they are supplied as an argument to parameter \code{weights}. By default, \code{weights} have a value of \code{1} and thus do not affect the resulting model fit, when supplied or computed, the weights are saved to the model fit object.
\end{explainbox}

\subsection{Regression}\label{sec:stat:LM:regression}
%\index{linear regression}
\index{linear regression|see{linear models, linear regression}}%
\index{linear models!linear regression|(}%
The \Rdata{cars} data set, containing two \code{numeric} variables, is used in the examples.
A simple linear model $y = \alpha \cdot 1 + \beta \cdot x$ where $y$ corresponds to stopping distance (\code{dist}) and $x$ to initial speed (\code{speed}) is formulated in \Rlang as \code{dist \char"007E\ 1 + speed}. The fitted model object is assigned to variable \code{fm1} (a mnemonic for fitted-model one).\label{chunk:lm:models1}

<<models-1>>=
fm1 <- lm(dist ~ 1 + speed, data=cars)
class(fm1)
@

The next step is diagnosis of the fit. Are assumptions of the linear model procedure used reasonably close to being fulfilled? In \Rlang it is most common to use plots to this end. We show here only one of the plots normally produced. This quantile vs.\ quantile plot is used to assess how much the distribution of the residuals deviates from the assumed Normal distribution.

<<models-1a>>=
plot(fm1, which = 2)
@

In the case of a regression, calling \Rfunction{summary()} with the fitted model object as argument is most useful as it provides a table of coefficient estimates and their errors. Remember that as is the case for most \Rlang functions, the value returned by \Rfunction{summary()} is printed when we call this method at the \Rlang prompt.

<<models-1b>>=
summary(fm1)
@

The summary\index{linear models!summary table} is organised in sections. ``Call:'' shows \code{dist\ \char"007E\ 1 + speed} or the specification of the model fitted, plus the data used. ``Residuals:'' displays the extremes, quartiles and median of the residuals, or deviations between observations and the fitted line. ``Coefficients:'' contains estimates of the model parameters and their variation plus corresponding $t$-tests. In the last three lines, there is information on overall standard error and its degrees of freedom and overall coefficient of determination ($R^2$) and $F$-statistic.

Replacing $\alpha$ and $\beta$ in $y = \alpha \cdot 1 + \beta \cdot x$ by the estimates for the intercept, $a = -17.6$, and slope, $b = 3.93$, we obtain an estimate for the regression line $y = -17.6 + 3.93 x$. However, given the nature of the problem, we \emph{know based on first principles} that the stopping distance must be zero when speed is zero. This suggests that we should not estimate the value of $\alpha$ but instead set $\alpha = 0$, or in other words, fit the model $y = \beta \cdot x$.

In \Rlang models, the intercept is included by default, so the model fitted above can be formulated as \code{dist\ \char"007E\ speed}---i.e., the missing \code{+ 1} does not change the model. To exclude the intercept, we need to specify the model as \code{dist\ \char"007E\ speed - 1} (or its equivalent \code{dist\ \char"007E\ speed + 0}), for a straight line passing through the origin ($x = 0$, $y = 0$). In the summary for this model there is an estimate for the slope but not for the intercept.

<<models-2>>=
fm2 <- lm(dist ~ speed - 1, data = cars)
summary(fm2)
@

The equation for \code{fm2} is $y = 2.91 x$. From the residuals, it can be seen that it is inadequate, as the straight line does not follow the curvature of the cloud of observations.

\begin{playground}
You will now fit a second-degree polynomial\index{linear models!polynomial regression}\index{polynomial regression}, a different linear model: $y = \alpha \cdot 1 + \beta_1 \cdot x + \beta_2 \cdot x^2$. The function used is the same as for linear regression, \Rfunction{lm()}. We only need to alter the formulation of the model. The identity function \Rfunction{I()} is used to protect its argument from being interpreted as part of the model formula. Instead, its argument is evaluated beforehand and the result is used as the, in this case second, explanatory variable.\label{chunk:stats:fm3}

<<models-3, eval=eval_playground>>=
fm3 <- lm(dist ~ speed + I(speed^2), data = cars)
plot(fm3, which = 3)
summary(fm3)
@

The ``same'' fit using an orthogonal polynomial can be specified using function \Rfunction{poly()}. Polynomials of different degrees can be obtained by supplying as the second argument to \Rfunction{poly()} the corresponding positive integer value. In this case, the different terms of the polynomial are bulked together in the summary.

<<models-3a, eval=eval_playground>>=
fm3a <- lm(dist ~ poly(speed, 2), data = cars)
summary(fm3a)
@

It is possible to compare two model fits using \Rfunction{anova()}, testing whether one of the models describes the data better than the other. It is important in this case to take into consideration the nature of the difference between the model formulas, most importantly if they can be interpreted as nested---i.e., interpreted as a base model vs. the same model with additional terms.

<<models-4, eval=eval_playground>>=
anova(fm2, fm1)
@

Three or more models can also be compared in a single call to \Rfunction{anova()}. However, care is needed, as the order of the arguments matters.

<<models-5, eval=eval_playground>>=
anova(fm2, fm3, fm3a)
anova(fm2, fm3a, fm3)
@

\label{par:stats:AIC}%
\index{Akaike's An Information Criterion@Akaike's \emph{An Information Criterion}}%
\index{AIC|see{\emph{An Information Criterion}}}%
\index{Bayesian Information Criterion@\emph{Bayesian Information Criterion}}%
\index{BIC|see{\emph{Bayesian Information Criterion}}}%
\index{Schwarz's Bayesian Criterion@Schwarz's \emph{Bayesian criterion}|see{\emph{Bayesian Information Criterion}}}%
Different criteria can be used to choose the ``best'' model: significance based on $p$-values or information criteria (AIC, BIC). AIC (Akaike's ``An Information Criterion'') and BIC (``Bayesian Information Criterion'' = SBC, ``Schwarz's Bayesian criterion'') that penalise the resulting ``goodness'' based on the number of parameters in the fitted model. In the case of AIC and BIC, a smaller value is better, and values returned can be either positive or negative, in which case more negative is better. Estimates for both BIC and AIC are returned by \Rfunction{anova()}, and on their own by \Rfunction{BIC()} and \Rfunction{AIC()}

<<models-5a, eval=eval_playground>>=
BIC(fm2, fm1, fm3, fm3a)
AIC(fm2, fm1, fm3, fm3a)
@

Once you have run the code in the chunks above, you will be able see that these three criteria do not necessarily agree on which is the ``best'' model. Find in the output $P$-value, BIC and AIC estimates, for the different models and conclude which model is favoured by each of the three criteria. In addition, you will notice that the two different formulations of the quadratic polynomial are equivalent.

\end{playground}

Additional query methods give easy access to different aspects of fitted models: \Rfunction{vcov()} returns the variance-covariance matrix, \Rfunction{coef()} and its alias \Rfunction{coefficients()} return the estimates for the fitted model coefficients, \Rfunction{fitted()} and its alias \Rfunction{fitted.values()} extract the fitted values, and \Rfunction{resid()} and its alias \Rfunction{residuals()} the corresponding residuals (or deviations) (Figure \ref{fig:lm:fit:query:more}). Less frequently used accessors are \Rfunction{getCall()}, \Rfunction{effects()}, \Rfunction{terms()}, \Rfunction{model.frame()}, and \Rfunction{model.matrix()}.

\begin{figure}
  \centering
\begin{small}
\begin{tikzpicture}[node distance=1.4cm, scale=0.5]
\node (model) [tprocess] {\textsl{model} $\to$ \code{formula}};
\node (data) [tprocess, below of=model, yshift = 0.4cm] {\textsl{observations} $\to$ \code{data}};
\node (weights) [tprocess, dashed, below of=data, fill=blue!1, yshift = 0.4cm] {\textsl{weights} $\to$ \code{weights}};
\node (fitfun) [tprocess, right of=data, xshift=2.5cm, fill=blue!5] {\code{lm()}};
\node (fm) [tprocess, color = black, right of=fitfun, xshift=1.5cm, fill=blue!5] {\code{lm} \textsl{object}};
\node (fitted) [tprocess, color = black, right of=fm, xshift=1.7cm] {\code{fitted()}};
\node (resid) [tprocess, color = black, above of=fitted, yshift = -0.4cm] {\code{residuals()}};
\node (coef) [tprocess, color = black, above of=resid, yshift = -0.4cm] {\code{coef()}};
\node (aic) [tprocess, color = black, below of=fitted, yshift = 0.4cm] {\code{AIC()}; \code{BIC()}};
\node (predict) [tprocess, color = black, below of=aic, yshift = 0.4cm] {\code{predict()}};
\node (newdata) [tprocess, color = black, left of=predict, xshift = -2.4cm] {\textsl{expl.\ vars.} $\to$ \code{newdata}};
\draw [arrow] (model) -- (fitfun);
\draw [arrow] (data) -- (fitfun);
\draw [arrow, dashed] (weights) -- (fitfun);
\draw [arrow] (fitfun) -- (fm);
\draw [arrow] (fm) -- (coef);
\draw [arrow] (fm) -- (resid);
\draw [arrow] (fm) -- (fitted);
\draw [arrow] (fm) -- (aic);
\draw [arrow] (fm) -- (predict);
\draw [arrow] (newdata) -- (predict);
\end{tikzpicture}
\end{small}
  \caption[Linear model fitting, more query methods ]{Diagram including additional methods used to query fitted model objects using linear models as an example. For other details see the legend of Figure \ref{fig:lm:fit:diagram}.}\label{fig:lm:fit:query:more}
\end{figure}

\begin{playground}
Familiarise yourself with these extraction and summary methods by reading their documentation and use them to explore \code{fm1} fitted above or model fits to other data of your interest.
\end{playground}

It is usual\index{linear models!structure of summary object} to only look at the values returned by \Rfunction{anova()} and \Rfunction{summary()} as implicitly displayed by \code{print()}. However, both \Rfunction{anova()} and \Rfunction{summary()} return complex objects, derived from \code{list}, containing some members not displayed by \code{print()}. Access to members of these objects can be necessary to use them in further calculations or to print them in a format different to the default.

\begin{explainbox}\label{box:LM:fit:summary:object}
The class and structure of the objects returned by \Rmethod{summary()} depends on the class of the model fit object, i.e., \Rmethod{summary()} is a generic method with multiple specialisations.

<<models-EB2ca>>=
class(summary(fm1))
@

One case where we need to extract individual members is when adding annotations to plots. Another case is when writing reports to include programmatically the computed values within the text. \Rfunction{str()} can be used to display the structure. Calling \Rfunction{str()} with \code{no.list = TRUE}, \code{give.attr = FALSE} and \code{vec.len = 2} as arguments restricts the output to an overview of the structure of \code{fm1}.

<<models-EB3>>=
str(summary(fm1), no.list = TRUE, give.attr = FALSE, vec.len = 2)
@

Extraction of members follows the usual \Rlang rules using \Roperator{\$}, \Roperator{[ ]}, or \Roperator{[[ ]]}.

<<models-EB4>>=
summary(fm1)$adj.r.squared
@

Estimates of \code{coefficients} are accompanied by estimates of their standard errors, \emph{t}-values and \emph{P}-values, while in the model object \code{fm1} these are not included.

<<models-EB3b>>=
coef(fm1)
str(fm1$coefficients)
print(summary(fm1)$coefficients)
str(summary(fm1)$coefficients)
@
\end{explainbox}

\begin{explainbox}\label{box:LM:anova:object}
The class of the object returned by method \code{anova()} does not depend on the class of the model fit object, while its structure does depend.

<<models-EB2>>=
anova(fm1)
@

<<models-EB2a>>=
class(anova(fm1))
@

<<models-EB2b>>=
str(anova(fm1))
@
\end{explainbox}

\begin{explainbox}\label{box:stats:slope:ttest}
As\index{linear models!ad-hoc tests for parameters}\index{t-test@$t$-test|(}\index{calibration curves} an example of the use of values extracted from the \code{summary.lm} object, I show how to test if the slope from a linear regression fit deviates significantly from a constant value different from the usual zero, which tests for the presence of an ``effect'' of the explanatory variable. When testing for deviations from a calibration by comparing two instruments or an instrument and a reference, a null hypothesis of one for the slope tests for deviations from the true readings. In some cases, when comparing the effectiveness of interventions we may be interested to test if a new approach surpasses that in current use by at least a specific margin. There exist practical situations where testing if a response exceeds a threshold is of interest.

A \emph{t}-value can be computed for the slope as for the mean. When using \Rfunction{anova()} and \Rfunction{summary()} the null hypothesis is no effect or no response, i.e., slope = 0. The equivalent test with a null hypothesis of slope = 1 is easy to implement if we consider how we calculate a $t$-value (see section \ref{sec:stats:ttest} on page \pageref{sec:stats:ttest}). To compute the \emph{t}-value we need an estimate for the slope and an estimate of its standard error. To look up the $P$-value, we need the degrees of freedom. All these values are available as members of the summary object of a fitted model.

<<models-EB5>>=
est.slope.value <- summary(fm1)$coefficients["speed", "Estimate"]
est.slope.se <- summary(fm1)$coefficients["speed", "Std. Error"]
degrees.of.freedom <- summary(fm1)$df[2]
@

The estimate of the $t$-value, or quantile, is computed based on the difference between the estimate for the slope and a null hypothesis used as reference, and the standard error of the estimated slope. A probability is obtained based on the computed $t$-value, or quantile, and the $t$ distribution with matching degrees of freedom with a call to \Rfunction{pt()} (see section \ref{sec:prob:dist} on page \pageref{sec:prob:dist}.) For a two-tail test we multiply by two the one-tail $P$ estimate.

<<models-EB6>>=
hyp.null <- 1
t.value <- (est.slope.value - hyp.null) / est.slope.se
p.value <- 2 * pt(q = t.value, df = degrees.of.freedom, lower.tail = FALSE)
cat("slope =", signif(est.slope.value, 3),
    "with s.e. =", signif(est.slope.se, 3),
    "\nt.value =", signif(t.value, 3),
    "and P-value =", signif(p.value, 3))
@

This example is for a linear model fitted with function \Rfunction{lm()} but the same approach can be applied to other model fit procedures for which parameter estimates and their corresponding standard error estimates can be extracted or computed.
\end{explainbox}

\begin{advplayground}
Check that the computations above after replacing \code{hyp.null <- 1} by \code{hyp.null <- 0} agree with the output of printing \code{summary()}.

Modify the example above so as to test whether the intercept is significantly larger than 5 feet, doing a one-sided test.

\end{advplayground}
\index{t-test@$t$-test|)}

Method \Rfunction{predict()} uses the fitted model together with new data for the independent variables to compute predictions. As \Rfunction{predict()} accepts new data as input, it allows interpolation and extrapolation to values of the independent variables not present in the original data. In the case of fits of linear and some other models, method \Rfunction{predict()} returns, in addition to the prediction, estimates of the confidence and/or prediction intervals. The new data must be stored in a data frame with columns using the same names for the explanatory variables as in the data used for the fit, a response variable is not needed and additional columns are ignored. (The explanatory variables in the new data can be either continuous or factors, but they must match in this respect those in the original data.)

\begin{playground}
Predict using both \code{fm1} and \code{fm2} the distance required to stop cars moving at 0, 5, 10, 20, 30, and 40~mph. Study the help page for the \Rfunction{predict()} method for linear models (using \code{help(predict.lm)}). Explore the difference between \code{"prediction"} and \code{"confidence"} bands: why are they so different?
\end{playground}
\index{linear models!linear regression|)}%

\begin{explainbox}\label{box:LM:fit:object}
The\index{linear models!structure of model fit object} objects returned by model fitting functions contain the full information, including the data to which the model was fit to. Their structure resembles a nested list. In most cases, the class of the objects returned by model fit functions agrees in name with the name of the function (\code{"lm"} in this example) but is not derived from \code{"list"}. The query functions, either extract parts of the object or do additional calculations and formatting based on them. Different specialisations of these methods are called depending on the class of the model fit object. (See section \ref{sec:methods} on page \pageref{sec:methods}.)

<<models-EB0>>=
class(fm1)
names(fm1)
@

The structure of model fit objects is of interest only when the query or accessor functions do not provide the needed information and components have to be extracted using operator \Roperator{[[ ]]}. Exploring these objects is also a way of learning how model fitting works in \Rlang. As with any other objects, \Rfunction{str()} shows the structure.

<<models-EB1>>=
str(fm1, no.list = TRUE, give.attr = FALSE, vec.len = 2)
@

Member \code{call} contains the function call and arguments used to create object \code{fm1}.

<<models-EB1b>>=
str(fm1$call)
@
\end{explainbox}

\subsection{Analysis of variance, ANOVA}\label{sec:anova}
%\index{analysis of variance}
\index{analysis of variance|see{linear models, analysis of variance}}%
\index{ANOVA|see{linear models, analysis of variance}}%
\index{linear models!analysis of variance|(}%
In ANOVA, the explanatory variable is categorical, and in \Rlang, must be a \code{factor} or \code{ordered} factor (see section \ref{sec:calc:factors} on page \pageref{sec:calc:factors}). As a linear model, the fitting approach is the same as for linear and polynomial regression (Figure \ref{fig:lm:fit:diagram}).
The \Rdata{InsectSprays} data set used in the next example gives insect counts in plots sprayed with different insecticides. In these data, \code{spray} is a factor with six levels.%
\label{xmpl:fun:lm:fm4}

What determines that this is an ANOVA is that \code{spray}, the explanatory variable, is a \code{factor}.

<<models-6z>>=
data(InsectSprays)
is.numeric(InsectSprays$spray)
is.factor(InsectSprays$spray)
levels(InsectSprays$spray)
@

By using a factor instead of a numeric vector, a different model matrix is built from an equivalent formula.\label{chunk:stat:fm4}

<<models-6>>=
fm4 <- lm(count ~ spray, data = InsectSprays)
@

Diagnostic plots are obtained in the same way as for linear regression. We show only the quantile-quantile plot for simplicity, but during data analysis it is very important to check all the diagnostics plots. As many of the residuals deviate from the one-to-one line we have to conclude the residuals do not follow the Normal distribution, and a different approach to model fitting should be used (see section \ref{sec:stat:GLM} on page \pageref{sec:stat:GLM}).

<<model-6a>>=
plot(fm4, which = 2)
@

In ANOVA,\index{F-test@$F$-test} most frequently the interest is in testing hypotheses with function \Rfunction{anova()}, which implements the $F$-test for the main effects of factors and their interactions. In this example, with a single explanatory variable, there is only one effect of interest, that of \code{sprays}.

<<model-6b>>=
anova(fm4)
@

\index{models!contrasts|(}
\begin{warningbox}
Function \Rfunction{summary()} can be used to extract parameter estimates informing of the size of the effects, but meaningfully only by using contrasts different to the default ones.
Function \Rfunction{aov()} is a wrapper on \Rfunction{lm()} that returns an object that by default displays the output of \Rfunction{anova()} also with \Rfunction{summary()}, but even in this case it can be preferable to change the default contrasts (see \code{help(aov)}).

The contrasts used affect the estimates returned by \Rfunction{coef()} and \Rfunction{summary()} applied to an ANOVA model fit. The default used in \Rlang, \Rfunction{contr.treatment()}, is different to that used in \Slang, \Rfunction{contr.helmert}. With \Rfunction{contr.treatment} the first level of the factor (assumed to be a control) is used as a reference for the estimation of coefficients for the remaining factor levels and testing of their significance. With \Rfunction{contr.helmert} the contrasts are of the second level with the first, the third with the average of the first two, and so on. These contrasts depend on the order of factor levels. Instead, \code{contr.sum} uses as reference the mean of all levels, i.e., using as a condition that the coefficient estimates add up to zero. Obviously what type of contrast is used changes what the coefficient estimates describe, and, consequently, how the $p$-values should be interpreted.
\end{warningbox}

\begin{explainbox}
The approach used by default for model fits and ANOVA calculations varies among programs. There exist different so-called ``types'' of sums of squares, usually called I, II, and III. In orthogonal designs, the choice is of no consequence, but differences can be important for unbalanced designs, even leading to different conclusions. \Rlang's default, type~I, is usually considered to suffer milder problems than type~III, the default used by \pgrmname{SPSS} and \pgrmname{SAS}. In any case, for unbalanced data it is preferable to use the approach implemented in package \pkgname{nlme}.
\end{explainbox}

\begin{explainbox}
The most straightforward way of setting a different default for contrasts for a whole series of model fits is by setting \Rlang option \code{contrasts}, which we here only print.

<<contrasts-01>>=
options("contrasts")
@

The option is set to a named character vector of length two, with the first value, named \code{unordered} giving the name of the function used when the explanatory variable is an unordered \code{factor} (created with \Rfunction{factor()}) and the second value, named \code{ordered}, giving the name of the function used when the explanatory variable is an \code{ordered} factor (created with \Rfunction{ordered()}).

It is also possible to select the contrast to be used in the call to \code{aov()} or \code{lm()}.

<<models-6c>>=
fm4trea <- lm(count ~ spray, data = InsectSprays,
              contrasts = list(spray = contr.treatment))
fm4sum  <- lm(count ~ spray, data = InsectSprays,
              contrasts = list(spray = contr.sum))
@

In \code{fm4trea} we used \Rfunction{contr.treatment()}, thus contrasts for individual treatments are done against \code{Spray1} taking it as the control or reference, as can be inferred from the generated contrasts matrix. For this reason, there is no row for \code{Spray1} in the summary table. Each of the rows \code{Spray2} to \code{Spray6} is a test comparing these treatments individually against \code{Spray1}.

<<contrasts-02>>=
contr.treatment(length(levels(InsectSprays$spray)))
@

<<models-6d>>=
summary(fm4trea)
@

In \code{fm4sum} we used \Rfunction{contr.sum()} with the sum constrained to be zero, thus estimates for the last treatment level are determined by the sum of the previous ones, and not tested for significance.

<<contrasts-03>>=
contr.sum(length(levels(InsectSprays$spray)))
@

<<models-6e>>=
summary(fm4sum)
@

\end{explainbox}

\begin{advplayground}
  Explore how taking the last level as reference in \Rfunction{contr.SAS()} instead of the first one as in \Rfunction{contr.treatment()} affects the estimates. Reorder the levels of factor \code{spray} so that the test using \Rfunction{contr.SAS()} becomes equivalent to that obtained above with \Rfunction{contr.treatment()}. Consider why \Rfunction{contr.poly()} is the default for ordered factors and when \Rfunction{contr.helmert()} could be  most useful.
\end{advplayground}

Contrasts, on the other hand, do not affect the table returned by \Rfunction{anova()} as this table does not deal with the effects of individual factor levels. The overall estimates shown at the bottom of the summary table remain unchanged. In other words, when using different contrasts what changes is how the total variation explained by the fitted model is partitioned into components to be tested for specific contributions to the overall model fit.

\begin{explainbox}
\index{tests!adjusted P-values@{adjusted $P$-values}}%
  Post-hoc tests based on specific contrasts\index{tests!post-hoc} and multiple comparisons\index{tests!multiple comparisons} tests are most frequently applied after an ANOVA to test for differences among pairs of treatments or specific combinations of treatments. \Rlang function \Rfunction{Tukey.test()} implements Tukey's HSD\index{tests!Tukey's HSD} (honestly significant difference) test for pairwise tests. Function \Rfunction{pairwise.t.test()} supports different correction methods for the $P$-values from simultaneous $t$-tests. Function \Rfunction{p.adjust()} applies adjustments to $P$-values and can be used when the test procedure does not apply them. The most comprehensive implementation of multiple comparisons is available in package \pkgname{multcomp}. Function \Rfunction{glht()} (general linear hypothesis testing) from this package supports different contrasts and adjustment methods.
\end{explainbox}

Contrasts and their interpretation are discussed in detail by \citeauthor{Venables2002} (\citeyear{Venables2002}) and \citeauthor{Crawley2012} (\citeyear{Crawley2012}).
\index{models!contrasts|)}
\index{linear models!analysis of variance|)}%

\subsection{Analysis of covariance, ANCOVA}
%\index{analysis of covariance}
\index{analysis of covariance|see{linear models, analysis of covariance}}
\index{linear models!analysis of covariance}
\index{ANCOVA|see{linear models, analysis of covariance}}

When a linear model includes both explanatory factors and continuous explanatory variables, we may call it \emph{analysis of covariance} (ANCOVA). The formula syntax is the same for all linear models and, as mentioned in previous sections, what determines the type of analysis is the nature of the explanatory variable(s). As the formulation remains the same, no specific example is given. The main difficulty of ANCOVA is in the selection of the covariate and the interpretation of the results of the analysis, especially, when the covariate is not independent of the treatment described by the factor \autocite[e.g.][]{Smith1957}.
\index{linear models|)}

\subsection{Model update and selection}\label{sec:stat:update:step}
\index{models!updating|(}
Model fit objects can be updated, i.e., modified, because they contain not only the results of the fit but also the data to which the model was fit (see page \pageref{box:LM:fit:object}). Given that the call is also stored, all the information needed to recalculate the same fit is available. Method \Rfunction{update()} makes it possible to recalculate the fit with changes to the call, without passing again all the arguments to a new call to \Rfunction{lm()} (Figure \ref{fig:lm:update:diagram}). We can modify different arguments, including selecting part of the data by passing a new argument to formal parameter \code{subset}.

\begin{figure}
  \centering
 \begin{small}
\begin{tikzpicture}[node distance=1.4cm, scale=0.5]
\node (model) [tprocess] {\textsl{model}};
\node (data) [tprocess, below of=model, yshift=-0.4cm] {\textsl{observations}};
\node (fitfun) [tprocess, right of=model, yshift=-0.9cm, xshift=1cm] {\code{lm()}};
\node (fm) [tprocess, color = black, right of=fitfun, xshift=1.8cm] {\code{lm} \textsl{object} $\to$ \code{object}};
\node (update) [tprocess, color = black, right of=fm, xshift = 1.8cm, fill=blue!5] {\code{update()}};
\node (newmodel) [tprocess, above of=fm, fill=blue!5, yshift=-0.3cm] {\textsl{new model} $\to$ \code{formula}};
\node (newdata) [tprocess, below of=fm, fill=blue!5, yshift=0.3cm] {\textsl{new observs.} $\to$ \code{data}};
\node (newfm) [tprocess, color = black, right of=update, xshift=1.5cm, fill=blue!5] {\code{lm} \textsl{object}};
\draw [arrow] (model) -- (fitfun);
\draw [arrow] (data) -- (fitfun);
\draw [arrow] (fitfun) -- (fm);
\draw [arrow] (fm) -- (update);
\draw [arrow] (newmodel) -- (update);
\draw [arrow] (newdata) -- (update);
\draw [arrow] (update) -- (newfm);
\end{tikzpicture}
\end{small}
  \caption[Updating a fitted model]{Diagram showing the steps for updating a fitted model (in filled boxes) together with the previous steps in unfilled boxes. Please, see Figure \ref{fig:lm:fit:diagram} for other details.}\label{fig:lm:update:diagram}
\end{figure}

Method \Rfunction{update()} retrieves the call from the model fit object using \Rfunction{getCall()}, modifies it and, by default, evaluates it. The default \Rfunction{update()} method works as long as the model-fit object contains a member named \code{call} or if a specialisation of \Rfunction{getCall()} is available. Thus, method \Rfunction{update()} can be used with models fitted with other functions in addition to \Rfunction{lm()}.

For the next example, we recreate the model fit object \code{fm4} from page \pageref{chunk:stat:fm4}.

<<models-update-01>>=
fm4 <- lm(count ~ spray, data = InsectSprays)
anova(fm4)
fm4a <- update(fm4, formula = log10(count + 1) ~ spray)
anova(fm4a)
@

\begin{playground}
Print \code{fm4\$call} and \code{fm4a\$call}. These two calls differ in the argument to \code{formula}. What other members have been updated in \code{fm4a} compared to \code{fm4}?
\end{playground}

In the chunk above we replaced the argument passed to \code{formula}. This is a frequent use, but, for example, to fit the same model to a subset of the data, we can pass a suitable argument to parameter \code{subset}.

<<models-update-02>>=
fm4b <- update(fm4, subset = !spray %in% c("A", "B"))
anova(fm4b)
@

\begin{advplayground}
When having many treatments with long names, which is not the case here, instead of listing the factor levels for which to subset the data, it can be convenient to use regular expressions for pattern matching (see section \ref{sec:calc:regex} on page \pageref{sec:calc:regex}). Run the code below, and investigate why \code{anova(fm4b)} and \code{anova(fm4c)} produce the same ANOVA table printout, but the fit model objects are not identical. You can use \code{str()} to explore if any members differ between the two objects.

<<models-update-03, eval=eval_playground>>=
fm4c <- update(fm4, subset = !grepl("[AB]", spray))
anova(fm4c)
identical(fm4b, fm4c)
@

\end{advplayground}

\begin{explainbox}
Method \Rfunction{update()} plays an additional role when the fitting is done by numerical approximation, as the previously computed estimates are used as the starting values for the numerical calculations required for fitting the updated model (see section \ref{sec:stat:NLS} on page \pageref{sec:stat:NLS} as an example). This can drastically decrease computation time, or even easy the task of finding suitable starting values for parameter estimates by fitting increasingly more complex nested models.
\end{explainbox}

\index{models!updating|)}

\index{models!stepwise selection|(}\index{linear models!stepwise model selection|(}
Method \Rfunction{update()} used together with \Rfunction{AIC()} (or \Rfunction{anova()}) gives us the tools to compare nested models, and select one out of a group as shown above. When comparing several models doing the comparisons manually is tedious, and in scripts, in many cases difficult to write code that is flexible (or abstract) enough.  Method \Rfunction{step()} automates stepwise selection of nested models such as the selection among polynomials of different degrees or which variables to retain in multiple regression. After fitting a model, method \Rfunction{step()} is used to update this model using an automatic stopping criterion (Figure \ref{fig:lm:step:diagram}).

\begin{figure}
  \centering
\begin{small}
\begin{tikzpicture}[node distance=1.4cm, scale=0.5]
\node (model) [tprocess] {\textsl{model}};
\node (data) [tprocess, below of=model, yshift=-0.4cm] {\textsl{observations}};
\node (fitfun) [tprocess, right of=model, yshift=-0.9cm, xshift=1cm] {\code{lm()}};
\node (fm) [tprocess, color = black, right of=fitfun, xshift=1.8cm] {\code{lm} \textsl{object} $\to$ \code{object}};
\node (step) [tprocess, color = black, right of=fm, xshift = 1.8cm, fill=blue!5] {\code{step()}};
\node (newmodels) [tprocess, dashed, above of=fm, fill=blue!5, yshift=-0.3cm] {\textsl{new model(s)} $\to$ \code{scope}};
\node (newfm) [tprocess, color = black, right of=update, xshift=1.5cm, fill=blue!5] {\code{lm} \textsl{object}};
\draw [arrow] (model) -- (fitfun);
\draw [arrow] (data) -- (fitfun);
\draw [arrow] (fitfun) -- (fm);
\draw [arrow] (fm) -- (update);
\draw [arrow, dashed] (newmodels) -- (step);
\draw [arrow] (step) -- (newfm);
\end{tikzpicture}
\end{small}
  \caption[Stepwise model selection]{Diagram showing the steps used for stepwise model selection among nested models (in filled boxes) together with the previous steps in unfilled boxes. The range of models to select from can be set by the user. See Figure \ref{fig:lm:fit:diagram} for other details.}\label{fig:lm:step:diagram}
\end{figure}

Stepwise model selection---either in the \emph{forward} direction from simpler to more complex models, in the backward direction from more complex to simpler models or in both directions---is implemented in base \Rlang's \emph{method} \Rfunction{step()} using Akaike's information criterion (AIC)\index{Akaike's An Information Criterion@Akaike's \emph{An Information Criterion}} as the selection criterion. Use of method \Rfunction{step()} from \Rlang is possible, for example, with \code{lm()} and \code{glm} fits. AIC is described on page \pageref{par:stats:AIC}.

For the next example, we use \code{fm3} from page \pageref{chunk:stats:fm3}, a linear model for a polynomial regression. If, as shown here, no models are passed through formal parameter \code{scope}, the previously fit model will be simplified, if possible. Method \Rfunction{step()} by default prints to the console a trace of the models tried and the corresponding AIC estimates.

<<models-step-01>>=
fm3 <- lm(dist ~ speed + I(speed^2), data = cars)
fm3a <- step(fm3)
@

Method \Rfunction{summary()} reveals the differences between the original and updated models.

<<models-step-01a>>=
summary(fm3)
summary(fm3a)
@

If we pass a model with additional terms through parameter \code{scope} this model will be taken as the most complex model to be assessed. If, instead of one model, we pass two nested models in a list and name them \code{lower} and \code{upper}, they will delimit the scope of the stepwise search. In the next example, we see that first a backward search is done and term \code{speed} is removed because it removal decreases (= improves) AIC. Subsequently, a forward search is done unsuccessfully, as AIC increases.

<<models-step-02>>=
fm3b <-
  step(fm3,
       scope = dist ~ speed + I(speed^2) + I(speed^3) + I(speed^4))
summary(fm3b)
@

\begin{playground}
Explain why the stepwise model selection in the code below differs from those in the two previous examples. Consult \code{help(step)} is necessary.

<<models-step-02a, eval=eval_playground>>=
fm3c <-
  step(fm3,
       scope = list(lower = dist ~ speed,
                    upper = dist ~ speed + I(speed^2) + I(speed^3) + I(speed^4)))
summary(fm3c)
@

\end{playground}

Functions \Rfunction{update()} and \Rfunction{step()} are \emph{convenience functions} as they provide direct and/or simpler access to operations available through other functions or combined use of multiple functions.
\index{linear models!stepwise model selection|)}\index{models!stepwise selection|)}

\section{Generalised Linear Models}\label{sec:stat:GLM}
\index{generalised linear models|(}\index{models!generalised linear|see{generalised linear models}}
\index{GLM|see{generalised linear models}}

Linear models make the assumption of normally distributed residuals. Generalised linear models, fitted with function \Rfunction{glm()}, are more flexible, and allow the assumed distribution to be selected as well as the link function (defaults are as in \code{lm()}). Figure \ref{fig:glm:fit:diagram} shows that the steps used to fit a model with \Rfunction{glm()} are the same as with \code{lm()} except that we can select the probability distribution assumed to describe the variation among observations. Frequently used probability distributions are binomial and Poisson (see \code{help(family)} for the variations and additional ones).\index{models!for binomial outcomes data}\index{models!for counts data}

\begin{figure}
  \centering
\begin{small}
\begin{tikzpicture}[node distance=1.4cm, scale=0.5]
\node (model) [tprocess] {\textsl{model} $\to$ \code{formula}};
\node (data) [tprocess, below of=model, yshift = 0.4cm] {\textsl{observations} $\to$ \code{data}};
\node (weights) [tprocess, dashed, below of=data, fill=blue!1, yshift = 0.4cm] {\textsl{weights} $\to$ \code{weights}};
\node (family) [tprocess, below of=weights, fill=blue!5, yshift = 0.4cm] {\textsl{distribution} $\to$ \code{family}};
\node (fitfun) [tprocess, right of=data, xshift=2.5cm, yshift = -0.4cm,fill=blue!5] {\code{glm()}};
\node (fm) [tprocess, color = black, right of=fitfun, xshift=1.5cm, fill=blue!5] {\code{glm} \textsl{object}};
\node (summary) [tprocess, color = black, right of=fm, xshift=1.7cm] {\code{summary()}};
\node (anova) [tprocess, color = black, below of=summary, yshift = 0.4cm] {\code{anova()}};
\node (plot) [tprocess, color = black, above of=summary, yshift = -0.4cm] {\code{plot()}};
\draw [arrow] (model) -- (fitfun);
\draw [arrow] (data) -- (fitfun);
\draw [arrow, dashed] (weights) -- (fitfun);
\draw [arrow] (family) -- (fitfun);
\draw [arrow] (fitfun) -- (fm);
\draw [arrow] (fm) -- (plot);
\draw [arrow] (fm) -- (anova);
\draw [arrow] (fm) -- (summary);
\end{tikzpicture}
\end{small}
  \caption[Generalised linear model fitting in \Rlang]{Generalised linear model fitting in \Rlang is done in steps similar to those used for linear models. Generic diagram from Figure \ref{fig:model:fit:diagram} redrawn to show a generalised linear model fit. Non-filled boxes are shared with fitting of other types of models, and filled ones are specific to \Rfunction{glm()}. Only the three most frequently used query methods are shown, while  both response and explanatory variables are under \textsl{observations}. Dashed boxes and arrows are optional as defaults are provided.}\label{fig:glm:fit:diagram}
\end{figure}

For count data, GLMs are preferred over LMs. In the example below, we fit the same model as above, but  assuming a quasi-Poisson distribution instead of the Normal. An argument passed to \code{family} selects the assumed error distribution. The \Rdata{InsectSprays} data set used in the next example, gives insect counts in plots sprayed with different insecticides. In these data, spray is a factor with six levels.

<<model-10>>=
fm10 <- glm(count ~ spray, data = InsectSprays, family = quasipoisson)
@

Method \Rfunction{plot()} as for linear-model fits, produces diagnosis plots. We show, as before, the quantile--quantile plot of residuals. The Normal distribution assumed above in the linear model fit was not a good approximation (section \ref{sec:anova} on page \pageref{sec:anova}), as count data are known to follow a different distribution. This is clear by comparing the quantile--quantile plot for \code{fm4} (page \pageref{sec:anova}) and the plot below for the model fit under the assumption of a Quasi-Poisson distribution.

<<model-11>>=
plot(fm10, which = 2)
@

The printout from the \Rfunction{anova()} method for GLM fits has some differences to that for LM fits. In \Rlang versions previous to 4.4.0, no test statistics or $P$-values were computed unless requested by passing an argument to parameter \code{test}. In later versions of \Rlang, either a chi-squared test or an $F$-test are computed by default depending on whether the dispersion is fixed or free. We here use \code{"F"} as an argument to request an $F$-test.

<<model-10a>>=
anova(fm10, test = "F")
@

We can extract different components similarly as described for linear models (see section \ref{sec:stat:LM} on page \pageref{sec:stat:LM}).

<<model-12>>=
class(fm10)
summary(fm10)
head(residuals(fm10))
head(fitted(fm10))
@

\begin{explainbox}
If we use \code{str()} or \code{names()} we can see that there are some differences with respect to linear model fits. The returned object is of a different class and contains some members not present in linear models. Two of these have to do with the iterative approximation method used, \code{iter} contains the number of iterations used  and \code{converged} the success or not in finding a solution.

<<model-13>>=
class(fm10)
names(fm10)
fm10$converged
fm10$iter
@
\end{explainbox}

Methods \code{update()} and \code{step()}, described for \code{lm()} in section \ref{sec:stat:update:step} on page \pageref{sec:stat:update:step}, can be also used with models fitted with \code{glm()}.

\index{generalised linear models|)}

\section{Non-Linear Regression}\label{sec:stat:NLS}
\index{non-linear models|(}%
\index{models!non-linear|see{non-linear models}}%
\index{NLS|see{non-linear models}}

By \emph{non-linear} it is meant non-linear \emph{in the parameters} whose values are being estimated through fitting the model to observations. This is different from the shape of the function when plotted---i.e., polynomials of any degree are linear models. In contrast, the Michaelis-Menten equation used in chemistry and the Gompertz equation used to describe growth are models that are non-linear in their parameters.

While analytical algorithms exist for finding estimates for the parameters of linear models, in the case of non-linear models, the estimates are obtained by approximation. For analytical solutions, estimates can always be obtained (except in pathological cases affected by the limitations of floating point numbers described on page \pageref{box:integer:float}). For approximations obtained through iteration, cases when the algorithm fails to \emph{converge} onto an answer are relatively common. Iterative algorithms attempt to improve an initial guess for the values of the parameters to be estimated, a guess frequently supplied by the user. In each iteration, the estimate obtained in the previous iteration is used as the starting value, and this process is repeated one time after another. The expectation is that after a finite number of iterations the algorithm will converge into a solution that ``cannot'' be improved further. In real life, we stop iteration when the improvement in the fit is smaller than a certain threshold, or when no convergence has been achieved after a certain maximum number of iterations. In the first case, we usually obtain good estimates; in the second case, we do not obtain usable estimates and need to look for different ways of obtaining them.

When convergence fails, the first thing to do is to try different starting values and if this also fails, switch to a different computational algorithm. These steps usually help, but not always. Good starting values are in many cases crucial and in some cases ``guesses'' can be obtained using either graphical or analytical approximations.

Function \Rfunction{nls()} is \Rlang's workhorse for fitting non-linear models. The steps for its use are similar to those for LM and GLM (Figure \ref{fig:nls:fit:diagram}). One difference is that starting values are needed, and another difference is in how the model to be fitted is specified: the user provides the names of the parameters and a model equation that includes in the \emph{rhs} a call to an \Rlang function.

\begin{figure}
  \centering
\begin{small}
\begin{tikzpicture}[node distance=1.4cm, scale=0.5]
\node (model) [tprocess] {\textsl{model} $\to$ \code{formula}};
\node (data) [tprocess, below of=model, yshift = 0.4cm] {\textsl{observations} $\to$ \code{data}};
\node (weights) [tprocess, dashed, below of=data, yshift = 0.4cm, fill=blue!1] {\textsl{weights} $\to$ \code{weights}};
\node (guess) [tprocess, below of=weights, fill=blue!5, yshift = 0.4cm] {\textsl{guesses} $\to$ \code{start}};
\node (fitfun) [tprocess, right of=data, xshift=2.5cm, yshift = -0.4cm, fill=blue!5] {\code{nls()}};
\node (fm) [tprocess, color = black, right of=fitfun, xshift=1.5cm, fill=blue!5] {\code{nls} \textsl{object}};
\node (summary) [tprocess, color = black, right of=fm, xshift=1.7cm] {\code{summary()}};
\node (anova) [tprocess, color = black, below of=summary, yshift = 0.4cm] {\code{anova()}};
\node (plot) [tprocess, color = black, above of=summary, yshift = -0.4cm] {\code{plot()}};
\draw [arrow] (model) -- (fitfun);
\draw [arrow] (data) -- (fitfun);
\draw [arrow, dashed] (weights) -- (fitfun);
\draw [arrow] (guess) -- (fitfun);
\draw [arrow] (fitfun) -- (fm);
\draw [arrow] (fm) -- (plot);
\draw [arrow] (fm) -- (anova);
\draw [arrow] (fm) -- (summary);
\end{tikzpicture}
\end{small}
  \caption[Non-linear model fitting in \Rlang]{Non-linear model fitting in \Rlang is done in steps. Generic diagram from Figure \ref{fig:model:fit:diagram} redrawn to show a non-linear model fit. Non-filled boxes are shared with fitting of other types of models, and filled ones are specific to \Rfunction{nls()}. Only the three most frequently used query methods are shown, while  both response and explanatory variables are under \textsl{observations}. Dashed boxes and arrows are optional as defaults are provided.}\label{fig:nls:fit:diagram}
\end{figure}

In cases when algorithms exist for ``guessing'' suitable starting values, \Rlang provides a mechanism for packaging the \Rlang function to be fitted together with the \Rlang function generating the starting values. These functions go by the name of \emph{self-starting functions} and relieve the user from the burden of guessing and supplying suitable starting values. The\index{self-starting functions} self-starting functions available in \Rlang are \code{SSasymp()}, \code{SSasympOff()}, \code{SSasympOrig()}, \code{SSbiexp()}, \code{SSfol()}, \code{SSfpl()}, \code{SSgompertz()}, \code{SSlogis()}, \code{SSmicmen()}, and \code{SSweibull()}. Function \code{selfStart()} can be used to define new ones. All these functions can be used when fitting models with \Rfunction{nls} or \Rfunction{nlme}. The respective help pages give the details.

\begin{warningbox}
In calls to \Rfunction{nls()}, the rhs of the model \code{formula} is a function call. The names of its arguments if not present in \code{data} are assumed to be parameters to be fitted. Below, a named function
\end{warningbox}

As example the Michaelis-Menten equation\index{Michaelis-Menten equation} describing reaction kinetics\index{chemical reaction kinetics} in biochemistry and chemistry is fitted to the \Rdata{Puromycin} data set. The mathematical formulation is given by

\begin{equation}\label{eq:michaelis:menten}
v = \frac{\mathrm{d} [P]}{\mathrm{d} t} = \frac{V_{\mathrm{max}} [S]}{K_{\mathrm{M}} + [S]}
\end{equation}

\noindent
and is implemented in \Rlang under the name \Rfunction{SSmicmen()} as a self-starting function.

<<nls-01>>=
data(Puromycin)
names(Puromycin)
@

<<nls-02>>=
fm21 <- nls(rate ~ SSmicmen(conc, Vm, K), data = Puromycin,
            subset = state == "treated")
@

As for other fitted models we use query methods (see section \ref{sec:stat:LM} on page \pageref{sec:stat:LM}).

<<nls-03>>=
class(fm21)
summary(fm21)
residuals(fm21)
fitted(fm21)
@

\begin{explainbox}
Methods \code{str()} and \code{names()} can reveal differences with respect to linear and generalised linear models. The fitted model object is of class \code{nls} contains additional members but lacks others. Two members are related to the iterative approximation method used, \code{control} containing nested members holding iteration settings, and \code{convInfo} (convergence information) with nested members with information on the outcome of the iterative algorithm.

<<nls-04>>=
class(fm21)
names(fm21)
@

<<nls-04a>>=
fm21$convInfo
@
\end{explainbox}

Method \code{update()}, described for \code{lm()} in section \ref{sec:stat:update:step} on page \pageref{sec:stat:update:step}, can be also used with models fitted with \code{nls()}. The use of previous estimates as guesses for updates is an important feature.

\index{non-linear models|)}

\section{Splines and Local Regression}\label{sec:stat:splines}
\index{smoothing splines|(}\index{local polynomial regression|(}%
\index{LOESS|see{local polynomial regression}}%
The name ``spline'' derives from the tool used by draftsmen to draw smooth curves. Originally, a spline of soft wood was used as a flexible guide to draw arbitrary curves. Later the wood splines were replaced by a rod of flexible metal, such as lead, encased in plastic or similar material but the original name persisted. In mathematics, splines are functions that describe smooth and flexible curves.

Most of the model fits given above as examples produce estimates for parameters that are interpretable in the real world, directly in the case of mechanistic models like the estimate of reaction constants or at least indicating broadly a relationship between two variables as in the case of linear regression. In the case of polynomials with degree higher than 2, parameter estimates no longer directly describe features of the data.

Splines take this a step farther and parameter estimates have no practical interest. The interest resides in the overall shape and position of the predicted curve. Splines consist of knots (or connection points) joined by straight or curved fitted lines, i.e., they are functions that are \emph{piecewise}. The simplest splines, are piece-wise linear, given by chained straight line segments connecting knots.

In more complex splines, the segments are polynomials, frequently cubic polynomials, that fulfil certain constraints at the knots. For example, that the slope or first derivative is the same for the two connected curve ``pieces'' at the knot where they are connected. This constraint ensures that the curve is smooth. In some cases, similar constraints are imposed on higher order derivatives, for example, to the second derivative to ensure that the curve of the first derivative is also smooth at the knots.

Splines are used in free-hand drawing with computers to draw arbitrary smooth curves. They are also be used for interpolation, in which case observations, assumed to be error-free, become the knots of a spline used to approximate intermediate values. Finally, splines can be used as models to be fit to observations subject to random variation. In this case splines fulfil the role of smoothers, as a curve that broadly describes a relationship among variables.

Splines are frequently used as smooth curves in plots as described in section \ref{sec:plot:smoothers} on page \pageref{sec:plot:smoothers}. Function \Rfunction{spline()} is used for interpolation and function \Rfunction{smooth.spline()} for smoothing by fitting a cubic spline (a spline where the knots are connected by third degree polynomials). Function \Rfunction{smooth.spline()} has a different user interface than that we used for model fit functions described above, as it only accepts \code{numeric} vectors as arguments to parameters \code{x} and \code{y} (Figure \ref{fig:spline:fit:diagram}). Additional parameters make it possible to override the defaults for number of knots and adjust the stiffness or tendency towards a straight line. The \code{plot()} method for splines, differently to the methods for other fit functions, produces a plot of the prediction. As no model formula is used, only one curve at a time is fitted and no statistical tests involving groups are possible. The most commonly used query functions are thus not the same as for linear and non-linear models.

\begin{figure}
  \centering
\begin{small}
\begin{tikzpicture}[node distance=1.4cm, scale=0.5]
\node (x) [tprocess, fill=blue!5] {\textsl{obs.} $\to$ \code{x}};
\node (y) [tprocess, below of=model, fill=blue!5] {\textsl{obs.} $\to$ \code{y}};
\node (fitfun) [tprocess, right of=model, yshift=-0.7cm, xshift=2cm, fill=blue!5] {\code{smooth.spline()}};
\node (fm) [tprocess, color = black, right of=fitfun, xshift=2.3cm, fill=blue!5] {\code{smooth.spline} \textsl{obj.}};
\node (pred) [tprocess, color = black, right of=fm, xshift=2.3cm] {\code{predict()}};
\node (fitted) [tprocess, color = black, above of=pred, yshift = -0.4cm] {\code{fitted()}};
\node (resid) [tprocess, color = black, below of=pred, yshift = +0.4cm] {\code{residuals()}};
\draw [arrow] (x) -- (fitfun);
\draw [arrow] (y) -- (fitfun);
\draw [arrow] (fitfun) -- (fm);
\draw [arrow] (fm) -- (pred);
\draw [arrow] (fm) -- (fitted);
\draw [arrow] (fm) -- (resid);
\end{tikzpicture}
\end{small}
  \caption[Fitting smooth splines in \Rlang]{Fitting of smooth splines in \Rlang. Generic diagram from Figure \ref{fig:model:fit:diagram} redrawn to show the fitting of splines. Non-filled boxes are shared with fitting of other types of models, and filled ones are specific to \Rfunction{smooth.spline()}. Only the three most frequently used query methods are shown, while response and explanatory variables are passed separately to \textsl{x} and \textsl{y}.}\label{fig:spline:fit:diagram}
\end{figure}

<<spline-01>>=
fs1 <- smooth.spline(x = cars$speed, y = cars$dist)
print(fs1)
plot(fs1, type = "l")
points(x = cars$speed, y = cars$dist)
@

Function \Rfunction{loess} implements \emph{local polynomial regression}. It fits a polynomial curve or surface (i.e., more than one explanatory variable can be included in the model formula) using local-weighted fitting. Its user interface is rather similar to that of \code{glm()} with \code{formula}, \code{family} and \code{data} formal parameters (Figure \ref{fig:loess:fit:diagram}). Additional parameters control ``stiffness'' or the extent of the local data used for fitting (how much weight is given to observations as a function of their distance). The type of fit local or not used for individual explanatory variables can be controlled through parameter \code{parametric}.

\begin{figure}
  \centering
\begin{small}
\begin{tikzpicture}[node distance=1.4cm, scale=0.5]
\node (model) [tprocess] {\textsl{model} $\to$ \code{formula}};
\node (data) [tprocess, below of=model, yshift = 0.4cm] {\textsl{observations} $\to$ \code{data}};
\node (weights) [tprocess, dashed, below of=data, fill=blue!1, yshift = 0.4cm] {\textsl{weights} $\to$ \code{weights}};
\node (family) [tprocess, below of=weights, fill=blue!5, yshift = 0.4cm] {\textsl{distribution} $\to$ \code{family}};
\node (fitfun) [tprocess, right of=data, xshift=2.5cm, yshift = -0.4cm, fill=blue!5] {\code{loess()}};
\node (fm) [tprocess, color = black, right of=fitfun, xshift=1.5cm, fill=blue!5] {\code{loess} \textsl{object}};
\node (pred) [tprocess, color = black, right of=fm, xshift=1.7cm] {\code{predict()}};
\node (fitted) [tprocess, color = black, above of=pred, yshift = -0.4cm] {\code{fitted()}};
\node (resid) [tprocess, color = black, below of=pred, yshift = +0.4cm] {\code{residuals()}};
\draw [arrow] (model) -- (fitfun);
\draw [arrow] (data) -- (fitfun);
\draw [arrow, dashed] (weights) -- (fitfun);
\draw [arrow] (family) -- (fitfun);
\draw [arrow] (fitfun) -- (fm);
\draw [arrow] (fm) -- (pred);
\draw [arrow] (fm) -- (fitted);
\draw [arrow] (fm) -- (resid);
\end{tikzpicture}
\end{small}
  \caption[Loess model fitting in \Rlang]{Loess model fitting in \Rlang is done in steps. Generic diagram from Figure \ref{fig:model:fit:diagram} redrawn to show local polynomial regression model fitting. Non-filled boxes are shared with fitting of other types of models, and filled ones are specific to \Rfunction{loess()}. Only the three most frequently used query methods are shown, while  both response and explanatory variables are under \textsl{observations}. Dashed boxes and arrows are optional as defaults are provided.}\label{fig:loess:fit:diagram}
\end{figure}

<<spline-02>>=
floc <- loess(dist ~ speed, data = cars)
class(floc)
summary(floc)
@

\begin{warningbox}
  Function \Rfunction{anova()} can be used to compare two or more loess fits, but not on a single one.
\end{warningbox}

\begin{explainbox}
  Several modern approaches to data analysis, which do provide estimates of effects' significance and sizes, are based on the use of splines to describe the responses and even variance. Among them are additive models such as GAM and related methods \autocite[see][]{Wood2017} and functional data analysis (FDA) \autocite{Ramsay2009}. These methods are implemented in specialised extension packages and fall outside the scope of this book.
\end{explainbox}
\index{smoothing splines|)}\index{local polynomial regression|)}\index{models!fitting|)}%

\section{Model Formulas}\label{sec:stat:formulas}
\index{model formulas|(}\index{models!specification|see{model formulas}}%
Model formulas, such as \code{y\,\char"007E\,x} are widely used in \Rlang, both in model fitting as exemplified in previous sections of this chapter and in plotting when using base \Rlang \Rmethod{plot()} methods.

\Rlang is consistent and flexible in how it treats various objects, to an extent that can be surprising to those familiar with other computer languages. Model formulas are objects of class \Rclass{formula} and mode \Rclass{call} and can be manipulated and stored similarly to objects of other \Rlang classes.

<<mformulas-intro-1>>=
class(y ~ x)
mode(y ~ x)
@

Like any other \Rlang object formulas can be assigned to variables and be members of lists and vectors. Consequently, the first linear model fit example from page \pageref{chunk:lm:models1} can be rewritten as follows.

<<mformulas-intro-2, eval=FALSE>>=
my.formula <- dist ~ 1 + speed
fm1 <- lm(my.formula, data=cars)
@

In some situations, e.g., calculation of correlations, models lacking a \emph{lhs} term (a term on the left-hand side of \code{\,\char"007E\,}) are used. At least one term must be present in the \emph{rhs} of model formulas, as an expression ending in \code{\,\char"007E\,} is syntactically incomplete.

<<mformulas-intro-5>>=
class(~ x + y)
mode(~ x + y)
is.empty.model(~ x + y)
@

\begin{explainbox}
Some details of \Rlang formulas can be important in advanced scripts. Two kinds of ``emptiness'' are possible for formulas. As with other classes, empty objects or vectors of length zero are valid and can be created with the class constructor. In the case of formulas, there is an additional kind of emptiness, a formula describing a model with no explanatory terms on its \emph{rhs}.

An ``empty'' object of class \Rclass{formula} can be created by a call to \code{formula()} with no arguments, similarly as a numeric vector of length zero is created by the call \code{numeric()}. The last, commented out, statement in the code below triggers an error as the argument passed to \Rfunction{is.empty.model()} is of length zero. (This behaviour is not consistent with \Rclass{numeric} vectors of length zero; see for example the value returned by \code{is.finite(numeric())}.)

<<mformulas-intro-3>>=
class(formula())
mode(formula())
length(formula())
# is.empty.model(formula())
@

A model formula describing a model with no explanatory terms on the rhs, is considered empty even if it is a valid object of class \Rclass{formula} and, thus, not missing. While \code{y\ \char"007E\ 1} describes a model with only an intercept (estimating $a = \bar{x}$), \code{y\ \char"007E\ 0} or its equivalent \code{y\ \char"007E\ -1}, describes an empty model that cannot be fitted to data.

<<mformulas-intro-4>>=
class(y ~ 0)
mode(y ~ 0)
is.empty.model(y ~ 0)
is.empty.model(y ~ 1)
is.empty.model(y ~ x)
@

The value returned by \Rmethod{length()} on a single formula is not always 1, the number of formulas in the vector of formulas, but instead the number of components in the formula. For longer vectors, it does return the number of member formulae. Because of this, it is better to store model formulas in objects of class \Rclass{list} than in vectors, as \Rfunction{length()} consistently returns the expected value on lists.

<<mformulas-intro-6>>=
length(formula())
length(y ~ 0)
length(y ~ 1)
length(y ~ x)
length(c(y ~ 1, y ~ x))
length(list(y ~ 1))
length(list(y ~ 1, y ~ x))
@

As described above, \Rfunction{length()} applied to a single formula and to a list of formulas behaves differently. To call \Rfunction{length()} on each member of a list of formulas, we can use \code{sapply()} (see section \ref{sec:data:apply} on page \pageref{sec:data:apply}). As function \Rfunction{is.empty.model()} is not vectorised, we also have to use \code{sapply()} with a list of formulas.

<<mformulas-intro-6b>>=
sapply(list(y ~ 0, y ~ 1, y ~ x), length)
sapply(list(y ~ 0, y ~ 1, y ~ x), is.empty.model)
@

\end{explainbox}

In the examples in previous sections, we fitted simple models. More complex ones can be easily formulated using the same syntax. First of all, one can avoid using of operator \code{*} by explicitly defining all individual main effects and interactions using operators \code{+} and \code{:}\,. The syntax implemented in base \Rlang allows grouping by means of parentheses, so it is also possible to exclude some interactions by combining the use of \code{*} and parentheses.

The same symbols as for arithmetic operators are used for model formulas. Within a formula, symbols are interpreted according to formula syntax. When we mean an arithmetic operation that could be interpreted as being part of the model formula we need to ``protect'' it by means of the identity function \Rfunction{I()}. The next two examples define formulas for models with only one explanatory variable. With formulas like these, the explanatory variable will be computed on the fly when fitting the model to data. In the first case below, we need to explicitly protect the addition of the two variables into their sum, because otherwise they would be interpreted as two separate explanatory variables in the model. In the second case, \Rfunction{log()} cannot be interpreted as part of the model formula, and consequently does not require additional protection, neither does the expression passed as its argument.

<<mformulas-intro-7,eval=FALSE>>=
y ~ I(x1 + x2)
y ~ log(x1 + x2)
@

\Rlang formula syntax allows alternative ways for specifying interaction terms. They allow ``abbreviated'' ways of entering formulas, which for complex experimental designs saves typing and can improve clarity. As seen above, operator \code{*} saves us from having to explicitly indicate all the interaction terms in a full factorial model.

<<mformulas-intro-8,eval=FALSE>>=
y ~ x1 + x2 + x3 + x1:x2 + x1:x3 + x2:x3 + x1:x2:x3
@

Can be replaced by a concise equivalent.

<<mformulas-00,eval=FALSE>>=
y ~ x1 * x2 * x3
@

When the model to be specified does not include all possible interaction terms, we can combine the concise notation with parentheses. Below, equivalent formulas are shown using concise and verbose notation.

<<mformulas-01, eval=FALSE>>=
y ~ x1 + (x2 * x3)
y ~ x1 + x2 + x3 + x2:x3
@

<<mformulas-02,eval=FALSE>>=
y ~ x1 * (x2 + x3)
y ~ x1 + x2 + x3 + x1:x2 + x1:x3
@

The \code{\textasciicircum{}} operator provides a concise notation to limit the order of the interaction terms included in a formula.

<<mformulas-03,eval=FALSE>>=
y ~ (x1 + x2 + x3)^2
y ~ x1 + x2 + x3 + x1:x2 + x1:x3 + x2:x3
@

Operator \code{\%in\%} can also be used as a shortcut for including only some of all the possible interaction terms in a formula.

<<mformulas-04,eval=FALSE>>=
y ~ x1 + x2 + x1 %in% x2
@

\begin{playground}
Whether the two model formulas above are equivalent or not, can be investigated using function \Rfunction{terms()}.

% knitr bug??
<<mformulas-01a, eval=FALSE>>=
terms(y ~ x1 + (x2 * x3))
terms(y ~ x1 * (x2 + x3))
terms(y ~ (x1 + x2 + x3)^2)
terms(y ~ x1 + x2 + x1 %in% x2)
@
\end{playground}

\begin{advplayground}
For operator \code{\textasciicircum{}} to behave as expected, its first operand should be a formula with no interactions!  Compare the result of expanding these two formulas with \Rfunction{terms()}.

<<mformulas-PG01, eval=eval_playground>>=
y ~ (x1 + x2 + x3)^2
y ~ (x1 * x2 * x3)^2
@
\end{advplayground}

\begin{advplayground}
Run the code examples below using the \Rdata{npk} data set from \Rlang. They demonstrate the use of different model formulas in ANOVA\index{analysis of variance!model formula}. Use these examples plus your own variations on the same theme to build your understanding of the syntax of model formulas. Based on the terms displayed in the ANOVA tables, first work out what models are being fitted in each case. In a second step, write the mathematical formulation of each of the models. Finally, think how model choice may affect the conclusions from an analysis of variance.

% runs fine but crashes LaTeX
<<mformulas-PG02, eval=FALSE>>=
data(npk)
anova(lm(yield ~ N * P * K, data = npk))
anova(lm(yield ~ (N + P + K)^2, data = npk))
anova(lm(yield ~ N + P + K + P %in% N + K %in% N, data = npk))
anova(lm(yield ~ N + P + K + N %in% P + K %in% P, data = npk))
@
\end{advplayground}

Nesting of factors in experiments using hierarchical designs such as split-plots or repeated measures, results in the need to compute additional error terms, differing in their degrees of freedom. In a nested design with fixed effects, effects are tested based on different error terms depending on the design of an experiment, i.e., depending on the randomisation of the assignment of treatments to experimental units. In base-\Rlang model-formulas, nesting is described by explicit definition of error terms by means of \code{Error()} within the formula.

The syntax described above does not support complex statistical models as implemented in extension packages. For example, Nowadays, fitting linear mixed-effects (LME) models is the preferred approach for the analysis of data from experiments and surveys based on hierarchical designs. These methods are implemented in packages \pkgname{nlme} \autocite{Pinheiro2000} and \pkgname{lme4} \autocite{Bates2015} that define extensions to the model formula syntax. The extensions make it possible to describe nesting and distinguish fixed and random effects. Packages implementing fitting of additive models have needed other extensions to the formula syntax. Additive model methods are described by \citeauthor{Wood2017} (\citeyear{Wood2017}) and \citeauthor{Zuur2012} (\citeyear{Zuur2012}). Although the overall approach and syntax are followed in most contributed packages, different packages have extended the formula syntax in different ways. These extensions fall outside the scope of this book.

\begin{warningbox}
  \Rlang will accept any syntactically correct model formula, even when the results of the fit are not interpretable. It is \emph{the responsibility of the user to ensure that models are meaningful}\index{models!nesting of factors}. The most common, and dangerous, mistake is specifying for factorial experiments, models that are missing lower-order terms.

  Fitting models like those below to data from an experiment based on a three-way factorial design should be avoided. In both cases, simpler terms are missing, while higher-order interaction(s) that include the missing term are included in the model. Such models are not interpretable, as the variation from the missing term(s) ends being ``disguised'' within the remaining terms, distorting their apparent significance and parameter estimates.

<<mformulas-W01,eval=FALSE>>=
y ~ A + B + A:B + A:C + B:C
y ~ A + B + C + A:B + A:C + A:B:C
@

  In contrast to those above, the models below are interpretable, even if not ``full'' models (not including all possible interactions).

<<mformulas-W02,eval=FALSE>>=
y ~ A + B + C + A:B + A:C + B:C
y ~ (A + B + C)^2
y ~ A + B + C + B:C
y ~ A + B * C
@

\end{warningbox}

\begin{explainbox}
\index{model formulas!manipulation}\textbf{Manipulation of model formulas.} Because this is a book about the \Rlang language, it is pertinent to describe how formulas can be manipulated. Formulas, as any other \Rlang objects, can be saved in variables including lists. Why is this useful? For example, if we want to fit several different models to the same data, we can write a \code{for} loop that walks through a list of model formulas (see section \ref{sec:R:faces:of:loops} on page \pageref{sec:R:faces:of:loops}). Obviously, user-defined functions can accept formulas as arguments as \code{lm()} and other model-fitting functions do. In addition, it is relatively simple for user code to programmatically create and edit \Rlang formulas, in the same way as functions \code{update()} and \code{step()} do under the hood.

A conversion constructor is available under the name \Rfunction{as.formula()}. It is useful when formulas are input interactively by the user or read from text files. With \Rfunction{as.formula()} we can convert a character string into a formula.

<<mformulas-EB03>>=
as.formula("y ~ x")
@

As\index{model formulas!conversion from character strings} there are many functions for the manipulation of \code{character} strings available in base \Rlang and through extension packages, it is easiest to build model formulas as strings. We can use functions like \code{paste()} to assemble a formula as text, and then use \Rfunction{as.formula()} to convert it to an object of class \code{formula}, usable for fitting a model.

<<mformulas-EB04>>=
paste("y", "x", sep = "~") |> as.formula()
@

For\index{model formulas!conversion into character strings} the reverse operation of converting a formula into a string, we have available methods \code{as.character()} and \code{format()}. The first of these methods returns a character vector containing the components of the formula as individual strings, while \code{format()} returns a single character string with the formula formatted for printing.

<<mformulas-EB05>>=
as.character(y ~ x)
format(y ~ x)
@

This conversion makes it possible to edit a formula as a character string.

<<mformulas-EB05a>>=
format(y ~ x) |> gsub("x", "x + z", x = _) |> as.formula()
@

It\index{model formulas!updating} is also possible to \emph{edit} formula objects with method \Rfunction{update()}. In the replacement formula, a dot can replace either the left-hand side (lhs) or the right-hand side (rhs) of the existing formula. We can also remove terms as can be seen below. In some cases, the dot corresponding to the lhs can be omitted, but including it makes the syntax clearer.

<<mformulas-EB06>>=
my.formula <- y ~ x1 + x2
update(my.formula, . ~ . + x3)
update(my.formula, . ~ . - x1)
update(my.formula, . ~ x3)
update(my.formula, z ~ .)
update(my.formula, . + z ~ .)
@

As \Rlang provides high-level functions for model selection editing model formulas is not very frequently needed for model fitting.
\end{explainbox}

A model matrix of dummy coefficients is used in the actual computations. This matrix can be derived from a model formula, a contrast name, and the data for the explanatory variables using function \Rfunction{model.matrix()}.
\index{model formulas|)}

\section{Time Series}\label{sec:stat:time:series}
\index{time series|(}
Longitudinal data consist of repeated measurements, usually done over time, on the same experimental units. Longitudinal data, when replicated on several experimental units at each time point, are called repeated measurements, while when not replicated, they are called time series. Base \Rlang provides special support for the analysis of time series data, while repeated measurements can be analysed with nested linear models, mixed-effects models, and additive models.

Time series data are data collected in such a way that there is only one observation, possibly of multiple variables, available at each point in time. This brief section introduces only the most basic aspects of time-series analysis. In most cases, time steps are of uniform duration and occur regularly, which simplifies data handling and storage. \Rlang not only provides methods for the analysis and manipulation of time-series, but also a specialised class for their storage, \Rclass{"ts"}. Regular time steps allow more compact storage---e.g., a \code{ts} object does not need to store time values for each observation but instead a combination of two of start time, step size and end time. When analysing time-series data, it is frequently necessary to convert time data between one of the special \Rlang classes and \Rclass{character} strings, and to operate on dates and times (see section \ref{sec:data:datetime} on page \pageref{sec:data:datetime}).

By now, you surely guessed that to create an object of class \Rclass{"ts"} one needs to use a constructor called \Rfunction{ts()} or a conversion constructor called \Rfunction{as.ts()} and that you can look up the arguments they accept by consulting help using \code{help(ts)}.

<<ts-00>>=
my.ts <- ts(1:10, start = 2019, deltat = 1/12)
@

The \code{print()} method for \code{ts} objects is special, and adjusts the printout according to the time step or \code{deltat} of the series.

<<ts-00aa>>=
print(my.ts)
@

The structure of the \code{ts} object is simple. Its mode is \code{numeric} but its class is \code{ts}. It is similar to a numeric vector with the addition of one attribute named \code{tsp} describing the time steps, as a numeric vector of length 3, giving start and end time and the size of the steps.

<<ts-00a>>=
mode(my.ts)
class(my.ts)
is.ts(my.ts)
str(my.ts)
attributes(my.ts)
@

Data set \Rdata{nottem}, included in \Rlang, contains meteorological data for Nottingham. The annual cycle of mean air temperatures (in degrees Fahrenheit) as well as variation among years are clear when data are plotted.

\begin{explainbox}
Reexpression of the temperatures in the time-series from degrees Fahrenheit into degrees Celsius can be achieved as in \code{numeric} vectors using vectorised arithmetic and recycling.

<<ts-04>>=
nottem.celcius <- (nottem - 32) * 5 / 9
@
\end{explainbox}

<<ts-03>>=
is.ts(nottem.celcius)
@

<<echo=FALSE, cache=FALSE>>=
opts_chunk$set(opts_fig_wide)
@

<<ts-03plot>>=
plot(nottem.celcius)
@

\begin{playground}
Explore the structure of the \code{nottem.celcius} object (or the \code{nottem} object), and consider how and why it differs or not from that of the object \code{my.ts} that we created above. Similarly explore time series \code{ausres}, another of the data sets included in \Rlang.

<<ts-03a, eval=eval_playground>>=
str(nottem.celcius)
attributes(nottem.celcius)
@

\end{playground}

Many time series of observations display cyclic variation at different frequencies. Outdoors, air temperature varies cyclically between day and night and throughout the year. Superimposed on these regular cycles there can be faster random variation and long-term trends. One approach to the analysis of time series data is to estimate the separate contribution of these components.

An efficient approach to time series decomposition, based on LOESS\index{time series!decomposition}\index{STL|see{time series decomposition}} (see section \ref{sec:stat:splines} on \pageref{sec:stat:splines}), is STL (Seasonal and Trend decomposition using Loess).\qRfunction{decompose()}\qRfunction{stl()} A seasonal window of 7 months, the minimum accepted, allows the extraction of the annual cycles and a long-term trend leaving  as a remainder some unexplained variation. In the plot, is important to be aware that the scale limits in the different panels are different, and re-set for each plot.

<<echo=FALSE, cache=FALSE>>=
opts_chunk$set(opts_fig_wide_square)
@

<<ts-05, fig.height=5>>=
nottem.stl <- stl(nottem.celcius, s.window = 7)
plot(nottem.stl)
@

It is interesting to explore the class and structure of the object returned by \Rfunction{stl()}, as we may want to extract components. We can see that the structure of this object is rather similar to model-fit objects of classes \code{lm} and \code{glm}.

<<ts-06>>=
class(nottem.stl)
str(nottem.stl, no.list = TRUE, give.attr = FALSE, vec.len = 2)
@

As with other fit methods, method \Rfunction{summary()} is available. However, this method for class \code{stl} returns unchanged the \code{stl} object received as an argument and displays a summary. In other words, it behaves similarly to \code{print()} methods with respect to the returned object, but produces a different printout than \code{print()} as its side effect.

<<ts-07>>=
summary(nottem.stl)
@

\begin{playground}
Consult \code{help(stl)} and \code{help(plot.stl)} and create different plots and decompositions by passing different arguments to the formal parameters of these methods.

Method \code{print()} shows the different components. Extract the seasonal component and plot is on its own against time.
\end{playground}

In the Nottingham temperature time series, the period of the variation is clearly annual, but for many other time series an interesting feature to characterise is autocorrelation and its periodicity. Function \Rfunction{acf()} computes and plots the autocorrelation function (ACF) vs. the lag. The time series has monthly data, while the scale for lag in the plot below is in years. The autocorrelation is one at zero lag, and slightly less with a lag of one year, while its is negative between winter and summer temperatures.

<<echo=FALSE, cache=FALSE>>=
opts_chunk$set(opts_fig_narrow)
@

<<ts-08>>=
acf(nottem)
@

More advanced time-series analysis and forecasting methods are implemented in contributed packages and fall outside the scope of this book. The textbook \citetitle{Hyndman2021} \autocite{Hyndman2021} is comprehensive, starting with an introduction to time series and continuing all the way to the description of modern forecasting methods, using \Rlang throughout.
\index{time series|)}

\section{Multivariate Statistics}\label{sec:stat:MV}
\index{multivariate statistics|(}
All the methods presented above are univariate, as even if in some cases we considered multiple explanatory variables on the rhs of model formulas, the lhs contained at most one response variable. There are many different multivariate methods available, and a few of them are implemented in base \Rlang functions. The current section does not describe these methods in depth, it only provides a few simple examples for some of the frequently used ones.

\subsection{Multivariate analysis of variance}
\index{multivariate analysis of variance|(}
\index{MANOVA|see{multivariate analysis of variance}}
Multivariate methods take into account several response variables simultaneously, as part of a single analysis. In practice, it is usual to use contributed packages for multivariate data analysis in \Rlang, except for simple cases. We will look first at \emph{multivariate} ANOVA or MANOVA. In the same way as \Rfunction{aov()} is a wrapper that uses internally \Rfunction{lm()}, \Rfunction{manova()} is a wrapper that uses internally \Rfunction{aov()}.

Multivariate model formulas in base \Rlang require the use of column binding (\code{cbind()}) on the left-hand side (lhs) of the model formula. The well-known \Rdata{iris} data set, containing size measurements for flowers of three species of \emph{Iris}, is used in the examples below.

<<manova-02>>=
mmf2 <- manova(cbind(Petal.Length, Petal.Width) ~  Species, data = iris)
anova(mmf2)
summary(mmf2)
@

\begin{advplayground}
Modify the example above to use \code{aov()} instead of \code{manova()} and save the result to a variable named \code{mmf3}.
Use \code{class()}, \code{attributes()}, \code{names()}, \code{str()} and extraction of members to explore objects \code{mmf1}, \code{mmf2} and \code{mmf3}. Are they different?
\end{advplayground}

\index{multivariate analysis of variance|)}

\subsection{Principal components analysis}\label{sec:stat:PCA}
\index{principal components analysis|(}\index{PCA|see {principal components analysis}}

Principal components analysis (PCA) is used to simplify a data set by combining variables with similar and ``mirror'' behaviour into principal components. At a later stage, we frequently try to interpret these components in relation to known and/or assumed independent variables. Base \Rlang's function \Rfunction{prcomp()} computes the principal components and accepts additional arguments for centring and scaling.

<<pca-01>>=
pc <- prcomp(iris[c("Sepal.Length", "Sepal.Width",
                    "Petal.Length", "Petal.Width")],
             center = TRUE,
             scale = TRUE)
@

By printing the returned object, we can see the loadings of each variable in the principal components \code{PC1} to \code{PC4}.
<<pca-02>>=
class(pc)
pc
@

In the summary, the rows ``Proportion of Variance'' and ``Cumulative Proportion'' are most informative of the contribution of each principal component (PC) to explaining the variation among observations.

<<pca-03>>=
summary(pc)
@

Method \code{plot()} generates a bar plot of variances corresponding to the different components.

<<pca-04>>=
plot(pc)
@

<<include=FALSE, cache=FALSE>>=
opts_chunk$set(opts_fig_wide_square)
@

Method \Rfunction{biplot()} produces a plot with one principal component (PC) on each axis, plus arrows for the loadings.

<<pca-05>>=
biplot(pc)
@

<<include=FALSE, cache=FALSE>>=
opts_chunk$set(opts_fig_narrow)
@

Visually more elaborate plots of the principal components and their loadings can be obtained using package \pkgnameNI{ggplot} described in chapter \ref{chap:R:plotting} on page \pageref{chap:R:plotting}. Package \pkgnameNI{ggfortify} extends \pkgnameNI{ggplot} so as to make it easy to plot principal components and their loadings.

\begin{playground}
For growth and morphological data, a log-transformation can be suitable given that variance is frequently proportional to the magnitude of the values measured. We leave it as an exercise to repeat the above analysis using transformed values for the dimensions of petals and sepals. How much does the use of transformations change the outcome of the analysis?
\end{playground}

\begin{advplayground}
As for other fitted models, the object returned by function \Rfunction{prcomp()} is list-like with multiple components and belongs to a class of the same name as the function, not derived from class \code{"list"}.

<<pca-EB-01, eval=eval_playground>>=
class(pc)
str(pc, max.level = 1)
@
\end{advplayground}

\index{principal components analysis|)}

\subsection{Multidimensional scaling}\label{sec:stat:MDS}
\index{multidimensional scaling|(}\index{MDS|see {multidimensional scaling}}

The aim of multidimensional scaling (MDS) is to visualise in 2D space the similarity between pairs of observations. The values for the observed variable(s) are used to compute a measure of distance among pairs of observations. The nature of the data will influence what distance metric is most informative.
For MDS we start with a matrix of distances among observations. We will use, for the next examples, distances in kilometres between geographic locations in Europe from the data set \Rdata{eurodist}.

<<mds-01>>=
loc <- cmdscale(eurodist)
@

We can see that the returned object \code{loc} is a \code{matrix}, with names for one of the dimensions.

<<mds-02>>=
class(loc)
dim(loc)
head(loc, 8)
@

To make the code easier to read, two vectors are first extracted from the matrix and named \code{x} and \code{y}. We force aspect to equality so that distances on both axes are comparable.

<<include=FALSE, cache=FALSE>>=
opts_chunk$set(opts_fig_wide_square)
@

<<mds-03>>=
x <- loc[, 1]
y <- -loc[, 2] # change sign so North is at the top
plot(x, y, type = "n", asp = 1,
     main = "cmdscale(eurodist)")
text(x, y, rownames(loc), cex = 0.6)
@%
\pagebreak

\begin{advplayground}
  Find data on the mean annual temperature, mean annual rainfall and mean number of sunny days at each of the locations in the \code{eurodist} data set. Next, compute suitable distance metrics, for example, using function \Rfunction{dist}. Finally, use MDS to visualise how similar the locations are with respect to each of the three variables. Devise a measure of distance that takes into account the three climate variables and use MDS to find how distant the different locations are.
\end{advplayground}

\index{multidimensional scaling|)}

\subsection{Cluster analysis}\label{sec:stat:cluster}
\index{cluster analysis|(}

In cluster analysis, the aim is to group observations into discrete groups with maximal internal homogeneity and maximum group-to-group differences. In the next example, we use function \Rfunction{hclust()} from the base-\Rlang package \pkgname{stats}. We use, as above, the \Rdata{eurodist} data which directly provides distances. In other cases, a matrix of distances between pairs of observations needs to be first calculated with function \Rfunction{dist} which supports several methods.

<<cluster-01>>=
hc <- hclust(eurodist)
print(hc)
@

<<cluster-02>>=
plot(hc)
@

We can use \Rfunction{cutree()} to limit the number of clusters by directly passing as an argument the desired number of clusters or the height at which to cut the tree.

<<cluster-03>>=
cutree(hc, k = 5)
@

The object returned by \Rfunction{hclust()} contains details of the result of the clustering, which allows further manipulation and plotting.
<<cluster-04>>=
str(hc)
@

\index{cluster analysis|)}

%\subsection{Discriminant analysis}\label{sec:stat:DA}
%\index{discriminant analysis|(}
%
%In discriminant analysis the categories or groups to which objects belong are known \emph{a priori} for a training data set. The aim is to fit/build a classifier that will allow us to assign future observations to the different non-overlapping groups with as few mistakes as possible.
%
%
%\index{discriminant analysis|)}
\index{multivariate statistics|)}

\section{Further Reading}\label{sec:stat:further:reading}

\Rlang\index{further reading!statistics with R} and its extension packages provide implementations of most known statistical methods. For some methods, alternative implementations exist in different packages. The present chapter only attempts to show how some of the most frequently used implementations are used, as this knowledge is frequently taken for granted in specialised books, several of which I list here. Two recent text books on statistics, following a modern approach, and using \Rlang for examples, are \citetitle{Diez2019} \autocite{Diez2019} and \citetitle{Holmes2019} \autocite{Holmes2019}. They differ in the subjects emphasised, with the second one focusing more on genetic and molecular biology. Three examples of books introducing statistical computations in \Rlang are \citetitle{Dalgaard2008} \autocite{Dalgaard2008}, \citetitle{Everitt2010} \autocite{Everitt2010} and \citetitle{Zuur2009} \autocite{Zuur2009}. The book \citetitle{Mehtatalo2020} \autocite{Mehtatalo2020} presents both the statistical theory and code examples. The comprehensive \citebooktitle{Crawley2012} \autocite{Crawley2012} and the classic reference \citebooktitle{Venables2002} \autocite{Venables2002} both present statistical theory in parallel with the \Rlang code examples. More specific books are also available from which a few suggestions for further reading are \citebooktitle{Everitt2011} \autocite{Everitt2011}, \citebooktitle{Faraway2004} \autocite{Faraway2004}, \citebooktitle{Faraway2006} \autocite{Faraway2006}, \citebooktitle{Hyndman2021} \autocite{Hyndman2021}, \citebooktitle{James2013} \autocite{James2013}, \citebooktitle{Pinheiro2000} \autocite{Pinheiro2000} and \citebooktitle{Wood2017} \autocite{Wood2017}.

<<eval=eval_diag, include=eval_diag, echo=eval_diag, cache=FALSE>>=
knitter_diag()
R_diag()
other_diag()
@