-
Notifications
You must be signed in to change notification settings - Fork 4
/
R.stats.rnw
1702 lines (1293 loc) · 125 KB
/
R.stats.rnw
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% !Rnw root = appendix.main.Rnw
<<echo=FALSE, include=FALSE>>=
opts_chunk$set(opts_fig_narrow_square)
opts_knit$set(concordance=TRUE)
opts_knit$set(unnamed.chunk.label = 'functions-chunk')
rm(plot)
@
\chapter{Base \Rlang: ``Verbs'' and ``Nouns'' for Statistics}\label{chap:R:statistics}
\begin{VF}
The purpose of computing is insight, not numbers.
\VA{Richard W. Hamming}{\emph{Numerical Methods for Scientists and Engineers}, 1987}\nocite{Hamming1987}
\end{VF}
\section{Aims of This Chapter}
This chapter aims to give the reader an introduction to the approach used in base \Rlang for the computation of statistical summaries, the fitting of models to observations and tests of hypothesis. This chapter does \emph{not} explain data analysis methods, statistical principles or experimental designs. There are many good books on the use of \Rpgrm for different kinds of statistical analyses (see further reading on page \pageref{sec:stat:further:reading}) but most of them tend to focus on specific statistical methods rather than on the commonalities among them. Although base \R's model fitting functions target specific statistical procedures, they use a common approach to model specification and for returning the computed estimates and test outcomes. This approach, also followed by many contributed extension packages, can be considered as part of the philosophy behind the \Rlang language. In this chapter, you will become familiar with the approaches used in \Rlang for calculating statistical summaries, generating (pseudo-)random numbers, sampling, fitting models, and carrying out tests of significance. We will use linear correlation, \emph{t}-test, linear models, generalised linear models, non-linear models, and some simple multivariate methods as examples. The focus is on how to specify statistical models, contrasts and observations, how to access different components of the objects returned by the corresponding fit and summary functions, and how to use these extracted components in further computations or for customised printing and formatting.
%\emph{At present I use several examples adapted from the help pages for the functions described. I may revise this before publication.}
\section{Statistical Summaries}
\index{summaries!statistical|(}
Being the main focus of the \Rlang language in data analysis and statistics, \Rlang provides functions both for simple and complex calculations, going from means and variances to fitting very complex models. Table \ref{tab:stat:summaries} lists some frequently used functions. All these methods accept numeric vectors and/or matrices as arguments. In addition, function \Rfunction{quantile()} can be used to simultaneously compute multiple arbitrary quantiles for a vector of observations, and method \Rfunction{summary()} produces a summary that depends on the class of the argument passed to it. (See section \ref{sec:functions:sem} on page \pageref{sec:functions:sem} for how to define your own functions.)
\begin{table}
\centering
\caption[Simple statistical summaries.]{Frequently used simple statistical summaries and the corresponding \Rlang functions.\vspace{1ex}}\label{tab:stat:summaries}
\begin{tabular}{llll}
\toprule
Function & Symbol & Formulation & Name \\
\midrule
\Rfunction{mean()} & $\bar{x}$ & $\sum x / n$ & mean \\
\Rfunction{var()} & $s^2$ & $\sum (x_i - \hat{x})^2 / (n - 1)$ & sample variance \\
\Rfunction{sd()} & $s$ & $\sqrt[2]{s^2}$ & sample standard deviation \\
\Rfunction{median()} & M or $\tilde{x}$ & & median \\
\Rfunction{mad()} & MAD & median $|x_i - \hat{x}|$ & median absolute deviation \\
\Rfunction{mode()} & MOD & & mode \\
\Rfunction{max()} & $x_\mathrm{max}$ & & maximum \\
\Rfunction{min()} & $x_\mathrm{min}$ & & minimum \\
\Rfunction{range()} & $x_\mathrm{min}, x_\mathrm{max}$ & & range \\
\bottomrule
\end{tabular}
\end{table}
By default, if the argument contains \code{NAs} these functions return \code{NA}. The logic behind this is that if one value exists but is unknown, the true result of the computation is unknown (see page \pageref{par:special:values} for details on the role of \code{NA} in \Rlang). However, an additional parameter called \code{na.rm} allows us to override this default behaviour by requesting any \code{NA} in the input to be removed (or discarded) before calculating the summary,
<<stat-fun-1>>=
x <- c(1:20, NA)
mean(x)
mean(x, na.rm = TRUE)
@
Function \Rfunction{mean()} can be used to compute the mean from all values, as in the example above, as well as trimmed means, i.e., means computed after discarding extreme values. The argument passed to parameter \code{trim} decides the fraction of the observations to discard at \emph{each extreme} of the vector of values after ordering them from smallest to largest.
<<stat-fun-2>>=
x <- c(1:20, 100)
mean(x)
mean(x, trim = 0.05)
@
\begin{playground}
In contrast to the use of other functions, I do not provide examples of the use of all the functions listed in Table \ref{tab:stat:summaries}. Construct \code{numeric} vectors with artificial data or use real data to play with the remaining functions. Study the help pages to learn about the different parameters and their uses.% Later in the book, only the output from certain examples will be shown, with the expectation, that other examples will be run by readers.
\end{playground}
Other more advanced functions are also available in \Rlang, such as \Rfunction{boxplot.stats()} that computes the values needed to draw boxplots (see section \ref{sec:boxplot} on page \pageref{sec:boxplot}).
In many cases, you will want to compute statistical summaries by group or treatment in addition or instead of for a whole data set or vector. See section \ref{sec:calc:df:aggregate} on page \pageref{sec:calc:df:aggregate} for details on how to compute summaries of data stored in data frames using base \Rlang functions, and section \ref{sec:dplyr:manip} on page \pageref{sec:dplyr:manip} for alternative functions from contributed packages.
\index{summaries!statistical|)}
\section{Standard Probability Distributions}\label{sec:prob:dist}
\index{probability distributions!standard|(}%
\index{probability distributions!theoretical|see{--- standard}}%
\index{Normal distribution}%
Density functions, probability distribution functions, quantile functions, and functions for pseudo-random draws are available in \Rlang for several different standard (theoretical) probability distributions. Entering \code{help(Distributions)} at the \Rlang prompt will open a help page describing all the distributions available in base \Rlang. For each distribution, the different functions contain the same ``root'' in their names: \code{norm} for the normal distribution, \code{unif} for the uniform distribution, and so on. The ``head'' of the name indicates the type of values returned: ``\code{d}'' for density, ``\code{q}'' for quantile, ``\code{r}'' (pseudo-)random draws, and ``\code{p}'' for probability (Table \ref{tab:prob:funs}).
\begin{table}
\centering
\caption[Standard probability distributions]{Standard probability distributions in \Rlang. Partial list of base \Rlang functions related to probability distributions. The full list can be obtained by executing the command \code{help(Distributions)}.\vspace{1ex}}\label{tab:prob:funs}
\begin{tabular}{llllll}
\toprule
Distribution & Symbol & Density & $P$-value & Quantiles & Draws \\
\midrule
Normal & $N$ & \Rfunction{dnorm()} & \Rfunction{pnorm()} & \Rfunction{qnorm()} & \Rfunction{rnorm()} \\
Student's & $t$ & \Rfunction{dt()} & \Rfunction{pt()} & \Rfunction{qt()} & \Rfunction{rt()}\\
F & $F$ & \Rfunction{df()} & \Rfunction{pf()} & \Rfunction{qf()} & \Rfunction{rf()} \\
binomial & $B$ & \Rfunction{dbinom()} & \Rfunction{pbinom()} & \Rfunction{qbinom()} & \Rfunction{rbinom()} \\
multinomial & $M$ & \Rfunction{dmultinom()} & \Rfunction{pmultinom()} & \Rfunction{qmultinom()} & \Rfunction{rmultinom()} \\
Poisson & & \Rfunction{dpois()} & \Rfunction{ppois()} & \Rfunction{qpois()} & \Rfunction{rpois()} \\
$\Chi$-squared & $\Chi^2$ & \Rfunction{dchisq()} & \Rfunction{pchisq()} & \Rfunction{qchisq()} & \Rfunction{rchisq()} \\
lognormal & & \Rfunction{dlnorm()} & \Rfunction{plnorm()} & \Rfunction{qlnorm()} & \Rfunction{rlnorm()} \\
uniform & & \Rfunction{dunif()} & \Rfunction{punif()} & \Rfunction{qunif()} & \Rfunction{runif()} \\
\bottomrule
\end{tabular}
\end{table}
Theoretical distributions are defined by mathematical functions that accept parameters that control the exact shape and location. In the case of the Normal distribution, these parameters are the \emph{mean} (\code{mean}) controlling the location center and \emph(standard deviation) (\code{sd}) controlling the spread away from the center of the distribution. The four different functions differ in which values are calculated (the unknowns) and which values are supplied as arguments (the known inputs).
In what follows, I use the Normal distribution as an example, but with differences in their parameters, the functions for other theoretical distributions follow a similar naming pattern.
\subsection{Density from parameters}\label{sec:prob:dens}
\index{probability distributions!density from parameters}
To obtain a single point from the distribution curve we pass a vector of length one as an argument for \code{x}.
<<distrib-01>>=
dnorm(x = 1.5, mean = 1, sd = 0.5)
@
To obtain multiple values we can pass a longer vector as an argument.
<<distrib-01a>>=
dnorm(x = seq(from = -1, to = 1, length.out = 5), mean = 1, sd = 0.5)
@
With 50 equally spaced values for $x$ we can plot a line (\code{type = "l"}) that shows that the 50 generated data points give the illusion of a continuous curve. We also add a point showing the value for $x = 1.5$ calculated above.
<<distrib-01b>>=
vct1 <- seq(from = -1, to = 3, length.out = 50)
df1 <- data.frame(x = vct1,
y = dnorm(x = vct1, mean = 1, sd = 1))
plot(y~x, data = df1, type = "l", xlab = "z", ylab = "f(z)")
points(x = 2, y = dnorm(x = 2, mean = 1, sd = 1))
@
\subsection{Probabilities from parameters and quantiles}\label{sec:prob:quant}
\index{probability distributions!probabilities from quantiles}
With a known quantile value, it is possible to look up the corresponding $P$-value from the Normal distribution, i.e., the area under the curve, either to the right or to the left of a given value of \code{q} (by default, integrating the lower or left tail). When working with observations, the quantile, mean and standard deviation are in most cases computed from the same observations under the null hypothesis. In the example below, we use invented values for all parameters \code{q}, the quantile, \code{mean}, and \code{sd}, the standard deviation.
<<distrib-02>>=
pnorm(q = 2, mean = 1, sd = 1)
pnorm(q = 2, mean = 1, sd = 1, lower.tail = FALSE)
pnorm(q = 2, mean = 1, sd = 4, lower.tail = FALSE)
pnorm(q = c(2, 4), mean = 1, sd = 1, lower.tail = FALSE)
@
\begin{explainbox}
In tests of significance, empirical $z$-values and $t$-values are computed by subtracting from the observed mean for one group or raw quantile, the ``expected'' mean (a hypothesised theoretical value, the mean of a control condition used as a reference, or the mean computed over all treatments under the assumption of no effect of treatments) and then dividing this difference by the standard deviation. Consequently, the $p$-values corresponding to these empirical $z$-values and $t$-values need to be looked up using \code{mean = 0} and \code{sd = 1} when calling \Rfunction{pnorm()} or \Rfunction{pt()} respectively. These frequently used values are the defaults.
\end{explainbox}
\subsection{Quantiles from parameters and probabilities}\label{sec:quant:prob}
\index{probability distributions!quantiles from probabilities}
The reverse computation from that in the previous section is to obtain the quantile corresponding to a known $P$-value or area under one of the tails of the distribution curve. These quantiles are equivalent to the values in the tables of precalculated quantiles used in earlier times to assess significance with statistical tests.
<<distrib-03>>=
qnorm(p = 0.01, mean = 0, sd = 1)
qnorm(p = 0.05, mean = 0, sd = 1)
qnorm(p = 0.05, mean = 0, sd = 1, lower.tail = FALSE)
@
\begin{warningbox}
Quantile functions like \Rfunction{qnorm()} and probability functions like \Rfunction{pnorm()} always do computations based on a single tail of the distribution, even though it is possible to specify which tail we are interested in. If we are interested in obtaining simultaneous quantiles for both tails, we need to do this manually. If we are aiming at quantiles for $P = 0.05$, we need to find the quantile for each tail based on $P / 2 = 0.025$.
<<distrib-WB01>>=
qnorm(p = 0.025, mean = 0, sd = 1)
qnorm(p = 0.025, mean = 0, sd = 1, lower.tail = FALSE)
@
When calculating a $P$-value from a quantile computed from observations in a test of significance, we need to first decide whether a two-sided or single-sided test is relevant, and in the case of a single sided test, which tail is of interest. In a two-sided test we need to multiply the returned $P$-value by 2. This works in the case of a symmetric distribution like the Normal, because the quantiles in the two tails differ only in sign. However, this is not the case for asymmetric distributions.
<<distrib-WB02>>=
pnorm(q = 4, mean = 0, sd = 1) * 2
@
\end{warningbox}
\subsection{``Random'' draws from a distribution}\label{sec:stat:random}
\index{random draws|see{probability distributions, pseudo-random draws}}\index{probability distributions!pseudo-random draws}
True random sequences can only be generated by physical processes. All ``pseudo-random'' sequences of numbers generated by computation are really deterministic although they share many properties with true random sequences (e.g., in relation to autocorrelation).
It is possible to compute not only pseudo-random draws from a uniform distribution but also from the Normal, $t$, $F$, and other distributions. In each case, the probability with which different values are ``drawn'' approximates the probabilities set by the corresponding theoretical distribution. Parameter \code{n} indicates the number of values to be drawn, or its equivalent, the length of the vector returned (see section \ref{sec:plot:histogram} on page \pageref{sec:plot:histogram} for example plots).\qRfunction{rnorm()}\qRfunction{runif()}%
<<distrib-04,echo=2:3>>=
set.seed(12234)
rnorm(5)
rnorm(n = 10, mean = 10, sd = 2)
@
\begin{playground}
Edit the examples in sections \ref{sec:prob:quant}, \ref{sec:quant:prob}, and \ref{sec:stat:random} to do computations based on different distributions, such as Student's \emph{t}, \emph{F} or uniform.
\end{playground}
\begin{explainbox}
\index{random numbers|see{pseudo-random numbers}}\index{pseudo-random numbers}
It is impossible to generate truly random sequences of numbers by means of a deterministic process such as a mathematical computation. ``Random numbers'' as generated by \Rpgrm and other computer programs are \emph{pseudo-random numbers}, long deterministic series of numbers that resemble random draws. Random number generation uses a \emph{seed} value that determines where in the series the first value is fetched. The usual way of automatically setting the value of the seed is to take the milliseconds or a similar rapidly changing set of digits from the real-time clock of the computer. However, in cases when we wish to repeat a calculation using the same series of pseudo-random values, we can use \Rfunction{set.seed()} with an arbitrary integer as an argument to reset the generator to the same point in the underlying (deterministic) sequence.
\end{explainbox}
\begin{advplayground}
Execute the statement \code{rnorm(3)}\qRfunction{rnorm()} by itself several times, paying attention to the values obtained. Repeat the exercise, but now executing \code{set.seed(98765)}\qRfunction{set.seed()} immediately before each call to \code{rnorm(3)}, again paying attention to the values obtained. Next execute \code{set.seed(98765)}, followed by \code{c(rnorm(3), rnorm(3))}, and then execute \code{set.seed(98765)}, followed by \code{rnorm(6)} and compare the output. Repeat the exercise using a different argument in the call to \code{set.seed()}. analyse the results and explain how \code{setseed()} affects the generation of pseudo-random numbers in \Rlang.
\end{advplayground}
\index{probability distributions!standard|)}
\section{Observed Probability Distributions}
\index{probability distributions!observed|(}%
\index{empirical probability distributions|see{probability distributions, observed}}%
It is common to estimate the value of the parameters for a standard distribution like Student's $t$ or Normal distributions from observational data, assuming a priori the suitability of the distribution. If we compute the mean and standard deviation of a large sample, these two parameters define a specific Normal distribution curve. If we add the estimate of the degrees of freedom, $v = n - 1$, the three parameters define a specific $t$-distribution curve. Thus it is possible to use the functions described in section \ref{sec:prob:dist} on page \ref{sec:prob:dist}, in statistical inference.
\begin{explainbox}
Package \pkgname{mixtools} provides tools for fitting and analysing \emph{mixture models} such as the mix of two or more univariate Normal distributions. An example of its use could be to estimate mean and standard deviations for males and females in a dataset where the gender was not recorded at the time of observation.
\end{explainbox}
It is also possible to describe the observed shape of the distribution, or empirical distribution, for a data set without relying on a standard distribution. The fitted empirical distribution can later be used to compute probabilities, quantiles, and random draws as from standard distributions. This also allows statistical inference, using methods such as the bootstrap or some additive models.
Function \Rfunction{density()} computes kernel density estimates, using different methods. A curve is used to describe the shape, and the bandwidth determines how flexible this curve is. The curve is a smoother that adapts to the observed shape of the distribution of observations. The object returned is a complex list that can be used to plot the estimated curve.
The code below estimates the empirical distribution for the waiting time in minutes between eruptions of the Old Faithful geyser at Yellowstone, a dataset from \Rlang.
<<density-01>>=
d <- density(faithful$waiting, bw = "sj")
@
\begin{explainbox}
Using \Rfunction{str()} we can explore the structure of the object returned by function \Rfunction{density()}.
<<density-02>>=
str(d)
@
The object saved as \code{d} is a \code{list} with seven members. The two numeric vectors, \code{x} and \code{y} describe the estimated probability distribution and produce the curve in the plot below. The numerical bandwidth estimated using method \code{"sj"} is in \code{bw}, and the length of vector \code{faithful\$waiting}, the data used, is in \code{n}. Member \code{call} is the command used to call the function, the remaining two members have self-explanatory names. The returned object belongs to class \Rclass{density}. The overall pattern is similar, but simpler than for the model fitting functions that we will see later in the chapter. The class name of the object is the same as the name of the function that created it, \code{call} provides a \emph{trace} of how the object was created. Other members, facilitate computation of derived quantities and plotting. Being a list, the individual members can be extracted by name.
<<density-03>>=
d$n
@
\end{explainbox}
As a \Rmethod{plot()} method is available for class \Rclass{density} we can easily produce a plot of the estimated empirical density distribution. In this case, the fitted bimodal curve, with two maxima, is very different to the Normal.
<<desnity-04>>=
plot(d)
@
Observed probability distributions, especially empirical ones, nowadays play a central role in data visualisation including 1D and 2D empirical density plots based on the use of functions like \Rfunction{density()}, as well as traditional histograms (see section \ref{sec:plot:density} on page \pageref{sec:plot:density} for examples of more elaborate and elegant plots).
\index{probability distributions!observed|)}
\section{``Random'' Sampling}
\index{random sampling|see{pseudo-random sampling}}%
\index{pseudo-random sampling|(}%
In addition to drawing values from a theoretical distribution, we can draw values from an existing set or collection of values. We call this operation (pseudo-)random sampling. The draws can be done either with replacement or without replacement. In the second case, all draws are taken from the whole set of values, making it possible for a given value to be drawn more than once. In the default case of not using replacement, subsequent draws are taken from the values remaining after removing the values chosen in earlier draws.
<<sampling-01>>=
sample(x = LETTERS)
sample(x = LETTERS, size = 12)
sample(x = LETTERS, size = 12, replace = TRUE)
@
In practice, pseudo-random sampling is useful when we need to select subsets of observations. One such case is assigning treatments to experimental units in an experiment or selecting persons to interview in a survey. Another use is in bootstrapping to estimate variation in parameter estimates using empirical distributions.
\begin{faqbox}{How to sample random rows from a data frame?}
As described in section \ref{sec:R:data:frames} on page \pageref{sec:R:data:frames}, data frames are commonly used to store one observation per row. To sample a subset of rows, we need to generate a random set of indices to use with the extraction operator (\Roperator{[ ]}). Here we sample four rows from data frame \code{cars} included in \Rlang. These data consist of stopping distances for cars moving at different speeds as described in the documentation available by entering \code{help(cars)}).
<<sampling-02>>=
cars[sample(x = 1:nrow(cars), size = 4), ]
@
\end{faqbox}
\begin{advplayground}
Consult the documentation of \Rfunction{sample()} and explain why the code below is equivalent to that in the example immediately above.
<<sampling-03, eval=eval_playground>>=
cars[sample(x = nrow(cars), size = 4), ]
@
\end{advplayground}
\index{pseudo-random sampling|)}%
\section{Correlation}\label{sec:stats:correlation}
\index{correlation|(}
Both parametric (Pearson's) and non-parametric robust (Spearman's and Kendall's) methods for the estimation of the (linear) correlation between pairs of variables are available in base \Rlang. The different methods are selected by passing arguments to a single function. While Pearson's method is based on the actual values of the observations, non-parametric methods are based on the ordering or rank of the observations, and consequently less affected by observations with extreme values.
\subsection{Pearson's $r$}
\index{correlation!parametric}
\index{correlation!Pearson}
Function \Rfunction{cor()} can be called with two vectors of the same length as arguments. In the case of the parametric Pearson method, we do not need to provide further arguments as this method is the default one. We use data set \code{cars}.
<<cor-01>>=
cor(x = cars$speed, y = cars$dist)
@
It is also possible to pass a data frame (or a matrix) as the only argument. When the data frame (or matrix) contains only two columns, the returned correlation estimate is equivalent to that of passing the two columns individually as vectors. The object returned is a $2 \times 2$ \code{matrix} instead of a vector of length one.
<<cor-02>>=
cor(cars)
@
When the data frame or matrix contains more than two numeric vectors, the returned value is a matrix of estimates of pairwise correlations between columns. We here use \Rfunction{rnorm()} described above to create a long vector of pseudo-random values drawn from the Normal distribution and \Rfunction{matrix()} to convert it into a matrix with three columns (see page \pageref{sec:matrix:array} for details about \Rlang matrices).
<<cor-02a>>=
mat1 <- matrix(rnorm(54), ncol = 3,
dimnames = list(rows = 1:18, cols = c("A", "B", "C")))
cor(mat1)
@
\begin{playground}
Modify the code in the chunk immediately above constructing a matrix with six columns and then computing the correlations.
\end{playground}
While \Rfunction{cor()} returns an estimate of $r$, the correlation coefficient, \Rfunction{cor.test()} computes, in addition, the $t$-value, $P$-value, and confidence interval for the estimate.
<<cor-03>>=
cor.test(x = cars$speed, y = cars$dist)
@
Above we passed two numeric vectors as arguments, one to parameter \code{x} and one to parameter \code{y}. Alternatively, we can pass a data frame as an argument to \code{data}, and a \emph{model formula} to \code{formula}. The argument passed to \code{formula} determines which variables from \code{data} are used, and in which role. Briefly, the variable(s) to the left of the tilde (\code{~}) are response variables, and those to the right are independent, or explanatory, variables. In the case of correlation, no assumption is made on cause and effect, and both variables appear to the right of the tilde. The code below is equivalent to that above. See section \ref{sec:stat:formulas} on page \pageref{sec:stat:formulas} for details on the use of model formulas and section \ref{sec:stat:mf} on page \pageref{sec:stat:mf} for examples of their use in model fitting.
<<cor-03a, eval=eval_playground>>=
cor.test(formula = ~ speed + dist, data = cars)
@
\begin{playground}
Functions \Rfunction{cor()} and \Rfunction{cor.test()} return \Rlang objects, that when using \Rlang interactively get automatically ``printed'' on the screen. One should be aware that \Rfunction{print()} methods do not necessarily display all the information contained in an \Rlang object. This is almost always the case for complex objects like those returned by \Rlang functions implementing statistical tests. As with any \Rlang object, we can save the result of an analysis into a variable. As described in section \ref{sec:calc:lists} on page \pageref{sec:calc:lists} for lists, we can peek into the structure of an object with method \Rfunction{str()}. We can use \Rfunction{class()} and \Rfunction{attributes()} to extract further information. Run the code in the chunk below to discover what is actually returned by \Rfunction{cor()}.
<<cor-PG01, eval=eval_playground>>=
MAT1 <- cor(cars)
class(MAT1)
attributes(MAT1)
str(MAT1)
@
Methods \Rfunction{class()}, \Rfunction{attributes()} and \Rfunction{str()} are very powerful tools that can be used when we are in doubt about the data contained in an object and/or how it is structured. Knowing the structure allows us to retrieve the data members directly from the object when predefined extractor methods are not available.
\end{playground}
\subsection{Kendall's $\tau$ and Spearman's $\rho$}
\index{correlation!non-parametric}
\index{correlation!Kendall}
\index{correlation!Spearman}
We use the same functions as for Pearson's $r$ but explicitly request the use of one of these methods by passing an argument.
<<cor-04>>=
cor(x = cars$speed, y = cars$dist, method = "kendall")
cor(x = cars$speed, y = cars$dist, method = "spearman")
@
Function \Rfunction{cor.test()}, described above, also allows the choice of method with the same syntax as shown for \Rfunction{cor()}.
\begin{playground}
Repeat the exercise in the playground immediately above, but now using non-parametric methods. How does the information stored in the returned \code{matrix} differ depending on the method, and how can we extract from the returned object information about the method used for the calculation of the correlation?
\end{playground}
\index{correlation|)}
\section{$t$-test}\label{sec:stats:ttest}
\index{t-test@$t$-test|(}%
\index{Student's t-test@Student's $t$-test|see{$t$-test}}%
The $t$-test is based on Student's $t$-distribution. It can be applied to any parameter estimate for which its standard deviation is available, and the $t$-distribution is a plausible assumption. It is most frequently used to compare an estimate of the mean against a constant value, or the estimate of a difference between two means and a target difference, usually no difference. In \Rlang these can be computed manually using functions \Rfunction{mean()}, \Rfunction{sd()}, and \Rfunction{pt()} or with \Rfunction{t.test()}.
Although rarely presented in such a way, the $t$-test can be thought of as a special case of a linear model fit. Consistently with functions used to fit models to observations, we can use a \emph{formula} to describe a $t$-test. A formula such as \code{y\,\char"007E\,x} is read as $y$ is explained by $x$. We use \emph{lhs} (left-hand-side) and \emph{rhs} (right-hand-side) to signify all terms to the left and right of the tilde (\code{\,\char"007E\,}), respectively (\code{<lhs>\,\char"007E\,<rhs>}). (See section \ref{sec:stat:formulas} on page \pageref{sec:stat:formulas} for a detailed discussion of model formulas, and section \ref{sec:stat:mf} on page \pageref{sec:stat:mf} for examples of their use in model fitting.)
<<ttest-01>>=
df1 <- data.frame(some.size = c(rnorm(10, mean = 2.5), rnorm(10, mean = 2.0)),
group = factor(rep(c("A", "B"), each = 10)))
@
The formula \code{some.size\,\char"007E\,1} is read as ``the mean of variable \code{some.size} is explained by a constant value''. The value estimated from observations, $\bar{x}$, is compared against the value of $\mu$ set as the null hypothesis, where $\mu$ is the \emph{unknown} mean of the sampled population. By default, \Rfunction{t.test()} applies a two-sided test (\code{alternative = "two.sided"}) against \code{mu = 0}, but here we use \code{mu = 2} instead.
<<ttest-02>>=
t.test(some.size ~ 1, mu = 2, data = df1)
@
The same test can be calculated step by step. In this case, this approach is not needed, but it is useful when we have a parameter estimate (not just mean) and its standard error available, as in model fits (see the advanced playground on page \pageref{box:stats:slope:ttest} for an example).
<<ttest-02a>>=
sem = sqrt(var(df1$some.size) / nrow(df1))
t.value = (mean(df1$some.size) - 2) / sem # Ho: mu = 2
p.value <- pt(t.value, df = nrow(df1) - 1, lower.tail = FALSE) * 2 # two tails
signif(c(t = t.value, df = nrow(df1) - 1, P = p.value), 4) # 4 digits
@
The same function, with a different formula, tests for the difference between the means of two groups or treatments, $H_o: {\mu}_A - {\mu}_B = 0$. We read the formula \code{some.size\,\char"007E\,group} as ``differences in \code{some.size} are explained by factor \code{group}''. The difference between the means for the two groups is estimated and compared against the hypothesis. (In this case, the value of the argument passed to \code{mu}, zero by default, describes this difference.) By default, variances in the two groups are not to assumed equal,
<<ttest-03>>=
t.test(some.size ~ group, data = df1)
@
\noindent
but with \code{var.equal = TRUE}, the variances in the populations from which observations in groups A and B were sampled are assumed equal, and pooled into a single estimate.
<<ttest-03a>>=
t.test(some.size ~ group, var.equal = TRUE, data = df1)
@
The $t$-test serves as an example of how statistical tests are usually carried out in \Rlang. Table \ref{tab:stats:tests} lists \Rlang functions for frequently used statistical tests.
\begin{table}
\caption[Statistical tests]{\Rlang functions implementing frequently used statistical tests. Student's $t$-test and correlation tests are described on pages \pageref{sec:stats:ttest} and \pageref{sec:stats:correlation}, respectively.}\vspace{1ex}\label{tab:stats:tests}
\centering\noindent
\begin{tabular}{ll}
\toprule
Statistical test & Function name \\
\midrule
Student's $t$-test (1 and 2 samples) & \code{t.test()} \\
Wilcoxon rank sum and signed rank tests & \code{wilcox.test} \\
Kolmogorov-Smirnov tests & \code{ks.test()} \\
Correlation tests (Pearson, Kendall, Spearman) & \code{cor.test()} \\
$F$-test to compare two variances & \code{var.test()} \\
Fisher's exact test for count data & \code{fisher.test()} \\
Pearson's Chi-squared ($\chi^2$) test for count data & \code{chisq.test()} \\
Exact Binomial Test & \code{binom.test()} \\
Test of equal or given proportions & \code{prop.test()} \\
\bottomrule
\end{tabular}
\end{table}
\index{t-test@$t$-test|)}
\section{Model Fitting in \Rlang}\label{sec:stat:mf}
\index{models!fitting|(}
The general approach to model fitting in \Rlang is to separate the actual fitting of a model from the inspection of the fitted model. A model fitting function minimally requires a description of the model to fit, as a model \code{formula} and a data frame or vectors with the data or observations to which to fit the model. These functions in \Rlang return a model-fit object. This object contains the data, the model formula, the call, and the result of fitting the model. Several methods are available for querying it. The diagram in Figure \ref{fig:model:fit:diagram} summarises the approach used in \Rlang for data analysis based on fitted models.
\begin{figure}
\centering
\begin{small}
\begin{tikzpicture}[node distance=1.4cm, scale=0.5]
\node (model) [tprocess] {\textsl{model specification}};
\node (data) [tprocess, below of=model] {\textsl{observations}};
\node (fitfun) [tprocess, right of=model, yshift=-0.7cm, xshift=2.5cm] {\textsl{fitting function}};
\node (fm) [tprocess, color = black, right of=fitfun, xshift=1.5cm] {\textsl{fitted model}};
\node (summary) [tprocess, color = black, right of=fm, xshift=1.7cm] {\textsl{query methods}};
\draw [arrow] (model) -- (fitfun);
\draw [arrow] (data) -- (fitfun);
\draw [arrow] (fitfun) -- (fm);
\draw [arrow] (fm) -- (summary);
\end{tikzpicture}
\end{small}
\caption[Model fitting in \Rlang]{Model fitting in \Rlang is done in steps, and can be represented schematically as a flow of information.}\label{fig:model:fit:diagram}
\end{figure}
Models are described using model formulas such as \code{y\,\char"007E\,x} which we read as $y$ is explained by $x$. We use \emph{lhs} (left-hand-side) and \emph{rhs} (right-hand-side) to signify all terms to the left and right of the tilde (\code{\,\char"007E\,}), respectively (\code{<lhs>\,\char"007E\,<rhs>}). Model formulas are used in different contexts: fitting of models, plotting, and tests like $t$-test. The syntax of model formulas is consistent throughout base \Rlang and numerous independently developed packages. However, their use is not universal, and several packages extend the basic syntax to allow the description of specific types of models. As most things in \Rlang, model formulas are objects and can be stored in variables. See section \ref{sec:stat:formulas} on page \pageref{sec:stat:formulas} for a detailed discussion of model formulas.
Although there is some variation, especially for fitted model classes defined in extension packages, in most cases, the \textsl{query functions} bulked together in the rightmost box in the diagram in Figure \ref{fig:model:fit:diagram} include \Rfunction{summary()}, \Rfunction{anova()} and \Rfunction{plot()}, with other methods such as \Rfunction{coef()}, \Rfunction{residuals()}, \Rfunction{fitted()}, \Rfunction{predict()}, \Rfunction{AIC()}, and \Rfunction{BIC()} usually also available. Additional methods may be available. However, as model fit objects are \code{list}-like, these and other values can be extracted and/or computed programmatically when needed. The examples in this chapter can be adapted to the fitting of types of models not described in this book.
\begin{explainbox}
Fitted model objects in \Rlang are self contained and include a copy of the data to which the model was fit, as well as residuals and possibly even intermediate results of computations. Although this can make the size of these objects large, it allows querying and even updating them in the absence of the data in the current \Rlang workspace.
\end{explainbox}
\section{Fitting Linear Models}\label{sec:stat:LM}
\index{models!linear|see{linear models}}
\index{linear models|(}
\index{LM|see{linear models}}
Regression, analysis of variance (ANOVA) and analysis of covariance (ANCOVA) are all linear models, differing only on the type of explanatory variables included in the statistical model fitted. If in the fitted model all explanatory variables are continuous, i.e., \code{numeric}, vectors, the model is a regression model. If all explanatory variables are discrete, i.e., \code{factors}, the model is ANOVA. Finally, if the model contains but \code{numeric} variables and \code{factors} it is named ANCOVA. As in all cases the fitting approach is the same, based on ordinary least squares (OLS), in \Rlang, they are all implemented in function \Rfunction{lm()}.
There is another meaning of ANOVA, referring only to the tests of significance rather than to an approach to model fitting. Consequently, rather confusingly, results for tests of significance can both in the case of regression, ANOVA and ANCOVA, be presented in an ANOVA table. In this second, stricter meaning, ANOVA means a test of significance based on the ratios between pairs of variances.
\begin{warningbox}
If you do not clearly remember the difference between numeric vectors and factors, or how they can be created, please, revisit chapter \ref{chap:R:as:calc} on page \pageref{chap:R:as:calc}.
\end{warningbox}
\begin{figure}
\centering
\begin{small}
\begin{tikzpicture}[node distance=1.4cm, scale=0.5]
\node (model) [tprocess] {\textsl{model} $\to$ \code{formula}};
\node (data) [tprocess, below of=model, yshift = 0.4cm] {\textsl{observations} $\to$ \code{data}};
\node (weights) [tprocess, dashed, below of=data, fill=blue!1, yshift = 0.4cm] {\textsl{weights} $\to$ \code{weights}};
\node (fitfun) [tprocess, right of=data, xshift=2.5cm, fill=blue!5] {\code{lm()}};
\node (fm) [tprocess, color = black, right of=fitfun, xshift=1.5cm, fill=blue!5] {\code{lm} \textsl{object}};
\node (summary) [tprocess, color = black, right of=fm, xshift=1.7cm] {\code{summary()}};
\node (anova) [tprocess, color = black, below of=summary, yshift = 0.4cm] {\code{anova()}};
\node (plot) [tprocess, color = black, above of=summary, yshift = -0.4cm] {\code{plot()}};
\draw [arrow] (model) -- (fitfun);
\draw [arrow] (data) -- (fitfun);
\draw [arrow, dashed] (weights) -- (fitfun);
\draw [arrow] (fitfun) -- (fm);
\draw [arrow] (fm) -- (plot);
\draw [arrow] (fm) -- (anova);
\draw [arrow] (fm) -- (summary);
\end{tikzpicture}
\end{small}
\caption[Linear model fitting in \Rlang]{Linear model fitting in \Rlang is done in steps. The generic diagram from Figure \ref{fig:model:fit:diagram} redrawn to show a linear model fit. Non-filled boxes are shared with the fitting of other types of models, and filled ones are specific to \Rfunction{lm()}. Only the three most frequently used query methods are shown, while both response and explanatory variables are under \textsl{observations}. Dashed boxes and arrows are optional as defaults are provided.}\label{fig:lm:fit:diagram}
\end{figure}
Figure \ref{fig:lm:fit:diagram} shows the steps needed to fit a linear model and extract the estimates and test results. The observations are stored in a data frame, one case or event per row, with values for both response and explanatory variables in variables or columns. The model formula is used to indicate which variables in the data frame are to be used and in which role: either
response or explanatory, and when explanatory how they contribute to the estimated response. The object containing the results from the fit is queried to assess validity and make conclusions or predictions.
\begin{explainbox}
Weights are multiplicative factors used to alter the \emph{weight} given to individual residuals when fitting a model to observations that are not equally informative. A frequent case is fitting a model to observations that are means of drastically different numbers of individual measurements. Some model fit functions compute the weights, but in most cases they are supplied as an argument to parameter \code{weights}. By default, \code{weights} have a value of \code{1} and thus do not affect the resulting model fit, when supplied or computed, the weights are saved to the model fit object.
\end{explainbox}
\subsection{Regression}\label{sec:stat:LM:regression}
%\index{linear regression}
\index{linear regression|see{linear models, linear regression}}%
\index{linear models!linear regression|(}%
The \Rdata{cars} data set, containing two \code{numeric} variables, is used in the examples.
A simple linear model $y = \alpha \cdot 1 + \beta \cdot x$ where $y$ corresponds to stopping distance (\code{dist}) and $x$ to initial speed (\code{speed}) is formulated in \Rlang as \code{dist \char"007E\ 1 + speed}. The fitted model object is assigned to variable \code{fm1} (a mnemonic for fitted-model one).\label{chunk:lm:models1}
<<models-1>>=
fm1 <- lm(dist ~ 1 + speed, data=cars)
class(fm1)
@
The next step is diagnosis of the fit. Are assumptions of the linear model procedure used reasonably close to being fulfilled? In \Rlang it is most common to use plots to this end. We show here only one of the plots normally produced. This quantile vs.\ quantile plot is used to assess how much the distribution of the residuals deviates from the assumed Normal distribution.
<<models-1a>>=
plot(fm1, which = 2)
@
In the case of a regression, calling \Rfunction{summary()} with the fitted model object as argument is most useful as it provides a table of coefficient estimates and their errors. Remember that as is the case for most \Rlang functions, the value returned by \Rfunction{summary()} is printed when we call this method at the \Rlang prompt.
<<models-1b>>=
summary(fm1)
@
The summary\index{linear models!summary table} is organised in sections. ``Call:'' shows \code{dist\ \char"007E\ 1 + speed} or the specification of the model fitted, plus the data used. ``Residuals:'' displays the extremes, quartiles and median of the residuals, or deviations between observations and the fitted line. ``Coefficients:'' contains estimates of the model parameters and their variation plus corresponding $t$-tests. In the last three lines, there is information on overall standard error and its degrees of freedom and overall coefficient of determination ($R^2$) and $F$-statistic.
Replacing $\alpha$ and $\beta$ in $y = \alpha \cdot 1 + \beta \cdot x$ by the estimates for the intercept, $a = -17.6$, and slope, $b = 3.93$, we obtain an estimate for the regression line $y = -17.6 + 3.93 x$. However, given the nature of the problem, we \emph{know based on first principles} that the stopping distance must be zero when speed is zero. This suggests that we should not estimate the value of $\alpha$ but instead set $\alpha = 0$, or in other words, fit the model $y = \beta \cdot x$.
In \Rlang models, the intercept is included by default, so the model fitted above can be formulated as \code{dist\ \char"007E\ speed}---i.e., the missing \code{+ 1} does not change the model. To exclude the intercept, we need to specify the model as \code{dist\ \char"007E\ speed - 1} (or its equivalent \code{dist\ \char"007E\ speed + 0}), for a straight line passing through the origin ($x = 0$, $y = 0$). In the summary for this model there is an estimate for the slope but not for the intercept.
<<models-2>>=
fm2 <- lm(dist ~ speed - 1, data = cars)
summary(fm2)
@
The equation for \code{fm2} is $y = 2.91 x$. From the residuals, it can be seen that it is inadequate, as the straight line does not follow the curvature of the cloud of observations.
\begin{playground}
You will now fit a second-degree polynomial\index{linear models!polynomial regression}\index{polynomial regression}, a different linear model: $y = \alpha \cdot 1 + \beta_1 \cdot x + \beta_2 \cdot x^2$. The function used is the same as for linear regression, \Rfunction{lm()}. We only need to alter the formulation of the model. The identity function \Rfunction{I()} is used to protect its argument from being interpreted as part of the model formula. Instead, its argument is evaluated beforehand and the result is used as the, in this case second, explanatory variable.\label{chunk:stats:fm3}
<<models-3, eval=eval_playground>>=
fm3 <- lm(dist ~ speed + I(speed^2), data = cars)
plot(fm3, which = 3)
summary(fm3)
@
The ``same'' fit using an orthogonal polynomial can be specified using function \Rfunction{poly()}. Polynomials of different degrees can be obtained by supplying as the second argument to \Rfunction{poly()} the corresponding positive integer value. In this case, the different terms of the polynomial are bulked together in the summary.
<<models-3a, eval=eval_playground>>=
fm3a <- lm(dist ~ poly(speed, 2), data = cars)
summary(fm3a)
@
It is possible to compare two model fits using \Rfunction{anova()}, testing whether one of the models describes the data better than the other. It is important in this case to take into consideration the nature of the difference between the model formulas, most importantly if they can be interpreted as nested---i.e., interpreted as a base model vs. the same model with additional terms.
<<models-4, eval=eval_playground>>=
anova(fm2, fm1)
@
Three or more models can also be compared in a single call to \Rfunction{anova()}. However, care is needed, as the order of the arguments matters.
<<models-5, eval=eval_playground>>=
anova(fm2, fm3, fm3a)
anova(fm2, fm3a, fm3)
@
\label{par:stats:AIC}%
\index{Akaike's An Information Criterion@Akaike's \emph{An Information Criterion}}%
\index{AIC|see{\emph{An Information Criterion}}}%
\index{Bayesian Information Criterion@\emph{Bayesian Information Criterion}}%
\index{BIC|see{\emph{Bayesian Information Criterion}}}%
\index{Schwarz's Bayesian Criterion@Schwarz's \emph{Bayesian criterion}|see{\emph{Bayesian Information Criterion}}}%
Different criteria can be used to choose the ``best'' model: significance based on $p$-values or information criteria (AIC, BIC). AIC (Akaike's ``An Information Criterion'') and BIC (``Bayesian Information Criterion'' = SBC, ``Schwarz's Bayesian criterion'') that penalise the resulting ``goodness'' based on the number of parameters in the fitted model. In the case of AIC and BIC, a smaller value is better, and values returned can be either positive or negative, in which case more negative is better. Estimates for both BIC and AIC are returned by \Rfunction{anova()}, and on their own by \Rfunction{BIC()} and \Rfunction{AIC()}
<<models-5a, eval=eval_playground>>=
BIC(fm2, fm1, fm3, fm3a)
AIC(fm2, fm1, fm3, fm3a)
@
Once you have run the code in the chunks above, you will be able see that these three criteria do not necessarily agree on which is the ``best'' model. Find in the output $P$-value, BIC and AIC estimates, for the different models and conclude which model is favoured by each of the three criteria. In addition, you will notice that the two different formulations of the quadratic polynomial are equivalent.
\end{playground}
Additional query methods give easy access to different aspects of fitted models: \Rfunction{vcov()} returns the variance-covariance matrix, \Rfunction{coef()} and its alias \Rfunction{coefficients()} return the estimates for the fitted model coefficients, \Rfunction{fitted()} and its alias \Rfunction{fitted.values()} extract the fitted values, and \Rfunction{resid()} and its alias \Rfunction{residuals()} the corresponding residuals (or deviations) (Figure \ref{fig:lm:fit:query:more}). Less frequently used accessors are \Rfunction{getCall()}, \Rfunction{effects()}, \Rfunction{terms()}, \Rfunction{model.frame()}, and \Rfunction{model.matrix()}.
\begin{figure}
\centering
\begin{small}
\begin{tikzpicture}[node distance=1.4cm, scale=0.5]
\node (model) [tprocess] {\textsl{model} $\to$ \code{formula}};
\node (data) [tprocess, below of=model, yshift = 0.4cm] {\textsl{observations} $\to$ \code{data}};
\node (weights) [tprocess, dashed, below of=data, fill=blue!1, yshift = 0.4cm] {\textsl{weights} $\to$ \code{weights}};
\node (fitfun) [tprocess, right of=data, xshift=2.5cm, fill=blue!5] {\code{lm()}};
\node (fm) [tprocess, color = black, right of=fitfun, xshift=1.5cm, fill=blue!5] {\code{lm} \textsl{object}};
\node (fitted) [tprocess, color = black, right of=fm, xshift=1.7cm] {\code{fitted()}};
\node (resid) [tprocess, color = black, above of=fitted, yshift = -0.4cm] {\code{residuals()}};
\node (coef) [tprocess, color = black, above of=resid, yshift = -0.4cm] {\code{coef()}};
\node (aic) [tprocess, color = black, below of=fitted, yshift = 0.4cm] {\code{AIC()}; \code{BIC()}};
\node (predict) [tprocess, color = black, below of=aic, yshift = 0.4cm] {\code{predict()}};
\node (newdata) [tprocess, color = black, left of=predict, xshift = -2.4cm] {\textsl{expl.\ vars.} $\to$ \code{newdata}};
\draw [arrow] (model) -- (fitfun);
\draw [arrow] (data) -- (fitfun);
\draw [arrow, dashed] (weights) -- (fitfun);
\draw [arrow] (fitfun) -- (fm);
\draw [arrow] (fm) -- (coef);
\draw [arrow] (fm) -- (resid);
\draw [arrow] (fm) -- (fitted);
\draw [arrow] (fm) -- (aic);
\draw [arrow] (fm) -- (predict);
\draw [arrow] (newdata) -- (predict);
\end{tikzpicture}
\end{small}
\caption[Linear model fitting, more query methods ]{Diagram including additional methods used to query fitted model objects using linear models as an example. For other details see the legend of Figure \ref{fig:lm:fit:diagram}.}\label{fig:lm:fit:query:more}
\end{figure}
\begin{playground}
Familiarise yourself with these extraction and summary methods by reading their documentation and use them to explore \code{fm1} fitted above or model fits to other data of your interest.
\end{playground}
It is usual\index{linear models!structure of summary object} to only look at the values returned by \Rfunction{anova()} and \Rfunction{summary()} as implicitly displayed by \code{print()}. However, both \Rfunction{anova()} and \Rfunction{summary()} return complex objects, derived from \code{list}, containing some members not displayed by \code{print()}. Access to members of these objects can be necessary to use them in further calculations or to print them in a format different to the default.
\begin{explainbox}\label{box:LM:fit:summary:object}
The class and structure of the objects returned by \Rmethod{summary()} depends on the class of the model fit object, i.e., \Rmethod{summary()} is a generic method with multiple specialisations.
<<models-EB2ca>>=
class(summary(fm1))
@
One case where we need to extract individual members is when adding annotations to plots. Another case is when writing reports to include programmatically the computed values within the text. \Rfunction{str()} can be used to display the structure. Calling \Rfunction{str()} with \code{no.list = TRUE}, \code{give.attr = FALSE} and \code{vec.len = 2} as arguments restricts the output to an overview of the structure of \code{fm1}.
<<models-EB3>>=
str(summary(fm1), no.list = TRUE, give.attr = FALSE, vec.len = 2)
@
Extraction of members follows the usual \Rlang rules using \Roperator{\$}, \Roperator{[ ]}, or \Roperator{[[ ]]}.
<<models-EB4>>=
summary(fm1)$adj.r.squared
@
Estimates of \code{coefficients} are accompanied by estimates of their standard errors, \emph{t}-values and \emph{P}-values, while in the model object \code{fm1} these are not included.
<<models-EB3b>>=
coef(fm1)
str(fm1$coefficients)
print(summary(fm1)$coefficients)
str(summary(fm1)$coefficients)
@
\end{explainbox}
\begin{explainbox}\label{box:LM:anova:object}
The class of the object returned by method \code{anova()} does not depend on the class of the model fit object, while its structure does depend.
<<models-EB2>>=
anova(fm1)
@
<<models-EB2a>>=
class(anova(fm1))
@
<<models-EB2b>>=
str(anova(fm1))
@
\end{explainbox}
\begin{explainbox}\label{box:stats:slope:ttest}
As\index{linear models!ad-hoc tests for parameters}\index{t-test@$t$-test|(}\index{calibration curves} an example of the use of values extracted from the \code{summary.lm} object, I show how to test if the slope from a linear regression fit deviates significantly from a constant value different from the usual zero, which tests for the presence of an ``effect'' of the explanatory variable. When testing for deviations from a calibration by comparing two instruments or an instrument and a reference, a null hypothesis of one for the slope tests for deviations from the true readings. In some cases, when comparing the effectiveness of interventions we may be interested to test if a new approach surpasses that in current use by at least a specific margin. There exist practical situations where testing if a response exceeds a threshold is of interest.
A \emph{t}-value can be computed for the slope as for the mean. When using \Rfunction{anova()} and \Rfunction{summary()} the null hypothesis is no effect or no response, i.e., slope = 0. The equivalent test with a null hypothesis of slope = 1 is easy to implement if we consider how we calculate a $t$-value (see section \ref{sec:stats:ttest} on page \pageref{sec:stats:ttest}). To compute the \emph{t}-value we need an estimate for the slope and an estimate of its standard error. To look up the $P$-value, we need the degrees of freedom. All these values are available as members of the summary object of a fitted model.
<<models-EB5>>=
est.slope.value <- summary(fm1)$coefficients["speed", "Estimate"]
est.slope.se <- summary(fm1)$coefficients["speed", "Std. Error"]
degrees.of.freedom <- summary(fm1)$df[2]
@
The estimate of the $t$-value, or quantile, is computed based on the difference between the estimate for the slope and a null hypothesis used as reference, and the standard error of the estimated slope. A probability is obtained based on the computed $t$-value, or quantile, and the $t$ distribution with matching degrees of freedom with a call to \Rfunction{pt()} (see section \ref{sec:prob:dist} on page \pageref{sec:prob:dist}.) For a two-tail test we multiply by two the one-tail $P$ estimate.
<<models-EB6>>=
hyp.null <- 1
t.value <- (est.slope.value - hyp.null) / est.slope.se
p.value <- 2 * pt(q = t.value, df = degrees.of.freedom, lower.tail = FALSE)
cat("slope =", signif(est.slope.value, 3),
"with s.e. =", signif(est.slope.se, 3),
"\nt.value =", signif(t.value, 3),
"and P-value =", signif(p.value, 3))
@
This example is for a linear model fitted with function \Rfunction{lm()} but the same approach can be applied to other model fit procedures for which parameter estimates and their corresponding standard error estimates can be extracted or computed.
\end{explainbox}
\begin{advplayground}
Check that the computations above after replacing \code{hyp.null <- 1} by \code{hyp.null <- 0} agree with the output of printing \code{summary()}.
Modify the example above so as to test whether the intercept is significantly larger than 5 feet, doing a one-sided test.
\end{advplayground}
\index{t-test@$t$-test|)}
Method \Rfunction{predict()} uses the fitted model together with new data for the independent variables to compute predictions. As \Rfunction{predict()} accepts new data as input, it allows interpolation and extrapolation to values of the independent variables not present in the original data. In the case of fits of linear and some other models, method \Rfunction{predict()} returns, in addition to the prediction, estimates of the confidence and/or prediction intervals. The new data must be stored in a data frame with columns using the same names for the explanatory variables as in the data used for the fit, a response variable is not needed and additional columns are ignored. (The explanatory variables in the new data can be either continuous or factors, but they must match in this respect those in the original data.)
\begin{playground}
Predict using both \code{fm1} and \code{fm2} the distance required to stop cars moving at 0, 5, 10, 20, 30, and 40~mph. Study the help page for the \Rfunction{predict()} method for linear models (using \code{help(predict.lm)}). Explore the difference between \code{"prediction"} and \code{"confidence"} bands: why are they so different?
\end{playground}
\index{linear models!linear regression|)}%
\begin{explainbox}\label{box:LM:fit:object}
The\index{linear models!structure of model fit object} objects returned by model fitting functions contain the full information, including the data to which the model was fit to. Their structure resembles a nested list. In most cases, the class of the objects returned by model fit functions agrees in name with the name of the function (\code{"lm"} in this example) but is not derived from \code{"list"}. The query functions, either extract parts of the object or do additional calculations and formatting based on them. Different specialisations of these methods are called depending on the class of the model fit object. (See section \ref{sec:methods} on page \pageref{sec:methods}.)
<<models-EB0>>=
class(fm1)
names(fm1)
@
The structure of model fit objects is of interest only when the query or accessor functions do not provide the needed information and components have to be extracted using operator \Roperator{[[ ]]}. Exploring these objects is also a way of learning how model fitting works in \Rlang. As with any other objects, \Rfunction{str()} shows the structure.
<<models-EB1>>=
str(fm1, no.list = TRUE, give.attr = FALSE, vec.len = 2)
@
Member \code{call} contains the function call and arguments used to create object \code{fm1}.
<<models-EB1b>>=
str(fm1$call)
@
\end{explainbox}
\subsection{Analysis of variance, ANOVA}\label{sec:anova}
%\index{analysis of variance}
\index{analysis of variance|see{linear models, analysis of variance}}%
\index{ANOVA|see{linear models, analysis of variance}}%
\index{linear models!analysis of variance|(}%
In ANOVA, the explanatory variable is categorical, and in \Rlang, must be a \code{factor} or \code{ordered} factor (see section \ref{sec:calc:factors} on page \pageref{sec:calc:factors}). As a linear model, the fitting approach is the same as for linear and polynomial regression (Figure \ref{fig:lm:fit:diagram}).
The \Rdata{InsectSprays} data set used in the next example gives insect counts in plots sprayed with different insecticides. In these data, \code{spray} is a factor with six levels.%
\label{xmpl:fun:lm:fm4}
What determines that this is an ANOVA is that \code{spray}, the explanatory variable, is a \code{factor}.
<<models-6z>>=
data(InsectSprays)
is.numeric(InsectSprays$spray)
is.factor(InsectSprays$spray)
levels(InsectSprays$spray)
@
By using a factor instead of a numeric vector, a different model matrix is built from an equivalent formula.\label{chunk:stat:fm4}
<<models-6>>=
fm4 <- lm(count ~ spray, data = InsectSprays)
@
Diagnostic plots are obtained in the same way as for linear regression. We show only the quantile-quantile plot for simplicity, but during data analysis it is very important to check all the diagnostics plots. As many of the residuals deviate from the one-to-one line we have to conclude the residuals do not follow the Normal distribution, and a different approach to model fitting should be used (see section \ref{sec:stat:GLM} on page \pageref{sec:stat:GLM}).
<<model-6a>>=
plot(fm4, which = 2)
@
In ANOVA,\index{F-test@$F$-test} most frequently the interest is in testing hypotheses with function \Rfunction{anova()}, which implements the $F$-test for the main effects of factors and their interactions. In this example, with a single explanatory variable, there is only one effect of interest, that of \code{sprays}.
<<model-6b>>=
anova(fm4)
@
\index{models!contrasts|(}
\begin{warningbox}
Function \Rfunction{summary()} can be used to extract parameter estimates informing of the size of the effects, but meaningfully only by using contrasts different to the default ones.
Function \Rfunction{aov()} is a wrapper on \Rfunction{lm()} that returns an object that by default displays the output of \Rfunction{anova()} also with \Rfunction{summary()}, but even in this case it can be preferable to change the default contrasts (see \code{help(aov)}).
The contrasts used affect the estimates returned by \Rfunction{coef()} and \Rfunction{summary()} applied to an ANOVA model fit. The default used in \Rlang, \Rfunction{contr.treatment()}, is different to that used in \Slang, \Rfunction{contr.helmert}. With \Rfunction{contr.treatment} the first level of the factor (assumed to be a control) is used as a reference for the estimation of coefficients for the remaining factor levels and testing of their significance. With \Rfunction{contr.helmert} the contrasts are of the second level with the first, the third with the average of the first two, and so on. These contrasts depend on the order of factor levels. Instead, \code{contr.sum} uses as reference the mean of all levels, i.e., using as a condition that the coefficient estimates add up to zero. Obviously what type of contrast is used changes what the coefficient estimates describe, and, consequently, how the $p$-values should be interpreted.
\end{warningbox}
\begin{explainbox}
The approach used by default for model fits and ANOVA calculations varies among programs. There exist different so-called ``types'' of sums of squares, usually called I, II, and III. In orthogonal designs, the choice is of no consequence, but differences can be important for unbalanced designs, even leading to different conclusions. \Rlang's default, type~I, is usually considered to suffer milder problems than type~III, the default used by \pgrmname{SPSS} and \pgrmname{SAS}. In any case, for unbalanced data it is preferable to use the approach implemented in package \pkgname{nlme}.
\end{explainbox}
\begin{explainbox}
The most straightforward way of setting a different default for contrasts for a whole series of model fits is by setting \Rlang option \code{contrasts}, which we here only print.
<<contrasts-01>>=
options("contrasts")
@
The option is set to a named character vector of length two, with the first value, named \code{unordered} giving the name of the function used when the explanatory variable is an unordered \code{factor} (created with \Rfunction{factor()}) and the second value, named \code{ordered}, giving the name of the function used when the explanatory variable is an \code{ordered} factor (created with \Rfunction{ordered()}).
It is also possible to select the contrast to be used in the call to \code{aov()} or \code{lm()}.
<<models-6c>>=
fm4trea <- lm(count ~ spray, data = InsectSprays,
contrasts = list(spray = contr.treatment))
fm4sum <- lm(count ~ spray, data = InsectSprays,
contrasts = list(spray = contr.sum))
@
In \code{fm4trea} we used \Rfunction{contr.treatment()}, thus contrasts for individual treatments are done against \code{Spray1} taking it as the control or reference, as can be inferred from the generated contrasts matrix. For this reason, there is no row for \code{Spray1} in the summary table. Each of the rows \code{Spray2} to \code{Spray6} is a test comparing these treatments individually against \code{Spray1}.
<<contrasts-02>>=
contr.treatment(length(levels(InsectSprays$spray)))
@
<<models-6d>>=
summary(fm4trea)
@
In \code{fm4sum} we used \Rfunction{contr.sum()} with the sum constrained to be zero, thus estimates for the last treatment level are determined by the sum of the previous ones, and not tested for significance.
<<contrasts-03>>=
contr.sum(length(levels(InsectSprays$spray)))
@
<<models-6e>>=
summary(fm4sum)
@
\end{explainbox}
\begin{advplayground}
Explore how taking the last level as reference in \Rfunction{contr.SAS()} instead of the first one as in \Rfunction{contr.treatment()} affects the estimates. Reorder the levels of factor \code{spray} so that the test using \Rfunction{contr.SAS()} becomes equivalent to that obtained above with \Rfunction{contr.treatment()}. Consider why \Rfunction{contr.poly()} is the default for ordered factors and when \Rfunction{contr.helmert()} could be most useful.
\end{advplayground}
Contrasts, on the other hand, do not affect the table returned by \Rfunction{anova()} as this table does not deal with the effects of individual factor levels. The overall estimates shown at the bottom of the summary table remain unchanged. In other words, when using different contrasts what changes is how the total variation explained by the fitted model is partitioned into components to be tested for specific contributions to the overall model fit.
\begin{explainbox}
\index{tests!adjusted P-values@{adjusted $P$-values}}%
Post-hoc tests based on specific contrasts\index{tests!post-hoc} and multiple comparisons\index{tests!multiple comparisons} tests are most frequently applied after an ANOVA to test for differences among pairs of treatments or specific combinations of treatments. \Rlang function \Rfunction{Tukey.test()} implements Tukey's HSD\index{tests!Tukey's HSD} (honestly significant difference) test for pairwise tests. Function \Rfunction{pairwise.t.test()} supports different correction methods for the $P$-values from simultaneous $t$-tests. Function \Rfunction{p.adjust()} applies adjustments to $P$-values and can be used when the test procedure does not apply them. The most comprehensive implementation of multiple comparisons is available in package \pkgname{multcomp}. Function \Rfunction{glht()} (general linear hypothesis testing) from this package supports different contrasts and adjustment methods.
\end{explainbox}
Contrasts and their interpretation are discussed in detail by \citeauthor{Venables2002} (\citeyear{Venables2002}) and \citeauthor{Crawley2012} (\citeyear{Crawley2012}).
\index{models!contrasts|)}
\index{linear models!analysis of variance|)}%
\subsection{Analysis of covariance, ANCOVA}
%\index{analysis of covariance}
\index{analysis of covariance|see{linear models, analysis of covariance}}
\index{linear models!analysis of covariance}
\index{ANCOVA|see{linear models, analysis of covariance}}
When a linear model includes both explanatory factors and continuous explanatory variables, we may call it \emph{analysis of covariance} (ANCOVA). The formula syntax is the same for all linear models and, as mentioned in previous sections, what determines the type of analysis is the nature of the explanatory variable(s). As the formulation remains the same, no specific example is given. The main difficulty of ANCOVA is in the selection of the covariate and the interpretation of the results of the analysis, especially, when the covariate is not independent of the treatment described by the factor \autocite[e.g.][]{Smith1957}.
\index{linear models|)}
\subsection{Model update and selection}\label{sec:stat:update:step}
\index{models!updating|(}
Model fit objects can be updated, i.e., modified, because they contain not only the results of the fit but also the data to which the model was fit (see page \pageref{box:LM:fit:object}). Given that the call is also stored, all the information needed to recalculate the same fit is available. Method \Rfunction{update()} makes it possible to recalculate the fit with changes to the call, without passing again all the arguments to a new call to \Rfunction{lm()} (Figure \ref{fig:lm:update:diagram}). We can modify different arguments, including selecting part of the data by passing a new argument to formal parameter \code{subset}.
\begin{figure}
\centering
\begin{small}
\begin{tikzpicture}[node distance=1.4cm, scale=0.5]
\node (model) [tprocess] {\textsl{model}};
\node (data) [tprocess, below of=model, yshift=-0.4cm] {\textsl{observations}};
\node (fitfun) [tprocess, right of=model, yshift=-0.9cm, xshift=1cm] {\code{lm()}};
\node (fm) [tprocess, color = black, right of=fitfun, xshift=1.8cm] {\code{lm} \textsl{object} $\to$ \code{object}};
\node (update) [tprocess, color = black, right of=fm, xshift = 1.8cm, fill=blue!5] {\code{update()}};
\node (newmodel) [tprocess, above of=fm, fill=blue!5, yshift=-0.3cm] {\textsl{new model} $\to$ \code{formula}};
\node (newdata) [tprocess, below of=fm, fill=blue!5, yshift=0.3cm] {\textsl{new observs.} $\to$ \code{data}};
\node (newfm) [tprocess, color = black, right of=update, xshift=1.5cm, fill=blue!5] {\code{lm} \textsl{object}};
\draw [arrow] (model) -- (fitfun);
\draw [arrow] (data) -- (fitfun);
\draw [arrow] (fitfun) -- (fm);
\draw [arrow] (fm) -- (update);
\draw [arrow] (newmodel) -- (update);
\draw [arrow] (newdata) -- (update);
\draw [arrow] (update) -- (newfm);
\end{tikzpicture}
\end{small}
\caption[Updating a fitted model]{Diagram showing the steps for updating a fitted model (in filled boxes) together with the previous steps in unfilled boxes. Please, see Figure \ref{fig:lm:fit:diagram} for other details.}\label{fig:lm:update:diagram}
\end{figure}
Method \Rfunction{update()} retrieves the call from the model fit object using \Rfunction{getCall()}, modifies it and, by default, evaluates it. The default \Rfunction{update()} method works as long as the model-fit object contains a member named \code{call} or if a specialisation of \Rfunction{getCall()} is available. Thus, method \Rfunction{update()} can be used with models fitted with other functions in addition to \Rfunction{lm()}.
For the next example, we recreate the model fit object \code{fm4} from page \pageref{chunk:stat:fm4}.
<<models-update-01>>=
fm4 <- lm(count ~ spray, data = InsectSprays)
anova(fm4)
fm4a <- update(fm4, formula = log10(count + 1) ~ spray)
anova(fm4a)
@
\begin{playground}
Print \code{fm4\$call} and \code{fm4a\$call}. These two calls differ in the argument to \code{formula}. What other members have been updated in \code{fm4a} compared to \code{fm4}?
\end{playground}
In the chunk above we replaced the argument passed to \code{formula}. This is a frequent use, but, for example, to fit the same model to a subset of the data, we can pass a suitable argument to parameter \code{subset}.
<<models-update-02>>=
fm4b <- update(fm4, subset = !spray %in% c("A", "B"))
anova(fm4b)
@
\begin{advplayground}
When having many treatments with long names, which is not the case here, instead of listing the factor levels for which to subset the data, it can be convenient to use regular expressions for pattern matching (see section \ref{sec:calc:regex} on page \pageref{sec:calc:regex}). Run the code below, and investigate why \code{anova(fm4b)} and \code{anova(fm4c)} produce the same ANOVA table printout, but the fit model objects are not identical. You can use \code{str()} to explore if any members differ between the two objects.
<<models-update-03, eval=eval_playground>>=
fm4c <- update(fm4, subset = !grepl("[AB]", spray))
anova(fm4c)
identical(fm4b, fm4c)
@
\end{advplayground}
\begin{explainbox}
Method \Rfunction{update()} plays an additional role when the fitting is done by numerical approximation, as the previously computed estimates are used as the starting values for the numerical calculations required for fitting the updated model (see section \ref{sec:stat:NLS} on page \pageref{sec:stat:NLS} as an example). This can drastically decrease computation time, or even easy the task of finding suitable starting values for parameter estimates by fitting increasingly more complex nested models.
\end{explainbox}
\index{models!updating|)}
\index{models!stepwise selection|(}\index{linear models!stepwise model selection|(}
Method \Rfunction{update()} used together with \Rfunction{AIC()} (or \Rfunction{anova()}) gives us the tools to compare nested models, and select one out of a group as shown above. When comparing several models doing the comparisons manually is tedious, and in scripts, in many cases difficult to write code that is flexible (or abstract) enough. Method \Rfunction{step()} automates stepwise selection of nested models such as the selection among polynomials of different degrees or which variables to retain in multiple regression. After fitting a model, method \Rfunction{step()} is used to update this model using an automatic stopping criterion (Figure \ref{fig:lm:step:diagram}).
\begin{figure}
\centering
\begin{small}
\begin{tikzpicture}[node distance=1.4cm, scale=0.5]
\node (model) [tprocess] {\textsl{model}};
\node (data) [tprocess, below of=model, yshift=-0.4cm] {\textsl{observations}};
\node (fitfun) [tprocess, right of=model, yshift=-0.9cm, xshift=1cm] {\code{lm()}};
\node (fm) [tprocess, color = black, right of=fitfun, xshift=1.8cm] {\code{lm} \textsl{object} $\to$ \code{object}};
\node (step) [tprocess, color = black, right of=fm, xshift = 1.8cm, fill=blue!5] {\code{step()}};
\node (newmodels) [tprocess, dashed, above of=fm, fill=blue!5, yshift=-0.3cm] {\textsl{new model(s)} $\to$ \code{scope}};
\node (newfm) [tprocess, color = black, right of=update, xshift=1.5cm, fill=blue!5] {\code{lm} \textsl{object}};
\draw [arrow] (model) -- (fitfun);
\draw [arrow] (data) -- (fitfun);
\draw [arrow] (fitfun) -- (fm);
\draw [arrow] (fm) -- (update);
\draw [arrow, dashed] (newmodels) -- (step);
\draw [arrow] (step) -- (newfm);
\end{tikzpicture}
\end{small}
\caption[Stepwise model selection]{Diagram showing the steps used for stepwise model selection among nested models (in filled boxes) together with the previous steps in unfilled boxes. The range of models to select from can be set by the user. See Figure \ref{fig:lm:fit:diagram} for other details.}\label{fig:lm:step:diagram}
\end{figure}
Stepwise model selection---either in the \emph{forward} direction from simpler to more complex models, in the backward direction from more complex to simpler models or in both directions---is implemented in base \Rlang's \emph{method} \Rfunction{step()} using Akaike's information criterion (AIC)\index{Akaike's An Information Criterion@Akaike's \emph{An Information Criterion}} as the selection criterion. Use of method \Rfunction{step()} from \Rlang is possible, for example, with \code{lm()} and \code{glm} fits. AIC is described on page \pageref{par:stats:AIC}.
For the next example, we use \code{fm3} from page \pageref{chunk:stats:fm3}, a linear model for a polynomial regression. If, as shown here, no models are passed through formal parameter \code{scope}, the previously fit model will be simplified, if possible. Method \Rfunction{step()} by default prints to the console a trace of the models tried and the corresponding AIC estimates.
<<models-step-01>>=
fm3 <- lm(dist ~ speed + I(speed^2), data = cars)
fm3a <- step(fm3)
@
Method \Rfunction{summary()} reveals the differences between the original and updated models.
<<models-step-01a>>=
summary(fm3)
summary(fm3a)
@
If we pass a model with additional terms through parameter \code{scope} this model will be taken as the most complex model to be assessed. If, instead of one model, we pass two nested models in a list and name them \code{lower} and \code{upper}, they will delimit the scope of the stepwise search. In the next example, we see that first a backward search is done and term \code{speed} is removed because it removal decreases (= improves) AIC. Subsequently, a forward search is done unsuccessfully, as AIC increases.
<<models-step-02>>=
fm3b <-
step(fm3,
scope = dist ~ speed + I(speed^2) + I(speed^3) + I(speed^4))
summary(fm3b)
@
\begin{playground}
Explain why the stepwise model selection in the code below differs from those in the two previous examples. Consult \code{help(step)} is necessary.
<<models-step-02a, eval=eval_playground>>=
fm3c <-
step(fm3,
scope = list(lower = dist ~ speed,
upper = dist ~ speed + I(speed^2) + I(speed^3) + I(speed^4)))
summary(fm3c)
@
\end{playground}
Functions \Rfunction{update()} and \Rfunction{step()} are \emph{convenience functions} as they provide direct and/or simpler access to operations available through other functions or combined use of multiple functions.
\index{linear models!stepwise model selection|)}\index{models!stepwise selection|)}
\section{Generalised Linear Models}\label{sec:stat:GLM}
\index{generalised linear models|(}\index{models!generalised linear|see{generalised linear models}}
\index{GLM|see{generalised linear models}}
Linear models make the assumption of normally distributed residuals. Generalised linear models, fitted with function \Rfunction{glm()}, are more flexible, and allow the assumed distribution to be selected as well as the link function (defaults are as in \code{lm()}). Figure \ref{fig:glm:fit:diagram} shows that the steps used to fit a model with \Rfunction{glm()} are the same as with \code{lm()} except that we can select the probability distribution assumed to describe the variation among observations. Frequently used probability distributions are binomial and Poisson (see \code{help(family)} for the variations and additional ones).\index{models!for binomial outcomes data}\index{models!for counts data}
\begin{figure}
\centering
\begin{small}
\begin{tikzpicture}[node distance=1.4cm, scale=0.5]
\node (model) [tprocess] {\textsl{model} $\to$ \code{formula}};
\node (data) [tprocess, below of=model, yshift = 0.4cm] {\textsl{observations} $\to$ \code{data}};
\node (weights) [tprocess, dashed, below of=data, fill=blue!1, yshift = 0.4cm] {\textsl{weights} $\to$ \code{weights}};
\node (family) [tprocess, below of=weights, fill=blue!5, yshift = 0.4cm] {\textsl{distribution} $\to$ \code{family}};
\node (fitfun) [tprocess, right of=data, xshift=2.5cm, yshift = -0.4cm,fill=blue!5] {\code{glm()}};
\node (fm) [tprocess, color = black, right of=fitfun, xshift=1.5cm, fill=blue!5] {\code{glm} \textsl{object}};
\node (summary) [tprocess, color = black, right of=fm, xshift=1.7cm] {\code{summary()}};
\node (anova) [tprocess, color = black, below of=summary, yshift = 0.4cm] {\code{anova()}};
\node (plot) [tprocess, color = black, above of=summary, yshift = -0.4cm] {\code{plot()}};
\draw [arrow] (model) -- (fitfun);
\draw [arrow] (data) -- (fitfun);
\draw [arrow, dashed] (weights) -- (fitfun);
\draw [arrow] (family) -- (fitfun);
\draw [arrow] (fitfun) -- (fm);
\draw [arrow] (fm) -- (plot);
\draw [arrow] (fm) -- (anova);
\draw [arrow] (fm) -- (summary);
\end{tikzpicture}
\end{small}
\caption[Generalised linear model fitting in \Rlang]{Generalised linear model fitting in \Rlang is done in steps similar to those used for linear models. Generic diagram from Figure \ref{fig:model:fit:diagram} redrawn to show a generalised linear model fit. Non-filled boxes are shared with fitting of other types of models, and filled ones are specific to \Rfunction{glm()}. Only the three most frequently used query methods are shown, while both response and explanatory variables are under \textsl{observations}. Dashed boxes and arrows are optional as defaults are provided.}\label{fig:glm:fit:diagram}
\end{figure}
For count data, GLMs are preferred over LMs. In the example below, we fit the same model as above, but assuming a quasi-Poisson distribution instead of the Normal. An argument passed to \code{family} selects the assumed error distribution. The \Rdata{InsectSprays} data set used in the next example, gives insect counts in plots sprayed with different insecticides. In these data, spray is a factor with six levels.
<<model-10>>=
fm10 <- glm(count ~ spray, data = InsectSprays, family = quasipoisson)
@
Method \Rfunction{plot()} as for linear-model fits, produces diagnosis plots. We show, as before, the quantile--quantile plot of residuals. The Normal distribution assumed above in the linear model fit was not a good approximation (section \ref{sec:anova} on page \pageref{sec:anova}), as count data are known to follow a different distribution. This is clear by comparing the quantile--quantile plot for \code{fm4} (page \pageref{sec:anova}) and the plot below for the model fit under the assumption of a Quasi-Poisson distribution.
<<model-11>>=
plot(fm10, which = 2)