-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path3statsExs.tex
1146 lines (976 loc) · 43.3 KB
/
3statsExs.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% !Rnw root = learnR.Rnw
\input{preamble}
\noindent
\fbox{\parbox{\textwidth}{
\vspace*{-2pt}
\begin{tabular}{ll}
Science, & What is the role of statistics and statistical\\
statistics \& R & analysis in the wider scientific enterprise?\\[6pt]
Scatterplot & Scatterplot matrices can give useful insights on\\
matrices & data that will be used for regression or related\\
& calculations.\\[6pt]
Transformation & Data often require transformation prior to entry\\
& into a regression model.\\[6pt]
Model & Fitting a regression or other such model gives,\\
objects & in the first place, a model object.\\[6pt]
Generic & \code{plot()}, \code{print()} and \code{summary()} are examples\\
functions & of {\em generic} functions. With a dataframe as \\
& argument \code{plot()} gives a scatterplot matrix.\\
& With an \code{lm} object, it gives diagnostic plots.\\[6pt]
Extractor & Use an extractor function to extract output from\\
function & a model object. Extractor functions are {\em generic}\\
& functions\\[6pt]
List objects & An \code{lm} model object is a list object. Lists are\\
& used extensively in R.\\
\end{tabular}
}}
\vspace*{8pt}
\marginnote[12pt]{Issues that will be noted include the use of
{\em generic} functions such as \code{plot()} and \code{print()},
the way that regression model objects are structured, and the use
of extractor functions to extract information from model objects.}
This chapter will use examples to illustrate common issues in the
exploration of data and the fitting of regression models. It will
round out the discussion of Chapters \ref{ch:getStart} and
\ref{ch:workenv} by adding some further important technical details.
The notes that follow assume a knowledge of basic statistical
ideas and methods, to the extent that readers will be comfortable
with output that includes details of standard errors, $t$-statistics,
and $p$-values. At one point, there is a mention of Bayesian
prior probability.
There is use, in several places, of
cross-validation for assessing predictive accuracy. For this, data
is split up into several subsets (10 is a common choice). Each
subset is then left out in turn for use to check predictive accuracy
when the model is fitted to the remaining data. Once the process is
complete, predictions made independently of the observed values are
available for all observations. Differences between observed values
and predictions made independently of those values can then be used
to assess predictive accuracy.
In bootstrap resampling, repeated with replacement samples are taken
from the data. The model can then be refitted to each such resample,
generating multiple estimates of each coefficient or other statistic
of interest. Standard error estimates can then be based on variation
between the multiple estimates.
All the approaches that have been noted assume that data can be treated
as a random sample from the population to which results will be applied.
That may be a heroic assumption.
\subsection*{Notation, when referring to datasets}
Data will be used that is taken from several different R packages.
The notation \code{MASS::mammals}, which can be used in code as well
as in the textual description, makes it clear that the dataset
\code{mammals} that is required is from the {\em MASS} package.
Should another attached package happen to have a dataset
\code{mammals}, there is no risk of confusion.
\section{Science, statistics, and R}
How does statistical analysis fit into the wider scientific
enterprise? While not a central focus of the present notes,
the issues that will now be noted are too important to be
ignored.
\marginnote[12pt]{Note, however, the chapters on map overlays
and on text mining --- these extend into areas that would not
ordinarily be described as "data analysis".}
The R system is an enabler that allows users to do effective
data analysis, and much else besides. These notes hint at
its scope, primarily for data manipulation, for data analysis,
and for graphics.
For the purposes of this next, the terms "data science" and
"statistics" are different names for an endeavour whose concern
is to extract meaning from data, leading for example to results
that might be reported in a scientific paper, or that might form
the basis for a business or government policy. Statistical
issues and ideas are fundamental to any use of data to make
generalizations that extend beyond the particular data that
have been used for analysis. They set strict limits to the
range of generalizations that are scientifically legitimate.
Statistical analysis is a partner to, and not a substitute
for, robust scientific processes. The use of experimental data
provides the simplest context in which to explore this point
further.
For experimental work, over and above what may emerge from a
statistical analysis, the demand is that results be replicable.
Laboratory studies have repeatedly shown shown that drinking
regular caffeinated coffee increases blood pressure, though
with minimal long term effects.\footnote{\citet{green_kirby_suls_1996}
} It is accepted that there is
this effect, not from the statistical analysis of results from
any individual trial, but because the effect has been demonstrated
in repeated trials. The role of statistical analysis has been:
\begin{enumerate}
\tightlist
\item to demonstrate that, collating the evidence from repeated
trials, the effect does appear real;
\item to assess the magnitude of the effect.
\end{enumerate}
Worldwide, hundreds of thousands of randomised trials are
conducted annually. What do they tell us? In clinical medicine,
follow-up trials are common, and clear conclusions will often
emerge from the careful collation of evidence that, in important
cases, is likely to follow. In many other areas follow-up trials
have until recently been uncommon. This is now changing, and for
good reason. Independent replication of the experimental process
provides checks on the total experimental process, including the
statistical analysis. It is unlikely that the same mistakes in
experimental procedure and/or statistical analysis will be repeated.
\marginnote[12pt]{These replication rates are so low, in the areas
to which these papers relate, that they make nonsense of citations
to published individual trial results as evidence
that a claimed effect has been scientifically demonstrated.}
Papers that had a key role in getting attention to reproducibility
concerns have been \citet{prinz2011believe} and
\citep{begley2012drug}. The first (6 out of 53 "landmark" studies
reproduced) related to drug trials, and the second (19 out of 65
"seminal" studies) to cancer drug trials. Since those studies
appeared, results have appeared from systematic attempts to
reproduce published work in psychology (around 40\%), in
laboratory economics (11 of 18), and in social science (12 of 18).
For research and associated analyses with observational data,
the absence of experimental control offers serious challenges.
In a case where the aim is to compare two or more groups, there
are certain to be more differences than the difference that is
of interest. Commonly, regression methods are used to apply
"covariate adjustments". It is then crucial that all relevant
covariates and covariate interactions are accounted for, and that
covariates are measured with adequate accuracy. Do covariates
require transformation (e.g., $x$, or $\log(x)$, or $x^2$) for
purposes of use in the regression model?
In a hard-hitting paper titled "Cargo-cult statistics and
scientific crisis", \citet{stark_saltelli_2018} comment,
quoting also from \citet{edwards_roy_2017}:
\begin{quote}
While some argue that there is no crisis (or at least not a systemic problem), bad incentives, bad scientific practices, outdated methods of vetting and disseminating results, and techno-science appear to be producing misleading and incorrect results. This might produce a crisis
of biblical proportions: as Edwards and Roy write: ``If a critical mass
of scientists become untrustworthy, a tipping point is possible in
which the scientific enterprise itself becomes inherently corrupt and public trust is lost, risking a new dark age with devastating
consequences to humanity.''
\end{quote}
Statistical issues are then front and centre in what many are identifying
as a crisis, but are not the whole story. The crisis is one that
scientists and statisticians need to tackle in partnership.
In a paper that deserves much more attention than it has received,
\citep{tukey_1997}, John W Tukey
argued that, as part of the process of fitting a model and forming a
conclusion, there should be incisive and informed critique of the data
used, of the model, and of the inferences made. It is important that
analysts search out available information about the processes that
generated the data, and consider critically how this may affect the
reliance placed on it. Other specific types of challenge (this list is
longer than Tukey's) may include:
\begin{itemize}
\tightlist
\item For experiments, is the design open to criticism?
\item Look for biases in processes that generated the data.
\item Look for inadequacies in laboratory procedure.
\item Use all relevant graphical or other summary checks to
critique the model that underpins the analysis.
\item Where possible, check the performance of the model on
test data that reflects the manner of use of results.
(If for example predictions are made that will be applied a
year into the future, check how predictions made a year
ahead panned out for historical data.)
\item For experimental data, have the work replicated
independently by another research group, from generation of
data through to analysis.
\item Have analysis results been correctly interpreted,
in the light of subject area knowledge.
\end{itemize}
Exposure to diverse challenges will build (or destroy!) confidence
in model-based inferences. We should trust those results that have
withstood thorough and informed challenge.
Data do not stand on their own. An understanding of the processes
that generated the data is crucial to judging how data can and cannot
reasonably be used. So also is application area insight. Numerous
studies found that women taking combined hormone replacement therapy
(HRT) also had a lower incidence of coronary heart disease (CHD),
relative to women no taking HRT. Randomized trials showed a small
but clear increase in risk. The risk was lower in the population
as a whole because, for reasons associated with socio-economic status,
women taking HRT were on the whole eating healthier food and getting
more exercise. In analyses of the population data, account had not
been taken of lifestyle effects.
\subsection{What does R add to the mix?}
R clearly has a huge range of abilities for manipulating data, fitting
and checking statistical models, and for using graphs and tables to
present results. More than this, it has extensive reproducible
reporting abilities that can be used to allow others to repeat the
data manipulation, analysis, and steps in the processing of output
that have led to an eventual paper or report. A file is provided
that mixes code with the text for the eventual document, and that
is then processed ("woven" or "knitted") to provide the final
document, complete with analysis output, tables, graphs, and code
(if any) that is to be included in the final document. This makes
it straightforward for referees, or for anyone with an interest in
the work, to check the analysis and/or try modifications.
The publication of data and code is an important step on the way
to making results more open and transparent, and to encouraging
informed post-publication critique.
\section{Generalization beyond the available data}
A common statistical analysis aim is to assess the extent to which
available data supports conclusions that extend beyond the
circumstances that generated the data. The hope is that available
data -- the sample values -- can be used as a window into a wider
population.
\subsection{Models for the random component}
Introductory statistics
courses are likely to focus on models that assume that \textit{error}
terms are independently and identically normally distributed ---
they make the \textit{iid normal assumption}. This involves the
separate assumptions of independence, assumptions of homogeneity
of variance (i.e., the standard deviations of all measurements
are the same), and normality.
Strict normality is commonly, depending
on the use that will be made of model results, unnecessary.
Commonly, the interest is in parameter estimates and/or in model
predictions. Standard forms of statistical inference then rely
on the normality of the the relevant sampling distributions.
Central Limit Theorem type effects will then, given enough data,
often work to bring these distributions close enough to normality
for purposes of the required inferences.
Much of the art of applied statistics lies in recognizing those
assumptions that, in any given context, are important and need
careful checking. Models are said to be \textit{robust} against
those assumptions that are of relatively minor consequence.
Most of the standard elementary methods assume that all population
values are chosen with equal probability, independently between
sample values. Or, if this is not strictly the case, departures
from this assumption will not be serious enough to compromise
results. In designed experiments and in sample surveys,
randomization mechanisms are used to ensure the required
independence. Failures in the randomization process will
compromise results.
Where there has not been explicit use of a randomization mechanism,
it is necessary to consider carefully how this may have affected the
data. Is some form of dependence structure likely? Temporal and
spatial dependence arise because values that are close together in
time or space are relatively more similar. Is there clustering that
arises because all individuals within chosen streets or within chosen
families have been included? Two individuals in the same family or in
the same street may be more similar than two individuals chosen at
random from the same city suburb. Model choice and model fitting
then needs to account for such effects.
Models account for both
\textit{fixed} and \textit{random} effects. In a straight line
model, the fixed effect is the line, while the random effect
accounts for variation about the line. The random part of the
model can matter a great deal.
\subsection{R functions for working with distributions}
For each distribution, there are four functions, with names
whose first letter is, respectively, \texttt{d} (probability or
probability \textbf{d}ensity),
\texttt{p} (cumulative \textbf{p}robability),
\texttt{q} (\textbf{q}uantile), and \texttt{r} (generate a
\textbf{r}andom sample). Functions that have \texttt{d} as
their first letter are probabilities for distributions that are
discrete, or \textbf{d}ensities) for continuous distributions.
Common discrete distributions are the binomial and the Poisson.
The betabinomial is often used to model binomial type data that
have a larger than binomial variance. The negative binomial
is widely used to model count data that have a larger than
Poisson variance. Flexible implementations of the betabinomial,
including the ability to model the scale parameter that controls
the variance, have appeared only recently. Alternatives to the
betabinomial have been little explored. By contrast, for count
data, there has been extensive investigation of the negative
binomial, as well as of other alternatives to the Poisson.
\marginnote[12pt]{Another name for the normal distribution is
the Gaussian distribution.}
Easily the most widely used distribution for continuous data is
the normal. Data
where the error term is not normal will often come close to
normal after transformation. The most widely used transformation
for this purpose is the logarithmic.
Other continuous distributions are common in the context of
special types of model, e.g., the exponential distribution, or
the Weibull of which it is a special case, in waiting time models.
\section{The Uses of Scatterplots}
\begin{Schunk}
\begin{Sinput}
## Below, the dataset MASS::mammals will be required
library(MASS, quietly=TRUE)
\end{Sinput}
\end{Schunk}
\subsection{Transformation to an appropriate scale}
\marginnote[12pt]{Among other issues, is there a wide enough
spread of distinct values that data can be treated as continuous.}
A first step is to elicit basic information on the columns in the
data, including information on relationships between explanatory
variables. Is it desirable to transform one or more variables?
Transformations are helpful that ensure, if possible, that:
\begin{itemizz}
\item All columns have a distribution that is reasonably well spread
out over the whole range of values, i.e., it is unsatisfactory to
have most values squashed together at one end of the range, with a
small number of very small or very large values occupying the
remaining part of the range.
\item Relationships between columns are roughly linear.
\item the scatter about any relationship is similar across the whole
range of values.
\end{itemizz}
It may happen that the one transformation, often a logarithmic
transformation, will achieve all these at the same time.
The scatterplot in Figure \ref{fig:Animals}A, showing data from the
dataset \code{MASS::mammals}, is is an extreme version of the
common situation where positive (or non-zero) values are squashed
together in the lower part of the range, with a tail out to the right.
Such a distribution is said to be ``skewed to the right''.
Code for Figure \ref{fig:Animals}A
\begin{Schunk}
\begin{Sinput}
plot(brain ~ body, data=mammals)
mtext(side=3, line=0.5, adj=0,
"A: Unlogged data", cex=1.1)
\end{Sinput}
\end{Schunk}
\begin{marginfigure}
\begin{Schunk}
\centerline{\includegraphics[width=\textwidth]{figs/03-bbAB-1} }
\end{Schunk}
\caption{Brain weight (g) versus Body weight (kg), for 62 species of mammal.
Panel A plots the unlogged data, while Panel B has log scales for
both axes, but with axis labels in the original (unlogged) units.
\label{fig:Animals}}
\end{marginfigure}
Figure \ref{fig:Animals}B shows the scatterplot for the logged data.
Code for Figure \ref{fig:Animals}B is:
\begin{Schunk}
\begin{Sinput}
plot(brain ~ body, data=mammals, log="xy")
mtext(side=3, line=0.5, adj=0,
"B: Log scales (both axes)", cex=1.1)
\end{Sinput}
\end{Schunk}
Where, as in Figure \ref{fig:Animals}A, values are concentrated at one
end of the range, the small number (perhaps one or two) of values that
lie at the other end of the range will, in a straight line regression
with that column as the only explanatory variable, be a leverage
point. When it is one explanatory variable among several, those
values will have an overly large say in determining the coefficient
for that variable.
As happened here, a logarithmic transformation will often remove much
or all of the skew. Also, as happened here, such transformations often
bring the added bonus that relationships between the resulting
variables are approximately linear.
\subsection{The Uses of Scatterplot Matrices}\label{sec:spm}
Subsequent chapters will make extensive use of scatterplot matrices.
A scatterplot matrix plots every column against every other column,
with the result in the layout used for correlation matrices. Figure
\ref{fig:trees} shows a scatterplot matrix for the
\code{datasets::trees} dataset. \marginnote[-24pt]{The
\margtt{datasets} package is, in an off-the-shelf installation,
attached when R starts.}
\begin{figure*}
\vspace*{-6pt}
\begin{minipage}[c]{0.55\linewidth}
\fbox{\parbox{\textwidth}{{\bf Interpreting Scatterplot Matrices:}\\[4pt]
For identifying the axes for each panel
\begin{itemizz}
\item[-] look across the row to the diagonal to identify the variable on
the vertical axis.
\item[-] look up or down the column to the diagonal for the
variable on the horizontal axis.
\end{itemizz}
Each below diagonal panel is the mirror image of the
corresponding above diagonal panel.
}}
\vspace*{6pt}
\begin{Schunk}
\begin{Sinput}
## Code used for the plot
plot(trees, cex.labels=1.5)
# Calls pairs(trees)
\end{Sinput}
\end{Schunk}
\end{minipage}
\begin{minipage}[c]{0.465\linewidth}
\begin{Schunk}
\centerline{\includegraphics[width=\textwidth]{figs/03-plot-trees-1} }
\end{Schunk}
\vspace*{-15pt}
\end{minipage}
\caption{Scatterplot matrix for the \code{trees} data, obtained
using the default \code{plot()} method for data frames. The
scatterplot matrix is a graphical counterpart of the correlation
matrix.\label{fig:trees}}
\end{figure*}
Notice that \code{plot()}, called with the dataframe \code{trees},
has in turn called the plot method for a data frame, i.e.,
it has called \code{plot.data.frame()} which has in turn called
the function \code{pairs()}.
The scatterplot matrix may be examined, if there are enough
points, for evidence of:
\begin{enumerate}
\marginnote[18pt]{The scatterplot matrix is best used as an initial
coarse screening device. Skewness in the individual distributions
is better checked using plots of density estimates.}
\item Strong clustering in the data, and/or
obvious outliers;
\item Clear non-linear relationships, so that a
correlation will underestimate the strength of any relationship;
\item Severely skewed distributions, so that the correlation is a biased
measure of the strength of relationship.
\end{enumerate}
\section{World record times for track and field events}\label{sec:wr}
The first example is for world track and road record times,
\marginnote{Note also the use of these data in the exercise at the end
of Chapter 2 (Section \ref{ss:ch2ex})} as at 9th August 2006. Data,
copied down from the web page
\url{http://www.gbrathletics.com/wrec.htm}, are in the dataset
\code{DAAG::worldRecords}.
\subsection*{Data exploration}
First, use \code{str()} to get information on the data frame columns:
\begin{Schunk}
\begin{Sinput}
library(DAAG, quietly=TRUE)
\end{Sinput}
\end{Schunk}
\begin{Schunk}
\begin{Sinput}
str(worldRecords, vec.len=3)
\end{Sinput}
\end{Schunk}
\begin{fullwidth}
\begin{Schunk}
\begin{Soutput}
'data.frame': 40 obs. of 5 variables:
$ Distance : num 0.1 0.15 0.2 0.3 0.4 0.5 0.6 0.8 ...
$ roadORtrack: Factor w/ 2 levels "road","track": 2 2 2 2 2 2 2 2 ...
$ Place : chr "Athens" "Cassino" "Atlanta" ...
$ Time : num 0.163 0.247 0.322 0.514 ...
$ Date : Date, format: "2005-06-14" "1983-05-22" ...
\end{Soutput}
\end{Schunk}
\end{fullwidth}
%$
Distinguishing points for track events from those for road events is
easiest if we use lattice graphics, as in Figure \ref{fig:wrnolog}.
\begin{Schunk}
\begin{Sinput}
## Code
library(lattice)
xyplot(Time ~ Distance, scales=list(tck=0.5),
groups=roadORtrack, data=worldRecords,
auto.key=list(columns=2), aspect=1)
## On a a colour device the default is to use
## different colours, not different symbols,
## to distinguish groups.
\end{Sinput}
\end{Schunk}
\begin{marginfigure}
\begin{Schunk}
\centerline{\includegraphics[width=0.98\textwidth]{figs/03-trackVSroad-1} }
\end{Schunk}
\caption{World record times versus distance, for field and road
events.\label{fig:wrnolog}}
\end{marginfigure}
Clearly increases in \code{Time} are not proportional to increases in
\code{Distance}. Indeed, such a model does not make sense; velocity
decreases as the length of the race increases. Proportionality when
logarithmic scales are used for the two variables does make sense.
Figure \ref{fig:wrlog} uses logarithmic scales on both axes. The two
panels differ only in the labeling of the scales. The left panel uses
labels on scales of $\log_e$, while the right panel has labels in the
original units. Notice the use of
\code{auto.key} to obtain a key.
\begin{figure}
\begin{Schunk}
\centerline{\includegraphics[width=0.47\textwidth]{figs/03-wrTimesAB-1} \includegraphics[width=0.47\textwidth]{figs/03-wrTimesAB-2} }
\end{Schunk}
\caption{World record times versus distance, for field and road
events, using logarithmic scales. The left panel uses labels on
scales of $\log_e$, while in the right panel, labeling is in the
original units, expressed as powers of 10.}
\label{fig:wrlog}
\end{figure}
\begin{Schunk}
\begin{Sinput}
## Code for Left panel
xyplot(log(Time) ~ log(Distance),
groups=roadORtrack, data=worldRecords,
scales=list(tck=0.5),
auto.key=list(columns=2), aspect=1)
## Right panel
xyplot(Time ~ Distance, groups=roadORtrack,
data=worldRecords,
scales=list(log=10, tck=0.5),
auto.key=list(columns=2), aspect=1)
\end{Sinput}
\end{Schunk}
\vspace*{-9pt}
\subsection*{Fitting a regression line}
The plots suggest that a line is a good fit. Note however that the data span a
huge range of distances. The ratio of longest to shortest distance is
almost 3000:1. Departures from the line are of the order of 15\% at
the upper end of the range, but are so small relative to
this huge range that they are not obvious.
The following \marginnote{The name \margtt{lm} is a mnemonic for
{\underline{l}}inear {\underline{m}}odel.} uses the function
\code{lm()} to fit a straight line fit to the logged data,
then extracting the regression coefficients:
\marginnote[1.0cm]{The equation gives predicted times:
\begin{eqnarray*}
\widehat{\mbox{Time}} &=& e^{0.7316} \times \mbox{Distance}^{1.1248}\\
&=& 2.08 \times \mbox{Distance}^{1.1248}
\end{eqnarray*}
This implies, as would be expected, that kilometers per minute
increase with increasing distance. Fitting a line to points that are
on a log scale thus allows an immediate interpretation.}
\begin{Schunk}
\begin{Sinput}
worldrec.lm <- lm(log(Time) ~ log(Distance),
data=worldRecords)
coef(worldrec.lm)
\end{Sinput}
\begin{Soutput}
(Intercept) log(Distance)
0.7316 1.1248
\end{Soutput}
\end{Schunk}
There is no difference that can be detected visually between the track
races and the road races. Careful analysis will in fact find no
difference.
\subsection{Summary information from model objects}\label{ss:sum-modobj}
In order to avoid recalculation of the model information
\marginnote{The name \code{worldrec.lm} is used to indicate
that this is an \margtt{lm} object, with data from \code{worldRecords}.
Use any name that seems helpful!}
each time that some different information is required, we store the
result from the \code{lm()} calculation in the model object
\code{worldrec.lm}.
\begin{marginfigure}
Plot points; add line:
\begin{Schunk}
\begin{Sinput}
plot(log(Time) ~ log(Distance),
data = worldRecords)
abline(worldrec.lm)
\end{Sinput}
\end{Schunk}
\end{marginfigure}
Note that the function \code{abline()} can be used
with the model object as argument to add a line to the plot of
\code{log(Time)} against \code{log(Distance)}.
\subsection*{Diagnostic plots}
\begin{figure}
\begin{Schunk}
\centerline{\includegraphics[width=\textwidth]{figs/03-diag12-1} }
\end{Schunk}
\caption{First and last of the default diagnostic plots, from the
linear model for log(record time) versus log(distance), for
field and road events.}
\label{fig:wr-diag}
\setfloatalignment{t}% forces caption to be top-aligned
\end{figure}
Panel A is designed
to give an indication whether the relationship really is linear, or
whether there is some further systematic component that should perhaps
be modeled. It does show systematic differences from a line.
The largest difference is more than a 15\% difference.\sidenote{A
difference of 0.05 on a scale of $\log_e$ translates to a difference
of just over 5\%. A difference of 0.15 translates to a difference
of just over 16\%, i.e., slightly more than 15\%.} There are
mechanisms for using a smooth curve to account for the differences
from a line, if these are thought important enough to model.
The plot in panel B allows an assessment of the extent to which individual
points are influencing the fitted line. Observation 40 does have both
a very large leverage and a large Cook's distance. The plot on the
left makes it clear that this is the point with the largest fitted
time. Observation 40 is for a 24h race, or 1440 min. Examine
\begin{Schunk}
\begin{Sinput}
worldRecords["40", ]
\end{Sinput}
\begin{Soutput}
Distance roadORtrack Place Time Date
40 290.2 road Basle 1440 1998-05-03
\end{Soutput}
\end{Schunk}
\subsection{The model object}\label{ss:modobj}
Functions that are commonly used to get information
about model objects are: \code{print()}, \code{summary()} and
\code{plot()}. These are all \textit{generic} functions. The effect
of the function depends on the class of object that is printed (ie, by
default, displayed on the screen) or or plotted, or summarized.
The function \code{print()} may display relatively terse
output, while \code{summary()} may display more extensive output.
This varies from one type of model object to another.
Compare the outputs from the following:
\begin{fullwidth}
\begin{Schunk}
\begin{Sinput}
print(worldrec.lm) # Alternatively, type worldrec.lm
\end{Sinput}
\begin{Soutput}
Call:
lm(formula = log(Time) ~ log(Distance), data = worldRecords)
Coefficients:
(Intercept) log(Distance)
0.732 1.125
\end{Soutput}
\begin{Sinput}
summary(worldrec.lm)
\end{Sinput}
\begin{Soutput}
Call:
lm(formula = log(Time) ~ log(Distance), data = worldRecords)
Residuals:
Min 1Q Median 3Q Max
-0.0807 -0.0497 0.0028 0.0377 0.1627
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.73160 0.01241 59 <2e-16
log(Distance) 1.12475 0.00437 257 <2e-16
Residual standard error: 0.0565 on 38 degrees of freedom
Multiple R-squared: 0.999, Adjusted R-squared: 0.999
F-statistic: 6.63e+04 on 1 and 38 DF, p-value: <2e-16
\end{Soutput}
\end{Schunk}
\end{fullwidth}
\marginnote[10pt]{Internally, \margtt{summary(wtvol.lm)} calls
\margtt{UseMethod("summary")}. As \margtt{wtvol.lm} is an
lm object, this calls \margtt{summary.lm()}.} Used with
\code{lm} objects, \code{print()} calls \code{print.lm()}, while
\code{summary()} calls \code{summary.lm()}.
Note that typing \code{worldrec.lm} has the same effect as
\code{print(worldrec.lm)}.
\subsection{The \texttt{lm} model object is a list}
The model object is actually a list. Here are the names of the list
elements:
\begin{Schunk}
\begin{Sinput}
names(worldrec.lm)
\end{Sinput}
\begin{Soutput}
[1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"
[9] "xlevels" "call" "terms" "model"
\end{Soutput}
\end{Schunk}
These different list elements hold very different classes and
dimensions (or lengths) of object. Hence the use of a list; any
collection of different R objects can be brought together into a
list.
The following is a check on the model call:
\begin{fullwidth}
\begin{Schunk}
\begin{Sinput}
worldrec.lm$call
\end{Sinput}
\begin{Soutput}
lm(formula = log(Time) ~ log(Distance), data = worldRecords)
\end{Soutput}
\end{Schunk}
\end{fullwidth}
% $
\begin{marginfigure}[40pt]
Use extractor function \margtt{coef()}:\\[-3pt]
\begin{Schunk}
\begin{Sinput}
coef(worldrec.lm)
\end{Sinput}
\end{Schunk}
\end{marginfigure}
Commonly required information is best accessed using generic
extractor functions. Above, attention was drawn to \code{print()},
\code{summary()} and \code{plot()}. Other commonly used extractor
functions are \code{residuals()}, \code{coefficients()}, and
\code{fitted.values()}. These can be abbreviated to \code{resid()},
\code{coef()}, and \code{fitted()}.
\section{Regression with two explanatory variables}\label{sec:nihills}
The dataset \code{nihills} in the \pkg{DAAG} package will be used for
a regression fit in Section \ref{sec:nihills-reg}. This has record
times for Northern Ireland mountain races. Overview details of the
data are:
\begin{fullwidth}
\begin{Schunk}
\begin{Sinput}
str(nihills)
\end{Sinput}
\begin{Soutput}
'data.frame': 23 obs. of 5 variables:
$ dist : num 7.5 4.2 5.9 6.8 5 4.8 4.3 3 2.5 12 ...
$ climb : int 1740 1110 1210 3300 1200 950 1600 1500 1500 5080 ...
$ time : num 0.858 0.467 0.703 1.039 0.541 ...
$ timef : num 1.064 0.623 0.887 1.214 0.637 ...
$ gradient: num 232 264 205 485 240 ...
\end{Soutput}
\end{Schunk}
\end{fullwidth}
\marginnote[12pt]{The function \margtt{splom()} is a \pkg{lattice}
alternative to \code{pairs()}, giving a different panel layout.}
Figure \ref{fig:nimra} uses the function
\code{lattice::splom()} to give scatterplot
matrices, one for the unlogged data, and the other for the logged
data. The left panel shows the unlogged data, while the right panel
shows the logged data:
\begin{figure*}
\vspace*{-6pt}
\begin{Schunk}
\centerline{\includegraphics[width=0.48\textwidth]{figs/03-nihills-spmAB-1} \includegraphics[width=0.48\textwidth]{figs/03-nihills-spmAB-2} }
\end{Schunk}
\caption{Scatterplot matrices for the Northern Ireland mountain racing
data. The left panel is for the unlogged data, while the right panel is
for the logged data. Code has been added that shows the correlations,
in the lower panel.\label{fig:nimra}}
\end{figure*}
\vspace*{15pt}
The following panel function was used to show the correlations:
\begin{Schunk}
\begin{Sinput}
showcorr <- function(x,y,...){
panel.xyplot(x,y,...)
xy <- current.panel.limits()
rho <- paste(round(cor(x,y),3))
eps <- 0.035*diff(range(y))
panel.text(max(x), min(y)+eps, rho,
pos=2, offset=-0.2)
}
\end{Sinput}
\end{Schunk}
Code for the scatterplot matrix in the left panel is:
\begin{Schunk}
\begin{Sinput}
## Scatterplot matrix; unlogged data
library(lattice)
splom(~nihills, xlab="",
main=list("A: Untransformed data", x=0,
just="left", fontface="plain"))
\end{Sinput}
\end{Schunk}
For the right panel, create a data frame from the logged data:
\begin{Schunk}
\begin{Sinput}
lognihills <- log(nihills)
names(lognihills) <- paste0("l", names(nihills))
## Scatterplot matrix; log scales
splom(~ lognihills, lower.panel=showcorr, xlab="",
main=list("B: Log transformed data", x=0,
just="left", fontface="plain"))
\end{Sinput}
\end{Schunk}
%% Alternatively, give new names
%% directly:
%% <<newnames, eval=FALSE>>=
%% @ %
\marginnote[15pt]{Unlike \margtt{paste()}, the function \margtt{paste0()} does
not leave spaces between text strings that it pastes together.}
Note that the data are positively skewed, i.e., there is a long tail
to the right, for all variables. For such data, a logarithmic
transformation often gives more nearly linear relationships. The
relationships between explanatory variables, and between the dependent
variable and explanatory variables, are closer to linear when
logarithmic scales are used. Just as importantly, issues with large
leverage, so that the point for the largest data values has a much
greater leverage and hence much greater influence than other points on
the the fitted regression, are greatly reduced.
Notice also that the correlation of 0.913 between \code{climb} and
\code{dist} in the left panel of Figure \ref{fig:nimra} is very
different from the correlation of 0.78 between \code{lclimb} and
\code{ldist} in the right panel. Correlations where distributions are
highly skew are not comparable with correlations where distributions
are more nearly symmetric. The statistical properties are different.
The following regresses \code{log(time)} on \code{log(climb)} and
\code{log(dist)}:
\begin{Schunk}
\begin{Sinput}
nihills.lm <- lm(ltime ~ lclimb + ldist,
data=lognihills)
\end{Sinput}
\end{Schunk}
\section{One-way Comparisons}
\marginnote[12pt]{A common strategy for getting a valid comparison is to
grow the plants in separate pots, with a random arrangement of pots.}
The dataset \code{tomato} has weights of plants that were grown
under one of four different sets of experimental conditions.
Five plants were grown under each of the treatments:
\begin{itemizz}
\item[-] `\code{water only}'
\item[-] `\code{conc nutrient}'
\item[-] `\code{2-4-D + conc nutrient}'
\item[-] `\code{x conc nutrient}'
\end{itemizz}
Figure \ref{fig:Tomato}, created
using the function \code{quickplot()} from the \pkg{ggplot2} package,
shows the plant weights. Are the apparent differences between
treatments large enough that they can be distinguished statistically?
\begin{figure}
\vspace*{-6pt}
\begin{Schunk}
\centerline{\includegraphics[width=0.98\textwidth]{figs/03-gg-tomato-1} }
\end{Schunk}
\caption{Weights (g) of tomato plants grown under four different
treatments.\label{fig:Tomato}}
\end{figure}
\marginnote{Notice that `\code{water only}' is made the reference level.
This choice makes best sense for the analysis of
variance calculations that appear below.}
\begin{Schunk}
\begin{Sinput}
## Code
library(ggplot2)
tomato <- within(DAAG::tomato,
trt <- relevel(trt, ref="water only"))
quickplot(weight, trt, data=tomato,
xlab="Weight (g)", ylab="")
\end{Sinput}
\end{Schunk}
\marginnote[12pt]{Observe that, to get estimates and SEs of treatment
effects, \margtt{tomato.aov} can be treated as an \txtt{lm}
(regression) object.}
The command \code{aov()}, followed by a call to \code{summary.lm()},
can be used to analyse these data, thus:
\begin{Schunk}
\begin{Sinput}
tomato.aov <- aov(weight ~ trt, data=tomato)
round(coef(summary.lm(tomato.aov)), 3)
\end{Sinput}
\end{Schunk}
\begin{fullwidth}
\begin{Schunk}
\begin{Soutput}
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.683 0.187 9.019 0.000
trt2-4-D + conc nutrient -0.358 0.264 -1.358 0.190
trt3x conc nutrient -0.700 0.264 -2.652 0.015
trtconc nutrient 0.067 0.264 0.253 0.803
\end{Soutput}
\end{Schunk}
\end{fullwidth}
Because we made `\code{water only}' the reference level,
`\code{(Intercept)}' is the mean for \code{water only}, and the
other coefficients are for differences from \code{water only}.
\subsection*{A randomized block comparison}
Growing conditions in a glasshouse or growth chamber ---
temperature, humidity and air movement --- will not be totally
uniform. This makes it desirable to do several repeats of the
comparison between treatments\sidenote{In language used
originally in connection with agricultural field trials,
where the comparison was repeated on different blocks of land,
each different location is a "block".}, with conditions
within each repeat ("block") kept as uniform as possible.
Each different "block" may for example be a different part of
the glasshouse or growth chamber.
The dataset \code{DAAG::rice} is from an experiment
where there were six treatment combinations ---
three types of fertilizer were applied to each of two varieties