-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path0intro.tex
196 lines (173 loc) · 7.13 KB
/
0intro.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
\begin{fullwidth}
This is an introduction to R that emphasises statistical applications.
It provides relatively terse overviews, with worked examples, of a
wide range of the abilities that are available from R and from R
packages. The R system has been in the front line of the huge
advances in software for data analysis from around the end of the
last century.
A strong emphasis is on good research practice. Always, whether
or not one chooses to use the word `science', what is in mind is
the used of defendable scientific processes. Results from the
processing of data through a series of analysis steps achieves
nothing, unless there is some wider context of understanding that
gives it meaning. What do we hope to learn? What are the processes
that generated the data? Is there a wider context (typically, there
is) to which we hope that results will apply?
Any use of data that extends beyond providing statistical or
graphical summary relies on the use of a mathematical model,
whether implicitly or explicitly. Models must be able to
survive reasonable challenges to their use and relevance to
the task in hand, if claims that result from their use are
to be credible. What is being called the `reproducibility
crisis' in science is in large part a consequence of the neglect
of this key point.
The scientific context has crucial implications for the experiments
that it is useful to do, and for the analyses that are meaningful.
Available statistical methodology, and statistical and computing
software and hardware, bring their own constraints and opportunities.
Effective data analysis requires the critical resources of
well-trained and well-informed minds, with software that is of high
quality handling the needed calculations.
\subsection*{Good planning, and reliable software}
The \textit{Statistics of data collection} encompasses statistical
\textit{experimental design}, sampling design, and more besides.
Planning will be most effective if based on sound knowledge of the
materials and procedures available to experimenters.
The same general issues arise in all fields --- industrial, medical,
biological and laboratory experimentation more generally. The aim
is, always, is to get maximum value from resources used.
While the R system is unique in the extent of close scrutiny that it
receives from highly expert users, the same warnings apply as to any
statistical system. The base system and the recommended
packages get unusually careful scrutiny. Take particular care
with newer or little-used abilities
in contributed packages. These may not have been much tested,
unless by their developers. The greatest risks arise from
inadequate understanding of the statistical issues.
Statisticians who work with other scientists, as the author has
for most of a long career, have the satisfaction, and challenge, of
learning about the areas of science to which they have been exposed.
In the author's case, that experience has ranged widely. It is
reflected in the wide-ranging examples that are presented in the
pages that follow.
\end{fullwidth}
\newpage
\section*{Important R resources}
\marginnote[12pt]{CRAN is the primary R `repository'. For
preference, obtain R and R packages from a CRAN mirror in
the local region, e.g.,\\
Australia: \url{cran.csiro.au/}\\
NZ: \url{cran.stat.auckland.ac.nz/}}
\noindent
\fbox{\parbox{\textwidth}{
{\bf Note the following web sites:}\\[4pt]
CRAN (Comprehensive R Archive Network):\newline
\url{http://cran.r-project.org}\\[3pt]
% For Sweden, use \url{http://ftp.sunet.se/pub/lang/CRAN/}\\
% For Denmark, use \url{http://cran.dk.r-project.org/}
The Bioconductor repository
(\url{http://www.bioconductor.org}), with packages that cater for
high throughput genomic data, is one of several
repositories that supplement CRAN.\\[3pt]
R homepage: \url{http://www.r-project.org/}\\[3pt]
Links to useful web pages: From an R session that uses the
GUI, click on \underline{Help}, then on \underline{R Help}. On the
window that pops up, click on \underline{Resources}
}}
\subsection*{The {\em DAAGviz} package}
\marginnote[8pt]{Note that {\em DAAGviz} is not, currently at least,
available on CRAN.}
This package is an optional companion to these notes.
It collects scripts and datasets together in a way that may be
useful to users of this text.
You can install it, assuming a live internet connection,
by typing:
\begin{fullwidth}
\begin{Schunk}
\begin{Sinput}
remotes::install_github('jhmaindonald/DAAGviz',
build_opts=c("--no-resave-data",
"--no-manual"))
\end{Sinput}
\end{Schunk}
\end{fullwidth}
Assuming that the {\em DAAGviz} package has been installed, it can be attached thus:
\begin{Schunk}
\begin{Sinput}
library(DAAGviz)
\end{Sinput}
\end{Schunk}
Once attached, this package gives access to:
\begin{itemizz}
\item[-]
Scripts that include all the code. To access these scripts do, e.g.
\begin{marginfigure}[66pt]
More succinctly, use the function \margtt{getScript()}:\\[-3pt]
\begin{Schunk}
\begin{Sinput}
## Place Ch 5 script in
## working directory
getScript(5)
\end{Sinput}
\end{Schunk}
\end{marginfigure}
\begin{Schunk}
\begin{Sinput}
## Check available scripts
dir(system.file('scripts', package='DAAGviz'))
## Show chapter 5 script
script5 <- system.file('scripts/5examples-code.R',
package='DAAGviz')
file.show(script5)
\end{Sinput}
\end{Schunk}
\item[-]
Source files (also scripts) for functions that can be used to
reproduce the graphs. These are available for Chapters 5 to 15
only. To load the Chapter 5 functions into the workspace,
use the command:
\begin{marginfigure}[54pt]
More succinctly, use the function \margtt{sourceFigFuns()}:\\[-3pt]
\begin{Schunk}
\begin{Sinput}
## Load Ch 5 functions
## into workspace
sourceFigFuns(5)
\end{Sinput}
\end{Schunk}
\end{marginfigure}
\begin{Schunk}
\begin{Sinput}
path2figs5 <- system.file('doc/figs5.R',
package='DAAGviz')
source(path2figs5)
\end{Sinput}
\end{Schunk}
\item[-] The datasets \code{bronchit}, \code{eyeAmp}, and
\code{Crimean}, which feature later in these notes.
\end{itemizz}
\newpage
\subsection*{Alternative sources for some datasets}
\marginnote{At courses where these notes are used, these will
be provided on a memory stick.}
{\color{gray40}
The web page \url{http://maths-people.anu.edu.au/~johnm/} may in a few cases
be a convenient source for datasets that are referred to in this text:
\begin{itemizz}
\item[-]
Look in \url{http://maths-people.anu.edu.au/~johnm/r/rda} for
various image ({\bf .RData}) files.\sidenote{{\em Image} files hold
copies of one or more R objects, in a format that facilitates rapid
access from R.} Use the function \code{load()} to bring any of these
into R.
\item[-] Look in \url{http://maths-people.anu.edu.au/~johnm/datasets/text}
for the files {\bf bestTimes.txt}, {\bf molclock.txt}, and other such
text files.
\item[=] Look in \url{http://maths-people.anu.edu.au/~johnm/datasets/csv}
for several {\bf .csv} files.
\end{itemizz}
\vspace*{-9pt}
}
\subsection*{Asterisked Sections or Subsections}
Asterisks are used to identify material that is more technical or
specialized, and that might be omitted at a first reading.