-
Notifications
You must be signed in to change notification settings - Fork 4
/
preface.tex
133 lines (100 loc) · 19.8 KB
/
preface.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
\chapter*{Preface}
\begin{VF}
``Suppose that you want to teach the `cat' concept to a very young child. Do you explain that a cat is a relatively small, primarily carnivorous mammal with retractible claws, a distinctive sonic output, etc.? I'll bet not. You probably show the kid a lot of different cats, saying `kitty' each time, until it gets the idea. To put it more generally, generalizations are best made by abstraction from experience.''
\VA{R. P. Boas}{\emph{Can we make mathematics intelligible?}, 1981}\nocite{Boas1981}
\end{VF}
\noindent
Why have I chosen the title ``\emph{Learn R: As a Language}''? This book is based on exploration and practice that aims at teaching how to express various operations on data using the \Rlang language. It focuses on the language, rather than on specific types of data analysis, and exposes the reader to current usage and does not spare the quirks of the language. When we use our native language in everyday life, we do not think about grammar rules or sentence structure, except for the trickier or unfamiliar situations. My aim is for this book to help readers grow to use \Rlang in this same way, i.e., to become fluent in \Rlang. The book is structured around the elements of languages with chapter titles that highlight the parallels between natural languages like English and the \Rlang language.
Nowadays, many students of biological and environmental sciences learn \Rlang in courses about statistics or data analysis. However, frequently not in enough depth to effectively use it in scripts for automating data analyses or documenting the whole data analysis workflow to ensure reproducibility. Students in the humanities and also in other fields, may find it easier to learn the R language separately from data analysis and statistics. There are also many who are already familiar with statistical principles and wiling to switch from other software to \Rlang. \emph{Learn R: As a Language} is written with these readers in mind to serve both as a text book and as a reference.
A language is a system of communication. Basic concepts and operations are based on abstractions that are shared across programming languages and relevant to programs of all sizes and complexities; these abstractions are explained in the book together their implementation in the R language. Other abstractions and programming concepts, outside the scope of this book, are relevant to large and complex pieces of software meant to be widely distributed. In other words, \emph{Learn R: As a Language} aims at teaching and supporting \emph{programming in the small}: the use of \Rlang to automate the drudgery of data manipulation, including the different steps spanning from data input and exploration to the production of publication-quality illustrations and their documentation.
Using a language actively is the most efficient way of learning it. By using it, I mean actually reading, writing, and running scripts or programs. \emph{Learn R: As a Language} supports learning the \Rlang language in a way comparable to how children learn to speak: they work out what the rules are, simply by listening to people speak and trying to utter what they want to tell their parents. Of course, small children receive some guidance, but they are not taught a prescriptive set of rules like when learning a second language at school. Instead of listening, readers will read and run code, and instead of speaking, readers will write and try to execute \Rlang code statements on a computer. I do provide explanations and guidance, but the idea of this book is for readers to play with the numerous examples by creating new variations upon them, to find out by themselves the patterns behind the \Rlang language. Instead of parents being the sounding board for the first utterances of readers new to \Rlang, the computer will play this role.
This revised second edition reflects changes that took place in \Rlang and packages described. Very few code chunks from the first edition had stopped working but deprecations meant that some examples triggered messages or warnings, and will eventually fail. Recent ($>$ 4.0.0) versions of \Rlang have significant enhancements such as the new pipe operator. Packages have also evolved acquiring new features. Feedback from readers and reviewers has highlighted some gaps in the contents and unclear explanations. Re-reading myself the book after some time allowed me to think of other improvements. I have updated the book accordingly. I have added diagrams and flowcharts to facilitate comprehension programming concepts. I edited the text from the first edition to fix all errors and outdated examples or explanations known to me and to improve the clarity of previously unclear explanations.
\emph{I encourage you to approach \Rlang like a child approaches his or her mother tongue when first learning to speak: do not struggle, just play, and fool around with \Rlang! If the going gets difficult and frustrating, take a break! If you get a new insight, take a break to enjoy the victory!
}%\end{framed}
\section*{Acknowledgements}
I thank Jaakko Heinonen for introducing me to the then new \Rlang. Along the way many well known and not so famous experts have answered my questions in usenet and more recently in Stackoverflow. I wish to warmly thank members of my own research group, students participating in the courses I have taught, colleagues I have collaborated with, authors of the books I have read and people I have only met online or at conferences. All of them have made it possible for me to write this book. I am indebted to Tarja Lehto, Titta Kotilainen, Tautvydas Zalnierius, Fang Wang, Yan Yan, Neha Rai, Markus Laurel, Brett Cooper, colleagues, students and anonymous reviewers for many very helpful comments on the draft manuscript and/or the published first edition. Rob Calver, editor of both editions, provided encouragement with great patience, Lara Spieker, Vaishali Singh, and Paul Boyd for their help with different aspects of this project.
In many ways this text owes much more to people who are not authors than to myself. However, as I am the one who has written \emph{Learn R: As a Language} and decided what to include and exclude, I take full responsibility for any errors and inaccuracies.
\\[1cm]
Helsinki, \today
\chapter*{Using the book to learn \Rlang}
\begin{VF}
Few people like problems. Hence the natural tendency in problem-solving is to pick the first solution that comes to mind and run with it.\ \ldots\ A better strategy \ldots\ is to select the most attractive path from many ideas, or concepts.
\VA{J. L. Adams}{\emph{Conceptual blockbusting}, 1987}\nocite{Adams1987}
\end{VF}
\section*{Approach and structure}
Depending on previous experience, reading \emph{Learn R: As a Language} will be about exploring a new world or revisiting a familiar one. In both cases \emph{Learn R: As a Language} aims to be a travel guide, neither a traveler's account, nor a cookbook of \Rlang recipes. It can be used as a course book, supplementary reading or for self instruction, and also as a reference.
In \Rlang, like in most ``rich'' languages, there are multiple ways of coding the same operations. I have included code examples that aim to strike a balance between execution speed and readability. One could write equivalent \Rlang books using substantially different code examples. Keep this is mind when reading the book and using \Rlang. Keep also in mind that it is impossible to remember everything about \Rlang and as a user you will frequently need to consult the documentation, even while doing the exercises in this book. The \Rlang language, in a broad sense, is vast because it can be expanded with independently developed packages. Learning to use \Rlang mainly consists of learning the basics plus developing the skill of finding your way in \Rlang, its documentation and on-line question and answer forums.
The contents of the book are organized so that it can be used both as a text book for learning \Rlang and as a reference. It starts with simple concepts and language elements progressing towards more complex language structures and uses. Along the way readers will find, in each chapter, descriptions and examples for the common (usual) cases and the exceptions. Some books hide the exceptions and counterintuitive features from learners to make the learning easier, I instead have included these but marked using icons and marginal bars. There are two reasons for choosing this approach. First, the boundary between boringly easy and frustratingly challenging is different for each of us, and varies depending on the subject dealt with. So, I hope the marks will help readers predict what to expect, how much effort to put in each section and even what to read or skip. Second, if I had hidden the tricky bits of the \Rlang language, I would have made reader's later use of \Rlang more difficult. It would have also made the book less useful as a reference.
The key to the marginal bars and icons is given next. They inform about what content is advanced or included with a specific aim.
\begin{infobox}
Signals text providing general information not directly related to the \Rlang language.
\end{infobox}
\begin{explainbox}
Signals in-depth explanations of specific points that may require you to spend time thinking, which in general can be skipped on first reading, but to which you should return at a later peaceful time, preferably with a cup of coffee or tea.
\end{explainbox}
\begin{warningbox}
Signals important bits of information that must be remembered when using \Rlang---i.e., explain some unusual feature of the language.
\end{warningbox}
\begin{playground}
Signals \emph{playground} sections which contain open-ended exercises---ideas and pieces of \Rlang code to play with at the \Rlang console.
\end{playground}
\begin{advplayground}
Signals \emph{advanced playground} sections which will require more time to play with before grasping concepts than regular \emph{playground} sections.
\end{advplayground}
Readers new to \Rlang should read at least chapters \ref{chap:R:introduction} to \ref{chap:R:functions} sequentially. Possibly, starting by reading parts and doing exercises not marked as advanced. However, I expect to be most useful to these readers, not to completely skip the description of unusual features and special cases, but rather to skim enough from them so as to get an idea of what special situations they may face as \Rlang users. Playground exercises should not be skipped, as they are a key component of the didactic approach used.
Readers already familiar with \Rlang will be able to read the chapters in the book in any order, as need arises. In the long run, I expect \emph{Learn R: As a Language} to remain useful as a reference to those using it as a textbook, both for refreshing the mainstream features and to deal with the oddities and quirks of the language. To make its use as a reference easy, I have been thorough with indexing, including many carefully chosen terms, their synonyms and the names of all R objects and constructs discussed, collecting them in three alphabetical indexes: \emph{General index}, \emph{Index of R names by category}, and \emph{Alphabetic index of R names} starting at pages \pageref{idx:general}, \pageref{idx:rcats} and \pageref{idx:rindex}, respectively. I have also included cross references that link related sections.
Readers should not aim at remembering all the details presented in the book. Using this and other books, and documentation effectively as references, depends on a good grasp of the larger picture of \Rlang and how to navigate the documentation; i.e., it is more important to remember abstractions and in what situations they are used, and function names, than the details of how to use them. Developing a sense of when one needs to be careful not to fall in a ``language trap'' is also important.
\section*{Typography and syntax highlighting}
I use the notation \textcolor{blue}{\code{<value>}}, \textcolor{blue}{\code{<statement>}}, etc., as a generic placeholder in diagrams indicating \emph{any valid value}, \emph{any valid R statement}, etc.
R code chunks are typeset in a typewriter font, and using colour to highlight the different elements of the syntax, such as variables, functions, constant values, etc. R code elements embedded in the text are similarly typeset but always black. For example in the ``code chunk'' below \code{mean()} and \code{print()} are functions; 1, 5 and 3 are constant numeric values, and \code{z} is the name of a variable where the result of the computation done in the first line of code is stored. The line starting with \code{\#\# } shows what is printed to the screen when executing the second statement: \code{[1] 1}.
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{z} \hlkwb{<-} \hlkwd{mean}\hlstd{(}\hlnum{1}\hlstd{,} \hlnum{5}\hlstd{,} \hlnum{3}\hlstd{)}
\hlkwd{print}\hlstd{(z)}
\end{alltt}
\begin{verbatim}
## [1] 1
\end{verbatim}
\end{kframe}
\end{knitrout}
%\newpage
%\newpage
%\begin{infobox}
%\noindent
%\textbf{Status as of 2016-11-23.} I have updated the manuscript to track package updates since the previous version uploaded six months ago, and added several examples of the new functionality added to packages \ggpmisc, \ggrepel, and \ggplot. I have written new sections on packages \viridis, \pkgname{gganimate}, \pkgname{ggstance}, \pkgname{ggbiplot}, \pkgname{ggforce}, \pkgname{ggtern} and \pkgname{ggalt}. Some of these sections are to be expanded, and additional sections are planned for other recently released packages.
%
%With respect to the chapter \textit{Storing and manipulating data with R} I have put it on hold, except for the introduction, until I can see a soon to be published book covering the same subject. Hadley Wickham has named the set of tools developed by him and his collaborators as \textit{tidyverse} to be described in the book titled \textit{R for Data Science} by Grolemund and Wickham (O'Reilly).
%
%An important update to \ggplot was released last week, and it includes changes to the behavior of some existing functions, specially faceting has become extensible through other packages. Several of the new facilities are described in the updated text and code included in this book and this pdf has been generated with up-to-date version of \ggplot and packages as available today from CRAN, except for \pkgname{ggtern} which was downloaded from Bitbucket minutes ago.
%
%The present update adds about 100 pages to the previous versions. I expect to upload a new update to this manuscript in one or two months time.
%
%\textbf{Status as of 2017-01-17.} Added ``playground'' exercises to the chapter describing \ggplot, and converted some of the examples earlier part of the main text into these playground items. Added icons to help readers quickly distinguish playground sections (\textcolor{blue}{\noticestd{"0055}}), information sections (\textcolor{blue}{\modpicts{"003D}}), warnings about things one needs to be specially aware of (\colorbox{yellow}{\typicons{"E136}}) and boxes with more advanced content that may require longer time/more effort to grasp (\typicons{"E04E}). Added to the sections \code{scales} and examples in the \ggplot chapter details about the use of colors in \Rlang and \ggplot2. Removed some redundant examples, and updated the section on \code{plotmath}. Added terms to the alphabetical index. Increased line-spacing to avoid uneven spacing with inline code bits.
%
%\textbf{Status as of 2017-02-09.} Wrote section on ggplot2 themes, and on using system- and Google fonts in ggpplots with the help of package \pkgname{showtext}. Expanded section on \ggplot's \code{annotation}, and revised some sections in the ``R scripts and Programming'' chapter. Started writing the data chapter. Wrote draft on writing and reading text files. Several other smaller edits to text and a few new examples.
%
%\textbf{Status as of 2017-02-14.} Wrote sections on reading and writing MS-Excel files, files from statistical programs such as SPSS, SyStat, etc., and NetCDF files. Also wrote sections on using URLs to directly read data, and on reading HTML and XML files directly, as well on using JSON to retrieve measured/logged data from IoT (internet of things) and similar intelligent physical sensors, micro-controller boards and sensor hubs with network access.
%
%\textbf{Status as of 2017-03-25.} Revised and expanded the chapter on plotting maps, adding a section on the manipulation and plotting of image data. Revised and expanded the chapter on extensions to \pkgname{ggplot2}, so that there are no longer empty sections. Wrote short chapter ``If and when \Rlang needs help.'' Revised and expanded the ``Introduction'' chapter. Added index entries, and additional citations to literature.
%
%\textbf{Status as of 2017-04-04.} Revised and expanded the chapter on using \Rpgrm as a calculator. Revised and expanded the ``Scripts'' chapter. Minor edits to ``Functions'' chapter. Continued writing chapter on data, writing a section on \Rlang native apply functions and added preliminary text for a pipes and tees section. Write intro to `tidyverse' and grammar of data manipulation. Added index entries, and a few additional citations to the literature. Spell checking.
%
%\textbf{Status as of 2017-04-08.} Completed writing first draft of chapter on data, writing all the previously missing sections on the ``grammar of data manipulation.'' Wrote two extended examples in the same chapter. Add table listing several extensions to \pkgname{ggplot2} not described in the book.
%
%\textbf{Status as of 2017-04-13.} Revised all chapters correcting some spelling mistakes, adding some explanatory text and indexing all functions and operators used. Thoroughly revised the Introduction chapter and the Preface. Expanded section on bar plots (now bar and column plots). Revised section on tile plots. Expanded section on factors in chapter 2, adding examples of reordering of factor labels, and making clearer the difference between the labels of the levels and the levels themselves.
%
%\textbf{Status as of 2017-04-29.} Tested with R 3.4.0. Package \pkgname{gganimate} needs to be installed from GitHub as the updated version is not yet in CRAN. Function \code{gg\_animate()} has been renamed \code{gganimate().}
%
%\textbf{Status as of 2017-05-14.} Submitted package \pkgname{learnrbook} to CRAN. Revised code in the book
%to use this new package. Small fixes after more testing. Added examples of plotting and labeling based on fits with \code{method = "nls"}, including use of the new \code{ggpmisc::stat\_fit\_tidy()}.
%
%\textbf{Status as of 2017-06-11.} Added sections on R-code bench marking and profiling for performance optimization. Added also an example of explicit compilation of a function defined in the R language. Added section on functions \code{assign()}, \code{get()} and \code{mget()}.
%
%\textbf{Status as of 2017-08-12.} Various edits to all chapters. Expanded section on \pkgname{ggpmisc} to include the new functionality added in version 0.2.15.9002: \code{geom\_table} and \code{stat\_fit\_tb}. Added section on package \pkgname{ggbeeswarm}. Added sections on packages \pkgname{magick} and on using \pgrmname{ImageJ} from \Rpgrm. Improved indexing and cross references.
%
%\textbf{Status as of 2017-10-25.} Edited the chapter on using R as a calculator, adding examples on insertion and deletion of members of lists and vectors, and also of use of \code{gl()} and \code{reorder()}. Edited sections on scale limits and added new section on coordinate limits to explain more thoroughly their differences and uses in chapter on plotting with \pkgname{ggplot2}. Added a section on package \pkgname{ggsignif} to the chapter on extensions to \pkgname{ggplot2}. Expanded section on \pkgname{ggpmisc} in the same chapter describing new functionality added in version 0.2.16.
%\pkgname{ggplo2} $>=$ 2.2.1.9000 is required by the current development version of \pkgname{ggpmisc}.
%
%\textbf{Status as of 2017-10-30.} Add section on using pipes with \code{ggplot()} and layers.
%\end{infobox}