forked from hadley/adv-r
-
Notifications
You must be signed in to change notification settings - Fork 1
/
OO.Rmd
169 lines (118 loc) · 9.52 KB
/
OO.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
# (PART) Object-oriented programming {-}
\index{object-oriented programming}
```{r setup, include = FALSE}
source("common.R")
library(sloop)
```
# Introduction {#oo .unnumbered}
In the following five chapters you'll learn about __object-oriented programming__ (OOP). OOP is a little more challenging in R than in other languages because:
* There are multiple OOP systems to choose from. In this book, I'll focus
on the three that I believe are most important: __S3__, __R6__, and __S4__.
S3 and S4 are provided by base R. R6 is provided by the R6 package, and is
similar to the Reference Classes, or __RC__ for short, from base R.
* There is disagreement about the relative importance of the OOP systems.
I think S3 is most important, followed by R6, then S4. Others believe that
S4 is most important, followed by RC, and that S3 should be avoided. This
means that different R communities use different systems.
* S3 and S4 use generic function OOP which is rather different from the
encapsulated OOP used by most languages popular today[^julia]. We'll come
back to precisely what those terms mean shortly, but basically, while the
underlying ideas of OOP are the same across languages, their expressions are
rather different. This means that you can't immediately transfer your
existing OOP skills to R.
[^julia]: The exception is Julia, which also uses generic function OOP. Compared to R, Julia's implementation is fully developed and extremely performant.
Generally in R, functional programming is much more important than object-oriented programming, because you typically solve complex problems by decomposing them into simple functions, not simple objects. Nevertheless, there are important reasons to learn each of the three systems:
* S3 allows your functions to return rich results with user-friendly display
and programmer-friendly internals. S3 is used throughout base R, so it's
important to master if you want to extend base R functions to work with new
types of input.
* R6 provides a standardised way to escape R's copy-on-modify semantics.
This is particularly important if you want to model objects that exist
independently of R. Today, a common need for R6 is to model data that comes
from a web API, and where changes come from inside or outside of R.
* S4 is a rigorous system that forces you to think carefully about program
design. It's particularly well-suited for building large systems that evolve
over time and will receive contributions from many programmers. This is
why it is used by the Bioconductor project, so another reason to learn S4
is to equip you to contribute to that project.
The goal of this brief introductory chapter is to give you some important vocabulary and some tools to identify OOP systems in the wild. The following chapters then dive into the details of R's OOP systems:
1. Chapter \@ref(base-types) details the base types which form the foundation
underlying all other OO system.
1. Chapter \@ref(s3) introduces S3, the simplest and most commonly used
OO system.
1. Chapter \@ref(r6) discusses R6, an encapsulated OO system built on
top of environments.
1. Chapter \@ref(s4) introduces S4, which is similar to S3 but more formal and
more strict.
1. Chapter \@ref(oo-tradeoffs) compares these three main OO systems. By
understanding the trade-offs of each system you can appreciate when to use
one or the other.
This book focusses on the mechanics of OOP, not its effective use, and it may be challenging to fully understand if you have not done object-oriented programming before. You might wonder why I chose not to provide more immediately useful coverage. I have focused on mechanics here because they need to be well described somewhere (writing these chapters required a considerable amount of reading, exploration, and synthesis on my behalf), and using OOP effectively is sufficiently complex to require a book-length treatment; there's simply not enough room in *Advanced R* to cover it in the depth required.
## OOP systems {-}
Different people use OOP terms in different ways, so this section provides a quick overview of important vocabulary. The explanations are necessarily compressed, but we will come back to these ideas multiple times.
The main reason to use OOP is __polymorphism__ (literally: many shapes). Polymorphism means that a developer can consider a function's interface separately from its implementation, making it possible to use the same function form for different types of input. This is closely related to the idea of __encapsulation__: the user doesn't need to worry about details of an object because they are encapsulated behind a standard interface.
To be concrete, polymorphism is what allows `summary()` to produce different outputs for numeric and factor variables:
```{r}
diamonds <- ggplot2::diamonds
summary(diamonds$carat)
summary(diamonds$cut)
```
You could imagine `summary()` containing a series of if-else statements, but that would mean only the original author could add new implementations. An OOP system makes it possible for any developer to extend the interface with implementations for new types of input.
To be more precise, OO systems call the type of an object its __class__, and an implementation for a specific class is called a __method__. Roughly speaking, a class defines what an object _is_ and methods describe what that object can _do_.
The class defines the __fields__, the data possessed by every instance of that class. Classes are organised in a hierarchy so that if a method does not exist for one class, its parent's method is used, and the child is said to __inherit__ behaviour. For example, in R, an ordered factor inherits from a regular factor, and a generalised linear model inherits from a linear model. The process of finding the correct method given a class is called __method dispatch__.
There are two main paradigms of object-oriented programming which differ in how methods and classes are related. In this book, we'll borrow the terminology of _Extending R_ [@extending-R] and call these paradigms encapsulated and functional:
* In __encapsulated__ OOP, methods belong to objects or classes, and method
calls typically look like `object.method(arg1, arg2)`. This is called
encapsulated because the object encapsulates both data (with fields) and
behaviour (with methods), and is the paradigm found in most popular
languages.
* In __functional__ OOP, methods belong to __generic__ functions, and method
calls look like ordinary function calls: `generic(object, arg2, arg3)`.
This is called functional because from the outside it looks like a regular
function call, and internally the components are also functions.
With this terminology in hand, we can now talk precisely about the different OO systems available in R.
## OOP in R {-}
Base R provides three OOP systems: S3, S4, and reference classes (RC):
* __S3__ is R's first OOP system, and is described in _Statistical Models
in S_ [@white-book]. S3 is an informal implementation of functional OOP
and relies on common conventions rather than ironclad guarantees.
This makes it easy to get started with, providing a low cost way of
solving many simple problems.
* __S4__ is a formal and rigorous rewrite of S3, and was introduced in
_Programming with Data_ [@programming-with-data]. It requires more upfront
work than S3, but in return provides more guarantees and greater
encapsulation. S4 is implemented in the base __methods__ package, which is
always installed with R.
(You might wonder if S1 and S2 exist. They don't: S3 and S4 were named
according to the versions of S that they accompanied. The first two
versions of S didn't have any OOP framework.)
* __RC__ implements encapsulated OO. RC objects are a special type of S4
objects that are also __mutable__, i.e., instead of using R's usual
copy-on-modify semantics, they can be modified in place. This makes them
harder to reason about, but allows them to solve problems that are difficult
to solve in the functional OOP style of S3 and S4.
A number of other OOP systems are provided by CRAN packages:
* __R6__ [@R6] implements encapsulated OOP like RC, but resolves some
important issues. In this book, you'll learn about R6 instead of RC, for
reasons described in Section \@ref(why-r6).
* __R.oo__ [@R.oo] provides some formalism on top of S3, and makes it
possible to have mutable S3 objects.
* __proto__ [@proto] implements another style of OOP based on the idea of
__prototypes__, which blur the distinctions between classes and instances
of classes (objects). I was briefly enamoured with prototype based
programming [@mutatr] and used it in ggplot2, but now think it's better to
stick with the standard forms.
Apart from R6, which is widely used, these systems are primarily of theoretical interest. They do have their strengths, but few R users know and understand them, so it is hard for others to read and contribute to your code.
## sloop {-}
Before we go on I want to introduce the sloop package:
```{r}
library(sloop)
```
The sloop package (think "sail the seas of OOP") provides a number of helpers that fill in missing pieces in base R. The first of these is `sloop::otype()`. It makes it easy to figure out the OOP system used by a wild-caught object:
```{r}
otype(1:10)
otype(mtcars)
mle_obj <- stats4::mle(function(x = 1) (x - 2) ^ 2)
otype(mle_obj)
```
Use this function to figure out which chapter to read to understand how to work with an existing object.