-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathnotes.Rmd
395 lines (260 loc) · 8.58 KB
/
notes.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
---
title: "Stat 33A - Lecture Notes 6"
date: September 27, 2020
output: pdf_document
---
R Graphics Overview
===================
There are three main systems for creating visualizations in R:
1. The base R functions, including `plot()`.
2. The `lattice` package. The interface is similar to the base R
functions, but uses lists of parameters to control plot details.
3. The `ggplot2` package. The interface is a "grammar of graphics"
where plots are assembled from layers.
Both `lattice` and `ggplot2` are based on R's low-level `grid`
graphics system.
It is usually easier to customize visualizations made with base R.
Both `lattice` and `ggplot2` are better at handling grouped data and
generally require less code to create a nice-looking visualization.
We'll learn `ggplot2`.
The Tidyverse
=============
`ggplot2` is part of a popular collection of packages for doing data science in
R called the Tidyverse (<https://www.tidyverse.org/>).
The Tidyverse packages are made by many of the same people that make RStudio.
They provide alternatives to R's built-in tools for:
* Reading files (package `readr`)
* Manipulating data frames (packages `dplyr`, `tidyr`, `tibble`)
* Manipulating strings (package `stringr`)
* Manipulating factors (package `forcats`)
* Functional programming (package `purrr`)
* Making visualizations (package `ggplot2`)
The Tidyverse packages are popular but controversial, because some of them use
a syntax different from base R.
RStudio cheat sheets (mostly for Tidyverse packages):
https://rstudio.com/resources/cheatsheets/
Tidy Data
---------
Most Tidyverse packages, including `ggplot`, are designed for working with tidy
data sets.
A data set is **tidy** if (and only if):
1. Each observation has its own row.
2. Each feature has its own column.
3. Each value has its own cell.
Tidy data sets are convenient for data analysis in general.
The `tidyr` package has tools to clean up untidy data sets, and also examples
of untidy data.
```{r}
# install.packages("tidyr")
library(tidyr)
```
An example of untidy data:
```{r}
table2
```
Another example:
```{r}
table4a
```
We'll learn about how to clean up untidy data later.
Tibbles
=======
**Tibbles** are Tidyverse's improved version of data frames.
Compared to an ordinary data frame, tibbles:
* Print differently
* Default to `drop = FALSE` for the subset operator `[`
* Don't allow partial matching for the dollar operator `$`
If you never load a Tidyverse package with `library()`, tibbles behave like
ordinary data frames.
Let's load some data to test these properties:
```{r}
dogs = readRDS("data/dogs/dogs.rds")
dogs_tbl = readRDS("data/dogs/dogs_tibble.rds")
```
Tibbles have the class "tbl" and "tbl_df" in addition to "data.frame":
```{r}
class(dogs)
class(dogs_tbl)
```
For the tibble, `[` to subset a single value DOES NOT drop the data frame:
```{r}
class(dogs[, 1])
class(dogs[, 1, drop = FALSE])
class(dogs_tbl[, 1])
```
For the tibble, `$` DOES NOT allow partial matches:
```{r, error = TRUE}
dogs$bree
dogs_tbl$bree
```
There are `as` functions to convert from/to tibbles:
```{r}
# Convert tibble to data frame
class(as.data.frame(dogs_tbl))
# Convert data frame to tibble
library(tibble)
class(as_tibble(dogs))
```
The Grammar of Graphics
=======================
The fundamental idea of `ggplot2` is that all graphics are composed of layers.
As an example, let's create a simplified version of the Best in Show
visualization:
https://informationisbeautiful.net/visualizations/
best-in-show-whats-the-top-data-dog/
First load ggplot2:
```{r}
# install.packages("ggplot2")
library(ggplot2)
```
And also the dogs data:
```{r}
dogs = readRDS("data/dogs/dogs.rds")
```
This data set is tidy!
Layer 1: Data
-------------
Use the data layer to select the data to plot.
Call the `ggplot()` function to set the data layer:
```{r}
ggplot(dogs)
```
Layer 2: GEOMetry
-----------------
Use the geometry layer to select the shape to plot.
Add a `geom_` function to set the geometry layer:
```{r}
ggplot(dogs) + geom_point()
```
Layer 3: AESthetics
-------------------
Use the AESthetic layer to select how data columns correspond to shapes.
Call the `aes()` function in `ggplot()` to set aesthetic layer:
```{r}
ggplot(dogs, aes(x = datadog, y = popularity)) + geom_point()
```
This applies to the entire plot.
You can also set the aesthetic layer for an individual geometry:
```{r}
ggplot(dogs) + geom_point(aes(x = datadog, y = popularity))
```
This only applies to that geometry.
Extended Example
----------------
For example, let's add breed labels to the plot:
```{r}
ggplot(dogs, aes(x = datadog, y = popularity, label = breed)) + geom_point() +
geom_text()
```
Now let's color the dogs by group:
```{r}
ggplot(dogs, aes(x = datadog, y = popularity, label = breed, color = group)) +
geom_point() + geom_text()
```
Where we put the aesthetics matters:
```{r}
ggplot(dogs, aes(x = datadog, y = popularity, label = breed)) +
geom_point() + geom_text(aes(color = group))
```
Saving Plots
============
Recall the plot we made of the dogs data:
```{r}
library(ggplot2)
dogs = readRDS("data/dogs/dogs.rds")
ggplot(dogs) + geom_point(aes(x = datadog, y = popularity))
```
In ggplot2, use `ggsave()` to save the most recent plot you created:
```{r}
ggsave("dogs.png")
```
The file format is selected automatically based on the extension.
Common formats are PNG, JPEG, and PDF.
You can also save a plot with one of R's "plot device" functions.
The steps are:
1. Call a plot device function: `png()`, `jpeg()`, `pdf()`, `bmp()`, `tiff()`,
or `svg()`.
2. Run your code to make the plot.
3. Call `dev.off()` to indicate that you're done plotting.
This will only work in the console!
For example:
```{r, eval=FALSE}
# Run these lines in the console, not the notebook!
jpeg("dogs.jpeg")
ggplot(dogs) + geom_point(aes(x = datadog, y = popularity))
dev.off()
```
This strategy is more general than `ggsave()` -- it works for any of R's
graphics systems.
Customizing Plots
=================
Layer | Description
---------- | -----------
data | A data frame to visualize
aesthetics | The map or "wires" between data and geometry
geometry | Geometry to represent the data visually
labels | Titles and axis labels
scales | How numbers in data are converted to numbers on screen
guides | Legend settings
annotations | Additional geoms that are not mapped to data
facets | Side-by-side panels
coordinates | Coordinate systems (Cartesian, logarithmic, polar)
statistics | An alternative to geometry
Recall the plot we made earlier:
```{r}
library(ggplot2)
dogs = readRDS("data/dogs/dogs.rds")
ggplot(dogs, aes(x = datadog, y = popularity)) + geom_point()
```
How else can we make our plot look more like the Best in Show plot?
1. Add the dog breeds as text.
Add more geometries to add additional details to a plot:
```{r}
ggplot(dogs, aes(x = datadog, y = popularity, label = breed)) +
geom_point() +
geom_text(size = 2, hjust = 1, vjust = 1, nudge_x = -0.05)
```
See the `ggrepel` package for automatic label positioning.
2. Color the points by type of dog.
```{r}
ggplot(dogs, aes(x = datadog, y = popularity, label = breed)) +
geom_point(aes(color = group)) +
geom_text(size = 2, hjust = 1, vjust = 1, nudge_x = -0.05)
```
We can also set parameters outside of the aesthetics.
Doing so sets a constant value instead of mapping to a feature in the
data.
Set size to 10 for all points:
```{r}
ggplot(dogs, aes(x = datadog, y = popularity, label = breed)) +
geom_point(aes(color = group), size = 10) +
geom_text(size = 2, hjust = 1, vjust = 1, nudge_x = -0.05)
```
Note that if you want to set a constant color for all points, you need to do so
outside of `aes()`:
```{r}
ggplot(dogs, aes(x = datadog, y = popularity)) + geom_point(color = "blue")
```
```{r}
ggplot(dogs, aes(x = datadog, y = popularity)) +
geom_point(aes(color = "blue"))
```
You can also use the scales layer to customize the color choices.
Read the documentation for details about parameters.
3. Reverse the y-axis.
Use the scale layer to change axes.
```{r}
ggplot(dogs, aes(x = datadog, y = popularity, label = breed)) +
geom_point(aes(color = group), size = 1) +
geom_text(size = 2, hjust = 1, vjust = 1, nudge_x = -0.05) +
scale_y_reverse()
```
4. Add titles and labels.
```{r}
ggplot(dogs, aes(x = datadog, y = popularity, label = breed)) +
geom_point(aes(color = group), size = 1) +
geom_text(size = 2, hjust = 1, vjust = 1, nudge_x = -0.05) +
scale_y_reverse() +
labs(title = "Best in Show", x = "Datadog Score", y = "Popularity")
```
Could use ggimage package to replace the points with images of dogs.