-
Notifications
You must be signed in to change notification settings - Fork 72
/
cm103.Rmd
568 lines (383 loc) · 16.1 KB
/
cm103.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
# (3) Functional programming in R: Part I
```{r include=FALSE}
knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE)
```
```{r}
library(tidyverse)
library(glue)
library(tictoc)
```
## Today's Agenda
- Announcements:
- Reminder about Assignment 1 and 2
- Milestone 2 is now posted
- Office hours are on Tuesday & Thursday after class in this room, and on Fridays from 10-12 in ESB 1045.
- See [here for the schedule](https://stat545.stat.ubc.ca/officehours/)
- [Download today's participation file](https://github.com/STAT547-UBC-2019-20/Discussions/blob/master/participation/cm103/cm103_participation.Rmd) (and commit into your participation repo)
- Part 1: Introduction to functional programming (FP) (10 mins)
- Motivation for functions and for vectorizing operations
- Anatomy of a function in R
- Comments on RScript vs. RMarkdown vs. RNotebook
- Part 2: Vectorization
- What is vectorization?
- Why do we use vectorization?
- Examples of vectorized operations in R
- Part 3: Functional programming using the `purrr` package
- `purrr::map`
- Use the right `purrr::map*` function based on your desired output
- Specify some arguments of the function
- Mapping with two data objects
- Mapping with more than two data objects
## Learning outcomes for this lecture
1. Define the philosophy of functional programming in R.
1. Describe the benefits of vectorizing R code.
1. Apply vectorization to tasks in R.
1. List and describe the `map` functions from the `purrr` package.
1. Apply functions from the `purrr` package to vectorize tasks in R.
1. Describe anonymous functions, apply them, and use the shorthand notation in `purrr` functions.
## Part 1: Introduction to functional programming (FP) (10 mins)
### Motivation for functions and for vectorizing operations
- [Hadley Wickham's cupcake recipes](https://speakerdeck.com/hadley/the-joy-of-functional-programming?slide=13)
### Anatomy of a function in R
![](img/anatomy_of_function.png)
```{r}
func_name <- function (arg1, arg2 = 5, arg3 = TRUE) {
if (arg3 == TRUE) {
print(glue::glue(arg1," will be raised to the power of ", arg2))
}
arg1^arg2
}
func_name(arg1 = 4, arg2 = 3, arg3 = FALSE)
func_name()
```
### Rscript (.R) vs. RMarkdown (.Rmd) vs. RNotebook (Rmd + special YAML Header)
[This site](http://uc-r.github.io/r_notebook) has some nice visuals that show you differences between an Rmarkdown document and an RNotebook.
Rscripts are not interactive and designed to be run from the command line.
## Part 2: Vectorization
Many thanks to one of our teaching assistants [Sirine Chahma](https://github.com/sirine-chahma) for the first draft of this lecture!
### What is vectorization?
There are several ways of applying the same operation to all the elements of a given vector.
You can "brute force" it:
```{r, bruteforce_multiplication}
x <- c(1, 2, 3, 4)
y <- c()
y[1] = x[1]*2
y[2] = x[2]*2
y[3] = x[3]*2
y[4] = x[4]*2
y
```
But it's very easy to make mistakes when you're copy/pasting code like this so it's a good rule of thumb to think of better ways to do things when you have to copy and paste the same block of code more than about once.
Let's try this again:
```{r, bruteforce_multiplication2}
x <- c(1, 2, 3, 4)
y <- c()
y[1] = x[1]*2
y[2] = x[2]*2
y[3] = x[3]*2
y[4] = x[4]*2
y
```
Rats! We made another mistake.
Find and fix the mistake in the code above please!
Okay, let's get to the better way of doing things.
You can use a loop :
```{r}
1:length(x)
```
```{r, loop_multiplication}
x <- c(1, 2, 3, 4)
y <- c()
for (i in 1:length(x)){
y[i] <- x[i]*2
}
y
```
There is a function called `seq_along` that essentially replaces `1:length(x)` in the code chunk above:
```{r, loop_multiplication2}
x <- c(1, 2, 3, 4)
y <- c()
for (i in seq_along(x)){
y[i] <- x[i]*2
}
y
```
We will use `seq_along(x)` and `1:length(x)` interchangeably.
So, this works and is much less error-prone, but in this case - there is actually an even better option, - vectorized operations!
Let's see an example of it:
```{r, vectorized_multiplication}
x <- c(1, 2, 3, 4)
y <- x*2
y
```
You might have thought this was an obvious thing to try, and you'd be right - R has some built in functions to handle vectorization "behind the scenes".
For example, we can sum the values of two vectors :
```{r, loop_addition}
x1 <- c(1, 2, 3, 4)
x2 <- c(10, 20, 30, 40)
y <- c()
for (i in 1:length(x1)){
y[i] <- x1[i] + x2[i]
}
y
```
but built-in vectorization in R allows us to do this:
```{r, vectorized_addition}
x1 <- c(1, 2, 3, 4)
x2 <- c(10, 20, 30, 40)
y <- x1 + x2
y
```
### Why do we use vectorization?
Let's come back to the first example we saw (multiply the values of a vector by 2), but let's use a bigger vector this time.
```{r, create_big_vector}
x <- 1:100000000
print(glue('The length of x is ', length(x)))
```
Take a guess at how long the loop below is going to take to run (Hint: the answer is "in the seconds")?
```
# Guess at how long this loop takes
x <- 1:100000000
for (i in 1:length(x)){
y[i] <- 2*x[i]
}
## YOUR GUESS HERE
```
Let's try using the `tictoc` package to time how long this operation takes.
`tic` starts the clock, and `toc` stops the clock and prints out the total time.
```{r, long_loop, eval = FALSE}
y <- c()
#start timing
tic()
for (i in 1:length(x)){
y[i] <- 2*x[i]
}
#end timing
toc()
```
Let's take a look at the time taken by the vectorized operation now :
```{r, long_vectorization}
#start timming
tic()
y <- x*2
#end timming
toc()
```
Wow! That is amazing - see how much faster the vectorized operation is compared to the `for` loop.
It's usually recommended to use vectorized operation rather than regular loops for several reasons, including memory efficiency, speed, readability, "debugability", and easily being able to add tests (more on this next week).
### Examples of vectorized operations
Here are a few examples of other operations that are vectorized.
- Check if the values of two vectors are the same :
```{r, boolean_equal}
x1 <- c(1, 2, 3, 4)
x2 <- c(1, 2, 1, 2)
y <- x1 == x2
# Can you guess the values of `y`?
print(c(TRUE, TRUE, FALSE, FALSE))
```
And the answer is (run in RStudio):
```{r, answer_boolean_equal, eval = FALSE}
y
```
- Compare the values of two vectors :
```{r, boolean_greater}
x1 <- c(1, 2, 3, 4)
x2 <- c(1, 2, 1, 2)
y <- x1 > x2
# Can you guess the values of `y`?
print("YOUR SOLUTION HERE")
```
And the answer is:
```{r, answer_boolean_greater, eval = FALSE}
y
```
- Logical comparaisons can also be used:
```{r, logical}
# compares each elements of each vector by position
y <- c(TRUE, TRUE, TRUE) & c(FALSE, TRUE, TRUE)
y
```
There are a lot of other operations that are vectorized!
Here is a list of vector operators : [R Operators cheat sheet](https://cran.r-project.org/doc/contrib/Baggott-refcard-v2.pdf)
## Part 3: Functional programming using the `purrr` package
Until now, we have just applied simple operations to vectors.
The functions were only applied to a single element of the vector, which were doubles.
What if we want to use data frames (as you likely will in your projects)?
In this case, one "element" becomes a whole vector (a column of the data frame), and the functions have to accept a vector as an input.
Let's now try to work with data frames.
How do we apply a function to all the columns of a data frame?
We are going to work with the `iris` data frame :
```{r, load_df}
#select only the columns that represents a numerical variable
iris_df <- iris %>%
select(-Species)
head(iris_df)
```
Let's compute the mean of each column using a for loop :
```{r,, mean_loop}
means <- vector("double", ncol(iris_df))
## YOUR SOLUTION HERE
for (i in seq_along(iris_df)) {
means[i] <- mean(iris_df[[i]], na.rm = TRUE)
}
```
`means` contains the means of each column :
```{r, answer_mean_loop, eval = FALSE}
means
```
We can do the same to find the minimum of each column :
```{r, min_loop}
mins <- vector("double", ncol(iris_df))
## YOUR SOLUTION HERE
for (i in seq_along(iris_df)) {
mins[i] <- min(iris_df[[i]], na.rm = TRUE)
}
```
`mins` contains the minimums of each column :
```{r, answer_min_loop, eval = FALSE}
mins
```
The two loops we just wrote seem to very similar to each other, we should try to write a function that takes the function we want to apply and a data frame as its inputs.
```{r, my_map_function}
my_function <- function(x, fun) {
value <- vector("double", ncol(x))
for (i in seq_along(x)) {
value[i] <- fun(x[[i]], na.rm = TRUE)
}
value
}
```
Let's check if we find the same values as before.
Try calling `my_function` to compute the `mean` and `min` of `iris_df`:
```{r, my_function_mean}
## YOUR SOLUTION HERE
my_function(iris_df, mean)
```
```{r, my_function_min}
## YOUR SOLUTION HERE
my_function(iris_df, min)
```
We find exactly the same values as when we were using the for loop!
*Note*: We have just written a functional, which is a *function* that takes *another function* as an input, and returns a vector as an output.
Just as a preview, the `purrr` package has some really convenient function(al)s that allow us to pass in other functions to apply to data frame.
### The most general `purrr` function: `map`
The `purrr:map` function takes at least two arguments : a data frame and a function.
`map(.x, .f, ...)`
This means that we are going to apply the function `f` for every element of `x`.
This image may help you to better understand what does the `purrr:map` function does :
<img src="https://d33wubrfki0l68.cloudfront.net/12f6af8404d9723dff9cc665028a35f07759299d/d0d9a/diagrams/functionals/map-list.png" width=500>
Source: [Advanced R](https://adv-r.hadley.nz/index.html) by Hadley Wickham.
*Note* : In this image, the elements of the object that are used as an input seem to be the rows, but when we use a data frame as the input, they actually correspond to the columns of the data frame.
Let's calculate the mean of the columns of the iris data frame :
```{r, purrr_mean_map}
library(purrr)
map(iris_df, mean)
```
The only difference with our `my_function` function we created above is that the output is a list!
### Use the right `purrr::map*` function based on your desired output
Now, let's take a look at the other functions that exist in the `purrr` library.
Here is a [cheatsheet](https://github.com/rstudio/cheatsheets/blob/master/purrr.pdf) that contains a list of all the functions, and how to use them.
We have `map_chr` (character vector), `map_dbl` (double/numeric vector), `map_dfc` (dfc for dataframe columns and dfr for dataframe rows), `map_int` (integer) and `map_lgl` (logical).
Let's practice with `purrr:map_dbl`:
```{r, purrr_mean_map_dbl}
library(purrr)
map_dbl(iris_df, mean)
```
This time, the output is a vector containing doubles!
This is exactly what we had with the function we created.
What if we want to specify some arguments of our function (ignore the NAs when we compute the mean for instance)?
We need to do a bit of work to do that - essentially we need to tell the `map` functional to also consider the `na.rm` argument of the mean function.
Let's see how...
### Specify some arguments of the function
Let's introduce some missing data in our data frame :
```{r, missing_data}
iris_NA <- iris_df
iris_NA[1, 1] <- NA
```
What happens if we use `purrr:map_dbl`?
```{r, answer_map_dbl_NA}
map_dbl(iris_NA, mean)
```
The mean of the first column is now equal to NA. To solve this issue, we can use `na.rm = TRUE` as an argument of the `mean` function. But how do we add this to our `map_dbl` call?
We have to create what we call an *anonymous function*.
```{r, answer_map_dbl_anonymous_fun}
map_dbl(iris_NA, function(df) mean(df, na.rm = TRUE))
```
The general format of an anonymous function is `function(x) body of the function`.
For example, if you want to compute $4^2$ using an anonymous function, it would be :
```{r, anonymous_fun_example}
(function(x) x**2)(4)
```
The anonymous function is surrounded by round brackets, and so is the input of the anonymous function.
*Note* : There is a shorter way to write anonymous functions :
```{r, short_anonymous_fun}
map_dbl(iris_NA, ~ mean(., na.rm = TRUE))
```
The `function(df)` is replaced by `~` and the argument of the function is replaced by a `.`.
### Mapping with two data objects
So far, we have only used the `purrr:map` function that only takes one data object and one function as an argument.
What if we wanted to do more complicated operations, that use a function that needs more than one input?
For example, how would you calculate the weighted means (using `weighted.mean`) of the columns of a given data frame, where the weights are in another data frame?
Let's create a data frame that contains the weights picking some randomly generated values from the `iris_NA` dataset (according to the poisson distribution using the `rpois` function) :
```{r, create_weights}
weights <- tibble(weight_sepal_legth = rpois(nrow(iris_NA), 3),
weight_sepal_width = rpois(nrow(iris_NA), 3),
weight_petal_legth = rpois(nrow(iris_NA), 3),
weight_petal_width = rpois(nrow(iris_NA), 3),)
```
First, let's see what are the parameters of `weighted.mean`
```{r, weighted_mean_parameters}
?weighted.mean
```
In order to know which `purrr:map*` function we have to use, you can consult the handy table where each row is the table corresponds to "the thing you want to map".
Each column represents the type you want the "output" of the map function to be, either a list, an atomic (vector), the same type as the input, and no output (useful if you want to modify things in place).
<img src="img/map_family.png" width=900>
Source: [Advanced R](https://adv-r.hadley.nz/index.html) by Hadley Wickham.
As we have two arguments, we should use the `purrr:map2*` function. As we want the output of the function to be a data frame, we are goint to use `purrr:map2_df`.
```{r, map2_NA}
map2_df(iris_NA, weights, weighted.mean)
```
We have the same issue as before because of the NAs... We should use an anonymous function!
```{r, map2_NA_anonymous_function}
map2_df(iris_NA, weights, function(x, y) weighted.mean(x, y, na.rm = TRUE))
```
What would be the short form of this anonymous function?
```{r, answer_map2_NA_anonymous_function_short}
## YOUR SOLUTION HERE
map2_df(iris_NA, weights, ~ weighted.mean(.x, .y, na.rm = TRUE))
```
*WARNING* : if `y` has less elements than `x`, the elements of `y` will be used several times.
This could have some nasty side-effects, but is also quite useful!
<img src="https://d33wubrfki0l68.cloudfront.net/55032525ec77409e381dcd200a47e1787e65b964/dcaef/diagrams/functionals/map2-recycle.png" width=400>
Source: [Advanced R](https://adv-r.hadley.nz/index.html) by Hadley Wickham.
### Mapping with more than two data objects
When we have more than two arguments, we should use the `purrr:pmap*` function.
```{r, pmap_example}
f <- function(x, y, z, arg = 0){
(x+y+z)/3 + arg
}
pmap(list(c(1, 1), c(1, 2), c(1, 3)), f)
```
If we want to use an anonymous function, we have to us `..1, ..2, ..3`.
```{r, pmap_anonymous_function}
pmap(list(c(1, 1), c(1, 2), c(1, 3)), ~ f(..1, ..2, ..3, arg=2))
```
*Note* : if you use `purrr:pmap*` on a single data frame, it will iterate row-wise!
Example : Try to find the mean of all the rows of the `iris_df` dataset (which doesn't really make sense, but let's do it anyway).
```{r, answer_pmap_row}
## YOUR SOLUTION HERE
pmap(iris_df, ~ mean(.x)) %>%
head()
```
### Summary and key points
- Cupcakes (vanilla and chocolate and espresso) as motivation for writing functions in R
- The anatomy of an R function
- Vectorize - Which R functions are vectorized? What vectorization means.
- The benefits of vectorization (100M size vector takes 32s in a for loop, and <1 s in a vectorized function)
- Use the `purrr` package to "map" over dataframes, vectors, etc...
- There were more than 1 map_* ; choose the one that's appropriate for your use case
- map2 and pmap! - as well as how to write anonymous functions in R
### Additional Resources
1. Chapter 21 of [R for Data Science](https://r4ds.had.co.nz/iteration.html).
2. [Learn to purr](http://www.rebeccabarter.com/blog/2019-08-19_purrr/) blog post.
3. Chapter 9 of [Advanced R for Data Science](https://adv-r.hadley.nz/functionals.html).