forked from bcb420-2022/R_basics
-
Notifications
You must be signed in to change notification settings - Fork 1
/
10-R-control-structures.Rmd
508 lines (380 loc) · 16.5 KB
/
10-R-control-structures.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
# Control structures of R {#r-control-structures}
(if, else if, ifelse, for, and while; vectorized commands as alternatives)
## Overview
### Abstract:
Introducing control structures: if, else if, ifelse, for, and while.
### Objectives:
This unit will:
* introduce the main control structures of R;
### Outcomes:
After working through this unit you:
* can read, analyze and write conditional expressions using if, else, and the ifelse() function;
* can read, analyze and write for loops using the range operator and the seq_along() function;
* can construct while loops with a termination condition.
### Deliverables:
**Time management**: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
**Journal**: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
**Insights**: If you find something particularly noteworthy about this unit, make a note in your insights! page.
### Prerequisites
[RPR-Subsetting (Subsetting and filtering R objects)](#r-subsetting)
## Control structures
"Control structures" are parts of the syntax that execute code according to whether some condition is met. Let's look at this with some simple examples:
### if and else
```{r, eval=FALSE}
# Code pattern
if (<conditional expression evaluates to TRUE>) {
<execute this block>
} else if (<expression evaluates to TRUE>) {
<execute this block>
} else {
<execute this block>
}
```
### conditional expressions
* anything that evaluates to TRUE or FALSE, or can be coerced to a logical.
* Obviously the operators:
* ! - not equals
* == - equals
* < - less than
* \\> - greater than
* <= - less than or equal
* \\>= - greater than or equal
* but there are also a number of in-built functions that are useful for this purpose:
* all - are all values in the set TRUE
* any - are any of the values in the set TRUE
* exists - does the object exist
* is.character - is the object of type character
* is.factor - is the object of type factor
* is.integer - is the object of type integer
* is.null - is the object null
* is.na - is the object NA
* is.nan - is the object infinite.
* is.numeric - is the object of type numeric
* is.unsorted - is the object sorted
* is.vector - is the object of type vector
* Simple "if" statement:
* Rolling a die. If you get a "six", you get to roll again.
```{r}
x <- sample(1:6, 1)
if (x == 6) {
x <- c(x, sample(1:6, 1))
}
print(x)
```
### "if", "else if", and "else"
Here is a popular dice game called high-low.
```{r}
a <- sample(1:6, 1)
b <- sample(1:6, 1)
if (a + b > 7) {
print("high")
} else if (a + b < 7) {
print("low")
} else {
print("seven")
}
```
We need to talk about conditional expressions that test for more than one condition for a moment, things like: "if this is true OR that is true OR my birthday is a Sunday this year ...". To join logical expressions, R has two distinct sets of operators:
* | and ||,
* and & and &&.
**| is for "or" and & is for "and". But what about && and ||?**
The single operators are "vectorized" whereas the doubled operators short-circuit.
This means if you apply the single operators to a vector, you get a vector of results:
```{r}
x <- c(1, 3, 5, 7, 11, 13, 17)
x > 3 & x < 17 # FALSE FALSE TRUE TRUE TRUE TRUE FALSE: all comparisons
x [x > 3 & x < 17] # 5 7 11 13
```
whereas this stop at the first FALSE
```{r}
x > 3 && x < 17 # FALSE:
```
The vectorized version is usually what you want, e.g. for subsetting, as above. But it is usually not the right way in control structures: there, you usually want never to evaluate unnecessary expressions. Chained "OR" expressions can be aborted after the first TRUE is encountered, and chained "AND" expressions can be aborted after the first FALSE. Which is what the double operators do.
no error: length test is TRUE so is.na() never gets evaluated
```{r}
x <- numeric()
if (length(x) == 0 || is.na(x)) { print("zero") } # .
```
whereas this throws an error, because is.na() is evaluated even though x has length zero.
```{r, eval = FALSE}
if (length(x) == 0 | is.na(x)) { print("zero") }
```
**Bottom line: always use || and && in control structures.**
### ifelse
The ifelse() function deserves special mention: its arguments work like an if / else construct ...
ifelse(5 > 7, TRUE, FALSE)
```{r}
ifelse(runif(1) > 0.5, "pickles", "cucumbers")
```
i.e. ifelse(<condition is true>, <evaluate this>, <evaluate that> )
But the cool thing about ifelse() is that it's vectorized! You can apply it to a whole vector of conditions at once:
```{r}
set.seed(27); runif(10)
```
```{r}
set.seed(27); runif(10) > 0.2
```
```{r}
set.seed(27); ifelse(runif(10) > 0.2, "caution", " to the wind")
```
```{r}
set.seed(NULL)
```
### for
"for" loops are the workhorse of innumerable R scripts. They are controlled by a sequence, and a variable. The body of the loop is executed once for each element in the sequence. Most often, the sequence is a sequence of integers, created with the colon - the range operator. The pattern is:
for (<element> in <vector>) {
<expressions using element...>
}
```{r}
# simple for loop
for (i in 1:10) {
print(c(i, i^2, i^3))
}
```
Let's stay with the high-low game for a moment:
* What are the odds of winning?
* Let's simulate some runs with a "for" loop.
```{r}
N <- 25000
outcomes <- character(N) # initialize an empty vector
for (i in 1:N) { # repeat, assign each element of 1:N to
# the variable "i" in turn
a <- sample(1:6, 1)
b <- sample(1:6, 1)
if (a + b > 7) {
outcomes[i] <- "high"
} else if (a + b < 7) {
outcomes[i] <- "low"
} else {
outcomes[i] <- "seven"
}
}
head(outcomes, 36)
table(outcomes) # the table() function tabulates the elements
# of a vector
```
round((36 * table(outcomes))/N) # Can you explain this expression?
Note that there is nothing special about the expression for (i in 1:N) { ... . Any expression that generates a sequence of items will do; I write a lot of code like for (fileName in dir()) { ... or for (gene in data$name) {... , or for (column in colnames(expressionTable)) {... etc.
Loops in **R** can be slow if you are not careful how you write them. The reason is usually related to dynamically managing memory. If you can, you should always pre-define objects of sufficient size to hold your results. Even better, use a vectorized approach.
### Compare excution times: one million square roots from a vector of random numbers ...
Version 1: Naive for-loop: grow result object as required
```{r}
N <- 1000000 # Set N to a large number
x <- runif(N) # get N uniformily distributed random numbers
y <- numeric() # create a variable to assign to
startTime <- Sys.time() # save start time
for (i in 1:N) { # loop N-times
y[i] <- sqrt(x[i]) # calculate one square root, grow y to store it
}
Sys.time() - startTime # time it took
rm(x) # clean up
rm(y)
```
Version 2: Define result object to be large enough
```{r}
N <- 1000000 # Set N to a large number
x <- runif(N) # get N uniformily distributed random numbers
y <- numeric(N) # create a variable with N slots
startTime <- Sys.time() # save start time
for (i in 1:N) { # loop N-times
y[i] <- sqrt(x[i]) # calculate one square root, store in Y
}
Sys.time() - startTime # time it took
rm(x) # clean up
rm(y)
```
Version 3: vectorized
```{r}
N <- 1000000 # Set N to a large number
x <- runif(N) # get N uniformily distributed random numbers
startTime <- Sys.time() # save start time
y <- sqrt(x) # sqrt() is vectorized!
Sys.time() - startTime # time it took
rm(x) # clean up
rm(y)
```
**The tiny change of pre-allocating memory for the result object y, rather than dynamically growing the vector has made a huge difference. But using the vectorized version of the sqrt() function directly is the fastest approach.**
### seq_along() vs. range
Consider the following carefully:
Assume we write a loop to iterate over vectors of variable length for example going from e to pi with a given number of elements:
```{r}
( v5 <- seq(exp(1), pi, length.out = 5) )
```
```{r}
( v2 <- seq(exp(1), pi, length.out = 2) )
```
```{r}
( v1 <- seq(exp(1), pi, length.out = 1) )
```
```{r}
( v0 <- seq(exp(1), pi, length.out = 0) )
```
The idiom we will probably find most commonly for this task is uses the range operator ":"
```{r}
1:length(v5)
1:length(v2)
```
```{r}
for (i in 1:length(v5)) {
print(v5[i])
}
```
```{r}
for (i in 1:length(v2)) {
print(v2[i])
}
```
```{r}
for (i in 1:length(v1)) {
print(v1[i])
}
```
```{r}
for (i in 1:length(v0)) {
print(v0[i])
}
```
The problem with the last iteration is: we probably didn't want to execute the loop if the vector has length 0. But since 1:length(v0) is the same as 1:0, we get an erroneous execution.
This is why we should always use the following idiom instead, when iterating over a vector: the function seq_along().
**seq_along() builds a vector of indices over its argument.**
```{r}
seq_along(v5)
```
```{r}
seq_along(v2)
```
```{r}
seq_along(v1)
```
```{r}
seq_along(v0)
```
```{r}
for (i in seq_along(v5)) {
print(v5[i])
}
```
```{r}
for (i in seq_along(v2)) {
print(v2[i])
}
```
```{r}
for (i in seq_along(v1)) {
print(v1[i])
}
```
```{r}
for (i in seq_along(v0)) {
print(v0[i])
}
```
Now we get the expected behaviour: no output if the vector is empty.
### loops vs. vectorized expressions
If you can achieve your result with an **R** vector expression, it will be faster than using a loop. But sometimes you need to do things explicitly, for example if you need to access intermediate results.
Here is an example to play some more with loops: a password generator.
Passwords are a pain. We need them everywhere, they are frustrating to type, to remember and since the cracking programs are getting smarter they become more and more likely to be broken.
Here is a simple password generator that creates random strings with consonant/vowel alterations. These are melodic and easy to memorize, but actually as strong as an 8-character, fully random password that uses all characters of the keyboard such as "!He.%2jJ" or "#hb,B2X^" (which is pretty much unmemorizable).
The former is taken from 20^7 X 7^7 10^15 possibilities, the latter is from 94^8 ~ 6 X 10^15 possibilities. High-end GPU supported password crackers can test about 109 passwords a second, the passwords generated by this little algorithm would thus take on the order of 10^6 seconds or eleven days to crack^[That's assuming the worst case in that the attacker needs to know the pattern with which the password is formed, i.e. the number of characters and the alphabet that we chose from. But note that there is an even worse case: if the attacker had access to our code and the seed to our random number generator. If you start the random number generator e.g. with a new seed that is generated from Sys.time(), the possible space of seeds can be devastatingly small. But even if a seed is set explicitly with the set.seed(<number>) function, the <number> seed is a 32-bit integer (check this with .Machine$integer.max) and thus can take only a bit more than 4 X 10^9 values, six orders of magnitude less than the 10^15 password complexity we thought we had! It turns out that the code may be a much greater vulnerability than the password itself. Keep that in mind. Keep it secret. Keep it safe.]. This is probably good enough to deter a casual attack.
`r task_counter <- task_counter + 1`
## Task `r task_counter`
```{block, type="rmd-task"}
Copy, study and run the below code
```
```{r, eval = FALSE}
# Suggest memorizable passwords
# Below we use the functions:
?nchar
?sample
?substr
?paste
?print
#define a string of consonants ...
con <- "bcdfghjklmnpqrstvwxz"
# ... and a string of of vowels
vow <- "aeiouy"
for (i in 1:10) { # ten sample passwords to choose from ...
pass = rep("", 14) # make an empty character vector
for (j in 1:7) { # seven consonant/vowel pairs to be created ...
k <- sample(1:nchar(con), 1) # pick a random index for consonants ...
ch <- substr(con,k,k) # ... get the corresponding character ...
idx <- (2*j)-1 # ... compute the position (index) of where to put the consonant ...
pass[idx] <- ch # ... and put it in the right spot
# same thing for the vowel, but coded with fewer intermediate assignments
# of results to variables
k <- sample(1:nchar(vow), 1)
pass[(2*j)] <- substr(vow,k,k)
}
print( paste(pass, collapse="") ) # collapse the vector in to a string and print
}
```
**Try this a few times.**
### while
Whereas a for-loop runs for a fixed number of times, a "while" loop runs as long as a condition is true, possibly forever. Here is an example, again our high-low game: this time we simulate what happens when we play it more than once with a strategy that compensates us for losing.
Let's assume we are playing high-low in a casino. You can bet high or low. You get two dollars for one if you win, nothing if you lose. If you bet "high", you lose if we roll "low" or "seven". Thus your chances of winning are 15/36 = 42%. You play the following strategy: start with 33 dollars. Bet one dollar. If you win, good. If you loose, triple your bet. Stop the game when your funds are gone (bad), or if you have more than 100 dollars (good) - i.e. you have tripled the funds you risked. Also stop if you've played more than 100 rounds and start getting bored.
```{r}
funds <- 33
bet <- 1 # our first bet
nPlays <- 0 # this counts how often we've played
MAXPLAYS <- 100
set.seed(1234567)
while (funds > 0 && funds < 100 && nPlays < MAXPLAYS) {
bet <- min(bet, funds) # can't bet more than we have.
funds <- funds - bet # place the bet
a <- sample(1:6, 1) # roll the dice
b <- sample(1:6, 1)
# we always play "high"
if (a + b > 7) { # we win :-)
result <- "Win! "
funds <- funds + (2 * bet)
bet <- 1 # reset the bet to one dollar
} else { # we lose :-(
result <- "Lose."
bet <- 3 * bet # increase the bet to 3 times previous
}
print(paste("Round", nPlays, result,
"Funds now:", funds,
"Next bet:", bet))
nPlays <- nPlays + 1
}
set.seed(NULL)
```
**Now before you get carried away - try this with different seeds and you'll quickly figure out that the odds of beating the game are not all that great...**
`r task_counter <- task_counter + 1`
## Task `r task_counter`
```{block, type="rmd-task"}
A rocket ship has to sequence a countdown for the rocket to launch. You are starting the countdown from 3. You want to print the value of variable named txt that outputs:
[1] "3" "2" "1" "0" "Lift Off!"
Using what you learned above, write a while loop that gives the output above.
Sample Solution:
start <- 3<br>
txt <- as.character(start)<br>
countdown <- start<br>
while (countdown > 0) { <br>
countdown <- countdown - 1 <br>
txt <- c(txt, countdown) <br>
} <br>
txt <- c(txt, "Lift Off!") <br>
txt<br>
```
## Self-evaluation
## Further reading, links and resources
**If in doubt, ask!**<br>
If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.
```{block2, type="rmd-original-history"}
<br>**Author**: Boris Steipe <[email protected]> <br>
**Created**: 2017-08-05<br>
**Modified**: 2019-01-07<br>
Version: 1.1<br>
**Version history**:<br>
1.0 Udate set.seed() usage<br>
1.0.1 Maintenance; clarify for-loop comparison<br>
1.0 Completed to first live version<br>
0.1 Material collected from previous tutorial<br>
```
### Updated Revision history
```{r echo=FALSE}
source("./bcb420_books_helper_functions.R")
knitr::kable(githistory2table(git2r::commits(repo=".",path=knitr::current_input())))
```
### Footnotes: