-
Notifications
You must be signed in to change notification settings - Fork 334
/
notation.rmd
457 lines (322 loc) · 18 KB
/
notation.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
# R Notation
Now that you have a deck of cards, you need a way to do card-like things with it. First, you'll want to reshuffle the deck from time to time. And next, you'll want to deal cards from the deck (one card at a time, whatever card is on top—we're not cheaters).
To do these things, you'll need to work with the individual values inside your data frame, a task essential to data science. For example, to deal a card from the top of your deck, you'll need to write a function that selects the first row of values in your data frame, like this
```r
deal(deck)
## face suit value
## king spades 13
```
You can select values within an R object with R's notation system.
## Selecting Values
R has a notation system that lets you extract values from R objects. To extract a value or set of values from a data frame, write the data frame's name followed by a pair of hard brackets:
```r
deck[ , ]
```
Between the brackets will go two indexes separated by a comma. The indexes tell R which values to return. R will use the first index to subset the rows of the data frame and the second index to subset the columns.
You have a choice when it comes to writing indexes. There are six different ways to write an index for R, and each does something slightly different. They are all very simple and quite handy, so let's take a look at each of them. You can create indexes with:
* Positive integers
* Negative integers
* Zero
* Blank spaces
* Logical values
* Names
The simplest of these to use is positive integers.
### Positive Integers
R treats positive integers just like _ij_ notation in linear algebra: `deck[i,j]` will return the value of `deck` that is in the _ith_ row and the _jth_ column, Figure \@ref(fig:positive). Notice that _i_ and _j_ only need to be integers in the mathematical sense. They can be saved as numerics in R
```r
head(deck)
## face suit value
## king spades 13
## queen spades 12
## jack spades 11
## ten spades 10
## nine spades 9
## eight spades 8
deck[1, 1]
## "king"
```
To extract more than one value, use a vector of positive integers. For example, you can return the first row of `deck` with `deck[1, c(1, 2, 3)]` or `deck[1, 1:3]`:
```r
deck[1, c(1, 2, 3)]
## face suit value
## king spades 13
```
R will return the values of `deck` that are in both the first row and the first, second, and third columns. Note that R won't actually remove these values from `deck`. R will give you a new set of values which are copies of the original values. You can then save this new set to an R object with R's assignment operator:
```r
new <- deck[1, c(1, 2, 3)]
new
## face suit value
## king spades 13
```
```{block2, type = "rmdtip"}
**Repetition**
If you repeat a number in your index, R will return the corresponding value(s) more than once in your "subset." This code will return the first row of `deck` twice:
deck[c(1, 1), c(1, 2, 3)]
## face suit value
## king spades 13
## king spades 13
```
```{r positive, echo = FALSE, fig.cap = "R uses the _ij_ notation system of linear algebra. The commands in this figure will return the shaded values."}
knitr::include_graphics("images/hopr_0401.png")
```
R's notation system is not limited to data frames. You can use the same syntax to select values in any R object, as long as you supply one index for each dimension of the object. So, for example, you can subset a vector (which has one dimension) with a single index:
```r
vec <- c(6, 1, 3, 6, 10, 5)
vec[1:3]
## 6 1 3
```
```{block2, type = "rmdtip"}
**Indexing begins at 1**
In some programming languages, indexing begins with 0. This means that 0 returns the first element of a vector, 1 returns the second element, and so on.
This isn't the case with R. Indexing in R behaves just like indexing in linear algebra. The first element is always indexed by 1. Why is R different? Maybe because it was written for mathematicians. Those of us who learned indexing from a linear algebra course wonder why computers programmers start with 0.
```
```{block2, type = "rmdtip"}
**drop = FALSE**
If you select two or more columns from a data frame, R will return a new data frame:
deck[1:2, 1:2]
## face suit
## king spades
## queen spades
However, if you select a single column, R will return a vector:
deck[1:2, 1]
## "king" "queen"
If you would prefer a data frame instead, you can add the optional argument `drop = FALSE` between the brackets:
deck[1:2, 1, drop = FALSE]
## face
## king
## queen
This method also works for selecting a single column from a matrix or an array.
```
### Negative Integers
Negative integers do the exact opposite of positive integers when indexing. R will return every element _except_ the elements in a negative index. For example, `deck[-1, 1:3]` will return everything _but_ the first row of `deck`. `deck[-(2:52), 1:3]` will return the first row (and exclude everything else):
```r
deck[-(2:52), 1:3]
## face suit value
## king spades 13
```
Negative integers are a more efficient way to subset than positive integers if you want to include the majority of a data frame's rows or columns.
R will return an error if you try to pair a negative integer with a positive integer in the _same_ index:
```r
deck[c(-1, 1), 1]
## Error in xj[i] : only 0's may be mixed with negative subscripts
```
However, you can use both negative and positive integers to subset an object if you use them in _different_ indexes (e.g., if you use one in the rows index and one in the columns index, like `deck[-1, 1]`).
### Zero
What would happen if you used zero as an index? Zero is neither a positive integer nor a negative integer, but R will still use it to do a type of subsetting. R will return nothing from a dimension when you use zero as an index. This creates an empty object:
```r
deck[0, 0]
## data frame with 0 columns and 0 rows
```
To be honest, indexing with zero is not very helpful.
### Blank Spaces
You can use a blank space to tell R to extract _every_ value in a dimension. This lets you subset an object on one dimension but not the others, which is useful for extracting entire rows or columns from a data frame:
```r
deck[1, ]
## face suit value
## king spades 13
```
### Logical Values {#logic}
If you supply a vector of `TRUE`s and `FALSE`s as your index, R will match each `TRUE` and `FALSE` to a row in your data frame (or a column depending on where you place the index). R will then return each row that corresponds to a `TRUE`, Figure \@ref(fig:logicals).
It may help to imagine R reading through the data frame and asking, "Should I return the _i_th row of the data structure?" and then consulting the _i_th value of the index for its answer. For this system to work, your vector must be as long as the dimension you are trying to subset:
```r
deck[1, c(TRUE, TRUE, FALSE)]
## face suit
## king spades
rows <- c(TRUE, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F,
F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F,
F, F, F, F, F, F, F, F, F, F, F, F, F, F)
deck[rows, ]
## face suit value
## king spades 13
```
```{r logicals, echo = FALSE, fig.cap = "You can use vectors of TRUEs and FALSEs to tell R exactly which values you want to extract and which you do not. The command would return just the numbers 1, 6, and 5."}
knitr::include_graphics("images/hopr_0402.png")
```
This system may seem odd—who wants to type so many `TRUE`s and `FALSE`s?—but it will become very powerful in [Modifying Values].
### Names
Finally, you can ask for the elements you want by name—if your object has names (see [Names](#names)). This is a common way to extract the columns of a data frame, since columns almost always have names:
```r
deck[1, c("face", "suit", "value")]
## face suit value
## king spades 13
# the entire value column
deck[ , "value"]
## 13 12 11 10 9 8 7 6 5 4 3 2 1 13 12 11 10 9 8
## 7 6 5 4 3 2 1 13 12 11 10 9 8 7 6 5 4 3 2
## 1 13 12 11 10 9 8 7 6 5 4 3 2 1
```
## Deal a Card
Now that you know the basics of R's notation system, let's put it to use.
```{exercise, name = "Deal a Card"}
Complete the following code to make a function that returns the first row of a data frame:
```
```r
deal <- function(cards) {
# ?
}
```
```{solution}
You can use any of the systems that return the first row of your data frame to write a `deal` function. I'll use positive integers and blanks because I think they are easy to understand:
```
```r
deal <- function(cards) {
cards[1, ]
}
```
The function does exactly what you want: it deals the top card from your data set. However, the function becomes less impressive if you run `deal` over and over again:
```r
deal(deck)
## face suit value
## king spades 13
deal(deck)
## face suit value
## king spades 13
deal(deck)
## face suit value
## king spades 13
```
`deal` always returns the king of spades because `deck` doesn't know that we've dealt the card away. Hence, the king of spades stays where it is, at the top of the deck ready to be dealt again. This is a difficult problem to solve, and we will _deal_ with it in [Environments]. In the meantime, you can fix the problem by shuffling your deck after every deal. Then a new card will always be at the top.
Shuffling is a temporary compromise: the probabilities at play in your deck will not match the probabilities that occur when you play a game with a single deck of cards. For example, there will still be a probability that the king of spades appears twice in a row. However, things are not as bad as they may seem. Most casinos use five or six decks at a time in card games to prevent card counting. The probabilities that you would encounter in those situations are very close to the ones we will create here.
## Shuffle the Deck
When you shuffle a real deck of cards, you randomly rearrange the order of the cards. In your virtual deck, each card is a row in a data frame. To shuffle the deck, you need to randomly reorder the rows in the data frame. Can this be done? You bet! And you already know everything you need to do it.
This may sound silly, but start by extracting every row in your data frame:
```r
deck2 <- deck[1:52, ]
head(deck2)
## face suit value
## king spades 13
## queen spades 12
## jack spades 11
## ten spades 10
## nine spades 9
## eight spades 8
```
What do you get? A new data frame whose order hasn't changed at all. What if you asked R to extract the rows in a different order? For example, you could ask for row 2, _then_ row 1, and then the rest of the cards:
```r
deck3 <- deck[c(2, 1, 3:52), ]
head(deck3)
## face suit value
## queen spades 12
## king spades 13
## jack spades 11
## ten spades 10
## nine spades 9
## eight spades 8
```
R complies. You'll get all the rows back, and they'll come in the order you ask for them. If you want the rows to come in a random order, then you need to sort the integers from 1 to 52 into a random order and use the results as a row index. How could you generate such a random collection of integers? With our friendly neighborhood `sample` function:
```r
random <- sample(1:52, size = 52)
random
## 35 28 39 9 18 29 26 45 47 48 23 22 21 16 32 38 1 15 20
## 11 2 4 14 49 34 25 8 6 10 41 46 17 33 5 7 44 3 27
## 50 12 51 40 52 24 19 13 42 37 43 36 31 30
deck4 <- deck[random, ]
head(deck4)
## face suit value
## five diamonds 5
## queen diamonds 12
## ace diamonds 1
## five spades 5
## nine clubs 9
## jack diamonds 11
```
Now the new set is truly shuffled. You'll be finished once you wrap these steps into a function.
```{exercise, name = "Shuffle a Deck"}
Use the preceding ideas to write a `shuffle` function. `shuffle` should take a data frame and return a shuffled copy of the data frame.
```
```{solution}
Your `shuffle` function will look like the one that follows:
```
```r
shuffle <- function(cards) {
random <- sample(1:52, size = 52)
cards[random, ]
}
```
Nice work! Now you can shuffle your cards between each deal:
```r
deal(deck)
## face suit value
## king spades 13
deck2 <- shuffle(deck)
deal(deck2)
## face suit value
## jack clubs 11
```
## Dollar Signs and Double Brackets
Two types of object in R obey an optional second system of notation. You can extract values from data frames and lists with the `$` syntax. You will encounter the `$` syntax again and again as an R programmer, so let's examine how it works.
To select a column from a data frame, write the data frame's name and the column name separated by a `$`. Notice that no quotes should go around the column name:
```r
deck$value
## 13 12 11 10 9 8 7 6 5 4 3 2 1 13 12 11 10 9 8 7
## 6 5 4 3 2 1 13 12 11 10 9 8 7 6 5 4 3 2 1 13
## 12 11 10 9 8 7 6 5 4 3 2 1
```
R will return all of the values in the column as a vector. This `$` notation is incredibly useful because you will often store the variables of your data sets as columns in a data frame. From time to time, you'll want to run a function like `mean` or `median` on the values in a variable. In R, these functions expect a vector of values as input, and `deck$value` delivers your data in just the right format:
```r
mean(deck$value)
## 7
median(deck$value)
## 7
```
You can use the same `$` notation with the elements of a list, if they have names. This notation has an advantage with lists, too. If you subset a list in the usual way, R will return a _new_ list that has the elements you requested. This is true even if you only request a single element.
To see this, make a list:
```r
lst <- list(numbers = c(1, 2), logical = TRUE, strings = c("a", "b", "c"))
lst
## $numbers
## [1] 1 2
## $logical
## [1] TRUE
## $strings
## [1] "a" "b" "c"
```
And then subset it:
```r
lst[1]
## $numbers
## [1] 1 2
```
The result is a smaller _list_ with one element. That element is the vector `c(1, 2)`. This can be annoying because many R functions do not work with lists. For example, `sum(lst[1])` will return an error. It would be horrible if once you stored a vector in a list, you could only ever get it back as a list:
```r
sum(lst[1])
## Error in sum(lst[1]) : invalid 'type' (list) of argument
```
When you use the `$` notation, R will return the selected values as they are, with no list structure around them:
```r
lst$numbers
## 1 2
```
You can then immediately feed the results to a function:
```r
sum(lst$numbers)
## 3
```
If the elements in your list do not have names (or you do not wish to use the names), you can use two brackets, instead of one, to subset the list. This notation will do the same thing as the `$` notation:
```r
lst[[1]]
## 1 2
```
In other words, if you subset a list with single-bracket notation, R will return a smaller list. If you subset a list with double-bracket notation, R will return just the values that were inside an element of the list. You can combine this feature with any of R's indexing methods:
```r
lst["numbers"]
## $numbers
## [1] 1 2
lst[["numbers"]]
## 1 2
```
This difference is subtle but important. In the R community, there is a popular, and helpful, way to think about it, Figure \@ref(fig:trains). Imagine that each list is a train and each element is a train car. When you use single brackets, R selects individual train cars and returns them as a new train. Each car keeps its contents, but those contents are still inside a train car (i.e., a list). When you use double brackets, R actually unloads the car and gives you back the contents.
```{r trains, echo = FALSE, fig.cap = "It can be helpful to think of your list as a train. Use single brackets to select train cars, double brackets to select the contents inside of a car."}
knitr::include_graphics("images/hopr_0403.png")
```
```{block2, type = "rmdimportant"}
**Never attach**
In R's early days, it became popular to use `attach()` on a data set once you had it loaded. Don't do this! `attach` recreates a computing environment similar to those used in other statistics applications like Stata and SPSS, which crossover users liked. However, R is not Stata or SPSS. R is optimized to use the R computing environment, and running `attach()` can cause confusion with some R functions.
What does `attach()` do? On the surface, `attach` saves you typing. If you attach the `deck` data set, you can refer to each of its variables by name; instead of typing `deck$face`, you can just type `face`. But typing isn't bad. It gives you a chance to be explicit, and in computer programming, explicit is good. Attaching a data set creates the possibility that R will confuse two variable names. If this occurs within a function, you're likely to get unusable results and an unhelpful error message to explain what happened.
```
Now that you are an expert at retrieving values stored in R, let's summarize what you've accomplished.
## Summary
You have learned how to access values that have been stored in R. You can retrieve a copy of values that live inside a data frame and use the copies for new computations.
In fact, you can use R's notation system to access values in any R object. To use it, write the name of an object followed by brackets and indexes. If your object is one-dimensional, like a vector, you only need to supply one index. If it is two-dimensional, like a data frame, you need to supply two indexes separated by a comma. And, if it is _n_-dimensional, you need to supply _n_ indexes, each separated by a comma.
In [Modifying Values], you'll take this system a step further and learn how to change the actual values that are stored inside your data frame. This is all adding up to something special: complete control of your data. You can now store your data in your computer, retrieve individual values at will, and use your computer to perform correct calculations with those values.
Does this sound basic? It may be, but it is also powerful and essential for efficient data science. You no longer need to memorize everything in your head, nor worry about doing mental arithmetic wrong. This low-level control over your data is also a prerequisite for more efficient R programs, the subject of [Project 3: Slot Machine].