-
Notifications
You must be signed in to change notification settings - Fork 0
/
errors.txt
677 lines (428 loc) · 18 KB
/
errors.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
Preface, Page xxvii
Zofia Gajdos instead of Zzofia Gajdos
Introduction, Page xxx
In the intro "scrapping" should be "scraping"
Page 3, 31 in book
the New File
Should be
hen New File
Page 24, 52 in pdf
Change this paragraph
Note that the levels have an order that is different from the order of appearance in the factor object. The default is for the levels to follow alphabetical order. However, often we want the levels to follow a different order. We will see several examples of this in the Data Visualization part of the book. The function reorder lets us change the order of the levels of a factor variable based on a summary computed on a numeric vector. We will demonstrate this with a simple example.
To
Note that the levels have an order that is different from the order of appearance in the factor object. The default in R is for the levels to follow alphabetical order. However, often we want the levels to follow a different order. You can specify an order through the `levels` argument when creating the factor with the `factor` function. For example, in the murders dataset regions are ordered from east to west. The function `reorder` lets us change the order of the levels of a factor variable based on a summary computed on a numeric vector. We will demonstrate this with a simple example, and will see more advanced ones in the Data Visualization part of the book.
Page 48, 77 in pdf
name space
Should be
namespace
Page 49, 77 in pdf
If we want to be absolutely sure we use the __dplyr__ `filter` we can use
Should be
If we want to be absolutely sure that we use the __dplyr__ `filter`, we can use
page 52-53 in pdf
The list subsection (2.4.6) changed quite a bit. Better to look at new version. Here is what it shoudl be now:
Data frames are a special case of _lists_. Lists are useful because you can store any combination of different types. You can create a list using the `list` function like this:
```{r}
record <- list(name = "John Doe",
student_id = 1234,
grades = c(95, 82, 91, 97, 93),
final_grade = "A")
```
The function `c` is described in Section \@ref{vectors}.
This list includes a character, a number, a vector with five numbers, and another character.
```{r}
record
class(record)
```
As with data frames, you can extract the components of a list with the accessor `$`.
```{r}
record$student_id
```
We can also use double square brackets (`[[`) like this:
```{r}
record[["student_id"]]
```
You should get used to the fact that in R, there are often several ways to do the same thing, such as accessing entries.
You might also encounter lists without variable names.
```{r,}
record2 <- list("John Doe", 1234)
record2
```
If a list does not have names, you cannot extract the elements with `$`, but you can still use the brackets method and instead of providing the variable name, you provide the list index, like this:
```{r}
record2[[1]]
```
We won't be using lists until later, but you might encounter one in your own exploration of R. For this reason, we show you some basics here.
page 56
This is not a visible change, but because we added a reference we need to add a label to section:
## Vectors {#vectors}
Page 63, 91 in pdf
this is now a special data frame called a grouped data frame and dplyr functions
Add comma:
this is now a special data frame called a grouped data frame, and dplyr functions
Page 64, 92 in pdf
To see the states by population, from smallest to largest, we arrange by `rate` instead
Should be
To see the states by murder rate, from lowest to highest, we arrange by `rate` instead:
Page 65, 94 in pdf
"Remember that the main summarization function"
Change to
"The `mean` and `sd` functions"
Page 65, 94 in pdf
if left blank, top_n, filters
Remove comma
if left blank, top_n filters
Page 70, 98 in pdf
For example to access
Add comma
For example, to access
Page 71, 99 in pdf
__no argument
No __ and Should be bold
So in markdown right:
__no argument__
Page 73, 101 in pdf
three groups
Should be
four groups
Page 73, 101 in pdf
for example to check
Should be
for example, to check
Page 74, 102 in pdf
In exercise 3 murders is in the wrong font. In code,
murders
Should be
`murders`
Page 82, 110 in pdf
Remove:
Two other functions that are helpful are `scan`.
Page 84, 112 in pdf
Although there are R packages designed to read this format, if you are choosing a file format
to save your own data, you generally want to avoid Microsoft Excel. We recommend Google
Sheets as a free software tool for organizing data. We provide more recommendations in
the section Data Organization with Spreadsheets. This book focuses on data analysis. Yet
often a data scientist needs to collect data or work with others collecting data. Filling out
a spreadsheet by hand is a practice we highly discourage and instead recommend that the
process be automatized as much as possible. But sometimes you just have to do it. In this
section, we provide recommendations on how to store data in a spreadsheet.
Should be
Although this book focuses almost exclusively on data analysis, data management is also an important part of data science. As explained in the introduction, we do not cover this topic. However, quite often data analysts needs to collect data, or work with others collecting data, in a way that is most conveniently stored in a spreadsheet. Although filling out a spreadsheet by hand is a practice we highly discourage, we instead recommend the process be automatized as much as possible, sometimes you just have to do it. Therefore, in this section, we provide recommendations on how to organize data in a spreadsheet. Although there are R packages designed to read Microsoft Excel spreadsheets, we generally want to avoid this format. Instead, we recommend Google Sheets as a free software tool. Below
we summarize the recommendations made in paper by Karl Broman and Kara Woo^[https://www.tandfonline.com/doi/abs/10.1080/00031305.2017.1375989]. Please read the paper for important details.
page 97 in pdf
Replace
One other important difference is that by default `data.frame` coerces characters into factors without providing a warning or message:
```{r}
grades <- data.frame(names = c("John", "Juan", "Jean", "Yao"),
exam_1 = c(95, 80, 90, 85),
exam_2 = c(90, 85, 85, 90))
class(grades$names)
```
To avoid this, we use the rather cumbersome argument `stringsAsFactors`:
```{r}
grades <- data.frame(names = c("John", "Juan", "Jean", "Yao"),
exam_1 = c(95, 80, 90, 85),
exam_2 = c(90, 85, 85, 90),
stringsAsFactors = FALSE)
class(grades$names)
```
with
```{r}
grades <- data.frame(names = c("John", "Juan", "Jean", "Yao"),
exam_1 = c(95, 80, 90, 85),
exam_2 = c(90, 85, 85, 90))
```
page 110 in pdf
Change
However, there are a couple of important differences. To show this we read-in the data with an R-base function:
```{r}
dat2 <- read.csv(filename)
```
An important difference is that the characters are converted to factors:
```{r}
class(dat2$abb)
class(dat2$region)
```
This can be avoided by setting the argument `stringsAsFactors` to `FALSE`.
```{r}
dat <- read.csv("murders.csv", stringsAsFactors = FALSE)
class(dat$state)
```
In our experience this can be a cause for confusion since a variable that was saved as characters in file is converted to factors regardless of what the variable represents. In fact, we **highly** recommend setting `stringsAsFactors=FALSE` to be your default approach when using the R-base parsers. You can easily convert the desired columns to factors after importing data.
to
You can obtain an data frame like `dat` using:
```{r}
dat2 <- read.csv(filename)
```
page 100 in pdf
Replace the scan subsection.
Change
### `scan`
to
An often useful R-base importing function is `scan`, as it provides much flexibility.
also added space between `=` here:
x <- scan(file.path(path, filename), sep=",", what = "c")
x <- scan(file.path(path, filename), sep = ",", what = "c")
Page 132, 160 in pdf
10\. Notice that the approximation calculated in question *two* is very close to the exact calculation in the first question. Now perform the same task for more extreme values. Compare the exact calculation and the normal approximation for the interval (79,81]. How many times bigger is the actual proportion than the approximation?
Two should be nine
page 108, 136 in pdf
9. Rewrite the code above to abbreviation as the label through aes
should be
9. Rewrite the code above to use abbreviations as the label through aes
page 108, 136 in pdf
10. Change the color of the labels through blue. How will we do this?
should be
10. Change the color of the labels to blue. How will we do this?
Page 132, 160 in pdf
15. In answering the previous questions, we found that it is not at all rare for a seven footer to become an NBA player. What would be a fair critique of our calculations:
a. Practice and talent are what make a great basketball player, not height.
b. The normal approximation is not appropriate for heights.
c. As seen in question 3, the normal approximation tends to underestimate the extreme values. It’s possible that there are more seven footers than we predicted.
d. As seen in question 3, the normal approximation tends to overestimate the extreme
values. It’s possible that there are fewer seven footers than we predicted.
Question 3 should be question 10
Page171, 199 in pdf
For each year, we are simply comparing four quantities – the four percentages. A widely used graphical representation of percentages, popularized by Microsoft Excel, is the pie chart:
Should be (four should be five)
For each year, we are simply comparing five quantities – the five percentages. A widely used graphical representation of percentages, popularized by Microsoft Excel, is the pie chart:
page 182 in pdf
----1----x----10------100---
should be
`----10---x---100------1000---`
Page 199, 147 in pdf
closet should be closest
302 in pdf
There is a - instead of + in CI:
$[\bar{X} - 2\hat{\mbox{SE}}(\bar{X}), \bar{X} - 2\hat{\mbox{SE}}(\bar{X})]$
Should be
$[\bar{X} - 2\hat{\mbox{SE}}(\bar{X}), \bar{X} + 2\hat{\mbox{SE}}(\bar{X})]$
304 in pdf
$1 - (1 - p)/2 + (1 - p)/2 = p$
should be
$1 - (1 - p)/2 - (1 - p)/2 = p$
Page 344, 372 in pdf
get_slope <- function(x, y) cor(x, y) / (sd(x) * sd(y))
Should be
get_slope <- function(x, y) cor(x, y) * sd(y) / sd(x)
Page 344, 372 in pdf
summarize(slope = get_slope(R_per_game, BB_per_game))
Should be:
summarize(slope = get_slope(BB_per_game, R_per_game))
Page 344, 372 in pdf
So does this mean that if we go and hire low salary players with many BB, and who therefore increase the number of walks per game by 2, our team will score 4.2 more runs per game?
Should be (4.2 is now 1.5)
So does this mean that if we go and hire low salary players with many BB, and who therefore increase the number of walks per game by 2, our team will score 1.5 more runs per game?
Page 345, 373 in pdf
#> 1 1.3
Should be
#> 1 0.449
Page 346, 374 in pdf
#> 1 0.4 4.19
#> 2 0.5 3.30
#> 3 0.6 2.55
#> 4 0.7 2.03
#> 5 0.8 2.47
Should be
#> 1 0.4 0.734
#> 2 0.5 0.566
#> 3 0.6 0.412
#> 4 0.7 0.285
#> 5 0.8 0.365
Page 346, 374 in pdf
In fact, the values above are closer to the slope we obtained from singles, 1.3, which is more consistent with our intuition.
Should be (1.3 should b3 0.45)
In fact, the values above are closer to the slope we obtained from singles, 0.45, which is more consistent with our intuition.
Page 347, 375 in pdf
#> 1 2.8 6.47
#> 2 2.9 8.87
#> 3 3 7.40
#> 4 3.1 7.34
#> 5 3.2 5.89
Should be:
#> 1 2.8 1.52
#> 2 2.9 1.57
#> 3 3 1.52
#> 4 3.1 1.49
#> 5 3.2 1.58
Page 355 in pdf
72 - 69.1) / 2.5
Should be
(72.0 - 69.1) / 2.5
Page 347, 375 in pdf
#> 1 5.33
Should be:
#> 1 1.84
Page 355, 383 in pdf
get_slope <- function(x, y) cor(x, y) / (sd(x) * sd(y))
Should be
get_slope <- function(x, y) cor(x, y) * sd(y) / sd(x)
Page 373, 401 in pdf
over interpret should be over-interpret
three common ways should be four common ways
Page 391, 419 in pdf
Add a return after + here
new_tidy_data %>% ggplot(aes(year, fertility, color = country)) +
geom_point()
page 453, 481 in pdf
The make_date function can be used to quickly create a date object. It takes three ar- guments: year, month, day, hour, minute, seconds, and time zone defaulting to the epoch values on UTC time. So create an date object representing, for example, July 6, 2019 we write:
make_date(1970, 7, 6)
#> [1] "1970-07-06"
Should be
make_date(2019, 7, 6)
#> [1] "2019-07-06"
Page 449, 477 in pdf
"It turns them into days since the epoch.The as.Date function can convert a character into a date. So to see that the epoch is day 0 we can type"
There should be a space between "epoch" and "The".
page 472 in pdf
Remove the stringAsFactors
```{r}
new_research_funding_rates <- tab[6:14] %>%
str_trim %>%
str_split("\\s{2,}", simplify = TRUE) %>%
data.frame(stringsAsFactors = FALSE) %>%
setNames(the_names) %>%
mutate_at(-1, parse_number)
new_research_funding_rates %>% as_tibble()
```
should be
```{r}
new_research_funding_rates <- tab[6:14] %>%
str_trim %>%
str_split("\\s{2,}", simplify = TRUE) %>%
data.frame() %>%
setNames(the_names) %>%
mutate_at(-1, parse_number)
new_research_funding_rates %>% as_tibble()
```
Page 485, 513 in pdf
If the outcomes are binary, both RMSE and MSE are equivalent to accuracy, since
Should be
If the outcomes are binary, both RMSE and MSE are equivalent to one minus accuracy, since
Page 538, 566 in pdf
fit_glm <- glm(y ~ x_1 + x_2, data=mnist_27$train, family = "binomial")
p_hat_glm <- predict(fit_glm, mnist_27$test)
y_hat_glm <- factor(ifelse(p_hat_glm > 0.5, 7, 2))
confusionMatrix(y_hat_glm, mnist_27$test$y)$overall["Accuracy"]
#> Accuracy #> 0.76
Should be (type response, and accuracy changes a bit)
fit_glm <- glm(y ~ x_1 + x_2, data=mnist_27$train, family = "binomial")
p_hat_glm <- predict(fit_glm, mnist_27$test, type = "response")
y_hat_glm <- factor(ifelse(p_hat_glm > 0.5, 7, 2))
confusionMatrix(y_hat_glm, mnist_27$test$y)$overall["Accuracy"]
#> Accuracy #> 0.75
page 518 in pdf
"These are the images of the digits with the largest and smallest values for $X_1$:
And here are the original images corresponding to the largest and smallest value of $X_2$:"
should be
"To illustrate how to interpret $X_1$ and $X_2$, we include four example images. On the left are the original images of the two digits with the largest and smallest values for $X_1$ and on the right we have the images corresponding to the largest and smallest values of $X_2$:"
Page 621 in pdf
The first two numbers are seven and the third one is a 2.
should be
The first and third numbers are sevens and the second one is a two.
Page 667, 695 in pdf
Remove sentence "We have already shown how we can do this with RStudio."
716 in pdf
```{r echo=FALSE}
summary(pressure)
```
Should be
```{r, echo=FALSE}
summary(pressure)
```
639 in pdf
"seven users"
should be
"six users"
687 in pdf
mv ../cv.tex ./
Should be
mv ../resumes/cv.tex ./
page 648
First formal should not have 1/N should be
$$ \sum_{u,i} \left(y_{u,i} - \mu - b_i\right)^2 + \lambda \sum_{i} b_i^2 $$
and "The first term is just least squares"
should be "The first term is just the sum of squares"
"
page 643 in pdf
Change
Let's compute the average rating for user $u$ for those that have rated over 100 movies:
```{r user-effect-hist}
train_set %>%
group_by(userId) %>%
summarize(b_u = mean(rating)) %>%
filter(n()>=100) %>%
ggplot(aes(b_u)) +
geom_histogram(bins = 30, color = "black")
```
to
Let's compute the average rating for user $u$ for those that have rated 100 or more movies:
```{r user-effect-hist}
train_set %>%
group_by(userId) %>%
filter(n()>=100) %>%
summarize(b_u = mean(rating)) %>%
ggplot(aes(b_u)) +
geom_histogram(bins = 30, color = "black")
```
page 651 in pdf
first formula should not have 1/N
should be:
$$
\sum_{u,i} \left(y_{u,i} - \mu - b_i - b_u \right)^2 +
\lambda \left(\sum_{i} b_i^2 + \sum_{u} b_u^2\right)
$$
NEED TO FIND PAGE:
This package is intended for relatively simple tasks such as converging data into tables.
should be
This package is intended for relatively simple tasks such as converting data into tables.
In section 28.3.2 change
geom_smooth(span = 0.15, method.args = list(degree=1))
To
geom_smooth(method = "loess", span = 0.15, method.args = list(degree=1))
Section 27.4.5
Change
A cutoff of 65 makes more sense than 64. Furthermore, it balances the specificity and sensitivity of our confusion matrix:
To
A cutoff of `r best_cutoff` makes more sense than 64. Furthermore, it balances the specificity and sensitivity of our confusion matrix:
page 502 in pdf
Exercise 3 code need to read mnist so should be:
library(dslabs)
mnist <- read_mnist()
y <- mnist$train$labels
Page 626 in the pdf
The average distance function has i,j and instead should have 1,j and 2,j
$$\sqrt{ \frac{1}{2} \sum_{j=1}^2 (X_{i,j}-X_{i,j})^2 },$$
Should be
$$\sqrt{ \frac{1}{2} \sum_{j=1}^2 (X_{1,j}-X_{2,j})^2 },$$
page 659 in pdf
The m shoud be an n at the end of the first equation:
$$
r_{u,i} = p_{u,1} q_{1,i} + p_{u,2} q_{2,i} + \dots + p_{u,m} q_{m,i}
$$
should be
$$
r_{u,i} = p_{u,1} q_{1,i} + p_{u,2} q_{2,i} + \dots + p_{u,n} q_{n,i}
$$
NEED to FIX JSON PART. Search for citi_bike
ALSO IN SPANISH VERSION
=====================================
Improvements (not errors)
Page 67
describe in the next
Should be
Describe next
Page 73, 101 in pdf
case_when(x < 0 ~ "Negative", x > 0 ~ "Positive", TRUE ~ "Zero")
Should be
case_when(x < 0 ~ "Negative",
x > 0 ~ "Positive",
TRUE ~ "Zero")
Page 81, 109 in pdf
it will overwrite existing files without warning.
Should be bolded:
__it will overwrite existing files without warning__.
page 442, 470 in pdf
paste0("http://www.pnas.org/content/suppl/2015/09/16/",
should be
paste0("https://www.pnas.org/content/suppl/2015/09/16/",