-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path7 Exploratory Data Analysis.Rmd
581 lines (410 loc) · 16.5 KB
/
7 Exploratory Data Analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
---
title: "7 Exploratory Data Analysis"
author: "Russ Conte"
date: "8/19/2018"
output: html_document
---
Hadley and Garrett state that in Exploratory Data Analysis (EDA) you:
Generate questions about your data.
Search for answers by visualising, transforming, and modelling your data.
Use what you learn to refine your questions and/or generate new questions.
##7.1.1 Prerequisites for EDA:
```{r}
library(tidyverse)
```
From the text: Your goal during EDA is to develop an understanding of your data. The easiest way to do this is to <i><b>use questions as tools to guide your investigation</i></b>. When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make.
Two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:
What type of <i>variation</i> occurs <i>within</i> my variables?
What type of <i>covariation<i/> occurs <i>between</i> my variables?
##7.3 Variation
From the text (this is another fantastic truth in the text!): "The best way to understand that pattern is to visualise the distribution of the variable’s values."
##7.3.1 Visualizing distributions
An example of the distribution of a categorical variable:
```{r}
ggplot(data=diamonds) +
geom_bar(mapping=aes(x=cut))
```
Note the height of the bars = count by type of cut. These values can be shown directly:
```{r}
diamonds %>%
count(cut)
```
One way to look at <i>continuous</i> values is via a histogram:
```{r}
diamonds
ggplot(data=diamonds) +
geom_histogram(mapping = aes(x=price),binwidth = 10)
```
Interesting idea: Let's look at count of diamonds with carat<3:
```{r}
smaller <- filter(diamonds, carat<3)
bigger <- filter(diamonds, carat>=3)
ggplot(data=smaller)+
geom_histogram(mapping = aes(x=carat), binwidth = 0.01)
ggplot(data=bigger)+
geom_histogram(mapping = aes(x=carat), binwidth = 0.001)
```
One way to put multiple plots on the same graph is to use freqploy:
```{r}
smaller
ggplot(data=smaller, mapping=aes(x=carat, color=cut))+
geom_freqpoly(binwidth=0.1)
ggplot(data=smaller, mapping=aes(x=carat, color=color))+
geom_freqpoly(binwidth=0.1)
ggplot(data=smaller, mapping=aes(x=carat, color=price))+
geom_freqpoly(binwidth=0.1)
```
Another great line from Hadley and Garrett:
"The key to asking good follow-up questions will be to rely on your curiosity (What do you want to learn more about?) as well as your skepticism (How could this be misleading?)."
##7.3.2 Typical Values
As an example of typical questions, look at this histogram:
```{r}
ggplot(data=smaller) +
geom_histogram(mapping = aes(x=carat),binwidth = 0.01)
```
The authors ask a few good questions, here are some answers:
Why are there more diamonds at whole carats and common fractions of carats?
I don't know, it is a testable question that diamonds in this set were cut to those specifications, but that is a hypothesis not a fact
Why are there more diamonds slightly to the right of each peak than there are slightly to the left of each peak?
Possibly because those diamonds sell for a higher cost than diamonds slightly below whole value of carats. That is a hypothesis not a fact
Why are there no diamonds bigger than 3 carats?
Because the data set removed all diamonds >3 carats.
Let's practice wit data of eruptions of Old Faithful in Yellowstone:
```{r}
faithful
ggplot(data=faithful) +
geom_histogram(mapping = aes(x=eruptions),binwidth = 0.25)
```
##7.3.3 Unusual values (outliers)
Notice there are no visible values >10, but the x-axis extends out to 60. There are some values >10, but not many. Let's investigate!
```{r}
ggplot(diamonds) +
geom_histogram(mapping = aes(x=y), binwidth = 0.5)
```
```{r}
ggplot(diamonds)+
geom_histogram(mapping = aes(x=y), binwidth = 0.5)+
coord_cartesian(ylim=c(0,50))
```
Let's use dplyr to find the three values that are outliers:
```{r}
unusual <- diamonds %>%
filter(y<3 | y>20) %>%
select(price, x,y,z) %>%
arrange(y)
unusual
```
More great advice from Hadley and Garrett:
"It’s good practice to repeat your analysis with and without the outliers. If they have minimal effect on the results, and you can’t figure out why they’re there, it’s reasonable to replace them with missing values, and move on. However, if they have a substantial effect on your results, you shouldn’t drop them without justification. You’ll need to figure out what caused them (e.g. a data entry error) and disclose that you removed them in your write-up."
##7.3.4 Exercises
Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.
```{r}
ggplot(data=diamonds)+
geom_histogram(mapping = aes(x=x),binwidth = 0.25)
ggplot(data=diamonds)+
geom_histogram(mapping = aes(x=y),binwidth = 0.25)
ggplot(data=diamonds)+
geom_histogram(mapping = aes(x=z),binwidth = 0.25)
```
Note: The x and y values range from 0 through 60, z values range from 0 to approximately 30.
2. Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)
```{r}
ggplot(data=diamonds) +
geom_histogram(mapping = aes(x=price), binwidth = 1)
ggplot(data=diamonds) +
geom_histogram(mapping = aes(x=price), binwidth = 10)
ggplot(data=diamonds) +
geom_histogram(mapping = aes(x=price), binwidth = 100)
ggplot(data=diamonds) +
geom_histogram(mapping = aes(x=price), binwidth = 1000)
ggplot(data=diamonds) +
geom_histogram(mapping = aes(x=price), binwidth = 10000)
```
The number of diamonds goes down as price goes up. The vast majority of diamonds are <$10,000, a few are >$15,000
3. How many diamonds are 0.99 carats? How many are 1 carat? What do you think accounts for this discrepancy?
```{r}
diamonds %>%
count(carat==0.99) #23 diamonds are 0.99 carat
diamonds %>%
count(carat==1.00) #1558 diamonds are 1.00 carat
```
My hypothesis is that diamonds that are 1.0 carat sell easier than diamonds that are 0.99 carats.
4. Compare and contrast coord_cartesian() vs xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows?
```{r}
ggplot(diamonds)+
geom_histogram(mapping = aes(x=y), binwidth = 0.5)+
coord_cartesian(ylim=c(0,12000))
ggplot(diamonds)+
geom_histogram(mapping = aes(x=y), binwidth = 0.5) +
xlim(0,100)
ggplot(diamonds)+
geom_histogram(mapping = aes(x=y)) +
xlim(0,100)
```
## 7.4 Missing Values
One way to analyze unusual values is the replace them with NA, here we replace y<3 or y>20 with NA:
```{r}
diamonds2 <- diamonds %>%
mutate(y=ifelse(y<3 | y>20, NA, y))
```
Note that ggplot (like R) will report on missing values:
```{r}
ggplot(data=diamonds2,mapping = aes(x=x, y=y)) +
geom_point()
```
If we want to supress that warning, use na.rm=TRUE:
```{r}
ggplot(data=diamonds2, mapping = aes(x=x, y=y)) +
geom_point(na.rm = TRUE)
```
Let's look at what makes missing values different from observed values, using the NYCFlights13 data set:
```{r}
nycflights13::flights %>%
mutate(
cancelled=is.na(dep_time),
sched_hour=sched_dep_time%/%100,
sched_min=sched_dep_time%%100,
sched_dep_time=sched_hour+sched_min/60
) %>%
ggplot(mapping = aes(sched_dep_time)) +
geom_freqpoly(mapping = aes(color=cancelled), binwidth=0.25)
```
The authors say (and I totally agree) this plot does not tell us a lot, mainly because there are few missing flights compared to
scheduled flights. They write that they will address in a future section of the book.
## 7.4.1 Exercises
1. What happens to missing values in a histogram? Barchart? Why is there a difference?
```{r}
miss=data.frame(a1=c(1:10,NA,NA), b=c(1:12))
miss
ggplot(miss) +
geom_histogram(mapping = aes(x=a1)) #histogram removes the rows with missing data
ggplot(miss) +
geom_bar(mapping = aes(x=a1)) # bar also removes one row containing missing data
```
2. What does na.rm = TRUE do in mean() and sum()?
```{r}
miss
mean(miss$a1) # with NA it returns NA
mean(miss$a1,na.rm = TRUE) #it returns the value (5.5)
sum(miss$a1) #returns NA
sum(miss$a1, na.rm=TRUE) #returns the sum of 55
```
## 7.5 Covariation
Covariation describes variables that vary together (such as height and weight in humans). It's not always easy to see:
```{r}
ggplot(diamonds, mapping = aes(x=price)) +
geom_freqpoly(mapping = aes(color=cut), binwidth=500)
```
The differences are difficult to see because the counts differ so much:
```{r}
ggplot(diamonds,aes(x=cut)) +
geom_bar()
```
What we'll do is standardize the values, so the area under each curve=1.
```{r}
ggplot(diamonds, mapping = aes(x=price, y=..density..)) +
geom_freqpoly(mapping = aes(color=cut), binwidth=500)
```
let's look at another way to visualize the data - a boxplot.
```{r}
ggplot(diamonds, aes(x=cut, y=price)) +
geom_boxplot(mapping = aes(color=cut))
```
Another example is to look at how highway mileage varies across class of car:
```{r}
ggplot(mpg, aes(x=class, y=hwy)) +
geom_boxplot(mapping = aes(color=class))
```
Let's reorder the data to make it easier to understand:
```{r}
ggplot(mpg) +
geom_boxplot(mapping = aes(x=reorder(class, hwy, FUN=mean), y=hwy))
```
## 7.5.1 Exercises
1. Use what you’ve learned to improve the visualisation of the departure times of cancelled vs. non-cancelled flights.
```{r}
library(nycflights13)
nycflights13::flights %>%
mutate(
cancelled=is.na(dep_time),
sched_hour=sched_dep_time%/%100,
sched_min=sched_dep_time%%100,
sched_dep_time=sched_hour+sched_min/60
) %>%
ggplot(mapping = aes(x=cancelled, y=sched_dep_time)) +
geom_boxplot()
```
** note** the on-time flights have an ealier departure time EXCEPT for one outlier under TRUE that is close to 0
2. What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?
```{r}
diamonds.lm=lm(price~., data=diamonds)
summary(diamonds.lm) #Carat has highest beta with price (11,256.978)
#How is that variable correlated with cut?
cor(diamonds$price, diamonds$carat) # the correlation is 0.9215913, quite high
#Why does the combination of those two relationships lead to lower quality diamonds being more expensive?
#Because all of the quality variables in the linear model have a strongly negative coefficient
```
3. Install the ggstance package, and create a horizontal boxplot. How does this compare to using coord_flip()?
```{r}
library(ggstance)
library(ggplot2)
# using ggplot2:
ggplot(diamonds, aes(price, carat, fill=factor(cut))) +
geom_boxplot() +
coord_flip()
#using ggstance:
ggplot(diamonds, aes(price, carat, fill = factor(cut))) +
geom_boxploth()
```
Note that the prior boxplots have a lot of outlier points!
From the text: "One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using geom_lv() to display the distribution of price vs cut. What do you learn? How do you interpret the plots?"
```{r}
diamonds
library(ggplot2)
library(lvplot)
p <- ggplot(diamonds, aes(x=cut, y=price))
p + geom_lv(aes(fill=..LV..))
```
These plots are stacked, unlike the former boxplots, and decrease in size as the price goes up.
5. Compare and contrast geom_violin() with a facetted geom_histogram(), or a coloured geom_freqpoly(). What are the pros and cons of each method?
```{r}
ggplot(diamonds, aes(price, carat, fill=factor(cut))) +
geom_violin()
```
```{r}
diamonds
library(ggplot2)
ggplot(diamonds, aes(x=price)) +
geom_histogram() +
facet_wrap(~color)
ggplot(diamonds, aes(x=price)) +
geom_freqpoly() +
facet_wrap(~color)
```
I find the faceted histogram to be the easiest to understand and interpret.
Let's look a ggbeeswarm:
```{r}
diamonds
library(tidyverse)
library(ggbeeswarm)
ggplot(diamonds, aes(x=cut, y=price)) +
geom_beeswarm()
```
## 7.5.2 Two categorical variables
We can visualize covariation between two categorical variables. One way is to use the built in count function:
```{r}
ggplot(data=diamonds) +
geom_count(mapping = aes(x=cut, y=color))
```
We can also use dplyr to calculate the count:
```{r}
library(tidyverse)
diamonds %>%
count(cut,color)
```
Now we can visualize the results with ggplot :)
```{r}
diamonds
library(tidyverse)
diamonds %>%
count(cut,color) %>%
ggplot(mapping=aes(x=cut, y=color)) +
geom_tile(mapping = aes(fill=n))
```
## 7.5.3 Two Continuous Variables
One way to see the relationship between two continuous variables is with a scatterplot:
```{r}
library(tidyverse)
ggplot(diamonds, aes(x=carat, y=price)) +
geom_point()
```
Let's see how this looks using the package 'hexbin':
```{r}
library(tidyverse)
ggplot(smaller) +
geom_bin2d(aes(x=carat, y=price))
```
```{r}
library(hexbin)
ggplot(smaller) +
geom_hex(aes(x=carat, y=price))
```
Another very interesting example is to bin a continuous variable (such as carat), and then use the tools
to plot categorical variables. The first soluation puts approximately the same value of carate, the width is independent of the number of observations (note 'cut_width'):
```{r}
ggplot(smaller, aes(x=carat, y=price)) +
geom_boxplot(aes(group=cut_width(carat, 0.1)))
```
To plot depending on the number of observations (note 'cut_number'):
```{r}
library(tidyverse)
ggplot(smaller, aes(x=carat, y=price)) +
geom_boxplot(aes(group=cut_number(carat, 20)))
```
## 7.5.3.1 Exercises:
1. Instead of summarising the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using cut_width() vs cut_number()? How does that impact a visualisation of the 2d distribution of carat and price?
cut_number puts the same number of observations into each bin, cut_width makes each bin the same size.
2. Visualize carat, partitioned by price.
```{r}
library(tidyverse)
ggplot(diamonds, aes(x=carat, y=price)) +
geom_boxplot(aes(group=cut_number(price, 10)))
```
3. How does the price distribution of very large diamonds compare to small diamonds? Is it as you expect, or does it surprise you?
```{r}
very_large <- filter(diamonds, carat>2)
small <- filter(diamonds, carat<1)
library(tidyverse)
ggplot(very_large, aes(x=carat, y=price)) +
geom_boxplot()
ggplot(small, aes(x=carat, y=price)) +
geom_boxplot(aes(group=cut_number(price, 10)))
```
Analysis: There are fewer large diamonds than small diamonds. The price distribution of very large diamonds is
much narrower than small diamonds. The few large diamonds are in one boxplot, but the many more smaller diamonds
are in 10 boxplots.
4. Combine two of the techniques you’ve learned to visualise the combined distribution of cut, carat, and price.
```{r}
library(tidyverse)
ggplot(diamonds, aes(x=cut, y=carat)) +
geom_boxplot(aes(group=cut_number(price, 10)))
```
5. Why is a scatterplot better than a binned plot for this case:
ggplot(data = diamonds) +
geom_point(mapping = aes(x = x, y = y)) +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
```{r}
ggplot(data = diamonds) +
geom_point(mapping = aes(x = x, y = y)) +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
```
Because the binned plots would obscure some of the points
## 7.6 Patterns and Models
From the text: "Once you’ve removed the strong relationship between carat and price, you can see what you expect in the relationship between cut and price: relative to their size, better quality diamonds are more expensive."
```{r}
library(tidyverse)
library(modelr)
mod <-lm(log(price)~log(carat), data=diamonds)
diamonds2 <- diamonds %>%
add_residuals(mod) %>%
mutate(resid=exp(resid))
ggplot(diamonds2) +
geom_point(aes(x=carat, y=resid))
```
From the text: "Once you’ve removed the strong relationship between carat and price, you can see what you expect in the relationship between cut and price: relative to their size, better quality diamonds are more expensive."
```{r}
ggplot(diamonds2) +
geom_boxplot(aes(x=cut, y=resid))
```
## 7.7 gggplot2 calls
It's possible to condense a lot of the data in ggplot 2 calls:
```{r}
ggplot(data=faithful, mapping=aes(x=eruptions)) +
geom_freqpoly(binwidth=0.25)
```
is the same as (note the condensed code):
```{r}
ggplot(faithful, aes(x=eruptions)) +
geom_freqpoly(binwidth=0.25)
```
This is a test.