-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path26th_and_California_example_visualizations.Rmd
498 lines (355 loc) · 13.3 KB
/
26th_and_California_example_visualizations.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
# Examples of data.table and visualizations
### by Gene Leynes
This content was originally developed for the meetup "Workshop: Develop a Data App using Raphael, D3, HTML and Backbone (Week 3)", on January 28, 2013, which was held at 1871 which is located in Chicago's Merchandise Mart. http://www.meetup.com/The-Chicago-Data-Visualization-Group/events/97690792/
This content continues to be a useful example for demonstrating the some simple data.table examples.
The main point of this report is to show:
- Markdown examples
- data.table aggregation examples
- some ggplot and base R plot examples
This report may also rely on my personal library, `geneorama` which can be downloaded from github using the `devtools` library (which is available on CRAN using `install.packages`). To install geneorama using devtools run: `devtools::install_github("geneorama/geneorama")`
If you run into problems running this report, please check sessionInfo (shown in the HTML after initialization) to see if you're using compatible package versions.
## INITIALIZATION
```{r, message=FALSE, results='hide'}
## Remove all objects; perform garbage collection
rm(list=ls())
gc(reset=TRUE)
## Load libraries
geneorama::detach_nonstandard_packages()
geneorama::loadinstall_libraries(c("geneorama", "knitr", "mgcv", "ggplot2",
"reshape2"))
geneorama::sourceDir("functions/")
```
### sessionInfo
```{r}
sessionInfo()
```
Set knitr options tidy=FALSE globally, to retain indenting.
```{r, cache=FALSE}
opts_chunk$set(tidy=FALSE)
```
## IMPORT DATA
Read in the csv file and look at the structure:
```{r}
rawdat = read.table(
file = 'data/database_2013-01-21_8zQ4cW7T.csv', sep=',', quote='"',
flush=FALSE, header=TRUE, nrows=-1, fill=FALSE, stringsAsFactors=FALSE,
na.strings=c('None', ''))
str(rawdat)
head(rawdat)
```
By converting to data.table we can alaredy reap some benefits, we no longer have to use the `head`, we can no simply print `dat` and veiew the head and tail.
```{r}
dat <- as.data.table(rawdat)
```
### Fix the dates
We need to convert the dates from character to date objects, here we'll use data.table's `IDate` function, which is nice because it doesn't keep track of (and get confused by) timezone information. It also strips off the time, which is fine here because it's not meaningful. We could keep the time by using `as.POSIXct` or in a separate column using `ITime`, but that would require a tedious conversion (see `?strptime` for specifics).
Note the `:=` data.table assignmet:
```{r, results='hide'}
## Convert booking and discharge dates to date time objects
## EXAMPLE FORMAT: 2012-12-30T20:57:19.616186
dat[ , booking_date := as.IDate(booking_date)]
dat[ , discharge_date_earliest := as.IDate(discharge_date_earliest)]
```
## Examples of grouping with data.table
```{r}
## Summary by bail amount the old way:
## (Don't forget the useNA argument!!)
table(dat$bail_amount)
table(dat$bail_amount, useNA='ifany') # almost forgot
## Summary by bail amount Data Table
dat[ , .N, by = bail_amount]
dat[ , .N, keyby = bail_amount]
## Summary by race
dat[ , .N, by=race]
```
## Examples of subsetting with data.table
This is probably the most confusing thing at first
```{r}
## WRONG WAY:
dat[ , 3]
dat[ , 'race']
## RIGHT WAY:
dat[1:10 , race]
dat[1:10 , 3, with = F]
## Indexing works differently
dat[1]
## DF:
df = as.data.frame(dat)
# df[1]
df[1,]
```
## Examples of grouping and aggregating with data.table
This is very fast, and very useful for generating any kind of summary statistics.
```{r}
## Grouping is simple
dat[ , mean(age_at_booking), by=race]
## you can name your grouped results with lists
dat[i = TRUE,
j = list(mean = mean(age_at_booking),
sd = sd(age_at_booking)),
by = race]
```
## Examples of grouping and aggregating with data.table
But look at what happens here!
Sometimes things happen that you may not expect. It's good, but possibly a surprise. By including `age_at_booking` without any aggregating function, it automatically expands the data.table result
```{r}
dat[i = TRUE,
j = list(mean = mean(age_at_booking),
sd = sd(age_at_booking),
age_at_booking),
by = race]
```
### Complex query example
```{r, tidy=FALSE}
mysummary = dat[i = !is.na(charges) &
!is.na(booking_date) &
!is.na(bail_amount),
j = list(count = .N,
coverage = diff(range(booking_date)),
bailave = mean(bail_amount),
bailsd = sd(bail_amount),
bailmin = min(bail_amount),
bailmax = max(bail_amount)),
by = list(race,gender,
age_at_booking)]
mysummary
## Uncomment to open summary as csv file, in Excel probably
# wtf(mysummary)
```
### Check Categories
Let's use the `NAsummary` from geneorama to dig into the structure of dat:
```{r}
NAsummary(dat)
```
Let's look at the tables of categorical values (with smallish unique values)
```{r}
table(dat$race)
table(dat$gender)
table(dat$bail_status)
```
What's going on with `bail_status`? There are a couple of numeric entries (probbaly data entry errors), but there are also no NA values, and table results don't add up to the total number of rows (`r comma(nrow(dat))`).
The problem is that we didn't use the fussy `useNA` argument in table
```{r}
table(dat$bail_status, useNA = 'ifany')
```
This is another place where data.table can help. The `.N` in data.table gives a count, returns it as a useful data.table rather than a table (which can have some tricky properties), and if you use multiple `by` variables data.table will normalize (in the database sense) your data for you automatically. Oh, and it includes all the current categories, including NA.
```{r}
dat[ , .N, bail_status]
```
#### Clean up race variable
Issues with race can get thorny, fortunately we just need to fix some miscoded variables here.
Recall the `race` tablulation;
```{r, echo=FALSE}
dat[ , .N, race]
```
Let's convert W to WH and B to BK, these are probably just miscoded.
```{r, results='hide'}
## This data.frame syntax would still work:
# dat$race = sub('^W$', 'WH', dat$race)
# dat$race = sub('^B$', 'BK', dat$race)
## However, this is the same thing using data.table
dat[ , race := sub('^W$', 'WH', race)]
dat[ , race := sub('^B$', 'BK', race)]
```
That's more managable:
```{r, echo=FALSE}
dat[ , .N, race]
```
## SIMPLE DATA EXPLORATION
```{r}
## The summary of charges is too big to print:
dat[ , .N, charges]
## So, let's look at the top 20 charges
dat[ , .N, charges][order(-N)][1:20]
```
This code `dat[ , .N, charges][order(-N)][1:20]`
Is roughly equivalant to this code, but the data.frame way takes a lot more steps and requires a temporary variable:
```{r}
df <- as.data.frame(dat)
mytable <- table(df$charges)
mytable <- as.data.frame(mytable)
mytable <- mytable[order(-mytable$Freq), ]
mytable[1:20, ]
rm(df, mytable)
```
#### Decomposing the "chained" data.table command
At first this code `dat[ , .N, charges][order(-N)][1:20]` is harder to read, but it's easy to interpret in steps:
The first part creates the summary. This command basically "give me the row counts, grouped by charges":
* `dat[ , .N, charges]`
The next part says, take those results and order them by `-N`:
* `[order(-N)]`
Then, from those results just return the top twenty values:
* `[1:20]`
<br>
Let's do the same thing for citations:
```{r}
## Just the top values of charges_citation
dat[ , .N, charges_citation][order(-N)][1:20]
```
Notice that the number of NA values is much lower.
You can also if the citations / charges line up in any particularly useful way, by grouping by two variables:
```{r}
dat[ , .N, list(charges_citation, charges)][order(-N)][1:20]
```
Good luck with finding those connections...
(In a real analysis you may find it helpful to write your current version of dat to a csv file and open it in the default application, which is often Excel)
```{r}
## Write Temp File using geneorama package's function
## Uncomment to run or run manually at the console
# wtf(dat)
```
## GRAPHICAL ANALYSIS
Tables and aggregates are great, but let's check out some graphs.
```{r, fig.width=15}
## Check the range of the two date values
range(dat$discharge_date_earliest, na.rm=TRUE)
range(dat$booking_date, na.rm=TRUE)
## Check number of NA's in each date
dat[ , .N, is.na(discharge_date_earliest)]
dat[ , .N, is.na(booking_date)]
## Does the NA in discharge have anything to do with the age of the record?
dat[i = TRUE,
j = .N,
keyby = list(missing_discharge_date = is.na(discharge_date_earliest),
booking_year = year(booking_date))]
## Does the NA in discharge have anything to do with the age of the record?
dat[i = TRUE,
j = .N,
keyby = list(discharge_year = year(discharge_date_earliest))]
## One last check...
## What is the difference between discharge date and booking date?
date_diff_table <- dat[
i = !is.na(discharge_date_earliest),
j = list(day_diff = as.integer(discharge_date_earliest-booking_date),
booking_date)]
date_diff_table
## intersting, let's plot that...
plot(day_diff ~ booking_date, date_diff_table)
## Some basic histograms
hist(dat$bail_amount, 100)
hist(pmin(dat$bail_amount, 1000000), 100)
hist(dat$age_at_booking, 100)
## Some basic plots
plot(dat$age_at_booking, dat$bail_amount, main='Age by bail amount')
plot(log(bail_amount) ~ age_at_booking, dat, main='Age by log(bail amount)')
boxplot(bail_amount~age_at_booking, dat, main='age by bail amount (much better)')
boxplot(pmin(bail_amount, 5e5)~age_at_booking, dat,
main='age by bail amount\n(changing the limits)')
boxplot(log(bail_amount)~age_at_booking, dat, main='age by log(bail amount)')
```
### Is there a relationship between age and bail amount?
Make a linear model
```{r}
bailbyage_lm <- lm(log(bail_amount) ~ age_at_booking, data=dat)
```
Plot the fit
```{r}
plot(log(bail_amount) ~ age_at_booking, dat, main='Age by log(bail amount)')
lines(fitted(bailbyage_lm) ~
dat[!is.na(bail_amount) & !is.na(age_at_booking), age_at_booking],
col = "blue",
lwd=3)
```
Check the model diagnostics...
```{r, fig.width=12, fig.height=12}
par(mfrow=c(2,2))
plot(bailbyage_lm)
par(mfrow=c(1,1))
```
Q: How can the lm result do all that?
A: It has _lots_ of parts.
Looking at the structure gives us some idea of what's inside:
```{r}
str(bailbyage_lm)
```
### Is there a relationship between age and bail amount?
Let's try a model that can perform smoothing.
```{r}
library(mgcv)
bailbyage_gam <- gam(log(bail_amount) ~ s(age_at_booking), data=dat)
plot(bailbyage_gam, main='fitted age by bail amount')
```
```{r}
# Plot of fitted on the data:
plot(log(bail_amount) ~ age_at_booking,
data = dat,
main='fitted age by bail amount\n(another way)')
points(fitted(bailbyage_gam) ~
dat[!is.na(bail_amount) & !is.na(age_at_booking), age_at_booking],
col = "blue",
pch = 16)
```
## SOME GGPLOTS
```{r, fig.width=15, fig.height=15}
ggplot(dat, aes(x=age_at_booking, fill=factor(race))) +
geom_density(alpha=.5) +
facet_grid(race ~ .)+
xlab("Age at booking") +
ylab("Density by race") +
theme(plot.title = element_text(size = 20)) +
labs(title='Age at booking summary\nby race\n')
ggplot(dat, aes(x=bail_amount, fill=factor(race))) +
geom_density(alpha=.5) +
facet_grid(race ~ .) +
xlab("Age at booking") +
ylab("bail amount by race") +
theme(plot.title = element_text(size = 20)) +
labs(title='Bond by race summary\n')
ggplot(dat, aes(x=age_at_booking, fill=factor(gender))) +
geom_density(alpha=.7) +
facet_grid(gender ~ .)+
xlab("Age at booking") +
ylab("Density by race") +
theme(plot.title = element_text(size = 20)) +
labs(title='Age at booking summary\nby gender\n')
ggplot(dat, aes(x=age_at_booking, fill=factor(gender))) +
geom_density(alpha=.7) +
facet_grid(race ~ .)+
xlab("Age at booking") +
ylab("Density by race") +
theme(plot.title = element_text(size = 20)) +
labs(title='Age at booking summary\nby gender and race\n')
ggplot(dat, aes(x=bail_amount, fill=factor(gender))) +
geom_density() +
facet_grid(gender ~ .)+
xlab("Age at booking") +
ylab("bail amount by race") +
theme(plot.title = element_text(size = 20)) +
labs(title='Bond by race summary\n')
```
### Fancy ggplot example
Fancy ggplot example (with made up data, the jail data is hard to use in this example)
```{r, fig.width=15, fig.height=15}
N = 1e3
temp = data.table(
loc = sample(letters, N, replace=T),
year = rep(1999:2003, length.out=N),
mag = rpois(N, lambda=5),
x = 1:N,
matrix(rnorm(N*10), N, 10))
temp
temp_melted <- melt(
temp[loc %in% letters[1:15]],
id.vars=c("loc", "year", "mag"),
measure.vars = c("V1", "V2", "V7"))
temp_melted = as.data.table(temp_melted)
temp_melted
mycolors = c(
'red','red','darkred','darkred','darkred','red','red',
'goldenrod','yellow','goldenrod',
'green','green','darkgreen','darkgreen','darkgreen','green','green')
ggplot(temp_melted[value>.53 | value < .47],
aes(x=as.numeric(year), y=value, size=mag, shape=loc,
colour=value)) +
geom_point() +
xlab("My Custom X Label (LOG SCALED)") +
ylab("My Custom Y Label") +
theme(plot.title = element_text(size = 20)) +
labs(title='My Amazing Plot \n (note the newline --->)\n') +
scale_colour_gradientn(colours = mycolors) +
scale_x_log10() +
theme(panel.background = element_rect(fill = "gray60", colour = "black")) +
theme(panel.grid.major = element_line(colour = "gray40")) +
theme(panel.grid.minor = element_line(colour = "gray70", linetype = "dotted"))
rm(temp, temp_melted, N)
```