-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathlive_session_answers.Rmd
287 lines (165 loc) · 7.49 KB
/
live_session_answers.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
---
title: 'Live Session: working with JSON and jsonlite'
---
########################################################################################
# 1. PREPARING YOUR WORKSPACE
```{r packages_install, eval=FALSE, include=FALSE}
# run these to install the required packages for this session
install.packages("jsonlite")
install.packages("listviewer")
# if you don't have tidyverse installed, also run
install.packages("tidyverse")
```
```{r, message=FALSE}
# load the packages
library(tidyverse)
library(jsonlite) # make sure to load this **after** loading tidyverse
library(listviewer)
```
########################################################################################
# 2. IMPORTING THE JSON DATA
We will be using some json files that should be found under the "data/" folder in this repository.
```{r}
# inspect existing files in folder "data/"
list.files("data/")
```
Load the data file with information on 500 businesses on the Yelp platform:
```{r}
yelp_data <- fromJSON("data/yelp_500_businesses.json",
simplifyVector = F,
encoding = "utf8")
# if you get an encoding error, run the code below
# options(encoding = "utf8")
# getOption("encoding")
```
########################################################################################
# 3. VIEWING THE CONTENTS OF THE JSON FILE
A quick but non-intuitive way of seeing the data is by using the `str()` function with `max.level` set to a pre-defined value to avoid printing out very long outputs.
```{r}
str(yelp_data[[1]], max.level = 1)
```
A more interactive and intuitive way of looking at the structure of a JSON file is by using the `listviewer::jsonedit()` function.
It allows you to interactively dig deeper into the individual elements present in the JSON data.
```{r}
listviewer::jsonedit(yelp_data, height = "800px", mode = "view")
```
########################################################################################
# 4. TRANSFORMING JSON DATA INTO TIDY TABLES
## 4.1 The art and craft of rectangling
As explained in this very helpful tidyr article (https://tidyr.tidyverse.org/articles/rectangle.html), "rectangling is the art and craft of taking a deeply nested list (often sourced from wild caught JSON or XML) and taming it into a tidy data set of rows and columns".
The most important functions for rectangling are:
- `unnest_longer()` takes each element of a list-column and makes a new **row**
- `unnest_wider()` takes each element of a list-column and makes a new **column**
- `hoist()` is similar to `unnest_wider()` but only plucks out selected components, and can reach down multiple levels
## 4.2 Using `hoist()` to select desired columns
First, let's convert the JSON data to a tibble.
```{r}
# "____" will be the column name
yelp_tibble <- tibble(business_info = yelp_data)
yelp_tibble
```
The we need to select some specific elements within the nested lists. The function `hoist()` is very handy here as it allows us to pluck out only the desired elements within (deeply) nested lists.
```{r}
yelp_df <- yelp_tibble %>%
hoist(business_info, # unused sub-lists remain as last column in data frame
"name",
"city",
"neighborhood" = list("neighborhoods", 1), # selects first element in "____" list, similar to [["____"]][[1]]
"hours"
)
yelp_df
```
## 4.3 Using `unnest_longer()` and `unnest_wider()` to unnest lists
Let's take a look at the `hours` column.
```{r}
str(yelp_df[['hours']][[1]])
```
First, we want to have a column named `day` with all available days of the week for each business.
```{r}
yelp_df1 <- yelp_df %>%
unnest_longer(hours) %>%
rename("day" = hours_id)
yelp_df1
```
Let's take a look at the `hours` column again to see what happened.
```{r}
str(yelp_df1[['hours']][[1]])
```
Now that we have new rows with each day of the week, we want to put the information on the closing and opening time on a column each.
```{r}
yelp_df2 <- yelp_df1 %>%
unnest_wider(hours)
yelp_df2
```
# 5. EXPORTING TO A .JSON FILE
First, let's select the desired columns from the dataframe.
```{r}
yelp_df_export <- yelp_df2 %>%
select("name",
"city",
"day",
"open",
"close"
) %>%
head(5)
yelp_df_export
```
Now let's convert it to a JSON element.
```{r}
yelp_json <- toJSON(yelp_df_export,
pretty=T,
auto_unbox=T) # see EXTRAS below for more info on what 'unboxing" does
yelp_json
```
Finally, let's write it to a file.
```{r}
write(yelp_json, "data/my_json_file.json")
```
See if it worked:
```{r}
# inspect existing files in folder "data/"
list.files("data/")
```
########################################################################################
# **6. EXTRAS**
## 6.1 Further readings:
Rectangling: https://tidyr.tidyverse.org/articles/rectangle.html
Step-by-step tutorials on parsing JSON data with jsonlite and tidyverse:
- https://themockup.blog/posts/2020-05-22-parsing-json-in-r-with-jsonlite/
- https://hendrikvanb.gitlab.io/2018/07/nested_data-json_to_tibble/
- https://robotwealth.com/how-to-wrangle-json-data-in-r-with-jsonlite-purr-and-dplyr/
Documentation on jsonlite:
- Functions: https://www.rdocumentation.org/packages/jsonlite/versions/1.7.2
- CRAN: https://cran.r-project.org/web/packages/jsonlite/index.html
## 6.2 Working with atomic vectors / data frames of length 1
The jsonlite function `unbox()` prevents atomic vector of length 1 or data frames with only 1 row turning into an `array` when encoded into JSON.
For example, it can be beneficial to set `auto_unbox = TRUE` in the `toJSON()` function to make importing the data by others easier, as it prevents especially vectors of length 1 from being unnecessarily read as lists when being imported with `fromJSON`. This means users will not have to actively unlist all these elements of length 1.
```{r}
# create dataframe with one row only
x <- iris[1,]
# create two JSON strings, one with and one without unnbox()
x_boxed <- toJSON(list(rec=x))
x_unboxed <-toJSON(list(rec=unbox(x)))
# see how the boxed version looks like and how it gets imported with fromJSON()
x_boxed
fromJSON((x_boxed), simplifyVector = F)
# see how the unboxed version looks like and how it gets imported with fromJSON()
x_unboxed
fromJSON((x_unboxed), simplifyVector = F)
```
## 6.3 Flattening nested dataframes
If you do not turn off simplification by using `simplifyVector = FALSE` when importing JSON data, you can do something similar to `unnest()` by using jsonlite's `flatten()` (https://www.rdocumentation.org/packages/jsonlite/versions/1.7.2/topics/flatten).
The difference is that `flatten()` specifies the names of the parent data frames in the flattened columns. This allows better tracking of the original structure of the JSON file.
This can be useful if you want to quickly have all the data as columns (especially with `recursive = TRUE`) to then select only the desired columns.
```{r}
# create a data frame
x <- data.frame(driver = c("Bowser", "Peach"), occupation = c("Koopa", "Princess"), stringsAsFactors=FALSE)
x$vehicle <- data.frame(model = c("Piranha Prowler", "Royal Racer"), stringsAsFactors=FALSE)
x$vehicle$stats <- data.frame(speed = c(55, 34), weight = c(67, 24), drift = c(35, 32))
# explore the difference between unnest() and flatten()
x # original data frame
unnest(x) # unnests first level
flatten(x, recursive = FALSE) # flattens only first level
flatten(x) # flattens all levels
```
########################################################################################