-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path11 Data Import with Readr.Rmd
337 lines (237 loc) · 8.72 KB
/
11 Data Import with Readr.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
---
title: "11 Data Import"
author: "Russ Conte"
date: "9/1/2018"
output: html_document
---
## 11.2 Getting Started
From the text: "When you run read_csv() it prints out a column specification that gives the name and type of each column. That’s an important part of readr, which we’ll come back to in parsing a file."
```{r}
library(tidyverse)
heights <- read_csv(file = 'https://raw.githubusercontent.com/hadley/r4ds/master/data/heights.csv') #yay!
heights
```
Note it is possible to read an inline CSV file:
```{r}
# note no comm at at the end of each ROW of data
read_csv("a,b,c
1,2,3
4,5,6")
```
Here is how to skip the top rows if they do not have data:
```{r}
read_csv("The first line of metadata
The second line of metadata
x,y,z
1,2,3", skip=2)
read_csv("do not import this line,
a,b,c
4,5,6", skip=1)
```
If the first row does NOT have column names, here is how to tell R to NOT treat the first row as column names. R will create column names.
```{r}
read_csv("x,y,z\n1,2,3",col_names = FALSE)
```
It's possible to assign column names, too. Here's how that is done:
```{r}
read_csv("x,y,z\n1,2,3",col_names = c("First", "Second", "Third"))
```
It' virtually guarantted that data will have NAs, here is one way to deal with NAs.
This solution treats the period as an NA:
```{r}
read_csv("a,b,c\n1,2,3\n4,na,6", na=".") # note first line is column names
```
Exercises
1. What funciton would you use to read a file where fields are separated with "|"?
read_delim and then set the delimiter to "|", thus:
```{r}
library(readr)
read_delim("1|2|3\nx|y|z",delim="|")
```
2. Apart from file, skip and comment, what other arguments to read_csv and read_tsv have in common?
read_csv(file, col_names = TRUE, col_types = NULL,
locale = default_locale(), na = c("", "NA"), quoted_na = TRUE,
quote = "\"", comment = "", trim_ws = TRUE, skip = 0, n_max = Inf,
guess_max = min(1000, n_max), progress = show_progress())
read_tsv(file, col_names = TRUE, col_types = NULL,
locale = default_locale(), na = c("", "NA"), quoted_na = TRUE,
quote = "\"", comment = "", trim_ws = TRUE, skip = 0, n_max = Inf,
guess_max = min(1000, n_max), progress = show_progress())
Common elements:
col_names=TRUE
col_types = NULL
locale = default_locale()
na=c("", "NA")
quoted_na = TRUE
quote = "\"
comment = ""
trim_ws=TRUE
skip = 0
n_max = inf
guess_max = min(1000, nmax)
progress = show_progress()
3. What are the most important arguments to read_fwf?
the file itself, columns widths/positions and types.
```{r}
cat(read_lines(fwf_sample))
read_fwf(fwf_sample, fwf_empty(fwf_sample, col_names = c("first", "last", "state", "ssn")))
```
4.Sometimes strings in a CSV file contain commas. To prevent them from causing problems they need to be surrounded by a quoting character, like " or '. By convention, read_csv() assumes that the quoting character will be ", and if you want to change it you’ll need to use read_delim() instead. What arguments do you need to specify to read the following text into a data frame?
"x,y\n1,'a,b'"
Use the 'quote' function:
```{r}
read_csv("x,y\n1,'a,b'", quote = "'")
```
5. What is wrong with each of the following inline CSV files?
```{r}
read_csv("a,b,\n1,2,3\n4,5,6") # note that column 3 is not named, so R gives it the name X3
```
```{r}
read_csv("a,b,c\n1,2\n1,2,3,4") #unequal number of columns in the data as parsed results in NA
```
Another example with an NA.
## 11.3 parsing a vector
Note - parsing a vector converts into another type of vector, such as a logical, integer, string, etc.
Example: Convert string to logical
```{r}
str(parse_logical(c("TRUE", "FALSE", "NA")))
```
Convert string to interger
```{r}
str(parse_integer(c("1", "2", "3")))
```
Convert character to actual date format:
```{r}
str(parse_date(c("2018-09-02", "0100-12-25")))
```
parse interger:
```{r}
parse_integer(c("1","333",".","456"),na=".") # note the NA goes OUTSIDE of the data
```
If parsing failes, R generates an error:
```{r}
x <- parse_integer(c("1","2","abc","."), na=".")
```
It's possible to use problems() to address parsing issues:
```{r}
problems(x)
```
## 11.3.1 Parsing numbers
There exist three issues with numbers:
1. People in different parts of the world write the same numbers differently:
12,38 or 12.38, for example
```{r}
parse_double(c("12.38"))
parse_double(c("12,30")) # returns an error because R (my version) is coded for the USA.
```
2. Numbers are often surrounded by symbols:
$40.00, 45%, -$5.00
parse_number handles this quite easily:
```{r}
parse_number(c("$1000"))
parse_number(c("$345.78"))
parse_number(c("45%"))
parse_number(c("We had 78 people in attendance")) # wow! This is amazing! Wowsie!
```
3. Numbers often contain grouping characters that make them easier for humans to read, but more difficult for computers:
123,456,789
```{r}
parse_number(c("123,456,789"))
```
## 11.3.2 Strings
```{r}
x1 <- "El Ni\xf1o was particularly bad this year"
x2 <- "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"
x1
x2
```
To fix these issues, put the encoding into the statement:
```{r}
parse_character(x1,locale = locale(encoding = "Latin1"))
parse_character(x2, locale = locale(encoding = "Shift-JIS"))
```
It's possible to have R guess the encoding, here's what that looks like:
```{r}
guess_encoding(charToRaw(x2)) # it guessed OI8-R, so let's try that one:
parse_character(x2, locale=locale(encoding = "KOI8-R"))
```
##11.3.3 Factors
```{r}
fruit <- c("Apple", "Orange", "Banana")
parse_factor(c("Apple", "Orange", "Banana"),levels = fruit)
```
## 11.3.4 Dates, date-time and times
This will parse date-time according to ISO 8601
```{r}
parse_datetime(c("2018-09-02T1310"))
```
To parse time, we will use the hms package:
```{r}
library(hms)
parse_time("15:45")
```
## 11.3.5 Exercises
1. What are the most important arguments to locale()?
```{r}
?locale
```
From the readme: "A locale object tries to capture all the defaults that can vary between countries." Thus I would conclude
the most important arguments are the ones that specify the country.
2. What happens if you try and set decimal_mark and grouping_mark to the same character?
```{r}
#parse_character(x1,locale = locale(encoding = "Latin1",decimal_mark = ".", grouping_mark = "."))
```
<font color="red">Error: `decimal_mark` and `grouping_mark` must be different</font>
What happens to the default value of grouping_mark when you set decimal_mark to “,”? What happens to the default value of decimal_mark when you set the grouping_mark to “.”?
3. I didn’t discuss the date_format and time_format options to locale(). What do they do? Construct an example that shows when they might be useful.
```{r}
x1=parse_date(c("2018-09-02"))
x1
parse_character(x1,locale = locale(encoding = "Latin1"))
parse_character(x1, locale = locale(encoding = "Shift-JIS"))
```
5. What's the difference between read_csv and read_csv2?
read_csv2() uses ; for separators, instead of ,. This is common in European countries which use , as the decimal separator.
6. What are the most common encodings used in Europe? What are the most common encodings used in Asia? Do some googling to find out.
From Wikipedia:
ISO 8859-1 Western Europe.
ISO 8859-2 Western and Central Europe.
ISO 8859-3 Western Europe and South European (Turkish, Maltese plus Esperanto)
ISO 8859-4 Western Europe and Baltic countries (Lithuania, Estonia, Latvia and Lapp)
ISO 8859-5 Cyrillic alphabet.
ISO 8859-6 Arabic.
ISO 8859-7 Greek.
Source: https://en.wikipedia.org/wiki/Character_encoding
7. Generate the correct format to parse each of the following correctly:
```{r}
d1 <- "January 1, 2010"
parse_datetime("January 1, 2010", "%B %d, %Y") # note MUST have comma in the format section for this to work
d2 <- "2015-Mar-07"
parse_datetime("2015-Mar-07", "%Y-%b-%d")
d3 <- "06-Jun-2017"
parse_datetime("06-jun-2017", "%d-%b-%Y")
d4 <- c("August 19 (2015)", "July 1 (2015)")
d5 <- "12/30/14" # Dec 30, 2014
t1 <- "1705"
t2 <- "11:15:10.12 PM"
```
## 11.4.4 Parsing a file, (not just one line of a document)
From the text:
The heuristic tries each of the following types, stopping when it finds a match:
logical: contains only “F”, “T”, “FALSE”, or “TRUE”.
integer: contains only numeric characters (and -).
double: contains only valid doubles (including numbers like 4.5e-5).
number: contains valid doubles with the grouping mark inside.
time: matches the default time_format.
date: matches the default date_format.
date-time: any ISO8601 date.
If none of these rules apply, then the column will stay as a vector of strings.
##11.4.3 Other Challenges:
```{r}
challenge2 <- read_csv(readr_example("challenge.csv"), guess_max = 1001)
challenge2
```
## 11.5, writing to a file
```{r}
write_csv(challenge2, "challenge.csv")
```