-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path14 Strings.Rmd
237 lines (152 loc) · 5.64 KB
/
14 Strings.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
---
title: "14 Strings.Rmd"
author: "Russ Conte"
date: "9/23/2018"
output: html_document
---
#14. Strings
## 14.1.1 Prerequisites
```{r}
library(tidyverse)
library(stringr)
```
##14.2 String Basics
Note - Hadley and Garrett recommend always using double quotes when quoting, not single quotes, unless quoting within a quote
```{r}
string1 <- "Four score and seven years ago"
string2 <- 'If I want to include a "quote" in a string, I\'ll use single quotes for the string and double quotes for the quote'
string1
string2
```
Note that the printed version of the string is *not* the same as the string itself. So see the string itself use writeLines(x)
```{r}
x <- c("\"", "\\")
x
writeLines(x) #interesting!
```
There are ways of writing non-English characters on all platforms. For example:
```{r}
x <- "\u00b5"
x #the Greek letter mu
```
Note that multiple strings are often stored as a character vector:
```{r}
c("First", "Second", "Third")
```
## 14.2.1 String Length
Note that we will be using stringr, and all functions start with str_.
```{r}
str_length(c("Now is the time", "for all good men", "to come to the aid of their country"))
```
## 14.2.2 Combining Strings
To combine two or more strings, use str_c
```{r}
str_c("a", "b", "c")
str_c("xyz", " abc")
```
We can use the sep argument to control how the strings are separated when printed out:
```{r}
str_c("abc", "xyz", sep=',')
```
If you want to print missing values as NA, use str_replace_na():
```{r}
x <-c("x", "y", NA)
str_c("|-",x,"-|")
str_c("|-",str_replace_na(x), "-|")
```
Note that str_c is vectorized, and (like other vectors in R) will automatically recycle shorter vectors to match longer vectors
```{r}
str_c("prefix-", c("a", "b", "c"), "-suffix")
```
To collapse a vector of strings into a single string, use collapse():
```{r}
str_c(c("x", "y", "z"), collapse=", ")
```
## 14.2.3, Subsetting strings
It's possible to subset strings using str_sub():
```{r}
x <- c("Apple", "Bannana", "Orange")
str_sub(x,start = 2,5)
```
Note that negative numbers count from the end going backwards:
```{r}
str_sub(x,-3-1)
```
If the string is too short:
```{r}
str_sub(x,3,24) #extract the 3rd through the 24th characters
```
14.2.5 Exercises
In code that doesn’t use stringr, you’ll often see paste() and paste0(). What’s the difference between the two functions? What stringr function are they equivalent to? How do the functions differ in their handling of NA?
In your own words, describe the difference between the sep and collapse arguments to str_c().
Use str_length() and str_sub() to extract the middle character from a string. What will you do if the string has an even number of characters?
What does str_wrap() do? When might you want to use it?
What does str_trim() do? What’s the opposite of str_trim()?
Write a function that turns (e.g.) a vector c("a", "b", "c") into the string a, b, and c. Think carefully about what it should do if given a vector of length 0, 1, or 2.
## 14.3.3 Regular Expressions
## 14.3.1 Basic matches using Regular Expressions
We'll use str_view() and str_view_all() to begin our regular expressions lessons
```{r}
x <- c("Apple", "Orange", "Pear", "Lemon", "Bannana")
str_view(x, "e")
```
The next step up in complexity is the . (period) that matches any character except a new line:
```{r}
x
str_view(x,".nn.")
```
14.3.1.1 Exercises
Explain why each of these strings don’t match a \: "\", "\\", "\\\".
How would you match the sequence "'\?
What patterns will the regular expression \..\..\.. match? How would you represent it as a string?
## 14.3.2 Anchors
From the text:
ou can use:
^ to match the start of the string.
$ to match the end of the string. (note the $ goes *after* the last letter in the regular expression)
```{r}
str_view(x, "^P")
```
```{r}
str_view(x, "e$")
```
From the text: "To force a regular expression to only match a complete string, anchor it with both ^ and $:"
```{r}
x <- c("Apple iPod", "Apple", "Apple Macintosh")
str_view(x,"^Apple$")
```
14.3.2.1 Exercises
How would you match the literal string "$^$"?
Given the corpus of common words in stringr::words, create regular expressions that find all words that:
Start with “y”.
End with “x”
Are exactly three letters long. (Don’t cheat by using str_length()!)
Have seven letters or more.
Since this list is long, you might want to use the match argument to str_view() to show only the matching or non-matching words.
## 14.3.3 Character Classes and Alternatives
From the text:
There are four other useful tools:
\d: matches any digit.
\s: matches any whitespace (e.g. space, tab, newline).
[abc]: matches a, b, or c.
[^abc]: matches anything except a, b, or c.
```{r}
# Look for a literal character that normally has special meaning in a regex
str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c") # will match the period
```
```{r}
str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c") #will match the asterisk
```
```{r}
str_view(c("abc", "a.c", "a*c", "a c"), "a[ ]") # will match white space
```
14.3.3.1 Exercises
Create regular expressions to find all words that:
Start with a vowel.
That only contain consonants. (Hint: thinking about matching “not”-vowels.)
End with ed, but not with eed.
End with ing or ise.
Empirically verify the rule “i before e except after c”.
Is “q” always followed by a “u”?
Write a regular expression that matches a word if it’s probably written in British English, not American English.
Create a regular expression that will match telephone numbers as commonly written in your country.