14 Strings.Rmd

---
title: "14 Strings.Rmd"
author: "Russ Conte"
date: "9/23/2018"
output: html_document
---

#14. Strings

## 14.1.1 Prerequisites

```{r}
library(tidyverse)
library(stringr)
```

##14.2 String Basics

Note - Hadley and Garrett recommend always using double quotes when quoting, not single quotes, unless quoting within a quote

```{r}
string1 <- "Four score and seven years ago"
string2 <- 'If I want to include a "quote" in a string, I\'ll use single quotes for the string and double quotes for the quote'
string1
string2
```

Note that the printed version of the string is *not* the same as the string itself. So see the string itself use writeLines(x)

```{r}
x <- c("\"", "\\")
x

writeLines(x) #interesting!
```

There are ways of writing non-English characters on all platforms. For example:

```{r}
x <- "\u00b5"
x #the Greek letter mu
```

Note that multiple strings are often stored as a character vector:

```{r}
c("First", "Second", "Third")
```

## 14.2.1 String Length

Note that we will be using stringr, and all functions start with str_.

```{r}
str_length(c("Now is the time", "for all good men", "to come to the aid of their country"))
```

## 14.2.2 Combining Strings

To combine two or more strings, use str_c

```{r}
str_c("a", "b", "c")
str_c("xyz", " abc")
```

We can use the sep argument to control how the strings are separated when printed out:

```{r}
str_c("abc", "xyz", sep=',')
```

If you want to print missing values as NA, use str_replace_na():

```{r}
x <-c("x", "y", NA)
str_c("|-",x,"-|")
str_c("|-",str_replace_na(x), "-|")
```

Note that str_c is vectorized, and (like other vectors in R) will automatically recycle shorter vectors to match longer vectors

```{r}
str_c("prefix-", c("a", "b", "c"), "-suffix")
```

To collapse a vector of strings into a single string, use collapse():

```{r}
str_c(c("x", "y", "z"), collapse=", ")
```

## 14.2.3, Subsetting strings

It's possible to subset strings using str_sub():

```{r}
x <- c("Apple", "Bannana", "Orange")
str_sub(x,start = 2,5)
```

Note that negative numbers count from the end going backwards:

```{r}
str_sub(x,-3-1)
```

If the string is too short:

```{r}
str_sub(x,3,24) #extract the 3rd through the 24th characters
```


14.2.5 Exercises

    In code that doesn’t use stringr, you’ll often see paste() and paste0(). What’s the difference between the two functions? What stringr function are they equivalent to? How do the functions differ in their handling of NA?

    In your own words, describe the difference between the sep and collapse arguments to str_c().

    Use str_length() and str_sub() to extract the middle character from a string. What will you do if the string has an even number of characters?

    What does str_wrap() do? When might you want to use it?

    What does str_trim() do? What’s the opposite of str_trim()?

    Write a function that turns (e.g.) a vector c("a", "b", "c") into the string a, b, and c. Think carefully about what it should do if given a vector of length 0, 1, or 2.

## 14.3.3 Regular Expressions

## 14.3.1 Basic matches using Regular Expressions

We'll use str_view() and str_view_all() to begin our regular expressions lessons

```{r}
x <- c("Apple", "Orange", "Pear", "Lemon", "Bannana")
str_view(x, "e")
```

The next step up in complexity is the . (period) that matches any character except a new line:

```{r}
x
str_view(x,".nn.")
```


14.3.1.1 Exercises

    Explain why each of these strings don’t match a \: "\", "\\", "\\\".

    How would you match the sequence "'\?

    What patterns will the regular expression \..\..\.. match? How would you represent it as a string?


## 14.3.2 Anchors

From the text:

ou can use:

    ^ to match the start of the string.
    $ to match the end of the string. (note the $ goes *after* the last letter in the regular expression)

```{r}
str_view(x, "^P")
```

```{r}
str_view(x, "e$")
```

From the text: "To force a regular expression to only match a complete string, anchor it with both ^ and $:"

```{r}
x <- c("Apple iPod", "Apple", "Apple Macintosh")
str_view(x,"^Apple$")
```


14.3.2.1 Exercises

    How would you match the literal string "$^$"?

    Given the corpus of common words in stringr::words, create regular expressions that find all words that:
        Start with “y”.
        End with “x”
        Are exactly three letters long. (Don’t cheat by using str_length()!)
        Have seven letters or more.

    Since this list is long, you might want to use the match argument to str_view() to show only the matching or non-matching words.

## 14.3.3 Character Classes and Alternatives

From the text:
There are four other useful tools:

    \d: matches any digit.
    \s: matches any whitespace (e.g. space, tab, newline).
    [abc]: matches a, b, or c.
    [^abc]: matches anything except a, b, or c.

```{r}
# Look for a literal character that normally has special meaning in a regex
str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c") # will match the period
```

```{r}
str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c") #will match the asterisk
```

```{r}
str_view(c("abc", "a.c", "a*c", "a c"), "a[ ]") # will match white space
```


14.3.3.1 Exercises

    Create regular expressions to find all words that:

        Start with a vowel.

        That only contain consonants. (Hint: thinking about matching “not”-vowels.)

        End with ed, but not with eed.

        End with ing or ise.

    Empirically verify the rule “i before e except after c”.

    Is “q” always followed by a “u”?

    Write a regular expression that matches a word if it’s probably written in British English, not American English.

    Create a regular expression that will match telephone numbers as commonly written in your country.