diff --git a/_posts/2023-11-30-class-3-data-wrangling-with-the-tidyverse/class-2.html b/_posts/2023-11-30-class-3-data-wrangling-with-the-tidyverse/class-2.html deleted file mode 100644 index 0a0ebe2..0000000 --- a/_posts/2023-11-30-class-3-data-wrangling-with-the-tidyverse/class-2.html +++ /dev/null @@ -1,2347 +0,0 @@ - - - - -
- - - - - - - - - - - - - - - -The Rmarkdown for this class is on github
-The tidyverse is a collection of packages that share similar design philosophy, syntax, and data structures. The packages are largely developed by the same team that builds Rstudio.
-Some key packages that we will touch on in this course:
-ggplot2
: plotting based on the “grammar of graphics”
-dplyr
: functions to manipulate tabular data
-tidyr
: functions to help reshape data into a tidy format
-readr
: functions for data import and export
-stringr
: functions for working with strings
-tibble
: a redesigned data.frame
To use an R package in an analysis we need to load the package using the library()
function. This needs to be done once in each R session and it is a good idea to do this at the beginning of your Rmarkdown. For teaching purposes I will however sometimes load a package when I introduce a function from a package.
A tibble
is a reimagining of the base R data.frame
. It has a few differences from the data.frame
.The biggest differences are that it doesn’t have row.names
and it has an enhanced print
method. If interested in learning more, see the tibble vignette.
Compare data
to data_tbl
.
Note, by default Rstudio displays data.frames in a tibble-like format
-data <- data.frame(a = 1:3,
- b = letters[1:3],
- c = Sys.Date() - 1:3,
- row.names = c("a", "b", "c"))
-data_tbl <- as_tibble(data)
-data_tbl
-When you work with tidyverse functions it is a good practice to convert data.frames to tibbles.
-If a data.frame has rownames, you can preserve these by moving them into a column before converting to a tibble using the rownames_to_column()
from tibble
.
mtcars # built in dataset, a data.frame with information about vehicles
- mpg cyl disp hp drat wt qsec vs am gear carb
-Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
-Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
-Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
-Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
-Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
-Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
-Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
-Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
-Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
-Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
-Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
-Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
-Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
-Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
-Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
-Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
-Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
-Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
-Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
-Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
-Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
-Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
-AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
-Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
-Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
-Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
-Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
-Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
-Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
-Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
-Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
-Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
-mtcars_tbl <- rownames_to_column(mtcars, "vehicle")
-mtcars_tbl <- as_tibble(mtcars_tbl)
-mtcars_tbl
-# A tibble: 32 × 12
- vehicle mpg cyl disp hp drat wt qsec vs am gear carb
- <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
- 1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
- 2 Mazda RX4 … 21 6 160 110 3.9 2.88 17.0 0 1 4 4
- 3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
- 4 Hornet 4 D… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
- 5 Hornet Spo… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
- 6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
- 7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
- 8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
- 9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
-10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
-# … with 22 more rows
-If you don’t need the rownames, then you can use the as_tibble()
function directly.
mtcars_tbl <- as_tibble(mtcars)
-View()
can be used to open an excel like view of a data.frame. This is a good way to quickly look at the data. glimpse()
or str()
give an additional view of the data.
View(mtcars)
-glimpse(mtcars)
-str(mtcars)
Additional R functions to help with exploring data.frames (and tibbles):
-Useful base R functions for exploring values
-mtcars$gear # extract gear column data as a vector
-mtcars[, "gear"] # extract gear column data as a vector
-mtcars[["gear"]] # extract gear column data as a vector
-
-summary(mtcars$gear) # get summary stats on column
-
-unique(mtcars$cyl) # find unique values in column cyl
-length(mtcars$cyl) # length of values in a vector
-
-table(mtcars$cyl) # get frequency of each value in column cyl
-table(mtcars$gear, mtcars$cyl) # get frequency of each combination of values
-The readr
package provides a series of functions for importing or writing data in common text formats.
read_csv()
: comma-separated values (CSV) files
-read_tsv()
: tab-separated values (TSV) files
-read_delim()
: delimited files (CSV and TSV are important special cases)
-read_fwf()
: fixed-width files
-read_table()
: whitespace-separated files
These functions are faster and have better defaults than the base R equivalents (e.g. read.table
). These functions also directly output tibbles compatible with the tidyverse.
The readr checksheet provides a concise overview of the functionality in the package.
-To illustrate how to use readr we will load a .csv
file containing information about flights from 2014.
First we will download the data. You can download this data manually from github. Instead we will use R to download the dataset using the download.file()
base R function.
if(!file.exists("flights14.csv")) {
- url <- "https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv"
- download.file(url, "flights14.csv")
-}
-You should now have a file called “flights14.csv” in your working directory (the same directory as the Rmarkdown). To read this data into R, we can use the read_csv()
function. The defaults for this function often work for many datasets.
flights <- read_csv("flights14.csv")
-flights
-# A tibble: 253,316 × 11
- year month day dep_de…¹ arr_d…² carrier origin dest air_t…³ dista…⁴ hour
- <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
- 1 2014 1 1 14 13 AA JFK LAX 359 2475 9
- 2 2014 1 1 -3 13 AA JFK LAX 363 2475 11
- 3 2014 1 1 2 9 AA JFK LAX 351 2475 19
- 4 2014 1 1 -8 -26 AA LGA PBI 157 1035 7
- 5 2014 1 1 2 1 AA JFK LAX 350 2475 13
- 6 2014 1 1 4 0 AA EWR LAX 339 2454 18
- 7 2014 1 1 -2 -18 AA JFK LAX 338 2475 21
- 8 2014 1 1 -3 -14 AA JFK LAX 356 2475 15
- 9 2014 1 1 -1 -17 AA JFK MIA 161 1089 15
-10 2014 1 1 -2 -14 AA JFK SEA 349 2422 18
-# … with 253,306 more rows, and abbreviated variable names ¹dep_delay,
-# ²arr_delay, ³air_time, ⁴distance
-There are a few commonly used arguments:
-col_names
: if the data doesn’t have column names, you can provide them (or skip them).
col_types
: set this if the data type of a column is incorrectly inferred by readr
comment
: if there are comment lines in the file, such as a header line prefixed with #
, you want to skip, set this to #
.
skip
: # of lines to skip before reading in the data.
n_max
: maximum number of lines to read, useful for testing reading in large datasets.
The readr functions will also automatically uncompress gzipped or zipped datasets, and additionally can read data directly from a URL.
-read_csv("https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv")
There are equivalent functions for writing data from R to files:
-write_csv
, write_tsv
, write_delim
.
The readxl
package can read data from excel files and is included in the tidyverse. The read_excel()
function is the main function for reading data.
-The openxlsx
package, which is not part of tidyverse but is on CRAN, can write excel files. The write.xlsx()
function is the main function for writing data to excel spreadsheets.
Often it is useful to store R objects on disk. These could be large processed datasets, intermediate results, or complex data structures that are not easily stored in rectangular text formats.
-R provides the readRDS()
and saveRDS()
functions for storing data in binary formats.
saveRDS(flights, "flights.rds") # save single object into a file
-df <- readRDS("flights.rds") # read object back into R
-df
-# A tibble: 253,316 × 11
- year month day dep_de…¹ arr_d…² carrier origin dest air_t…³ dista…⁴ hour
- <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
- 1 2014 1 1 14 13 AA JFK LAX 359 2475 9
- 2 2014 1 1 -3 13 AA JFK LAX 363 2475 11
- 3 2014 1 1 2 9 AA JFK LAX 351 2475 19
- 4 2014 1 1 -8 -26 AA LGA PBI 157 1035 7
- 5 2014 1 1 2 1 AA JFK LAX 350 2475 13
- 6 2014 1 1 4 0 AA EWR LAX 339 2454 18
- 7 2014 1 1 -2 -18 AA JFK LAX 338 2475 21
- 8 2014 1 1 -3 -14 AA JFK LAX 356 2475 15
- 9 2014 1 1 -1 -17 AA JFK MIA 161 1089 15
-10 2014 1 1 -2 -14 AA JFK SEA 349 2422 18
-# … with 253,306 more rows, and abbreviated variable names ¹dep_delay,
-# ²arr_delay, ³air_time, ⁴distance
-If you want to save/load multiple objects you can use save()
and load()
.
save(flights, df, file = "robjs.rda") # save flight_df and df
-load()
will load the data into the environment with the same objects names used when saving the objects.
dplyr
provides a suite of functions for manipulating data
-in tibbles.
*Rows:
-- filter()
chooses rows based on column values
-- slice()
chooses rows based on location
-- arrange()
changes the order of the rows
-- distinct()
selects distinct/unique rows
*Columns:
-- select()
changes whether or not a column is included
-- rename()
changes the name of columns
-- mutate()
changes the values of columns and creates new columns
Groups of rows:
-- summarise()
collapses a group into a single row
The magrittr
package provides the pipe operator %>%
. This operator allows you to pass data from one function to another. The pipe takes data from the left-hand operation and passes it to the first argument of the right-hand operation. x %>% f(y)
is equivalent to f(x, y)
. There is now also a pipe operator in base R (|>
) which is starting to become more widely used.
The pipe allows complex operations to be conducted without having many intermediate variables. Chaining multiple dplyr commands is a very power and readable
-nrow(flights)
-[1] 253316
-
-[1] 253316
-flights %>% nrow(x = .) # the `.` is a placeholder for the data moving through the pipe and is implied
-[1] 253316
-
- [1] "air_time" "arr_delay" "carrier" "day" "dep_delay" "dest"
- [7] "distance" "hour" "month" "origin" "year"
-# you still need to assign the output if you want to use it later
-number_of_rows <- flights %>% nrow()
-number_of_rows
-[1] 253316
-Returning to our flights
data. Let’s use filter()
to select certain rows.
filter(tibble, conditional_expression, ...)
filter(flights, dest == "LAX") #select rows where the `dest` column is equal to `LAX
-# A tibble: 14,434 × 11
- year month day dep_de…¹ arr_d…² carrier origin dest air_t…³ dista…⁴ hour
- <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
- 1 2014 1 1 14 13 AA JFK LAX 359 2475 9
- 2 2014 1 1 -3 13 AA JFK LAX 363 2475 11
- 3 2014 1 1 2 9 AA JFK LAX 351 2475 19
- 4 2014 1 1 2 1 AA JFK LAX 350 2475 13
- 5 2014 1 1 4 0 AA EWR LAX 339 2454 18
- 6 2014 1 1 -2 -18 AA JFK LAX 338 2475 21
- 7 2014 1 1 -3 -14 AA JFK LAX 356 2475 15
- 8 2014 1 1 142 133 AA JFK LAX 345 2475 19
- 9 2014 1 1 -4 11 B6 JFK LAX 349 2475 9
-10 2014 1 1 3 -10 B6 JFK LAX 349 2475 16
-# … with 14,424 more rows, and abbreviated variable names ¹dep_delay,
-# ²arr_delay, ³air_time, ⁴distance
-Multiple conditions can be used to select rows. For example we can select rows where the dest
column is equal to LAX
and the origin
is equal to EWR
. You can either use the &
operator, or supply multiple arguments.
We can select rows where the dest
column is equal to LAX
or the origin
is equal to EWR
using the |
operator.
filter(flights, dest == "LAX" | origin == "EWR")
-The %in%
operator is useful for identifying rows with entries matching those in a vector of possibilities.
Try it out:
-dep_delay
).filter(flights, dest == "DEN", dep_delay > 0)
-# A tibble: 3,060 × 11
- year month day dep_de…¹ arr_d…² carrier origin dest air_t…³ dista…⁴ hour
- <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
- 1 2014 1 1 45 37 B6 JFK DEN 237 1626 22
- 2 2014 1 1 6 -13 DL JFK DEN 235 1626 20
- 3 2014 1 1 13 16 DL LGA DEN 242 1620 18
- 4 2014 1 1 35 47 F9 LGA DEN 246 1620 18
- 5 2014 1 1 2 19 WN EWR DEN 259 1605 12
- 6 2014 1 1 17 60 WN LGA DEN 245 1620 17
- 7 2014 1 1 3 12 WN LGA DEN 260 1620 11
- 8 2014 1 1 10 3 UA EWR DEN 224 1605 17
- 9 2014 1 1 46 43 UA LGA DEN 235 1620 18
-10 2014 1 1 22 8 UA EWR DEN 237 1605 9
-# … with 3,050 more rows, and abbreviated variable names ¹dep_delay,
-# ²arr_delay, ³air_time, ⁴distance
-arrange()
can be used to sort the data based on values in a single or multiple columns
arrange(tibble, <columns_to_sort_by>)
For example, let’s find the flight with the shortest amount of air time by arranging the table based on the air_time
(flight time in minutes).
arrange(flights, air_time)
-# A tibble: 253,316 × 11
- year month day dep_de…¹ arr_d…² carrier origin dest air_t…³ dista…⁴ hour
- <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
- 1 2014 2 21 46 40 EV EWR BDL 20 116 9
- 2 2014 6 20 -6 -2 US LGA BOS 20 184 14
- 3 2014 1 16 -3 -12 EV EWR BDL 21 116 11
- 4 2014 1 16 10 14 EV EWR BDL 21 116 8
- 5 2014 2 19 19 0 EV EWR BDL 21 116 8
- 6 2014 2 26 38 20 EV EWR BDL 21 116 23
- 7 2014 3 4 17 -4 EV EWR BDL 21 116 22
- 8 2014 6 5 105 93 EV EWR BDL 21 116 14
- 9 2014 6 5 16 4 EV EWR BDL 21 116 22
-10 2014 6 26 19 13 EV EWR BDL 21 116 13
-# … with 253,306 more rows, and abbreviated variable names ¹dep_delay,
-# ²arr_delay, ³air_time, ⁴distance
-Try it out:
-distance
), rank in ascending order. What flight has the shortest distance?# A tibble: 1 × 11
- year month day dep_delay arr_d…¹ carrier origin dest air_t…² dista…³ hour
- <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
-1 2014 1 30 9 17 US EWR PHL 46 80 15
-# … with abbreviated variable names ¹arr_delay, ²air_time, ³distance
-select()
is a simple function that subsets the tibble to keep certain columns.
select(tibble, <columns_to_keep>)
select(flights, origin, dest)
-# A tibble: 253,316 × 2
- origin dest
- <chr> <chr>
- 1 JFK LAX
- 2 JFK LAX
- 3 JFK LAX
- 4 LGA PBI
- 5 JFK LAX
- 6 EWR LAX
- 7 JFK LAX
- 8 JFK LAX
- 9 JFK MIA
-10 JFK SEA
-# … with 253,306 more rows
-the :
operator can select a range of columns, such as the columns from air_time
to hour
. The !
operator selects columns not listed.
There is a suite of utilities in the tidyverse to help with select columns based on conditions: matches()
, starts_with()
, ends_with()
, contains()
, any_of()
, and all_of()
. everything()
is also useful as a placeholder for all columns not explicitly listed. See help ?select
In general, when working with the tidyverse, you don’t need to quote the names of columns. In the example above, we needed quotes because “delay” is not a column name in the flights tibble.
-mutate()
allows you to add new columns to the tibble.
mutate(tibble, new_column_name = expression, ...)
mutate(flights, total_delay = dep_delay + arr_delay)
-# A tibble: 253,316 × 12
- year month day dep_de…¹ arr_d…² carrier origin dest air_t…³ dista…⁴ hour
- <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
- 1 2014 1 1 14 13 AA JFK LAX 359 2475 9
- 2 2014 1 1 -3 13 AA JFK LAX 363 2475 11
- 3 2014 1 1 2 9 AA JFK LAX 351 2475 19
- 4 2014 1 1 -8 -26 AA LGA PBI 157 1035 7
- 5 2014 1 1 2 1 AA JFK LAX 350 2475 13
- 6 2014 1 1 4 0 AA EWR LAX 339 2454 18
- 7 2014 1 1 -2 -18 AA JFK LAX 338 2475 21
- 8 2014 1 1 -3 -14 AA JFK LAX 356 2475 15
- 9 2014 1 1 -1 -17 AA JFK MIA 161 1089 15
-10 2014 1 1 -2 -14 AA JFK SEA 349 2422 18
-# … with 253,306 more rows, 1 more variable: total_delay <dbl>, and abbreviated
-# variable names ¹dep_delay, ²arr_delay, ³air_time, ⁴distance
-We can’t see the new column, so we add a select command to examine the columns of interest.
-mutate(flights, total_delay = dep_delay + arr_delay) %>%
- select(dep_delay, arr_delay, total_delay)
-# A tibble: 253,316 × 3
- dep_delay arr_delay total_delay
- <dbl> <dbl> <dbl>
- 1 14 13 27
- 2 -3 13 10
- 3 2 9 11
- 4 -8 -26 -34
- 5 2 1 3
- 6 4 0 4
- 7 -2 -18 -20
- 8 -3 -14 -17
- 9 -1 -17 -18
-10 -2 -14 -16
-# … with 253,306 more rows
-Multiple new columns can be made, and you can refer to columns made in preceding statements.
-Try it out:
-air_time
) in hours rather than in minutes, add as a new column.mutate(flights, flight_time = air_time / 60)
-# A tibble: 253,316 × 12
- year month day dep_de…¹ arr_d…² carrier origin dest air_t…³ dista…⁴ hour
- <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
- 1 2014 1 1 14 13 AA JFK LAX 359 2475 9
- 2 2014 1 1 -3 13 AA JFK LAX 363 2475 11
- 3 2014 1 1 2 9 AA JFK LAX 351 2475 19
- 4 2014 1 1 -8 -26 AA LGA PBI 157 1035 7
- 5 2014 1 1 2 1 AA JFK LAX 350 2475 13
- 6 2014 1 1 4 0 AA EWR LAX 339 2454 18
- 7 2014 1 1 -2 -18 AA JFK LAX 338 2475 21
- 8 2014 1 1 -3 -14 AA JFK LAX 356 2475 15
- 9 2014 1 1 -1 -17 AA JFK MIA 161 1089 15
-10 2014 1 1 -2 -14 AA JFK SEA 349 2422 18
-# … with 253,306 more rows, 1 more variable: flight_time <dbl>, and abbreviated
-# variable names ¹dep_delay, ²arr_delay, ³air_time, ⁴distance
-summarize()
is a function that will collapse the data from a column into a summary value based on a function that takes a vector and returns a single value (e.g. mean(), sum(), median()). It is not very useful yet, but will be very powerful when we discuss grouped operations.
# A tibble: 1 × 2
- avg_arr_delay med_air_time
- <dbl> <dbl>
-1 8.15 134
-All of the functionality described above can be easily expressed in base R syntax (see examples here). However, where dplyr really shines is the ability to apply the functions above to groups of data within each data frame.
-We can establish groups within the data using group_by()
. The functions mutate()
, summarize()
, and optionally arrange()
will instead operate on each group independently rather than all of the rows.
Common approaches: -group_by -> summarize: calculate summaries per group -group_by -> mutate: calculate summaries per group and add as new column to original tibble
-group_by(tibble, <columns_to_establish_groups>)
group_by(flights, carrier) # notice the new "Groups:" metadata.
-
-# calculate average dep_delay per carrier
-group_by(flights, carrier) %>%
- summarize(avg_dep_delay = mean(dep_delay))
-
-# calculate average arr_delay per carrier at each airport
-group_by(flights, carrier, origin) %>%
- summarize(avg_dep_delay = mean(dep_delay))
-
-# calculate # of flights between each origin and destination city, per carrier, and average air time.
- # n() is a special function that returns the # of rows per group
-group_by(flights, carrier, origin, dest) %>%
- summarize(n_flights = n(),
- mean_air_time = mean(air_time))
-Here are some questions that we can answer using grouped operations in a few lines of dplyr code. Use pipes.
-air_time
between each origin airport and destination airport?# A tibble: 221 × 3
-# Groups: origin [3]
- origin dest avg_air_time
- <chr> <chr> <dbl>
- 1 EWR ALB 31.4
- 2 EWR ANC 424.
- 3 EWR ATL 111.
- 4 EWR AUS 210.
- 5 EWR AVL 89.7
- 6 EWR AVP 25
- 7 EWR BDL 25.4
- 8 EWR BNA 115.
- 9 EWR BOS 40.1
-10 EWR BQN 197.
-# … with 211 more rows
-group_by(flights, origin, dest) %>%
- summarize(avg_air_time = mean(air_time)) %>%
- arrange(avg_air_time) %>%
- head(1)
-# A tibble: 1 × 3
-# Groups: origin [1]
- origin dest avg_air_time
- <chr> <chr> <dbl>
-1 EWR AVP 25
-group_by(flights, origin, dest) %>%
- summarize(avg_air_time = mean(air_time)) %>%
- arrange(desc(avg_air_time)) %>%
- head(1)
-# A tibble: 1 × 3
-# Groups: origin [1]
- origin dest avg_air_time
- <chr> <chr> <dbl>
-1 JFK HNL 625.
-Try it out:
-air_time
) on average from JFK to LAX?# A tibble: 5 × 2
- carrier flight_time
- <chr> <dbl>
-1 DL 328.
-2 UA 328.
-3 B6 328.
-4 AA 330.
-5 VX 333.
-# A tibble: 10 × 2
- month mean_dep_delay
- <dbl> <dbl>
- 1 2 52.9
- 2 1 41.2
- 3 7 2.48
- 4 9 1.04
- 5 8 1.03
- 6 3 -0.130
- 7 10 -1.73
- 8 6 -1.76
- 9 5 -3.52
-10 4 -4.5
-stringr
is a package for working with strings (i.e. character vectors). It provides a consistent syntax for string manipulation and can perform many routine tasks:
str_c
: concatenate strings (similar to paste()
in base R)
-str_count
: count occurrence of a substring in a string
-str_subset
: keep strings with a substring
-str_replace
: replace a string with another string
-str_split
: split a string into multiple pieces based on a string
library(stringr)
-some_words <- c("a sentence", "with a ", "needle in a", "haystack")
-str_detect(some_words, "needle") # use with dplyr::filter
-str_subset(some_words, "needle")
-
-str_replace(some_words, "needle", "pumpkin")
-str_replace_all(some_words, "a", "A")
-
-str_c(some_words, collapse = " ")
-
-str_c(some_words, " words words words", " anisfhlsdihg")
-
-str_count(some_words, "a")
-str_split(some_words, " ")
-stringr uses regular expressions to pattern match strings. This means that you can perform complex matching to the strings of interest. Additionally this means that there are special characters with behaviors that may be surprising if you are unaware of regular expressions.
-A useful resource when using regular expressions is https://regex101.com
-complex_strings <- c("10101-howdy", "34-world", "howdy-1010", "world-.")
-# keep words with a series of #s followed by a dash, + indicates one or more occurrences.
-str_subset(complex_strings, "[0-9]+-")
-
-# keep words with a dash followed by a series of #s
-str_subset(complex_strings, "-[0-9]+")
-
-str_subset(complex_strings, "^howdy") # keep words starting with howdy
-str_subset(complex_strings, "howdy$") # keep words ending with howdy
-str_subset(complex_strings, ".") # . signifies any character
-str_subset(complex_strings, "\\.") # need to use backticks to match literal special character
-Let’s use dplyr and stringr together.
-Which destinations contain an “LL” in their 3 letter code?
-# A tibble: 1 × 1
- dest
- <chr>
-1 FLL
-Which 3-letter destination codes start with H?
-# A tibble: 4 × 1
- dest
- <chr>
-1 HOU
-2 HNL
-3 HDN
-4 HYA
-Let’s make a new column that combines the origin
and dest
columns.
mutate(flights, new_col = str_c(origin, ":", dest)) %>%
- select(new_col, everything())
-# A tibble: 253,316 × 12
- new_col year month day dep_delay arr_delay carrier origin dest air_time
- <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl>
- 1 JFK:LAX 2014 1 1 14 13 AA JFK LAX 359
- 2 JFK:LAX 2014 1 1 -3 13 AA JFK LAX 363
- 3 JFK:LAX 2014 1 1 2 9 AA JFK LAX 351
- 4 LGA:PBI 2014 1 1 -8 -26 AA LGA PBI 157
- 5 JFK:LAX 2014 1 1 2 1 AA JFK LAX 350
- 6 EWR:LAX 2014 1 1 4 0 AA EWR LAX 339
- 7 JFK:LAX 2014 1 1 -2 -18 AA JFK LAX 338
- 8 JFK:LAX 2014 1 1 -3 -14 AA JFK LAX 356
- 9 JFK:MIA 2014 1 1 -1 -17 AA JFK MIA 161
-10 JFK:SEA 2014 1 1 -2 -14 AA JFK SEA 349
-# … with 253,306 more rows, and 2 more variables: distance <dbl>, hour <dbl>
-sessionInfo()
-R version 4.2.0 (2022-04-22)
-Platform: x86_64-apple-darwin17.0 (64-bit)
-Running under: macOS Big Sur/Monterey 10.16
-
-Matrix products: default
-BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
-LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
-
-locale:
-[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
-
-attached base packages:
-[1] stats graphics grDevices utils datasets methods base
-
-other attached packages:
-[1] stringr_1.4.1 tibble_3.1.8 dplyr_1.0.10 readr_2.1.2
-
-loaded via a namespace (and not attached):
- [1] bslib_0.3.1 compiler_4.2.0 pillar_1.8.1 jquerylib_0.1.4
- [5] tools_4.2.0 bit_4.0.4 digest_0.6.30 downlit_0.4.2
- [9] jsonlite_1.8.3 evaluate_0.16 memoise_2.0.1 lifecycle_1.0.3
-[13] pkgconfig_2.0.3 rlang_1.0.6 DBI_1.1.3 cli_3.4.1
-[17] rstudioapi_0.13 parallel_4.2.0 distill_1.5 yaml_2.3.6
-[21] xfun_0.32 fastmap_1.1.0 withr_2.5.0 knitr_1.39
-[25] generics_0.1.3 vctrs_0.4.1 sass_0.4.1 hms_1.1.2
-[29] bit64_4.0.5 tidyselect_1.2.0 glue_1.6.2 R6_2.5.1
-[33] fansi_1.0.3 vroom_1.5.7 rmarkdown_2.14 tzdb_0.3.0
-[37] magrittr_2.0.3 ellipsis_0.3.2 htmltools_0.5.2 assertthat_0.2.1
-[41] utf8_1.2.2 stringi_1.7.8 cachem_1.0.6 crayon_1.5.2
-The content of this class borrows heavily from previous tutorials:
-R code style guide: -http://adv-r.had.co.nz/Style.html
-Tutorial organization: -https://github.com/sjaganna/molb7910-2019
-Other R tutorials: -https://github.com/matloff/fasteR -https://r4ds.had.co.nz/index.html -https://bookdown.org/rdpeng/rprogdatascience/
-
`,e.githubCompareUpdatesUrl&&(t+=`View all changes to this article since it was first published.`),t+=` - If you see mistakes or want to suggest changes, please create an issue on GitHub.
- `);const n=e.journal;return'undefined'!=typeof n&&'Distill'===n.title&&(t+=` -Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise. The figures that have been reused from other sources don’t fall under this license and can be recognized by a note in their caption: “Figure from …”.
- `),'undefined'!=typeof e.publishedDate&&(t+=` -For attribution in academic contexts, please cite this work as
-${e.concatenatedAuthors}, "${e.title}", Distill, ${e.publishedYear}.-
BibTeX citation
-${m(e)}- `),t}var An=Math.sqrt,En=Math.atan2,Dn=Math.sin,Mn=Math.cos,On=Math.PI,Un=Math.abs,In=Math.pow,Nn=Math.LN10,jn=Math.log,Rn=Math.max,qn=Math.ceil,Fn=Math.floor,Pn=Math.round,Hn=Math.min;const zn=['Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday'],Bn=['Jan.','Feb.','March','April','May','June','July','Aug.','Sept.','Oct.','Nov.','Dec.'],Wn=(e)=>10>e?'0'+e:e,Vn=function(e){const t=zn[e.getDay()].substring(0,3),n=Wn(e.getDate()),i=Bn[e.getMonth()].substring(0,3),a=e.getFullYear().toString(),d=e.getUTCHours().toString(),r=e.getUTCMinutes().toString(),o=e.getUTCSeconds().toString();return`${t}, ${n} ${i} ${a} ${d}:${r}:${o} Z`},$n=function(e){const t=Array.from(e).reduce((e,[t,n])=>Object.assign(e,{[t]:n}),{});return t},Jn=function(e){const t=new Map;for(var n in e)e.hasOwnProperty(n)&&t.set(n,e[n]);return t};class Qn{constructor(e){this.name=e.author,this.personalURL=e.authorURL,this.affiliation=e.affiliation,this.affiliationURL=e.affiliationURL,this.affiliations=e.affiliations||[]}get firstName(){const e=this.name.split(' ');return e.slice(0,e.length-1).join(' ')}get lastName(){const e=this.name.split(' ');return e[e.length-1]}}class Gn{constructor(){this.title='unnamed article',this.description='',this.authors=[],this.bibliography=new Map,this.bibliographyParsed=!1,this.citations=[],this.citationsCollected=!1,this.journal={},this.katex={},this.publishedDate=void 0}set url(e){this._url=e}get url(){if(this._url)return this._url;return this.distillPath&&this.journal.url?this.journal.url+'/'+this.distillPath:this.journal.url?this.journal.url:void 0}get githubUrl(){return this.githubPath?'https://github.com/'+this.githubPath:void 0}set previewURL(e){this._previewURL=e}get previewURL(){return this._previewURL?this._previewURL:this.url+'/thumbnail.jpg'}get publishedDateRFC(){return Vn(this.publishedDate)}get updatedDateRFC(){return Vn(this.updatedDate)}get publishedYear(){return this.publishedDate.getFullYear()}get publishedMonth(){return Bn[this.publishedDate.getMonth()]}get publishedDay(){return this.publishedDate.getDate()}get publishedMonthPadded(){return Wn(this.publishedDate.getMonth()+1)}get publishedDayPadded(){return Wn(this.publishedDate.getDate())}get publishedISODateOnly(){return this.publishedDate.toISOString().split('T')[0]}get volume(){const e=this.publishedYear-2015;if(1>e)throw new Error('Invalid publish date detected during computing volume');return e}get issue(){return this.publishedDate.getMonth()+1}get concatenatedAuthors(){if(2
tag. We found the following text: '+t);const n=document.createElement('span');n.innerHTML=e.nodeValue,e.parentNode.insertBefore(n,e),e.parentNode.removeChild(e)}}}}).observe(this,{childList:!0})}}var Ti='undefined'==typeof window?'undefined'==typeof global?'undefined'==typeof self?{}:self:global:window,_i=f(function(e,t){(function(e){function t(){this.months=['jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov','dec'],this.notKey=[',','{','}',' ','='],this.pos=0,this.input='',this.entries=[],this.currentEntry='',this.setInput=function(e){this.input=e},this.getEntries=function(){return this.entries},this.isWhitespace=function(e){return' '==e||'\r'==e||'\t'==e||'\n'==e},this.match=function(e,t){if((void 0==t||null==t)&&(t=!0),this.skipWhitespace(t),this.input.substring(this.pos,this.pos+e.length)==e)this.pos+=e.length;else throw'Token mismatch, expected '+e+', found '+this.input.substring(this.pos);this.skipWhitespace(t)},this.tryMatch=function(e,t){return(void 0==t||null==t)&&(t=!0),this.skipWhitespace(t),this.input.substring(this.pos,this.pos+e.length)==e},this.matchAt=function(){for(;this.input.length>this.pos&&'@'!=this.input[this.pos];)this.pos++;return!('@'!=this.input[this.pos])},this.skipWhitespace=function(e){for(;this.isWhitespace(this.input[this.pos]);)this.pos++;if('%'==this.input[this.pos]&&!0==e){for(;'\n'!=this.input[this.pos];)this.pos++;this.skipWhitespace(e)}},this.value_braces=function(){var e=0;this.match('{',!1);for(var t=this.pos,n=!1;;){if(!n)if('}'==this.input[this.pos]){if(0 =k&&(++x,i=k);if(d[x]instanceof n||d[T-1].greedy)continue;w=T-x,y=e.slice(i,k),v.index-=i}if(v){g&&(h=v[1].length);var S=v.index+h,v=v[0].slice(h),C=S+v.length,_=y.slice(0,S),L=y.slice(C),A=[x,w];_&&A.push(_);var E=new n(o,u?a.tokenize(v,u):v,b,v,f);A.push(E),L&&A.push(L),Array.prototype.splice.apply(d,A)}}}}}return d},hooks:{all:{},add:function(e,t){var n=a.hooks.all;n[e]=n[e]||[],n[e].push(t)},run:function(e,t){var n=a.hooks.all[e];if(n&&n.length)for(var d,r=0;d=n[r++];)d(t)}}},i=a.Token=function(e,t,n,i,a){this.type=e,this.content=t,this.alias=n,this.length=0|(i||'').length,this.greedy=!!a};if(i.stringify=function(e,t,n){if('string'==typeof e)return e;if('Array'===a.util.type(e))return e.map(function(n){return i.stringify(n,t,e)}).join('');var d={type:e.type,content:i.stringify(e.content,t,n),tag:'span',classes:['token',e.type],attributes:{},language:t,parent:n};if('comment'==d.type&&(d.attributes.spellcheck='true'),e.alias){var r='Array'===a.util.type(e.alias)?e.alias:[e.alias];Array.prototype.push.apply(d.classes,r)}a.hooks.run('wrap',d);var l=Object.keys(d.attributes).map(function(e){return e+'="'+(d.attributes[e]||'').replace(/"/g,'"')+'"'}).join(' ');return'<'+d.tag+' class="'+d.classes.join(' ')+'"'+(l?' '+l:'')+'>'+d.content+''+d.tag+'>'},!t.document)return t.addEventListener?(t.addEventListener('message',function(e){var n=JSON.parse(e.data),i=n.language,d=n.code,r=n.immediateClose;t.postMessage(a.highlight(d,a.languages[i],i)),r&&t.close()},!1),t.Prism):t.Prism;var d=document.currentScript||[].slice.call(document.getElementsByTagName('script')).pop();return d&&(a.filename=d.src,document.addEventListener&&!d.hasAttribute('data-manual')&&('loading'===document.readyState?document.addEventListener('DOMContentLoaded',a.highlightAll):window.requestAnimationFrame?window.requestAnimationFrame(a.highlightAll):window.setTimeout(a.highlightAll,16))),t.Prism}();e.exports&&(e.exports=n),'undefined'!=typeof Ti&&(Ti.Prism=n),n.languages.markup={comment://,prolog:/<\?[\w\W]+?\?>/,doctype://i,cdata://i,tag:{pattern:/<\/?(?!\d)[^\s>\/=$<]+(?:\s+[^\s>\/=]+(?:=(?:("|')(?:\\\1|\\?(?!\1)[\w\W])*\1|[^\s'">=]+))?)*\s*\/?>/i,inside:{tag:{pattern:/^<\/?[^\s>\/]+/i,inside:{punctuation:/^<\/?/,namespace:/^[^\s>\/:]+:/}},"attr-value":{pattern:/=(?:('|")[\w\W]*?(\1)|[^\s>]+)/i,inside:{punctuation:/[=>"']/}},punctuation:/\/?>/,"attr-name":{pattern:/[^\s>\/]+/,inside:{namespace:/^[^\s>\/:]+:/}}}},entity:/?[\da-z]{1,8};/i},n.hooks.add('wrap',function(e){'entity'===e.type&&(e.attributes.title=e.content.replace(/&/,'&'))}),n.languages.xml=n.languages.markup,n.languages.html=n.languages.markup,n.languages.mathml=n.languages.markup,n.languages.svg=n.languages.markup,n.languages.css={comment:/\/\*[\w\W]*?\*\//,atrule:{pattern:/@[\w-]+?.*?(;|(?=\s*\{))/i,inside:{rule:/@[\w-]+/}},url:/url\((?:(["'])(\\(?:\r\n|[\w\W])|(?!\1)[^\\\r\n])*\1|.*?)\)/i,selector:/[^\{\}\s][^\{\};]*?(?=\s*\{)/,string:{pattern:/("|')(\\(?:\r\n|[\w\W])|(?!\1)[^\\\r\n])*\1/,greedy:!0},property:/(\b|\B)[\w-]+(?=\s*:)/i,important:/\B!important\b/i,function:/[-a-z0-9]+(?=\()/i,punctuation:/[(){};:]/},n.languages.css.atrule.inside.rest=n.util.clone(n.languages.css),n.languages.markup&&(n.languages.insertBefore('markup','tag',{style:{pattern:/(
-
-
- ${e.map(l).map((e)=>`
`)}}const Mi=`
-d-citation-list {
- contain: layout style;
-}
-
-d-citation-list .references {
- grid-column: text;
-}
-
-d-citation-list .references .title {
- font-weight: 500;
-}
-`;class Oi extends HTMLElement{static get is(){return'd-citation-list'}connectedCallback(){this.hasAttribute('distill-prerendered')||(this.style.display='none')}set citations(e){x(this,e)}}var Ui=f(function(e){var t='undefined'==typeof window?'undefined'!=typeof WorkerGlobalScope&&self instanceof WorkerGlobalScope?self:{}:window,n=function(){var e=/\blang(?:uage)?-(\w+)\b/i,n=0,a=t.Prism={util:{encode:function(e){return e instanceof i?new i(e.type,a.util.encode(e.content),e.alias):'Array'===a.util.type(e)?e.map(a.util.encode):e.replace(/&/g,'&').replace(/e.length)break tokenloop;if(!(y instanceof n)){c.lastIndex=0;var v=c.exec(y),w=1;if(!v&&f&&x!=d.length-1){if(c.lastIndex=i,v=c.exec(e),!v)break;for(var S=v.index+(g?v[1].length:0),C=v.index+v[0].length,T=x,k=i,p=d.length;T
-
-`);class Ni extends ei(Ii(HTMLElement)){renderContent(){if(this.languageName=this.getAttribute('language'),!this.languageName)return void console.warn('You need to provide a language attribute to your
Footnotes
-
-`,!1);class Fi extends qi(HTMLElement){connectedCallback(){super.connectedCallback(),this.list=this.root.querySelector('ol'),this.root.style.display='none'}set footnotes(e){if(this.list.innerHTML='',e.length){this.root.style.display='';for(const t of e){const e=document.createElement('li');e.id=t.id+'-listing',e.innerHTML=t.innerHTML;const n=document.createElement('a');n.setAttribute('class','footnote-backlink'),n.textContent='[\u21A9]',n.href='#'+t.id,e.appendChild(n),this.list.appendChild(e)}}else this.root.style.display='none'}}const Pi=ti('d-hover-box',`
-
-
-