diff --git a/_posts/2023-11-27-class-2/class-2.Rmd b/_posts/2023-11-27-class-2/class-2.Rmd index bf77fb3..31579e4 100644 --- a/_posts/2023-11-27-class-2/class-2.Rmd +++ b/_posts/2023-11-27-class-2/class-2.Rmd @@ -153,9 +153,9 @@ any(is.na(state.region)) ### Factors -When printing the `state.name` object you may have noticed the `Levels: Northeast South North Central West`. What is this? +When printing the `state.region` object you may have noticed the `Levels: Northeast South North Central West`. What is this? -`state.name` is a special type of integer vector called a `factor`. These are commonly used to represent categorical data, and allow one to define a custom order for a category. In various statistical models factors are treated differently from numeric data. In our class you will use them mostly when you are plotting. +`state.region` is a special type of integer vector called a `factor`. These are commonly used to represent categorical data, and allow one to define a custom order for a category. In various statistical models factors are treated differently from numeric data. In our class you will use them mostly when you are plotting. Internally they are represented as integers, with levels that map a value to each integer value. @@ -435,12 +435,16 @@ mtcars[c("Duster 360", "Datsun 710"), c("cyl", "hp")] For cars with miles per gallon (`mpg`) of at least 30, how many cylinders (`cyl`) do they have? ```{r} - +n_cyl <- mtcars[mtcars$mpg > 30, "cyl"] +n_cyl +unique(n_cyl) ``` Which car has the highest horsepower (`hp`)? ```{r} +top_hp_car <- mtcars[mtcars$hp == max(mtcars$hp), ] +rownames(top_hp_car) ``` diff --git a/_posts/2023-11-27-class-2/class-2.html b/_posts/2023-11-27-class-2/class-2.html index d17ba97..7e2d470 100644 --- a/_posts/2023-11-27-class-2/class-2.html +++ b/_posts/2023-11-27-class-2/class-2.html @@ -93,8 +93,8 @@ - - + + @@ -110,7 +110,7 @@ @@ -1518,7 +1518,7 @@ @@ -1538,7 +1538,7 @@
[1] FALSE
When printing the state.name
object you may have noticed the Levels: Northeast South North Central West
. What is this?
state.name
is a special type of integer vector called a factor
. These are commonly used to represent categorical data, and allow one to define a custom order for a category. In various statistical models factors are treated differently from numeric data. In our class you will use them mostly when you are plotting.
When printing the state.region
object you may have noticed the Levels: Northeast South North Central West
. What is this?
state.region
is a special type of integer vector called a factor
. These are commonly used to represent categorical data, and allow one to define a custom order for a category. In various statistical models factors are treated differently from numeric data. In our class you will use them mostly when you are plotting.
Internally they are represented as integers, with levels that map a value to each integer value.
For cars with miles per gallon (mpg
) of at least 30, how many cylinders (cyl
) do they have?
Which car has the highest horsepower (hp
)?
[1] "Maserati Bora"
The data.frame
and related variants (e.g. tibble or data.table) are a workhorse data structure that we will return to again and again in the next classes.
We have already used many functions e.g. seq
, typeof
, matrix
, as.data.frame
. Functions have rules for how arguments are specified.
round(x, digits = 0)
round(x, digits = 0)
round
: function name
x
: required argument
digits
: optional argument (Defaults to 0)
ls()
[1] "add_stuff" "animals" "df" "is_in_regions"
- [5] "lst" "m" "nums" "res"
- [9] "state.name" "total_area" "ww" "x"
+ [5] "lst" "m" "n_cyl" "nums"
+ [9] "res" "state.name" "top_hp_car" "total_area"
+[13] "ww" "x"
Objects can be removed from the environment, which can be helpful if you have a large memory object that is no longer needed.
Noble WS. A quick guide to organizing computational biology projects. PLoS Comput Biol. 2009 Jul;5(7):e1000424. doi: 10.1371/journal.pcbi.1000424.
Provide meaningful names for your files. Consider including ordinal values (e.g. 01, 02, 03) if analyses depend on previous results to indicate ordering of execution.
-# bad
-
- models.R
- analysis.R
- explore.R-redo-final-v2.R analysis
# good
--data.R
- clean-model.R
- fit-data.R plot
# better
--data.R
- 01_clean-model.R
- 02_fit-data.R 03_plot
# bad
+
+ models.R
+ analysis.R
+ explore.R-redo-final-v2.R analysis
# good
+-data.R
+ clean-model.R
+ fit-data.R plot
# better
+-data.R
+ 01_clean-model.R
+ 02_fit-data.R 03_plot
“Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread.”
@@ -2345,62 +2358,62 @@Organizing your code
-
- Use comments to remind yourself what the code does. The
#
character tells R to ignore a line of text.+# convert x to zscores -<- (x - mean(x)) / sd(x) zs
# convert x to zscores +<- (x - mean(x)) / sd(x) zs
-
- Use comments to break up long scripts into logical blocks
+# Load data --------------------------- -<- read_csv("awesome-data.csv) - dat colnames(dat) <- c("sample", "color", "score", "prediction") -... -... -# modify data ------------------------- -dat <- mutate(dat, result = score + prediction) -... -... -# Plot data --------------------------- -ggplot(dat, aes(sample, score)) + - geom_point()
# Load data --------------------------- +<- read_csv("awesome-data.csv) + dat colnames(dat) <- c("sample", "color", "score", "prediction") +... +... +# modify data ------------------------- +dat <- mutate(dat, result = score + prediction) +... +... +# Plot data --------------------------- +ggplot(dat, aes(sample, score)) + + geom_point()
-
- Use sensible names for variables. Keep them short, but meaningful. Separate words with snake_case (e.g
plot_df
) or camelCase (plotDf
) approach.-# good -<- width * height - a <- 2 * width + 2 * height - p <- data.frame(area = a, perimeter = p) measurement_df
+# bad -<- x1 * x2 - y <- 2*x1 + 2*x2 - yy <- data.frame(a = y, b = yy) tmp
+# good +<- width * height + a <- 2 * width + 2 * height + p <- data.frame(area = a, perimeter = p) measurement_df
# bad +<- x1 * x2 + y <- 2*x1 + 2*x2 + yy <- data.frame(a = y, b = yy) tmp
-
- Space is free in code, use it liberally. Add spaces around operators.
+# Good -<- mean(feet / 12 + inches, na.rm = TRUE) - average -# Bad -<-mean(feet/12+inches,na.rm=TRUE) average
# Good +<- mean(feet / 12 + inches, na.rm = TRUE) + average +# Bad +<-mean(feet/12+inches,na.rm=TRUE) average
-
- Split up complicated operations or long function calls into multiple lines. In general you can add a newline after a comma or a pipe operation (
%>%
). Indenting the code can also help with readability.-# good -<- complicated_function(x, - data minimizer = 1.4, - sigma = 100, - scale_values = FALSE, - verbose = TRUE, - additional_args = list(x = 100, - fun = rnorm)) - # bad -<- complicated_function(x, minimizer = 1.4, sigma = 100, scale_values = FALSE, verbose = TRUE, additional_args = list(x = 100, fun = rnorm)) data
+#good -<- read_csv("awesome_data.csv") %>% - plot_df select(sample, scores, condition) %>% - mutate(norm_scores = scores / sum(scores)) - - #bad -<- read_csv("awesome_data.csv") %>% select(sample, scores, condition) %>% mutate(norm_scores = scores / sum(scores)) plot_df
+# good +<- complicated_function(x, + data minimizer = 1.4, + sigma = 100, + scale_values = FALSE, + verbose = TRUE, + additional_args = list(x = 100, + fun = rnorm)) + # bad +<- complicated_function(x, minimizer = 1.4, sigma = 100, scale_values = FALSE, verbose = TRUE, additional_args = list(x = 100, fun = rnorm)) data
#good +<- read_csv("awesome_data.csv") %>% + plot_df select(sample, scores, condition) %>% + mutate(norm_scores = scores / sum(scores)) + + #bad +<- read_csv("awesome_data.csv") %>% select(sample, scores, condition) %>% mutate(norm_scores = scores / sum(scores)) plot_df
Rstudio has a shortcuts to help format code
Code -> Reformat code
Code -> Reindent lines
@@ -2411,7 +2424,7 @@Acknowledge https://r4ds.had.co.nz/index.html
https://bookdown.org/rdpeng/rprogdatascience/
http://adv-r.had.co.nz/Style.html -+
The Rmarkdown for this class is on github
+The tidyverse is a collection of packages that share similar design philosophy, syntax, and data structures. The packages are largely developed by the same team that builds Rstudio.
+Some key packages that we will touch on in this course:
+readr
: functions for data import and export
+ggplot2
: plotting based on the “grammar of graphics”
+dplyr
: functions to manipulate tabular data
+tidyr
: functions to help reshape data into a tidy format
+stringr
: functions for working with strings
+tibble
: a redesigned data.frame
To use an R package in an analysis we need to load the package using the library()
function. This needs to be done once in each R session and it is a good idea to do this at the beginning of your Rmarkdown. For teaching purposes I will however sometimes load a package when I introduce a function from a package.
A tibble
is a re-imagining of the base R data.frame
. It has a few differences from the data.frame
.The biggest differences are that it doesn’t have row.names
and it has an enhanced print
method. If interested in learning more, see the tibble vignette.
Compare data
to data_tbl
.
Note, by default Rstudio displays base R data.frames in a tibble-like format
+data <- data.frame(a = 1:3,
+ b = letters[1:3],
+ c = Sys.Date() - 1:3,
+ row.names = c("a", "b", "c"))
+data_tbl <- as_tibble(data)
+data_tbl
+When you work with tidyverse functions it is a good practice to convert data.frames to tibbles.
+If a data.frame has rownames, you can preserve these by moving them into a column before converting to a tibble using the rownames_to_column()
from tibble
.
head(mtcars )
+ mpg cyl disp hp drat wt qsec vs am gear carb
+Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
+Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
+Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
+Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
+Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
+Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
+mtcars_tbl <- rownames_to_column(mtcars, "vehicle")
+mtcars_tbl <- as_tibble(mtcars_tbl)
+mtcars_tbl
+# A tibble: 32 × 12
+ vehicle mpg cyl disp hp drat wt qsec vs am gear carb
+ <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
+ 1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
+ 2 Mazda RX4 … 21 6 160 110 3.9 2.88 17.0 0 1 4 4
+ 3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
+ 4 Hornet 4 D… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
+ 5 Hornet Spo… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
+ 6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
+ 7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
+ 8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
+ 9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
+10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
+# ℹ 22 more rows
+If you don’t need the rownames, then you can use the as_tibble()
function directly.
mtcars_tbl <- as_tibble(mtcars)
+The readr
package provides a series of functions for importing or writing data in common text formats.
read_csv()
: comma-separated values (CSV) files
+read_tsv()
: tab-separated values (TSV) files
+read_delim()
: delimited files (CSV and TSV are important special cases)
+read_fwf()
: fixed-width files
+read_table()
: whitespace-separated files
These functions are faster and have better defaults than the base R equivalents (e.g. read.table
). These functions also directly output tibbles compatible with the tidyverse.
The readr checksheet provides a concise overview of the functionality in the package.
+To illustrate how to use readr we will load a .csv
file containing information about flights from 2014.
First we will download the data. You can download this data manually from github. Instead we will use R to download the dataset using the download.file()
base R function.
if(!file.exists("flights14.csv")) {
+ url <- "https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv"
+ download.file(url, "flights14.csv")
+}
+You should now have a file called “flights14.csv” in your working directory (the same directory as the Rmarkdown). To read this data into R, we can use the read_csv()
function. The defaults for this function often work for many datasets.
flights <- read_csv("flights14.csv")
+flights
+# A tibble: 253,316 × 11
+ year month day dep_delay arr_delay carrier origin dest air_time distance
+ <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl>
+ 1 2014 1 1 14 13 AA JFK LAX 359 2475
+ 2 2014 1 1 -3 13 AA JFK LAX 363 2475
+ 3 2014 1 1 2 9 AA JFK LAX 351 2475
+ 4 2014 1 1 -8 -26 AA LGA PBI 157 1035
+ 5 2014 1 1 2 1 AA JFK LAX 350 2475
+ 6 2014 1 1 4 0 AA EWR LAX 339 2454
+ 7 2014 1 1 -2 -18 AA JFK LAX 338 2475
+ 8 2014 1 1 -3 -14 AA JFK LAX 356 2475
+ 9 2014 1 1 -1 -17 AA JFK MIA 161 1089
+10 2014 1 1 -2 -14 AA JFK SEA 349 2422
+# ℹ 253,306 more rows
+# ℹ 1 more variable: hour <dbl>
+There are a few commonly used arguments:
+col_names
: if the data doesn’t have column names, you can provide them (or skip them).
col_types
: set this if the data type of a column is incorrectly inferred by readr
comment
: if there are comment lines in the file, such as a header line prefixed with #
, you want to skip, set this to #
.
skip
: # of lines to skip before reading in the data.
n_max
: maximum number of lines to read, useful for testing reading in large datasets.
The readr functions will also automatically uncompress gzipped or zipped datasets, and additionally can read data directly from a URL.
+read_csv("https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv")
There are equivalent functions for writing data from R to files:
+write_csv
, write_tsv
, write_delim
.
The readxl
package can read data from excel files and is included in the tidyverse. The read_excel()
function is the main function for reading data.
The openxlsx
package, which is not part of tidyverse but is on CRAN, can write excel files. The write.xlsx()
function is the main function for writing data to excel spreadsheets.
Often it is useful to store R objects as files on disk. These could be large processed datasets, intermediate results, or complex data structures that are not easily stored in rectangular text formats.
+R provides the readRDS()
and saveRDS()
functions for storing data in binary formats.
saveRDS(flights, "flights.rds") # save single object into a file
+df <- readRDS("flights.rds") # read object back into R
+df
+# A tibble: 253,316 × 11
+ year month day dep_delay arr_delay carrier origin dest air_time distance
+ <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl>
+ 1 2014 1 1 14 13 AA JFK LAX 359 2475
+ 2 2014 1 1 -3 13 AA JFK LAX 363 2475
+ 3 2014 1 1 2 9 AA JFK LAX 351 2475
+ 4 2014 1 1 -8 -26 AA LGA PBI 157 1035
+ 5 2014 1 1 2 1 AA JFK LAX 350 2475
+ 6 2014 1 1 4 0 AA EWR LAX 339 2454
+ 7 2014 1 1 -2 -18 AA JFK LAX 338 2475
+ 8 2014 1 1 -3 -14 AA JFK LAX 356 2475
+ 9 2014 1 1 -1 -17 AA JFK MIA 161 1089
+10 2014 1 1 -2 -14 AA JFK SEA 349 2422
+# ℹ 253,306 more rows
+# ℹ 1 more variable: hour <dbl>
+If you want to save/load multiple objects you can use save()
and load()
.
save(flights, df, file = "robjs.rda") # save flight_df and df
+load()
will load the data into the environment with the same objects names used when saving the objects.
View()
can be used to open an excel like view of a data.frame. This is a good way to quickly look at the data. glimpse()
or str()
give an additional view of the data.
View(mtcars)
+str(mtcars)
+glimpse(mtcars)
Additional R functions to help with exploring data.frames (and tibbles):
+Useful base R functions for exploring values
+In the first two lectures we introduced how to subset vectors, data.frames, and matrices +using base R functions. These approaches are flexible, succinct, and stable, meaning that +these approaches will likely be supported by R in the future.
+Some criticisms of using base R are that the syntax is hard to read, it tends to be verbose, and difficult to learn. Dplyr, and other tidyverse packages, offer alternative approaches which many find easier to use. It is however necessary to know some base R in order to effectively use R.
+Some key differences between base R and the approaches in dplyr (and tidyverse)
+[
dplyr
provides a suite of functions for manipulating data
+in tibbles.
*Rows:
+- filter()
chooses rows based on column values
+- slice()
chooses rows based on location
+- arrange()
changes the order of the rows
+- distinct()
selects distinct/unique rows
*Columns:
+- select()
changes whether or not a column is included
+- rename()
changes the name of columns
+- mutate()
changes the values of columns and creates new columns
Groups of rows:
+- summarise()
collapses a group into a single row
Returning to our flights
data. Let’s use filter()
to select certain rows.
filter(tibble, conditional_expression, ...)
filter(flights, dest == "LAX") #select rows where the `dest` column is equal to `LAX
+# A tibble: 14,434 × 11
+ year month day dep_delay arr_delay carrier origin dest air_time distance
+ <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl>
+ 1 2014 1 1 14 13 AA JFK LAX 359 2475
+ 2 2014 1 1 -3 13 AA JFK LAX 363 2475
+ 3 2014 1 1 2 9 AA JFK LAX 351 2475
+ 4 2014 1 1 2 1 AA JFK LAX 350 2475
+ 5 2014 1 1 4 0 AA EWR LAX 339 2454
+ 6 2014 1 1 -2 -18 AA JFK LAX 338 2475
+ 7 2014 1 1 -3 -14 AA JFK LAX 356 2475
+ 8 2014 1 1 142 133 AA JFK LAX 345 2475
+ 9 2014 1 1 -4 11 B6 JFK LAX 349 2475
+10 2014 1 1 3 -10 B6 JFK LAX 349 2475
+# ℹ 14,424 more rows
+# ℹ 1 more variable: hour <dbl>
+Multiple conditions can be used to select rows. For example we can select rows where the dest
column is equal to LAX
and the origin
is equal to EWR
. You can either use the &
operator, or supply multiple arguments.
We can select rows where the dest
column is equal to LAX
or the origin
is equal to EWR
using the |
operator.
filter(flights, dest == "LAX" | origin == "EWR")
+The %in%
operator is useful for identifying rows with entries matching those in a vector of possibilities.
Try it out:
+dep_delay
).filter(flights, dest == "DEN", dep_delay > 0)
+# A tibble: 3,060 × 11
+ year month day dep_delay arr_delay carrier origin dest air_time distance
+ <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl>
+ 1 2014 1 1 45 37 B6 JFK DEN 237 1626
+ 2 2014 1 1 6 -13 DL JFK DEN 235 1626
+ 3 2014 1 1 13 16 DL LGA DEN 242 1620
+ 4 2014 1 1 35 47 F9 LGA DEN 246 1620
+ 5 2014 1 1 2 19 WN EWR DEN 259 1605
+ 6 2014 1 1 17 60 WN LGA DEN 245 1620
+ 7 2014 1 1 3 12 WN LGA DEN 260 1620
+ 8 2014 1 1 10 3 UA EWR DEN 224 1605
+ 9 2014 1 1 46 43 UA LGA DEN 235 1620
+10 2014 1 1 22 8 UA EWR DEN 237 1605
+# ℹ 3,050 more rows
+# ℹ 1 more variable: hour <dbl>
+arrange()
can be used to sort the data based on values in a single or multiple columns
arrange(tibble, <columns_to_sort_by>)
For example, let’s find the flight with the shortest amount of air time by arranging the table based on the air_time
(flight time in minutes).
arrange(flights, air_time)
+# A tibble: 253,316 × 11
+ year month day dep_delay arr_delay carrier origin dest air_time distance
+ <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl>
+ 1 2014 2 21 46 40 EV EWR BDL 20 116
+ 2 2014 6 20 -6 -2 US LGA BOS 20 184
+ 3 2014 1 16 -3 -12 EV EWR BDL 21 116
+ 4 2014 1 16 10 14 EV EWR BDL 21 116
+ 5 2014 2 19 19 0 EV EWR BDL 21 116
+ 6 2014 2 26 38 20 EV EWR BDL 21 116
+ 7 2014 3 4 17 -4 EV EWR BDL 21 116
+ 8 2014 6 5 105 93 EV EWR BDL 21 116
+ 9 2014 6 5 16 4 EV EWR BDL 21 116
+10 2014 6 26 19 13 EV EWR BDL 21 116
+# ℹ 253,306 more rows
+# ℹ 1 more variable: hour <dbl>
+Try it out:
+distance
), rank in ascending order. What flight has the shortest distance?# A tibble: 1 × 11
+ year month day dep_delay arr_delay carrier origin dest air_time distance
+ <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl>
+1 2014 1 30 9 17 US EWR PHL 46 80
+# ℹ 1 more variable: hour <dbl>
+select()
is a simple function that subsets the tibble to keep certain columns.
select(tibble, <columns_to_keep>)
select(flights, origin, dest)
+# A tibble: 253,316 × 2
+ origin dest
+ <chr> <chr>
+ 1 JFK LAX
+ 2 JFK LAX
+ 3 JFK LAX
+ 4 LGA PBI
+ 5 JFK LAX
+ 6 EWR LAX
+ 7 JFK LAX
+ 8 JFK LAX
+ 9 JFK MIA
+10 JFK SEA
+# ℹ 253,306 more rows
+the :
operator can select a range of columns, such as the columns from air_time
to hour
. The !
operator selects columns not listed.
There is a suite of utilities in the tidyverse to help with select columns based on conditions: matches()
, starts_with()
, ends_with()
, contains()
, any_of()
, and all_of()
. everything()
is also useful as a placeholder for all columns not explicitly listed. See help ?select
In general, when working with the tidyverse, you don’t need to quote the names of columns. In the example above, we needed quotes because “delay” is not a column name in the flights tibble.
+mutate()
allows you to add new columns to the tibble.
mutate(tibble, new_column_name = expression, ...)
mutate(flights, total_delay = dep_delay + arr_delay)
+# A tibble: 253,316 × 12
+ year month day dep_delay arr_delay carrier origin dest air_time distance
+ <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl>
+ 1 2014 1 1 14 13 AA JFK LAX 359 2475
+ 2 2014 1 1 -3 13 AA JFK LAX 363 2475
+ 3 2014 1 1 2 9 AA JFK LAX 351 2475
+ 4 2014 1 1 -8 -26 AA LGA PBI 157 1035
+ 5 2014 1 1 2 1 AA JFK LAX 350 2475
+ 6 2014 1 1 4 0 AA EWR LAX 339 2454
+ 7 2014 1 1 -2 -18 AA JFK LAX 338 2475
+ 8 2014 1 1 -3 -14 AA JFK LAX 356 2475
+ 9 2014 1 1 -1 -17 AA JFK MIA 161 1089
+10 2014 1 1 -2 -14 AA JFK SEA 349 2422
+# ℹ 253,306 more rows
+# ℹ 2 more variables: hour <dbl>, total_delay <dbl>
+We can’t see the new column, so we add a select command to examine the columns of interest.
+mutate(flights, total_delay = dep_delay + arr_delay) |>
+ select(dep_delay, arr_delay, total_delay)
+# A tibble: 253,316 × 3
+ dep_delay arr_delay total_delay
+ <dbl> <dbl> <dbl>
+ 1 14 13 27
+ 2 -3 13 10
+ 3 2 9 11
+ 4 -8 -26 -34
+ 5 2 1 3
+ 6 4 0 4
+ 7 -2 -18 -20
+ 8 -3 -14 -17
+ 9 -1 -17 -18
+10 -2 -14 -16
+# ℹ 253,306 more rows
+Multiple new columns can be made, and you can refer to columns made in preceding statements.
+Try it out:
+air_time
) in hours rather than in minutes, add as a new column.mutate(flights, flight_time = air_time / 60)
+# A tibble: 253,316 × 12
+ year month day dep_delay arr_delay carrier origin dest air_time distance
+ <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl>
+ 1 2014 1 1 14 13 AA JFK LAX 359 2475
+ 2 2014 1 1 -3 13 AA JFK LAX 363 2475
+ 3 2014 1 1 2 9 AA JFK LAX 351 2475
+ 4 2014 1 1 -8 -26 AA LGA PBI 157 1035
+ 5 2014 1 1 2 1 AA JFK LAX 350 2475
+ 6 2014 1 1 4 0 AA EWR LAX 339 2454
+ 7 2014 1 1 -2 -18 AA JFK LAX 338 2475
+ 8 2014 1 1 -3 -14 AA JFK LAX 356 2475
+ 9 2014 1 1 -1 -17 AA JFK MIA 161 1089
+10 2014 1 1 -2 -14 AA JFK SEA 349 2422
+# ℹ 253,306 more rows
+# ℹ 2 more variables: hour <dbl>, flight_time <dbl>
+summarize()
is a function that will collapse the data from a column into a summary value based on a function that takes a vector and returns a single value (e.g. mean(), sum(), median()). It is not very useful yet, but will be very powerful when we discuss grouped operations.
# A tibble: 1 × 2
+ avg_arr_delay med_air_time
+ <dbl> <dbl>
+1 8.15 134
+All of the functionality described above can be easily expressed in base R syntax (see examples here). However, where dplyr really shines is the ability to apply the functions above to groups of data within each data frame.
+We can establish groups within the data using group_by()
. The functions mutate()
, summarize()
, and optionally arrange()
will instead operate on each group independently rather than all of the rows.
Common approaches: +group_by -> summarize: calculate summaries per group +group_by -> mutate: calculate summaries per group and add as new column to original tibble
+group_by(tibble, <columns_to_establish_groups>)
group_by(flights, carrier) # notice the new "Groups:" metadata.
+
+# calculate average dep_delay per carrier
+group_by(flights, carrier) |>
+ summarize(avg_dep_delay = mean(dep_delay))
+
+# calculate average arr_delay per carrier at each airport
+group_by(flights, carrier, origin) |>
+ summarize(avg_dep_delay = mean(dep_delay))
+
+# calculate # of flights between each origin and destination city, per carrier, and average air time.
+ # n() is a special function that returns the # of rows per group
+group_by(flights, carrier, origin, dest) |>
+ summarize(n_flights = n(),
+ mean_air_time = mean(air_time))
+Here are some questions that we can answer using grouped operations in a few lines of dplyr code. Use pipes.
+air_time
between each origin airport and destination airport?# A tibble: 221 × 3
+# Groups: origin [3]
+ origin dest avg_air_time
+ <chr> <chr> <dbl>
+ 1 EWR ALB 31.4
+ 2 EWR ANC 424.
+ 3 EWR ATL 111.
+ 4 EWR AUS 210.
+ 5 EWR AVL 89.7
+ 6 EWR AVP 25
+ 7 EWR BDL 25.4
+ 8 EWR BNA 115.
+ 9 EWR BOS 40.1
+10 EWR BQN 197.
+# ℹ 211 more rows
+group_by(flights, origin, dest) |>
+ summarize(avg_air_time = mean(air_time)) |>
+ arrange(avg_air_time) |>
+ head(1)
+# A tibble: 1 × 3
+# Groups: origin [1]
+ origin dest avg_air_time
+ <chr> <chr> <dbl>
+1 EWR AVP 25
+group_by(flights, origin, dest) |>
+ summarize(avg_air_time = mean(air_time)) |>
+ arrange(desc(avg_air_time)) |>
+ head(1)
+# A tibble: 1 × 3
+# Groups: origin [1]
+ origin dest avg_air_time
+ <chr> <chr> <dbl>
+1 JFK HNL 625.
+Try it out:
+air_time
) on average from JFK to LAX?# A tibble: 5 × 2
+ carrier flight_time
+ <chr> <dbl>
+1 DL 328.
+2 UA 328.
+3 B6 328.
+4 AA 330.
+5 VX 333.
+# A tibble: 10 × 2
+ month mean_dep_delay
+ <dbl> <dbl>
+ 1 2 52.9
+ 2 1 41.2
+ 3 7 2.48
+ 4 9 1.04
+ 5 8 1.03
+ 6 3 -0.130
+ 7 10 -1.73
+ 8 6 -1.76
+ 9 5 -3.52
+10 4 -4.5
+stringr
is a package for working with strings (i.e. character vectors). It provides a consistent syntax for string manipulation and can perform many routine tasks:
str_c
: concatenate strings (similar to paste()
in base R)
+str_count
: count occurrence of a substring in a string
+str_subset
: keep strings with a substring
+str_replace
: replace a string with another string
+str_split
: split a string into multiple pieces based on a string
library(stringr)
+some_words <- c("a sentence", "with a ", "needle in a", "haystack")
+str_detect(some_words, "needle") # use with dplyr::filter
+str_subset(some_words, "needle")
+
+str_replace(some_words, "needle", "pumpkin")
+str_replace_all(some_words, "a", "A")
+
+str_c(some_words, collapse = " ")
+
+str_c(some_words, " words words words", " anisfhlsdihg")
+
+str_count(some_words, "a")
+str_split(some_words, " ")
+stringr uses regular expressions to pattern match strings. This means that you can perform complex matching to the strings of interest. Additionally this means that there are special characters with behaviors that may be surprising if you are unaware of regular expressions.
+A useful resource when using regular expressions is https://regex101.com
+complex_strings <- c("10101-howdy", "34-world", "howdy-1010", "world-.")
+# keep words with a series of #s followed by a dash, + indicates one or more occurrences.
+str_subset(complex_strings, "[0-9]+-")
+
+# keep words with a dash followed by a series of #s
+str_subset(complex_strings, "-[0-9]+")
+
+str_subset(complex_strings, "^howdy") # keep words starting with howdy
+str_subset(complex_strings, "howdy$") # keep words ending with howdy
+str_subset(complex_strings, ".") # . signifies any character
+str_subset(complex_strings, "\\.") # need to use backticks to match literal special character
+Let’s use dplyr and stringr together.
+Which destinations contain an “LL” in their 3 letter code?
+# A tibble: 1 × 1
+ dest
+ <chr>
+1 FLL
+Which 3-letter destination codes start with H?
+filter(flights, str_detect(dest, "^H")) |>
+ select(dest) |>
+ unique()
+# A tibble: 4 × 1
+ dest
+ <chr>
+1 HOU
+2 HNL
+3 HDN
+4 HYA
+Let’s make a new column that combines the origin
and dest
columns.
mutate(flights, new_col = str_c(origin, ":", dest)) |>
+ select(new_col, everything())
+# A tibble: 253,316 × 12
+ new_col year month day dep_delay arr_delay carrier origin dest air_time
+ <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl>
+ 1 JFK:LAX 2014 1 1 14 13 AA JFK LAX 359
+ 2 JFK:LAX 2014 1 1 -3 13 AA JFK LAX 363
+ 3 JFK:LAX 2014 1 1 2 9 AA JFK LAX 351
+ 4 LGA:PBI 2014 1 1 -8 -26 AA LGA PBI 157
+ 5 JFK:LAX 2014 1 1 2 1 AA JFK LAX 350
+ 6 EWR:LAX 2014 1 1 4 0 AA EWR LAX 339
+ 7 JFK:LAX 2014 1 1 -2 -18 AA JFK LAX 338
+ 8 JFK:LAX 2014 1 1 -3 -14 AA JFK LAX 356
+ 9 JFK:MIA 2014 1 1 -1 -17 AA JFK MIA 161
+10 JFK:SEA 2014 1 1 -2 -14 AA JFK SEA 349
+# ℹ 253,306 more rows
+# ℹ 2 more variables: distance <dbl>, hour <dbl>
+sessionInfo()
+R version 4.3.1 (2023-06-16)
+Platform: aarch64-apple-darwin20 (64-bit)
+Running under: macOS Monterey 12.2.1
+
+Matrix products: default
+BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
+LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
+
+locale:
+[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
+
+time zone: America/Denver
+tzcode source: internal
+
+attached base packages:
+[1] stats graphics grDevices utils datasets methods base
+
+other attached packages:
+[1] stringr_1.5.1 tibble_3.2.1 dplyr_1.1.3 readr_2.1.4
+
+loaded via a namespace (and not attached):
+ [1] bit_4.0.5 jsonlite_1.8.7 compiler_4.3.1 crayon_1.5.2
+ [5] tidyselect_1.2.0 parallel_4.3.1 jquerylib_0.1.4 yaml_2.3.7
+ [9] fastmap_1.1.1 R6_2.5.1 generics_0.1.3 knitr_1.45
+[13] distill_1.6 bslib_0.5.1 pillar_1.9.0 tzdb_0.4.0
+[17] rlang_1.1.2 utf8_1.2.4 cachem_1.0.8 stringi_1.8.1
+[21] xfun_0.41 sass_0.4.7 bit64_4.0.5 memoise_2.0.1
+[25] cli_3.6.1 withr_2.5.2 magrittr_2.0.3 digest_0.6.33
+[29] vroom_1.6.4 rstudioapi_0.15.0 hms_1.1.3 lifecycle_1.0.4
+[33] vctrs_0.6.4 downlit_0.4.3 evaluate_0.23 glue_1.6.2
+[37] fansi_1.0.5 rmarkdown_2.25 tools_4.3.1 pkgconfig_2.0.3
+[41] htmltools_0.5.7
+The content of this class borrows heavily from previous tutorials:
+R code style guide: +http://adv-r.had.co.nz/Style.html
+Tutorial organization: +https://github.com/sjaganna/molb7910-2019
+Other R tutorials: +https://github.com/matloff/fasteR +https://r4ds.had.co.nz/index.html +https://bookdown.org/rdpeng/rprogdatascience/
+
`,e.githubCompareUpdatesUrl&&(t+=`View all changes to this article since it was first published.`),t+=` + If you see mistakes or want to suggest changes, please create an issue on GitHub.
+ `);const n=e.journal;return'undefined'!=typeof n&&'Distill'===n.title&&(t+=` +Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise. The figures that have been reused from other sources don’t fall under this license and can be recognized by a note in their caption: “Figure from …”.
+ `),'undefined'!=typeof e.publishedDate&&(t+=` +For attribution in academic contexts, please cite this work as
+${e.concatenatedAuthors}, "${e.title}", Distill, ${e.publishedYear}.+
BibTeX citation
+${m(e)}+ `),t}var An=Math.sqrt,En=Math.atan2,Dn=Math.sin,Mn=Math.cos,On=Math.PI,Un=Math.abs,In=Math.pow,Nn=Math.LN10,jn=Math.log,Rn=Math.max,qn=Math.ceil,Fn=Math.floor,Pn=Math.round,Hn=Math.min;const zn=['Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday'],Bn=['Jan.','Feb.','March','April','May','June','July','Aug.','Sept.','Oct.','Nov.','Dec.'],Wn=(e)=>10>e?'0'+e:e,Vn=function(e){const t=zn[e.getDay()].substring(0,3),n=Wn(e.getDate()),i=Bn[e.getMonth()].substring(0,3),a=e.getFullYear().toString(),d=e.getUTCHours().toString(),r=e.getUTCMinutes().toString(),o=e.getUTCSeconds().toString();return`${t}, ${n} ${i} ${a} ${d}:${r}:${o} Z`},$n=function(e){const t=Array.from(e).reduce((e,[t,n])=>Object.assign(e,{[t]:n}),{});return t},Jn=function(e){const t=new Map;for(var n in e)e.hasOwnProperty(n)&&t.set(n,e[n]);return t};class Qn{constructor(e){this.name=e.author,this.personalURL=e.authorURL,this.affiliation=e.affiliation,this.affiliationURL=e.affiliationURL,this.affiliations=e.affiliations||[]}get firstName(){const e=this.name.split(' ');return e.slice(0,e.length-1).join(' ')}get lastName(){const e=this.name.split(' ');return e[e.length-1]}}class Gn{constructor(){this.title='unnamed article',this.description='',this.authors=[],this.bibliography=new Map,this.bibliographyParsed=!1,this.citations=[],this.citationsCollected=!1,this.journal={},this.katex={},this.publishedDate=void 0}set url(e){this._url=e}get url(){if(this._url)return this._url;return this.distillPath&&this.journal.url?this.journal.url+'/'+this.distillPath:this.journal.url?this.journal.url:void 0}get githubUrl(){return this.githubPath?'https://github.com/'+this.githubPath:void 0}set previewURL(e){this._previewURL=e}get previewURL(){return this._previewURL?this._previewURL:this.url+'/thumbnail.jpg'}get publishedDateRFC(){return Vn(this.publishedDate)}get updatedDateRFC(){return Vn(this.updatedDate)}get publishedYear(){return this.publishedDate.getFullYear()}get publishedMonth(){return Bn[this.publishedDate.getMonth()]}get publishedDay(){return this.publishedDate.getDate()}get publishedMonthPadded(){return Wn(this.publishedDate.getMonth()+1)}get publishedDayPadded(){return Wn(this.publishedDate.getDate())}get publishedISODateOnly(){return this.publishedDate.toISOString().split('T')[0]}get volume(){const e=this.publishedYear-2015;if(1>e)throw new Error('Invalid publish date detected during computing volume');return e}get issue(){return this.publishedDate.getMonth()+1}get concatenatedAuthors(){if(2
tag. We found the following text: '+t);const n=document.createElement('span');n.innerHTML=e.nodeValue,e.parentNode.insertBefore(n,e),e.parentNode.removeChild(e)}}}}).observe(this,{childList:!0})}}var Ti='undefined'==typeof window?'undefined'==typeof global?'undefined'==typeof self?{}:self:global:window,_i=f(function(e,t){(function(e){function t(){this.months=['jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov','dec'],this.notKey=[',','{','}',' ','='],this.pos=0,this.input='',this.entries=[],this.currentEntry='',this.setInput=function(e){this.input=e},this.getEntries=function(){return this.entries},this.isWhitespace=function(e){return' '==e||'\r'==e||'\t'==e||'\n'==e},this.match=function(e,t){if((void 0==t||null==t)&&(t=!0),this.skipWhitespace(t),this.input.substring(this.pos,this.pos+e.length)==e)this.pos+=e.length;else throw'Token mismatch, expected '+e+', found '+this.input.substring(this.pos);this.skipWhitespace(t)},this.tryMatch=function(e,t){return(void 0==t||null==t)&&(t=!0),this.skipWhitespace(t),this.input.substring(this.pos,this.pos+e.length)==e},this.matchAt=function(){for(;this.input.length>this.pos&&'@'!=this.input[this.pos];)this.pos++;return!('@'!=this.input[this.pos])},this.skipWhitespace=function(e){for(;this.isWhitespace(this.input[this.pos]);)this.pos++;if('%'==this.input[this.pos]&&!0==e){for(;'\n'!=this.input[this.pos];)this.pos++;this.skipWhitespace(e)}},this.value_braces=function(){var e=0;this.match('{',!1);for(var t=this.pos,n=!1;;){if(!n)if('}'==this.input[this.pos]){if(0 =k&&(++x,i=k);if(d[x]instanceof n||d[T-1].greedy)continue;w=T-x,y=e.slice(i,k),v.index-=i}if(v){g&&(h=v[1].length);var S=v.index+h,v=v[0].slice(h),C=S+v.length,_=y.slice(0,S),L=y.slice(C),A=[x,w];_&&A.push(_);var E=new n(o,u?a.tokenize(v,u):v,b,v,f);A.push(E),L&&A.push(L),Array.prototype.splice.apply(d,A)}}}}}return d},hooks:{all:{},add:function(e,t){var n=a.hooks.all;n[e]=n[e]||[],n[e].push(t)},run:function(e,t){var n=a.hooks.all[e];if(n&&n.length)for(var d,r=0;d=n[r++];)d(t)}}},i=a.Token=function(e,t,n,i,a){this.type=e,this.content=t,this.alias=n,this.length=0|(i||'').length,this.greedy=!!a};if(i.stringify=function(e,t,n){if('string'==typeof e)return e;if('Array'===a.util.type(e))return e.map(function(n){return i.stringify(n,t,e)}).join('');var d={type:e.type,content:i.stringify(e.content,t,n),tag:'span',classes:['token',e.type],attributes:{},language:t,parent:n};if('comment'==d.type&&(d.attributes.spellcheck='true'),e.alias){var r='Array'===a.util.type(e.alias)?e.alias:[e.alias];Array.prototype.push.apply(d.classes,r)}a.hooks.run('wrap',d);var l=Object.keys(d.attributes).map(function(e){return e+'="'+(d.attributes[e]||'').replace(/"/g,'"')+'"'}).join(' ');return'<'+d.tag+' class="'+d.classes.join(' ')+'"'+(l?' '+l:'')+'>'+d.content+''+d.tag+'>'},!t.document)return t.addEventListener?(t.addEventListener('message',function(e){var n=JSON.parse(e.data),i=n.language,d=n.code,r=n.immediateClose;t.postMessage(a.highlight(d,a.languages[i],i)),r&&t.close()},!1),t.Prism):t.Prism;var d=document.currentScript||[].slice.call(document.getElementsByTagName('script')).pop();return d&&(a.filename=d.src,document.addEventListener&&!d.hasAttribute('data-manual')&&('loading'===document.readyState?document.addEventListener('DOMContentLoaded',a.highlightAll):window.requestAnimationFrame?window.requestAnimationFrame(a.highlightAll):window.setTimeout(a.highlightAll,16))),t.Prism}();e.exports&&(e.exports=n),'undefined'!=typeof Ti&&(Ti.Prism=n),n.languages.markup={comment://,prolog:/<\?[\w\W]+?\?>/,doctype://i,cdata://i,tag:{pattern:/<\/?(?!\d)[^\s>\/=$<]+(?:\s+[^\s>\/=]+(?:=(?:("|')(?:\\\1|\\?(?!\1)[\w\W])*\1|[^\s'">=]+))?)*\s*\/?>/i,inside:{tag:{pattern:/^<\/?[^\s>\/]+/i,inside:{punctuation:/^<\/?/,namespace:/^[^\s>\/:]+:/}},"attr-value":{pattern:/=(?:('|")[\w\W]*?(\1)|[^\s>]+)/i,inside:{punctuation:/[=>"']/}},punctuation:/\/?>/,"attr-name":{pattern:/[^\s>\/]+/,inside:{namespace:/^[^\s>\/:]+:/}}}},entity:/?[\da-z]{1,8};/i},n.hooks.add('wrap',function(e){'entity'===e.type&&(e.attributes.title=e.content.replace(/&/,'&'))}),n.languages.xml=n.languages.markup,n.languages.html=n.languages.markup,n.languages.mathml=n.languages.markup,n.languages.svg=n.languages.markup,n.languages.css={comment:/\/\*[\w\W]*?\*\//,atrule:{pattern:/@[\w-]+?.*?(;|(?=\s*\{))/i,inside:{rule:/@[\w-]+/}},url:/url\((?:(["'])(\\(?:\r\n|[\w\W])|(?!\1)[^\\\r\n])*\1|.*?)\)/i,selector:/[^\{\}\s][^\{\};]*?(?=\s*\{)/,string:{pattern:/("|')(\\(?:\r\n|[\w\W])|(?!\1)[^\\\r\n])*\1/,greedy:!0},property:/(\b|\B)[\w-]+(?=\s*:)/i,important:/\B!important\b/i,function:/[-a-z0-9]+(?=\()/i,punctuation:/[(){};:]/},n.languages.css.atrule.inside.rest=n.util.clone(n.languages.css),n.languages.markup&&(n.languages.insertBefore('markup','tag',{style:{pattern:/(
+
+
+ ${e.map(l).map((e)=>`
`)}}const Mi=`
+d-citation-list {
+ contain: layout style;
+}
+
+d-citation-list .references {
+ grid-column: text;
+}
+
+d-citation-list .references .title {
+ font-weight: 500;
+}
+`;class Oi extends HTMLElement{static get is(){return'd-citation-list'}connectedCallback(){this.hasAttribute('distill-prerendered')||(this.style.display='none')}set citations(e){x(this,e)}}var Ui=f(function(e){var t='undefined'==typeof window?'undefined'!=typeof WorkerGlobalScope&&self instanceof WorkerGlobalScope?self:{}:window,n=function(){var e=/\blang(?:uage)?-(\w+)\b/i,n=0,a=t.Prism={util:{encode:function(e){return e instanceof i?new i(e.type,a.util.encode(e.content),e.alias):'Array'===a.util.type(e)?e.map(a.util.encode):e.replace(/&/g,'&').replace(/e.length)break tokenloop;if(!(y instanceof n)){c.lastIndex=0;var v=c.exec(y),w=1;if(!v&&f&&x!=d.length-1){if(c.lastIndex=i,v=c.exec(e),!v)break;for(var S=v.index+(g?v[1].length:0),C=v.index+v[0].length,T=x,k=i,p=d.length;T
+
+`);class Ni extends ei(Ii(HTMLElement)){renderContent(){if(this.languageName=this.getAttribute('language'),!this.languageName)return void console.warn('You need to provide a language attribute to your
Footnotes
+
+`,!1);class Fi extends qi(HTMLElement){connectedCallback(){super.connectedCallback(),this.list=this.root.querySelector('ol'),this.root.style.display='none'}set footnotes(e){if(this.list.innerHTML='',e.length){this.root.style.display='';for(const t of e){const e=document.createElement('li');e.id=t.id+'-listing',e.innerHTML=t.innerHTML;const n=document.createElement('a');n.setAttribute('class','footnote-backlink'),n.textContent='[\u21A9]',n.href='#'+t.id,e.appendChild(n),this.list.appendChild(e)}}else this.root.style.display='none'}}const Pi=ti('d-hover-box',`
+
+
+