diff --git a/episodes/02-starting-with-data.Rmd b/episodes/02-starting-with-data.Rmd index 5022e339..65e653b7 100644 --- a/episodes/02-starting-with-data.Rmd +++ b/episodes/02-starting-with-data.Rmd @@ -6,25 +6,24 @@ exercises: 30 ::::::::::::::::::::::::::::::::::::::: objectives -- Describe what a data frame is. +- Set up the working directory and sub-directories. - Load external data from a .csv file into a data frame. +- Describe what a data frame is. - Summarize the contents of a data frame. -- Describe the difference between a factor and a string. -- Convert between strings and factors. -- Reorder and rename factors. -- Change how character strings are handled in a data frame. -- Examine and change date formats. +- Indexing and subsetting data frames. +- Explore missing values in data frames. +- Use logical operators in data frames. :::::::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::: questions -- What is a data.frame? +- What is a working directory? +- How can I create new sub-directories in R? - How can I read a complete csv file into R? - How can I get basic summary information about my dataset? -- How can I change the way R treats strings in my dataset? -- Why would I want strings to be treated differently? -- How are dates represented in R and how can I change the format? +- How can I extract certain rows and columns from my data frame? +- How can I deal with missing values in R? :::::::::::::::::::::::::::::::::::::::::::::::::: @@ -34,54 +33,35 @@ library(readr) books <- read_csv("./data/books.csv") ``` -## Open your Rproj file -First, open your R Project file (`library_carpentry.Rproj`) created in the `Before We Start` lesson. +## Create a new R project -If you did not complete that step, do the following: +Let's create a new project `library_carpentry.Rproj` in RStudio to play with our dataset. -- Under the `File` menu, click on `New project`, choose `New directory`, then - `New project` -- Enter the name `library_carpentry` for this new folder (or "directory"). This - will be your **working directory** for the rest of the day. -- Click on `Create project` -- Create a new file where we will type our scripts. Go to File > New File > R - script. Click the save icon on your toolbar and save your script as - "`script.R`". +:::::::::::::::::::::::::::::::::::::::::: prereq -## Presentation of the data +## Reminder -This data was downloaded from the University of Houston--Clear Lake Integrated -Library System in 2018. It is a relatively random sample of books from the -catalog. It consists of 10,000 observations of 11 variables. +In the previous episode [`Before We Start`](https://librarycarpentry.github.io/lc-r/00-before-we-start.html), we briefly walk through how to create an R project to keep +all your scripts, data and analysis. If you need a refreshment, go back +to this episode and follow the instructions on creating an R project folder. -These variables are: +:::::::::::::::::::::::::::::::::::::::::: + + +Once you have created the project, open the project in RStudio. -- `CALL...BIBLIO.` : Bibliographic call number. Most of these are cataloged with - the Library of Congress classification, but there are also items cataloged in - the Dewey Decimal System (including fiction and non-fiction), and Superintendent - of Documents call numbers. Character. -- `X245.ab` : The title and remainder of title. Exported from MARC tag 245|ab - fields. Separated by a `|` pipe character. Character. -- `X245.c` : The author (statement of responsibility). Exported from MARC tag 245|c. Character. -- `TOT.CHKOUT` : The total number of checkouts. Integer. -- `LOUTDATE` : The last date the item was checked out. Date. YYYY-MM-DDThh:mmTZD -- `SUBJECT` : Bibliographic subject in Library of Congress Subject Headings. - Separated by a `|` pipe character. Character. -- `ISN` : ISBN or ISSN. Exported from MARC field 020|a. Character -- `CALL...ITEM` : Item call number. Most of these are `NA` but there are some - secondary call numbers. -- `X008.Date.One` : Date of publication. Date. YYYY -- `BCODE2` : Item format. Character. -- `BCODE1` Sub-collection. Character. ## Getting data into R -### Ways to get data into R +We will be using a demo dataset from the University of Houston--Clear Lake +Integrated Library System in 2018. It contains a relatively random sample of books +from the catalog. In order to use your data in R, you must import it and turn it into an R *object*. There are many ways to get data into R. - **Manually**: You can manually create it using the `data.frame()` function in Base R, or the `tibble()` function in the tidyverse. + - **Import it from a file** Below is a very incomplete list - Text: TXT (`readLines()` function) - Tabular data: CSV, TSV (`read.table()` function or `readr` package) @@ -89,6 +69,7 @@ In order to use your data in R, you must import it and turn it into an R *object - Google sheets: (`googlesheets` package) - Statistics program: SPSS, SAS (`haven` package) - Databases: MySQL (`RMySQL` package) + - **Gather it from the web**: You can connect to webpages, servers, or APIs directly from within R, or you can create a data scraped from HTML webpages using the `rvest` package. For example @@ -99,7 +80,14 @@ In order to use your data in R, you must import it and turn it into an R *object - World Bank's World Development Indicators with [`WDI`](https://cran.r-project.org/web/packages/WDI/WDI.pdf). -### Organizing your working directory + +### Set up your working directory + +The working directory is an important concept to understand. It is the place +on your computer where R will look for and save files. When you write code for +your project, your scripts should refer to files in relation to the root of your working +directory and only to files within this structure. Using RStudio projects makes this +easy and ensures that your working directory is set up properly. Using a consistent folder structure across your projects will help keep things organized and make it easy to find/file things in the future. This @@ -129,36 +117,35 @@ needs, but these should form the backbone of your working directory. knitr::include_graphics("fig/working-directory-structure.png") ``` -### The working directory +:::::::::::::::::::::::::::::::::::::::::: callout -The working directory is an important concept to understand. It is the place -on your computer where R will look for and save files. When you write code for -your project, your scripts should refer to files in relation to the root of your working -directory and only to files within this structure. +## Using the `getwd()` and `setwd()` commands + +Knowing your current `directory` is important so that you save your files, scripts +and output in the right location. You can check your current directory by running +`getwd()` in the RStudio interface. If for some reason your working directory is +not what it should be, you can change it manually by navigating in the file browser +to where your working directory should be, then by clicking on the blue gear icon +"More", and selecting "Set As Working Directory". Alternatively, you can use `setwd("/path/to/working/directory")` to reset your working directory. However, your +scripts should not include this line, because it will fail on someone else's computer. + +:::::::::::::::::::::::::::::::::::::::::: -Using RStudio projects makes this easy and ensures that your working directory -is set up properly. If you need to check it, you can use `getwd()`. If for some -reason your working directory is not what it should be, you can change it in the -RStudio interface by navigating in the file browser to where your working directory -should be, then by clicking on the blue gear icon "More", and selecting "Set As Working -Directory". Alternatively, you can use `setwd("/path/to/working/directory")` to -reset your working directory. However, your scripts should not include this line, -because it will fail on someone else's computer. ::::::::::::::::::::::::::::::::::::::::: callout -## Setting your working directory with `setwd()` +## Tips on Using the `setwd()` command Some points to note about setting your working directory: -The directory must be in quotation marks. +- The directory must be in quotation marks. -On Windows computers, directories in file paths are separated with a +- For Windows users, directories in file paths are separated with a backslash `\`. However, in R, you must use a forward slash `/`. You can copy and paste from the Windows Explorer window directly into R and use find/replace (Ctrl/Cmd + F) in R Studio to replace all backslashes with forward slashes. -On Mac computers, open the Finder and navigate to the directory you wish to +- For Mac users, open the Finder and navigate to the directory you wish to set as your working directory. Right click on that folder and press the options key on your keyboard. The 'Copy "Folder Name"' option will transform into 'Copy "Folder Name" as Pathname. It will copy the path to the folder to the @@ -169,16 +156,16 @@ After you set your working directory, you can use `./` to represent it. So if you have a folder in your directory called `data`, you can use read.csv("./data") to represent that sub-directory. - :::::::::::::::::::::::::::::::::::::::::::::::::: -### Downloading the data and getting set up -Now that you have set your working directory, we will create our folder +## Downloading the data + +Once you have set your working directory, we will create our folder structure using the `dir.create()` function. For this lesson we will use the following folders in our working directory: -**`data/`**, **`data_output/`** and **`fig_output/`**. Let's write them all in +data/, data_output/ and fig_output/. Let's write them all in lowercase to be consistent. We can create them using the RStudio interface by clicking on the "New Folder" button in the file pane (bottom right), or directly from R by typing at console: @@ -189,10 +176,11 @@ dir.create("data_output") dir.create("fig_output") ``` -Go to the Figshare page for this curriculum and download the dataset called -"`books.csv`". The direct download link is: -[https://ndownloader.figshare.com/files/22031487](https://ndownloader.figshare.com/files/22031487). Place this downloaded file in -the `data/` you just created. *Alternatively,* you can do this directly from R by copying and +To download the dataset, go to the Figshare page for this curriculum and +download the dataset called "`books.csv`". The direct download link is: +[https://ndownloader.figshare.com/files/22031487](https://ndownloader.figshare.com/files/22031487). +Place this downloaded file in the `data/` directory that you just created. +*Alternatively,* you can do this directly from R by copying and pasting this in your terminal (your instructor can place this chunk of code in the Etherpad): @@ -204,7 +192,8 @@ download.file("https://ndownloader.figshare.com/files/22031487", Now if you navigate to your `data` folder, the `books.csv` file should be there. We now need to load it into our R session. -## `tidyverse` + +## Loading the data into R R has some base functions for reading a local data file into your R session--namely `read.table()` and `read.csv()`, but these have some @@ -215,6 +204,14 @@ installed and loaded with `tidyverse`. library(tidyverse) # loads the core tidyverse, including dplyr, readr, ggplot2, purrr ``` +:::::::::::::::::::::::::::::::::::::::::: prereq + +Make sure you have the `tidyverse` package installed in R. If not, refer back to +the episode `Before we start` on how to install R packages. + +:::::::::::::::::::::::::::::::::::::::::: + + To get our sample data into our R session, we will use the `read_csv()` function and assign it to the `books` value. @@ -230,13 +227,11 @@ each column as it reads it into R. For example, in this dataset, it reads the option to specify the data type for a column manually by using the `col_types` argument in `read_csv`. -You should now have an R object called `books` in the Environment pane: 10000 -observations of 12 variables. We will be using this data file in the next -module. +You should now have an R object called `books` in the Environment pane. ::::::::::::::::::::::::::::::::::::::::: callout -## Note +## Reading tabular data `read_csv()` assumes that fields are delineated by commas, however, in several countries, the comma is used as a decimal separator and the semicolon (;) is @@ -248,19 +243,52 @@ the help for `read_csv()` by typing `?read_csv` to learn more. There is also the `read_tsv()` for tab-separated data files, and `read_delim()` allows you to specify more details about the structure of your file. - :::::::::::::::::::::::::::::::::::::::::::::::::: ```{r, echo=FALSE, fig.cap="The books CSV loaded as a tibble in your R environment", fig.alt="RStudio environment pane showing one object 'books' with 10000 observations of 12 variables"} knitr::include_graphics("fig/booksImport.PNG") ``` -## What are data frames and tibbles? + +:::::::::::::::::::::::::::::::::::::::::: challenge + +## Discussion: Examine the data + +Open and examine the data in R. How many observations and variables are there? + +::::::::::::::: solution + +The data contains 10,000 observations and 11 variables. + +- `CALL...BIBLIO.` : Bibliographic call number. Most of these are cataloged with + the Library of Congress classification, but there are also items cataloged in + the Dewey Decimal System (including fiction and non-fiction), and Superintendent + of Documents call numbers. Character. +- `X245.ab` : The title and remainder of title. Exported from MARC tag 245|ab + fields. Separated by a `|` pipe character. Character. +- `X245.c` : The author (statement of responsibility). Exported from MARC tag 245|c. Character. +- `TOT.CHKOUT` : The total number of checkouts. Integer. +- `LOUTDATE` : The last date the item was checked out. Date. YYYY-MM-DDThh:mmTZD +- `SUBJECT` : Bibliographic subject in Library of Congress Subject Headings. + Separated by a `|` pipe character. Character. +- `ISN` : ISBN or ISSN. Exported from MARC field 020|a. Character +- `CALL...ITEM` : Item call number. Most of these are `NA` but there are some + secondary call numbers. +- `X008.Date.One` : Date of publication. Date. YYYY +- `BCODE2` : Item format. Character. +- `BCODE1` Sub-collection. Character. + +::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::: + + +## Data frames and tibbles Data frames are the *de facto* data structure for tabular data in `R`, and what we use for data processing, statistics, and plotting. -A data frame is the representation of data in the format of a table where the +A **data frame** is the representation of data in the format of a table where the columns are vectors that all have the same length. Because columns are vectors, each column must contain a single type of data (e.g., characters, integers, factors). For example, here is a figure depicting a data frame comprising a @@ -274,39 +302,58 @@ A data frame can be created by hand, but most commonly they are generated by the functions `read_csv()` or `read_table()`; in other words, when importing spreadsheets from your hard drive (or the web). -A tibble is an extension of `R` data frames used by the *tidyverse*. When the +A **tibble** is an extension of `R` data frames used by the *tidyverse*. When the data is read using `read_csv()`, it is stored in an object of class `tbl_df`, `tbl`, and `data.frame`. You can see the class of an object with `class()`. + ## Inspecting data frames When calling a `tbl_df` object (like `books` here), there is already a lot of information about our data frame being displayed such as the number of rows, the number of columns, the names of the columns, and as we just saw the class of data stored in each column. However, there are functions to extract this information from data frames. Here is a non-exhaustive list of some of these functions. Let's try them out! -- Size: - - - `dim(books)` - returns a vector with the number of rows in the first element, - and the number of columns as the second element (the **dim**ensions of - the object) - - `nrow(books)` - returns the number of rows - - `ncol(books)` - returns the number of columns +
-- Content: - - - `head(books)` - shows the first 6 rows - - `tail(books)` - shows the last 6 rows +### **Size and dimensions** -- Names: +```{r, purl=FALSE, eval=FALSE} +dim(books) # returns a vector with the number of rows in the first element, +and the number of columns as the second element (the **dim**ensions of the object) +nrow(books) # returns the number of rows +ncol(books) # returns the number of columns +``` - - `names(books)` - returns the column names (synonym of `colnames()` for - `data.frame` objects) +
-- Summary: - - - `View(books)` - look at the data in the viewer - - `str(books)` - structure of the object and information about the class, - length and content of each column - - `summary(books)` - summary statistics for each column +### **Content** + +To examine the contents of a data frame. + +```{r, purl=FALSE, eval=FALSE} +head(books) # shows the first 6 rows +tail(books) # shows the last 6 rows +``` + +
+ +### **Names** + +```{r, purl=FALSE, eval=FALSE} +names(books) # returns the column names (synonym of `colnames()` for +`data.frame` objects) +glimpse(books) # print names of the books data frame to the console +``` + +
+ +### **Summary** + +```{r, purl=FALSE, eval=FALSE} +View(books) # look at the data in the viewer +str(books) # structure of the object and information about the class, +length and content of each column +summary(books) # summary statistics for each column +``` Note: most of these functions are "generic", they can be used on other types of objects besides data frames. @@ -321,15 +368,17 @@ for each variable. map_chr(books, class) ``` + ## Indexing and subsetting data frames Our `books` data frame has 2 dimensions: rows (observations) and columns (variables). If we want to extract some specific data from it, we need to -specify the "coordinates" we want from it. In the last session, we used square -brackets `[ ]` to subset values from vectors. Here we will do the same thing for -data frames, but we can now add a second dimension. Row numbers come first, -followed by column numbers. However, note that different ways of specifying -these coordinates lead to results with different classes. +specify the "coordinates" we want from it. + +In the last session, we used square brackets `[ ]` to subset values from vectors. +Here we will do the same thing for data frames, but we can now add a second dimension. +Row numbers come first, followed by column numbers. However, note that different ways +of specifying these coordinates lead to results with different classes. ```{r, purl=FALSE, eval=FALSE} ## first element in the first column of the data frame (as a vector) @@ -348,7 +397,10 @@ books[3, ] head_books <- books[1:6, ] ``` -### Dollar sign +
+ + +### **Dollar sign** The dollar sign `$` is used to distinguish a specific variable (column, in Excel-speak) in a data frame: @@ -360,7 +412,10 @@ head(books$X245.ab) # print the first six book titles mean(books$TOT.CHKOUT) ``` -### `unique()`, `table()`, and `duplicated()` +
+ + +### **`unique()`, `table()`, and `duplicated()`** Use `unique()` to see all the distinct values in a variable: @@ -368,7 +423,7 @@ Use `unique()` to see all the distinct values in a variable: unique(books$BCODE2) ``` -Take that one step further with `table()` to get quick frequency counts on a variable: +Take one step further with `table()` to get quick frequency counts on a variable: ```{r expl11, comment=NA} table(books$BCODE2) # frequency counts on a variable @@ -389,7 +444,8 @@ table(duplicated(books$ISN)) # run a table of duplicated values which(duplicated(books$ISN)) # get row numbers of duplicated values ``` -## Exploring missing values + +## Explore missing values You may also need to know the number of missing values: @@ -400,7 +456,13 @@ table(is.na(books$ISN)) # use table() and is.na() in combination booksNoNA <- na.omit(books) # Return only observations that have no missing values ``` -*** +::::::::::::::::::::::::::::::::::::::: callout + +Recall how we use `na.rm`, `is.na()`, `na.omit()`, and `complete.cases()` when +dealing with vectors. + +::::::::::::::::::::::::::::::::::::::: + ::::::::::::::::::::::::::::::::::::::: challenge @@ -439,12 +501,11 @@ booksNoNA <- na.omit(books) # Return only observations that have no missing val :::::::::::::::::::::::::::::::::::::::::::::::::: + ## Logical tests R contains a number of operators you can use to compare -values. Use `help(Comparison)` to read the R help file. Note that **two equal -signs** (`==`) are used for evaluating equality (because one equals sign (`=`) -is used for assigning variables). +values. Use `help(Comparison)` to read the R help file. | operator | function | | -------- | ------------------------ | @@ -458,6 +519,24 @@ is used for assigning variables). | `is.na()` | Is NA | | `!is.na()` | Is Not NA | + +::::::::::::::::::::::::::::::::::::::: callout + +Note that the **two equal signs** (`==`) are used for evaluating equality +because one equals sign (`=`) is used for assigning variables. + +::::::::::::::::::::::::::::::::::::::: + + +A simple logical test using numeric comparison: + +```{r, purl=FALSE} +1 < 2 + + +1 > 2 +``` + Sometimes you need to do multiple logical tests (think Boolean logic). Use `help(Logic)` to read the help file. @@ -469,10 +548,41 @@ Sometimes you need to do multiple logical tests (think Boolean logic). Use | `any()` | Are some values true? | | `all()` | Are all values true? | +
+ +### **Logical Subsetting** + +We can use logical operators to subset our data, just like how we use the square +brackets `[]` for subsetting. + +For instance, if we want to extract rows with *Total Checkout Number* of more than 5: + +```{r, purl=FALSE, eval=FALSE} +books[books$TOT.CHKOUT > 5, ] +``` + +::::::::::::::::::::::::::::::::::::::: discussion + +Compare the output of the following codes to the previous one: + +```{r, purl=FALSE, eval=FALSE} +books$TOT.CHKOUT > 5 +books[books$TOT.CHKOUT > 5] +``` + +What differences did you see in the output? + +::::::::::::::::::::::::::::::::::::::: + + :::::::::::::::::::::::::::::::::::::::: keypoints -- Use read.csv to read tabular data in R. -- Use factors to represent categorical data in R. +- Use `getwd()` and `setwd()` to navigate between directories. +- Use `read_csv()` from tidyverse to read tabular data into R. +- Data frames are made up of vectors of equal length, with each vector +representing each column of the data frame. +- Summarise the dimension, content and variables in a data frame. +- Using the square brackets `[]` and logical operators to subset data frames. ::::::::::::::::::::::::::::::::::::::::::::::::::