-
Notifications
You must be signed in to change notification settings - Fork 502
/
Copy path02-getting-started.Rmd
445 lines (266 loc) · 33 KB
/
02-getting-started.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
# Getting Started with Data in R {#getting-started}
```{r setup_getting_started, include=FALSE, purl=FALSE}
chap <- 2
lc <- 0
rq <- 0
# **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`**
# **`r paste0("(RQ", chap, ".", (rq <- rq + 1), ")")`**
knitr::opts_chunk$set(
tidy = FALSE,
out.width = '\\textwidth',
fig.height = 4,
fig.align = 'center',
warning = FALSE
)
options(scipen = 99, digits = 3)
# Set random number generator see value for replicable pseudorandomness. Why 76?
# https://www.youtube.com/watch?v=xjJ7FheCkCU
set.seed(76)
```
Before we can start exploring data in R, there are some key concepts to understand first:
1. What are R and RStudio?
2. How do I code in R?
3. What are R packages?
We'll introduce these concepts in upcoming Sections \@ref(r-rstudio)-\@ref(packages). If you are already somewhat familiar with these concepts, feel free to skip to Section \@ref(nycflights13) where we'll introduce our first data set: all domestic flights departing a New York City airport in 2013. This is a dataset we will explore in depth in this book.
## What are R and RStudio? {#r-rstudio}
For much of this book, we will assume that you are using R via RStudio. First time users often confuse the two. At its simplest R is like a car's engine\index{R} while RStudio is like a car's dashboard\index{RStudio}.
<!--
R: Engine | RStudio: Dashboard
:-------------------------:|:-------------------------:
{ height=1.7in } | { height=1.7in }
-->
```{r R-vs-RStudio-1, echo=FALSE, fig.align='center', fig.cap="Analogy of difference between R and RStudio.", out.width='95%', purl=FALSE}
knitr::include_graphics("images/R_vs_RStudio_1.png")
```
More precisely, R is a programming language that runs computations while RStudio is an *integrated development environment (IDE)* that provides an interface by adding many convenient features and tools. So just as the way of having access to a speedometer, rearview mirrors, and a navigation system makes driving much easier, using RStudio's interface makes using R much easier as well.
### Installing R and RStudio
> **Note about RStudio Server**: If your instructor has provided you with a link and access to RStudio Server, then you can skip this section. We do recommend though after a few months of working on the RStudio Server that you return to these instructions.
You will first need to download and install both R and RStudio (Desktop version) on your computer.
1. **You must do this first:** [Download and install R](https://cran.r-project.org/).
+ Click on the download link corresponding to your computer's operating system. \index{R!installation}
1. **You must do this second:** [Download and install RStudio](https://www.rstudio.com/products/rstudio/download3/).
+ Scroll down to "Installers for Supported Platforms"
+ Click on the download link corresponding to your computer's operating system. \index{RStudio!installation}
### Using R via RStudio
Recall our car analogy from earlier. Much as we don't drive a car by interacting directly with the engine but rather by interacting with elements on the car's dashboard, we won't be using R directly but rather we will use RStudio's interface. After you install R and RStudio on your computer, you'll have two new programs AKA applications you can open. We will always work in RStudio and not R. Figure \@ref(fig:R-vs-RStudio-2) shows what icon you should be clicking on your computer.
<!--
R: Do not open this | RStudio: Open this
:-------------------------:|:-------------------------:
`r include_image("images/Rlogo.png", html_opts = "width=25%")` | `r include_image("images/RStudio-Ball.png", html_opts = "width=20%")`
-->
```{r R-vs-RStudio-2, echo=FALSE, fig.align='center', fig.cap="Icons of R versus RStudio on your computer.", out.width='90%', purl=FALSE}
knitr::include_graphics("images/R_vs_RStudio_2.png")
```
After you open RStudio, you should see the following in Figure \@ref(fig:RStudio-interface).
```{r RStudio-interface, echo=FALSE, fig.align='center', fig.cap="RStudio interface to R.", out.width='100%', purl=FALSE}
knitr::include_graphics("images/rstudio.png")
```
Note the three panes, which are three panels dividing the screen: The *Console pane*, the *Files pane*, and the *Environment pane*. Over the course of this chapter, you'll come to learn what purpose each of these panes serve.
## How do I code in R? {#code}
Now that you're set up with R and RStudio, you are probably asking yourself "OK. Now how do I use R?" The first thing to note as that unlike other statistical software programs like Excel, STATA, or SAS that provide [point and click](https://en.wikipedia.org/wiki/Point_and_click) interfaces, R is an [interpreted language](https://en.wikipedia.org/wiki/Interpreted_language). This means you have to enter in R commands written in R code. In other words, you have to code/program in R. Note that we'll use the terms "coding" and "programming" interchangeably in this book.
While it is not required to be a seasoned coder/computer programmer to use R, there is still a set of basic programming concepts that R users need to understand. Consequently, while this book is not a book on programming, you will still learn just enough of these basic programming concepts needed to explore and analyze data effectively.
### Basic programming concepts and terminology {#programming-concepts}
We now introduce some basic programming concepts and terminology that you'll learn as you go. Note that in this book, we will use a different font to distinguish regular font from `computer_code`.
It is important to note that while these tutorials serve as excellent introductions, a single pass through them is insufficient for long-term learning and retention. The ultimate tools for long-term learning and retention are "learning by doing" and repetition, something we will have you do over the course of the entire book and we encourage this process as much as possible as you learn any new skill.
* Basics: \index{programming language basics}
+ Console: where you enter in commands \index{console}
+ Objects: where values are saved, how to assign values to objects. \index{objects}
+ Data types: integers, doubles/numerics, logicals, characters. \index{data types}
* Vectors: a series of values. These are created using the `c()` function where `c()` stands for "combine" or "concatenate". For example: `c(6, 11, 13, 31, 90, 92)`. \index{vectors}
* Factors: *Categorical data* (as opposed to *numerical data*) are represented in R as `factor`s. \index{factors}
* Data frames: Data frames are analogous to rectangular spreadsheets: they are representations of datasets in R where the rows correspond *observations* and the columns correspond to *variables* that describe the observations. \index{data frames} We will revisit this later in Section \@ref(nycflights13).
* Conditionals: \index{conditionals}
+ Testing for equality in R using `==` (and not `=` which is typically used for assignment). Ex: `2 + 1 == 3` compares `2 + 1` to `3` and is correct R syntax, while `2 + 1 = 3` is not and is incorrect R syntax.
+ Boolean algebra: `TRUE/FALSE` statements and mathematical operators such as `<` (less than), `<=` (less than or equal), and `!=` (not equal to). \index{Boolean algebra}
+ Logical operators: `&` representing "and", `|` representing "or". Ex: `(2 + 1 == 3) & (2 + 1 == 4)` returns `FALSE` while `(2 + 1 == 3) | (2 + 1 == 4)` returns `TRUE`. \index{operators!logical}
* Functions: Functions take in inputs (called *arguments*) and return outputs. You either manually specify a function's arguments or use the function's *defaults*. \index{functions}
This list is by no means an exhaustive list of all the programming concepts and terminology needed to become a savvy R user; such a list would be so large it wouldn't be very useful, especially for novices. Rather, we feel this is the bare minimum you need to know before you get started; the rest we feel you can learn as you go. Remember that your knowledge of all of these concepts will build as you get better and better at "speaking R" and getting used to its syntax.
### Errors, warnings, and messages
One slightly confusing part of R is how it reports errors, warnings, and messages. The default theme in RStudio colors errors, warnings, and messages in red, which makes them seem like you did something wrong. However, seeing red text in the console *is not always bad.*
R will show red text in the console in three different situations:
* **Errors**: \index{R!errors} When the red text is a legitimate error, it will be prefaced with "Error in…" and try to explain what went wrong. Generally when there's an error, the code will not run. For example, as shown in Subsection \@ref(package-use) if you see `Error in ggplot(...) : could not find function "ggplot"`, it means that the `ggplot()` function is not accessible because the package was not loaded with `library(ggplot2)`, and thus you cannot use it.
* **Warnings**: \index{R!warnings} When the red text is a warning, it will be prefaced with "Warning:" and try to explain why there's a warning. Generally your code will still work, but with some caveats. For example, you see in Chapter \@ref(viz) if you plot a scatterplot and one of the rows in your data frame is missing a value, you will see this warning: `Warning: Removed 1 rows containing missing values (geom_point)`. R will still make the scatterplot with all the remaining values, but it's warning you that one of the points isn't there.
* **Messages**: \index{R!messages} When the red text doesn't start with either "Error" or "Warning", it's *just a friendly message*. You'll see these messages when you load some packages like the `dplyr` package in Subsection \@ref(package-loading), or when you read data saved in spreadsheet files with `read_csv()` as you'll see in Chapter \@ref(tidy). These are helpful diagnostic messages and they don't stop your code from working.
Remember, when you see red text in the console, *don't panic*. It doesn't necessarily mean anything is wrong.
* If the text starts with "Error", figure out what's causing it. <span style="color:red">Think of errors as a red traffic light: something is wrong!</span>
* If the text starts with "Warning", figure out if it's something to worry about. For instance, if you get a warning about missing values in a scatterplot and you know there are missing values, you're fine. If that's surprising, look at your data and see what's missing. <span style="color:gold">Think of warnings as a yellow traffic light: everything is working fine, but watch out/pay attention.</span>
* Otherwise the text is just a message. Read it, wave back at R, and thank it for talking to you. <span style="color:green">Think of messages as a green traffic light: everything is working fine.</span>
### Tips on learning to code
Learning to code/program is very much like learning a foreign language, it can be very daunting and frustrating at first. Such frustrations are very common and it is very normal to feel discouraged as you learn. However just as with learning a foreign language, if you put in the effort and are not afraid to make mistakes, anybody can learn.
Here are a few useful tips to keep in mind as you learn to program:
* **Remember that computers are not actually that smart**: You may think your computer or smartphone are "smart," but really people spent a lot of time and energy designing them to appear "smart." Rather you have to tell a computer everything it needs to do. Furthermore the instructions you give your computer can't have any mistakes in them, nor can they be ambiguous in any way.
* **Take the "copy, paste, and tweak" approach**: Especially when learning your first programming language, it is often much easier to taking existing code that you know works and modify it to suit your ends, rather than trying to write new code from scratch. We call this the *copy, paste, and tweak* approach. So early on, we suggest not trying to write code from memory, but rather take existing examples we have provided you, then copy, paste, and tweak them to suit your goals. Don't be afraid to play around!
* **The best way to learn to code is by doing**: Rather than learning to code for its own sake, we feel that learning to code goes much smoother when you have a goal in mind or when you are working on a particular project, like analyzing data that you are interested in.
* **Practice is key**: Just as the only method to improving your foreign language skills is through practice, practice, and practice; so also the only method to improving your coding is through practice, practice, and practice. Don't worry however; we'll give you plenty of opportunities to do so!
## What are R packages? {#packages}
Another point of confusion with many new R users is the idea of an R package. R packages \index{R!packages} extend the functionality of R by providing additional functions, data, and documentation. They are written by a world-wide community of R users and can be downloaded for free from the internet. For example, among the many packages we will use in this book are:
* The `ggplot2` package for data visualization in Chapter \@ref(viz). \index{R packages!ggplot2}
* The `dplyr` package for data wrangling in Chapter \@ref(wrangling). \index{R packages!dplyr}
* The `moderndive` package that accompanies this book. \index{R packages!moderndive}
* The `infer` package \index{R packages!infer} for "tidy" and transparent statistical inference in Chapters \@ref(confidence-intervals), \@ref(hypothesis-testing), and \@ref(inference-for-regression).
A good analogy for R packages \index{R packages} is they are like apps you can download onto a mobile phone:
<!--
R: A new phone | R Packages: Apps you can download
:-------------------------:|:-------------------------:
{ height=1.5in } | { height=1.5in }
-->
```{r R-vs-R-packages, echo=FALSE, fig.align='center', fig.cap="Analogy of R versus R packages.", out.width='90%', purl=FALSE}
knitr::include_graphics("images/R_vs_R_packages.png")
```
So R is like a new mobile phone: while it has a certain amount of features when you use it for the first time, it doesn't have everything. R packages are like the apps you can download onto your phone from Apple's App Store or Android's Google Play.
Let's continue this analogy by considering the Instagram app for editing and sharing pictures. Say you have purchased a new phone and you would like to share a recent photo you have taken on Instagram. You need to:
1. *Install the app*: Since your phone is new and does not include the Instagram app, you need to download the app from either the App Store or Google Play. You do this once and you're set. You might do this again in the future any time there is an update to the app.
1. *Open the app*: After you've installed Instagram, you need to open the app.
Once Instagram is open on your phone, you can then proceed to share your photo with your friends and family. The process is very similar for using an R package. You need to:
1. *Install the package*: This is like installing an app on your phone. Most packages are not installed by default when you install R and RStudio. Thus if you want to use a package for the first time, you need to install it first. Once you've installed a package, you likely won't install it again unless you want to update it to a newer version.
1. *"Load" the package*: "Loading" a package is like opening an app on your phone. Packages are not "loaded" by default when you start RStudio on your computer; you need to "load" each package you want to use every time you start RStudio.
Let's now show you how to perform these two steps for the `ggplot2` package for data visualization.
### Package installation {#package-installation}
> **Note about RStudio Server**: If your instructor has provided you with a link and access to RStudio Server, you probably will not need to install packages, as they have likely been pre-installed for you by your instructor. That being said, it is still a good idea to know this process for later on when you are not using RStudio Server, but rather RStudio Desktop on your own computer.
There are two ways to install an R package. \index{R packages!installation} For example, to install the `ggplot2` package:
1. **Easy way**: In the Files pane of RStudio:
a) Click on the "Packages" tab.
a) Click on "Install" next to Update.
a) Type the name of the package under "Packages (separate multiple with space or comma):" In this case, type `ggplot2`.
a) Click "Install".
```{r echo=FALSE, fig.align='center', fig.cap="Installing packages in R the easy way.", out.width=ifelse(knitr:::is_latex_output(), '5in', '70%'), purl=FALSE}
knitr::include_graphics("images/install_packages_easy_way.png")
```
1. **Slightly harder way**: An alternative but slightly less convenient way to install a package is by typing `install.packages("ggplot2")` in the Console pane of RStudio and pressing Return/Enter on your keyboard. Note you must include the quotation marks.
Much like an app on your phone, you only have to install a package once. However, if you want to update an already installed package to a newer version, you need to re-install it by repeating the earlier steps.
```{block lc2-0, type='learncheck'}
**_Learning check_**
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Repeat the earlier installing steps, but for the `dplyr`, `nycflights13`, and `knitr` packages. This will install the earlier mentioned `dplyr` package, the `nycflights13` package containing data on all domestic flights leaving a NYC airport in 2013, and the `knitr` package for writing reports in R.
```{block, type='learncheck'}
```
### Package loading {#package-loading}
Recall that after you've installed a package, you need to "load" it. In other words, open it. We do this by using the `library()` command. \index{R packages!loading} For example, to load the `ggplot2` package, run the following code in the Console pane. What do we mean by "run the following code"? Either type or copy & paste the following code into the Console pane and then hit the enter key.
```{r, eval=FALSE}
library(ggplot2)
```
If after running the earlier code, a blinking cursor returns next to the `>` "prompt" sign, it means you were successful and the `ggplot2` package is now loaded and ready to use. If however, you get a red "error message" that reads... \index{R packages!loading error}
```
Error in library(ggplot2) : there is no package called ‘ggplot2’
```
... it means that you didn't successfully install it. In that case, go back to the previous subsection "Package installation" and install it.
```{block lc2-1, type='learncheck'}
**_Learning check_**
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** "Load" the `dplyr`, `nycflights13`, and `knitr` packages as well by repeating the earlier steps.
```{block, type='learncheck'}
```
### Package use {#package-use}
One extremely common mistake new R users make when wanting to use particular packages is they forget to "load" them first by using the `library()` command we just saw. Remember: *you have to load each package you want to use every time you start RStudio.* If you don't first "load" a package, but attempt to use one of its features, you'll see an error message similar to:
```
Error: could not find function
```
R is telling you that you are trying to use a function in a package that has not yet been "loaded." R doesn't know where to find the function you are using. Almost all new users forget do this when starting out, and it is a little annoying to get used to doing it. However, you'll remember with practice.
## Explore your first datasets {#nycflights13}
Let's put everything we've learned so far into practice and start exploring some real data! Data comes to us in a variety of formats, from pictures to text to numbers. Throughout this book, we'll focus on datasets that are saved in "spreadsheet"-type format; this is probably the most common way data are collected and saved in many fields. Remember from Subsection \@ref(programming-concepts) that these "spreadsheet"-type datasets are called _data frames_ in R; \index{data frames} we will focus on working with data saved as data frames throughout this book.
Let's first load all the packages needed for this chapter, assuming you've already installed them. Read Section \@ref(packages) for information on how to install and load R packages if you haven't already.
```{r message=FALSE}
library(nycflights13)
library(dplyr)
library(knitr)
```
At the beginning of all subsequent chapters in this text, we'll always have a list of packages that you should have installed and loaded to work with that chapter's R code.
### `nycflights13` package
Many of us have flown on airplanes or know someone who has. Air travel has become an ever-present aspect in many people's lives. If you live in or are visiting a relatively large city and you walk around that city's airport, you see gates showing flight information from many different airlines. And you will frequently see that some flights are delayed because of a variety of conditions. Are there ways that we can avoid having to deal with these flight delays?
We'd all like to arrive at our destinations on time whenever possible. (Unless you secretly love hanging out at airports. If you are one of these people, pretend for the moment that you are very much anticipating being at your final destination.) Throughout this book, we're going to analyze data related to flights contained in the `nycflights13` \index{R packages!nycflights13} package. Specifically, this package contains five data sets saved in five separate data frames with information about all domestic flights departing from New York City in 2013. These include Newark Liberty International (EWR), John F. Kennedy International (JFK), and LaGuardia (LGA) airports:
* `flights`: Information on all `r scales::comma(nrow(nycflights13::flights))` flights
* `airlines`: A table matching airline names and their two letter IATA airline codes (also known as carrier codes) for `r nrow(nycflights13::airlines)` airline companies
* `planes`: Information about each of `r scales::comma(nrow(nycflights13::planes))` physical aircraft used.
* `weather`: Hourly meteorological data for each of the three NYC airports. This data frame has `r scales::comma(nrow(nycflights13::weather))` rows, roughly corresponding to the 365 $\times$ 24 $\times$ 3 = 26,280 possible hourly measurements one can observe at three locations over the course of a year.
* `airports`: Airport names, codes, and locations for `r scales::comma(nrow(nycflights13::airports))` destination airports.
### `flights` data frame
We will begin by exploring the `flights` data frame that is included in the `nycflights13` package and getting an idea of its structure. Run the following code in your console (either by typing it or cutting & pasting it): it loads in the `flights` dataset into your Console. Note depending on the size of your monitor, the output may vary slightly.
```{r load_flights}
flights
```
Let's unpack this output:
* `A tibble: 336,776 x 19`: A `tibble` is a kind of data frame used in R. This particular data frame has
+ `336,776` rows
+ `19` columns corresponding to 19 variables describing each observation
* `year month day dep_time sched_dep_time dep_delay arr_time` are different columns, in other words variables, of this data frame.
* We then have the first 10 rows of observations corresponding to 10 flights.
* `... with 336,766 more rows, and 11 more variables:` indicating to us that 336,766 more rows of data and 11 more variables could not fit in this screen.
Unfortunately, this output does not allow us to explore the data very well. Let's look at different tools to explore data frames.
### Exploring data frames {#exploredataframes}
Among the many ways of getting a feel for the data contained in a data frame such as `flights`, we present three functions that take as their "argument", in other words their input, the data frame in question. We also include a fourth method for exploring one particular column of a data frame:
1. Using the `View()` function built for use in RStudio. We will use this the most.
1. Using the `glimpse()` function, which is included in the `dplyr` package.
1. Using the `kable()` function, which is included in the `knitr` package.
1. Using the `$` extraction operator to view a single variable in a data frame.
**1. `View()`**:
Run `View(flights)` \index{R packages!utils!View()} in your Console in RStudio, either by typing it or cutting & pasting it into the Console pane, and explore this data frame in the resulting pop-up viewer. You should get into the habit of always `View`ing any data frames that come your way. Note the capital "V" in `View`. R is case-sensitive so you'll receive an error is you run `view(flights)` instead of `View(flights)`.
```{block lc2-2, type='learncheck'}
**_Learning check_**
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What does any *ONE* row in this `flights` dataset refer to?
- A. Data on an airline
- B. Data on a flight
- C. Data on an airport
- D. Data on multiple flights
```{block, type='learncheck'}
```
By running `View(flights)`, we see the different *variables* listed in the columns and we see that there are different types of variables. Some of the variables like `distance`, `day`, and `arr_delay` are what we will call *quantitative* variables. \index{quantitative} These variables are numerical in nature. Other variables here are \index{categorical} *categorical*.
Note that if you look in the leftmost column of the `View(flights)` output, you will see a column of numbers. These are the row numbers of the dataset. If you glance across a row with the same number, say row 5, you can get an idea of what each row corresponds to. In other words, this will allow you to identify what object is being referred to in a given row. This is often called the *observational unit*. The observational unit in this example is an individual flight departing New York City in 2013. You can identify the observational unit by determining what "thing" is being measured or described by each of the variables. We'll talk more about observational units in Section \@ref(identification-vs-measurement-variables) on *identification* and *measurement* variables.
**2. `glimpse()`**:
The second way to explore a data frame is using the `glimpse()` function \index{dplyr!glimpse()} included in the \index{dplyr|seealso{R packages!dplyr}} `dplyr` package. Thus, you can only use the `glimpse()` function after you've loaded the `dplyr` package. This function provides us with an alternative method for exploring a data frame than the `View()` function:
```{r}
glimpse(flights)
```
We see that `glimpse()` will give you the first few entries of each variable in a row after the variable. In addition, the *data type* (see Subsection \@ref(programming-concepts)) of the variable is given immediately after each variable's name inside `< >`. Here, `int` and `dbl` refer to "integer" and "double", which are computer coding terminology for quantitative/numerical variables. In contrast, `chr` refers to "character", which is computer terminology for text data. Text data, such as the `carrier` or `origin` of a flight, are categorical variables. The `time_hour` variable is an example of one more type of data type: `dttm`. As you may suspect, this variable corresponds to a specific date and time of day. However, we won't work with dates in this class and leave it to a more advanced book on data science.
```{block lc2-3, type='learncheck'}
**_Learning check_**
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What are some examples in this dataset of **categorical** variables? What makes them different than **quantitative** variables?
```{block, type='learncheck'}
```
**3. `kable()`**:
The final way to explore the entirety of a data frame is using the `kable()` \index{knitr!kable()} function from the \index{knitr|seealso{R packages!knitr}} `knitr` package. Let's explore the different carrier codes for all the airlines in our dataset two ways. Run both of these lines of code in your Console:
```{r eval=FALSE}
airlines
kable(airlines)
```
At first glance, it may not appear that there is much difference in the outputs. However when using tools for document production such as [R Markdown](http://rmarkdown.rstudio.com/lesson-1.html), the latter code produces output that is much more legible and reader-friendly.
**4. `$` operator**
Lastly, the `$` operator \index{operators!dollar sign} allows us to extract and then explore a single variable within a data frame. For example, run the following in your console
```{r eval=FALSE}
airlines
airlines$name
```
We used the `$` operator to extract only the `name` variable and return it as a vector of length 16. We will only be occasionally exploring data frames using this operator, instead favoring the `View()` and `glimpse()` functions.
### Identification & measurement variables {#identification-vs-measurement-variables}
There is a subtle difference between the kinds of variables that you will encounter in data frames: *identification variables* and *measurement variables*. For example, let's explore the `airports` data frame by showing the output of `glimpse(airports)`:
```{r}
glimpse(airports)
```
The variables `faa` and `name` are what we will call *identification variables*: variables that uniquely identify each observational unit. They are mainly used to provide a unique name to each observational unit i.e. row, thereby allowing us to uniquely identify them. `faa` gives the unique code provided by the FAA for that airport, while the `name` variable gives the longer more natural name of the airport. The remaining variables (`lat`, `lon`, `alt`, `tz`, `dst`, `tzone`) are often called *measurement* or *characteristic* variables: variables that describe properties of each observational unit, in other words each observation in each row. For example, `lat` and `long` describe the latitude and longitude of each airport.
Furthermore, sometimes a single variable might not be enough to uniquely identify each observational unit: combinations of variables might be needed. While it is not an absolute rule, for organizational purposes it is considered good practice to have your identification variables in the left-most columns of your data frame.
```{block lc3-3c, type='learncheck'}
**_Learning check_**
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What properties of the observational unit do each of `lat`, `lon`, `alt`, `tz`, `dst`, and `tzone` describe for the `airports` data frame? Note that you may want to use `?airports` to get more information.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Provide the names of variables in a data frame with at least three variables in which one of them is an identification variable and the other two are not. In other words, create your own tidy data frame that matches these conditions.
```{block, type='learncheck', purl=FALSE}
```
### Help files
Another nice feature of R is the help system. You can get help in R by entering a `?` \index{operators!?} before the name of a function or data frame in question and you will be presented with a page showing the documentation. For example, let's look at the help file for the `flights` data frame:
```{r eval=FALSE}
?flights
```
A help file should pop-up in the Help pane of RStudio. If you have questions about a function or data frame included in an R package, you should get in the habit of consulting the help file right away.
## Conclusion
We've given you what we feel are the most essential concepts to know before you can start exploring data in R. Is this chapter exhaustive? Absolutely not. To try to include everything in this chapter would make the chapter so large it wouldn't be useful!
### Additional resources
If you are completely new to the world of coding, R, and RStudio and feel you could benefit from a more detailed introduction, we suggest you check out ModernDive co-author Chester Ismay's ["Getting used to R, RStudio, and R Markdown"](https://rbasics.netlify.com/) short book [@usedtor2016], which includes screencast recordings that you can follow along and pause as you learn. Furthermore, there is an introduction to R Markdown, a tool used for reproducible research in R.
```{r echo=FALSE, fig.align='center', fig.cap="Preview of 'Getting Used to R, RStudio, and R Markdown'.", out.height=if(knitr:::is_latex_output()) '2.5in', purl=FALSE}
knitr::include_graphics("images/getting-used-to-R.png")
```
### What's to come?
As we stated earlier however, the best way to learn R is to learn by doing. We now start the "data science" portion of the book in Chapter \@ref(viz) with what we feel is the most important tool in a data scientist's toolbox: data visualization. We will continue to explore the data included in the `nycflights13` package through data visualization. We'll see that data visualization is a powerful tool to add to our toolbox for data exploring that provides additional insight to what the `View()` and `glimpse()` functions can provide.
```{r echo=FALSE, fig.cap="ModernDive flowchart.", out.width='110%', fig.align='center'}
knitr::include_graphics("images/flowcharts/flowchart/flowchart.004.png")
```