Skip to content

Commit

Permalink
Getting ready to release Version 0.1.1
Browse files Browse the repository at this point in the history
  • Loading branch information
ismayc committed Jan 11, 2017
1 parent ef633bf commit 42f755d
Show file tree
Hide file tree
Showing 5 changed files with 119 additions and 21 deletions.
75 changes: 64 additions & 11 deletions 03-tidy.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ lc <- 0
rq <- 0
# **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`**
# **`r paste0("(RQ", chap, ".", (rq <- rq + 1), ")")`**
knitr::opts_chunk$set(tidy = FALSE, out.width='\\textwidth')
knitr::opts_chunk$set(tidy = FALSE, out.width = '\\textwidth')
```


Expand Down Expand Up @@ -76,6 +76,20 @@ Reading over this definition, you can begin to think about datasets that won't f
+ What features of this dataset might make it difficult to visualize?
+ How could the dataset be tweaked to make it **tidy**?

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Say the following table are stock prices, how would you make this tidy?

```{r echo=FALSE}
library(dplyr)
data.frame(
time = as.Date('2009-01-01') + 0:4,
x = round(rnorm(5, 0, 1), 3),
y = round(rnorm(5, 0, 2), 3),
z = round(rnorm(5, 0, 4), 3)
) %>%
knitr::kable()
```


***


Expand Down Expand Up @@ -154,12 +168,12 @@ glimpse(flights)

***

Another way to view the properties of a dataset is to use the `str` function ("str" is short for "structure"). The `str` function is expecting an object for its argument. In this case, the object is a data frame named `flights`. You can use the `str` function on other objects and data frames using the syntax `str(object)` where `object` is the name of an object in R. This will give you the first few entries of each variable in a row after the variable. In addition, the type of the variable is given immediately after the `:` following each variable's name. Here, `int` and `num` refer to quantitative variables. In contrast, `chr` refers to categorical variables. One more type of variable is given here with the `time_hour` variable: **POSIXct**. As you may suspect, this variable corresponds to a specific date and time of day.
We see that `glimpse` will give you the first few entries of each variable in a row after the variable. In addition, the type of the variable is given immediately after each variable's name inside `< >`. Here, `int` and `num` refer to quantitative variables. In contrast, `chr` refers to categorical variables. One more type of variable is given here with the `time_hour` variable: **dttm**. As you may suspect, this variable corresponds to a specific date and time of day.

Another nice feature of R is the help system. You can get help in R by simply entering a question mark before the name of a function or an object and you will be presented with a page showing the documentation. Note that this output help file is omitted here but can be accessed [here](https://cran.r-project.org/web/packages/nycflights13/nycflights13.pdf) on page 3 of the PDF document.

```{r eval=FALSE}
?str
?glimpse
?flights
```

Expand Down Expand Up @@ -187,7 +201,7 @@ By contrast, also included in the `nycflights13` package are datasets with diffe
- `airports`: airport names and locations
- `airlines`: translation between two letter carrier codes and names

You may have been asking yourself what `carrier` refers to in the `str(flights)` output above. The `airlines` dataset provides a description of this with each airline being the observational unit:
You may have been asking yourself what `carrier` refers to in the `glimpse(flights)` output above. The `airlines` dataset provides a description of this with each airline being the observational unit:

```{r}
data(airlines)
Expand All @@ -196,17 +210,58 @@ airlines

As can be seen here when you just enter the name of an object in R, by default it will print the contents of that object to the screen. Be careful! It's usually better to use the `View()` function in RStudio since larger objects may take awhile to print to the screen and it likely won't be helpful to you to have hundreds of lines outputted.

***

```{block lc3-3b, type='learncheck'}
**_Learning check_**
```

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Run the following block of code in R to load and view each of the four data frames in the `nycflights13` package. Switch between the different tabs that have opened to view each of the four data frames. Describe in two sentences for each data frame what stands out to you and what the most important features are of each.

```{r eval=FALSE}
data(weather)
data(planes)
data(airports)
data(airlines)
View(weather)
View(planes)
View(airports)
View(airlines)
```

***

### Identification variables

There is a subtle difference between the kinds of variables that you will encounter in data frames. The `airports` data frame you worked with above contains data in these different kinds. Let's pull them apart using the `glimpse` function:

```{r}
glimpse(airports)
```

The variables `faa` and `name` are what we will call *identification variables*. They are mainly used to provide a name to the observational unit. Here the observational unit is an airport and the `faa` gives the code provided by the FAA for that airport while the `name` variable gives the longer more natural name of the airport. These ID variables differ from the other variables that are often called *measurement* or *characteristic* variables. The remaining variables (aside from `faa` and `name`) are of this type in `airports`. They don't uniquely identify the observational unit, but instead describe properties of the observational unit. For organizational purposes, it is best practice to have your identification variables in the far leftmost columns of your data frame.

***

```{block lc3-3c, type='learncheck'}
**_Learning check_**
```

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What properties of the observational unit do each of `lat`, `lon`, `alt`, `tz`, `dst`, and `tzone` describe for the `airports` data frame?

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Provide the names of variables in a data frame with at least three variables in which one of them is an identification variable and the other two are not.

***

## Normal forms of data

The datasets included in the `nycflights13` package are in a form that minimizes redundancy of data. We will see that there are ways to _merge_ (or _join_) the different tables together easily. We are capable of doing so because each of the tables have _keys_ in common to relate one to another. This is an important property of **normal forms** of data. The process of decomposing data frames into less redundant tables without losing information is called **normalization**. More information is available on [Wikipedia](https://en.wikipedia.org/wiki/Database_normalization).

We saw an example of this above with the `airlines` dataset. While the `flights` data frame could also include a column with the names of the airlines instead of the carrier code, this would be repetitive since there is a unique mapping of the carrier code to the name of the airline/carrier.

Below an example is given showing how to **join** the `airlines` data frame together with the `flights` data frame by linking together the two datasets via a common **key** of `"carrier"`. Note that this "joined" data frame is assigned to a new data frame called `joined_flights`.
Below an example is given showing how to **join** the `airlines` data frame together with the `flights` data frame by linking together the two datasets via a common **key** of `"carrier"`. Note that this "joined" data frame is assigned to a new data frame called `joined_flights`. The **key** variable that we frequently join by is one of the *identification variables* mentioned above.

```{r message=FALSE}
if(!require(nycflights13))
install.packages("nycflights13", repos = "http://cran.rstudio.org")
library(dplyr)
joined_flights <- inner_join(x = flights, y = airlines, by = "carrier")
```
Expand All @@ -215,9 +270,7 @@ joined_flights <- inner_join(x = flights, y = airlines, by = "carrier")
View(joined_flights)
```

If we `View` this dataset, we see a new variable has been created called (We will see in Subsection 5.1.1 ways to change `name` to a more descriptive variable name.)

More discussion about joining data frames together will be given in Chapter \@ref(manip). We will see there that the names of the columns to be linked need not match as they did here with `"carrier"`.
If we `View` this dataset, we see a new variable has been created called `name`. (We will see in Subsection \@ref(rename) ways to change `name` to a more descriptive variable name.) More discussion about joining data frames together will be given in Chapter \@ref(manip). We will see there that the names of the columns to be linked need not match as they did here with `"carrier"`.

***
***
Expand Down Expand Up @@ -265,5 +318,5 @@ kable(data_frame("role" = role, `Sociology?` = sociology,

## What's to come?

In Chapter \@ref(viz), we will further explore the distribution of a variable in a related dataset to `flights`: the `temp` variable in the `weather` dataset. We'll be interested in understanding how this variable varies in relation to the values of other variables in the dataset. We will see that visualization is often a powerful tool in helping us see what is going on in a dataset. It will be a useful way to expand on the `str` function we have seen here for tidy data.
In Chapter \@ref(viz), we will further explore the distribution of a variable in a related dataset to `flights`: the `temp` variable in the `weather` dataset. We'll be interested in understanding how this variable varies in relation to the values of other variables in the dataset. We will see that visualization is often a powerful tool in helping us see what is going on in a dataset. It will be a useful way to expand on the `glimpse` function we have seen here for tidy data.

8 changes: 5 additions & 3 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
# ModernDive 0.1.0.9000
# ModernDive 0.1.1.9000

# ModernDive 0.1.1

* Fixed the problems of chapter cross-references not working by removing the backticks in chapter names
+ Issue created on `bookdown` [here](https://github.com/rstudio/bookdown/issues/294)
* Looked for typos throughout all chapters EXCEPT Chapter 3
* Added [Inference Coggle diagram](https://coggle.it/diagram/Vxlydu1akQFeqo6-) to Appendix B
* Looked for typos throughout all chapters
* Added coggle diagrams to Chapter 4 and Appendix B
* Followed the same format of having a Conclusion section at the end of each chapter
* Fixed $T$ distribution plot with histogram in Chapter 7
+ May be weird issue with `cache = TRUE` that incorrectly plotted values on 1/10^th^ the correct scale
Expand Down
2 changes: 1 addition & 1 deletion _bookdown.yml
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
book_filename: "ismaykim"
output_dir: "docs-devel"
output_dir: "docs"
#chapter_name: "Chapter "
53 changes: 48 additions & 5 deletions bib/packages.bib
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,19 @@ @Manual{R-bookdown
note = {R package version 0.3},
url = {https://CRAN.R-project.org/package=bookdown},
}
@Manual{R-broom,
title = {broom: Convert Statistical Analysis Objects into Tidy Data Frames},
author = {David Robinson},
year = {2016},
note = {R package version 0.4.1},
url = {https://CRAN.R-project.org/package=broom},
}
@Manual{R-devtools,
title = {devtools: Tools to Make Developing R Packages Easier},
author = {Hadley Wickham and Winston Chang},
note = {R package version 1.12.0.9000},
url = {https://github.com/hadley/devtools},
year = {2016},
note = {R package version 1.12.0},
url = {https://CRAN.R-project.org/package=devtools},
}
@Manual{R-dplyr,
title = {dplyr: A Grammar of Data Manipulation},
Expand Down Expand Up @@ -63,12 +70,40 @@ @Manual{R-knitr
note = {R package version 1.15.1},
url = {https://CRAN.R-project.org/package=knitr},
}
@Manual{R-lattice,
title = {lattice: Trellis Graphics for R},
author = {Deepayan Sarkar},
year = {2016},
note = {R package version 0.20-34},
url = {https://CRAN.R-project.org/package=lattice},
}
@Manual{R-Matrix,
title = {Matrix: Sparse and Dense Matrix Classes and Methods},
author = {Douglas Bates and Martin Maechler},
year = {2016},
note = {R package version 1.2-7.1},
url = {https://CRAN.R-project.org/package=Matrix},
}
@Manual{R-mosaic,
title = {mosaic: Project MOSAIC Statistics and Mathematics Teaching Utilities},
author = {Randall Pruim and Daniel T. Kaplan and Nicholas J. Horton},
year = {2016},
note = {R package version 0.14.4},
url = {https://github.com/ProjectMOSAIC/mosaic},
url = {https://CRAN.R-project.org/package=mosaic},
}
@Manual{R-mosaicData,
title = {mosaicData: Project MOSAIC Data Sets},
author = {Randall Pruim and Daniel Kaplan and Nicholas Horton},
year = {2016},
note = {R package version 0.14.0},
url = {https://CRAN.R-project.org/package=mosaicData},
}
@Manual{R-mvtnorm,
title = {mvtnorm: Multivariate Normal and t Distributions},
author = {Alan Genz and Frank Bretz and Tetsuhisa Miwa and Xuefei Mi and Torsten Hothorn},
year = {2016},
note = {R package version 1.0-5},
url = {https://CRAN.R-project.org/package=mvtnorm},
}
@Manual{R-nycflights13,
title = {nycflights13: Flights that Departed NYC in 2013},
Expand All @@ -78,11 +113,19 @@ @Manual{R-nycflights13
url = {https://CRAN.R-project.org/package=nycflights13},
}
@Manual{R-okcupiddata,
title = {okcupiddata: OkCupid Profile Data for Introductory Statistics and Data Science Courses},
title = {okcupiddata: OkCupid Profile Data for Introductory Statistics and Data
Science Courses},
author = {Albert Y. Kim and Adriana Escobedo-Land},
year = {2016},
note = {R package version 0.1.0},
url = {http://github.com/rudeboybert/okcupiddata},
url = {https://CRAN.R-project.org/package=okcupiddata},
}
@Manual{R-readr,
title = {readr: Read Tabular Data},
author = {Hadley Wickham and Jim Hester and Romain Francois},
year = {2016},
note = {R package version 1.0.0},
url = {https://CRAN.R-project.org/package=readr},
}
@Manual{R-rmarkdown,
title = {rmarkdown: Dynamic Documents for R},
Expand Down
2 changes: 1 addition & 1 deletion index.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ description: "Combining statistical and computational thinking to make sense of
options(width = 72)
knitr::opts_chunk$set(tidy = FALSE, out.width='\\textwidth',
fig.align = "center")
version <- "0.1.0.9000"
version <- "0.1.1"
```


Expand Down

0 comments on commit 42f755d

Please sign in to comment.