Gaston Sanchez
- Get started with
"dplyr"
- Get to know the basic dplyr verbs:
slice()
,filter()
,select()
mutate()
arrange()
summarise()
group_by()
- Get started with
"ggplot2"
- Produce basic plots with
ggplot()
- Write your descriptions, explanations, and code in an
Rmd
(R markdown) file. - Name this file as
lab04-first-last.Rmd
, wherefirst
andlast
are your first and last names (e.g.lab04-gaston-sanchez.Rmd
). - If you attend lab, to achieve substantial progress you have to complete the “Your Turn*" sections marked with a star.
- Knit your
Rmd
file as an html document (default option). - Submit your
Rmd
andhtml
files to bCourses, in the corresponding lab assignment.
So far you’ve been manipulating data tables (under the class of
"data.frame"
objects) using bracket notation, dat[ , ]
, and the
dollar operator, dat$name
, in order to select specific rows, columns,
or cells. In addition, you’ve created charts with functions like
plot()
, boxplot()
, barplot()
, and hist()
, which are part of the
"graphics"
package.
In this lab, you will start learning about other approaches to
manipulate tables and create statistical graphics. We are going to use
the functionality of the package "dplyr"
to work with tabular data in
a more syntactic way. This is a fairly recent package introduced a
couple of years ago, but it is based on more than a decade of research
and work lead by Hadley Wickham.
Likewise, to create graphics in a more consistent and visually pleasing
way, we are going to use the package "ggplot2"
, also originally
authored by Hadley Wickham, and developed as part of his PhD more than a
decade ago.
Use the first hour of the lab to get as far as possible with the
material associated to "dplyr"
. Then use the second hour of the lab to
work on graphics with "ggplot2"
.
While you follow this lab, you may want to open these cheat sheets:
I’m assuming that you already installed the packages "dplyr"
and
"ggplot2"
. If that’s not the case then run on the console the
command below (do NOT include this command in your Rmd
):
# don't include this command in your Rmd file
# don't worry too much if you get a warning message
install.packages(c("dplyr", "ggplot2"))
Remember that you only need to install a package once! After a package
has been installed in your machine, there is no need to call
install.packages()
again on the same package. What you should always
invoke in order to use the functions in a package is the library()
function:
# (include these commands in your Rmd file)
# don't forget to load the packages
library(dplyr)
library(ggplot2)
About loading packages: Another rule to keep in mind is to always
load any required packages at the very top of your script files (.R
or
.Rmd
or .Rnw
files). Avoid calling the library()
function in the
middle of a script. Instead, load all the packages before anything else.
The data file for this lab is: nba2018-players.csv
.
To import the data in R you can use the base function read.csv()
, or
you can also use read_csv()
from the package "readr"
:
# with "base" read.csv()
dat <- read.csv('nba2018-players.csv', stringsAsFactors = FALSE)
# with "readr" read_csv()
dat <- read_csv('nba2018-players.csv')
To make the learning process of "dplyr"
gentler, Hadley Wickham
proposes beginning with a set of five basic verbs or operations for
data frames (each verb corresponds to a function in "dplyr"
):
- filter: keep rows matching criteria
- select: pick columns by name
- mutate: add new variables
- arrange: reorder rows
- summarise: reduce variables to values
These are the verbs or actions you were introduced to in last week’s lab. I’ve slightly modified Hadley’s list of verbs:
filter()
,slice()
, andselect()
: subsetting and selecting rows and columnsmutate()
: add new variablesarrange()
: reorder rowssummarise()
: reduce variables to valuesgroup_by()
: grouped (aggregate) operations
slice()
allows you to select rows by position:
# first three rows
three_rows <- slice(dat, 1:3)
three_rows
## player team position height weight age experience
## 1 Al Horford BOS C 82 245 30 9
## 2 Amir Johnson BOS PF 81 240 29 11
## 3 Avery Bradley BOS SG 74 180 26 6
## college salary games minutes points points3
## 1 University of Florida 26540100 68 2193 952 86
## 2 12000000 80 1608 520 27
## 3 University of Texas at Austin 8269663 55 1835 894 108
## points2 points1
## 1 293 108
## 2 186 67
## 3 251 68
filter()
allows you to select rows by condition:
# subset rows given a condition
# (height greater than 85 inches)
gt_85 <- filter(dat, height > 85)
gt_85
## player team position height weight age experience
## 1 Edy Tavares CLE C 87 260 24 1
## 2 Boban Marjanovic DET C 87 290 28 1
## 3 Kristaps Porzingis NYK PF 87 240 21 1
## 4 Roy Hibbert DEN C 86 270 30 8
## 5 Alexis Ajinca NOP C 86 248 28 6
## college salary games minutes points points3 points2
## 1 5145 1 24 6 0 3
## 2 7000000 35 293 191 0 72
## 3 4317720 66 2164 1196 112 331
## 4 Georgetown University 5000000 6 11 4 0 2
## 5 4600000 39 584 207 0 89
## points1
## 1 0
## 2 47
## 3 198
## 4 0
## 5 29
select()
allows you to select columns by name:
# columns by name
player_height <- select(dat, player, height)
- use
slice()
to subset the data by selecting the first 5 rows. - use
slice()
to subset the data by selecting rows 10, 15, 20, …, 50. - use
slice()
to subset the data by selecting the last 5 rows. - use
filter()
to subset those players with height less than 70 inches tall. - use
filter()
to subset rows of Golden State Warriors ('GSW'
). - use
filter()
to subset rows of GSW centers ('C'
). - use
filter()
and thenselect()
, to subset rows of lakers ('LAL'
), and then display their names. - use
filter()
and thenselect()
, to display the name and salary, of GSW point guards - find how to select the name, age, and team, of players with more than 10 years of experience, making 10 million dollars or less.
- find how to select the name, team, height, and weight, of rookie players, 20 years old, displaying only the first five occurrences (i.e. rows)
Another basic verb is mutate()
which allows you to add new variables.
Let’s create a small data frame for the warriors with three columns:
player
, height
, and weight
:
# creating a small data frame step by step
gsw <- filter(dat, team == 'GSW')
gsw <- select(gsw, player, height, weight)
gsw <- slice(gsw, c(4, 8, 10, 14, 15))
gsw
## player height weight
## 1 David West 81 250
## 2 JaVale McGee 84 270
## 3 Kevon Looney 81 220
## 4 Shaun Livingston 79 192
## 5 Stephen Curry 75 190
Now, let’s use mutate()
to (temporarily) add a column with the ratio
height / weight
:
mutate(gsw, height / weight)
## player height weight height/weight
## 1 David West 81 250 0.3240000
## 2 JaVale McGee 84 270 0.3111111
## 3 Kevon Looney 81 220 0.3681818
## 4 Shaun Livingston 79 192 0.4114583
## 5 Stephen Curry 75 190 0.3947368
You can also give a new name, like: ht_wt = height / weight
:
mutate(gsw, ht_wt = height / weight)
## player height weight ht_wt
## 1 David West 81 250 0.3240000
## 2 JaVale McGee 84 270 0.3111111
## 3 Kevon Looney 81 220 0.3681818
## 4 Shaun Livingston 79 192 0.4114583
## 5 Stephen Curry 75 190 0.3947368
In order to permanently change the data, you need to assign the changes to an object:
gsw2 <- mutate(gsw, ht_m = height * 0.0254, wt_kg = weight * 0.4536)
gsw2
## player height weight ht_m wt_kg
## 1 David West 81 250 2.0574 113.4000
## 2 JaVale McGee 84 270 2.1336 122.4720
## 3 Kevon Looney 81 220 2.0574 99.7920
## 4 Shaun Livingston 79 192 2.0066 87.0912
## 5 Stephen Curry 75 190 1.9050 86.1840
The next basic verb of "dplyr"
is arrange()
which allows you to
reorder rows. For example, here’s how to arrange the rows of gsw
by
height
# order rows by height (increasingly)
arrange(gsw, height)
## player height weight
## 1 Stephen Curry 75 190
## 2 Shaun Livingston 79 192
## 3 David West 81 250
## 4 Kevon Looney 81 220
## 5 JaVale McGee 84 270
By default arrange()
sorts rows in increasing order. To arrange rows
in descending order you need to use the auxiliary function desc()
.
# order rows by height (decreasingly)
arrange(gsw, desc(height))
## player height weight
## 1 JaVale McGee 84 270
## 2 David West 81 250
## 3 Kevon Looney 81 220
## 4 Shaun Livingston 79 192
## 5 Stephen Curry 75 190
# order rows by height, and then weight
arrange(gsw, height, weight)
## player height weight
## 1 Stephen Curry 75 190
## 2 Shaun Livingston 79 192
## 3 Kevon Looney 81 220
## 4 David West 81 250
## 5 JaVale McGee 84 270
-
using the data frame
gsw
, add a new variableproduct
with the product ofheight
andweight
. -
create a new data frame
gsw3
, by adding columnslog_height
andlog_weight
with the log transformations ofheight
andweight
. -
use the original data frame to
filter()
andarrange()
those players with height less than 71 inches tall, in increasing order. -
display the name, team, and salary, of the top-5 highest paid players
-
display the name, team, and salary, for the top-5 highest paid players
-
display the name, team, and points3, of the top 10 three-point players
-
create a data frame
gsw_mpg
of GSW players, that contains variables for player name, experience, andmin_per_game
(minutes per game), sorted bymin_per_game
(in descending order)
The next verb is summarise()
. Conceptually, this involves applying a
function on one or more columns, in order to summarize values. This is
probably easier to understand with one example.
Say you are interested in calculating the average salary of all NBA
players. To do this “a la dplyr” you use summarise()
, or its synonym
function summarize()
:
# average salary of NBA players
summarise(dat, avg_salary = mean(salary))
## avg_salary
## 1 5804697
Calculating an average like this seems a bit verbose, especially when
you can directly use mean()
like this:
mean(dat$salary)
## [1] 5804697
So let’s make things a bit more interessting. What if you want to
calculate some summary statistics for salary
: min, median, mean, and
max?
# some stats for salary (dplyr)
summarise(
dat,
min = min(salary),
median = median(salary),
avg = mean(salary),
max = max(salary)
)
## min median avg max
## 1 5145 3e+06 5804697 30963450
Well, this may still look like not much. You can do the same in base R (there are actually better ways to do this):
# some stats for salary (base R)
c(min = min(dat$salary),
median = median(dat$salary),
median = mean(dat$salary),
max = max(dat$salary))
## min median median max
## 5145 3000000 5804697 30963450
To actually appreciate the power of summarise()
, we need to introduce
the other major basic verb in "dplyr"
: group_by()
. This is the
function that allows you to perform data aggregations, or grouped
operations.
Let’s see the combination of summarise()
and group_by()
to calculate
the average salary by team:
# average salary, grouped by team
summarise(
group_by(dat, team),
avg_salary = mean(salary)
)
## # A tibble: 30 x 2
## team avg_salary
## <chr> <dbl>
## 1 ATL 5494447.
## 2 BOS 6127673.
## 3 BRK 4011351.
## 4 CHI 5781368.
## 5 CHO 5531548.
## 6 CLE 7069699.
## 7 DAL 5157128.
## 8 DEN 4648719.
## 9 DET 6871632.
## 10 GSW 6265160.
## # … with 20 more rows
Here’s a similar example with the average salary by position:
# average salary, grouped by position
summarise(
group_by(dat, position),
avg_salary = mean(salary)
)
## # A tibble: 5 x 2
## position avg_salary
## <chr> <dbl>
## 1 C 6529906.
## 2 PF 5801127.
## 3 PG 5601217.
## 4 SF 6042455.
## 5 SG 5114178.
Here’s a more fancy example: average weight and height, by position, displayed in desceding order by average height:
arrange(
summarise(
group_by(dat, position),
avg_height = mean(height),
avg_weight = mean(weight)),
desc(avg_height)
)
## # A tibble: 5 x 3
## position avg_height avg_weight
## <chr> <dbl> <dbl>
## 1 C 83.2 251.
## 2 PF 81.4 235.
## 3 SF 79.5 220.
## 4 SG 77.0 204.
## 5 PG 74.3 189.
-
use
summarise()
to get the largest height value. -
use
summarise()
to get the standard deviation ofpoints3
. -
use
summarise()
andgroup_by()
to display the median of three-points, by team. -
display the average triple points by team, in ascending order, of the bottom-5 teams (worst 3pointer teams)
-
obtain the mean and standard deviation of
age
, for Power Forwards, with 5 and 10 years of experience (including 5 and 10).
The package "ggplot2"
is probably the most popular package in R to
create beautiful static graphics. Compared to the functions in the
base package "graphics"
, the package "ggplot2
" follows a somewhat
different philosophy, and it tries to be more consistent and modular as
possible.
- The main function in
"ggplot2"
isggplot()
- The main input to
ggplot()
is a data frame object. - You can use the internal function
aes()
to specify what columns of the data frame will be used for the graphical elements of the plot. - You must specify what kind of geometric objects or geoms will
be displayed: e.g.
geom_point()
,geom_bar()
,geom_boxpot()
. - Pretty much anything else that you want to add to your plot is controlled by auxiliary functions, especially those things that have to do with the format, rather than the underlying data.
- The construction of a ggplot is done by adding layers with the
+
operator.
Let’s start with a scatterplot of salary
and points
# scatterplot (option 1)
ggplot(data = dat) +
geom_point(aes(x = points, y = salary))
ggplot()
creates an object of class"ggplot"
- the main input for
ggplot()
isdata
which must be a data frame - then we use the
"+"
operator to add a layer - the geometric object (geom) are points:
geom_points()
aes()
is used to specify thex
andy
coordinates, by taking columnspoints
andsalary
from the data frame
The same scatterplot can also be created with this alternative, and more
common use of ggplot()
# scatterplot (option 2)
ggplot(data = dat, aes(x = points, y = salary)) +
geom_point()
When including code for plots and graphics, we strongly recommend that you create an individual code chunk for each plot, and that you give a label to that chunk. This is illustrated in the following screenshot.
Note that the code chunk has a label scatterplot1
; moreover, the code
is exclusively decidated to this plot. Why should you care? Because when
"knitr"
creates the file of the plot, it will use the chunk label for
the graph. So it’s better to give meaningful names to those chunks
containing graphics.
Say you want to color code the points in terms of position
# colored scatterplot
ggplot(data = dat, aes(x = points, y = salary)) +
geom_point(aes(color = position))
Maybe you wan to modify the size of the dots in terms of points3
:
# sized and colored scatterplot
ggplot(data = dat, aes(x = points, y = salary)) +
geom_point(aes(color = position, size = points3))
To add some transparency effect to the dots, you can use the alpha
parameter.
# sized and colored scatterplot
ggplot(data = dat, aes(x = points, y = salary)) +
geom_point(aes(color = position, size = points3), alpha = 0.7)
Notice that alpha
was specified outside aes()
. This is because we
are not using any column for the alpha
transparency values.
- Open the ggplot2 cheatsheet
- Use the data frame
gsw
to make a scatterplot ofheight
andweight
. - Find out how to make another scatterplot of
height
andweight
, usinggeom_text()
to display the names of the players. - Get a scatter plot of
height
andweight
, for ALL the warriors, displaying their names withgeom_label()
. - Get a density plot of
salary
(for all NBA players). - Get a histogram of
points2
with binwidth of 50 (for all NBA players). - Get a barchart of the
position
frequencies (for all NBA players). - Make a scatterplot of
experience
andsalary
of all Centers, and usegeom_smooth()
to add a regression line. - Repeat the same scatterplot of
experience
andsalary
of all Centers, but now usegeom_smooth()
to add a loess line (i.e. smooth line).
One of the most attractive features of "ggplot2"
is the ability to
display multiple facets. The idea of facets is to divide a plot into
subplots based on the values of one or more categorical (or discrete)
variables.
Here’s an example. What if you want to get scatterplots of points
and
salary
separated (or grouped) by position
? This is where faceting
comes handy, and you can use facet_wrap()
for this purpose:
# scatterplot by position
ggplot(data = dat, aes(x = points, y = salary)) +
geom_point() +
facet_wrap(~ position)
The other faceting function is facet_grid()
, which allows you to
control the layout of the facets (by rows, by columns, etc)
# scatterplot by position
ggplot(data = dat, aes(x = points, y = salary)) +
geom_point(aes(color = position), alpha = 0.7) +
facet_grid(~ position) +
geom_smooth(method = loess)
# scatterplot by position
ggplot(data = dat, aes(x = points, y = salary)) +
geom_point(aes(color = position), alpha = 0.7) +
facet_grid(position ~ .) +
geom_smooth(method = loess)
- Make scatterplots of
experience
andsalary
faceting byposition
- Make scatterplots of
experience
andsalary
faceting byteam
- Make density plots of
age
faceting byteam
- Make scatterplots of
height
andweight
faceting byposition
- Make scatterplots of
height
andweight
, with a 2-dimensional density,geom_density2d()
, faceting byposition
- Make a scatterplot of
experience
andsalary
for the Warriors, but this time add a layer withtheme_bw()
to get a simpler background - Repeat any of the previous plots but now adding a leyer with another
theme e.g.
theme_minimal()
,theme_dark()
,theme_classic()