Michael J. Steinbaugh 2017-07-07
The tidyverse is a suite of integrated packages designed to make common operations performed in R more user friendly. This was initially developed by Hadley Wickham, Chief Scientist at RStudio, but is now maintained by a number of talented developers moving the R language forward.
The core collection of tidyverse packages are managed by the tidyverse CRAN package.
install.packages("tidyverse")
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
When you load the tidyverse
library, these core packages will be loaded into your environment:
There are a number of additional tidyverse packages that we highly recommend for performing data analysis, including:
- magrittr: Defines the pipe operator (
%>%
), which is used to write left-to-right chain operations. We'll cover this below. - stringr: Enables easier manipulation of vectors ("strings").
- readxl: Current recommended best practice for import of Excel workbooks.
When you load the tidyverse, you'll see messages about tidy package conflicts at the end. This is normal. Currently, dplyr masks stats::filter()
and stats::lag()
by default. What's happening here is that the tidyverse has some functions with the same names as base R packages.
Note: This remains a common issue when loading multiple libraries in a single script. For example, many Bioconductor packages have generic functions with the same name as base R and tidyverse packages. For example, biomaRt::select()
and dplyr::select()
have the same function name but require different arguments. Whichever library you load last will define the function name (select()
). If you need to use two packages with the same function name at the same time, you can reference them explicitly (e.g. dplyr::select()
). Therefore, when starting a new analysis using tidyverse packages, we highly recommend slotting library(tidyverse)
at the end of your list of libraries.
One problem with R is a lack of consistency across packages in how functions and arguments are named.
- Base R functions are formatted in dotted case:
read.csv()
. - tidyverse functions are formatted in snake_case:
read_csv()
. - Bioconductor functions are generally formatted in lowerCamelCase (and sometimes UpperCamelCase).
The tidyverse collection of packages are very opinionated in this regard and consistently use snake_case
formatting for all function names and arguments. When using these functions, we recommend that you follow the tidy style guide.
A core component of the tidyverse is the tibble. Tibbles are a modern rework of the standard data.frame
, with some internal improvements to make code more reliable. Most notably, tibbles will return a more reasonable number of rows to the console.
Before we begin the tidy workflow, coerce your counts, metadata, and DESeqResults to data frames:
counts <- as.data.frame(data)
meta <- as.data.frame(meta)
results <- as.data.frame(res_tableOE)
Now let's create tibbles from our data frames:
counts_tbl <- counts %>%
rownames_to_column("ensgene") %>%
as_tibble
meta_tbl <- meta %>%
rownames_to_column("sample_name") %>%
as_tibble %>%
rename(sample_type = sampletype,
mov_expression = MOVexpr) %>%
mutate(sample_name = tolower(sample_name))
results_tbl <- results %>%
rownames_to_column("symbol") %>%
as_tibble
First, try returning the counts
data frame in your console.
counts
Next, try returning the counts tibble.
counts_tbl
## # A tibble: 38,828 x 13
## ensgene sample1 sample2 sample3 sample4 sample5
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ENSMUSG00000000001 19.7848000 19.265000 20.889500 24.076700 23.7222000
## 2 ENSMUSG00000000003 0.0000000 0.000000 0.000000 0.000000 0.0000000
## 3 ENSMUSG00000000028 0.9377920 1.032290 0.892183 0.827891 0.8269540
## 4 ENSMUSG00000000031 0.0359631 0.000000 0.000000 0.000000 0.0000000
## 5 ENSMUSG00000000037 0.1514170 0.056033 0.146196 0.180883 0.0473238
## 6 ENSMUSG00000000049 0.2567330 0.258134 0.421286 2.191720 1.0730200
## 7 ENSMUSG00000000056 5.9998100 6.047990 6.020250 5.620120 6.4116300
## 8 ENSMUSG00000000058 5.2784700 3.971810 6.161450 7.045910 5.2136600
## 9 ENSMUSG00000000078 37.2141000 32.303000 31.249600 4.260570 3.5634100
## 10 ENSMUSG00000000085 25.4044000 22.950700 21.415600 16.825400 17.9712000
## # ... with 38,818 more rows, and 7 more variables: sample6 <dbl>,
## # sample7 <dbl>, sample8 <dbl>, sample9 <dbl>, sample10 <dbl>,
## # sample11 <dbl>, sample12 <dbl>
See how R only prints 10 rows instead of returning all 38k? This is much more user friendly.
Internally, a tibble is essentially a class variant of data.frame
, with some extra tibble (tbl
) magic baked in:
class(meta_tbl)
## [1] "tbl_df" "tbl" "data.frame"
glimpse()
is a modern rework of str()
, optimized for tibbles:
glimpse(counts_tbl)
glimpse(meta_tbl)
glimpse(results_tbl)
Important: tidyverse is very opininationed about row names. These packages insist that all column data (e.g. data.frame
) be treated equally, and that special designation of a column as rownames
should be deprecated. tibble provides simple utility functions to to handle rownames: rownames_to_column()
and column_to_rownames()
.
help("rownames", "tibble")
tidyverse packages improve code readability by changing how functions interpret object names. This is achieved through the use of "non-standard evaluation" instead of base R's "standard evaluation". This probably sounds confusing but is actually pretty simple. In fact, we've already used a function (subset()
) in the class that works with non-standard evaluation.
subset(meta_tbl, mov_expression == "high")
## # A tibble: 3 x 3
## sample_name sample_type mov_expression
## <chr> <chr> <chr>
## 1 mov10_oe_1 MOV10_overexpression high
## 2 mov10_oe_2 MOV10_overexpression high
## 3 mov10_oe_3 MOV10_overexpression high
Here's the tidyverse variant:
filter(meta_tbl, mov_expression == "high")
## # A tibble: 3 x 3
## sample_name sample_type mov_expression
## <chr> <chr> <chr>
## 1 mov10_oe_1 MOV10_overexpression high
## 2 mov10_oe_2 MOV10_overexpression high
## 3 mov10_oe_3 MOV10_overexpression high
See how both functions refer to the mov_expression
column directly and not in quotations? That's non-standard evaluation. It makes code easier to read.
The most useful tool in the tidyverse is dplyr. It's a swiss-army knife for data manipulation. dplyr has 5 core functions that we recommend incorporating into your analysis:
filter()
picks cases based on their values.arrange()
changes the ordering of the rows.select()
picks variables based on their names.mutate()
adds new variables that are functions of existing variablessummarise()
reduces multiple values down to a single summary.
Note: dplyr underwent a massive revision this year, switching versions from 0.5 to 0.7. If you consult other dplyr tutorials online, note that many materials developed prior to 2017 are no longer correct. In particular, this applies to writing functions with dplyr (see Notes section below).
Let's make a report tibble of our DESeqResults
.
First, we only need a few columns of interest from our results tibble. We can do this easily in dplyr with select()
.
report <- results_tbl %>%
select(symbol, baseMean, log2FoldChange, padj)
Conversely, you can remove columns you don't want with negative selection.
results_tbl %>%
select(-c(lfcSE, stat, pvalue))
## # A tibble: 23,368 x 4
## symbol baseMean log2FoldChange padj
## <chr> <dbl> <dbl> <dbl>
## 1 1/2-SBSRNA4 45.6520399 0.266586547 2.708964e-01
## 2 A1BG 61.0931017 0.208057615 3.638671e-01
## 3 A1BG-AS1 175.6658069 -0.051825739 7.837586e-01
## 4 A1CF 0.2376919 0.012557390 NA
## 5 A2LD1 89.6179845 0.343006364 7.652553e-02
## 6 A2M 5.8600841 -0.270449534 2.318666e-01
## 7 A2ML1 2.4240553 0.236041349 NA
## 8 A2MP1 1.3203237 0.079525469 NA
## 9 A4GALT 64.5409534 0.795049160 2.875565e-05
## 10 A4GNT 0.1912781 0.009458374 NA
## # ... with 23,358 more rows
Note that the rows are sorted by the gene symbol. Let's fix that and sort them by adjusted P value instead with arrange()
.
report <- arrange(report, padj)
report
## # A tibble: 23,368 x 4
## symbol baseMean log2FoldChange padj
## <chr> <dbl> <dbl> <dbl>
## 1 MOV10 21681.7998 4.7695983 0.000000e+00
## 2 H1F0 7881.0811 1.5250811 2.007733e-162
## 3 HSPA6 168.2522 4.4993734 1.969313e-134
## 4 HIST1H1C 1741.3830 1.4868361 5.116720e-101
## 5 TXNIP 5133.7486 1.3868320 4.882246e-90
## 6 NEAT1 21973.7061 0.9087853 2.269464e-83
## 7 KLF10 1694.2109 1.2093969 9.257431e-78
## 8 INSIG1 11872.5106 1.2260848 8.853278e-70
## 9 NR1D1 969.9119 1.5236259 1.376753e-64
## 10 WDFY1 1422.7361 1.0629160 1.298076e-61
## # ... with 23,358 more rows
Let's keep only genes that are expressed (baseMean
above 0) with an adjusted P value below 0.01. You can perform multiple filter()
operations together in a single command.
report <- report %>%
filter(baseMean > 0,
padj < 0.01)
mutate()
enables you to create a new column from an existing column. Let's generate log10 calculations of our baseMeans for each gene.
report %>%
mutate(log10BaseMean = log10(baseMean)) %>%
select(symbol, baseMean, log10BaseMean)
## # A tibble: 4,909 x 3
## symbol baseMean log10BaseMean
## <chr> <dbl> <dbl>
## 1 MOV10 21681.7998 4.336095
## 2 H1F0 7881.0811 3.896586
## 3 HSPA6 168.2522 2.225961
## 4 HIST1H1C 1741.3830 3.240894
## 5 TXNIP 5133.7486 3.710435
## 6 NEAT1 21973.7061 4.341903
## 7 KLF10 1694.2109 3.228967
## 8 INSIG1 11872.5106 4.074543
## 9 NR1D1 969.9119 2.986732
## 10 WDFY1 1422.7361 3.153124
## # ... with 4,899 more rows
You can quickly rename an existing column with rename()
. The syntax is new_name
= old_name
.
report %>%
rename(gene = symbol)
## # A tibble: 4,909 x 4
## gene baseMean log2FoldChange padj
## <chr> <dbl> <dbl> <dbl>
## 1 MOV10 21681.7998 4.7695983 0.000000e+00
## 2 H1F0 7881.0811 1.5250811 2.007733e-162
## 3 HSPA6 168.2522 4.4993734 1.969313e-134
## 4 HIST1H1C 1741.3830 1.4868361 5.116720e-101
## 5 TXNIP 5133.7486 1.3868320 4.882246e-90
## 6 NEAT1 21973.7061 0.9087853 2.269464e-83
## 7 KLF10 1694.2109 1.2093969 9.257431e-78
## 8 INSIG1 11872.5106 1.2260848 8.853278e-70
## 9 NR1D1 969.9119 1.5236259 1.376753e-64
## 10 WDFY1 1422.7361 1.0629160 1.298076e-61
## # ... with 4,899 more rows
You can perform column summarization operations with summarise()
.
report %>%
summarise(avgBaseMean = mean(baseMean))
## # A tibble: 1 x 1
## avgBaseMean
## <dbl>
## 1 1911.6
Advanced: summarise()
is particularly powerful in combination with the group_by()
function, which allows you to group related rows together.
Note: summarize()
also works if you prefer to use American English. This applies across the board to any tidy functions, including in ggplot2 (e.g. color
in place of colour
).
In the recent dplyr 0.7 update, pull()
was added as a quick way to access column data as a vector. This is very handy in chain operations with the pipe operator.
pull(report, symbol) %>% .[1:10]
To demonstrate dplyr's powerful suite of join operations, let's import Ensembl gene annotations from the annotables package and add them to our report.
install.packages("devtools")
devtools::install_github("stephen_turner/annotables")
library(annotables)
annotable <- grch37 %>%
select(symbol, biotype, description) %>%
distinct
report <- left_join(report, annotable, by = "symbol")
Underneath the hood, tidyverse packages build upon the base R language using rlang, which is a complete rework of how functions handle variable names and evaluate arguments. This is achieved through the tidyeval
framework, which interprates command operations using tidy evaluation
. This is outside of the scope of the course, but explained in detail in the Programming with dplyr vignette, in case you'd like to understand how these new tools behave differently from base R.