Skip to content

Latest commit

 

History

History
388 lines (292 loc) · 16.1 KB

File metadata and controls

388 lines (292 loc) · 16.1 KB

The tidyverse

Michael J. Steinbaugh 2017-07-07

The tidyverse is a suite of integrated packages designed to make common operations performed in R more user friendly. This was initially developed by Hadley Wickham, Chief Scientist at RStudio, but is now maintained by a number of talented developers moving the R language forward.


Installation

The core collection of tidyverse packages are managed by the tidyverse CRAN package.

install.packages("tidyverse")
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats
## lag():    dplyr, stats

When you load the tidyverse library, these core packages will be loaded into your environment:

Recommended optional packages

There are a number of additional tidyverse packages that we highly recommend for performing data analysis, including:

  • magrittr: Defines the pipe operator (%>%), which is used to write left-to-right chain operations. We'll cover this below.
  • stringr: Enables easier manipulation of vectors ("strings").
  • readxl: Current recommended best practice for import of Excel workbooks.

Function name conflicts

When you load the tidyverse, you'll see messages about tidy package conflicts at the end. This is normal. Currently, dplyr masks stats::filter() and stats::lag() by default. What's happening here is that the tidyverse has some functions with the same names as base R packages.

Note: This remains a common issue when loading multiple libraries in a single script. For example, many Bioconductor packages have generic functions with the same name as base R and tidyverse packages. For example, biomaRt::select() and dplyr::select() have the same function name but require different arguments. Whichever library you load last will define the function name (select()). If you need to use two packages with the same function name at the same time, you can reference them explicitly (e.g. dplyr::select()). Therefore, when starting a new analysis using tidyverse packages, we highly recommend slotting library(tidyverse) at the end of your list of libraries.

Code style

One problem with R is a lack of consistency across packages in how functions and arguments are named.

  • Base R functions are formatted in dotted case: read.csv().
  • tidyverse functions are formatted in snake_case: read_csv().
  • Bioconductor functions are generally formatted in lowerCamelCase (and sometimes UpperCamelCase).

The tidyverse collection of packages are very opinionated in this regard and consistently use snake_case formatting for all function names and arguments. When using these functions, we recommend that you follow the tidy style guide.

tibbles

A core component of the tidyverse is the tibble. Tibbles are a modern rework of the standard data.frame, with some internal improvements to make code more reliable. Most notably, tibbles will return a more reasonable number of rows to the console.

Before we begin the tidy workflow, coerce your counts, metadata, and DESeqResults to data frames:

counts <- as.data.frame(data)
meta <- as.data.frame(meta)
results <- as.data.frame(res_tableOE)

Now let's create tibbles from our data frames:

counts_tbl <- counts %>%
    rownames_to_column("ensgene") %>%
    as_tibble
meta_tbl <- meta %>%
    rownames_to_column("sample_name") %>%
    as_tibble %>%
    rename(sample_type = sampletype,
           mov_expression = MOVexpr) %>%
    mutate(sample_name = tolower(sample_name))
results_tbl <- results %>%
    rownames_to_column("symbol") %>%
    as_tibble

First, try returning the counts data frame in your console.

counts

Next, try returning the counts tibble.

counts_tbl
## # A tibble: 38,828 x 13
##               ensgene    sample1   sample2   sample3   sample4    sample5
##                 <chr>      <dbl>     <dbl>     <dbl>     <dbl>      <dbl>
##  1 ENSMUSG00000000001 19.7848000 19.265000 20.889500 24.076700 23.7222000
##  2 ENSMUSG00000000003  0.0000000  0.000000  0.000000  0.000000  0.0000000
##  3 ENSMUSG00000000028  0.9377920  1.032290  0.892183  0.827891  0.8269540
##  4 ENSMUSG00000000031  0.0359631  0.000000  0.000000  0.000000  0.0000000
##  5 ENSMUSG00000000037  0.1514170  0.056033  0.146196  0.180883  0.0473238
##  6 ENSMUSG00000000049  0.2567330  0.258134  0.421286  2.191720  1.0730200
##  7 ENSMUSG00000000056  5.9998100  6.047990  6.020250  5.620120  6.4116300
##  8 ENSMUSG00000000058  5.2784700  3.971810  6.161450  7.045910  5.2136600
##  9 ENSMUSG00000000078 37.2141000 32.303000 31.249600  4.260570  3.5634100
## 10 ENSMUSG00000000085 25.4044000 22.950700 21.415600 16.825400 17.9712000
## # ... with 38,818 more rows, and 7 more variables: sample6 <dbl>,
## #   sample7 <dbl>, sample8 <dbl>, sample9 <dbl>, sample10 <dbl>,
## #   sample11 <dbl>, sample12 <dbl>

See how R only prints 10 rows instead of returning all 38k? This is much more user friendly.

Internally, a tibble is essentially a class variant of data.frame, with some extra tibble (tbl) magic baked in:

class(meta_tbl)
## [1] "tbl_df"     "tbl"        "data.frame"

glimpse() is a modern rework of str(), optimized for tibbles:

glimpse(counts_tbl)
glimpse(meta_tbl)
glimpse(results_tbl)

Row names

Important: tidyverse is very opininationed about row names. These packages insist that all column data (e.g. data.frame) be treated equally, and that special designation of a column as rownames should be deprecated. tibble provides simple utility functions to to handle rownames: rownames_to_column() and column_to_rownames().

help("rownames", "tibble")

Non-standard evaluation

tidyverse packages improve code readability by changing how functions interpret object names. This is achieved through the use of "non-standard evaluation" instead of base R's "standard evaluation". This probably sounds confusing but is actually pretty simple. In fact, we've already used a function (subset()) in the class that works with non-standard evaluation.

subset(meta_tbl, mov_expression == "high")
## # A tibble: 3 x 3
##   sample_name          sample_type mov_expression
##         <chr>                <chr>          <chr>
## 1  mov10_oe_1 MOV10_overexpression           high
## 2  mov10_oe_2 MOV10_overexpression           high
## 3  mov10_oe_3 MOV10_overexpression           high

Here's the tidyverse variant:

filter(meta_tbl, mov_expression == "high")
## # A tibble: 3 x 3
##   sample_name          sample_type mov_expression
##         <chr>                <chr>          <chr>
## 1  mov10_oe_1 MOV10_overexpression           high
## 2  mov10_oe_2 MOV10_overexpression           high
## 3  mov10_oe_3 MOV10_overexpression           high

See how both functions refer to the mov_expression column directly and not in quotations? That's non-standard evaluation. It makes code easier to read.

dplyr

The most useful tool in the tidyverse is dplyr. It's a swiss-army knife for data manipulation. dplyr has 5 core functions that we recommend incorporating into your analysis:

  • filter() picks cases based on their values.
  • arrange() changes the ordering of the rows.
  • select() picks variables based on their names.
  • mutate() adds new variables that are functions of existing variables
  • summarise() reduces multiple values down to a single summary.

Note: dplyr underwent a massive revision this year, switching versions from 0.5 to 0.7. If you consult other dplyr tutorials online, note that many materials developed prior to 2017 are no longer correct. In particular, this applies to writing functions with dplyr (see Notes section below).

Let's make a report tibble of our DESeqResults.

select()

First, we only need a few columns of interest from our results tibble. We can do this easily in dplyr with select().

report <- results_tbl %>%
    select(symbol, baseMean, log2FoldChange, padj)

Conversely, you can remove columns you don't want with negative selection.

results_tbl %>%
    select(-c(lfcSE, stat, pvalue))
## # A tibble: 23,368 x 4
##         symbol    baseMean log2FoldChange         padj
##          <chr>       <dbl>          <dbl>        <dbl>
##  1 1/2-SBSRNA4  45.6520399    0.266586547 2.708964e-01
##  2        A1BG  61.0931017    0.208057615 3.638671e-01
##  3    A1BG-AS1 175.6658069   -0.051825739 7.837586e-01
##  4        A1CF   0.2376919    0.012557390           NA
##  5       A2LD1  89.6179845    0.343006364 7.652553e-02
##  6         A2M   5.8600841   -0.270449534 2.318666e-01
##  7       A2ML1   2.4240553    0.236041349           NA
##  8       A2MP1   1.3203237    0.079525469           NA
##  9      A4GALT  64.5409534    0.795049160 2.875565e-05
## 10       A4GNT   0.1912781    0.009458374           NA
## # ... with 23,358 more rows

arrange()

Note that the rows are sorted by the gene symbol. Let's fix that and sort them by adjusted P value instead with arrange().

report <- arrange(report, padj)
report
## # A tibble: 23,368 x 4
##      symbol   baseMean log2FoldChange          padj
##       <chr>      <dbl>          <dbl>         <dbl>
##  1    MOV10 21681.7998      4.7695983  0.000000e+00
##  2     H1F0  7881.0811      1.5250811 2.007733e-162
##  3    HSPA6   168.2522      4.4993734 1.969313e-134
##  4 HIST1H1C  1741.3830      1.4868361 5.116720e-101
##  5    TXNIP  5133.7486      1.3868320  4.882246e-90
##  6    NEAT1 21973.7061      0.9087853  2.269464e-83
##  7    KLF10  1694.2109      1.2093969  9.257431e-78
##  8   INSIG1 11872.5106      1.2260848  8.853278e-70
##  9    NR1D1   969.9119      1.5236259  1.376753e-64
## 10    WDFY1  1422.7361      1.0629160  1.298076e-61
## # ... with 23,358 more rows

filter()

Let's keep only genes that are expressed (baseMean above 0) with an adjusted P value below 0.01. You can perform multiple filter() operations together in a single command.

report <- report %>%
    filter(baseMean > 0,
           padj < 0.01)

mutate()

mutate() enables you to create a new column from an existing column. Let's generate log10 calculations of our baseMeans for each gene.

report %>%
    mutate(log10BaseMean = log10(baseMean)) %>%
    select(symbol, baseMean, log10BaseMean)
## # A tibble: 4,909 x 3
##      symbol   baseMean log10BaseMean
##       <chr>      <dbl>         <dbl>
##  1    MOV10 21681.7998      4.336095
##  2     H1F0  7881.0811      3.896586
##  3    HSPA6   168.2522      2.225961
##  4 HIST1H1C  1741.3830      3.240894
##  5    TXNIP  5133.7486      3.710435
##  6    NEAT1 21973.7061      4.341903
##  7    KLF10  1694.2109      3.228967
##  8   INSIG1 11872.5106      4.074543
##  9    NR1D1   969.9119      2.986732
## 10    WDFY1  1422.7361      3.153124
## # ... with 4,899 more rows

rename()

You can quickly rename an existing column with rename(). The syntax is new_name = old_name.

report %>%
    rename(gene = symbol)
## # A tibble: 4,909 x 4
##        gene   baseMean log2FoldChange          padj
##       <chr>      <dbl>          <dbl>         <dbl>
##  1    MOV10 21681.7998      4.7695983  0.000000e+00
##  2     H1F0  7881.0811      1.5250811 2.007733e-162
##  3    HSPA6   168.2522      4.4993734 1.969313e-134
##  4 HIST1H1C  1741.3830      1.4868361 5.116720e-101
##  5    TXNIP  5133.7486      1.3868320  4.882246e-90
##  6    NEAT1 21973.7061      0.9087853  2.269464e-83
##  7    KLF10  1694.2109      1.2093969  9.257431e-78
##  8   INSIG1 11872.5106      1.2260848  8.853278e-70
##  9    NR1D1   969.9119      1.5236259  1.376753e-64
## 10    WDFY1  1422.7361      1.0629160  1.298076e-61
## # ... with 4,899 more rows

summarise()

You can perform column summarization operations with summarise().

report %>%
    summarise(avgBaseMean = mean(baseMean))
## # A tibble: 1 x 1
##   avgBaseMean
##         <dbl>
## 1      1911.6

Advanced: summarise() is particularly powerful in combination with the group_by() function, which allows you to group related rows together.

Note: summarize() also works if you prefer to use American English. This applies across the board to any tidy functions, including in ggplot2 (e.g. color in place of colour).

pull()

In the recent dplyr 0.7 update, pull() was added as a quick way to access column data as a vector. This is very handy in chain operations with the pipe operator.

pull(report, symbol) %>% .[1:10]

Joins

To demonstrate dplyr's powerful suite of join operations, let's import Ensembl gene annotations from the annotables package and add them to our report.

install.packages("devtools")
devtools::install_github("stephen_turner/annotables")
library(annotables)
annotable <- grch37 %>%
    select(symbol, biotype, description) %>%
    distinct
report <- left_join(report, annotable, by = "symbol")

Notes

Programming

Underneath the hood, tidyverse packages build upon the base R language using rlang, which is a complete rework of how functions handle variable names and evaluate arguments. This is achieved through the tidyeval framework, which interprates command operations using tidy evaluation. This is outside of the scope of the course, but explained in detail in the Programming with dplyr vignette, in case you'd like to understand how these new tools behave differently from base R.

Additional resources