Skip to content

Commit

Permalink
ADD: added highlights and links to various places
Browse files Browse the repository at this point in the history
  • Loading branch information
chris-greening committed Jan 28, 2023
1 parent 1beba32 commit ea3a2b1
Showing 1 changed file with 23 additions and 25 deletions.
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Joining multiple datasets on the same column in R using dplyr and purrr

Joining multiple datasets on the same column is a common pattern in data preparation
Joining **multiple datasets** on the same column is a common pattern in data preparation

So let's jump in and explore how we can leverage R and the tidyverse to join an arbitrary number of datasets on a shared column with elegant, readable code!
So let's jump in and explore how we can **leverage R** and the **tidyverse** to join an arbitrary number of datasets on a shared column with **elegant, readable code**!

## Table of Contents
- [Installing prerequisite packages](#installing-prerequisite-packages)
Expand All @@ -20,9 +20,9 @@ So let's jump in and explore how we can leverage R and the tidyverse to join an
## Installing prerequisite packages
<a src="#installing-prerequisite-packages"></a>

In this tutorial we'll be using dplyr and purrr from the popular tidyverse collection of packages
In this tutorial we'll be using [`dplyr`](https://dplyr.tidyverse.org/) and [`purrr`](https://purrr.tidyverse.org/) from the popular [tidyverse](https://www.tidyverse.org/) collection of packages

The following line of code will install them on your machine if they aren't already:
The following line of code will **install** them on your machine if they aren't already:

```R
install.packages(c("dplyr", "purrr"))
Expand All @@ -33,7 +33,7 @@ install.packages(c("dplyr", "purrr"))
## Examining our sample datasets
<a src="#examining-our-sample-datasets"></a>

For the following examples, we'll be using real-world agricultural data sourced via Eurostat containing the number of specific livestock animals (`swine`, `bovine`, `sheep`, and `goats`) in a `country` during a given `year`
For the following examples, we'll be using **real-world agricultural data** sourced via [Eurostat](https://ec.europa.eu/eurostat) containing the number of specific livestock animals (`swine`, `bovine`, `sheep`, and `goats`) in a `country` during a given `year`

For example, here is the `goats` dataset
```R
Expand All @@ -54,7 +54,7 @@ For example, here is the `goats` dataset
# … with 1,312 more rows
```

Our goal is to join these datasets by `country` and `year` into a single `livestock.data` variable containing all the animals like so:
Our goal is to **join these datasets** by `country` and `year` into a single `livestock.data` variable containing all the animals like so:

```R
> livestock.data
Expand All @@ -79,7 +79,7 @@ Our goal is to join these datasets by `country` and `year` into a single `livest
## Using dplyr::full_join to manually join two datasets at a time
<a src="#using-dplyr"></a>

Let's start with a naive approach and manually join our datasets one-by-one on the `country` and `year` columns
Let's start with a naive approach and **manually** join our datasets one-by-one on the `country` and `year` columns

```R
by = c("country", "year")
Expand All @@ -89,27 +89,27 @@ livestock.data <- dplyr::full_join(livestock.data, sheep, by=by)
```

The above code accomplishes the exercise by:
1. Manually stepping through each animal
2. applying a function that takes two arguments (in this case `dplyr::full_join`)
3. and chaining the output of one step (`livestock.data`) as the input for the next step
1. **Manually** stepping through each animal
2. **applying a function** that takes two arguments (in this case `dplyr::full_join`)
3. and chaining the **output** of one step (`livestock.data`) as the **input** for the next step

While this might work for four datasets, what if we had 100 datasets? 1000 datasets? _n datasets?!_ Suddenly not a great solution!
While this might work for four datasets, what if we had 100 datasets? 1000 datasets? _n datasets?!_ Suddenly **not** a great solution!

Let's investigate how we can improve, automate, and scale this
Let's investigate how we can **improve, automate, and scale** this

---

## Understanding the reduce operation
<a src="#understanding-the-reduce-operation"></a>

The reduce operation is a technique that combines all the elements of an array (i.e. an array containing our individual livestock datasets) into a single value (i.e. the final joined table).
The **reduce** operation is a technique that **combines** all the elements of an **array** (i.e. an array containing our individual livestock datasets) into a **single value** (i.e. the final joined table).

The reduce operation accomplishes this by:
1. Looping over an array
2. applying a function that takes two arguments (such as `dplyr::full_join`)
3. and chaining the output of one step as the input for the next step
1. **Looping** over an array
2. **applying a function** that takes two arguments (such as `dplyr::full_join`)
3. and chaining the **outpu**t of one step as the **input** for the next step

Sound familiar? This is exactly what we just performed manually in the previous section except this time we'll be leveraging R to do it for us!
Sound familiar? This is *exactly* what we just performed manually in the previous section except this time we'll be **leveraging R** to do it for us!

So let's see in practice how we can apply the reduce operation to elegantly join our `livestock.data`

Expand All @@ -118,9 +118,9 @@ So let's see in practice how we can apply the reduce operation to elegantly join
## Leveraging purrr::reduce to join multiple datasets
<a src="#leveraging-purrr-reduce"></a>

`purrr` is a package that enhances R's functional programming toolkit for working with functions and vectors (i.e. reducing, mapping, filtering, etc.)
`purrr` is a package that enhances R's **functional programming toolkit** for working with functions and vectors (i.e. reducing, mapping, filtering, etc.)

In this case, we're going to use `purrr::reduce` in conjunction with `dplyr::full_join` to join all of our datasets in one line of concise, readable code
In this case, we're going to use `purrr::reduce` in conjunction with `dplyr::full_join` to join all of our datasets in one line of **concise, readable code**

```R
livestock.data <- purrr::reduce(
Expand All @@ -131,23 +131,21 @@ livestock.data <- purrr::reduce(
)
```

And that's it! We've joined all of our datasets in what's essentially a single line of code
And that's it! We accomplished this by:

We accomplished this by:

1. Looping over a list of our livestock
1. **Looping** over a list of our livestock
```R
list(bovine, goats, swine, sheep)
```
2. applying `dplyr::full_join` which takes two arguments
2. **applying** `dplyr::full_join` which takes two arguments

```R
function(left, right) {
dplyr::full_join(left, right, by=c("country", "year"))
}
```

3. and chaining the output of one step as the input for the next step
3. and chaining the **output** of one step as the **input** for the next step

![Image showing the different datasets joining together in a hierarchical chain that starts with bovine and goats joining into livestock.data, livestock.data joining with swine, and livestock.data finally joining with sheep](media/join_image.PNG)

Expand Down

0 comments on commit ea3a2b1

Please sign in to comment.