diff --git a/posts/Performing multiple joins in R using dplyr and purrr/README.md b/posts/Performing multiple joins in R using dplyr and purrr/README.md index 19da0f2..c52c1cf 100644 --- a/posts/Performing multiple joins in R using dplyr and purrr/README.md +++ b/posts/Performing multiple joins in R using dplyr and purrr/README.md @@ -1,8 +1,8 @@ # Joining multiple datasets on the same column in R using dplyr and purrr -Joining multiple datasets on the same column is a common pattern in data preparation +Joining **multiple datasets** on the same column is a common pattern in data preparation -So let's jump in and explore how we can leverage R and the tidyverse to join an arbitrary number of datasets on a shared column with elegant, readable code! +So let's jump in and explore how we can **leverage R** and the **tidyverse** to join an arbitrary number of datasets on a shared column with **elegant, readable code**! ## Table of Contents - [Installing prerequisite packages](#installing-prerequisite-packages) @@ -20,9 +20,9 @@ So let's jump in and explore how we can leverage R and the tidyverse to join an ## Installing prerequisite packages -In this tutorial we'll be using dplyr and purrr from the popular tidyverse collection of packages +In this tutorial we'll be using [`dplyr`](https://dplyr.tidyverse.org/) and [`purrr`](https://purrr.tidyverse.org/) from the popular [tidyverse](https://www.tidyverse.org/) collection of packages -The following line of code will install them on your machine if they aren't already: +The following line of code will **install** them on your machine if they aren't already: ```R install.packages(c("dplyr", "purrr")) @@ -33,7 +33,7 @@ install.packages(c("dplyr", "purrr")) ## Examining our sample datasets -For the following examples, we'll be using real-world agricultural data sourced via Eurostat containing the number of specific livestock animals (`swine`, `bovine`, `sheep`, and `goats`) in a `country` during a given `year` +For the following examples, we'll be using **real-world agricultural data** sourced via [Eurostat](https://ec.europa.eu/eurostat) containing the number of specific livestock animals (`swine`, `bovine`, `sheep`, and `goats`) in a `country` during a given `year` For example, here is the `goats` dataset ```R @@ -54,7 +54,7 @@ For example, here is the `goats` dataset # … with 1,312 more rows ``` -Our goal is to join these datasets by `country` and `year` into a single `livestock.data` variable containing all the animals like so: +Our goal is to **join these datasets** by `country` and `year` into a single `livestock.data` variable containing all the animals like so: ```R > livestock.data @@ -79,7 +79,7 @@ Our goal is to join these datasets by `country` and `year` into a single `livest ## Using dplyr::full_join to manually join two datasets at a time -Let's start with a naive approach and manually join our datasets one-by-one on the `country` and `year` columns +Let's start with a naive approach and **manually** join our datasets one-by-one on the `country` and `year` columns ```R by = c("country", "year") @@ -89,27 +89,27 @@ livestock.data <- dplyr::full_join(livestock.data, sheep, by=by) ``` The above code accomplishes the exercise by: -1. Manually stepping through each animal -2. applying a function that takes two arguments (in this case `dplyr::full_join`) -3. and chaining the output of one step (`livestock.data`) as the input for the next step +1. **Manually** stepping through each animal +2. **applying a function** that takes two arguments (in this case `dplyr::full_join`) +3. and chaining the **output** of one step (`livestock.data`) as the **input** for the next step -While this might work for four datasets, what if we had 100 datasets? 1000 datasets? _n datasets?!_ Suddenly not a great solution! +While this might work for four datasets, what if we had 100 datasets? 1000 datasets? _n datasets?!_ Suddenly **not** a great solution! -Let's investigate how we can improve, automate, and scale this +Let's investigate how we can **improve, automate, and scale** this --- ## Understanding the reduce operation -The reduce operation is a technique that combines all the elements of an array (i.e. an array containing our individual livestock datasets) into a single value (i.e. the final joined table). +The **reduce** operation is a technique that **combines** all the elements of an **array** (i.e. an array containing our individual livestock datasets) into a **single value** (i.e. the final joined table). The reduce operation accomplishes this by: -1. Looping over an array -2. applying a function that takes two arguments (such as `dplyr::full_join`) -3. and chaining the output of one step as the input for the next step +1. **Looping** over an array +2. **applying a function** that takes two arguments (such as `dplyr::full_join`) +3. and chaining the **outpu**t of one step as the **input** for the next step -Sound familiar? This is exactly what we just performed manually in the previous section except this time we'll be leveraging R to do it for us! +Sound familiar? This is *exactly* what we just performed manually in the previous section except this time we'll be **leveraging R** to do it for us! So let's see in practice how we can apply the reduce operation to elegantly join our `livestock.data` @@ -118,9 +118,9 @@ So let's see in practice how we can apply the reduce operation to elegantly join ## Leveraging purrr::reduce to join multiple datasets -`purrr` is a package that enhances R's functional programming toolkit for working with functions and vectors (i.e. reducing, mapping, filtering, etc.) +`purrr` is a package that enhances R's **functional programming toolkit** for working with functions and vectors (i.e. reducing, mapping, filtering, etc.) -In this case, we're going to use `purrr::reduce` in conjunction with `dplyr::full_join` to join all of our datasets in one line of concise, readable code +In this case, we're going to use `purrr::reduce` in conjunction with `dplyr::full_join` to join all of our datasets in one line of **concise, readable code** ```R livestock.data <- purrr::reduce( @@ -131,15 +131,13 @@ livestock.data <- purrr::reduce( ) ``` -And that's it! We've joined all of our datasets in what's essentially a single line of code +And that's it! We accomplished this by: -We accomplished this by: - -1. Looping over a list of our livestock +1. **Looping** over a list of our livestock ```R list(bovine, goats, swine, sheep) ``` -2. applying `dplyr::full_join` which takes two arguments +2. **applying** `dplyr::full_join` which takes two arguments ```R function(left, right) { @@ -147,7 +145,7 @@ function(left, right) { } ``` -3. and chaining the output of one step as the input for the next step +3. and chaining the **output** of one step as the **input** for the next step ![Image showing the different datasets joining together in a hierarchical chain that starts with bovine and goats joining into livestock.data, livestock.data joining with swine, and livestock.data finally joining with sheep](media/join_image.PNG)