From 5dbd7ed512fccc2e1faeb5a7a3ea35513684e14b Mon Sep 17 00:00:00 2001 From: jmledford3115 Date: Sun, 28 Jan 2024 19:32:54 -0800 Subject: [PATCH] lab 6 update --- lab6/lab6_1.Rmd | 10 +- lab6/lab6_2.Rmd | 6 +- lab6_1.html | 1797 +++++++++++++++++++++++++++++++++++++++++++++++ lab6_2.html | 1717 ++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 3525 insertions(+), 5 deletions(-) create mode 100644 lab6_1.html create mode 100644 lab6_2.html diff --git a/lab6/lab6_1.Rmd b/lab6/lab6_1.Rmd index 606ae4c..74e6665 100644 --- a/lab6/lab6_1.Rmd +++ b/lab6/lab6_1.Rmd @@ -22,12 +22,18 @@ library("janitor") ``` ## Load the data -For this lab, we will use the following dataset: +For this lab, we will use the following two datasets: -1. S. K. Morgan Ernest. 2003. Life history characteristics of placental non-volant mammals. Ecology 84:3402. [link](http://esapubs.org/archive/ecol/E084/093/) +1. 1. Gaeta J., G. Sass, S. Carpenter. 2012. Biocomplexity at North Temperate Lakes LTER: Coordinated Field Studies: Large Mouth Bass Growth 2006. Environmental Data Initiative. [link](https://portal.edirepository.org/nis/mapbrowse?scope=knb-lter-ntl&identifier=267) + +2. S. K. Morgan Ernest. 2003. Life history characteristics of placental non-volant mammals. Ecology 84:3402. [link](http://esapubs.org/archive/ecol/E084/093/) ## Pipes `%>%` Recall that we use pipes to connect the output of code to a subsequent function. This makes our code cleaner and more efficient. One way we can use pipes is to attach the `clean_names()` function from janitor to the `read_csv()` output. +```{r} +fish <- readr::read_csv("data/Gaeta_etal_CLC_data.csv") +``` + ```{r} mammals <- read_csv("data/mammal_lifehistories_v2.csv") %>% clean_names() ``` diff --git a/lab6/lab6_2.Rmd b/lab6/lab6_2.Rmd index 412da3a..fa303f4 100644 --- a/lab6/lab6_2.Rmd +++ b/lab6/lab6_2.Rmd @@ -27,8 +27,8 @@ These are data taken from comic books and assembled by fans. The include a good Check out the way I am loading these data. If I know there are NAs, I can take care of them at the beginning. But, we should do this very cautiously. At times it is better to keep the original columns and data intact. ```{r} -superhero_info <- read_csv("data/heroes_information.csv", na = c("", "-99", "-")) -superhero_powers <- read_csv("data/super_hero_powers.csv", na = c("", "-99", "-")) +#superhero_info <- read_csv("data/heroes_information.csv", na = c("", "-99", "-")) +#superhero_powers <- read_csv("data/super_hero_powers.csv", na = c("", "-99", "-")) ``` ## Data tidy @@ -37,7 +37,7 @@ superhero_powers <- read_csv("data/super_hero_powers.csv", na = c("", "-99", "-" ## `tabyl` The `janitor` package has many awesome functions that we will explore. Here is its version of `table` which not only produces counts but also percentages. Very handy! Let's use it to explore the proportion of good guys and bad guys in the `superhero_info` data. ```{r} -tabyl(superhero_info, alignment) +#tabyl(superhero_info, alignment) ``` 1. Who are the publishers of the superheros? Show the proportion of superheros from each publisher. Which publisher has the highest number of superheros? diff --git a/lab6_1.html b/lab6_1.html new file mode 100644 index 0000000..edcef91 --- /dev/null +++ b/lab6_1.html @@ -0,0 +1,1797 @@ + + + + + + + + + + + + + + +mutate(), and if_else() + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+
+
+
+
+ +
+ + + + + + + +
+

Learning Goals

+

At the end of this exercise, you will be able to:
+1. Use mutate() to add columns in a dataframe.
+2. Use mutate() and if_else() to replace +values in a dataframe.

+
+
+

Load the libraries

+
library("tidyverse")
+library("janitor")
+
+
+

Load the data

+

For this lab, we will use the following two datasets:

+
    +
    1. +
    2. Gaeta J., G. Sass, S. Carpenter. 2012. Biocomplexity at North +Temperate Lakes LTER: Coordinated Field Studies: Large Mouth Bass Growth +2006. Environmental Data Initiative. link
    3. +
  1. +
  2. S. K. Morgan Ernest. 2003. Life history characteristics of placental +non-volant mammals. Ecology 84:3402. link
  3. +
+
+
+

Pipes %>%

+

Recall that we use pipes to connect the output of code to a +subsequent function. This makes our code cleaner and more efficient. One +way we can use pipes is to attach the clean_names() +function from janitor to the read_csv() output.

+
fish <- readr::read_csv("data/Gaeta_etal_CLC_data.csv")
+
## Rows: 4033 Columns: 6
+## ── Column specification ────────────────────────────────────────────────────────
+## Delimiter: ","
+## chr (2): lakeid, annnumber
+## dbl (4): fish_id, length, radii_length_mm, scalelength
+## 
+## ℹ Use `spec()` to retrieve the full column specification for this data.
+## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
+
mammals <- read_csv("data/mammal_lifehistories_v2.csv") %>% clean_names()
+
## Rows: 1440 Columns: 13
+## ── Column specification ────────────────────────────────────────────────────────
+## Delimiter: ","
+## chr (4): order, family, Genus, species
+## dbl (9): mass, gestation, newborn, weaning, wean mass, AFR, max. life, litte...
+## 
+## ℹ Use `spec()` to retrieve the full column specification for this data.
+## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
+
+
+

mutate()

+

Mutate allows us to create a new column from existing columns in a +data frame. We are doing a small introduction here and will add some +additional functions later. Let’s convert the length variable from cm to +millimeters and create a new variable called length_mm.

+
fish %>% 
+  mutate(length_mm = length*10) %>% 
+  select(fish_id, length, length_mm)
+
## # A tibble: 4,033 × 3
+##    fish_id length length_mm
+##      <dbl>  <dbl>     <dbl>
+##  1     299    167      1670
+##  2     299    167      1670
+##  3     299    167      1670
+##  4     300    175      1750
+##  5     300    175      1750
+##  6     300    175      1750
+##  7     300    175      1750
+##  8     301    194      1940
+##  9     301    194      1940
+## 10     301    194      1940
+## # ℹ 4,023 more rows
+
+
+

Practice

+
    +
  1. Use mutate() to make a new column that is the half +length of each fish: length_half = length/2. Select only fish_id, +length, and length_half.
  2. +
+
+
+

mutate_all()

+

This last function is super helpful when cleaning data. With “wild” +data, there are often mixed entries (upper and lowercase), blank spaces, +odd characters, etc. These all need to be dealt with before +analysis.

+

Here is an example that changes all entries to lowercase (if +present).

+
mammals
+
## # A tibble: 1,440 × 13
+##    order  family genus species   mass gestation newborn weaning wean_mass    afr
+##    <chr>  <chr>  <chr> <chr>    <dbl>     <dbl>   <dbl>   <dbl>     <dbl>  <dbl>
+##  1 Artio… Antil… Anti… americ… 4.54e4      8.13   3246.    3         8900   13.5
+##  2 Artio… Bovid… Addax nasoma… 1.82e5      9.39   5480     6.5       -999   27.3
+##  3 Artio… Bovid… Aepy… melamp… 4.15e4      6.35   5093     5.63     15900   16.7
+##  4 Artio… Bovid… Alce… busela… 1.5 e5      7.9   10167.    6.5       -999   23.0
+##  5 Artio… Bovid… Ammo… clarkei 2.85e4      6.8    -999  -999         -999 -999  
+##  6 Artio… Bovid… Ammo… lervia  5.55e4      5.08   3810     4         -999   14.9
+##  7 Artio… Bovid… Anti… marsup… 3   e4      5.72   3910     4.04      -999   10.2
+##  8 Artio… Bovid… Anti… cervic… 3.75e4      5.5    3846     2.13      -999   20.1
+##  9 Artio… Bovid… Bison bison   4.98e5      8.93  20000    10.7     157500   29.4
+## 10 Artio… Bovid… Bison bonasus 5   e5      9.14  23000.    6.6       -999   30.0
+## # ℹ 1,430 more rows
+## # ℹ 3 more variables: max_life <dbl>, litter_size <dbl>, litters_year <dbl>
+
mammals %>%
+  mutate_all(tolower)
+
## # A tibble: 1,440 × 13
+##    order    family genus species mass  gestation newborn weaning wean_mass afr  
+##    <chr>    <chr>  <chr> <chr>   <chr> <chr>     <chr>   <chr>   <chr>     <chr>
+##  1 artioda… antil… anti… americ… 45375 8.13      3246.36 3       8900      13.53
+##  2 artioda… bovid… addax nasoma… 1823… 9.39      5480    6.5     -999      27.27
+##  3 artioda… bovid… aepy… melamp… 41480 6.35      5093    5.63    15900     16.66
+##  4 artioda… bovid… alce… busela… 1500… 7.9       10166.… 6.5     -999      23.02
+##  5 artioda… bovid… ammo… clarkei 28500 6.8       -999    -999    -999      -999 
+##  6 artioda… bovid… ammo… lervia  55500 5.08      3810    4       -999      14.89
+##  7 artioda… bovid… anti… marsup… 30000 5.72      3910    4.04    -999      10.23
+##  8 artioda… bovid… anti… cervic… 37500 5.5       3846    2.13    -999      20.13
+##  9 artioda… bovid… bison bison   4976… 8.93      20000   10.71   157500    29.45
+## 10 artioda… bovid… bison bonasus 5e+05 9.14      23000.… 6.6     -999      29.99
+## # ℹ 1,430 more rows
+## # ℹ 3 more variables: max_life <chr>, litter_size <chr>, litters_year <chr>
+

Using the across function we can specify individual columns.

+
mammals %>% 
+  mutate(across(c("order", "family"), tolower))
+
## # A tibble: 1,440 × 13
+##    order  family genus species   mass gestation newborn weaning wean_mass    afr
+##    <chr>  <chr>  <chr> <chr>    <dbl>     <dbl>   <dbl>   <dbl>     <dbl>  <dbl>
+##  1 artio… antil… Anti… americ… 4.54e4      8.13   3246.    3         8900   13.5
+##  2 artio… bovid… Addax nasoma… 1.82e5      9.39   5480     6.5       -999   27.3
+##  3 artio… bovid… Aepy… melamp… 4.15e4      6.35   5093     5.63     15900   16.7
+##  4 artio… bovid… Alce… busela… 1.5 e5      7.9   10167.    6.5       -999   23.0
+##  5 artio… bovid… Ammo… clarkei 2.85e4      6.8    -999  -999         -999 -999  
+##  6 artio… bovid… Ammo… lervia  5.55e4      5.08   3810     4         -999   14.9
+##  7 artio… bovid… Anti… marsup… 3   e4      5.72   3910     4.04      -999   10.2
+##  8 artio… bovid… Anti… cervic… 3.75e4      5.5    3846     2.13      -999   20.1
+##  9 artio… bovid… Bison bison   4.98e5      8.93  20000    10.7     157500   29.4
+## 10 artio… bovid… Bison bonasus 5   e5      9.14  23000.    6.6       -999   30.0
+## # ℹ 1,430 more rows
+## # ℹ 3 more variables: max_life <dbl>, litter_size <dbl>, litters_year <dbl>
+
+
+

if_else()

+

We will briefly introduce if_else() here because it +allows us to use mutate() but not have the entire column +affected in the same way. In a sense, this can function like find and +replace in a spreadsheet program. With ifelse(), you first +specify a logical statement, afterwards what needs to happen if the +statement returns TRUE, and lastly what needs to happen if +it’s FALSE.

+

Have a look at the data from mammals below. Notice that the values +for newborn include -999.00. This is sometimes used as a +placeholder for NA (but, is a really bad idea). We can use +if_else() to replace -999.00 with +NA.

+
mammals %>% 
+  select(genus, species, newborn) %>% 
+  arrange(newborn)
+
## # A tibble: 1,440 × 3
+##    genus       species        newborn
+##    <chr>       <chr>            <dbl>
+##  1 Ammodorcas  clarkei           -999
+##  2 Bos         javanicus         -999
+##  3 Bubalus     depressicornis    -999
+##  4 Bubalus     mindorensis       -999
+##  5 Capra       falconeri         -999
+##  6 Cephalophus niger             -999
+##  7 Cephalophus nigrifrons        -999
+##  8 Cephalophus natalensis        -999
+##  9 Cephalophus leucogaster       -999
+## 10 Cephalophus ogilbyi           -999
+## # ℹ 1,430 more rows
+
mammals %>% 
+  select(genus, species, newborn) %>%
+  mutate(newborn_new = ifelse(newborn == -999.00, NA, newborn))%>% 
+  arrange(newborn)
+
## # A tibble: 1,440 × 4
+##    genus       species        newborn newborn_new
+##    <chr>       <chr>            <dbl>       <dbl>
+##  1 Ammodorcas  clarkei           -999          NA
+##  2 Bos         javanicus         -999          NA
+##  3 Bubalus     depressicornis    -999          NA
+##  4 Bubalus     mindorensis       -999          NA
+##  5 Capra       falconeri         -999          NA
+##  6 Cephalophus niger             -999          NA
+##  7 Cephalophus nigrifrons        -999          NA
+##  8 Cephalophus natalensis        -999          NA
+##  9 Cephalophus leucogaster       -999          NA
+## 10 Cephalophus ogilbyi           -999          NA
+## # ℹ 1,430 more rows
+
+
+

Practice

+
    +
  1. We are interested in the family, genus, species and max life +variables. Because the max life span for several mammals is unknown, the +authors have use -999 in place of NA. Replace all of these values with +NA in a new column titled max_life_new. Finally, sort the +date in descending order by max_life_new. Which mammal has the oldest +known life span?
  2. +
+
+
+

That’s it! Let’s take a break and then move on to part 2!

+

–>Home

+
+ + + +
+
+ +
+ + + + + + + + + + + + + + + + diff --git a/lab6_2.html b/lab6_2.html new file mode 100644 index 0000000..b814c6d --- /dev/null +++ b/lab6_2.html @@ -0,0 +1,1717 @@ + + + + + + + + + + + + + + +dplyr Superhero + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+
+
+
+
+ +
+ + + + + + + +
+

Learning Goals

+

At the end of this exercise, you will be able to:
+1. Develop your dplyr superpowers so you can easily and confidently +manipulate dataframes.
+2. Learn helpful new functions that are part of the janitor +package.

+
+
+

Instructions

+

For the second part of lab today, we are going to spend time +practicing the dplyr functions we have learned and add a few new ones. +This lab doubles as your homework. Please complete the lab and push your +final code to GitHub.

+
+
+

Load the libraries

+
library("tidyverse")
+
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
+## ✔ dplyr     1.1.4     ✔ readr     2.1.5
+## ✔ forcats   1.0.0     ✔ stringr   1.5.1
+## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
+## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
+## ✔ purrr     1.0.2     
+## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
+## ✖ dplyr::filter() masks stats::filter()
+## ✖ dplyr::lag()    masks stats::lag()
+## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
+
library("janitor")
+
## 
+## Attaching package: 'janitor'
+## 
+## The following objects are masked from 'package:stats':
+## 
+##     chisq.test, fisher.test
+
+
+

Load the superhero data

+

These are data taken from comic books and assembled by fans. The +include a good mix of categorical and continuous data. Data taken from: +https://www.kaggle.com/claudiodavi/superhero-set

+

Check out the way I am loading these data. If I know there are NAs, I +can take care of them at the beginning. But, we should do this very +cautiously. At times it is better to keep the original columns and data +intact.

+
#superhero_info <- read_csv("data/heroes_information.csv", na = c("", "-99", "-"))
+#superhero_powers <- read_csv("data/super_hero_powers.csv", na = c("", "-99", "-"))
+
+
+

Data tidy

+
    +
  1. Some of the names used in the superhero_info data are +problematic so you should rename them here. Before you do anything, +first have a look at the names of the variables. You can use +rename() or clean_names().
  2. +
+
+
+

tabyl

+

The janitor package has many awesome functions that we +will explore. Here is its version of table which not only +produces counts but also percentages. Very handy! Let’s use it to +explore the proportion of good guys and bad guys in the +superhero_info data.

+
#tabyl(superhero_info, alignment)
+
    +
  1. Who are the publishers of the superheros? Show the proportion of +superheros from each publisher. Which publisher has the highest number +of superheros?

  2. +
  3. Notice that we have some neutral superheros! Who are they? List +their names below.

  4. +
+
+
+

superhero_info

+
    +
  1. Let’s say we are only interested in the variables name, alignment, +and “race”. How would you isolate these variables from +superhero_info?
  2. +
+
+
+

Not Human

+
    +
  1. List all of the superheros that are not human.
  2. +
+
+
+

Good and Evil

+
    +
  1. Let’s make two different data frames, one focused on the “good +guys” and another focused on the “bad guys”.

  2. +
  3. For the good guys, use the tabyl function to +summarize their “race”.

  4. +
  5. Among the good guys, Who are the Vampires?

  6. +
  7. Among the bad guys, who are the male humans over 200 inches in +height?

  8. +
  9. Are there more good guys or bad guys with green hair?

  10. +
  11. Let’s explore who the really small superheros are. In the +superhero_info data, which have a weight less than 50? Be +sure to sort your results by weight lowest to highest.

  12. +
  13. Let’s make a new variable that is the ratio of height to weight. +Call this variable height_weight_ratio.

  14. +
  15. Who has the highest height to weight ratio?

  16. +
+
+
+

superhero_powers

+

Have a quick look at the superhero_powers data +frame.

+
    +
  1. How many superheros have a combination of agility, stealth, +super_strength, stamina?
  2. +
+
+
+

Your Favorite

+
    +
  1. Pick your favorite superhero and let’s see their powers!

  2. +
  3. Can you find your hero in the superhero_info data? Show their +info!

  4. +
+
+
+

Push your final code to GitHub!

+

Please be sure that you check the keep md file in the +knit preferences.

+
+ + + +
+
+ +
+ + + + + + + + + + + + + + + +