forked from jmledford3115/datascibiol
-
Notifications
You must be signed in to change notification settings - Fork 0
/
lab4_2.Rmd
169 lines (129 loc) · 7.1 KB
/
lab4_2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
---
title: "Dealing with NA's"
author: "Joel Ledford"
date: "Winter 2019"
output:
html_document:
theme: spacelab
toc: yes
toc_float: yes
pdf_document:
toc: yes
---
## Review
In the last section we practiced wrangling untidy data using `tidyr`. We also learned the `summarize()` function and `group_by()` to produce useful summaries of our data. But, we ended the last session with the discovery of NA's and how they can impact analyses. This is a big issue in data science and we will spend the remainder of this lab learning how to manage data with missing values.
## Learning Goals
*At the end of this exercise, you will be able to:*
1. Produce summaries of the number of NA's in a data set.
2. Replace values with `NA` in a data set as appropriate.
## Load the tidyverse
```{r}
library("tidyverse")
```
## Dealing with NA's
In almost all scientific data, there are missing observations. These can be tricky to deal with, partly because you first need to determine how missing values were treated in the original study. Scientists use different conventions in showing missing data; some use blank spaces, others use "-", etc. Unfourtunately, sometimes **missing data is indicated with numerics like -999.0 or zero!**, though this can be required in some situations (for example raster data). Often, a combination of methods are used. It is up to the data analyst to find out how missing values are represented in the data set and to deal with them appropriately.
## Practice
1. What are some possible problems if missing data are indicated by "-999.0" or "0"?
## Load the `msleep` data into a new object
```{r}
msleep <-
msleep
```
## Are there any NA's?
Let's first check to see if our data has any NA's. is.na() is a function that determines whether a value in a data frame is or is not an NA. This is evaluated logically as TRUE or FALSE.
```{r}
is.na(msleep)
```
OK, what are we supposed to do with that? Unless you have a small data frame, applying the is.na function universally is not helpful but we can use the code in another way. Let's incorporate it into the `summarize()` function.
```{r}
msleep %>%
summarize(number_nas = sum(is.na(msleep)))
```
This is better, but we still don't have any idea of where those NA's are in our data frame. If there were a systemic problem in the data it would be hard to determine. In order to do this, we need to apply `is.na` to each variable of interest.
```{r}
msleep %>%
summarize(number_nas = sum(is.na(conservation)))
```
The common `summary()` function in base R will tally NAs by variable, but only for categorical/factor, numeric, integer, or logical columns. What about any NA values in the character columns?
```{r}
summary(msleep)
```
What if we are working with hundreds or thousands (or more!) variables?! In order to deal with this problem efficiently we can use another package in the tidyverse called `purrr`.
```{r}
msleep_na <-
msleep %>%
purrr::map_df(~ sum(is.na(.))) #map to a new data frame the sum results of the is.na function for all columns
msleep_na
```
Don't forget that we can use `gather()` to make viewing this output easier.
```{r}
msleep %>%
purrr::map_df(~ sum(is.na(.))) %>%
tidyr::gather(key = "variables", value = "num_nas") %>%
arrange(desc(num_nas))
```
This is much better, but we need to be careful. R can have difficulty interpreting missing data. This is especially true for categorical variables. Always do a reality check if the output doesn't make sense to you. A quick check never hurts and can save you from waisting time in the future or from drawing false conclusions.
You can explore a specific variable more intently using `count()`. This operates similarly to `group_by()`.
```{r}
msleep %>%
count(conservation)
```
Adding the `sort=TRUE` option automatically makes a descending list.
```{r}
msleep %>%
count(conservation, sort = TRUE)
```
It is true that all of this is redundant, but you want to be able to run multiple checks on the data. Remember, just because your code runs without errors doesn't mean it is doing what you intended.
## Replacing NA's
The bit of code below is very handy, especially if the data has NA's represented as **actual values that you want replaced** or if you want to make sure any blanks are treated as NA. As was mentioned in Lab 1, NAs have special properties in R, so if your data has missing observations, it is important all missing values are represented with NA.
```{r}
msleep_na2 <-
msleep %>%
na_if("") #replace x data with NA
msleep_na2
```
## Practice
1. Did replacing blanks with NA have any effect on the msleep data? Demonstrate this using some code.
## Practice and Homework
We will work together on the next part (time permitting) and this will end up being your homework. Make sure that you save your work and copy and paste your responses into the RMarkdown homework file.
Aren't mammals fun? Let's load up some more mammal data. This will be life history data for mammals. The [data](http://esapubs.org/archive/ecol/E084/093/) are from: *S. K. Morgan Ernest. 2003. Life history characteristics of placental non-volant mammals. Ecology 84:3402.*
```{r}
life_history <-
readr::read_csv("data/mammal_lifehistories_v2.csv")
```
Rename some of the variables. Notice that I am replacing the old `life_history` data.
```{r}
life_history <-
life_history %>%
dplyr::rename(
genus = Genus,
wean_mass = `wean mass`,
max_life = `max. life`,
litter_size = `litter size`,
litters_yr = `litters/year`
)
```
1. Explore the data using the method that you prefer. Below, I am using a new package called `skimr`. If you want to use this, make sure that it is installed.
```{r}
#install.packages("skimr")
```
```{r}
library("skimr")
```
```{r}
life_history %>%
skimr::skim()
```
2. Run the code below. Are there any NA's in the data? Does this seem likely? Do the mean values above make sense for this data?
```{r}
msleep %>%
summarize(number_nas = sum(is.na(life_history)))
```
3. Are there any missing data (NA's) represented by different values? How much and where? In which variables do we have the most missing data? Can you think of a reason why so much data are missing in this variable?
4. Compared to the `msleep` data, we have better representation among taxa. Produce a summary that shows the number of observations by taxonomic order.
5. Mammals have a range of life histories, including lifespan. Produce a summary of lifespan in years by order. Be sure to include the minimum, maximum, mean, standard deviation, and total n.
6. Let's look closely at gestation and newborns. Summarize the mean gestation, newborn mass, and weaning mass by order. Add a new column that shows mean gestation in years and sort in descending order. Which group has the longest mean gestation? What is the common name for these mammals?
7. To save us the hassle in the future, how could we tell R how NA values are represented in our data file when we read it in?
## Wrap-up
Please review the learning goals and be sure to use the code here as a reference when completing the homework.
-->[Home](https://jmledford3115.github.io/datascibiol/)