15 Factors.Rmd

---
title: "Factors"
author: "Russ Conte"
date: "10/21/2018"
output: html_document
---
## 15.1 Prerequisites
```{r}
library(tidyverse)
library(forcats)
```

Let's look at an example where we have some variable that records months of the year:

```{r}
months <- c("Jan", "Feb", "Mar")
# it is possible to mistype a month, and this is bad:
months1 <- c("Jann", "Deb", "Car")
#also note that these don't sort in the order we want:
sort(months)
```

To fix this we create <i>levels</i> for our factor:

```{r}
months_levels <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
                   "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
```

Now it's possible to create a factor with levels:

```{r}
y1 <- factor(months, levels=months_levels)
y1
```

Note that any factor not in the levels will be converted to NA:

```{r}
y2 <- factor(months1, levels=months_levels)
y2
levels(y2)
```

Note that it is always possible to access the levels of a factor: levels(factor_name)

From here to the end of this chapter we'll look at the General Social Survey, and address some of the
challenges working with factors:

```{r}
View(gss_cat)
summary(gss_cat)
mean(gss_cat$tvhours,na.rm = TRUE)
```

```{r}
gss_cat %>% 
  count(race)
```

```{r}
gss_cat %>% 
  count(marital)
```

Let's plot the results by race:

```{r}
ggplot(gss_cat, aes(race)) +
  geom_bar() +
  scale_x_discrete(drop=FALSE)
```

## 15.3.1 Exercises

1. Explore the distribution of rincome (reported income). What makes the default bar chart hard to understand? How could you improve the plot?

```{r}
ggplot(gss_cat, aes(rincome)) +
  geom_bar()+
  scale_x_discrete(drop=FALSE)
```

The thing that makes is difficult is the text on the x-axis. It's easy to rotate the text so it's easier to read.

```{r}
ggplot(gss_cat, aes(rincome)) +
  geom_bar()+
  scale_x_discrete(drop=FALSE) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
```

2. What is the most common relig in this survey? Protestant

```{r}
ggplot(gss_cat, aes(relig)) +
  geom_bar()+
  scale_x_discrete(drop=FALSE) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
```

2a. What’s the most common partyid? Looking at it graphically, Independent

```{r}
ggplot(gss_cat, aes(partyid)) +
  geom_bar()+
  scale_x_discrete(drop=FALSE) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
```

Looking at it numerically:

```{r}
gss_cat %>% 
  count(partyid)
```

3. Which relig does denom (denomination) apply to? How can you find out with a table? How can you find out with a visualisation?

```{r}
ggplot(gss_cat, aes(relig)) +
  geom_bar()+
  scale_x_discrete(drop=FALSE) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
```

```{r}
levels(gss_cat$relig)
```

```{r}
gss_cat %>% 
  count(relig)
```

## 15.4 Modifying Factor Order

Example: Let's look at hours watching TV measured across regions:

```{r}
relig <- gss_cat %>% 
  group_by(relig) %>% 
  summarise(
    age = mean(age, na.rm = TRUE),
    tvhours=mean(tvhours, na.rm = TRUE),
    n=n()
    )
ggplot(relig, aes(tvhours, relig)) + geom_point()
```

Let's improve on this chart:

```{r}
ggplot(relig, aes(tvhours, fct_reorder(relig, tvhours))) +
  geom_point()
```

Hadley and Garrett recommend moving transformations as above out of aes() and into mutate. Same as above, using mutate:

```{r}
relig %>% 
  mutate(relig = fct_reorder(relig, tvhours)) %>% 
  ggplot(aes(tvhours, relig))+
  geom_point()
```

What does the average amount of tv time look like when measured against income?

```{r}
rincome <- gss_cat %>% 
  group_by(rincome) %>% 
  summarise(
    age = mean(age, na.rm=TRUE),
    tvhours = mean(tvhours, na.rm=TRUE),
    n=n()
  )

ggplot(rincome, aes(age, fct_reorder(rincome, tvhours))) +
         geom_point()
```

```{r}
ggplot(
  rincome,
  aes(age, fct_relevel(rincome, "Not applicable"))
) +
  geom_point()
```

(from the text) Let's use fct_reorder2() to reorder the factor by the y values associated with the largest x values:

```{r}
by_age <- gss_cat %>% 
  filter(!is.na(age)) %>% 
  count(age, marital) %>% 
  group_by(age) %>% 
  mutate(prop = n / sum(n))

ggplot(by_age, aes(age, prop, color=marital)) +
  geom_line(na.rm = TRUE)
```


Let's look at how to do this with bar plots:

```{r}
gss_cat %>% 
  mutate(marital = marital %>% 
           fct_infreq() %>% 
           fct_rev()) %>% 
           ggplot(aes(marital)) +
           geom_bar()
```


## 15.4 1 Exercises

There are some suspiciously high numbers in tvhours. Is the mean a good summary?

I trust median more than mean, the median will be less impacted by outliers:

```{r}
mean(gss_cat$tvhours, na.rm = TRUE)
median(gss_cat$tvhours, na.rm = TRUE)
```

2. For each factor in gss_cat identify whether the order of the levels is arbitrary or principled.

```{r}
sapply(gss_cat, levels)
```

Only income is principled, the rest are arbitrary.

3. Why did moving "Not Applicable" to the front of the levels move it to the bottom of the plot? Note - there were 7,043 "not applicable" responses in income:

```{r}
gss_cat %>% 
  count(rincome)
```

The position on the plot is determined by factor levels (of course!) and fct_relevel moves "Not Applicable" to the bottom of the plot. Problem solved :)