20-poly_boot.Rmd

---
title: 'Polynomials and bootstrapping'
output:
  xaringan::moon_reader:
    lib_dir: libs
    css: [default, rladies, rladies-fonts, "my-theme.css"]
    incremental: true
    nature:
      highlightStyle: github
      highlightLines: true
      countIncrementalSlides: false
---


```{r, echo=F, message=FALSE, warning=FALSE}
options(scipen = 999)
library(tidyverse)
library(knitr)
library(broom)
# function to display only part of the output
hook_output <- knit_hooks$get("output")
knit_hooks$set(output = function(x, options) {
  lines <- options$output.lines
  if (is.null(lines)) {
    return(hook_output(x, options))  # pass to default hook
  }
  x <- unlist(strsplit(x, "\n"))
  more <- "..."
  if (length(lines)==1) {        # first n lines
    if (length(x) > lines) {
      # truncate the output, but add ....
      x <- c(head(x, lines), more)
    }
  } else {
    x <- c(more, x[lines], more)
  }
  # paste these lines together
  x <- paste(c(x, ""), collapse = "\n")
  hook_output(x, options)
})

knitr::opts_chunk$set(message = FALSE) # suppress messages
```


class: inverse

## Non-linear relationships

Linear lines often make bad predictions -- very few processes that we study actually have linear relationships. For example, effort had diminishing returns (e.g., log functions), or small advantages early in life can have significant effects on mid-life outcones (e.g., exponentional functions). In cases where the direction of the effect is constant but changing in magnitude, the best way to handle the data is to transform a variable (usually the outcome) and run linear analyses.

```{r, eval = F}
log_y = ln(y)
lm(log_y ~ x)
```
---

## Polynomial relationships

Other processes represent changes in the directon of relationship -- for example, it is believed that a small amount of anxiety is benefitial for performance on some tasks (positive direction) but too much is detrimental (negative direction). When the shape of the effect includes change(s) in direction, then a polynomial term(s) may be more appropriate.

It should be noted that polynomials are often a poor approximation for a non-linear effect of X on Y. However, correctly testing for non-linear effects usually requires (a) a lot of data and (b) making a number of assumptions about the data. Polynomial regression can be a useful tool for exploratory analysis and in cases when data are limited in terms of quantity and/or quality.


---
## Polynomial regression 
 
Polynomial regression is most often a form of hierarchical regressoin that systematically tests a series of higher order functions for a single variable.

$$
\begin{aligned}
\large \textbf{Linear: } \hat{Y} &= b_0 + b_1X \\
\large \textbf{Quadtratic: } \hat{Y} &= b_0 + b_1X + b_2X^2 \\
\large \textbf{Cubic: } \hat{Y} &= b_0 + b_1X + b_2X^2 + b_3X^3\\
\end{aligned}
$$
---

![](images/ted-lasso.jpg)

---
### Example

Can a team have too much talent? Researchers hypothesized that teams with too many talented players have poor intrateam coordination and therefore perform worse than teams with a moderate amount of talent. To test this hypothesis, they looked at 208 international football teams. Talent was the percentage of players during the 2010 and 2014 World Cup Qualifications phases who also had contracts with elite club teams. Performance was the number of points the team earned during these same qualification phases.

```{r, message=F}
football = read.csv("https://raw.githubusercontent.com/uopsych/psy612/master/data/swaab.csv")
```

.small[Swaab, R.I., Schaerer, M, Anicich, E.M., Ronay, R., and Galinsky, A.D. (2014). [The too-much-talent effect: Team
interdependence determines when more talent is too much or not enough.](https://www8.gsb.columbia.edu/cbs-directory/sites/cbs-directory/files/publications/Too%20much%20talent%20PS.pdf) _Psychological Science 25_(8), 1581-1591.]

---
.pull-left[
```{r, message = F, warning = F}
head(football)
```
]
.pull-right[
```{r, warning = F}
ggplot(football, aes(x = talent, y = points)) +
  geom_point() + 
  geom_smooth(se = F) +
  theme_bw(base_size = 20)
```

]


---

```{r mod1, warning = F, message = F, fig.width = 10, fig.height = 5}
mod1 = lm(points ~ talent, data = football)
library(broom)
aug1 = augment(mod1)
ggplot(aug1, aes(x = .fitted, y = .resid)) +
  geom_point() + 
  geom_smooth(se = F) +
  theme_bw(base_size = 20)
```

---
```{r mod2}
mod2 = lm(points ~ talent + I(talent^2), data = football)
summary(mod2)
```

---

```{r, message = F}
library(sjPlot)
plot_model(mod2, type = "pred", 
           terms = c("talent"))
```

---
## Interpretation

The intercept is the predicted value of Y when x = 0 -- this is always the interpretation of the intercept, no matter what kind of regression model you're running. 

The $b_1$ coefficient is the tangent to the curve when X=0. In other words, this is the rate of change when X is equal to 0. If 0 is not a meaningful value on your X, you may want to center, as this will tell you the rate of change at the mean of X.


```{r}
football$talent_c50 = football$talent - 50
mod2_50 = lm(points ~ talent_c50 + I(talent_c50^2), data = football)
```

---

```{r}
summary(mod2_50)
```

---
```{r, echo = FALSE, warning= FALSE}
plot_model(mod2_50, type = "pred", terms = c("talent_c50"))
```
---

Or you can choose another value to center your predictor on, if there's a value that has a particular meaning or interpretation.

```{r}
football$talent_c = football$talent - mean(football$talent)
mod2_c = lm(points ~ talent_c + I(talent_c^2), data = football)
```

---

```{r}
mean(football$talent)
summary(mod2_c)
```

---
```{r, echo = FALSE, warning= FALSE}
plot_model(mod2_c, type = "pred", terms = "talent_c")
```
---


## Interpretation

The $b_2$ coefficient indexes the acceleration, which is how much the slope is going to change. More specifically, $2 \times b_2$ is the acceleration: the rate of change in $b_1$ for a 1-unit change in X.

You can use this to calculate the slope of the tangent line at any value of X you're interested in:
$$\large b_1 + (2\times b_2\times X)$$

---

```{r}
tidy(mod2)
```

.pull-left[
**At X = 10**
```{r}
54.9 + (2*-.570*10)
```

]
.pull-right[
**At X = 70**
```{r}
54.9 + (2*-.570*70)
```

]

---

## Polynomials are interactions

An term for $X^2$ is a term for $X \times X$ or the multiplication of two independent variables holding the same values. 

```{r}
football$talent_2 = football$talent*football$talent
tidy(lm(points ~ talent + talent_2, data = football))
```
---

## Polynomials are interactions

Put another way:

$$\large \hat{Y} = b_0 + b_1X + b_2X^2$$

$$\large \hat{Y} = b_0 + \frac{b_1}{2}X + \frac{b_1}{2}X + b_2(X \times X)$$

The interaction term in another model would be interpreted as "how does the slope of X change as I move up in Z?" -- here, we ask "how does the slope of X change as we move up in X?"
---

## When should you use polynomial terms?

You may choose to fit a polynomial term after looking at a scatterplot of the data or looking at residual plots. A U-shaped curve may be indicative that you need to fit a quadratic form -- although, as we discussed before, you may actually be measuring a different kind of non-linear relationship. 

Polynomial terms should mostly be dictated by theory -- if you don't have a good reason for thinking there will be a change in sign, then a polynomial is not right for you.

And, of course, if you fit a polynomial regression, be sure to once again check your diagnostics before interpreting the coefficients. 
---

```{r, warning = F, message = F, fig.width=10, fig.height=5}
aug2 = augment(mod2)
ggplot(aug2, aes(x = .fitted, y = .resid)) +
  geom_point() + 
  geom_smooth(se = F) +
  theme_bw(base_size = 20)

```

---

```{r, warning = F, message = F, fig.width=10, fig.height=7}
plot_model(mod2, type = "pred", show.data = T)
```

---

## Bootstrapping

In bootstrapping, the theoretical sampling distribution is assumed to be unknown or unverifiable. Under the weak assumption that the sample in hand is representative of some population, then that population sampling distribution can be built empirically by randomly sampling with replacement from the sample.

The resulting empirical sampling distribution can be used to construct confidence intervals and make inferences.

---

### Illustration

Imagine you had a sample of 6 people: Rachel, Monica, Phoebe, Joey, Chandler, and Ross. To bootstrap their heights, you would draw from this group many samples of 6 people *with replacement*, each time calculating the average height of the sample.

```{r, echo = F}
friends = c("Rachel", "Monica", "Phoebe", "Joey", "Chandler", "Ross")
heights = c(65, 65, 68, 70, 72, 73)
names(heights) = friends
(sample1 = sample(friends, size = 6, replace = T)); mean(heights[sample1])
(sample1 = sample(friends, size = 6, replace = T)); mean(heights[sample1])
(sample1 = sample(friends, size = 6, replace = T)); mean(heights[sample1])
(sample1 = sample(friends, size = 6, replace = T)); mean(heights[sample1])
(sample1 = sample(friends, size = 6, replace = T)); mean(heights[sample1])
(sample1 = sample(friends, size = 6, replace = T)); mean(heights[sample1])
(sample1 = sample(friends, size = 6, replace = T)); mean(heights[sample1])
(sample1 = sample(friends, size = 6, replace = T)); mean(heights[sample1])
```
???


```{r}
heights
```

---
### Illustration

```{r}
boot = 10000
friends = c("Rachel", "Monica", "Phoebe", "Joey", "Chandler", "Ross")
heights = c(65, 65, 68, 70, 72, 73)
sample_means = numeric(length = boot)
for(i in 1:boot){
  this_sample = sample(heights, size = length(heights), replace = T)
  sample_means[i] = mean(this_sample)
}
```

---

## Illustration 
```{r, echo = F, message = F, fig.width = 10, fig.height=6, warning = F}
library(ggpubr)
mu = mean(heights)
sem = sd(heights)/sqrt(length(heights))
cv_t = qt(p = .975, df = length(heights)-1)

bootstrapped = data.frame(means = sample_means) %>%
  ggplot(aes(x = means)) + 
  geom_histogram(color = "white") +
  geom_density() +
  geom_vline(aes(xintercept = mean(sample_means), color = "mean"), size = 2) +
  geom_vline(aes(xintercept = median(sample_means), color = "median"), size = 2) +
  geom_vline(aes(xintercept = quantile(sample_means, probs = .025), color = "Lower 2.5%"), size = 2) +
    geom_vline(aes(xintercept = quantile(sample_means, probs = .975), color = "Upper 2.5%"), size = 2) +
  scale_x_continuous(limits = c(mu-3*sem, mu+3*sem))+
  ggtitle("Bootstrapped distribution") +
  cowplot::theme_cowplot()


from_prob = data.frame(means = seq(from = min(sample_means), to = max(sample_means))) %>%
  ggplot(aes(x = means)) +
  stat_function(fun = function(x) dnorm(x, m = mu, sd = sem)) + 
  geom_vline(aes(xintercept = mean(heights), color = "mean"), size = 2) +
  geom_vline(aes(xintercept = median(heights), color = "median"), size = 2) +
  geom_vline(aes(xintercept = mu-cv_t*sem, color = "Lower 2.5%"), size = 2) +
  geom_vline(aes(xintercept = mu+cv_t*sem, color = "Upper 2.5%"), size = 2) +scale_x_continuous(limits = c(mu-3*sem, mu+3*sem))+  
  ggtitle("Distribution from probability theory") +
  cowplot::theme_cowplot()

ggarrange(bootstrapped, from_prob, ncol = 1)
```

---
### Example

A sample of 216 response times. What is their central tendency and variability?

There are several candidates for central tendency (e.g., mean, median) and for variability (e.g., standard deviation, interquartile range).  Some of these do not have well understood theoretical sampling distributions.

For the mean and standard deviation, we have theoretical sampling distributions to help us, provided we think the mean and standard deviation are the best indices. For the others, we can use bootstrapping.

---
```{r, echo = F, message=F, fig.width=10, fig.height=8}
library(tidyverse)
set.seed(03102020)
response = rf(n = 216, 3, 50) 
response = response*500 + rnorm(n = 216, mean = 200, sd = 100)

values = quantile(response, 
                  probs = c(.025, .5, .975))
mean_res = mean(response)

data.frame(x = response) %>%
  ggplot(aes(x)) +
  geom_histogram(aes(y = ..density..), binwidth = 150, fill = "lightgrey", color = 
                   "black")+
  geom_density()+
  geom_vline(aes(xintercept = values[1], 
                 color = "Lower 2.5%"), size = 2)+
  geom_vline(aes(xintercept = values[2], 
                 color = "Median"), size = 2)+
  geom_vline(aes(xintercept = values[3], 
                 color = "Upper 2.5%"), size = 2)+
  geom_vline(aes(xintercept = mean_res, 
                 color = "Mean"), size = 2)+
  labs(x = "Reponse time (ms)", title = "Response Time Distribution") + cowplot::theme_cowplot(font_size = 20)
```

---
### Bootstrapping

Before now, if we wanted to estimate the mean and the 95% confidence interval around the mean, we would find the theoretical sampling distribution by scaling a t-distribution to be centered on the mean of our sample and have a standard deviation equal to $\frac{s}{\sqrt{N}}.$ But we have to make many assumptions to use this sampling distribution, and we may have good reason not to.  

Instead, we can build a population sampling distribution empirically by randomly sampling with replacement from the sample.


---

## Response time example
```{r}
boot = 10000
response_means = numeric(length = boot)
for(i in 1:boot){
  sample_response = sample(response, size = 216, replace = T)
  response_means[i] = mean(sample_response)
}
```

```{r, echo = F, message = F, fig.width = 10, fig.height5}
data.frame(means = response_means) %>%
  ggplot(aes(x = means)) + 
  geom_histogram(color = "white") +
  geom_density() +
  geom_vline(aes(xintercept = mean(response_means), color = "mean"), size = 2) +
  geom_vline(aes(xintercept = median(response_means), color = "median"), size = 2) +
  geom_vline(aes(xintercept = quantile(response_means, probs = .025), color = "Lower 2.5%"), size = 2) +
    geom_vline(aes(xintercept = quantile(response_means, probs = .975), color = "Upper 2.5%"), size = 2) +
  cowplot::theme_cowplot()
```


---

```{r}
mean(response_means)
median(response_means)

quantile(response_means, probs = c(.025, .975))
```

What about something like the median?

---

### bootstrapped distribution of the median

```{r}
boot = 10000
response_med = numeric(length = boot)
for(i in 1:boot){
  sample_response = sample(response, size = 216, replace = T)
  response_med[i] = median(sample_response)
}
```
.pull-left[
```{r echo=F,  message=FALSE}
data.frame(medians = response_med) %>%
  ggplot(aes(x = medians)) + 
  geom_histogram(aes(y = ..density..),
                 color = "white", fill = "grey") +
  geom_density() +
  geom_vline(aes(xintercept = mean(response_med), color = "mean"), size = 2) +
  geom_vline(aes(xintercept = median(response_med), color = "median"), size = 2) +
  geom_vline(aes(xintercept = quantile(response_med, probs = .025), color = "Lower 2.5%"), size = 2) +
    geom_vline(aes(xintercept = quantile(response_med, probs = .975), color = "Upper 2.5%"), size = 2) +
  cowplot::theme_cowplot()
```
]
.pull-right[
```{r}
mean(response_med)
median(response_med)
quantile(response_med, 
         probs = c(.025, .975))
```
]
---

### bootstrapped distribution of the standard deviation

```{r}
boot = 10000
response_sd = numeric(length = boot)
for(i in 1:boot){
  sample_response = sample(response, size = 216, replace = T)
  response_sd[i] = sd(sample_response)
}
```
.pull-left[
```{r echo=F,  message=FALSE}
data.frame(sds = response_sd) %>%
  ggplot(aes(x = sds)) + 
  geom_histogram(aes(y = ..density..),color = "white", fill = "grey") +
  geom_density() +
  geom_vline(aes(xintercept = mean(response_sd), color = "mean"), size = 2) +
  geom_vline(aes(xintercept = median(response_sd), color = "median"), size = 2) +
  geom_vline(aes(xintercept = quantile(response_sd, probs = .025), color = "Lower 2.5%"), size = 2) +
    geom_vline(aes(xintercept = quantile(response_sd, probs = .975), color = "Upper 2.5%"), size = 2) +
  cowplot::theme_cowplot()
```
]
.pull-right[
```{r}
mean(response_sd)
median(response_sd)
quantile(response_sd, 
         probs = c(.025, .975))
```
]

---

You can bootstrap estimates and 95% confidence intervals for *any* statistics you'll need to estimate. 

The `boot` function provides some functions to speed this process along.

```{r}
library(boot)

# function to obtain R-Squared from the data
rsq <- function(data, indices) {
  d <- data[indices,] # allows boot to select sample
  
  fit <- lm(mpg~wt+disp, data=d) # this is the code you would have run
  
  return(summary(fit)$r.square)
}
# bootstrapping with 10000 replications
results <- boot(data=mtcars, statistic=rsq,
   R=10000)
```

---
.pull-left[
```{r}
data.frame(rsq = results$t) %>%
  ggplot(aes(x = rsq)) +
  geom_histogram(color = "white", bins = 30) 
```
]

.pull-right[
```{r}
median(results$t)
boot.ci(results, type = "perc")
```
]

---

### Example 2

Samples of service waiting times for Verizon’s (ILEC) versus other carriers (CLEC) customers. In this district, Verizon must provide line service to all customers or else face a fine. The question is whether the non-Verizon customers are getting ignored or facing greater variability in waiting times.

```{r, message = F, warning = F, echo = 2}
library(here)
Verizon = read.csv(here("data/Verizon.csv"))
```

```{r, echo = F, fig.width = 10, fig.height = 4}
Verizon %>%
  ggplot(aes(x = Time, fill = Group)) + 
  geom_histogram(bins = 30) + 
  guides(fill = F) +
  facet_wrap(~Group, scales = "free_y")
```

---

```{r, echo = F, fig.width = 10, fig.height = 6}
Verizon %>%
  ggplot(aes(x = Time, fill = Group)) + 
  geom_histogram(bins = 50, position = "dodge") + 
  guides(fill = F)
table(Verizon$Group)
```

---

There's no world in which these data meet the typical assumptions of an independent samples t-test. To estimate mean differences we can use boostrapping. Here, we'll resample with replacement separately from the two samples. 

```{r}
boot = 10000
difference = numeric(length = boot)

subsample_CLEC = Verizon %>% filter(Group == "CLEC")
subsample_ILEC = Verizon %>% filter(Group == "ILEC")

for(i in 1:boot){
  sample_CLEC = sample(subsample_CLEC$Time, 
                       size = nrow(subsample_CLEC), 
                       replace = T)
  sample_ILEC = sample(subsample_ILEC$Time, 
                       size = nrow(subsample_ILEC), 
                       replace = T)
  
  difference[i] = mean(sample_CLEC) - mean(sample_ILEC)
}
```

---

```{r echo=F,  message=FALSE, fig.width=10, fig.height=5}
data.frame(differences = difference) %>%
  ggplot(aes(x = differences)) + 
  geom_histogram(aes(y = ..density..),color = "white", fill = "grey") +
  geom_density() +
  geom_vline(aes(xintercept = mean(differences), color = "mean"), size = 2) +
  geom_vline(aes(xintercept = median(differences), color = "median"), size = 2) +
  geom_vline(aes(xintercept = quantile(differences, probs = .025), color = "Lower 2.5%"), size = 2) +
    geom_vline(aes(xintercept = quantile(differences, probs = .975), color = "Upper 2.5%"), size = 2) +
  cowplot::theme_cowplot()
```

The difference in means is `r round(median(difference),2)` $[`r round(quantile(difference, probs = .025),2)`,`r round(quantile(difference, probs = .975),2)`]$.

---

### Bootstrapping Summary

Bootstrapping can be a useful tool to estimate parameters when 
1. you've violated assumptions of the test (i.e., normality, homoskedasticity)
2. you have good reason to believe the sampling distribution is not normal, but don't know what it is
3. there are other oddities in your data, like very unbalanced samples 

This allows you to create a confidence interval around any statistic you want -- Cronbach's alpha, ICC, Mahalanobis Distance, $R^2$, AUC, etc. 
* You can test whether these statistics are significantly different from any other value -- how?

---

### Bootstrapping Summary

Bootstrapping will NOT help you deal with:

* dependence between observations -- for this, you'll need to explicity model dependence (e.g., multilevel model, repeated measures ANOVA)

* improperly specified models or forms -- use theory to guide you here

* measurement error -- why bother?

---
class: inverse

## Next time...