-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path20-poly_boot.Rmd
621 lines (475 loc) · 19.7 KB
/
20-poly_boot.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
---
title: 'Polynomials and bootstrapping'
output:
xaringan::moon_reader:
lib_dir: libs
css: [default, rladies, rladies-fonts, "my-theme.css"]
incremental: true
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
---
```{r, echo=F, message=FALSE, warning=FALSE}
options(scipen = 999)
library(tidyverse)
library(knitr)
library(broom)
# function to display only part of the output
hook_output <- knit_hooks$get("output")
knit_hooks$set(output = function(x, options) {
lines <- options$output.lines
if (is.null(lines)) {
return(hook_output(x, options)) # pass to default hook
}
x <- unlist(strsplit(x, "\n"))
more <- "..."
if (length(lines)==1) { # first n lines
if (length(x) > lines) {
# truncate the output, but add ....
x <- c(head(x, lines), more)
}
} else {
x <- c(more, x[lines], more)
}
# paste these lines together
x <- paste(c(x, ""), collapse = "\n")
hook_output(x, options)
})
knitr::opts_chunk$set(message = FALSE) # suppress messages
```
class: inverse
## Non-linear relationships
Linear lines often make bad predictions -- very few processes that we study actually have linear relationships. For example, effort had diminishing returns (e.g., log functions), or small advantages early in life can have significant effects on mid-life outcones (e.g., exponentional functions). In cases where the direction of the effect is constant but changing in magnitude, the best way to handle the data is to transform a variable (usually the outcome) and run linear analyses.
```{r, eval = F}
log_y = ln(y)
lm(log_y ~ x)
```
---
## Polynomial relationships
Other processes represent changes in the directon of relationship -- for example, it is believed that a small amount of anxiety is benefitial for performance on some tasks (positive direction) but too much is detrimental (negative direction). When the shape of the effect includes change(s) in direction, then a polynomial term(s) may be more appropriate.
It should be noted that polynomials are often a poor approximation for a non-linear effect of X on Y. However, correctly testing for non-linear effects usually requires (a) a lot of data and (b) making a number of assumptions about the data. Polynomial regression can be a useful tool for exploratory analysis and in cases when data are limited in terms of quantity and/or quality.
---
## Polynomial regression
Polynomial regression is most often a form of hierarchical regressoin that systematically tests a series of higher order functions for a single variable.
$$
\begin{aligned}
\large \textbf{Linear: } \hat{Y} &= b_0 + b_1X \\
\large \textbf{Quadtratic: } \hat{Y} &= b_0 + b_1X + b_2X^2 \\
\large \textbf{Cubic: } \hat{Y} &= b_0 + b_1X + b_2X^2 + b_3X^3\\
\end{aligned}
$$
---

---
### Example
Can a team have too much talent? Researchers hypothesized that teams with too many talented players have poor intrateam coordination and therefore perform worse than teams with a moderate amount of talent. To test this hypothesis, they looked at 208 international football teams. Talent was the percentage of players during the 2010 and 2014 World Cup Qualifications phases who also had contracts with elite club teams. Performance was the number of points the team earned during these same qualification phases.
```{r, message=F}
football = read.csv("https://raw.githubusercontent.com/uopsych/psy612/master/data/swaab.csv")
```
.small[Swaab, R.I., Schaerer, M, Anicich, E.M., Ronay, R., and Galinsky, A.D. (2014). [The too-much-talent effect: Team
interdependence determines when more talent is too much or not enough.](https://www8.gsb.columbia.edu/cbs-directory/sites/cbs-directory/files/publications/Too%20much%20talent%20PS.pdf) _Psychological Science 25_(8), 1581-1591.]
---
.pull-left[
```{r, message = F, warning = F}
head(football)
```
]
.pull-right[
```{r, warning = F}
ggplot(football, aes(x = talent, y = points)) +
geom_point() +
geom_smooth(se = F) +
theme_bw(base_size = 20)
```
]
---
```{r mod1, warning = F, message = F, fig.width = 10, fig.height = 5}
mod1 = lm(points ~ talent, data = football)
library(broom)
aug1 = augment(mod1)
ggplot(aug1, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_smooth(se = F) +
theme_bw(base_size = 20)
```
---
```{r mod2}
mod2 = lm(points ~ talent + I(talent^2), data = football)
summary(mod2)
```
---
```{r, message = F}
library(sjPlot)
plot_model(mod2, type = "pred",
terms = c("talent"))
```
---
## Interpretation
The intercept is the predicted value of Y when x = 0 -- this is always the interpretation of the intercept, no matter what kind of regression model you're running.
The $b_1$ coefficient is the tangent to the curve when X=0. In other words, this is the rate of change when X is equal to 0. If 0 is not a meaningful value on your X, you may want to center, as this will tell you the rate of change at the mean of X.
```{r}
football$talent_c50 = football$talent - 50
mod2_50 = lm(points ~ talent_c50 + I(talent_c50^2), data = football)
```
---
```{r}
summary(mod2_50)
```
---
```{r, echo = FALSE, warning= FALSE}
plot_model(mod2_50, type = "pred", terms = c("talent_c50"))
```
---
Or you can choose another value to center your predictor on, if there's a value that has a particular meaning or interpretation.
```{r}
football$talent_c = football$talent - mean(football$talent)
mod2_c = lm(points ~ talent_c + I(talent_c^2), data = football)
```
---
```{r}
mean(football$talent)
summary(mod2_c)
```
---
```{r, echo = FALSE, warning= FALSE}
plot_model(mod2_c, type = "pred", terms = "talent_c")
```
---
## Interpretation
The $b_2$ coefficient indexes the acceleration, which is how much the slope is going to change. More specifically, $2 \times b_2$ is the acceleration: the rate of change in $b_1$ for a 1-unit change in X.
You can use this to calculate the slope of the tangent line at any value of X you're interested in:
$$\large b_1 + (2\times b_2\times X)$$
---
```{r}
tidy(mod2)
```
.pull-left[
**At X = 10**
```{r}
54.9 + (2*-.570*10)
```
]
.pull-right[
**At X = 70**
```{r}
54.9 + (2*-.570*70)
```
]
---
## Polynomials are interactions
An term for $X^2$ is a term for $X \times X$ or the multiplication of two independent variables holding the same values.
```{r}
football$talent_2 = football$talent*football$talent
tidy(lm(points ~ talent + talent_2, data = football))
```
---
## Polynomials are interactions
Put another way:
$$\large \hat{Y} = b_0 + b_1X + b_2X^2$$
$$\large \hat{Y} = b_0 + \frac{b_1}{2}X + \frac{b_1}{2}X + b_2(X \times X)$$
The interaction term in another model would be interpreted as "how does the slope of X change as I move up in Z?" -- here, we ask "how does the slope of X change as we move up in X?"
---
## When should you use polynomial terms?
You may choose to fit a polynomial term after looking at a scatterplot of the data or looking at residual plots. A U-shaped curve may be indicative that you need to fit a quadratic form -- although, as we discussed before, you may actually be measuring a different kind of non-linear relationship.
Polynomial terms should mostly be dictated by theory -- if you don't have a good reason for thinking there will be a change in sign, then a polynomial is not right for you.
And, of course, if you fit a polynomial regression, be sure to once again check your diagnostics before interpreting the coefficients.
---
```{r, warning = F, message = F, fig.width=10, fig.height=5}
aug2 = augment(mod2)
ggplot(aug2, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_smooth(se = F) +
theme_bw(base_size = 20)
```
---
```{r, warning = F, message = F, fig.width=10, fig.height=7}
plot_model(mod2, type = "pred", show.data = T)
```
---
## Bootstrapping
In bootstrapping, the theoretical sampling distribution is assumed to be unknown or unverifiable. Under the weak assumption that the sample in hand is representative of some population, then that population sampling distribution can be built empirically by randomly sampling with replacement from the sample.
The resulting empirical sampling distribution can be used to construct confidence intervals and make inferences.
---
### Illustration
Imagine you had a sample of 6 people: Rachel, Monica, Phoebe, Joey, Chandler, and Ross. To bootstrap their heights, you would draw from this group many samples of 6 people *with replacement*, each time calculating the average height of the sample.
```{r, echo = F}
friends = c("Rachel", "Monica", "Phoebe", "Joey", "Chandler", "Ross")
heights = c(65, 65, 68, 70, 72, 73)
names(heights) = friends
(sample1 = sample(friends, size = 6, replace = T)); mean(heights[sample1])
(sample1 = sample(friends, size = 6, replace = T)); mean(heights[sample1])
(sample1 = sample(friends, size = 6, replace = T)); mean(heights[sample1])
(sample1 = sample(friends, size = 6, replace = T)); mean(heights[sample1])
(sample1 = sample(friends, size = 6, replace = T)); mean(heights[sample1])
(sample1 = sample(friends, size = 6, replace = T)); mean(heights[sample1])
(sample1 = sample(friends, size = 6, replace = T)); mean(heights[sample1])
(sample1 = sample(friends, size = 6, replace = T)); mean(heights[sample1])
```
???
```{r}
heights
```
---
### Illustration
```{r}
boot = 10000
friends = c("Rachel", "Monica", "Phoebe", "Joey", "Chandler", "Ross")
heights = c(65, 65, 68, 70, 72, 73)
sample_means = numeric(length = boot)
for(i in 1:boot){
this_sample = sample(heights, size = length(heights), replace = T)
sample_means[i] = mean(this_sample)
}
```
---
## Illustration
```{r, echo = F, message = F, fig.width = 10, fig.height=6, warning = F}
library(ggpubr)
mu = mean(heights)
sem = sd(heights)/sqrt(length(heights))
cv_t = qt(p = .975, df = length(heights)-1)
bootstrapped = data.frame(means = sample_means) %>%
ggplot(aes(x = means)) +
geom_histogram(color = "white") +
geom_density() +
geom_vline(aes(xintercept = mean(sample_means), color = "mean"), size = 2) +
geom_vline(aes(xintercept = median(sample_means), color = "median"), size = 2) +
geom_vline(aes(xintercept = quantile(sample_means, probs = .025), color = "Lower 2.5%"), size = 2) +
geom_vline(aes(xintercept = quantile(sample_means, probs = .975), color = "Upper 2.5%"), size = 2) +
scale_x_continuous(limits = c(mu-3*sem, mu+3*sem))+
ggtitle("Bootstrapped distribution") +
cowplot::theme_cowplot()
from_prob = data.frame(means = seq(from = min(sample_means), to = max(sample_means))) %>%
ggplot(aes(x = means)) +
stat_function(fun = function(x) dnorm(x, m = mu, sd = sem)) +
geom_vline(aes(xintercept = mean(heights), color = "mean"), size = 2) +
geom_vline(aes(xintercept = median(heights), color = "median"), size = 2) +
geom_vline(aes(xintercept = mu-cv_t*sem, color = "Lower 2.5%"), size = 2) +
geom_vline(aes(xintercept = mu+cv_t*sem, color = "Upper 2.5%"), size = 2) +scale_x_continuous(limits = c(mu-3*sem, mu+3*sem))+
ggtitle("Distribution from probability theory") +
cowplot::theme_cowplot()
ggarrange(bootstrapped, from_prob, ncol = 1)
```
---
### Example
A sample of 216 response times. What is their central tendency and variability?
There are several candidates for central tendency (e.g., mean, median) and for variability (e.g., standard deviation, interquartile range). Some of these do not have well understood theoretical sampling distributions.
For the mean and standard deviation, we have theoretical sampling distributions to help us, provided we think the mean and standard deviation are the best indices. For the others, we can use bootstrapping.
---
```{r, echo = F, message=F, fig.width=10, fig.height=8}
library(tidyverse)
set.seed(03102020)
response = rf(n = 216, 3, 50)
response = response*500 + rnorm(n = 216, mean = 200, sd = 100)
values = quantile(response,
probs = c(.025, .5, .975))
mean_res = mean(response)
data.frame(x = response) %>%
ggplot(aes(x)) +
geom_histogram(aes(y = ..density..), binwidth = 150, fill = "lightgrey", color =
"black")+
geom_density()+
geom_vline(aes(xintercept = values[1],
color = "Lower 2.5%"), size = 2)+
geom_vline(aes(xintercept = values[2],
color = "Median"), size = 2)+
geom_vline(aes(xintercept = values[3],
color = "Upper 2.5%"), size = 2)+
geom_vline(aes(xintercept = mean_res,
color = "Mean"), size = 2)+
labs(x = "Reponse time (ms)", title = "Response Time Distribution") + cowplot::theme_cowplot(font_size = 20)
```
---
### Bootstrapping
Before now, if we wanted to estimate the mean and the 95% confidence interval around the mean, we would find the theoretical sampling distribution by scaling a t-distribution to be centered on the mean of our sample and have a standard deviation equal to $\frac{s}{\sqrt{N}}.$ But we have to make many assumptions to use this sampling distribution, and we may have good reason not to.
Instead, we can build a population sampling distribution empirically by randomly sampling with replacement from the sample.
---
## Response time example
```{r}
boot = 10000
response_means = numeric(length = boot)
for(i in 1:boot){
sample_response = sample(response, size = 216, replace = T)
response_means[i] = mean(sample_response)
}
```
```{r, echo = F, message = F, fig.width = 10, fig.height5}
data.frame(means = response_means) %>%
ggplot(aes(x = means)) +
geom_histogram(color = "white") +
geom_density() +
geom_vline(aes(xintercept = mean(response_means), color = "mean"), size = 2) +
geom_vline(aes(xintercept = median(response_means), color = "median"), size = 2) +
geom_vline(aes(xintercept = quantile(response_means, probs = .025), color = "Lower 2.5%"), size = 2) +
geom_vline(aes(xintercept = quantile(response_means, probs = .975), color = "Upper 2.5%"), size = 2) +
cowplot::theme_cowplot()
```
---
```{r}
mean(response_means)
median(response_means)
quantile(response_means, probs = c(.025, .975))
```
What about something like the median?
---
### bootstrapped distribution of the median
```{r}
boot = 10000
response_med = numeric(length = boot)
for(i in 1:boot){
sample_response = sample(response, size = 216, replace = T)
response_med[i] = median(sample_response)
}
```
.pull-left[
```{r echo=F, message=FALSE}
data.frame(medians = response_med) %>%
ggplot(aes(x = medians)) +
geom_histogram(aes(y = ..density..),
color = "white", fill = "grey") +
geom_density() +
geom_vline(aes(xintercept = mean(response_med), color = "mean"), size = 2) +
geom_vline(aes(xintercept = median(response_med), color = "median"), size = 2) +
geom_vline(aes(xintercept = quantile(response_med, probs = .025), color = "Lower 2.5%"), size = 2) +
geom_vline(aes(xintercept = quantile(response_med, probs = .975), color = "Upper 2.5%"), size = 2) +
cowplot::theme_cowplot()
```
]
.pull-right[
```{r}
mean(response_med)
median(response_med)
quantile(response_med,
probs = c(.025, .975))
```
]
---
### bootstrapped distribution of the standard deviation
```{r}
boot = 10000
response_sd = numeric(length = boot)
for(i in 1:boot){
sample_response = sample(response, size = 216, replace = T)
response_sd[i] = sd(sample_response)
}
```
.pull-left[
```{r echo=F, message=FALSE}
data.frame(sds = response_sd) %>%
ggplot(aes(x = sds)) +
geom_histogram(aes(y = ..density..),color = "white", fill = "grey") +
geom_density() +
geom_vline(aes(xintercept = mean(response_sd), color = "mean"), size = 2) +
geom_vline(aes(xintercept = median(response_sd), color = "median"), size = 2) +
geom_vline(aes(xintercept = quantile(response_sd, probs = .025), color = "Lower 2.5%"), size = 2) +
geom_vline(aes(xintercept = quantile(response_sd, probs = .975), color = "Upper 2.5%"), size = 2) +
cowplot::theme_cowplot()
```
]
.pull-right[
```{r}
mean(response_sd)
median(response_sd)
quantile(response_sd,
probs = c(.025, .975))
```
]
---
You can bootstrap estimates and 95% confidence intervals for *any* statistics you'll need to estimate.
The `boot` function provides some functions to speed this process along.
```{r}
library(boot)
# function to obtain R-Squared from the data
rsq <- function(data, indices) {
d <- data[indices,] # allows boot to select sample
fit <- lm(mpg~wt+disp, data=d) # this is the code you would have run
return(summary(fit)$r.square)
}
# bootstrapping with 10000 replications
results <- boot(data=mtcars, statistic=rsq,
R=10000)
```
---
.pull-left[
```{r}
data.frame(rsq = results$t) %>%
ggplot(aes(x = rsq)) +
geom_histogram(color = "white", bins = 30)
```
]
.pull-right[
```{r}
median(results$t)
boot.ci(results, type = "perc")
```
]
---
### Example 2
Samples of service waiting times for Verizon’s (ILEC) versus other carriers (CLEC) customers. In this district, Verizon must provide line service to all customers or else face a fine. The question is whether the non-Verizon customers are getting ignored or facing greater variability in waiting times.
```{r, message = F, warning = F, echo = 2}
library(here)
Verizon = read.csv(here("data/Verizon.csv"))
```
```{r, echo = F, fig.width = 10, fig.height = 4}
Verizon %>%
ggplot(aes(x = Time, fill = Group)) +
geom_histogram(bins = 30) +
guides(fill = F) +
facet_wrap(~Group, scales = "free_y")
```
---
```{r, echo = F, fig.width = 10, fig.height = 6}
Verizon %>%
ggplot(aes(x = Time, fill = Group)) +
geom_histogram(bins = 50, position = "dodge") +
guides(fill = F)
table(Verizon$Group)
```
---
There's no world in which these data meet the typical assumptions of an independent samples t-test. To estimate mean differences we can use boostrapping. Here, we'll resample with replacement separately from the two samples.
```{r}
boot = 10000
difference = numeric(length = boot)
subsample_CLEC = Verizon %>% filter(Group == "CLEC")
subsample_ILEC = Verizon %>% filter(Group == "ILEC")
for(i in 1:boot){
sample_CLEC = sample(subsample_CLEC$Time,
size = nrow(subsample_CLEC),
replace = T)
sample_ILEC = sample(subsample_ILEC$Time,
size = nrow(subsample_ILEC),
replace = T)
difference[i] = mean(sample_CLEC) - mean(sample_ILEC)
}
```
---
```{r echo=F, message=FALSE, fig.width=10, fig.height=5}
data.frame(differences = difference) %>%
ggplot(aes(x = differences)) +
geom_histogram(aes(y = ..density..),color = "white", fill = "grey") +
geom_density() +
geom_vline(aes(xintercept = mean(differences), color = "mean"), size = 2) +
geom_vline(aes(xintercept = median(differences), color = "median"), size = 2) +
geom_vline(aes(xintercept = quantile(differences, probs = .025), color = "Lower 2.5%"), size = 2) +
geom_vline(aes(xintercept = quantile(differences, probs = .975), color = "Upper 2.5%"), size = 2) +
cowplot::theme_cowplot()
```
The difference in means is `r round(median(difference),2)` $[`r round(quantile(difference, probs = .025),2)`,`r round(quantile(difference, probs = .975),2)`]$.
---
### Bootstrapping Summary
Bootstrapping can be a useful tool to estimate parameters when
1. you've violated assumptions of the test (i.e., normality, homoskedasticity)
2. you have good reason to believe the sampling distribution is not normal, but don't know what it is
3. there are other oddities in your data, like very unbalanced samples
This allows you to create a confidence interval around any statistic you want -- Cronbach's alpha, ICC, Mahalanobis Distance, $R^2$, AUC, etc.
* You can test whether these statistics are significantly different from any other value -- how?
---
### Bootstrapping Summary
Bootstrapping will NOT help you deal with:
* dependence between observations -- for this, you'll need to explicity model dependence (e.g., multilevel model, repeated measures ANOVA)
* improperly specified models or forms -- use theory to guide you here
* measurement error -- why bother?
---
class: inverse
## Next time...