-
Notifications
You must be signed in to change notification settings - Fork 58
/
Copy pathWeek_05_1_Binary_Variables_and_Interactions.Rmd
737 lines (532 loc) · 28.6 KB
/
Week_05_1_Binary_Variables_and_Interactions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
---
title: "Binary Variables and Functional Form"
subtitle: "i.e. what you actually do most of the time"
date: "Updated `r Sys.Date()`"
output:
xaringan::moon_reader:
self_contained: TRUE
css: [default, metropolis, metropolis-fonts]
lib_dir: libs
# Run xaringan::summon_remark() for this
#chakra: libs/remark-latest.min.js
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
---
```{r setup, include=FALSE}
options(htmltools.dir.version = FALSE)
knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE, fig.width = 8, fig.height = 6)
library(tidyverse)
library(dagitty)
library(ggdag)
library(fixest)
library(scales)
library(Cairo)
theme_metro <- function(x) {
theme_classic() +
theme(panel.background = element_rect(color = '#FAFAFA',fill='#FAFAFA'),
plot.background = element_rect(color = '#FAFAFA',fill='#FAFAFA'),
text = element_text(size = 16),
axis.title.x = element_text(hjust = 1),
axis.title.y = element_text(hjust = 1, angle = 0))
}
theme_void_metro <- function(x) {
theme_void() +
theme(panel.background = element_rect(color = '#FAFAFA',fill='#FAFAFA'),
plot.background = element_rect(color = '#FAFAFA',fill='#FAFAFA'),
text = element_text(size = 16))
}
theme_metro_regtitle <- function(x) {
theme_classic() +
theme(panel.background = element_rect(color = '#FAFAFA',fill='#FAFAFA'),
plot.background = element_rect(color = '#FAFAFA',fill='#FAFAFA'),
text = element_text(size = 16))
}
```
```{css, echo=FALSE}
pre {
max-height: 350px;
overflow-y: auto;
}
pre[class] {
max-height: 100px;
}
```
# Check-in
- Where are we at?
- Next week we have a midterm. The format will be very similar to the homeworks - multiple choice, some short answers, a little programming
- It will cover everything up through the day of the midterm
- This lecture will cover the last "new" material before the midterm
- This is a pretty packed lecture, but it's structured to be easy to come back to and review, there will be plenty of practice next time, and this might end up broken across both days
- The latter half of this week will just be practice, especially for this material but also for stuff we've done
---
# The Right Hand Side
- Today we'll be focusing on the *right hand side* of a regression
- Economists generally refer to a regression as having a "left-hand-side" of the dependent variable $Y$, and a "right-hand-side" of all the independent stuff, like $\beta_0 + \beta_1X + \beta_2Z + \varepsilon$.
- So far, we've just tossed stuff on the right-hand side and called it our treatment variable or a control variable without thinking too much harder about it
- Today we will think harder about it!
---
# The Right Hand Side
We will look at three features of the right-hand side
- What if the variable is *categorical* or *binary*? (binary variables)
- What if the variable has a *nonlinear effect* on $Y$ (polynomials and logarithms)
- What if the effect of one variable *depends on the value of another variable?* (interaction terms)
---
# The Overriding Key Concept
**The derivative** is magical here
- $\partial Y/\partial X$ will give you an expression that tells you the effect of a change in $X$ on $Y$
- Whatever form $X$ enters the equation as!
- (then if it's binary, imagine that as a change from 0 to 1, or off to on)
- If you're confused **try taking the derivative, I'm serious, it helps**
- Your options with this stuff: EITHER commit to memory a bunch of different interpretation rules for each different case, OR just take the darn derivative
---
# Binary Data
- A variable is binary if it only has two values - 0 or 1 (or "No" or "Yes", etc.)
- Binary variables are super common in econometrics!
- Did you get the treatment? Yes / No
- Do you live in the US? Yes / No
- Is a floating exchange rate in effect? Yes / No
---
# Comparison of Means
- When a binary variable is an independent variable, what we are often interested in doing is *comparing means*
- Is mean income higher inside the US or outside?
- Is mean height higher for kids who got a nutrition supplement or those who didn't?
- Is mean GDP growth higher with or without a floating exchange rate?
---
# Comparison of Means
- Let's compare log earnings in 1993 between married people 30 or older vs. never-married people 30 or older
```{r, echo = TRUE}
data(PSID, package = 'Ecdat')
PSID <- PSID %>%
filter(age >= 30, married %in% c('married','never married'), earnings > 0) %>%
mutate(married = married == 'married')
PSID %>%
group_by(married) %>%
summarize(log_earnings = mean(log(earnings)))
```
---
# Comparison of Means
- Seems to be a slight favor to the married men
```{r}
PSID %>%
group_by(married) %>%
mutate(mean_earnings = exp(mean(log(earnings), na.rm = TRUE))) %>%
ggplot(aes(x = married, y = earnings)) +
geom_jitter() +
scale_x_discrete(labels = c('Not Married', 'Married')) +
scale_y_log10(labels = scales::dollar) +
labs(x = "Marital Status", y = "Log Labor Income") +
geom_point(aes(x = married, y = mean_earnings), size = 6, color = 'red') +
geom_segment(aes(x = FALSE, y = PSID %>% filter(!married) %>% pull(earnings) %>% log() %>% mean() %>% exp(),
xend = TRUE, yend = PSID %>% filter(married) %>% pull(earnings) %>% log() %>% mean() %>% exp()), size = 2, color = 'red') +
theme_metro_regtitle()
```
---
# Comparison of Means
- The *difference between the means* follows a t-distribution under the null that they're identical
- So of course we can do a hypothesis test of whether they're different
```{r, echo = TRUE}
t.test(log(earnings) ~ married, data = PSID, var.equal = TRUE)
```
---
# Comparison of Means
- But why bother trotting out a specific test when we can just do a regression?
- (In fact, a lot of specific tests can be replaced with basic regression, see [this explainer](https://lindeloev.github.io/tests-as-linear/))
```{r, echo = TRUE}
feols(log(earnings) ~ married, data = PSID)
```
---
# Comparison of Means
Notice:
- The intercept gives the mean for the *non-married* group
- The coefficient on *marriedTRUE* gives the married minus non-married difference
- The t-stat and p-value on that coefficient are exactly the same as that `t.test` (except the t is reversed; same deal)
- i.e. *the coefficient on a binary variable in a regression gives the difference in means*
- If we'd defined it the other way, with "not married" as the independent variable, the intercept would be the mean for the *married* group (i.e. "not married = 0"), and the coefficient would be the exact same but times $-1$ (same difference, just opposite direction!)
---
# Comparison of Means
Why does OLS give us a comparison of means when you give it a binary variable?
- The only $X$ values are 0 (FALSE) and 1 (TRUE)
- Because of this, OLS no longer really fits a *line*, it's more of two separate means
- And when you're estimating to minimize the sum of squared errors separately for each group, can't do any better than to predict the mean!
- So you get the mean of each group as each group's prediction
---
# Binary with Controls
- Obviously this is handy for including binary controls, but why do this for binary treatments? Because we can add controls!
```{r}
PSID <- filter(PSID, kids < 98)
etable(feols(log(earnings) ~ married + kids + age , data = PSID))
```
---
# Multicollinearity
- Why is just one side of it on the regression? Why aren't "married" and "not married" BOTH included?
- Because regression couldn't give an answer!
- Mean of married is $9.47$ and of non-married is $9.26$.
$$ \log(Earnings) = 0 + 9.47Married + 9.26NonMarried $$
$$ \log(Earnings) = 9.26 + .21Married + 0NonMarried $$
$$ \log(Earnings) = 3 + 6.47Married + 6.26NonMarried $$
- These (and infinitely many other options) all give the exact same predictions! OLS can't pick between them. There's no single best way to minimize squared residuals
- So we pick one with convenient properties, setting one of the categories to have a coefficient of 0 (dropping it) and making the coefficient on the other *the difference relative to* the one we left out
---
# More than Two Categories
- That interpretation - dropping one and making the other relative to *that*, conveniently extends to *multi-category variables*
- Why stop at binary categorical variables? There are plenty of categorical variables with more than two values
- What is your education level? What is your religious denomination? What continent are you on?
- We can put these in a regression by turning *each value* into its own binary variable
- (and then dropping one so the coefficients on the others give you the difference with the omitted one)
---
# More than Two Categories
- Make the mean of group A be 1, of group B be 2, etc.
```{r, echo = TRUE}
tib <- tibble(group = sample(LETTERS[1:4], 10000, replace = TRUE)) %>%
mutate(Y = rnorm(10000) + (group == "A") + 2*(group == "B") + 3*(group == "C") + 4*(group == "D"))
feols(Y ~ group, data = tib)
```
---
# More than Two Categories
- By changing the reference group, the coefficients change because they're "different from" a different group!
- And notice that, as before, the intercept is the mean of the omitted group (although this changes once you add controls; the intercept is the predicted mean when all right-hand-side variables are 0)
```{r, echo = TRUE}
tib <- tib %>% mutate(group = factor(group, levels = c('B','A','C','D')))
feols(Y ~ group, data = tib)
```
---
# More than Two Categories
- Some Interpretations: Controlling for number of kids and age, people with a high school degree have log earnings .324 higher than those without a high school degree (earnings 32.4% higher). BA-holders have earnings 84.8% higher than those without a HS degree
- Controlling for kids and age, a graduate degree earns (.976 - .848 =) 12.8% more than someone with a BA (`glht()` could help!)
```{r}
PSID <- PSID %>%
mutate(education = case_when(
educatn < 12 ~ 'No High School Degree',
educatn == 12 ~ 'High School Degree',
educatn < 16 ~ 'Some College',
educatn == 16 ~ 'Bachelor\'s Degree',
TRUE ~ 'Graduate Degree'
)) %>%
mutate(education = factor(education, levels = c('No High School Degree',
'High School Degree',
'Some College',
'Bachelor\'s Degree',
'Graduate Degree')))
table(PSID$education)
feols(log(earnings) ~ education + kids + age , data = PSID)
```
---
# Concept Checks
- If $X$ is binary, in sentences interpret the coefficients from the estimated OLS equation $Y = 4 + 3X + 2Z$
- How might a comparison of means come in handy if you wanted to analyze the results of a randomized policy experiment?
- If you had a data set of people from every continent and added "continent" as a control, how many coefficients would this add to your model?
- If in that regression you really wanted to compare Europe to Asia specifically, what might you do so that the regression made this easy?
---
# Interpreting OLS
- To think more about the right-hand-side, let's go back to our original interpretation of an OLS coefficient
$$ Y = \beta_0 + \beta_1X + \varepsilon $$
- A one-unit change in $X$ is associated with a $\beta_1$-unit change in $Y$
- This logic still works with binary variables since "a one-unit change in $X$" means "changing $X$ from No to Yes"
- We can also think of this as $\partial Y/\partial X = \beta_1$ in calculus terms
- Notice that this assumes that a one-unit change in $X$ *always has the same effect* on $\beta_1$ no matter what else is going on
- What if that's not true?
---
# Functional Form
- We talked before about times when a *linear model* like standard OLS might not be sufficient
- However, as long as those *non*-linearities are on the right hand side, we can fix the problem easily but just having $X$ enter non-linearly! Run it through a *transformation*!
- The most common transformations by far are *polynomials* and *logarithms*
---
# Functional Form
- Why do this? Because sometimes a straight line is clearly not going to do the trick!
```{r}
set.seed(500)
df <- tibble(X = 5*runif(200)) %>%
mutate(Y = X - 2*X^2 + 2*rnorm(200))
ggplot(df, aes(x = X, y = Y)) +
geom_point() +
geom_smooth(method = 'lm', se = FALSE) +
theme_metro() +
annotate(geom = 'text', x = 2.5, y = -20, label = 'Straight Line', color = 'blue', size = 15/.pt)
```
---
# Polynomials
- $\beta_1X$ is a "first order polynomial" - there's one term
- $\beta_1X + \beta_2X^2$ is a "second order polynomial" or a "quadratic" - two terms (note both included, it's not just $X^2$)
- $\beta_1X + \beta_2X^2 + \beta_3X^3$ is a third-order or cubic, etc.
What do they do?
- The more polynomial terms, the more flexible the line can be. With enough terms you can mimic *any* shape of relationship
- Of course, if you just add a whole buncha terms, it gets very noisy, and prediction out-of-sample gets very bad
- Keep it minimal - quadratics are almost always enough, unless you have reason to believe there's a true more-complex relationship. You can try adding higher-order terms and see if they make a difference
---
# Polynomials
- The true relationship is quadratic
```{r}
ggplot(df, aes(x = X, y = Y)) +
geom_point() +
geom_smooth(method = 'lm', se = FALSE) +
geom_line(aes(x = X, y = predict(lm(Y~X+I(X^2), data = df))), color = 'red', size = 1) +
theme_metro() +
annotate(geom = 'text', x = 2.5, y = -20, label = 'Straight Line', color = 'blue', size = 15/.pt) +
annotate(geom = 'text', x = 3, y = -9, label = 'Quadratic', color = 'red', size = 15/.pt)
```
---
# Polynomials
- Higher-order terms don't do anything for us here (because a quadratic is sufficient!)
```{r}
ggplot(df, aes(x = X, y = Y)) +
geom_point() +
geom_smooth(method = 'lm', se = FALSE) +
geom_line(aes(x = X, y = predict(lm(Y~X+I(X^2), data = df))), color = 'red', size = 1) +
geom_line(aes(x = X, y = predict(lm(Y~X+I(X^2) + I(X^3), data = df))), color = 'purple', size = 1) +
theme_metro() +
annotate(geom = 'text', x = 2.5, y = -20, label = 'Straight Line', color = 'blue', size = 15/.pt) +
annotate(geom = 'text', x = 3, y = -9, label = 'Quadratic', color = 'red', size = 15/.pt) +
annotate(geom = 'text', x = 3.5, y = -12, label = 'Cubic', color = 'purple', size = 15/.pt)
```
---
# Polynomials
- Interpret polynomials using the derivative
- $\partial Y/\partial X$ will be different depending on the value of $X$ (as it should! Notice in the graph that the slope changes for different values of $X$)
$$ Y = \beta_1X + \beta_2X^2 $$
$$ \partial Y/\partial X = \beta_1 + 2\beta_2X $$
So at $X = 0$, the effect of a one-unit change in $X$ is $\beta_1$. At $X = 1$, it's $\beta_1 + \beta_2$. At $X = 5$ it's $\beta_1 + 5\beta_2$.
- **IMPORTANT**: when you have a polynomial, *the coefficients on each individual term mean very little on their own*. You have to consider them alongisde the other coefficients from the polynomial! **Never** interpret $\beta_1$ here without thinking about $\beta_2$ alongside. Also, the significance of the individual terms doesn't really matter - consider doing an F-test of all of them at once.
---
# Polynomials in R
- We can add an `I()` function to our regression to do a calculation on a variable before including it. So `I(X^2)` adds a squared term
- There's also a `poly()` function but you **must** set `raw=TRUE` if you do
```{r, echo = TRUE, eval = FALSE}
# Linear
feols(Y ~ X, data = df)
# Quadratic
feols(Y ~ X + I(X^2), data = df)
# Cubic
feols(Y ~ X + I(X^2) + I(X^3), data = df)
feols(Y ~ poly(X, 3, raw = TRUE), data = df)
```
---
# Concept Check
- What's the effect of a one-unit change in $X$ at $X = 0$, $X = 1$, and $X = 2$ for each of these?
```{r}
etable(feols(Y~X, data = df),
feols(Y~X + I(X^2), data = df),
feols(Y ~ X + I(X^2) + I(X^3), data = df))
```
---
# Logarithms
- Another common transformation, both for dependent and independent variables, is to take the logarithm
- This has the effect of pulling in extreme values from strongly right-skewed data and making linear relationships pop out
- Income, for example, is almost always used with a logarithm
- It also gives the coefficients a nice percentage-based interpretation
---
# Logarithms
```{r}
set.seed(500)
logdf <- tibble(X = 10*runif(200)) %>%
mutate(Y = log(X) + .3*rnorm(200))
ggplot(logdf, aes(x = X, y = Y)) + geom_point() +
theme_metro() +
geom_line(aes(x = X, y = predict(lm(Y~X, data = logdf))), color = 'blue', size = 1) +
geom_line(aes(x = X, y = predict(lm(Y~log(X), data = logdf))), color = 'red', size = 1) +
annotate(geom = 'text', x = 4.1, y = .5, color = 'blue', label = 'Straight Line', size = 15/.pt) +
annotate(geom = 'text', x = 1.5, y = 1.5, color = 'red', label = 'Y regressed on log(X)', size = 15/.pt)
```
---
# Or if you prefer...
- Notice the change in axes
```{r}
ggplot(logdf, aes(x = log(X), y = Y)) + geom_point() +
theme_metro() +
geom_line(aes(x = log(X), y = predict(lm(Y~X, data = logdf))), color = 'blue', size = 1) +
geom_line(aes(x = log(X), y = predict(lm(Y~log(X), data = logdf))), color = 'red', size = 1) +
annotate(geom = 'text', x = 1.25, y = -.25, color = 'blue', label = '"Straight Line" (Y regressed on X)', size = 15/.pt) +
annotate(geom = 'text', x = -.5, y = 1, color = 'red', label = 'Y regressed on log(X)', size = 15/.pt)
```
---
# Logarithms
- How can we interpret them?
- The key is to remember that $\log(X) + a \approx \log((1+a)X)$, meaning that a $a$-unit change in $log(X)$ is similar to a $a\times100%$ change in $X$
- So, walk through our "one-unit change in the variable" logic from before, but whenever we hit a log, change that into a percentage!
- Note this *only works for a small $a$*! With natural logs, the approximation breaks down above $a = .1$ or so
- Can interpret exactly instead: a $a$-unit change in $\log(X)$ is really a $(\log(1+a)-1)%$ change in $X$
- Or you can change the base - a 1-unit change in $\log_{1+a}(X)$ is exactly a $a$% change in $X$.
---
# Logarithms
- $Y = \beta_0 + \beta_1\log(X)$ A one-unit change in $\log(X)$, or a 100% change in $X$, is associated with a $\beta_1$-unit change in $Y$
- $\log(Y) = \beta_0 + \beta_1X$ a one-unit change in $X$ is associated with a $\beta_1\times 100$% change in $Y$
- $\log(Y) = \beta_0 + \beta_1\log(X)$ A one-unit change in $\log(X)$, or a or a 100% change in $X$, is associated with a $\beta_1$-unit change in $\log(Y)$, or a $\beta_1\times100$% change in $Y$.
- (Try also with changes smaller than one unit - that's usually more reasonable)
---
# Logarithms
Downsides:
- Logarithms require that all data be positive. No negatives or zeroes!
- Fairly rare that a variable with negative values wants a log anyway
- But zeroes are common! A common practice is to just do $log(X+1)$ or $asinh(X)=\ln(X+\sqrt{X^2+1})$ but these are pretty arbitrary
- On the left-hand side at least, better to deal with 0s with Poisson regression
---
# Functional Form
- In general, you want the shape of your function to match the shape of the relationship in the data (or, even better, the true relationship)
- Polynomials and logs can usually get you there!
- Which to use? Use logs for highly skewed data or variables with exponential relationships
- Use polynomials if it doesn't look straight! **Check that scatterplot and see how not-straight it is!**
---
# Concept Checks
- Which of the following variables would you likely want to log before using them? Income, height, wealth, company size, home square footage
- In each of the following estimated OLS lines, interpret the coefficient by filling in "A [blank] change in X is associated with a [blank] change in Y":
$$ Y = 1 + 2\log(X) $$
$$ \log(Y) = 3 + 2\log(X) $$
$$ \log(Y) = 4 + 3X $$
---
# Interactions
- For both polynomials and logarithms, the effect of a one-unit change in $X$ differs depending on its current value (for logarithms, a 1-unit change in $X$ is different percentage changes in $X$ depending on current value)
- But why stop there? Maybe the effect of $X$ differs depending on the current value of *other variables!*
- Enter interaction terms!
$$ Y = \beta_0 + \beta_1X + \beta_2Z + \beta_3X\times Z + \varepsilon $$
- Interaction terms are *a little tough* but also **extremely important**. Expect to come back to these slides, as you're almost certainly going to use interaction terms in both of your major projects this term
---
# Interactions
- Change in the value of a *control* can shift a regression line up and down
- Using the model $Y = \beta_0 + \beta_1X + \beta_2Z$, estimated as $Y = .01 + 1.2X + .95Z$:
```{r}
set.seed(500)
df <- tibble(X = runif(200), Z = 5*runif(200)) %>%
mutate(Y = X + Z + rnorm(200))
m <- lm(Y~X+Z, data = df)
mc <- coef(m)[1] %>% unname()
mx <- coef(m)[2] %>% unname()
mz <- coef(m)[3] %>% unname()
ggplot(df, aes(x = X, y = Y)) +
geom_point() +
theme_metro() +
scale_x_continuous(limits = c(0,1.1)) +
geom_line(aes(x = X, y = mc + mx*X + mz), color = 'red', size = 1) +
geom_line(aes(x = X, y = mc + mx*X + 3*mz), color = 'red', size = 1) +
geom_line(aes(x = X, y = mc + mx*X + 5*mz), color = 'red', size = 1) +
annotate(geom = 'text', x = 1.01, y = mc + mx*1.05 + mz, color = 'red', label = 'Prediction\nat Z = 1', hjust = 0, size = 15/.pt) +
annotate(geom = 'text', x = 1.01, y = mc + mx*1.05 + 3*mz, color = 'red', label = 'Prediction\nat Z = 3', hjust = 0, size = 15/.pt) +
annotate(geom = 'text', x = 1.01, y = mc + mx*1.05 + 5*mz, color = 'red', label = 'Prediction\nat Z = 5', hjust = 0, size = 15/.pt)
```
---
# Interactions
- But an interaction can both shift the line up and down AND change its slope
- Using the model $Y = \beta_0 + \beta_1X + \beta_2Z + \beta_3X\times Z$, estimated as $Y = .035 + 1.14X + .94Z + 1.02X\times Z$:
```{r}
set.seed(500)
df <- tibble(X = runif(200), Z = 5*runif(200)) %>%
mutate(Y = X+Z+X*Z + rnorm(200))
m <- lm(Y~X*Z, data = df)
mc <- coef(m)[1] %>% unname()
mx <- coef(m)[2] %>% unname()
mz <- coef(m)[3] %>% unname()
mxz <- coef(m)[4] %>% unname()
ggplot(df, aes(x = X, y = Y)) +
geom_point() +
theme_metro() +
scale_x_continuous(limits = c(0,1.2)) +
geom_line(aes(x = X, y = mc + mx*X + mz + mxz*X), color = 'red', size = 1) +
geom_line(aes(x = X, y = mc + mx*X + 3*mz + 3*mxz*X), color = 'red', size = 1) +
geom_line(aes(x = X, y = mc + mx*X + 5*mz + 5*mxz*X), color = 'red', size = 1) +
annotate(geom = 'text', x = 1.01, y = mc + mx*1.05 + mz + mxz*1.05, color = 'red', label = 'Prediction\nat Z = 1', hjust = 0, size = 15/.pt) +
annotate(geom = 'text', x = 1.01, y = mc + mx*1.05 + 3*mz + 3*mxz*1.05, color = 'red', label = 'Prediction\nat Z = 3', hjust = 0, size = 15/.pt) +
annotate(geom = 'text', x = 1.01, y = mc + mx*1.05 + 5*mz + 5*mxz*1.05, color = 'red', label = 'Prediction\nat Z = 5', hjust = 0, size = 15/.pt)
```
---
# Interactions
- How can we interpret an interaction?
- The idea is that the interaction shows how *the effect of one variable changes as the value of the other changes*
- The derivative helps!
$$ Y = \beta_0 + \beta_1X + \beta_2Z + \beta_3X\times Z $$
$$ \partial Y/\partial X = \beta_1 + \beta_3 Z $$
- The effect of $X$ is $\beta_1$ when $Z = 0$, or $\beta_1 + \beta_3$ when $Z = 1$, or $\beta_1 + 3\beta_3$ if $Z = 3$!
---
# Interactions
- Often we are doing interactions with binary variables to see how an effect differs across groups
- In these cases, we combine what we know about binary variables with what we know about interactions!
- Now, instead of the intercept giving the baseline and the binary coefficient giving the difference, the coefficient on $X$ is the baseline effect of $X$ and the interaction is the difference in the effect of $X$
- The interaction coefficient becomes "the difference in the effect of $X$ between the $Z$ = "No" and $Z$ = "Yes" groups"
- (What if it's continuous? Mathematically the same but the thinking changes - the interaction term is the difference in the effect of $X$ you get when increasing $Z$ by one unit)
---
# Interactions
- Marriage for those without a college degree raises earnings by 24%. A college degree reduces the marriage premium by 25%. Marriage for those with a college degree reduces earnings by .24 - .25 = -1%
```{r}
PSID <- PSID %>%
mutate(college = educatn >= 16)
etable(feols(log(earnings) ~ married*college, data = PSID))
```
---
# Notes on Interactions
- Like with polynomials, the coefficients on their own now have little meaning and must be evaluated alongside each other. $\beta_1$ by itself is just "the effect of $X$ when $Z = 0$", not "the effect of $X$"
- Yes, you *do* almost always want to include both variables in un-interacted form and interacted form. Otherwise the interpretation gets very thorny
- Interaction effects are *poorly powered*. You need a *lot* of data to be able to tell whether an effect is different in two groups. If $N$ observations is adequate power to see if the effect itself is different from zero, you need a sample of roughly $16\times N$ to see if the difference in effects is nonzero. Sixteen times!!
- It's tempting to try interacting your effect with everything to see if it's bigger/smaller/nonzero in some groups, but because it's poorly powered, this is a bad idea! You'll get a lot of false positives
---
# In R!
- Binary variables in R (on the right-hand-side) you can just treat as normal variables
- Categorical variables too (although if it's numeric you may need to run it through `factor()` first, or `i()` in `feols()`)
- In `feols()` you can specify which group gets dropped using `i()` and setting `ref` in it
```{r, echo = TRUE, eval = FALSE}
# drops married = FALSE
feols(log(earnings) ~ married, data = PSID)
# drops married = TRUE
feols(log(earnings) ~ i(married, ref = 'TRUE'), data = PSID)
```
---
# Binary variables
- You can also use `I()` to specify binary variables in-model, or `case_when()` to create categorical variables
- `case_when` works in steps - the first one that applies to you, you get, so that `TRUE` at the end catches "everyone else"
```{r, echo = TRUE, eval = FALSE}
PSID <- PSID %>%
mutate(education = case_when(
educatn < 12 ~ 'No High School Degree',
educatn == 12 ~ 'High School Degree',
educatn < 16 ~ 'Some College',
educatn == 16 ~ 'Bachelor\'s Degree',
TRUE ~ 'Graduate Degree'
))
feols(log(earnings) ~ education + I(kids > 0), data = PSID) |> etable()
```
---
# Binary Variables
```{r, echo = FALSE, eval = TRUE}
PSID <- PSID %>%
mutate(education = case_when(
educatn < 12 ~ 'No High School Degree',
educatn == 12 ~ 'High School Degree',
educatn < 16 ~ 'Some College',
educatn == 16 ~ 'Bachelor\'s Degree',
TRUE ~ 'Graduate Degree'
))
etable(feols(log(earnings) ~ education + I(kids > 0), data = PSID))
```
---
# Polynomials and Logarithms
- As previously discussed, $I()$ will let us do functions like $X^2$
- We can also do `log()` straight in the regression.
```{r, echo = TRUE, eval = FALSE}
lm(Y ~ X + I(X^2) + I(X^3), data = df)
lm(log(Y) ~ log(X), data = df)
```
---
# Interactions
- `X*Z` will include `X`, `Z`, and also their interaction
- If necessary, `X:Z` is the interaction only, but you rarely need this. However, it's handy for referring to the interaction term in `linearHypothesis`!
- In `feols()` the `i()` function is a very powerful way of doing interactions
```{r, echo = TRUE, eval = FALSE}
feols(Y ~ X*Z, data = df)
feols(Y ~ X + X:Z, data = df)
feols(Y ~ i(Z, X), data = df)
feols(Y ~ i(Z, X), data = df) |> iplot()
```
---
# Tests
- `wald()` can be handy for testing groups of binary variables for a categorical
- Also good for testing all the polynomial terms, or testing if the effect of $X$ is significant at a certain value of $Z$
```{r, echo = TRUE, eval = FALSE}
# Is the education effect zero overall?
m1 <- feols(log(earnings)~educatn, data = PSID)
wald(m1, 'educatn')
# Does X have any effect?
m2 <- feols(Y ~ X + I(X^2) + I(X^3), data = df)
wald(m2, 'X') # Gets all coefficients with an 'X' anywhere in the name - check this is right!
# Is the effect of X significant when Z = 5?
library(multcomp)
m3 <- lm(Y ~ X*Z, data = df)
glht(m3, 'X + 5*X:Z= 0') %>% summary()
```