forked from diazale/intro_to_r_stats
-
Notifications
You must be signed in to change notification settings - Fork 0
/
regression_workshop.Rmd
725 lines (526 loc) · 35.5 KB
/
regression_workshop.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
---
title: "Intro to R stats, linear regression, MiCM Fall 2022 Workshop"
author: "Alex Diaz-Papkovich"
date: "October 11, 2022"
output:
html_document:
#github_document:
toc: true
toc_depth: 3
toc_float: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Introduction
Regression is a set of methods to measure and estimate the relationship between variables. The idea is your data will "regress" to some function. The focus of this workshop is linear regression, one of the most effective and widely-used statistical tools. The idea is that for each observation, you have some outcome that depends on other variables, and this relationship can be expressed as a linear equation, $y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \dots + \beta_{p}x_{p} + noise$.
Why linear regression? Because it's:
* interpretable
* easy to implement
* got good statistical properties
* robust
Linear regression works when you are modeling continuous data (i.e. not binary or categorical data), and when you have a handful of predictors. If you have $p$ predictors and $n$ samples, we assume that $p << n$. If your data does not fall into these regimes, linear regression will not work. To deal with these scenarios, we can extend the ideas of linear regression to formulate *generalized linear models* (a family of models containing logistic regression and Poisson regression) and penalized/regularized regression like LASSO and ridge regression. Here, we will focus on simple cases where linear regression works.
Linear regression is extremely powerful and is regularly used in every type of analysis across scientific fields. A GWAS, for example, is just a series of linear regression models carried out on a set of SNPs, where we assume a phenotype $y$ can be modelled by the number of alleles at each SNP $S$, $x$, i.e. $y = \beta_{0} + \beta_{S}x$. Note that this is simply a measurement of association between variables -- linear regression will **not** determine whether there is a causal relationship between variables, though it is one of the tools that can add to evidence of causality.
### Formulation
There are a few basic assumptions that you must make when doing linear regression.
The first assumption is that your data are related by a _linear relationship_, that is, it can be expressed as a linear equation. In the simplest case, we take the equation of a line, $y=mx + b$, where we know what the $y$ and $x$ values are and estimate the values of the slope $m$ and intercept $b$. For easier notation, we typically denote the intercept as $\beta_0$ and the slope as $\beta_1$ and rearrange the equation as: $$y=\beta_{0} + \beta_{1}x$$
```{r, echo=FALSE}
## Setup up coordinate system (with x == y aspect ratio):
plot(c(-2,3), c(-1,5), type = "n", xlab = "x", ylab = "y", asp = 1)
## the x- and y-axis, and an integer grid
abline(h = -5:5, v = -5:6, col = "lightgray", lty = 3)
abline(a = 1, b = 2, col = 2)
text(2,2, "y = 2x + 1", col = 2, adj = c(-.1, -.1))
```
We will (hopefully!) have more than one value for each $(x,y)$ pair, so we denote each observation $(x_i,y_i)$. If we have $N$ samples, then $i$ takes the values $1 \dots N$, so our data are described as the collection of observations $(x_1,y_1)\dots(x_N,y_N)$. Given this set of data, our model is now:
$$y_i = \beta_{0} + \beta_{1}x_{i} \text{ , for } i = 1 \dots N$$
```{r, echo=FALSE}
## Setup up coordinate system (with x == y aspect ratio):
x <- seq(-1,3,0.1)
y <- 2*x + 1
plot(c(-2,3), c(-1,5), type = "n", xlab = "x", ylab = "y", asp = 1)
## the x- and y-axis, and an integer grid
abline(h = -5:5, v = -5:6, col = "lightgray", lty = 3)
abline(a = 1, b = 2, col = 2)
points(x,y)
text(2,2, "y = 2x + 1", col = 2, adj = c(-.1, -.1))
```
Finally, we need to add noise to our model. Our observations will never be perfect, so we have to build this error into the model. The errors are denoted by $\epsilon_{i}$, and we assume that they independently follow the same Normal distribution centred around $0$ and with a constant variance, $\sigma^{2}$. This is written as $\epsilon_{i} \stackrel{iid}{\sim}N(0,\sigma^{2})$. Now, our model looks like:
$$y_i = \beta_{0} + \beta_{1}x_{i} + \epsilon_{i} \text{ , for } i = 1 \dots N \text{ , } \epsilon_{i} \stackrel{iid}{\sim}N(0,\sigma^{2})$$
```{r, echo=FALSE}
## Setup up coordinate system (with x == y aspect ratio):
x <- seq(-1,3,0.1)
e <- rnorm(length(x), mean=0, sd=1)
y <- 2*x + 1 + e
plot(c(-2,3), c(-1,5), type = "n", xlab = "x", ylab = "y", asp = 1)
## the x- and y-axis, and an integer grid
abline(h = -5:5, v = -5:6, col = "lightgray", lty = 3)
abline(a = 1, b = 2, col = 2)
points(x,y)
text(2,2,"y = 2x + 1 + error", col = 2, adj = c(-.1, -.1))
```
There are several assumptions with this approach:
* The errors are *random*
* The errors are *independent* ($\implies$ uncorrelated)
* The errors are *normally distributed*
* The errors have *mean zero*
* The errors have *the same variance* (homoscedasticity)
#### How do we estimate $\beta$?
Here we started with a linear equation and generated our data. In real life, we have our data and need to estimate the linear equation. Given our observations $(x_i,y_i)$, we want to estimate $\beta_0$ and $\beta_1$. Estimates of $\beta$ values are denoted as $\hat{\beta_j}$ for every parameter $j$. The value of $\hat\beta_1$ is the estimate of the unit-change between $x_i$ and $y_i$. For example, if our $y$ values are a person's weight (in kg) and our $x$ values are height (in cm), then $\hat{\beta_1}$ is the estimate of how much somebody's weight in kg increases for a 1cm change in height.
One logical approach to estimate the values of the $\beta$s is in a way that would make the errors, or equivalently sum of their squares, as small as possible. This is the **ordinary least squares** estimator.
$$y_i = \beta_{0} + \beta_{1}x_{i} + \epsilon_{i}
\\ \text{rearranged as: } \epsilon_i = y_i - \beta_{0} - \beta_{1}x_{i}
$$
If we have $N$ observations, then the sum of the squared errors is
$$\sum_{i}^{N}\epsilon_{i}^{2} = \sum_{i}^{N}(y_{i} - \beta_{0} - \beta_{1}x_{i})^{2} = SSE$$
Geometrically, we are drawing squares on our regression plot: for each observation $i$, the square has sides of length $\epsilon_i$ (the distance between the point and the regression line) and area $\epsilon_i^2$. The goal of the least squares estimator is to make the sum of the areas of these squares as small as possible. For an intuitive animation, [see this web page](http://setosa.io/ev/ordinary-least-squares-regression/).
Using some calculus, the estimates of the $\beta$s turn out to be:
$$\hat{\beta_1} = \frac{\sum^{N}_{i=1}(x_{i} - \bar{x})(y_{i} - \bar{y})}
{\sum_{i=1}^{N}(x_{i} - \bar{x})^{2}} = \frac{SS_{xy}}{SS_{xx}}$$
$$\hat{\beta_0} = \bar{y} - \hat{\beta_{1}}\bar{x}$$
Because of the assumptions we made earlier, our estimators have some very nice statistical properties. If our errors are uncorrelated and have mean zero and are homoscedasticity, then our estimates are the best linear unbiased estimator. Being unbiased means that the estimator of $\beta$, on average, the true value of the $\beta$. Being the "best" estimator means it has the lowest variance of all linear unbiased estimators. There are numerous websites and textbooks that discuss derivations of estimates in great detail.
### Notation and terminology
Models are usually written in matrix notation for convenience. So far we have worked with one continuous $y$ variable and one continuous $x$ variable. This is called **simple linear regression**. Since we are working with $N$ observations (index by $i$), we have a system of linear equations.
$$
y_1 = \beta_{0} + \beta_{1}x_{1} + \epsilon_{1} \\
y_2 = \beta_{0} + \beta_{1}x_{2} + \epsilon_{2} \\
\dots \\
y_N = \beta_{0} + \beta_{1}x_{N} + \epsilon_{N}
$$
Rather than writing this out repeatedly, we work with matrix notation:
$$Y = \begin{pmatrix}
y_1 \\ y_2 \\ \dots \\ y_N
\end{pmatrix}, X =
\begin{pmatrix}
1 & x_{1} \\ 1 & x_{2} \\ \dots & \dots \\ 1 & x_{N}
\end{pmatrix},
\beta = \begin{pmatrix}
\beta_{0} \\ \beta_{1}
\end{pmatrix},
\epsilon = \begin{pmatrix}
\epsilon_{1} \\ \epsilon_{2} \\ \dots \\ \epsilon_{N}
\end{pmatrix}$$
$$Y = X\beta + \epsilon, \epsilon_{i} \stackrel{iid}{\sim}N(0,\sigma^{2})$$
When you extend your model to include multiple predictors (say, $p$ of them), your model will look the same except for the matrix $X$ and vector $\beta$, which will be:
$$X =
\begin{pmatrix}
1 & x_{11} & x_{12} & \dots & x_{1p} \\
1 & x_{21} & x_{23} & \dots & x_{2p}\\
\dots & \dots & \dots & \dots & \dots \\
1 & x_{N1} & x_{N2} & \dots & x_{Np}
\end{pmatrix},
\beta = \begin{pmatrix}
\beta_{0} \\ \beta_{1} \\ \dots \\ \beta_{p}
\end{pmatrix}$$
The estimate for the $\beta$s in matrix notation is written as:
$$\hat{\beta} = (X^{T}X)^{-1}X^{T}Y$$
When doing inference, the $\beta$s follow a normal distribution:
$$\hat{\beta} \sim N(\beta, (X^{T}X)^{-1}\sigma^{2})$$
There are many terms used to describe the variables and types of models used. This is a brief reference.
* Independent variable, regressor, feature, explanatory variable, predictor, parameter, etc: These are all terms for your $x$ variables.
* Dependent variable, regressand, target, outcome, response, etc: These are all terms for your $y$ variable, which you are modeling using your $x$ variables
* Simple/single linear regression: The simplest regression model, $y_i = \beta_0 + x_i \beta_1 + \epsilon_i$
* Multiple linear regression: Linear regression with multiple independent variables, $y_i = \beta_0 + x_{i,1}\beta_{1} + x_{i,2}\beta_{2} + \dots + x_{i,p}\beta_{p} + \epsilon_i$.
* Multivariate linear regression: Linear regression with multiple dependent (response) variables that share a covariance structure, $Y_{ij} = \beta_{0j} + \beta_{1j}X_{i1} + \dots + \beta_{pj}X_{ip} + \epsilon_{ij}$ for observations $i=1, \dots, N$ and dependent variables $j = 1, \dots,m$. We won't go into this in this workshop, but it's useful for when you want to simultaneously model related outcomes, such as multiple phenotypes given a shared input, like gene expression levels.
As an aside, this is why it's important to know, mathematically, what your model looks like. Using ambiguous terms makes it tedious to understand and/or replicate research and results.
## Linear regression in R
### Simple linear regression
Most scientific programming languages have built-in linear regression. In R, linear models are constructed with the `lm` function. The general format is:
``my_model <- lm(y ~ x1 + x2 + ... + xp, data = my_data)``
As an example, let's generate some data and run a regression on it.
```{r Generate some simulation data}
set.seed(20191203) # set a seed for reproducibility
# lets set up some true data
b0 <- -300 # true beta_0
b1 <- 7 # true beta_1
N <- 40 # sample size
# generate some x values, let's pretend it's height in inches
# a sample of size N bounded between 50 and 80
x_data <- runif(N,50,80)
# generate some random normal errors
errs <- rnorm(N, mean = 0, sd = 30)
y_data <- b0 + x_data*b1 + errs
```
The first thing you should do is **look at your data**. This should give you some idea of whether your data are linearly related.
```{r Plotting our generated data}
plot(x_data, y_data)
```
Note that the data is still kind of noisy -- this is fine, as data is not always clearly linearly related. However, if you saw other shapes (like parabolas, exponential growth, or periodicity), then the linear model would need some changes. Moving on, we code our model as:
```{r Setting up our linear model}
my_model <- lm(y_data ~ x_data)
```
We can visualize what the model looks like:
```{r Visualizing simple linear regression}
plot(x_data, y_data)
abline(my_model)
```
To look at the results of our model, we use the `summary` function.
```{r Summary of simple linear regression}
summary(my_model)
```
What's going on in this output?
```
Call:
lm(formula = y_data ~ x_data)
```
The first part shows the formula R used to run the analysis.
```
Residuals:
Min 1Q Median 3Q Max
-58.154 -13.992 1.414 24.988 49.085
```
This portion is a summary of the residuals (errors). We'll follow up later.
```
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -310.4223 33.3369 -9.312 2.38e-11 ***
x_data 7.1091 0.5088 13.972 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
```
The coefficients are the meat of the analysis. They are our slopes and intercept. The estimate for our intercept is $\hat{\beta_{0}} = -310.4223$ and for the slope is $\hat{\beta_1} = 7.1091$. Recall our original values when generating the data were $-300$ and $7$, respectively, so this is pretty good!
The standard error refers to the standard error for the estimate of particular $\beta$ value. In this case, the standard error around the slope is $0.5088$, and the standard error around the intercept is $33.3369$. Next we have the t-statistic and p-value for each $\beta$. Note that the t-statistic is the estimate divided by the standard error, and the p-value is the probability that the t-statistic is drawn at random from a student's t-distribution with $N-p-1$ degrees of freedom, where $p$ is the number of independent variables. In principle you could calculate these values yourself... but that's what software is for.
Moving on, although we have estimates of our $\beta$ values, we'd also like to know how likely it is that it's actually zero (perhaps the standard deviations are so large that $\beta$ could plausibly be $0$). So for each slope, using the aforementioned t-statistic, we are testing:
$$H_{0}: \beta_{i} = 0 \text{ vs. } H_{A}: \beta_i \ne 0$$
In this case our p-value falls below $0.05$ for both $\beta_0$ and $\beta_1$, so we reject the null hypothesis in each case. That is, our slopes are non-zero. If you run an analysis with multiple $\beta$s, you will need to asses the significance of each slope and remove those that do not appear to be significant.
```
Residual standard error: 26.89 on 38 degrees of freedom
Multiple R-squared: 0.8371, Adjusted R-squared: 0.8328
F-statistic: 195.2 on 1 and 38 DF, p-value: < 2.2e-16
```
The residual standard error is the average amount that the predicted variable will deviate from the fitted line. So on average, our $y$ values will be about $26.89$ away from the line.
The R-squared ($R^2$) is the **coefficient of determination**. It is defined as $R^2 = 1 - \frac{\sum_{i=1}^{N}\epsilon_{i}^{2}}{\sum_{i=1}^{N}(y_{i} - \bar{y})^{2}} = 1 - \frac{SSE}{SST}$. It is bounded between $0$ and $1$, and is the proportion of variance explained by your model -- the closer your model is to $1$, the better of a fit your model is. If your $R^2$ is $0$, it explains none of the variance, and if it's $1$, then all of your points fall exactly on the fitted line. In simple linear regression, the $R^2$ is the squared Pearson correlation coefficient, $\rho$. You should note that it is a measure of the *linear* relationship between variables. It's possible to have an explicit non-linear relationship (such as $y = x^2$) between $x$ and $y$. In these cases, a linear model could have $R^2 = 0$ even though the variables are very clearly related. Adding extra parameters will artifically inflate the value of $R^2$ despite not adding any explanatory value; the adjusted $R^2$ adjusts for this. Remember, the goal is to explain relationships your data, and not to meet some arbitrary threshold!
The F-statistic tests the hypothesis that *all* of your non-intercept $\beta$ values are $0$, ie:
$$H_0: \beta_1 = \dots = \beta_p = 0 \text{ vs. } H_A:\text{At least one slope is not zero}$$
If your F-statistic is non-significant, none of your predictors are useful! In this case it's a moot point as we only have one slope anyway.
Because of the assumptions we made earlier, we can contruct confidence intervals for our $\beta$ values. We can do this in R using `confint`, or construct them ourselves using the equation $\hat{\beta_{k}} \pm t_{1 - \alpha/2,n-p-1}s.e.(\hat{\beta_{k}})$, where $t_{1 - \alpha/2,n-p-1}$ are the values of the student's t-distribution at $\alpha/2$ significance level at $n-p-1$ degrees of freedom. For a $95\%$ confidence interval, these would be $t_{0.975, n-p-1}$ and $t_{0.025, n-p-1}$.
```{r Confidence intervals for betas, echo=F}
print("Confidence intervals with confint function")
print(confint(my_model))
```
```{r Manually calculate confidence intervals}
t_1 <- qt(0.025,38) # n - p - 1 = 40 - 1 - 1 = 38
# t_2 <- qt(0.975,38) # the t-distribution is symmetric so this is just t_1*(-1)
beta_est <- summary(my_model)$coefficients[2,1] # retrieve the beta1 estimate from the model
beta_sd <- summary(my_model)$coefficients[2,2] # retrieve the sd of the beta1 esimate from the model
print(paste("(",(beta_est + t_1*beta_sd), ", " , (beta_est - t_1*beta_sd), ")",sep=""))
```
### Multiple variables
Often we want to test multiple variables. Luckily, this is straightforward to do -- we simply add more variables to our model and carry out the rest of the tests as intended. Mathematically for each observation we have:
$$y_i = \beta_0 + \beta_1x_{i1} + \beta_{2}x_{i2} + \epsilon_{i}$$
```{r Multiple regression}
set.seed(20191203) # set a seed for reproducibility
# lets set up some true data
b0 <- -300 # true beta_0
b1 <- 7 # true beta_1
b2 <- -12 # true beta_2
N <- 40 # sample size
# generate some x values again
x1_data <- runif(N,50,80)
x2_data <- runif(N,10,20)
# generate some random normal errors
errs <- rnorm(N, mean = 0, sd = 30)
y_data <- b0 + x1_data*b1 + x2_data*b2 + errs
```
In this case, we've created our $y$ variable as a linear combination of two variables ($x1$ and $x2$). Lets code this in R and see what the results are. To begin, we'll put our data into a dataframe -- so far we've only worked with vectors. Generally you'll be working with dataframes (or some equivalent if you're working in another language), which are easier to manage. The results will be the same, but slightly easier to follow
```{r Create the linear model and show the output}
# put the data into a dataframe
my_data <- data.frame(y_data, x1_data, x2_data) # put the data into a frame
colnames(my_data) <- c("y","x1","x2") # give names to our variables
my_model_2 <- lm(y ~ x1 + x2, data = my_data) # set up the model
summary(my_model_2) # output the analysis
```
The estimates are:
* $\hat{\beta_0} = -253.3287$ (true value: $\beta_0 = -300$)
* $\hat{\beta_{1}} = 6.0382$ (true value: $\beta_1 = 7$)
* $\hat{\beta_{2}} = -11.0692$ (true value: $\beta_2 = -12$)
We interpret these results the same as in the simple regression model. If we increase $x_{i2}$ by $1$, then our estimate for $y_{i}$ will change by $-11.0692$. If we increase $x_{i1}$ by $1$, then our estimate for $y_{i}$ will change by $6.0382$. Note these estimated slopes are close to the true values. As an exercise, let's see what happens if we raise the sample size to something huge, say, $10,000$.
```{r Multiple regression with large sample size}
set.seed(20191203) # set a seed for reproducibility
# lets set up some true data
b0 <- -300 # true beta_0
b1 <- 7 # true beta_1
b2 <- -12 # true beta_2
N <- 10000 # sample size
# generate some x values again
x1_data <- runif(N,50,80)
x2_data <- runif(N,10,20)
# generate some random normal errors
errs <- rnorm(N, mean = 0, sd = 30)
y_data <- b0 + x1_data*b1 + x2_data*b2 + errs
my_model_3 <- lm(y_data ~ x1_data + x2_data)
summary(my_model_3)
```
So as the sample size gets bigger, the estimates of the $\beta$ values get much closer to the true value!
We can also visualize this regression, although since we've got an extra dimension, we are now fitting a plane in 3D space, as opposed to a line in 2D space.
```{r 3D regression}
set.seed(20191203) # set a seed for reproducibility
# lets set up some true data
b0 <- -300 # true beta_0
b1 <- 7 # true beta_1
b2 <- -12 # true beta_2
N <- 100 # sample size
# generate some x values again
x1_data <- runif(N,50,80)
x2_data <- runif(N,10,20)
# generate some random normal errors
errs <- rnorm(N, mean = 0, sd = 30)
y_data <- b0 + x1_data*b1 + x2_data*b2 + errs
my_model_3 <- lm(y_data ~ x1_data + x2_data)
library(scatterplot3d) # load the library for visualization - you may need to install this
plot3d <- scatterplot3d(cbind.data.frame(x1_data,x2_data,y_data), angle=55)
plot3d$plane3d(my_model_3)
```
Unfortunately we can't perceive beyond three dimensions. But for now, this will do.
### Categorical data
Categorical data is measured through *dummy variables*. These take on binary values, {0,1}, and are mutually exclusive. Common example include sex and treatment/control. When using multiple values of a categorical value (e.g. Control, Treatment1, Treatment2, etc), you must create one dummy variable per category. In R, you can do this automatically using the `factor` function. The results of regression will be relative to some reference group. For example, if you code for sex, then the estimated $\beta$ will be the difference in averages for males relative to females, without loss of generality. Our model is:
$$y_i = \beta_0 + \beta_1x_{i} + \epsilon_{i} \text{ ,where } x_{i}=\{0,1\}$$
Lets simulate some more data again, this time from two groups.
```{r Regression with categorical data}
set.seed(20191203) # set a seed for reproducibility
ng <- 2 # number of groups
N <- 40 # sample size
# values of the true betas
b0 <- 10
b1 <- -5
group <- rep( c("group1", "group2"), each = N) # create the two sets of groups
errs <- rnorm(ng*N, 0, sd=10) # create the standard errors
y_data <- b0 + b1*(group=="group2") + errs # those in group 2 are multiplied by b1
my_model_cat <- lm(y_data ~ as.factor(group)) # define the linear model
summary(my_model_cat)
```
Once again, we retrieve pretty good estimates of our true $\beta$ values.
### Relationship to other statistical tests
Pausing to think for a moment... what are we testing? We assume our data comes from two distributions, and that the means between these distributions are different, i.e., their difference is not zero. We can visualize this via (for example) a boxplot:
```{r Visualizing difference between groups}
plot(factor(group), y_data)
```
In fact, if we run a t-test instead of a regression model, we actually retrieve the same estimates we had before:
```{r t-test comparisons}
group1_data <- y_data[1:40]
group2_data <- y_data[41:80]
t.test(group1_data, group2_data) # full t-test
print("Estimated difference between groups:")
print((t.test(y_data[41:80], y_data[1:40])$estimate[1] -
t.test(y_data[41:80], y_data[1:40])$estimate[2])) # estimated difference
```
Basically every statistical test can be rewritten in terms of a linear model! There is an excellent intro available [here](https://lindeloev.github.io/tests-as-linear/).
### Continuous and categorical data
Using the above, we can combine categorical and continuous data into one model. We will treat $x_1$ as continuous and $x_2$ as categorical.
```{r Categorical and mixed data}
set.seed(20191203) # set a seed for reproducibility
# lets set up some true data
b0 <- -300 # true beta_0
b1 <- 7 # true beta_1
b2 <- -30 # true beta_2
N <- 100 # sample size
# generate some x values again
x1_data <- runif(N,50,80) # generate continuous data
x2_data <- c(rep(1,N/2), rep(0,N/2)) # generate binary data
# generate some random normal errors
errs <- rnorm(N, mean = 0, sd = 30)
y_data <- b0 + x1_data*b1 + x2_data*b2 + errs
my_model_cont_cat <- lm(y_data ~ x1_data + x2_data)
summary(my_model_cont_cat)
```
We would read this output as before. Let's visualize what we're looking at.
```{r Visualizing categorical and continuous data}
library(ggplot2)# we'll use ggplot for plotting here
set.seed(20191203) # set a seed for reproducibility
# lets set up some true data
b0 <- -300 # true beta_0
b1 <- 7 # true beta_1
b2 <- -30 # true beta_2
N <- 100 # sample size
# generate some x values again
x1_data <- runif(N,50,80) # generate continuous data
x2_data <- c(rep(1,N/2), rep(0,N/2)) # generate binary data
# generate some random normal errors
errs <- rnorm(N, mean = 0, sd = 30)
y_data <- b0 + x1_data*b1 + x2_data*b2 + errs
# put together a data frame
my_data <- data.frame(y_data, x1_data, factor(x2_data))
colnames(my_data) <- c("y", "x1", "x2")
# generate a linear model
my_model_cont_cat <- lm(y ~ x1 + x2, data = my_data)
p <- ggplot(data = my_data, aes(y = y, x = x1, colour = x2)) +
geom_point() +
geom_abline(intercept = coef(my_model_cont_cat)[1], slope = coef(my_model_cont_cat)[2], colour="#F8766D") +
geom_abline(intercept = coef(my_model_cont_cat)[1] + coef(my_model_cont_cat)[3],
slope = coef(my_model_cont_cat)[2], colour = "#00BFC4")
p
```
Depending on whether the data belongs to one category or another (sex or treatment, for example), the regression line is pushed up or down depending on the estimate of the appropriate $\beta$.
### Prediction
Imagine we have some input data and want to predict its value according to our model. Mathematically, this is straightforward. If we have our estimates of $\beta$ values, then we just plug it in. That is, we take our inputs and map them to the regression line we've drawn. The equation is,
$$y_{predicted} = \hat{\beta_{0}} + \hat{\beta_{1}}x_{input_1} + \dots + \hat{\beta_{p}}x_{input_p}$$
In R, we can use the `predict` function. All of this holds for when you have multiple parameters as well. Lets look at the case of simple linear regression, for illustration.
```{r}
set.seed(20191203) # set a seed for reproducibility
# lets set up some true data
b0 <- -300 # true beta_0
b1 <- 7 # true beta_1
N <- 40 # sample size
# generate some x values, let's pretend it's height in inches
# a sample of size N bounded between 50 and 80
x_data <- runif(N,50,80)
# generate some random normal errors
errs <- rnorm(N, mean = 0, sd = 30)
y_data <- b0 + x_data*b1 + errs
my_model <- lm(y_data ~ x_data)
# Create new points at x=70 and x=72
# predict where the data will fall and plot it
new_points <- data.frame(x_data=c(72,70))
new_prediction <- predict(my_model, new_points)
plot(x_data, y_data)
abline(my_model)
points(x = new_points$x_data, y = new_prediction, col = "red", pch="X")
```
### Model evaluation
Now that we've constructed our models, we need to check how good they are. To do this, we carry out residual diagnostics. Recall the assumptions of our residuals:
* They are independent and identically distributed according to a normal distribution
The primary way to diagnose issues is to look at plots of the error terms. If we look at a scatterplot of errors, they should be random and distributed around the line $y=0$ regardless of any other factors (e.g. order of the points or the value of the $x$ variables). There should be no patterns. Next, if we look at a QQ plot for a normal distribution, they should fall along a straight line (or at least something close to it). It's quite easy to generate these plots in R -- you just put your model into the `plot` function. It generates four plots:
1. **Residuals vs fitted**. There should be no non-linear patterns, and data should be randomly distributed about the line $y=0$.
2. **The Normal QQ plot**. The residuals should follow a straight line that runs from corner to corner.
3. **Scale-location**. This looks at whether the residuals change as your input data gets more extreme. If your errors have the same variation across fitted values, they are homoscedastic (which is what we want). Otherwise, they are heteroscedastic.
4. **Residuals vs leverage**. This looks at influential points (outliers). Sometimes having a few extreme values can tilt the estimates dramatically.
A thorough discussion of these diagnostics with examples can be found [here](https://data.library.virginia.edu/diagnostic-plots/).
```{r}
set.seed(20191203) # set a seed for reproducibility
# lets set up some true data
b0 <- -300 # true beta_0
b1 <- 7 # true beta_1
N <- 40 # sample size
# generate some x values, let's pretend it's height in inches
# a sample of size N bounded between 50 and 80
x_data <- runif(N,50,80)
# generate some random normal errors
errs <- rnorm(N, mean = 0, sd = 30)
y_data <- b0 + x_data*b1 + errs
my_model <- lm(y_data ~ x_data)
plot(my_model)
```
Does this look random? We generated the data randomly but with small samples sometimes patterns appear. Lets increase the sample size and see what happens.
```{r}
set.seed(20191203) # set a seed for reproducibility
# lets set up some true data
b0 <- -300 # true beta_0
b1 <- 7 # true beta_1
N <- 1000 # sample size
# generate some x values, let's pretend it's height in inches
# a sample of size N bounded between 50 and 80
x_data <- runif(N,50,80)
# generate some random normal errors
errs <- rnorm(N, mean = 0, sd = 30)
y_data <- b0 + x_data*b1 + errs
my_model <- lm(y_data ~ x_data)
plot(my_model)
```
These plots are much more satisfying. Now, what do bad plots look like?
```{r}
set.seed(20191203) # set a seed for reproducibility
# lets set up some true data
b0 <- -300 # true beta_0
b1 <- 7 # true beta_1
N <- 40 # sample size
x_data <- runif(N,50,60)
# generate some random normal errors
errs <- rnorm(N, mean = 0, sd = 30)
y_data <- b0 + exp(x_data)*b1 + errs
my_model_bad <- lm(y_data ~ x_data)
summary(my_model_bad)
```
In this model we generate our `y_data` through an exponential relationship to `x_data`: $y = \beta_{0} + \beta_{1}e^{x}$. The estimates of $\beta$ are still significant and the $R^2$ is decent. What happens when we look at the residuals?
```{r Residual analysis of a poor model}
plot(my_model_bad)
```
As we can see, the residual diagnostics look poor. There are clear patterns and influential outliers -- we have violated several of our assumptions so we question the accuracy of our estimates. Plotting the data with the linear fit is also illuminating.
```{r Plotting the linear fit of a poor model}
plot(x_data, y_data)
abline(my_model_bad)
```
The lesson here is **always look at your data**! Often you can salvage your model by transforming the inputs to stabilize their values. Since we have an exponential relationship here, we could use the function $log(x)$ and define a model $y = \beta_0 + \beta_{1}log(x)$.
## Pitfalls
Linear regression is not appropriate to predict categorical or binary values, such as probability of death. Generalized linear models are more suited for this approach. If your $x$ variables are strongly correlated (*multicollinearity*), estimates from the models will be unreliable. In this case you should remove redundant parameters. If you have many parameters relative to your sample size (or more parameters), you should look at regularized regression methods or dimension reduction approaches, such as principal components analysis.
### Anscombe's quartet
Anscombe's quartet is a collection of four datasets that have identical regression results. If we run summaries on them, this is what we get.
```{r Anscombes quartet, echo=FALSE}
# taken from the R documentation on CRAN.
## So regression on each dataset
ff <- y ~ x
mods <- setNames(as.list(1:4), paste0("lm", 1:4))
for(i in 1:4) {
ff[2:3] <- lapply(paste0(c("y","x"), i), as.name)
## or ff[[2]] <- as.name(paste0("y", i))
## ff[[3]] <- as.name(paste0("x", i))
mods[[i]] <- lmi <- lm(ff, data = anscombe)
#print(anova(lmi))
print(summary(mods[[i]]))
}
```
But if we plot them, we see that all of these are **completely different**.
```{r Anscombe plots, echo=FALSE}
## Now, do what you should have done in the first place: PLOTS
op <- par(mfrow = c(2, 2), mar = 0.1+c(4,4,1,1), oma = c(0, 0, 2, 0))
for(i in 1:4) {
ff[2:3] <- lapply(paste0(c("y","x"), i), as.name)
plot(ff, data = anscombe, col = "red", pch = 21, bg = "orange", cex = 1.2,
xlim = c(3, 19), ylim = c(3, 13))
abline(mods[[i]], col = "blue")
}
mtext("Anscombe\'s 4 Regression data sets", outer = TRUE, cex = 1.5)
par(op)
```
### Patterns in noise
Even if we generate totally random data, we may get "significant" results. Lets run 100 trials of 20 values chosen at random from $[0,1]$ and count how often the $\beta_1$ is statistically significant.
```{r Random noise linear model}
# generate 20 random values between 0 and 1 and run a regression between them. Do this 100 times and see how many times b1 is "significant"
N <- 20
trials <- 100
num_sig <- 0
for (t in 1:trials){
x_random <- runif(N)
y_random <- runif(N)
random_model <- lm(y_random~x_random)
if (summary(random_model)$coefficients[2,4] < 0.05){
num_sig <- num_sig + 1
plot(x_random, y_random)
abline(random_model)
}
}
print(paste("We have",num_sig,"significant results!"))
```
### Confounders and Simpson's paradox
Confounding variables mask true relationships between the variables you're measuring. Simpson's paradox occurs when after adding confounders to a model, the original trends disappear or reverse.
```{r Plotting Simpsons paradox}
# Example code modified from https://www.data-to-viz.com/caveat/simpson.html
library(ggplot2)
library(tidyverse)
# create our data
a <- data.frame( x = rnorm(100), y = rnorm(100)) %>% mutate(y = y-x/2)
b <- a %>% mutate(x=x+2) %>% mutate(y=y+2)
c <- a %>% mutate(x=x+4) %>% mutate(y=y+4)
simpsons_data <- do.call(rbind, list(a,b,c))
simpsons_data <- simpsons_data %>% mutate(group=rep(c("A", "B", "C"), each=100))
rm(a,b,c) # cleanup
# the function geom_smooth carries out the linear fit automatically, and then plots it
ggplot(simpsons_data, aes(x=x, y=y)) +
geom_point( size=2) +
geom_smooth(method = "lm") +
ggtitle("Linear fit without confounders")
ggplot(simpsons_data, aes(x=x, y=y, colour=group)) +
geom_point(size=2) +
geom_smooth(method = "lm") +
ggtitle("Linear fit with confounders")
```
## Extensions
This introduction provides a good basis for pursuing more complex analyses. Some of these include:
* **Interactions**. In some cases independent variables will interact so that their slopes will vary depending on combined values. For example, it's possible that a drug response is strongly related to sex and age, and also the combination of the two (e.g. it gets weaker for males as they get older). This can be coded in R as `lm(y ~ x1 + x2 + x1*x2)`.
* **(Multiple) Analysis of (co)variance (ANOVA, ANCOVA, MANOVA, MANCOVA)**. Used for testing the differences between multiple groups (ANOVA), possibly controlling for continous covariates (ANCOVA), or using multiple response variables (MANOVA/MANCOVA). These can be thought of as specific types of linear regression models.
* **Time series**. Longitudinal data often has correlations in the error terms. A common example is retail sales, which follow annual cycles. In biology, you may encounter measurements of phenotypes that vary systematically by season or time of day. Time series analysis can control for cyclical measures and examine overall trends.
* **Generalized linear models**, including logistic regression, to model categorical or binary data. These are used to calculate odds ratios and are common in estimating probabilities (e.g. probability of dying, contracting a disease, etc)
* **Linear mixed models**. So far we've assumed that estimates of $\beta$ are fixed, ie they remain the same across values of $x$. It's possible that they are randomly distributed and not fixed.
* **Regularized regression**. In cases where we have many explanatory variables, regularization can improve our estimates. Common methods include Lasso, Ridge, and Elastic Net.
* **High dimensional data**. Sometimes we have massive numbers of explanatory variables and/or samples. This is common in gene expression data, single cell sequencing, neuroinformatics, etc. There are several approaches both for dimension reduction and modeling, and it is a very active area of research.
## References
In addition to the links provided, I often reference *Introduction to regression modeling* by Bovas Abraham and Johannes Ledolter (2006). It is more focused on the mathematical and statistical side of things, but is not overly technical and helps to clearly lay out why we see certain results.