forked from diazale/micm_spring_2023
-
Notifications
You must be signed in to change notification settings - Fork 0
/
logistic_regression.Rmd
348 lines (243 loc) · 15.6 KB
/
logistic_regression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
---
title: "Intro to logistic regression, MiCM Spring 2023 Workshop"
author: "Alex Diaz-Papkovich"
date: "March 17, 2023"
output:
html_document:
#github_document:
toc: true
toc_depth: 3
toc_float: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## The log function
As a basic review of the $\log$ function:
* It is the inverse of exponentiation
* Unless otherwise stated, assume $\log$ refers to that natural logarithm with base $e$, sometimes written $\ln$
* $\log(1) = 0$
* $\log(0)$ is undefined
* $\log(x*y) = \log(x) + \log(y)$
* $\log(x/y) = \log(x) - \log(y)$
* $\exp(\log(x)) = x$
## The binomial distribution
The binomial distribution models a situation where we have multiple independent Bernoulli trials---trials with binary success/failure outcomes. We have three parameters:
* $y$, the observed number of successes
* $n$, the total number of trials
* $p$, the probability of success
The probability mass function is:
$$
P(Y=y) = {n \choose y}p^y(1-p)^{n-y}
$$
For example, if a drug has $90\%$ of clearing an infection and we have $100$ patients, what is the probability that exactly $90$ patients are cleared of their infection? This is expressed mathematically as:
$$
P(Y=90) = {100 \choose 90}(0.9)^{90}(0.1)^{10}
$$
In R, we can calculate the value directly with `choose(100,90)*(0.9)^(90)*(0.1)^(10)` or (more usefully) with `dbinom(90, 100, 0.9)`. The value should be `0.1318653`.
We can also calculate the probability that $90$ or more patients are cleared by summing up these values. We can do this either by summing the probability that $90$, ..., $100$ are cleared, or using the `pbinom` function.
```{r Binomial distribution}
# Calculate the probability of exactly 90 patients clearing an infection
dbinom(90, 100, 0.9)
# Calculate the probability of 90 or more (i.e. 90, 91, ..., 100) patients clearing an infection
sum(dbinom(x = 90:100, size = 100, prob = 0.9))
# Equivalently, calculate the probability that 90 or more (i.e. more than 89) clear an infection
pbinom(89, 100, 0.9, lower.tail = F)
```
One thing to consider is that even when we know the probability of an event happening, we don't know how many times it **will** happen. This is the fundamental nature of probability. If we simulate many runs of the same binomial probability, we'll get this:
```{r Running the binomial many times, echo=F}
set.seed(20230317)
nruns <- 500
nobs <- 100
p_success <- 0.5
# our simulation data
sim_data <- rbinom(nruns, size = nobs, prob = p_success)
hist(sim_data,
main = paste("Histogram of ",
nruns,
" runs of ",
nobs,
" trials with probability of success of ",
p_success))
print("Minimum of simulation data:")
print(min(sim_data))
print("Maximum of simulation data:")
print(max(sim_data))
```
In this case, we have a probability of success of $0.5$ with $100$ trials---even odds between success and failure. Even though we expect $50$ successes, we have as few as $35$ and as many as $63$.
We will not go deep into the binomial distribution, though it is very useful. What we are interested in is estimating the value $p$, which we don't know, in the context of binary or count data. We typically have either a count of all events summarized in a table (e.g. in the form of a [contingency table](https://en.wikipedia.org/wiki/Contingency_table)) or as a list of binary events (e.g. survival data with associated variables like drug treatment, sex, age, etc).
## Logistic regression
Assume that we have binary outcome data ($y$) and explanatory variables ($\mathbf{x}$) for each observation. Each observation is a Bernoulli trial where $y=1$ with a probability $\pi$ and $y=0$ with a probability $(1-\pi)$. Each observation is a **Bernoulli trial**. We believe that the probability of success ($y=1$) is related to $x$, so we write $\pi$ as $\pi(x)$. Since we have many Bernoulli trials, which we assume are independent, our data follow the Binomial distribution.
Like with linear regression, we have $n$ observations and $p$ covariates. An individual observation is called a **case**, and when all explanatory covariates match among cases, we call it a **constellation** (personal note: I haven't heard this term used often---I've heard "class", "group", "crosstab", etc). Some examples:
* Survival data ($y=1$ if a patient lived, $y=0$ if they died) by sex ($x_1=1$ for male, $x_1=0$ for female) and treatment ($x_2=1$ for treatment, $x_2=0$ for placebo)
* Vaccine data ($y=1$ if a patient becomes infected) by vaccination status ($x_1=1$ if they were vaccinated)
We write the explanatory variables as $\mathbf{x}=(x_{1},x_{2},\dots,x_{p})'$, and our function of the probability of success as $\pi=\pi(\mathbf{x}_i)$.
The specifics of the notation (i.e. subscripts) will change depending on how the problem is formulated, but generally our binomial probability looks like this:
$$
P(Y=y) = {n \choose y}[\pi(\mathbf{x})]^{y}[1-\pi(\mathbf{x})]^{n-y}
$$
We are interested in estimating $\pi(\mathbf{x})$; we wish to link the probability $\pi$ to the covariates $\mathbf{x} = x_1, \dots, x_p$. We do this by selecting a **link function**. We would like a function in which:
* $\pi$ depends on $\mathbf{x}$ in a linear manner, because linear functions have nice properties
* $\pi$ falls within the values $[0,1]$, since it is a probability.
There are many link functions. We will use the **logistic function** or the **logit**. It takes the form:
$$
\pi(\mathbf{x}) = \frac{1}{1 + e^{-\mathbf{x}'\mathbf{\beta}}}
$$
It looks like this:
```{r The logistic curve}
curve(1/(1+exp(-x)), from=-5, to=5, xlab="Explanatory variable x", ylab=expression(pi(x)~"=P(Y=1)"))
```
There are other link functions, such as the probit, complementary log-log (cloglog), the log-log, etc. The logit has some nice properties:
* It is symmetric
* It has simpler theoretical properties
* It is easier to interpret as the log of odds ratios
* It can be used in estimates either prospectively or retrospectively
## Interpreting logistic regression
Through some algebra, we can re-arrange the logit equation to get:
$$
\ln\frac{\pi(\mathbf{x})}{1-\pi(\mathbf{x})} = \mathbf{x}'\mathbf{\beta} = \beta_0 + \beta_1x_1 + ... + \beta_px_p
$$
Here $\frac{\pi(\mathbf{x})}{1- \pi(\mathbf{x})}$ is our **log-odds**. We can also exponentiate both sides to get our **odds**:
$$
\frac{\pi(\mathbf{x})}{1-\pi(\mathbf{x})} = \exp(\mathbf{x}'\mathbf{\beta}) = \exp(\beta_0 + \beta_1x_1 + ... + \beta_px_p)
$$
The odds are the ratio of the probability of something happening versus not happening. If $\pi=0.5$, both events are equally likely and the odds are $1$. If the probability is higher, then the odds will be greater than $1$; if it is lower, the odds are below $1$. For example, if $\pi=0.75$, then our ratio is $0.75/0.25 = 3$.
Often we are interested in how a change in the explanatory variables impacts the value of $\pi$. Usually we do this by keeping all variables fixed and incrementing one of them. For illustration, assume we have a simple case with two groups: those who received a drug (indicated with $i=1$), and those who did not (indicated with $i=0$). We are interested in $\pi_i$, the probability of survival of each group.
Our logistic regression model is:
$$
\log(\frac{\pi_i}{1-\pi_i}) = \beta_0 + \beta_1x_{i1}, \text{ for groups } i=\{0,1\}
$$
If someone did not receive a drug, then we have:
$$
\begin{aligned}
&\log(\frac{\pi_0}{1-\pi_0}) \\
&= \beta_0 + \beta_1(0) \\
&= \beta_0
\end{aligned}
$$
If they did, then we have:
$$
\log(\frac{\pi_1}{1-\pi_1}) = \beta_0 + \beta_1
$$
In this case we have:
* $\beta_0$, our log-odds of survival if someone did not receive a drug
* $\beta_1$, our log-odds of survival if someone did receive a drug
We can compare these two and use some algebra of logarithms to get our log odds ratio:
$$
\begin{aligned}
& \log(\frac{\pi_1}{1-\pi_1}) - \log(\frac{\pi_0}{1-\pi_0}) \\
&=\log(\frac{\pi_1/(1-\pi_1)}{\pi_0/(1-\pi_0)}) \\
&= (\beta_1 + \beta_0) - \beta_0 \\
&= \beta_1 \\
\end{aligned}
$$
In this case, $\beta_1$ is the log-odds ratio of survival for those given the drug versus not given the drug. To calculate the odds radio, we use $\exp(\beta_1)$. **We are usually most interested in this type of analysis**.
Finally, we can re-arrange everything to get our **probability of success**:
$$
\pi(\mathbf{x}) = \frac{\exp(\beta_0 + \beta_1x_1 + ... + \beta_px_p)}{1 + \exp(\beta_0 + \beta_1x_1 + ... + \beta_px_p)}
$$
Unlike odds, this value gives us the direct interpretation of how likely something is to happen. This can be useful to give more meaning to a rare event that may become more likely through a higher odds ratio.
For brevity, we will omit the derivations of various estimators. We use similar notation to linear regression, where $\hat{\beta}$ represents the estimate of $\beta$.
## The `glm` function
Logistic regression in R is done with the `glm` function. Its syntax and output are similar to the `lm` function, although we have to specify the family and link function. In this case, we will use the `binomial` family and `logit` link function. The typical setup is:
`model.name <- glm(RESPONSE ~ COVARIATES, family = binomial(link=logit), data = dataframename)`
As with linear regression, we can use the `summary` function, as well as `predict`.
### Example
Consider a toy example where we simulate two populations: one that is given a drug that increases the chance of survival, and another that is given a placebo, which has a low chance of survival.
```{r Toy example of logistic regression}
# Simulate a study where 50 people are given a drug and 50 people are given a placebo
# Those given the treatment have a 70% chance of survival
# Those given the placebo have a 20% chance of survival
# We randomly draw values of 1/0 from the binomial distribution
# Here, 1 indicates survival and 0 indicates death
set.seed(20230317)
pop1_size <- 50
pop2_size <- 50
pop_place <- rbinom(pop1_size,1,0.2)
pop_treat <- rbinom(pop2_size,1,0.7)
pop_sim <- data.frame(c(rep("placebo",pop1_size),rep("drug",pop2_size)),c(pop_place, pop_treat))
colnames(pop_sim) <- c("treatment","survived")
sim_glm <- glm(survived ~ treatment, family = binomial(link="logit"), data = pop_sim)
```
```{r Examine the results of our toy model}
summary(sim_glm)
#confint(sim_glm)
#exp(confint(sim_glm))
# The values of our betas
#beta_0 <- summary(sim_glm)$coefficients[1,1]
#beta_1 <- summary(sim_glm)$coefficients[2,1]
# Our odds of survival for the placebo:
#print(paste("Odds of survival for placebo",exp(beta_0 + beta_1)))
# Our odds of survival for the drug:
#print(paste("Odds of survival for treatment",exp(beta_0)))
# Find the probability of survival for the placebo:
#print(paste("Probability of survival for placebo",(exp(beta_0 + beta_1)/(1 + exp(beta_0 + beta_1)))))
# Find the probability of survival for the treatment:
#print(paste("Probability of survival for treatment",exp(beta_0)/(1 + exp(beta_0))))
# Test the
#anova(sim_glm, test = "Chisq")
#with(sim_glm, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))
```
There are a few things to note when comparing this output to that of linear regression:
* We have a z-statistic instead of a t-statistic like in linear regression
* We do not have an $R^2$ value or F-statistic, but instead null and residual deviances
* We also have an AIC
We can reformat this data to fit a contingency table:
```{r Format our simulation data in a contingency table}
pop_sim_cont <- data.frame(table(pop_sim)) # Aggregate our data into counts
# Get our marginal counts.
# We know we have 50 of each but it's good practice for coding
total_drug <- sum(pop_sim_cont[pop_sim_cont["treatment"]=="drug",]$Freq)
total_plac <- sum(pop_sim_cont[pop_sim_cont["treatment"]=="placebo",]$Freq)
# Drop our death observations and attach the counts we just got instead
pop_sim_cont <- subset(pop_sim_cont, survived==1)
pop_sim_cont$survived <- pop_sim_cont$Freq # Set the variable "survived" to a count
pop_sim_cont <- pop_sim_cont[,c(1,2)]
pop_sim_cont$patients <-with(pop_sim_cont,
ifelse(treatment=="drug", total_drug, total_plac))
# Add a variable of the response (the number of survived vs died)
pop_sim_cont$resp <- cbind(pop_sim_cont$survived, pop_sim_cont$patients - pop_sim_cont$survived)
```
Now that our data is formatted as a contingency table, we can run the same analysis:
```{r Run the GLM on the contingency table}
sim_glm_cont <- glm(resp ~ treatment, family = binomial(link = "logit"), data = pop_sim_cont)
summary(sim_glm_cont)
```
### Haberman data
We will use Haberman's survival data.^[Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.] It is a record of breast cancer patients who underwent surgery. The attributes are:
1. Age of patient at time of operation (numerical)
2. Patient's year of operation (year - 1900, numerical)
3. Number of positive axillary nodes detected (numerical)
4. Survival status (class attribute)
| 1 = the patient survived 5 years or longer
| 2 = the patient died within 5 years
```{r Read in Haberman data}
haberman.data <- read.csv("haberman.data", header=F)
colnames(haberman.data) <- c("age","surgery_year","nodes","survival_status")
# recode the survival status to 1 (lived 5+ years) and 0 (died within 5 years)
haberman.data$survival_status <- ifelse(haberman.data$survival_status==2,0,1)
```
The first thing we may be interested in is plotting our data.
```{r Plot Haberman data against variables}
plot(haberman.data$nodes, haberman.data$survival_status)
plot(haberman.data$age, haberman.data$nodes)
plot(haberman.data$age, haberman.data$survival_status)
```
#### Exercises:
Create a logistic model in R that models survival status against each of the variables in the data
```{r}
#haberman.model1 <- glm(...,
# family=binomial(link = "logit"),
# data = haberman.data)
#summary(haberman.model1)
```
```{r}
#plot(haberman.data$nodes, haberman.data$survival_status, xlim = c(0,60))
#newdata <- data.frame(nodes=seq(min(haberman.data$nodes), max(haberman.data$nodes),len=500))
#newdata$pred = predict(haberman.model1, newdata, type="response")
#lines(pred ~ nodes, newdata)
```
### Exercises: Other datasets
The following data are available in the `Sleuth3` package, which you can install using `install.packages("Sleuth3")`:
* `Sleuth3::case2001`: In 1846 the Donner and Reed families left Springfield, Illinois, for California by covered wagon. In July, the Donner Party, as it became known, reached Fort Bridger, Wyoming. There its leaders decided to attempt a new and untested rote to the Sacramento Valley. Having reached its full size of 87 people and 20 wagons, the party was delayed by a difficult crossing of the Wasatch Range and again in the crossing of the desert west of the Great Salt Lake. The group became stranded in the eastern Sierra Nevada mountains when the region was hit by heavy snows in late October. By the time the last survivor was rescued on April 21, 1847, 40 of the 87 members had died from famine and exposure to extreme cold.
* `Sleuth3::case2002`: A 1972–1981 health survey in The Hague, Netherlands, discovered an association between keeping pet birds and increased risk of lung cancer. To investigate birdkeeping as a risk factor, researchers conducted a case–control study of patients in 1985 at four hospitals in The Hague (population 450,000). They identified 49 cases of lung cancer among the patients who were registered with a general practice, who were age 65 or younger and who had resided in the city since 1965. They also selected 98 controls from a population of residents having the same general age structure.