-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path04-graphs-models.Rmd
206 lines (138 loc) · 6.59 KB
/
04-graphs-models.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
```{r 04_setup, include=FALSE}
knitr::opts_chunk$set(echo=TRUE, eval=FALSE, fig.align="center")
```
# (PART) Causal Graphs {-}
# Causal Graphs as Statistical Models
## Learning Goals {-}
- Apply the Causal Markov assumption to express the joint distribution of data.
- Simulate data from causal graphs under linear and logistic regression structural equation models.
- Formulate use cases of simulation to understand causal inference concepts
<br>
Slides from today are available [here](https://docs.google.com/presentation/d/1wfAzgLke1QJUHZsTomJ3nosAqhsvXUZTXLRqNOFlbeM/edit?usp=sharing).
<br><br><br>
## Simulating data in R {-}
**You can download a template RMarkdown file to start from [here](template_rmds/04-graphs-models.Rmd).**
### Why simulate? {-}
Simulation is a very powerful tool for understanding statistical ideas. We can simulate (generate) data where we know the true underlying distribution, and we can then see how statistical methods perform on this data. For example:
- Do 95% confidence intervals really contain the true population value in 95% of samples? When is this not true?
- Does regression work to estimate causal effects when we have conditional exchangeability?
<br>
### Simulation functions in R {-}
R has functions to work with several probability distributions. For example, the following 4 functions work with the normal distribution:
- `rnorm()`: Generate a **r**andom number from a normal distribution
- `pnorm()`: Tail **p**robability from a normal distribution
- `dnorm()`: Get **d**ensity value from a normal distribution
- `qnorm()`: Get a **q**uantile from a normal distribution
There are `r`, `p`, `d`, and `q` functions for other distributions too (e.g., `runif()`, `rbinom()`.) We'll primarily use `r` functions to generate random numbers.
**`rbinom()`**:
```{r}
# 4 different people each flip a fair coin once
rbinom(4, size = 1, prob = 0.5)
# 4 different people flip unfair coins
# First 2 flip a coin with P(Heads) = 0.9
# Second 2 flip a coin with P(Heads) = 0.2
rbinom(4, size = 1, prob = c(0.9, 0.9, 0.2, 0.2))
# Write a command with rbinom() so that the result will definitively be 1,0,1
```
**`rnorm()`**:
```{r}
# 5 numbers from a normal distribution with mean 10 and SD 2
rnorm(5, mean = 10, sd = 2)
# 3 numbers from 3 different normal distributions
# Means = 10, 100, 1000. SD = 1
rnorm(3, mean = c(10, 100, 1000), sd = 1)
# Write a command with rnorm() so that the result will definitively be 10,20
```
<br>
### Simulating data from graphs+SEMs {-}
In this course, we'll simulate data that come from graphs and their corresponding structural equation models. We'll step through how to simulate data that come from this causal graph.
```{r 04_dag_example, fig.width=6, fig.height=2, fig.align="center", echo=FALSE, eval=TRUE}
par(mar = rep(0,4))
plot(1, type = "n", xaxt = "n", yaxt = "n", bty = "n", xlab = "", ylab = "", xlim = c(0,6), ylim = c(1,4))
text(c("A", "Z", "Y"), x = c(1,3,5), y = c(1,4,1), cex = 1.1)
arrows(x0 = c(3,3), y0 = c(4,4)-0.2, x1 = c(1,5), y1 = c(1,1)+0.2, angle = 25, lwd = 4)
arrows(x0 = 1.7, y0 = 1, x1 = 4.5, y1 = 1, angle = 25, lwd = 4)
```
**Question:** Using the Causal Markov Assumption, how can we express the joint distribution of this data?
<br>
Consider the structural equation model (SEM) below that is associated with our causal graph:
$$
\begin{align*}
Z &\sim N(\mu_Z = 40, \sigma_Z = 5) \\
A &\sim \mathrm{Binomial}(n = 1, p_A) \\
\qquad &\log\left(\frac{p_A}{1-p_A}\right) = -1 + 0.05Z \\
Y &\sim N(\mu_Y, \sigma_Y = 1) \\
\mu_Y &= 30 + 5A + 2Z
\end{align*}
$$
- $Z$ is an exogenous variable that follows a normal distribution with mean 40 and standard deviation 5.
- $A$ is binary (binomial/Bernoulli) with probability of success $p_A$
- $p_A$ depends on $Z$ via a logistic regression model.
- $Y$ is quantitative and follows a normal distribution.
- The mean $\mu_Y$ depends on $A$ and $Z$ via a linear regression model.
<br>
Complete the code below to simulate data that follow this structural equation model.
```{r}
set.seed(451) # For reproducibility
# Sample size
n <- 10000
# Simulate the Z variable
Z <- rnorm(?)
# Simulate the A variable
log_odds_A <- -1 + 0.05*Z
odds_A <- exp(log_odds_A)
p_A <- odds_A/(1+odds_A)
A <- rbinom(?)
# Simulate the Y variable
mean_Y <- 30 + ?
noise_Y <- rnorm(n, mean = 0, sd = 1)
Y <- mean_Y + noise_Y
# Store all simulated variables in a dataset called sim_data
sim_data <- data.frame(Z, A, Y)
```
After generating simulated data, it's best to use plots and summary measures to make sure that marginal distributions and relationships between variables look sensible.
```{r}
# Explorations to check sensibility of simulated data
```
<br><br><br><br>
## Exercises {-}
You'll need to install the `dagitty` package before getting started.
### Exercise 1 {-}
a. Write the joint distribution of the data from the causal graph below using the Causal Markov Assumption.
```{r 04_ex1, echo=FALSE, eval=TRUE, fig.height=4, fig.align="center"}
library(dagitty)
dag <- dagitty("dag {
bb=\"0,0,1,1\"
A [exposure,pos=\"0.311,0.400\"]
W [pos=\"0.541,0.200\"]
Y [outcome,pos=\"0.640,0.400\"]
Z [pos=\"0.398,0.200\"]
A -> Y
W -> A
W -> Y
Z -> A
Z -> Y
}
")
plot(dag)
```
b. Simulate data from the graph. You are free to choose your own structural equation model, but let $Z$, $W$, and $Y$ be quantitative. Let $A$ be binary. Use a sample size of 10,000 for your simulation. Use plots and summary measures to ensure that the simulated data have reasonable properties.
c. Based on the structural equation model you chose, what would you guess is the average causal effect $E[Y^{a=1} - Y^{a=0}]$?
d. Let's consider *simulating interventions* using your SEM. If you were to force all study units to receive a certain value of treatment (either 0 or 1), how do you think the causal graph would change?
e. Modify your simulation to reflect your thoughts from part d. Incorporate the code below into your simulation.
```{r}
# Scenario 1: Force all to be treated
A_force_1 <- rep(1, n) # Repeat the number 1 n times
# Scenario 2: Force all to be treated
A_force_0 <- rep(0, n) # Repeat the number 0 n times
# The treated potential outcome Y^{a=1}
Y1 <- ? # something with A_force_1
# The untreated potential outcome Y^{a=0}
Y0 <- ? # something with A_force_0
# Estimate the average causal effect
mean(?)
```
f. Was your intuition from part b close to the ACE that you estimated in the simulation?
<br><br>
### Extra {-}
Try organizing the simulation code from today into general-purpose functions that can be used in subsequent simulations.