forked from NickCH-K/CausalitySlides
-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathLecture_01_Describing_Data.Rmd
415 lines (307 loc) · 17.3 KB
/
Lecture_01_Describing_Data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
---
title: "Lecture 1: Describing Data"
author: "Nick Huntington-Klein"
date: "`r Sys.Date()`"
output:
revealjs::revealjs_presentation:
theme: solarized
transition: slide
self_contained: true
smart: true
fig_caption: true
reveal_options:
slideNumber: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE)
library(tidyverse)
library(purrr)
library(patchwork)
library(ggpubr)
theme_set(theme_gray(base_size = 15))
```
## Causality
- This class is about causality
- Welcome!
- This class exists to answer one question: *how can we use statistics figure out how $X$ causes $Y$?*
- It's a short question but an extremely difficult one
## This Class
- We'll be covering the purpose of statistical research and how it works
- Then the concepts underlying causality and research design
- And then some standard research designs for uncovering causality in observational data
## Housekeeping
- Let's go over the syllabus and projects!
- The textbook: The Effect
## This Week
- We'll be discussion how we *describe variables* and *describe relationships*
- With a bit of an R reminder
- We'll cover a bit of regression review, but...!
- This class is much more concerned with *design* than any particular estimator. *How is the data utilized?*
- Regression is *one way* of doing this stuff, but regression is only one implementation. So we won't be solely focusing on it
## This Week
- We'll start with ways of discussing how we can *describe variables*
- And then move on to ways of discussing how we can *describe relationships*
- Secretly, pretty much all statistical analysis is just about doing one of those two things
- *Causal* analysis is *purely* about *knowing exactly which variables and relationships to describe*
## Variables
- A statistical variable is a recorded observation, repeated many times
- "Number of calories I ate for breakfast this morning" is one observation
- "The number of calories I ate each breakfast in the past week" is a variable with seven observations
## The Distribution of a Variable
- Variables have *distributions*
- The distribution of a variable is simply the description of *how often each value of the variable comes up*
- So for example, the statement "10% of people are left-handed" is just a partial description of the distribution of the handedness variable.
- If you observe a bunch of people and record what their dominant hand is, 10% of the time you'll write down "left-handed," 1% of the time you'll write down "ambidextrous," and 89% of the time you'll write down "right-handed." That's the full description of the distribution
## Looking Straight at a Distribution
- The distribution of a variable contains *everything we know* about that variable from empirical observation
- Any description we make will be a *summary* of that distribution
- So we may as well look at it directly!
## Distributions of Kinds of Variables
- There are two main kinds of variables for which the distributions look different: discrete and continuous
- Discrete variables take a finite set of values: left-handed, right-handed, ambidextrous. Or "lives in Seattle" vs. "Doesn't" or "Number of kids"
- Continuous variables take any value: income, height, KwH of electricity used each day
- (Sometimes, "ordinal" discrete variables with many values are treated as continuous for simplicity)
## Discrete Distributions
- To fully describe the distribution of a discrete variable, just give the proportion of time it takes each value. That's it!
- Give a table with the proportions (or counts), or show a graph with the proportions
```{r, echo = FALSE}
handedness_data <- data.frame(hand = c(rep('Left', 9*452),rep('Ambidextrous',.5*452),rep('Right',90.5*452)))
```
```{r, echo = TRUE}
library(tidyverse)
handedness_data %>%
pull(hand) %>%
table() %>%
prop.table()
```
## Discrete Distributions
```{r, echo = TRUE, eval = FALSE}
ggplot(handedness_data, aes(x = hand)) +
geom_bar(fill = 'white', color = 'black') + # These two lines are important
stat_count(geom = "text", size = 5,
aes(label = scales::percent(..count../nrow(handedness_data)),
y = ..count.. + 1300)) +
ggpubr::theme_pubr() +
labs(x = 'Handedness', y = 'Count') # The rest is just decoration
```
## Discrete Distributions
```{r, echo = FALSE, eval = TRUE}
ggplot(handedness_data, aes(x = hand)) +
geom_bar(fill = 'white', color = 'black') + # These two lines are important
stat_count(geom = "text", size = 5,
aes(label = scales::percent(..count../nrow(handedness_data)),
y = ..count.. + 1300)) +
ggpubr::theme_pubr() +
labs(x = 'Handedness', y = 'Count') # The rest is just decoration
```
## Using Discrete Distributions
- What can we use a discrete distribution to say?
- X% of observations are in category A
- (X+Y)% of observations are in category (A or B)
- If it's "ordinal" (the values have an order), we can describe the median, max, min, etc.
- There are also dispersion measures describing how evenly distributed the categories are but we won't be going into that
## Continuous Distributions
- Variables that are numeric in nature and take *many* values have a continuous distribution
- Their distributions can be presented in one of two main ways - using a *histogram* or using a *density distribution*
- A *histogram* splits the range of the data up into bins and then just treats it like a ordinal discrete distribution
- A *density distribution* uses a rolling average of the proportion of observations within each window
## Continuous Distributions
```{r, echo = FALSE}
data(Scorecard, package = 'pmdplyr')
```
```{r, echo = TRUE, eval = FALSE}
ggplot(Scorecard, aes(x = repay_rate)) +
geom_histogram(bins = 5, fill = 'white', color = 'black') +
ggpubr::theme_pubr() +
labs(x = 'Proportion of Grads on-track to Repay Loans', y = 'Count', title = 'Loan Repayment by College')
```
## Continuous Distributions
```{r, echo = FALSE, eval = TRUE}
ggplot(Scorecard, aes(x = repay_rate)) +
geom_histogram(bins = 5, fill = 'white', color = 'black') +
ggpubr::theme_pubr() +
labs(x = 'Proportion of Grads on-track to Repay Loans', y = 'Count', title = 'Loan Repayment by College')
```
## Continuous Distributions
```{r}
ggplot(Scorecard, aes(x = repay_rate)) +
geom_histogram(bins = 10, fill = 'white', color = 'black') +
ggpubr::theme_pubr() +
labs(x = 'Proportion of Grads on-track to Repay Loans', y = 'Count', title = 'Loan Repayment by College')
```
## Continuous Distributions
```{r}
ggplot(Scorecard, aes(x = repay_rate)) +
geom_histogram(bins = 20, fill = 'white', color = 'black') +
ggpubr::theme_pubr() +
labs(x = 'Proportion of Grads on-track to Repay Loans', y = 'Count', title = 'Loan Repayment by College')
```
## Continuous Distributions
```{r}
ggplot(Scorecard, aes(x = repay_rate)) +
geom_histogram(bins = 100, fill = 'white', color = 'black') +
ggpubr::theme_pubr() +
labs(x = 'Proportion of Grads on-track to Repay Loans', y = 'Count', title = 'Loan Repayment by College')
```
## Continuous Distributions
```{r}
ggplot(Scorecard, aes(x = repay_rate)) +
geom_density(color = 'black') +
ggpubr::theme_pubr() +
labs(x = 'Proportion of Grads on-track to Repay Loans', y = 'Density', title = 'Loan Repayment by College')
```
## Continuous Distributions
- We can describe these distributions fully using *percentiles*
- The Xth percentile is the value for which X% of the sample is less than that value
- Taken together, you can describe the entire sample by just going through percentiles
## Continuous Distributions
```{r, echo = FALSE}
ggplot(Scorecard, aes(x = repay_rate)) +
geom_density(color = 'black') +
geom_vline(aes(xintercept = quantile(repay_rate, .1, na.rm = TRUE))) +
annotate(geom = 'label', x = quantile(Scorecard$repay_rate, .1, na.rm = TRUE), y =.5, label = '10th percentile') +
annotate(geom = 'text', x = .25, y = .25, label = '10%') +
annotate(geom = 'text', x = .6, y = .25, label = '90%') +
ggpubr::theme_pubr() +
labs(x = 'Proportion of Grads on-track to Repay Loans', y = 'Density', title = 'Loan Repayment by College')
```
## Summarizing Continuous Data
- Commonly we want to describe these distrbutions much more compactly, while still telling us something about them
```{r, echo = TRUE}
library(vtable)
Scorecard %>%
select(repay_rate) %>%
sumtable()
```
## Summarizing Continuous Data
- Every "summary statistic" of a given variable is just a way of describing some aspect of these distributions
- Commonly we are focused on just a few important features of the distribution:
- The central tendency
- Dispersion
## The Central Tendency
- Central tendencies are ways of picking a single number that represents the variable best
- Often: the mean
- The median (50th percentile)
- For categorical data, sometimes the mode
## The Central Tendency
- The median is good at being representative of *a typical observation*, and is not sensitive to outliers
- The mean can be better thought of as a betting average. If you "bet the mean" and drew an infinite number of observations, you'd break even
- If Jeff Bezos walks in the room, mean income shoots through the roof (because if you're randomly drawing people in the room, sometimes you're Jeff Bezos!), but the median largely remains unchanged (because Jeff Bezos isn't anywhere near being a typical person)
## The Central Tendency
- So why use the mean at all? It makes sense to think about those betting odds if you are, say, trying to predict something
- It also has a bunch of nice statistical properties
- Meaning, we *understand the mean* fairly well, and *we know how the mean changes as we go from sample to sample*
- In other words, it's handy when we're trying to learn about the *theoretical distribution* our data comes from (more on that in a bit!)
## Dispersion
- Measures of dispersion tell us how *spread out* the data is
- Some of these are percentile-based measures, like the inter-quartile range (75th percentile minus 25th) or the range (Max - Min, or 100th percentile minus 0th)
- Most commonly we will use standard deviation and variance
```{r, echo = FALSE}
Scorecard %>%
select(repay_rate) %>%
sumtable(out = 'kable')
```
## Dispersion
- Variance is *average squared deviation from the mean*.
- Take each observation, subtract the mean, square the result, and take the mean of *that* (times $n/(n-1)$ )
- Standard deviation is the square root of the variance
## Dispersion
```{r}
names <- c('Very Low Variation','A Little More Variation','Even More Variation','Lots of Variation')
set.seed(1000)
dfs <- c(.2,.25,.5,.75) %>%
map(function(x)
data.frame(x = rnorm(100,sd=x)))
p <- 1:4 %>%
map(function(x)
ggplot(dfs[[x]],aes(x=x))+
geom_density()+
scale_x_continuous(limits=c(min(dfs[[4]]$x),max(dfs[[4]]$x)))+
scale_y_continuous(limits=c(0,
max(density(dfs[[1]]$x)$y)))+
labs(x='Observed Value',y='Density',title=names[x])+
theme_pubr()+
theme(text = element_text(size = 13)))
(p[[1]] + p[[2]]) / (p[[3]] + p[[4]])
```
## Skew
- One other aspect of a distribution we sometimes consider is *skew*
- Skew is sort of "how much the distribution *leans to one side or the other*
- This can be a problem if the skew is extreme
- Extreme right skew can make means highly unrepresentative as a few big observations pull the mean upwards
- This can sometimes be helped by taking a log of the data
## Skew
```{r}
set.seed(500)
df <- data.frame(expy = exp(rnorm(10000)+1))
p1 <- ggplot(df, aes(x = expy)) + geom_density() +
geom_vline(aes(xintercept = mean(df$expy)), linetype = 'dashed') +
annotate(x = mean(df$expy), y = .25, geom = 'label', label = 'Mean') +
labs(x = 'Value',y = 'Density', title = 'Skewed Data') +
theme_pubr() +
theme(text = element_text(size = 13),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
p2 <- ggplot(df, aes(x = log(expy))) + geom_density() +
geom_vline(aes(xintercept = mean(log(df$expy))), linetype = 'dashed') +
annotate(x = mean(log(df$expy)), y = .25, geom = 'label', label = 'Mean') +
labs(x = 'Value',y = 'Density', title = 'Logged Skewed Data') +
theme_pubr() +
theme(text = element_text(size = 13),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
p1 + p2
```
## Theoretical Distributions
- Now for the good stuff!
- We *rarely actually care what our data is* or the distribution of it!
- What we actually care about are *what broader inferences can we draw from the data we see!*
- The mean of your variable is just *the mean of the observations you happened to sample*
- But what can we learn about *how that variable works overall* from that?
## Theoretical Distributions
- There is a "population distribution" that we can't see - it's theoretical - we just get a sample
- If we had infinite data, we'd see the theoretical distribution
- To learn about that theoretical distribution, we take what we know about *sampling variation* and use it to rule out certain theoretical distributions as unlikely
## Theoretical Distributions
- For example, if we flip a coin 1000 times and get heads 999 times, can we rule out that it's a fair coin?
- (the "theoretical distribution" here is a discrete one: the coin is heads 50% of the time and tails 50%)
- We *assume that the coin is fair* (null/prior hypothesis) and see how unlikely the data is. If the coin is fair, we take what we know about sampling varaition for a binary variable and calculate that 999/1000 heads has a `dbinom(999, 1000, .5)` chance of happening
- So that's probably not the real theoretical distribution!
## Theoretical Distributions
Reminders:
- All we've shown is that *a particular theoretical distribution is unlikely*, not *anything else*
- We don't know what the proper theoretical distribution *is*
- We haven't shown that our result is *important*
- We have effectively calculated a p-value here - if it's low enough, we say "Statistically significant" but please don't get fooled into thinking that means anything other than what we've said here - a particular theoretical distribution is statistically unlikely to have generated this data
## Sampling Variation
- Often when trying to generalize from a sample to a theoretical distribution we will focus on the mean
- This is because the sampling variation of the mean is very well-understood. It follows a normal distribution with a mean at the population mean, and a standard deviation of the population standard deviation, scaled by $1/\sqrt{n}$
## Sampling Variation
$$ \bar{X} = \hat{\mu} \sim N(\mu, \sigma/\sqrt{n}) $$
- Latin letters ( $X, n$ ) are data
- Modified Latin letters ( $\bar{X}$ ) are calculations made with data
- Greek letters ( $\mu$ ) are population values/"the truth"
- Modified Greek letters ( $\hat{\mu}$ ) are *our estimate of the truth*
## Sampling Variation
- With `r sum(!is.na(Scorecard$repay_rate))` observations, the average of that repayment rate is `r scales::number(mean(Scorecard$repay_rate, na.rm = TRUE), accuracy = .001)`, standard deviation is `r scales::number(sd(Scorecard$repay_rate, na.rm = TRUE), accuracy = .001)`
- Does the average college have a repayment rate of 50\%?
- If it does, then the mean of a sample of `r sum(!is.na(Scorecard$repay_rate))` observations should follow a distribution of $N(.5, \sigma/\sqrt{20890})$
- Estimate $\hat{\sigma}$ using sample standard deviation $s$ (with a correction)
## Sampling Variation
- So if the true distribution has a mean of 50% (whatever kind of distribution it is, as long as it has a mean - we don't need to assume the distribution is normal, the sampling variation of the mean will be normal anyway), then $\bar{X} \sim N(.5, .0013)$
- The probability of getting $\bar{X} =$ `r scales::number(mean(Scorecard$repay_rate, na.rm = TRUE), accuracy = .001)` or something even higher is `1-pnorm(.575, .5, .0013) = `r 1-pnorm(.575, .5, .0013)`. It rounds to 0 here but it's just an extremely small number. This is a *one-tailed p-value*
- The probability of getting $\bar{X} =$ `r scales::number(mean(Scorecard$repay_rate, na.rm = TRUE), accuracy = .001)` or something *equally far away or farther from .5 in either direction* is `2*(1-pnorm(.575, .5, .0013)`, a *two-tailed p-value*
## Sampling Variation
- Another, possibly better way to think about it is what range of theoretical distributions we *wouldn't* find unlikely
- We have to first define "unlikely" for this, often with a p-value cutoff
- Then, a confidence interval around the actual sample mean tells us which theoretical means would not find the existing data "too unlikely"
- $C.I. = [\bar{X} \pm z_\alpha\hat{\sigma}/\sqrt{n}]$, where $z_\alpha$ is based on our "too unlikely" definition. For a 95% confidence interval ("too unlikely" is a two-tailed p-value below .05), it's $z_{.05} \approx 1.96$
## Statistical Inference
- By using *what we know about sampling variation*, we can *make inferences about a variable's theoretical distribution (i.e. what mean that distribution is likely ot have)
- In this way we can use what we have - data - to learn about what we actually care about - population distributions
- We have to leverage what we know, and what we have, to make that theoretical inference that we really care about
- This will echo very strongly once we start talking about causality!
## Next Time
- Not just single variables, but relationships!
- What are those *population relationships?*
- That's the real juicy stuff