forked from araastat/BIOF085
-
Notifications
You must be signed in to change notification settings - Fork 0
/
04_python_stat.Rmd
487 lines (332 loc) · 24.4 KB
/
04_python_stat.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
---
jupyter:
jupytext:
formats: ipynb,Rmd
text_representation:
extension: .Rmd
format_name: rmarkdown
format_version: '1.2'
jupytext_version: 1.4.2
kernelspec:
display_name: Python 3
language: python
name: python3
---
# Statistical analysis
## Introduction
Statistical analysis usually encompasses 3 activities in a data science workflow. These are (a) descriptive analysis, (b) hypothesis testing and (c) statistical modeling. Descriptive analysis refers to a description of the data, which includes computing summary statistics and drawing plots. Hypothesis testing usually refers to statistically seeing if two (or more) groups are different from each other based on some metrics. Modeling refers to fitting a curve to the data to describe the relationship patterns of different variables in a data set.
In terms of Python packages that can address these three tasks:
| Task | Packages |
| :---------------------- | ---------------------- |
| Descriptive statistics | pandas, numpy, matplotlib, seaborn |
| Hypothesis testing | scipy, statsmodels |
| Modeling | statsmodels, lifelines, scikit-learn |
## Descriptive statistics
Descriptive statistics that are often computed are the mean, median, standard deviation, inter-quartile range, pairwise correlations, and the like. Most of these functions are available in `numpy`, and hence are available in `pandas`. We have already seen how we can compute these statistics and have even computed grouped statistics. For example, we will compute these using the diamonds dataset
```{python 04-python-stat-1}
import numpy as np
import scipy as sc
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
```
```{python 04-python-stat-2}
diamonds = pd.read_csv('data/diamonds.csv.gz')
```
```{python 04-python-stat-3}
diamonds.groupby('color')['price'].agg([np.mean, np.median, np.std])
```
There were other examples we saw yesterday along these lines. Refer to both the `python_tools_ds` and `python_pandas` documents
## Simulation and inference
Hypothesis testing is one of the areas where statistics is often used. There are functions for a lot of the standard statistical tests in `scipy` and `statsmodels`. However, I'm going to take a little detour to see if we can get some understanding of hypothesis tests using the powerful simulation capabilities of Python. We'll visit the in-built functions available in `scipy` and `statsmodels` as well.
### Simulation and hypothesis testing
**Question:** You have a coin and you flip it 100 times. You get 54 heads. How likely is it that you have a fair coin?
We can simulate this process, which is random, using Python. The process of heads and tails from coin tosses can be modeled as a [**binomial** distribution](https://en.wikipedia.org/wiki/Binomial_distribution). We can repeat this experiment many many times on our computer, making the assumption that we have a fair coin, and then seeing how likely what we observed is under that assumption.
> Simulation under reasonable assumptions is a great way to understand our data and the underlying data generating processes. In the modern era, it has most famously been used by Nate Silver of ESPN to simulate national elections in the US. There are many examples in engineering where simulations are done to understand a technology and figure out its tolerances and weaknesses, like in aircraft testing. It is also commonly used in epidemic modeling to help understand how an epidemic would spread under different conditions.
```{python 04-python-stat-4}
rng = np.random.RandomState(205) # Seed the random number generator
x = rng.binomial(100, 0.5, 100000) # Simulate 100,000 experiments of tossing a fair coin 100 times
sns.distplot(x, kde=True, rug=False)
plt.axvline(54, color = 'r'); # What we observed
plt.xlabel('Number of heads');
```
```{python 04-python-stat-5}
pd.Series(x).describe() # We convert to pd.Series to take advantage of the `describe` function.
```
What we see from the histogram and the description of the data above is the patterns in data we would expect if we repeated this random experiment. We can already make some observations. First, we do see that the average number of heads we expect to get is 50, which validates that our experiment is using a fair coin. Second, we can reasonably get as few as 27 heads and as many as 72 heads even with a fair coin. In fact, we could look at what values we would expect to see 95% of the time.
```{python 04-python-stat-6}
np.quantile(x, [0.025, 0.975])
```
This says that 95% of the time we'll see values between 40 and 60. (This is **not** a confidence interval. This is the actual results of a simulation study. A confidence interval would be computed based on a **single** experiment, assuming a binomial distribution. We'll come to that later).
So how likely would we be to see the 54 heads in 100 tosses assuming a fair coin? This can be computed as the proportion of experiments
```{python 04-python-stat-7}
np.mean(x > 54) # convince yourself of this
```
This is what would be considered the *p-value* for the test that the coin is fair.
> The p-value of a statistical hypothesis test is the likelihood that we would see an outcome at least as extreme as we observed under the assumption that the null hypothesis (H<sub>0</sub>) that we chose is actually true.
>
> In our case, that null hypothesis is that the coin we're tossing is fair. The p-value **only** gives evidence against the null hypothesis, but does **not** give evidence for the null hypothesis. In other words, if the p-value is small (smaller than some threshold we deem reasonable), then we can claim evidence against the null hypothesis, but if the p-value is large, we cannot say the null hypothesis is true.
What happens if we increase the number of tosses, and we look at the proportion of heads. We observe 54% heads.
```{python 04-python-stat-8}
rng = np.random.RandomState(205)
x = rng.binomial(10000, 0.5, 100000)/10000
sns.distplot(x)
plt.axvline(0.54, color = 'r')
plt.xlabel('Proportion of heads');
```
```{python 04-python-stat-9}
pd.Series(x).describe()
```
Well, that changed the game significantly. If we up the number of coin tosses per experiment to 10,000, so 100-fold increase, then we do not see very much variation in the proportion of tosses that are heads.
>This is expected behavior because of a statistical theorem called the *Law of Large Numbers*, which essentially says that if you do larger and larger sized random experiments with the same experimental setup, your estimate of the true population parameter (in this case the true chance of getting a head, or 0.5 for a fair coin) will become more and more precise.
Now we see that for a fair coin, we should reasonably see between 47.8% and 52% of tosses should be heads. This is quite an improvement from the 27%-72% range we saw with 100 tosses.
We can compute our p-value in the same way as before.
```{python 04-python-stat-10}
np.mean(x > 0.54)
```
So we would never see 54% of our tosses be heads if we tossed a fair coin 10,000 times. Now, with a larger experiment, we would **reject** our null hypothesis H<sub>0</sub> that we have a fair coin.
So same observation, but more data, changes our *inference* from not having sufficient evidence to say that the coin isn't fair to saying that it isn't fair quite definitively. This is directly due to the increased precision of our estimates and thus our ability to differentiate between much smaller differences in the truth.
Let's see a bit more about what's going on here. Suppose we assume that the coin's true likelihood of getting a head is really 0.55, so a very small bias towards heads.
> Food for thought: Is the difference between 0.50 and 0.54 worth worrying about? It probably depends.
We're going to compare what we would reasonably see over many repeated experiments given the coin has a 0.50 (fair) and a 0.55 (slightly biased) chance of a head. First, we'll do experiments of 100 tosses of a coin.
```{python 04-python-stat-11}
rng = np.random.RandomState(205)
x11 = rng.binomial(100, 0.5, 100000)/100 # Getting proportion of heads
x12 = rng.binomial(100, 0.55, 100000)/100
sns.distplot(x11, label = 'Fair')
sns.distplot(x12, label = 'Biased')
plt.xlabel('Proportion of heads')
plt.legend();
```
We see that there is a great deal of overlap in the potential outcomes over 100,000 repetitions of these experiments, so we have a lot of uncertainty about which model (fair or biased) is the truth.
Now, if we up our experiment to 10,000 tosses of each coin, and again repeat the experiment 100,000 times,
```{python 04-python-stat-12}
rng = np.random.RandomState(205)
x21 = rng.binomial(10000, 0.5, 100000)/10000
x22 = rng.binomial(10000, 0.55, 100000)/10000
sns.distplot(x21, label = 'Fair')
sns.distplot(x22, label = 'Biased')
plt.xlabel('Proportion of heads')
plt.legend();
```
We now find almost no overlap between the potential outcomes, so we can very easily distinguish the two models. This is part of what gathering more data (number of tosses) buys you.
We typically measure this ability to distinguish between two models using concepts of *statistical power*, which is the likelihood that we would find an observation at least as extreme as what we observed, under the **alternative** model (in this case, the biased coin model). We can calculate the statistical power quite easily for the two sets of simulated experiments. Remember, we observed 54% heads in our one instance of each experiment that we actually observed. By doing simulations, we're "playing God" and seeing what could have happened, but in practice we only do the experiment once (how many clinical trials of an expensive drug would you really want to do?).
```{python 04-python-stat-13}
pval1 = np.mean(x11 > 0.54)
pval2 = np.mean(x21 > 0.54)
power1 = np.mean(x12 > 0.54)
power2 = np.mean(x22 > 0.54)
print('The p-value when n=100 is ', np.round(pval1, 2))
print('The p-value when n=10,000 is ', np.round(pval2, 2))
print('Statistical power when n=100 is ', np.round(power1, 2))
print('Statistical power when n=10,000 is ', np.round(power2, 2))
```
So as *n* goes up, the p-value for the same experimental outcome goes down and the statistical power goes up. This is a general rule with increasing sample size.
This idea can be used to design a two-armed experiment. Suppose we are looking at the difference in proportion of mice who gained weight between a wild-type mouse and a knockout variant. Since mice are expensive, let's limit the number of mice we'll use in each arm to 10. We expect 30% of the wild-type mice to gain weight, and expect a higher proportion of the knockouts will gain weight. This is again the setup for a binomial experiment, with the number of "coin tosses" being 10 for each of the arms. We're going to do two sets of experiments, one for the WT and one for the KO, and see the difference in proportions of weight gain ('heads') between them, and repeat it 100,000 times.
```{python 04-python-stat-14}
rng = np.random.RandomState(304)
N = 10
weight_gain_wt0 = rng.binomial(N, 0.3, 100000)/N # Get proportion
weight_gain_ko0 = rng.binomial(N, 0.3, 100000)/N # Assume first (null hypothesis) that there is no difference
diff_weight_gain0 = weight_gain_ko0 - weight_gain_wt0
sns.distplot(diff_weight_gain0, kde=False); # Since we only have 10 mice each, this histogram is not very smooth.
# No matter!
```
We usually design the actual test by choosing a cutoff in the difference in proportions and stating that we will reject the null hypothesis if our observed difference exceeds this cutoff. We choose the cutoff so that the p-value of the cutoff is some pre-determined error rate, typically 0.05 or 5% (This is not golden or set in stone. We'll discuss this later). Let's find that cutoff from this simulation. This will correspond to the 95th percentile of this simulated distribution.
```{python 04-python-stat-15}
np.round(np.quantile(diff_weight_gain0, 0.95), 2)
```
This means that at least 5% of the values will be 0.3 or bigger. In fact, this proportion is
```{python 04-python-stat-16}
np.mean(diff_weight_gain0 > 0.3)
```
So we'll take 0.3 as the cutoff for our test (It's fine if the Type 1 error is more than 0.05. If we take the next largest value in the simulation, we dip below 0.05). We're basically done specifying the testing rule.
What we (and reviewers) like to know at this point is, what is the difference level for which you might get 80% power. The thinking is that if the true difference was, say, *p > 0* rather than 0 (under the null hypothesis), we would reject the null hypothesis, i.e., get our observed difference to be more than 0.3, at least 80% of the time. We want to find out how big that value of *p* is. In other words, what is the level of difference in proportions at which we can be reasonably certain that our test will REJECT H<sub>0</sub>, given our sample size, when the true difference in proportions is *p*. Another way of saying this is how big does the difference in true proportions have to be before we would be fairly confident statistically of distinguishing that we have a difference between the two groups given our chosen sample size, i.e., fairly small overlaps in the two competing distributions.
We can also do this using simulation, by keeping the WT group at 0.3, increasing the KO group gradually, simulating the distribution of the difference in proportion and seeing at what point we get to a statistical power of about 80%. Recall, we've already determined that our test will reject H<sub>0</sub> when the observed difference is greater than 0.3
```{python 04-python-stat-17}
p1 = np.linspace(0.3, 0.9, 100)
power = np.zeros(len(p1))
for i, p in enumerate(p1):
weight_gain_wt1 = rng.binomial(N, 0.3, 100000)/N
weight_gain_ko1 = rng.binomial(N, p, 100000)/N
diff_weight_gain1 = weight_gain_ko1 - weight_gain_wt1
power[i] = np.mean(diff_weight_gain1 > 0.3)
```
```{python 04-python-stat-18}
sns.lineplot(p1, power)
plt.axhline(0.8, color = 'black', linestyle = '--');
plt.ylabel('Statistical power')
plt.xlabel('Proportion in KO mice');
```
```{python 04-python-stat-19}
np.round(p1[np.argmin(np.abs(power - 0.8))] - 0.3, 2) # Find the location in the p1 array where power is closest to 0.8
```
So to get to 80% power, we would need the true difference in proportion to be 0.48, or that at least 78% of KO mice should gain weight on average. This is quite a big difference, and its probably not very interesting scientifically to look for such a big difference, since it's quite unlikely.
If we could afford 100 mice per arm, what would this look like?
```{python 04-python-stat-20}
rng = np.random.RandomState(304)
N = 100
weight_gain_wt0 = rng.binomial(N, 0.3, 100000)/N # Get proportion
weight_gain_ko0 = rng.binomial(N, 0.3, 100000)/N # Assume first (null hypothesis) that there is no difference
diff_weight_gain0 = weight_gain_ko0 - weight_gain_wt0
cutoff = np.quantile(diff_weight_gain0, 0.95)
p1 = np.linspace(0.3, 0.9, 100)
power = np.zeros(len(p1))
for i, p in enumerate(p1):
weight_gain_wt1 = rng.binomial(N, 0.3, 100000)/N
weight_gain_ko1 = rng.binomial(N, p, 100000)/N
diff_weight_gain1 = weight_gain_ko1 - weight_gain_wt1
power[i] = np.mean(diff_weight_gain1 > cutoff)
sns.lineplot(p1, power)
plt.axhline(0.8, color = 'black', linestyle = '--');
plt.ylabel('Statistical power')
plt.xlabel('Proportion in KO mice');
```
```{python 04-python-stat-21}
np.round(p1[np.argmin(np.abs(power - 0.8))] - 0.3, 2)
```
The minimum detectable difference for 80% power is now down to 0.17, so we'd need the KO mice in truth to show weight gain 47% of the time, compared to 30% in WT mice. This is more reasonable scientifically as a query.
### A permutation test
A permutation test is a 2-group test that asks whether two groups are different with respect to some metric. Let us look at a breast cancer proteomics experiment to illustrate this. The experimental data contains protein expression for over 12 thousand proteins, along with clinical data. We can ask, for example, whether a particular protein expression differs by ER status.
```{python 04-python-stat-22}
brca = pd.read_csv('data/brca.csv')
brca.head()
```
We will first do the classical t-test, that is available in the `scipy` package.
```{python 04-python-stat-23}
import scipy as sc
import statsmodels as sm
test_probe = 'NP_001193600'
tst = sc.stats.ttest_ind(brca[brca['ER Status']=='Positive'][test_probe], # Need [] since names have spaces
brca[brca['ER Status']=='Negative'][test_probe],
nan_policy = 'omit')
np.round(tst.pvalue, 3)
```
The idea about a permutation test is that, if there is truly no difference then it shouldn't make a difference if we shuffled the labels of ER status over the study individuals. That's literally what we will do. We will do this several times, and look at the average difference in expression each time. This will form the null distribution under our assumption of no differences by ER status. We'll then see where our observed data falls, and then be able to compute a p-value.
The difference between the simulations we just did and a permutation test is that the permutation test is based only on the observed data. No particular models are assumed and no new data is simulated. All we're doing is shuffling the labels among the subjects, but keeping their actual data intact.
```{python 04-python-stat-24}
nsim = 10000
rng = np.random.RandomState(294)
x = np.where(brca['ER Status']=='Positive', -1, 1)
y = brca[test_probe].to_numpy()
obs_diff = np.nanmean(y[x==1]) - np.nanmean(y[x==-1])
diffs = np.zeros(nsim)
for i in range(nsim):
x1 = rng.permutation(x)
diffs[i] = np.nanmean(y[x1==1]) - np.nanmean(y[x1 == -1])
```
```{python 04-python-stat-25}
sns.distplot(diffs)
plt.axvline(x = obs_diff, color ='r');
```
```{python 04-python-stat-26}
pval = np.mean(np.abs(diffs) > np.abs(obs_diff))
f"P-value from permutation test is {pval}"
```
This is pretty close to what we got from the t-test
### Testing many proteins
We could do the permutation test all the proteins using the array operations in `numpy`
```{python 04-python-stat-27}
expr_names = [u for u in list(brca.columns) if u.find('NP') > -1] # Find all column names with NP
exprs = brca[expr_names] # Extract the protein data
```
```{python 04-python-stat-28}
x = np.where(brca['ER Status']=='Positive', -1, 1)
obs_diffs = exprs[x==1].mean(axis=0)-exprs[x==-1].mean(axis=0)
```
```{python 04-python-stat-29}
nsim = 1000
diffs = np.zeros((nsim, exprs.shape[1]))
for i in range(nsim):
x1 = rng.permutation(x)
diffs[i,:] =exprs[x1==1].mean(axis=0) - exprs[x1==-1].mean(axis=0)
```
```{python 04-python-stat-30}
pvals = np.zeros(exprs.shape[1])
len(pvals)
```
```{python 04-python-stat-31}
for i in range(len(pvals)):
pvals[i] = np.mean(diffs[:,i] > obs_diffs.iloc[i])
sns.distplot(pvals);
```
This plot shows that there is probably some proteins which are differentially expressed between ER+ and ER- patients. (If no proteins had any difference, this histogram would be flat, since the p-values would be uniformly distributed). The ideas around Gene Set Enrichment Analysis (GSEA) can also be applied here.
```{python 04-python-stat-32}
exprs_shortlist = [u for i, u in enumerate(list(exprs.columns)) if pvals[i] < 0.0001 ]
len(exprs_shortlist)
```
This means that, if we considered a p-value cutoff for screening at 0.0001, we would select 486 of the 12395 proteins for further study.
### Getting a confidence interval using the bootstrap
We can use simulations to obtain a model-free confidence interval for particular parameters of interest based on our observed data. The technique we will demonstrate is called the bootstrap. The idea is that if we sample with replacement from our observed data to get another data set of the same size as the observed data, and compute our statistic of interest, and then repeat this process many times, then the distribution of our statistic that we will obtain this way will be very similar to the true sampling distribution of the statistic if we could "play God". This has strong theoretical foundations from work done by several researchers in the 80s and 90s.
1. Choose the number of simulations `nsim`
1. for each iteration (1,...,nsim)
- Simulate a dataset with replacement from the original data.
- compute and store the statistic
1. Compute the 2.5th and 97.5th percential of the distribution of the statistic. This is your confidence interval.
Let's see this in action. Suppose we tossed a coin 100 times. We're going to find a confidence interval for the proportion of heads from this coin.
```{python 04-python-stat-33}
rng = np.random.RandomState(304)
x = rng.binomial(1, 0.7, 100)
x
```
This gives the sequence of heads (1) and tails (0), assuming the true probability of heads is 0.7.
We now create 100000 bootstrap samples from here.
```{python 04-python-stat-34}
nsim = 100000
boots = np.random.choice(x, (len(x), nsim), replace = True) # sample from the data
boot_estimates = boots.mean(axis = 0) # compute mean of each sample, i.e proportion of heads
sns.distplot(boot_estimates);
```
```{python 04-python-stat-35}
np.quantile(boot_estimates, (0.025, 0.975)) # Find 2.5 and 97.5-th percentiles
```
So our 95% bootstrap confidence interval is (0.66, 0.83). Our true value of 0.7 certainly falls in it.
## Classical hypothesis testing
Python has the tools to do classic hypothesis testing as well. Several functions are available in the `scipy.stats` module. The commonly used tests that are available are as follows:
| Function | Test |
| :----------------- | ------------------------------------------------------- |
| `ttest_1samp` | One-sample t-test |
| `ttest_ind` | Two-sample t-test |
| `ttest_rel` | Paired t-test |
| `wilcoxon` | Wilcoxon signed-rank test (nonparametric paired t-test) |
| `ranksum` | Wilcoxon rank-sum test (nonparametric 2-sample t-test) |
| `chi2_contingency` | Chi-square test for independence |
| `fisher_exact` | Fisher's exact test on a 2x2 contingency table |
| `f_oneway` | One-way ANOVA |
| `pearsonr` | Testing for correlation |
| | |
There are also several tests in `statsmodels.stats`
| Functions | Tests |
| :------------------ | ------------------------------------- |
| `proportions_ztest` | Test for difference in proportions |
| `mcnemar` | McNemar's test |
| `sign_test` | Sign test |
| `multipletests` | p-value correction for multiple tests |
| `fdrcorrection` | p-value correction by FDR |
| | |
We've seen an example of these tests earlier. .
## Regression analysis
The regression modeling frameworks in Python are mainly in `statsmodels`, though some of it can be found in `scikit-learn` which we will see tomorrow. We will use the diamonds dataset for demonstration purposes. We will attempt to model the diamond price against several of the other diamond characteristics.
```{python 04-python-stat-36}
import statsmodels.api as sm
import statsmodels.formula.api as smf # Use the formula interface to statsmodels
mod1 = smf.glm('price ~ carat + clarity + depth + cut + color', data = diamonds).fit()
```
```{python 04-python-stat-37}
mod1.summary()
```
This is the basic syntax for modeling in statsmodels. Let's go through and parse it.
One thing you notice is that we've written a formula inside the model
```
mod1 = smf.glm('price ~ carat + clarity + depth + cut + color', data = diamonds).fit()
```
This is based on another Python package, `patsy`, which allows us to write the model like this. This will read as
"price depends on carat, clarity, depth, cut and color". Underneath a lot is going on.
1. color, clarity, and cut are all categorical variables. They actually need to be expanded into dummy variables, so we will have one column for each category level, which is 1 when the diamond is of that category and 0 otherwise.
1. An intercept terms is added
1. The dummy variables are concatenated to the continuous variables
1. The model is run
We can see the dummy variables using `pandas`
```{python 04-python-stat-38}
pd.get_dummies(diamonds)
```