-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathFisherVsStouffer.Rmd
209 lines (163 loc) · 11 KB
/
FisherVsStouffer.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
---
title: "Fisher vs Stouffer"
author: "Timothy Daley"
date: "11/29/2020"
output:
html_document:
fig_height: 7
fig_width: 7
toc: yes
code_folding: hide
toc_float: yes
---
The most common way to coalesce $p$-values is Fisher's method (https://en.wikipedia.org/wiki/Fisher%27s_method). The basic idea is to log transform your $p$-values, then $-2$ times the sum is chi-square distributed. Specifically, for $p$-values $p_{1}, \ldots, p_{n}$, Fisher's method calculates the combined test statistic as
$$
t_{\text{Fisher}} = -2 \sum_{i = 1}^{n} \log p_{i}.
$$
Under the null that the $p$-values are independent uniformly distributed, the test statistic $t_{\text{Fisher}}$ will be $\chi^{2}_{2n}$ distributed. The common wisdom is that Fisher's method tend to values low p-values, instead of consistently low $p$-values.
An alternative to Fisher's method is Stouffer's method. The idea is to transform the $p$-values to $z$-scores, then compute a combined $z$-score by averaging the individual $z$-scores. E.g. if $\Phi(\cdot)$ is the standard normal cdf, then the test statistic is
$$
t_{\text{Stouffer}} = \frac{1}{\sqrt{k}} \sum_{i = 1}^{n} \Phi^{-1} (p_{i}).
$$
Under the null this is distributed as a negative standard normal variable (negative because $\Phi^{-1}$ of small $p$-values will be less than zero). The prevailing wisdom is that this method values consistently small $p$-values. To ensure that effects have the same sign, we can define
$$
t_{\text{Stouffer}} = \frac{1}{\sqrt{k}} \sum_{i = 1}^{n} \text{sign}(x_{i}) \Phi^{-1} (p_{i})
$$
as the test statistic so that effect sizes ($x_{i}$) of the same sign are valued. This has a null distribution of standard normal.
I first learned about Stouffer's method through a [paper by Art Owen](https://projecteuclid.org/download/pdfview_1/euclid.aos/1256303530) which discusses different meta-analysis techniques. In particular, he says "In the Fisher test, if the first
$m−1$ $p$-values already yield a test statistic exceeding the $\chi^2_{2m}$ significance threshold, then the $m$th test statistic cannot undo it. The Stouffer test is different. Any
large but finite value of $\sum_{j = 1}^{m-1} \Phi^{-1}(p_{j})$ can be canceled by an opposing value of $\Phi^{-1}(p_{m})$." Thus indicating that Stouffer's method will tend to pick up more small but consistent effects, and will be more robust in the presence of rare outliers. On the other hand, because Fisher's method can be significant due to just one very small $p$-value then it is likely not as robust to outliers. To test this let's compare the methods in two situations: small-tailed null and long-tailed null, where the latter will simulate the presence of outlier effects. I expect that Fisher's method will have higher false positive in the presence of long tails.
# Simulation
The simulation I'll run is the standard 2-groups model (e.g. see [Efron 2008](https://projecteuclid.org/download/pdfview_1/euclid.ss/1215441276)). For aggregation purposes, we'll have 5 effect sizes (I'm leaving effect size undefined on purpose) to merge. In the motivating application, the individual effect sizes and $p$-values arise from differential expression analysis of guide RNAs, with aggregation done at the gene level (as guide RNAs target specific genes). Consistent effects are not guaranteed because of variable guide efficiency (e.g. see [my previous paper on this topic](https://link.springer.com/article/10.1186/s13059-018-1538-6)).
```{r setup}
library(reticulate)
# you need to edit RETICULATE_PYTHON in .Renviron
# for some reason I need to make a venv for each project
Sys.setenv(RETICULATE_PYTHON = "/Users/tim.daley/blog/FisherVsStouffer/venv/bin/python3")
use_python('/Users/tim.daley/blog/FisherVsStouffer/venv/bin/python3')
#reticulate::py_config()
```
```{python}
import numpy as np
# set seed
np.random.seed(12345)
n_genes = 20000
# 2% of genes are non-null
p = 0.05
gene_labels = np.random.binomial(n = 1, p = p, size = (n_genes, ))
print("number of genes: ", n_genes)
print("number of non-null genes: ", sum(gene_labels))
```
## Small tail
For the small-tailed case, let's assume that null gene-effects are equal to $0$ and the non-null gene effects are distributed $\mathcal{N}(3, 1)$.
```{python}
import seaborn
gene_effects = gene_labels*np.random.normal(loc = 3, scale = 1, size = (n_genes, ))
seaborn.histplot(gene_effects, color = 'black')
```
You can see the long tail on the right is the gene effects. Now let's assume 5 observations (guide RNAs) per gene. These are gonna be normally distributed with mean equal to gene effect size, and standard deviation 1.
```{python}
import itertools
import pandas as pd
lst = range(n_genes)
guide2gene_map = list(itertools.chain.from_iterable(itertools.repeat(x, 5) for x in lst))
n_guides = 5*n_genes
assert(len(guide2gene_map) == n_guides)
guide_effects = pd.DataFrame({'guide2gene': guide2gene_map})
guide_effects['guide_effect'] = guide_effects['guide2gene'].map(lambda i: np.random.normal(loc = gene_effects[i], scale = 1))
guide_effects['label'] = guide_effects['guide2gene'].map(lambda i: gene_labels[i])
seaborn.histplot(x = 'guide_effect', hue = 'label', data = guide_effects).set_title('guide effect distribution')
```
Now let's look at combining $p$-values with Fisher vs Stouffer.
```{python}
from scipy.stats import norm, chi2
from math import sqrt
def stouffer_pval(z):
stouffer_zval = sum(z)/sqrt(len(z))
stouffer_pval = norm.cdf(-stouffer_zval)
return stouffer_pval
def fisher_pval(z):
# 1-sided p-val
pvals = norm.cdf(-z)
fisher_val = -2*sum(np.log(pvals))
fisher_pval = 1.0 - chi2.cdf(fisher_val, df = 2*len(z))
return fisher_pval
gene_pvals = pd.DataFrame({'gene': range(n_genes),
'gene_label': gene_labels})
gene_pvals['fisher'] = gene_pvals['gene'].map(lambda i: fisher_pval(guide_effects[guide_effects['guide2gene'] == gene_pvals['gene'][i]]['guide_effect']))
gene_pvals['stouffer'] = gene_pvals['gene'].map(lambda i: stouffer_pval(guide_effects[guide_effects['guide2gene'] == gene_pvals['gene'][i]]['guide_effect']))
gene_pvals.head()
```
```{python}
import matplotlib.pyplot as plt
seaborn.histplot(gene_pvals['fisher'], bins = 30, color = 'black', label = 'Fisher p-vals')
plt.show()
seaborn.histplot(gene_pvals['stouffer'], bins = 30, color = 'black', label = 'Stouffer p-vals')
plt.show()
```
Now let's compute Benjamini-Hochberg FDRs and compare them to the empirical FDR.
```{python}
from statsmodels.stats.multitest import fdrcorrection
_, gene_pvals['fisher_fdr'] = fdrcorrection(gene_pvals['fisher'], method = 'indep')
_, gene_pvals['stouffer_fdr'] = fdrcorrection(gene_pvals['stouffer'], method = 'indep')
gene_pvals.head()
```
Now let's compute the empirical FDRs.
```{python}
def compute_empirical_fdr(labels, fdrs, fdr_thresh):
mask = (fdrs < fdr_thresh)
R = np.sum(mask)
if R > 0:
return np.sum(1 - labels[mask])/float(R)
else:
return 0
empirical_fdr = pd.DataFrame({'fdr_thresh': np.linspace(0, 1, num = 1000)})
empirical_fdr['fisher'] = empirical_fdr['fdr_thresh'].map(lambda x: compute_empirical_fdr(gene_pvals['gene_label'], gene_pvals['fisher_fdr'], x))
empirical_fdr['stouffer'] = empirical_fdr['fdr_thresh'].map(lambda x: compute_empirical_fdr(gene_pvals['gene_label'], gene_pvals['stouffer_fdr'], x))
empirical_fdr.head()
```
```{python}
empirical_fdr.tail()
```
```{python}
df = pd.DataFrame({'estimated_fdr': empirical_fdr['fdr_thresh'].tolist() + empirical_fdr['fdr_thresh'].tolist(),
'empirical_fdr': empirical_fdr['fisher'].tolist() + empirical_fdr['stouffer'].tolist(),
'condition': ['fisher']*empirical_fdr.shape[0] + ['stouffer']*empirical_fdr.shape[0]})
plt.plot([0, 1], [0, 1], '--', color = 'black')
seaborn.lineplot(x = 'estimated_fdr', y = 'empirical_fdr', hue = 'condition', data = df)
plt.show()
```
## Long tail
Now what happens if the null is a long-tailed distribution? We'll use a t-distribution with a finite variance as the long tailed, and we'll normalize it so that the standard deviation is equal to 1.
```{python}
import pandas as pd
from math import sqrt
long_tailed_guide_effects = pd.DataFrame({'guide2gene': guide2gene_map})
deg_fred = 5
sd = sqrt(deg_fred/float(deg_fred - 2))
# divide by sd to normalize
long_tailed_guide_effects['guide_effect'] = long_tailed_guide_effects['guide2gene'].map(lambda i: gene_effects[i] + np.random.standard_t(df = deg_fred)/sd)
long_tailed_guide_effects['label'] = long_tailed_guide_effects['guide2gene'].map(lambda i: gene_labels[i])
seaborn.histplot(x = 'guide_effect', hue = 'label', data = long_tailed_guide_effects)
plt.show()
```
```{python}
long_tailed_gene_pvals = pd.DataFrame({'gene': range(n_genes),
'gene_label': gene_labels})
long_tailed_gene_pvals['fisher'] = long_tailed_gene_pvals['gene'].map(lambda i: fisher_pval(long_tailed_guide_effects[long_tailed_guide_effects['guide2gene'] == long_tailed_gene_pvals['gene'][i]]['guide_effect']))
long_tailed_gene_pvals['stouffer'] = long_tailed_gene_pvals['gene'].map(lambda i: stouffer_pval(long_tailed_guide_effects[long_tailed_guide_effects['guide2gene'] == long_tailed_gene_pvals['gene'][i]]['guide_effect']))
_, long_tailed_gene_pvals['fisher_fdr'] = fdrcorrection(long_tailed_gene_pvals['fisher'], method = 'indep')
_, long_tailed_gene_pvals['stouffer_fdr'] = fdrcorrection(long_tailed_gene_pvals['stouffer'], method = 'indep')
long_tailed_empirical_fdr = pd.DataFrame({'fdr_thresh': np.linspace(0, 1, num = 1000)})
long_tailed_empirical_fdr['fisher'] = long_tailed_empirical_fdr['fdr_thresh'].map(lambda x: compute_empirical_fdr(long_tailed_gene_pvals['gene_label'], long_tailed_gene_pvals['fisher_fdr'], x))
long_tailed_empirical_fdr['stouffer'] = long_tailed_empirical_fdr['fdr_thresh'].map(lambda x: compute_empirical_fdr(long_tailed_gene_pvals['gene_label'], long_tailed_gene_pvals['stouffer_fdr'], x))
df = pd.DataFrame({'estimated_fdr': long_tailed_empirical_fdr['fdr_thresh'].tolist() + long_tailed_empirical_fdr['fdr_thresh'].tolist(),
'empirical_fdr': long_tailed_empirical_fdr['fisher'].tolist() + long_tailed_empirical_fdr['stouffer'].tolist(),
'condition': ['fisher']*long_tailed_empirical_fdr.shape[0] + ['stouffer']*long_tailed_empirical_fdr.shape[0]})
```
```{python}
plt.plot([0, 1], [0, 1], '--', color = 'black')
seaborn.lineplot(x = 'estimated_fdr', y = 'empirical_fdr', hue = 'condition', data = df)
plt.show()
```
We see here that Fisher's method has a higher false positive rate, particularly in the lower region where we usually set the false discovery rate. This inidicates that Fisher's method is much more susceptible to false positives in the presence of long tails. If we correctly specify the null model, then this will be less of an issue, but we saw above that both methods perform similarly when the model is correctly specified. Therefore, this simple experiment indicates that there can be significant benefits to using Stouffer's method in place of Fisher's method, particularly when long-tailed or off-target effects can be present in the null distribution and are difficult to accurately model.