NGS & statistics

Abbreviations

Abbreviation	Meaning
NGS	Next Generation Sequencing
DEG	Differentially expressed gene(s)
padj	Adjusted p-value
FDR	False Discovery Rate

Symbols

Symbol	Meaning
$m$	Number of tests in a multiple-testing schema (e.g. number of genes in differential expression analysis)
$P$	p-value
$E$	e-value

Multiple testing corrections

The problem with multiple tests

The multiple testing problem arises from the application of a given statistical test to a large number of cases. For example, in differential expression analysis, each gene/transcript is submitted to a test of equality between two conditions. A single analysis thus typically involves several tens of thousands tests.

The general problem of multiple testing is that the risk of false positive indicated by the nominal p-value will be challenged for each element.

P-value and derived multiple testing corrections

P-value (nominal p-value)

The nominal p-value is the p-value attached to one particular element in a series of multiple tests. For example, in differential analysis, one nominal p-value is computed for each gene. This p-value indicates the risk to obtain an effect at least as important as our observation under the null hypothesis, i.e. in the absence of regulation.

Bonferroni correction

E-value

The e-value indicates the number of false positives expected by chance, for a given threshold of p-value.

$E = = P \cdot m$

Where $m$ is the number of tests (e.g. genes), $FP$ the number of false positives, the notation $< >$ denotes the random expectation, and $P$ is the nominal p-value of the considered gene.

Note that the e-value is a positive number ranging from $0$ to $m$ (number of tests). It is thus not a p-value, since probabilities are by definition comprized between 0 and 1.

Family-wise error rate (FWER)

The Family-Wise Error Rate (FWER) indicates the probability to observe at least one false positive among the multiple tests.

$FWER = P(FP >= 1)$

False Discovery Rate (FDR)

The False Discovery Rate (FDR) indicates the expected proportion of false positives among the cases declared positive. For example, if a differential analysis reports 200 differentially expressed genes with an FDR threshold of 0.05, we should expect to have $0.05 \cdot 200=10$ false positive among them.

What is an adjusted p-value?

An adjusted p-value is a statistics derived from the nominal p-value in order to correct for the effects of multiple testing.

Various types of corrections for multiple testing have been defined (Bonferoni, e-value, FWER, FDR). Note that some of these corrections are not actual "adjusted p-values".

the original Bonferoni correction consists in adapting the $\alpha$ threshold rather than correcting the p-value.
the e-value is a number that can exceed 1, it is thus not a probability, and thus, not a p-value.

The most usual correction is the FDR, which can be estimated in various ways.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly