Skip to content

Using the model for a general outlier detection

samsongourevitch edited this page Jun 13, 2024 · 7 revisions

Introduction

Now that we have adopted the following model : $Y(II)^m_{obs} = Y(II)^m_{true} + \delta_b + \epsilon^m$

with $\epsilon$ a gaussian vector, we have to understand a bit better the structure of epsilon to determine which mutants behave significantly different from the WT and which differences can be attributed to the error modeled by $\epsilon$.

Estimating $\epsilon$

We can visualize $\epsilon$ by plotting $Y(II)^WT_{norm}$ :

image image image image image image

This plot shows that $\epsilon$ cannot be perfectly modeled (as it is often done) by simple independent gaussian errors with same variance, ie with a covariance matrix $\Sigma = \sigma^2 I_n$, where $n$ is the number of time-points.

Indeed, the trajectories show patterns, meaning that although $\epsilon$ is centered around 0, the errors are correlated with each other, ie if a WT starts higher than the average trajectory, it is likely to stay higher.

We can visualize the covariance matrix by estimating it with the empirical estimator, which yields the following matrices :

image image image image image image

If we model $\epsilon$ with a diagonal covariance matrix, meaning that we neglect the interaction between different time-points, and try tu reduce the data, we get the following figures :

image

This figure still looks pretty different from a standard n-dimensional gaussian vector.

However, when we take into account the whole estimated covariance matrix, we get :

image

We see that the reduced data looks relatively similar to a standard gaussian vector.

Designing the test

We now assume that $\epsilon$ follows $N(0_n, \Sigma)$. Let $C$ be a matrix square root of $\Sigma$, meaning that $C \cdot C^t = \Sigma$, which exists because $\Sigma$ is symmetric and definite positive. Using classical results about gaussian vectors, $C^{-1} \epsilon$ follows $N(0_n, I_n)$.

Therefore, $C^{-1} Y_{norm}^{m} = C^{-1}(Y_{true}^{m} - Y_{true}^{WT}) + C^{-1} \epsilon$ and under the null hypothesis that $Y_{true}^{m} = Y_{true}^{WT}$, $C^{-1} Y_{norm}^{m}$ follows $N(0_n, I_n)$.

In this setting, under the null hypothesis, we have n i.i.d. observation of the random standard gaussian variable $\epsilon^m$. We can thus use the Central Limit Theorem to find a 95% confidence interval of the average, around 0. The probability of a mutant have the same phenotype as the WT and being out of this interval is 5%. So we use this average to set apart outliers. We check that the false positive rate on the WT with our method is around 5%, which is the case.