You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PPCA is a probabilistic latent variable model, whose maximum likelihood solution corresponds to PCA. For an introduction to PPCA, see [1].
This implementation uses the expectation-maximization (EM) algorithm to find maximum-likelihood estimates of the PPCA model parameters. This enables a principled handling of missing values in the dataset, assuming that the values are missing at random (see equations further below).
Requirements
This implementation requires Python >= 3.9 and can be installed as a package with:
pip install -e .
To run the demo.ipynb notebook, the following packages are additionally required:
pip install notebook scikit-learn matplotlib
Demo
In the demo.ipynb we show basic usage and compare this implementation to the sklearn implementation. In short, PPCA can be used similarly to its sklearn counterpart:
from ppca import PPCA
...
ppca = PPCA(n_components=2)
# X contains data with possibly missing values (= np.nan)
# Z are the transformed values
Z = ppca.fit_transform(X)
print("explained variance: ", ppca.explained_variance_)
...
However, in addition to the sklearn implementation, PPCA can output distributions, handle missing values, etc.
EM for PPCA with Missing Values
Most implementations on Github for PCA with missing values use the EM imputation algorithm described in Roweis [3]. However, without formulating a probabilistic model, there is e.g. no obvious way to transform unseen data with missing values. Instead, in this repository, we implement the full probabilistic PCA model.
References [1] and [2] provide derivations and detailed discussions of the model and its optimization with EM, however, the missing value case is not explained in detail. Therefore, the necessary equations are provided here in compact form. Familiarity with [1] and [2] is assumed.
First, note that we can simply integrate out missing values from the marginal likelihood. Let $\mathbf{x}_n^\text{m}$ denote the missing entries of observation $\mathbf{x}_n$ and $\mathbf{x}_n^\text{o}$ the observed entries.
Note that $\mathbf{w}_i^\text{T}$ is the $i$-th row of $\mathbf{W}$. From this equation we can already see that the maximum likelihood estimate of $\mathbf{\mu}$ is given by:
$\iota_{ni}$ are indicator variables which are 1 if $x_{ni}$ is observed and 0 otherwise. Estimating $\mathbf{W}_\text{ML}$ and $\sigma_\text{ML}^2$ is more complicated and we have to resort to the EM algorithm.
The expectation of the complete-data log likelihood w.r.t. the latent variables is as follows:
where $\mathbf{y}_n$ contains the observed elements of $\mathbf{x}_n$ minus the corresponding elements of $\mathbf{\mu}_\text{ML}$, $\mathbf{W}_n$ is the matrix formed by the rows of $\mathbf{W}$ corresponding to observed elements in $\mathbf{x}_n$, and $\mathbf{M}_n = \mathbf{W}_n^\text{T} \mathbf{W}_n + \sigma^2 \mathbf{I}$.
In the M-step, we maximize the complete-data log likelihood while fixing the latent variable posterior.
[1] Bishop, C. M., Pattern Recognition and Machine Learning. New York: Springer, 2006.
[2] Tipping, M. E. and Bishop, C. M., Probabilistic Principal Component Analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 1999.
[3] Roweis, S., EM algorithms for PCA and SPCA. In Proceedings of the 1997 conference on Advances in neural information processing systems 10 (NIPS '97), 1998, 626-632.