[TOC]
The goal of this document is to understand Principal Components Analysis (PCA) in scikit-learn.decomposition
.
I have done this by writing a sequence of tests that helped me understand what was going on. Take a look
This document is a work in progress and sections will improve as my understanding improves.
I worked on a presentaation of Dirac Notation that I will eventually use for this document, it is available here: https://www.icloud.com/keynote/0a7T8YWw1ZTV5_LU2YuABNHOQ#Notation_de_Dirac
Use Typora to read this document to see the equations.
Spectroscopy is the optical method of choice to detect substances or identify tissues. We have learned from very early on that identifying the peaks in a spectrum can be done to infer the substances in our sample. For instance, if we have pure substances, it is relatively easy to identify Ethanol and Methanol with their Raman spectra, because their shapes are significantly different and many peaks do not overlap:
It may even be possible to separate both substances if we have a mixture of the two, by fitting $c_e \mathbf{I}(\nu){e} + c_m \mathbf{I}(\nu)_{m}$ to find the appropriate concentrations that can explain our final combined spectrum. However, what do we do if we have a mixture of several solutions? What if several peaks overlap? What if we don't know the original spectra that are being mixed?
We know intuitively that if peaks belong to the same molecule, they should vary together. If by chance none of the peaks from the different analytes overlap, then it becomes trivial: we only need to identify the peaks, find their amplitudes, and we will quickly get the concentrations of the respective analytes. But things get complicated if they have overlapping peaks, and even worse if we have more than a few components. We will discuss how we can use the very general nomenclature of a generalized vector space with Principal Components Analysis to extract this information.
From a mathematical point of view, we can consider a spectrum as a vector of intensities:
$$
\mathbf{I} = \sum_{i=1}^{n} I_i\mathbf{\hat{\nu}}i,
$$
where each individual frequency $\nu_i$ is in its own dimension, with $\hat{\nu}i$ the base vectors and $I_i$ is the intensity at that frequency. Therefore, if we have N=1024 points in our intensity spectrum $\mathbf{I}$, we are in an N-dimensional space, with the components being $(I_1,I_2,...I_N)$, and we should assume (at least for now) that these components are all independent. If we define the norm of a vector from the dot product, we can say that the norm is equal to:
$$
\left|\mathbf{I} \right| = \sqrt{\sum{i=1}^{n} I_i\mathbf{\hat{\nu}}i \cdot \sum{j=1}^{n} I_j\mathbf{\hat{\nu}}j} = \sqrt{\sum{i=1}^{n}\sum{j=1}^{n} I_i I_j\ \hat{\nu}_i \cdot \hat{\nu}j} = \sqrt{\sum{i=1}^{n} \left|I_i\right|^2},
$$
since the spectral base vectors $\hat{\nu}_i$ are all orthonormal, which we can use to normalize a spectrum (or a vector). Finally, it is very convenient to work with matrix notation to express many of these things. We can express the spectrum $\mathbf{I}$ in the basis $\left{ \hat{\nu}i \right}$ with:
$$
\mathbf{I} = \sum{i=1}^{n} I_i\mathbf{\hat{\nu}}_i
\left( \mathbf{\hat{\nu}}_1 \ \mathbf{\hat{\nu}}_2\ ... \mathbf{\hat{\nu}}_n \right) \left( I_1 \ I_2 \ ... \ I_n \right)^T
\left( \mathbf{\hat{\nu}}1 \ \mathbf{\hat{\nu}}2\ ... \mathbf{\hat{\nu}}n \right)
\left(
\begin{matrix}
I_1 \
I_2 \
... \
I_n \
\end{matrix}
\right).
$$
Note that the vector itself $\mathbf{I}$ is different from the components of that vector in a given basis $[I]\nu$: the position vector $\mathbf{r}$ is different from its components $(x,y,z)$ in Cartesian coordinates. If we consider these matrices as partitions, we can write in a form even more compact as:
$$
\mathbf{I} = \hat{\nu} [I]\nu
$$
where the notation $\left[ I\right]\nu$ means "the intensity coefficients in base
For more information about the notation for vector, base vectors and coefficients:
- Read [Greenberg Section 10.7](./Greenberg base change.pdf) on bases and base changes.
- Watch the video (in French) that explains in even more details where this comes from.
- Watch an example (in French) for problem 10.7.1 that discussed an application of this notation and formalism to perform a base change.
This document will always discuss things in terms of bases, base change: a vector can be expressed in an infinite number of bases. A reminder for the definition of a base
- A base set is complete: it spans the space for which it is a base: you must be able to get every vector in that space with
$\mathbf{v} = \sum c_i \mathbf{e}_i$ . We call the$c_i$ the components of a vector in that base. - A base set is linearly independent: all base vectors are independent, and the only way to combine the base vectors to obtain the null vector
$\sum c_i \mathbf{e}_i = \mathbf{0}$ is with$c_i =0$ for all$c_i$ . Another way of saying this is "you cannot combine two vectors from the base set to obtain a third one from the base set". - The number of base vectors in the set is the dimension of the space.
Notice that :
- The base vectors do not have to be unitary: they can have any length. A normalized base set will be labelled with a hat on the vector
$\left{ \mathbf{\hat{e}}_i \right}$ , and an arbitrary set will be$\left{ \mathbf{e}_i \right}$ . - The base vectors do not have to be orthogonal: as long as they are independent, that is fine. There is no notation to differentiate orthogonal and non-orthogonal basis because the property of orthogonality is not a single vector property, it is a property of a pair of vectors, therefore we cannot label a vector as "orthogonal".
We know from experience that in a spectrum, intensities are not completely independent: for instance, in the methanol spectrum above, the peak around 1000 cm$^{-1}$ has a certain width and therefore those intensities are related and are not independent. In fact, for the spectrum of a single substance, all intensities are related because they will come from a scaled version of the original spectrum. Therefore, if we have the a methanol spectrum :
$$
\mathbf{I}M = \sum{i=0}^{n} I_{M,i}\mathbf{\hat{\nu}}i,
$$
where $I{M,i}$ is the intensity at frequency
Finally, if we want to describe a collection of
$$ \mathbf{I}i = \sum{j} c_{ij}\mathbf{\hat{s}}_j. $$
This can be rewritten in matrix notation: $$ \left( \mathbf{I}1, \mathbf{I}2,...,\mathbf{I}m \right) = \left( \mathbf{\hat{s}}1, \mathbf{\hat{s}}2,...,\mathbf{\hat{s}}n \right) \left( \begin{matrix} c{11} & c{21} & ... & c{m1} \ c{12} & c{22} & ... & c{m2} \ ... & ... & ...& ...\ c_{1n} & c_{2n} & ... & c_{mn} \end{matrix} \right) $$
to yield this very compact form: $$ \mathbf{I} = \mathbf{\hat{s}} \left[\mathbf{C}\right]_\mathbf{\hat{s}} $$
with
If we have several components (i.e. methanol, ethanol, etc...) and there is no overlap whatsoever between the spectra (i.e the peaks are all distinct), then the base vectors
The general expression for a vector as a function of its basis and its components in that basis is such that obviously, it stands correct for any basis: $$ \mathbf{I} = \mathbf{{e}} \left[\mathbf{C}\right]\mathbf{e} =\mathbf{{e}^\prime} \left[\mathbf{C^\prime}\right]\mathbf{e^\prime} \label{eq:vectorBasis} $$ It is the purpose of the present section to show how to go from a basis b to a basis b', that is, how to transform the coefficients c into coefficients c'. For more information, you can look at the Youtube Video on base changes.
Since we can express any vector in a basis, we can choose to express the basis vectors $\mathbf{{e}}$ in the $\mathbf{{e}}^\prime$ basis, with the coefficients $\left[ \mathbf{Q} \right]{\mathbf{{e}}^\prime}$we do not know yet:
$$
\mathbf{{e}} = \mathbf{{e}}^\prime \left[ \mathbf{Q} \right]{\mathbf{{e}}^\prime},
\label{eq:bprimetob}
$$
where each column of the matrix $\left[ \mathbf{Q} \right]{\mathbf{{e}}^\prime}$ is the component of the vector $\mathbf{e}i$ in the $\mathbf{e}^\prime$ basis. By definition, a basis set has enough vectors to cover the vector space, therefore both basis sets must have the same number of vectors, and the matrix $\left[ \mathbf{Q} \right]{\mathbf{{e}}^\prime}$ is necessarily square, and can be inverted. We can therefore use $(\ref{eq:vectorBasis})$ in $(\ref{eq:bprimetob})$ and obtain simply:
$$
\mathbf{I} = \mathbf{{e}} \left[\mathbf{C}\right]\mathbf{e}
\left( \mathbf{{e}}^\prime \left[ \mathbf{Q} \right]{\mathbf{{e}}^\prime} \right) \left[\mathbf{C}\right]\mathbf{e}
\mathbf{{e}}^\prime \left( \left[ \mathbf{Q} \right]{\mathbf{{e}}^\prime} \left[\mathbf{C}\right]\mathbf{e} \right)
\mathbf{{e}^\prime} \left[\mathbf{C^\prime}\right]\mathbf{e^\prime} \label{eq:base} $$ This means that, when the vectors in different bases are expressed by $(\ref{eq:vectorBasis})$, the coordinates in the basis $\mathbf{e}^\prime$ can be obtained from the components in the basis $\mathbf{e}$ by this simple transformation: $$ \left[\mathbf{C^\prime}\right]\mathbf{e^\prime} \equiv \left[ \mathbf{Q} \right]{\mathbf{\hat{e}}^\prime} \left[\mathbf{C}\right]\mathbf{e} $$
The goal of Principal Component Analysis (PCA) is to obtain an orthogonal basis for a much smaller subspace than the original (it is a dimensionality reduction technique). We will identify this orthonormal PCA base as
At this point, it is farily simple to describe the process without worrying about the details: PCA takes a large number of samples spectra, and will:
- Find the principal components, or an orthonormal basis
$\left{ \mathbf{\hat{p}} \right}$ that explains the variance of the data the best with a value that expresses how important they are. - Given a spectrum
$\mathbf{I}$ , it can return (fit) it to the PCA components and give the coefficients $\left[ \mathbf{I} \right]\mathbf{\hat{p}}$ in the PCA basis $\left{ \mathbf{\hat{p}} \right}$, with $\mathbf{I}=\mathbf{\hat{p}} \left[ \mathbf{I} \right]\mathbf{\hat{p}}$.
So, because the present document is about the matheatical formalism first and foremost and that I do not want to dive so much into the Python details, let us just say that the following code will give us the principal components, and all the coefficients for our spectra in that base:
from sklearn.decomposition import PCA
#[...]
pca = PCA(n_components=componentsToKeep)
pca.fit(dataSet) # find the principal components
# The principal components are available in the variable pca.components_
# They form an orthonormal basis set
pcaDataCoefficients = pca.transform(dataSet) # express our spectra in the PCA basis
# pcaDataCoefficients are the coefficients for each spectrum in the PCA basis
# Note: it is (c1*PC1+c2*PC2+....) + meanSpectrum = Spectrum
# as in equation (16) below.
Equation $(\ref{eq:vectorBasis})$ is not the only possibility to express a vector in different basis. We will see later that PCA often translates the sample vectors (i.e. the intensity spectra) to the "origin" by subtracting the mean spectrum from all spectra. This means that we have a more general transformation than $(\ref{eq:vectorBasis})$ in that we do not express $\mathbf{I}$ in a different set of coordinates but rather $\mathbf{I} - \bar{\mathbf{I}}$ :
$$
\mathbf{I}-\mathbf{\bar{I}}= \mathbf{\hat{p}} \left[\mathbf{C}\right]\mathbf{\hat{p}},
\label{eq:vectorBasis2}
$$
with
$$
\mathbf{\bar{I}} = \frac{1}{m} \sum{j=1}^{m} \mathbf{{I}}{i} \equiv
\left( \hat{\nu}1,\hat{\nu}2, ..., \hat{\nu}n \right)
\left( \mathbf{\bar{I}}{1},\mathbf{\bar{I}}{2}, ..., \mathbf{\bar{I}}{n} \right)^T
\equiv
\hat{\nu} \left[ \mathbf{\bar{I}} \right]\nu
\label{eq:average}
$$
This average is computed in the "intensity" basis (or original basis $\left{\mathbf{\hat{\nu}}\right}$ ) because that is the only basis we know when we start (i.e. we average all spectra). This small change where we subtract the mean is important, because to return to another basis used to generate $\mathbf{I}$ , we need to write:
$$
\mathbf{I}
\mathbf{{e}} \left[\mathbf{C}\right]_\mathbf{e},
$$
and if we try to follow the same development as in $(\ref{eq:base})$ , we would quickly get stuck because we do not have $\mathbf{\bar{I}}$ neither in $\left{\mathbf{e}\right}$ or $\left{\mathbf{\hat{p}}\right}$ coordinates, we have it in the $\left{\mathbf{\hat{\nu}}\right}$ coordinates :
$$
\mathbf{I}
\mathbf{\hat{p}} \left[\mathbf{C}\right]\mathbf{\hat{p}} + \hat{\nu} \left[ \mathbf{\bar{I}} \right]\nu
\mathbf{{e}} \left[\mathbf{C}\right]_\mathbf{e}, \label{eq:translated} $$ Yet, this is the situation we will encounter later:
- $\mathbf{I} $ is the many spectra we have acquired in the lab. They are in
$\left{\mathbf{\hat{\nu}}\right}$ basis (i.e. simple intensity spectra). - We can compute
$\mathbf{\bar{I}}$ with$(\ref{eq:average})$ , also in the spectral component basis$\left{\mathbf{\hat{\nu}}\right}$ . - The $\left{\mathbf{p}i\right}$ basis is the Principal Component Analysis (PCA) basis that will be obtained from the module together with the coefficients $\left[\mathbf{C}\right]\mathbf{\hat{p}}$. It comes from a singular value decomposition, and at this point, we do not worry oursleves with how it is obtained: we know we can obtain
$\left{\mathbf{\hat{p}}\right}$ and$\left[\mathbf{C}\right]_\mathbf{\hat{p}}$ fromsklearn
and PCA. - Finally, the $\left{\mathbf{e}i\right}$ basis will be our "solution" basis (or the physically meaningful basis) for which we would like to get the concentrations $\left[\mathbf{C}\right]\mathbf{e}$ for our lab measurements. In Raman, this could be the lipid spectrum, DNA spectrum, protein spectrum etc... We know some
$\left{\mathbf{e}_i\right}$ (from insight), but we certainly do not know them all. We want the coefficients to try to determine the concentrations of these molecules and get insight (or answers) from our experimental spectra, but we may not have all the components (i.e. we may not have the full basis set).
There is mathematically not much we can do with these three coordinate systems in
- Express
$\mathbf{\bar{I}}$ in the base$\left{\mathbf{e}\right}$ - Express
$\mathbf{\bar{I}}$ in the PCA base$\left{\mathbf{\hat{p}}\right}$
For reasons that should become clear later, we will choose to express $\mathbf{\bar{I}}$ in the base $\left{\mathbf{\hat{p}}\right}$ because, in fact, we do not know $\left{\mathbf{e}\right}$ completely, we only know part of it. If we knew $\mathbf{{\hat{p}}} \left[ \mathbf{\bar{I}} \right]\mathbf{\hat{p}}$, we could write:
$$
\mathbf{\hat{p}} \left[\mathbf{C}\right]\mathbf{\hat{p}} + \mathbf{\hat{p}} \left[ \mathbf{\bar{I}} \right]_\mathbf{\hat{p}}
\mathbf{e} \left[\mathbf{C}\right]_\mathbf{e}, $$
$$ \mathbf{\hat{p}} \left( \left[\mathbf{C}\right]\mathbf{\hat{p}} + \left[ \mathbf{\bar{I}} \right]\mathbf{\hat{p}} \right)
\mathbf{e} \left[\mathbf{C}\right]_\mathbf{e}, $$
If we define for clarity: $$ \left[\mathbf{C_+} \right]\mathbf{\hat{p}} \equiv \left[\mathbf{C} \right]\mathbf{\hat{p}} + \left[ \mathbf{\bar{I}} \right]\mathbf{\hat{p}}, $$ we can write: $$ \mathbf{\hat{p}} \left[\mathbf{C+} \right]_\mathbf{\hat{p}}
\mathbf{{e}} \left[\mathbf{C}\right]_\mathbf{e}.
$$
We obtain it by transforming the null spectrum $\mathbf{0}$ in equation :
$$
\mathbf{0}
\mathbf{\hat{p}} \left[\mathbf{C_0}\right]_\mathbf{\hat{p}} + \mathbf{\bar{I}} $$
$$ \mathbf{\bar{I}} = -\mathbf{\hat{p}} \left[\mathbf{C_0}\right]\mathbf{\hat{p}} = \mathbf{\hat{p}} \left( - \left[\mathbf{C_0}\right]\mathbf{\hat{p}}\right) \equiv \mathbf{\hat{p}} \left[ \mathbf{\bar{I}} \right]_\mathbf{\hat{p}} $$