diff --git a/report_thesis/src/sections/background/preprocessing/pca.tex b/report_thesis/src/sections/background/preprocessing/pca.tex index 3b0136d9..1d55fd88 100644 --- a/report_thesis/src/sections/background/preprocessing/pca.tex +++ b/report_thesis/src/sections/background/preprocessing/pca.tex @@ -1,47 +1,23 @@ \subsubsection{Principal Component Analysis (PCA)}\label{subsec:pca} -\gls{pca} is a dimensionality reduction technique that transforms a set of possibly correlated variables into a smaller set of uncorrelated variables called \textit{principal components}. -We give an overview of the \gls{pca} algorithm based on \citet{James2023AnIS}. +\gls{pca} is a dimensionality reduction technique used to reduce the number of features in a dataset while retaining as much information as possible. +We provide an overview of \gls{pca} in this section based on \citet{dataminingConcepts} and \citet{Vasques2024}. -First, the data matrix $\mathbf{X}$ is centered by subtracting the mean of each variable to ensure that the data is centered at the origin: +\gls{pca} works by identifying the directions in which the\\$n$-dimensional data varies the most and projects the data onto these $k$ dimensions, where $k \leq n$. +This projection results in a lower-dimensional representation of the data. +\gls{pca} can reveal the underlying structure of the data, which enables interpretation that would not be possible with the original high-dimensional data. -$$ -\mathbf{\bar{X}} = \mathbf{X} - \mathbf{\mu}, -$$ +\gls{pca} works as follows. +First, the input data are normalized, which prevents features with larger scales from dominating the analysis. -where $\mathbf{\bar{X}}$ is the centered data matrix and $\mathbf{\mu}$ is the mean of each variable. +Then, the covariance matrix of the normalized data is computed. +The covariance matrix captures how each pair of features in the dataset varies together. +$k$ orthogonal unit vectors, called \textit{principal components}, are then computed from this covariance matrix. +These vectors are perpendicular to each other and capture the directions of maximum variance in the data. -The covariance matrix of the centered data is then computed: +The principal components are then sorted such that the first component captures the most variance, the second component captures the second most variance, and so on. +Variance is assumed by \gls{pca} to be a measure of information. +In other words, the principal components are sorted based on the amount of information they capture. -$$ -\mathbf{C} = \frac{1}{n-1} \mathbf{\bar{X}}^T \mathbf{\bar{X}}, -$$ - -where $n$ is the number of samples. - -Then, the covariance matrix $\mathbf{C}$ is decomposed into its eigenvectors $\mathbf{V}$ and eigenvalues $\mathbf{D}$: - -$$ -\mathbf{C} = \mathbf{V} \mathbf{D} \mathbf{V}^T, -$$ - -where $\mathbf{V}$ contains the eigenvectors of $\mathbf{C}$. -These eigenvectors represent the principal components, indicating the directions of maximum variance in $\mathbf{X}$. -The interpretation of the principal components is that the first captures the most variance, the second captures the next most variance, and so on. -The matrix $\mathbf{D}$ is diagonal and contains the eigenvalues, each quantifying the variance captured by its corresponding principal component. - -These components are the scores $\mathbf{T}$, calculated as follows: - -$$ -\mathbf{T} = \mathbf{\bar{X}} \mathbf{V}_n, -$$ - -where $\mathbf{V}_n$ includes only the top $n$ eigenvectors. -The scores $\mathbf{T}$ are the new, uncorrelated features that reduce the dimensionality of the original data, capturing the most significant patterns and trends. - -Finally, the original data points are projected onto the space defined by the top $n$ principal components, which transforms $X$ into a lower-dimensional representation: - -$$ -\mathbf{X}_{\text{reduced}} = \mathbf{\bar{X}} \mathbf{V}_n, -$$ - -where $\mathbf{V}_n$ is the matrix that only contains the top $n$ eigenvectors. \ No newline at end of file +After computing and sorting the principal components, the data can be projected onto the most informative principal components. +This projection results in a lower-dimensional approximation of the original data. +The number of principal components to keep is a hyperparameter that can be tuned to balance the trade-off between the amount of information retained and the dimensionality of the data.