diff --git a/report_thesis/src/index.tex b/report_thesis/src/index.tex index d7c860db..531a233e 100644 --- a/report_thesis/src/index.tex +++ b/report_thesis/src/index.tex @@ -3,7 +3,11 @@ \newpage \begin{abstract} - Abstract +This thesis advances the analysis of \gls{libs} data for predicting major oxide compositions in geological samples. +By integrating machine learning techniques and ensemble regression models, the study addresses challenges like high dimensionality, multicollinearity, and limited data availability. +Key innovations include the use of stacked generalization for improved model performance and an automated hyperparameter optimization framework. +The research contributes a comprehensive catalog of models and preprocessing techniques, and integrates findings into the \gls{pyhat} by the \gls{usgs}, enhancing its scientific capabilities. +This work lays a robust foundation for future advancements in geochemical analysis and planetary exploration using \gls{libs} data. \end{abstract} \maketitle diff --git a/report_thesis/src/sections/background/preprocessing/index.tex b/report_thesis/src/sections/background/preprocessing/index.tex index a0612dfb..7398888c 100644 --- a/report_thesis/src/sections/background/preprocessing/index.tex +++ b/report_thesis/src/sections/background/preprocessing/index.tex @@ -1,6 +1,6 @@ \subsection{Preprocessing} In this subsection, we discuss the preprocessing methods used in our machine learning pipeline. -We cover the following normalization techniques: Z-score normalization, max absolute scaling, min-max normalization, robust scaling, Norm 3, power transformation, and quantile transformation. +We cover the following normalization techniques: Z-Score standardization, Max Absolute scaling, Min-Max normalization, robust scaling, Norm 3, power transformation, and quantile transformation. These techniques are essential for standardizing data, handling different scales, and improving the performance of machine learning models. For the purposes of this discussion, let $\mathbf{x}$ be a feature vector with values $x_1, x_2, \ldots, x_n$. diff --git a/report_thesis/src/sections/background/preprocessing/max_abs.tex b/report_thesis/src/sections/background/preprocessing/max_abs.tex index b615d143..bac86aec 100644 --- a/report_thesis/src/sections/background/preprocessing/max_abs.tex +++ b/report_thesis/src/sections/background/preprocessing/max_abs.tex @@ -1,12 +1,12 @@ \subsubsection{Max Absolute Scaler} -Max absolute scaling is a normalization technique that scales each feature individually so that the maximum absolute value of each feature is 1. +Max Absolute Scaling is a normalization technique that scales each feature individually so that the maximum absolute value of each feature is 1. This results in the data being normalized to a range between -1 and 1. -The formula for max absolute scaling is given by: +The formula for Max Absolute Scaling is given by: $$ x'_i = \frac{x_i}{\max(|\mathbf{x}|)}, $$ where $x_i$ is the original feature value, $\max(|\mathbf{x}|)$ is the maximum absolute value of the feature vector $\mathbf{x}$, and $x'_i$ is the normalized feature value. -This scaling method is particularly useful for data that has been centered at zero or is sparse, as max absolute scaling does not alter the mean of the data. +This scaling method is particularly useful for data that has been centered at zero or is sparse, as Max Absolute Scaling does not alter the mean of the data. Additionally, it preserves the sparsity of the data by ensuring that zero entries remain zero, thereby not introducing any non-zero values~\cite{Vasques2024}. \ No newline at end of file diff --git a/report_thesis/src/sections/background/preprocessing/min-max.tex b/report_thesis/src/sections/background/preprocessing/min-max.tex index 176e981c..54905f01 100644 --- a/report_thesis/src/sections/background/preprocessing/min-max.tex +++ b/report_thesis/src/sections/background/preprocessing/min-max.tex @@ -1,7 +1,7 @@ \subsubsection{Min-Max Normalization}\label{subsec:min-max} -Min-max normalization rescales the range of features to a specific range $[a, b]$, where $a$ and $b$ represent the new minimum and maximum values, respectively. +Min-Max normalization rescales the range of features to a specific range $[a, b]$, where $a$ and $b$ represent the new minimum and maximum values, respectively. The goal is to normalize the range of the data to a specific scale, typically 0 to 1. -The min-max normalization of a feature vector $\mathbf{x}$ is given by: +The Min-Max normalization of a feature vector $\mathbf{x}$ is given by: $$ x'_i = \frac{x_i - \min(\mathbf{x})}{\max(\mathbf{x}) - \min(\mathbf{x})}(b - a) + a, diff --git a/report_thesis/src/sections/background/preprocessing/z-score.tex b/report_thesis/src/sections/background/preprocessing/z-score.tex index 0cd42120..eba5ae2c 100644 --- a/report_thesis/src/sections/background/preprocessing/z-score.tex +++ b/report_thesis/src/sections/background/preprocessing/z-score.tex @@ -1,7 +1,7 @@ -\subsubsection{Z-score Normalization} -Z-score normalization, also known as zero-mean normalization, transforms data to have a mean of zero and a standard deviation of one. +\subsubsection{Z-Score Standardization} +Z-Score Standardization, also known as zero-mean normalization, transforms data to have a mean of zero and a standard deviation of one. This technique is useful when the actual minimum and maximum of a feature are unknown or when outliers may significantly skew the distribution. -The z-score normalization of a feature vector \(\mathbf{x}\) is given by: +The Z-Score Standardization of a feature vector \(\mathbf{x}\) is given by: $$ x'_i = \frac{x_i - \overline{\mathbf{x}}}{\sigma_\mathbf{x}}, @@ -9,4 +9,4 @@ \subsubsection{Z-score Normalization} where \(x_i\) is the original value, \(\overline{\mathbf{x}}\) is the mean of the feature vector \(\mathbf{x}\), \(\sigma_\mathbf{x}\) is the standard deviation of the feature vector \(\mathbf{x}\), and \(x'_i\) is the normalized feature value. By transforming the data using the Z-score, each value reflects its distance from the mean in terms of standard deviations. -Z-score normalization is particularly advantageous in scenarios where data features have different units or scales, or when preparing data for algorithms that assume normally distributed inputs~\cite{dataminingConcepts}. +Z-Score Standardization is particularly advantageous in scenarios where data features have different units or scales, or when preparing data for algorithms that assume normally distributed inputs~\cite{dataminingConcepts}. diff --git a/report_thesis/src/sections/conclusion.tex b/report_thesis/src/sections/conclusion.tex index faf551e3..a4c73d39 100644 --- a/report_thesis/src/sections/conclusion.tex +++ b/report_thesis/src/sections/conclusion.tex @@ -1 +1,33 @@ -\section{Conclusion}\label{sec:conclusion} \ No newline at end of file +\section{Conclusion}\label{sec:conclusion} +This thesis set out to advance the analysis of \gls{libs} data for predicting major oxide compositions in geological samples. +By integrating sophisticated machine learning techniques and ensemble regression models, we aimed to tackle the substantial challenges posed by the high-dimensional, nonlinear nature of \gls{libs} data. + +Our research confronted and addressed critical challenges, including the complexities of high dimensionality, non-linearity, multicollinearity, and the limited availability of data. +These issues traditionally hinder the accurate prediction of major oxides from spectral data, necessitating the development of robust and adaptive computational methodologies. + +Throughout our study, we systematically explored a diverse range of machine learning models, categorized into ensemble learning models, linear and regularization models, and neural network models. +Using the developed evaluation framework, we identified the strengths and limitations of each model in relation to predicting major oxides within the context of \gls{libs} data analysis. + +Normalization and transformation techniques played a crucial role in our approach. +We investigated and employed various methods such as Z-Score standardization, Max Absolute scaling, Min-Max normalization, robust scaling, Norm 3, power transformation, and quantile transformation. +These techniques proved vital for standardizing the data, managing different scales, and ultimately enhancing the performance of our models. + +Dimensionality reduction techniques such as \gls{pca} and \gls{kernel-pca} showed potential in managing the high dimensionality of the spectral data; however, their efficacy was not conclusively demonstrated. + +One of the key innovations in our approach was the use of stacked generalization. +Such an approach has seen limited use in the field \gls{libs} data analysis and our work demonstrated its potential in this context. +This ensemble method combined the predictions of multiple base models, each trained on the same data, to form a meta-learner. +By leveraging the strengths of various models and mitigating their individual weaknesses, this technique significantly improved generalization on unseen data. + +We also designed and implemented a framework, using the automated hyperparameter optimization tool Optuna as its foundation. +This framework allowed us to identify the most effective combinations of preprocessing methods and models tailored to the specific characteristics of each oxide, ensuring highly effective performance. + +Finally, we designed and implemented a data partitioning method that addresses the challenges of data leakage and uneven distribution of extreme values, ensuring robust and reliable model evaluation. + +The outcome of our work is a comprehensive catalog of machine learning models and preprocessing techniques for predicting major oxide compositions in \gls{libs} data. +This catalog, featuring highly effective configurations, provides a resource for future research and model and preprocessor selection. + +Moreover, our contributions extend beyond this thesis. +We integrated our findings into the \gls{pyhat} library developed by the \gls{usgs}, thereby enhancing its capabilities for the scientific community. + +In conclusion, by addressing the inherent challenges and developing a robust computational framework, this thesis has laid groundwork for future advancements in geochemical analysis and planetary exploration using \gls{libs} data. diff --git a/report_thesis/src/sections/proposed_approach/model_selection.tex b/report_thesis/src/sections/proposed_approach/model_selection.tex index 2dd1004f..81907fb2 100644 --- a/report_thesis/src/sections/proposed_approach/model_selection.tex +++ b/report_thesis/src/sections/proposed_approach/model_selection.tex @@ -3,7 +3,7 @@ \subsection{Model and Preprocessing Selection}\label{sec:model_selection} We had several considerations to guide our selection of preprocessing techniques. Firstly, our review of the literature revealed that there seems to be no consensus on a single, most effective normalization method for \gls{libs} data. -Therefore, we included traditional normalization methods in our experiments, such as z-score normalization, Min-Max scaling, and Max Absolute scaling. +Therefore, we included traditional normalization methods in our experiments, such as Z-Score Normalization, Min-Max normalization, and Max Absolute Scaling. This approach allowed us to determine which normalization method was most effective for our dataset. Additionally, dimensionality reduction techniques are considered by the literature to be effective techniques for \gls{libs} data due to its high dimensionality. Specifically, \gls{pca} has been widely adopted by the spectroscopic community as an established dimensionality reduction technique~\cite{pca_review_paper}. @@ -50,7 +50,7 @@ \subsection{Model and Preprocessing Selection}\label{sec:model_selection} \toprule \textbf{Normalization / Scaling:} \\ \midrule -Z-Score Normalization \\ +Z-Score Standardization \\ Min-Max Normalization \\ Max Absolute Scaling \\ Robust Scaling \\ diff --git a/report_thesis/src/sections/results/optimization_results.tex b/report_thesis/src/sections/results/optimization_results.tex index 8a313b43..94ada51b 100644 --- a/report_thesis/src/sections/results/optimization_results.tex +++ b/report_thesis/src/sections/results/optimization_results.tex @@ -59,7 +59,7 @@ \subsection{Optimization Results}\label{sec:optimization_results} This indicates that they are indeed used in some top-performing configurations. However, based on the results in Table~\ref{tab:pca_comparison}, we did not expect them to be as prevalent as they are, suggesting that while they are not the most frequently used, they can still be highly effective in specific scenarios. Interestingly, Figure~\ref{fig:top100_scalers} shows that, although \texttt{Norm3Scaler} is the most frequently used and best-performing scaler, this is not always the case. -Min-max scaling appears to yield better results for \ce{SiO2} and \ce{CaO}, while robust scaling seems more effective for \ce{MgO}. +Min-Max normalization appears to yield better results for \ce{SiO2} and \ce{CaO}, while robust scaling seems more effective for \ce{MgO}. For \ce{Al2O3}, Norm 3 scaling exhibits the lowest \gls{rmsecv} values but a higher mean \gls{rmsecv} value compared to the other scalers. Finally, Figure~\ref{fig:top100_transformers} reveals another nuanced finding. Power transformations appear to most frequently yield the best results across oxides, while quantile transformation or no transformation show the lowest \gls{rmsecv} values for the remaining oxides.