Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conclusion & Abstract? #215

Merged
merged 7 commits into from
Jun 12, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion report_thesis/src/index.tex
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,11 @@
\newpage

\begin{abstract}
Abstract
This thesis advances the analysis of \gls{libs} data for predicting major oxide compositions in geological samples.
By integrating machine learning techniques and ensemble regression models, the study addresses challenges like high dimensionality, non-linearity, multicollinearity, and limited data availability.
chhoumann marked this conversation as resolved.
Show resolved Hide resolved
Key innovations include the use of stacked generalization for improved model performance and automated hyperparameter optimization with Optuna.
chhoumann marked this conversation as resolved.
Show resolved Hide resolved
The research contributes a comprehensive catalog of models and preprocessing techniques, and integrates findings into the \gls{pyhat} by the \gls{usgs}, enhancing its scientific capabilities.
This work lays a robust foundation for future advancements in geochemical analysis and planetary exploration using \gls{libs} data.
\end{abstract}

\maketitle
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
\subsection{Preprocessing}
In this subsection, we discuss the preprocessing methods used in our machine learning pipeline.
We cover the following normalization techniques: Z-score normalization, max absolute scaling, min-max normalization, robust scaling, Norm 3, power transformation, and quantile transformation.
We cover the following normalization techniques: Z-Score standardization, Max Absolute scaling, Min-Max normalization, robust scaling, Norm 3, power transformation, and quantile transformation.
These techniques are essential for standardizing data, handling different scales, and improving the performance of machine learning models.
For the purposes of this discussion, let $\mathbf{x}$ be a feature vector with values $x_1, x_2, \ldots, x_n$.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
\subsubsection{Max Absolute Scaler}
Max absolute scaling is a normalization technique that scales each feature individually so that the maximum absolute value of each feature is 1.
Max Absolute Scaling is a normalization technique that scales each feature individually so that the maximum absolute value of each feature is 1.
This results in the data being normalized to a range between -1 and 1.
The formula for max absolute scaling is given by:
The formula for Max Absolute Scaling is given by:

$$
x'_i = \frac{x_i}{\max(|\mathbf{x}|)},
$$

where $x_i$ is the original feature value, $\max(|\mathbf{x}|)$ is the maximum absolute value of the feature vector $\mathbf{x}$, and $x'_i$ is the normalized feature value.
This scaling method is particularly useful for data that has been centered at zero or is sparse, as max absolute scaling does not alter the mean of the data.
This scaling method is particularly useful for data that has been centered at zero or is sparse, as Max Absolute Scaling does not alter the mean of the data.
Additionally, it preserves the sparsity of the data by ensuring that zero entries remain zero, thereby not introducing any non-zero values~\cite{Vasques2024}.
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
\subsubsection{Min-Max Normalization}\label{subsec:min-max}
Min-max normalization rescales the range of features to a specific range $[a, b]$, where $a$ and $b$ represent the new minimum and maximum values, respectively.
Min-Max normalization rescales the range of features to a specific range $[a, b]$, where $a$ and $b$ represent the new minimum and maximum values, respectively.
The goal is to normalize the range of the data to a specific scale, typically 0 to 1.
The min-max normalization of a feature vector $\mathbf{x}$ is given by:
The Min-Max normalization of a feature vector $\mathbf{x}$ is given by:

$$
x'_i = \frac{x_i - \min(\mathbf{x})}{\max(\mathbf{x}) - \min(\mathbf{x})}(b - a) + a,
Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
\subsubsection{Z-score Normalization}
Z-score normalization, also known as zero-mean normalization, transforms data to have a mean of zero and a standard deviation of one.
\subsubsection{Z-Score Standardization}
Z-Score Standardization, also known as zero-mean normalization, transforms data to have a mean of zero and a standard deviation of one.
This technique is useful when the actual minimum and maximum of a feature are unknown or when outliers may significantly skew the distribution.
The z-score normalization of a feature vector \(\mathbf{x}\) is given by:
The Z-Score Standardization of a feature vector \(\mathbf{x}\) is given by:

$$
x'_i = \frac{x_i - \overline{\mathbf{x}}}{\sigma_\mathbf{x}},
$$

where \(x_i\) is the original value, \(\overline{\mathbf{x}}\) is the mean of the feature vector \(\mathbf{x}\), \(\sigma_\mathbf{x}\) is the standard deviation of the feature vector \(\mathbf{x}\), and \(x'_i\) is the normalized feature value.
By transforming the data using the Z-score, each value reflects its distance from the mean in terms of standard deviations.
Z-score normalization is particularly advantageous in scenarios where data features have different units or scales, or when preparing data for algorithms that assume normally distributed inputs~\cite{dataminingConcepts}.
Z-Score Standardization is particularly advantageous in scenarios where data features have different units or scales, or when preparing data for algorithms that assume normally distributed inputs~\cite{dataminingConcepts}.
33 changes: 32 additions & 1 deletion report_thesis/src/sections/conclusion.tex
Original file line number Diff line number Diff line change
@@ -1 +1,32 @@
\section{Conclusion}\label{sec:conclusion}
\section{Conclusion}\label{sec:conclusion}
This thesis set out to advance the analysis of \gls{libs} data for predicting major oxide compositions in geological samples.
By integrating sophisticated machine learning techniques and ensemble regression models, we aimed to tackle the substantial challenges posed by the high-dimensional, nonlinear nature of \gls{libs} data.

Our research confronted and addressed critical challenges, including the complexities of high dimensionality, non-linearity, multicollinearity, and the limited availability of data.
These issues traditionally hinder the accurate prediction of major oxides from spectral data, necessitating the development of robust and adaptive computational methodologies.

Throughout our study, we systematically explored a diverse range of machine learning models, categorized into ensemble learning models, linear and regularization models, and neural network models.
By implementing a rigorous evaluation framework, we identified the strengths and limitations of each model within the context of \gls{libs} data analysis.
chhoumann marked this conversation as resolved.
Show resolved Hide resolved
chhoumann marked this conversation as resolved.
Show resolved Hide resolved

Normalization and transformation techniques played a crucial role in our approach.
We investigated and employed various methods such as Z-Score standardization, Max Absolute scaling, Min-Max normalization, robust scaling, Norm 3, power transformation, and quantile transformation.
These techniques were vital for standardizing the data, managing different scales, and ultimately enhancing the performance of our models.
chhoumann marked this conversation as resolved.
Show resolved Hide resolved

Dimensionality reduction methods like \gls{pca} and \gls{kernel-pca} seemed promising in addressing the high dimensionality of the spectral data, but did not conclusive results regarding their efficacy.
chhoumann marked this conversation as resolved.
Show resolved Hide resolved

One of the key innovations in our approach was the use of stacked generalization.
chhoumann marked this conversation as resolved.
Show resolved Hide resolved
chhoumann marked this conversation as resolved.
Show resolved Hide resolved
This ensemble method combined the predictions of multiple base models, each trained on the same data, to form a meta-learner.
This technique leveraged the strengths of various models, mitigated their individual weaknesses, and significantly improved generalization on unseen data.
chhoumann marked this conversation as resolved.
Show resolved Hide resolved

We also designed and implemented a framework using the automated hyperparameter optimization tool, Optuna.
chhoumann marked this conversation as resolved.
Show resolved Hide resolved
This framework allowed us to identify the most effective combinations of preprocessing methods and models tailored to the specific characteristics of each oxide, ensuring optimal performance.
chhoumann marked this conversation as resolved.
Show resolved Hide resolved

Finally, we also designed and implemented a method for data partitioning that addresses the challenges of data leakage and uneven distribution of extreme values, ensuring robust and reliable model evaluation.
chhoumann marked this conversation as resolved.
Show resolved Hide resolved

The outcome of our work is a comprehensive catalog of machine learning models and preprocessing techniques for predicting major oxide compositions in \gls{libs} data.
This catalog, featuring highly effective configurations, provides a resource for future research and model selection.
chhoumann marked this conversation as resolved.
Show resolved Hide resolved
chhoumann marked this conversation as resolved.
Show resolved Hide resolved

Moreover, our contributions extend beyond this thesis.
We integrated our findings into the \gls{pyhat} developed by the \gls{usgs}, enhancing its capabilities for the scientific community.
chhoumann marked this conversation as resolved.
Show resolved Hide resolved

In conclusion, by addressing the inherent challenges and developing a robust computational framework, this thesis has laid groundwork for future advancements in geochemical analysis and planetary exploration using \gls{libs} data.
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ \subsection{Model and Preprocessing Selection}\label{sec:model_selection}

We had several considerations to guide our selection of preprocessing techniques.
Firstly, our review of the literature revealed that there seems to be no consensus on a single, most effective normalization method for \gls{libs} data.
Therefore, we included traditional normalization methods in our experiments, such as z-score normalization, Min-Max scaling, and Max Absolute scaling.
Therefore, we included traditional normalization methods in our experiments, such as Z-Score Normalization, Min-Max normalization, and Max Absolute Scaling.
This approach allowed us to determine which normalization method was most effective for our dataset.
Additionally, dimensionality reduction techniques are considered by the literature to be effective techniques for \gls{libs} data due to its high dimensionality.
Specifically, \gls{pca} has been widely adopted by the spectroscopic community as an established dimensionality reduction technique~\cite{pca_review_paper}.
Expand Down Expand Up @@ -50,7 +50,7 @@ \subsection{Model and Preprocessing Selection}\label{sec:model_selection}
\toprule
\textbf{Normalization / Scaling:} \\
\midrule
Z-Score Normalization \\
Z-Score Standardization \\
Min-Max Normalization \\
Max Absolute Scaling \\
Robust Scaling \\
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ \subsection{Optimization Results}\label{sec:optimization_results}
This indicates that they are indeed used in some top-performing configurations.
However, based on the results in Table~\ref{tab:pca_comparison}, we did not expect them to be as prevalent as they are, suggesting that while they are not the most frequently used, they can still be highly effective in specific scenarios.
Interestingly, Figure~\ref{fig:top100_scalers} shows that, although \texttt{Norm3Scaler} is the most frequently used and best-performing scaler, this is not always the case.
Min-max scaling appears to yield better results for \ce{SiO2} and \ce{CaO}, while robust scaling seems more effective for \ce{MgO}.
Min-Max normalization appears to yield better results for \ce{SiO2} and \ce{CaO}, while robust scaling seems more effective for \ce{MgO}.
For \ce{Al2O3}, Norm 3 scaling exhibits the lowest \gls{rmsecv} values but a higher mean \gls{rmsecv} value compared to the other scalers.
Finally, Figure~\ref{fig:top100_transformers} reveals another nuanced finding.
Power transformations appear to most frequently yield the best results across oxides, while quantile transformation or no transformation show the lowest \gls{rmsecv} values for the remaining oxides.
Expand Down
Loading