-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #215 from chhoumann/conclusion
Conclusion & Abstract?
- Loading branch information
Showing
8 changed files
with
51 additions
and
15 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
6 changes: 3 additions & 3 deletions
6
report_thesis/src/sections/background/preprocessing/max_abs.tex
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,12 +1,12 @@ | ||
\subsubsection{Max Absolute Scaler} | ||
Max absolute scaling is a normalization technique that scales each feature individually so that the maximum absolute value of each feature is 1. | ||
Max Absolute Scaling is a normalization technique that scales each feature individually so that the maximum absolute value of each feature is 1. | ||
This results in the data being normalized to a range between -1 and 1. | ||
The formula for max absolute scaling is given by: | ||
The formula for Max Absolute Scaling is given by: | ||
|
||
$$ | ||
x'_i = \frac{x_i}{\max(|\mathbf{x}|)}, | ||
$$ | ||
|
||
where $x_i$ is the original feature value, $\max(|\mathbf{x}|)$ is the maximum absolute value of the feature vector $\mathbf{x}$, and $x'_i$ is the normalized feature value. | ||
This scaling method is particularly useful for data that has been centered at zero or is sparse, as max absolute scaling does not alter the mean of the data. | ||
This scaling method is particularly useful for data that has been centered at zero or is sparse, as Max Absolute Scaling does not alter the mean of the data. | ||
Additionally, it preserves the sparsity of the data by ensuring that zero entries remain zero, thereby not introducing any non-zero values~\cite{Vasques2024}. |
4 changes: 2 additions & 2 deletions
4
report_thesis/src/sections/background/preprocessing/min-max.tex
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
8 changes: 4 additions & 4 deletions
8
report_thesis/src/sections/background/preprocessing/z-score.tex
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,12 +1,12 @@ | ||
\subsubsection{Z-score Normalization} | ||
Z-score normalization, also known as zero-mean normalization, transforms data to have a mean of zero and a standard deviation of one. | ||
\subsubsection{Z-Score Standardization} | ||
Z-Score Standardization, also known as zero-mean normalization, transforms data to have a mean of zero and a standard deviation of one. | ||
This technique is useful when the actual minimum and maximum of a feature are unknown or when outliers may significantly skew the distribution. | ||
The z-score normalization of a feature vector \(\mathbf{x}\) is given by: | ||
The Z-Score Standardization of a feature vector \(\mathbf{x}\) is given by: | ||
|
||
$$ | ||
x'_i = \frac{x_i - \overline{\mathbf{x}}}{\sigma_\mathbf{x}}, | ||
$$ | ||
|
||
where \(x_i\) is the original value, \(\overline{\mathbf{x}}\) is the mean of the feature vector \(\mathbf{x}\), \(\sigma_\mathbf{x}\) is the standard deviation of the feature vector \(\mathbf{x}\), and \(x'_i\) is the normalized feature value. | ||
By transforming the data using the Z-score, each value reflects its distance from the mean in terms of standard deviations. | ||
Z-score normalization is particularly advantageous in scenarios where data features have different units or scales, or when preparing data for algorithms that assume normally distributed inputs~\cite{dataminingConcepts}. | ||
Z-Score Standardization is particularly advantageous in scenarios where data features have different units or scales, or when preparing data for algorithms that assume normally distributed inputs~\cite{dataminingConcepts}. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,33 @@ | ||
\section{Conclusion}\label{sec:conclusion} | ||
\section{Conclusion}\label{sec:conclusion} | ||
This thesis set out to advance the analysis of \gls{libs} data for predicting major oxide compositions in geological samples. | ||
By integrating sophisticated machine learning techniques and ensemble regression models, we aimed to tackle the substantial challenges posed by the high-dimensional, nonlinear nature of \gls{libs} data. | ||
|
||
Our research confronted and addressed critical challenges, including the complexities of high dimensionality, non-linearity, multicollinearity, and the limited availability of data. | ||
These issues traditionally hinder the accurate prediction of major oxides from spectral data, necessitating the development of robust and adaptive computational methodologies. | ||
|
||
Throughout our study, we systematically explored a diverse range of machine learning models, categorized into ensemble learning models, linear and regularization models, and neural network models. | ||
Using the developed evaluation framework, we identified the strengths and limitations of each model in relation to predicting major oxides within the context of \gls{libs} data analysis. | ||
|
||
Normalization and transformation techniques played a crucial role in our approach. | ||
We investigated and employed various methods such as Z-Score standardization, Max Absolute scaling, Min-Max normalization, robust scaling, Norm 3, power transformation, and quantile transformation. | ||
These techniques proved vital for standardizing the data, managing different scales, and ultimately enhancing the performance of our models. | ||
|
||
Dimensionality reduction techniques such as \gls{pca} and \gls{kernel-pca} showed potential in managing the high dimensionality of the spectral data; however, their efficacy was not conclusively demonstrated. | ||
|
||
One of the key innovations in our approach was the use of stacked generalization. | ||
Such an approach has seen limited use in the field \gls{libs} data analysis and our work demonstrated its potential in this context. | ||
This ensemble method combined the predictions of multiple base models, each trained on the same data, to form a meta-learner. | ||
By leveraging the strengths of various models and mitigating their individual weaknesses, this technique significantly improved generalization on unseen data. | ||
|
||
We also designed and implemented a framework, using the automated hyperparameter optimization tool Optuna as its foundation. | ||
This framework allowed us to identify the most effective combinations of preprocessing methods and models tailored to the specific characteristics of each oxide, ensuring highly effective performance. | ||
|
||
Finally, we designed and implemented a data partitioning method that addresses the challenges of data leakage and uneven distribution of extreme values, ensuring robust and reliable model evaluation. | ||
|
||
The outcome of our work is a comprehensive catalog of machine learning models and preprocessing techniques for predicting major oxide compositions in \gls{libs} data. | ||
This catalog, featuring highly effective configurations, provides a resource for future research and model and preprocessor selection. | ||
|
||
Moreover, our contributions extend beyond this thesis. | ||
We integrated our findings into the \gls{pyhat} library developed by the \gls{usgs}, thereby enhancing its capabilities for the scientific community. | ||
|
||
In conclusion, by addressing the inherent challenges and developing a robust computational framework, this thesis has laid groundwork for future advancements in geochemical analysis and planetary exploration using \gls{libs} data. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters