Merge pull request #227 from chhoumann/baseline-vs-stacking

[KB-311, KB-306] Baseline vs stacking - justification & further comparison
chhoumann · Jun 12, 2024 · 3cfb1ac · 3cfb1ac
2 parents 4b6f879 + 88b034d
commit 3cfb1ac
Show file tree

Hide file tree

Showing 4 changed files with 234 additions and 4 deletions.
diff --git a/baseline/eda/compare_old_vs_new_test_sets.ipynb b/baseline/eda/compare_old_vs_new_test_sets.ipynb
diff --git a/report_thesis/src/sections/baseline_replica.tex b/report_thesis/src/sections/baseline_replica.tex
@@ -34,7 +34,7 @@ \section{Baseline \& Replica}\label{sec:baseline_replica}
 Following this discovery, we modified the replica to instead use the datasets in the same way as in the \gls{pls1-sm} phase, which yielded results aligning more closely with the original model.
 
 Furthermore, our initial replica used a random train/test split for training, in contrast to the original model's manual curation to ensure representation of extreme compositions in both sets.
-This difference stemmed from the original authors' application of domain expertize in their dataset curation --- a process we could not directly replicate.
+This difference stemmed from the original authors' application of domain expertise in their dataset curation --- a process we could not directly replicate.
 Nevertheless, we found that automatically identifying extreme compositions and ensuring that they were present in both the training and testing sets brought us closer to the original model.
 We chose to pull out the $n$ largest and smallest samples by concentration range, for each oxide, and reserve them for the training set.
 Then we would do a random split on the remaining dataset, such that the final train/test split would be a $80\%/20\%$ split.

diff --git a/report_thesis/src/sections/experiments/stacking_ensemble.tex b/report_thesis/src/sections/experiments/stacking_ensemble.tex
@@ -72,14 +72,27 @@ \subsubsection{Results}\label{subsec:stacking_ensemble_results}
 The 1:1 plot in Figure~\ref{fig:elasticnet_one_to_one} shows the near-constant predictions for \ce{TiO2} when using a \gls{enet} meta-learner, and Figure~\ref{fig:enetalpha01_one_to_one} shows the improved predictions with \texttt{alpha} = 0.1.
 This leads us to conclude that the meta-learner's choice significantly impacts the \gls{rmsecv} and prediction outcomes.
 
-The stacking approach demonstrated significant improvements in prediction accuracy as compared to the baseline we described in Section~\ref{subsec:baseline_results}, validating the efficacy of our methodology.
-We measured this by \gls{rmsep}, as it provides the fairest comparison between the baseline and the stacking approach.
+The stacking approach demonstrated strong improvements in prediction accuracy compared to the baseline described in Section~\ref{sec:baseline_replica}, validating the efficacy of our methodology.
+We measured this improvement using \gls{rmsep}, which provides the fairest comparison between the baseline and the stacking approach.
+As mentioned, \gls{rmsep} evaluates the model's performance on the test set.
+In Section~\ref{sec:baseline_replica}, we described how the baseline test set was constructed by sorting extreme concentration values into the training set, and then performing a random split.
+As noted in Section~\ref{subsec:validation_testing_procedures}, required a more sophisticated procedure to support the testing and validation strategy in this work.
+Despite the differences in test set construction, the test sets remained similar in composition\footnote{The analysis of this can be found on our GitHub repository: \url{https://github.com/chhoumann/thesis-chemcam}}, which allowed us to use \gls{rmsep} as a fair comparison metric.
+Table~\ref{tab:stacking_ensemble_vs_moc} compares the \gls{rmsep} values of different oxides for the \gls{moc} (replica) model with three stacking ensemble models: \gls{enet} with $\alpha = 1$, \gls{enet} with $\alpha = 0.1$, and \gls{svr}.
+Overall, the stacking ensemble models tend to produce lower \gls{rmsep} values compared to the \gls{moc} (replica) model.
+Notably, \ce{SiO2}, \ce{TiO2}, \ce{Na2O}, and \ce{K2O} show large improvements across all stacking ensemble models.
+For instance, the \gls{rmsep} for \ce{SiO2} is reduced from 5.61 (\gls{moc} (replica)) to around 3.59 (\gls{enet} with $\alpha = 1$) and further to 3.47 (\gls{svr}).
+Similarly, \ce{TiO2} shows a reduction from 0.61 (\gls{moc} (replica)) to 0.32 (\gls{enet} with $\alpha = 0.1$).
+The improvements are consistent across most oxides, with \gls{enet} and \gls{svr} models both outperforming the \gls{moc} (replica) model.
+This shows that the ensemble approach, particularly with these meta-learners, enhances prediction accuracy for the oxides we tested.
+
+The results presented above indicate a strong performance from the stacking ensemble approach.
 However, it is important to note that some evaluation metrics are worse in the stacking approach than in certain individual configurations.
 We believe that further tuning, particularly of the meta-learner's hyperparameters, could substantially improve these results.
 
 \begin{table}
 \centering
-\caption{Stacking ensemble results using the \gls{enet} model as the meta-learner with default hyperparameters.}
+\caption{Stacking ensemble results using the \gls{enet} model as the meta-learner with $\alpha = 1$.}
 \begin{tabular}{lcccc}
 \toprule
 Oxide          & \gls{rmsep} & STDDEV & \gls{rmsecv}         & Std. Dev. CV          \\
@@ -171,6 +184,28 @@ \subsubsection{Results}\label{subsec:stacking_ensemble_results}
     \label{fig:elasticnet_one_to_one}
 \end{figure*}
 
+\begin{table}
+\centering
+\caption{Comparison of \gls{rmsep} values for the \gls{moc} (replica) model and various stacking ensemble models.}
+\resizebox{0.45\textwidth}{!}{
+\begin{tabular}{lccccc}
+\toprule
+Oxide          & \gls{moc} (replica) & \gls{enet} ($\alpha = 1$) & \gls{enet} ($\alpha = 0.1$) & \gls{svr} \\
+\midrule
+\ce{SiO2}      & 5.61          & 3.59              & 3.60                & \textbf{3.47} \\
+\ce{TiO2}      & 0.61          & 0.57              & \textbf{0.32}       & 0.34 \\
+\ce{Al2O3}     & 2.47          & \textbf{1.66}     & 1.66                & 1.73 \\
+\ce{FeO_T}     & 1.82          & 1.79              & 1.84                & \textbf{1.69} \\
+\ce{MgO}       & 1.56          & \textbf{0.71}     & 0.77                & 0.82 \\
+\ce{CaO}       & 2.09          & \textbf{1.64}     & 1.65                & 1.59 \\
+\ce{Na2O}      & 1.33          & 0.47              & 0.44                & \textbf{0.37} \\
+\ce{K2O}       & 1.91          & \textbf{0.48}     & 0.49                & 0.51 \\
+\bottomrule
+\end{tabular}
+}
+\label{tab:stacking_ensemble_vs_moc}
+\end{table}
+
 \begin{figure*}
     \centering
     \resizebox{0.75\textwidth}{!}{

diff --git a/report_thesis/src/sections/proposed_approach/testing_validation.tex b/report_thesis/src/sections/proposed_approach/testing_validation.tex
@@ -58,6 +58,15 @@ \subsection{Validation and Testing Procedures for Model Evaluation}\label{subsec
 
 This necessitates careful dataset partitioning to ensure that the model training process accounts for these challenges, improving the generalizability and robustness of the models.
 
+In Section~\ref{sec:baseline_replica}, we described how we ensured representation of extreme compositions in both the training and testing sets by automatically identifying the $n$ largest and smallest samples by concentration range for each oxide and reserving them for the training set.
+We then performed a random split on the remaining dataset, resulting in a final train/test split of 80\%/20\%.
+In this process, we also employ a rudimentary procedure to prevent data leakage, ensuring that each target was only present once in the training set.
+The baseline did not employ cross-validation, as our goal was to replicate the \gls{moc} model that was presented in \citet{cleggRecalibrationMarsScience2017}.
+We note that this procedure is insufficient to support the testing and validation strategy we have laid out above, as it does not support $k$-fold cross-validation.
+A random $k$-fold split of the training data would not account for the uneven distribution of extreme values across the folds, and would furthermore cause data leakage between the folds.
+Moreover, the procedure failed to consider the concentration of each oxide individually, instead aggregating concentrations across all oxides. This represents a significant limitation, as it attempts to generate a uniform test set for each oxide, thereby neglecting the unique distribution characteristics of individual oxides.
+Therefore, a more sophisticated procedure is needed to ensure that the data partitioning accounts for these challenges.
+
 \subsubsection{Dataset Partitioning}\label{subsubsec:dataset_partitioning}
 To ensure rigorous evaluation of our models and to address the challenges of data leakage and uneven distribution of extreme values, we have implemented a customized k-fold data partitioning procedure.
 This approach divides the dataset into $k$ folds, which are used to define cross-validation datasets, as well as a training set and a test set.