Merge pull request #235 from chhoumann/clarify

Clarify role of std.dev, clarify cv metrics
chhoumann · Jun 12, 2024 · 38d7827 · 38d7827
2 parents 172042b + 58b83de
commit 38d7827
Show file tree

Hide file tree

Showing 4 changed files with 47 additions and 9 deletions.
diff --git a/report_thesis/src/references.bib b/report_thesis/src/references.bib
@@ -866,4 +866,16 @@ @article{sirven_pca_ann_plsr
   note = {PMID: 16503595},
   url = {https://doi.org/10.1021/ac051721p},
   eprint = {https://doi.org/10.1021/ac051721p}
-}
+}
+
+@book{geronHandsonMachineLearning2023,
+  title = {Hands-on Machine Learning with {{Scikit-Learn}}, {{Keras}}, and {{TensorFlow}}: Concepts, Tools, and Techniques to Build Intelligent Systems},
+  shorttitle = {Hands-on Machine Learning with {{Scikit-Learn}}, {{Keras}}, and {{TensorFlow}}},
+  author = {Géron, Aurélien},
+  date = {2023},
+  edition = {Third edition},
+  publisher = {O'Reilly},
+  isbn = {978-1-09-812597-4},
+  langid = {english},
+  pagetotal = {834},
+}
diff --git a/report_thesis/src/sections/experiments/stacking_ensemble.tex b/report_thesis/src/sections/experiments/stacking_ensemble.tex
@@ -59,6 +59,23 @@ \subsubsection{Results for Stacking Ensemble}\label{subsec:stacking_ensemble_res
 The evaluation metrics are shown in Table~\ref{tab:stacking_ensemble_results_enet}, Table~\ref{tab:stacking_ensemble_results_enet_01}, and Table~\ref{tab:stacking_ensemble_results_svr}.
 Additionally, we provide 1:1 plots for each ensemble in Figures~\ref{fig:elasticnet_one_to_one}, \ref{fig:enetalpha01_one_to_one}, and \ref{fig:svr_one_to_one}, showing the actual versus predicted values for each oxide.
 
+For the \gls{enet} meta-learner with $\alpha = 1$, the \gls{rmsep} values range from 0.470 for \ce{Na2O} to 3.588 for \ce{SiO2}.
+The \gls{rmsecv} values are generally higher, which could initially suggest overfitting.
+However, considering our testing and validation strategy, this discrepancy is expected.
+Our method for partitioning ensures that extreme values are included in the training folds but not in the test set, making the test set easier to predict.
+This results in lower \gls{rmsep} values compared to \gls{rmsecv} values, which is a deliberate trade-off to provide a fairer assessment of the model's generalization performance.
+The standard deviations of the \gls{rmsecv} are relatively low, suggesting consistent performance across folds.
+
+When the \gls{enet} meta-learner's $\alpha$ is reduced to 0.1, there is a noticeable improvement in the \gls{rmsep} for \ce{TiO2}, dropping from 0.571 to 0.319.
+This suggests that reducing the regularization parameter helps in better capturing the variance in the data.
+The \gls{rmsecv} values also show a slight improvement, indicating better generalization.
+However, the standard deviations remain similar, suggesting that the model's consistency across folds is maintained.
+
+The \gls{svr} meta-learner shows the best performance for several oxides, particularly \ce{SiO2} and \ce{Na2O}, with \gls{rmsep} values of 3.473 and 0.369, respectively.
+
+We generally observe that the standard deviation metrics are close to the corresponding \gls{rmse} values, indicating low variability in the prediction errors.
+This suggests robustness of the predictions.
+
 A notable observation from our results is that different meta-learners exhibited varying performance levels across oxides.
 We observed that the final predictions were strongly affected by the meta-learner, going as far as rendering some predictions nonsensical if the wrong meta-learner was chosen.
 Specifically, for \ce{TiO2}, we observed that predictions remained near-constant values despite varying the combination of model configurations in the \ce{TiO2} ensemble.
@@ -95,7 +112,7 @@ \subsubsection{Results for Stacking Ensemble}\label{subsec:stacking_ensemble_res
 \caption{Stacking ensemble results using the \gls{enet} model as the meta-learner with $\alpha = 1$.}
 \begin{tabular}{lcccc}
 \toprule
-Oxide          & \gls{rmsep} & STDDEV & \gls{rmsecv}         & Std. Dev. CV          \\
+Oxide          & \gls{rmsep} & Std. Dev. & \gls{rmsecv}         & Std. Dev. CV          \\
 \midrule
 \ce{SiO2}      & 3.588       & 3.582  & 4.680 $\pm$ 0.500    & 4.670 $\pm$ 0.516     \\
 \ce{TiO2}      & 0.571       & 0.565  & 0.818 $\pm$ 0.111    & 0.814 $\pm$ 0.117     \\
@@ -115,7 +132,7 @@ \subsubsection{Results for Stacking Ensemble}\label{subsec:stacking_ensemble_res
 \caption{Stacking ensemble results using the \gls{enet} model as the meta-learner with $\alpha = 0.1$.}
 \begin{tabular}{lcccc}
 \toprule
-Oxide          & \gls{rmsep} & STDDEV & \gls{rmsecv}         & Std. Dev. CV          \\
+Oxide          & \gls{rmsep} & Std. Dev. & \gls{rmsecv}         & Std. Dev. CV          \\
 \midrule
 \ce{SiO2}      & 3.598       & 3.591  & 4.686 $\pm$ 0.489    & 4.677 $\pm$ 0.505     \\
 \ce{TiO2}      & 0.319       & 0.310  & 0.450 $\pm$ 0.083    & 0.448 $\pm$ 0.083     \\
@@ -135,7 +152,7 @@ \subsubsection{Results for Stacking Ensemble}\label{subsec:stacking_ensemble_res
 \caption{Stacking ensemble results using the \gls{svr} model as the meta-learner with default hyperparameters.}
 \begin{tabular}{lcccc}
 \toprule
-Oxide          & \gls{rmsep} & STDDEV & \gls{rmsecv}         & Std. Dev. CV          \\
+Oxide          & \gls{rmsep} & Std. Dev. & \gls{rmsecv}         & Std. Dev. CV          \\
 \midrule
 \ce{SiO2}      & 3.473       & 3.478  & 5.064 $\pm$ 0.932    & 5.061 $\pm$ 0.926     \\
 \ce{TiO2}      & 0.340       & 0.333  & 0.442 $\pm$ 0.087    & 0.442 $\pm$ 0.087     \\

diff --git a/report_thesis/src/sections/proposed_approach/optimization_framework.tex b/report_thesis/src/sections/proposed_approach/optimization_framework.tex
@@ -2,8 +2,8 @@ \subsection{Optimization Framework}\label{sec:optimization_framework}
 One of the primary challenges in developing a stacking ensemble is determining the optimal choice of base estimators. \citet{wolpertstacked_1992} highlighted that this can be considered a 'black art' and that the choice usually relies on intelligent guesses.
 In our case, this problem is further exacerbated by the fact that the optimal choice of base estimator may vary depending on the target oxide.
 The complexity of the problem is increased because different oxides require different models, and the optimal preprocessing techniques will depend on both the model and the specific oxide being predicted.
-Due to the challenges highligted in \ref{subsec:challenges}, namely high dimensionality, multicollinearity, and matrix effects, it is difficult to determine which configuration is optimal.
-Selecting the appropriate preprocessing steps for each base estimator is essential, as incorrect preprocessing can significantly degrade performance and undermine the model's effectiveness
+Due to the challenges highlighted in \ref{subsec:challenges}, namely high dimensionality, multicollinearity, and matrix effects, it is difficult to determine which configuration is optimal.
+Selecting the appropriate preprocessing steps for each base estimator is essential, as incorrect preprocessing can significantly degrade performance and undermine the model's effectiveness.
 Furthermore, choosing the right hyperparameters for each base estimator introduces additional complexity, as these decisions also significantly impact model performance and must be carefully tuned for each specific oxide.
 Some estimators might require very little tuning to achieve accurate and robust predictions, while others might require extensive tuning, depending on the target oxide.
 For instance, simpler approaches like \gls{enet} and ridge regression may quickly reach their optimal performance with minimal hyperparameter adjustments. However, due to their simplicity, they often fail to capture the complex patterns in the data that more advanced models can, making them less competitive despite their ease of tuning.
@@ -16,14 +16,14 @@ \subsection{Optimization Framework}\label{sec:optimization_framework}
 To guide this process we have developed a working assumption.
 Specifically, we assume that selecting the top-$n$ best pipelines for each oxide, considering different preprocessors and models for each pipeline, will result in the best pipelines for a given oxide in our stacking ensemble.
 Here, $n$ is a heuristic based on the results and \textit{best} is evaluated in terms of the metrics outlined in Section~\ref{subsec:evaluation_metrics}.
-Additionaly, each permutation will utilize our proposed data partitioning and cross-validation strategy outlined in Section~\ref{subsec:validation_testing_procedures}.
-Utilizing our proposed data partitioning and cross-validation strategy, along with the aformentioned evaluation metrics, will ensure that the top-$n$ pipelines align with our goals of generalization, robustness, and accuracy outlined in Section~\ref{sec:problem_definition}.
+Additionally, each permutation will utilize our proposed data partitioning and cross-validation strategy outlined in Section~\ref{subsec:validation_testing_procedures}.
+Utilizing our proposed data partitioning and cross-validation strategy, along with the aforementioned evaluation metrics, will ensure that the top-$n$ pipelines align with our goals of generalization, robustness, and accuracy outlined in Section~\ref{sec:problem_definition}.
 This narrows our focus to three key tasks: selecting suitable preprocessors and models, finding the optimal hyperparameters, and devising a guided search strategy to evaluate various permutations and identify the top-$n$ pipelines for each oxide.
 First, we curated a diverse set of models and preprocessing techniques, as detailed in Section~\ref{sec:model_selection}.
 Next, we developed an optimization framework to systematically explore and optimize these pipeline configurations, which will be described in the following section.
 
 \subsubsection{The Framework}
-To systematically explore and optimize pipeline configurations, the search process should be guided by an ojective function.
+To systematically explore and optimize pipeline configurations, the search process should be guided by an objective function.
 Based on the evaluation process outlined in Section~\ref{subsec:validation_testing_procedures}, whereby we argue that solely evaluating on the \gls{rmsep} may lead to misleading and poor results, we define the objective function we wish to optimize as a multi-objective optimization on minimizing the \texttt{rmse\_cv} and \texttt{std\_dev\_cv}.
 
 Given these goals, traditional methods like grid search and random search could be used, but they often fall short due to several inherent limitations.

diff --git a/report_thesis/src/sections/proposed_approach/testing_validation.tex b/report_thesis/src/sections/proposed_approach/testing_validation.tex
@@ -208,3 +208,12 @@ \subsubsection{Discussion of Testing and Validation Strategy}
 By evaluating with both cross-validation and a separate test set, we ensure that the model both generalizes well and performs well under typical conditions.
 Cross-validation allows us to evaluate the model's performance across the entire dataset, including extreme values, while the test set provides a measure of the model's performance on unseen, typical data.
 This combination of cross-validation and a separate test set provides a comprehensive assessment of the model's performance, ultimately helping to ensure that the model is both robust and accurate.
+
+In our initial and optimization experiments, we prioritize cross-validation metrics to evaluate the models.
+This strategy mitigates the risk of overfitting to the test set by avoiding a bias towards lower \gls{rmsep} values.
+Conversely, for the stacking ensemble experiment, we emphasize test set metrics to comprehensively assess the ensemble's performance, while still considering cross-validation metrics.
+This approach aligns with standard machine learning conventions\cite{geronHandsonMachineLearning2023}.
+In the initial experiment, cross-validation metrics serve as thresholds for model selection.
+During the optimization phase, only cross-validation metrics guide the search for optimal hyperparameters.
+For the stacking ensemble experiment, both cross-validation and test set metrics are evaluated, with a primary focus on the \gls{rmsep} metric.
+This approach aims to make our final model accurate, robust, and generalizable to unseen data, providing a balanced evaluation through both cross-validation and test set metrics.