From 9f511b4facad756f1eaefc1117bc68d05b2fac3f Mon Sep 17 00:00:00 2001 From: Ivikhostrup Date: Wed, 12 Jun 2024 15:07:27 +0200 Subject: [PATCH 01/12] Contributions written --- .../src/sections/pyhat_contribution.tex | 21 +++++++++++++++++++ 1 file changed, 21 insertions(+) create mode 100644 report_thesis/src/sections/pyhat_contribution.tex diff --git a/report_thesis/src/sections/pyhat_contribution.tex b/report_thesis/src/sections/pyhat_contribution.tex new file mode 100644 index 00000000..01c7fb6b --- /dev/null +++ b/report_thesis/src/sections/pyhat_contribution.tex @@ -0,0 +1,21 @@ +\section{PyHAT Contribution}\label{sec:pyhat_contribution} +As part of our research, we have made several contributions to \gls{pyhat}, which offers a user-friendly interface designed for performing machine learning and data analysis tasks specifically for hyperspectral data. +This collaboration was initiated through a series of discussions with two members \gls{usgs} responsible for managing \gls{pyhat}, wherein we identified mutual challenges and opportunities for integrating our solutions into the library. + +The largest contribution involved the integration of an automatic outlier detection method into the library's \gls{pls} implementation. +This method calculates the Mahalanobis distance for each data point and uses the chi-squared distribution to establish a threshold. +Any datapoint exceeding this threshold is considered an out and removed from the dataset. +Utilizing two intermediary \gls{pls} models, one as a reference and the other to evaluate the impact of outlier removal, the method iteratively identifies and eliminates outliers while assessing the performance of the second model against the reference model. +If the second model demonstrates improved performance compared to the reference model, it replaces the reference model, and the process continues until no further significant improvement is detected. +To conserve computational resources, the method halts if the error of the second model increases relative to the reference model, thus providing an early stopping mechanism. + +This contribution also included the development of a graphical user interface (GUI) component for the existing \gls{pyhat} GUI to visualize the outlier removal process in real-time. +This included utilities to select a threshold, select a given oxide for which to perform outlier removal, and a logging mechanism to display the number of outliers removed at each iteration in the GUI. + +Another contribution made to \gls{pyhat} involved a fix of an important functionality in their Joint Approximation Diagonalization of Eigen-matrices (JADE) implementation. +The fix provided the ability to properly identify which of the original data points has the highest correlation with each independent component produced by JADE. +The correlation scores produced by this functionality can be used in a regression context, where a linear model learns the coefficients that best fit the relationship between the independent components and the original data points. + +Finally, contribution were made to improve the performance of various processes in \gls{pyhat}. + + \ No newline at end of file From d1c5d81a60506d1dbf12dbe29dba4e226c70a5d6 Mon Sep 17 00:00:00 2001 From: Ivikhostrup Date: Wed, 12 Jun 2024 15:09:31 +0200 Subject: [PATCH 02/12] Being reviewed --- report_thesis/src/sections/pyhat_contribution.tex | 1 + 1 file changed, 1 insertion(+) diff --git a/report_thesis/src/sections/pyhat_contribution.tex b/report_thesis/src/sections/pyhat_contribution.tex index 01c7fb6b..895189b5 100644 --- a/report_thesis/src/sections/pyhat_contribution.tex +++ b/report_thesis/src/sections/pyhat_contribution.tex @@ -17,5 +17,6 @@ \section{PyHAT Contribution}\label{sec:pyhat_contribution} The correlation scores produced by this functionality can be used in a regression context, where a linear model learns the coefficients that best fit the relationship between the independent components and the original data points. Finally, contribution were made to improve the performance of various processes in \gls{pyhat}. +At the time of writing, all contributions has been demonstrated to work as intended to the two \gls{usgs} members responsible for managing \gls{pyhat} and are undergoing final review. \ No newline at end of file From 0646faac95c02446c8208e285b6a748d70a23bd0 Mon Sep 17 00:00:00 2001 From: Ivikhostrup <56341364+Ivikhostrup@users.noreply.github.com> Date: Wed, 12 Jun 2024 15:56:55 +0200 Subject: [PATCH 03/12] Update report_thesis/src/sections/pyhat_contribution.tex Co-authored-by: Christian Bager Bach Houmann --- report_thesis/src/sections/pyhat_contribution.tex | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/report_thesis/src/sections/pyhat_contribution.tex b/report_thesis/src/sections/pyhat_contribution.tex index 895189b5..a011764d 100644 --- a/report_thesis/src/sections/pyhat_contribution.tex +++ b/report_thesis/src/sections/pyhat_contribution.tex @@ -1,5 +1,7 @@ \section{PyHAT Contribution}\label{sec:pyhat_contribution} -As part of our research, we have made several contributions to \gls{pyhat}, which offers a user-friendly interface designed for performing machine learning and data analysis tasks specifically for hyperspectral data. +As part of our work, we have made several contributions to \gls{pyhat}. +We describe these contributions here. +\gls{pyhat} offers a user-friendly interface designed for performing machine learning and data analysis tasks specifically for hyperspectral data. This collaboration was initiated through a series of discussions with two members \gls{usgs} responsible for managing \gls{pyhat}, wherein we identified mutual challenges and opportunities for integrating our solutions into the library. The largest contribution involved the integration of an automatic outlier detection method into the library's \gls{pls} implementation. From 56ac672dd9ef8b410edefdf31b4e0ef8fb87aa69 Mon Sep 17 00:00:00 2001 From: Ivikhostrup <56341364+Ivikhostrup@users.noreply.github.com> Date: Wed, 12 Jun 2024 15:57:07 +0200 Subject: [PATCH 04/12] Update report_thesis/src/sections/pyhat_contribution.tex Co-authored-by: Christian Bager Bach Houmann --- report_thesis/src/sections/pyhat_contribution.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/report_thesis/src/sections/pyhat_contribution.tex b/report_thesis/src/sections/pyhat_contribution.tex index a011764d..637ded35 100644 --- a/report_thesis/src/sections/pyhat_contribution.tex +++ b/report_thesis/src/sections/pyhat_contribution.tex @@ -2,7 +2,7 @@ \section{PyHAT Contribution}\label{sec:pyhat_contribution} As part of our work, we have made several contributions to \gls{pyhat}. We describe these contributions here. \gls{pyhat} offers a user-friendly interface designed for performing machine learning and data analysis tasks specifically for hyperspectral data. -This collaboration was initiated through a series of discussions with two members \gls{usgs} responsible for managing \gls{pyhat}, wherein we identified mutual challenges and opportunities for integrating our solutions into the library. +Our collaboration was initiated through a series of discussions with two members from \gls{usgs} that are responsible for \gls{pyhat}, wherein we identified mutual challenges and opportunities for integrating our solutions into the tool. The largest contribution involved the integration of an automatic outlier detection method into the library's \gls{pls} implementation. This method calculates the Mahalanobis distance for each data point and uses the chi-squared distribution to establish a threshold. From d6dbc108527340c1fc18d85531d2e652e626dc52 Mon Sep 17 00:00:00 2001 From: Ivikhostrup <56341364+Ivikhostrup@users.noreply.github.com> Date: Wed, 12 Jun 2024 15:57:24 +0200 Subject: [PATCH 05/12] Update report_thesis/src/sections/pyhat_contribution.tex Co-authored-by: Christian Bager Bach Houmann --- report_thesis/src/sections/pyhat_contribution.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/report_thesis/src/sections/pyhat_contribution.tex b/report_thesis/src/sections/pyhat_contribution.tex index 637ded35..c7e20e96 100644 --- a/report_thesis/src/sections/pyhat_contribution.tex +++ b/report_thesis/src/sections/pyhat_contribution.tex @@ -4,7 +4,7 @@ \section{PyHAT Contribution}\label{sec:pyhat_contribution} \gls{pyhat} offers a user-friendly interface designed for performing machine learning and data analysis tasks specifically for hyperspectral data. Our collaboration was initiated through a series of discussions with two members from \gls{usgs} that are responsible for \gls{pyhat}, wherein we identified mutual challenges and opportunities for integrating our solutions into the tool. -The largest contribution involved the integration of an automatic outlier detection method into the library's \gls{pls} implementation. +The largest contribution involved the integration of an automatic outlier detection method into \gls{pyhat}. This method calculates the Mahalanobis distance for each data point and uses the chi-squared distribution to establish a threshold. Any datapoint exceeding this threshold is considered an out and removed from the dataset. Utilizing two intermediary \gls{pls} models, one as a reference and the other to evaluate the impact of outlier removal, the method iteratively identifies and eliminates outliers while assessing the performance of the second model against the reference model. From b1f983caff27543dae2ba69df647795c82a2fbbf Mon Sep 17 00:00:00 2001 From: Ivikhostrup <56341364+Ivikhostrup@users.noreply.github.com> Date: Wed, 12 Jun 2024 15:58:11 +0200 Subject: [PATCH 06/12] Update report_thesis/src/sections/pyhat_contribution.tex Co-authored-by: Christian Bager Bach Houmann --- report_thesis/src/sections/pyhat_contribution.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/report_thesis/src/sections/pyhat_contribution.tex b/report_thesis/src/sections/pyhat_contribution.tex index c7e20e96..dfd21748 100644 --- a/report_thesis/src/sections/pyhat_contribution.tex +++ b/report_thesis/src/sections/pyhat_contribution.tex @@ -6,7 +6,7 @@ \section{PyHAT Contribution}\label{sec:pyhat_contribution} The largest contribution involved the integration of an automatic outlier detection method into \gls{pyhat}. This method calculates the Mahalanobis distance for each data point and uses the chi-squared distribution to establish a threshold. -Any datapoint exceeding this threshold is considered an out and removed from the dataset. +Any datapoint exceeding this threshold is considered an outlier and removed from the dataset. Utilizing two intermediary \gls{pls} models, one as a reference and the other to evaluate the impact of outlier removal, the method iteratively identifies and eliminates outliers while assessing the performance of the second model against the reference model. If the second model demonstrates improved performance compared to the reference model, it replaces the reference model, and the process continues until no further significant improvement is detected. To conserve computational resources, the method halts if the error of the second model increases relative to the reference model, thus providing an early stopping mechanism. From 570e2ee92658b58ed374befa1066cee496f55944 Mon Sep 17 00:00:00 2001 From: Ivikhostrup <56341364+Ivikhostrup@users.noreply.github.com> Date: Wed, 12 Jun 2024 15:58:25 +0200 Subject: [PATCH 07/12] Update report_thesis/src/sections/pyhat_contribution.tex Co-authored-by: Christian Bager Bach Houmann --- report_thesis/src/sections/pyhat_contribution.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/report_thesis/src/sections/pyhat_contribution.tex b/report_thesis/src/sections/pyhat_contribution.tex index dfd21748..6b112d60 100644 --- a/report_thesis/src/sections/pyhat_contribution.tex +++ b/report_thesis/src/sections/pyhat_contribution.tex @@ -11,7 +11,7 @@ \section{PyHAT Contribution}\label{sec:pyhat_contribution} If the second model demonstrates improved performance compared to the reference model, it replaces the reference model, and the process continues until no further significant improvement is detected. To conserve computational resources, the method halts if the error of the second model increases relative to the reference model, thus providing an early stopping mechanism. -This contribution also included the development of a graphical user interface (GUI) component for the existing \gls{pyhat} GUI to visualize the outlier removal process in real-time. +This contribution also included the development of a graphical user interface (GUI) component for the existing \gls{pyhat} GUI to configure and visualize the outlier removal process. This included utilities to select a threshold, select a given oxide for which to perform outlier removal, and a logging mechanism to display the number of outliers removed at each iteration in the GUI. Another contribution made to \gls{pyhat} involved a fix of an important functionality in their Joint Approximation Diagonalization of Eigen-matrices (JADE) implementation. From dfc180a4dec134c06a9dbf4587931ae8f492c818 Mon Sep 17 00:00:00 2001 From: Ivikhostrup <56341364+Ivikhostrup@users.noreply.github.com> Date: Wed, 12 Jun 2024 15:58:41 +0200 Subject: [PATCH 08/12] Update report_thesis/src/sections/pyhat_contribution.tex Co-authored-by: Christian Bager Bach Houmann --- report_thesis/src/sections/pyhat_contribution.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/report_thesis/src/sections/pyhat_contribution.tex b/report_thesis/src/sections/pyhat_contribution.tex index 6b112d60..b97eece0 100644 --- a/report_thesis/src/sections/pyhat_contribution.tex +++ b/report_thesis/src/sections/pyhat_contribution.tex @@ -18,7 +18,7 @@ \section{PyHAT Contribution}\label{sec:pyhat_contribution} The fix provided the ability to properly identify which of the original data points has the highest correlation with each independent component produced by JADE. The correlation scores produced by this functionality can be used in a regression context, where a linear model learns the coefficients that best fit the relationship between the independent components and the original data points. -Finally, contribution were made to improve the performance of various processes in \gls{pyhat}. +Finally, we made some contributions to improve the performance of various processes in \gls{pyhat}. At the time of writing, all contributions has been demonstrated to work as intended to the two \gls{usgs} members responsible for managing \gls{pyhat} and are undergoing final review. \ No newline at end of file From 06d30c3d50518e1bb2ca9b063f7635b20b430079 Mon Sep 17 00:00:00 2001 From: Christian Bager Bach Houmann Date: Wed, 12 Jun 2024 20:22:24 +0200 Subject: [PATCH 09/12] jade gls --- report_thesis/src/glossary.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/report_thesis/src/glossary.tex b/report_thesis/src/glossary.tex index 33308ddf..83553a70 100644 --- a/report_thesis/src/glossary.tex +++ b/report_thesis/src/glossary.tex @@ -58,4 +58,5 @@ \newacronym{rss}{RSS}{Residual Sum of Squares} \newacronym{tpe}{TPE}{Tree-structured Parzen Estimator} \newacronym{usgs}{USGS}{United States Geological Survey} -\newacronym{pyhat}{PyHAT}{Python Hyperspectral Analysis Tool} \ No newline at end of file +\newacronym{pyhat}{PyHAT}{Python Hyperspectral Analysis Tool} +\newacronym{jade}{JADE}{Joint Approximation Diagonalization of Eigen-matrices} From 9b3b1fb16a87c323c8e6cd2a716549b662cd0a64 Mon Sep 17 00:00:00 2001 From: Christian Bager Bach Houmann Date: Wed, 12 Jun 2024 20:22:30 +0200 Subject: [PATCH 10/12] include section --- report_thesis/src/index.tex | 1 + 1 file changed, 1 insertion(+) diff --git a/report_thesis/src/index.tex b/report_thesis/src/index.tex index 9f5fba4c..41a0bc56 100644 --- a/report_thesis/src/index.tex +++ b/report_thesis/src/index.tex @@ -26,5 +26,6 @@ \subsubsection*{Acknowledgements:} \input{sections/proposed_approach/proposed_approach.tex} \input{sections/methodology.tex} \input{sections/experiments/index.tex} +\input{sections/pyhat_contribution.tex} \input{sections/conclusion.tex} \input{sections/future_work.tex} From 654cfa6e321e00086f5be63bea54a4562a460e15 Mon Sep 17 00:00:00 2001 From: Christian Bager Bach Houmann Date: Wed, 12 Jun 2024 20:22:43 +0200 Subject: [PATCH 11/12] wip: rewrite --- .../src/sections/pyhat_contribution.tex | 27 +++++++++++++------ 1 file changed, 19 insertions(+), 8 deletions(-) diff --git a/report_thesis/src/sections/pyhat_contribution.tex b/report_thesis/src/sections/pyhat_contribution.tex index b97eece0..a13f51bf 100644 --- a/report_thesis/src/sections/pyhat_contribution.tex +++ b/report_thesis/src/sections/pyhat_contribution.tex @@ -4,18 +4,29 @@ \section{PyHAT Contribution}\label{sec:pyhat_contribution} \gls{pyhat} offers a user-friendly interface designed for performing machine learning and data analysis tasks specifically for hyperspectral data. Our collaboration was initiated through a series of discussions with two members from \gls{usgs} that are responsible for \gls{pyhat}, wherein we identified mutual challenges and opportunities for integrating our solutions into the tool. -The largest contribution involved the integration of an automatic outlier detection method into \gls{pyhat}. -This method calculates the Mahalanobis distance for each data point and uses the chi-squared distribution to establish a threshold. -Any datapoint exceeding this threshold is considered an outlier and removed from the dataset. -Utilizing two intermediary \gls{pls} models, one as a reference and the other to evaluate the impact of outlier removal, the method iteratively identifies and eliminates outliers while assessing the performance of the second model against the reference model. -If the second model demonstrates improved performance compared to the reference model, it replaces the reference model, and the process continues until no further significant improvement is detected. -To conserve computational resources, the method halts if the error of the second model increases relative to the reference model, thus providing an early stopping mechanism. +% The largest contribution involved the integration of an automatic outlier detection method into \gls{pyhat}. +% This method calculates the Mahalanobis distance for each data point and uses the chi-squared distribution to establish a threshold. +% Any datapoint exceeding this threshold is considered an outlier and removed from the dataset. +% Utilizing two intermediary \gls{pls} models, one as a reference and the other to evaluate the impact of outlier removal, the method iteratively identifies and eliminates outliers while assessing the performance of the second model against the reference model. +% If the second model demonstrates improved performance compared to the reference model, it replaces the reference model, and the process continues until no further significant improvement is detected. +% To conserve computational resources, the method halts if the error of the second model increases relative to the reference model, thus providing an early stopping mechanism. + +We implemented an outlier detection method in \gls{pyhat} that uses the Mahalanobis distance and the chi-squared test. +This statistical approach identifies outliers without relying on qualitative assessments. +The process involves computing leverage and spectral residuals for each sample using a \gls{pls} model, combining these metrics into a two-dimensional dataset, and calculating the Mahalanobis distance for each sample. +Samples are classified as outliers if their Mahalanobis distance exceeds a chi-squared critical value at a confidence level based on the threshold. +Outliers are then excluded, and the model is retrained iteratively until no further performance improvement is observed. +We developed this method as a part of our work on the \gls{moc} model replica presented in \citet{p9_paper}, where it served as an automated version of the one presented by \citet{andersonImprovedAccuracyQuantitative2017}. + +This method was integrated into \gls{pyhat}'s library and GUI, allowing users to configure the chi-squared threshold, number of PLS components, and maximum iterations. +Users can select their dataset and regression target, configure the method, and run it through the GUI. + This contribution also included the development of a graphical user interface (GUI) component for the existing \gls{pyhat} GUI to configure and visualize the outlier removal process. This included utilities to select a threshold, select a given oxide for which to perform outlier removal, and a logging mechanism to display the number of outliers removed at each iteration in the GUI. -Another contribution made to \gls{pyhat} involved a fix of an important functionality in their Joint Approximation Diagonalization of Eigen-matrices (JADE) implementation. -The fix provided the ability to properly identify which of the original data points has the highest correlation with each independent component produced by JADE. +We also contributed by resolving a critical issue in the \gls{jade} implementation within \gls{pyhat}. +The fix provided the ability to properly identify which of the original data points has the highest correlation with each independent component produced by \gls{jade}. The correlation scores produced by this functionality can be used in a regression context, where a linear model learns the coefficients that best fit the relationship between the independent components and the original data points. Finally, we made some contributions to improve the performance of various processes in \gls{pyhat}. From 2e6a56164da629474a299b61b45ec120bb0d9879 Mon Sep 17 00:00:00 2001 From: Christian Bager Bach Houmann Date: Wed, 12 Jun 2024 20:30:37 +0200 Subject: [PATCH 12/12] rewrite outlier removal method --- .../src/sections/pyhat_contribution.tex | 19 ++++++------------- 1 file changed, 6 insertions(+), 13 deletions(-) diff --git a/report_thesis/src/sections/pyhat_contribution.tex b/report_thesis/src/sections/pyhat_contribution.tex index a13f51bf..92fabd52 100644 --- a/report_thesis/src/sections/pyhat_contribution.tex +++ b/report_thesis/src/sections/pyhat_contribution.tex @@ -4,24 +4,17 @@ \section{PyHAT Contribution}\label{sec:pyhat_contribution} \gls{pyhat} offers a user-friendly interface designed for performing machine learning and data analysis tasks specifically for hyperspectral data. Our collaboration was initiated through a series of discussions with two members from \gls{usgs} that are responsible for \gls{pyhat}, wherein we identified mutual challenges and opportunities for integrating our solutions into the tool. -% The largest contribution involved the integration of an automatic outlier detection method into \gls{pyhat}. -% This method calculates the Mahalanobis distance for each data point and uses the chi-squared distribution to establish a threshold. -% Any datapoint exceeding this threshold is considered an outlier and removed from the dataset. -% Utilizing two intermediary \gls{pls} models, one as a reference and the other to evaluate the impact of outlier removal, the method iteratively identifies and eliminates outliers while assessing the performance of the second model against the reference model. -% If the second model demonstrates improved performance compared to the reference model, it replaces the reference model, and the process continues until no further significant improvement is detected. -% To conserve computational resources, the method halts if the error of the second model increases relative to the reference model, thus providing an early stopping mechanism. - -We implemented an outlier detection method in \gls{pyhat} that uses the Mahalanobis distance and the chi-squared test. -This statistical approach identifies outliers without relying on qualitative assessments. -The process involves computing leverage and spectral residuals for each sample using a \gls{pls} model, combining these metrics into a two-dimensional dataset, and calculating the Mahalanobis distance for each sample. -Samples are classified as outliers if their Mahalanobis distance exceeds a chi-squared critical value at a confidence level based on the threshold. +We implemented an outlier detection method in \gls{pyhat} that uses the Mahalanobis distance and the chi-squared test. +This statistical approach identifies outliers without relying on qualitative assessments. +The process involves computing leverage, which measures a sample's influence, and spectral residuals, which are the differences between observed and predicted values, for each sample using a \gls{pls} model. +These metrics are combined into a two-dimensional dataset, and the Mahalanobis distance for each sample is calculated. +Samples are classified as outliers if their Mahalanobis distance exceeds a chi-squared critical value at a confidence level based on the threshold. Outliers are then excluded, and the model is retrained iteratively until no further performance improvement is observed. We developed this method as a part of our work on the \gls{moc} model replica presented in \citet{p9_paper}, where it served as an automated version of the one presented by \citet{andersonImprovedAccuracyQuantitative2017}. -This method was integrated into \gls{pyhat}'s library and GUI, allowing users to configure the chi-squared threshold, number of PLS components, and maximum iterations. +This method was integrated into \gls{pyhat}'s library and GUI, allowing users to configure the chi-squared threshold, number of \gls{pls} components, and maximum iterations. Users can select their dataset and regression target, configure the method, and run it through the GUI. - This contribution also included the development of a graphical user interface (GUI) component for the existing \gls{pyhat} GUI to configure and visualize the outlier removal process. This included utilities to select a threshold, select a given oxide for which to perform outlier removal, and a logging mechanism to display the number of outliers removed at each iteration in the GUI.