Fix/multilabel confusion matrix #51

tiaguinho-code · 2024-05-13T13:53:33Z

What it do?

Give dataloader capability to decode target arrays from test and prediction set.
Should fix Issue #50 .

How?

Added internal function to dataloader that usess the same label binarizer that it used to encode the different categories.
It does the inverse encoding then and returns that for an arbitrary target array.

In the current implementaiton the dataloader is passed to the evaluation script to do translation in there but we can of course also translate inside the trainer script and pass the decoded arrays and labels.

Function takes binarrized encoded target array and decodes it back. useful for confusion matrix.

The one dimensional stuff wasn't working due to some places expecting lists of lists and only getting lists. Fixed now

Dataloader was added to evaluation to label the confusion matrix. If this is too much mixing of the dataloader we can also of course just pass the decoded y_pred and y_test.

tiaguinho-code · 2024-05-13T17:14:48Z

Just noticed there is undefined behavior if the Model Chooses multiple elements for one row for example THF and Et2O on the same molecule. Looks like the Binarizer somehow chooses one, but I guess this shouldn't be an issue as it will work out in the statistics.

mlederbauer · 2024-05-13T19:22:56Z

Thanks a lot for the contribution! Checking the code asap
I was wondering if you could add 1 or 2 example confusion matrices that are generated when running the code? (I guess this might change a bit wth our discussion on Discord)
Typically, you can “Copy Image” directly from VSCode and Cmd+V it here on Github.

tiaguinho-code · 2024-05-13T19:27:35Z

willco tomorrow 👍🏼

Some setup for adding the multi dimensional support for the Confusion matrix the way it produces multiple cms for each target of the --targets.

This function returns a list of list of the columns of each target. This is needed to make multiple confusion matrices.

Added support for one target, one dim target array and multiple targets, multidimensional target array single target, multidimensional target array for confusion matrices.

tiaguinho-code · 2024-05-14T15:50:28Z

Current state

The code has now been refactored quite a bit, especially the training script. It can handle all sorts of dataset mixtures. Roc etc only are generated for binary tasks and depending on the target dimensionality, different things are executed.

Todo

Fix the algorithm to make the subplots and add that cool color bar, but I feel like this depends on how exactly we want to plot it. The technically painful part is done, which is generate all the sub cms and bring along the right labels.

Examples of generated confusion matrices

Images

mlederbauer · 2024-05-14T16:15:53Z

nmrcraft/analysis/plotting.py

    """
    Plots the confusion matrix.
    Parameters:
    - cm (array-like): Confusion matrix data.
    - classes (list): List of classes for the axis labels.
    - title (str): Title of the plot.
+    - full (bool): If true plots one big, else many smaller.


I like the description hehe

Alternatively for the final review round in the end, we could rename to “Plots single confusion matrix for all targets combined vs one CM per target” or something like that

mlederbauer · 2024-05-14T16:17:08Z

nmrcraft/analysis/plotting.py

+        plt.savefig(path)
+        plt.close()
+
+    elif not full:  # Plot many small cms of each target


could rewrite this to else: as full is either True or False

mlederbauer · 2024-05-14T16:17:44Z

nmrcraft/analysis/plotting.py

+            sub_classes = classes[
+                slice(columns_set[i][0], columns_set[i][-1] + 1)
+            ]
+            axs[i].imshow(sub_cm, interpolation="nearest", cmap=plt.cm.Blues)


Super nice! in the final review, we can add our minecraft color scheme here

because we obtain the cmap from the setup_style() function in nmrcraft/analysis/plotting.py

mlederbauer · 2024-05-14T16:19:32Z

nmrcraft/data/dataset.py

+        """
+        y_decoded = self.binarized_target_decoder(y)
+        flat_y_decoded = [y for ys in y_decoded for y in ys]
+        return flat_y_decoded


That must have been technically difficult to implement! Thanks for taking care of this. In the final review, we could add a bit more detailed documentation here so that the TAs know what’s going on in this function.

mlederbauer · 2024-05-14T16:20:03Z

nmrcraft/evaluation/evaluation.py

 from typing import Any, Dict, Tuple

 from sklearn.base import BaseEstimator
 from sklearn.metrics import (
    accuracy_score,
    auc,
-    # confusion_matrix,


Thanks for cleaning up the code alongside!

mlederbauer · 2024-05-14T16:24:11Z

nmrcraft/evaluation/evaluation.py

+
+
+def model_evaluation_nD(
+    model: BaseEstimator,


Very nice. Also a note for final review (I’m creating an issue out of this: add type hints for all functions and classes. The purpose of this is to enforce that a function takes inputs only of a certain type, and also only outputs a certain type.

instead of

def confusion_matric_plotter_or_whatever(target, full=True): return 1+1

Write

def confusion_matric_plotter_or_whatever(target: np.array, full: bool = True) -> int: return 1+1

This seems a bit useless at first but its a game changer when writing code that takes in many different inputs, and saves some amount of error messages!

mlederbauer · 2024-05-14T16:24:41Z

nmrcraft/evaluation/evaluation.py

+    X_test: Any,
+    y_test: Any,
+    y_labels: Any,
+    dataloader: dataset.DataLoader,


Here, for example, you have such type hints

mlederbauer · 2024-05-14T16:25:31Z

scripts/training/train_metal.py

        print(f"Accuracy: {metrics['accuracy']}")
+        mlflow.log_artifact(get_cm_path())


Thanks for cleaning up! We might need to adapt this to the script @strsamue is writing.

Something not right

TiagoW added 8 commits May 13, 2024 11:43

Init: init fixing environment + scratch script

64ce6f0

Added binarized_target_decoder to dataloader

d5ad288

Function takes binarrized encoded target array and decodes it back. useful for confusion matrix.

Confusion Matrix working for all except metals

6762e8f

Fixed problem with metals flag

df4e6ef

The one dimensional stuff wasn't working due to some places expecting lists of lists and only getting lists. Fixed now

Chore: get rid of ported testing scripts

904d8a1

Chore: restore model_evaluation as much as possible

d042d6e

Chore: get rid of TODO comments fixed by dataloader

f2ff55f

Fix: added the dataloader y_labels to the evaluation + cleanup

2233a21

Dataloader was added to evaluation to label the confusion matrix. If this is too much mixing of the dataloader we can also of course just pass the decoded y_pred and y_test.

tiaguinho-code requested review from mlederbauer and kbiniek May 13, 2024 15:53

tiaguinho-code self-assigned this May 13, 2024

tiaguinho-code added the bug Something isn't working label May 13, 2024

tiaguinho-code linked an issue May 13, 2024 that may be closed by this pull request

bug: multilabel confusion matrix #50

Closed

tiaguinho-code marked this pull request as ready for review May 13, 2024 15:54

Tiago Würthner added 3 commits May 14, 2024 08:39

Refactor train_script and multi dim support for evaluation and plotting

1af1d4a

Some setup for adding the multi dimensional support for the Confusion matrix the way it produces multiple cms for each target of the --targets.

Feature: get_target_columns_separated() added to dataloader

d3a7d05

This function returns a list of list of the columns of each target. This is needed to make multiple confusion matrices.

Add support for multiple cm types

478fe40

Added support for one target, one dim target array and multiple targets, multidimensional target array single target, multidimensional target array for confusion matrices.

This comment was marked as duplicate.

Sign in to view

Tiago Würthner added 2 commits May 14, 2024 16:09

Merge branch 'main' into fix/multilabel-confusion-matrix

215be3f

Hotfix: remove to_csv("yeet")

2476fc3

mlederbauer previously approved these changes May 14, 2024

View reviewed changes

mlederbauer approved these changes May 14, 2024

View reviewed changes

mlederbauer merged commit 8eba2a7 into main May 14, 2024
1 check passed

mlederbauer deleted the fix/multilabel-confusion-matrix branch May 14, 2024 19:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/multilabel confusion matrix #51

Fix/multilabel confusion matrix #51

tiaguinho-code commented May 13, 2024 •

edited

Loading

tiaguinho-code commented May 13, 2024

mlederbauer commented May 13, 2024

tiaguinho-code commented May 13, 2024

tiaguinho-code commented May 14, 2024 •

edited

Loading

This comment was marked as duplicate.

mlederbauer May 14, 2024

mlederbauer May 14, 2024

mlederbauer May 14, 2024

mlederbauer May 14, 2024

mlederbauer May 14, 2024

mlederbauer May 14, 2024

mlederbauer May 14, 2024

mlederbauer May 14, 2024

mlederbauer May 14, 2024

mlederbauer May 14, 2024

		print(f"Accuracy: {metrics['accuracy']}")
		mlflow.log_artifact(get_cm_path())

Fix/multilabel confusion matrix #51

Fix/multilabel confusion matrix #51

Conversation

tiaguinho-code commented May 13, 2024 • edited Loading

What it do?

How?

tiaguinho-code commented May 13, 2024

mlederbauer commented May 13, 2024

tiaguinho-code commented May 13, 2024

tiaguinho-code commented May 14, 2024 • edited Loading

Current state

Todo

Examples of generated confusion matrices

This comment was marked as duplicate.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tiaguinho-code commented May 13, 2024 •

edited

Loading

tiaguinho-code commented May 14, 2024 •

edited

Loading