Address incosistencies in DataScaling infrastructure #598

RandomDefaultUser · 2024-10-25T13:47:28Z

This PR will close #482 and should also close #483.

Address Data scaling: row vs. column and naming of methods #482, i.e., fix docstrings and rename "normal" to "minmax"
Address Data scaling: API and documentation #483, i.e., make API consistent with sklearn scalers

…his in the CI to check that nothing breaks

RandomDefaultUser · 2024-10-29T10:34:10Z

@elcorto I hope I did not miss anything, but I think this should address the issues you have raised. I would appreciate your opinion/review.

elcorto · 2024-10-29T10:48:36Z

It seems like you are still working on this PR? If so, please let me know when it is ready for a review.

RandomDefaultUser · 2024-10-29T11:37:18Z

It seems like you are still working on this PR? If so, please let me know when it is ready for a review.

As of today I was actually finished with the PR from my side, except of course if I missed something you wanted implemented?

elcorto · 2024-10-29T14:02:55Z

It seems like you are still working on this PR? If so, please let me know when it is ready for a review.

As of today I was actually finished with the PR from my side, except of course if I missed something you wanted implemented?

I saw recent commits so I wanted to wait with the review, but OK then I'll look into it now.

elcorto

I have one large comment that I think should be discussed and two small changes, the rest looks good, thanks a lot.

elcorto · 2024-11-08T12:26:05Z

docs/source/basic_usage/trainingmodel.rst

-* ``feature-wise-standard``: Row Standardization (Scale to mean 0, standard deviation 1)
+* ``feature-wise-standard``: Standardization (Scale to mean 0, standard
+  deviation 1) is applied to each feature dimension individually. I.e., if your
+  training data has dimensions (x,y,z,f), then each of the f rows with (x,y,z)
+  entries is scaled indiviually.


As mentioned in #482, DataScaler silently assumes a 2D array of shape (x * y * z, f) since it does torch.mean(unscaled, 0, ...), so unless I'm overlooking something obvious, I think this explanation is actually misleading.

Consider this:

from sklearn.preprocessing import StandardScaler from mala.datahandling.data_scaler import DataScaler import torch as T import einops import numpy as np from icecream import ic reshape_to_2d = lambda x: einops.rearrange(x, "x y z f -> (x y z) f") T.set_default_dtype(T.float64) # sklearn's StandardScaler uses a biased estimator np.std(..., ddof=0), while # torch uses T.std(..., correction=1) and there is no way to change this in # StandardScaler, so we roll our own. class MyStandardScaler: @staticmethod def fit_transform(X): assert X.ndim == 2 x_mean = X.mean(0)[None, :] x_std = np.std(X.numpy(), axis=0, ddof=1)[None, :] return (X - x_mean) / x_std with T.no_grad(): arr_4d = T.rand(2, 3, 4, 5) arr_2d = reshape_to_2d(arr_4d) ##arr_2d_scaled_skl = StandardScaler().fit_transform(arr_2d) arr_2d_scaled_skl = MyStandardScaler().fit_transform(arr_2d) scaler_mala = DataScaler("feature-wise-standard") arr_2d_scaled_mala = arr_2d.clone() scaler_mala.fit(arr_2d_scaled_mala) # in-place mod! scaler_mala.transform(arr_2d_scaled_mala) ic(arr_2d_scaled_skl[:5, :]) ic(arr_2d_scaled_mala[:5, :]) assert np.allclose(arr_2d_scaled_mala, arr_2d_scaled_skl, atol=1e-15) # ValueError: Found array with dim 4. StandardScaler expected <= 2. ##arr_2d_scaled_skl = StandardScaler().fit_transform(arr_2d) # This works but shouldn't scaler_mala = DataScaler("feature-wise-standard") arr_4d_scaled_mala = arr_4d.clone() scaler_mala.fit(arr_4d_scaled_mala) # in-place mod! scaler_mala.transform(arr_4d_scaled_mala) arr_2d_from_4d_scaled_mala = reshape_to_2d(arr_4d_scaled_mala) assert not np.allclose( arr_2d_from_4d_scaled_mala, arr_2d_scaled_skl, atol=1e-1 ) ic(arr_2d_from_4d_scaled_mala[:5, :])

results in:

ic| arr_2d_scaled_skl[:5, :]: tensor([[-0.8424, -0.3500, -1.3310, 0.6888, -0.2759], [-0.1266, -1.1469, -0.2332, -0.3831, -0.7441], [-1.4691, 1.4810, 0.2890, 0.3123, -1.2043], [ 1.5169, -0.5278, -1.4609, 2.1362, 1.1646], [-0.6087, 0.2314, 0.9764, -1.0315, 0.6824]]) ic| arr_2d_scaled_mala[:5, :]: tensor([[-0.8424, -0.3500, -1.3310, 0.6888, -0.2759], [-0.1266, -1.1469, -0.2332, -0.3831, -0.7441], [-1.4691, 1.4810, 0.2890, 0.3123, -1.2043], [ 1.5169, -0.5278, -1.4609, 2.1362, 1.1646], [-0.6087, 0.2314, 0.9764, -1.0315, 0.6824]]) ic| arr_2d_from_4d_scaled_mala[:5, :]: tensor([[ 0.7071, -0.7071, -0.7071, 0.7071, 0.7071], [-0.7071, -0.7071, -0.7071, 0.7071, -0.7071], [-0.7071, 0.7071, -0.7071, -0.7071, -0.7071], [ 0.7071, -0.7071, -0.7071, 0.7071, 0.7071], [ 0.7071, 0.7071, 0.7071, -0.7071, -0.7071]])

So the code doesn't fail when given a 4D array but produces wrong results. Therefore, all docs mentioning (x,y,z,f) should be adapted. Also an array dimension check is probably a good idea.

Aaaaaah, I see what you mean now, sorry for misunderstanding. I have adapted the documentation and added an array check. Let me know if there is still something missing!

mala/datahandling/data_scaler.py

elcorto · 2024-11-08T12:34:18Z

docs/source/basic_usage/trainingmodel.rst

+* ``feature-wise-minmax``: Min-Max scaling (Scale to be in range 0...1) is
+  applied to each feature dimension individually. I.e., if your training data
+  has dimensions (x,y,z,f), then each of the f rows with (x,y,z) entries is
+  scaled indiviually.


elcorto · 2024-11-08T12:34:41Z

mala/common/parameters.py

+              standard deviation 1) is applied to each feature dimension
+              individually. I.e., if your training data has dimensions
+              (x,y,z,f), then each of the f rows with (x,y,z) entries is scaled
+              indiviually.
+            - "feature-wise-minmax": Row Min-Max scaling (Scale to be in range
+              0...1) is applied to each feature dimension individually.
+              I.e., if your training data has dimensions (x,y,z,f), then each
+              of the f rows with (x,y,z) entries is scaled indiviually.


elcorto · 2024-11-08T12:34:50Z

mala/common/parameters.py

+              individually. I.e., if your training data has dimensions
+              (x,y,z,f), then each of the f rows with (x,y,z) entries is scaled
+              indiviually.
+            - "feature-wise-minmax": Row Min-Max scaling (Scale to be in range
+              0...1) is applied to each feature dimension individually.
+              I.e., if your training data has dimensions (x,y,z,f), then each
+              of the f rows with (x,y,z) entries is scaled indiviually.


elcorto · 2024-11-08T12:35:01Z

mala/datahandling/data_scaler.py

+          I.e., if your training data has dimensions (x,y,z,f), then each
+          of the f rows with (x,y,z) entries is scaled indiviually.
+        - "feature-wise-minmax": Min-Max scaling (Scale to be in range
+          0...1) is applied to each feature dimension individually.
+          I.e., if your training data has dimensions (x,y,z,f), then each
+          of the f rows with (x,y,z) entries is scaled indiviually.


Co-authored-by: Steve Schmerler <[email protected]>

…ta_scaling

RandomDefaultUser · 2024-11-19T13:25:53Z

I fixed the pipeline @elcorto after addressing your comments, could you let me know if this PR looks good from your side now?

elcorto

Thanks for the doc update in DataScaler and the 2d array check.

Two things that I stumbled upon:

There are still places (see https://github.com/mala-project/mala/pull/598/files) that mention (x,y,z,f) that you probably missed to catch, namely in docs/source/basic_usage/trainingmodel.rst and mala/common/parameters.py.
The docs in DataScaler and mala/common/parameters.py are identical. Maybe just keep them in one place instead, so for instance in parameters.py link to :class:~mala.datahandling.data_handler.DataScaler ?

mala/datahandling/data_scaler.py

Co-authored-by: Steve Schmerler <[email protected]>

RandomDefaultUser · 2024-11-22T17:46:03Z

Thanks for the feedback and suggestions @elcorto , I implemented the changes and corrected (x,y,z) to (d) in the places you mentioned. I would keep the full doc strings in both the Parameters and DataScaler classes though. I think they are there for different audiences: a regular user who applies a model does not necessarily interact with the DataScaler class, so they need to be in the Parameters class. If a developer however wants to adapt the DataScaler code the proper doc string should be in the right place. I will add a note that changes in DataScaler should be propagated to the Parameters class though.

elcorto

OK then we're all set, I guess. Thanks a lot!

Renamed "normal" to "minmax" and fixed docstrings.

1c222ab

RandomDefaultUser requested a review from elcorto October 25, 2024 13:47

RandomDefaultUser added 3 commits October 25, 2024 16:13

Made DataScaler API consistent with sklearn

4312c07

Made interface more consistent with sklearn

f31f9b9

Also made partial_fit consistent with the sklearn, but have to test t…

dc6a8ff

…his in the CI to check that nothing breaks

RandomDefaultUser marked this pull request as ready for review October 25, 2024 14:28

Fixed docs

30768ee

elcorto reviewed Nov 8, 2024

View reviewed changes

RandomDefaultUser and others added 5 commits November 14, 2024 17:26

Update mala/datahandling/data_scaler.py

5765b48

Co-authored-by: Steve Schmerler <[email protected]>

Update mala/datahandling/data_scaler.py

1ddd3c7

Co-authored-by: Steve Schmerler <[email protected]>

Fixed dimensions as given in docstrings, added array check in DataScaler

c0f80ff

Merge remote-tracking branch 'fork_lenz/fix_data_scaling' into fix_da…

993ba68

…ta_scaling

Fixed pipeline

c1dce0c

elcorto requested changes Nov 22, 2024

View reviewed changes

mala/datahandling/data_scaler.py Outdated Show resolved Hide resolved

mala/datahandling/data_scaler.py Outdated Show resolved Hide resolved

RandomDefaultUser and others added 4 commits November 22, 2024 18:38

Update mala/datahandling/data_scaler.py

d7087d3

Co-authored-by: Steve Schmerler <[email protected]>

Update mala/datahandling/data_scaler.py

ec4777b

Co-authored-by: Steve Schmerler <[email protected]>

Corrected (x,y,z) to (d) in two places

608ba39

Merge branch 'develop' into fix_data_scaling

1525529

Added note about propagating changes

a98830b

elcorto approved these changes Nov 22, 2024

View reviewed changes

RandomDefaultUser merged commit 6cb1d3f into mala-project:develop Nov 25, 2024
5 checks passed

RandomDefaultUser deleted the fix_data_scaling branch November 25, 2024 09:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address incosistencies in DataScaling infrastructure #598

Address incosistencies in DataScaling infrastructure #598

RandomDefaultUser commented Oct 25, 2024 •

edited

Loading

RandomDefaultUser commented Oct 29, 2024

elcorto commented Oct 29, 2024

RandomDefaultUser commented Oct 29, 2024

elcorto commented Oct 29, 2024

elcorto left a comment

elcorto Nov 8, 2024 •

edited

Loading

RandomDefaultUser Nov 14, 2024

elcorto Nov 8, 2024

elcorto Nov 8, 2024

elcorto Nov 8, 2024

elcorto Nov 8, 2024

RandomDefaultUser commented Nov 19, 2024

elcorto left a comment •

edited

Loading

RandomDefaultUser commented Nov 22, 2024

elcorto left a comment

Address incosistencies in DataScaling infrastructure #598

Address incosistencies in DataScaling infrastructure #598

Conversation

RandomDefaultUser commented Oct 25, 2024 • edited Loading

RandomDefaultUser commented Oct 29, 2024

elcorto commented Oct 29, 2024

RandomDefaultUser commented Oct 29, 2024

elcorto commented Oct 29, 2024

elcorto left a comment

Choose a reason for hiding this comment

elcorto Nov 8, 2024 • edited Loading

Choose a reason for hiding this comment

RandomDefaultUser Nov 14, 2024

Choose a reason for hiding this comment

elcorto Nov 8, 2024

Choose a reason for hiding this comment

elcorto Nov 8, 2024

Choose a reason for hiding this comment

elcorto Nov 8, 2024

Choose a reason for hiding this comment

elcorto Nov 8, 2024

Choose a reason for hiding this comment

RandomDefaultUser commented Nov 19, 2024

elcorto left a comment • edited Loading

Choose a reason for hiding this comment

RandomDefaultUser commented Nov 22, 2024

elcorto left a comment

Choose a reason for hiding this comment

RandomDefaultUser commented Oct 25, 2024 •

edited

Loading

elcorto Nov 8, 2024 •

edited

Loading

elcorto left a comment •

edited

Loading