Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address incosistencies in DataScaling infrastructure #598

Open
wants to merge 15 commits into
base: develop
Choose a base branch
from

Conversation

RandomDefaultUser
Copy link
Member

@RandomDefaultUser RandomDefaultUser commented Oct 25, 2024

This PR will close #482 and should also close #483.

@RandomDefaultUser RandomDefaultUser marked this pull request as ready for review October 25, 2024 14:28
@RandomDefaultUser
Copy link
Member Author

@elcorto I hope I did not miss anything, but I think this should address the issues you have raised. I would appreciate your opinion/review.

@elcorto
Copy link
Member

elcorto commented Oct 29, 2024

It seems like you are still working on this PR? If so, please let me know when it is ready for a review.

@RandomDefaultUser
Copy link
Member Author

It seems like you are still working on this PR? If so, please let me know when it is ready for a review.

As of today I was actually finished with the PR from my side, except of course if I missed something you wanted implemented?

@elcorto
Copy link
Member

elcorto commented Oct 29, 2024

It seems like you are still working on this PR? If so, please let me know when it is ready for a review.

As of today I was actually finished with the PR from my side, except of course if I missed something you wanted implemented?

I saw recent commits so I wanted to wait with the review, but OK then I'll look into it now.

Copy link
Member

@elcorto elcorto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one large comment that I think should be discussed and two small changes, the rest looks good, thanks a lot.

Comment on lines 52 to 56
* ``feature-wise-standard``: Row Standardization (Scale to mean 0, standard deviation 1)
* ``feature-wise-standard``: Standardization (Scale to mean 0, standard
deviation 1) is applied to each feature dimension individually. I.e., if your
training data has dimensions (x,y,z,f), then each of the f rows with (x,y,z)
entries is scaled indiviually.
Copy link
Member

@elcorto elcorto Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in #482, DataScaler silently assumes a 2D array of shape (x * y * z, f) since it does torch.mean(unscaled, 0, ...), so unless I'm overlooking something obvious, I think this explanation is actually misleading.

Consider this:

from sklearn.preprocessing import StandardScaler
from mala.datahandling.data_scaler import DataScaler
import torch as T
import einops
import numpy as np
from icecream import ic

reshape_to_2d = lambda x: einops.rearrange(x, "x y z f -> (x y z) f")
T.set_default_dtype(T.float64)


# sklearn's StandardScaler uses a biased estimator np.std(..., ddof=0), while
# torch uses T.std(..., correction=1) and there is no way to change this in
# StandardScaler, so we roll our own.
class MyStandardScaler:
    @staticmethod
    def fit_transform(X):
        assert X.ndim == 2
        x_mean = X.mean(0)[None, :]
        x_std = np.std(X.numpy(), axis=0, ddof=1)[None, :]
        return (X - x_mean) / x_std


with T.no_grad():
    arr_4d = T.rand(2, 3, 4, 5)
    arr_2d = reshape_to_2d(arr_4d)

    ##arr_2d_scaled_skl = StandardScaler().fit_transform(arr_2d)
    arr_2d_scaled_skl = MyStandardScaler().fit_transform(arr_2d)

    scaler_mala = DataScaler("feature-wise-standard")
    arr_2d_scaled_mala = arr_2d.clone()
    scaler_mala.fit(arr_2d_scaled_mala)
    # in-place mod!
    scaler_mala.transform(arr_2d_scaled_mala)

    ic(arr_2d_scaled_skl[:5, :])
    ic(arr_2d_scaled_mala[:5, :])
    assert np.allclose(arr_2d_scaled_mala, arr_2d_scaled_skl, atol=1e-15)

    # ValueError: Found array with dim 4. StandardScaler expected <= 2.
    ##arr_2d_scaled_skl = StandardScaler().fit_transform(arr_2d)

    # This works but shouldn't
    scaler_mala = DataScaler("feature-wise-standard")
    arr_4d_scaled_mala = arr_4d.clone()
    scaler_mala.fit(arr_4d_scaled_mala)
    # in-place mod!
    scaler_mala.transform(arr_4d_scaled_mala)

    arr_2d_from_4d_scaled_mala = reshape_to_2d(arr_4d_scaled_mala)
    assert not np.allclose(
        arr_2d_from_4d_scaled_mala, arr_2d_scaled_skl, atol=1e-1
    )
    ic(arr_2d_from_4d_scaled_mala[:5, :])

results in:

ic| arr_2d_scaled_skl[:5, :]: tensor([[-0.8424, -0.3500, -1.3310,  0.6888, -0.2759],
                                      [-0.1266, -1.1469, -0.2332, -0.3831, -0.7441],
                                      [-1.4691,  1.4810,  0.2890,  0.3123, -1.2043],
                                      [ 1.5169, -0.5278, -1.4609,  2.1362,  1.1646],
                                      [-0.6087,  0.2314,  0.9764, -1.0315,  0.6824]])
ic| arr_2d_scaled_mala[:5, :]: tensor([[-0.8424, -0.3500, -1.3310,  0.6888, -0.2759],
                                       [-0.1266, -1.1469, -0.2332, -0.3831, -0.7441],
                                       [-1.4691,  1.4810,  0.2890,  0.3123, -1.2043],
                                       [ 1.5169, -0.5278, -1.4609,  2.1362,  1.1646],
                                       [-0.6087,  0.2314,  0.9764, -1.0315,  0.6824]])
ic| arr_2d_from_4d_scaled_mala[:5, :]: tensor([[ 0.7071, -0.7071, -0.7071,  0.7071,  0.7071],
                                               [-0.7071, -0.7071, -0.7071,  0.7071, -0.7071],
                                               [-0.7071,  0.7071, -0.7071, -0.7071, -0.7071],
                                               [ 0.7071, -0.7071, -0.7071,  0.7071,  0.7071],
                                               [ 0.7071,  0.7071,  0.7071, -0.7071, -0.7071]])

So the code doesn't fail when given a 4D array but produces wrong results. Therefore, all docs mentioning (x,y,z,f) should be adapted. Also an array dimension check is probably a good idea.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaaaaah, I see what you mean now, sorry for misunderstanding. I have adapted the documentation and added an array check. Let me know if there is still something missing!

mala/datahandling/data_scaler.py Outdated Show resolved Hide resolved
mala/datahandling/data_scaler.py Outdated Show resolved Hide resolved
Comment on lines 58 to 61
* ``feature-wise-minmax``: Min-Max scaling (Scale to be in range 0...1) is
applied to each feature dimension individually. I.e., if your training data
has dimensions (x,y,z,f), then each of the f rows with (x,y,z) entries is
scaled indiviually.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

Comment on lines 582 to 589
standard deviation 1) is applied to each feature dimension
individually. I.e., if your training data has dimensions
(x,y,z,f), then each of the f rows with (x,y,z) entries is scaled
indiviually.
- "feature-wise-minmax": Row Min-Max scaling (Scale to be in range
0...1) is applied to each feature dimension individually.
I.e., if your training data has dimensions (x,y,z,f), then each
of the f rows with (x,y,z) entries is scaled indiviually.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

Comment on lines 605 to 611
individually. I.e., if your training data has dimensions
(x,y,z,f), then each of the f rows with (x,y,z) entries is scaled
indiviually.
- "feature-wise-minmax": Row Min-Max scaling (Scale to be in range
0...1) is applied to each feature dimension individually.
I.e., if your training data has dimensions (x,y,z,f), then each
of the f rows with (x,y,z) entries is scaled indiviually.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

Comment on lines 32 to 37
I.e., if your training data has dimensions (x,y,z,f), then each
of the f rows with (x,y,z) entries is scaled indiviually.
- "feature-wise-minmax": Min-Max scaling (Scale to be in range
0...1) is applied to each feature dimension individually.
I.e., if your training data has dimensions (x,y,z,f), then each
of the f rows with (x,y,z) entries is scaled indiviually.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

@RandomDefaultUser
Copy link
Member Author

I fixed the pipeline @elcorto after addressing your comments, could you let me know if this PR looks good from your side now?

Copy link
Member

@elcorto elcorto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the doc update in DataScaler and the 2d array check.

Two things that I stumbled upon:

  • There are still places (see https://github.com/mala-project/mala/pull/598/files) that mention (x,y,z,f) that you probably missed to catch, namely in docs/source/basic_usage/trainingmodel.rst and mala/common/parameters.py.

  • The docs in DataScaler and mala/common/parameters.py are identical. Maybe just keep them in one place instead, so for instance in parameters.py link to :class:~mala.datahandling.data_handler.DataScaler ?

mala/datahandling/data_scaler.py Outdated Show resolved Hide resolved
mala/datahandling/data_scaler.py Outdated Show resolved Hide resolved
@RandomDefaultUser
Copy link
Member Author

Thanks for the feedback and suggestions @elcorto , I implemented the changes and corrected (x,y,z) to (d) in the places you mentioned. I would keep the full doc strings in both the Parameters and DataScaler classes though. I think they are there for different audiences: a regular user who applies a model does not necessarily interact with the DataScaler class, so they need to be in the Parameters class. If a developer however wants to adapt the DataScaler code the proper doc string should be in the right place. I will add a note that changes in DataScaler should be propagated to the Parameters class though.

Copy link
Member

@elcorto elcorto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK then we're all set, I guess. Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data scaling: API and documentation Data scaling: row vs. column and naming of methods
2 participants