Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BoostAGroota works wrong with set_config(transform_output="pandas") #18

Closed
Tialo opened this issue May 3, 2023 · 1 comment · Fixed by #20
Closed

BoostAGroota works wrong with set_config(transform_output="pandas") #18

Tialo opened this issue May 3, 2023 · 1 comment · Fixed by #20

Comments

@Tialo
Copy link
Contributor

Tialo commented May 3, 2023

Hello, I've noticed that if you use set_config(transform_output="pandas") your BoostAGroota.transform methods works wrong. It shuffles columns of pandas DataFrame(which left after feature selection).

There is code snipper for reproduction of this problem.

import warnings
warnings.filterwarnings('ignore')

from sklearn import set_config
from lightgbm import LGBMRegressor

import arfs.feature_selection.allrelevant as arfsgroot
from arfs.utils import load_data

set_config(transform_output='pandas')

boston = load_data(name="Boston")
X, y = boston.data, boston.target

fs = arfsgroot.BoostAGroota(LGBMRegressor(n_estimators=1, random_state=42))
X_transformed = fs.fit_transform(X, y)

print(X)
print(X_transformed)

As you would see column CRIM has values which were in column AGE.

requirements used in code

arfs==1.0.7
bleach==6.0.0
bokeh==2.4.3
certifi==2022.12.7
charset-normalizer==3.1.0
cloudpickle==2.2.1
colorcet==3.0.1
contourpy==1.0.7
cycler==0.11.0
fonttools==4.39.3
holoviews==1.15.4
idna==3.4
importlib-metadata==6.6.0
importlib-resources==5.12.0
Jinja2==3.1.2
joblib==1.2.0
kiwisolver==1.4.4
lightgbm==3.3.3
llvmlite==0.40.0
Markdown==3.4.3
MarkupSafe==2.1.2
matplotlib==3.7.1
numba==0.57.0
numpy==1.21.6
packaging==23.1
pandas==1.5.1
panel==0.14.4
param==1.13.0
Pillow==9.5.0
pyct==0.5.0
pyparsing==3.0.9
python-dateutil==2.8.2
pytz==2023.3
pyviz-comms==2.2.1
PyYAML==6.0
requests==2.30.0
scikit-learn==1.2.0
scipy==1.8.1
seaborn==0.12.2
shap==0.41.0
six==1.16.0
slicer==0.0.7
threadpoolctl==3.1.0
tornado==6.3.1
tqdm==4.65.0
typing_extensions==4.5.0
tzdata==2023.3
urllib3==2.0.1
webencodings==0.5.1
zipp==3.15.0

I also tried to understand why such thing happens and figured out that this behavior caused by your implementation of transform method.

As you can see using return X[self.selected_features_] works strange

import pandas as pd
import numpy as np

from sklearn import set_config
from sklearn.feature_selection._base import SelectorMixin
from sklearn.base import BaseEstimator

set_config(transform_output="pandas")

class FeatureSelector_with_shuffled_output(SelectorMixin, BaseEstimator):
    def fit(self, X, y):
        self.feature_names_in_ = X.columns.to_numpy()
        random = np.random.RandomState(44)
        self.selected_features_ = random.choice(X.columns, X.shape[1] // 2, replace=False)
        self.support_ = np.array([c in self.selected_features_ for c in X.columns])
        return self
    
    def _get_support_mask(self):
        return self.support_
    
    def transform(self, X):
        if not isinstance(X, pd.DataFrame):
            raise ValueError("X needs to be pandas.DataFrame")
        return X[self.selected_features_]
    

class FeatureSelector(SelectorMixin, BaseEstimator):
    def fit(self, X, y):
        self.feature_names_in_ = X.columns.to_numpy()
        random = np.random.RandomState(44)
        self.selected_features_ = random.choice(X.columns, X.shape[1] // 2, replace=False)
        self.support_ = np.array([c in self.selected_features_ for c in X.columns])
        return self
    
    def _get_support_mask(self):
        return self.support_
    
    
X = pd.DataFrame({
    "a": np.random.randint(50, 100, 10),
    "b": np.random.randint(10, 20, 10),
    "c": np.random.randint(-100, -50, 10),
    "d": np.random.randint(-10, 0, 10)
})
y = pd.Series(np.random.rand(10))

fs = FeatureSelector()
print(fs.fit_transform(X, y))

fsw = FeatureSelector_with_shuffled_output()
print(fsw.fit_transform(X, y))

print(X)

Hope it will be helpful! If you have any questions I am open for discussion or adding some information.

@Tialo
Copy link
Contributor Author

Tialo commented May 4, 2023

I figured out why exactly this happens.
Method transform is wrapped with special function in sklearn, in class _SetOutputMixin which is base for TransformerMixin which is base for SelectorMixin which is base for BoostAGroota. You can find function _wrap_data_with_container in sklearn/utils/_set_output.py.

If you use set_config(transform_output="pandas") it will put your data in pandas data frame. It does it this way: it takes data from your transformer, which has shuffled columns, and assign column names from estimator.get_feature_names_out, which are not shuffled.

  1. Why columns from estimator.get_feature_names_out, are not shuffled:

    if isinstance(X, pd.DataFrame):
        self.feature_names_in_ = X.columns.to_numpy()
    else:
        raise TypeError("X is not a dataframe")
    ...
    self.selected_features_ = self.selected_features_.values
    self.support_ = np.asarray(
        [
            True if c in self.selected_features_ else False
            for c in self.feature_names_in_
        ]
    )

    This code in BoostAGroota.fit assign self.feature_names_in_ which are not shuffled, then self.support_ will be boolean mask for not shuffled columns
    And then SelectorMixin.get_feature_names_out returns input_features[self.get_support()] still not shuffled

  2. Why columns from your transformer are shuffled:
    Line new_x, obj_feat, cat_idx = get_pandas_cat_codes(X) in function _BoostARoota shuffles new_x, it happens because function get_pandas_cat_codes does X = pd.concat([X[X.columns.difference(obj_feat)], cat], axis=1) it takes numerical features and then concatenates them with categorical, which breaks the order.

How to fix it:

def get_pandas_cat_codes(X):
    dtypes_dic = create_dtype_dict(X, dic_keys="dtypes")
    obj_feat = dtypes_dic["cat"] + dtypes_dic["time"] + dtypes_dic["unk"]

    if obj_feat:
        for obj_column in obj_feat:
            column = X[obj_column].astype("str").astype("category")
            # performs label encoding
            _, inverse = np.unique(column, return_inverse=True)
            X[obj_column] = inverse
        cat_idx = [X.columns.get_loc(col) for col in obj_feat]
    else:
        obj_feat = None
        cat_idx = None

    return X, obj_feat, cat_idx

This method will not only fix my issue it will also make your output keep the original order.
Or you can have any other workaround that will solve the issue, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant