BoostAGroota works wrong with set_config(transform_output="pandas") #18

Tialo · 2023-05-03T19:29:36Z

Hello, I've noticed that if you use set_config(transform_output="pandas") your BoostAGroota.transform methods works wrong. It shuffles columns of pandas DataFrame(which left after feature selection).

There is code snipper for reproduction of this problem.

import warnings
warnings.filterwarnings('ignore')

from sklearn import set_config
from lightgbm import LGBMRegressor

import arfs.feature_selection.allrelevant as arfsgroot
from arfs.utils import load_data

set_config(transform_output='pandas')

boston = load_data(name="Boston")
X, y = boston.data, boston.target

fs = arfsgroot.BoostAGroota(LGBMRegressor(n_estimators=1, random_state=42))
X_transformed = fs.fit_transform(X, y)

print(X)
print(X_transformed)

As you would see column CRIM has values which were in column AGE.

requirements used in code

arfs==1.0.7
bleach==6.0.0
bokeh==2.4.3
certifi==2022.12.7
charset-normalizer==3.1.0
cloudpickle==2.2.1
colorcet==3.0.1
contourpy==1.0.7
cycler==0.11.0
fonttools==4.39.3
holoviews==1.15.4
idna==3.4
importlib-metadata==6.6.0
importlib-resources==5.12.0
Jinja2==3.1.2
joblib==1.2.0
kiwisolver==1.4.4
lightgbm==3.3.3
llvmlite==0.40.0
Markdown==3.4.3
MarkupSafe==2.1.2
matplotlib==3.7.1
numba==0.57.0
numpy==1.21.6
packaging==23.1
pandas==1.5.1
panel==0.14.4
param==1.13.0
Pillow==9.5.0
pyct==0.5.0
pyparsing==3.0.9
python-dateutil==2.8.2
pytz==2023.3
pyviz-comms==2.2.1
PyYAML==6.0
requests==2.30.0
scikit-learn==1.2.0
scipy==1.8.1
seaborn==0.12.2
shap==0.41.0
six==1.16.0
slicer==0.0.7
threadpoolctl==3.1.0
tornado==6.3.1
tqdm==4.65.0
typing_extensions==4.5.0
tzdata==2023.3
urllib3==2.0.1
webencodings==0.5.1
zipp==3.15.0

I also tried to understand why such thing happens and figured out that this behavior caused by your implementation of transform method.

As you can see using return X[self.selected_features_] works strange

import pandas as pd
import numpy as np

from sklearn import set_config
from sklearn.feature_selection._base import SelectorMixin
from sklearn.base import BaseEstimator

set_config(transform_output="pandas")

class FeatureSelector_with_shuffled_output(SelectorMixin, BaseEstimator):
    def fit(self, X, y):
        self.feature_names_in_ = X.columns.to_numpy()
        random = np.random.RandomState(44)
        self.selected_features_ = random.choice(X.columns, X.shape[1] // 2, replace=False)
        self.support_ = np.array([c in self.selected_features_ for c in X.columns])
        return self
    
    def _get_support_mask(self):
        return self.support_
    
    def transform(self, X):
        if not isinstance(X, pd.DataFrame):
            raise ValueError("X needs to be pandas.DataFrame")
        return X[self.selected_features_]
    

class FeatureSelector(SelectorMixin, BaseEstimator):
    def fit(self, X, y):
        self.feature_names_in_ = X.columns.to_numpy()
        random = np.random.RandomState(44)
        self.selected_features_ = random.choice(X.columns, X.shape[1] // 2, replace=False)
        self.support_ = np.array([c in self.selected_features_ for c in X.columns])
        return self
    
    def _get_support_mask(self):
        return self.support_
    
    
X = pd.DataFrame({
    "a": np.random.randint(50, 100, 10),
    "b": np.random.randint(10, 20, 10),
    "c": np.random.randint(-100, -50, 10),
    "d": np.random.randint(-10, 0, 10)
})
y = pd.Series(np.random.rand(10))

fs = FeatureSelector()
print(fs.fit_transform(X, y))

fsw = FeatureSelector_with_shuffled_output()
print(fsw.fit_transform(X, y))

print(X)

Hope it will be helpful! If you have any questions I am open for discussion or adding some information.

The text was updated successfully, but these errors were encountered:

Tialo · 2023-05-04T16:39:56Z

I figured out why exactly this happens.
Method transform is wrapped with special function in sklearn, in class _SetOutputMixin which is base for TransformerMixin which is base for SelectorMixin which is base for BoostAGroota. You can find function _wrap_data_with_container in sklearn/utils/_set_output.py.

If you use set_config(transform_output="pandas") it will put your data in pandas data frame. It does it this way: it takes data from your transformer, which has shuffled columns, and assign column names from estimator.get_feature_names_out, which are not shuffled.

Why columns from estimator.get_feature_names_out, are not shuffled:

if isinstance(X, pd.DataFrame):
    self.feature_names_in_ = X.columns.to_numpy()
else:
    raise TypeError("X is not a dataframe")
...
self.selected_features_ = self.selected_features_.values
self.support_ = np.asarray(
    [
        True if c in self.selected_features_ else False
        for c in self.feature_names_in_
    ]
)

This code in BoostAGroota.fit assign self.feature_names_in_ which are not shuffled, then self.support_ will be boolean mask for not shuffled columns
And then SelectorMixin.get_feature_names_out returns input_features[self.get_support()] still not shuffled

Why columns from your transformer are shuffled:
Line new_x, obj_feat, cat_idx = get_pandas_cat_codes(X) in function _BoostARoota shuffles new_x, it happens because function get_pandas_cat_codes does X = pd.concat([X[X.columns.difference(obj_feat)], cat], axis=1) it takes numerical features and then concatenates them with categorical, which breaks the order.

How to fix it:

def get_pandas_cat_codes(X):
    dtypes_dic = create_dtype_dict(X, dic_keys="dtypes")
    obj_feat = dtypes_dic["cat"] + dtypes_dic["time"] + dtypes_dic["unk"]

    if obj_feat:
        for obj_column in obj_feat:
            column = X[obj_column].astype("str").astype("category")
            # performs label encoding
            _, inverse = np.unique(column, return_inverse=True)
            X[obj_column] = inverse
        cat_idx = [X.columns.get_loc(col) for col in obj_feat]
    else:
        obj_feat = None
        cat_idx = None

    return X, obj_feat, cat_idx

This method will not only fix my issue it will also make your output keep the original order.
Or you can have any other workaround that will solve the issue, thanks!

Tialo mentioned this issue May 5, 2023

Fixed handling of categorical features. #20

Merged

ThomasBury closed this as completed in #20 May 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BoostAGroota works wrong with set_config(transform_output="pandas") #18

BoostAGroota works wrong with set_config(transform_output="pandas") #18

Tialo commented May 3, 2023

Tialo commented May 4, 2023

BoostAGroota works wrong with set_config(transform_output="pandas") #18

BoostAGroota works wrong with set_config(transform_output="pandas") #18

Comments

Tialo commented May 3, 2023

Tialo commented May 4, 2023