[Enhancement] Better Regularization for Categorical features #1934

guolinke · 2019-01-02T03:47:44Z

It seems current split finding algorithm for categorical features often results in over-fitting.
We need a better solution to reduce the over-fitting.

guolinke · 2019-01-02T04:04:58Z

code of the current solution:
https://github.com/Microsoft/LightGBM/blob/fb28070e1daa500b087d3102145ae48988030195/src/treelearner/feature_histogram.hpp#L112-L234

The process:

sort the bins in the feature histogram by the sum_gradients / sum_hessians
find the split points in the sorted histogram

Current Regularizations:

convert to one-vs-rest when num_cats <= config->max_cat_to_onehot
add cat_smooth to sum_hessians the before sorting the histogram
only enumerate max_cat_threshold times, to avoid searching too many times.
use min_data_per_group to avoid too little data in each split point.
an additional cat_l2 for L2 regularization.

New Regularizations (proposal):

Some randomize in sorting
don't use the cat(bin) with too little data as split point.

Your idea is very welcome here 😄 .

guolinke · 2019-01-02T04:06:52Z

ping @Laurae2 @StrikerRUS @henry0312

StrikerRUS · 2019-01-03T20:57:48Z

The first thing that comes to my mind after the word "categorical" is catboost with their target encoding and feature interactions: https://tech.yandex.com/catboost/doc/dg/concepts/algorithm-main-stages_cat-to-numberic-docpage/, https://tech.yandex.com/catboost/doc/dg/concepts/speed-up-training-docpage/#max-ctr-, (CTR settings sections) https://tech.yandex.com/catboost/doc/dg/concepts/cli-reference_train-model-docpage/#cli-reference_train-model__options

guolinke · 2019-01-04T00:31:53Z

I remember CatBoost's solutions didn't related to GBDT's algorithm, as they preprocess the features. Thus, these solutions could be in other GBDT tool as well.

henry0312 · 2019-01-09T09:04:15Z

@guolinke

I remember CatBoost's solutions didn't related to GBDT's algorithm, as they preprocess the features. > Thus, these solutions could be in other GBDT tool as well.

That's right.

To be honest, I don't understand the current way very well, however, I think less parameter is better because more paramers leads to overfitting.
And there is a famous solution for categorical (I don't remember it now).
I'll tell you later.

henry0312 · 2019-01-09T09:46:08Z

As you know, CART is one of the famous algorithms, which is used in scikit-learn.
(A Step by Step CART Decision Tree Example may be useful for understanding)

scikit-learn uses an optimised version of the CART algorithm; however, scikit-learn implementation does not support categorical variables for now.

from https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart.

However, scikit-learn doesn't support categorical variables 😢

julioasotodv · 2019-02-25T17:32:25Z

In some experiments with private data, the regularization hyperparameters for the categorical variables (max_cat_threshold, cat_l2, cat_smooth and max_cat_to_onehot) can dramatically change the bias and/or variance of the final model if the dataset contains lots of categorical features. However, I have not found an obvious way to tune these hyperparameters (other than exhaustive grid search).

ekerazha · 2019-03-20T16:28:35Z

@guolinke

I remember CatBoost's solutions didn't related to GBDT's algorithm, as they preprocess the features. Thus, these solutions could be in other GBDT tool as well.

If you look at this https://catboost.ai/docs/concepts/algorithm-main-stages_cat-to-numberic.html it says "Before each split is selected in the tree...", so it's not really a preprocessing step, because the encoding is performed before each split, so it's something that needs to be implemented in the GBDT/RF algorithm. Because the encoding is performed on different data splits, this reduces overfitting.

btrotta · 2019-07-09T05:35:25Z

I think one of the reasons for overfitting categorical features is that the categories are ordered by weight before finding splits. This means that the gain from a split on a categorical feature will in general be larger than for a numerical feature, and categorical features are disproportionately likely to be chosen for a split vs numerical features.

For example, if we make only one split in some categorical feature (after this ordering has been done), then the feature is already optimally divided into top n and bottom num_bins - n weights. But if instead we consider the feature as numerical (using some random integer encoding of the categories), then in general the weight of the bins will not be monotone increasing, so it will take several splits to achieve this same separation in weights. Therefore the gains will in general be larger from splitting on categorical vs numerical features, so categorical features will more often be chosen as the best split by the algorithm.

L2 regularization (lambda_l2, cat_l2, cat_smooth) helps somewhat with this problem, since it reduces the magnitude of the weights (especially for small categories) and the potential gain of each split. But it might also be useful to have an L1 regularization for categorical features (i.e. a cat_l1), which would directly penalize the number of splits on a categorical feature.

sakura7621 · 2019-07-09T07:46:28Z

@btrotta I agree with " overfitting categorical features" part and I also think it's because we already optimize the order for splitting. But the example is not clear yet. Comparing "ordering categorical features" and "random integer encoding of the categories" in the example is somewhat not fair. It should be a comparison between "ordering categorical features" and some "true numerical features". I think the numerical themselves have some attributes of "ordering". For example, with "age" features, sorting the age makes sense and could be considered as a way of ordering, though it's not optimized as categorical ones.

btrotta · 2019-07-09T08:13:06Z

@chaupmcs Yes, that's a fair point. A true numerical feature would probably have bin weights with longer monotone intervals, and fewer turning points, compared to a random encoding of a categorical feature. But still, unless the numerical feature is completely monotone, the categorical feature offers easier gains from splitting.

StrikerRUS · 2019-08-01T15:32:17Z

Closing in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open (or post a comment if you are not a topic starter) this issue if you are actively working on implementing this feature.

guolinke added the enhancement label Jan 2, 2019

henry0312 assigned henry0312 and guolinke Jan 9, 2019

guolinke pinned this issue Jan 14, 2019

henry0312 mentioned this issue Jan 15, 2019

Divide metric into train_metric and valid_metric #1912

Closed

BlindApe mentioned this issue Feb 10, 2019

Deleted #2005

Closed

guolinke unpinned this issue Apr 10, 2019

guolinke added feature request help wanted labels Aug 1, 2019

guolinke mentioned this issue Aug 1, 2019

Feature Requests & Voting Hub #2302

Open

StrikerRUS closed this as completed Aug 1, 2019

sflender mentioned this issue Jun 1, 2021

[question] Treating high-cardinality categorical features as numeric #4278

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] Better Regularization for Categorical features #1934

[Enhancement] Better Regularization for Categorical features #1934

guolinke commented Jan 2, 2019

guolinke commented Jan 2, 2019 •

edited

Loading

guolinke commented Jan 2, 2019

StrikerRUS commented Jan 3, 2019

guolinke commented Jan 4, 2019

henry0312 commented Jan 9, 2019

henry0312 commented Jan 9, 2019 •

edited

Loading

julioasotodv commented Feb 25, 2019 •

edited

Loading

ekerazha commented Mar 20, 2019 •

edited

Loading

btrotta commented Jul 9, 2019

sakura7621 commented Jul 9, 2019

btrotta commented Jul 9, 2019

StrikerRUS commented Aug 1, 2019

[Enhancement] Better Regularization for Categorical features #1934

[Enhancement] Better Regularization for Categorical features #1934

Comments

guolinke commented Jan 2, 2019

guolinke commented Jan 2, 2019 • edited Loading

guolinke commented Jan 2, 2019

StrikerRUS commented Jan 3, 2019

guolinke commented Jan 4, 2019

henry0312 commented Jan 9, 2019

henry0312 commented Jan 9, 2019 • edited Loading

julioasotodv commented Feb 25, 2019 • edited Loading

ekerazha commented Mar 20, 2019 • edited Loading

btrotta commented Jul 9, 2019

sakura7621 commented Jul 9, 2019

btrotta commented Jul 9, 2019

StrikerRUS commented Aug 1, 2019

guolinke commented Jan 2, 2019 •

edited

Loading

henry0312 commented Jan 9, 2019 •

edited

Loading

julioasotodv commented Feb 25, 2019 •

edited

Loading

ekerazha commented Mar 20, 2019 •

edited

Loading