-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Enhancement] Better Regularization for Categorical features #1934
Comments
code of the current solution: The process:
Current Regularizations:
New Regularizations (proposal):
Your idea is very welcome here 😄 . |
The first thing that comes to my mind after the word "categorical" is catboost with their target encoding and feature interactions: https://tech.yandex.com/catboost/doc/dg/concepts/algorithm-main-stages_cat-to-numberic-docpage/, https://tech.yandex.com/catboost/doc/dg/concepts/speed-up-training-docpage/#max-ctr-, (CTR settings sections) https://tech.yandex.com/catboost/doc/dg/concepts/cli-reference_train-model-docpage/#cli-reference_train-model__options |
I remember CatBoost's solutions didn't related to GBDT's algorithm, as they preprocess the features. Thus, these solutions could be in other GBDT tool as well. |
That's right. To be honest, I don't understand the current way very well, however, I think less parameter is better because more paramers leads to overfitting. |
As you know, CART is one of the famous algorithms, which is used in scikit-learn.
from https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart. However, scikit-learn doesn't support categorical variables 😢 |
In some experiments with private data, the regularization hyperparameters for the categorical variables ( |
If you look at this https://catboost.ai/docs/concepts/algorithm-main-stages_cat-to-numberic.html it says "Before each split is selected in the tree...", so it's not really a preprocessing step, because the encoding is performed before each split, so it's something that needs to be implemented in the GBDT/RF algorithm. Because the encoding is performed on different data splits, this reduces overfitting. |
I think one of the reasons for overfitting categorical features is that the categories are ordered by weight before finding splits. This means that the gain from a split on a categorical feature will in general be larger than for a numerical feature, and categorical features are disproportionately likely to be chosen for a split vs numerical features. For example, if we make only one split in some categorical feature (after this ordering has been done), then the feature is already optimally divided into top L2 regularization ( |
@btrotta I agree with " overfitting categorical features" part and I also think it's because we already optimize the order for splitting. But the example is not clear yet. Comparing "ordering categorical features" and "random integer encoding of the categories" in the example is somewhat not fair. It should be a comparison between "ordering categorical features" and some "true numerical features". I think the numerical themselves have some attributes of "ordering". For example, with "age" features, sorting the age makes sense and could be considered as a way of ordering, though it's not optimized as categorical ones. |
@chaupmcs Yes, that's a fair point. A true numerical feature would probably have bin weights with longer monotone intervals, and fewer turning points, compared to a random encoding of a categorical feature. But still, unless the numerical feature is completely monotone, the categorical feature offers easier gains from splitting. |
Closing in favor of being in #2302. We decided to keep all feature requests in one place. Welcome to contribute this feature! Please re-open (or post a comment if you are not a topic starter) this issue if you are actively working on implementing this feature. |
It seems current split finding algorithm for categorical features often results in over-fitting.
We need a better solution to reduce the over-fitting.
The text was updated successfully, but these errors were encountered: