You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To summarize the idea, we maintain a set of indicator variables indicating whether each feature has already been used somewhere in the forest that has been trained. Whenever a new feature that has not yet been used anywhere in the forest is being considered for a split, it has to pay a penalty (that can be specified per-feature, or is the same for all features) that can be seen as lowering the gain of the split. The goal is to nudge the tree towards re-using features it has already used, unless the unused features result in significantly better splits.
The purpose of this feature is to enable the training of models that are parsimonious in terms of how many features they use. In industry, models using thousands of features are not uncommon, but many of the features likely have low importance and could be removed without much impact. Models that use fewer features are cheaper to serve in production and require smaller payloads, reducing networking and data marshaling costs.
There are many ways to do feature selection, including methods that are done independently of the model, but XGBoost itself is a decent feature selection algorithm. Therefore one approach to feature selection can look like this: 1) train an XGBoost model on all features; 2) get the feature importance; 3) train another XGBoost model with only the top K features; 4) check that the new model is not much worse than the model having all the features. However, this feature selection strategy is not ideal, because highly-correlated but important features are likely to all be used during training. With the proposed feature-use penalty, if features A and B are correlated, and only A has been used thus far, then B will only be used if gain(best split of B) > gain(best split of A) + penalty_term. So, training with this feature turns XGBoost into a more powerful feature selector based on a fairly theoretically-sound basis: a feature is admitted when it provides a contribution to reducing loss than cannot be captured by existing features.
I want to get feedback on this idea:
If I were to contribute this change, and it could be done neatly in the code, would it be accepted?
Is there any reason this would be especially difficult to implement? It is straightforward to implement in a naive sequential implementation of XGBoost learning, but I'm not sure if having something like indicator variables would prove difficult in a distributed context.
The text was updated successfully, but these errors were encountered:
Thank you for raising the issue and for the very nice summary. I will look into the paper (I think I read it before but couldn't recall).
If I were to contribute this change, and it could be done neatly in the code, would it be accepted?
I think so. Will update this once I go through the paper and see if there's an easy way to do it, at least for one tree method and for one hardware architecture.
Is there any reason this would be especially difficult to implement?
I can't answer that until I fully understand the algorithm.
I'm interested in having support for cost-efficient gradient boosting in XGBoost. Glossing over non-essential details, CEBG is the application of a GreedyMiser-like strategy to second-order XGBoost-style optimization. LightGBM has support for this under parameters beginning with "cegb" (see https://lightgbm.readthedocs.io/en/latest/Parameters.html)
To summarize the idea, we maintain a set of indicator variables indicating whether each feature has already been used somewhere in the forest that has been trained. Whenever a new feature that has not yet been used anywhere in the forest is being considered for a split, it has to pay a penalty (that can be specified per-feature, or is the same for all features) that can be seen as lowering the gain of the split. The goal is to nudge the tree towards re-using features it has already used, unless the unused features result in significantly better splits.
The purpose of this feature is to enable the training of models that are parsimonious in terms of how many features they use. In industry, models using thousands of features are not uncommon, but many of the features likely have low importance and could be removed without much impact. Models that use fewer features are cheaper to serve in production and require smaller payloads, reducing networking and data marshaling costs.
There are many ways to do feature selection, including methods that are done independently of the model, but XGBoost itself is a decent feature selection algorithm. Therefore one approach to feature selection can look like this: 1) train an XGBoost model on all features; 2) get the feature importance; 3) train another XGBoost model with only the top K features; 4) check that the new model is not much worse than the model having all the features. However, this feature selection strategy is not ideal, because highly-correlated but important features are likely to all be used during training. With the proposed feature-use penalty, if features A and B are correlated, and only A has been used thus far, then B will only be used if gain(best split of B) > gain(best split of A) + penalty_term. So, training with this feature turns XGBoost into a more powerful feature selector based on a fairly theoretically-sound basis: a feature is admitted when it provides a contribution to reducing loss than cannot be captured by existing features.
I want to get feedback on this idea:
The text was updated successfully, but these errors were encountered: