-
Notifications
You must be signed in to change notification settings - Fork 539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Add support for computing feature_importances in RF #3531
Comments
Definitely agreed. Not sure we'll have enough bandwidth to get this for 0.19 (given work going to new backend) but should be highly prioritized high after that. |
Here's one use-case that requires this attribute to be present: https://github.com/willb/fraud-notebooks/blob/develop/03-model-random-forest.ipynb |
This issue has been labeled |
we are interested to use this feature in our use case too. |
This would also be useful for tools like Boruta, a popular feature selection library that's part of scikit-learn-contrib. There is a Boruta issue asking for support for cuML estimators |
Tagging @vinaydes and @venkywonka to see if we can have Venkat start on this? |
This is probably not the most efficient implementation, but in case anyone else needs it: def calculate_importances(nodes, n_features):
importances = np.zeros((len(nodes), n_features))
feature_gains = np.zeros(n_features)
def calculate_node_importances(node, i_root):
if "gain" not in node:
return
samples = node["instance_count"]
gain = node["gain"]
feature = node["split_feature"]
feature_gains[feature] += gain * samples
for child in node["children"]:
calculate_node_importances(child, i_root)
for i, root in enumerate(nodes):
calculate_node_importances(root, i)
importances[i] = feature_gains / feature_gains.sum()
return np.mean(importances, axis=0) you can see the logic behind it here https://towardsdatascience.com/the-mathematics-of-decision-trees-random-forest-and-feature-importance-in-scikit-learn-and-spark-f2861df67e3 |
Cross linking an issue that asks for this feature and OOB support #3361 |
it is an important issue worth a look. |
Commenting to re-iterate the usefulness for this feature. I was trying to follow https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html using cuml but it is not currently possible. |
A user shared a workflow today for which cuML's RF was 20x faster than their prior CPU-based RF. They wanted to use feature importance for feature selection, but weren't able to do so. |
yea, I'm missing this too. |
Same here. Switched to cuml for feature selection. a really needed feature. |
Same here. Using RF and need feature importance |
up for this issue. it is worth to look at |
Is your feature request related to a problem? Please describe.
RF implementation should support computing
feature_importances_
property, just like how it is exposed in sklearn.Describe the solution you'd like
feature_importances_
(ie. all the importances across the features sum to 1.0)Node
. We just need to, while building the tree, keep accumulating each feature's importance as we keep adding more nodes.The text was updated successfully, but these errors were encountered: