[FEA] Add support for computing feature_importances in RF #3531

teju85 · 2021-02-19T09:47:37Z

Is your feature request related to a problem? Please describe.
RF implementation should support computing feature_importances_ property, just like how it is exposed in sklearn.

Describe the solution you'd like

By default, we should compute normalized feature_importances_ (ie. all the importances across the features sum to 1.0)
Implementation that is done in sklearn is here. We have all of this information in our Node. We just need to, while building the tree, keep accumulating each feature's importance as we keep adding more nodes.

The text was updated successfully, but these errors were encountered:

JohnZed · 2021-02-19T18:31:49Z

Definitely agreed. Not sure we'll have enough bandwidth to get this for 0.19 (given work going to new backend) but should be highly prioritized high after that.

teju85 · 2021-02-25T05:35:55Z

Here's one use-case that requires this attribute to be present: https://github.com/willb/fraud-notebooks/blob/develop/03-model-random-forest.ipynb

github-actions · 2021-03-27T06:07:23Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

sooryaa-thiruloga · 2021-07-14T20:18:32Z

we are interested to use this feature in our use case too.

beckernick · 2022-03-18T16:04:09Z

This would also be useful for tools like Boruta, a popular feature selection library that's part of scikit-learn-contrib. There is a Boruta issue asking for support for cuML estimators

teju85 · 2022-03-18T16:06:07Z

Tagging @vinaydes and @venkywonka to see if we can have Venkat start on this?

hafarooki · 2022-04-16T09:46:06Z

This is probably not the most efficient implementation, but in case anyone else needs it:

def calculate_importances(nodes, n_features):
    importances = np.zeros((len(nodes), n_features))
    feature_gains = np.zeros(n_features)


    def calculate_node_importances(node, i_root):
        if "gain" not in node:
            return

        samples = node["instance_count"]
        gain = node["gain"]
        feature = node["split_feature"]
        feature_gains[feature] += gain * samples

        for child in node["children"]:
            calculate_node_importances(child, i_root)


    for i, root in enumerate(nodes):
        calculate_node_importances(root, i)
        importances[i] = feature_gains / feature_gains.sum()

    return np.mean(importances, axis=0)

you can see the logic behind it here https://towardsdatascience.com/the-mathematics-of-decision-trees-random-forest-and-feature-importance-in-scikit-learn-and-spark-f2861df67e3

beckernick · 2022-06-29T18:40:52Z

Cross linking an issue that asks for this feature and OOB support #3361

Wulin-Tan · 2022-08-28T16:30:38Z

it is an important issue worth a look.

HybridNeos · 2023-05-07T03:46:12Z

Commenting to re-iterate the usefulness for this feature. I was trying to follow https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html using cuml but it is not currently possible.

beckernick · 2023-05-30T16:31:15Z

A user shared a workflow today for which cuML's RF was 20x faster than their prior CPU-based RF. They wanted to use feature importance for feature selection, but weren't able to do so.

szeka94 · 2024-07-15T16:46:49Z

yea, I'm missing this too.

Avertemp · 2024-08-05T17:45:02Z

Same here. Switched to cuml for feature selection. a really needed feature.

Zach-Sten · 2024-08-26T21:21:12Z

Same here. Using RF and need feature importance

hms-self · 2024-12-23T19:15:27Z

up for this issue. it is worth to look at

teju85 added feature request New feature or request doc Documentation CUDA / C++ CUDA issue Cython / Python Cython or Python issue Algorithm API Change For tracking changes to algorithms that might effect the API improvement Improvement / enhancement to an existing function labels Feb 19, 2021

github-actions bot added the inactive-30d label Mar 27, 2021

Wuuzzaa mentioned this issue Oct 25, 2023

Does BorutaPy work with cuML RandomForestClassifier? scikit-learn-contrib/boruta_py#99

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add support for computing feature_importances in RF #3531

[FEA] Add support for computing feature_importances in RF #3531

teju85 commented Feb 19, 2021

JohnZed commented Feb 19, 2021

teju85 commented Feb 25, 2021

github-actions bot commented Mar 27, 2021

sooryaa-thiruloga commented Jul 14, 2021

beckernick commented Mar 18, 2022

teju85 commented Mar 18, 2022

hafarooki commented Apr 16, 2022 •

edited

Loading

beckernick commented Jun 29, 2022 •

edited

Loading

Wulin-Tan commented Aug 28, 2022

HybridNeos commented May 7, 2023

beckernick commented May 30, 2023

szeka94 commented Jul 15, 2024

Avertemp commented Aug 5, 2024

Zach-Sten commented Aug 26, 2024

hms-self commented Dec 23, 2024

[FEA] Add support for computing feature_importances in RF #3531

[FEA] Add support for computing feature_importances in RF #3531

Comments

teju85 commented Feb 19, 2021

JohnZed commented Feb 19, 2021

teju85 commented Feb 25, 2021

github-actions bot commented Mar 27, 2021

sooryaa-thiruloga commented Jul 14, 2021

beckernick commented Mar 18, 2022

teju85 commented Mar 18, 2022

hafarooki commented Apr 16, 2022 • edited Loading

beckernick commented Jun 29, 2022 • edited Loading

Wulin-Tan commented Aug 28, 2022

HybridNeos commented May 7, 2023

beckernick commented May 30, 2023

szeka94 commented Jul 15, 2024

Avertemp commented Aug 5, 2024

Zach-Sten commented Aug 26, 2024

hms-self commented Dec 23, 2024

hafarooki commented Apr 16, 2022 •

edited

Loading

beckernick commented Jun 29, 2022 •

edited

Loading