Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sklearn implementation has an error #21

Open
jjuppe opened this issue Aug 5, 2021 · 0 comments
Open

Sklearn implementation has an error #21

jjuppe opened this issue Aug 5, 2021 · 0 comments

Comments

@jjuppe
Copy link

jjuppe commented Aug 5, 2021

Hi @chasedehan,

I think I found an error in the sklearn implementation.

At the moment you add one column to df2 for every iteration that you are doing. And then df2 is joined to df again. Like this many duplicate columns are created that are diluting the mean of the feature importance later on. You can find this if you print out df after every iteration

try:
    importance = clf.feature_importances_
    df2['fscore' + str(i)] = importance
except ValueError:
    print("this clf doesn't have the feature_importances_ method.  Only Sklearn tree based methods allowed")

# importance = sorted(importance.items(), key=operator.itemgetter(1))

# df2 = pd.DataFrame(importance, columns=['feature', 'fscore'+str(i)])
df2['fscore'+str(i)] = df2['fscore'+str(i)] / df2['fscore'+str(i)].sum()
df = pd.merge(df, df2, on='feature', how='outer')
if not silent:
    print("Round: ", this_round, " iteration: ", i)

Here is a suggestion how to fix it:

if len(getattr(clf, 'feature_importances_', [])) == 0:
    raise ValueError(
        "this clf doesn't have the feature_importances_ method. Only Sklearn tree based methods allowed"
    )

if i == 1:
    df = pd.DataFrame({'feature': new_x.columns})

# importance = sorted(importance.items(), key=operator.itemgetter(1))

importance = clf.feature_importances_
importance = np.column_stack([new_x.columns, importance])
df2 = pd.DataFrame(importance, columns=['feature', 'fscore'+str(i)])
df2['fscore'+str(i)] = df2['fscore'+str(i)] / df2['fscore'+str(i)].sum()
df = pd.merge(df, df2, on='feature', how='outer')
if not silent:
    print("Round: ", this_round, " iteration: ", i) ```

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant