You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think I found an error in the sklearn implementation.
At the moment you add one column to df2 for every iteration that you are doing. And then df2 is joined to df again. Like this many duplicate columns are created that are diluting the mean of the feature importance later on. You can find this if you print out df after every iteration
try:
importance = clf.feature_importances_
df2['fscore' + str(i)] = importance
except ValueError:
print("this clf doesn't have the feature_importances_ method. Only Sklearn tree based methods allowed")
# importance = sorted(importance.items(), key=operator.itemgetter(1))
# df2 = pd.DataFrame(importance, columns=['feature', 'fscore'+str(i)])
df2['fscore'+str(i)] = df2['fscore'+str(i)] / df2['fscore'+str(i)].sum()
df = pd.merge(df, df2, on='feature', how='outer')
if not silent:
print("Round: ", this_round, " iteration: ", i)
Here is a suggestion how to fix it:
if len(getattr(clf, 'feature_importances_', [])) == 0:
raise ValueError(
"this clf doesn't have the feature_importances_ method. Only Sklearn tree based methods allowed"
)
if i == 1:
df = pd.DataFrame({'feature': new_x.columns})
# importance = sorted(importance.items(), key=operator.itemgetter(1))
importance = clf.feature_importances_
importance = np.column_stack([new_x.columns, importance])
df2 = pd.DataFrame(importance, columns=['feature', 'fscore'+str(i)])
df2['fscore'+str(i)] = df2['fscore'+str(i)] / df2['fscore'+str(i)].sum()
df = pd.merge(df, df2, on='feature', how='outer')
if not silent:
print("Round: ", this_round, " iteration: ", i) ```
The text was updated successfully, but these errors were encountered:
Hi @chasedehan,
I think I found an error in the sklearn implementation.
At the moment you add one column to df2 for every iteration that you are doing. And then df2 is joined to df again. Like this many duplicate columns are created that are diluting the mean of the feature importance later on. You can find this if you print out df after every iteration
Here is a suggestion how to fix it:
The text was updated successfully, but these errors were encountered: