Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix handling of the min_samples_leaf hyperparameter #35

Merged
merged 11 commits into from
Nov 3, 2018

Conversation

ogrisel
Copy link
Owner

@ogrisel ogrisel commented Nov 2, 2018

This is a tentative fix for #34.

However the test in test_compare_lightgbm fails if I change the value of min_samples_leaf to something different than 1.

In retrospect, I believe this is because we reject too many splits by doing the filtering at the grower level while we should probably do it also inside the find_node_split* calls.

@ogrisel
Copy link
Owner Author

ogrisel commented Nov 2, 2018

I don't have the time to work on this further today. @NicolasHug feel free to takeover if you wish so.

Note that in the current state of this PR the Higgs boson benchmark is significantly slower than LightGBM but this is probably because we reject too many splits by doing the coarse node level filter only (instead of doing per-feature feature filtering as well).

@codecov
Copy link

codecov bot commented Nov 2, 2018

Codecov Report

Merging #35 into master will decrease coverage by 0.12%.
The diff coverage is 97.5%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #35      +/-   ##
==========================================
- Coverage   94.46%   94.34%   -0.13%     
==========================================
  Files           8        8              
  Lines         759      778      +19     
==========================================
+ Hits          717      734      +17     
- Misses         42       44       +2
Impacted Files Coverage Δ
pygbm/splitting.py 99.46% <100%> (-0.54%) ⬇️
pygbm/grower.py 89.41% <91.66%> (-0.29%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6da9cb9...1da31d7. Read the comment docs.

@ogrisel
Copy link
Owner Author

ogrisel commented Nov 2, 2018

@NicolasHug I sent you an invite to have commit rights to this repo.

This PR needs a rebase on top of master to fix the conflicts.

@NicolasHug
Copy link
Collaborator

Ok I'll take it up

we reject too many splits by doing the filtering at the grower level while we should probably do it also inside the find_node_split* calls.

Yes I think this should be at the histogram splitting level like in lightgbm.

Do we want to rename the current min_samples_leaf into min_samples_split to be more scikit-learnesque, or completely get rid of it?

@ogrisel
Copy link
Owner Author

ogrisel commented Nov 2, 2018

No let's get rid of the old min_samples_split strategy to keep the code as simple as possible. It's was not a good way to control for overfitting. min_samples_leaf is a better strategy.

@@ -307,12 +311,12 @@ def _parallel_find_split_subtraction(context, parent_histograms,
histograms by substraction.
"""
# Pre-allocate the results datastructure to be able to use prange
split_infos = [SplitInfo(0, 0, 0, 0., 0., 0., 0.)
split_infos = [SplitInfo(0, 0, 0, 0., 0., 0., 0., 0, 0)
for i in range(context.n_features)]
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This data structure could probably be also stored as an attribute on the context to avoid reallocating it over and over again.

pygbm/splitting.py Outdated Show resolved Hide resolved
@NicolasHug
Copy link
Collaborator

I did the following changes:

  • min_samples_leaf is now checked at the histogram level, like in lightgbm.

  • min_gain_to_split as well.

  • had to set decimal=3 in test_compare_lightgbm. Like for you, tests don't pass for mean_samples_leaf > 1, but the predictions are still pretty close in general.

  • it doesn't run slower than on master. I also observed that your last commit was a lot slower and I don't understand why: less splits to consider = less work = faster, as far as I understand. Maybe some numba compilation thing?

  • tests fail on test_predictor because I set mean_samples_leaf to 5 and slightly modified the code to plot the lightgbm model. I get this (lightgbm tree is on the left):
    Digraph.gv.pdf. I don't understand why our tree is going so deep on the same feature.

  • I get the exact same ROC AUC as on master: .7892

Things are not totally broken there's something fishy going on.

@NicolasHug
Copy link
Collaborator

The code from test_predictor ran with master gives a much more reasonable tree: Digraph.gv.pdf

@ogrisel
Copy link
Owner Author

ogrisel commented Nov 2, 2018

it doesn't run slower than on master. I also observed that your last commit was a lot slower and I don't understand why: less splits to consider = less work = faster, as far as I understand. Maybe some numba compilation thing?

Many useless splits evaluated but on the Higgs boson, the n_samples is big enough that other nodes will be splitted with the node level filtering. Hence the slow down.

Copy link
Owner Author

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments:

pygbm/grower.py Show resolved Hide resolved
and self.root.n_samples < self.min_samples_leaf):
# Do not even bother computing any splitting statistics.
self._finalize_leaf(self.root)
return
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to add a test for this case in test_grower.py.

pygbm/splitting.py Show resolved Hide resolved
@@ -76,6 +78,7 @@ def __init__(self, n_features, binned_features, n_bins,
self.l2_regularization = l2_regularization
self.min_hessian_to_split = min_hessian_to_split
self.min_samples_leaf = min_samples_leaf
self.min_gain_to_split = min_gain_to_split
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should raise a ValueError if the user passes min_gain_to_split < 0.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's already a check in the grower. BTW min_gain_to_split is only a parameter of the grower, not of GradientBoostingMachine. We should move it there right?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right this is fine for now. No need to expose it to the public API at this time.

est_lightgbm.fit(X_train_binned, y_train)

from pygbm.plotting import plot_tree
plot_tree(grower, est_lightgbm)
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not put such plots in the regular tests. Use the examples/ folder for visual debugging instead.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if you want to compare lightgbm and pygbm on the Boston dataset, please add a new test in test_compare_lightgbm.py instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, this is not meant to stay, I just left it so you can reproduce my plots if you wanted to

@ogrisel
Copy link
Owner Author

ogrisel commented Nov 2, 2018

Indeed there is something fishy going on...

@ogrisel
Copy link
Owner Author

ogrisel commented Nov 2, 2018

test_predictor.py used to pass before 7e91c2c, right? Maybe this commit is causing the regression but I am not sure why.

else:
hessian_left += histogram[bin_idx]['sum_hessians']
if hessian_left < context.min_hessian_to_split:
continue
hessian_right = context.sum_hessians - hessian_left
if hessian_right < context.min_hessian_to_split:
continue
# won't get any better
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment to say that the loss functions are all convex and therefore the hessians are positive.

gradient_right = context.sum_gradients - gradient_left
gain = _split_gain(gradient_left, hessian_left,
gradient_right, hessian_right,
context.sum_gradients, context.sum_hessians,
context.l2_regularization)
if gain > best_split.gain:

if gain > best_split.gain and gain > context.min_gain_to_split:
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably have gain >= context.min_gain_to_split.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is the cause of the bad trees.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops I spoke too quickly, I had set min_samples_leaf=1 in test_predictor.py and forgot about it.

@ogrisel
Copy link
Owner Author

ogrisel commented Nov 2, 2018

Maybe the structure we observe is expected in cases the most predictive feature has a value that is linearly correlated with the target value: the gain should be constant for consecutive bin_idx for a given feature. One way to mitigate this would be to detect those areas of constant gains and split in the middle of the plateau instead of the right hand-side.

@ogrisel
Copy link
Owner Author

ogrisel commented Nov 2, 2018

I am not sure about my theory, I cannot come up with a synthetic dataset that would trigger this case.

@NicolasHug
Copy link
Collaborator

test_predictor.py used to pass before 7e91c2c, right? Maybe this commit is causing the regression but I am not sure why.

Yes it passes in 1c6e62b with 5 leaves. The threshold was still lowered from .9 to .75.

@ogrisel
Copy link
Owner Author

ogrisel commented Nov 3, 2018

Yes it passes in 1c6e62b with 5 leaves. The threshold was still lowered from .9 to .75.

min_samples_leaf=5, not 5 leaves.

The threshold on the training set is expected to drop when we control overfitting with a stricter min_samples_leaf. However there should be a value for min_samples_leaf where the test accuracy is comparable to the previous values. I just pushed one such value.

@ogrisel
Copy link
Owner Author

ogrisel commented Nov 3, 2018

Ok I pushed some small improvements. I would be in favor of merging this PR as such. There are still some discrepancies with LightGBM (see #32) but I think that the specific case of #34 is fixed.

@ogrisel ogrisel merged commit 154def0 into master Nov 3, 2018
@ogrisel ogrisel deleted the fix-min_samples_leaf branch November 3, 2018 00:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants