-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix handling of the min_samples_leaf hyperparameter #35
Conversation
I don't have the time to work on this further today. @NicolasHug feel free to takeover if you wish so. Note that in the current state of this PR the Higgs boson benchmark is significantly slower than LightGBM but this is probably because we reject too many splits by doing the coarse node level filter only (instead of doing per-feature feature filtering as well). |
Codecov Report
@@ Coverage Diff @@
## master #35 +/- ##
==========================================
- Coverage 94.46% 94.34% -0.13%
==========================================
Files 8 8
Lines 759 778 +19
==========================================
+ Hits 717 734 +17
- Misses 42 44 +2
Continue to review full report at Codecov.
|
@NicolasHug I sent you an invite to have commit rights to this repo. This PR needs a rebase on top of master to fix the conflicts. |
Ok I'll take it up
Yes I think this should be at the histogram splitting level like in lightgbm. Do we want to rename the current |
No let's get rid of the old |
@@ -307,12 +311,12 @@ def _parallel_find_split_subtraction(context, parent_histograms, | |||
histograms by substraction. | |||
""" | |||
# Pre-allocate the results datastructure to be able to use prange | |||
split_infos = [SplitInfo(0, 0, 0, 0., 0., 0., 0.) | |||
split_infos = [SplitInfo(0, 0, 0, 0., 0., 0., 0., 0, 0) | |||
for i in range(context.n_features)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This data structure could probably be also stored as an attribute on the context to avoid reallocating it over and over again.
I did the following changes:
Things are not totally broken there's something fishy going on. |
The code from |
Many useless splits evaluated but on the Higgs boson, the n_samples is big enough that other nodes will be splitted with the node level filtering. Hence the slow down. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments:
and self.root.n_samples < self.min_samples_leaf): | ||
# Do not even bother computing any splitting statistics. | ||
self._finalize_leaf(self.root) | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to add a test for this case in test_grower.py
.
@@ -76,6 +78,7 @@ def __init__(self, n_features, binned_features, n_bins, | |||
self.l2_regularization = l2_regularization | |||
self.min_hessian_to_split = min_hessian_to_split | |||
self.min_samples_leaf = min_samples_leaf | |||
self.min_gain_to_split = min_gain_to_split |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should raise a ValueError
if the user passes min_gain_to_split < 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's already a check in the grower. BTW min_gain_to_split
is only a parameter of the grower, not of GradientBoostingMachine
. We should move it there right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right this is fine for now. No need to expose it to the public API at this time.
tests/test_predictor.py
Outdated
est_lightgbm.fit(X_train_binned, y_train) | ||
|
||
from pygbm.plotting import plot_tree | ||
plot_tree(grower, est_lightgbm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not put such plots in the regular tests. Use the examples/
folder for visual debugging instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also if you want to compare lightgbm and pygbm on the Boston dataset, please add a new test in test_compare_lightgbm.py
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, this is not meant to stay, I just left it so you can reproduce my plots if you wanted to
Indeed there is something fishy going on... |
test_predictor.py used to pass before 7e91c2c, right? Maybe this commit is causing the regression but I am not sure why. |
else: | ||
hessian_left += histogram[bin_idx]['sum_hessians'] | ||
if hessian_left < context.min_hessian_to_split: | ||
continue | ||
hessian_right = context.sum_hessians - hessian_left | ||
if hessian_right < context.min_hessian_to_split: | ||
continue | ||
# won't get any better |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a comment to say that the loss functions are all convex and therefore the hessians are positive.
pygbm/splitting.py
Outdated
gradient_right = context.sum_gradients - gradient_left | ||
gain = _split_gain(gradient_left, hessian_left, | ||
gradient_right, hessian_right, | ||
context.sum_gradients, context.sum_hessians, | ||
context.l2_regularization) | ||
if gain > best_split.gain: | ||
|
||
if gain > best_split.gain and gain > context.min_gain_to_split: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably have gain >= context.min_gain_to_split
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is the cause of the bad trees.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops I spoke too quickly, I had set min_samples_leaf=1
in test_predictor.py
and forgot about it.
Maybe the structure we observe is expected in cases the most predictive feature has a value that is linearly correlated with the target value: the gain should be constant for consecutive |
I am not sure about my theory, I cannot come up with a synthetic dataset that would trigger this case. |
The threshold on the training set is expected to drop when we control overfitting with a stricter |
This is a tentative fix for #34.
However the test in
test_compare_lightgbm
fails if I change the value of min_samples_leaf to something different than 1.In retrospect, I believe this is because we reject too many splits by doing the filtering at the grower level while we should probably do it also inside the
find_node_split*
calls.