Fix handling of the min_samples_leaf hyperparameter #35

ogrisel · 2018-11-02T11:46:06Z

This is a tentative fix for #34.

However the test in test_compare_lightgbm fails if I change the value of min_samples_leaf to something different than 1.

In retrospect, I believe this is because we reject too many splits by doing the filtering at the grower level while we should probably do it also inside the find_node_split* calls.

ogrisel · 2018-11-02T11:48:38Z

I don't have the time to work on this further today. @NicolasHug feel free to takeover if you wish so.

Note that in the current state of this PR the Higgs boson benchmark is significantly slower than LightGBM but this is probably because we reject too many splits by doing the coarse node level filter only (instead of doing per-feature feature filtering as well).

codecov · 2018-11-02T11:51:24Z

Codecov Report

Merging #35 into master will decrease coverage by 0.12%.
The diff coverage is 97.5%.

@@            Coverage Diff             @@
##           master      #35      +/-   ##
==========================================
- Coverage   94.46%   94.34%   -0.13%     
==========================================
  Files           8        8              
  Lines         759      778      +19     
==========================================
+ Hits          717      734      +17     
- Misses         42       44       +2

Impacted Files	Coverage Δ
pygbm/splitting.py	`99.46% <100%> (-0.54%)`	⬇️
pygbm/grower.py	`89.41% <91.66%> (-0.29%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6da9cb9...1da31d7. Read the comment docs.

ogrisel · 2018-11-02T11:56:29Z

@NicolasHug I sent you an invite to have commit rights to this repo.

This PR needs a rebase on top of master to fix the conflicts.

NicolasHug · 2018-11-02T13:04:15Z

Ok I'll take it up

we reject too many splits by doing the filtering at the grower level while we should probably do it also inside the find_node_split* calls.

Yes I think this should be at the histogram splitting level like in lightgbm.

Do we want to rename the current min_samples_leaf into min_samples_split to be more scikit-learnesque, or completely get rid of it?

ogrisel · 2018-11-02T13:08:10Z

No let's get rid of the old min_samples_split strategy to keep the code as simple as possible. It's was not a good way to control for overfitting. min_samples_leaf is a better strategy.

ogrisel · 2018-11-02T14:29:30Z

pygbm/splitting.py

@@ -307,12 +311,12 @@ def _parallel_find_split_subtraction(context, parent_histograms,
    histograms by substraction.
    """
    # Pre-allocate the results datastructure to be able to use prange
-    split_infos = [SplitInfo(0, 0, 0, 0., 0., 0., 0.)
+    split_infos = [SplitInfo(0, 0, 0, 0., 0., 0., 0., 0, 0)
                   for i in range(context.n_features)]


This data structure could probably be also stored as an attribute on the context to avoid reallocating it over and over again.

pygbm/splitting.py

NicolasHug · 2018-11-02T20:00:07Z

I did the following changes:

min_samples_leaf is now checked at the histogram level, like in lightgbm.
min_gain_to_split as well.
had to set decimal=3 in test_compare_lightgbm. Like for you, tests don't pass for mean_samples_leaf > 1, but the predictions are still pretty close in general.
it doesn't run slower than on master. I also observed that your last commit was a lot slower and I don't understand why: less splits to consider = less work = faster, as far as I understand. Maybe some numba compilation thing?
tests fail on test_predictor because I set mean_samples_leaf to 5 and slightly modified the code to plot the lightgbm model. I get this (lightgbm tree is on the left):
Digraph.gv.pdf. I don't understand why our tree is going so deep on the same feature.
I get the exact same ROC AUC as on master: .7892

Things are not totally broken there's something fishy going on.

NicolasHug · 2018-11-02T20:04:46Z

The code from test_predictor ran with master gives a much more reasonable tree: Digraph.gv.pdf

ogrisel · 2018-11-02T20:24:22Z

it doesn't run slower than on master. I also observed that your last commit was a lot slower and I don't understand why: less splits to consider = less work = faster, as far as I understand. Maybe some numba compilation thing?

Many useless splits evaluated but on the Higgs boson, the n_samples is big enough that other nodes will be splitted with the node level filtering. Hence the slow down.

ogrisel

Some comments:

pygbm/grower.py

ogrisel · 2018-11-02T18:43:45Z

pygbm/grower.py

+                and self.root.n_samples < self.min_samples_leaf):
+            # Do not even bother computing any splitting statistics.
+            self._finalize_leaf(self.root)
+            return


It would be great to add a test for this case in test_grower.py.

pygbm/splitting.py

ogrisel · 2018-11-02T20:27:14Z

pygbm/splitting.py

@@ -76,6 +78,7 @@ def __init__(self, n_features, binned_features, n_bins,
        self.l2_regularization = l2_regularization
        self.min_hessian_to_split = min_hessian_to_split
        self.min_samples_leaf = min_samples_leaf
+        self.min_gain_to_split = min_gain_to_split


You should raise a ValueError if the user passes min_gain_to_split < 0.

There's already a check in the grower. BTW min_gain_to_split is only a parameter of the grower, not of GradientBoostingMachine. We should move it there right?

Right this is fine for now. No need to expose it to the public API at this time.

ogrisel · 2018-11-02T20:29:22Z

tests/test_predictor.py

+    est_lightgbm.fit(X_train_binned, y_train)
+
+    from pygbm.plotting import plot_tree
+    plot_tree(grower, est_lightgbm)


Please do not put such plots in the regular tests. Use the examples/ folder for visual debugging instead.

Also if you want to compare lightgbm and pygbm on the Boston dataset, please add a new test in test_compare_lightgbm.py instead.

Sure, this is not meant to stay, I just left it so you can reproduce my plots if you wanted to

ogrisel · 2018-11-02T20:34:28Z

Indeed there is something fishy going on...

ogrisel · 2018-11-02T20:39:33Z

test_predictor.py used to pass before 7e91c2c, right? Maybe this commit is causing the regression but I am not sure why.

ogrisel · 2018-11-02T20:51:16Z

pygbm/splitting.py

        else:
            hessian_left += histogram[bin_idx]['sum_hessians']
        if hessian_left < context.min_hessian_to_split:
            continue
        hessian_right = context.sum_hessians - hessian_left
        if hessian_right < context.min_hessian_to_split:
-            continue
+            # won't get any better


Maybe add a comment to say that the loss functions are all convex and therefore the hessians are positive.

ogrisel · 2018-11-02T20:52:22Z

pygbm/splitting.py

        gradient_right = context.sum_gradients - gradient_left
        gain = _split_gain(gradient_left, hessian_left,
                           gradient_right, hessian_right,
                           context.sum_gradients, context.sum_hessians,
                           context.l2_regularization)
-        if gain > best_split.gain:
+
+        if gain > best_split.gain and gain > context.min_gain_to_split:


We should probably have gain >= context.min_gain_to_split.

Yes this is the cause of the bad trees.

Oops I spoke too quickly, I had set min_samples_leaf=1 in test_predictor.py and forgot about it.

ogrisel · 2018-11-02T21:54:45Z

Maybe the structure we observe is expected in cases the most predictive feature has a value that is linearly correlated with the target value: the gain should be constant for consecutive bin_idx for a given feature. One way to mitigate this would be to detect those areas of constant gains and split in the middle of the plateau instead of the right hand-side.

ogrisel · 2018-11-02T22:20:48Z

I am not sure about my theory, I cannot come up with a synthetic dataset that would trigger this case.

NicolasHug · 2018-11-02T23:37:37Z

test_predictor.py used to pass before 7e91c2c, right? Maybe this commit is causing the regression but I am not sure why.

Yes it passes in 1c6e62b with 5 leaves. The threshold was still lowered from .9 to .75.

…er level filtering

ogrisel · 2018-11-03T00:08:23Z

Yes it passes in 1c6e62b with 5 leaves. The threshold was still lowered from .9 to .75.

min_samples_leaf=5, not 5 leaves.

The threshold on the training set is expected to drop when we control overfitting with a stricter min_samples_leaf. However there should be a value for min_samples_leaf where the test accuracy is comparable to the previous values. I just pushed one such value.

ogrisel · 2018-11-03T00:23:21Z

Ok I pushed some small improvements. I would be in favor of merging this PR as such. There are still some discrepancies with LightGBM (see #32) but I think that the specific case of #34 is fixed.

ogrisel added 2 commits November 2, 2018 10:40

Collect sample counts in node.split_info

56d83ca

Add tests

1c6e62b

ogrisel commented Nov 2, 2018

View reviewed changes

pygbm/splitting.py Outdated Show resolved Hide resolved

NicolasHug added 3 commits November 2, 2018 13:59

merged master + test for min_samples_leaf at histogram level

d2e2902

Merge branch 'master' into fix-min_samples_leaf

0497360

Test for min_gain_to_split at histogram level

7e91c2c

ogrisel commented Nov 2, 2018

View reviewed changes

removed plotting from tests/test_predictor.py

b5b6eca

ogrisel added 4 commits November 3, 2018 00:48

Make test_predictor_from_grower stricter

f60a344

Only test one stopping param at a time

fb26aa7

Make min_gain_to_split splitting level filtering consistent with grow…

f83d1e9

…er level filtering

Better hyperparam for the boston test

e5c0fdc

Cover edge case when n_samples is too small

1da31d7

ogrisel merged commit 154def0 into master Nov 3, 2018

ogrisel deleted the fix-min_samples_leaf branch November 3, 2018 00:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix handling of the min_samples_leaf hyperparameter #35

Fix handling of the min_samples_leaf hyperparameter #35

ogrisel commented Nov 2, 2018

ogrisel commented Nov 2, 2018

codecov bot commented Nov 2, 2018 •

edited

Loading

ogrisel commented Nov 2, 2018

NicolasHug commented Nov 2, 2018

ogrisel commented Nov 2, 2018

ogrisel Nov 2, 2018

NicolasHug commented Nov 2, 2018

NicolasHug commented Nov 2, 2018

ogrisel commented Nov 2, 2018

ogrisel left a comment

ogrisel Nov 2, 2018

ogrisel Nov 2, 2018

NicolasHug Nov 2, 2018

ogrisel Nov 2, 2018

ogrisel Nov 2, 2018

ogrisel Nov 2, 2018

NicolasHug Nov 2, 2018

ogrisel commented Nov 2, 2018

ogrisel commented Nov 2, 2018

ogrisel Nov 2, 2018

ogrisel Nov 2, 2018

ogrisel Nov 2, 2018

ogrisel Nov 2, 2018

ogrisel commented Nov 2, 2018

ogrisel commented Nov 2, 2018

NicolasHug commented Nov 2, 2018

ogrisel commented Nov 3, 2018 •

edited

Loading

ogrisel commented Nov 3, 2018

Fix handling of the min_samples_leaf hyperparameter #35

Fix handling of the min_samples_leaf hyperparameter #35

Conversation

ogrisel commented Nov 2, 2018

ogrisel commented Nov 2, 2018

codecov bot commented Nov 2, 2018 • edited Loading

Codecov Report

ogrisel commented Nov 2, 2018

NicolasHug commented Nov 2, 2018

ogrisel commented Nov 2, 2018

Choose a reason for hiding this comment

NicolasHug commented Nov 2, 2018

NicolasHug commented Nov 2, 2018

ogrisel commented Nov 2, 2018

ogrisel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogrisel commented Nov 2, 2018

ogrisel commented Nov 2, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogrisel commented Nov 2, 2018

ogrisel commented Nov 2, 2018

NicolasHug commented Nov 2, 2018

ogrisel commented Nov 3, 2018 • edited Loading

ogrisel commented Nov 3, 2018

codecov bot commented Nov 2, 2018 •

edited

Loading

ogrisel commented Nov 3, 2018 •

edited

Loading