-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clear split info buffer in cost efficient gradient boosting before every iteration (fix partially #3679) #5164
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you accidentally tag the wrong issue in the description? This doesn't seem related to #4969 at all.
Also, is it possible to create a test that reproduces the bug this PR addresses, to ensure this fix is working and prevent it from being accidentally re-introduced in the future?
n1 = 800000 Segmentation fault (core dumped) Thanks! @shiyu1994 |
@bluesummerv Thanks for using LightGBM. It seems that your example is not related with this PR. Could you please open a new issue for your bug report? |
@jameslamb Thanks for the suggestion. But TBH I don't think we need a test case for this fix. It just fixes a logical mistake in the code of CEGB, which is not a corner case that is likely to be introduced again by future modifications. I think it makes sense to add test cases to guarantee the code to work in some tricky scenarios. But perhaps it is not necessary to add a test case for obvious logical mistakes in programming. WDYT? |
I believe a test should be added in this PR. I strongly believe that PRs which fix bugs in software should also introduce tests confirming that the software no longer exhibits those bugs, unless the effort involved in creating the tests or the cost of running the tests is too large. Given that we have already have a clear reproducible example of a bug in LightGBM (#3679 (comment)), the effort required to convert it to a test case should not be too much, and I think it is worth it for the increased confidence that this bug won't be reintroduced. That test would also give us some additional code coverage of using CEGB, which isn't tested thoroughly in the project today. In my opinion, tests should not just be used for "corner cases" or "tricky scenarios". Tests should try to cover the ways that users might use the software, and check that the software behaves as those uses would expect it to. Testing gives us confidence that once a bug is fixed, it won't be reintroduced. This project is very large and there are many, many possible code paths based on e.g. different combinations of training parameters, compilation options, data characteristics, operating systems, etc. That set of combinations is far too large for us to just rely on maintainers' knowledge and pull request reviews to catch regressions. |
@jameslamb Done with adding the test case. Please check. |
@guolinke This is very important bugfix for our long living bug. Could you please help to review it? |
Note that this only fixes the case when CEGB is used (that's why the word partially is used in the title of this PR). I need further verification to see if the same bug (Check failed: (best_split_info.left_count) > (0)) still happens in normal cases. |
Thanks for the clarification! However, GitHub autoclosing mechanism doesn't understand this and will close #3679 after merging this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
Could you please remove the change requests so that we can merge this RP? Currently it is blocked by the change requests. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
This is to fix the case provided by @mshivers in #3679. It turns out that the bug is not related to the core algorithm of LightGBM, but due to the cost-efficient gradient boosting module.
In the code snippet below, the best split of
SerialTreeLearner
can be updated byCostEfficientGradientBoosting
,LightGBM/src/treelearner/cost_effective_gradient_boosting.hpp
Lines 85 to 110 in fc0c8fd
However,
CostEfficientGradientBoosting
did not clear its buffer of splitssplits_per_leaf_
before a new boosting iteration starts, which causes it to contain splits from previous trees. And these splits with wrong split information (sum of gradients and hessians, gain, etc.) will be mixed into the current tree.