Fix program stop when split data count equals zero #5087

GinkoBalboa · 2022-03-21T13:53:41Z

Recently we have discovered a couple of bugs preventing us to train models. The first bug is
connected with the abrupt stopping of the training because a node split is produced in which the data count is
zero. The bug makes the LightGBM exit so that the model is lost. The problem arises since we lose
our training progress, and we cannot use the LightGBM in these cases. We have also discovered that
other people had similar problems, for example:

Check failed: (best_split_info.left_count) > (0) #4946

The first thing was to produce the data that causes crashes. We've done this by generating random
sequences and capturing the sequence that causes the desired error. Because of the, somewhat
different numerics during the training processes when dealing with GPU or CPU we gathered
two sets of data, one that breaks on the CPU, and one that breaks on the GPU:

tests/data/data_fail_leaf_count_zero_cpu.csv
tests/data/data_fail_leaf_count_zero_gpu.csv
tests/data/data_fail_num_machines_gt_one.csv

When these sequences are ran through the test code given in the additional test cases for this
problem:

tests/python_package_test/test_engine.py:test_training_leaf_count_zero
tests/python_package_test/test_engine.py:test_training_num_machines_gt_one

the training stops with one of the two following errors:

    Check failed: (best_split_info.left_count) > (0) at /home/user/LightGBM/src/treelearner/serial_tree_learner.cpp, line 687

    Check failed: (best_split_info.right_count) > (0) at /home/user/LightGBM/src/treelearner/serial_tree_learner.cpp, line 697

The solution

The most simple solution for us is not to stop the training at this point and just return a
warning. At least, this allows us to save the trained model and examine it later. When we
implemented the solution there was another break that manifested by the num_machine greater
than one check failed, while running the second test example (test_training_num_machines_gt_one).

    Check failed: (num_machines) > (1) at /home/user/LightGBM/src/treelearner/serial_tree_learner.cpp, line 740

So in this place, we implemented the same simple logic - just log a warning and continue the
training. Maybe the patch proposed here is not the most elegant one, but at least it gives us the
possibility not to lose the trained model.

ghost · 2022-03-21T13:53:55Z

All CLA requirements met.

jameslamb

Thanks for your interest in LightGBM!

Before we review this PR in more depth:

Please sign the CLA by clicking the link in Fix program stop when split data count equals zero #5087 (comment)
Please remove the CSVs added in tests/data, and instead use synthetic data created with numpy / pandas. See the existing tests for examples.
- adding such large files to the repo increases the time and network bandwidth required to train this repo, and we are not willing to tolerate that for the benefit of one test

GinkoBalboa · 2022-03-21T16:34:57Z

Thanks for the suggestion. I managed to find seed numbers that produce errors when data is generated from them. Now I can remove the .csv files.

jameslamb

@guolinke or @shiyu1994 I'm not sure about the changes in serial_tree_learner.cpp. I think you're better qualified to review that proposal.

If you agree with the proposed changes, I'll come back and give a more specific review about changes to the tests.

guolinke · 2022-03-23T03:27:26Z

Hi @shiyu1994 , is best_split_info.left_count == 0 still happening in the latest commits?
Although it is more like a bug, this PR provides a workaround.

shiyu1994 · 2022-03-24T02:29:58Z

is best_split_info.left_count == 0 still happening in the latest commits?

Yes, since we are still getting this bug reports from the users, see e.g., #3679 (comment).

I believe we need a fix instead of a workaround. I'll debug with the above reproducible example today.

jameslamb · 2022-04-01T00:56:15Z

I believe we need a fix instead of a workaround. I'll debug with the above reproducible example today.

@shiyu1994 this comment means we should close this pull request, in favor of a fix, right?

guolinke · 2022-04-01T01:14:30Z

it depends how soon we can fix it. If it is quick, we can have a fix, otherwise we can merge this workaround first.

StrikerRUS · 2022-04-21T00:54:32Z

I believe we should have a real fix for this error in v4.0.0, not a removed constraint.

shiyu1994 · 2022-04-21T02:54:58Z

@jameslamb

this comment means we should close this pull request, in favor of a fix, right?

Definitely we want to fix this before 4.0.0. But maybe we can keep this open before the fix? Or even directly modify this PR for a fix since it provides a test case which should be included together with the fix.

shiyu1994 · 2022-04-21T09:08:12Z

But maybe we can keep this open before the fix?

This PR can be closed. The root cause of the example provided in #3679 @mshivers (which is the test case added in this PR) is due to cost efficient gradient boosting. Please refer to #5164.

@GinkoBalboa Thank you for opening this PR and bring the issue to us again!

github-actions · 2023-11-15T00:20:48Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

Fix program stop when split data count equals zero

a915259

GinkoBalboa requested review from guolinke, btrotta, shiyu1994, hzy46, tongwu-sh, StrikerRUS, jmoralez and jameslamb as code owners March 21, 2022 13:53

jameslamb requested changes Mar 21, 2022

View reviewed changes

jameslamb added the in progress label Mar 21, 2022

Added random seed seq that produces error (w/o patch)

4b925e1

GinkoBalboa added 2 commits March 21, 2022 17:51

Removed test for gpu

de9f508

Correct linting python errors

080f02b

jameslamb reviewed Mar 21, 2022

View reviewed changes

jameslamb added the awaiting review label Mar 21, 2022

GinkoBalboa added 2 commits March 21, 2022 18:20

Fix some c++ linting errors

e5d7b5b

Correct condition for num_machines

8737b85

GinkoBalboa requested a review from jameslamb April 20, 2022 07:02

shiyu1994 closed this Apr 21, 2022

jameslamb removed the awaiting review label Mar 16, 2023

jameslamb removed the in progress label Aug 13, 2023

github-actions bot locked as resolved and limited conversation to collaborators Nov 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix program stop when split data count equals zero #5087

Fix program stop when split data count equals zero #5087

GinkoBalboa commented Mar 21, 2022

ghost commented Mar 21, 2022 •

edited by ghost

Loading

jameslamb left a comment

GinkoBalboa commented Mar 21, 2022

jameslamb left a comment

guolinke commented Mar 23, 2022

shiyu1994 commented Mar 24, 2022

jameslamb commented Apr 1, 2022

guolinke commented Apr 1, 2022

StrikerRUS commented Apr 21, 2022 •

edited

Loading

shiyu1994 commented Apr 21, 2022

shiyu1994 commented Apr 21, 2022

github-actions bot commented Nov 15, 2023

Fix program stop when split data count equals zero #5087

Fix program stop when split data count equals zero #5087

Conversation

GinkoBalboa commented Mar 21, 2022

The solution

ghost commented Mar 21, 2022 • edited by ghost Loading

jameslamb left a comment

Choose a reason for hiding this comment

GinkoBalboa commented Mar 21, 2022

jameslamb left a comment

Choose a reason for hiding this comment

guolinke commented Mar 23, 2022

shiyu1994 commented Mar 24, 2022

jameslamb commented Apr 1, 2022

guolinke commented Apr 1, 2022

StrikerRUS commented Apr 21, 2022 • edited Loading

shiyu1994 commented Apr 21, 2022

shiyu1994 commented Apr 21, 2022

github-actions bot commented Nov 15, 2023

ghost commented Mar 21, 2022 •

edited by ghost

Loading

StrikerRUS commented Apr 21, 2022 •

edited

Loading