Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix program stop when split data count equals zero #5087

Closed
wants to merge 6 commits into from

Conversation

GinkoBalboa
Copy link
Contributor

Recently we have discovered a couple of bugs preventing us to train models. The first bug is
connected with the abrupt stopping of the training because a node split is produced in which the data count is
zero. The bug makes the LightGBM exit so that the model is lost. The problem arises since we lose
our training progress, and we cannot use the LightGBM in these cases. We have also discovered that
other people had similar problems, for example:

The first thing was to produce the data that causes crashes. We've done this by generating random
sequences and capturing the sequence that causes the desired error. Because of the, somewhat
different numerics during the training processes when dealing with GPU or CPU we gathered
two sets of data, one that breaks on the CPU, and one that breaks on the GPU:

  • tests/data/data_fail_leaf_count_zero_cpu.csv
  • tests/data/data_fail_leaf_count_zero_gpu.csv
  • tests/data/data_fail_num_machines_gt_one.csv

When these sequences are ran through the test code given in the additional test cases for this
problem:

  • tests/python_package_test/test_engine.py:test_training_leaf_count_zero
  • tests/python_package_test/test_engine.py:test_training_num_machines_gt_one

the training stops with one of the two following errors:

    Check failed: (best_split_info.left_count) > (0) at /home/user/LightGBM/src/treelearner/serial_tree_learner.cpp, line 687
    Check failed: (best_split_info.right_count) > (0) at /home/user/LightGBM/src/treelearner/serial_tree_learner.cpp, line 697

The solution

The most simple solution for us is not to stop the training at this point and just return a
warning. At least, this allows us to save the trained model and examine it later. When we
implemented the solution there was another break that manifested by the num_machine greater
than one check failed, while running the second test example (test_training_num_machines_gt_one).

    Check failed: (num_machines) > (1) at /home/user/LightGBM/src/treelearner/serial_tree_learner.cpp, line 740

So in this place, we implemented the same simple logic - just log a warning and continue the
training. Maybe the patch proposed here is not the most elegant one, but at least it gives us the
possibility not to lose the trained model.

@ghost
Copy link

ghost commented Mar 21, 2022

CLA assistant check
All CLA requirements met.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your interest in LightGBM!

Before we review this PR in more depth:

  1. Please sign the CLA by clicking the link in Fix program stop when split data count equals zero #5087 (comment)
  2. Please remove the CSVs added in tests/data, and instead use synthetic data created with numpy / pandas. See the existing tests for examples.
    • adding such large files to the repo increases the time and network bandwidth required to train this repo, and we are not willing to tolerate that for the benefit of one test

@GinkoBalboa
Copy link
Contributor Author

Thanks for the suggestion. I managed to find seed numbers that produce errors when data is generated from them. Now I can remove the .csv files.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guolinke or @shiyu1994 I'm not sure about the changes in serial_tree_learner.cpp. I think you're better qualified to review that proposal.

If you agree with the proposed changes, I'll come back and give a more specific review about changes to the tests.

@guolinke
Copy link
Collaborator

Hi @shiyu1994 , is best_split_info.left_count == 0 still happening in the latest commits?
Although it is more like a bug, this PR provides a workaround.

@shiyu1994
Copy link
Collaborator

is best_split_info.left_count == 0 still happening in the latest commits?

Yes, since we are still getting this bug reports from the users, see e.g., #3679 (comment).

I believe we need a fix instead of a workaround. I'll debug with the above reproducible example today.

@jameslamb
Copy link
Collaborator

I believe we need a fix instead of a workaround. I'll debug with the above reproducible example today.

@shiyu1994 this comment means we should close this pull request, in favor of a fix, right?

@guolinke
Copy link
Collaborator

guolinke commented Apr 1, 2022

it depends how soon we can fix it. If it is quick, we can have a fix, otherwise we can merge this workaround first.

@GinkoBalboa GinkoBalboa requested a review from jameslamb April 20, 2022 07:02
@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Apr 21, 2022

I believe we should have a real fix for this error in v4.0.0, not a removed constraint.

@shiyu1994
Copy link
Collaborator

@jameslamb

this comment means we should close this pull request, in favor of a fix, right?

Definitely we want to fix this before 4.0.0. But maybe we can keep this open before the fix? Or even directly modify this PR for a fix since it provides a test case which should be included together with the fix.

@shiyu1994
Copy link
Collaborator

But maybe we can keep this open before the fix?

This PR can be closed. The root cause of the example provided in #3679 @mshivers (which is the test case added in this PR) is due to cost efficient gradient boosting. Please refer to #5164.

@GinkoBalboa Thank you for opening this PR and bring the issue to us again!

Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants