Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix number of features mismatching in continual training (fix #5156) #5157

Open
wants to merge 14 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 30 additions & 6 deletions src/boosting/gbdt_model_text.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,15 @@ std::string GBDT::DumpModel(int start_iteration, int num_iteration, int feature_
for (size_t i = 0; i < feature_importances.size(); ++i) {
size_t feature_importances_int = static_cast<size_t>(feature_importances[i]);
if (feature_importances_int > 0) {
pairs.emplace_back(feature_importances_int, feature_names_[i]);
if (i < feature_names_.size()) {
pairs.emplace_back(feature_importances_int, feature_names_[i]);
} else {
// with LibSVM format and continual training, the number of features in dataset can be fewer than in the intial model
// in that case FeatureImportance returns with the number of features in the intial model
std::stringstream str_buf;
str_buf << "Column_" << i << "_from_init_model";
pairs.emplace_back(feature_importances_int, str_buf.str());
}
}
}
str_buf << '\n' << "\"feature_importances\":" << "{";
Expand Down Expand Up @@ -377,7 +385,15 @@ std::string GBDT::SaveModelToString(int start_iteration, int num_iteration, int
for (size_t i = 0; i < feature_importances.size(); ++i) {
size_t feature_importances_int = static_cast<size_t>(feature_importances[i]);
if (feature_importances_int > 0) {
pairs.emplace_back(feature_importances_int, feature_names_[i]);
if (i < feature_names_.size()) {
pairs.emplace_back(feature_importances_int, feature_names_[i]);
} else {
// with LibSVM format and continual training, the number of features in dataset can be fewer than in the intial model
// in that case FeatureImportance returns with the number of features in the intial model
std::stringstream str_buf;
str_buf << "Column_" << i << "_from_init_model";
pairs.emplace_back(feature_importances_int, str_buf.str());
}
}
}
// sort the importance
Expand Down Expand Up @@ -636,21 +652,29 @@ std::vector<double> GBDT::FeatureImportance(int num_iteration, int importance_ty
for (int iter = 0; iter < num_used_model; ++iter) {
for (int split_idx = 0; split_idx < models_[iter]->num_leaves() - 1; ++split_idx) {
if (models_[iter]->split_gain(split_idx) > 0) {
const int real_feature_index = models_[iter]->split_feature(split_idx);
#ifdef DEBUG
CHECK_GE(models_[iter]->split_feature(split_idx), 0);
CHECK_GE(real_feature_index, 0);
#endif
feature_importances[models_[iter]->split_feature(split_idx)] += 1.0;
if (static_cast<size_t>(real_feature_index) >= feature_importances.size()) {
feature_importances.resize(real_feature_index + 1);
}
feature_importances[real_feature_index] += 1.0;
}
}
}
} else if (importance_type == 1) {
for (int iter = 0; iter < num_used_model; ++iter) {
for (int split_idx = 0; split_idx < models_[iter]->num_leaves() - 1; ++split_idx) {
if (models_[iter]->split_gain(split_idx) > 0) {
const int real_feature_index = models_[iter]->split_feature(split_idx);
#ifdef DEBUG
CHECK_GE(models_[iter]->split_feature(split_idx), 0);
CHECK_GE(real_feature_index, 0);
#endif
feature_importances[models_[iter]->split_feature(split_idx)] += models_[iter]->split_gain(split_idx);
if (static_cast<size_t>(real_feature_index) >= feature_importances.size()) {
feature_importances.resize(real_feature_index + 1);
}
feature_importances[real_feature_index] += models_[iter]->split_gain(split_idx);
}
}
}
Expand Down
20 changes: 20 additions & 0 deletions tests/python_package_test/test_engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -966,6 +966,26 @@ def test_continue_train_multiclass():
assert evals_result['valid_0']['multi_logloss'][-1] == pytest.approx(ret)


def test_continue_train_different_feature_size():
StrikerRUS marked this conversation as resolved.
Show resolved Hide resolved
np.random.seed(0)
train_X = np.random.randn(100, 10)
train_y = np.sum(train_X, axis=1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, strange. With 3.3.2 version this test doesn't fail on my local machine each time. It fails sometimes.
How about using not a random but concrete trivial dataset, something like

train_X = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
train_y = np.array([0, 1, 2])

?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now a warning is provided, see https://github.com/microsoft/LightGBM/pull/5157/files#r868752144.
And the test case just checks whether the warning message is provided. Now it should always fail in 3.3.2.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it should always fail in 3.3.2.

Isn't true for me 😕
I can sometimes successfully run the test on my Windows machine with LightGBM 3.3.2 installed from PyPI.

Output:

[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.000000
[LightGBM] [Debug] init for col-wise cost 0.000004 seconds, init for row-wise cost 0.000180 seconds
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000209 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 175
[LightGBM] [Info] Number of data points in the train set: 100, number of used features: 5
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 3 and depth = 2
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 3
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 3
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 2
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 3
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 3 and depth = 2
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 3
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 3
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 2
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 3 and depth = 2
D:\Miniconda3\lib\site-packages\lightgbm\engine.py:177: UserWarning: Found `num_trees` in params. Will use it instead of argument
  _log_warning(f"Found `{alias}` in params. Will use it instead of argument")
<lightgbm.basic.Booster at 0x2b6dc985760>

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that and thanks for checking.
I've updated the test case so that the initial model will surely use feature 8 or 9, and the continual training dataset contains only 5 features. Now the test case should always fail in previous versions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for working on the test!
However, now this test never fails (was run successfully 50 times in a row) on my machine with LightGBM 3.3.2 😄

Log:
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.000000
[LightGBM] [Debug] init for col-wise cost 0.000003 seconds, init for row-wise cost 0.000161 seconds
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000187 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 175
[LightGBM] [Info] Number of data points in the train set: 100, number of used features: 5
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 2
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 3 and depth = 2
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 2
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 2
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 3
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 2
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 3 and depth = 2
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 2
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 2
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 3 and depth = 2
<lightgbm.basic.Booster at 0x1dadc11aa90>
Model:
tree
version=v3
num_class=1
num_tree_per_iteration=1
label_index=0
max_feature_idx=4
objective=regression
feature_names=Column_0 Column_1 Column_2 Column_3 Column_4
feature_infos=[0.015606064446828216:0.99033894739670436] [0.011714084185001972:0.99201124341447411] [0.01203622289765427:0.98837383805922618] [0.01323685775889949:0.99440078964767942] [0.0046954761925470656:0.99884700656786651]
tree_sizes=433 456 454 456 454 457 452 457 458 382 463 387 464 463 462 463 387 465 464 387

Tree=0
num_leaves=4
num_cat=0
split_feature=8 8 8
split_gain=999600 132192 117600
threshold=103.00000000000001 55.000000000000007 151.00000000000003
decision_type=2 2 2
left_child=1 -1 -2
right_child=2 -3 -4
leaf_value=186.40000000000001 206.19999999999999 196.59999999999999 216
leaf_weight=27 24 24 25
leaf_count=27 24 24 25
internal_value=201 191.2 211.2
internal_weight=0 51 49
internal_count=100 51 49
is_linear=0
shrinkage=1


Tree=1
num_leaves=4
num_cat=0
split_feature=8 8 8
split_gain=813246 114310 91614.6
threshold=97.000000000000014 145.00000000000003 43.000000000000007
decision_type=2 2 2
left_child=2 -2 -1
right_child=1 -3 -4
leaf_value=-14.339999643961589 3.6000000285605589 13.005000032697405 -5.5333332962459991
leaf_weight=21 24 28 27
leaf_count=21 24 28 27
internal_value=0 8.66423 -9.38625
internal_weight=0 52 48
internal_count=100 52 48
is_linear=0
shrinkage=0.1


Tree=2
num_leaves=4
num_cat=0
split_feature=8 8 8
split_gain=661912 106022 66947.7
threshold=109.00000000000001 61.000000000000007 157.00000000000003
decision_type=2 2 2
left_child=1 -1 -2
right_child=2 -3 -4
leaf_value=-11.472199859619142 5.1623749971389774 -2.5549999723210934 12.799499893188477
leaf_weight=30 24 24 22
leaf_count=30 24 24 22
internal_value=0 -7.509 8.81491
internal_weight=0 54 46
internal_count=100 54 46
is_linear=0
shrinkage=0.1


Tree=3
num_leaves=4
num_cat=0
split_feature=8 8 8
split_gain=541286 90395.6 51743.2
threshold=91.000000000000014 139.00000000000003 43.000000000000007
decision_type=2 2 2
left_child=2 -2 -1
right_child=1 -3 -4
leaf_value=-11.758779907226563 2.0473306971912582 10.221966577345327 -4.961771702766419
leaf_weight=21 24 31 24
leaf_count=21 24 31 24
internal_value=0 6.65485 -8.13371
internal_weight=0 55 45
internal_count=100 55 45
is_linear=0
shrinkage=0.1


Tree=4
num_leaves=4
num_cat=0
split_feature=8 8 8
split_gain=439960 83077 36862.1
threshold=115.00000000000001 67.000000000000014 157.00000000000003
decision_type=2 2 2
left_child=1 -1 -2
right_child=2 -3 -4
leaf_value=-9.0168283057935312 4.6399736268179756 -1.2844118246187768 10.49735318964178
leaf_weight=33 21 24 22
leaf_count=33 21 24 22
internal_value=0 -5.76107 7.63677
internal_weight=0 57 43
internal_count=100 57 43
is_linear=0
shrinkage=0.1


Tree=5
num_leaves=4
num_cat=0
split_feature=8 8 8
split_gain=359523 70011.3 29162.8
threshold=85.000000000000014 133.00000000000003 43.000000000000007
decision_type=2 2 2
left_child=2 -2 -1
right_child=1 -3 -4
leaf_value=-9.6812194460914256 0.9671219994624457 8.0214061737060547 -4.4111016545976911
leaf_weight=21 24 34 21
leaf_count=21 24 34 21
internal_value=0 5.10239 -7.04616
internal_weight=0 58 42
internal_count=100 58 42
is_linear=0
shrinkage=0.1


Tree=6
num_leaves=4
num_cat=0
split_feature=8 8 8
split_gain=291965 51781 29631.8
threshold=91.000000000000014 157.00000000000003 49.000000000000007
decision_type=2 2 2
left_child=2 -2 -1
right_child=1 -3 -4
leaf_value=-8.374019781748455 2.382246724854816 8.645477294921875 -3.2303891863141745
leaf_weight=24 33 22 21
leaf_count=24 33 22 21
internal_value=0 4.88754 -5.97366
internal_weight=0 55 45
internal_count=100 55 45
is_linear=0
shrinkage=0.1


Tree=7
num_leaves=4
num_cat=0
split_feature=8 8 8
split_gain=241666 44241.4 20265.2
threshold=115.00000000000001 49.000000000000007 157.00000000000003
decision_type=2 2 2
left_child=1 -1 -2
right_child=2 -3 -4
leaf_value=-7.5366175810496019 3.4379374277024044 -1.8938883086045584 7.7809295654296875
leaf_weight=24 21 33 22
leaf_count=24 21 33 22
internal_value=0 -4.26977 5.65993
internal_weight=0 57 43
internal_count=100 57 43
is_linear=0
shrinkage=0.1


Tree=8
num_leaves=4
num_cat=0
split_feature=8 8 8
split_gain=196769 38913.2 15310.5
threshold=85.000000000000014 133.00000000000003 43.000000000000007
decision_type=2 2 2
left_child=2 -2 -1
right_child=1 -3 -4
leaf_value=-7.1220336823236385 0.69178846441209318 5.9509620666503906 -3.3034729072025848
leaf_weight=21 24 34 21
leaf_count=21 24 34 21
internal_value=0 3.77475 -5.21275
internal_weight=0 58 42
internal_count=100 58 42
is_linear=0
shrinkage=0.1


Tree=9
num_leaves=3
num_cat=0
split_feature=8 8
split_gain=160757 35733.3
threshold=121.00000000000001 73.000000000000014
decision_type=2 2
left_child=1 -1
right_child=-2 -3
leaf_value=-5.2662804179721405 4.9105544734001159 -0.28483681498716273
leaf_weight=36 40 24
leaf_count=36 40 24
internal_value=0 -3.2737
internal_weight=0 60
internal_count=100 60
is_linear=0
shrinkage=0.1


Tree=10
num_leaves=4
num_cat=0
split_feature=2 3 1
split_gain=13.801 10.6247 5.30422
threshold=0.39124460793750004 0.57209255545676052 0.51007976537911426
decision_type=2 2 2
left_child=2 -2 -1
right_child=1 -3 -4
leaf_value=-10.008088938395183 -9.9444599969046461 -9.8565416689272283 -9.9392709641229544
leaf_weight=24 28 27 21
leaf_count=24 28 27 21
internal_value=0 -9.9013 -9.97597
internal_weight=0 55 45
internal_count=100 55 45
is_linear=0
shrinkage=0.1


Tree=11
num_leaves=3
num_cat=0
split_feature=4 0
split_gain=11.5297 8.62398
threshold=0.67098217697241236 0.49642413305341432
decision_type=2 2
left_child=1 -1
right_child=-2 -3
leaf_value=-8.9968132420590052 -8.8941043404971847 -8.9236732210431793
leaf_weight=38 34 28
leaf_count=38 34 28
internal_value=0 -8.96578
internal_weight=0 66
internal_count=100 66
is_linear=0
shrinkage=0.1


Tree=12
num_leaves=4
num_cat=0
split_feature=1 0 3
split_gain=11.1402 5.70155 5.37138
threshold=0.4329138381487897 0.31488953069403935 0.48185106917748455
decision_type=2 2 2
left_child=2 -2 -1
right_child=1 -3 -4
leaf_value=-8.1240014648437491 -8.0604864211309515 -7.9952484955658791 -8.0523968436501256
leaf_weight=20 21 37 22
leaf_count=20 21 37 22
internal_value=0 -8.01887 -8.08649
internal_weight=0 58 42
internal_count=100 58 42
is_linear=0
shrinkage=0.1


Tree=13
num_leaves=4
num_cat=0
split_feature=2 3 1
split_gain=9.77513 6.994 3.37145
threshold=0.39124460793750004 0.57209255545676052 0.51007976537911426
decision_type=2 2 2
left_child=2 -2 -1
right_child=1 -3 -4
leaf_value=-7.3027134259541828 -7.2492815290178569 -7.1779496934678821 -7.2478478931245354
leaf_weight=24 28 27 21
leaf_count=24 28 27 21
internal_value=0 -7.21426 -7.27711
internal_weight=0 55 45
internal_count=100 55 45
is_linear=0
shrinkage=0.1


Tree=14
num_leaves=4
num_cat=0
split_feature=4 3 2
split_gain=8.72052 6.28816 2.9544
threshold=0.38168423736294915 0.39970451119142136 0.57474024719752059
decision_type=2 2 2
left_child=-1 -2 -3
right_child=1 2 -4
leaf_value=-6.5552221249311406 -6.540647659301758 -6.499760017395019 -6.4460565839494981
leaf_weight=39 20 20 21
leaf_count=39 20 20 21
internal_value=0 -6.49468 -6.47225
internal_weight=0 61 41
internal_count=100 61 41
is_linear=0
shrinkage=0.1


Tree=15
num_leaves=4
num_cat=0
split_feature=1 0 3
split_gain=7.48096 4.12722 3.46943
threshold=0.4329138381487897 0.39974028150985036 0.48185106917748455
decision_type=2 2 2
left_child=2 -2 -1
right_child=1 -3 -4
leaf_value=-5.9287466812133793 -5.8698616948621023 -5.8165104438518656 -5.8711990703235983
leaf_weight=20 29 29 22
leaf_count=20 29 29 22
internal_value=0 -5.84319 -5.8986
internal_weight=0 58 42
internal_count=100 58 42
is_linear=0
shrinkage=0.1


Tree=16
num_leaves=3
num_cat=0
split_feature=4 0
split_gain=7.09244 4.97742
threshold=0.67098217697241236 0.49642413305341432
decision_type=2 2
left_child=1 -1
right_child=-2 -3
leaf_value=-5.3225026983963817 -5.2427101247450887 -5.2669374602181573
leaf_weight=38 34 28
leaf_count=38 34 28
internal_value=0 -5.29893
internal_weight=0 66
internal_count=100 66
is_linear=0
shrinkage=0.1


Tree=17
num_leaves=4
num_cat=0
split_feature=2 3 0
split_gain=6.14628 3.95565 2.09787
threshold=0.39124460793750004 0.57209255545676052 0.31488953069403935
decision_type=2 2 2
left_child=2 -2 -1
right_child=1 -3 -4
leaf_value=-4.8033817863464359 -4.7557433400835309 -4.7020983519377531 -4.7599296722412108
leaf_weight=20 28 27 25
leaf_count=20 28 27 25
internal_value=0 -4.72941 -4.77924
internal_weight=0 55 45
internal_count=100 55 45
is_linear=0
shrinkage=0.1


Tree=18
num_leaves=4
num_cat=0
split_feature=1 4 3
split_gain=5.78174 3.21241 2.42224
threshold=0.4329138381487897 0.52674300595287049 0.48185106917748455
decision_type=2 2 2
left_child=2 -2 -1
right_child=1 -3 -4
leaf_value=-4.3300938415527339 -4.2781520597396359 -4.2309710891158501 -4.2820091941139919
leaf_weight=20 31 27 22
leaf_count=20 31 27 22
internal_value=0 -4.25619 -4.30491
internal_weight=0 58 42
internal_count=100 58 42
is_linear=0
shrinkage=0.1


Tree=19
num_leaves=3
num_cat=0
split_feature=4 2
split_gain=5.06465 3.34179
threshold=0.38168423736294915 0.60984868647320056
decision_type=2 2
left_child=-1 -2
right_child=1 -3
leaf_value=-3.8771305084228516 -3.8485698895576679 -3.7998270381580705
leaf_weight=39 39 22
leaf_count=39 39 22
internal_value=0 -3.83099
internal_weight=0 61
internal_count=100 61
is_linear=0
shrinkage=0.1


end of trees

feature_importances:
Column_3=7
Column_0=5
Column_1=5
Column_2=5
Column_4=5

parameters:
[boosting: gbdt]
[objective: regression]
[metric: l2]
[tree_learner: serial]
[device_type: cpu]
[data: ]
[valid: ]
[num_iterations: 10]
[learning_rate: 0.1]
[num_leaves: 31]
[num_threads: 0]
[deterministic: 0]
[force_col_wise: 0]
[force_row_wise: 0]
[histogram_pool_size: -1]
[max_depth: -1]
[min_data_in_leaf: 20]
[min_sum_hessian_in_leaf: 0.001]
[bagging_fraction: 1]
[pos_bagging_fraction: 1]
[neg_bagging_fraction: 1]
[bagging_freq: 0]
[bagging_seed: 3]
[feature_fraction: 1]
[feature_fraction_bynode: 1]
[feature_fraction_seed: 2]
[extra_trees: 0]
[extra_seed: 6]
[early_stopping_round: 0]
[first_metric_only: 0]
[max_delta_step: 0]
[lambda_l1: 0]
[lambda_l2: 0]
[linear_lambda: 0]
[min_gain_to_split: 0]
[drop_rate: 0.1]
[max_drop: 50]
[skip_drop: 0.5]
[xgboost_dart_mode: 0]
[uniform_drop: 0]
[drop_seed: 4]
[top_rate: 0.2]
[other_rate: 0.1]
[min_data_per_group: 100]
[max_cat_threshold: 32]
[cat_l2: 10]
[cat_smooth: 10]
[max_cat_to_onehot: 4]
[top_k: 20]
[monotone_constraints: ]
[monotone_constraints_method: basic]
[monotone_penalty: 0]
[feature_contri: ]
[forcedsplits_filename: ]
[refit_decay_rate: 0.9]
[cegb_tradeoff: 1]
[cegb_penalty_split: 0]
[cegb_penalty_feature_lazy: ]
[cegb_penalty_feature_coupled: ]
[path_smooth: 0]
[interaction_constraints: ]
[verbosity: 2]
[saved_feature_importance_type: 0]
[linear_tree: 0]
[max_bin: 255]
[max_bin_by_feature: ]
[min_data_in_bin: 3]
[bin_construct_sample_cnt: 200000]
[data_random_seed: 1]
[is_enable_sparse: 1]
[enable_bundle: 1]
[use_missing: 1]
[zero_as_missing: 0]
[feature_pre_filter: 1]
[pre_partition: 0]
[two_round: 0]
[header: 0]
[label_column: ]
[weight_column: ]
[group_column: ]
[ignore_column: ]
[categorical_feature: ]
[forcedbins_filename: ]
[precise_float_parser: 0]
[objective_seed: 5]
[num_class: 1]
[is_unbalance: 0]
[scale_pos_weight: 1]
[sigmoid: 1]
[boost_from_average: 1]
[reg_sqrt: 0]
[alpha: 0.9]
[fair_c: 1]
[poisson_max_delta_step: 0.7]
[tweedie_variance_power: 1.5]
[lambdarank_truncation_level: 30]
[lambdarank_norm: 1]
[label_gain: ]
[eval_at: ]
[multi_error_top_k: 1]
[auc_mu_weights: ]
[num_machines: 1]
[local_listen_port: 12400]
[time_out: 120]
[machine_list_filename: ]
[machines: ]
[gpu_platform_id: -1]
[gpu_device_id: -1]
[gpu_use_dp: 0]
[num_gpu: 1]

end of parameters

pandas_categorical:null

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@StrikerRUS
Did you test the test case in the latest commit of this PR?

I've tested with v3.3.2 with the test case

def test_continue_train_different_feature_size(capsys):
    np.random.seed(0)
    train_X = np.hstack([np.ones(800).reshape(-1, 8), np.arange(200, 0, -1).reshape(-1, 2)])
    train_y = np.sum(train_X[:, -2:], axis=1)
    train_data = lgb.Dataset(train_X, label=train_y)
    params = {
        "objective": "regression",
        "num_trees": 10,
        "num_leaves": 31,
        "verbose": -1,
        'predict_disable_shape_check': True,
    }
    model = lgb.train(train_set=train_data, params=params)

    train_X_cont = np.random.rand(100, 5)
    train_y_cont = np.sum(train_X_cont, axis=1)
    train_data_cont = lgb.Dataset(train_X_cont, label=train_y_cont)
    params.update({"verbose": 2})
    lgb.train(train_set=train_data_cont, params=params, init_model=model)
    captured = capsys.readouterr()
    assert captured.out.find("features found in dataset for continual training, but at least") != -1

And the output is

FAILED test_engine.py::test_continue_train_different_feature_size - AssertionError: assert -1 != -1

which shows that the test case fails in v3.3.2.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shiyu1994 This test should segfault or something like this (I don't remember exactly) in v3.3.2 but it doesn't.

Warning was added after starting this review thread and we don't speak about it here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@StrikerRUS Thanks. With the latest commit, it should now fail in v3.3.2. It will segfaults or the AssertionError will arise by the line

assert captured.out.find("features found in dataset for continual training, but at least") != -1

in the latest version of test case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the latest commit, it should now fail in v3.3.2.

Still doesn't fail in v3.3.2.

Please use fixed dataset, not a random one, to constantly reproduce segfaults in versions without workaround from this PR.

train_data = lgb.Dataset(train_X, label=train_y)
params = {
"objective": "regression",
"num_trees": 10,
"num_leaves": 31,
"verbose": 2,
shiyu1994 marked this conversation as resolved.
Show resolved Hide resolved
'predict_disable_shape_check': True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it necessary to pass predict_disable_shape_check? This test isn't generating any predictions. Can this be removed?

Copy link
Collaborator

@StrikerRUS StrikerRUS May 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good observation! But predict() is called internally here when you set init scores by a previously trained model

elif isinstance(init_model, Booster):
predictor = init_model._to_predictor(dict(init_model.params, **params))

train_set._update_params(params) \
._set_predictor(predictor) \

if isinstance(predictor, _InnerPredictor):
if self._predictor is None and init_score is not None:
_log_warning("The init_score will be overridden by the prediction of init_model.")
self._set_init_score_by_predictor(predictor, data)

def _set_init_score_by_predictor(self, predictor, data, used_indices=None):

if predictor is not None:
init_score = predictor.predict(data,
raw_score=True,
data_has_header=data_has_header)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see! So then is the bug addressed by this PR only present when you've also set predict_disable_shape_check = True? I expect that most LightGBM users will not be aware that that prediction parameter is relevant for training continuation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, without setting predict_disable_shape_check = True users get the following error:

LightGBMError: The number of features in data (5) is not the same as it was in training data (10).
You can set ``predict_disable_shape_check=true`` to discard this error, but please be aware what you are doing.

I think the second sentence of the error "You can set predict_disable_shape_check=true to discard this error ..." helps understand this complicated relationship.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

predict_disable_shape_check is necessary because here a numpy matrix is used in the test case. If we use a LibSVM file as dataset in the test case, we don't need predict_disable_shape_check=True. I use a numpy matrix here because I think it is more convenient than LivSVM in that it does not requiring generating files in test cases.

}
model = lgb.train(train_set=train_data, params=params)

train_X_cont = np.random.rand(100, 5)
train_y_cont = np.sum(train_X_cont, axis=1)
train_data_cont = lgb.Dataset(train_X_cont, label=train_y_cont)
lgb.train(train_set=train_data_cont, params=params, init_model=model)


def test_cv():
X_train, y_train = make_synthetic_regression()
params = {'verbose': -1}
Expand Down