Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix number of features mismatching in continual training (fix #5156) #5157

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

shiyu1994
Copy link
Collaborator

This is to fix #5156.

src/io/parser.cpp Outdated Show resolved Hide resolved
@shiyu1994
Copy link
Collaborator Author

I found that passing the number of features in the loaded model to dataset constructors requires touching a lot of interfaces in python/R/C++. I'll switch to the strategy that directly checks the length of feature_importance in gbdt_model_text.cpp, which should be much simpler.

@shiyu1994 shiyu1994 changed the title [WIP] Fix number of features mismatching in continual training (fix #5156) Fix number of features mismatching in continual training (fix #5156) Apr 21, 2022
@shiyu1994
Copy link
Collaborator Author

A branch v3.2.1-patched is created based on v3.2.1, including this fix. Because the fix is urgent for a usage in our product team.

@shiyu1994 shiyu1994 added the fix label Apr 21, 2022
@StrikerRUS
Copy link
Collaborator

Could you please add a regression test for this issue?

@jameslamb
Copy link
Collaborator

Because the fix is urgent for a usage in our product team.

This is an open source project, not a Microsoft commercial product. I feel strongly that choices we make in this repo should be based on what is best for the project's large community of users, not Microsoft's product priorities.

Instead of introducing a long-lived branch in this repo just for use by a product team at Microsoft, I think it would be preferable to just get this PR merged, and for whatever product teams you're referring to to then pin to a specific commit of LightGBM in their builds.

If the fix is needed so urgently that that team cannot wait for it to go through the normal testing and review process, you should tell them to temporarily build from an internal fork of this project that contains this fix, and then later replace builds from that fork with builds pinned to a commit in this public repository once this PR is merged.

src/boosting/gbdt_model_text.cpp Outdated Show resolved Hide resolved
@shiyu1994 shiyu1994 requested a review from jmoralez as a code owner May 9, 2022 04:28
@shiyu1994
Copy link
Collaborator Author

Could you please add a regression test for this issue?

Done via bda5bdb

tests/python_package_test/test_engine.py Outdated Show resolved Hide resolved
tests/python_package_test/test_engine.py Outdated Show resolved Hide resolved
Comment on lines 970 to 972
np.random.seed(0)
train_X = np.random.randn(100, 10)
train_y = np.sum(train_X, axis=1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, strange. With 3.3.2 version this test doesn't fail on my local machine each time. It fails sometimes.
How about using not a random but concrete trivial dataset, something like

train_X = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
train_y = np.array([0, 1, 2])

?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now a warning is provided, see https://github.com/microsoft/LightGBM/pull/5157/files#r868752144.
And the test case just checks whether the warning message is provided. Now it should always fail in 3.3.2.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it should always fail in 3.3.2.

Isn't true for me 😕
I can sometimes successfully run the test on my Windows machine with LightGBM 3.3.2 installed from PyPI.

Output:

[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.000000
[LightGBM] [Debug] init for col-wise cost 0.000004 seconds, init for row-wise cost 0.000180 seconds
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000209 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 175
[LightGBM] [Info] Number of data points in the train set: 100, number of used features: 5
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 3 and depth = 2
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 3
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 3
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 2
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 3
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 3 and depth = 2
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 3
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 3
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 2
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 3 and depth = 2
D:\Miniconda3\lib\site-packages\lightgbm\engine.py:177: UserWarning: Found `num_trees` in params. Will use it instead of argument
  _log_warning(f"Found `{alias}` in params. Will use it instead of argument")
<lightgbm.basic.Booster at 0x2b6dc985760>

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that and thanks for checking.
I've updated the test case so that the initial model will surely use feature 8 or 9, and the continual training dataset contains only 5 features. Now the test case should always fail in previous versions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for working on the test!
However, now this test never fails (was run successfully 50 times in a row) on my machine with LightGBM 3.3.2 😄

Log:
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.000000
[LightGBM] [Debug] init for col-wise cost 0.000003 seconds, init for row-wise cost 0.000161 seconds
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000187 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 175
[LightGBM] [Info] Number of data points in the train set: 100, number of used features: 5
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 2
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 3 and depth = 2
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 2
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 2
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 3
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 2
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 3 and depth = 2
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 2
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 2
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 3 and depth = 2
<lightgbm.basic.Booster at 0x1dadc11aa90>
Model:
tree
version=v3
num_class=1
num_tree_per_iteration=1
label_index=0
max_feature_idx=4
objective=regression
feature_names=Column_0 Column_1 Column_2 Column_3 Column_4
feature_infos=[0.015606064446828216:0.99033894739670436] [0.011714084185001972:0.99201124341447411] [0.01203622289765427:0.98837383805922618] [0.01323685775889949:0.99440078964767942] [0.0046954761925470656:0.99884700656786651]
tree_sizes=433 456 454 456 454 457 452 457 458 382 463 387 464 463 462 463 387 465 464 387

Tree=0
num_leaves=4
num_cat=0
split_feature=8 8 8
split_gain=999600 132192 117600
threshold=103.00000000000001 55.000000000000007 151.00000000000003
decision_type=2 2 2
left_child=1 -1 -2
right_child=2 -3 -4
leaf_value=186.40000000000001 206.19999999999999 196.59999999999999 216
leaf_weight=27 24 24 25
leaf_count=27 24 24 25
internal_value=201 191.2 211.2
internal_weight=0 51 49
internal_count=100 51 49
is_linear=0
shrinkage=1


Tree=1
num_leaves=4
num_cat=0
split_feature=8 8 8
split_gain=813246 114310 91614.6
threshold=97.000000000000014 145.00000000000003 43.000000000000007
decision_type=2 2 2
left_child=2 -2 -1
right_child=1 -3 -4
leaf_value=-14.339999643961589 3.6000000285605589 13.005000032697405 -5.5333332962459991
leaf_weight=21 24 28 27
leaf_count=21 24 28 27
internal_value=0 8.66423 -9.38625
internal_weight=0 52 48
internal_count=100 52 48
is_linear=0
shrinkage=0.1


Tree=2
num_leaves=4
num_cat=0
split_feature=8 8 8
split_gain=661912 106022 66947.7
threshold=109.00000000000001 61.000000000000007 157.00000000000003
decision_type=2 2 2
left_child=1 -1 -2
right_child=2 -3 -4
leaf_value=-11.472199859619142 5.1623749971389774 -2.5549999723210934 12.799499893188477
leaf_weight=30 24 24 22
leaf_count=30 24 24 22
internal_value=0 -7.509 8.81491
internal_weight=0 54 46
internal_count=100 54 46
is_linear=0
shrinkage=0.1


Tree=3
num_leaves=4
num_cat=0
split_feature=8 8 8
split_gain=541286 90395.6 51743.2
threshold=91.000000000000014 139.00000000000003 43.000000000000007
decision_type=2 2 2
left_child=2 -2 -1
right_child=1 -3 -4
leaf_value=-11.758779907226563 2.0473306971912582 10.221966577345327 -4.961771702766419
leaf_weight=21 24 31 24
leaf_count=21 24 31 24
internal_value=0 6.65485 -8.13371
internal_weight=0 55 45
internal_count=100 55 45
is_linear=0
shrinkage=0.1


Tree=4
num_leaves=4
num_cat=0
split_feature=8 8 8
split_gain=439960 83077 36862.1
threshold=115.00000000000001 67.000000000000014 157.00000000000003
decision_type=2 2 2
left_child=1 -1 -2
right_child=2 -3 -4
leaf_value=-9.0168283057935312 4.6399736268179756 -1.2844118246187768 10.49735318964178
leaf_weight=33 21 24 22
leaf_count=33 21 24 22
internal_value=0 -5.76107 7.63677
internal_weight=0 57 43
internal_count=100 57 43
is_linear=0
shrinkage=0.1


Tree=5
num_leaves=4
num_cat=0
split_feature=8 8 8
split_gain=359523 70011.3 29162.8
threshold=85.000000000000014 133.00000000000003 43.000000000000007
decision_type=2 2 2
left_child=2 -2 -1
right_child=1 -3 -4
leaf_value=-9.6812194460914256 0.9671219994624457 8.0214061737060547 -4.4111016545976911
leaf_weight=21 24 34 21
leaf_count=21 24 34 21
internal_value=0 5.10239 -7.04616
internal_weight=0 58 42
internal_count=100 58 42
is_linear=0
shrinkage=0.1


Tree=6
num_leaves=4
num_cat=0
split_feature=8 8 8
split_gain=291965 51781 29631.8
threshold=91.000000000000014 157.00000000000003 49.000000000000007
decision_type=2 2 2
left_child=2 -2 -1
right_child=1 -3 -4
leaf_value=-8.374019781748455 2.382246724854816 8.645477294921875 -3.2303891863141745
leaf_weight=24 33 22 21
leaf_count=24 33 22 21
internal_value=0 4.88754 -5.97366
internal_weight=0 55 45
internal_count=100 55 45
is_linear=0
shrinkage=0.1


Tree=7
num_leaves=4
num_cat=0
split_feature=8 8 8
split_gain=241666 44241.4 20265.2
threshold=115.00000000000001 49.000000000000007 157.00000000000003
decision_type=2 2 2
left_child=1 -1 -2
right_child=2 -3 -4
leaf_value=-7.5366175810496019 3.4379374277024044 -1.8938883086045584 7.7809295654296875
leaf_weight=24 21 33 22
leaf_count=24 21 33 22
internal_value=0 -4.26977 5.65993
internal_weight=0 57 43
internal_count=100 57 43
is_linear=0
shrinkage=0.1


Tree=8
num_leaves=4
num_cat=0
split_feature=8 8 8
split_gain=196769 38913.2 15310.5
threshold=85.000000000000014 133.00000000000003 43.000000000000007
decision_type=2 2 2
left_child=2 -2 -1
right_child=1 -3 -4
leaf_value=-7.1220336823236385 0.69178846441209318 5.9509620666503906 -3.3034729072025848
leaf_weight=21 24 34 21
leaf_count=21 24 34 21
internal_value=0 3.77475 -5.21275
internal_weight=0 58 42
internal_count=100 58 42
is_linear=0
shrinkage=0.1


Tree=9
num_leaves=3
num_cat=0
split_feature=8 8
split_gain=160757 35733.3
threshold=121.00000000000001 73.000000000000014
decision_type=2 2
left_child=1 -1
right_child=-2 -3
leaf_value=-5.2662804179721405 4.9105544734001159 -0.28483681498716273
leaf_weight=36 40 24
leaf_count=36 40 24
internal_value=0 -3.2737
internal_weight=0 60
internal_count=100 60
is_linear=0
shrinkage=0.1


Tree=10
num_leaves=4
num_cat=0
split_feature=2 3 1
split_gain=13.801 10.6247 5.30422
threshold=0.39124460793750004 0.57209255545676052 0.51007976537911426
decision_type=2 2 2
left_child=2 -2 -1
right_child=1 -3 -4
leaf_value=-10.008088938395183 -9.9444599969046461 -9.8565416689272283 -9.9392709641229544
leaf_weight=24 28 27 21
leaf_count=24 28 27 21
internal_value=0 -9.9013 -9.97597
internal_weight=0 55 45
internal_count=100 55 45
is_linear=0
shrinkage=0.1


Tree=11
num_leaves=3
num_cat=0
split_feature=4 0
split_gain=11.5297 8.62398
threshold=0.67098217697241236 0.49642413305341432
decision_type=2 2
left_child=1 -1
right_child=-2 -3
leaf_value=-8.9968132420590052 -8.8941043404971847 -8.9236732210431793
leaf_weight=38 34 28
leaf_count=38 34 28
internal_value=0 -8.96578
internal_weight=0 66
internal_count=100 66
is_linear=0
shrinkage=0.1


Tree=12
num_leaves=4
num_cat=0
split_feature=1 0 3
split_gain=11.1402 5.70155 5.37138
threshold=0.4329138381487897 0.31488953069403935 0.48185106917748455
decision_type=2 2 2
left_child=2 -2 -1
right_child=1 -3 -4
leaf_value=-8.1240014648437491 -8.0604864211309515 -7.9952484955658791 -8.0523968436501256
leaf_weight=20 21 37 22
leaf_count=20 21 37 22
internal_value=0 -8.01887 -8.08649
internal_weight=0 58 42
internal_count=100 58 42
is_linear=0
shrinkage=0.1


Tree=13
num_leaves=4
num_cat=0
split_feature=2 3 1
split_gain=9.77513 6.994 3.37145
threshold=0.39124460793750004 0.57209255545676052 0.51007976537911426
decision_type=2 2 2
left_child=2 -2 -1
right_child=1 -3 -4
leaf_value=-7.3027134259541828 -7.2492815290178569 -7.1779496934678821 -7.2478478931245354
leaf_weight=24 28 27 21
leaf_count=24 28 27 21
internal_value=0 -7.21426 -7.27711
internal_weight=0 55 45
internal_count=100 55 45
is_linear=0
shrinkage=0.1


Tree=14
num_leaves=4
num_cat=0
split_feature=4 3 2
split_gain=8.72052 6.28816 2.9544
threshold=0.38168423736294915 0.39970451119142136 0.57474024719752059
decision_type=2 2 2
left_child=-1 -2 -3
right_child=1 2 -4
leaf_value=-6.5552221249311406 -6.540647659301758 -6.499760017395019 -6.4460565839494981
leaf_weight=39 20 20 21
leaf_count=39 20 20 21
internal_value=0 -6.49468 -6.47225
internal_weight=0 61 41
internal_count=100 61 41
is_linear=0
shrinkage=0.1


Tree=15
num_leaves=4
num_cat=0
split_feature=1 0 3
split_gain=7.48096 4.12722 3.46943
threshold=0.4329138381487897 0.39974028150985036 0.48185106917748455
decision_type=2 2 2
left_child=2 -2 -1
right_child=1 -3 -4
leaf_value=-5.9287466812133793 -5.8698616948621023 -5.8165104438518656 -5.8711990703235983
leaf_weight=20 29 29 22
leaf_count=20 29 29 22
internal_value=0 -5.84319 -5.8986
internal_weight=0 58 42
internal_count=100 58 42
is_linear=0
shrinkage=0.1


Tree=16
num_leaves=3
num_cat=0
split_feature=4 0
split_gain=7.09244 4.97742
threshold=0.67098217697241236 0.49642413305341432
decision_type=2 2
left_child=1 -1
right_child=-2 -3
leaf_value=-5.3225026983963817 -5.2427101247450887 -5.2669374602181573
leaf_weight=38 34 28
leaf_count=38 34 28
internal_value=0 -5.29893
internal_weight=0 66
internal_count=100 66
is_linear=0
shrinkage=0.1


Tree=17
num_leaves=4
num_cat=0
split_feature=2 3 0
split_gain=6.14628 3.95565 2.09787
threshold=0.39124460793750004 0.57209255545676052 0.31488953069403935
decision_type=2 2 2
left_child=2 -2 -1
right_child=1 -3 -4
leaf_value=-4.8033817863464359 -4.7557433400835309 -4.7020983519377531 -4.7599296722412108
leaf_weight=20 28 27 25
leaf_count=20 28 27 25
internal_value=0 -4.72941 -4.77924
internal_weight=0 55 45
internal_count=100 55 45
is_linear=0
shrinkage=0.1


Tree=18
num_leaves=4
num_cat=0
split_feature=1 4 3
split_gain=5.78174 3.21241 2.42224
threshold=0.4329138381487897 0.52674300595287049 0.48185106917748455
decision_type=2 2 2
left_child=2 -2 -1
right_child=1 -3 -4
leaf_value=-4.3300938415527339 -4.2781520597396359 -4.2309710891158501 -4.2820091941139919
leaf_weight=20 31 27 22
leaf_count=20 31 27 22
internal_value=0 -4.25619 -4.30491
internal_weight=0 58 42
internal_count=100 58 42
is_linear=0
shrinkage=0.1


Tree=19
num_leaves=3
num_cat=0
split_feature=4 2
split_gain=5.06465 3.34179
threshold=0.38168423736294915 0.60984868647320056
decision_type=2 2
left_child=-1 -2
right_child=1 -3
leaf_value=-3.8771305084228516 -3.8485698895576679 -3.7998270381580705
leaf_weight=39 39 22
leaf_count=39 39 22
internal_value=0 -3.83099
internal_weight=0 61
internal_count=100 61
is_linear=0
shrinkage=0.1


end of trees

feature_importances:
Column_3=7
Column_0=5
Column_1=5
Column_2=5
Column_4=5

parameters:
[boosting: gbdt]
[objective: regression]
[metric: l2]
[tree_learner: serial]
[device_type: cpu]
[data: ]
[valid: ]
[num_iterations: 10]
[learning_rate: 0.1]
[num_leaves: 31]
[num_threads: 0]
[deterministic: 0]
[force_col_wise: 0]
[force_row_wise: 0]
[histogram_pool_size: -1]
[max_depth: -1]
[min_data_in_leaf: 20]
[min_sum_hessian_in_leaf: 0.001]
[bagging_fraction: 1]
[pos_bagging_fraction: 1]
[neg_bagging_fraction: 1]
[bagging_freq: 0]
[bagging_seed: 3]
[feature_fraction: 1]
[feature_fraction_bynode: 1]
[feature_fraction_seed: 2]
[extra_trees: 0]
[extra_seed: 6]
[early_stopping_round: 0]
[first_metric_only: 0]
[max_delta_step: 0]
[lambda_l1: 0]
[lambda_l2: 0]
[linear_lambda: 0]
[min_gain_to_split: 0]
[drop_rate: 0.1]
[max_drop: 50]
[skip_drop: 0.5]
[xgboost_dart_mode: 0]
[uniform_drop: 0]
[drop_seed: 4]
[top_rate: 0.2]
[other_rate: 0.1]
[min_data_per_group: 100]
[max_cat_threshold: 32]
[cat_l2: 10]
[cat_smooth: 10]
[max_cat_to_onehot: 4]
[top_k: 20]
[monotone_constraints: ]
[monotone_constraints_method: basic]
[monotone_penalty: 0]
[feature_contri: ]
[forcedsplits_filename: ]
[refit_decay_rate: 0.9]
[cegb_tradeoff: 1]
[cegb_penalty_split: 0]
[cegb_penalty_feature_lazy: ]
[cegb_penalty_feature_coupled: ]
[path_smooth: 0]
[interaction_constraints: ]
[verbosity: 2]
[saved_feature_importance_type: 0]
[linear_tree: 0]
[max_bin: 255]
[max_bin_by_feature: ]
[min_data_in_bin: 3]
[bin_construct_sample_cnt: 200000]
[data_random_seed: 1]
[is_enable_sparse: 1]
[enable_bundle: 1]
[use_missing: 1]
[zero_as_missing: 0]
[feature_pre_filter: 1]
[pre_partition: 0]
[two_round: 0]
[header: 0]
[label_column: ]
[weight_column: ]
[group_column: ]
[ignore_column: ]
[categorical_feature: ]
[forcedbins_filename: ]
[precise_float_parser: 0]
[objective_seed: 5]
[num_class: 1]
[is_unbalance: 0]
[scale_pos_weight: 1]
[sigmoid: 1]
[boost_from_average: 1]
[reg_sqrt: 0]
[alpha: 0.9]
[fair_c: 1]
[poisson_max_delta_step: 0.7]
[tweedie_variance_power: 1.5]
[lambdarank_truncation_level: 30]
[lambdarank_norm: 1]
[label_gain: ]
[eval_at: ]
[multi_error_top_k: 1]
[auc_mu_weights: ]
[num_machines: 1]
[local_listen_port: 12400]
[time_out: 120]
[machine_list_filename: ]
[machines: ]
[gpu_platform_id: -1]
[gpu_device_id: -1]
[gpu_use_dp: 0]
[num_gpu: 1]

end of parameters

pandas_categorical:null

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@StrikerRUS
Did you test the test case in the latest commit of this PR?

I've tested with v3.3.2 with the test case

def test_continue_train_different_feature_size(capsys):
    np.random.seed(0)
    train_X = np.hstack([np.ones(800).reshape(-1, 8), np.arange(200, 0, -1).reshape(-1, 2)])
    train_y = np.sum(train_X[:, -2:], axis=1)
    train_data = lgb.Dataset(train_X, label=train_y)
    params = {
        "objective": "regression",
        "num_trees": 10,
        "num_leaves": 31,
        "verbose": -1,
        'predict_disable_shape_check': True,
    }
    model = lgb.train(train_set=train_data, params=params)

    train_X_cont = np.random.rand(100, 5)
    train_y_cont = np.sum(train_X_cont, axis=1)
    train_data_cont = lgb.Dataset(train_X_cont, label=train_y_cont)
    params.update({"verbose": 2})
    lgb.train(train_set=train_data_cont, params=params, init_model=model)
    captured = capsys.readouterr()
    assert captured.out.find("features found in dataset for continual training, but at least") != -1

And the output is

FAILED test_engine.py::test_continue_train_different_feature_size - AssertionError: assert -1 != -1

which shows that the test case fails in v3.3.2.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shiyu1994 This test should segfault or something like this (I don't remember exactly) in v3.3.2 but it doesn't.

Warning was added after starting this review thread and we don't speak about it here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@StrikerRUS Thanks. With the latest commit, it should now fail in v3.3.2. It will segfaults or the AssertionError will arise by the line

assert captured.out.find("features found in dataset for continual training, but at least") != -1

in the latest version of test case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the latest commit, it should now fail in v3.3.2.

Still doesn't fail in v3.3.2.

Please use fixed dataset, not a random one, to constantly reproduce segfaults in versions without workaround from this PR.

#endif
feature_importances[models_[iter]->split_feature(split_idx)] += 1.0;
if (static_cast<size_t>(real_feature_index) >= feature_importances.size()) {
warn_about_feature_number = true;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provide warning instead of enlarging feature importance vector. Since GBDT::FeatureImportance is used by Python API and the python code doesn't know the final size of feature importance vector, dynamically enlarging the vector will cause out of bound access.

@shiyu1994 shiyu1994 requested a review from StrikerRUS May 10, 2022 02:20
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very confused by these changes, sorry.

My understanding of what was written in #5156 is that given initial model has been trained on a Dataset with n features, if you attempt to perform continued training on a Dataset with m features (m < n), LightGBM will not complain but there might be some out-of-bound access.

I also believe that #5156 mentions LibSVM files as an example because construction of a Dataset from a file might drop unsplittable columns (like those with all missing values) if feature_pre_filter=True.

Is my interpretation correct? If it is, that'll help me review this more effectively.

"num_trees": 10,
"num_leaves": 31,
"verbose": -1,
'predict_disable_shape_check': True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it necessary to pass predict_disable_shape_check? This test isn't generating any predictions. Can this be removed?

Copy link
Collaborator

@StrikerRUS StrikerRUS May 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good observation! But predict() is called internally here when you set init scores by a previously trained model

elif isinstance(init_model, Booster):
predictor = init_model._to_predictor(dict(init_model.params, **params))

train_set._update_params(params) \
._set_predictor(predictor) \

if isinstance(predictor, _InnerPredictor):
if self._predictor is None and init_score is not None:
_log_warning("The init_score will be overridden by the prediction of init_model.")
self._set_init_score_by_predictor(predictor, data)

def _set_init_score_by_predictor(self, predictor, data, used_indices=None):

if predictor is not None:
init_score = predictor.predict(data,
raw_score=True,
data_has_header=data_has_header)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see! So then is the bug addressed by this PR only present when you've also set predict_disable_shape_check = True? I expect that most LightGBM users will not be aware that that prediction parameter is relevant for training continuation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, without setting predict_disable_shape_check = True users get the following error:

LightGBMError: The number of features in data (5) is not the same as it was in training data (10).
You can set ``predict_disable_shape_check=true`` to discard this error, but please be aware what you are doing.

I think the second sentence of the error "You can set predict_disable_shape_check=true to discard this error ..." helps understand this complicated relationship.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

predict_disable_shape_check is necessary because here a numpy matrix is used in the test case. If we use a LibSVM file as dataset in the test case, we don't need predict_disable_shape_check=True. I use a numpy matrix here because I think it is more convenient than LivSVM in that it does not requiring generating files in test cases.

}
}
}
} else {
Log::Fatal("Unknown importance type: only support split=0 and gain=1");
}
if (warn_about_feature_number) {
Log::Warning("Only %d features found in dataset for continual training, but at least %d features found in initial model.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add some information here that can help users understand what they should do about this warning? Or is this warning just being raised for the purpose of the unit test?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added

Log::Warning("Please check the number of features used in continual training.");

in d719d78.

Do you think it is appropriate?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! But I don't think "please check" will be any more informative for users, and I think that providing two separate WARN-level log messages for the same problem might be confusing.

Would you consider this?

Log::Warning(
    "Only %d features found in dataset, but at least %d features found in initial model. "
    "If this is intentional and the features in dataset match those used to train the initial model, no action is required."
    "If not, training continuation may fail or produce bad results."
    "To suppress this warning, provide a dataset with at least %d features"
    static_cast<int>(feature_importances.size()),
    max_feature_index_found + 1,
    max_feature_index_found + 1
)

@shiyu1994
Copy link
Collaborator Author

I also believe that #5156 mentions LibSVM files as an example because construction of a Dataset from a file might drop unsplittable columns (like those with all missing values) if feature_pre_filter=True.

Here we only focus on the original features, not features preprocessed by LightGBM. This is to fix such cases: First, a model A is trained with m features, then the user errorneously drop some features in continued training with the model A. So there's nothing to do with feature_pre_filter .

@shiyu1994 shiyu1994 requested a review from jameslamb June 1, 2022 14:16
@StrikerRUS
Copy link
Collaborator

Just a reminder: merge-blocking thread is #5157 (comment).

@jameslamb
Copy link
Collaborator

Ah ok thanks @shiyu1994 , now I understand. I guess the phrase "When the input dataset is from LibSVM file" was used in #5156 because loading from a file bypasses predict_disable_shape_check? #5157 (comment)

@shiyu1994
Copy link
Collaborator Author

@jameslamb

I guess the phrase "When the input dataset is from LibSVM file" was used in #5156 because loading from a file bypasses predict_disable_shape_check? #5157 (comment)

Exactly.

@shiyu1994
Copy link
Collaborator Author

Seems that the QEMU test fails due to environment issues
image

@shiyu1994
Copy link
Collaborator Author

Close and reopen to retrigger ci tests.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to prevent accidental merge.
#5157 (comment)

@shiyu1994
Copy link
Collaborator Author

@guolinke @jameslamb @StrikerRUS Could you please help to verify again whether the latest test case fails with v3.3.2? I've tested on my Linux machine and it fails every time. Note that now we use a fixed dataset, without any randomness. Thanks!

@StrikerRUS
Copy link
Collaborator

Could you please help to verify again whether the latest test case fails with v3.3.2?

Just checked again, the latest test doesn't fail on my Windows machine with v3.3.2.

@jameslamb
Copy link
Collaborator

@shiyu1994 could you update this PR? Let me know if you need help with any part of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Out of bound access when dataset in continual training has fewer features than in the loaded model
4 participants