Fix number of features mismatching in continual training (fix #5156) #5157

StrikerRUS · 2022-05-09T22:40:46Z

Hmmm, strange. With 3.3.2 version this test doesn't fail on my local machine each time. It fails sometimes.
How about using not a random but concrete trivial dataset, something like

train_X = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]]) train_y = np.array([0, 1, 2])

?

Now a warning is provided, see https://github.com/microsoft/LightGBM/pull/5157/files#r868752144.
And the test case just checks whether the warning message is provided. Now it should always fail in 3.3.2.

Now it should always fail in 3.3.2.

Isn't true for me 😕
I can sometimes successfully run the test on my Windows machine with LightGBM 3.3.2 installed from PyPI.

Output:

[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.000000 [LightGBM] [Debug] init for col-wise cost 0.000004 seconds, init for row-wise cost 0.000180 seconds [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000209 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 175 [LightGBM] [Info] Number of data points in the train set: 100, number of used features: 5 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Debug] Trained a tree with leaves = 3 and depth = 2 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 3 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 3 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 2 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 3 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Debug] Trained a tree with leaves = 3 and depth = 2 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 3 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 3 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 2 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Debug] Trained a tree with leaves = 3 and depth = 2 D:\Miniconda3\lib\site-packages\lightgbm\engine.py:177: UserWarning: Found `num_trees` in params. Will use it instead of argument _log_warning(f"Found `{alias}` in params. Will use it instead of argument") <lightgbm.basic.Booster at 0x2b6dc985760>

Sorry about that and thanks for checking.
I've updated the test case so that the initial model will surely use feature 8 or 9, and the continual training dataset contains only 5 features. Now the test case should always fail in previous versions.

Thank you very much for working on the test!
However, now this test never fails (was run successfully 50 times in a row) on my machine with LightGBM 3.3.2 😄

Log:

[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.000000 [LightGBM] [Debug] init for col-wise cost 0.000003 seconds, init for row-wise cost 0.000161 seconds [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000187 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 175 [LightGBM] [Info] Number of data points in the train set: 100, number of used features: 5 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 2 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Debug] Trained a tree with leaves = 3 and depth = 2 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 2 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 2 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 3 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 2 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Debug] Trained a tree with leaves = 3 and depth = 2 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 2 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 2 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Debug] Trained a tree with leaves = 3 and depth = 2 <lightgbm.basic.Booster at 0x1dadc11aa90>

Model:

tree version=v3 num_class=1 num_tree_per_iteration=1 label_index=0 max_feature_idx=4 objective=regression feature_names=Column_0 Column_1 Column_2 Column_3 Column_4 feature_infos=[0.015606064446828216:0.99033894739670436] [0.011714084185001972:0.99201124341447411] [0.01203622289765427:0.98837383805922618] [0.01323685775889949:0.99440078964767942] [0.0046954761925470656:0.99884700656786651] tree_sizes=433 456 454 456 454 457 452 457 458 382 463 387 464 463 462 463 387 465 464 387 Tree=0 num_leaves=4 num_cat=0 split_feature=8 8 8 split_gain=999600 132192 117600 threshold=103.00000000000001 55.000000000000007 151.00000000000003 decision_type=2 2 2 left_child=1 -1 -2 right_child=2 -3 -4 leaf_value=186.40000000000001 206.19999999999999 196.59999999999999 216 leaf_weight=27 24 24 25 leaf_count=27 24 24 25 internal_value=201 191.2 211.2 internal_weight=0 51 49 internal_count=100 51 49 is_linear=0 shrinkage=1 Tree=1 num_leaves=4 num_cat=0 split_feature=8 8 8 split_gain=813246 114310 91614.6 threshold=97.000000000000014 145.00000000000003 43.000000000000007 decision_type=2 2 2 left_child=2 -2 -1 right_child=1 -3 -4 leaf_value=-14.339999643961589 3.6000000285605589 13.005000032697405 -5.5333332962459991 leaf_weight=21 24 28 27 leaf_count=21 24 28 27 internal_value=0 8.66423 -9.38625 internal_weight=0 52 48 internal_count=100 52 48 is_linear=0 shrinkage=0.1 Tree=2 num_leaves=4 num_cat=0 split_feature=8 8 8 split_gain=661912 106022 66947.7 threshold=109.00000000000001 61.000000000000007 157.00000000000003 decision_type=2 2 2 left_child=1 -1 -2 right_child=2 -3 -4 leaf_value=-11.472199859619142 5.1623749971389774 -2.5549999723210934 12.799499893188477 leaf_weight=30 24 24 22 leaf_count=30 24 24 22 internal_value=0 -7.509 8.81491 internal_weight=0 54 46 internal_count=100 54 46 is_linear=0 shrinkage=0.1 Tree=3 num_leaves=4 num_cat=0 split_feature=8 8 8 split_gain=541286 90395.6 51743.2 threshold=91.000000000000014 139.00000000000003 43.000000000000007 decision_type=2 2 2 left_child=2 -2 -1 right_child=1 -3 -4 leaf_value=-11.758779907226563 2.0473306971912582 10.221966577345327 -4.961771702766419 leaf_weight=21 24 31 24 leaf_count=21 24 31 24 internal_value=0 6.65485 -8.13371 internal_weight=0 55 45 internal_count=100 55 45 is_linear=0 shrinkage=0.1 Tree=4 num_leaves=4 num_cat=0 split_feature=8 8 8 split_gain=439960 83077 36862.1 threshold=115.00000000000001 67.000000000000014 157.00000000000003 decision_type=2 2 2 left_child=1 -1 -2 right_child=2 -3 -4 leaf_value=-9.0168283057935312 4.6399736268179756 -1.2844118246187768 10.49735318964178 leaf_weight=33 21 24 22 leaf_count=33 21 24 22 internal_value=0 -5.76107 7.63677 internal_weight=0 57 43 internal_count=100 57 43 is_linear=0 shrinkage=0.1 Tree=5 num_leaves=4 num_cat=0 split_feature=8 8 8 split_gain=359523 70011.3 29162.8 threshold=85.000000000000014 133.00000000000003 43.000000000000007 decision_type=2 2 2 left_child=2 -2 -1 right_child=1 -3 -4 leaf_value=-9.6812194460914256 0.9671219994624457 8.0214061737060547 -4.4111016545976911 leaf_weight=21 24 34 21 leaf_count=21 24 34 21 internal_value=0 5.10239 -7.04616 internal_weight=0 58 42 internal_count=100 58 42 is_linear=0 shrinkage=0.1 Tree=6 num_leaves=4 num_cat=0 split_feature=8 8 8 split_gain=291965 51781 29631.8 threshold=91.000000000000014 157.00000000000003 49.000000000000007 decision_type=2 2 2 left_child=2 -2 -1 right_child=1 -3 -4 leaf_value=-8.374019781748455 2.382246724854816 8.645477294921875 -3.2303891863141745 leaf_weight=24 33 22 21 leaf_count=24 33 22 21 internal_value=0 4.88754 -5.97366 internal_weight=0 55 45 internal_count=100 55 45 is_linear=0 shrinkage=0.1 Tree=7 num_leaves=4 num_cat=0 split_feature=8 8 8 split_gain=241666 44241.4 20265.2 threshold=115.00000000000001 49.000000000000007 157.00000000000003 decision_type=2 2 2 left_child=1 -1 -2 right_child=2 -3 -4 leaf_value=-7.5366175810496019 3.4379374277024044 -1.8938883086045584 7.7809295654296875 leaf_weight=24 21 33 22 leaf_count=24 21 33 22 internal_value=0 -4.26977 5.65993 internal_weight=0 57 43 internal_count=100 57 43 is_linear=0 shrinkage=0.1 Tree=8 num_leaves=4 num_cat=0 split_feature=8 8 8 split_gain=196769 38913.2 15310.5 threshold=85.000000000000014 133.00000000000003 43.000000000000007 decision_type=2 2 2 left_child=2 -2 -1 right_child=1 -3 -4 leaf_value=-7.1220336823236385 0.69178846441209318 5.9509620666503906 -3.3034729072025848 leaf_weight=21 24 34 21 leaf_count=21 24 34 21 internal_value=0 3.77475 -5.21275 internal_weight=0 58 42 internal_count=100 58 42 is_linear=0 shrinkage=0.1 Tree=9 num_leaves=3 num_cat=0 split_feature=8 8 split_gain=160757 35733.3 threshold=121.00000000000001 73.000000000000014 decision_type=2 2 left_child=1 -1 right_child=-2 -3 leaf_value=-5.2662804179721405 4.9105544734001159 -0.28483681498716273 leaf_weight=36 40 24 leaf_count=36 40 24 internal_value=0 -3.2737 internal_weight=0 60 internal_count=100 60 is_linear=0 shrinkage=0.1 Tree=10 num_leaves=4 num_cat=0 split_feature=2 3 1 split_gain=13.801 10.6247 5.30422 threshold=0.39124460793750004 0.57209255545676052 0.51007976537911426 decision_type=2 2 2 left_child=2 -2 -1 right_child=1 -3 -4 leaf_value=-10.008088938395183 -9.9444599969046461 -9.8565416689272283 -9.9392709641229544 leaf_weight=24 28 27 21 leaf_count=24 28 27 21 internal_value=0 -9.9013 -9.97597 internal_weight=0 55 45 internal_count=100 55 45 is_linear=0 shrinkage=0.1 Tree=11 num_leaves=3 num_cat=0 split_feature=4 0 split_gain=11.5297 8.62398 threshold=0.67098217697241236 0.49642413305341432 decision_type=2 2 left_child=1 -1 right_child=-2 -3 leaf_value=-8.9968132420590052 -8.8941043404971847 -8.9236732210431793 leaf_weight=38 34 28 leaf_count=38 34 28 internal_value=0 -8.96578 internal_weight=0 66 internal_count=100 66 is_linear=0 shrinkage=0.1 Tree=12 num_leaves=4 num_cat=0 split_feature=1 0 3 split_gain=11.1402 5.70155 5.37138 threshold=0.4329138381487897 0.31488953069403935 0.48185106917748455 decision_type=2 2 2 left_child=2 -2 -1 right_child=1 -3 -4 leaf_value=-8.1240014648437491 -8.0604864211309515 -7.9952484955658791 -8.0523968436501256 leaf_weight=20 21 37 22 leaf_count=20 21 37 22 internal_value=0 -8.01887 -8.08649 internal_weight=0 58 42 internal_count=100 58 42 is_linear=0 shrinkage=0.1 Tree=13 num_leaves=4 num_cat=0 split_feature=2 3 1 split_gain=9.77513 6.994 3.37145 threshold=0.39124460793750004 0.57209255545676052 0.51007976537911426 decision_type=2 2 2 left_child=2 -2 -1 right_child=1 -3 -4 leaf_value=-7.3027134259541828 -7.2492815290178569 -7.1779496934678821 -7.2478478931245354 leaf_weight=24 28 27 21 leaf_count=24 28 27 21 internal_value=0 -7.21426 -7.27711 internal_weight=0 55 45 internal_count=100 55 45 is_linear=0 shrinkage=0.1 Tree=14 num_leaves=4 num_cat=0 split_feature=4 3 2 split_gain=8.72052 6.28816 2.9544 threshold=0.38168423736294915 0.39970451119142136 0.57474024719752059 decision_type=2 2 2 left_child=-1 -2 -3 right_child=1 2 -4 leaf_value=-6.5552221249311406 -6.540647659301758 -6.499760017395019 -6.4460565839494981 leaf_weight=39 20 20 21 leaf_count=39 20 20 21 internal_value=0 -6.49468 -6.47225 internal_weight=0 61 41 internal_count=100 61 41 is_linear=0 shrinkage=0.1 Tree=15 num_leaves=4 num_cat=0 split_feature=1 0 3 split_gain=7.48096 4.12722 3.46943 threshold=0.4329138381487897 0.39974028150985036 0.48185106917748455 decision_type=2 2 2 left_child=2 -2 -1 right_child=1 -3 -4 leaf_value=-5.9287466812133793 -5.8698616948621023 -5.8165104438518656 -5.8711990703235983 leaf_weight=20 29 29 22 leaf_count=20 29 29 22 internal_value=0 -5.84319 -5.8986 internal_weight=0 58 42 internal_count=100 58 42 is_linear=0 shrinkage=0.1 Tree=16 num_leaves=3 num_cat=0 split_feature=4 0 split_gain=7.09244 4.97742 threshold=0.67098217697241236 0.49642413305341432 decision_type=2 2 left_child=1 -1 right_child=-2 -3 leaf_value=-5.3225026983963817 -5.2427101247450887 -5.2669374602181573 leaf_weight=38 34 28 leaf_count=38 34 28 internal_value=0 -5.29893 internal_weight=0 66 internal_count=100 66 is_linear=0 shrinkage=0.1 Tree=17 num_leaves=4 num_cat=0 split_feature=2 3 0 split_gain=6.14628 3.95565 2.09787 threshold=0.39124460793750004 0.57209255545676052 0.31488953069403935 decision_type=2 2 2 left_child=2 -2 -1 right_child=1 -3 -4 leaf_value=-4.8033817863464359 -4.7557433400835309 -4.7020983519377531 -4.7599296722412108 leaf_weight=20 28 27 25 leaf_count=20 28 27 25 internal_value=0 -4.72941 -4.77924 internal_weight=0 55 45 internal_count=100 55 45 is_linear=0 shrinkage=0.1 Tree=18 num_leaves=4 num_cat=0 split_feature=1 4 3 split_gain=5.78174 3.21241 2.42224 threshold=0.4329138381487897 0.52674300595287049 0.48185106917748455 decision_type=2 2 2 left_child=2 -2 -1 right_child=1 -3 -4 leaf_value=-4.3300938415527339 -4.2781520597396359 -4.2309710891158501 -4.2820091941139919 leaf_weight=20 31 27 22 leaf_count=20 31 27 22 internal_value=0 -4.25619 -4.30491 internal_weight=0 58 42 internal_count=100 58 42 is_linear=0 shrinkage=0.1 Tree=19 num_leaves=3 num_cat=0 split_feature=4 2 split_gain=5.06465 3.34179 threshold=0.38168423736294915 0.60984868647320056 decision_type=2 2 left_child=-1 -2 right_child=1 -3 leaf_value=-3.8771305084228516 -3.8485698895576679 -3.7998270381580705 leaf_weight=39 39 22 leaf_count=39 39 22 internal_value=0 -3.83099 internal_weight=0 61 internal_count=100 61 is_linear=0 shrinkage=0.1 end of trees feature_importances: Column_3=7 Column_0=5 Column_1=5 Column_2=5 Column_4=5 parameters: [boosting: gbdt] [objective: regression] [metric: l2] [tree_learner: serial] [device_type: cpu] [data: ] [valid: ] [num_iterations: 10] [learning_rate: 0.1] [num_leaves: 31] [num_threads: 0] [deterministic: 0] [force_col_wise: 0] [force_row_wise: 0] [histogram_pool_size: -1] [max_depth: -1] [min_data_in_leaf: 20] [min_sum_hessian_in_leaf: 0.001] [bagging_fraction: 1] [pos_bagging_fraction: 1] [neg_bagging_fraction: 1] [bagging_freq: 0] [bagging_seed: 3] [feature_fraction: 1] [feature_fraction_bynode: 1] [feature_fraction_seed: 2] [extra_trees: 0] [extra_seed: 6] [early_stopping_round: 0] [first_metric_only: 0] [max_delta_step: 0] [lambda_l1: 0] [lambda_l2: 0] [linear_lambda: 0] [min_gain_to_split: 0] [drop_rate: 0.1] [max_drop: 50] [skip_drop: 0.5] [xgboost_dart_mode: 0] [uniform_drop: 0] [drop_seed: 4] [top_rate: 0.2] [other_rate: 0.1] [min_data_per_group: 100] [max_cat_threshold: 32] [cat_l2: 10] [cat_smooth: 10] [max_cat_to_onehot: 4] [top_k: 20] [monotone_constraints: ] [monotone_constraints_method: basic] [monotone_penalty: 0] [feature_contri: ] [forcedsplits_filename: ] [refit_decay_rate: 0.9] [cegb_tradeoff: 1] [cegb_penalty_split: 0] [cegb_penalty_feature_lazy: ] [cegb_penalty_feature_coupled: ] [path_smooth: 0] [interaction_constraints: ] [verbosity: 2] [saved_feature_importance_type: 0] [linear_tree: 0] [max_bin: 255] [max_bin_by_feature: ] [min_data_in_bin: 3] [bin_construct_sample_cnt: 200000] [data_random_seed: 1] [is_enable_sparse: 1] [enable_bundle: 1] [use_missing: 1] [zero_as_missing: 0] [feature_pre_filter: 1] [pre_partition: 0] [two_round: 0] [header: 0] [label_column: ] [weight_column: ] [group_column: ] [ignore_column: ] [categorical_feature: ] [forcedbins_filename: ] [precise_float_parser: 0] [objective_seed: 5] [num_class: 1] [is_unbalance: 0] [scale_pos_weight: 1] [sigmoid: 1] [boost_from_average: 1] [reg_sqrt: 0] [alpha: 0.9] [fair_c: 1] [poisson_max_delta_step: 0.7] [tweedie_variance_power: 1.5] [lambdarank_truncation_level: 30] [lambdarank_norm: 1] [label_gain: ] [eval_at: ] [multi_error_top_k: 1] [auc_mu_weights: ] [num_machines: 1] [local_listen_port: 12400] [time_out: 120] [machine_list_filename: ] [machines: ] [gpu_platform_id: -1] [gpu_device_id: -1] [gpu_use_dp: 0] [num_gpu: 1] end of parameters pandas_categorical:null

@StrikerRUS
Did you test the test case in the latest commit of this PR?

I've tested with v3.3.2 with the test case

def test_continue_train_different_feature_size(capsys): np.random.seed(0) train_X = np.hstack([np.ones(800).reshape(-1, 8), np.arange(200, 0, -1).reshape(-1, 2)]) train_y = np.sum(train_X[:, -2:], axis=1) train_data = lgb.Dataset(train_X, label=train_y) params = { "objective": "regression", "num_trees": 10, "num_leaves": 31, "verbose": -1, 'predict_disable_shape_check': True, } model = lgb.train(train_set=train_data, params=params) train_X_cont = np.random.rand(100, 5) train_y_cont = np.sum(train_X_cont, axis=1) train_data_cont = lgb.Dataset(train_X_cont, label=train_y_cont) params.update({"verbose": 2}) lgb.train(train_set=train_data_cont, params=params, init_model=model) captured = capsys.readouterr() assert captured.out.find("features found in dataset for continual training, but at least") != -1

And the output is

FAILED test_engine.py::test_continue_train_different_feature_size - AssertionError: assert -1 != -1

which shows that the test case fails in v3.3.2.

@shiyu1994 This test should segfault or something like this (I don't remember exactly) in v3.3.2 but it doesn't.

Warning was added after starting this review thread and we don't speak about it here.

@StrikerRUS Thanks. With the latest commit, it should now fail in v3.3.2. It will segfaults or the AssertionError will arise by the line

assert captured.out.find("features found in dataset for continual training, but at least") != -1

in the latest version of test case.

With the latest commit, it should now fail in v3.3.2.

Still doesn't fail in v3.3.2.

Please use fixed dataset, not a random one, to constantly reproduce segfaults in versions without workaround from this PR.

jameslamb · 2022-05-11T03:42:54Z

Why is it necessary to pass predict_disable_shape_check? This test isn't generating any predictions. Can this be removed?

Ah, good observation! But predict() is called internally here when you set init scores by a previously trained model

LightGBM/python-package/lightgbm/engine.py

Lines 153 to 154 in 0018206

elif isinstance(init_model, Booster):

predictor = init_model._to_predictor(dict(init_model.params, **params))

LightGBM/python-package/lightgbm/engine.py

Lines 160 to 161 in 0018206

train_set._update_params(params) \

._set_predictor(predictor) \

LightGBM/python-package/lightgbm/basic.py

Lines 1519 to 1522 in 0018206

if isinstance(predictor, _InnerPredictor):

if self._predictor is None and init_score is not None:

_log_warning("The init_score will be overridden by the prediction of init_model.")

self._set_init_score_by_predictor(predictor, data)

LightGBM/python-package/lightgbm/basic.py

Line 1393 in 0018206

def _set_init_score_by_predictor(self, predictor, data, used_indices=None):

LightGBM/python-package/lightgbm/basic.py

Lines 1399 to 1402 in 0018206

if predictor is not None:

init_score = predictor.predict(data,

raw_score=True,

data_has_header=data_has_header)

Ah, I see! So then is the bug addressed by this PR only present when you've also set predict_disable_shape_check = True? I expect that most LightGBM users will not be aware that that prediction parameter is relevant for training continuation.

Yes, without setting predict_disable_shape_check = True users get the following error:

LightGBMError: The number of features in data (5) is not the same as it was in training data (10). You can set ``predict_disable_shape_check=true`` to discard this error, but please be aware what you are doing.

I think the second sentence of the error "You can set predict_disable_shape_check=true to discard this error ..." helps understand this complicated relationship.

predict_disable_shape_check is necessary because here a numpy matrix is used in the test case. If we use a LibSVM file as dataset in the test case, we don't need predict_disable_shape_check=True. I use a numpy matrix here because I think it is more convenient than LivSVM in that it does not requiring generating files in test cases.

-Original file line number
+Diff line change
@@ Expand Up @@
       for (size_t i = 0; i < feature_importances.size(); ++i) {
         size_t feature_importances_int = static_cast<size_t>(feature_importances[i]);
         if (feature_importances_int > 0) {
-          pairs.emplace_back(feature_importances_int, feature_names_[i]);
+          if (i < feature_names_.size()) {
+            pairs.emplace_back(feature_importances_int, feature_names_[i]);
+          } else {
+            // with LibSVM format and continual training, the number of features in dataset can be fewer than in the intial model
+            // in that case FeatureImportance returns with the number of features in the intial model
+            std::stringstream str_buf;
+            str_buf << "Column_" << i << "_from_init_model";
+            pairs.emplace_back(feature_importances_int, str_buf.str());
+          }
         }
       }
       str_buf << '\n' << "\"feature_importances\":" << "{";
@@ Expand Down Expand Up @@
       for (size_t i = 0; i < feature_importances.size(); ++i) {
         size_t feature_importances_int = static_cast<size_t>(feature_importances[i]);
         if (feature_importances_int > 0) {
-          pairs.emplace_back(feature_importances_int, feature_names_[i]);
+          if (i < feature_names_.size()) {
+            pairs.emplace_back(feature_importances_int, feature_names_[i]);
+          } else {
+            // with LibSVM format and continual training, the number of features in dataset can be fewer than in the intial model
+            // in that case FeatureImportance returns with the number of features in the intial model
+            std::stringstream str_buf;
+            str_buf << "Column_" << i << "_from_init_model";
+            pairs.emplace_back(feature_importances_int, str_buf.str());
+          }
         }
       }
       // sort the importance
@@ Expand Down Expand Up @@
         for (int iter = 0; iter < num_used_model; ++iter) {
           for (int split_idx = 0; split_idx < models_[iter]->num_leaves() - 1; ++split_idx) {
             if (models_[iter]->split_gain(split_idx) > 0) {
+              const int real_feature_index = models_[iter]->split_feature(split_idx);
     #ifdef DEBUG
-              CHECK_GE(models_[iter]->split_feature(split_idx), 0);
+              CHECK_GE(real_feature_index, 0);
     #endif
-              feature_importances[models_[iter]->split_feature(split_idx)] += 1.0;
+              if (static_cast<size_t>(real_feature_index) >= feature_importances.size()) {
+                feature_importances.resize(real_feature_index + 1);
+              }
+              feature_importances[real_feature_index] += 1.0;
             }
           }
         }
       } else if (importance_type == 1) {
         for (int iter = 0; iter < num_used_model; ++iter) {
           for (int split_idx = 0; split_idx < models_[iter]->num_leaves() - 1; ++split_idx) {
             if (models_[iter]->split_gain(split_idx) > 0) {
+              const int real_feature_index = models_[iter]->split_feature(split_idx);
     #ifdef DEBUG
-              CHECK_GE(models_[iter]->split_feature(split_idx), 0);
+              CHECK_GE(real_feature_index, 0);
     #endif
-              feature_importances[models_[iter]->split_feature(split_idx)] += models_[iter]->split_gain(split_idx);
+              if (static_cast<size_t>(real_feature_index) >= feature_importances.size()) {
+                feature_importances.resize(real_feature_index + 1);
+              }
+              feature_importances[real_feature_index] += models_[iter]->split_gain(split_idx);
             }
           }
         }
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix number of features mismatching in continual training (fix #5156) #5157

Diff view

Diff view

There are no files selected for viewing

StrikerRUS May 9, 2022

shiyu1994 May 10, 2022

StrikerRUS May 10, 2022

shiyu1994 May 11, 2022

StrikerRUS May 11, 2022

shiyu1994 Jun 9, 2022

StrikerRUS Jun 12, 2022

shiyu1994 Jun 13, 2022

StrikerRUS Jun 13, 2022

jameslamb May 11, 2022

StrikerRUS May 11, 2022 •

edited

Loading

jameslamb May 12, 2022

StrikerRUS May 14, 2022

shiyu1994 May 30, 2022

Fix number of features mismatching in continual training (fix #5156) #5157

Are you sure you want to change the base?

Fix number of features mismatching in continual training (fix #5156) #5157

Diff view

Diff view

There are no files selected for viewing

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StrikerRUS May 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StrikerRUS May 11, 2022 •

edited

Loading