Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix number of features mismatching in continual training (fix #5156) #5157

Open
wants to merge 14 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 12 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 27 additions & 4 deletions src/boosting/gbdt_model_text.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -632,31 +632,54 @@ std::vector<double> GBDT::FeatureImportance(int num_iteration, int importance_ty
}

std::vector<double> feature_importances(max_feature_idx_ + 1, 0.0);
bool warn_about_feature_number = false;
int max_feature_index_found = -1;
if (importance_type == 0) {
for (int iter = 0; iter < num_used_model; ++iter) {
for (int split_idx = 0; split_idx < models_[iter]->num_leaves() - 1; ++split_idx) {
if (models_[iter]->split_gain(split_idx) > 0) {
const int real_feature_index = models_[iter]->split_feature(split_idx);
#ifdef DEBUG
CHECK_GE(models_[iter]->split_feature(split_idx), 0);
CHECK_GE(real_feature_index, 0);
#endif
feature_importances[models_[iter]->split_feature(split_idx)] += 1.0;
if (static_cast<size_t>(real_feature_index) >= feature_importances.size()) {
warn_about_feature_number = true;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provide warning instead of enlarging feature importance vector. Since GBDT::FeatureImportance is used by Python API and the python code doesn't know the final size of feature importance vector, dynamically enlarging the vector will cause out of bound access.

if (real_feature_index > max_feature_index_found) {
max_feature_index_found = real_feature_index;
}
} else {
feature_importances[real_feature_index] += 1.0;
}
}
}
}
} else if (importance_type == 1) {
for (int iter = 0; iter < num_used_model; ++iter) {
for (int split_idx = 0; split_idx < models_[iter]->num_leaves() - 1; ++split_idx) {
if (models_[iter]->split_gain(split_idx) > 0) {
const int real_feature_index = models_[iter]->split_feature(split_idx);
#ifdef DEBUG
CHECK_GE(models_[iter]->split_feature(split_idx), 0);
CHECK_GE(real_feature_index, 0);
#endif
feature_importances[models_[iter]->split_feature(split_idx)] += models_[iter]->split_gain(split_idx);
if (static_cast<size_t>(real_feature_index) >= feature_importances.size()) {
warn_about_feature_number = true;
if (real_feature_index > max_feature_index_found) {
max_feature_index_found = real_feature_index;
}
} else {
feature_importances[real_feature_index] += models_[iter]->split_gain(split_idx);
}
}
}
}
} else {
Log::Fatal("Unknown importance type: only support split=0 and gain=1");
}
if (warn_about_feature_number) {
Log::Warning("Only %d features found in dataset for continual training, but at least %d features found in initial model.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add some information here that can help users understand what they should do about this warning? Or is this warning just being raised for the purpose of the unit test?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added

Log::Warning("Please check the number of features used in continual training.");

in d719d78.

Do you think it is appropriate?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! But I don't think "please check" will be any more informative for users, and I think that providing two separate WARN-level log messages for the same problem might be confusing.

Would you consider this?

Log::Warning(
    "Only %d features found in dataset, but at least %d features found in initial model. "
    "If this is intentional and the features in dataset match those used to train the initial model, no action is required."
    "If not, training continuation may fail or produce bad results."
    "To suppress this warning, provide a dataset with at least %d features"
    static_cast<int>(feature_importances.size()),
    max_feature_index_found + 1,
    max_feature_index_found + 1
)

static_cast<int>(feature_importances.size()), max_feature_index_found + 1);
Log::Warning("Please check the number of features used in continual training.");
}
return feature_importances;
}

Expand Down
23 changes: 23 additions & 0 deletions tests/python_package_test/test_engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -970,6 +970,29 @@ def test_continue_train_multiclass():
assert evals_result['valid_0']['multi_logloss'][-1] == pytest.approx(ret)


def test_continue_train_different_feature_size(capsys):
np.random.seed(0)
train_X = np.hstack([np.ones(800).reshape(-1, 8), np.arange(200, 0, -1).reshape(-1, 2)])
train_y = np.sum(train_X[:, -2:], axis=1)
train_data = lgb.Dataset(train_X, label=train_y)
params = {
"objective": "regression",
"num_trees": 10,
"num_leaves": 31,
"verbose": -1,
'predict_disable_shape_check': True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it necessary to pass predict_disable_shape_check? This test isn't generating any predictions. Can this be removed?

Copy link
Collaborator

@StrikerRUS StrikerRUS May 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good observation! But predict() is called internally here when you set init scores by a previously trained model

elif isinstance(init_model, Booster):
predictor = init_model._to_predictor(dict(init_model.params, **params))

train_set._update_params(params) \
._set_predictor(predictor) \

if isinstance(predictor, _InnerPredictor):
if self._predictor is None and init_score is not None:
_log_warning("The init_score will be overridden by the prediction of init_model.")
self._set_init_score_by_predictor(predictor, data)

def _set_init_score_by_predictor(self, predictor, data, used_indices=None):

if predictor is not None:
init_score = predictor.predict(data,
raw_score=True,
data_has_header=data_has_header)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see! So then is the bug addressed by this PR only present when you've also set predict_disable_shape_check = True? I expect that most LightGBM users will not be aware that that prediction parameter is relevant for training continuation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, without setting predict_disable_shape_check = True users get the following error:

LightGBMError: The number of features in data (5) is not the same as it was in training data (10).
You can set ``predict_disable_shape_check=true`` to discard this error, but please be aware what you are doing.

I think the second sentence of the error "You can set predict_disable_shape_check=true to discard this error ..." helps understand this complicated relationship.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

predict_disable_shape_check is necessary because here a numpy matrix is used in the test case. If we use a LibSVM file as dataset in the test case, we don't need predict_disable_shape_check=True. I use a numpy matrix here because I think it is more convenient than LivSVM in that it does not requiring generating files in test cases.

}
model = lgb.train(train_set=train_data, params=params)

train_X_cont = np.random.rand(100, 5)
train_y_cont = np.sum(train_X_cont, axis=1)
train_data_cont = lgb.Dataset(train_X_cont, label=train_y_cont)
params.update({"verbose": 2})
lgb.train(train_set=train_data_cont, params=params, init_model=model)
captured = capsys.readouterr()
assert captured.out.find("features found in dataset for continual training, but at least") != -1


def test_cv():
X_train, y_train = make_synthetic_regression()
params = {'verbose': -1}
Expand Down