python api can't continue train with binary file data #4311

papillonyi · 2021-05-21T03:44:17Z

Description

when i tried to continue train with binary file data, it raise a error

I train with a binary file with 10 iterations first and it success;
then, i try to train with init_model=last_train_result, and it raise such error;

And i try do the same thing with lightgbm cli, it works

Reproducible example

here is my code

Environment info

th version of lightGBM i used is 3.1.1 in linux

Command(s) you used to install LightGBM

Additional Comments

The text was updated successfully, but these errors were encountered:

jameslamb · 2022-01-02T02:54:26Z

@papillonyi thanks very much for using LightGBM. sorry for the very long delay in responding to this!

In the future, please do not post logs, error messages, or code as screenshots. Post such things as text instead, so others facing the same challenges can find this discussion from search engines.

Providing a minimal, reproducible example that can be easily copied and run by others would also make it much more likely that issues will be addressed quickly.

I tried to create a minimal, reproducible example based on the provided code snippet, using the Python package as of the latest commit on master (af5b40e).

The code below produces the reported error.

import lightgbm as lgb
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=1_000, n_informative=5)

data_file_name = "output.bin"
model_file_name = "model-temp.text"
params = {
    "boosting_type": "gbdt",
    "verbose": 1,
    "deterministic": True,
    "objective": "regression"
}

# create dataset and save it to binary file
lgb.Dataset(data=X, label=y).save_binary(data_file_name)

[LightGBM] [Info] Saving data to binary file output.bin

# train for 10 iterations on data from file
dtrain = lgb.Dataset(data_file_name)
booster = lgb.train(
    params=params,
    train_set=dtrain,
    num_boost_round=10
)
booster.save_model(model_file_name)

[LightGBM] [Info] Load from binary file output.bin
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002147 seconds.
You can set force_col_wise=true to remove the overhead.
[LightGBM] [Info] Total Bins 25500
[LightGBM] [Info] Number of data points in the train set: 1000, number of used features: 100
[LightGBM] [Info] Start training from score 0.505056

# clear booster and dataset
del booster
del dtrain

# try to containue training for 10 more rounds on
# model and Dataset from file
dtrain = lgb.Dataset(data_file_name)

booster = lgb.train(
    params=params,
    train_set=dtrain,
    init_model=model_file_name,
    num_boost_round=10
)

[LightGBM] [Info] Load from binary file output.bin
[LightGBM] [Fatal] Unknown format of training data. Only CSV, TSV, and LibSVM (zero-based) formatted text files are supported.

stack trace (click me)

---------------------------------------------------------------------------
LightGBMError                             Traceback (most recent call last)
/tmp/ipykernel_91/4239506626.py in <module>
----> 1 booster = lgb.train(
      2     params=params,
      3     train_set=dtrain.construct(),
      4     init_model=model_file_name,
      5     num_boost_round=10

/opt/conda/lib/python3.8/site-packages/lightgbm/engine.py in train(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, keep_training_booster, callbacks)
    159         raise TypeError("Training only accepts Dataset object")
    160 
--> 161     train_set._update_params(params) \
    162              ._set_predictor(predictor) \
    163              .set_feature_name(feature_name) \

/opt/conda/lib/python3.8/site-packages/lightgbm/basic.py in _set_predictor(self, predictor)
   2068         elif self.data is not None:
   2069             self._predictor = predictor
-> 2070             self._set_init_score_by_predictor(self._predictor, self.data)
   2071         elif self.used_indices is not None and self.reference is not None and self.reference.data is not None:
   2072             self._predictor = predictor

/opt/conda/lib/python3.8/site-packages/lightgbm/basic.py in _set_init_score_by_predictor(self, predictor, data, used_indices)
   1385         num_data = self.num_data()
   1386         if predictor is not None:
-> 1387             init_score = predictor.predict(data,
   1388                                            raw_score=True,
   1389                                            data_has_header=data_has_header,

/opt/conda/lib/python3.8/site-packages/lightgbm/basic.py in predict(self, data, start_iteration, num_iteration, raw_score, pred_leaf, pred_contrib, data_has_header, is_reshape)
    782         if isinstance(data, (str, Path)):
    783             with _TempFile() as f:
--> 784                 _safe_call(_LIB.LGBM_BoosterPredictForFile(
    785                     self.handle,
    786                     c_str(str(data)),

/opt/conda/lib/python3.8/site-packages/lightgbm/basic.py in _safe_call(ret)
    125     """
    126     if ret != 0:
--> 127         raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
    128 
    129 

LightGBMError: Unknown format of training data. Only CSV, TSV, and LibSVM (zero-based) formatted text files are supported.

jameslamb · 2022-01-02T03:37:42Z

Looking at the stacktrace, this is coming from a call to LGBM_BoosterPredictForFile() inside `Booster.predict(), here

LightGBM/python-package/lightgbm/basic.py

Line 784 in af5b40e

_safe_call(_LIB.LGBM_BoosterPredictForFile(

LGBM_BoosterPredictForFile() cannot be used with LightGBM Dataset binary files at the moment.

Related conversations:

[R-package] predict() breaks when using a Dataset stored in a file #4034 (comment)
[docs] Clarify the fact that predict() on a file does not support saved Datasets (fixes #4034) #4545

I try to do the same thing with lightgbm cli, it works

I was able to replicate that behavior as well, and confirm that the CLI does support training continuation using a Dataset binary file

First, I installed the Python package and built the CLI.

cd python-package
python setup.py install
cd ..

mkdir build
cd build
cmake ..
make -j2
cd ..

Then I ran the following Python code to generate the Dataset binary file and initial model.

import lightgbm as lgb
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=1_000, n_informative=5)

data_file_name = "output.bin"
model_file_name = "model-temp.text"
params = {
    "boosting_type": "gbdt",
    "verbose": 1,
    "deterministic": True,
    "objective": "regression"
}

# create dataset and save it to binary file
lgb.Dataset(data=X, label=y).save_binary(data_file_name)

# train for 10 iterations on data from file
dtrain = lgb.Dataset(data_file_name)
booster = lgb.train(
    params=params,
    train_set=dtrain,
    num_boost_round=10
)
booster.save_model(model_file_name)

Checked that the model produced had exactly 10 trees.

cat model-temp.text | grep 'Tree=' | tail -1

Tree=9

Next, I created a file train.conf with configuration for the CLI.

task = train
objective = regression
data = output.bin
num_trees = 7
output_model = model-from-cli.txt
input_model = model-temp.text

Next, ran training with the CLI and checked that a new model file was produced, with 17 total trees (10 from initial training, 7 from training continuation).

./lightgbm config=train.conf
cat model-from-cli.txt | grep 'Tree=' | tail -1

Tree=16

jameslamb · 2022-01-02T03:42:24Z

I think at this point we should convert this issue into a feature request like "[python] support training continuation using a text model file and binary Dataset file" (and probably [R-package]), what do you think @shiyu1994 @StrikerRUS ?

And that solutions for supporting that might be one of the following:

support predicting on a constructed Dataset ([R] Allow predictions on lgb.Dataset objects #1939, Enable use of constructed Dataset in predict() methods #4546)
figure out the code path used to support this in the CLI, and invoke the same code path from the Python package

jameslamb · 2024-04-24T02:45:36Z

Going through old issues today, I realize that this and #6144 describe exactly the same problem.

Since #6144 is more recent, I'm cross-linking these two and closing this one.

jameslamb added the bug label May 24, 2021

StrikerRUS mentioned this issue Jul 12, 2021

release 3.3.0 #4310

Closed

21 tasks

jameslamb closed this as completed Apr 24, 2024

jameslamb mentioned this issue Apr 24, 2024

Reloading dataset broken with init_model #6144

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

python api can't continue train with binary file data #4311

python api can't continue train with binary file data #4311

papillonyi commented May 21, 2021

jameslamb commented Jan 2, 2022 •

edited

Loading

jameslamb commented Jan 2, 2022

jameslamb commented Jan 2, 2022

jameslamb commented Apr 24, 2024

python api can't continue train with binary file data #4311

python api can't continue train with binary file data #4311

Comments

papillonyi commented May 21, 2021

Description

Reproducible example

Environment info

Additional Comments

jameslamb commented Jan 2, 2022 • edited Loading

jameslamb commented Jan 2, 2022

jameslamb commented Jan 2, 2022

jameslamb commented Apr 24, 2024

jameslamb commented Jan 2, 2022 •

edited

Loading