[python-package] How do I use lgb.Dataset() with lgb.Predict() without using pandas df or np array? #6285

wil70 · 2024-01-22T19:41:02Z

Description

I'm trying optura and flaml. I'm able to train (lgb.train) models with optura with csv and bin files as input for training and validation dataset. This is great as the speed is good.
The problem is with the prediction (lgb.predict), I'm not able to get a good speed as I need to go via pandas df or np array.
Is there a way to by pass those and use lgb.Dataset()?

Reproducible example

I have big datasets (csv and bin). I would like to use those with lgb.Dataset('train.csv.bin') instead of Panda df pd.read_csv('train.csv') for 1) speed reason and also 2) for consistency on how the LightGBM (CLI version) handle "na" and "+-inf" which pandas handle differently.

   
params = {
        "objective": "multiclass",
        #"metric": "multi_logloss,multi_error,auc_mu",
        "metric": "multi_error",
        "verbosity": -1,
        "boosting_type": "gbdt",
        "num_threads" : "10",
        "num_class" : "2",
        "ignore_column" : "1",
        "label_column" : "10",
        "categorical_feature":"8,9",
        "data" : 'train.csv.bin',
        "valid_data" : 'validate.csv.bin',
    }

    #model = lgb.train(
    #    params,
    #    dtrain,
    #    valid_sets=[dval],
    #   callbacks=[early_stopping(1), log_evaluation(100)],
    #)
    
    model.save_model("model.txt")
    
    #dval = lgb.Dataset('test.csv')
    dval = lgb.Dataset('validate.csv.bin', label=-1) #, params=params)
    #val_data = pd.read_csv('validate.csv',header=None) 

    # Load the model from file
    model = lgb.Booster(model_file='model.txt')

    # Get the true labels
    y_true = dval.get_label()

    # Get the predicted probabilities
    y_pred = model.predict(dval.get_data())
    # **Error: Exception: Cannot get data before construct Dataset**
    #y_pred = model.predict(dval.data)
    #**Error: lightgbm.basic.LightGBMError: Unknown format of training data. Only CSV, TSV, and LibSVM (zero-based) formatted text files are supported.**

How can I achieve this? how do I specify all columns are features except column 10 and ignore column 1?
I tried to feed the param to lgb.Dataset, but that didn't do it

Environment info

Win10 pro + Python 3.12.0 + latest optura

LightGBM version or commit hash: Latest as of today

Command(s) you used to install LightGBM

pip install lightgbm

Additional Comments

The text was updated successfully, but these errors were encountered:

wil70 · 2024-01-29T18:47:59Z

If no reply to the "question" then may be this is a feature enhancement request?

This would be a great feature enhacement for large data set. LightGBM is good at handling big dataset for training and validation with its c++ engine, keeping the same performance for the testing phase as well would be a big plus.

In my code, all is good until after the line "model = lgb.Booster(model_file='model.txt')"...
If we could directly use a LightGBM Dataset to predict from the model (moel.predict(...)) that would solve the issue as all the data would stay within the c++ engine and not be manipulated in python.

jameslamb · 2024-01-29T23:19:59Z

Thanks as always for your interest in LightGBM and for pushing the limits of what it can do with larger datasets and larger models.

As you've discovered, directly calling predict() on a LightGBM dataset isn't supported today. We already have these feature requests tracking it (in #2302):

The best way to get that functionality into LightGBM is to contribute it yourself. If that interests you, consider putting up a draft pull request and @-ing us for help on specific questions.

jameslamb · 2024-01-29T23:22:09Z

Panda df pd.read_csv('train.csv')

If you have large enough data that it's a significant runtime + memory problem to load it, and you're using Python, consider storing it in a different format than a CSV file. CSV is a text format and pandas is going to be doing a ton of type-guessing and type-conversion while reading that.

For example, consider storing it as a dense numpy array in the .npy file format (numpy docs) and then reading it up into a numpy matrix.

Or in Parquet format and reading that into pandas (to at least eliminate most of the type-conversion overhead of CSV).

jameslamb · 2024-01-29T23:24:26Z

stay within the C++ engine and not be manipulated in Python

LightGBM also supports predicting directly on a CSV file

LightGBM/src/c_api.cpp

Line 695 in 252828f

    
           void Predict(int start_iteration, int num_iteration, int predict_type, const char* data_filename,

Have you tried that?

You could do that with the lightgbm CLI or using Booster.predict() in the Python package. Booster.predict() accepts a path to a CSV/TSV/LibSVM formatted file.

https://github.com/microsoft/LightGBM/blob/252828fd86627d7405021c3377534d6a8239dd69/python-package/lightgbm/basic.py#L1073-1075

StrikerRUS · 2024-07-25T00:09:22Z

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

wil70 changed the title ~~[Python, Question] How to use lgbDataset with lgb.Predict without using pandas df or np array?~~ [Python, Question] How to use lgb.Dataset() with lgb.Predict() without using pandas df or np array? Jan 22, 2024

wil70 changed the title ~~[Python, Question] How to use lgb.Dataset() with lgb.Predict() without using pandas df or np array?~~ [Python, Question] How do I use lgb.Dataset() with lgb.Predict() without using pandas df or np array? Jan 22, 2024

jameslamb added the question label Jan 22, 2024

jameslamb changed the title ~~[Python, Question] How do I use lgb.Dataset() with lgb.Predict() without using pandas df or np array?~~ [python-package] How do I use lgb.Dataset() with lgb.Predict() without using pandas df or np array? Jan 23, 2024

jameslamb added the feature request label Jan 29, 2024

jameslamb mentioned this issue Jan 29, 2024

Feature Requests & Voting Hub #2302

Open

StrikerRUS closed this as completed Jul 25, 2024

jameslamb mentioned this issue Aug 16, 2024

Being able to do Prediction (task=prediction) on bin files. #6613

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python-package] How do I use lgb.Dataset() with lgb.Predict() without using pandas df or np array? #6285

[python-package] How do I use lgb.Dataset() with lgb.Predict() without using pandas df or np array? #6285

wil70 commented Jan 22, 2024 •

edited

Loading

wil70 commented Jan 29, 2024 •

edited

Loading

jameslamb commented Jan 29, 2024

jameslamb commented Jan 29, 2024

jameslamb commented Jan 29, 2024

StrikerRUS commented Jul 25, 2024

[python-package] How do I use lgb.Dataset() with lgb.Predict() without using pandas df or np array? #6285

[python-package] How do I use lgb.Dataset() with lgb.Predict() without using pandas df or np array? #6285

Comments

wil70 commented Jan 22, 2024 • edited Loading

Description

Reproducible example

Environment info

Additional Comments

wil70 commented Jan 29, 2024 • edited Loading

jameslamb commented Jan 29, 2024

jameslamb commented Jan 29, 2024

jameslamb commented Jan 29, 2024

StrikerRUS commented Jul 25, 2024

wil70 commented Jan 22, 2024 •

edited

Loading

wil70 commented Jan 29, 2024 •

edited

Loading