[python-package] make `Dataset` pickleable #5098

5uperpalo · 2022-03-26T09:54:48Z

Description

ctypes objects pointers are added to the training(and evaluation) dataset after it is used for training a LightGBM model in Python. This becomes an issue if I want to serialize the dataset using cloudpickle after I use it for model training.

cloudpickle is used by RayTune to serialize the objects during the distribution of the objects while running multiple trail runs. I did not expect any dataset attribute change after using it to train LightGBM model. As a workaround for now I create a copy of the dataset that I use for training as I could not use it otherwise in followup hyper parameter optimization steps that are done by RayTune.

Reproducible example

import numpy as np
import pandas as pd
import lightgbm as lgb
from lightgbm import Dataset
import cloudpickle

train_df = pd.DataFrame(
    {
        "id": np.arange(0, 20),
        "cont_feature": np.arange(0, 20),
        "target": [0] * 5 + [1] * 15,
    },
)
lgbtrain = Dataset(
        train_df.drop(columns=["target", "id"]),
        train_df["target"],
        free_raw_data=False,
    )

pickled_train_df_ok = cloudpickle.dumps(lgbtrain)

model = lgb.train(
    train_set=lgbtrain,
    params={},
)

pickled_train_df_NOTok = cloudpickle.dumps(lgbtrain)

Environment info

LightGBM version or commit hash:

lightgbm = "3.3.2"

Command(s) you used to install LightGBM

pip install lightgbm

python = "3.9.7"
cloudpickle = "2.0.0"
pandas = "1.4.1"
numpy = "1.22.3"

The text was updated successfully, but these errors were encountered:

jameslamb · 2022-03-26T17:33:33Z

This becomes an issue

Thanks for the excellent report! We'll look into it as soon as possible. Can you please clarify what specifically you mean by "This becomes an issue"?

if you get an exception or warning, please include the message you see (so that this issue can be found by others facing the same issue and using search engines to look for solutions)
if something else, please describe specifically what "becomes an issue" means

5uperpalo · 2022-03-26T17:40:02Z

hi @jameslamb , thanks for quick response, by "This becomes an issue" I mean the pickler used by RayTune is unable to serialize the dataset object and outputs a value error:
cloudpickle ValueError: ctypes objects containing pointers cannot be pickled
only way around is it either do a simple copy of dataset object that I am using for training, or do a over-complicated solution by defining my own customized serialization as described here:
https://docs.ray.io/en/latest/ray-core/serialization.html

StrikerRUS · 2022-03-26T19:13:00Z

Indeed, Dataset object isn't serializable. Booster object (that is a trained LightGBM model actually, returned value of the train() function) is serializable because it can be re-created from the text format. I don't think that the same is applicable for the Dataset entity.

LightGBM/python-package/lightgbm/basic.py

Lines 2671 to 2690 in c991b2b

    
           def __getstate__(self): 
        
               this = self.__dict__.copy() 
        
               handle = this['handle'] 
        
               this.pop('train_set', None) 
        
               this.pop('valid_sets', None) 
        
               if handle is not None: 
        
                   this["handle"] = self.model_to_string(num_iteration=-1) 
        
               return this 
        
           def __setstate__(self, state): 
        
               model_str = state.get('handle', None) 
        
               if model_str is not None: 
        
                   handle = ctypes.c_void_p() 
        
                   out_num_iterations = ctypes.c_int(0) 
        
                   _safe_call(_LIB.LGBM_BoosterLoadModelFromString( 
        
                       c_str(model_str), 
        
                       ctypes.byref(out_num_iterations), 
        
                       ctypes.byref(handle))) 
        
                   state['handle'] = handle 
        
               self.__dict__.update(state)

StrikerRUS · 2024-07-24T23:48:49Z

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

jameslamb mentioned this issue Mar 26, 2022

[python-package] label attribute of training dataset changes from pandas series to numpy array after model training #5099

Closed

jameslamb changed the title ~~Python LightGBM : ctypes objects pointers added to training dataset after model train~~ [python-package] ctypes objects pointers added to training dataset after model train Mar 26, 2022

jameslamb added the feature request label Jun 1, 2023

jameslamb changed the title ~~[python-package] ctypes objects pointers added to training dataset after model train~~ [python-package] make Dataset pickleable Jun 1, 2023

jameslamb mentioned this issue Jun 1, 2023

Feature Requests & Voting Hub #2302

Open

StrikerRUS closed this as completed Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python-package] make `Dataset` pickleable #5098

[python-package] make `Dataset` pickleable #5098

5uperpalo commented Mar 26, 2022

jameslamb commented Mar 26, 2022

5uperpalo commented Mar 26, 2022 •

edited

Loading

StrikerRUS commented Mar 26, 2022

StrikerRUS commented Jul 24, 2024

[python-package] make Dataset pickleable #5098

[python-package] make Dataset pickleable #5098

Comments

5uperpalo commented Mar 26, 2022

Description

Reproducible example

Environment info

jameslamb commented Mar 26, 2022

5uperpalo commented Mar 26, 2022 • edited Loading

StrikerRUS commented Mar 26, 2022

StrikerRUS commented Jul 24, 2024

[python-package] make `Dataset` pickleable #5098

[python-package] make `Dataset` pickleable #5098

5uperpalo commented Mar 26, 2022 •

edited

Loading