Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] label attribute of training dataset changes from pandas series to numpy array after model training #5099

Closed
5uperpalo opened this issue Mar 26, 2022 · 4 comments
Labels

Comments

@5uperpalo
Copy link

Description

As per the output and code below the type of label attribute in the dataset changes from pandas series to numpy array after I use the dataset for training a LighGBM model. This creates an issue if I want to use the label attribute in follow-up evaluation/postprocessing steps. I would expect the algorithm to not change input parameters 😄 .

lgbtrain.label type before training:
 <class 'pandas.core.series.Series'>
lgbtrain.data type before training:
 <class 'pandas.core.frame.DataFrame'>
...
lgbtrain.label type after training:
 <class 'numpy.ndarray'>
lgbtrain.data type after training:
 <class 'pandas.core.frame.DataFrame'>

Reproducible example

import numpy as np
import pandas as pd
import lightgbm as lgb
from lightgbm import Dataset
import cloudpickle

train_df = pd.DataFrame(
    {
        "id": np.arange(0, 20),
        "cont_feature": np.arange(0, 20),
        "target": [0] * 5 + [1] * 15,
    },
)
lgbtrain = Dataset(
        train_df.drop(columns=["target", "id"]),
        train_df["target"],
        free_raw_data=False,
    )

print(f"lgbtrain.label type before training:\n {type(lgbtrain.label)}")
print(f"lgbtrain.data type before training:\n {type(lgbtrain.data)}")

model = lgb.train(
    train_set=lgbtrain,
    params={},
)

print(f"lgbtrain.label type after training:\n {type(lgbtrain.label)}")
print(f"lgbtrain.data type after training:\n {type(lgbtrain.data)}")

Environment info

LightGBM version or commit hash:

lightgbm = "3.3.2"

Command(s) you used to install LightGBM

pip install lightgbm

python = "3.9.7"
cloudpickle = "2.0.0"
pandas = "1.4.1"
numpy = "1.22.3"

@jameslamb
Copy link
Collaborator

I would expect the algorithm to not change input parameter

Thanks for an excellent write-up with clear reproducible example!

For both this issue and #5098 , I think you may have missed a detail of how lightgbm training works. It is not that "the algorithm" is making such changes.

A LightGBM Dataset object is an alternative representation of the provided raw data which has been passed through some pre-processing. You could reproduce this behavior by running lgbtrain.construct() directly in the provided coded.

Please see these descriptions for more details

I don't recall if it's intentional that .label (the public attribute on Dataset) is stored as a numpy array. Maybe another maintainer will remember. Otherwise, I'll look through the package's code at some point in the future and respond.


Can you provide more specifics on the way you're using the Dataset object, and why .label being stored as a numpy array is problematic?

@jameslamb jameslamb changed the title Python LightGBM: label attribute of training dataset changes from pandas series to numpy array after model training [python-package] label attribute of training dataset changes from pandas series to numpy array after model training Mar 26, 2022
@StrikerRUS
Copy link
Collaborator

Dataset attributes like weight, label, init_score have to be re-created after underlying data are passed to the cpp side. I think numpy was chosen as universal format that should be sufficient in the majority of cases.

self.label = self.get_field('label') # original values can be modified at cpp side

Refer to #2390 for the answer why originally passed underlying data might be changed after constructing Dataset object.

@github-actions
Copy link

This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM!

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants