Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot construct a valid set refer to a train set with max_bin != 255 #6159

Closed
aslongaspossible opened this issue Oct 28, 2023 · 6 comments
Closed
Labels

Comments

@aslongaspossible
Copy link

Description

When I want to construct a valid set refer to a train set with max_bin=15, it raises "Dataset max_bin 15 != config 255". Seems that I can never create a valid set with max_bin != 255?

Reproducible example

val = lgb.Dataset(val_dataframe, val_label, lgb.Dataset('train.bin'))
val.save_binary('val.bin')

Where val_dataframe are features of valid set, val_label are labels, 'train_bin' are saved train set of lgb.Dataset in binary with max_bin=15.

Environment info

LightGBM version or commit hash: 3.3.5

Command(s) you used to install LightGBM

conda install -c conda-forge lightgbm
@jameslamb
Copy link
Collaborator

Thanks for using LightGBM.

The reproducible example you provided isn't reproducible. For example:

  • it's missing import statements
  • you haven't provided code that could exactly reproduce the objects val_dataframe and val_label
  • you haven't provided the content of the file train.bin or described exactly how it was created
  • the name val_dataframe implies that you're using a data frame in memory, but you haven't mentioned what library (pandas? polars? dask?) or the version of that library

You could help reduce the effort required to answer this question by addressing those concerns and providing a minimal, reproducible example. If you haven't done that before and are unsure where to start, see:

Can you please provide such details or explain why that's not possible?

@aslongaspossible
Copy link
Author

Thanks for using LightGBM.

The reproducible example you provided isn't reproducible. For example:

  • it's missing import statements
  • you haven't provided code that could exactly reproduce the objects val_dataframe and val_label
  • you haven't provided the content of the file train.bin or described exactly how it was created
  • the name val_dataframe implies that you're using a data frame in memory, but you haven't mentioned what library (pandas? polars? dask?) or the version of that library

You could help reduce the effort required to answer this question by addressing those concerns and providing a minimal, reproducible example.

Sorry to have provided unreproducible pseudocode. Here is the reproducible one:

import numpy as np
import lightgbm as lgb
lgb_random = lgb.Dataset(np.random.rand(10000, 100), np.random.rand(10000, 1), params={'max_bin':15})
lgb_random.save_binary('random.bin')
del lgb_random
val = lgb.Dataset(np.random.rand(1000, 100), np.random.rand(1000, 1), reference=lgb.Dataset('random.bin'))
val.save_binary('val.bin')

In other words, any random data can reproduce my problem. I wonder that why the config doesn't change according to reference dataset.

@shiyu1994
Copy link
Collaborator

@aslongaspossible Thanks for reporting this issue.

It seems that by default the lgb.Dataset considers max_bin as 255 for default value. And even when loading from binary dataset file, the max_bin isn't loaded from the file.

A quick fix would be add params={'max_bin':15} in the reference=lgb.Dataset('random.bin'), i.e. reference=lgb.Dataset('random.bin', params={'max_bin':15}).

Still, I agree that this is not convenient. Since the binary file should ideally contain all the information to reconstruct the preprocessed dataset.

Will look into how to fix this laster.

@jameslamb
Copy link
Collaborator

Thanks @aslongaspossible for providing a reproducible example. Given that, I see the issue and agree with @shiyu1994 's recommendation. To facilitate that in the future, until LightGBM provides more convenient behavior, consider storing Dataset parameters alongside wherever you store the Dataset binary files.

will look into how to fix this

@shiyu1994 @aslongaspossible please see #4904 (comment) where I described this exact issue in detail. I think we have a path forward, but haven't as yet had anyone take up implementing it: #4904 (comment)

@aslongaspossible
Copy link
Author

A quick fix would be add params={'max_bin':15} in the reference=lgb.Dataset('random.bin'), i.e. reference=lgb.Dataset('random.bin', params={'max_bin':15}).

This works. Thank you!

Copy link

github-actions bot commented Nov 6, 2024

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 6, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants