-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python-package] How do I use lgb.Dataset() with lgb.Predict() without using pandas df or np array? #6285
Comments
If no reply to the "question" then may be this is a feature enhancement request? This would be a great feature enhacement for large data set. LightGBM is good at handling big dataset for training and validation with its c++ engine, keeping the same performance for the testing phase as well would be a big plus. In my code, all is good until after the line "model = lgb.Booster(model_file='model.txt')"... |
Thanks as always for your interest in LightGBM and for pushing the limits of what it can do with larger datasets and larger models. As you've discovered, directly calling
The best way to get that functionality into LightGBM is to contribute it yourself. If that interests you, consider putting up a draft pull request and |
If you have large enough data that it's a significant runtime + memory problem to load it, and you're using Python, consider storing it in a different format than a CSV file. CSV is a text format and For example, consider storing it as a dense Or in Parquet format and reading that into |
LightGBM also supports predicting directly on a CSV file Line 695 in 252828f
Have you tried that? You could do that with the |
Closed in favor of being in #2302. We decided to keep all feature requests in one place. Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature. |
Description
I'm trying optura and flaml. I'm able to train (lgb.train) models with optura with csv and bin files as input for training and validation dataset. This is great as the speed is good.
The problem is with the prediction (lgb.predict), I'm not able to get a good speed as I need to go via pandas df or np array.
Is there a way to by pass those and use lgb.Dataset()?
Reproducible example
I have big datasets (csv and bin). I would like to use those with lgb.Dataset('train.csv.bin') instead of Panda df pd.read_csv('train.csv') for 1) speed reason and also 2) for consistency on how the LightGBM (CLI version) handle "na" and "+-inf" which pandas handle differently.
How can I achieve this? how do I specify all columns are features except column 10 and ignore column 1?
I tried to feed the param to lgb.Dataset, but that didn't do it
Environment info
Win10 pro + Python 3.12.0 + latest optura
LightGBM version or commit hash: Latest as of today
Command(s) you used to install LightGBM
Additional Comments
The text was updated successfully, but these errors were encountered: