-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dask] DaskRegressor.predict() fails on DataFrame / Series input #3861
Comments
I marked this "good first issue" only because I think that for someone who's experienced with Dask, they might be able to fix this without needing too much LightGBM knowledge. |
Hi, James. There's a |
😱😱😱 good eye! I think that behavior is inconsistent and should change, but it still doesn't explain the bug, right? Because if that method returned a Dask DataFrame, presumably you'd get this same error calling .compute() on that result, right? |
Quick question, is |
Was going too fast, sorry. That was a very hastily-written issue and it needs a better reproducible example when I can. I just edited it to define |
I take it back, now I remember why there's a import numpy as np
import pandas as pd
from lightgbm import LGBMRegressor
reg = LGBMRegressor()
np.random.random(10)
num_features = 20
num_rows = 1000
X = pd.DataFrame({
"col" + str(i): np.random.random(num_rows)
for i in range(num_features)
})
y = np.random.random(num_rows)
reg.fit(X, y)
preds = reg.predict(X)
print(f"input type: {type(X)}, \npred type: {type(preds)}") |
That makes sense. I couldn't reproduce the error, I tried on local and remote clusters. |
I see some issues that suggest that this error might happen when an input contains NaNs:
I've also found the place where this happens (I think). There are some calls in This is the internal function in Dask that raises the error in the original post here: https://github.com/dask/dask/blob/e54976954a4e983493923769c31d89b68e72fd6f/dask/dataframe/utils.py#L157 I'll try soon to create a clean reproducible example. I believe I know how to fix this, but without that repro we won't be able to test a fix. |
Not sure if it's entirely related but the predict also fails if there are categoricals. import dask
import lightgbm as lgb
from dask.distributed import Client
client = Client()
dtypes = {
'name': 'category',
'id': int,
'x': float,
'y': float
}
ddf = dask.datasets.timeseries(freq='1H', dtypes=dtypes)
X, y = ddf.drop('y', 1), ddf.y
reg = lgb.dask.DaskLGBMRegressor().fit(X, y)
reg.predict(X)
If I do: Xc = X.compute()
reg.to_local().predict(Xc) It works as expected. I use categoricals a lot so I would really like to see this work, I'd like to work on this. Do you think it should be a separate issue or is related? |
AH @jmoralez !!!! Maybe you found the secret to reproducing this! |
Thank you for the nice reproducible example, this could be the issue!
I'd love if you can fix this. Do you have time to work on it over the next few days? Sorry for the rush, but this is one of the issues I want to fix before we do a 3.2.0 release of |
Oh I meant it as in: if no one's taking it I'd like to check it out, haha. I'm not sure I'd be able to pull it off, I prefer maybe helping you with some findings or discussions. |
Haha ok, thanks! I actually get some dedicated time to work on LightGBM at work...so how about I try this tomorrow and open a draft PR, and maybe I'll Now that you found a small reproducible example, it should go quickly. |
Alright, I think I have a fix for this in #3908. I wanted to post some more debugging information here. Thanks to your huge help discovering that category cols was the issue, @jmoralez , I came up with the reproducible example below. I wanted something a little lower-level than using ```python
import dask
import dask.array as da
import dask.dataframe as dd
import pandas as pd
import numpy as np
import lightgbm as lgb
from dask.distributed import LocalCluster, Client
cluster = LocalCluster(n_workers=3)
client = Client(cluster)
client
def _create_data() -> pd.DataFrame:
num_rows = 1000
return pd.DataFrame({
"float_col1": pd.Series(np.random.random(num_rows), dtype="float"),
"float_col2": pd.Series(np.random.random(num_rows), dtype="float"),
"cat_col": pd.Series(np.random.choice(["a", "b", "y", "z"], num_rows), dtype="category"),
})
parts = [dask.delayed(_create_data)() for _ in range(5)]
ddf = dd.from_delayed(
parts,
meta={
"float_col1": "float",
"float_col2": "float",
"cat_col": "category"
}
)
label = da.random.random((5000, 1), (1000, 1)).to_dask_dataframe()[0]
reg = lgb.DaskLGBMRegressor()
reg.fit(X=ddf, y=label)
# this will fail
preds = reg.predict(ddf)
preds.compute()
That comes from the logic in full error log
|
…3908) * add support for pandas categorical columns * remove commented code * quotes * syntax error * fix shape for ranker test * Apply suggestions from code review Co-authored-by: Nikita Titov <[email protected]> * Update tests/python_package_test/test_dask.py * trying * fix tests * remove unnecessary debugging stuff * skip accuracy checks on categorical * use category columns as categorical features Co-authored-by: Nikita Titov <[email protected]>
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
How you are using LightGBM?
LightGBM component: Python-package
Environment info
Operating System: Ubuntu 18.04
C++ compiler version: gcc 8.3.0
CMake version: 3.13.4
Python version:
output of 'conda info'
LightGBM version or commit hash: https://github.com/microsoft/LightGBM/tree/9f70e9685dfb5c82f2ee87176a8433a6b7a4b98f
Error message and / or logs
Training with
lightgbm.dask.DaskLGBMRegressor
succeeds, and.predict()
fails with this error.Reproducible example(s)
I'll update this with a better, smaller reproducible example soon. I'm rushing right now to finish something else for work, but wanted to be sure I document this so search engines return this issue if others google that error message.
I'm training and trying to
.predict()
on a Dask DataFrame. Something like this.See the output of
conda env export
below for versions of Dask and its dependencies.output of 'conda env export'
References
I think that changing the uses of
map_blocks()
andmap_partitions
based on this description from the Dask docs could fix this issue.But I'm confused and concerned about this error showing up, since it does not show up in any of the tests at https://github.com/microsoft/LightGBM/blob/9f70e9685dfb5c82f2ee87176a8433a6b7a4b98f/tests/python_package_test/test_dask.py, and we test against Dask DataFrame inputs there.
For anyone new to LightGBM looking to help with this before I get to it, here's the place where we're using
_predict_part()
inmap_partitions()
-->LightGBM/python-package/lightgbm/dask.py
Lines 351 to 360 in 9f70e96
The text was updated successfully, but these errors were encountered: