-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Dask] Expected error randomly not raised in Dask test #4099
Comments
hmmm interesting. In the logs mentioned in those comments, it looks like this is a different root cause from what was fixed in #4071. I think that' what's happening here is that the data is still all ending up on one worker somehow. This is possibly the same underlying problem as #4074, actually. Error code 104 means "connection reset by peer" (link), which could occur in distributed training if one of the Dask workers dies and is restarted. Similarly here, if one of the workers died before training started, then it's possible that Dask would have moved the training data back to the other worker, and that then It's possible that one of the workers died because the two previous There's no reason that this test has to be in the same test case as the other network params tests. I just did that to try to minimize the total runtime of tests (number of times we call |
I believe 16Gb should be enough for toy datasets we use for tests...
Yeah, sure. But it doesn't fix underlying issue unfortunately. I remember I asked this question before but didn't get a clear answer. Has Dask something like "global option for reproducibility"? Similarly to |
A couple points on this:
This would be incredibly difficult for Dask or any distributed system to achieve. If you want to write code of the form "move this exact data to this exact worker and then run this exact task on this exact worker..." you can do it with Dask's low-level APIs, but at that point you're not really getting much benefit from Dask because you are doing all the work that its higher-level APIs are intended to abstract away. Once you get into coordinating processes and not just threads within one process, it becomes much more difficult to predict the exact behavior of the system. LightGBM is able to offer a |
Absolutely agree with this for the case of "real world". But I thought with only two test workers and deterministic data partitioning algorithm (looks it is wrong for Dask) given the same dataset we don't have a lot of variants. |
I haven't seen this one at all in the last month. I hope that #4132 was the fix for it. I think this can be closed. |
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
LightGBM/tests/python_package_test/test_dask.py
Lines 1058 to 1060 in 77d54b3
Refer to #4068 (comment) and #4068 (comment) for full logs.
The text was updated successfully, but these errors were encountered: