-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dask] Shifting to Serial Tree Learner Despite Having multiple workers #4987
Comments
If you have a need for the MPI version that isn't met by the socket-based build, please comment on #3831 with an explanation of how adding support for MPI would help you. That might increase the likelihood of that feature being implemented in the future.
I suspect that, on each worker, a LightGBM model is being trained on only that worker's local data and that, at the end of training, you're getting only one of those boosters back. That is what I observed using the MPI version when we first integrated
You can confirm that this is what's happening by using method For example: lgbm_cls.booster_.trees_to_dataframe().head(5) Examine the |
Hi, thanks for your response. I did not understand the part, "lightgbm.dask does not currently support the MPI-based build of LightGBM". Are you saying MPI based Lightgbm cannot handle distributed training? Because, I already tried |
You should expect MPI-based LightGBM to work, but today that requires that you run the
When you say "works fine"...did you perform a test like the one I mentioned with |
Hi, is there a reason to only fetch the first 5 rows only, Anyway, you are right. I did |
Every row in that table represents one tree node in the trained model. It just isn't necessary, for the purpose of this investigation, to look at the full structure of all trees. |
But if we are not looking at the full tree, shouldn't the sum of the count column will be incomplete? |
Unless you are using You could, alternatively, figure out how many records are in a tree by summing |
Thanks a lot for the detailed explanation. |
Sure, happy to help. I'll close this issue as I think we've reached a resolution. |
This issue has been automatically locked since there has not been any recent activity since it was closed. |
In my current system, I have 2 nodes, the first node has 4 workers, second node has 8 workers (as also evident from dask dashboard). I have specified
tree_learner='data'
when I initialize LGBMDaskClassifier. I am using the MPI version. In the scheduler log, I see the message "Only find one worker, will switch to serial tree learner" which I don't understand. In the dashboard, I have seen decent CPU utilization for all the workers.Reproducible Example:
In Network.cpp, I found that, config.num_machines is equal to the number of workers. However,
num_machines_ = linkers_->num_machines();
is setting num_machines_ to 1 for some reason which in turn causes the serial tree learner to be used instead of data tree learner.Environment Information:
Linux
Lightgbm: 3.3.2.99 (built from source using MPI following instructions from https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html)
dask: 2021.9.1
distributed: 2021.9.1
python 3.8.6
The text was updated successfully, but these errors were encountered: