[dask] Shifting to Serial Tree Learner Despite Having multiple workers #4987

rudra0713 · 2022-01-28T12:58:16Z

In my current system, I have 2 nodes, the first node has 4 workers, second node has 8 workers (as also evident from dask dashboard). I have specified tree_learner='data' when I initialize LGBMDaskClassifier. I am using the MPI version. In the scheduler log, I see the message "Only find one worker, will switch to serial tree learner" which I don't understand. In the dashboard, I have seen decent CPU utilization for all the workers.

Reproducible Example:

from os import name
from dask.distributed import Client, wait
from sklearn.datasets import make_classification
import dask.array as da
from dask.distributed import get_client
import dask
import lightgbm as lgb

n_samples, n_features, num_test_samples = 10000000, 30, 100
dask.config.set({'distributed.worker.daemon': False})

number_of_trials = 1

def invoke_dis_lgbm(X, y, number_of_estimators):
    client_lgbm = get_client()
    lgbm_cls = lgb.DaskLGBMClassifier(client=client_lgbm,boosting_type='gbdt', objective='binary', num_leaves=50,
                                learning_rate=0.1, n_estimators=number_of_estimators, max_depth=5,
                                bagging_fraction=0.9, feature_fraction=0.9, reg_lambda=0.2, tree_learner='data',
    )
    lgbm_cls.fit(X, y)
    return


def main(client):
    num_of_workers = len(client.has_what())
    X_local, y_local = make_classification(n_samples=n_samples, n_features=n_features, random_state=12345)
    X = da.from_array(X_local, chunks=(n_samples//num_of_workers, n_features), name='train_feature')
    y = da.from_array(y_local, chunks=(n_samples//num_of_workers), name='train_label')

    X = client.persist(X)
    _ = wait(X)
    y = client.persist(y)
    _ = wait(y)

    _ = invoke_dis_lgbm(X, y, 100)
 
    return


if __name__ == '__main__':
    with Client('100.110.108.137:8786') as client:
        main(client)

In Network.cpp, I found that, config.num_machines is equal to the number of workers. However, num_machines_ = linkers_->num_machines(); is setting num_machines_ to 1 for some reason which in turn causes the serial tree learner to be used instead of data tree learner.

Environment Information:
Linux
Lightgbm: 3.3.2.99 (built from source using MPI following instructions from https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html)
dask: 2021.9.1
distributed: 2021.9.1
python 3.8.6

The text was updated successfully, but these errors were encountered:

jameslamb · 2022-01-29T05:06:34Z

I am using the MPI version. In the scheduler log.

lightgbm.dask does not currently support the MPI-based build of LightGBM. There is an open feature request but it isn't currently being worked on: #3831.

If you have a need for the MPI version that isn't met by the socket-based build, please comment on #3831 with an explanation of how adding support for MPI would help you. That might increase the likelihood of that feature being implemented in the future.

I see the message "Only find one worker, will switch to serial tree learner" which I don't understand. In the dashboard, I have seen decent CPU utilization for all the workers.

I suspect that, on each worker, a LightGBM model is being trained on only that worker's local data and that, at the end of training, you're getting only one of those boosters back. That is what I observed using the MPI version when we first integrated dask-lightgbm into this project: #3831 (comment)

lightgbm.dask collects Booster objects from every worker and then only returns the first one. With distributed training there shouldn't be any issue with this, since sync-ups during training ensure that every Booster should be identical.

https://github.com/microsoft/LightGBM/blob/0075814f02f898ae4070d848078b9d0d4b510ceb/python-package/lightgbm/dask.py#L789-791

You can confirm that this is what's happening by using method .trees_to_dataframe() to check how many records are in the booster.

For example:

lgbm_cls.booster_.trees_to_dataframe().head(5)

Examine the count column of that output. I suspect you'll find that it's less than n_samples from your example code.

rudra0713 · 2022-02-01T03:21:38Z

Hi, thanks for your response. I did not understand the part, "lightgbm.dask does not currently support the MPI-based build of LightGBM". Are you saying MPI based Lightgbm cannot handle distributed training? Because, I already tried lightgbm.DaskLGBMClassifier with the MPI version, it seems to work fine.

jameslamb · 2022-02-01T03:27:21Z

Are you saying MPI based Lightgbm cannot handle distributed training?

You should expect MPI-based LightGBM to work, but today that requires that you run the lightgbm CLI as an MPI application, for example using mpiexec (https://lightgbm.readthedocs.io/en/latest/Parallel-Learning-Guide.html#id2). You cannot use lightgbm.dask to run LightGBM as an MPI application.

it seems to work fine.

When you say "works fine"...did you perform a test like the one I mentioned with .trees_to_dataframe() in my previous comment, ensuring that the model produced had a count for the first node indicating that it was trained on ALL of the training data?

rudra0713 · 2022-02-01T03:48:06Z

Hi, is there a reason to only fetch the first 5 rows only, lgbm_cls.booster_.trees_to_dataframe().head(5),

Anyway, you are right. I did lgbm_cls.booster_.trees_to_dataframe().head(20), summed the count column and it is still less than number of total samples.

jameslamb · 2022-02-01T03:52:42Z

Hi, is there a reason to only fetch the first 5 rows only

Every row in that table represents one tree node in the trained model. It just isn't necessary, for the purpose of this investigation, to look at the full structure of all trees.

rudra0713 · 2022-02-01T03:57:30Z

But if we are not looking at the full tree, shouldn't the sum of the count column will be incomplete?
Is our goal to find the number of samples that each tree node was trained on? or is the first node supposed to be trained on the full data?

jameslamb · 2022-02-01T04:06:55Z

count in that output describes how many observations from the training data would end up in that node if it were a leaf node.

Unless you are using bagging_fraction < 1.0 (docs link), that count for the first node in a tree will be equal to the size of the dataset, as it represents a tree with no splits.

You could, alternatively, figure out how many records are in a tree by summing count over all leaf nodes in the tree. I just chose "look at the first row in the table" because I thought it was easier.

rudra0713 · 2022-02-01T04:17:03Z

Thanks a lot for the detailed explanation.

jameslamb · 2022-02-01T04:26:35Z

Sure, happy to help. I'll close this issue as I think we've reached a resolution.

github-actions · 2023-08-16T00:18:23Z

This issue has been automatically locked since there has not been any recent activity since it was closed.
To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues
including a reference to this.

jameslamb added the dask label Jan 29, 2022

jameslamb added the awaiting response label Jan 29, 2022

no-response bot removed the awaiting response label Feb 1, 2022

jameslamb added the awaiting response label Feb 1, 2022

no-response bot removed the awaiting response label Feb 1, 2022

jameslamb added the awaiting response label Feb 1, 2022

no-response bot removed the awaiting response label Feb 1, 2022

jameslamb closed this as completed Feb 1, 2022

jameslamb changed the title ~~Shifting to Serial Tree Learner Despite Having multiple workers~~ [dask] Shifting to Serial Tree Learner Despite Having multiple workers Feb 1, 2022

github-actions bot locked as resolved and limited conversation to collaborators Aug 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dask] Shifting to Serial Tree Learner Despite Having multiple workers #4987

[dask] Shifting to Serial Tree Learner Despite Having multiple workers #4987

rudra0713 commented Jan 28, 2022

jameslamb commented Jan 29, 2022

rudra0713 commented Feb 1, 2022

jameslamb commented Feb 1, 2022

rudra0713 commented Feb 1, 2022

jameslamb commented Feb 1, 2022

rudra0713 commented Feb 1, 2022

jameslamb commented Feb 1, 2022

rudra0713 commented Feb 1, 2022

jameslamb commented Feb 1, 2022

github-actions bot commented Aug 16, 2023

[dask] Shifting to Serial Tree Learner Despite Having multiple workers #4987

[dask] Shifting to Serial Tree Learner Despite Having multiple workers #4987

Comments

rudra0713 commented Jan 28, 2022

jameslamb commented Jan 29, 2022

rudra0713 commented Feb 1, 2022

jameslamb commented Feb 1, 2022

rudra0713 commented Feb 1, 2022

jameslamb commented Feb 1, 2022

rudra0713 commented Feb 1, 2022

jameslamb commented Feb 1, 2022

rudra0713 commented Feb 1, 2022

jameslamb commented Feb 1, 2022

github-actions bot commented Aug 16, 2023