Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dask] Shifting to Serial Tree Learner Despite Having multiple workers #4987

Closed
rudra0713 opened this issue Jan 28, 2022 · 10 comments
Closed
Labels

Comments

@rudra0713
Copy link

In my current system, I have 2 nodes, the first node has 4 workers, second node has 8 workers (as also evident from dask dashboard). I have specified tree_learner='data' when I initialize LGBMDaskClassifier. I am using the MPI version. In the scheduler log, I see the message "Only find one worker, will switch to serial tree learner" which I don't understand. In the dashboard, I have seen decent CPU utilization for all the workers.

Reproducible Example:

from os import name
from dask.distributed import Client, wait
from sklearn.datasets import make_classification
import dask.array as da
from dask.distributed import get_client
import dask
import lightgbm as lgb

n_samples, n_features, num_test_samples = 10000000, 30, 100
dask.config.set({'distributed.worker.daemon': False})

number_of_trials = 1

def invoke_dis_lgbm(X, y, number_of_estimators):
    client_lgbm = get_client()
    lgbm_cls = lgb.DaskLGBMClassifier(client=client_lgbm,boosting_type='gbdt', objective='binary', num_leaves=50,
                                learning_rate=0.1, n_estimators=number_of_estimators, max_depth=5,
                                bagging_fraction=0.9, feature_fraction=0.9, reg_lambda=0.2, tree_learner='data',
    )
    lgbm_cls.fit(X, y)
    return


def main(client):
    num_of_workers = len(client.has_what())
    X_local, y_local = make_classification(n_samples=n_samples, n_features=n_features, random_state=12345)
    X = da.from_array(X_local, chunks=(n_samples//num_of_workers, n_features), name='train_feature')
    y = da.from_array(y_local, chunks=(n_samples//num_of_workers), name='train_label')

    X = client.persist(X)
    _ = wait(X)
    y = client.persist(y)
    _ = wait(y)

    _ = invoke_dis_lgbm(X, y, 100)
 
    return


if __name__ == '__main__':
    with Client('100.110.108.137:8786') as client:
        main(client)

In Network.cpp, I found that, config.num_machines is equal to the number of workers. However, num_machines_ = linkers_->num_machines(); is setting num_machines_ to 1 for some reason which in turn causes the serial tree learner to be used instead of data tree learner.

Environment Information:
Linux
Lightgbm: 3.3.2.99 (built from source using MPI following instructions from https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html)
dask: 2021.9.1
distributed: 2021.9.1
python 3.8.6
dask_db

@jameslamb jameslamb added the dask label Jan 29, 2022
@jameslamb
Copy link
Collaborator

I am using the MPI version. In the scheduler log.

lightgbm.dask does not currently support the MPI-based build of LightGBM. There is an open feature request but it isn't currently being worked on: #3831.

If you have a need for the MPI version that isn't met by the socket-based build, please comment on #3831 with an explanation of how adding support for MPI would help you. That might increase the likelihood of that feature being implemented in the future.

I see the message "Only find one worker, will switch to serial tree learner" which I don't understand. In the dashboard, I have seen decent CPU utilization for all the workers.

I suspect that, on each worker, a LightGBM model is being trained on only that worker's local data and that, at the end of training, you're getting only one of those boosters back. That is what I observed using the MPI version when we first integrated dask-lightgbm into this project: #3831 (comment)

lightgbm.dask collects Booster objects from every worker and then only returns the first one. With distributed training there shouldn't be any issue with this, since sync-ups during training ensure that every Booster should be identical.

https://github.com/microsoft/LightGBM/blob/0075814f02f898ae4070d848078b9d0d4b510ceb/python-package/lightgbm/dask.py#L789-791

You can confirm that this is what's happening by using method .trees_to_dataframe() to check how many records are in the booster.

For example:

lgbm_cls.booster_.trees_to_dataframe().head(5)

Examine the count column of that output. I suspect you'll find that it's less than n_samples from your example code.

@rudra0713
Copy link
Author

Hi, thanks for your response. I did not understand the part, "lightgbm.dask does not currently support the MPI-based build of LightGBM". Are you saying MPI based Lightgbm cannot handle distributed training? Because, I already tried lightgbm.DaskLGBMClassifier with the MPI version, it seems to work fine.

@jameslamb
Copy link
Collaborator

Are you saying MPI based Lightgbm cannot handle distributed training?

You should expect MPI-based LightGBM to work, but today that requires that you run the lightgbm CLI as an MPI application, for example using mpiexec (https://lightgbm.readthedocs.io/en/latest/Parallel-Learning-Guide.html#id2). You cannot use lightgbm.dask to run LightGBM as an MPI application.

it seems to work fine.

When you say "works fine"...did you perform a test like the one I mentioned with .trees_to_dataframe() in my previous comment, ensuring that the model produced had a count for the first node indicating that it was trained on ALL of the training data?

@rudra0713
Copy link
Author

Hi, is there a reason to only fetch the first 5 rows only, lgbm_cls.booster_.trees_to_dataframe().head(5),

Anyway, you are right. I did lgbm_cls.booster_.trees_to_dataframe().head(20), summed the count column and it is still less than number of total samples.

@jameslamb
Copy link
Collaborator

Hi, is there a reason to only fetch the first 5 rows only

Every row in that table represents one tree node in the trained model. It just isn't necessary, for the purpose of this investigation, to look at the full structure of all trees.

@rudra0713
Copy link
Author

But if we are not looking at the full tree, shouldn't the sum of the count column will be incomplete?
Is our goal to find the number of samples that each tree node was trained on? or is the first node supposed to be trained on the full data?

@jameslamb
Copy link
Collaborator

count in that output describes how many observations from the training data would end up in that node if it were a leaf node.

Unless you are using bagging_fraction < 1.0 (docs link), that count for the first node in a tree will be equal to the size of the dataset, as it represents a tree with no splits.

You could, alternatively, figure out how many records are in a tree by summing count over all leaf nodes in the tree. I just chose "look at the first row in the table" because I thought it was easier.

@rudra0713
Copy link
Author

Thanks a lot for the detailed explanation.

@jameslamb
Copy link
Collaborator

Sure, happy to help. I'll close this issue as I think we've reached a resolution.

@jameslamb jameslamb changed the title Shifting to Serial Tree Learner Despite Having multiple workers [dask] Shifting to Serial Tree Learner Despite Having multiple workers Feb 1, 2022
@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed.
To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues
including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 16, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants