Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dask] make random port search more resilient to random collisions (fixes #4057) #4133

Merged
merged 7 commits into from
Mar 31, 2021

Conversation

jameslamb
Copy link
Collaborator

Changes in this PR

Background

#4057 documents a class of error that can fail with distributed training using lightgbm.dask. If you do not provide machines or local_listen_port at training time, LightGBM will randomly search for ports to use on each worker.

To limit the overhead introduced by this random search, a function _find_random_open_port() is run once at the same time on every worker, using distributed.Client.run() (added in #3823). If you have multiple Dask worker processes on the same physical host (i.e. if you're using distirbuted.LocalCluster or nrprocs > 1), there is a small but non-zero probability that this setup will choose the same random port for multiple workers on that host. When that happens, LightGBM training will fail with an error like

Exception: LightGBMError('Socket send error, code: 104',)

Based on #4112 (comment), I think my original assumptions about how likely such conflicts were (#4057 (comment)) was not correct.

How this improves lightgbm

Removes a source of instability that could cause distributed training with Dask to fail.

Improves the stability of LightGBM's tests, which should reduce maintainer effort needed in re-running failed builds.

@jameslamb jameslamb added the fix label Mar 29, 2021
@jameslamb jameslamb requested a review from StrikerRUS March 29, 2021 05:48
@jameslamb
Copy link
Collaborator Author

Linking #4132 (comment), the finding that inspired this pull request.

@@ -371,6 +383,18 @@ def _train(
_find_random_open_port,
workers=list(worker_addresses)
)
# handle the case where _find_random_open_port() produces duplicates
retries_left = 10
while _worker_map_has_duplicates(worker_address_to_port) and retries_left > 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid that multiple re-runs of the same function still have high probability to produce same values, especially on the same physical machine.

If a is omitted or None, the current system time is used. If randomness sources are provided by the operating system, they are used instead of the system time (see the os.urandom() function for details on availability).
https://docs.python.org/3/library/random.html#random.seed

I believe that more reliable way to handle the case of same ports will be to resolve it manually by simply incrementing ports while conflicts are not resolved.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll change this to something like that. I was worried about making this a lot more complicated, but maybe it's unavoidable.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry for this, I'll try to come up with a solution as well. FWIW the "randomness" isn't related to python's random.seed, since we're asking the OS for an open port and it decides which one to give us. I believe the collisions happen when a worker has completed the function and freed the port and another one is just asking for it and the OS just returns the same one (kinda troll). I'll see if we can put the port in wait for a bit or something like that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright, I tried a different approach in 05303c8. I think this will be more reliable. Instead of re-running _find_random_port() all at once for all workers again, the process will be like:

  1. run _find_random_port() for every worker
  2. if any duplicate IP-port pairs were found, run _find_random_port() again only for those workers that need to be changed to eliminate duplicates

I think that pattern (only re-running for the workers that need to be changed to resolve duplicates) should give us confidence that duplicates will be resolved, because the system time will change each time that function is run.

I think this approach, will still relies on _find_random_port(), is preferable to just incrementing and checking if the new port is open, because it should (on average) find a new open port more quickly than the incrementing approach. Consider the case where, for example, LightGBM tries to put multiple workers in a LocalCluster on port 8887 (1 less than the default port for Jupyter). Jupyter uses such an approach of "increment by one until you find an open port", so if someone has multiple Jupyter sessions running it's possible that they might have ports 8888, 8889, 8890, and 8891 all occupied (for example), which would mean LightGBM would need 5 attempts to find a new open port (if 8892 is open).

I think the existence of systems like this is why Dask also searches randomly if the default port it prefers for its scheduler is occupied when you run dask-scheduler (instead of incrementing). You can see that in the logs for LightGBM's Dask tests that use dsitributed.LocalCluster, for example from this build:

  /opt/conda/envs/test-env/lib/python3.9/site-packages/distributed/node.py:151: UserWarning: Port 8787 is already in use.
  Perhaps you already have a cluster running?
  Hosting the HTTP server on port 37211 instead
    warnings.warn(

  /opt/conda/envs/test-env/lib/python3.9/site-packages/distributed/node.py:151: UserWarning: Port 8787 is already in use.
  Perhaps you already have a cluster running?
  Hosting the HTTP server on port 35045 instead
    warnings.warn(

  /opt/conda/envs/test-env/lib/python3.9/site-packages/distributed/node.py:151: UserWarning: Port 8787 is already in use.
  Perhaps you already have a cluster running?
  Hosting the HTTP server on port 35051 instead
    warnings.warn(

  /opt/conda/envs/test-env/lib/python3.9/site-packages/distributed/node.py:151: UserWarning: Port 8787 is already in use.
  Perhaps you already have a cluster running?
  Hosting the HTTP server on port 38031 instead

  /opt/conda/envs/test-env/lib/python3.9/site-packages/distributed/node.py:151: UserWarning: Port 8787 is already in use.
  Perhaps you already have a cluster running?
  Hosting the HTTP server on port 41941 instead

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmoralez I don't think you need to do any new work. Your review on the change I just pushed is welcome if you see anything wrong with it, but I'm fairly confident it will get the job done.

assert retry_msg not in capsys.readouterr().out

# should handle worker maps with duplicates
map_without_duplicates = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

map_with_duplicates

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 75cfbbb

}
patched_map = lgb.dask._possibly_fix_worker_map_duplicates(
client=client,
worker_map=map_without_duplicates
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

map_with_duplicates

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 75cfbbb

tests/python_package_test/test_dask.py Show resolved Hide resolved
Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@StrikerRUS
Copy link
Collaborator

@jmoralez Are you OK to merge this?

@jmoralez
Copy link
Collaborator

@StrikerRUS Yes

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants