[tests][cli] distributed training #4254

jmoralez · 2021-05-04T23:47:23Z

This ~~aims to close~~ contributes to #3841 by proposing a way to run tests on distributed training calling the CLI from python.

Here I define a class DistributedMockup (suggestions for a better name are more than welcome) that implements a fit method that creates the required configuration files, saves the data and then runs the lightgbm binary using a ThreadPoolExecutor. The training logs are taken from the first machine in mlist.txt and are piped to stdout.

There's also a predict method that computes the predictions using the trained model on a single call to the binary (don't think it's necessary here to make a call on each partition).

I tested this on my fork using github actions, however this will run on azure pipelines so I'll be working on that following the very detailed explanation here: #3841 (comment)

jmoralez · 2021-05-05T00:13:26Z

Hi @jameslamb. Will this:

LightGBM/.ci/test.sh

Line 202 in fcd2453

pytest $BUILD_DIRECTORY/tests || exit -1

trigger the tests?
In that case I think I have to fix the paths, everything right now is designed to work by calling pytest from tests/distributed/

jmoralez · 2021-05-06T00:33:45Z

This ran successfully (logs) so I think it's ready for review. Right now it's only running two simple tests but could be extended to perform the tests that are run for the dask interface, we could probably borrow most of the data creation and the tests themselves.

jameslamb

This is great, thank you for the effort to make this extensible and to fit it into this project's existing continuous integration setup!

Can you please add rules for the files that would be created by these tests to the top-level .gitignore?

I pulled this branch and ran the tests tonight (worked perfectly 🎉 ) , and several files were untracked.

On branch tests/distributed
Your branch is up to date with 'origin/tests/distributed'.

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	mlist.txt
	predictions.txt
	train.txt
	train0.conf
	train0.txt
	train1.conf
	train1.txt

nothing added to commit but untracked files present (use "git add" to track)

I see that right now, the logic for _write_data() is written like "evenly split the training data between the workers". In my opinion, I think it would be better to have DistributedMockup.fit() expect data to be a list of data chunks. In the few months since lightgbm.dask was first introduced, we've seen a lot of issues related to uneven distribution of the training data across workers (for example #3817, #3882, #4101 (comment), #4206).

If fit() took in List[np.ndarray], then it would be easy to write tests on the behavior of distributed training in situations where, for example, feature distributions are very different on different partitions, or where the dataset sizes are very different.

What do you think?

Can you please add docstrings and type hints (including return type hints) on the methods of DistributedMockup, create_data(), and run_and_log()? I think that will help others to be able to contribute changes to these tests in the future, and will give us a chance of finding subtle bugs with mypy.

As of #3756, we've been trying to encourage the use of type hints in this project's Python code.

Notes for Other Reviewers

I support merging this PR with only a single job on Linux_latest (this is what I asked for in #3841 (comment)) and with only minimal tests like those in test_classifier and test_regressor. I have a preference for setting up the structure here and then incrementally adding tests later, like @jmoralez has proposed.

… include docstrings

jmoralez · 2021-05-11T02:26:27Z

Hi @jameslamb, I've included your comments, I think sending the data already partitioned to the fit method makes a lot of sense.

I noticed that mypy is only ran for the python-package folder in the lint task so I included it in the cli-distributed task to check the _test_distributed.py file.

This job failed in the CI but seems to be testing on windows, so I don't think is related to this PR. Looking forward to your thoughts on these changes.

jameslamb · 2021-05-11T23:03:06Z

This job failed in the CI but seems to be testing on windows, so I don't think is related to this PR. Looking forward to your thoughts on these changes.

Yep, I don't think that's related to your changes. It's an error we see occasionally on Appveyor.

C:\Miniconda37-x64\envs\test-env\lib\site-packages\lightgbm\basic.py in _lazy_init(self, data, label, reference, weight, group, init_score, predictor, silent, feature_name, categorical_feature, params)
   1268             self.__init_from_csc(data, params_str, ref_dataset)
   1269         elif isinstance(data, np.ndarray):
-> 1270             self.__init_from_np2d(data, params_str, ref_dataset)
   1271         elif isinstance(data, list) and len(data) > 0 and all(isinstance(x, np.ndarray) for x in data):
   1272             self.__init_from_list_np2d(data, params_str, ref_dataset)
C:\Miniconda37-x64\envs\test-env\lib\site-packages\lightgbm\basic.py in __init_from_np2d(self, mat, params_str, ref_dataset)
   1318             c_str(params_str),
   1319             ref_dataset,
-> 1320             ctypes.byref(self.handle)))
   1321         return self
   1322 
OSError: exception: access violation writing 0xFFFFFFFF93070000
OSError: exception: access violation writing 0xFFFFFFFF93070000

I've manually restarted that build.

jameslamb

Thanks for the changes, looks great! I left a few more suggested changes, and one question that I think we need help with from other maintainers.

.ci/test.sh

tests/distributed/_test_distributed.py

jameslamb

This looks great, thank you!

@shiyu1994 could you also please review? I think you're more familiar with the CLI than I am (and @StrikerRUS is currently reviewing a lot of other PRs 😀 )

StrikerRUS

Thanks for addressing fundamental comments! Please now check some minor comments below.

tests/distributed/_test_distributed.py

tests/distributed/predict.conf

… subprocess.run and check returncode

StrikerRUS

Thanks for addressing my previous comments! Here are some new comments about outdated docstrings and .gitignore update.

tests/distributed/_test_distributed.py

.gitignore

StrikerRUS

Thank you very much for as always valuable contribution! LGTM, except one typo and one simplification.

tests/distributed/_test_distributed.py

jmoralez · 2021-06-23T23:49:49Z

@StrikerRUS I've added your latest comments, let me know what you think.

StrikerRUS · 2021-06-24T19:13:48Z

@jmoralez Yeah, I saw. Many thanks!
I've already approved this PR (#4254 (review)).

@jameslamb There have been several changes since your latest approval. Would you like to take another look before merging?

jameslamb · 2021-06-25T02:23:40Z

@jameslamb There have been several changes since your latest approval. Would you like to take another look before merging?

yes please, will look soon

jameslamb

Overall this looks good to me. I just left some minor suggestions.

And one more general one... Can you please replace all uses of pathlib with os.path?

This project does not use pathlib anywhere else (you can confirm this with git grep pathlib), and keeping such things consistent reduces the maintenance burden for maintainers and friction for new contributors.

tests/distributed/_test_distributed.py

jmoralez · 2021-06-25T23:20:46Z

I think pathlib should be considered instead of os.path, for example this:

LightGBM/python-package/lightgbm/__init__.py

Lines 26 to 29 in 0701a32

    
           dir_path = os.path.dirname(os.path.realpath(__file__)) 
        
           if os.path.isfile(os.path.join(dir_path, 'VERSION.txt')): 
        
               with open(os.path.join(dir_path, 'VERSION.txt')) as version_file:

becomes this:

dir_path = Path(__file__).parent
version_file = dir_path / 'VERSION.txt'
if version_file.is_file():
    with version_file.open() as f:

StrikerRUS · 2021-06-26T12:37:43Z

I think pathlib should be considered instead of os.path, for example this:

I believe we can start migrating from os.path to pathlib similarly to #4136. @jameslamb WDYT?

jameslamb · 2021-06-27T05:20:26Z

I believe we can start migrating from os.path to pathlib

Unless there is a functional reason to prefer pathlib (e.g. efficiency, compatibility), I don't support transitioning this project's Python code from os.path to pathlib. I don't personally think that the example in #4254 (comment) is very compelling.

Is there some functional reason to prefer it?

similarly to #4136

I've observed that some other "good first issue" types of efforts (like #4136) have attracted contributions that required a lot of maintainer effort. In my opinion, that is worthwhile if the project gets some functional benefit from the changes, but not something that we should encourage for a change whose primary benefit, as I understand it, is a style preference.

If you'd like to make this change, @StrikerRUS , I won't oppose that. But I think the issue to document it should not use the good first issue tag, and that maintainers like you should do some of the work and not worry about preserving it for new contributors.

StrikerRUS · 2021-06-27T13:33:44Z

Is there some functional reason to prefer it?

It makes code much clearer and it's portable across platforms. With os.path module, ideally, we should use os.pardir instead of .. and os.path.sep instead of / inside os.path.join(), for example. But we often forget about it:

LightGBM/tests/c_api_test/test_.py

Lines 17 to 20 in bb39bc9

    
           os.path.join(curr_path, '../../'), 
        
           os.path.join(curr_path, '../../python-package/lightgbm/compile'), 
        
           os.path.join(curr_path, '../../python-package/compile'), 
        
           os.path.join(curr_path, '../../lib/')]

One good article:
https://treyhunner.com/2018/12/why-you-should-be-using-pathlib/

I've observed that some other "good first issue" types of efforts (like #4136) have attracted contributions that required a lot of maintainer effort.

To be honest, I've never been a fan of "good first issue"s due to the reasons you've written above.

If you'd like to make this change, @StrikerRUS , I won't oppose that.

Great! I'll prepare an issue then.

jameslamb

Looks good, thanks very much for the help!

StrikerRUS · 2021-07-03T12:08:22Z

@jameslamb Can we merge this?

jameslamb · 2021-07-04T05:09:49Z

@jameslamb Can we merge this?

yep! Sorry, should have merged it right after my review. I got distracted by something else and forgot to come back to this PR. Thanks for the @.

github-actions · 2023-08-23T18:57:18Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

jmoralez added 3 commits May 4, 2021 18:19

include distributed tests

7356527

remove github action file

711009c

merge master

3b9aac0

jmoralez added 2 commits May 5, 2021 19:08

try CI

e3c96a8

build shared library and fix linting error

eabb3b7

jmoralez marked this pull request as ready for review May 6, 2021 00:29

jmoralez requested review from chivee, guolinke, henry0312, jameslamb, shiyu1994, StrikerRUS and wxchan as code owners May 6, 2021 00:29

jmoralez changed the title ~~WIP: [tests][cli] distributed training (fixes #3841)~~ [tests][cli] distributed training (fixes #3841) May 6, 2021

jameslamb added the maintenance label May 6, 2021

jameslamb requested changes May 10, 2021

View reviewed changes

jmoralez added 2 commits May 10, 2021 19:38

ignore files created for testing. add type hints and check with mypy.…

9a4ad3b

… include docstrings

lint

1d108e1

jameslamb added the in progress label May 11, 2021

jameslamb requested changes May 11, 2021

View reviewed changes

.ci/test.sh Outdated Show resolved Hide resolved

tests/distributed/_test_distributed.py Outdated Show resolved Hide resolved

tests/distributed/_test_distributed.py Outdated Show resolved Hide resolved

jmoralez added 2 commits May 11, 2021 22:14

use pre_partition and write separate model files. remove mypy

9f51ad8

update docs

17ee6ab

jameslamb self-requested a review May 12, 2021 04:08

jameslamb approved these changes May 12, 2021

View reviewed changes

jameslamb added awaiting review and removed in progress labels May 12, 2021

jameslamb requested a review from StrikerRUS May 25, 2021 03:42

StrikerRUS reviewed May 26, 2021

View reviewed changes

write predict.conf in the predict method. more robust port setup. use…

f5e9d49

… subprocess.run and check returncode

StrikerRUS reviewed May 30, 2021

View reviewed changes

jmoralez added 4 commits May 31, 2021 20:21

add paths to tests and binary. remove lgb dependency. update .igtignore.

f7c6fdd

lint

c964fdf

allow to pass executable dir as argument to pytest

56113de

pass execfile to pytest instead of execdir

83ccf6a

StrikerRUS approved these changes Jun 12, 2021

View reviewed changes

tests/distributed/_test_distributed.py Outdated Show resolved Hide resolved

tests/distributed/_test_distributed.py Outdated Show resolved Hide resolved

add suggestions

e84df75

jameslamb self-requested a review June 25, 2021 03:25

jameslamb requested changes Jun 25, 2021

View reviewed changes

tests/distributed/_test_distributed.py Outdated Show resolved Hide resolved

tests/distributed/_test_distributed.py Outdated Show resolved Hide resolved

use os.path and add type hint to predict_config

45bfc79

Update tests/distributed/_test_distributed.py

d4cbb7a

jameslamb approved these changes Jun 28, 2021

View reviewed changes

StrikerRUS mentioned this pull request Jun 28, 2021

Migrate to pathlib for the Python code #4416

Closed

38 tasks

jameslamb merged commit b699fa6 into microsoft:master Jul 4, 2021

StrikerRUS mentioned this pull request Jul 7, 2021

[tests] clarify RuntimeError in distributed tests #4452

Merged

jmoralez deleted the tests/distributed branch July 8, 2021 02:40

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tests][cli] distributed training #4254

[tests][cli] distributed training #4254

jmoralez commented May 4, 2021 •

edited by jameslamb

Loading

jmoralez commented May 5, 2021

jmoralez commented May 6, 2021

jameslamb left a comment

jmoralez commented May 11, 2021

jameslamb commented May 11, 2021

jameslamb left a comment

jameslamb left a comment

StrikerRUS left a comment

StrikerRUS left a comment

StrikerRUS left a comment

jmoralez commented Jun 23, 2021

StrikerRUS commented Jun 24, 2021

jameslamb commented Jun 25, 2021

jameslamb left a comment

jmoralez commented Jun 25, 2021

StrikerRUS commented Jun 26, 2021

jameslamb commented Jun 27, 2021

StrikerRUS commented Jun 27, 2021

jameslamb left a comment

StrikerRUS commented Jul 3, 2021

jameslamb commented Jul 4, 2021

github-actions bot commented Aug 23, 2023

[tests][cli] distributed training #4254

[tests][cli] distributed training #4254

Conversation

jmoralez commented May 4, 2021 • edited by jameslamb Loading

jmoralez commented May 5, 2021

jmoralez commented May 6, 2021

jameslamb left a comment

Choose a reason for hiding this comment

Notes for Other Reviewers

jmoralez commented May 11, 2021

jameslamb commented May 11, 2021

jameslamb left a comment

Choose a reason for hiding this comment

jameslamb left a comment

Choose a reason for hiding this comment

StrikerRUS left a comment

Choose a reason for hiding this comment

StrikerRUS left a comment

Choose a reason for hiding this comment

StrikerRUS left a comment

Choose a reason for hiding this comment

jmoralez commented Jun 23, 2021

StrikerRUS commented Jun 24, 2021

jameslamb commented Jun 25, 2021

jameslamb left a comment

Choose a reason for hiding this comment

jmoralez commented Jun 25, 2021

StrikerRUS commented Jun 26, 2021

jameslamb commented Jun 27, 2021

StrikerRUS commented Jun 27, 2021

jameslamb left a comment

Choose a reason for hiding this comment

StrikerRUS commented Jul 3, 2021

jameslamb commented Jul 4, 2021

github-actions bot commented Aug 23, 2023

jmoralez commented May 4, 2021 •

edited by jameslamb

Loading