-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tests][cli] distributed training #4254
Conversation
Hi @jameslamb. Will this: Line 202 in fcd2453
trigger the tests? In that case I think I have to fix the paths, everything right now is designed to work by calling pytest from tests/distributed/
|
This ran successfully (logs) so I think it's ready for review. Right now it's only running two simple tests but could be extended to perform the tests that are run for the dask interface, we could probably borrow most of the data creation and the tests themselves. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thank you for the effort to make this extensible and to fit it into this project's existing continuous integration setup!
Can you please add rules for the files that would be created by these tests to the top-level .gitignore
?
I pulled this branch and ran the tests tonight (worked perfectly 🎉 ) , and several files were untracked.
On branch tests/distributed
Your branch is up to date with 'origin/tests/distributed'.
Untracked files:
(use "git add <file>..." to include in what will be committed)
mlist.txt
predictions.txt
train.txt
train0.conf
train0.txt
train1.conf
train1.txt
nothing added to commit but untracked files present (use "git add" to track)
I see that right now, the logic for _write_data()
is written like "evenly split the training data between the workers". In my opinion, I think it would be better to have DistributedMockup.fit()
expect data
to be a list of data chunks. In the few months since lightgbm.dask
was first introduced, we've seen a lot of issues related to uneven distribution of the training data across workers (for example #3817, #3882, #4101 (comment), #4206).
If fit()
took in List[np.ndarray]
, then it would be easy to write tests on the behavior of distributed training in situations where, for example, feature distributions are very different on different partitions, or where the dataset sizes are very different.
What do you think?
Can you please add docstrings and type hints (including return type hints) on the methods of DistributedMockup
, create_data()
, and run_and_log()
? I think that will help others to be able to contribute changes to these tests in the future, and will give us a chance of finding subtle bugs with mypy
.
As of #3756, we've been trying to encourage the use of type hints in this project's Python code.
Notes for Other Reviewers
I support merging this PR with only a single job on Linux_latest
(this is what I asked for in #3841 (comment)) and with only minimal tests like those in test_classifier
and test_regressor
. I have a preference for setting up the structure here and then incrementally adding tests later, like @jmoralez has proposed.
… include docstrings
Hi @jameslamb, I've included your comments, I think sending the data already partitioned to the I noticed that This job failed in the CI but seems to be testing on windows, so I don't think is related to this PR. Looking forward to your thoughts on these changes. |
Yep, I don't think that's related to your changes. It's an error we see occasionally on Appveyor.
I've manually restarted that build. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes, looks great! I left a few more suggested changes, and one question that I think we need help with from other maintainers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, thank you!
@shiyu1994 could you also please review? I think you're more familiar with the CLI than I am (and @StrikerRUS is currently reviewing a lot of other PRs 😀 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing fundamental comments! Please now check some minor comments below.
… subprocess.run and check returncode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my previous comments! Here are some new comments about outdated docstrings and .gitignore
update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for as always valuable contribution! LGTM, except one typo and one simplification.
@StrikerRUS I've added your latest comments, let me know what you think. |
@jmoralez Yeah, I saw. Many thanks! @jameslamb There have been several changes since your latest approval. Would you like to take another look before merging? |
yes please, will look soon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good to me. I just left some minor suggestions.
And one more general one... Can you please replace all uses of pathlib
with os.path
?
This project does not use pathlib
anywhere else (you can confirm this with git grep pathlib
), and keeping such things consistent reduces the maintenance burden for maintainers and friction for new contributors.
I think LightGBM/python-package/lightgbm/__init__.py Lines 26 to 29 in 0701a32
becomes this: dir_path = Path(__file__).parent
version_file = dir_path / 'VERSION.txt'
if version_file.is_file():
with version_file.open() as f: |
I believe we can start migrating from |
Unless there is a functional reason to prefer Is there some functional reason to prefer it?
I've observed that some other "good first issue" types of efforts (like #4136) have attracted contributions that required a lot of maintainer effort. In my opinion, that is worthwhile if the project gets some functional benefit from the changes, but not something that we should encourage for a change whose primary benefit, as I understand it, is a style preference. If you'd like to make this change, @StrikerRUS , I won't oppose that. But I think the issue to document it should not use the |
It makes code much clearer and it's portable across platforms. With LightGBM/tests/c_api_test/test_.py Lines 17 to 20 in bb39bc9
One good article:
To be honest, I've never been a fan of "good first issue"s due to the reasons you've written above.
Great! I'll prepare an issue then. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks very much for the help!
@jameslamb Can we merge this? |
yep! Sorry, should have merged it right after my review. I got distracted by something else and forgot to come back to this PR. Thanks for the |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
This
aims to closecontributes to #3841 by proposing a way to run tests on distributed training calling the CLI from python.Here I define a class
DistributedMockup
(suggestions for a better name are more than welcome) that implements afit
method that creates the required configuration files, saves the data and then runs the lightgbm binary using aThreadPoolExecutor
. The training logs are taken from the first machine inmlist.txt
and are piped to stdout.There's also a
predict
method that computes the predictions using the trained model on a single call to the binary (don't think it's necessary here to make a call on each partition).I tested this on my fork using github actions, however this will run on azure pipelines so I'll be working on that following the very detailed explanation here: #3841 (comment)