Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tests][cli] distributed training #4254
[tests][cli] distributed training #4254
Changes from 8 commits
7356527
711009c
3b9aac0
e3c96a8
eabb3b7
9a4ad3b
1d108e1
9f51ad8
17ee6ab
a424374
f5e9d49
f7c6fdd
c964fdf
56113de
83ccf6a
e84df75
45bfc79
d4cbb7a
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for writing this so late!
Do you @jameslamb really think that parallel tests should be executed as a separate CI job? I believe that environment initialization overhead is quite big in this case.
From my point of view, we should add new CI jobs in cases when new job requires something special to be done we haven't done before as a part of our CI scripts.
For example,
dask
is tested within our ordinary (mostly Python-package) tests (we just added some dependencies in conda env), we don't have separatedask
CI jobmpi
requires MPI installation and is mutually exclusive with ordinary installation, hence we have a separatempi
CI jobgpu
requires installation ofboost
, GPU drivers and is mutually exclusive with ordinary installation, hence we have a separategpu
CI jobregular
,sdist
,bdist
jobs are presented because they all test different ways of Python-package installation and its' correctnessswig
only compiles SWIG wrapper and doesn't run any tests (even for such scenario we used to compile SWIG artifacts withinsdist
job before got self-running agents to balance CI loading)cpp_tests
requires special CMake flags and doesn't require Python ecosystem initialization, hence we have a separatecpp-tests
CI jobAlso, please note that C API tests are run with each Python-package-related CI job.
https://github.com/microsoft/LightGBM/blob/master/tests/c_api_test/test_.py
I believe that basic distributed tests which are not requiring MPI installation or other special things can be run with each Python-package-related CI job. Please share your opinion on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically, the most recent run of this job took 4m15s, about 1m40s of which was spent compiling
lightgbm
and running the tests (based on timestamps in logs)reference: https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=10053&view=logs&j=cd496dfc-df55-5b41-9968-b20563c04f1a
They are "basic" today because I've asked @jmoralez to focus this first PR on only setting up the structure and not on adding a lot of additional tests. But I expect that a lot more tests will be added in the future. Every time we see an issue with
lightgbm.dask
or frommmlspark
and wonder "is this a problem at the level of those interfaces or something from LightGBM's core library?", I think we'll want to add tests here.I think it's possible that a few months from now, these distributed training tests might take 3 or 4 minutes to run (just the test part, not all of the environment setup). Instead of adding 3 to 4 minutes to each Python CI job, I thought it made sense to add only a single job as a starting point.
If you don't find that convincing, I don't feel strongly enough about it to block the PR until we agree. If you still think these tests should be run on every Python job instead of as a standalone, please advise @jmoralez on specifically what changes should be made (for example).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do support this.
For sure, that was the reason I created #3841.
You are right that separate CI jobs have their own advantages, but I think that the main benefit of embedding distributed tests into CI jobs with main tests is that in this case you get good coverage and balance for different OSes, compilers and environments in general. You don't need to think about where to setup new job: on Azure, GitHub Action, Appveyor; what compiler to choose on one CI service and what compiler on another one; invent new
if
rules forsetup.py
or something else. It is already done and you get it for free.I believe it's a bad idea to merge PRs that are "convenient" for only one of the maintainers.
We are not speaking about just theoretical things, we are discussing this particular PR and changes it proposes to merge into the
master
branch. I don't agree with that this is "delaying" PR somehow. This is what PRs are designed for - for discussing changes they contain.With new info uncovered
some previous statements may become irrelevant, for example. But this should be discussed before the merge to prevent code reverts in the future, I suppose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't disagree. Sorry if it sounded like I was complaining, that was not my intention.
In this case, @jmoralez WILL have to think about those things, all at once, right now. Since distributed training with the CLI isn't currently tested on Mac or Windows, for example, if failures are experienced adding those tests to jobs for those operating systems then they'll have to be fixed (or ignored with
if
orpytest.skip
) before this can be merged. The same goes for other combinations, like different Python version.So, said another way, I have a preference for taking smaller, incremental steps for adding this particular type of testing. I prefer adding this type of testing in a single job right now because then two types of work (changing which configurations the job runs on, adding new tests) can happen in parallel.
I think you are saying you'd prefer to add broader coverage in this first PR. I am ok with that and happy to support it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh no, definitely no! Only in case he'd love to do it.
As an example, please remember adding cpp tests with GTest. First, possibility of tests compilation was added in #3555 and was checked for correctness: #3555 (comment). Then, we actually added CI jobs with cpp tests: #4166. And right now different tests are proposed to be added in pending PRs. I think this is a good example of
As the core code of this PR is independent to the way we will execute it, I think we can go the same direction here. Leave only testing code in this PR and add CI jobs calling it in a follow-up one(s). WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh ok, I guess I misunderstood. I thought that your comment at the beginning of this thread was saying that you wanted this PR to add these tests to all existing Python jobs. I think now that maybe you just meant "for us to consider #3841 fixed, it should be added to all of those jobs".
Sure! Let's do that, I agree. So then I think this PR would need the following changes:
fixes
from the PR title andclose
from the PR description.vsts-ci.yml
and.ci/test.sh
@StrikerRUS do you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, #3841 should be "fixed" only after adding some meaningful number of tests for distributed code.
Totally agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. @jmoralez when you have time, could you please remove the changes from
.vsts-ci.yml
and.ci/test.sh
?I've removed "fixes" from the PR title.