-
Notifications
You must be signed in to change notification settings - Fork 420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for variable length dataloaders in DDP #3416
Add support for variable length dataloaders in DDP #3416
Conversation
6d35762
to
4647b54
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to me! Would you mind adding a unit test intest_trainer.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Two minor nits and then we can merge :)
Co-authored-by: Mihir Patel <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! thanks for the PR
* Add support for variable length dataloaders in dist training * Remove test file * Fix typo * Fixed batch referenced before assignment * Replace sentinel with None * Add unit test * Update unit test * Reduce tensor creation to one line Co-authored-by: Mihir Patel <[email protected]> * Remove requirement for gpu in test --------- Co-authored-by: Mihir Patel <[email protected]>
* Add support for variable length dataloaders in dist training * Remove test file * Fix typo * Fixed batch referenced before assignment * Replace sentinel with None * Add unit test * Update unit test * Reduce tensor creation to one line Co-authored-by: Mihir Patel <[email protected]> * Remove requirement for gpu in test --------- Co-authored-by: Mihir Patel <[email protected]>
What does this PR do?
DDP currently requires dataloaders to be the same length on all ranks.
This PR adds support for dataloaders with rank-dependent lengths.
The solution terminates iteration for dataloaders on all ranks when the first dataloader finishes.
This does mean some batches on the ranks with longer dataloaders will be missed.
See details on testing below.
What issue(s) does this change relate to?
Fixes #3415.
Before submitting
pre-commit
on your change? (see thepre-commit
section of prerequisites)Testing
I used the the script below to test this new functionality. It would previously hang and then NCCL timeout without this change.
By checking the logs I can confirm that rank1 (with a longer dataloader) ends its iteration early at the same point that rank0 ends its iteration.
I'd be happy to incorporate this into the existing tests: please point me towards a similar test that uses DDP if you would like me to implement the test. Thank you!