You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Summary:
Pull Request resolved: #1797
Training benchmark was broken with multiprocessing issues on servicelab. This diff is tested to ensure world size 4 is able to run on servicelab.
World size 8 is not able to finish in the 40 minutes time. More investigation is required and there is also tons of cost in starting processes and tearing down for each sharding paradigm. However, this will allow some level of testing for future diffs to prevent training regression
Differential Revision: D54880542
fbshipit-source-id: e8b001471c316c3d1436f4a42a67ab0ae2b51502
0 commit comments