Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
NeuralLinearBandit deadlock avoidance
Summary: When running in a distributed environment, the fix in D65428925 for zero-weight batches can lead to deadlock. This occurs because some workers are performing the LinearRegression update, and some are skipping it, but internally there is a `torch.distributed.allreduce`. I had previously assumed that my remote jobs were hanging due to an unrelated bug, but this seems to be the root cause. Distributed jobs complete successfully with this modification. Reviewed By: alexnikulkov Differential Revision: D65556041 fbshipit-source-id: 1a4ea7eb5211622d452ea5b843a2307e73fe2523
- Loading branch information