-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RMP] T4R fixes: MultiGPU data parallel training for next-item prediction and fixed serving #522
Comments
Distributed Data Parallel training is something we want to do, but I don't think it's part of this effort to fix the immediate blockers. Does that match what y'all understand, @EvenOldridge @viswa-nvidia? |
Binary Classification DP training would unblock the issue although DDP is preferred if more performant. More details are captured here |
Either DistributedDataParallel training is part of the scope of the quick fixes or it isn't, and it sounds like it isn't so we should track that work somewhere (but not here.) |
@gabrielspmoreira , please create a ticket for DP training binary classification and link it here. |
@viswa-nvidia @EvenOldridge |
I don't think we should split the issue, let's just target this for 22.09 |
Problem:
We have customers who would like to use multi-GPU Transformers4Rec but are blocked by issues with our existing support for session-based models.
Goal:
Constraints:
Starting Point:
Enable
DataParallel
/DistributedDataParallel
training using HF Trainer for next-item predictionDataParallel
works if the model is wrapped manually by the user (i.e.model = torch.nn.DataParallel(model)
for training, but that wrapping should happen automatically by the HF Trainer hereFix the serving sections of the existing T4R notebooks
[Task] Add multi-GPU example for Transformer4Rec PyTorch ([Task] Add multi-GPU example for Transformer4Rec PyTorch Transformers4Rec#508)
[BUG] mutli-gpu example is broken when we pull the release-22.11 branch of NVTabular Transformers4Rec#526
Note: The multi-GPU training of the specific use cases of session binary classification / regression is addressed by RMP #708
The text was updated successfully, but these errors were encountered: