Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RMP] T4R fixes: MultiGPU data parallel training for next-item prediction and fixed serving #522

Closed
9 of 10 tasks
karlhigley opened this issue Aug 8, 2022 · 6 comments
Closed
9 of 10 tasks
Assignees
Labels
Milestone

Comments

@karlhigley
Copy link
Contributor

karlhigley commented Aug 8, 2022

Problem:

We have customers who would like to use multi-GPU Transformers4Rec but are blocked by issues with our existing support for session-based models.

Goal:

  • Unblock customer use cases so they can try out T4R to give us feedback

Constraints:

  • We don't yet have Torchscript support (which is out of scope this issue)

Starting Point:

Note: The multi-GPU training of the specific use cases of session binary classification / regression is addressed by RMP #708

@karlhigley
Copy link
Contributor Author

Distributed Data Parallel training is something we want to do, but I don't think it's part of this effort to fix the immediate blockers. Does that match what y'all understand, @EvenOldridge @viswa-nvidia?

@sohn21c
Copy link

sohn21c commented Aug 9, 2022

Binary Classification DP training would unblock the issue although DDP is preferred if more performant. More details are captured here

@karlhigley
Copy link
Contributor Author

Either DistributedDataParallel training is part of the scope of the quick fixes or it isn't, and it sounds like it isn't so we should track that work somewhere (but not here.)

@viswa-nvidia
Copy link

@gabrielspmoreira , please create a ticket for DP training binary classification and link it here.

@gabrielspmoreira
Copy link
Member

@viswa-nvidia @EvenOldridge
Me, @rnyak, @sararb and @nzarif met today about the issues related to DataParallel.
We tested DataParallel for Next Item Prediction for one of the examples and it is not working, differently from what Sara found some weeks ago in another example.
So we have both Next Item Prediction and Binary Classification not working with DataParallel currently.
We have associated the issues for both in this RMP ticket description.
Should we remove the scope of DataParallel from this RMP and create another RMP ticket focused in DataParallel support (targeted for release 22.09)?

@karlhigley
Copy link
Contributor Author

I don't think we should split the issue, let's just target this for 22.09

@EvenOldridge EvenOldridge changed the title [RMP] "Band-aid" fixes for Transformers4Rec (multi-GPU and Python serving) [RMP] MultiGPU data parallel training, multi-gpu .fit() and Python based serving for Transformers4Rec Sep 12, 2022
@karlhigley karlhigley changed the title [RMP] MultiGPU data parallel training, multi-gpu .fit() and Python based serving for Transformers4Rec [RMP] Quick fixes: MultiGPU data parallel training, multi-gpu .fit(), and Python based serving for Transformers4Rec Sep 13, 2022
@karlhigley karlhigley changed the title [RMP] Quick fixes: MultiGPU data parallel training, multi-gpu .fit(), and Python based serving for Transformers4Rec [RMP] T4R quick fixes: MultiGPU data parallel training, multi-gpu .fit(), and Python based serving for Transformers4Rec Sep 13, 2022
@gabrielspmoreira gabrielspmoreira changed the title [RMP] T4R quick fixes: MultiGPU data parallel training, multi-gpu .fit(), and Python based serving for Transformers4Rec [RMP] T4R fixes: MultiGPU data parallel training, multi-gpu .fit(), and Python based serving for Transformers4Rec Oct 25, 2022
@gabrielspmoreira gabrielspmoreira changed the title [RMP] T4R fixes: MultiGPU data parallel training, multi-gpu .fit(), and Python based serving for Transformers4Rec [RMP] T4R fixes: MultiGPU data parallel training for next-item prediction Nov 1, 2022
@gabrielspmoreira gabrielspmoreira changed the title [RMP] T4R fixes: MultiGPU data parallel training for next-item prediction [RMP] T4R fixes: MultiGPU data parallel training for next-item prediction and fixed serving Nov 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants