[RMP] T4R fixes: MultiGPU data parallel training for next-item prediction and fixed serving #522

karlhigley · 2022-08-08T17:33:01Z

karlhigley · 2022-08-08T23:16:22Z

Distributed Data Parallel training is something we want to do, but I don't think it's part of this effort to fix the immediate blockers. Does that match what y'all understand, @EvenOldridge @viswa-nvidia?

sohn21c · 2022-08-09T22:53:09Z

Binary Classification DP training would unblock the issue although DDP is preferred if more performant. More details are captured here

karlhigley · 2022-08-10T02:05:47Z

Either DistributedDataParallel training is part of the scope of the quick fixes or it isn't, and it sounds like it isn't so we should track that work somewhere (but not here.)

viswa-nvidia · 2022-08-15T17:16:04Z

@gabrielspmoreira , please create a ticket for DP training binary classification and link it here.

gabrielspmoreira · 2022-08-15T22:18:33Z

@viswa-nvidia @EvenOldridge
Me, @rnyak, @sararb and @nzarif met today about the issues related to DataParallel.
We tested DataParallel for Next Item Prediction for one of the examples and it is not working, differently from what Sara found some weeks ago in another example.
So we have both Next Item Prediction and Binary Classification not working with DataParallel currently.
We have associated the issues for both in this RMP ticket description.
Should we remove the scope of DataParallel from this RMP and create another RMP ticket focused in DataParallel support (targeted for release 22.09)?

karlhigley · 2022-08-16T14:19:29Z

I don't think we should split the issue, let's just target this for 22.09

karlhigley added the roadmap label Aug 8, 2022

karlhigley assigned EvenOldridge Aug 8, 2022

karlhigley added this to the Merlin 22.08 milestone Aug 8, 2022

karlhigley assigned karlhigley, rnyak and sararb and unassigned EvenOldridge Aug 8, 2022

viswa-nvidia modified the milestones: Merlin 22.08, Merlin 22.09 Aug 15, 2022

EvenOldridge changed the title ~~[RMP] "Band-aid" fixes for Transformers4Rec (multi-GPU and Python serving)~~ [RMP] MultiGPU data parallel training, multi-gpu .fit() and Python based serving for Transformers4Rec Sep 12, 2022

karlhigley changed the title ~~[RMP] MultiGPU data parallel training, multi-gpu .fit() and Python based serving for Transformers4Rec~~ [RMP] Quick fixes: MultiGPU data parallel training, multi-gpu .fit(), and Python based serving for Transformers4Rec Sep 13, 2022

karlhigley changed the title ~~[RMP] Quick fixes: MultiGPU data parallel training, multi-gpu .fit(), and Python based serving for Transformers4Rec~~ [RMP] T4R quick fixes: MultiGPU data parallel training, multi-gpu .fit(), and Python based serving for Transformers4Rec Sep 13, 2022

viswa-nvidia modified the milestones: Merlin 22.09, Merlin 22.10 Sep 26, 2022

gabrielspmoreira assigned nzarif and gabrielspmoreira and unassigned karlhigley Sep 26, 2022

viswa-nvidia modified the milestones: Merlin 22.10, Merlin 22.11 Oct 11, 2022

viswa-nvidia modified the milestones: Merlin 22.11, Merlin 22.12 Oct 25, 2022

gabrielspmoreira changed the title ~~[RMP] T4R quick fixes: MultiGPU data parallel training, multi-gpu .fit(), and Python based serving for Transformers4Rec~~ [RMP] T4R fixes: MultiGPU data parallel training, multi-gpu .fit(), and Python based serving for Transformers4Rec Oct 25, 2022

gabrielspmoreira modified the milestones: Merlin 22.12, Merlin 22.11 Oct 25, 2022

gabrielspmoreira mentioned this issue Oct 25, 2022

[RMP] T4R - Support to multi-GPU training for binary classification and regression tasks #708

Closed

3 tasks

gabrielspmoreira changed the title ~~[RMP] T4R fixes: MultiGPU data parallel training, multi-gpu .fit(), and Python based serving for Transformers4Rec~~ [RMP] T4R fixes: MultiGPU data parallel training for next-item prediction Nov 1, 2022

gabrielspmoreira changed the title ~~[RMP] T4R fixes: MultiGPU data parallel training for next-item prediction~~ [RMP] T4R fixes: MultiGPU data parallel training for next-item prediction and fixed serving Nov 1, 2022

viswa-nvidia closed this as completed Nov 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RMP] T4R fixes: MultiGPU data parallel training for next-item prediction and fixed serving #522

[RMP] T4R fixes: MultiGPU data parallel training for next-item prediction and fixed serving #522

karlhigley commented Aug 8, 2022 •

edited by gabrielspmoreira

Loading

karlhigley commented Aug 8, 2022

sohn21c commented Aug 9, 2022

karlhigley commented Aug 10, 2022

viswa-nvidia commented Aug 15, 2022

gabrielspmoreira commented Aug 15, 2022

karlhigley commented Aug 16, 2022

[RMP] T4R fixes: MultiGPU data parallel training for next-item prediction and fixed serving #522

[RMP] T4R fixes: MultiGPU data parallel training for next-item prediction and fixed serving #522

Comments

karlhigley commented Aug 8, 2022 • edited by gabrielspmoreira Loading

Problem:

Goal:

Constraints:

Starting Point:

karlhigley commented Aug 8, 2022

sohn21c commented Aug 9, 2022

karlhigley commented Aug 10, 2022

viswa-nvidia commented Aug 15, 2022

gabrielspmoreira commented Aug 15, 2022

karlhigley commented Aug 16, 2022

karlhigley commented Aug 8, 2022 •

edited by gabrielspmoreira

Loading