Support tensor parallel/pipeline parallel #397

gongel · 2023-08-25T11:47:40Z

Support tensor parallel/pipeline parallel currently?

karan6181 · 2023-08-29T14:37:14Z

Can you please share more details ?

gongel · 2023-08-30T11:54:41Z

NVIDIA-Megatron team proposed "Tensor Parallelism". When training in "Tensor Parallelism", the rank in same group has same data.
Paper: https://arxiv.org/pdf/2205.05198.pdf
Repo: https://github.com/NVIDIA/Megatron-LM

But in streaming, you only support DDP/FSDP.

andreamad8 · 2024-02-09T03:05:14Z

Any plan to add this?

one easy solution that seems not to be working could be:

os.environ["WORLD_SIZE"] = str(os.environ["WORLD_SIZE"]  // model_parallel_size)
os.environ["RANK"] = str(os.environ["RANK"] // model_parallel_size)

I tried, but seems the code gets stuck after calling something like:

batch = next(batch_iterator)

where batch_iterator a dataloder.

cc: @karan6181

karan6181 · 2024-02-28T16:12:45Z

@snarayan21 Looks like this is being addressed. Is that right?

huxuan · 2024-06-12T12:57:16Z

Would like to know if there is any example of megatron integration.

snarayan21 · 2024-07-23T07:35:37Z

@andreamad8 @huxuan @gongel please see the replication argument detailed in our docs here.

@huxuan We don't have an explicit example of a megatron integration, but as it's pytorch based, you can simply swap in the dataset / dataloader.

casper-hansen · 2024-11-13T15:30:18Z

@snarayan21

We can compute replication based on TP and SP as follows. But it seems we are lacking documentation on PP. I think part of this issue was unanswered because of that. How do we modify the parameters to streaming with pipeline parallelism?

TP and SP:

replication = tp * sp

Various configurations I tried just to sanity check. None of these are correct (unexpected high loss).

replication = tp * sp * pp

replication = tp * sp * pp
num_canonical_nodes = world_size // replication

replication = tp * sp
num_canonical_nodes = world_size // (tp * sp * pp)

casper-hansen · 2024-11-14T09:21:46Z

Opened an issue for this separately:
#827

gongel added the enhancement New feature or request label Aug 25, 2023

snarayan21 closed this as completed Jul 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support tensor parallel/pipeline parallel #397

Support tensor parallel/pipeline parallel #397

gongel commented Aug 25, 2023

karan6181 commented Aug 29, 2023

gongel commented Aug 30, 2023 •

edited

Loading

andreamad8 commented Feb 9, 2024 •

edited

Loading

karan6181 commented Feb 28, 2024

huxuan commented Jun 12, 2024

snarayan21 commented Jul 23, 2024

casper-hansen commented Nov 13, 2024

casper-hansen commented Nov 14, 2024

Support tensor parallel/pipeline parallel #397

Support tensor parallel/pipeline parallel #397

Comments

gongel commented Aug 25, 2023

karan6181 commented Aug 29, 2023

gongel commented Aug 30, 2023 • edited Loading

andreamad8 commented Feb 9, 2024 • edited Loading

karan6181 commented Feb 28, 2024

huxuan commented Jun 12, 2024

snarayan21 commented Jul 23, 2024

casper-hansen commented Nov 13, 2024

casper-hansen commented Nov 14, 2024

gongel commented Aug 30, 2023 •

edited

Loading

andreamad8 commented Feb 9, 2024 •

edited

Loading