Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support tensor parallel/pipeline parallel #397

Closed
gongel opened this issue Aug 25, 2023 · 8 comments
Closed

Support tensor parallel/pipeline parallel #397

gongel opened this issue Aug 25, 2023 · 8 comments
Labels
enhancement New feature or request

Comments

@gongel
Copy link

gongel commented Aug 25, 2023

Support tensor parallel/pipeline parallel currently?

@gongel gongel added the enhancement New feature or request label Aug 25, 2023
@karan6181
Copy link
Collaborator

Can you please share more details ?

@gongel
Copy link
Author

gongel commented Aug 30, 2023

NVIDIA-Megatron team proposed "Tensor Parallelism". When training in "Tensor Parallelism", the rank in same group has same data.
Paper: https://arxiv.org/pdf/2205.05198.pdf
Repo: https://github.com/NVIDIA/Megatron-LM

But in streaming, you only support DDP/FSDP.

@andreamad8
Copy link

andreamad8 commented Feb 9, 2024

Any plan to add this?

one easy solution that seems not to be working could be:

os.environ["WORLD_SIZE"] = str(os.environ["WORLD_SIZE"]  // model_parallel_size)
os.environ["RANK"] = str(os.environ["RANK"] // model_parallel_size)

I tried, but seems the code gets stuck after calling something like:

batch = next(batch_iterator)

where batch_iterator a dataloder.

cc: @karan6181

@karan6181
Copy link
Collaborator

@snarayan21 Looks like this is being addressed. Is that right?

@huxuan
Copy link
Contributor

huxuan commented Jun 12, 2024

Would like to know if there is any example of megatron integration.

@snarayan21
Copy link
Collaborator

@andreamad8 @huxuan @gongel please see the replication argument detailed in our docs here.

@huxuan We don't have an explicit example of a megatron integration, but as it's pytorch based, you can simply swap in the dataset / dataloader.

@casper-hansen
Copy link

@snarayan21

We can compute replication based on TP and SP as follows. But it seems we are lacking documentation on PP. I think part of this issue was unanswered because of that. How do we modify the parameters to streaming with pipeline parallelism?

TP and SP:

replication = tp * sp

Various configurations I tried just to sanity check. None of these are correct (unexpected high loss).

replication = tp * sp * pp
replication = tp * sp * pp
num_canonical_nodes = world_size // replication
replication = tp * sp
num_canonical_nodes = world_size // (tp * sp * pp)

@casper-hansen
Copy link

Opened an issue for this separately:
#827

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants