-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support tensor parallel/pipeline parallel #397
Comments
Can you please share more details ? |
NVIDIA-Megatron team proposed "Tensor Parallelism". When training in "Tensor Parallelism", the rank in same group has same data. But in streaming, you only support DDP/FSDP. |
Any plan to add this? one easy solution that seems not to be working could be:
I tried, but seems the code gets stuck after calling something like:
where batch_iterator a dataloder. cc: @karan6181 |
@snarayan21 Looks like this is being addressed. Is that right? |
Would like to know if there is any example of megatron integration. |
@andreamad8 @huxuan @gongel please see the @huxuan We don't have an explicit example of a megatron integration, but as it's pytorch based, you can simply swap in the dataset / dataloader. |
We can compute replication based on TP and SP as follows. But it seems we are lacking documentation on PP. I think part of this issue was unanswered because of that. How do we modify the parameters to streaming with pipeline parallelism? TP and SP:
Various configurations I tried just to sanity check. None of these are correct (unexpected high loss).
|
Opened an issue for this separately: |
Support tensor parallel/pipeline parallel currently?
The text was updated successfully, but these errors were encountered: