-
Notifications
You must be signed in to change notification settings - Fork 344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding the new feature of FPDT #441
base: main
Are you sure you want to change the base?
Conversation
…on for supporting batch size larger than 1
Hi @YJHMITWEB , is FPDT referring to this paper? https://ui.adsabs.harvard.edu/abs/2023JARS...17b6510H/abstract |
@YJHMITWEB Do we need changes in |
megatron/initialize.py
Outdated
@@ -349,9 +349,12 @@ def _warmup_jit_function(): | |||
dtype = torch.float32 | |||
|
|||
# Warmup fused bias+gelu | |||
seq_length = args.seq_length | |||
if args.ds_sequence_parallel_fpdt: | |||
seq_length = 8192 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you define this as another variable like "FPDT_SEQ_LEN" and give a description in a comment why we have this setting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fixed by setting it to be ds_sequence_parallel_fpdt_chunk_size
if FPDT is enabled.
@@ -32,7 +35,9 @@ def forward(self, max_seq_len, offset=0): | |||
emb = torch.cat((freqs, freqs), dim=-1) | |||
# emb [seq_length, .., dim] | |||
from einops import rearrange | |||
return rearrange(emb, 'n d -> n 1 1 d') | |||
base = rearrange(emb, 'n d -> n 1 1 d') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this change the output when use --use-rotary-position-embeddings
, in llama style model?
FYI https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples_deepspeed/pretrain_llama2_distributed.sh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have tested both GPT and Llama models, this works well with both.
@delock, no, FPDT refers to this paper, aka Ulysses-Offload |
Thanks @samadejacobs for pointing. |
|
@microsoft-github-policy-service agree |
FPDT can only be work with this version of DeepSpeed.