-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
# [BUG] [Fix-Suggested] Model Training Stalls with FSDP when fsdp_use_orig_params=False
due to inconsistent model-optimizer state
#3256
Comments
Yes, it is the same issue! The proposed solution is reasonable. Thanks. |
Can you check if huggingface/transformers#35212 has solved the issue? If not, could you try if additionally switching off flash attention helps. |
@BenjaminBossan Thanks! I have tried huggingface/transformers#35212 and switched to this commit
The problem persists and I do not have flash attn installed in my env. |
I think huggingface/transformers#35212 helps when the users do not supply a custom optimizer and instead relying on the trainer to create the optimizer. If the user actually initialized optimizers themselves, the trainer would just respect whatever the user has done (see https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L1186-L1201). |
Thanks for the feedback @traincheck-team. Your link appears to be out of date (tip: use perma links) but I think it's clear what you mean. When the user creates the optimizer, I think it is reasonable to honor that and not re-initialize the optimizer. In this case, it clashes with the need to do delayed initialization. Honestly, I don't have a good idea how to consolidate the two needs. Maybe @muellerzr has an idea? |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Bug Description
Users have experienced model completely not learning when adapting their pipeline to FSDP (loss stays constant for each epoch), as reported in
https://github.com/huggingface/accelerate/issues/2665
.Mitigation Setting
fsdp_use_orig_params
totrue
makes the model learning again.We open a new issue here as the original has been closed and the root cause was not very clear there.
Environment
The bug is reproducible on the newest stable
accelerate
version. Below is theaccelerate env
output (the env probably does not matter as per the root cause we've diagnosed):`accelerate env` (Click to Show)
To reproduce:
Install all the dependencies (torch and accelerate)
Run bug.py using
run.sh
(same as reported in FSDP Model not learning during training, loss stays constant #2665)run.sh
default.yml
train.py
The above scripts will run 2 epochs and 10 steps for each epoch with
fsdp_use_orig_params == False
Notice how for each epoch, the validation scores are the same and loss information is the same for the same batch.
Modify
default.yml
to havefsdp_use_orig_params: true
and run step 2 again. Observe that the model now learn correctly as reported by FSDP Model not learning during training, loss stays constant #2665 (comment)Issue Root Cause
FSDP(model)
flattens model parameters and use the flattened ones for training, as can be seen from pytorch codebase (e.g. https://github.com/pytorch/pytorch/blob/6a096a0b960b415e95b89efb6cc6eeaa9c0f48ab/torch/distributed/fsdp/_unshard_param_utils.py#L122).In the buggy pipeline
train.py
, the optimizerAdamW
was initialized before wrappingmodel
withFSDP
which is done inaccelerator.prepare(model, optimizer, train_dataloader, val_dataloader, lr_scheduler)
. This optimizer only sees the original, unflattened model parameters, sinceaccelerator.prepare
does not update optimizer iffsdp_use_orig_params == False
Additional Evidence
We ran our bug detection tool against the problematic code's runtime trace and noticed two things during runtime:
step
api of the optimizer was not updating the model (neither model parameter change nor any computation op got invoked )zero_grad
api of the optimizer was not doing anything (allgrad
before enteringzero_grad
were alreadyNone
).Possible User-side Workaround
fsdp_use_orig_params
toTrue
or
Initialize optimizer after model is wrapped.
i.e.
change
to
Suggested
accelerate
-side Fix:optimizer
andmodel
parameters do not overlapfsdp_use_orig_params
isTrue
We will be more than happy to provide a PR for this issue! Let me know how you want to proceed!
The text was updated successfully, but these errors were encountered: