You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I followed training a T5 model with FSDP on Sagemaker from the example https://github.com/huggingface/notebooks/blob/main/sagemaker/25_pytorch_fsdp_model_parallelism/scripts/run_clm.py
I noticed that checkpointing is not done with save_strategy="no". Is it intentional(line https://github.com/huggingface/notebooks/blob/main/sagemaker/25_pytorch_fsdp_model_parallelism/scripts/run_clm.py#L93)? In my training I changed it to save_strategy="steps" and noticed two issues
Best checkpoints based on min validation loss is not saved. If I set the limit to 2 for e.g., the last 2 checkpoints are saved
I was not able to load the trained model from checkpoint and got the error which is mentioned elsewhere in issues RuntimeError: Trying to resize storage that is not resizable. This does not happen if I want to load the final model. But it makes training hard since I need to know when to stop training so that I have the final model withe the minimum loss saved. I tried with different versions
PyTorch 1.13
Transformers 4.26
and
PyTorch 2.0.0
Transformers 4.28.1
and see the same issue with loading a model from checkpoint.
Would appreciate any pointers
Thank you!
The text was updated successfully, but these errors were encountered:
Hi there!
I followed training a T5 model with FSDP on Sagemaker from the example
https://github.com/huggingface/notebooks/blob/main/sagemaker/25_pytorch_fsdp_model_parallelism/scripts/run_clm.py
I noticed that checkpointing is not done with
save_strategy="no"
. Is it intentional(linehttps://github.com/huggingface/notebooks/blob/main/sagemaker/25_pytorch_fsdp_model_parallelism/scripts/run_clm.py#L93
)? In my training I changed it tosave_strategy="steps"
and noticed two issuesRuntimeError: Trying to resize storage that is not resizable
. This does not happen if I want to load the final model. But it makes training hard since I need to know when to stop training so that I have the final model withe the minimum loss saved. I tried with different versionsand
and see the same issue with loading a model from checkpoint.
Would appreciate any pointers
Thank you!
The text was updated successfully, but these errors were encountered: