Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Stopped #4

Open
ThisIsForReview opened this issue Jan 3, 2024 · 4 comments
Open

Training Stopped #4

ThisIsForReview opened this issue Jan 3, 2024 · 4 comments

Comments

@ThisIsForReview
Copy link

Also the training stopped after 2 epoches with the error
FileNotFoundError: [Errno 2] No such file or directory: "lightning_logs\'mnist'\version_1\checkpoints\last.ckpt"

Not sure why last.ckpt was not saved?

Thanks

JB

@fdraxler
Copy link
Contributor

fdraxler commented Jan 9, 2024

Hi, thanks for your question. Can you provide more details for which command you executed so that we can reproduce the issue? Thanks!

@ThisIsForReview
Copy link
Author

I found every_n_epochs: 1 and save_top_k: 2 can work, but every_n_epochs > 1 does not work. It seems when every_n_epochs > 1, no last.ckpt has been saved.

JB

@fdraxler
Copy link
Contributor

So you are referring to the model checkpointing, right? What behavior are you trying to achieve?

@ThisIsForReview
Copy link
Author

It seems this is an issue from pytorch lightning. When save_last = True, every_n_epochs cannot be larger than 1. Lightning-trainable sets save_last = True, so every_n_epochs = 5 does not work.

JB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants