-
Notifications
You must be signed in to change notification settings - Fork 520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added metrics logging to checkpoint and separate yaml file #1562
Added metrics logging to checkpoint and separate yaml file #1562
Conversation
6b3999c
to
fb8ae7d
Compare
…6/super-gradients into feature/saving_metrics_to_yaml
After a discussion with @BloodAxe we decided to leave only saving metrics to a checkpoint for this pull request. Another pull request will come with proposals for sg_loggers refactoring and saving some trainer internals to yaml. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Issue description
@BloodAxe said SG does not support saving metrics, environment, and train config to the checkpoint.
After getting more familiar with a codebase, I realized the task should be split into several parts / PRs.
PR description
This PR proposes a way to add metrics to a checkpoint.
Primarily, this is done for easier access to experiment metrics, so the user doesn't need to run TB or parse the checkpoint to access them.
Some ideas and notes
Let's suppose the user needs to train a custom model with several classification heads and track different metrics. For now, all the metrics are gathered in a MetricCollection and each metric uses the same output of a model. Moreover, if the user wants to track the accuracy of some of the outputs, he can't do this. Now metrics are named exactly as the class is called and are stored in the dictionary under this name. It will be rewritten into the dictionary and one metric will remain.
It would be great to have the possibility to specify which model output should be used for every metric (and maybe loss). It also could be configured somehow like
When someone tries to access metric/loss values separately after executing the evaluate() method, some routine on splitting losses and metrics from single joined dict should be done. In this PR, saving metrics to the dict could be done much easier, if they were separated (for now the metrics are selected from the dictionary by their names and form a new dictionary, see sg_trainer.py/_save_checkpoint/lines 671, 679). Also, treating losses and metrics as separate structures adds some points to the code understanding, as they are logically different ;)
It looks like changing this will affect a bunch of places in the project, so if code owners respond to this problem, then it should become a separate PR :)
There are some places in _save_checkpoint function where a type of metric value should be checked wrt its type (values returned in validation/train_results_dict). It looks like metrics there sometimes appear to have different types (torch.Tensor(dtype=float) or a usual float). It would be much more convenient to expect a single type of value there, and checks of this kind will disappear and make the code look cleaner 🪄
For now, this routine has been moved to a separate function.