Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added metrics logging to checkpoint and separate yaml file #1562

Merged
merged 24 commits into from
Oct 31, 2023

Conversation

philmarchenko
Copy link
Contributor

@philmarchenko philmarchenko commented Oct 23, 2023

Issue description

@BloodAxe said SG does not support saving metrics, environment, and train config to the checkpoint.

After getting more familiar with a codebase, I realized the task should be split into several parts / PRs.

PR description

This PR proposes a way to add metrics to a checkpoint.

  1. Metrics from train and validation steps are saved in all .pth files.
  2. Additionally, the metrics are stored in several .yml files (best, latest, and epoch if specified in config). The file structure is the following:
metrics:
  train:
   <metricClass1>: <float> 
   ...
   <metricClassn>: <float> 
  valid:
   <metricClass1>: <float> 
   ...
   <metricClassn>: <float> 
tracked_metric_name: <metricClassx>

Primarily, this is done for easier access to experiment metrics, so the user doesn't need to run TB or parse the checkpoint to access them.

Some ideas and notes

  1. The metric setup could be done in a bit more flexible way.
    Let's suppose the user needs to train a custom model with several classification heads and track different metrics. For now, all the metrics are gathered in a MetricCollection and each metric uses the same output of a model. Moreover, if the user wants to track the accuracy of some of the outputs, he can't do this. Now metrics are named exactly as the class is called and are stored in the dictionary under this name. It will be rewritten into the dictionary and one metric will remain.

It would be great to have the possibility to specify which model output should be used for every metric (and maybe loss). It also could be configured somehow like

metrics: 
    __target__: CustomMetricDict
        metric_head1:
            __target__: MetrciClass1
                param1: ...
                paramn: ...
           model_prediction_key: <str>
           target_key: <str>
        ...
        metric_headn: ...    
  1. Mixing metrics and losses in a single dict (sg_trainer.py, get_train_loop_description_dict method, used as a return from evaluate)
    When someone tries to access metric/loss values separately after executing the evaluate() method, some routine on splitting losses and metrics from single joined dict should be done. In this PR, saving metrics to the dict could be done much easier, if they were separated (for now the metrics are selected from the dictionary by their names and form a new dictionary, see sg_trainer.py/_save_checkpoint/lines 671, 679). Also, treating losses and metrics as separate structures adds some points to the code understanding, as they are logically different ;)
    It looks like changing this will affect a bunch of places in the project, so if code owners respond to this problem, then it should become a separate PR :)
  2. isinstance(metric, torch.Tensor)
    There are some places in _save_checkpoint function where a type of metric value should be checked wrt its type (values returned in validation/train_results_dict). It looks like metrics there sometimes appear to have different types (torch.Tensor(dtype=float) or a usual float). It would be much more convenient to expect a single type of value there, and checks of this kind will disappear and make the code look cleaner 🪄
    For now, this routine has been moved to a separate function.

@philmarchenko philmarchenko force-pushed the feature/saving_metrics_to_yaml branch from 6b3999c to fb8ae7d Compare October 23, 2023 11:01
@philmarchenko philmarchenko marked this pull request as ready for review October 24, 2023 17:54
@philmarchenko
Copy link
Contributor Author

philmarchenko commented Oct 26, 2023

After a discussion with @BloodAxe we decided to leave only saving metrics to a checkpoint for this pull request. Another pull request will come with proposals for sg_loggers refactoring and saving some trainer internals to yaml.

BloodAxe
BloodAxe previously approved these changes Oct 28, 2023
Copy link
Contributor

@BloodAxe BloodAxe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@BloodAxe BloodAxe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@Louis-Dupont Louis-Dupont left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@BloodAxe BloodAxe merged commit e91be8f into Deci-AI:master Oct 31, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants