Enable control over MLFlowLogger run_name str to match a pre-existing tag run_name in MLFlow and resume model training #3263

wolliq · 2024-05-07T14:21:57Z

🚀 Feature Request

We would like to resume a model training, passing the run_name from YAML, using the MLFlowLogger.

Motivation

Today MLFlowLogger receives the run_name string from the YAML config but it has no control over it as the str automatically append a random str to it, i.e. my-test => my-test-sgftKr at runtime.

In the MLFlowLogger docs:

        run_name: (str, optional): MLflow run name. If not set it will be the same as the Trainer run name

but it always gets overridden by the random value after YAML parsing.

In the MLFlowLogger we have the filter string that captures the passed run_name randomly generated and it will not possible to match with a pre-existing run:

    def _start_mlflow_run(self, state):
        import mlflow

        env_run_id = os.getenv(
            mlflow.environment_variables.MLFLOW_RUN_ID.name,  # pyright: ignore[reportGeneralTypeIssues]
            None,
        )
        if env_run_id is not None:
            self._run_id = env_run_id
        elif self.resume:
            # Search for an existing run tagged with this Composer run if `self.resume=True`.
            assert self._experiment_id is not None
            run_name = self.tags['run_name']
            existing_runs = mlflow.search_runs(
                experiment_ids=[self._experiment_id],
                filter_string=f'tags.run_name = "{run_name}"',    # <<< HERE
                output_format='list',
            )
...

As explained in the {run_name} we will always find a random str appended to it for each new run.

[Optional] Implementation

Possible solution could be disable the random string generation by defining another environmental variable during YAML parsing, such as:

mlflow_tag_run_name=True

so that when the resume action is called, the run name is given to match the tag run name in MLFlow,

or directly

mlflow_tag_run_name="my-run-asdasd"

so that the str run_name is passed as is to MLFlowLogger to handle the resume.

Additional context

This is for a use case where we run training on the MosaicML platform and we log into MLFlow on Databricks platform.
Checkpointing is working fine, but the loss logging is wrong and separated because the unmatch of the random run_name force MLFlow to create a new run id for the resumed training.

The text was updated successfully, but these errors were encountered:

mvpatel2000 · 2024-05-07T21:20:27Z

@wolliq woud you mind sharing your YAML? are you saying you directly pass run name to mlflow logger but it is always overridden?

wolliq added the enhancement New (engineering) enhancements, such as features or API changes. label May 7, 2024

wolliq changed the title ~~Enable control over MLFlowLogger run_name str to use a pre-existing run_name and resume a model training~~ Enable control over MLFlowLogger run_name str to match a pre-existing tag run_name in MLFlow and resume model training May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable control over MLFlowLogger run_name str to match a pre-existing tag run_name in MLFlow and resume model training #3263

Enable control over MLFlowLogger run_name str to match a pre-existing tag run_name in MLFlow and resume model training #3263

wolliq commented May 7, 2024 •

edited

Loading

mvpatel2000 commented May 7, 2024

Enable control over MLFlowLogger run_name str to match a pre-existing tag run_name in MLFlow and resume model training #3263

Enable control over MLFlowLogger run_name str to match a pre-existing tag run_name in MLFlow and resume model training #3263

Comments

wolliq commented May 7, 2024 • edited Loading

🚀 Feature Request

Motivation

[Optional] Implementation

Additional context

mvpatel2000 commented May 7, 2024

wolliq commented May 7, 2024 •

edited

Loading