Add Goodput & Badput recording and monitoring support. #783

dipannita08 · 2024-10-25T20:23:34Z

This change adds the following:

Upgrades to the latest ml-goodput-measurement library
Integrates badput recording into the GoodputRecorder in AxLearn
Builds a Goodput monitor to configure and visualize Goodput and Badput using Tensorboard

Tested:

Functional end to end testing using fuji-test and fuji-7b (example tensorboard instance)
Unit tests

markblee

Thanks!

axlearn/common/launch_trainer_main.py

Ethanlm · 2024-10-25T23:59:00Z

Functional end to end testing using fuji-test and fuji-7b (example tensorboard instance)

I don't have access to this link. Can you provide an example that we can take a look

axlearn/common/trainer.py

markblee · 2024-11-07T00:36:38Z

axlearn/cloud/gcp/measurement.py

+        # Instantiate ml-goodput-measurement's GoodputMonitor
+        # to asynchronously calculate goodput and badput at
+        # the upload_interval and upload to the specified
+        # tensorboard directory.


Convert this into a docstring for the method? BTW, we use 100 line length.

axlearn/cloud/gcp/measurement.py

markblee · 2024-11-07T00:38:25Z

axlearn/cloud/gcp/measurement_test.py

+        fv.mark_as_parsed()
+
+        recorder = GoodputRecorder.from_flags(fv)
+        recorder._monitor = None  # Ensure _monitor is initially None


Suggested change

recorder._monitor = None # Ensure _monitor is initially None

self.assertIsNone(recorder._monitor) # Ensure _monitor is initially None

axlearn/cloud/gcp/measurement.py

axlearn/common/launch_trainer_main.py

markblee · 2024-11-07T00:44:58Z

Please feel free to "re-request review" when ready. Thanks!

axlearn/cloud/gcp/measurement.py

markblee · 2024-11-18T18:53:37Z

axlearn/common/trainer.py

@@ -324,6 +325,7 @@ def __init__(
                    model=self.model,
                    model_param_partition_specs=model_param_partition_specs,
                )
+        self._maybe_record_event(measurement.Event.END_ACCELERATOR_INIT)


Can you clarify what accelerator init is supposed to capture? E.g., would utils_spmd where we call jax distributed init be more appropriate?

This is supposed to capture device related initialization such as device scanning, mesh initialization, device reinit/reset, security setup, initialization of pre-mapped buffers etc. You are right, jax distributed init should be included here.

I would lean on your team to update/re-position the record calls to locations that seems best fit for this codebase.

Thanks @dipannita08 ! WDYT about removing the trainer.py changes for now, so that we can add them in a follow-up PR? This PR can focus on adding the scaffolding in measurement.py.

markblee · 2024-11-18T18:54:33Z

axlearn/common/trainer.py

@@ -847,6 +850,7 @@ def _prepare_training(self, prng_key: Tensor) -> bool:
            with fs.open(os.path.join(cfg.dir, "model_analysis.txt"), "w") as f:
                f.write(model_analysis)

+        self._maybe_record_event(measurement.Event.END_TRAINING_PREPARATION)


What's usually considered part of "training preparation"? Should we count the jit compilation below as a potentially substantial part of it? What about the checkpoint restoration above?

This would include the time spent on the creation of checkpoint managers, checkpoint loading, running mesh and model optimizers etc.

The JIT compilation is currently computed algorithmically based on the entire timeline of events (other recorded logs) and is included in the "program startup" badput that is meant to measure the time spent on framework specific function transformations (such as JAX tracing), compilation tasks, runtime initialization etc.

For now, checkpoint restoration is not included in this bucket - the expectation is the next version of the library (v0.0.5) has more definitive recorder and calculator APIs for checkpoint save and restore badput.

markblee · 2024-11-18T18:56:01Z

axlearn/common/trainer.py

@@ -883,6 +887,7 @@ def restore_checkpoint(self, restore_step: Optional[int] = None) -> Optional[int
            restore_input_iter = cfg.save_input_iterator
            try:
                # Try to restore with `input_iter`.
+                self._maybe_record_event(measurement.Event.START_DATA_LOADING)


Hm, is data loading be in relation to the input loading or the checkpoint or both? Here it seems only capturing the checkpoint restoration?

Data loading is in relation to the input data only. Checkpoint restore would be recorded and badput would be computed separately with changes coming in the next version of the Goodput package. Please feel free to update the location of the record calls specifically, for data loading, I wasn't sure where is the most appropriate place to put it.

Thanks for your help!

markblee

Ack.

dipannita08 added 2 commits October 21, 2024 23:44

Code clean up

bcd8618

Add goodput and badput monitoring support to AxLearn

d8474f7

dipannita08 requested review from ruomingp and markblee as code owners October 25, 2024 20:23

markblee reviewed Oct 25, 2024

View reviewed changes

axlearn/common/launch_trainer_main.py Outdated Show resolved Hide resolved

yiping-ma reviewed Oct 29, 2024

View reviewed changes

axlearn/common/trainer.py Show resolved Hide resolved

dipannita08 added 3 commits November 4, 2024 03:32

Merge remote-tracking branch 'upstream/main'

5c62244

Merge remote-tracking branch 'upstream/main'

1dff92c

Add more testing

1417133

markblee reviewed Nov 7, 2024

View reviewed changes

Address comments

f5d6a37

dipannita08 requested review from markblee, jiya-zhang and yiping-ma November 15, 2024 06:26

markblee reviewed Nov 18, 2024

View reviewed changes

Fix docstrings

0221426

dipannita08 requested a review from markblee November 25, 2024 21:36

markblee reviewed Nov 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Goodput & Badput recording and monitoring support. #783

Add Goodput & Badput recording and monitoring support. #783

dipannita08 commented Oct 25, 2024 •

edited

Loading

markblee left a comment

Ethanlm commented Oct 25, 2024 •

edited

Loading

markblee Nov 7, 2024

markblee Nov 7, 2024

markblee commented Nov 7, 2024

markblee Nov 18, 2024

dipannita08 Nov 25, 2024

markblee Nov 27, 2024

markblee Nov 18, 2024

dipannita08 Nov 25, 2024

markblee Nov 18, 2024

dipannita08 Nov 25, 2024

markblee left a comment

	recorder._monitor = None # Ensure _monitor is initially None
	self.assertIsNone(recorder._monitor) # Ensure _monitor is initially None

Add Goodput & Badput recording and monitoring support. #783

Are you sure you want to change the base?

Add Goodput & Badput recording and monitoring support. #783

Conversation

dipannita08 commented Oct 25, 2024 • edited Loading

markblee left a comment

Choose a reason for hiding this comment

Ethanlm commented Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markblee commented Nov 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markblee left a comment

Choose a reason for hiding this comment

dipannita08 commented Oct 25, 2024 •

edited

Loading

Ethanlm commented Oct 25, 2024 •

edited

Loading