Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add time units to throughput metrics and also secs_per_step metric. #3693

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions composer/callbacks/speed_monitor.py
Original file line number Diff line number Diff line change
Expand Up @@ -258,6 +258,7 @@ def __init__(
self.history_flops: deque[float] = deque(maxlen=window_size + 1)

self.gpu_flops_available = gpu_flops_available
self.time_unit = time_unit

self.divider = 1
if time_unit == 'seconds':
Expand Down Expand Up @@ -360,10 +361,18 @@ def batch_end(self, state: State, logger: Logger):
# Log the time
# `state.timestamp` excludes any time spent in evaluation
train_wct = state.timestamp.total_wct.total_seconds()
secs_per_step = 0
if len(self.history_wct) > 1:
secs_per_step = self.history_wct[-1] - self.history_wct[-2]
Comment on lines +364 to +366
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably average over history_wct as all metrics respect window_size. Something like:
elapsed_wct = (self.history_wct[-1] - self.history_wct[0]) / (len(history-wct)-1)


logger.log_metrics({
'time/train': train_wct / self.divider,
'time/val': self.total_eval_wct / self.divider,
'time/total': (train_wct + self.total_eval_wct) / self.divider,
f'time_{self.time_unit}/train': train_wct / self.divider,
f'time_{self.time_unit}/val': self.total_eval_wct / self.divider,
f'time_{self.time_unit}/total': (train_wct + self.total_eval_wct) / self.divider,
Comment on lines 369 to +374
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just rename time/train -> time/train/hours?

Changing log keys is a little annoying to end user... but duplicate logging is more annoying imo.

Copy link
Contributor

@dakinggg dakinggg Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mvpatel2000 i was a bit worried that would be a breaking change for any downstream pipelines that process metrics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline, it's better to avoid overwhelming loggers

'time/secs_per_step': secs_per_step,
})

def eval_end(self, state: State, logger: Logger):
Expand Down
Loading