Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shuffle metrics 3/4: Capture background metrics #8366

Merged
merged 1 commit into from
Dec 20, 2023

Conversation

crusaderky
Copy link
Collaborator

@crusaderky crusaderky commented Nov 19, 2023

Please read: #7943 (comment)
There are three commits in this PR. All but the last are the previous PRs in the chain.

Design note

There were two orthogonal ways to deliver this.

The first one (which I tried and scrapped) was collecting the metrics locally in a ShuffleRun.metrics attribute and then used the heartbeat system to propagate them. ShuffleSchedulerPlugin.heartbeat became substantially more complicated, as it had to propagate the data locally to Scheduler.cumulative_worker_metrics as well as to the spans extension. The biggest flaw was that it was completely losing the metrics generated between a heartbeat and the shuffle being deleted on the scheduler side. In order to remediate I would have had to implement some sort of shuffle_id -> span_id mapping for recently deleted runs.

The second design is what you see in this PR. It feels to me a bit more coherent with what we already have (e.g. how Worker.rpc is already passed to the ShuffleRun).

TBH both designs feel a bit hacky in their own way. I don't have a strong preference for either.

Copy link
Contributor

github-actions bot commented Nov 20, 2023

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       27 files  ±  0         27 suites  ±0   11h 55m 31s ⏱️ + 17m 15s
  3 948 tests +  4    3 833 ✔️  -   2     110 💤 +1    5 +  5 
49 659 runs  +37  47 344 ✔️ +19  2 301 💤 +4  14 +14 

For more details on these failures, see this check.

Results for commit bc7e57f. ± Comparison against base commit 4d41d32.

♻️ This comment has been updated with latest results.

@crusaderky crusaderky force-pushed the shuffle/metrics3 branch 3 times, most recently from 0bbdbdc to b206f82 Compare November 30, 2023 16:29
@crusaderky crusaderky force-pushed the shuffle/metrics3 branch 2 times, most recently from 91fcf7d to 83d1a0d Compare December 12, 2023 18:18
@crusaderky crusaderky force-pushed the shuffle/metrics3 branch 2 times, most recently from 3986e89 to 7cc4088 Compare December 18, 2023 14:07
Copy link
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @crusaderky!

@crusaderky crusaderky force-pushed the shuffle/metrics3 branch 6 times, most recently from 42f4cef to ce97a9e Compare December 20, 2023 00:08
@crusaderky crusaderky merged commit 9273186 into dask:main Dec 20, 2023
29 checks passed
@crusaderky crusaderky deleted the shuffle/metrics3 branch December 20, 2023 00:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use metering for P2P shuffling instrumentation
2 participants