Shuffle metrics 3/4: Capture background metrics #8366
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Please read: #7943 (comment)
There are three commits in this PR. All but the last are the previous PRs in the chain.
Design note
There were two orthogonal ways to deliver this.
The first one (which I tried and scrapped) was collecting the metrics locally in a
ShuffleRun.metrics
attribute and then used the heartbeat system to propagate them.ShuffleSchedulerPlugin.heartbeat
became substantially more complicated, as it had to propagate the data locally toScheduler.cumulative_worker_metrics
as well as to the spans extension. The biggest flaw was that it was completely losing the metrics generated between a heartbeat and the shuffle being deleted on the scheduler side. In order to remediate I would have had to implement some sort of shuffle_id -> span_id mapping for recently deleted runs.The second design is what you see in this PR. It feels to me a bit more coherent with what we already have (e.g. how
Worker.rpc
is already passed to the ShuffleRun).TBH both designs feel a bit hacky in their own way. I don't have a strong preference for either.