Instrumenting infrastructure with stall detection #832

akoshelev · 2023-11-03T06:47:07Z

Overview

This builds off #765 adopting it to the current state of IPA infrastructure.

This change brings some capabilities to the infrastructure to detect stalls. Because it has a non-trivial cost, it can be disabled via feature flag. This flag is not enabled by default for helper release builds because we use --no-default-features.

Because I touched some parts of the code, some unrelated changes are also included (sorry). I got rid of T inside Gateway because we don't actually need it and I wrote a small macro to make features mutually incompatible. I used that macro for stall-detection feature but later decided not to. I think macro can still be useful, so I thought it may be a good idea to keep it.

Implementation details

At the root level, we keep a counter that gets incremented each time there is a send or a receive request. Along with the actual Gateway instance, we spawn an observer task that periodically checks the value of this counter and compares it with the last value it read. If they are the same, it considers the gateway stalled and fetches its state.

Every object inside Gateway can report tasks that are currently awaiting completion. If there are no tasks, watcher does not print anything as it considers stall to occur somewhere else.

Shuttle tests do not require stall detection because Shuttle executor can catch those occurrences.

Testing

I only performed manual testing, I don't see a way to test it properly without something like https://github.com/tokio-rs/tracing/tree/master/tracing-mock which is currently not published. Alternative may be to plumb IO all the way down from test world/helper setup to gateway but this seems too clunky.

Here is an example of reported stall

2023-11-03T06:34:36.879502Z  WARN stall_detector{role=H3}: ipa::helpers::gateway::stall_detection::gateway: Helper is stalled sn=16474 state=
{
"step=protocol/binary_validator/row1/attributed_breakdown_key/bit0", from=H2. Waiting to receive records ["[0..914]"].
"step=protocol/binary_validator/row1/attributed_breakdown_key/bit1", from=H2. Waiting to receive records ["[0..914]"].
"step=protocol/binary_validator/row1/attributed_breakdown_key/bit2", from=H2. Waiting to receive records ["[0..914]"].
"step=protocol/binary_validator/row1/attributed_breakdown_key/bit3", from=H2. Waiting to receive records ["[0..914]"].
"step=protocol/binary_validator/row1/attributed_breakdown_key/bit4", from=H2. Waiting to receive records ["[0..914]"].
"step=protocol/binary_validator/row1/attributed_breakdown_key/bit5", from=H2. Waiting to receive records ["[0..914]"].
"step=protocol/binary_validator/row1/attributed_breakdown_key/bit6", from=H2. Waiting to receive records ["[0..914]"].
"step=protocol/binary_validator/row1/attributed_breakdown_key/bit7", from=H2. Waiting to receive records ["[0..914]"].
"step=protocol/binary_validator/row1/ever_encountered_source_event", from=H2. Waiting to receive records ["[0..914]"].
}

andyleiserson

Does it make sense to add the stall-detection config to CI?

andyleiserson · 2023-11-06T18:27:43Z

src/helpers/gateway/stall_detection.rs

+                |range| match (*range.end() - *range.start()).cmp(&U::from(1)) {
+                    std::cmp::Ordering::Less => format!("{}", range.start()),
+                    std::cmp::Ordering::Equal => format!("[{}, {}]", range.start(), range.end()),
+                    std::cmp::Ordering::Greater => format!("[{}..{}]", range.start(), range.end()),


It makes the output noisier, but using ..= here might avoid confusion.

I don't mind adding it, I used the math notation where ] means inclusive and ) exclusive. Let me know if you think it will still be worth adding =

src/helpers/gateway/mod.rs

andyleiserson · 2023-11-06T18:54:09Z

I was running into stalls when working on the vectorization stuff. I didn't diagnose it carefully (these changes would have helped), but my working theory was that it was due to problems where the sending queues didn't reach the flush threshold, before something got blocked waiting for the pending send data. It seems like there is a non-obvious property of the circuit, something like "cycle length", describing when outputs feed back to inputs, and if the flush threshold is set higher than the cycle length, evaluation will deadlock.

Requiring manual tuning to maximize performance without crossing the deadlock limit seems undesirable. I wonder if there is some kind of auto tuning that could manage the amount of work in flight. Not sure exactly what that would look like, though.

Co-authored-by: Andy Leiserson <[email protected]>

akoshelev · 2023-11-06T21:00:48Z

Does it make sense to add the stall-detection config to CI?

so it is already there because stall-detection is enabled by default. We also have some tests running with --no-default-features so I believe we are covered here

akoshelev · 2023-11-06T23:33:25Z

Requiring manual tuning to maximize performance without crossing the deadlock limit seems undesirable. I wonder if there is some kind of auto tuning that could manage the amount of work in flight. Not sure exactly what that would look like, though.

I agree that it is currently error-prone and highly cumbersome to configure buffers capacity. Similarly I don't see a clear way to auto tune it without running into performance issues. Setting active_work = 1 will avoid the deadlocks but will have bad performance profile which is hard to notice as you are working on the protocols.

akoshelev added 2 commits November 2, 2023 23:44

Infra stall detection

deceecf

Fix formatting

0b23471

andyleiserson approved these changes Nov 6, 2023

View reviewed changes

Update src/helpers/gateway/mod.rs

88d208d

Co-authored-by: Andy Leiserson <[email protected]>

Merge from main

2a50bd6

akoshelev merged commit 079b476 into private-attribution:main Nov 6, 2023
4 of 6 checks passed

akoshelev deleted the idle-infra-3 branch November 6, 2023 23:42

akoshelev mentioned this pull request Nov 8, 2023

Idle tracking in gate way #765

Closed

akoshelev mentioned this pull request Jan 3, 2024

Make Infra print internal state if stuck again #536

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instrumenting infrastructure with stall detection #832

Instrumenting infrastructure with stall detection #832

akoshelev commented Nov 3, 2023 •

edited

Loading

andyleiserson left a comment

andyleiserson Nov 6, 2023

akoshelev Nov 6, 2023

andyleiserson commented Nov 6, 2023

akoshelev commented Nov 6, 2023

akoshelev commented Nov 6, 2023

Instrumenting infrastructure with stall detection #832

Instrumenting infrastructure with stall detection #832

Conversation

akoshelev commented Nov 3, 2023 • edited Loading

Overview

Implementation details

Testing

andyleiserson left a comment

Choose a reason for hiding this comment

andyleiserson Nov 6, 2023

Choose a reason for hiding this comment

akoshelev Nov 6, 2023

Choose a reason for hiding this comment

andyleiserson commented Nov 6, 2023

akoshelev commented Nov 6, 2023

akoshelev commented Nov 6, 2023

akoshelev commented Nov 3, 2023 •

edited

Loading