Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instrumenting infrastructure with stall detection #832

Merged
merged 4 commits into from
Nov 6, 2023

Conversation

akoshelev
Copy link
Collaborator

@akoshelev akoshelev commented Nov 3, 2023

Overview

This builds off #765 adopting it to the current state of IPA infrastructure.

This change brings some capabilities to the infrastructure to detect stalls. Because it has a non-trivial cost, it can be disabled via feature flag. This flag is not enabled by default for helper release builds because we use --no-default-features.

Because I touched some parts of the code, some unrelated changes are also included (sorry). I got rid of T inside Gateway because we don't actually need it and I wrote a small macro to make features mutually incompatible. I used that macro for stall-detection feature but later decided not to. I think macro can still be useful, so I thought it may be a good idea to keep it.

Implementation details

At the root level, we keep a counter that gets incremented each time there is a send or a receive request. Along with the actual Gateway instance, we spawn an observer task that periodically checks the value of this counter and compares it with the last value it read. If they are the same, it considers the gateway stalled and fetches its state.

Every object inside Gateway can report tasks that are currently awaiting completion. If there are no tasks, watcher does not print anything as it considers stall to occur somewhere else.

Shuttle tests do not require stall detection because Shuttle executor can catch those occurrences.

Testing

I only performed manual testing, I don't see a way to test it properly without something like https://github.com/tokio-rs/tracing/tree/master/tracing-mock which is currently not published. Alternative may be to plumb IO all the way down from test world/helper setup to gateway but this seems too clunky.

Here is an example of reported stall

2023-11-03T06:34:36.879502Z  WARN stall_detector{role=H3}: ipa::helpers::gateway::stall_detection::gateway: Helper is stalled sn=16474 state=
{
"step=protocol/binary_validator/row1/attributed_breakdown_key/bit0", from=H2. Waiting to receive records ["[0..914]"].
"step=protocol/binary_validator/row1/attributed_breakdown_key/bit1", from=H2. Waiting to receive records ["[0..914]"].
"step=protocol/binary_validator/row1/attributed_breakdown_key/bit2", from=H2. Waiting to receive records ["[0..914]"].
"step=protocol/binary_validator/row1/attributed_breakdown_key/bit3", from=H2. Waiting to receive records ["[0..914]"].
"step=protocol/binary_validator/row1/attributed_breakdown_key/bit4", from=H2. Waiting to receive records ["[0..914]"].
"step=protocol/binary_validator/row1/attributed_breakdown_key/bit5", from=H2. Waiting to receive records ["[0..914]"].
"step=protocol/binary_validator/row1/attributed_breakdown_key/bit6", from=H2. Waiting to receive records ["[0..914]"].
"step=protocol/binary_validator/row1/attributed_breakdown_key/bit7", from=H2. Waiting to receive records ["[0..914]"].
"step=protocol/binary_validator/row1/ever_encountered_source_event", from=H2. Waiting to receive records ["[0..914]"].
}

Copy link
Collaborator

@andyleiserson andyleiserson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to add the stall-detection config to CI?

|range| match (*range.end() - *range.start()).cmp(&U::from(1)) {
std::cmp::Ordering::Less => format!("{}", range.start()),
std::cmp::Ordering::Equal => format!("[{}, {}]", range.start(), range.end()),
std::cmp::Ordering::Greater => format!("[{}..{}]", range.start(), range.end()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes the output noisier, but using ..= here might avoid confusion.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind adding it, I used the math notation where ] means inclusive and ) exclusive. Let me know if you think it will still be worth adding =

src/helpers/gateway/mod.rs Outdated Show resolved Hide resolved
@andyleiserson
Copy link
Collaborator

I was running into stalls when working on the vectorization stuff. I didn't diagnose it carefully (these changes would have helped), but my working theory was that it was due to problems where the sending queues didn't reach the flush threshold, before something got blocked waiting for the pending send data. It seems like there is a non-obvious property of the circuit, something like "cycle length", describing when outputs feed back to inputs, and if the flush threshold is set higher than the cycle length, evaluation will deadlock.

Requiring manual tuning to maximize performance without crossing the deadlock limit seems undesirable. I wonder if there is some kind of auto tuning that could manage the amount of work in flight. Not sure exactly what that would look like, though.

@akoshelev
Copy link
Collaborator Author

Does it make sense to add the stall-detection config to CI?

so it is already there because stall-detection is enabled by default. We also have some tests running with --no-default-features so I believe we are covered here

@akoshelev
Copy link
Collaborator Author

Requiring manual tuning to maximize performance without crossing the deadlock limit seems undesirable. I wonder if there is some kind of auto tuning that could manage the amount of work in flight. Not sure exactly what that would look like, though.

I agree that it is currently error-prone and highly cumbersome to configure buffers capacity. Similarly I don't see a clear way to auto tune it without running into performance issues. Setting active_work = 1 will avoid the deadlocks but will have bad performance profile which is hard to notice as you are working on the protocols.

@akoshelev akoshelev merged commit 079b476 into private-attribution:main Nov 6, 2023
4 of 6 checks passed
@akoshelev akoshelev deleted the idle-infra-3 branch November 6, 2023 23:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants