-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instrumenting infrastructure with stall detection #832
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to add the stall-detection
config to CI?
|range| match (*range.end() - *range.start()).cmp(&U::from(1)) { | ||
std::cmp::Ordering::Less => format!("{}", range.start()), | ||
std::cmp::Ordering::Equal => format!("[{}, {}]", range.start(), range.end()), | ||
std::cmp::Ordering::Greater => format!("[{}..{}]", range.start(), range.end()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes the output noisier, but using ..=
here might avoid confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind adding it, I used the math notation where ]
means inclusive and )
exclusive. Let me know if you think it will still be worth adding =
I was running into stalls when working on the vectorization stuff. I didn't diagnose it carefully (these changes would have helped), but my working theory was that it was due to problems where the sending queues didn't reach the flush threshold, before something got blocked waiting for the pending send data. It seems like there is a non-obvious property of the circuit, something like "cycle length", describing when outputs feed back to inputs, and if the flush threshold is set higher than the cycle length, evaluation will deadlock. Requiring manual tuning to maximize performance without crossing the deadlock limit seems undesirable. I wonder if there is some kind of auto tuning that could manage the amount of work in flight. Not sure exactly what that would look like, though. |
Co-authored-by: Andy Leiserson <[email protected]>
so it is already there because |
I agree that it is currently error-prone and highly cumbersome to configure buffers capacity. Similarly I don't see a clear way to auto tune it without running into performance issues. Setting |
Overview
This builds off #765 adopting it to the current state of IPA infrastructure.
This change brings some capabilities to the infrastructure to detect stalls. Because it has a non-trivial cost, it can be disabled via feature flag. This flag is not enabled by default for helper release builds because we use
--no-default-features
.Because I touched some parts of the code, some unrelated changes are also included (sorry). I got rid of
T
insideGateway
because we don't actually need it and I wrote a small macro to make features mutually incompatible. I used that macro forstall-detection
feature but later decided not to. I think macro can still be useful, so I thought it may be a good idea to keep it.Implementation details
At the root level, we keep a counter that gets incremented each time there is a send or a receive request. Along with the actual
Gateway
instance, we spawn an observer task that periodically checks the value of this counter and compares it with the last value it read. If they are the same, it considers the gateway stalled and fetches its state.Every object inside
Gateway
can report tasks that are currently awaiting completion. If there are no tasks, watcher does not print anything as it considers stall to occur somewhere else.Shuttle tests do not require stall detection because Shuttle executor can catch those occurrences.
Testing
I only performed manual testing, I don't see a way to test it properly without something like https://github.com/tokio-rs/tracing/tree/master/tracing-mock which is currently not published. Alternative may be to plumb IO all the way down from test world/helper setup to gateway but this seems too clunky.
Here is an example of reported stall