-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High caf thread utilization with 512 workers and 96 loggers #352
Comments
The following supervisor setup starting 80 workers, 12 loggers and setting
|
On Slack, the user provided a new flamegraph that looks promising: There's ~55% time spent in Can probably be assumed that |
Nice, thanks for doing all that digging. 👍 |
I can somewhat reproduce this on a bigger machine with the original reproducer, but also trigger a high percentage of |
Thanks, Arne. That's good to know! I'll ping you when I have a patch ready to try. |
@awelzel can you try I've written a small benchmark that pushes 1k messages (integers) to up to 1024 observers. With the patch, the runtime drops significantly: Hopefully the synthetic benchmark also translates to real results in Broker. |
From Slack, suspicion is that loggers stop accepting log messages causing workers to run OOM. User reports this is not happening with their vanilla 5.2.0 deployment.
This triggered in the user's environment only after ~20 hours with a pretty large cluster under production load, so not sure what the chances are to reproduce this in a testing environment, but we could give it a shot. |
Thank you for taking care of playing this back to the reporter. 🙂 Let me try to reproduce this bug with a stress test. I'll ping you if I have a new version to test. |
On the zeekorg Slack, users reported seeing high logger CPU usages in a cluster of 512 Zeek workers and 96 Zeek loggers (distributed over multiple physical systems).
A perf top and flamegraph of a single logging process indicates most time is spent in the caf layer, roughly ~72% in
on_consumed_data()
. The stacks might not be fully valid.In contrast, Zeek's logging threads, librdkafka threads and lz4 compression represent ~10% of the profile to the right. This looks like a pathological case being triggered.
The text was updated successfully, but these errors were encountered: