Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A single unresponsive peer brings Broker to a halt #426

Open
Neverlord opened this issue Sep 14, 2024 · 0 comments
Open

A single unresponsive peer brings Broker to a halt #426

Neverlord opened this issue Sep 14, 2024 · 0 comments
Assignees

Comments

@Neverlord
Copy link
Member

Broker uses credit-based flow progressing to push data to connected peers. If a peer no longer grants new credit, the pipeline stops. We have also found that the pipeline once stopped seems to struggle to get "unblocked" again. However, the deeper issue is that Zeek does not tap into the flow processing and always assumes it can publish data on its topics. However, we need to have some way of coping with overloaded (or otherwise unresponsive) nodes.

In a recent discussion with @ckreibich, I think we agreed that Zeek can't stop reading incoming data or "slow down" but that we should still try to have Zeek do as much as it can do if something runs into a wall. We could drop data where we produce it (in Zeek) in case the buffers start to reach a certain threshold or we disconnect peers that fail to keep up. Dropping data in Zeek would still mean that a single unresponsive/overloaded node blocks the entire cluster. We would at least stop individual Zeek processes from running out of memory eventually, though. However, Broker auto-disconnecting unresponsive peers is probably a safety measure we should implement regardless. It would stop single processes from being able to lock up the entire cluster.

From the reports we have, it seems Broker recovers when restarting the problematic process manually. So ideally, we combine the auto-disconnect with a supervising concept that identifies problematic nodes to restart them automatically and/or collect status information for that process that helps users to identify the underlying issue (like a script blocking too much CPU or taking to long to process events from certain topics).

This would mean we ultimately allow Broker to discard data. Once we auto-disconnect a peer, we discard all data that peer would have received. Even if the peer re-connects later, data is lost. However, the current approach is simply hoping for the best and then eventually fail catastrophically (locking up the whole cluster and processes running out of memory).

@Neverlord Neverlord self-assigned this Sep 29, 2024
ckreibich added a commit that referenced this issue Dec 3, 2024
* issue/gh-426:
  Fix log statement when disconnecting stalled peers
  Disconnect slow peers and WS clients by default
  Pick up backport for on_backpressure_buffer
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant