A single unresponsive peer brings Broker to a halt #426

Neverlord · 2024-09-14T14:22:37Z

Broker uses credit-based flow progressing to push data to connected peers. If a peer no longer grants new credit, the pipeline stops. We have also found that the pipeline once stopped seems to struggle to get "unblocked" again. However, the deeper issue is that Zeek does not tap into the flow processing and always assumes it can publish data on its topics. However, we need to have some way of coping with overloaded (or otherwise unresponsive) nodes.

In a recent discussion with @ckreibich, I think we agreed that Zeek can't stop reading incoming data or "slow down" but that we should still try to have Zeek do as much as it can do if something runs into a wall. We could drop data where we produce it (in Zeek) in case the buffers start to reach a certain threshold or we disconnect peers that fail to keep up. Dropping data in Zeek would still mean that a single unresponsive/overloaded node blocks the entire cluster. We would at least stop individual Zeek processes from running out of memory eventually, though. However, Broker auto-disconnecting unresponsive peers is probably a safety measure we should implement regardless. It would stop single processes from being able to lock up the entire cluster.

From the reports we have, it seems Broker recovers when restarting the problematic process manually. So ideally, we combine the auto-disconnect with a supervising concept that identifies problematic nodes to restart them automatically and/or collect status information for that process that helps users to identify the underlying issue (like a script blocking too much CPU or taking to long to process events from certain topics).

This would mean we ultimately allow Broker to discard data. Once we auto-disconnect a peer, we discard all data that peer would have received. Even if the peer re-connects later, data is lost. However, the current approach is simply hoping for the best and then eventually fail catastrophically (locking up the whole cluster and processes running out of memory).

* issue/gh-426: Fix log statement when disconnecting stalled peers Disconnect slow peers and WS clients by default Pick up backport for on_backpressure_buffer

Neverlord mentioned this issue Sep 19, 2024

Broker drops websocket clients if they send events too quickly zeek/zeek#3939

Open

Neverlord self-assigned this Sep 29, 2024

ckreibich added a commit that referenced this issue Dec 3, 2024

Merge branch 'issue/gh-426'

7df7358

* issue/gh-426: Fix log statement when disconnecting stalled peers Disconnect slow peers and WS clients by default Pick up backport for on_backpressure_buffer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A single unresponsive peer brings Broker to a halt #426

A single unresponsive peer brings Broker to a halt #426

Neverlord commented Sep 14, 2024

A single unresponsive peer brings Broker to a halt #426

A single unresponsive peer brings Broker to a halt #426

Comments

Neverlord commented Sep 14, 2024