You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Broker uses credit-based flow progressing to push data to connected peers. If a peer no longer grants new credit, the pipeline stops. We have also found that the pipeline once stopped seems to struggle to get "unblocked" again. However, the deeper issue is that Zeek does not tap into the flow processing and always assumes it can publish data on its topics. However, we need to have some way of coping with overloaded (or otherwise unresponsive) nodes.
In a recent discussion with @ckreibich, I think we agreed that Zeek can't stop reading incoming data or "slow down" but that we should still try to have Zeek do as much as it can do if something runs into a wall. We could drop data where we produce it (in Zeek) in case the buffers start to reach a certain threshold or we disconnect peers that fail to keep up. Dropping data in Zeek would still mean that a single unresponsive/overloaded node blocks the entire cluster. We would at least stop individual Zeek processes from running out of memory eventually, though. However, Broker auto-disconnecting unresponsive peers is probably a safety measure we should implement regardless. It would stop single processes from being able to lock up the entire cluster.
From the reports we have, it seems Broker recovers when restarting the problematic process manually. So ideally, we combine the auto-disconnect with a supervising concept that identifies problematic nodes to restart them automatically and/or collect status information for that process that helps users to identify the underlying issue (like a script blocking too much CPU or taking to long to process events from certain topics).
This would mean we ultimately allow Broker to discard data. Once we auto-disconnect a peer, we discard all data that peer would have received. Even if the peer re-connects later, data is lost. However, the current approach is simply hoping for the best and then eventually fail catastrophically (locking up the whole cluster and processes running out of memory).
The text was updated successfully, but these errors were encountered:
* issue/gh-426:
Fix log statement when disconnecting stalled peers
Disconnect slow peers and WS clients by default
Pick up backport for on_backpressure_buffer
Broker uses credit-based flow progressing to push data to connected peers. If a peer no longer grants new credit, the pipeline stops. We have also found that the pipeline once stopped seems to struggle to get "unblocked" again. However, the deeper issue is that Zeek does not tap into the flow processing and always assumes it can publish data on its topics. However, we need to have some way of coping with overloaded (or otherwise unresponsive) nodes.
In a recent discussion with @ckreibich, I think we agreed that Zeek can't stop reading incoming data or "slow down" but that we should still try to have Zeek do as much as it can do if something runs into a wall. We could drop data where we produce it (in Zeek) in case the buffers start to reach a certain threshold or we disconnect peers that fail to keep up. Dropping data in Zeek would still mean that a single unresponsive/overloaded node blocks the entire cluster. We would at least stop individual Zeek processes from running out of memory eventually, though. However, Broker auto-disconnecting unresponsive peers is probably a safety measure we should implement regardless. It would stop single processes from being able to lock up the entire cluster.
From the reports we have, it seems Broker recovers when restarting the problematic process manually. So ideally, we combine the auto-disconnect with a supervising concept that identifies problematic nodes to restart them automatically and/or collect status information for that process that helps users to identify the underlying issue (like a script blocking too much CPU or taking to long to process events from certain topics).
This would mean we ultimately allow Broker to discard data. Once we auto-disconnect a peer, we discard all data that peer would have received. Even if the peer re-connects later, data is lost. However, the current approach is simply hoping for the best and then eventually fail catastrophically (locking up the whole cluster and processes running out of memory).
The text was updated successfully, but these errors were encountered: