-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Node disconnection for long duration due to Encrypting network mesh during Mesh deployment #17155
Comments
On digging further, here's what I find why the node always disconnects for 3-4 seconds,
5.After failing the Leader check on the data node continuously, it executes the leader failure handler. This handler does multiple things,
|
To bring down the disconnection time from 3-4 second, to < 1 second or lesser, I propose the following,
|
[Triage Attendees - 1, 2, 3] |
@rajiv-kv The node is not being taken down for deployment. Instead its connection has been abruptly terminated, while the node is up and healthy. The problem is that the primary master thinks that data node has disconnected for a much longer duration than what it should be. |
@anuragrai16 - Yes the node is not restarted but proxy frontending the node is deployed and the connections are reset. Each node maintains a connection-pool to every other in the cluster Ref : NodeConnectionService. Connection from this pool is used for leader-follower checks as well as node-node transport communications. I think that deployment to the envoy proxy is causing the existing connections to become stale. You might probably need some external coordination to drain out the exisitng connections, weigh away the traffic before deployment and wait for the node to be back after deployment. |
@rajiv-kv - Thanks for the details. While yes, it makes sense to handle this gracefully externally, I also want to highlight a possible edge-case/inefficiency in the leader check as detailed in this comment. . Basically, when the data node connection gets dropped and the primary master (leader) has removed it from the cluster, the data node still continues to do its LeaderCheck and failing with the same deterministic exception (CoordinationStateRejectedException). Instead, if we catch this exception and fail the leader check, it can begin its recovery and rejoin the cluster pretty quickly. So, instead of the data node disconnecting for 3-4 seconds, it only disconnects for <1 second. What do you think about this change ? I can open a PR for it if you dont see any issues. |
@anuragrai16 : In this case, when the connection are disrupted due to proxy being deployed, it should be symmetric failure right, in the sense if Cluster Manager (aka master) identified that follower is disconnected while running the followerChecker, then at the same time LeaderChecker on data node should have also identified that leader is disconnected?
Can you please explain this further, why is that happening?
I don't see major side effect of failing fast when the error states follower node is not part of the cluster anymore, let me think more. Also another suggestion, to prevent the disruption when envoy proxy is deployed in the active Cluster manager (node1), is there way to know from external that it is deploying to active Cluster Manager host? If yes, then i would suggest to add this node1 in voting exclusion, this will ensure a standby cluster manager node (node2) is elected as leader and then you can deploy proxy on node1 without causing red cluster issues. and finally remove the voting exclusions. |
@shwetathareja - Thanks for taking a look. As for your questions,
Based on my limited understanding of the inter-node custom TCP implementation between node <> node, we might not be using the same tcp channel for,
I believe we define a ConnectionProfile on every node that controls how many connections can be open for different request type, and use a connection from a pool. For reference, TransportService opens connections to a node by taking in the connection profile here, and all the connections get created here.) This means that these are two different TCP connections from the Envoy proxy perspective. When a deployment happens, the network disconnect (essentially a TCP RST packet sent to each connection) happens in a phased way during the grace period. So, the socket used for leader check from the data node, v/s a channel used for follower check arriving to the data node may be disconnected at different times.
Right, this is a long term solution that we are looking for but slightly non-trivial at this point to implement for us due to some internal details of how envoy is deployed v/s how the OS cluster is managed on hosts. As a result, we are looking for an application fix for reducing the data node disconnection as seen by the cluster manager (master). The change I mentioned above (fail-fast the LeaderCheck) seems to do that for us, hence we would like to make that change. |
Also @shwetathareja , Here is the full trace logs on primary master and data node during the deployment. I have tagged the timestamp of interest in both logs. (You might need to request permission, I'll approve) |
@shwetathareja - I added a PR. But, There is an integration test testPrimaryTermValidation that seems to become flaky with this change. But this looks to be because it simulates a very high leader check interval setting (and reduces follower check interval) as a means of sustaining the network disconnect and testing assumption of a disconnected data node. Do you think it makes sense to remove this IT or perhaps tweak it somehow to represent an actual longer disconnect ? |
Describe the bug
We have an OpenSearch cluster deployed with an Encrypting network mesh (Envoy). The cluster is a standard one with 3 master nodes and a number of data nodes. Due to the encrypting mesh, instead of nodes connecting directly with each other, it connects via an envoy proxy container that encrypts all outgoing TCP connections. In the steady state, the ES cluster seems to work fine.
But every time we deploy the envoy container on the host, the network gets reset (TCP RST is instant and connections should instantly be retried and moved to the new container). But what we are observing is that : OpenSearch master to data network connection ends up disconnecting the node for 3 seconds every time.
If the node being disconnected hosts a primary replica of a shard, there are write failures for the index whereas get/search calls are not impacted. Or if the node is the primary master itself, no writes are going through for a few seconds and the cluster reports Yellow/Red state (though no physical shard movement takes place).
Note that ,Deployment of Envoy container is Blue/Green with a grace period similar to this.
Related component
Cluster Manager
To Reproduce
Expected behavior
When the envoy container on host is updated/deployed, the TCP connection reset is instant and therefore OpenSearch cluster should instantly retry the connection from master to the data node, without a large delay (3 seconds in this case).
Additional Details
Plugins
Screenshots
Diagram denoting how traffic is forwarded first to an Envoy container which encrypts the outgoing traffic before sending it to a new host. The purple box represents the OS container, while the pink box represents the Envoy container.
Logs on the primary master during the 3 seconds when it disconnects
Error observed on the client if making Indexing calls to the index that was impacted
Host/Environment (please complete the following information):
Additional context
Things tried to resolve
Set aggressive TCP keep_alive settings,
Set aggressive fault_detection settings tweak,
Set aggressive ping settings,
The text was updated successfully, but these errors were encountered: