Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Node disconnection for long duration due to Encrypting network mesh during Mesh deployment #17155

Open
anuragrai16 opened this issue Jan 28, 2025 · 10 comments · May be fixed by #17400
Open
Labels
bug Something isn't working Cluster Manager

Comments

@anuragrai16
Copy link

anuragrai16 commented Jan 28, 2025

Describe the bug

We have an OpenSearch cluster deployed with an Encrypting network mesh (Envoy). The cluster is a standard one with 3 master nodes and a number of data nodes. Due to the encrypting mesh, instead of nodes connecting directly with each other, it connects via an envoy proxy container that encrypts all outgoing TCP connections. In the steady state, the ES cluster seems to work fine.

But every time we deploy the envoy container on the host, the network gets reset (TCP RST is instant and connections should instantly be retried and moved to the new container). But what we are observing is that : OpenSearch master to data network connection ends up disconnecting the node for 3 seconds every time.

If the node being disconnected hosts a primary replica of a shard, there are write failures for the index whereas get/search calls are not impacted. Or if the node is the primary master itself, no writes are going through for a few seconds and the cluster reports Yellow/Red state (though no physical shard movement takes place).
Note that ,Deployment of Envoy container is Blue/Green with a grace period similar to this.

Related component

Cluster Manager

To Reproduce

  1. Create a OS cluster and deploy with a Network mesh (Envoy, Istio or similar)
  2. Deploy the underlying network mesh container (Envoy container) and wait for grace-period to finish for TCP to reset.
  3. Observe disconnection of master to the data node (where the envoy container was deployed) for 3-4 seconds and then reconnect.

Expected behavior

When the envoy container on host is updated/deployed, the TCP connection reset is instant and therefore OpenSearch cluster should instantly retry the connection from master to the data node, without a large delay (3 seconds in this case).

Additional Details

Plugins

Screenshots

Diagram denoting how traffic is forwarded first to an Envoy container which encrypts the outgoing traffic before sending it to a new host. The purple box represents the OS container, while the pink box represents the Envoy container.

Image

Logs on the primary master during the 3 seconds when it disconnects

Image

Error observed on the client if making Indexing calls to the index that was impacted

Image

Host/Environment (please complete the following information):

  • OS: Linux
  • Version: OS v2.4.1

Additional context

Things tried to resolve

Set aggressive TCP keep_alive settings,

discovery.zen.fd.connect_on_network_disconnect: true
transport.tcp.keep_count: 20
transport.tcp.keep_interval: 30s

Set aggressive fault_detection settings tweak,

cluster.fault_detection.follower_check.interval: 3s
cluster.fault_detection.follower_check.timeout: 13s
cluster.fault_detection.follower_check.retry_count: 15

Set aggressive ping settings,

discovery.zen.fd.ping_interval: 4s
discovery.zen.fd.ping_retries: 6
discovery.zen.fd.ping_timeout: 40s
@anuragrai16
Copy link
Author

anuragrai16 commented Feb 13, 2025

On digging further, here's what I find why the node always disconnects for 3-4 seconds,

  1. OpenSearch fault detection employs the use of FollowerChecks and LeaderChecks. Follower checks is code that runs on a primary master (leader) that is responsible for allowing a leader to check that its followers are still connected and healthy. Similarly, Leader checks is code that runs on all followers allowing them to check that the currently elected leader is still connected and healthy.

  2. FollowerChecker running on the primary master registers a disconnection handler that removes the data node from a cluster, and stops any Follower checker for this data node. This gets executed as soon as the data node’s network is disrupted due to secure-proxy deployment.

  3. At this point the data node is still connected to the primary master (but not the vice-versa), and continues to send Leader checks. But, the primary master is responding with a [CoordinationStateRejectedException](https://github.com/opensearch-project/OpenSearch/blob/d0a65d3b457e11d514171ecc0a77450fdd7d25bb/server/src/main/java/org/opensearch/cluster/coordination/LeaderChecker.java#L230) to the disconnected data node, since it has effectively removed it from the Cluster.

  4. When the data note receives this exception as part of its Leader check, it sees that it has failed, but it needs to retry cluster.fault_detection.leader_check.retry_count times before it can consider leader as truly failed.

5.After failing the Leader check on the data node continuously, it executes the leader failure handler. This handler does multiple things,

  1. The primary master upon receiving the probe from the data node adds it back to the cluster, at which point the cluster is back again in its steady state.

@anuragrai16
Copy link
Author

To bring down the disconnection time from 3-4 second, to < 1 second or lesser, I propose the following,

  • In the current LeaderCheck code flow, data nodes are forced to retry at least 3 times with a known deterministic exception thrown by the primary master, CoordinationStateRejectedException. This is because the primary master has already kicked out the data node, yet the data node continues its retries, so primary master will always throw the same exception no matter the retries.
  • Instead, if we catch the exception and fail-fast the leader check, the peer finder quickly kicks in and re-adds the data node to the cluster.
  • This reduces the disconnect time from 3-4 seconds to 1 second or less. Furthermore, reducing the cluster.fault_detection.leader_check.interval with this fix causes the data node to reconnect even faster.

@rajiv-kv
Copy link
Contributor

[Triage Attendees - 1, 2, 3]
Thanks @anuragrai16 for filing the issue. This looks to be expected behaviour as the cluster-manager does not have context that node is being take down for deployment. Please feel free to create seperate github issues on feature requests for supporting deployments in OpenSearch cluster.

@anuragrai16
Copy link
Author

anuragrai16 commented Feb 13, 2025

This looks to be expected behaviour as the cluster-manager does not have context that node is being take down for deployment.

@rajiv-kv The node is not being taken down for deployment. Instead its connection has been abruptly terminated, while the node is up and healthy. The problem is that the primary master thinks that data node has disconnected for a much longer duration than what it should be.

@rajiv-kv
Copy link
Contributor

@anuragrai16 - Yes the node is not restarted but proxy frontending the node is deployed and the connections are reset. Each node maintains a connection-pool to every other in the cluster Ref : NodeConnectionService.

Connection from this pool is used for leader-follower checks as well as node-node transport communications. I think that deployment to the envoy proxy is causing the existing connections to become stale.

You might probably need some external coordination to drain out the exisitng connections, weigh away the traffic before deployment and wait for the node to be back after deployment.
You can probably take a look if the zonal deployment model fits your usecase to weigh away in-flight request and decomissions a zone:- https://opensearch.org/docs/latest/api-reference/cluster-api/cluster-decommission/

@anuragrai16
Copy link
Author

anuragrai16 commented Feb 17, 2025

@rajiv-kv - Thanks for the details. While yes, it makes sense to handle this gracefully externally, I also want to highlight a possible edge-case/inefficiency in the leader check as detailed in this comment. .

Basically, when the data node connection gets dropped and the primary master (leader) has removed it from the cluster, the data node still continues to do its LeaderCheck and failing with the same deterministic exception (CoordinationStateRejectedException). Instead, if we catch this exception and fail the leader check, it can begin its recovery and rejoin the cluster pretty quickly. So, instead of the data node disconnecting for 3-4 seconds, it only disconnects for <1 second. What do you think about this change ? I can open a PR for it if you dont see any issues.

@shwetathareja
Copy link
Member

@anuragrai16 : In this case, when the connection are disrupted due to proxy being deployed, it should be symmetric failure right, in the sense if Cluster Manager (aka master) identified that follower is disconnected while running the followerChecker, then at the same time LeaderChecker on data node should have also identified that leader is disconnected?

At this point the data node is still connected to the primary master (but not the vice-versa), and continues to send Leader checks.

Can you please explain this further, why is that happening?

On fail fast of LeaderChecker when it receives the error that it is not part of the cluster

I don't see major side effect of failing fast when the error states follower node is not part of the cluster anymore, let me think more.

Also another suggestion, to prevent the disruption when envoy proxy is deployed in the active Cluster manager (node1), is there way to know from external that it is deploying to active Cluster Manager host? If yes, then i would suggest to add this node1 in voting exclusion, this will ensure a standby cluster manager node (node2) is elected as leader and then you can deploy proxy on node1 without causing red cluster issues. and finally remove the voting exclusions.

@anuragrai16
Copy link
Author

@shwetathareja - Thanks for taking a look. As for your questions,

Can you please explain this further, why is that happening?

Based on my limited understanding of the inter-node custom TCP implementation between node <> node, we might not be using the same tcp channel for,

  • leader check sent from data node -> primary master
  • follower check from primary master -> data node

I believe we define a ConnectionProfile on every node that controls how many connections can be open for different request type, and use a connection from a pool. For reference, TransportService opens connections to a node by taking in the connection profile here, and all the connections get created here.) This means that these are two different TCP connections from the Envoy proxy perspective. When a deployment happens, the network disconnect (essentially a TCP RST packet sent to each connection) happens in a phased way during the grace period. So, the socket used for leader check from the data node, v/s a channel used for follower check arriving to the data node may be disconnected at different times.

Also another suggestion, to prevent the disruption when envoy proxy is deployed in the active Cluster manager (node1), is there way to know from external that it is deploying to active Cluster Manager host?

Right, this is a long term solution that we are looking for but slightly non-trivial at this point to implement for us due to some internal details of how envoy is deployed v/s how the OS cluster is managed on hosts. As a result, we are looking for an application fix for reducing the data node disconnection as seen by the cluster manager (master). The change I mentioned above (fail-fast the LeaderCheck) seems to do that for us, hence we would like to make that change.

@anuragrai16
Copy link
Author

Also @shwetathareja , Here is the full trace logs on primary master and data node during the deployment. I have tagged the timestamp of interest in both logs. (You might need to request permission, I'll approve)

@anuragrai16 anuragrai16 linked a pull request Feb 20, 2025 that will close this issue
3 tasks
@anuragrai16
Copy link
Author

anuragrai16 commented Feb 21, 2025

@shwetathareja - I added a PR. But, There is an integration test testPrimaryTermValidation that seems to become flaky with this change. But this looks to be because it simulates a very high leader check interval setting (and reduces follower check interval) as a means of sustaining the network disconnect and testing assumption of a disconnected data node.
Whereas, with my change, irrespective of the leader check interval, the PeerFinder would kick in quickly in case of CoordinationStateRejection (which is what happens in this integration test), so the primary write will fail in this integration test.

Do you think it makes sense to remove this IT or perhaps tweak it somehow to represent an actual longer disconnect ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Cluster Manager
Projects
Status: 🆕 New
Development

Successfully merging a pull request may close this issue.

3 participants