-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce seed-node based peer discovery option #13043
Comments
I vote for Option 3 as the least disruptive. What is specific to Kubernetes, should be addressed in the Kubernetes plugin. Other peer discovery backends work fine in practice and we have years of evidence to back this claim. |
We faced exactly the same problem four years ago in rabbitmq/cluster-operator#662 Back then the root cause of malformed clusters were randomised startup delays within rabbitmq-server.
To solve this problem, I implemented two solutions:
I decided to go with solution 2. This solution uses Erlang locks and has worked extremely well. We have never seen any malformed clusters thereafter (that is since solution 2 shipped in RabbitMQ 3.8.18). The problem described in this issue has started to come back in RabbitMQ 3.13 |
Yes, understanding and fixing the issue would be great. However, we have recent reports of this problem with 3.13 and 4.0 and every single time we ask for debug logs from all nodes and we never get them. I can try to ping the users reporting the problem again. Otherwise we are kind of stuck with users reporting the problems every few months and not being able to do much about it. This issue never happened to me on GKE since the locking was introduced.. The third issue I linked to happened with classic config in our CI. It's not strictly a Kubernetes problem. I feel like changes to the k8s peer plugin (option 3) can be justified on simplicity alone: the plugin currently queries the Kubernetes API for a list of pod names that belong to a StatefulSet, which is completely unnecessary since these names are predictable. Therefore all the plugin's configuration options are redundant (since they are all about the Kubernetes API endpoint, TLS settings, etc). It's just a completely over-engineered design. |
Can you provide a link to the 3.13 report?
I agree. (On the other hand, changing only the K8s peer discovery plugin won't fix our CI issue.) |
The discord link above is for 3.13: https://discord.com/channels/1092487794984755311/1092487853654687774/1324478177237536859 I just pinged the reporter and they said they would try to provide debug logs next week. 🤞 |
Sorry, I corrected my message above. As you wrote in rabbitmq/cluster-operator#662 (comment)
So, I'm sure these significant changes re-introduced the clustering issue described here. |
I published a draft PR that shows what a redesigned k8s peer discovery plugin could look like: |
While looking at some failures in CI with peer discovery today, I read about a change in Erlang 26 in its time correction handling that affects peer discorvery seed node selection negatively. The code that computes the node start time can now return different results over time. In one CI failure, there is a difference of 22 seconds for the start time of a specific node when queried like 15 seconds apart. As a consequence the sorting of nodes is unstable. The start time is based on the Erlang monotonic time, so an arbitrary time reference. Erlang monotonic times can’t be compared between nodes because they won’t have the some arbitrary time reference. To convert it to a system time that can be used to compare, you have to query the local time offset and add it to the monotonic time. That’s what the peer discovery code does. However since Erlang 26, that time offset is volatile (that’s the documented word, not mine). This improves the time performance and precision according to the docs. Unfortunately, this completely breaks my system-time-based start time calculation… I will try to compute that start time once and cache it. I just need a stable value that allows peer discovery to determine which node is the older. |
Sounds good, each node could compute its start up system time during boot, and write it in some local ETS table such that this value is stable. Other nodes can then query this ETS table. That said, if the computed start up system time on node A is lower than the computed start up system time of node B, it doesn't necessarily mean that node A started before node B. However, as I understand you, that's acceptable since the order is deterministic. |
I went with a
Yes. To be clear, the start time is not the uptime. It’s the date/time at which the Erlang VM started. It is used to sort nodes that are not clustered yet. Once one of them is clustered, it will sort first anyway and be selected. That start time allows a stable sort even if the list grows with new nodes appearing. |
Yes, that's clear. I looked at the current implementation. My understanding is that with both the current implementation and your fix in #13128 two independent clusters can be formed as follows:
Is my understanding correct? |
If the Do you think we should use another source of time? Note that there is another race condition regardless of the fix: there is a time window between the moment node C queries nodes and decides to join node A and the moment node A returns a list of members with [A, C]. During that time window, if node B appears and is first in the list, node D might decide to join B instead of A. This one requires some kind of atomicity between the membership query and the actual join. Using a global lock is going to be tricky because nodes can’t be connected to each other at the time of the queries. This is not something that can be repaired at the end of peer discovery because disbanding all clusters but one and adding disbanded nodes to the wanted cluster might cause data loss. |
No, I think it would be best to eliminate clustering with the "oldest" node in the peer discovery algorithm, because without using Lamport clocks the order of events across multiple nodes is not clear, and therefore it's not defined who the "oldest" node really is. It can happen that after a cluster has been formed, a few seconds later another node starts which is reported to be the oldest although it's not. By using any form of wall clock time (as currently done), it's not possible to tell which node the oldest really is. In #3075 I used global locks which made it impossible to form multiple clusters. That's because as soon as multiple nodes are returned from the peer discovery backend, a node needs to acquire the global lock on the returned "live" nodes to decide whether to proceed as a standalone node (i.e. becoming the "seed node"), or joining another node.
Setting a global lock will cause the nodes to be connected (i.e. to form an Erlang cluster). Why is that a problem? |
The implementation at the time we worked on Khepri clustering did not prevent multiple clusters from being created. At least, Michal hit this problem many times while testing on Kubernetes and the Perhaps that was because we had to get rid of the peer-discovery-specific clustering code path in Also if nodes are connected without being cluster members, this will affect other users of |
We can definitely try a global lock again. I tend to document the reasoning and decision process in comments and commit messages, but I can’t find again what was wrong with the previous global lock, beside that mention of the Feature flags controller… |
I suppose it might be worth trying to set a global lock again. If this doesn't work out for some reason, we have to re-think how to best perform peer discovery.
I haven't fully thought it through, but one way of solving this problem could be to shift the decision of which node becomes the seed node from module
|
This is already possible. I think I remember one of the issues with the previous implementation: it happened that node A joined node B who joined node C (leaving node A alone). So the use of a lock didn’t solve the whole issue: nodes must join the same seed node anyway, or at least verify if something happened in their back. |
Is your feature request related to a problem? Please describe.
Currently there's no way to configure RabbitMQ by specifying the seed node that all other nodes have to join. All mechanisms assume that if they can't join other nodes, they should form a new cluster so that other nodes can join them. This works well in a vast majority of cases but still causes "mis-clustering" in some environments: for example a 3-node cluster deployment to Kubernetes may start as 2 nodes clustered together and 1 node that formed a separate cluster. While we see this issue reported occasionally, we never received full debug logs from all nodes to understand why this happens.
It's not a Kubernetes-specific issue however. While each peer discovery backend has its own logic and is more or less prone to this issue, it also happened recently in our CI with classic peer discovery.
In many environments, including Kubernetes, all (or at least some) node names are known upfront. For example a StatefulSet named
foo
on Kubernetes will always create pods with the namesfoo-0
,foo-1
and so on. Therefore, rather than querying the Kubernetes API for the list of endpoints behind a Service, we could just configure RabbitMQ to usefoo-0
as the seed node, since it has to exist, regardless of the number of nodes in the cluster.Another recent report:
https://discord.com/channels/1092487794984755311/1092487853654687774/1324478177237536859
Describe the solution you'd like
There are few options and I'd like to get some feedback about them.
OPTION 1: New Backend
Add a new peer discovery backend that simply takes 1 node name and only that node is allowed to form a new cluster while all the other nodes need to join it. This will require some changes to the generic parts of the peer discovery mechanism, since it currently always falls back to forming a new cluster after the configured number of attempts. In this case, we would like all nodes apart from the seed node to just keep trying forever. They are not allowed to form a new cluster under any circumstances.
The downsides:
OPTION 2: New Behaviour in Classic Config Peer Discovery
Introduce a new behaviour within the classic peer discovery backend. The desired logic is very close to what classic backend already does. However, it currently:
The new behaviour could be: if the local node is not on the list, it's not allowed to form a new cluster and has to join one of the other nodes. With such changes, the following configuration should accomplish the desired behaviour:
The downsides are:
We could introduce a dedicated configuration option for this behaviour, for example:
OPTION 3: Change k8s Peer Discovery
Change the way k8s peer discovery works. Currently it queries the Kubernetes API to receive the list of endpoints for a given Service. There are a lot of configuration options related to how to connect to the Kubernetes API (if the defaults don't work; fortunately they do work in most cases) to perform a query that we already know the answer to.
As mentioned above, a StatefulSet always uses consecutive 0-based suffixes. It is not possible to have a StatefulSet that does not have the
...-0
pod. If, for any reason, other nodes start successfully but pod...-0
can't, it's totally acceptable in my opinion that the other nodes would just keep waiting for node...-0
to start and to join it (note: peer discovery only happens on the initial cluster deployment, the unavailability of pod...-0
wouldn't affect any operations once the cluster is formed initially). There's little benefit of forming a cluster without it.So instead of querying the Kubernetes API, the plugin could just take a single parameter
cluster_formation.k8s.statefulset_name
, append-0
to that value and treat the result as the seed node. If the configuration option is not present, just repalce the-ID
suffix of the local node with-0
.Benefits:
Drawbacks:
Describe alternatives you've considered
We can just do nothing and hope that one day someone provides sufficient data to understand why the k8s plugin occasionally fails in some environments. It hasn't happened in our GKE cluster even once in the last few years, so there seems to be something environment specific.
Additional context
No response
The text was updated successfully, but these errors were encountered: