Introduce seed-node based peer discovery option #13043

mkuratczyk · 2025-01-08T16:36:14Z

Is your feature request related to a problem? Please describe.

Currently there's no way to configure RabbitMQ by specifying the seed node that all other nodes have to join. All mechanisms assume that if they can't join other nodes, they should form a new cluster so that other nodes can join them. This works well in a vast majority of cases but still causes "mis-clustering" in some environments: for example a 3-node cluster deployment to Kubernetes may start as 2 nodes clustered together and 1 node that formed a separate cluster. While we see this issue reported occasionally, we never received full debug logs from all nodes to understand why this happens.

It's not a Kubernetes-specific issue however. While each peer discovery backend has its own logic and is more or less prone to this issue, it also happened recently in our CI with classic peer discovery.

In many environments, including Kubernetes, all (or at least some) node names are known upfront. For example a StatefulSet named foo on Kubernetes will always create pods with the names foo-0, foo-1 and so on. Therefore, rather than querying the Kubernetes API for the list of endpoints behind a Service, we could just configure RabbitMQ to use foo-0 as the seed node, since it has to exist, regardless of the number of nodes in the cluster.

Another recent report:
https://discord.com/channels/1092487794984755311/1092487853654687774/1324478177237536859

Describe the solution you'd like

There are few options and I'd like to get some feedback about them.

OPTION 1: New Backend

Add a new peer discovery backend that simply takes 1 node name and only that node is allowed to form a new cluster while all the other nodes need to join it. This will require some changes to the generic parts of the peer discovery mechanism, since it currently always falls back to forming a new cluster after the configured number of attempts. In this case, we would like all nodes apart from the seed node to just keep trying forever. They are not allowed to form a new cluster under any circumstances.

The downsides:

completely opt-in - you need to start using the new backend to get the benefits
confusing - users expect k8s peer discovery to be the best option for Kubernetes for example

OPTION 2: New Behaviour in Classic Config Peer Discovery

Introduce a new behaviour within the classic peer discovery backend. The desired logic is very close to what classic backend already does. However, it currently:

expects the local node to be on the list of configured nodes
falls back to forming a new cluster (see above)

The new behaviour could be: if the local node is not on the list, it's not allowed to form a new cluster and has to join one of the other nodes. With such changes, the following configuration should accomplish the desired behaviour:

cluster_formation.peer_discovery_backend = classic_config
cluster_formation.classic_config.nodes.1 = [email protected]

The downsides are:

confusing - k8s peer discovery is not the best backend on Kubernetes (would not be used by the Cluster Operator)
potentially confusing - different behaviour based on whether the local node is on the list or not; however, personally I was surprised it was not working like that. For me it feels pretty intuitive that a node not mentioned on the list has to join of the nodes that are mentioned
not clear how the mechanism would work if the list contained multiple seed nodes but not all expected nodes of the cluster. I'd rather only allow a single seed node.

We could introduce a dedicated configuration option for this behaviour, for example:

cluster_formation.classic_config.seed_node = [email protected]        # or perhaps just "node"

OPTION 3: Change k8s Peer Discovery

Change the way k8s peer discovery works. Currently it queries the Kubernetes API to receive the list of endpoints for a given Service. There are a lot of configuration options related to how to connect to the Kubernetes API (if the defaults don't work; fortunately they do work in most cases) to perform a query that we already know the answer to.

As mentioned above, a StatefulSet always uses consecutive 0-based suffixes. It is not possible to have a StatefulSet that does not have the ...-0 pod. If, for any reason, other nodes start successfully but pod ...-0 can't, it's totally acceptable in my opinion that the other nodes would just keep waiting for node ...-0 to start and to join it (note: peer discovery only happens on the initial cluster deployment, the unavailability of pod ...-0 wouldn't affect any operations once the cluster is formed initially). There's little benefit of forming a cluster without it.

So instead of querying the Kubernetes API, the plugin could just take a single parameter cluster_formation.k8s.statefulset_name, append -0 to that value and treat the result as the seed node. If the configuration option is not present, just repalce the -ID suffix of the local node with -0.

Benefits:

we could make this a transparent change for Kubernetes users - the k8s plugin would just start working (even) more reliably
Drawbacks:
users might want this behaviour despite not using Kubernetes; using the k8s peer discovery plugin outside of Kubernetes is counter-intuitive

Describe alternatives you've considered

We can just do nothing and hope that one day someone provides sufficient data to understand why the k8s plugin occasionally fails in some environments. It hasn't happened in our GKE cluster even once in the last few years, so there seems to be something environment specific.

Additional context

No response

The text was updated successfully, but these errors were encountered:

michaelklishin · 2025-01-08T17:09:56Z

I vote for Option 3 as the least disruptive. What is specific to Kubernetes, should be addressed in the Kubernetes plugin.

Other peer discovery backends work fine in practice and we have years of evidence to back this claim.

ansd · 2025-01-09T08:51:24Z

We faced exactly the same problem four years ago in rabbitmq/cluster-operator#662

Back then the root cause of malformed clusters were randomised startup delays within rabbitmq-server.
Randomised startup delays were a poor design choice because:

a node might take a long time to start, and
every now and then, there is a chance that two nodes out of multiple start roughly at the same time (just a few milliseconds apart) leading both nodes to form new clusters.

To solve this problem, I implemented two solutions:

I delivered a proof of concept using a seed node in Kubernetes via Explore: classic peer discovery without randomised startup delay cluster-operator#689. This is one of the options you describe in this issue. This approach worked very well.
Remove randomised startup delays in rabbitmq-server via Remove randomized startup delays #3075

I decided to go with solution 2. This solution uses Erlang locks and has worked extremely well. We have never seen any malformed clusters thereafter (that is since solution 2 shipped in RabbitMQ 3.8.18).

The problem described in this issue has started to come back in RabbitMQ 3.13 ~~within the last few months (with RabbitMQ 4.0 if I'm not mistaken).~~ I know there were peer discovery changes being made on the 3.13 code base. In my opinion, before coming up with a new solution, I suggest first debugging and understanding the root cause why the recent peer discovery changes lead to malformed clusters, and then ideally fix this root cause.

mkuratczyk · 2025-01-09T09:11:42Z

Yes, understanding and fixing the issue would be great. However, we have recent reports of this problem with 3.13 and 4.0 and every single time we ask for debug logs from all nodes and we never get them. I can try to ping the users reporting the problem again. Otherwise we are kind of stuck with users reporting the problems every few months and not being able to do much about it. This issue never happened to me on GKE since the locking was introduced..

The third issue I linked to happened with classic config in our CI. It's not strictly a Kubernetes problem.

I feel like changes to the k8s peer plugin (option 3) can be justified on simplicity alone: the plugin currently queries the Kubernetes API for a list of pod names that belong to a StatefulSet, which is completely unnecessary since these names are predictable. Therefore all the plugin's configuration options are redundant (since they are all about the Kubernetes API endpoint, TLS settings, etc). It's just a completely over-engineered design.

ansd · 2025-01-09T09:25:52Z

we have recent reports of this problem with 3.13 and 4.0

Can you provide a link to the 3.13 report?

It's just a completely over-engineered design.

I agree. (On the other hand, changing only the K8s peer discovery plugin won't fix our CI issue.)

mkuratczyk · 2025-01-09T09:29:29Z

The discord link above is for 3.13: https://discord.com/channels/1092487794984755311/1092487853654687774/1324478177237536859

I just pinged the reporter and they said they would try to provide debug logs next week. 🤞

ansd · 2025-01-09T09:33:39Z

Sorry, I corrected my message above. As you wrote in rabbitmq/cluster-operator#662 (comment)

There were significant changes to peer discovery in 3.13

So, I'm sure these significant changes re-introduced the clustering issue described here.
It seems 3.13 came full circle and we're back with the same issue and symptoms where we were four years ago, this time with another yet to be identified root cause.

mkuratczyk · 2025-01-10T11:23:42Z

I published a draft PR that shows what a redesigned k8s peer discovery plugin could look like:
#13050

dumbbell · 2025-01-22T13:57:38Z

While looking at some failures in CI with peer discovery today, I read about a change in Erlang 26 in its time correction handling that affects peer discorvery seed node selection negatively. The code that computes the node start time can now return different results over time. In one CI failure, there is a difference of 22 seconds for the start time of a specific node when queried like 15 seconds apart. As a consequence the sorting of nodes is unstable.

The start time is based on the Erlang monotonic time, so an arbitrary time reference. Erlang monotonic times can’t be compared between nodes because they won’t have the some arbitrary time reference. To convert it to a system time that can be used to compare, you have to query the local time offset and add it to the monotonic time. That’s what the peer discovery code does.

However since Erlang 26, that time offset is volatile (that’s the documented word, not mine). This improves the time performance and precision according to the docs. Unfortunately, this completely breaks my system-time-based start time calculation…

I will try to compute that start time once and cache it. I just need a stable value that allows peer discovery to determine which node is the older.

ansd · 2025-01-22T14:43:45Z

Sounds good, each node could compute its start up system time during boot, and write it in some local ETS table such that this value is stable. Other nodes can then query this ETS table.

That said, if the computed start up system time on node A is lower than the computed start up system time of node B, it doesn't necessarily mean that node A started before node B. However, as I understand you, that's acceptable since the order is deterministic.

dumbbell · 2025-01-22T15:57:22Z

I went with a persistent_term but the idea is the same. The code will try to call this function and if it is undefined, it will do the same thing with several RPCs as before (but with the added caching).

That said, if the computed start up system time on node A is lower than the computed start up system time of node B, it doesn't necessarily mean that node A started before node B. However, as I understand you, that's acceptable since the order is deterministic.

Yes. To be clear, the start time is not the uptime. It’s the date/time at which the Erlang VM started. It is used to sort nodes that are not clustered yet. Once one of them is clustered, it will sort first anyway and be selected. That start time allows a stable sort even if the list grows with new nodes appearing.

ansd · 2025-01-22T19:16:31Z

To be clear, the start time is not the uptime. It’s the date/time at which the Erlang VM started.

Yes, that's clear. I looked at the current implementation. My understanding is that with both the current implementation and your fix in #13128 two independent clusters can be formed as follows:

Node A starts.
A queries peer discovery backend, which returns {A}.
Node A starts as standalone node.
(Very shortly after,) Node B starts.
B queries peer discovery backend, which returns the set {A, B}.
On B, rabbit_peer_discovery:query_node_props/1 returns the sorted list [B, A] because both A and B have a cluster size of 1 (only its own node) and B has a lower reported startup system time than A (i.e. B is reported to be older than A). This latter condition is rare, but it can happen, even with your fix, right?
B also starts as standalone node because B is at the start of the sorted list resulting in two separate clusters.
(Any nodes started thereafter will join B).

Is my understanding correct?

dumbbell · 2025-01-23T10:02:08Z

If the start_time is incorrectly reported, then yes. The host clock could be imprecise as well; I guess that’s what happens in CI. I hesitate to cache the start time early on startup. Perhaps it can reduce the chance of inversion.

Do you think we should use another source of time?

Note that there is another race condition regardless of the fix: there is a time window between the moment node C queries nodes and decides to join node A and the moment node A returns a list of members with [A, C]. During that time window, if node B appears and is first in the list, node D might decide to join B instead of A. This one requires some kind of atomicity between the membership query and the actual join. Using a global lock is going to be tricky because nodes can’t be connected to each other at the time of the queries.

This is not something that can be repaired at the end of peer discovery because disbanding all clusters but one and adding disbanded nodes to the wanted cluster might cause data loss.

ansd · 2025-01-23T10:57:22Z

Do you think we should use another source of time?

No, I think it would be best to eliminate clustering with the "oldest" node in the peer discovery algorithm, because without using Lamport clocks the order of events across multiple nodes is not clear, and therefore it's not defined who the "oldest" node really is. It can happen that after a cluster has been formed, a few seconds later another node starts which is reported to be the oldest although it's not.

By using any form of wall clock time (as currently done), it's not possible to tell which node the oldest really is.
Querying the node local wall clock time, as returned by erlang:system_time() or erlang:monotonic_time() + erlang:time_offset(), is unreliable. As you correctly said, there could be clock drifts across the different nodes where the clock on one node runs several seconds before/after the clock of another node. Even if each node uses a precise hardware clock and erlang:system_info(start_time) is correctly reported within a very low confidence interval, it can happen that the node boot process halts arbitrarily for a few seconds before or during the different peer discovery steps.

In #3075 I used global locks which made it impossible to form multiple clusters. That's because as soon as multiple nodes are returned from the peer discovery backend, a node needs to acquire the global lock on the returned "live" nodes to decide whether to proceed as a standalone node (i.e. becoming the "seed node"), or joining another node.

Using a global lock is going to be tricky because nodes can’t be connected to each other at the time of the queries.

Setting a global lock will cause the nodes to be connected (i.e. to form an Erlang cluster). Why is that a problem?

dumbbell · 2025-01-23T11:34:09Z

The implementation at the time we worked on Khepri clustering did not prevent multiple clusters from being created. At least, Michal hit this problem many times while testing on Kubernetes and the peer_discovery_classic_config_SUITE testsuite failed frequently. That’s why we had to rework the algorithm.

Perhaps that was because we had to get rid of the peer-discovery-specific clustering code path in rabbit_mnesia to use the join_cluster code path instead. And the behavior changed slightly.

Also if nodes are connected without being cluster members, this will affect other users of global like the feature flags controller. I don’t remember the details but that’s what I added as a comment in the peer discovery code :-)

dumbbell · 2025-01-23T11:58:24Z

We can definitely try a global lock again.

I tend to document the reasoning and decision process in comments and commit messages, but I can’t find again what was wrong with the previous global lock, beside that mention of the Feature flags controller…

ansd · 2025-01-23T13:57:56Z

I suppose it might be worth trying to set a global lock again.

If this doesn't work out for some reason, we have to re-think how to best perform peer discovery.
Some of the requirements are:

The list of nodes might be known ahead of time, or needs to be discovered dynamically.
Nodes might be started and clustered one after the other, or started all in parallel.

I haven't fully thought it through, but one way of solving this problem could be to shift the decision of which node becomes the seed node from module rabbit_peer_discovery to the peer discovery backends. For example

k8s v2 will decide who the seed node is (pod with the lowest ordinal)
for classic peer discovery (and all other backends where the list of nodes is known ahead of time) I think we could mandate to either start all nodes in parallel or have the operator include the node starting first to put at the head of the list of nodes, meaning the seed node is the first listed in the config.
etcd backend already today decides who the seed node is
Consul also offers a way for leader election.

dumbbell · 2025-01-23T17:08:14Z

I haven't fully thought it through, but one way of solving this problem could be to shift the decision of which node becomes the seed node from module rabbit_peer_discovery to the peer discovery backends. For example

This is already possible. rabbit_peer_discovery does its own selection only if the backend did not. As you said, kubernetes v2, etcd and consul select the seed node. Classic config, DNS and kubernetes v1 rely on rabbit_peer_discovery. Basically, we only need to find a solution for Classic config and DNS.

I think I remember one of the issues with the previous implementation: it happened that node A joined node B who joined node C (leaving node A alone). So the use of a lock didn’t solve the whole issue: nodes must join the same seed node anyway, or at least verify if something happened in their back.

mkuratczyk added the enhancement label Jan 8, 2025

mkuratczyk mentioned this issue Jan 10, 2025

k8s peer discovery v2 #13050

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce seed-node based peer discovery option #13043

Introduce seed-node based peer discovery option #13043

mkuratczyk commented Jan 8, 2025

michaelklishin commented Jan 8, 2025

ansd commented Jan 9, 2025 •

edited

Loading

mkuratczyk commented Jan 9, 2025

ansd commented Jan 9, 2025

mkuratczyk commented Jan 9, 2025

ansd commented Jan 9, 2025

mkuratczyk commented Jan 10, 2025

dumbbell commented Jan 22, 2025

ansd commented Jan 22, 2025

dumbbell commented Jan 22, 2025

ansd commented Jan 22, 2025

dumbbell commented Jan 23, 2025 •

edited

Loading

ansd commented Jan 23, 2025

dumbbell commented Jan 23, 2025

dumbbell commented Jan 23, 2025

ansd commented Jan 23, 2025 •

edited

Loading

dumbbell commented Jan 23, 2025

Introduce seed-node based peer discovery option #13043

Introduce seed-node based peer discovery option #13043

Comments

mkuratczyk commented Jan 8, 2025

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

OPTION 1: New Backend

OPTION 2: New Behaviour in Classic Config Peer Discovery

OPTION 3: Change k8s Peer Discovery

Describe alternatives you've considered

Additional context

michaelklishin commented Jan 8, 2025

ansd commented Jan 9, 2025 • edited Loading

mkuratczyk commented Jan 9, 2025

ansd commented Jan 9, 2025

mkuratczyk commented Jan 9, 2025

ansd commented Jan 9, 2025

mkuratczyk commented Jan 10, 2025

dumbbell commented Jan 22, 2025

ansd commented Jan 22, 2025

dumbbell commented Jan 22, 2025

ansd commented Jan 22, 2025

dumbbell commented Jan 23, 2025 • edited Loading

ansd commented Jan 23, 2025

dumbbell commented Jan 23, 2025

dumbbell commented Jan 23, 2025

ansd commented Jan 23, 2025 • edited Loading

dumbbell commented Jan 23, 2025

ansd commented Jan 9, 2025 •

edited

Loading

dumbbell commented Jan 23, 2025 •

edited

Loading

ansd commented Jan 23, 2025 •

edited

Loading