Windows: out of of three node clusters stops at booting and does not join the cluster during cluster formation. #13126

BondByteBlaster · 2025-01-22T11:18:55Z

BondByteBlaster
Jan 22, 2025

Community Support Policy

I have read RabbitMQ's Community Support Policy
I run RabbitMQ 4.x, the only series currently covered by community support
I promise to provide all relevant information (versions, logs from all nodes, rabbitmq-diagnostics output, detailed reproduction steps)

RabbitMQ version used

4.0.3

Erlang version used

26.2.x

Operating system (distribution) used

Windows

How is RabbitMQ deployed?

Windows installer

rabbitmq-diagnostics status output

See https://www.rabbitmq.com/docs/cli to learn how to use rabbitmq-diagnostics

Can´t run anything against the rabbitMQ instance since its not running properly but here is a file for the cmd output : 
[RabbitMQ CMD LOG.txt](https://github.com/user-attachments/files/18504439/RabbitMQ.CMD.LOG.txt)

# PASTE OUTPUT HERE, BETWEEN BACKTICKS

Logs from node 1 (with sensitive values edited out)

See https://www.rabbitmq.com/docs/logging to learn how to collect logs

[RabbitMQ detailed log.txt](https://github.com/user-attachments/files/18504453/RabbitMQ.detailed.log.txt)

Logs from node 2 (if applicable, with sensitive values edited out)

See https://www.rabbitmq.com/docs/logging to learn how to collect logs

[RabbitMQ node 2 log.txt](https://github.com/user-attachments/files/18504457/RabbitMQ.node.2.log.txt)

Logs from node 3 (if applicable, with sensitive values edited out)

See https://www.rabbitmq.com/docs/logging to learn how to collect logs

[RabbitMQ node 3 log.txt](https://github.com/user-attachments/files/18504462/RabbitMQ.node.3.log.txt)

rabbitmq.conf

See https://www.rabbitmq.com/docs/configure#config-location to learn how to find rabbitmq.conf file location

listeners.ssl.default = 5671
ssl_options.cacertfile = C:/myappCertificates/odis1.myappsupport.se-chain.pem
ssl_options.certfile = C:/myappCertificates/odis1.myappsupport.se-crt.pem
ssl_options.keyfile = C:/myappCertificates/odis1.myappsupport.se-key.pem
ssl_options.verify = verify_peer
ssl_options.fail_if_no_peer_cert = false
log.console = true
log.console.level = warning
log.file = C:/ProgramData/RabbitMQ/log/rabbitmq.log
log.file.level = warning

Steps to deploy RabbitMQ cluster

Add an environment variable to install it for all users.
Check open ports in the Windows Firewall.
Verify communication with the other two nodes.
Ensure the correct certificate is in the custom folder.
Reinstall RabbitMQ on node 1.
Enable RabbitMQ Management Plugin with:
rabbitmq-plugins enable rabbitmq_management
Check the .erlang file to ensure it matches the cluster configuration.
Install the Microsoft Handler to avoid warnings.
Configure RabbitMQ using the configuration file.
Restart RabbitMQ and try accessing the web tool (login with guest/guest).
Run the following commands:

rabbitmqctl stop_app
rabbitmqctl reset
rabbitmqctl join_cluster rabbit@mycompany09 # Also tried rabbit@mycompany08
rabbitmqctl start_app

Steps to reproduce the behavior in question

Don´t know why it failed or how to reproduce.

advanced.config

See https://www.rabbitmq.com/docs/configure#config-location to learn how to find advanced.config file location

EMPTY

Application code

Dont think this matters in this case but I got several C# service working against the RabbitMQ cluster.

Kubernetes deployment file

None

What problem are you trying to solve?

First off, we run RabbitMQ 3.13.0 (with Erlang 26.2.2), which was installed for the first time at the beginning of last year (2024). We missed that it needs to be updated more often, so we will schedule updates for this, and I will push for a Rabbit license.

We are running RabbitMQ on three Windows Server nodes with a couple of quorum queues. This has been working fine for nearly a year now, but suddenly we noticed a large quantity of messages in a DLX queue as well as some errors from our services. RabbitMQ on node 2 had shut down and could not be restarted. We tried restarting the entire service, but that did not help.

By running RabbitMQ in CMD, I could see that it freezes right at the start. I enabled more logging and found out that it gets stuck on something I believe is syncing Feature Flags with the cluster. It just loops rapidly and indefinitely.

I reinstalled RabbitMQ and Erlang on node 1 several times, double-checked all settings, including features and add-ons, to ensure they match the rest of the cluster. There is no problem removing it from the cluster and then adding it back in, but when starting the RabbitMQ service, it always freezes.

I'm beginning to think that the problem lies within the RabbitMQ cluster that is still running. Maybe it's a split-brain problem? We do regular restarts of the cluster, but we always leave one node operational. In this case, we probably want to turn everything off and then start it back up to hopefully re-sync the environment. However, it is very important not to lose any data.

Please help us get back on track so we can get it running and update the environment.

Answered by michaelklishin

Jan 22, 2025

RabbitMQ v3.13.x is out of community support.

3.13.0 is seven patch releases behind the 3.13.x series and 13 releases overall (behind 4.0.5).

Time to upgrade to 4.0.5.

View full answer

michaelklishin · 2025-01-22T12:23:10Z

michaelklishin
Jan 22, 2025
Maintainer

RabbitMQ v3.13.x is out of community support.

3.13.0 is seven patch releases behind the 3.13.x series and 13 releases overall (behind 4.0.5).

Time to upgrade to 4.0.5.

0 replies

michaelklishin · 2025-01-22T15:46:41Z

michaelklishin
Jan 22, 2025
Maintainer

For our team's own needs: the logs on node 1 stop right after the feature flag controller tries to use the global registry:

2025-01-21 16:29:29.791000+01:00 [debug] <0.1741.0> == Prelaunch DONE ==
2025-01-21 16:29:29.791000+01:00 [info] <0.1741.0> 
2025-01-21 16:29:29.791000+01:00 [info] <0.1741.0>  Starting RabbitMQ 3.13.0 on Erlang 26.2.2 [jit]
2025-01-21 16:29:29.791000+01:00 [info] <0.1741.0>  Copyright (c) 2007-2024 Broadcom Inc and/or its subsidiaries
2025-01-21 16:29:29.791000+01:00 [info] <0.1741.0>  Licensed under the MPL 2.0. Website: https://rabbitmq.com
2025-01-21 16:29:29.792000+01:00 [debug] <0.1849.0> Feature flags: controller standing by
2025-01-21 16:29:29.792000+01:00 [debug] <0.1741.0> Register `rabbit` process (<0.1741.0>) for rabbit_node_monitor
2025-01-21 16:29:29.797000+01:00 [info] <0.1741.0> 
2025-01-21 16:29:29.797000+01:00 [info] <0.1741.0>  node           : rabbit@mycompany07
2025-01-21 16:29:29.797000+01:00 [info] <0.1741.0>  home dir       : c:/Users/UserFirstname.userseconname
2025-01-21 16:29:29.797000+01:00 [info] <0.1741.0>  config file(s) : c:/ProgramData/RabbitMQ/advanced.config
2025-01-21 16:29:29.797000+01:00 [info] <0.1741.0>                 : c:/ProgramData/RabbitMQ/rabbitmq.conf
2025-01-21 16:29:29.797000+01:00 [info] <0.1741.0>  cookie hash    : wqjO2IWgOuDy7cc3OgLURQ
2025-01-21 16:29:29.797000+01:00 [info] <0.1741.0>  log(s)         : <stdout>
2025-01-21 16:29:29.797000+01:00 [info] <0.1741.0>                 : c:/ProgramData/RabbitMQ/log/rabbitmq.log
2025-01-21 16:29:29.797000+01:00 [info] <0.1741.0>  data dir       : c:/ProgramData/RabbitMQ/db/rabbit@mycompany07-mnesia
2025-01-21 16:29:29.797000+01:00 [debug] <0.1741.0> 
2025-01-21 16:29:29.797000+01:00 [debug] <0.1741.0> == Plugins (prelaunch phase) ==
2025-01-21 16:29:29.797000+01:00 [debug] <0.1741.0> Setting plugins up
2025-01-21 16:29:29.967000+01:00 [debug] <0.1741.0> Plugins discovery: ignoring getopt, not a RabbitMQ plugin
2025-01-21 16:29:29.968000+01:00 [debug] <0.1741.0> Plugins discovery: ignoring quantile_estimator, not a RabbitMQ plugin
2025-01-21 16:29:30.043000+01:00 [debug] <0.1741.0> Loading the following plugins: [cowlib,oauth2_client,cowboy,amqp_client,
2025-01-21 16:29:30.043000+01:00 [debug] <0.1741.0>                                 rabbitmq_web_dispatch,
2025-01-21 16:29:30.043000+01:00 [debug] <0.1741.0>                                 rabbitmq_management_agent,rabbitmq_management,
2025-01-21 16:29:30.043000+01:00 [debug] <0.1741.0>                                 prometheus,amqp10_client,rabbitmq_shovel,
2025-01-21 16:29:30.043000+01:00 [debug] <0.1741.0>                                 accept,rabbitmq_shovel_management,
2025-01-21 16:29:30.043000+01:00 [debug] <0.1741.0>                                 rabbitmq_prometheus]
2025-01-21 16:29:30.044000+01:00 [debug] <0.1741.0> Feature flags: REFRESHING after applications load...
2025-01-21 16:29:30.044000+01:00 [debug] <0.1849.0> Feature flags: registering controller globally before proceeding with task: refresh_after_app_load
2025-01-21 16:29:30.044000+01:00 [debug] <0.1849.0> Feature flags: [global sync] @ rabbit@mycompany07
2025-01-21 16:29:30.044000+01:00 [debug] <0.1849.0> Feature flags: [global register] @ rabbit@mycompany07

(the same message is repeated N times)

2025-01-21 16:29:30.058000+01:00 [debug] <0.1849.0> Feature flags: controller NOT globally registered; need to wait for the current global controller's task to finish
2025-01-21 16:29:30.058000+01:00 [debug] <0.1849.0> Feature flags: current global controller's task finished; trying to take next turn
2025-01-21 16:29:30.058000+01:00 [debug] <0.1849.0> Feature flags: registering controller globally before proceeding with task: refresh_after_app_load
2025-01-21 16:29:30.058000+01:00 [debug] <0.1849.0> Feature flags: [global sync] @ rabbit@mycompany07
2025-01-21 16:29:30.058000+01:00 [debug] <0.1849.0> Feature flags: [global register] @ rabbit@mycompany07
2025-01-21 16:29:30.060000+01:00 [debug] <0.1849.0> Feature flags: controller NOT globally registered; need to wait for the current global controller's task to finish
2025-01-21 16:29:30.060000+01:00 [debug] <0.1849.0> Feature flags: current global controller's task finished; trying to take next turn
2025-01-21 16:29:30.060000+01:00 [debug] <0.1849.0> Feature flags: registering controller globally before proceeding with task: refresh_after_app_load
2025-01-21 16:29:30.060000+01:00 [debug] <0.1849.0> Feature flags: [global sync] @ rabbit@mycompany07
2025-01-21 16:29:30.060000+01:00 [debug] <0.1849.0> Feature flags: [global register] @ rabbit@mycompany07

I could not find anything immediately related (that would not have to do strictly with logging, such as #12444 for 4.1.0) but there were feature flag-related changes in 4.0.x and will be more in 4.1.x.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Windows: out of of three node clusters stops at booting and does not join the cluster during cluster formation. #13126

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Windows: out of of three node clusters stops at booting and does not join the cluster during cluster formation. #13126

BondByteBlaster Jan 22, 2025

Community Support Policy

RabbitMQ version used

Erlang version used

Operating system (distribution) used

How is RabbitMQ deployed?

rabbitmq-diagnostics status output

Logs from node 1 (with sensitive values edited out)

Logs from node 2 (if applicable, with sensitive values edited out)

Logs from node 3 (if applicable, with sensitive values edited out)

rabbitmq.conf

Steps to deploy RabbitMQ cluster

Steps to reproduce the behavior in question

advanced.config

Application code

Kubernetes deployment file

What problem are you trying to solve?

Replies: 2 comments

michaelklishin Jan 22, 2025 Maintainer

michaelklishin Jan 22, 2025 Maintainer

BondByteBlaster
Jan 22, 2025

michaelklishin
Jan 22, 2025
Maintainer

michaelklishin
Jan 22, 2025
Maintainer