You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We ran into Consul cluster leader reelection deadlock based on Consul connectivity issues (ConnectEx timeouts) in raft layer between quorum servers.
Usually we use 3 consul servers in a quorum and a lot of clients around. In a long lived cluster, after a sudden leader reappearance, cluster failed to reelect and agree on a new one, because of magic timeouts between those servers. However, all of them were accessible this time and reachable from other machines. No network connectivity issues were registered that moment.
Restart of 1 consul server did not help. Timeout kept going to non restarted ones. Restarting of all 3 consul helped them to start to reach each other on :8300. Which is very similar to something in raft level consul network stack.
The text was updated successfully, but these errors were encountered:
sitano
changed the title
Consul 0.5.2/Raft internal bug/deadlock on leader lost, cluster failed to re-elect
Consul 0.5.2/Raft internal bug/deadlock on leader loss/flap, cluster fail to reelect new one
Mar 3, 2016
Hi @sitano thanks for the detailed report. We fixed a number of deadlock-type issues with yamux in Consul 0.6.0 - is it possible for you to try a newer version of Consul and/or do you have a reproducible setup where you can experiment?
We ran into Consul cluster leader reelection deadlock based on Consul connectivity issues (ConnectEx timeouts) in raft layer between quorum servers.
Usually we use 3 consul servers in a quorum and a lot of clients around. In a long lived cluster, after a sudden leader reappearance, cluster failed to reelect and agree on a new one, because of magic timeouts between those servers. However, all of them were accessible this time and reachable from other machines. No network connectivity issues were registered that moment.
Restart of 1 consul server did not help. Timeout kept going to non restarted ones. Restarting of all 3 consul helped them to start to reach each other on :8300. Which is very similar to something in raft level consul network stack.
Similar issues
Environment:
Logs:
server007-consul-issue-0103-cut.txt
server001-consul-issue-0103-cut.txt
server017-consul-issue-0103-cut.txt
The text was updated successfully, but these errors were encountered: