Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul 0.5.2/Raft internal bug/deadlock on leader loss/flap, cluster fail to reelect new one #1788

Closed
sitano opened this issue Mar 3, 2016 · 2 comments
Labels
type/bug Feature does not function as expected

Comments

@sitano
Copy link

sitano commented Mar 3, 2016

We ran into Consul cluster leader reelection deadlock based on Consul connectivity issues (ConnectEx timeouts) in raft layer between quorum servers.

Usually we use 3 consul servers in a quorum and a lot of clients around. In a long lived cluster, after a sudden leader reappearance, cluster failed to reelect and agree on a new one, because of magic timeouts between those servers. However, all of them were accessible this time and reachable from other machines. No network connectivity issues were registered that moment.

Restart of 1 consul server did not help. Timeout kept going to non restarted ones. Restarting of all 3 consul helped them to start to reach each other on :8300. Which is very similar to something in raft level consul network stack.

Similar issues

Environment:

  1. 64 bit windows machines
  2. Consul 0.5.2, go 1.4.2, revision = 9a9cc93 (standard distribution)
  3. 3 consul clusters form a quorum
  4. lots of clients
agent:
        check_monitors = 4
        check_ttls = 0
        checks = 4
        services = 2
build:
        prerelease =
        revision = 9a9cc934
        version = 0.5.2
consul:
        bootstrap = false
        known_datacenters > 5
        leader = false
        server = true
runtime:
        arch = 386
        cpu_count = 8
        goroutines = 185
        max_procs = 2
        os = windows
        version = go1.4.2
serf_lan:
        encrypted = true
serf_wan:
        encrypted = true

Logs:

server007-consul-issue-0103-cut.txt
server001-consul-issue-0103-cut.txt
server017-consul-issue-0103-cut.txt

@sitano sitano changed the title Consul 0.5.2/Raft internal bug/deadlock on leader lost, cluster failed to re-elect Consul 0.5.2/Raft internal bug/deadlock on leader loss/flap, cluster fail to reelect new one Mar 3, 2016
@slackpad slackpad added the type/bug Feature does not function as expected label Mar 11, 2016
@slackpad
Copy link
Contributor

Hi @sitano thanks for the detailed report. We fixed a number of deadlock-type issues with yamux in Consul 0.6.0 - is it possible for you to try a newer version of Consul and/or do you have a reproducible setup where you can experiment?

@sitano
Copy link
Author

sitano commented Mar 21, 2016

Yes, we are considering moving to 0.6.4. Thx.

@sitano sitano closed this as completed Mar 21, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Feature does not function as expected
Projects
None yet
Development

No branches or pull requests

2 participants