Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single Consul server stop and start causes election even with established quorum #1674

Closed
glenwong opened this issue Feb 2, 2016 · 11 comments
Labels
type/bug Feature does not function as expected

Comments

@glenwong
Copy link

glenwong commented Feb 2, 2016

I have a test case with 3 vagrant boxes running Consul in server mode and I noticed if I kill one of the Consul processes and then start it again, it causes an election even though there is already an established leader. This causes a momentary http api access outage for clients. I've attached the logs I see below: the machines are called proxy001, proxy002, and proxy003

Leader logs:

2016/02/02 18:27:07 [INFO] serf: EventMemberJoin: proxy001 192.168.57.10
2016/02/02 18:27:07 [INFO] consul: adding LAN server proxy001 (Addr: 192.168.57.10:8300) (DC: dc1)
2016/02/02 18:27:07 [INFO] consul: member 'proxy001' joined, marking health alive
2016/02/02 18:27:09 [WARN] raft: Rejecting vote from 192.168.57.10:8300 since we have a leader: 192.168.57.11:8300
2016/02/02 18:27:09 [ERR] raft: peer 192.168.57.10:8300 has newer term, stopping replication
2016/02/02 18:27:09 [INFO] raft: Node at 192.168.57.11:8300 [Follower] entering Follower state
2016/02/02 18:27:09 [INFO] raft: aborting pipeline replication to peer 192.168.57.12:8300
2016/02/02 18:27:09 [INFO] consul: cluster leadership lost
2016/02/02 18:27:10 [WARN] raft: Rejecting vote from 192.168.57.10:8300 since our last index is greater (60502, 60500)
2016/02/02 18:27:11 [WARN] raft: Heartbeat timeout reached, starting election
2016/02/02 18:27:11 [INFO] raft: Node at 192.168.57.11:8300 [Candidate] entering Candidate state
2016/02/02 18:27:11 [INFO] raft: Election won. Tally: 2
2016/02/02 18:27:11 [INFO] raft: Node at 192.168.57.11:8300 [Leader] entering Leader state
2016/02/02 18:27:11 [INFO] consul: cluster leadership acquired
2016/02/02 18:27:11 [INFO] consul: New leader elected: proxy002
2016/02/02 18:27:11 [INFO] raft: pipelining replication to peer 192.168.57.12:8300
2016/02/02 18:27:11 [WARN] raft: AppendEntries to 192.168.57.10:8300 rejected, sending older logs (next: 60501)
2016/02/02 18:27:11 [INFO] raft: pipelining replication to peer 192.168.57.10:8300

Restarted Server logs:

2016/02/02 18:27:07 [INFO] raft: Restored from snapshot 1464-58513-1454369444178
2016/02/02 18:27:07 [INFO] serf: EventMemberJoin: proxy001 192.168.57.10
2016/02/02 18:27:07 [INFO] serf: EventMemberJoin: proxy001.dc1 192.168.57.10
2016/02/02 18:27:07 [INFO] raft: Node at 192.168.57.10:8300 [Follower] entering Follower state
2016/02/02 18:27:07 [INFO] serf: Attempting re-join to previously known node: proxy002: 192.168.57.11:8301
2016/02/02 18:27:07 [INFO] consul: adding LAN server proxy001 (Addr: 192.168.57.10:8300) (DC: dc1)
2016/02/02 18:27:07 [WARN] serf: Failed to re-join any previously known node
2016/02/02 18:27:07 [INFO] consul: adding WAN server proxy001.dc1 (Addr: 192.168.57.10:8300) (DC: dc1)
2016/02/02 18:27:07 [INFO] agent: (LAN) joining: [192.168.57.10 192.168.57.11 192.168.57.12]
2016/02/02 18:27:07 [INFO] serf: EventMemberJoin: proxy002 192.168.57.11
2016/02/02 18:27:07 [INFO] serf: EventMemberJoin: gwong.local 192.168.57.1
2016/02/02 18:27:07 [INFO] serf: EventMemberJoin: proxy003 192.168.57.12
2016/02/02 18:27:07 [INFO] serf: Re-joined to previously known node: proxy002: 192.168.57.11:8301
2016/02/02 18:27:07 [INFO] consul: adding LAN server proxy002 (Addr: 192.168.57.11:8300) (DC: dc1)
2016/02/02 18:27:07 [INFO] consul: adding LAN server proxy003 (Addr: 192.168.57.12:8300) (DC: dc1)
2016/02/02 18:27:07 [INFO] agent: (LAN) joined: 3 Err: <nil>
2016/02/02 18:27:07 [ERR] agent: failed to sync remote state: No cluster leader
2016/02/02 18:27:09 [WARN] raft: Heartbeat timeout reached, starting election
2016/02/02 18:27:09 [INFO] raft: Node at 192.168.57.10:8300 [Candidate] entering Candidate state
2016/02/02 18:27:10 [WARN] raft: Election timeout reached, restarting election
2016/02/02 18:27:10 [INFO] raft: Node at 192.168.57.10:8300 [Candidate] entering Candidate state
2016/02/02 18:27:10 [INFO] agent.rpc: Accepted client: 127.0.0.1:51017
2016/02/02 18:27:11 [INFO] raft: Node at 192.168.57.10:8300 [Follower] entering Follower state
2016/02/02 18:27:11 [WARN] raft: Failed to get previous log: 60502 log not found (last: 60500)
2016/02/02 18:27:11 [INFO] consul: New leader elected: proxy002
2016/02/02 18:27:11 [INFO] agent: Synced service 'consul'

From reading the docs, I was under the impression that with a cluster of 3, a single failure of a non-leader node and subsequent rejoin shouldn't cause a leader election to occur?

@slackpad
Copy link
Contributor

Hi @glenwong - this should definitely not cause a leader election. This log line is very strange from your leader when the other node joins:

2016/02/02 18:27:09 [ERR] raft: peer 192.168.57.10:8300 has newer term, stopping replication

Is it possible that node is being started in bootstrap mode and maybe leading to a split brain situation?

@glenwong
Copy link
Author

Hi @slackpad. My consul info gives back this for two of the machines:

consul:
    bootstrap = false
    known_datacenters = 1
    leader = false
    server = true

And this for the third:

consul:
    bootstrap = false
    known_datacenters = 1
    leader = true
    server = true

And the config.json I'm using for the servers is:

{
  "bootstrap_expect": 3,
  "server": true,
  "data_dir": "/var/consul",
  "log_level": "INFO",
  "enable_syslog": true,
  "skip_leave_on_interrupt": true,
  "retry_join": ["proxy001", "proxy002", "proxy003"]
}

I've tried taking out the "bootstrap_expect" option as well as the "skip_leave_on_interrupt" but still continue to repro the issue.

@glenwong
Copy link
Author

Also to add more info. In order to repro this I have to leave the killed process down until I see this in the other consul servers logs:

2016/02/13 17:45:15 [INFO] serf: EventMemberFailed: proxy001 192.168.57.10
2016/02/13 17:45:15 [INFO] consul: removing LAN server proxy001 (Addr: 192.168.57.10:8300) (DC: dc1)
2016/02/13 17:45:16 [INFO] memberlist: Suspect proxy001 has failed, no acks received

If I kill the process for a shorter time and bring it up, I don't observe the re-election.

@slackpad
Copy link
Contributor

Hmm - your configuration looks good. Just one sanity check - is it possible there's any sharing of state through /var/consul being mapped to the same place in the Vagrant boxes (want to make sure there's nothing weird with corruption possible along those lines). If not, I'll need to dig in and try to see why it has a newer term.

@slackpad
Copy link
Contributor

Never mind on that last comment - I can reproduce this. Taking a look.

@slackpad slackpad added the type/bug Feature does not function as expected label Feb 17, 2016
@slackpad
Copy link
Contributor

I think I know what's going on with this one. When you fail a server for several seconds, you end up running up the heartbeat/AppendEntries failures counter down in the Raft layer. These end up causing the attempts to back off, up to a limit of about 50 seconds:

https://github.com/hashicorp/raft/blob/master/replication.go#L311

This ends up isolating the newly-added server from heartbeat messages, so it doesn't hear from the leader and then starts an election, which is the right thing to do. It looks like the Raft implementation handles all this ok and settles back into a stable state, but it would be good to see if we can make this better. I think this isn't a big problem in practice because typically servers restart in under a second, or they die and are replaced later.

@glenwong
Copy link
Author

Thanks for the explanation. My concern is that while Raft is settling down, there is no cluster leader so for the Consul clients the service is basically out for a second or two which isn't a huge deal. It just doesn't seem like a non-leader dying and rejoining should ever cause any outage while a stable quorum exists. Also I'm a bit worried about scenarios where a machine in the cluster is being intermittent. For comparison, I tested this scenario with etcd and zookeeper with an identical setup and didn't experience any outage.

@slackpad
Copy link
Contributor

@glenwong agreed - we should be able to fix this and avoid this situation. In practice this doesn't seem to come up very often - even an intermittent machine would have to hit kind of a sweet spot in the backoff timing.

You can set things like https://www.consul.io/docs/agent/options.html#allow_stale to allow other servers to service read requests (this is for DNS and there are similar controls for the HTTP endpoints) when a leader isn't available. This is helpful for riding out leader elections in general, and allow for reads to scale better across all your servers.

@mrwilby
Copy link

mrwilby commented Feb 26, 2016

@slackpad You mention that there are similar configuration controls for enabling/permitting stale responses via HTTP endpoints. Sorry for the stupid question, but which settings are those ? I see the DNS-related state configuration settings, but nothing explicitly about HTTP responses. Thanks

@slackpad
Copy link
Contributor

Hi @mrwilby those are controlled per-request and not configured on the agent. Please take a look at https://www.consul.io/docs/agent/http.html, specifically the "Consistency Modes" section. This is the most relaxed:

stale - This mode allows any server to service the read regardless of whether it is the leader. This means reads can be arbitrarily stale; however, results are generally consistent to within 50 milliseconds of the leader. The trade-off is very fast and scalable reads with a higher likelihood of stale values. Since this mode allows reads without a leader, a cluster that is unavailable will still be able to respond to queries.

@slackpad
Copy link
Contributor

We did some additional work in #1782 (comment) to address this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Feature does not function as expected
Projects
None yet
Development

No branches or pull requests

3 participants