-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disconnecting 1 server (not leader) out of 3 causes the remaining 2 servers to lose the leadership #1782
Comments
Is it possible this happens because the disconnected node restarts elections many times, and each time elects itself, increasing the term (https://github.com/hashicorp/raft/blob/master/raft.go#L1612), reaching a higher term than the 2 nodes that are connected? |
We ran into the same thing with a big cluster (3 servers, a lot of clients). |
@arie-stratoscale any ideas on how increment of the term on continuous self reelection related to - i/o timeouts? |
We are also faced with this problem.
|
I tried to build latest consul (with latest) raft and test again. It seems that the elections being stuck was fixed, but re-elections are still triggered by a server that left joining again. Again, I suspect the fact that the server tries to run elections, voting for himself and increment the Term (while the other server's Term remains the same) |
@arie-stratoscale yes, but it inc only local Term trying selfElect, so it should not affect whole cluster convergence. why it does not see other servers leadership or anything is the question... looks like some race cond. something prevents it to become a Follower of someone or Good Candidate. |
Once the disconnected server re-connects, he sends his new Term to the current leader. The current leader, seeing a Term newer than his, steps down and a new election starts. Why is this election stuck in version 0.6.3 I don't know, but compiling latest consul and raft master seems to solve the election being stuck (but still, a disconnect server reconnecting causes an election) |
It is similar to goraft/raft#161 or hashicorp/raft#69 . What i see from my logs case, one server was heavily calling for AppendEntries (thinking its a leader), and second heavily tried to selfElect. Both endup infinite retries. |
We seem to be seeing the same issue with 0.6.3. |
raft-net: Failed to flush response: write tcp 10.0.12.250:8300->10.0.18.20:37991: write: broken pipe ⋊> ~ consul -v 15:52:01 The consul lost leader information ,
and on the leader side:
And more error log
|
Could it be that there is an improvement on this with the 0.7.0 release? |
Consul 0.7.0 added a new check during bootstrap-expect which helps prevent new servers from starting spurious leader elections (servers will confirm via RPC that all the peers aren't part of an existing cluster). We've also done a lot of work on management of peer changes. Closing this one out - please open a new issue if you are still having trouble. |
We have a test that disconnects a network interface on consul server every 30 seconds for 30 seconds. I believe the expected behavior for the 2 remaining consul servers is to acquire and keep a leadership between them, but the result is that they periodically lose the leadership when the third server disconnects.
Logs from the leader node:
2016/03/02 04:21:11 [INFO] consul: cluster leadership lost
2016/03/02 04:21:11 [INFO] raft: pipelining replication to peer 1.124.245.220:8300
2016/03/02 04:21:11 [INFO] raft: aborting pipeline replication to peer 1.124.245.220:8300
2016/03/02 04:21:11 [ERR] http: Request GET /v1/kv/cluster/nodestates?recurse, error: No cluster leader from=127.0.0.1:50835
2016/03/02 04:21:12 [ERR] raft: Failed to heartbeat to 1.124.239.48:8300: dial tcp 1.124.239.48:8300: i/o timeout
2016/03/02 04:21:12 [WARN] raft: Rejecting vote request from 1.124.239.48:8300 since our last index is greater (5064, 5010)
2016/03/02 04:21:12 [ERR] http: Request PUT /v1/kv/cluster/rabbit/keepalive, error: No cluster leader from=127.0.0.1:50539
2016/03/02 04:21:12 [ERR] http: Request GET /v1/kv/cluster/nodestates?recurse, error: No cluster leader from=127.0.0.1:50835
2016/03/02 04:21:12 [WARN] raft: Heartbeat timeout reached, starting election
2016/03/02 04:21:12 [INFO] raft: Node at 1.124.192.47:8300 [Candidate] entering Candidate state
2016/03/02 04:21:12 [INFO] raft: Election won. Tally: 2
2016/03/02 04:21:12 [INFO] raft: Node at 1.124.192.47:8300 [Leader] entering Leader state
2016/03/02 04:21:12 [INFO] consul: cluster leadership acquired
2016/03/02 04:21:12 [INFO] consul: New leader elected: stratonode0
2016/03/02 04:21:12 [INFO] raft: pipelining replication to peer 1.124.245.220:8300
2016/03/02 04:21:12 [WARN] raft: AppendEntries to 1.124.239.48:8300 rejected, sending older logs (next: 5011)
2016/03/02 04:21:12 [INFO] raft: pipelining replication to peer 1.124.239.48:8300
2016/03/02 04:21:23 [INFO] agent: Synced check 'service:influxdb'
2016/03/02 04:21:31 [INFO] serf: EventMemberJoin: stratonode12080 1.124.239.48
2016/03/02 04:21:31 [INFO] consul: adding LAN server stratonode12080 (Addr: 1.124.239.48:8300) (DC: stratocluster-14594902935824492167__1_2_1_0_04b35d92)
2016/03/02 04:21:31 [INFO] consul: member 'stratonode12080' joined, marking health alive
2016/03/02 04:21:43 [WARN] raft: Failed to contact 1.124.239.48:8300 in 500.155252ms
2016/03/02 04:21:43 [WARN] raft: Failed to contact 1.124.239.48:8300 in 978.369848ms
2016/03/02 04:21:44 [WARN] raft: Failed to contact 1.124.239.48:8300 in 1.476780564s
2016/03/02 04:21:46 [INFO] memberlist: Suspect stratonode12080 has failed, no acks received
2016/03/02 04:21:48 [INFO] memberlist: Suspect stratonode12080 has failed, no acks received
2016/03/02 04:21:48 [INFO] serf: EventMemberFailed: stratonode12080 1.124.239.48
2016/03/02 04:21:48 [INFO] consul: removing LAN server stratonode12080 (Addr: 1.124.239.48:8300) (DC: stratocluster-14594902935824492167__1_2_1_0_04b35d92)
2016/03/02 04:21:48 [INFO] consul: member 'stratonode12080' failed, marking health critical
2016/03/02 04:21:52 [ERR] raft: Failed to heartbeat to 1.124.239.48:8300: read tcp 1.124.192.47:53238->1.124.239.48:8300: i/o timeout
2016/03/02 04:21:52 [ERR] raft: Failed to pipeline AppendEntries to 1.124.239.48:8300: write tcp 1.124.192.47:53208->1.124.239.48:8300: use of closed network connection
2016/03/02 04:21:52 [INFO] raft: aborting pipeline replication to peer 1.124.239.48:8300
2016/03/02 04:21:53 [WARN] raft: Failed to contact 1.124.239.48:8300 in 500.120413ms
2016/03/02 04:21:53 [WARN] raft: Failed to contact 1.124.239.48:8300 in 991.282529ms
2016/03/02 04:21:54 [WARN] raft: Failed to contact 1.124.239.48:8300 in 1.483350039s
2016/03/02 04:22:02 [ERR] raft: Failed to AppendEntries to 1.124.239.48:8300: dial tcp 1.124.239.48:8300: i/o timeout
2016/03/02 04:22:02 [ERR] raft: Failed to heartbeat to 1.124.239.48:8300: dial tcp 1.124.239.48:8300: i/o timeout
2016/03/02 04:22:11 [ERR] yamux: keepalive failed: connection write timeout
2016/03/02 04:22:11 [ERR] yamux: Failed to write body: write tcp 1.124.192.47:8300->1.124.239.48:57512: use of closed network connection
2016/03/02 04:22:11 [ERR] consul.rpc: multiplex conn accept failed: keepalive timeout from=1.124.239.48:57512
2016/03/02 04:22:12 [ERR] raft: Failed to AppendEntries to 1.124.239.48:8300: dial tcp 1.124.239.48:8300: i/o timeout
2016/03/02 04:22:13 [ERR] raft: Failed to heartbeat to 1.124.239.48:8300: dial tcp 1.124.239.48:8300: i/o timeout
2016/03/02 04:22:14 [WARN] raft: Rejecting vote request from 1.124.239.48:8300 since we have a leader: 1.124.192.47:8300
2016/03/02 04:22:14 [ERR] raft: peer 1.124.239.48:8300 has newer term, stopping replication
2016/03/02 04:22:14 [INFO] raft: Node at 1.124.192.47:8300 [Follower] entering Follower state
2016/03/02 04:22:14 [INFO] raft: aborting pipeline replication to peer 1.124.245.220:8300
2016/03/02 04:22:14 [INFO] consul: cluster leadership lost
Logs from the failing node:
2016/03/02 04:21:11 [WARN] dns: Query results too stale, re-requesting
2016/03/02 04:21:11 [ERR] dns: rpc error: No cluster leader
2016/03/02 04:21:11 [WARN] dns: Query results too stale, re-requesting
2016/03/02 04:21:11 [ERR] dns: rpc error: No cluster leader
2016/03/02 04:21:11 [WARN] dns: Query results too stale, re-requesting
2016/03/02 04:21:11 [ERR] dns: rpc error: No cluster leader
2016/03/02 04:21:11 [WARN] dns: Query results too stale, re-requesting
2016/03/02 04:21:11 [ERR] dns: rpc error: No cluster leader
2016/03/02 04:21:11 [WARN] raft: Election timeout reached, restarting election
2016/03/02 04:21:11 [INFO] raft: Node at 1.124.239.48:8300 [Candidate] entering Candidate state
2016/03/02 04:21:11 [ERR] http: Request GET /v1/kv/consulServers, error: No cluster leader from=127.0.0.1:58941
2016/03/02 04:21:33 [ERR] raft-net: Failed to flush response: write tcp 1.124.239.48:8300->1.124.192.47:64849: write: connection reset by peer
2016/03/02 04:21:34 [INFO] agent: Synced check 'service:nrad'
2016/03/02 04:21:44 [WARN] raft: Heartbeat timeout reached, starting election
2016/03/02 04:21:44 [INFO] raft: Node at 1.124.239.48:8300 [Candidate] entering Candidate state
2016/03/02 04:21:44 [ERR] http: Request GET /v1/kv/cluster/nodestates?recurse, error: No cluster leader from=127.0.0.1:60377
2016/03/02 04:21:44 [ERR] http: Request GET /v1/kv/cluster/nodestates?recurse, error: No cluster leader from=127.0.0.1:60378
2016/03/02 04:21:44 [INFO] memberlist: Suspect stratonode11300 has failed, no acks received
2016/03/02 04:21:45 [WARN] raft: Election timeout reached, restarting election
2016/03/02 04:22:13 [WARN] raft: Election timeout reached, restarting election
2016/03/02 04:22:13 [INFO] raft: Node at 1.124.239.48:8300 [Candidate] entering Candidate state
2016/03/02 04:22:14 [ERR] raft: Failed to make RequestVote RPC to 1.124.192.47:8300: dial tcp 1.124.192.47:8300: i/o timeout
2016/03/02 04:22:14 [ERR] raft: Failed to make RequestVote RPC to 1.124.245.220:8300: dial tcp 1.124.245.220:8300: i/o timeout
2016/03/02 04:22:14 [ERR] raft: Failed to make RequestVote RPC to 1.124.192.47:8300: dial tcp 1.124.192.47:8300: i/o timeout
2016/03/02 04:22:14 [ERR] raft: Failed to make RequestVote RPC to 1.124.245.220:8300: dial tcp 1.124.245.220:8300: i/o timeout
2016/03/02 04:22:15 [INFO] raft: Node at 1.124.239.48:8300 [Follower] entering Follower state
2016/03/02 04:22:15 [WARN] raft: Failed to get previous log: 5155 log not found (last: 5105)
Third server:
2016/03/02 04:21:13 [INFO] consul: New leader elected: stratonode0
2016/03/02 04:21:31 [INFO] serf: EventMemberJoin: stratonode12080 1.124.239.48
2016/03/02 04:21:31 [INFO] consul: adding LAN server stratonode12080 (Addr: 1.124.239.48:8300) (DC: stratocluster-14594902935824492167__1_2_1_0_04b35d92)
2016/03/02 04:21:46 [INFO] memberlist: Suspect stratonode12080 has failed, no acks received
2016/03/02 04:21:48 [INFO] memberlist: Suspect stratonode12080 has failed, no acks received
2016/03/02 04:21:48 [INFO] memberlist: Marking stratonode12080 as failed, suspect timeout reached
2016/03/02 04:21:48 [INFO] serf: EventMemberFailed: stratonode12080 1.124.239.48
2016/03/02 04:21:48 [INFO] consul: removing LAN server stratonode12080 (Addr: 1.124.239.48:8300) (DC: stratocluster-14594902935824492167__1_2_1_0_04b35d92)
2016/03/02 04:22:14 [ERR] http: Request GET /v1/kv/cluster/rabbit/keepalive, error: rpc error: No cluster leader from=127.0.0.1:64410
2016/03/02 04:22:14 [ERR] http: Request GET /v1/kv/cluster/nodestates?recurse, error: rpc error: No cluster leader from=127.0.0.1:64484
2016/03/02 04:22:14 [ERR] http: Request GET /v1/kv/cluster/nodestates?recurse, error: rpc error: No cluster leader from=127.0.0.1:64484
2016/03/02 04:22:15 [ERR] agent: coordinate update error: rpc error: No cluster leader
2016/03/02 04:22:15 [ERR] http: Request GET /v1/kv/cluster/nodes/givenstates/stratonode13788.node.strato, error: rpc error: No cluster leader from=127.0.0.1:64412
2016/03/02 04:22:15 [ERR] http: Request GET /v1/kv/cluster/rabbit/keepalive, error: rpc error: No cluster leader from=127.0.0.1:64410
2016/03/02 04:22:15 [ERR] http: Request GET /v1/kv/cluster/nodestates?recurse, error: rpc error: No cluster leader from=127.0.0.1:64484
2016/03/02 04:22:15 [ERR] http: Request GET /v1/kv/cluster/nodestates?recurse, error: rpc error: No cluster leader from=127.0.0.1:64484
2016/03/02 04:22:15 [ERR] http: Request GET /v1/kv/cluster/nodestates?recurse, error: rpc error: No cluster leader from=127.0.0.1:64485
2016/03/02 04:22:15 [INFO] consul: New leader elected: stratonode0
The text was updated successfully, but these errors were encountered: