etcd leader to follower heartbeat delay #16362

ajayudayagiri-hpe · 2023-07-31T06:07:45Z

ajayudayagiri-hpe
Jul 31, 2023

Bug report criteria

This bug report is not security related, security issues should be disclosed privately via [email protected].
This is not a support request, support requests should be raised in the etcd discussion forums.
You have read the etcd bug reporting guidelines.
Existing open issues along with etcd frequently asked questions have been checked and this is not a duplicate.

What happened?

We have been observing instability of etcd on longevity clusters (configuration provided below). Raised an issue on etcd and as per suggestions moved to v3.5.6. However, the issue is still observed, no improvement is seen.

Assuming it to be a disk IO delay issue, moved the /data/dir of etcd to tmpfs. The disk related metrics like wal sync reduced drastically, however, the delay in etcd queries are still seen (as shown below)

{"level":"info","ts":"2023-07-28T04:51:51.896Z","caller":"traceutil/trace.go:171","msg":"trace[704844056] linearizableReadLoop","detail":"{readStateIndex:49273142; appliedIndex:49273142; }","duration":"4.307763075s","start":"2023-07-28T04:51:47.588Z","end":"2023-07-28T04:51:51.895Z","steps":["trace[704844056] 'read index received' (duration: 4.307752129s)","trace[704844056] 'applied index is now lower than readState.Index' (duration: 7.524µs)"],"step_count":2}

As per above trace log - it took 4s to receive read index. Here, we observed the delay happened between the leader and follower when sending a heartbeat.

Following are the heartbeats sent from leader, logged at - stream.go

{MsgHeartbeat 1516085617212468964 14130861031026517270 112 0 0 [{0 0 EntryNormal [0 0 0 0 20 109 125 163]}] 49273142 {[] {{[] [] [] [] false} 0 0}} false 0 [229 22 137 154 210 120 206 104] 0} 2023-07-28 04:51:47.58917773 +0000 UTC m=+276.202110931

{MsgHeartbeat 1516085617212468964 14130861031026517270 112 0 0 [{0 0 EntryNormal [0 0 0 0 35 135 217 147]}] 49273142 {[] {{[] [] [] [] false} 0 0}} false 0 [229 22 137 154 210 120 206 104] 0} 2023-07-28 04:51:48.075661076 +0000 UTC m=+276.688594264

{MsgHeartbeat 1516085617212468964 14130861031026517270 112 0 0 [{0 0 EntryNormal [0 0 0 0 22 87 43 45]}] 49273142 {[] {{[] [] [] [] false} 0 0}} false 0 [229 22 137 154 210 120 206 104] 0} 2023-07-28 04:51:48.574961786 +0000 UTC m=+277.187895003

{MsgHeartbeat 1516085617212468964 14130861031026517270 112 0 0 [{0 0 EntryNormal [0 0 0 0 59 99 37 90]}] 49273142 {[] {{[] [] [] [] false} 0 0}} false 0 [229 22 137 154 210 120 206 104] 0} 2023-07-28 04:51:49.075885782 +0000 UTC m=+277.688818970

{MsgHeartbeat 1516085617212468964 14130861031026517270 112 0 0 [{0 0 EntryNormal [0 0 0 0 4 81 55 220]}] 49273142 {[] {{[] [] [] [] false} 0 0}} false 0 [229 22 137 154 210 120 206 104] 0} 2023-07-28 04:51:49.575413385 +0000 UTC m=+278.188346602

{MsgHeartbeat 1516085617212468964 14130861031026517270 112 0 0 [{0 0 EntryNormal [0 0 0 0 18 218 225 9]}] 49273142 {[] {{[] [] [] [] false} 0 0}} false 0 [229 22 137 154 210 120 206 104] 0} 2023-07-28 04:51:50.075246414 +0000 UTC m=+278.688179606

{MsgHeartbeat 1516085617212468964 14130861031026517270 112 0 0 [{0 0 EntryNormal [0 0 0 0 38 224 107 187]}] 49273142 {[] {{[] [] [] [] false} 0 0}} false 0 [229 22 137 154 210 120 206 104] 0} 2023-07-28 04:51:50.575974035 +0000 UTC m=+279.188907390

{MsgHeartbeat 1516085617212468964 14130861031026517270 112 0 0 [{0 0 EntryNormal [0 0 0 0 21 197 69 80]}] 49273142 {[] {{[] [] [] [] false} 0 0}} false 0 [229 22 137 154 210 120 206 104] 0} 2023-07-28 04:51:51.075443734 +0000 UTC m=+279.688376922

{MsgHeartbeat 1516085617212468964 14130861031026517270 112 0 0 [{0 0 EntryNormal [0 0 0 0 26 24 166 188]}] 49273142 {[] {{[] [] [] [] false} 0 0}} false 0 [229 22 137 154 210 120 206 104] 0} 2023-07-28 04:51:51.575599189 +0000 UTC m=+280.188532443

Following are the heartbeats received at follower, logged at - stream.go

{MsgHeartbeat 1516085617212468964 14130861031026517270 112 0 0 [{0 0 EntryNormal [0 0 0 0 20 109 125 163]}] 49273142 {[] {{[] [] [] [] false} 0 0}} false 0 [229 22 137 154 210 120 206 104] 0} 2023-07-28 04:51:51.894079712 +0000 UTC m=+277.045465036

{MsgHeartbeat 1516085617212468964 14130861031026517270 112 0 0 [{0 0 EntryNormal [0 0 0 0 35 135 217 147]}] 49273142 {[] {{[] [] [] [] false} 0 0}} false 0 [229 22 137 154 210 120 206 104] 0} 2023-07-28 04:51:51.894542369 +0000 UTC m=+277.045927717

{MsgHeartbeat 1516085617212468964 14130861031026517270 112 0 0 [{0 0 EntryNormal [0 0 0 0 22 87 43 45]}] 49273142 {[] {{[] [] [] [] false} 0 0}} false 0 [229 22 137 154 210 120 206 104] 0} 2023-07-28 04:51:51.894678212 +0000 UTC m=+277.046063568

{MsgHeartbeat 1516085617212468964 14130861031026517270 112 0 0 [{0 0 EntryNormal [0 0 0 0 59 99 37 90]}] 49273142 {[] {{[] [] [] [] false} 0 0}} false 0 [229 22 137 154 210 120 206 104] 0} 2023-07-28 04:51:51.89520582 +0000 UTC m=+277.046591168

{MsgHeartbeat 1516085617212468964 14130861031026517270 112 0 0 [{0 0 EntryNormal [0 0 0 0 4 81 55 220]}] 49273142 {[] {{[] [] [] [] false} 0 0}} false 0 [229 22 137 154 210 120 206 104] 0} 2023-07-28 04:51:51.895351185 +0000 UTC m=+277.046736554

{MsgHeartbeat 1516085617212468964 14130861031026517270 112 0 0 [{0 0 EntryNormal [0 0 0 0 18 218 225 9]}] 49273142 {[] {{[] [] [] [] false} 0 0}} false 0 [229 22 137 154 210 120 206 104] 0} 2023-07-28 04:51:51.89575468 +0000 UTC m=+277.047140031

{MsgHeartbeat 1516085617212468964 14130861031026517270 112 0 0 [{0 0 EntryNormal [0 0 0 0 38 224 107 187]}] 49273142 {[] {{[] [] [] [] false} 0 0}} false 0 [229 22 137 154 210 120 206 104] 0} 2023-07-28 04:51:51.895936191 +0000 UTC m=+277.047321549

{MsgHeartbeat 1516085617212468964 14130861031026517270 112 0 0 [{0 0 EntryNormal [0 0 0 0 21 197 69 80]}] 49273142 {[] {{[] [] [] [] false} 0 0}} false 0 [229 22 137 154 210 120 206 104] 0} 2023-07-28 04:51:51.8963489 +0000 UTC m=+277.047734258

{MsgHeartbeat 1516085617212468964 14130861031026517270 112 0 0 [{0 0 EntryNormal [0 0 0 0 26 24 166 188]}] 49273142 {[] {{[] [] [] [] false} 0 0}} false 0 [229 22 137 154 210 120 206 104] 0} 2023-07-28 04:51:51.896495372 +0000 UTC m=+277.047880726

If the time stamps are observed here, there is a difference of around 4s between the first heartbeat sent out from leader and the one received at follower. Also, at leader end the timestamps are with different intervals, however, at follower end, it looks like they all have been received almost at similar timestamp like a queue/batch is released.

Is there any batching happening in etcd which releases multiple messages at once or any other explanation for this delay ?

What did you expect to happen?

etcd follower should receive the heartbeat message from leader instantly

How can we reproduce it (as minimally and precisely as possible)?

It is reproduced on every longevity setup of

Anything else we need to know?

No response

Etcd version (please run commands below)

$ etcd --version
etcd Version: 3.5.6
Git SHA: GitNotFound
Go Version: go1.16.3
Go OS/Arch: linux/amd64

$ etcdctl version
etcdctl version: 3.5.6
API version: 3.5

Etcd configuration (command line flags or environment variables)

Setup Configuration
Cluster Size - 3
Master Nodes - 3
Member Nodes - 0
Kubernetes version - v1.22.17

H/W of each node
80 CPUs - Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz
512 GB RAM
1 Gig Network Adapter - This interface is only used for etcd
4 TB SSD Storage

Etcd Configuration
Version - 3.5.6
Service/Pod - Etcd running as systemd service
/data/dir - mounted on tmpfs

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
+------------------+---------+----------------+-----------------------------+-----------------------------+------------+
|        ID        | STATUS  |      NAME      |         PEER ADDRS          |        CLIENT ADDRS         | IS LEARNER |
+------------------+---------+----------------+-----------------------------+-----------------------------+------------+
| 150a37d617072ee4 | started | 192.168.155.20 | https://192.168.155.20:2380 | https://192.168.155.20:2379 |      false |
| 2319e8af957dc295 | started | 192.168.155.30 | https://192.168.155.30:2380 | https://192.168.155.30:2379 |      false |
| c41ae73d3cbee516 | started | 192.168.155.10 | https://192.168.155.10:2380 | https://192.168.155.10:2379 |      false |
+------------------+---------+----------------+-----------------------------+-----------------------------+------------+

$ etcdctl --endpoints=<member list> endpoint status -w table
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|          ENDPOINT           |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://192.168.155.20:2379 | 150a37d617072ee4 |   3.5.6 |  269 MB |     false |      false |       119 |   49383280 |           49383280 |        |
| https://192.168.155.30:2379 | 2319e8af957dc295 |   3.5.6 |  269 MB |      true |      false |       119 |   49383280 |           49383280 |        |
| https://192.168.155.10:2379 | c41ae73d3cbee516 |   3.5.6 |  269 MB |     false |      false |       119 |   49383280 |           49383280 |        |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Relevant log output

No response

chaochn47 · 2023-08-02T05:39:59Z

chaochn47
Aug 2, 2023
Collaborator

It looks like a network issue between the leader and follower etcd node. Is it a transitent issue or persistent?

If you have set up a dashboard based on etcd_network_peer_round_trip_time_seconds metric over some period, that will be helpful to visualize the network performance.

etcd/server/etcdserver/api/rafthttp/metrics.go

Lines 154 to 158 in 4a5e9d1

    
           rttSec = prometheus.NewHistogramVec(prometheus.HistogramOpts{ 
        
           	Namespace: "etcd", 
        
           	Subsystem: "network", 
        
           	Name:      "peer_round_trip_time_seconds", 
        
           	Help:      "Round-Trip-Time histogram between peers",

The metric value is published every 30s and measured as a smoothed measured RTT with every 4s prober.

Is there any batching happening in etcd which releases multiple messages at once or any other explanation for this delay?

raft does use batching and pipeline to improve the throughput but typically it won't have an order of seconds delay.

0 replies

ajayudayagiri-hpe · 2023-08-02T10:47:28Z

ajayudayagiri-hpe
Aug 2, 2023
Author

@chaochn47 - On longevity the issue is seen very frequently which leads to apiserver livenessprobe failures and cluster instability.

As you mentioned raft does use batching and pipeline to improve the throughput - can you point me to the code where this is happens ?

0 replies

chaochn47 · 2023-08-02T18:40:54Z

chaochn47
Aug 2, 2023
Collaborator

As you mentioned raft does use batching and pipeline to improve the throughput - can you point me to the code where this is happens ?

batching:

maybeSendAppend

pipeline:

pr.OptimisticUpdate

Leader maintains a next_index to represent the next log position that will be sent to each follower. If follower requests a log index that is behind the next_index, a disruptive snapshot will be sent from the leader. However, leader optimistically increases next_index up to its latest log index without waiting for the follower acknowledgement. If follower fails to append to its log, leader decreases the next_index to whatever follower reject hinted.

ref. https://github.com/etcd-io/raft/blob/main/design.md

0 replies

chaochn47 · 2023-08-03T07:37:54Z

chaochn47
Aug 3, 2023
Collaborator

Could you please provide minimal steps how to reproduce the issue without using longevity? If not, at least we can conclude the issue is highly relevant to the longevity setup but not etcd issue.

2 replies

ajayudayagiri-hpe Aug 7, 2023
Author

We could not reproduce the issue on any cluster without longevity. However, etcd is the one being impacted in longevity, we can clearly see the performance of etcd going down in trace logs.

ajayudayagiri-hpe Aug 7, 2023
Author

As posted here, the requests take too long to process, sometimes it reaches 5 seconds too.

chaochn47 · 2023-08-03T07:37:59Z

chaochn47
Aug 3, 2023
Collaborator

Thank you for your question, this support issue will be moved to our Discussion Forums.

We are trying to consolidate the channels to which questions for help/support are posted so that we can improve our efficiency in responding to your requests, and to make it easier for you to find answers to frequently asked questions and how to address common use cases.

We regularly see messages posted in multiple forums, with the full response thread only in one place or, worse, spread across multiple forums. Also, the large volume of support issues on GitHub is making it difficult for us to use issues to identify real bugs.

Members of the etcd community use Discussion Forums to field support requests. Before posting a new question, please search these for answers to similar questions, and also familiarize yourself with:

    [user documentation](https://etcd.io/docs/latest)
    [frequently asked questions](https://etcd.io/docs/v3.5/faq)

Again, thanks for using etcd and raising this question.

The etcd team

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd leader to follower heartbeat delay #16362

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

etcd leader to follower heartbeat delay #16362

ajayudayagiri-hpe Jul 31, 2023

Bug report criteria

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

Relevant log output

Replies: 5 comments · 2 replies

chaochn47 Aug 2, 2023 Collaborator

ajayudayagiri-hpe Aug 2, 2023 Author

chaochn47 Aug 2, 2023 Collaborator

chaochn47 Aug 3, 2023 Collaborator

ajayudayagiri-hpe Aug 7, 2023 Author

ajayudayagiri-hpe Aug 7, 2023 Author

chaochn47 Aug 3, 2023 Collaborator

ajayudayagiri-hpe
Jul 31, 2023

Replies: 5 comments 2 replies

chaochn47
Aug 2, 2023
Collaborator

ajayudayagiri-hpe
Aug 2, 2023
Author

chaochn47
Aug 2, 2023
Collaborator

chaochn47
Aug 3, 2023
Collaborator

ajayudayagiri-hpe Aug 7, 2023
Author

ajayudayagiri-hpe Aug 7, 2023
Author

chaochn47
Aug 3, 2023
Collaborator