[Error] Context deadline exceeded on etcd v3.5.6 #15229

ajayudayagiri-hpe · 2023-01-31T11:40:10Z

ajayudayagiri-hpe
Jan 31, 2023

On a 5-node baremetal cluster we are continously seeing "context deadline exceeded" in etcd logs which is resulting in liveness probe failure of kube-apiserver. Also, the kube-apiserver pod is often restarting due to etcd unable to respond on time.

Below are the iterations we have performed to test this scenario.
Iteration 1 - Etcd as Pod
Iteration 2 - Etcd as System Service

The further setup configuration is similar in both iterations.

Setup Configuration

Cluster Size - 5
Master Nodes - 3
Member Nodes - 2
Kubernetes version - v1.22.17
Etcd version - v3.5.6

H/W of each node:

80 CPUs - Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz
512 GB RAM
10 Gig Network Adapter
4 TB SSD Storage

Log from etcd service

Jan 30 14:27:15 cop-5n-iap-cx-1.arubathena.com etcd[57471]: {"level":"warn","ts":"2023-01-30T14:27:15.261Z","caller":"etcdserver/v3_server.go:840","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":12640906947059278029,"retry-timeout":"500ms"}
Jan 30 14:27:15 cop-5n-iap-cx-1.arubathena.com etcd[57471]: {"level":"warn","ts":"2023-01-30T14:27:15.761Z","caller":"etcdserver/v3_server.go:840","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":12640906947059278029,"retry-timeout":"500ms"}
Jan 30 14:27:16 cop-5n-iap-cx-1.arubathena.com etcd[57471]: {"level":"warn","ts":"2023-01-30T14:27:16.261Z","caller":"etcdserver/v3_server.go:840","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":12640906947059278029,"retry-timeout":"500ms"}
Jan 30 14:27:16 cop-5n-iap-cx-1.arubathena.com etcd[57471]: {"level":"warn","ts":"2023-01-30T14:27:16.760Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"1.999967528s","expected-duration":"101ms","prefix":"read-only range ","request":"key:\"/registry/health\" ","response":"","error":"context deadline exceeded"}

We have also used fio to check the disk I/O performance and the results are similar on all three nodes and are provided below. The parameters used below are --size=100m and --bs=2300.

root@cop-5n-iap-cx-1:/tmp# fio --rw=write --ioengine=sync --fdatasync=1 --directory=/var/lib/etcd/test-data --size=100m --bs=2300 --name=mytest
mytest: (g=0): rw=write, bs=(R) 2300B-2300B, (W) 2300B-2300B, (T) 2300B-2300B, ioengine=sync, iodepth=1
fio-3.33-30-g7204
Starting 1 process
mytest: Laying out IO file (1 file / 100MiB)
Jobs: 1 (f=1): [W(1)][100.0%][w=16.0MiB/s][w=7294 IOPS][eta 00m:00s]
mytest: (groupid=0, jobs=1): err= 0: pid=48060: Mon Jan 30 16:30:43 2023
  write: IOPS=7159, BW=15.7MiB/s (16.5MB/s)(100.0MiB/6368msec); 0 zone resets
    clat (usec): min=2, max=165, avg= 9.15, stdev= 4.81
     lat (usec): min=2, max=165, avg= 9.40, stdev= 4.94
    clat percentiles (usec):
     |  1.00th=[    4],  5.00th=[    5], 10.00th=[    5], 20.00th=[    6],
     | 30.00th=[    7], 40.00th=[    8], 50.00th=[    8], 60.00th=[   10],
     | 70.00th=[   11], 80.00th=[   12], 90.00th=[   14], 95.00th=[   17],
     | 99.00th=[   26], 99.50th=[   31], 99.90th=[   58], 99.95th=[   69],
     | 99.99th=[  106]
   bw (  KiB/s): min=15008, max=17645, per=99.87%, avg=16059.50, stdev=860.02, samples=12
   iops        : min= 6682, max= 7856, avg=7150.17, stdev=382.92, samples=12
  lat (usec)   : 4=4.91%, 10=59.42%, 20=33.76%, 50=1.78%, 100=0.12%
  lat (usec)   : 250=0.02%
  fsync/fdatasync/sync_file_range:
    sync (usec): min=35, max=12274, avg=128.07, stdev=108.48
    sync percentiles (usec):
     |  1.00th=[   39],  5.00th=[   44], 10.00th=[   47], 20.00th=[   51],
     | 30.00th=[   56], 40.00th=[   65], 50.00th=[  157], 60.00th=[  172],
     | 70.00th=[  182], 80.00th=[  190], 90.00th=[  202], 95.00th=[  219],
     | 99.00th=[  265], 99.50th=[  289], 99.90th=[  388], 99.95th=[  570],
     | 99.99th=[ 2442]
  cpu          : usr=3.64%, sys=24.88%, ctx=71271, majf=0, minf=19
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,45590,0,0 short=45590,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=15.7MiB/s (16.5MB/s), 15.7MiB/s-15.7MiB/s (16.5MB/s-16.5MB/s), io=100.0MiB (105MB), run=6368-6368msec

Disk stats (read/write):
  sdb: ios=0/93467, merge=0/50142, ticks=0/4220, in_queue=4200, util=66.86%

As per above fio result the the 99th percentile of fsync data is around 265 or 0.26ms and is far below 10ms which is required for good performance of etcd. However, the context deadline exceeded issue is still seen.

Temporary fixes tried
On node reboot the context deadline exceeded issue goes away, however, it is seen again after few days and it continues to occur. It seems to be occurring in longevity.

tjungblu · 2023-06-20T13:03:31Z

tjungblu
Jun 20, 2023
Collaborator

Pretty old thread, but here's what you can try:

timeout in reading index is sometimes a networking problem:
- do you see any package drops or anything else related on the NIC?
the server you're reading the index from might be overloaded:
- how's the CPU usage? maybe a GC issue? would kinda explain your restart fixing it
- maybe getting a cpu profile off that node would show some indication

are you running anything special against your APIserver? I've seen kubeburner causing issues frequently (especially in namespace deletion scenarios). Or anything that attempts to list all pods, like a CNI plugin can be a culprit.

8 replies

ajayudayagiri-hpe Jun 21, 2023
Author

@tjungblu - 50% I meant was the top usage when debug enabled, pprof enabled, etcd benchmark tool is ran. In a normal scenario it stays below 5%.

The top10 in pprof shows as follows

Showing nodes accounting for 590ms, 52.21% of 1130ms total
Showing top 10 nodes out of 331
      flat  flat%   sum%        cum   cum%
     160ms 14.16% 14.16%      160ms 14.16%  runtime.futex
     140ms 12.39% 26.55%      140ms 12.39%  syscall.Syscall
     130ms 11.50% 38.05%      130ms 11.50%  runtime.epollwait
      30ms  2.65% 40.71%       30ms  2.65%  runtime.memmove
      30ms  2.65% 43.36%       30ms  2.65%  syscall.Syscall6
      20ms  1.77% 45.13%       50ms  4.42%  golang.org/x/net/http2.(*serverConn).serve
      20ms  1.77% 46.90%      310ms 27.43%  runtime.findrunnable
      20ms  1.77% 48.67%       20ms  1.77%  runtime.lock2
      20ms  1.77% 50.44%       60ms  5.31%  runtime.mallocgc
      20ms  1.77% 52.21%       20ms  1.77%  runtime.mapaccess1_faststr

It looks like futex is taking more time, does that mean most of the read threads are waiting on mutex lock and that might be the reason for requests getting into context deadline state ?

tjungblu Jun 21, 2023
Collaborator

did you profile while you've observed those timeouts? 160ms don't seem enough wait time to warrant the 500ms timeout you've observed above.

ajayudayagiri-hpe Jun 26, 2023
Author

@tjungblu - We collected profiling data for longer period and observed the data during the timeout window. The result of top processes is similar. Above result was of 1 minute time span, it is not 160ms.

jberkus Jun 26, 2023

is it possible that this is a problem with your network and not etcd at all? that is, that requests aren't even reaching etcd?

ajayudayagiri-hpe Jun 27, 2023
Author

@jberkus - This problem is seen on a single node deployments too. Further, we have analysed network metrics and performed tcpdump to understand packet drops. Everything looks good from network end.

On a single node etcd doesn't need to fetch raft consensus for any Range request, however, etcd logs still shows "took too long" and "context deadline exceeded" warnings..

yitian108 · 2024-01-20T03:02:43Z

yitian108
Jan 20, 2024

Hello, I faced the same issue.

{"level":"warn","ts":"2024-01-20T11:00:55.929+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-b8c0c236-5a79-4943-9464-ea1d86ed48ef/192.168.3.69:14592","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded

@jberkus @tjungblu @ajayudayagiri-hpe Is there any update for this?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Error] Context deadline exceeded on etcd v3.5.6 #15229

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[Error] Context deadline exceeded on etcd v3.5.6 #15229

ajayudayagiri-hpe Jan 31, 2023

Setup Configuration

Replies: 2 comments · 8 replies

tjungblu Jun 20, 2023 Collaborator

ajayudayagiri-hpe Jun 21, 2023 Author

tjungblu Jun 21, 2023 Collaborator

ajayudayagiri-hpe Jun 26, 2023 Author

jberkus Jun 26, 2023

ajayudayagiri-hpe Jun 27, 2023 Author

yitian108 Jan 20, 2024

ajayudayagiri-hpe
Jan 31, 2023

Replies: 2 comments 8 replies

tjungblu
Jun 20, 2023
Collaborator

ajayudayagiri-hpe Jun 21, 2023
Author

tjungblu Jun 21, 2023
Collaborator

ajayudayagiri-hpe Jun 26, 2023
Author

ajayudayagiri-hpe Jun 27, 2023
Author

yitian108
Jan 20, 2024