-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Failed to create the reconcile looper: failed to list all OverLappingIPs: client rate limiter Wait returned an error: context deadline exceeded #389
Comments
@dougbtv kindly check this issue. |
Got a respone from @dougbtv, get suggestion to disable overlapping IP addresses. We are checking, if this issue can be fixed by disabling it. |
@dougbtv @andreaskaris |
I would like to explain the issue here in steps for better view:
Link of the code where we get the error: code |
if it's related to timeout we can fix it in 2 ways:
|
/cc @manuelbuil |
Hi all, I am coming from k8s sig-scalability to help you with fixing this issue. Error This particular PR: #438 won't really help, it will even make things probably worse, as you will be issuing more calls. So how it works:
Timeout that you now setting in the context, is not just requests timeout, but timeout for the sum of "waiting in a queue" and "request timeout". On a high-level design, I would recommend not using List at all and use informer instead. Hope this helps 🤞 |
Hi @smoshiur1237, I'll work on this. |
@mlguerrero12 Thanks, I proposed a fix by increasing the RequestTimeout. Would you please take a look, It should fix the issue. |
It might fix the issue for 500 pods but as I mentioned in your pr, we have a customer reporting this issue with 100 nodes and 30k pods. I'll explore other options and let you know. |
@smoshiur1237, I don't believe this issue can be solved by increasing the request timeout. Also, the reconciler job doesn't make a batch of requests before listing the cluster wide reservations. What it does is to list the pods and ip pools. The root of the issue is that the this reconciler job is expected to finish in 30 seconds. A context is created with this time and is used as parent for contexts of all requests. So, in large clusters, this parent context expires by the time it has to list the cluster wide reservations. If you check the logic for listing pods, it uses the same time that was set for the parent (30s). What I'm gonna do is remove this parent context and use 30 seconds for all listing operations (supported by what @marseel mentioned above the SLO). All other type of requests will continue using the |
@smoshiur1237, @adilGhaffarDev, could you please share complete logs of the reconciler when this issue happens? |
@mlguerrero12 here is the original logs that we got initially on this issue from a running whereabouts pod logs, where I have trimmed similar kinds of instances to make it visible :
|
Fixes k8snetworkplumbingwg#389 Signed-off-by: Marcelo Guerrero <[email protected]>
Parent timeout context of 30s was removed. All listing operations used by the cronjob reconciler has 30s as timeout. Fixes k8snetworkplumbingwg#389 Signed-off-by: Marcelo Guerrero <[email protected]>
Opened a new issue to track pod reference problem: |
Parent timeout context of 30s was removed. All listing operations used by the cronjob reconciler has 30s as timeout. Fixes k8snetworkplumbingwg/whereabouts#389 Signed-off-by: Marcelo Guerrero <[email protected]>
Parent timeout context of 30s was removed. All listing operations used by the cronjob reconciler has 30s as timeout. Fixes k8snetworkplumbingwg/whereabouts#389 Signed-off-by: Marcelo Guerrero <[email protected]>
Parent timeout context of 30s was removed. All listing operations used by the cronjob reconciler has 30s as timeout. Fixes k8snetworkplumbingwg/whereabouts#389 Signed-off-by: Marcelo Guerrero <[email protected]>
Describe the bug
reconciler failure was reported when we tried to scale in/out pods.
reconciler job scheduled for every 5 minutes but failed to execute with the given error -
[error] failed to list all OverLappingIPs: client rate limiter Wait returned an error: context deadline exceeded.
[error] failed to create the reconcile looper: failed to list all OverLappingIPs: client rate limiter Wait returned an error: context deadline exceeded
[verbose] reconciler failure: failed to list all OverLappingIPs: client rate limiter Wait returned an error: context deadline exceeded.
Current Behavior
the deployment of pod replicas to 500, the same for ippam podreferences. But when we scale in replicas to 1, pods are scaled successfully but 130 podreferences are left. I did 2 series of scale in/out and then release uninstall and redeployment - the same issue every time: 130 podreferences left undeleted after scale in.
To Reproduce
Steps to reproduce the behavior:
Environment:
kubectl version
): N/Auname -a
): N/AAdditional info / context
Add any other information / context about the problem here.
The text was updated successfully, but these errors were encountered: