-
Notifications
You must be signed in to change notification settings - Fork 99
Unexpected GetRateLimits result from non-owning peer when using Global behaviour #208
Comments
Actually it looks to be related to https://github.com/mailgun/gubernator/blob/master/global.go#L225 which sets hits to zero in all cases on the broadcast, so when it arrives at the non-owning peer it will never pass the code added by #157 and report over the limit. |
@philipgough's findings seem pretty odd. Imagine this scenario:
I tried to understand why that Both leaky bucket and token bucket algorithms show this problem, potentially for different reasons... I don't know. |
@Baliedge would you mind taking a look at this if you can spare some time? Not sure if its an issue with understanding the correct usage or a valid problem! TY |
I stumbled upon this issue yesterday because I saw in our server logs that, for any given request, 429s were always returned by one specific host in our cluster (the owner host for that request). I spent a couple of hours looking into this and I believe I know what's going on. I wrote a test (in a private repo, sorry) that does this:
I think the reason why the assertion fails is: in rl := &RateLimitResp{
Status: Status_UNDER_LIMIT,
// ...
}
// ...
if r.Hits == 0 {
return rl, nil
} When the owner of a key is done aggregating all the hits it got and makes a final Either we have to fix this in the algorithms themselves (risky, this code doesn't have unit tests!), or we need to change the code of
|
I might have a fix #216 |
PR #219 was merged recently. Did this fix the issue reported here? |
@Baliedge: @philipgough kindly left a PR with a reproduction of the problem. So it would be cool to incorporate the test into the codebase (if it's not already done) and confirm wether the issue is fixed through the test. |
@douglascamata I'm not finding any PRs to your name. Could you provide a link? |
@Baliedge: as I said, it is @philipgough's PR... mentioned in his first comment in this issue. Here it is: #207 |
@douglascamata @philipgough: Would you both please |
@Baliedge - sorry I was on vacation for a bit so missed this ping. |
I am trying to understand if we have hit a bug/regression or that if this behaviour is somehow expected and I am missing something.
I have a reverse proxy that reaches out to a Gubernator cluster all running on Kubernetes. The requests are load balanced over gRPC on the client side in a round robin fashion to each replica in the Gubernator cluster. I have observed that when a request lands on a replica who does not own a rate limit but fetches result from cache, I always get a status under limit response from that replica. Prometheus metrics show that broadcasts and peering etc are working as I would expect.
An example request might look like
Running for example 30 test requests against two different scenarios, the first which runs a single replica and so we can guarantee the owner, the second with two replicas which should split requests 50/50 in sync fashion will produce the following results:
I believe I have narrowed down the change of behaviour to #157 although I can't exactly explain why.
But running the above scenarios one again in the prior release returns the following, which is what I would expect as correct:
I've added a test case (which expectedly fails) to reproduce this exact scenario in #207 and reverting #157 on top of that PR allows the test to pass.
Hits appears to be 0 for non-owning peers in
algorithims.go
when I step through with the debugger.The text was updated successfully, but these errors were encountered: