This repository has been archived by the owner on Apr 19, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 99
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Every call to `GetRateLimits` would reset the `ResetTime` and not the `Remaining` counter. This would cause counters to eventually deplete and never fully reset.
Request time is resolved at first call to `getLocalRateLimit()`, then is propagated across peer-to-peer for global behavior.
QueueUpdate() allowed for sending request/response when local ratelimits are updated. However, the order they get called to QueueUpdate() is not guaranteed to be chronological. This causes stale updates to propagate, causing lost hits. Instead, QueueUpdate() will only pass the request. The current ratelimit state will be retrieved immediately before propagation. Rigorous functional tests added around global behavior.
- Simplify passing of request time across layers. - Better handling of metrics in tests. - Better detection of global broadcasts, global updates, and idle. - Drop redundant metric `guberator_global_broadcast_counter`. - Fix metric `gubernator_global_queue_length` for global broadcast. - Add metric `gubernator_global_send_queue_length` for global send.
…on-owner" Better tracks the code path. Can exclude non-owner activity. Can get accurate total rate limit checks with query like `rate(gubernator_getratelimit_counter{calltype="local"}[1m])`
…rement a counter.
Non-owners shouldn't be persisting rate limit state.
thrawn01
approved these changes
Mar 13, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, We should do something about the global code in v3. With all the counters used for functional tests I feel like it would be simpler to just add a new API call which gets the current status of broadcasts and updates. Thoughts?
Agreed on both. |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Every call to
GetRateLimits
would reset theResetTime
and not theRemaining
counter. This would cause counters to eventually deplete and never fully reset. The solution involved fixing two issues:Duration
value was never properly propagated in global behavior. This was added to the global broadcast logic.UpdatePeerGlobals
during a global broadcast but neglected to propagateDuration
.Duration
to zero and trigger a reset of theResetTime
. This code path does not reset theRemaining
counter because it's meant for cases where an existing rate limit had been extended or abbreviated in duration.CacheItem
struct and logic in algorithms.go would ignore it, causing it to short circuit around the logic that checksDuration
. Once the data type was corrected, theDuration
bug was revealed.ResetTime
generated by the owning and non-owning peers did not always match exactly.ResetTime
in multiple places based onclock.Now()
.ResetTime
doesn't change due to the above bug.GetRateLimits()
will set arequestTime
and pass it around so that any date/time computation to setResetTime
will always use the same base value instead ofclock.Now()
.Fix race condition in
QueueUpdate()
used by peers to propagate updates to rate limits that it owns.Remaining
counter. So, if the same key were updated multiple times it may get added in non-chronological order. The last update wins, potentially passing a staleRemaining
count, thereby dropping hits already applied.QueueUpdates()
. Then, when the timer calls to propagate the update, get the current ratelimit state of each queued update just before sending to the peers.Fix inconsistency with over limit status when calling
GetRateLimits
on a non-owner peer with global behavior.UNDER_LIMIT
no matter how many hits were applied.Optimize calls to
GetRateLimits
with zero hits to not trigger any global updates because nothing changed.Add rigorous functional tests around global behavior to verify full peer-to-peer propagation after a call to
GetRateLimits
.Fix doublecounting of metric
gubernator_over_limit_counter
on both non-owner and owner peers. Only count on owner peer.Fix metric doublecounting of
gubernator_getratelimit_counter
. When a non-owner uses Global behavior to process a request, do not increment the counter. After it global sends to the owner, the owner will increment the counter. This counter shall be the accurate count of rate limits checked.Remove redundant metric
gubernator_broadcast_counter
. Usegubernator_broadcast_duration_count
instead.Fix intermittent test error related to
TestHealthCheck
that causes the next test to fail because the Gubernator services were restarted and aren't always ready in time to accept requests.