-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
only mark tokens as unsupported based on metrics for a limited time #3205
Conversation
let stats = self.counter.get(token)?; | ||
if stats | ||
.flagged_unsupported_at | ||
.is_some_and(|t| now.duration_since(t) > self.token_freeze_time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this correct? if the flagged_unsupported_at
is some and the time between freezing period and now is bigger than token freeze time, should it return None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is explained in the comment. I think the confusion might come from the interface. I think all but 1 strategy (the hardcoded list) can only really return whether a token should be dropped but not if it needs to be kept.
The reason is that it's enough for a single metric to indicate that a token is bad but it's not enough if only 1 strategy says the token is good.
This could maybe be improved in a follow up PR adjusting these functions to return Quality
instead of Option<Quality>
and have the wrapping detector only pay attention to Quality::Unsupported
results in the short circuiting logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the upcoming PR you proposed would be really nice! thanks for the explanation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually after thinking about this more Option<Quality>
seems correct to me. That way we can express:
- not enough information to make a decision
- information indicates good
- information indicates bad
I think I'll just adjust the comment and make it more explicit what gets returned. Because the current code focuses only on whether or not we have enough information to mark the token as unsupported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG. Good unit test.
5a8fb19
to
b3cadea
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation, now it is clear. LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good decision to remove DetectorBuilder
.
@@ -742,3 +751,7 @@ fn default_settle_queue_size() -> usize { | |||
fn default_metrics_bad_token_detector_log_only() -> bool { | |||
true | |||
} | |||
|
|||
fn default_metrics_bad_token_detector_freeze_time() -> Duration { | |||
Duration::from_secs(60 * 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use from_mins(10)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, it doesn't exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah you're right, it is nightly api.
Description
Currently the bad token detection based on metrics will mark tokens as unsupported forever. This is problematic for tokens which only have issues temporarily. For example this can happen when the most important pool for a token gets into a weird state or when a token gets paused or a while.
Changes
Adjusts the logic to freeze tokens for a configurable period of time. Once the freeze period is over we give the token another chance (even if the stats indicate that it's currently unsupported). To not run into issues when a token is always bad the logic was built such that 1 more
bad
measurement is enough to freeze the token again.That way we can safely configure a very high
min_measurements
without having periods where a token that was flagged as bad can issues again because we need to get a lot new measurements to mark it as unsupported again.Additionally the PR simplifies how the metrics based bad token detector gets instantiated and gives each solver their completely separate instance (how it was originally communicated because each solver may support different tokens).
How to test
added a unit test