Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

only mark tokens as unsupported based on metrics for a limited time #3205

Merged
merged 6 commits into from
Jan 3, 2025

Conversation

MartinquaXD
Copy link
Contributor

Description

Currently the bad token detection based on metrics will mark tokens as unsupported forever. This is problematic for tokens which only have issues temporarily. For example this can happen when the most important pool for a token gets into a weird state or when a token gets paused or a while.

Changes

Adjusts the logic to freeze tokens for a configurable period of time. Once the freeze period is over we give the token another chance (even if the stats indicate that it's currently unsupported). To not run into issues when a token is always bad the logic was built such that 1 more bad measurement is enough to freeze the token again.
That way we can safely configure a very high min_measurements without having periods where a token that was flagged as bad can issues again because we need to get a lot new measurements to mark it as unsupported again.

Additionally the PR simplifies how the metrics based bad token detector gets instantiated and gives each solver their completely separate instance (how it was originally communicated because each solver may support different tokens).

How to test

added a unit test

@MartinquaXD MartinquaXD requested a review from a team as a code owner January 3, 2025 08:58
@MartinquaXD MartinquaXD changed the base branch from main to metrics-detection-log-only-mode January 3, 2025 09:49
crates/driver/src/domain/competition/bad_tokens/metrics.rs Outdated Show resolved Hide resolved
let stats = self.counter.get(token)?;
if stats
.flagged_unsupported_at
.is_some_and(|t| now.duration_since(t) > self.token_freeze_time)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this correct? if the flagged_unsupported_at is some and the time between freezing period and now is bigger than token freeze time, should it return None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is explained in the comment. I think the confusion might come from the interface. I think all but 1 strategy (the hardcoded list) can only really return whether a token should be dropped but not if it needs to be kept.
The reason is that it's enough for a single metric to indicate that a token is bad but it's not enough if only 1 strategy says the token is good.
This could maybe be improved in a follow up PR adjusting these functions to return Quality instead of Option<Quality> and have the wrapping detector only pay attention to Quality::Unsupported results in the short circuiting logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the upcoming PR you proposed would be really nice! thanks for the explanation!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually after thinking about this more Option<Quality> seems correct to me. That way we can express:

  1. not enough information to make a decision
  2. information indicates good
  3. information indicates bad

I think I'll just adjust the comment and make it more explicit what gets returned. Because the current code focuses only on whether or not we have enough information to mark the token as unsupported.

Base automatically changed from metrics-detection-log-only-mode to main January 3, 2025 10:23
Copy link
Contributor

@sunce86 sunce86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG. Good unit test.

@MartinquaXD MartinquaXD force-pushed the allow-unfreezing-tokens branch from 5a8fb19 to b3cadea Compare January 3, 2025 10:38
Copy link
Contributor

@m-lord-renkse m-lord-renkse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation, now it is clear. LGTM.

Copy link
Contributor

@mstrug mstrug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good decision to remove DetectorBuilder.

@@ -742,3 +751,7 @@ fn default_settle_queue_size() -> usize {
fn default_metrics_bad_token_detector_log_only() -> bool {
true
}

fn default_metrics_bad_token_detector_freeze_time() -> Duration {
Duration::from_secs(60 * 10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use from_mins(10)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, it doesn't exist.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you're right, it is nightly api.

@MartinquaXD MartinquaXD merged commit 3ca92b7 into main Jan 3, 2025
11 checks passed
@MartinquaXD MartinquaXD deleted the allow-unfreezing-tokens branch January 3, 2025 11:28
@github-actions github-actions bot locked and limited conversation to collaborators Jan 3, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants