Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: tablet throttler starvation scenario II #15433

Closed
shlomi-noach opened this issue Mar 11, 2024 · 2 comments
Closed

Bug Report: tablet throttler starvation scenario II #15433

shlomi-noach opened this issue Mar 11, 2024 · 2 comments

Comments

@shlomi-noach
Copy link
Contributor

Overview of the Issue

While working on #15398, in solving #15397, I found another, more dominant throttler starvation scenario. Like #15397, it can happen in v17 and v18, and it does not happen in v19. Again like #15397, the reason it does not happen in v19 is the unintentional introduction of some behavior. While fixing that behavior in #15398, the bug described here was made immediately visible and consistently failed the tabletmanager_throttler_topo CI tests.

The unintentional behavior added in v19, which hides both bugs, is that neither PRIMARY nor replicas ever go into dormant mode. Dormant mode was introduced to reduce the entwork traffic between PRIMARY and replicas. The idea is that if no one needs the throttler for some period of time (ie no one makes requests of the throttler), then the PRIMARY can relax checking up on replicas, reducing the check frequency from 4 times per second to just once per minute. Then, if someone does check the throttler, the PRIMARY resumes checking the replicas at high frequency.

The problem leading to this issue is that we used dormant state to determine the call frequency for collectMySQLMetrics(). This is the function where the PRIMARY collects info from the replicas. However, and this is the root of the issue, it is also the function where the replica aggregates its own metrics. This is worth elaborating on. All throttlers routinely check their own metrics (ie read _vt.heartbeat or whatever they're instructed to do). But then those metrics are collected. There is an abstraction layer (which we may end up deciding is unnecessary and will remove) where:

  • On the PRIMARY, collecting metrics means communicating with replicas via CheckThrottler().
  • On any throttler, including PRIMARY and replicas, collecting metrics means aggregating the metrics we routinely read in an in-memory map, from where they will be served.

And so on a replica this collection process normally merely copies one value from one place to another. But when the PRIMARY collects metrics from the replica, that's where the value needs to be on the replica. Without it, the PRIMARY reads a stale value from the replica.

Solution: a throttler should always collect its own in-memory metrics at high frequency, irrespective of dormancy.

Reproduction Steps

Binary Version

`v17`, `v18`

Operating System and Environment details

-

Log Fragments

No response

@shlomi-noach
Copy link
Contributor Author

Fixed in b58bac6

@shlomi-noach
Copy link
Contributor Author

Closed by #15398

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant