Bug Report: tablet throttler starvation scenario II #15433

shlomi-noach · 2024-03-11T06:24:59Z

Overview of the Issue

While working on #15398, in solving #15397, I found another, more dominant throttler starvation scenario. Like #15397, it can happen in v17 and v18, and it does not happen in v19. Again like #15397, the reason it does not happen in v19 is the unintentional introduction of some behavior. While fixing that behavior in #15398, the bug described here was made immediately visible and consistently failed the tabletmanager_throttler_topo CI tests.

The unintentional behavior added in v19, which hides both bugs, is that neither PRIMARY nor replicas ever go into dormant mode. Dormant mode was introduced to reduce the entwork traffic between PRIMARY and replicas. The idea is that if no one needs the throttler for some period of time (ie no one makes requests of the throttler), then the PRIMARY can relax checking up on replicas, reducing the check frequency from 4 times per second to just once per minute. Then, if someone does check the throttler, the PRIMARY resumes checking the replicas at high frequency.

The problem leading to this issue is that we used dormant state to determine the call frequency for collectMySQLMetrics(). This is the function where the PRIMARY collects info from the replicas. However, and this is the root of the issue, it is also the function where the replica aggregates its own metrics. This is worth elaborating on. All throttlers routinely check their own metrics (ie read _vt.heartbeat or whatever they're instructed to do). But then those metrics are collected. There is an abstraction layer (which we may end up deciding is unnecessary and will remove) where:

On the PRIMARY, collecting metrics means communicating with replicas via CheckThrottler().
On any throttler, including PRIMARY and replicas, collecting metrics means aggregating the metrics we routinely read in an in-memory map, from where they will be served.

And so on a replica this collection process normally merely copies one value from one place to another. But when the PRIMARY collects metrics from the replica, that's where the value needs to be on the replica. Without it, the PRIMARY reads a stale value from the replica.

Solution: a throttler should always collect its own in-memory metrics at high frequency, irrespective of dormancy.

Reproduction Steps

Binary Version

`v17`, `v18`

Operating System and Environment details

Log Fragments

No response

The text was updated successfully, but these errors were encountered:

shlomi-noach · 2024-03-11T06:40:10Z

Fixed in b58bac6

shlomi-noach · 2024-03-20T15:35:11Z

Closed by #15398

shlomi-noach added Type: Bug Component: Throttler labels Mar 11, 2024

shlomi-noach self-assigned this Mar 11, 2024

shlomi-noach mentioned this issue Mar 11, 2024

Tablet throttler: starvation fix and consolidation of logic. #15398

Merged

5 tasks

shlomi-noach closed this as completed Mar 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug Report: tablet throttler starvation scenario II #15433

Bug Report: tablet throttler starvation scenario II #15433

shlomi-noach commented Mar 11, 2024

shlomi-noach commented Mar 11, 2024

shlomi-noach commented Mar 20, 2024

Bug Report: tablet throttler starvation scenario II #15433

Bug Report: tablet throttler starvation scenario II #15433

Comments

shlomi-noach commented Mar 11, 2024

Overview of the Issue

Reproduction Steps

Binary Version

Operating System and Environment details

Log Fragments

shlomi-noach commented Mar 11, 2024

shlomi-noach commented Mar 20, 2024