You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While working on #15398, in solving #15397, I found another, more dominant throttler starvation scenario. Like #15397, it can happen in v17 and v18, and it does not happen in v19. Again like #15397, the reason it does not happen in v19 is the unintentional introduction of some behavior. While fixing that behavior in #15398, the bug described here was made immediately visible and consistently failed the tabletmanager_throttler_topo CI tests.
The unintentional behavior added in v19, which hides both bugs, is that neither PRIMARY nor replicas ever go into dormant mode. Dormant mode was introduced to reduce the entwork traffic between PRIMARY and replicas. The idea is that if no one needs the throttler for some period of time (ie no one makes requests of the throttler), then the PRIMARY can relax checking up on replicas, reducing the check frequency from 4 times per second to just once per minute. Then, if someone does check the throttler, the PRIMARY resumes checking the replicas at high frequency.
The problem leading to this issue is that we used dormant state to determine the call frequency for collectMySQLMetrics(). This is the function where the PRIMARY collects info from the replicas. However, and this is the root of the issue, it is also the function where the replica aggregates its own metrics. This is worth elaborating on. All throttlers routinely check their own metrics (ie read _vt.heartbeat or whatever they're instructed to do). But then those metrics are collected. There is an abstraction layer (which we may end up deciding is unnecessary and will remove) where:
On the PRIMARY, collecting metrics means communicating with replicas via CheckThrottler().
On any throttler, including PRIMARY and replicas, collecting metrics means aggregating the metrics we routinely read in an in-memory map, from where they will be served.
And so on a replica this collection process normally merely copies one value from one place to another. But when the PRIMARY collects metrics from the replica, that's where the value needs to be on the replica. Without it, the PRIMARY reads a stale value from the replica.
Solution: a throttler should always collect its own in-memory metrics at high frequency, irrespective of dormancy.
Reproduction Steps
Binary Version
`v17`, `v18`
Operating System and Environment details
-
Log Fragments
No response
The text was updated successfully, but these errors were encountered:
Overview of the Issue
While working on #15398, in solving #15397, I found another, more dominant throttler starvation scenario. Like #15397, it can happen in
v17
andv18
, and it does not happen inv19
. Again like #15397, the reason it does not happen inv19
is the unintentional introduction of some behavior. While fixing that behavior in #15398, the bug described here was made immediately visible and consistently failed thetabletmanager_throttler_topo
CI tests.The unintentional behavior added in
v19
, which hides both bugs, is that neitherPRIMARY
nor replicas ever go into dormant mode. Dormant mode was introduced to reduce the entwork traffic betweenPRIMARY
and replicas. The idea is that if no one needs the throttler for some period of time (ie no one makes requests of the throttler), then thePRIMARY
can relax checking up on replicas, reducing the check frequency from 4 times per second to just once per minute. Then, if someone does check the throttler, thePRIMARY
resumes checking the replicas at high frequency.The problem leading to this issue is that we used dormant state to determine the call frequency for
collectMySQLMetrics()
. This is the function where thePRIMARY
collects info from the replicas. However, and this is the root of the issue, it is also the function where the replica aggregates its own metrics. This is worth elaborating on. All throttlers routinely check their own metrics (ie read_vt.heartbeat
or whatever they're instructed to do). But then those metrics are collected. There is an abstraction layer (which we may end up deciding is unnecessary and will remove) where:PRIMARY
, collecting metrics means communicating with replicas viaCheckThrottler()
.PRIMARY
and replicas, collecting metrics means aggregating the metrics we routinely read in an in-memory map, from where they will be served.And so on a replica this collection process normally merely copies one value from one place to another. But when the
PRIMARY
collects metrics from the replica, that's where the value needs to be on the replica. Without it, thePRIMARY
reads a stale value from the replica.Solution: a throttler should always collect its own in-memory metrics at high frequency, irrespective of dormancy.
Reproduction Steps
Binary Version
Operating System and Environment details
Log Fragments
No response
The text was updated successfully, but these errors were encountered: