-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VReplication: Improve query buffering behavior during MoveTables traffic switching #15701
Conversation
Signed-off-by: Matt Lord <[email protected]>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Signed-off-by: Matt Lord <[email protected]>
3424b0b
to
61a0fe5
Compare
Signed-off-by: Matt Lord <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #15701 +/- ##
==========================================
- Coverage 68.44% 68.42% -0.02%
==========================================
Files 1558 1558
Lines 195822 195865 +43
==========================================
- Hits 134025 134024 -1
- Misses 61797 61841 +44 ☔ View full report in Codecov by Sentry. |
Signed-off-by: Matt Lord <[email protected]>
fddaf21
to
be18fc1
Compare
Signed-off-by: Matt Lord <[email protected]>
be18fc1
to
1f81b8a
Compare
Signed-off-by: Matt Lord <[email protected]>
fe9556b
to
8f72116
Compare
Signed-off-by: Matt Lord <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@harshit-gangal can you please review the changes to plan_execute.go?
Rest mostly LGTM. I have some minor questions and comments. Once we have harshit's review and your responses, I can review once again.
Version: EtcdVersion(initial.Kvs[0].ModRevision), | ||
Version: EtcdVersion(initial.Kvs[0].Version), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between these two?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revision, ModRevision, and Version are related but distinct things:
- what is different about Revision, ModRevision and Version? etcd-io/etcd#6518
- https://etcd.io/docs/v3.5/learning/data_model/
I'll try to summarize:
- Revision: a logical clock used to track changes to the etcd keyspace
- Version: a logical clock used to track the changes to an individual key since it was created
- ModRevision: the etcd keyspace revision number for the most recent update to an individual key (the latest version)
What I tried to do is align the Vitess comments and code so that we're matching etcd for which one of these to use. As you can see in this example here, we were using Version in Vitess code and setting it to the ModRevision value from etcd. When I first started the investigation I thought that, aside from this being odd, it could have perhaps led us to miss/skip an update to the key in the watch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good one
var currVersion = initial.Header.Revision | ||
var rev = initial.Header.Revision |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any other actual diff in this file? Everything else looks cosmetic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, just renaming things to be more precise and clear: https://github.com/vitessio/vitess/pull/15701/files/f1fed6dd787feea26a4f4e13f660e8eaa91e64bb#r1568995668
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, nice work!
I had just one point where I was not clear about the functionality: where we clear the plan cache.
func (kew *KeyspaceEventWatcher) Subscribe() chan *KeyspaceEvent { | ||
kew.subsMu.Lock() | ||
defer kew.subsMu.Unlock() | ||
c := make(chan *KeyspaceEvent, 2) | ||
// Use a decent size buffer to: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably indeed process the most recent version, discarding the rest, We can revisit this separately in the context of multi-tenant migrations, for example, where we may be switching several per-tenant workflows per second for smaller tenants.
rootCause := vterrors.RootCause(err) | ||
if rootCause != nil && strings.Contains(rootCause.Error(), "enforce denied tables") { | ||
log.V(2).Infof("Retry: %d, will retry query %s due to %v", try, query, err) | ||
lastVSchemaCreated = vs.GetCreated() | ||
if try == 0 { // We are going to retry at least once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why this deferred clearing of plans when the executor.newExecutor()
returns helps?
When we retry, we wait for a newer vschema. Shouldn't we clear the plans before executing the query with the new vschema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why this deferred clearing of plans when the executor.newExecutor() returns helps?
As noted in the description, this is the part that prevented the query shape from failing indefinitely and causing a never-ending cycle of keyspace events when there was a ~ concurrent buffer event.
The executor's VSchemaManager clears the plan cache when it receives a new vschema via its SrvVSchema watcher (it calls executor.SaveVSchema() in its watch's subscriber callback). This happens concurrently with the KeyspaceEventWatcher also receiving the new vschema in its SrvVSchema watcher and in its subscriber callback processing it (which includes getting info on all shards from the topo), and eventually determining that the keyspace is consistent and ending the buffering window. So there was a race with query retries such that a query could be planned against the wrong side just as the keyspace event is getting resolved and the buffers drained. Then that bad plan is the cached plan for the query until you do another topo.RebuildSrvVSchema
/vtctldclient RebuildVSchemaGraph
which then causes the VSchemaManager to clear the plan cache. It's essentially a race between the two SrvVSchema watchers and the work they do when a new one is received.
When we retry, we wait for a newer vschema. Shouldn't we clear the plans before executing the query with the new vschema.
As I noted above, the VSchemaManager already clears the plan cache when a new vschema is received. We wait for a newer vschema, but not indefinitely as the wait times out and we retry without it. So if we timeout just as the new vschema is coming in, we could plan with the old vschema, putting that into the just cleared plan cache for all subsequent executions. Do you remember when you, @harshit-gangal, and I were discussing this on Zoom? That's where this idea originated as you and I were both confused by how the query failures continued even after the buffering window ended due to hitting the max duration.
Since we're now making 1s the minimum for time between failovers the likelihood of hitting this should be drastically reduced, but since the results of this happening are worse than if we'd not done any buffering at all, I think it's better to be extra safe here when doing a retry. We should do it VERY rarely overall, and when we do, let's just be sure to prevent this issue from ever happening since it's fairly catastrophic and it's not at all clear to the user what is happening and how to clear it out (vtctldclient RebuildVSchemaGraph
) w/o just restarting the vtgate. That was my thinking. If we DID a retry AND the last time we retried (the 3rd try) still encountered an error, we know that the plan used was 1) probably not valid/correct as it failed 2) will be the plan in the cache if we do not clear the plans after it was added to to the cache.
Make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I improved the behavior and comments here: 7675cf8
Sorry for not explaining this more clearly ahead of time! 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This issue was clarified in a call: my misunderstanding was because I was assuming a closure in the check for the err
in defer. However err
is global to the function and we will not clear the cache if the query succeeds. Apologies for the misunderstanding and thanks for the additional comments ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good.
@@ -64,11 +64,11 @@ func (e *Executor) newExecute( | |||
logStats *logstats.LogStats, | |||
execPlan planExec, // used when there is a plan to execute | |||
recResult txResult, // used when it's something simple like begin/commit/rollback/savepoint | |||
) error { | |||
// 1: Prepare before planning and execution | |||
) (err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why we have error as a named parameter here? That is not our usual normal practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be sure that we're checking the returned error in the defer.
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
go/vt/vtgate/plan_execute.go
Outdated
if waitForNewerVSchema(ctx, e, lastVSchemaCreated) { | ||
// Without a wait we fail non-deterministically since the previous vschema will not have | ||
// the updated routing rules. | ||
timeout := e.resolver.scatterConn.gateway.buffer.GetConfig().MaxFailoverDuration / MaxBufferingRetries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you explain why this timeout is calculated in this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We retry 2 times before giving up. How long we wait before each retry — IF we don't see a newer vschema come in — was previously hardcoded at 30s while the max buffer failover duration defaults to 20s. I thought it made more sense to instead base the wait time on how long the buffering window is.
My thinking for the calculation was that this way we should be able to perform the max retries within the given window of time for many queries (certainly not all) and we should not end up waiting too long after the buffer window has ended, retrying old queries.
Before we could end up waiting 60 seconds to retry 2 times and continuing to retry when the buffering window ended 30+ seconds ago. That's of course assuming that we didn't receive a newer vschema.
Thinking about it again... I think this improves some failure scenarios/conditions but doesn't make the typical happy path much better as there's no point in retrying when we haven't gotten a new vschema UNLESS the traffic switch failed and we backed out and those queries should now succeed with the original targets/routing. Let's discuss tomorrow. I'm happy to change this... and should add a comment either way. 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting that we discussed this directly and were both OK with the calculation. I will add a comment, however, to describe the reasoning (which I should have already done).
select { | ||
case c <- th: | ||
default: | ||
} | ||
c <- ev | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change ensures that broadcast happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good overall, some clarification questions
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
…fic switching (vitessio#15701) Signed-off-by: Matt Lord <[email protected]>
…fic switching (vitessio#15701) Signed-off-by: Matt Lord <[email protected]>
Description
This PR is the result of investigating #15707 and going through the full call path for how we handle keyspace events and specifically how they are used to detect a
MoveTables
switch traffic operation — both when it starts and when it finishes — and how queries are handled during this time. The main changes are:RebuildSrvVSchema
operation — which kicks off the keyspace event which is supposed to end the buffering window — per traffic switch (previously if you were switching all traffic we did it twice, once for reads and once for writes)1s
the minimum for--buffer_min_time_between_failovers
to ensure that we don't start a new buffering window concurrently with resolving the keyspace event, which can result in an errant second buffering window to start while resolving the first — which resulted in a lot of additional query latency and errors along with leaving a bad query plan in the cache which then caused future queries with the same plan to go the wrong side and thus continuously kick off more buffering windows which hit the max window because there's not another keyspace event to resolve them. This change addresses the first part of this issue with the second/concurrent buffer window.MoveTables
traffic switch was detected, we always clear the plan cache after retrying when we were not successful on retry. This prevents the second part of the given query shape then failing indefinitely due to the cached bad plan which also then continuously kicked off more buffering windows. (This is item number 2 in the list.)Related Issue(s)
Fixes: #15707
Checklist