-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-tenant migrations: add topo locking while updating keyspace routing rules #15807
Multi-tenant migrations: add topo locking while updating keyspace routing rules #15807
Conversation
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #15807 +/- ##
==========================================
+ Coverage 68.40% 68.43% +0.02%
==========================================
Files 1556 1562 +6
Lines 195121 197053 +1932
==========================================
+ Hits 133479 134854 +1375
- Misses 61642 62199 +557 ☔ View full report in Codecov by Sentry. |
a130cfc
to
b284b26
Compare
I know this is still draft, but I'm wondering whether it affects |
Yes, just updated description as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Nice work on this. I apologize again for my initial confusion and unfounded concerns. 🙂
Signed-off-by: Rohit Nayak <[email protected]>
Signed-off-by: Rohit Nayak <[email protected]>
Signed-off-by: Rohit Nayak <[email protected]>
Signed-off-by: Rohit Nayak <[email protected]>
The watch output after changing the key paths to
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving but the proto field issue should be addressed before merge. I'll apply the "do not merge" label until then.
I'm really pleased with the direction on this one, though we still have work to do to align existing code with this implementation. Nice work @rohit-nayak-ps!
…he client side Signed-off-by: Rohit Nayak <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking much better! Thank you for spending the time and effort on it. Beyond nits, I have a couple of questions and suggestions. Please let me know what you think.
Thanks!
go/vt/topo/routing_rules_lock.go
Outdated
} | ||
|
||
func NewRoutingRulesLock(ctx context.Context, ts *Server, name string) (*RoutingRulesLock, error) { | ||
if err := ts.EnsureTopoPathExists(ctx, "Routing Rules", RoutingRulesPath); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this model. 1) We have unused junk in the topo server 2) More importantly, every caller on every call, for the life of the topo server over the course of years, is making an unnecessary additional topo call to get the sentinel file. That seems bad to me when AFAIUI the purpose of this work is to increase concurrency and do so safely. This is really only something that we typically have do one time over the lifetime of the topo server so I think we should do it differently. Suggestion/idea to follow in the actual Save function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented on the suggested approach below. Just to clarify the aim of this is not to "increase concurrency" but to make it safer. I think the extra call should take milliseconds, so not really affect the bandwidth of migrations but that can also be avoided (see below).
Also I personally don't see the sentinel
as junk
. It is part of the algorithm used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Network calls are typically the most expensive thing we do. Yes, it should typically be fast, but that's no reason to do it unnecessarily.
It's not just that it's junk, it's that it's bad practice IMO. I say that because the topo is a critical piece of Vitess and we have established models of working with it (just like MySQL). In the topo server today you have to create something, only then can you get it, lock it, update it etc. You're using this sentinel file to break/violate this model — and we're paying a cost to do so (deliberately making unnecessary topo calls is also bad IMO).
The topo server is our source of truth for metadata about a Vitess cluster. It's also our locking/coordination service. So I think it's important to be very thoughtful and deliberate about it. IMO part of the problem is that virtually all of the topo work was done by Alain ~ 7 years ago and we've since lost that knowledge and expertise. I think that building that back up as a group (maintainers) is important. And from there, being just as thoughtful and deliberate as he was (he is/was very good) in how we use it.
go/vt/topotools/routing_rules.go
Outdated
update func(ctx context.Context, rules *map[string]string) error) (err error) { | ||
var lock *topo.RoutingRulesLock | ||
lock, err = topo.NewRoutingRulesLock(ctx, ts, reason) | ||
if err != nil { | ||
return err | ||
} | ||
lockCtx, unlock, lockErr := lock.Lock(ctx) | ||
if lockErr != nil { | ||
return lockErr | ||
} | ||
defer unlock(&err) | ||
rules, _ := GetKeyspaceRoutingRules(lockCtx, ts) | ||
if rules == nil { | ||
rules = make(map[string]string) | ||
} | ||
if err := update(lockCtx, &rules); err != nil { | ||
return err | ||
} | ||
if err := saveKeyspaceRoutingRulesLocked(lockCtx, ts, rules); err != nil { | ||
return err | ||
} | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this is much nicer and it eliminates all the awkward sentinel related code, while remaining more standard for the topo server, and most importantly ensuring that we're only ever making the minimum necessary topo calls:
update func(ctx context.Context, rules *map[string]string) error) (err error) {
var lock *topo.RoutingRulesLock
lock, err = topo.NewRoutingRulesLock(ctx, ts, reason)
if err != nil {
return err
}
lockCtx, unlock, lockErr := lock.Lock(ctx)
if lockErr != nil {
// If the key does not yet exist then let's create it.
if !topo.IsErrType(lockErr, topo.NoNode) {
return lockErr
}
rules := make(map[string]string)
if err := update(ctx, &rules); err != nil {
return err
}
// This will fail if the key already exists and thus avoids any races here. The first
// writer will win and the others will have to retry. This situation should be very
// rare as we are typically only updating the rules from here on out.
if err := ts.CreateKeyspaceRoutingRules(ctx, buildKeyspaceRoutingRules(&rules)); err != nil {
return err
}
return nil
}
defer unlock(&err)
rules, err := GetKeyspaceRoutingRules(lockCtx, ts)
if err != nil {
return err
}
if err := update(lockCtx, &rules); err != nil {
return err
}
if err := saveKeyspaceRoutingRulesLocked(lockCtx, ts, rules); err != nil {
return err
}
return nil
Full diff (including updated unit tests): https://gist.github.com/mattlord/68e8636fe65cf72baab31e8f6b7b8604
If you disagree, can you help me understand why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I understand correctly your issues with current code are:
- There is that extra
sentinel
key that stays around - We have an extra topo
get
for every lock
With the recommended code:
- There is no sentinel and extra check
- The caller can (most likely very rarely) need to retry. So every caller of the
Update
needs to be aware of the need to retry or pass it to the the client who needs to know that this is not an error but need to retry).
I guess in your approach we can add N retries instead of the one and make the chance of the race surfacing the user practically zero.
In fact, I did start with an approach similary to yours. My thought was originally that since the topo (at least etcd) can lock a keypath without the key needing to exist, I could obtain a lock first and then create the rules. Unfortunately for this approach, Vitess requires that the key path be a "dir" and you cannot lock leaf keys.
I didn't think it was an option to surface a race to the end user, so came up with the sentinel approach. Seemed like Vitess should manage it, because it can.
Regarding the extra get
, a similar approach of trying to lock first and creating only on a NoNode
can remove the extra get
s. But the sentinel key will indeed remain.
Another point to note is that we delete the routing rules files today if there are no rules. So we would be creating the rules more than just once. If there are always multiple migrations in flight we will create it once, but if the user runs set of migrations and waits for them to complete before starting the next set, we will again follow the flow where the key to be locked needs to be created. With the sentinel file the prefix being locked is always present.
I think it is more important to get this feature in, to add safety to ongoing migrations. So @deepthi and @mattlord, if you are both fine with @mattlord's suggested approach, please go ahead and push that change to this PR, so that the attribution is correct and we minmise iterations. I can review the changes once they are in, just as an extra 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I understand correctly your issues with current code are:
- There is that extra
sentinel
key that stays around- We have an extra topo
get
for every lock
It's more than that too: #15807 (comment)
With the recommended code:
- There is no sentinel and extra check
That's trivializing it a bit IMO, just as it would be to ignore us doing something unusual with MySQL that breaks well established usage/access patterns. And doing so unnecessarily, and in a way that is slower / more costly.
We could, of course, alternatively require that you create the rules before updating them in a more direct way (CreateKeyspaceRoutingRules
). That would be even more standard — in that model we would always check for topo.NoNode
errors and do CreateKeyspaceRoutingRules
instead (as we're doing automatically in my suggestion).
- The caller can (most likely very rarely) need to retry. So every caller of the
Update
needs to be aware of the need to retry or pass it to the the client who needs to know that this is not an error but need to retry).
The caller (the vitess function or the vitess user) can always get an error and can retry. For topo.NodeExists
in the case of an update, the caller would only ever at most get this once and retry an update because of it once — as if you get that error on update you know that it now exists and you can of course then update it. And we don't ever allow deleting it (although someone could do that directly in the underlying topo implementation like etcd
of course) so it's generally a one time scenario.
I added the auto retry on topo.NodeExists
error in the Apply client command RPC and the workflow utils update function as those looked like the two places where you'd likely be creating the key (the utils update function is used when creating the initial routing rules for a workflow). IMO, no other callers / locations need to worry about topo.NodeExists
errors.
I guess in your approach we can add N retries instead of the one and make the chance of the race surfacing the user practically zero.
No need to retry more than once. Or retry at all really. Since only one caller should ever get the noted error (from then on the node will exist) in the lifetime of the topo server instance.
I don't understand under what scenario you could possibly race further here. There will always be one writer that wins (etcd2topo and etcd does this Create
->Put
in a strictly serializable transaction).
In fact, I did start with an approach similary to yours. My thought was originally that since the topo (at least etcd) can lock a keypath without the key needing to exist, I could obtain a lock first and then create the rules. Unfortunately for this approach, Vitess requires that the key path be a "dir" and you cannot lock leaf keys.
Yeah, I think that these issues are in part arising because RoutingRules
was a later topo addition and not done in a standard way (with a path, create, locking, etc). I'm hoping that we can avoid doing more non-standard things here.
I didn't think it was an option to surface a race to the end user, so came up with the sentinel approach. Seemed like Vitess should manage it, because it can.
topo.NoNode
and topo.NodeExists
are not unusual to surface. If you try to update things in the topo before they exist, you get NoNode (that's why most things use create/update vs apply for the others that always do a blind write). And if you try to create something that already exists you get NodeExists. These are not unusual to have surface in the UI.
And my suggestion does prevent this from surfacing to the user, just as the sentinel file method did.
Regarding the extra
get
, a similar approach of trying to lock first and creating only on aNoNode
can remove the extraget
s. But the sentinel key will indeed remain.Another point to note is that we delete the routing rules files today if there are no rules. So we would be creating the rules more than just once.
Why would we delete the file/key? IMO we should only delete the rules we want to delete as empty/no rules is not the same as having no key. Think of it like a table with no rows vs no table. That would be yet another non-standard thing IMO that I must have missed. Where do we do that? Ah, I see:
if len(data) == 0 {
// No rules, remove it.
if err := ts.globalCell.Delete(ctx, ts.GetKeyspaceRoutingRulesPath(), nil); err != nil && !IsErrType(err, NoNode) {
There's no reason to do that. IMO we should simply write the empty rules. You're saying that you want to apply empty routing rules, not that you want to delete the file. This IMO is like turning a delete from t1
into a drop table t1
.
If there are always multiple migrations in flight we will create it once, but if the user runs set of migrations and waits for them to complete before starting the next set, we will again follow the flow where the key to be locked needs to be created. With the sentinel file the prefix being locked is always present.
I think that this is resulting from another unusual and unnecessary behavior as noted above. There's no need to delete the file, just the rules within it.
I think it is more important to get this feature in, to add safety to ongoing migrations. So @deepthi and @mattlord, if you are both fine with @mattlord's suggested approach, please go ahead and push that change to this PR, so that the attribution is correct and we minmise iterations. I can review the changes once they are in, just as an extra 👀
I could push my suggestions if you want. The diff also clearly lays them out.
I'll remove my require changes
block on the PR now that we've at least discussed things. Perhaps others are fine with things as-is and you can move forward with their review (I've spent A LOT of time on this to date). It could always be changed/improved later too. The topo paths would have been much more difficult to change later but I think we've resolved those aspects well. These other things we're discussing are more internal and could be changed in a non-user visible and upgrade/downgrade safe way later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you do end up moving forward with things as they are, I think that this is an important point to address: #15807 (comment)
And if you do end up wanting me to push the local commit with my suggestions, I've updated the endtoend tests as well (minor relevant tweaks). I can also push it to demonstrate that all tests are passing if you like (they are passing locally).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was in fact somewhat uncomfortable with adding a sentinel file. If we can indeed work out a way to avoid that, that will be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding deleting the rules file @rohit-nayak-ps do we currently do that with Table/Shard routing rules as well?
As for why they may be different from other entities we store in the topo, I can only guess that it is because unlike keyspaces/shards/tablets, routing rules were intended to be temporary in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot say for sure why things related to the routing rules were done in a certain way in the past, but it does look like they were initially added here AFAICT: https://github.com/vitessio/vitess/pull/4833/files#diff-c12ed500f6b6ee0d413204f2a22a0885caf9daaa427cc636de912fd5a93e0320
I can, however, say the following things, which are largely based on my own opinions and preferences so it's perfectly fine if others disagree:
- The routing rules related topo work came ~ 2 years later than virtually all of the general topo server related work (~5 vs ~ 7 years). That general topo server design/work was largely done by Alain and later work was done by others (so different times, perspectives, pressures, etc).
- I think that the general design/work was well thought out and well done — to me it seems that time was clearly taken to write a well thought out design/architecture which was later implemented. Some of the later work was IMO done in a more rushed fashion as we were more/most concerned with trying to quickly achieve the needed given product/user feature — which is normal and fine, no judgement there.
- Lastly, and for me most importantly, if I reviewed a PR today where the author was dropping a sidecar table because there were no rows in it after the action being taken, I would object. It's the same fundamental principle for the topo data store in my mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding deleting the rules file @rohit-nayak-ps do we currently do that with Table/Shard routing rules as well? As for why they may be different from other entities we store in the topo, I can only guess that it is because unlike keyspaces/shards/tablets, routing rules were intended to be temporary in the first place.
Yes, we delete, for the reasons you outlined, I too presume, rather than leave an empty file around.
// SaveRoutingRules saves the routing rules into the topo.
func (ts *Server) SaveRoutingRules(ctx context.Context, routingRules *vschemapb.RoutingRules) error {
data, err := routingRules.MarshalVT()
if err != nil {
return err
}
if len(data) == 0 {
// No vschema, remove it. So we can remove the keyspace.
if err := ts.globalCell.Delete(ctx, RoutingRulesFile, nil); err != nil && !IsErrType(err, NoNode) {
return err
}
return nil
}
_, err = ts.globalCell.Update(ctx, RoutingRulesFile, data, nil)
return err
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// No vschema, remove it. So we can remove the keyspace.
That comment makes no sense to me in Vitess today FWIW.
My larger point is that the RoutingRules weren't really defined as a permanent part of the Vitess data model before — it was temporary storage for a blob so no rules (no pun intended, I mean no data store rules). Write a temporary blob here.
But now we want to — at least in my mind — make RoutingRules
a well defined part of the Vitess data model. And with that, apply the normal patterns, rules, and usage. We now want concurrency controls, consistency, etc. So we define a container to hold this specific type of element with a specific structure (it's like a table in our sidecar database). There's no reason to drop/delete that container whenever it happens to be empty. That would be wrong IMO in principle, let alone that it causes more unnecessary work and slows system performance.
One could argue that no, we should drop it and recreate it when actually needed because in their opinion the data model should only be defined in the application and not the data store. OK, I disagree but that's a valid opinion. But to then say, instead of keeping the container for the type permanently in our data store, we're going to keep some other random thing (the sentinel file) permanently in the data store so that we can get around the rules and access patterns already defined for containers in the data store — I think that is just wrong.
The larger context in my mind is that the VSchema
— routing rules becoming a part of that in VTGate
— are a critical part of Vitess. A critical part of our data model and our operating model. So we should start treating them that way. Making them a more standard part of the data model so that we can ensure proper / expected behavior with proper concurrency controls, consistency guarantees, versioning etc is pretty important in my mind. I'm trying to push us more methodically in that direction, starting here, so that we can eventually properly address issues like this: #15794
I personally think this is important and it's why I took the time to show exactly — with a working diff — how we could move in that direction while still achieving the immediate larger feature goals of this PR in a timely manner.
// We send both the old and new rules back in the response, so that we have a backup of the rules | ||
// in case we need to revert them or for audit purposes. | ||
currentRules, err := s.ts.GetKeyspaceRoutingRules(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This Old / New behavior is non-standard for the topo server and for Vitess. Older versions of the key are stored in the topo server if it came to that.
Can you help me understand what is different/special about KeyspaceRoutingRules that requires or warrants this Old / New behavior? I didn't see anything related to this in the issue or PR description. If it's truly warranted — then I think we should lay out why in the PR description, along with showing an example command and output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I thought of this is not special to keyspace routing rules. Just that with the number of workflows being quite high (depending on number of tenants) the probability of wanting to update routing rules might be higher than for other migration flows.
Actually I had originally thought of changing the Apply()
for keyspace routing rules to only update one tenant at a time, but it didn't match what the other Apply
s do: they modify the entire rule file. We decided we didn't want a feature creep and could implement this later if required.
In fact it may not be a bad idea for most user-facing RPCs/commands to the topo done by users to send back previous state as well since it makes it easy to audit/backup. It is a pattern I have used often in the past for "admin" commands.
Older versions of the key are stored in the topo server if it came to that.
Do we have a way to access a key value today from the topo or do you mean using the underlying topo's mechanisms like etcdctl get
? In that case, depending on the key, we may need post-processing of the value
before displaying or using it for an Apply
to revert a previous update.
I agree though, for DR purposes, it is a good option to have and I hadn't thought of it: we do have intrinsic versioning! But for regular users sending the previous version provides a lightweight way to audit/backup older rules..
Again, I can change it to not send the old rules and let the onus lie on the user doing a Get
before an Apply
. Let me know what you all want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a way to access a key value today from the topo or do you mean using the underlying topo's mechanisms like etcdctl get? In that case, depending on the key, we may need post-processing of the value
before displaying or using it for an Apply to revert a previous update.
IMO, the better/nicer way would be to add a --version
flag to the GetTopologyPath
command for things like this.
I'm not set against it, I just think that we should document — in the PR description and the code — why we're doing it since it's unique within Vitess today AFAIK.
I also think it would be better/nicer to add a command flag to control this: ~ --include-previous-value
(even if it defaults to true
). We could then use that elsewhere going forward too when/where it might be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The model I'm using for these Apply
command is ApplySchema
/ ApplyVSchema
, which in fact return the "new" state. I'm not sure why ApplyRoutingRules
did not follow that pattern, and it seems like at least initially, the ApplyRoutingRules
pattern was then followed by ApplyShardRoutingRules
and ApplyKeyspaceRoutingRules
. To make them all consistent, we should at least return the "new" state.
Let us not feature-creep this PR into adding a --version
to GetTopologyPath
, because then the next thing we'll need to add is a way of listing versions with enough metadata to be able to choose a version to inspect.
Either returning just the "new" value, or returning both behind a flag are fine by me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should definitely return the new state. What's unusual is returning the old and new.
I agree that we shouldn't expand the scope of the work here, I was only noting what I think would be a nicer way to achieve what I believe to be the aim here with returning the old value. Adding e.g. --version
to GetTopologyPath
would IMO be a much nicer and more thorough way to address what I perceived as the reasoning for it — but it should definitely be a separate piece of work (we/I could create an issue for it though).
Is SwithTraffic calls not going to be serialized? |
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Rohit Nayak <[email protected]>
…rules-locking' into rohit/multi-tenant-routing-rules-locking Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you for your patience and effort sticking with this.
I think that we need to adjust the description though:
Today we are not locking either the table or shard routing rules. Also the routing rules file only exists if there are any actual rules present: meaning the key for these rules, eg: /vitess/global/RoutingRules doesn't exist unless there is an active workflow.
As a precursor to a refactor of the existing routing rules, with the ability to lock them before modifications, this PR adds some new paths and a different naming convention for storing the rules.
/vitess/global/routing_rules/ will be the prefix for all locks and rules
/vitess/global/routing_rules/keyspace is the prefix for keyspace routing rules and /vitess/global/routing_rules/keyspace/Rules is the file where the rules are being saved.
/vitess/global/routing_rules/table and /vitess/global/routing_rules/shard will be the prefixes when we refactor the table and shard rules.
The locking is done at the /vitess/global/routing_rules/ level, so that we will serialize changes to all routing rules
topotools.UpdateKeyspaceRoutingRules() is the single entry point for all keyspace routing rule mods, which are made holding the lock to /vitess/global/routing_rules/.
Since I think it will be:
/vitess/global/routing_rules/keyspace/Rules
/vitess/global/routing_rules/shard/Rules
/vitess/global/routing_rules/table/Rules
And thus the locking will be on the given path:
/vitess/global/routing_rules/keyspace
/vitess/global/routing_rules/shard
/vitess/global/routing_rules/table
Maybe that's what you mean as-is. In any event, probably worth one last pass on the description given the recent changes. Thanks again!
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Rohit Nayak <[email protected]>
They should be, but we are trying to guard against the possibility that people might run them in parallel (perhaps by mistake), when migrating 1000's of tenants into Vitess. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
// find the individual entry | ||
_, ok = i.info[keyPath] | ||
if !ok { | ||
return vterrors.Errorf(vtrpc.Code_INVALID_ARGUMENT, "%s is not locked (no lockInfo in map)", keyPath) | ||
} | ||
|
||
// and we're good for now. | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might be an insufficient check for knowing if the lock still exists. Because etcd expires the lock after a certain amount of time, I think we need something like -
// find the individual entry
li, ok := i.info[keyPath]
if !ok {
return vterrors.Errorf(vtrpc.Code_INVALID_ARGUMENT, "%s is not locked (no lockInfo in map)", keyPath)
}
// check if we still have the lock
return li.lockDescriptor.Check(ctx)
I added some docs on how checking of shard lock works for different topo servers, but I think we still need to do this. This is what we are doing in CheckShardLocked
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. We need to add it to CheckKeyspaceLocked
too ...
Description
While updating routing rules (table routing rules, shard routing rules, or keyspace routing rules), in certain paths we initialize or update them without holding any specific locks. Only SwitchTraffic holds locks (at the keyspace level).
Usually this is fine, since vreplication workflows that impact routing rules (MoveTables/Reshard) are expected to be used mostly for imports or resharding operations, which are infrequent. The user is (implicitly) expected to serialize these operations.
However for multi-tenant migrations with thousands of tenants we could have several concurrent per-tenant migrations in flight. This PR adds locking, ensuring that operations that update the global keyspace routing rules are serialized.
Keyspace Routing Rules are updated in the following cases:
Create
when initial routing rules are setupSwitchTraffic
for replica/rdonlySwitchTraffic
for primaryComplete
when routing rules are droppedApplyKeyspaceRoutingRules
commandAdditionally we also delete keyspace routing rules on a
Cancel
Topo design
Today we are not locking either the table or shard routing rules. Also the routing rules file only exists if there are any actual rules present: meaning the key for these rules, eg:
/vitess/global/RoutingRules
doesn't exist unless there is an active workflow.As a precursor to a refactor of the existing routing rules, with the ability to lock them before modifications, this PR adds some new paths and a different naming convention for storing the rules.
/vitess/global/routing_rules/
will be the prefix for all locks and rules/vitess/global/routing_rules/keyspace
is the prefix for keyspace routing rules and/vitess/global/routing_rules/keyspace/Rules
is the file where the rules are being saved./vitess/global/routing_rules/table
and/vitess/global/routing_rules/shard
will be the prefixes when we refactor the table and shard rules.The locking is done at the
/vitess/global/routing_rules/
level, so that we will serialize changes to all routing rulestopotools.UpdateKeyspaceRoutingRules()
is the single entry point for all keyspace routing rule mods, which are made holding the lock to/vitess/global/routing_rules/keyspace
.Related Issue(s)
#15403
Checklist
Deployment Notes