-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: routing rules to mirror traffic during MoveTables #13772
Comments
@maxenglander This is a very well-written RFC! I looked at the query-serving related code in the linked PR too, and that looks good to me too! |
@maxenglander The idea is great to have such a feature.
|
Hey @harshit-gangal thanks for the feedback 🤗
|
In discussions with @rohit-nayak-ps, we agreed that if we go forward with this it should use a new topology record type, and not try to fit this into the routing rules record type. This will give us more flexibility in how we define the record type, and reduce the likelihood that we complicate or break the routing rules implementation. Updated RFC to reflect. |
Met with members of the Vitess team yesterday to gather feedback on the proposal and the demo implementation. Some notes:
Next steps: update current demo implementation to use alternative query planning/execution approach, and then loop back for another round of feedback. |
I will no longer be able to work on this. If someone else wants to take this over feel free to re-open. Thanks everyone for the feedback and discussions on this! |
Motivation
MoveTables
facilitates moving workloads from non-Vitess MySQL into Vitess. Typical issues users may encounter duringSwitchTraffic
:Users should mitigate these challenges through best practices such as:
vtexplain
to discover syntax and V/Schema design issues.SwitchTraffic --tablet_types RDONLY,REPLICA
to validateSELECT
queries, and gradually ramping queries to@replica
or@rdonly
.However, these best practices come with their own challenges:
vtexplain
does a good job of reporting whether a query can be parsed and planned by Vitess, but it does not perfectly replicate the VTGate planning phase, and does not emulate the query execution phase at all.@replica
may be challenging, depending on the nature of the workload or limitations in application libraries.Proposal
To complement these best practices, this RFC proposes a new type of routing rule to mirror some or all traffic to the target keyspace in the course of a
MoveTables
migration. This feature will allow users to easily test how a new, production-grade Vitess cluster will handle a production workload prior to runningSwitchTraffic
.Rule Definitions
Define a new topology record
MirrorRules
to support mirror rules that look like this:VSchema
vindexes.BuildKeyspace
will be updated to take new proto definitions into account.vindexes.(*VSchema).FindMirrorTables
will provide other packages visibility into mirror rules.Validate Mirror Rules
Validate constraints on mirror rules usage, e.g.:
Find Mirror Tables
Other parts of the code base will need to be able to introspect tables which have mirror rules on them.
A table will be considered "mirrored" if there is a mirror rule defined for keyspace, and the target keyspace has a matching table name.
Query Planning
Modify
planbuilder
to produceengine.Mirror
primitives when aSELECT
query is planned, and the tables that are touched are all in keyspaces with mirror rules to other keyspaces.Query Execution
Query planning will produce a
Mirror
primitive that executes the main and mirrored queries in parallel, returning only the results and the error from the main query.Some additional goals, not demonstrated in the code sample below:
MoveTables
Introduce a new
MoveTables
command:MirrorTraffic
.MirrorTraffic --percent=1 --tablet_types=RDONLY,REPLICA
will add mirror rules reflecting 1% of traffic from the source keyspace to the target keyspace.SwitchTraffic
, there will be no reverse option.--tables
; all tables in the workflow will be mirrored. Users are free to manually add/remove/modify rules to achieve different results.--remove
will remove all mirror rules.Monitoring
Existing metrics
Mirrored queries will appear in VTTablet-level metrics of the keyspace(s) where they are sent. Users will be able to compare the query rates, error rates, and latencies of the primary and mirrored keyspace(s).
I don't see a compelling reason to add additional metrics, labels, or label values at the VTTablet to indicate which queries were mirrored.
Differences between main and mirrored queries
I think the most useful thing to will be for VTGate to instrument differences between main queries and mirrored queries. For example if mirrored queries take 20% longer than their main counterpart. Similarly, if the main query does not return an error, but the mirrored queries do.
In terms of how this could be reported, it wouldn't be feasible to instrument this in metrics on a per-query-pattern basis. We can log differences, but that could potentially result in a lot of logging for a very busy application.
What I suggest doing is a cache with a user-definable max size, which maps query patterns to stats that can be fetched on demand through
:15000
or logged periodically (e.g. every T minutes or every N queries) to stderr or a file.Also
Other things that could be useful to monitor and/or log:
DML
I think there is use case for mirroring DML. Some applications are write-heavy, and for those applications it would be useful to test the throughput and latency of queries through Vitess. Additionally, mirroring DML may surface subtle issues in V/Schema design, e.g. where a main query to an unsharded source keyspace returns a duplicate key error, but the mirrored query to the target keyspace do not because a unique key is not enforced globally.
However, in the context of
MoveTables
, it does not make sense to commit DML to the target keyspace, since it would break VReplication (duplicate keys) or corrupt data. One way DML could be mirrored without wreaking havoc would be to wrap every DML in a transaction, and to ensure that every transaction is rolled back and never committed. Even then, there would be the strong likelihood of lock contention which would affect the performance of the mirrored DML.Given the risk and uncertainty around mirrored DML, I suggest that an initial implementation only mirror
Route
(=SELECT
) operators. A bug in mirrored DML could cause a lot of damage, so I think it is wise to defer any attempt at this until an initial,SELECT
-only implementation has stabilized through production trials.Performance
Performance considerations have been mentioned in various sections above, and are repeated here:
Alternatives
Some alternatives to parts of the proposal above.
Accommodate mirror rules within routing rules
Originally, the RFC proposed that we accommodate mirror rules within routing rules. After discussion, we agreed not to go with this approach, because it restricts the shape that mirror rule definitions can take, and risks breaking or complicating routing rule processing. Here is the content of the original proposal:
Pros:
Cons:
fromTable
, an assumption changed by introducing mirror rules.Decouple main and mirrored queries
In the proposal above, main and mirror queries execute in parallel, but complete as a group:
An alternate approach would be to decouple the main and mirrored queries, so that a reply can be sent back to the caller as soon as the main query finishes. Mirrored queries can continue executing afterwards, up to a timeout or global concurrency limit. A similar approach was used to implement replica warming queries.
Pros:
Cons:
Implement mirroring at the operator level
The proposal initially recommended implementing mirroring at the operator level. After receiving feedback from the Vitess team, the proposal was amended to mirror at the plan-level. Here is the content of the original proposal:
Incorporate mirror operators into the operator tree. Prune and consolidate these operators in a way that balances the intent of mirroring with other concerns like performance.
Route
(=SELECT
) operators will be mirrored. DML and other types of operators will not be mirrored. See the DML section below for more thoughts on this.Mirror
operators to the tree in a way that reflects mirror rule definitions.Mirror
operators into as few combined operators as possible, by following existing logic in the codebase for merging routes, for example.Mirror
mirror targets when their keyspace is the same as the keyspace of the mainRoute
.Mirror
operators without any mirror targets.Horizon
belowMirror
(andRoute
) where possible and sensible to do so.Mirror
operators from the final tree if the totalMirror
count exceeds some small, hard-coded value, to mitigateMirror
operator explosions resulting from planning interactions betweenMirror
and other operators.Proof-of-concept
#14872
Schedule
If this RFC is approved, I think it will make sense to sequence blocks of development in this order:
Mirror
operator planning and engine primitive.MoveTables MirrorTraffic
.Approval
Looking forward to feedback, and ultimately hoping for approval, from the following groups/people.
The text was updated successfully, but these errors were encountered: