Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] RFC: Sharding Recommendations (Tracking Issue) #68

Open
rohit-nayak-ps opened this issue Nov 21, 2024 · 1 comment
Open

[WIP] RFC: Sharding Recommendations (Tracking Issue) #68

rohit-nayak-ps opened this issue Nov 21, 2024 · 1 comment

Comments

@rohit-nayak-ps
Copy link
Contributor

One of the aims of the vt toolset is to provide information to users to help them shard their Vitess clusters better. This document discusses the different artifacts that will form a sharding recommendation. It will also attempt to identify the low level elements associated with each artifact.

Key artifacts

Essentially the output of a sharding recommendation is a complete VSchema including the keyspace which will be sharded and the unsharded keyspace which will contain the sequence tables and source reference tables. Each table in a sharded keyspace will need their sharding key: which includes the columns which are part of the key and the type of Vindex that maps these columns to a keyspace_id and hence shard.

Non-goal Orthogonally we could also suggest the number of shards based on the size of data and performance parameters. But that is currently not in scope.

Sharding Keys

Properties of a sharding key for a table

  • High cardinality: data should be uniformly distributed across shards
  • Query patterns: common columns in filters
  • Avoid hotspots: Use vindex types to hash or distribute auto-increments / time-stamps
  • Data locality: join clauses should inform selection of sharding keys on related tables
  • Application level: in a multi-tenant system tenant related data should reside on the same shard

Inputs

  • Query Logs
  • Database schema: PKs, indexes, FKs
  • Data distribution metrics: row count, selectivity

Workflow

Based on all provided inputs the recommender will suggest one set of artifacts. If we can provide a simulation of this against the query logs using the planner and the suggested sharding strategy then we can perturb the vschema with other possibilities or with user suggested changes and compare.

In an "agentic" approach we can try out different options automatically to refine the vschema further.

It is probably not reasonable to expect a recommendation system to come up with the "ideal" set of artifacts, so the UX should allow users to refine them based on humanint.

Reference Tables

Reference Tables are tables that are replicated across all shards within a keyspace. They are designed to store small, relatively static datasets—such as configuration settings, lookup tables, or enumeration values. Duplicating these tables on every shard reduces the number of cross-shard queries required for queries that join these tables.

The reference table's source-of-truth is in a unsharded keyspace. All DMLs on that table go to the source table and it is expected to be kept updated on each shard using other means: like a VReplication Materialize workflow.

Candidates for reference tables would be:

  • used in many queries or expensive queries
  • tables with low row count compared to those in the operational tables
  • fairly static: low write qps on those tables. This is to minimize inconsistent lookups due to replica lag while replicating the reference data

Lookup Vindexes

Secondary (or lookup) vindexes complement the sharding key by minimizing cross-shard queries when there are expensive queries that have one or more covering columns. Based on the same logic used to identify sharding keys, other candidates for sharding keys that are not chosen as primary vindexes can be recommended as lookup vindexes provided they signficantly increase selectivity.

Each lookup vindex has an additional cost of maintenance on DMLs in the original (owner) table.

@frouioui
Copy link
Member

Essentially the output of a sharding recommendation is a complete VSchema including the keyspace which will be sharded and the unsharded keyspace which will contain the sequence tables and source reference tables.

I read this as in the new feature will only suggest two keyspaces (sharded and unsharded). If the MySQL query log given by the end-users uses different logical database (use db1, etc), will it take into account that the users may want to have more than one sharded keyspace?

  • Data distribution metrics: row count, selectivity

How will this be provided by the users?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants