From ab57de090813ef407cb2645421835be894a4ace1 Mon Sep 17 00:00:00 2001 From: Manan Gupta Date: Mon, 20 May 2024 16:17:34 +0530 Subject: [PATCH 1/4] feat: add lock sharding docs for etcd Signed-off-by: Manan Gupta --- doc/design-docs/LockShard.md | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) create mode 100644 doc/design-docs/LockShard.md diff --git a/doc/design-docs/LockShard.md b/doc/design-docs/LockShard.md new file mode 100644 index 00000000000..95f3e760acf --- /dev/null +++ b/doc/design-docs/LockShard.md @@ -0,0 +1,26 @@ +Shard Locking +===================== + +This doc describes the working of shard locking that Vitess does using the topo servers. + +There are 2 variants of shard locking, `LockShard` which is a blocking call, and `TryLockShard` which tries to be a non-blocking call, but does not guarantee it. + +`TryLockShard` tries to find out if the shard is available to be locked or not. If it finds that the shard is locked, it returns with an error. However, there is still a race when the shard is not locked, that can cause `TryLockShard` to still block. + +### Working of LockShard + +`getLockTimeout` gets the amount of time we have to acquire a shard lock. It is not the amount of time that we acquire the shard lock for. It is currently misadvertised. LockShard returns a context, but that context doesn't have a timeout on it. When the shard lock expires, the context doesn't expire, because it doesn't have a timeout. To check whether the shard is locked or not, we have `CheckShardLocked`. + +The implementations of LockShard and CheckShardLocked differ slightly for all the different topology servers. We'll look at each of them separately. + +### Etcd + +In Etcd implementation, we use `KeepAlive` API to keep renewing the context that we have for acquiring the shard lock every 10 seconds. The duration of the lease is controlled by the `--topo_etcd_lease_ttl` flag which defaults to 10 seconds. Once we acquire the shard lock, the context for acquiring the shard lock expires and that stops the KeepAlives too. + +The shard lock is released either when the unlock function is called, or if the lease ttl expires. This guards against servers crashing while holding the shard lock. + +The Check function of etcd, is unique in the sense that apart from just checking whether the shard is locked or not, it also renews the lease by running `KeepAliveOnce`. + + +### ZooKeeper + From 32839be827d46fd34431d5631c10617fb38b1e19 Mon Sep 17 00:00:00 2001 From: Manan Gupta Date: Mon, 20 May 2024 19:24:49 +0530 Subject: [PATCH 2/4] docs: add zookeeper docs Signed-off-by: Manan Gupta --- doc/design-docs/LockShard.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/doc/design-docs/LockShard.md b/doc/design-docs/LockShard.md index 95f3e760acf..f391de88c14 100644 --- a/doc/design-docs/LockShard.md +++ b/doc/design-docs/LockShard.md @@ -24,3 +24,6 @@ The Check function of etcd, is unique in the sense that apart from just checking ### ZooKeeper +In ZooKeeper, locks are implemented by creating ephemeral files. The ephemeral files are present until the connection is alive. So there doesn't look like a timeout on the shard lock, unless the connection/process dies. + +The Check function doesn't do anything in ZooKeeper. The implementation just returns nil. To implement the Check functionality, we just need to check that the connection isn't broken and the ephemeral node exists. From 3ecb7d9649e69667ddf7e1f29166bf26e52a1d8b Mon Sep 17 00:00:00 2001 From: Manan Gupta Date: Mon, 20 May 2024 19:34:27 +0530 Subject: [PATCH 3/4] docs: add consul too to the docs Signed-off-by: Manan Gupta --- doc/design-docs/LockShard.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/doc/design-docs/LockShard.md b/doc/design-docs/LockShard.md index f391de88c14..2372c92c171 100644 --- a/doc/design-docs/LockShard.md +++ b/doc/design-docs/LockShard.md @@ -27,3 +27,9 @@ The Check function of etcd, is unique in the sense that apart from just checking In ZooKeeper, locks are implemented by creating ephemeral files. The ephemeral files are present until the connection is alive. So there doesn't look like a timeout on the shard lock, unless the connection/process dies. The Check function doesn't do anything in ZooKeeper. The implementation just returns nil. To implement the Check functionality, we just need to check that the connection isn't broken and the ephemeral node exists. + +### Consul + +In Consul, the timeout for the lock is controlled by the `--topo_consul_lock_session_ttl` flag. + +The Check function works properly and checks if the lock still exists. \ No newline at end of file From 3095b45b252cc6ccca713c3338b9e2fd4332b242 Mon Sep 17 00:00:00 2001 From: Manan Gupta Date: Wed, 22 May 2024 15:50:06 +0530 Subject: [PATCH 4/4] docs: address review comments Signed-off-by: Manan Gupta --- doc/design-docs/{LockShard.md => TopoLocks.md} | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) rename doc/design-docs/{LockShard.md => TopoLocks.md} (81%) diff --git a/doc/design-docs/LockShard.md b/doc/design-docs/TopoLocks.md similarity index 81% rename from doc/design-docs/LockShard.md rename to doc/design-docs/TopoLocks.md index 2372c92c171..b25927b4ed8 100644 --- a/doc/design-docs/LockShard.md +++ b/doc/design-docs/TopoLocks.md @@ -1,4 +1,4 @@ -Shard Locking +Locking Using Topology Servers ===================== This doc describes the working of shard locking that Vitess does using the topo servers. @@ -9,9 +9,9 @@ There are 2 variants of shard locking, `LockShard` which is a blocking call, and ### Working of LockShard -`getLockTimeout` gets the amount of time we have to acquire a shard lock. It is not the amount of time that we acquire the shard lock for. It is currently misadvertised. LockShard returns a context, but that context doesn't have a timeout on it. When the shard lock expires, the context doesn't expire, because it doesn't have a timeout. To check whether the shard is locked or not, we have `CheckShardLocked`. +`getLockTimeout` gets the amount of time we have to acquire a shard lock. It is not the amount of time that we acquire the shard lock for. It is currently misadvertised. `LockShard` returns a context, but that context doesn't have a timeout on it. When the shard lock expires, the context doesn't expire, because it doesn't have a timeout. To check whether the shard is locked or not, we have `CheckShardLocked`. -The implementations of LockShard and CheckShardLocked differ slightly for all the different topology servers. We'll look at each of them separately. +The implementations of `LockShard` and `CheckShardLocked` differ slightly for all the different topology servers. We'll look at each of them separately. ### Etcd