You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the old days of CouchDB 1.x things were simple -- once the HTTP daemon had started up CouchDB was ready for request processing. In a 2.x cluster matters are more complex; nodes need to be aware of each others' presence and they need to have an up-to-date view of the cluster's shard map. These details are captured in two databases that are replicated to every node in the cluster:
_nodes
_dbs
I think it would be beneficial to surface an endpoint that provides a good determination of when these databases have been brought over and the node is ready to go. Sort of like _up but with additional checks.
There is a bootstrapping problem here -- when a node first comes online it doesn't know about any future peers and so nominally it would consider itself ready to go. One way systems address this is by offering an option to configure "seed" nodes that should be contacted to learn about the cluster. That's of course how we configure clusters today, but the order of operations is a bit funny because the only way to tell a new node about a seed is to use the HTTP API ... which means the node is already online.
Here's what I'm thinking:
Introduce a [cluster] seed_nodes config setting which accepts a comma-delimited list of Erlang node names that this node should go and contact.
Add a /_ready endpoint which will return 200 only after confirming complete replication of _dbs and _nodes from one of the other cluster members (one of the seed nodes?)
The second step is a bit tricky because a) all the internal replications are of the "push" variety and b) the checkpoint information saved locally does not include any measure of whether the replication is complete. I'm wondering if we ought to use a different algorithm for those internal databases e.g. pull the database from a seed on startup instead of relying on push to complete the ring.
I think this enhancement will be important if we want to ultimately get to the kind of autoscaling of api nodes described in #1338
The text was updated successfully, but these errors were encountered:
I gave this a bit more thought and I think it has good potential. Here's how I'd see the replication part working:
Introduce a [cluster] seed_nodes config setting which accepts a comma-delimited list of Erlang node names that this node should go and contact. Typical case would be to set this using the names of the first 3 nodes in a cluster.
In mem3_nodes:initialize_nodelist(), add the seeds to the _nodes DB and the associated ets table.
Deterministically sort the seedlist independently on each node (e.g. by preferring seeds in the same zone).
Add a new mem3_seeds module that synchronously pulls _nodes and _dbs from the first live node in the seed preflist.
Set an internal ready flag once the step above completes.
Asynchronously pull from all seed nodes on a (configurable) scheduled basis to maintain checkpoints and minimize node startup time.
One open question is how the node should respond if a seedlist is configured but every seed node is down. The correct thing to do would be to not set the ready flag, although that means operators who rely on the ready flag (e.g. in a load balancer healthcheck) need to take extra care to maintain a good seedlist.
I started implementing this in https://github.com/apache/couchdb/tree/mem3-seedlist. Still needs documentation and proper testing but it passes smoke tests; e.g., in a Kubernetes environment, the individual nodes will not take traffic until /_up flips to 200, and that only happens once they've synced up with a seed node. Before that has happened /_up returns a 404 with "status": "seeding" and a JSON object showing the synchronization status with each of the members of the seedlist.
In the old days of CouchDB 1.x things were simple -- once the HTTP daemon had started up CouchDB was ready for request processing. In a 2.x cluster matters are more complex; nodes need to be aware of each others' presence and they need to have an up-to-date view of the cluster's shard map. These details are captured in two databases that are replicated to every node in the cluster:
I think it would be beneficial to surface an endpoint that provides a good determination of when these databases have been brought over and the node is ready to go. Sort of like
_up
but with additional checks.There is a bootstrapping problem here -- when a node first comes online it doesn't know about any future peers and so nominally it would consider itself ready to go. One way systems address this is by offering an option to configure "seed" nodes that should be contacted to learn about the cluster. That's of course how we configure clusters today, but the order of operations is a bit funny because the only way to tell a new node about a seed is to use the HTTP API ... which means the node is already online.
Here's what I'm thinking:
[cluster] seed_nodes
config setting which accepts a comma-delimited list of Erlang node names that this node should go and contact./_ready
endpoint which will return 200 only after confirming complete replication of_dbs
and_nodes
from one of the other cluster members (one of the seed nodes?)The second step is a bit tricky because a) all the internal replications are of the "push" variety and b) the checkpoint information saved locally does not include any measure of whether the replication is complete. I'm wondering if we ought to use a different algorithm for those internal databases e.g. pull the database from a seed on startup instead of relying on push to complete the ring.
I think this enhancement will be important if we want to ultimately get to the kind of autoscaling of
api
nodes described in #1338The text was updated successfully, but these errors were encountered: