Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ability to discover when a node is ready to handle requests #1416

Closed
kocolosk opened this issue Jun 27, 2018 · 3 comments
Closed

Improve ability to discover when a node is ready to handle requests #1416

kocolosk opened this issue Jun 27, 2018 · 3 comments

Comments

@kocolosk
Copy link
Member

In the old days of CouchDB 1.x things were simple -- once the HTTP daemon had started up CouchDB was ready for request processing. In a 2.x cluster matters are more complex; nodes need to be aware of each others' presence and they need to have an up-to-date view of the cluster's shard map. These details are captured in two databases that are replicated to every node in the cluster:

  • _nodes
  • _dbs

I think it would be beneficial to surface an endpoint that provides a good determination of when these databases have been brought over and the node is ready to go. Sort of like _up but with additional checks.

There is a bootstrapping problem here -- when a node first comes online it doesn't know about any future peers and so nominally it would consider itself ready to go. One way systems address this is by offering an option to configure "seed" nodes that should be contacted to learn about the cluster. That's of course how we configure clusters today, but the order of operations is a bit funny because the only way to tell a new node about a seed is to use the HTTP API ... which means the node is already online.

Here's what I'm thinking:

  1. Introduce a [cluster] seed_nodes config setting which accepts a comma-delimited list of Erlang node names that this node should go and contact.
  2. Add a /_ready endpoint which will return 200 only after confirming complete replication of _dbs and _nodes from one of the other cluster members (one of the seed nodes?)

The second step is a bit tricky because a) all the internal replications are of the "push" variety and b) the checkpoint information saved locally does not include any measure of whether the replication is complete. I'm wondering if we ought to use a different algorithm for those internal databases e.g. pull the database from a seed on startup instead of relying on push to complete the ring.

I think this enhancement will be important if we want to ultimately get to the kind of autoscaling of api nodes described in #1338

@kocolosk
Copy link
Member Author

I gave this a bit more thought and I think it has good potential. Here's how I'd see the replication part working:

  1. Introduce a [cluster] seed_nodes config setting which accepts a comma-delimited list of Erlang node names that this node should go and contact. Typical case would be to set this using the names of the first 3 nodes in a cluster.
  2. In mem3_nodes:initialize_nodelist(), add the seeds to the _nodes DB and the associated ets table.
  3. Deterministically sort the seedlist independently on each node (e.g. by preferring seeds in the same zone).
  4. Add a new mem3_seeds module that synchronously pulls _nodes and _dbs from the first live node in the seed preflist.
  5. Set an internal ready flag once the step above completes.
  6. Asynchronously pull from all seed nodes on a (configurable) scheduled basis to maintain checkpoints and minimize node startup time.

One open question is how the node should respond if a seedlist is configured but every seed node is down. The correct thing to do would be to not set the ready flag, although that means operators who rely on the ready flag (e.g. in a load balancer healthcheck) need to take extra care to maintain a good seedlist.

@kocolosk
Copy link
Member Author

I started implementing this in https://github.com/apache/couchdb/tree/mem3-seedlist. Still needs documentation and proper testing but it passes smoke tests; e.g., in a Kubernetes environment, the individual nodes will not take traffic until /_up flips to 200, and that only happens once they've synced up with a seed node. Before that has happened /_up returns a 404 with "status": "seeding" and a JSON object showing the synchronization status with each of the members of the seedlist.

@kocolosk
Copy link
Member Author

kocolosk commented Sep 5, 2019

Implemented in #1658 😄

@kocolosk kocolosk closed this as completed Sep 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant