Progressive shutdown - shedding #317

Jmgr · 2021-01-20T14:09:52Z

Currently, when a Liftbridge server shuts down it stops being a leader for its partitions. If many partitions exist that will result in a flurry of Raft events. Would it be possible to trigger a progressive shutdown to prevent this? Have you had some thought about this @tylertreat?

tylertreat · 2021-01-20T21:03:22Z

Yes, this is something I've thought a bit about, especially as it relates to rolling cluster upgrades. I think a graceful shutdown would make sense. There would be a few components to this:

If the server is leader for any partitions, transfer leadership to another replica (invoke a ChangeLeaderOp in Raft) and remove self from ISR (ShrinkISROp). This should be down gradually to avoid a flood of Raft ops. Also interrupt any clients currently subscribed.
If server is follower for any partitions, remove self from ISR (ShrinkISROp). This should be done gradually to avoid a flood of Raft ops. Also interrupt any clients currently subscribed.
At this point, probably reject any client requests, e.g. publish or subscribe.
If the server shutting down is the metadata leader, transfer leadership to another node. Perform a Raft barrier to ensure all preceding Raft ops have been applied.
Remove self from Raft group. Need to think through how this works when rejoining, e.g. in the case of restarting/upgrading a node.
Shut down the server.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Progressive shutdown - shedding #317

Progressive shutdown - shedding #317

Jmgr commented Jan 20, 2021

tylertreat commented Jan 20, 2021

Progressive shutdown - shedding #317

Progressive shutdown - shedding #317

Comments

Jmgr commented Jan 20, 2021

tylertreat commented Jan 20, 2021