-
-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pppoe-upgrade without session drops #13
Comments
@dfskoll we're debating this in the office currently, we're trying to determine MINIMUM information that needs to be carried over. So the strategy currently is:
There will be other caveats (like handling the PID file). Is there any other information you can think off that may be needed to be carried over? There are other possible issues:
Are there any other "don't forget about" pearls of wisdom that you're able to impart prior to me commencing work on this? |
@jkroonza, I think re-execing the rp-pppoe program is not a good idea. Instead, I'd update all the data structures to be dynamic. So instead of an array of Removing an interface or lowering the number of sessions will require care. If you lower the number of sessions below the current number of active sessions, it's probably best simply to disallow new sessions until the number of active sessions drops below the limit. For removing an interface, you could either make it fail if there are active sessions using the interface, terminate all such sessions, or delay removal until the sessions end (and disallow creation of new sessions on that interface.) |
@dfskoll those changes are already in the pipeline on this side via mentioned PR. We also want to implement a startup-script of commands as part of that PR. The upgrade process I'm trying to figure out, might be that as part of that PR we end up implementing a mechanism to dump the current running configuration as a script, which could in theory then be what ends up being sent over the pipe() here, or even written to a temp file and then loaded in the new instance. This pertains specifically to ugprades of pppoe-server where we'd prefer to NOT disconnect ongoing sessions. In other words, when we want to replace the currently running pppoe-server instance with a newer (or technically older or same if we should so choose) version of pppoe-server (or if some dependent library was updated and we want to ensure that the newest version of the software and libraries) are running. This process should be the exception to changing the running configuration though. Currently that involves shutting down pppoe-server which terminates all running sessions. The argument can be made that pppd instances too may need to be restarted and this is true, but that we can then stagger over time so that we can force reconnection of a handful of customers at a time over several hours so that if this triggers a problem we didn't anticipate we could stop the process and solve the problem whilst it's only affecting a handful of customers rather than potentially several hundred. The other (potentially valid) approach could be to take the ostrich apoach and simply not care if pppd instances are leaked (since IP allocations are delegated), at worst a sysadmin could manually kill pppd's for those leaked sessions, and if they end up sending a PADT we, since we don't know about the session simply ignore it. This also raises another possible approach: Spawn a new pppoe-server and set the old one to drain+exit so the old one can still manage the pppd pids for old sessions. And once they're all dead simply terminate. This may also be a valid approach. Currently we're discussing options internally around the upgrade / replace pppoe-server mechanism (which is something we require) as well, but we are also looking for your inputs as you plainly have more experience with this than we do. As it stands we thus see a couple of "ugprade path" options:
Options two and three are primarily problematic if ip delegation to pppd is not being used. As can be seen variations are also possible in the above paths. As it stands we need to add a bunch of interfaces again in the coming weeks, and if possible we'd like to take the opportunity to deploy at the very least the code that I'm busy trying to get ready for the PR, but if possible I would not mind also getting something live for a more sensible upgrade path going forward. |
Well, I still think that re-execing the program is a bad idea. I would rework the PR so it doesn't require that. Not having to re-exec will make everything else vastly simpler. |
Oh, I just saw your note about no-drop upgrades. IMO, that is a somewhat unrealistic goal. Would you try to implement no-drop kernel upgrades? The reality is that a short amount of downtime for software upgrades is pretty normal and expected. If you want to support this, you'd have to version the information passed between the old and the new program, because new versions of the server might have additional state that old versions don't know about. All-in-all, sounds vastly too complex to me. But if you really need that, then your option (1) is the only way. And if you want to preserve parent/child relationships, you can't fork the server... you can only exec a new copy so that it is still the parent of the pppd processes and will receive SIGCHLD signals when they exit. So that means dumping state to a file and restoring it from there. |
We realized the parent/child thing ... two options:
|
What is the point of the fork? Why not just dump state to a file, execve the program, and restore state from the file? |
Actually, there's a race condition. |
That's valid, thanks for the heads up. |
For anyone bumping into this, this can now be (to a degree) done by setting pppoe-server to drain via the control socket, and then bringing up an additional pppoe-server. This makes a few assumptions:
I'm still contemplating automating the process in some way, but that's no longer high on the priority list. |
Another crazy idea, that may or may not be viable.
Given a restart signal, somehow encode data required to reconstruct state prior to an execve, such that the replacing process (which may be a newer version of pppoe-server) can reconstruct internal state, and continue on keeping existing sessions and interfaces.
For now just let this rest here to think about, this would be non-trivial to say the least and I don't see that upgrades to pppoe-server itself is something that would happen frequently.
The text was updated successfully, but these errors were encountered: