-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
watchfrr: force kill daemons on restart #17163
base: master
Are you sure you want to change the base?
Conversation
9c2550a
to
5e0d6d6
Compare
Note: this PR is also to start the discussion. I would definitely understand if this is not the way you want to do this, or if there are huge issues I missed with those changes. Anyways. This is becoming problematic at scale, because I've been hitting a lot of issues with bgpd becoming completely unresponsive. And bgpd unresponsive also means bgpd not handling signals. Using |
This seems like a band-aid solution. Instead of using a "hammer" with bgp, we should be looking at the reasons why it becomes unresponsive in the first place and fix those. |
While I do agree with you that this should ultimately be fixed (we are trying to reproduce the issue in a lab), as it stands watchfrr is nigh useless because it cannot restart a failed bgpd. So yes, this is a band-aid, but the mechanism is already here. And honestly I don't want to be woken up at 4AM because the auto-heal does not work (though I've already deployed this as a quick measure). It's not a solution, but it keeps production happy. Would having two hammers be better for you? Keeping the kill -2 for e.g. 12s, and going the kill -9 route if this fails? |
Yeah. That is more reasonable, give it 12 or maybe 30 seconds? |
Great! watchfrr has a by default 20s timeout. do you prefer I give the daemon a bit less than that (e.g. 15s), or raise the default watchfrr timeout to 35s and use the big gun after 30s? |
@Tuetuopay Let get this across the finish line. Going with 35s timeouts seem reasonable. |
5e0d6d6
to
643adff
Compare
@Jafaral I updated the commit. Now the watchfrr timeout for stop jobs is 35 seconds. frrcommon will send a regular SIGSTOP (good for graceful stop unrelated to watchfrr), and if the daemon did not stop after 30s, it will send a SIGKILL. This should, I hope, be better. Meanwhile, I will deploy in prod auto perf profiling when such hangs happen, to hopefully find where those hangs come from. EDIT: following the rebase and seeing 5fba3c4 raises the timeout to 90s, I changed the SIGSTOP timeout to 60s. |
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Today, watchfrr sends a SIGSTOP to a misbehaving daemon through frrcommon. The issue is, a stuck daemon (like in a thread starvation situation) will not honor a SIGSTOP, and watchfrr will try indefinitely to kill it. frrcommon will now send a SIGSTOP, and if ineffective after 60 seconds, it will send a SIGKILL. Signed-off-by: Tuetuopay <tuetuopay@me.com>
643adff
to
7cf4d10
Compare
Today, watchfrr sends a SIGSTOP to a misbehaving daemon through frrcommon. The issue is, a stuck daemon (like in a thread starvation situation) will not honor a SIGSTOP, and watchfrr will try indefinitely to kill it.
Let's not waste time and kill -9 from the get go.