watchfrr: force kill daemons on restart #17163

Tuetuopay · 2024-10-18T13:32:58Z

Today, watchfrr sends a SIGSTOP to a misbehaving daemon through frrcommon. The issue is, a stuck daemon (like in a thread starvation situation) will not honor a SIGSTOP, and watchfrr will try indefinitely to kill it.

Let's not waste time and kill -9 from the get go.

Tuetuopay · 2024-10-18T13:37:48Z

Note: this PR is also to start the discussion. I would definitely understand if this is not the way you want to do this, or if there are huge issues I missed with those changes.

Anyways.

This is becoming problematic at scale, because I've been hitting a lot of issues with bgpd becoming completely unresponsive. And bgpd unresponsive also means bgpd not handling signals. Using kill -2 is way too polite and will never recover from such a case. Literally today I've had a route-reflector where watchfrr tries for almost two hours to restart bgpd, to no avail (and it's not like waiting any longer would have fixed anything).

Jafaral · 2024-10-21T18:45:39Z

This seems like a band-aid solution. Instead of using a "hammer" with bgp, we should be looking at the reasons why it becomes unresponsive in the first place and fix those.

Tuetuopay · 2024-10-22T09:58:28Z

This seems like a band-aid solution. Instead of using a "hammer" with bgp, we should be looking at the reasons why it becomes unresponsive in the first place and fix those.

While I do agree with you that this should ultimately be fixed (we are trying to reproduce the issue in a lab), as it stands watchfrr is nigh useless because it cannot restart a failed bgpd. So yes, this is a band-aid, but the mechanism is already here. And honestly I don't want to be woken up at 4AM because the auto-heal does not work (though I've already deployed this as a quick measure). It's not a solution, but it keeps production happy.

Would having two hammers be better for you? Keeping the kill -2 for e.g. 12s, and going the kill -9 route if this fails?

Jafaral · 2024-10-23T16:06:49Z

Would having two hammers be better for you? Keeping the kill -2 for e.g. 12s, and going the kill -9 route if this fails?

Yeah. That is more reasonable, give it 12 or maybe 30 seconds?

Tuetuopay · 2024-10-23T16:44:31Z

Great!

watchfrr has a by default 20s timeout. do you prefer I give the daemon a bit less than that (e.g. 15s), or raise the default watchfrr timeout to 35s and use the big gun after 30s?

Jafaral · 2025-01-07T02:29:14Z

@Tuetuopay Let get this across the finish line. Going with 35s timeouts seem reasonable.

Tuetuopay · 2025-01-07T14:10:23Z

@Jafaral I updated the commit.

Now the watchfrr timeout for stop jobs is 35 seconds. frrcommon will send a regular SIGSTOP (good for graceful stop unrelated to watchfrr), and if the daemon did not stop after 30s, it will send a SIGKILL.

This should, I hope, be better.

Meanwhile, I will deploy in prod auto perf profiling when such hangs happen, to hopefully find where those hangs come from.

EDIT: following the rebase and seeing 5fba3c4 raises the timeout to 90s, I changed the SIGSTOP timeout to 60s.

github-actions · 2025-01-07T14:12:49Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Today, watchfrr sends a SIGSTOP to a misbehaving daemon through frrcommon. The issue is, a stuck daemon (like in a thread starvation situation) will not honor a SIGSTOP, and watchfrr will try indefinitely to kill it. frrcommon will now send a SIGSTOP, and if ineffective after 60 seconds, it will send a SIGKILL. Signed-off-by: Tuetuopay <tuetuopay@me.com>

frrbot bot added the watchfrr label Oct 18, 2024

github-actions bot added master size/XS labels Oct 18, 2024

Tuetuopay force-pushed the watchfrr-force-kill branch from 9c2550a to 5e0d6d6 Compare October 18, 2024 13:34

Tuetuopay force-pushed the watchfrr-force-kill branch from 5e0d6d6 to 643adff Compare January 7, 2025 14:08

github-actions bot added size/S rebase PR needs rebase and removed size/XS labels Jan 7, 2025

github-actions bot added the conflicts label Jan 7, 2025

Tuetuopay force-pushed the watchfrr-force-kill branch from 643adff to 7cf4d10 Compare January 7, 2025 14:14

github-actions bot added size/XS and removed size/S conflicts labels Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

watchfrr: force kill daemons on restart #17163

watchfrr: force kill daemons on restart #17163

Tuetuopay commented Oct 18, 2024

Tuetuopay commented Oct 18, 2024

Jafaral commented Oct 21, 2024

Tuetuopay commented Oct 22, 2024 •

edited

Loading

Jafaral commented Oct 23, 2024

Tuetuopay commented Oct 23, 2024

Jafaral commented Jan 7, 2025

Tuetuopay commented Jan 7, 2025 •

edited

Loading

github-actions bot commented Jan 7, 2025

watchfrr: force kill daemons on restart #17163

Are you sure you want to change the base?

watchfrr: force kill daemons on restart #17163

Conversation

Tuetuopay commented Oct 18, 2024

Tuetuopay commented Oct 18, 2024

Jafaral commented Oct 21, 2024

Tuetuopay commented Oct 22, 2024 • edited Loading

Jafaral commented Oct 23, 2024

Tuetuopay commented Oct 23, 2024

Jafaral commented Jan 7, 2025

Tuetuopay commented Jan 7, 2025 • edited Loading

github-actions bot commented Jan 7, 2025

Tuetuopay commented Oct 22, 2024 •

edited

Loading

Tuetuopay commented Jan 7, 2025 •

edited

Loading