Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

watchfrr: force kill daemons on restart #17163

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Tuetuopay
Copy link
Contributor

Today, watchfrr sends a SIGSTOP to a misbehaving daemon through frrcommon. The issue is, a stuck daemon (like in a thread starvation situation) will not honor a SIGSTOP, and watchfrr will try indefinitely to kill it.

Let's not waste time and kill -9 from the get go.

@Tuetuopay
Copy link
Contributor Author

Note: this PR is also to start the discussion. I would definitely understand if this is not the way you want to do this, or if there are huge issues I missed with those changes.

Anyways.

This is becoming problematic at scale, because I've been hitting a lot of issues with bgpd becoming completely unresponsive. And bgpd unresponsive also means bgpd not handling signals. Using kill -2 is way too polite and will never recover from such a case. Literally today I've had a route-reflector where watchfrr tries for almost two hours to restart bgpd, to no avail (and it's not like waiting any longer would have fixed anything).

@Jafaral
Copy link
Member

Jafaral commented Oct 21, 2024

This seems like a band-aid solution. Instead of using a "hammer" with bgp, we should be looking at the reasons why it becomes unresponsive in the first place and fix those.

@Tuetuopay
Copy link
Contributor Author

Tuetuopay commented Oct 22, 2024

This seems like a band-aid solution. Instead of using a "hammer" with bgp, we should be looking at the reasons why it becomes unresponsive in the first place and fix those.

While I do agree with you that this should ultimately be fixed (we are trying to reproduce the issue in a lab), as it stands watchfrr is nigh useless because it cannot restart a failed bgpd. So yes, this is a band-aid, but the mechanism is already here. And honestly I don't want to be woken up at 4AM because the auto-heal does not work (though I've already deployed this as a quick measure). It's not a solution, but it keeps production happy.

Would having two hammers be better for you? Keeping the kill -2 for e.g. 12s, and going the kill -9 route if this fails?

@Jafaral
Copy link
Member

Jafaral commented Oct 23, 2024

Would having two hammers be better for you? Keeping the kill -2 for e.g. 12s, and going the kill -9 route if this fails?

Yeah. That is more reasonable, give it 12 or maybe 30 seconds?

@Tuetuopay
Copy link
Contributor Author

Great!

watchfrr has a by default 20s timeout. do you prefer I give the daemon a bit less than that (e.g. 15s), or raise the default watchfrr timeout to 35s and use the big gun after 30s?

@Jafaral
Copy link
Member

Jafaral commented Jan 7, 2025

@Tuetuopay Let get this across the finish line. Going with 35s timeouts seem reasonable.

@Tuetuopay Tuetuopay force-pushed the watchfrr-force-kill branch from 5e0d6d6 to 643adff Compare January 7, 2025 14:08
@github-actions github-actions bot added size/S rebase PR needs rebase and removed size/XS labels Jan 7, 2025
@Tuetuopay
Copy link
Contributor Author

Tuetuopay commented Jan 7, 2025

@Jafaral I updated the commit.

Now the watchfrr timeout for stop jobs is 35 seconds. frrcommon will send a regular SIGSTOP (good for graceful stop unrelated to watchfrr), and if the daemon did not stop after 30s, it will send a SIGKILL.

This should, I hope, be better.

Meanwhile, I will deploy in prod auto perf profiling when such hangs happen, to hopefully find where those hangs come from.

EDIT: following the rebase and seeing 5fba3c4 raises the timeout to 90s, I changed the SIGSTOP timeout to 60s.

Copy link

github-actions bot commented Jan 7, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Today, watchfrr sends a SIGSTOP to a misbehaving daemon through
frrcommon. The issue is, a stuck daemon (like in a thread starvation
situation) will not honor a SIGSTOP, and watchfrr will try indefinitely
to kill it.

frrcommon will now send a SIGSTOP, and if ineffective after 60 seconds,
it will send a SIGKILL.

Signed-off-by: Tuetuopay <tuetuopay@me.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants