Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic crash dumping and monitoring for "stuck processes" #2450

Open
ankush opened this issue Jan 30, 2025 · 0 comments
Open

Automatic crash dumping and monitoring for "stuck processes" #2450

ankush opened this issue Jan 30, 2025 · 0 comments

Comments

@ankush
Copy link
Member

ankush commented Jan 30, 2025

This is "concepts of a plan", so I don't know specifics but...

We should have some kind of automated core dumping when processes behave badly to do postmortem analysis.

Roughly:

  • configure coredumpctl
  • Setup some monitoring worker which keeps checking health of all services
  • If service isn't healthy for a long time - coredump (and eventually restart)

Why:

  • No need to debug under fire, investigate things peacefully.

If not coredumps, we can also do process-specific "dump state" too, like py-spy or sigusr1 for python processes. This doesn't reveal the whole picture though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant