Automatic crash dumping and monitoring for "stuck processes" #2450

ankush · 2025-01-30T08:55:03Z

This is "concepts of a plan", so I don't know specifics but...

We should have some kind of automated core dumping when processes behave badly to do postmortem analysis.

Roughly:

configure coredumpctl
Setup some monitoring worker which keeps checking health of all services
If service isn't healthy for a long time - coredump (and eventually restart)

Why:

No need to debug under fire, investigate things peacefully.

If not coredumps, we can also do process-specific "dump state" too, like py-spy or sigusr1 for python processes. This doesn't reveal the whole picture though.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic crash dumping and monitoring for "stuck processes" #2450

Automatic crash dumping and monitoring for "stuck processes" #2450

ankush commented Jan 30, 2025 •

edited

Loading

Automatic crash dumping and monitoring for "stuck processes" #2450

Automatic crash dumping and monitoring for "stuck processes" #2450

Comments

ankush commented Jan 30, 2025 • edited Loading

ankush commented Jan 30, 2025 •

edited

Loading