Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: OOM killer protection similar to sshd #580

Open
erpel opened this issue Jul 31, 2024 · 4 comments
Open

Feature: OOM killer protection similar to sshd #580

erpel opened this issue Jul 31, 2024 · 4 comments

Comments

@erpel
Copy link

erpel commented Jul 31, 2024

I initially opened this as part of amaonlinux, but it makes more sense in this project:

When the system is experiencing memory pressure, I've seen many times that ssm-agent gets killed by the OOM killer. This makes it hard to debug the situation if ssm-agent being killed results in being unable to log in and observe the situation.

I'd like ssm-agent to be run with the same OOM killer protections that sshd applies to it's own process (oom score adjustment -1000).

Alternatives would be to stop using SSM for login and switch to SSH, but this puts additional overhead on us, administering user accounts and ssh keys. SSM session manager is a useful feature that would really benefit from added efforts to increase stability.

This old bug https://bugzilla.redhat.com/show_bug.cgi?id=1010429#c0 contains some details about how it used to work with sshd - especially making sure that user processes spawned by the "protected" server don't inherit the strict protection of oom_score_adj -1000.

@CuriousDolphin
Copy link

I have the same problem, when the machine is saturated with ram, the ssm agent is killed and the only way is to restart it, is there any news?

@gianniLesl
Copy link
Contributor

What OS is this on? The amazon-ssm-agent process should always restart if it crashes or is killed.

@erpel
Copy link
Author

erpel commented Oct 2, 2024

We've seen this on up to date versions of AmazonLinux 2023. In high load situations, it seems restarting does not work reliably or takes a very long time, causing issues with reaching instances.

Manually adjusting the OOM killer score in the systemd unit file (using an override for example) does help, so I feel that ssm-agent setting this automatically on the important process(es) is a good solution.

@h0tw1r3
Copy link

h0tw1r3 commented Dec 6, 2024

This is a significant issue for us after switching from SSH to SSM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants