Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The agent should more clearly indicate when it or its sub-processes have been OOM killed on Kubernetes #3641

Open
Tracked by #3640
cmacknz opened this issue Oct 19, 2023 · 4 comments
Assignees
Labels
Team:Elastic-Agent Label for the Agent team

Comments

@cmacknz
Copy link
Member

cmacknz commented Oct 19, 2023

We need to make it easier to detect inadequate memory limits on Kubernetes, which are extremely common.

The agent should detect when its last status was OOM killed and report its status as degraded. Detecting that an agent has been OOMKilled from diagnostics along is not easy, it must be inferred from process restarts appearing the agent diagnostics with no other plausible explanations.

Today the primary way for us to detect this is to instruct users to run kubectl describe pod and look for the following:

       Last State:   Terminated
       Reason:       OOMKilled
       Exit Code:    137

We should automate this process and have the agent read the last state and reason for itself and report it in the agent status report.

We have also seen cases where the agent sub-processes are killed and restarted without the agent process itself being OOMKilled (because the sub-processes use more memory). We should double check that the OOMKilled reason appears on the pod when this happens.

The OOM kill event also appears in the node kernel logs if we end up needing to look there:

Mar 13 20:37:14 aks-default-32489819 kernel: [2442796.469054] Memory cgroup out of memory: Killed process 2532535 (filebeat) total-vm:2766604kB, anon-rss:1298484kB, file-rss:71456kB, shmem-rss:0kB, UID:0 pgtables:2992kB oom_score_adj:-997
Mar 13 20:37:14 aks-default-32489819 systemd[1]: cri-containerd-8a7c9177c7f2c619df882ecfebb3895c.scope: A process of this unit has been killed by the OOM killer.
@cmacknz cmacknz added the Team:Elastic-Agent Label for the Agent team label Oct 19, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@cmacknz cmacknz added the Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team label Oct 19, 2023
@cmacknz cmacknz changed the title Report the agent status as degraded when previously OOMKilled on Kubernetes The agent should more clearly indicate when it or its sub-processes have been OOM killed on Kubernetes Mar 20, 2024
@jlind23 jlind23 removed the Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team label Mar 20, 2024
@cmacknz
Copy link
Member Author

cmacknz commented Mar 20, 2024

I think we will need to experiment with a few different scenarios to test this properly:

  • The agent container going over its configured memory limit because one of the sub-processes (e.g. Filebeat is using too much memory).
  • The agent container staying under its limit, but the node it is running on running out of memory. This can be triggered by having the individual containers stay under a large memory limit while the sum of their actual memory consumption is greater than the memory available on the node.

@leehinman
Copy link
Contributor

Just so we don't forget. If the ExitCode is -1, that signals that "process hasn't exited or was terminated by a signal". We currently just log the ExitCode if a subprocess exits. We could add to the error message if the error is -1 that this is potentially OOM or at least that the process is getting killed via an external mechanism.

@cmacknz
Copy link
Member Author

cmacknz commented May 8, 2024

The reporting we get from k8s when a pod is OOMKilled differs based on the Kubernetes version.

Starting from Kubernetes 1.28 the memory.oom.group feature of cgroups v2 is turned on by default, so the pod will be OOM killed if any process in the container cgroup hits a memory limit.

Prior versions have memory.oom.group turned off, so the pods won't be annotated with the OOMKilled last exit reason. Most of our memory consumption happens in sub-processes so we hit this situation frequently.

Kubernetes change log for reference: https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.28.md

If using cgroups v2, then the cgroup aware OOM killer will be enabled for container cgroups via memory.oom.group . This causes processes within the cgroup to be treated as a unit and killed simultaneously in the event of an OOM kill on any process in the cgroup. (#117793, @tzneal) [SIG Apps, Node and Testing]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

No branches or pull requests

6 participants