Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elastic agent uses too much memory per Pod in k8s #5835

Open
swiatekm opened this issue Oct 23, 2024 · 17 comments
Open

Elastic agent uses too much memory per Pod in k8s #5835

swiatekm opened this issue Oct 23, 2024 · 17 comments
Labels
Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@swiatekm
Copy link
Contributor

swiatekm commented Oct 23, 2024

In its default configuration, agent has the kubernetes provider enabled. In DaemonSet mode, this provider keeps track of data about Pods scheduled on the Node the agent is running on. This issue concerns the fact that the agent process itself uses an excessive amount of memory if the number of these Pods is high (for the purpose of this issue, this will mean close to the default Kubernetes limit of 110). This was originally discovered while troubleshooting #4729.

This effect is visible even if we disable all inputs and self-monitoring, leaving agent to run as a single process without any components. This strongly implies it has to do with configuration variable providers. I used this empty configuration in my testing to limit confounding variables from beats, but the effect is more pronounced when components using variables are present in the configuration.

Here's a graph of agent memory consumption as the number of Pods on the Node increases from 10 to 110:

Image

A couple of observations from looking at how configuration changes affect this behaviour:

  • Making the garbage collector more aggressive makes most of the effect disappear, as does restarting the Pod. Very likely that this is caused by allocation churn more than steady-state heap utilization.
  • Disabling the host provider also reduces the effect greatly. When creating variables, each entry from a dynamic provider gets its own copy of data from context providers. On a large node, there can be quite a bit of host provider data. This is also visible when looking at variables.yml in diagnostics.
  • Increasing the debounce time on variable emission in the composable coordinator doesn't help much.

Test setup

  • Single node KiND cluster, default settings.
  • Default standalone kubernetes manifest, all inputs removed from configuration.
  • A single nginx Deployment, starting at 0 replicas, and later scaled up to the maximum the Node allows.

More data

I resized my Deployment a couple times and looked at a heap profile of the agent process:

Image

The churn appears to be coming primarily from needing to recreate all the variables whenever a Pod is updated. The call to composable.cloneMap is where we copy data from the host provider.

Root cause

The root cause appears to be a combination of behaviours in the variable pipeline:

  1. All variables are recreated and emitted whenever there's any change to the underlying data. In Kubernetes, with a lot of Pods on a Node, changes can be quite frequent, and the amount of data is non-trivial.
  2. We copy all the data from all context providers to any dynamic provider mapping. If there are a lot of dynamic provider mappings (one for each Pod), this can be quite expensive.
  3. I suspect there's also more copying going on in component model generation, but I haven't looked into it too much.
@cmacknz cmacknz added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Oct 23, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@swiatekm
Copy link
Contributor Author

swiatekm commented Oct 23, 2024

Potential fixes I thought of below. Not sure how feasible they are given the provider architecture, but Vars themselves seem flexible enough to permit them:

  1. Use a reference for the context provider mapping instead of copying it. If we can guarantee this won't be modified selectively by the receiver, it should work.
  2. Store vars instead of mappings in the dynamic provider state. Then, just emit the stored value when needed. Also only viable if we know this won't be modified. This would prevent us from regenerating everything on each change.
  3. dynamicProviderState.AddOrUpdate could similarly avoid some copies with additional assumptions about provider semantics.

@blakerouse
Copy link
Contributor

blakerouse commented Oct 23, 2024

@swiatekm You can assume that the data that is passed into the provider is not modified. The reason this code copies is because it ensures that the provider doesn't update the structure after sending to AddOrUpdate. It would be totally acceptable to lower memory usage to not do this, but providers need to be sure that it doesn't mutate the structure after calling AddOrUpdate (unless they call AddOrUpdate after the mutation).

@faec
Copy link
Contributor

faec commented Oct 23, 2024

One aspect of this issue is that the Agent Coordinator doesn't tell variable providers which variables it actually wants, which means providers need to collect all supported data even if it's known deterministically that it will never be needed. There's an issue already involving preprocessing variable substitutions in new policies, and a natural feature to include with it is to give providers an explicit list of variables that need to be monitored. In that case, an empty policy that doesn't actually use the kubernetes metadata could avoid querying it in the first place.

@swiatekm
Copy link
Contributor Author

swiatekm commented Oct 23, 2024

One aspect of this issue is that the Agent Coordinator doesn't tell variable providers which variables it actually wants, which means providers need to collect all supported data even if it's known deterministically that it will never be needed. There's an issue already involving preprocessing variable substitutions in new policies, and a natural feature to include with it is to give providers an explicit list of variables that need to be monitored. In that case, an empty policy that doesn't actually use the kubernetes metadata could avoid querying it in the first place.

I think most of this data is actually used by filebeat. See an example filebeat input:

data_stream:
  dataset: kubernetes.container_logs
id: kubernetes-container-logs-nginx-d556bf558-vcdpl-4bd5a70737faebfa2fbd4d34b9003cf8f32cd086787133590342e24d331da995
index: logs-kubernetes.container_logs-default
parsers:
  - container:
      format: auto
      stream: all
paths:
  - /var/log/containers/*4bd5a70737faebfa2fbd4d34b9003cf8f32cd086787133590342e24d331da995.log
processors:
  - add_fields:
      fields:
        input_id: filestream-container-logs-4b47f8c5-5515-4267-a33d-4fb64806f81c-kubernetes-f483454b-e8f2-42b5-8c22-4da229e86b8a.nginx
      target: '@metadata'
  - add_fields:
      fields:
        dataset: kubernetes.container_logs
        namespace: default
        type: logs
      target: data_stream
  - add_fields:
      fields:
        dataset: kubernetes.container_logs
      target: event
  - add_fields:
      fields:
        stream_id: kubernetes-container-logs-nginx-d556bf558-vcdpl-4bd5a70737faebfa2fbd4d34b9003cf8f32cd086787133590342e24d331da995
      target: '@metadata'
  - add_fields:
      fields:
        id: 5206cacb-562e-46a1-b256-0b6833a0d653
        snapshot: false
        version: 8.15.0
      target: elastic_agent
  - add_fields:
      fields:
        id: 5206cacb-562e-46a1-b256-0b6833a0d653
      target: agent
  - add_fields:
      fields:
        id: 4bd5a70737faebfa2fbd4d34b9003cf8f32cd086787133590342e24d331da995
        image:
          name: nginx:1.14.2
        runtime: containerd
      target: container
  - add_fields:
      fields:
        cluster:
          name: kind
          url: kind-control-plane:6443
      target: orchestrator
  - add_fields:
      fields:
        container:
          name: nginx
        labels:
          app: nginx
          pod-template-hash: d556bf558
        namespace: default
        namespace_labels:
          kubernetes_io/metadata_name: default
        namespace_uid: 9df9c3db-a0ca-426d-bbb5-0c63092a39ae
        node:
          hostname: kind-control-plane
          labels:
            beta_kubernetes_io/arch: amd64
            beta_kubernetes_io/os: linux
            kubernetes_io/arch: amd64
            kubernetes_io/hostname: kind-control-plane
            kubernetes_io/os: linux
            node-role_kubernetes_io/control-plane: ""
          name: kind-control-plane
          uid: 0b6d3cbf-7a86-4775-9351-86f5448c21d8
        pod:
          ip: 10.244.0.102
          name: nginx-d556bf558-vcdpl
          uid: f483454b-e8f2-42b5-8c22-4da229e86b8a
        replicaset:
          name: nginx-d556bf558
      target: kubernetes
  - add_fields:
      fields:
        annotations:
          elastic_co/dataset: ""
          elastic_co/namespace: ""
          elastic_co/preserve_original_event: ""
      target: kubernetes
  - drop_fields:
      fields:
        - kubernetes.annotations.elastic_co/dataset
      ignore_missing: true
      when:
        equals:
          kubernetes:
            annotations:
              elastic_co/dataset: ""
  - drop_fields:
      fields:
        - kubernetes.annotations.elastic_co/namespace
      ignore_missing: true
      when:
        equals:
          kubernetes:
            annotations:
              elastic_co/namespace: ""
  - drop_fields:
      fields:
        - kubernetes.annotations.elastic_co/preserve_original_event
      ignore_missing: true
      when:
        equals:
          kubernetes:
            annotations:
              elastic_co/preserve_original_event: ""
  - add_tags:
      tags:
        - preserve_original_event
      when:
        and:
          - has_fields:
              - kubernetes.annotations.elastic_co/preserve_original_event
          - regexp:
              kubernetes:
                annotations:
                  elastic_co/preserve_original_event: ^(?i)true$
prospector:
  scanner:
    symlinks: true
type: filestream

@faec
Copy link
Contributor

faec commented Oct 23, 2024

I think most of this data is actually used by filebeat.

This could be true, but one thing that hasn't been obvious to me looking at examples is how much unused metadata implicitly comes with the current approach. E.g. in the example you give, even though it needs different categories of Kubernetes metadata for potentially many containers/pods, all the actual substituted fields together are still quite small, and would still not be a major memory footprint even with hundreds or thousands of pods/containers. (There may be limits on how well we can act on that, though, depending on what's cached by the Kubernetes support libraries themselves.)

Anyway, if at some point there's reason to think that pruning fields would help, it would definitely be a feasible modification for the Agent Coordinator to pass the relevant/minimal field list through to variable providers.

@alexsapran
Copy link
Contributor

@swiatekm, your comment here #5835 (comment) seems to be related to something else https://github.com/elastic/ingest-dev/issues/2454#issuecomment-1737569099 that investigated the impact of map.Clone in the context of the processors. Taking a look at the history would help IMHO give some context.

@swiatekm
Copy link
Contributor Author

@swiatekm, your comment here #5835 (comment) seems to be related to something else elastic/ingest-dev#2454 (comment) that investigated the impact of map.Clone in the context of the processors. Taking a look at the history would help IMHO give some context.

To be clear, I wasn't making a statement about the performance of filebeat, just that the generated configuration for it appears to use a lot of the Pod metadata. This issue is about the elastic-agent binary exclusively.

@blakerouse
Copy link
Contributor

I think the complexity of only providing the fields that are used by the provider might outweigh the memory savings unless we could determine this to be a very large amount. The complexity comes when the policy changes and now a new field is now being referenced but the provider is not providing that variable now because it was omitted from the previous policy. Now the new set of fields need to be sent to the provider and then the provider now needs to update all mappings with that new variable. I think that complexity might outweigh the benefit of such a change.

I think the patch of not cloning the mappings might be a large enough win that the need to omit fields might not be required.

I do think this issue solved #3609 would be very nice to have done, and I think something that needs to be done relative to Hybrid OTel mode. Because when the agent is only running OTel configuration it should not be running any providers. That would be a memory savings for all agents that are not referencing anything kubernetes related in the policy.

@swiatekm
Copy link
Contributor Author

swiatekm commented Oct 28, 2024

Experimenting with removing var cloning brought me to an unusual discovery: On agent start, we spend a lot of memory on getting the host ip addresses. See these heap profiles:

Image
Image

I don't get why this would be the case. I verified we only call the host provider function once, and it allocating 33 MB while just getting ip addresses from Linux seems very excessive. But we only use Go stdlib functions to do this. Looking at interfaces via shell commands on the Node didn't yield anything unusual. I'll see if I can run a small program to show me if we're maybe iterating over a bunch of junk data - Kubernetes is know to mess with local networking a lot.

EDIT: Looks like this is a kind of N+1 problem with syscalls. The code lists interfaces, and then lists addresses for each interface. The second call is not cheap, so if we have a lot of interfaces on a machine - which happens on a K8s Node with a lot of Pods - it adds up to a fair amount of memory allocations. There is actually a go stdlib function that gets all the addresses in a single syscall, I'll try to use it and see if that helps.

swiatekm added a commit to elastic/go-sysinfo that referenced this issue Oct 29, 2024
When fetching ip addresses for the host, we fetch all the network
interfaces, and then ip addresses for each interface. The latter call is
surprisingly expensive on unix, as it involves opening a netlink socket,
sending a request for routing information, and receiving and parsing the
response. If the host has a lot of network interfaces, this can eat
surprising amounts of memory - I got in the order of 10 MB on a
Kubernetes Node with 100 Pods. See
elastic/elastic-agent#5835 (comment)
for some heap profiles from elastic-agent.

Instead, get all the addresses in a single stdlib call. We don't
actually care about which interface each ip address is attached to, we
just want all of them.

I've tested this in the real world scenario discussed in
elastic/elastic-agent#5835, not sure how to
include a self-contained test in this repo.
@cmacknz
Copy link
Member

cmacknz commented Oct 31, 2024

@andrewkroh any quick ideas about how to bring down the memory usage from syscall.NetlinkRIB coming through go-sysinfo's shared.Network() implementation? See the profile above in #5835 (comment).

@swiatekm
Copy link
Contributor Author

@cmacknz elastic/go-sysinfo#246 should already help a lot here.

@cmacknz
Copy link
Member

cmacknz commented Oct 31, 2024

Nice! How much of an improvement was that out of curiosity?

@swiatekm
Copy link
Contributor Author

Nice! How much of an improvement was that out of curiosity?

Around 4x less memory in my environment.

@Marchelune
Copy link

Marchelune commented Nov 7, 2024

Hi there,
I found this issue after realising my agent pods were getting Oomkilled because they exceeded the default (700mi) memory limit.

Is there anything that can be done to reduce the memory used? Here is the result of a top pod -A command in a single-node AKS cluster:

NAMESPACE        NAME                                                   CPU(cores)   MEMORY(bytes)   
beats-logging    filebeat-test-application-logs-beat-filebeat-4wn76     1m           46Mi            
elastic-system   elastic-operator-0                                     2m           56Mi            
kube-system      azure-ip-masq-agent-nx2rn                              1m           12Mi            
kube-system      azure-wi-webhook-controller-manager-795dfd9c87-5tcsd   8m           15Mi            
kube-system      azure-wi-webhook-controller-manager-795dfd9c87-dk6n9   8m           15Mi            
kube-system      cloud-node-manager-sz9xt                               1m           25Mi            
kube-system      coredns-597bb9d4db-26n5f                               2m           44Mi            
kube-system      coredns-597bb9d4db-xdgjb                               2m           24Mi            
kube-system      coredns-autoscaler-689db4649c-lvjb6                    1m           10Mi            
kube-system      csi-azuredisk-node-pqmh2                               2m           45Mi            
kube-system      csi-azurefile-node-67b4x                               2m           53Mi            
kube-system      elastic-agent-rwmsp                                    28m          973Mi  <-- !!
kube-system      keda-admission-webhooks-7778cc48bd-jsjtj               1m           14Mi            
kube-system      keda-admission-webhooks-7778cc48bd-q6cdk               1m           13Mi            
kube-system      keda-operator-5c76fdd585-4v52h                         7m           99Mi            
kube-system      keda-operator-5c76fdd585-l9s2d                         5m           53Mi            
kube-system      keda-operator-metrics-apiserver-58c8cbcc85-77qrk       4m           46Mi            
kube-system      keda-operator-metrics-apiserver-58c8cbcc85-c62r9       5m           42Mi            
kube-system      konnectivity-agent-6cd9c4cd64-md2rm                    2m           24Mi            
kube-system      konnectivity-agent-6cd9c4cd64-pkxvw                    1m           14Mi            
kube-system      kube-proxy-wr6hz                                       1m           38Mi            
kube-system      kube-state-metrics-5bcd4898-2zrwr                      2m           35Mi            
kube-system      metrics-server-7b685846d6-bcnrh                        3m           36Mi            
kube-system      metrics-server-7b685846d6-h9jm2                        3m           39Mi     

The agent is in version 8.14.3, with both System and Kubernetes integrations. I've removed the System integration to try lower the memory footprint, and it does lead to a lower ~700Mi memory. Still, this is way above the older alternative to use metricbeats paired with Kubernetes modules, which is about 85Mi per pod (granted, I haven't configured scheduler and controllermanager on the beat).
Is there anything we could do in the short term to lower the agent memory footprint?

For reference, the agent config (fleet-enrolled, standalone behaves the same)
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: elastic-agent
  namespace: kube-system
  labels:
      app: elastic-agent
spec:
  selector:
      matchLabels:
          app: elastic-agent
  template:
      metadata:
          labels:
              app: elastic-agent
      spec:
          tolerations:
              - key: node-role.kubernetes.io/control-plane
                effect: NoSchedule
              - key: node-role.kubernetes.io/master
                effect: NoSchedule
          serviceAccountName: elastic-agent
          hostNetwork: true
          hostPID: true
          dnsPolicy: ClusterFirstWithHostNet
          containers:
              - name: elastic-agent
                image: docker.elastic.co/beats/elastic-agent:8.14.3
                env:
                    - name: FLEET_ENROLL
                      value: '1'
                    - name: FLEET_INSECURE
                      value: 'false'
                    - name: FLEET_URL
                      value: 'https://someid.fleet.westeurope.azure.elastic-cloud.com:443'
                    - name: FLEET_ENROLLMENT_TOKEN
                      value: 'the-token'
                    - name: NODE_NAME
                      valueFrom:
                          fieldRef:
                              fieldPath: spec.nodeName
                    - name: POD_NAME
                      valueFrom:
                          fieldRef:
                              fieldPath: metadata.name
                    - name: ELASTIC_NETINFO
                      value: 'false'
                securityContext:
                    runAsUser: 0                     
                resources:
                    # Commented-out the limits, otherwise we run into OOMkilling land.
                    requests:
                        cpu: 100m
                        memory: 400Mi
                volumeMounts:
                    - name: proc
                      mountPath: /hostfs/proc
                      readOnly: true
                    - name: cgroup
                      mountPath: /hostfs/sys/fs/cgroup
                      readOnly: true
                    - name: varlibdockercontainers
                      mountPath: /var/lib/docker/containers
                      readOnly: true
                    - name: varlog
                      mountPath: /var/log
                      readOnly: true
                    - name: etc-full
                      mountPath: /hostfs/etc
                      readOnly: true
                    - name: var-lib
                      mountPath: /hostfs/var/lib
                      readOnly: true
                    - name: etc-mid
                      mountPath: /etc/machine-id
                      readOnly: true
                    - name: sys-kernel-debug
                      mountPath: /sys/kernel/debug
                    - name: elastic-agent-state
                      mountPath: /usr/share/elastic-agent/state
                    
          volumes:
              - name: proc
                hostPath:
                    path: /proc
              - name: cgroup
                hostPath:
                    path: /sys/fs/cgroup
              - name: varlibdockercontainers
                hostPath:
                    path: /var/lib/docker/containers
              - name: varlog
                hostPath:
                    path: /var/log
              - name: etc-full
                hostPath:
                    path: /etc
              - name: var-lib
                hostPath:
                    path: /var/lib
              - name: etc-mid
                hostPath:
                    path: /etc/machine-id
                    type: File
              - name: sys-kernel-debug
                hostPath:
                    path: /sys/kernel/debug
              - name: elastic-agent-state
                hostPath:
                    path: /var/lib/elastic-agent-managed/kube-system/state-1
                    type: DirectoryOrCreate
              #- name: universal-profiling-cache
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: elastic-agent
subjects:
  - kind: ServiceAccount
    name: elastic-agent
    namespace: kube-system
roleRef:
  kind: ClusterRole
  name: elastic-agent
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  namespace: kube-system
  name: elastic-agent
subjects:
  - kind: ServiceAccount
    name: elastic-agent
    namespace: kube-system
roleRef:
  kind: Role
  name: elastic-agent
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: elastic-agent-kubeadm-config
  namespace: kube-system
subjects:
  - kind: ServiceAccount
    name: elastic-agent
    namespace: kube-system
roleRef:
  kind: Role
  name: elastic-agent-kubeadm-config
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: elastic-agent
  labels:
      k8s-app: elastic-agent
rules:
  - apiGroups: ['']
    resources:
        - nodes
        - namespaces
        - events
        - pods
        - services
        - configmaps
        - serviceaccounts
        - persistentvolumes
        - persistentvolumeclaims
    verbs: ['get', 'list', 'watch']
  #- apiGroups: [""]
  - apiGroups: ['extensions']
    resources:
        - replicasets
    verbs: ['get', 'list', 'watch']
  - apiGroups: ['apps']
    resources:
        - statefulsets
        - deployments
        - replicasets
        - daemonsets
    verbs: ['get', 'list', 'watch']
  - apiGroups:
        - ''
    resources:
        - nodes/stats
    verbs:
        - get
  - apiGroups: ['batch']
    resources:
        - jobs
        - cronjobs
    verbs: ['get', 'list', 'watch']
  - nonResourceURLs:
        - '/metrics'
    verbs:
        - get
  - apiGroups: ['rbac.authorization.k8s.io']
    resources:
        - clusterrolebindings
        - clusterroles
        - rolebindings
        - roles
    verbs: ['get', 'list', 'watch']
  - apiGroups: ['policy']
    resources:
        - podsecuritypolicies
    verbs: ['get', 'list', 'watch']
  - apiGroups: ['storage.k8s.io']
    resources:
        - storageclasses
    verbs: ['get', 'list', 'watch']
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: elastic-agent
  namespace: kube-system
  labels:
      k8s-app: elastic-agent
rules:
  - apiGroups:
        - coordination.k8s.io
    resources:
        - leases
    verbs: ['get', 'create', 'update']
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: elastic-agent-kubeadm-config
  namespace: kube-system
  labels:
      k8s-app: elastic-agent
rules:
  - apiGroups: ['']
    resources:
        - configmaps
    resourceNames:
        - kubeadm-config
    verbs: ['get']
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: elastic-agent
  namespace: kube-system
  labels:
      k8s-app: elastic-agent
---

@swiatekm
Copy link
Contributor Author

Is there anything we could do in the short term to lower the agent memory footprint?

@Marchelune you could try setting GOGC to a lower-than-default value. It doesn't look like there's much we can do otherwise without overhauling the agent architecture, unfortunately. This overhaul is coming with our move to OpenTelemetry, but right now it's difficult to reduce the memory consumption given that every integration runs in a dedicated process.

@swiatekm
Copy link
Contributor Author

swiatekm commented Jan 3, 2025

In 8.16.2, which contains all of my improvements, the historical heap profile looks much more reasonable:

Image

Measuring memory consumption with 100 Pods on the Node, and Pods being created and deleted regularly, shows around ~100 MB worth of savings. There's not any noticeable change in a steady state without Pod changes, unfortunately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests

7 participants