[Meta]Investigate resource consumption of Elastic Agent with K8s Integration #3801

gizas · 2023-11-22T14:23:30Z

Backround

The latest issues like 3863, 3991 and 4081, proved that the installation of the default configuration of Elastic Agent with our Kubernetes Integration can lead to situations were our customers result in unfortunate circumstances (even with broken k8s clusters sometimes). There are many details and variables that affect the final setup and installation of our observability solution and we can try to summarise and list them here.

Goals

This issue tries to summarise the next actions we need in order to investigate:

The current resource consumption of default Elastic Agent with K8s Integration
Several alternative ways that we can offer in order to minimise the impact in different k8s environments and customer setups, regarding resource consumption of k8s cluster.

Actions

Current Actions

We have observed until now that:
a) Memory consumption of Elastic Agent had increased from 8.8 to 8.9 versions and later of Elastic Agent (Relevant https://github.com/elastic/sdh-beats/issues/3863#issuecomment-1733750863)
b) Number of API calls towards Kubernetes Control API has increased since 8.9 version (See Salesforce 01507229 regarding Elastic Agent overloading Kubernetes API server.: https://github.com/elastic/sdh-beats/issues/3991#issuecomment-1787648161)
c) CPU consumption (although not such a big issue at the moment and not first priority) has been referred here as a concern.

Unti now:

Since 8.11 we have updated the elastic-agent-autodiscover, beats PR to v0.6.4. Disabling metadata for deployment and cronjob. Pods that will be created from deployments or cronjobs will not have the extra metadata field for kubernetes.deployment or kubernetes.cronjob.
We have merged leader election configuration variables
Proposing a way to disable Leader Election in Managed Elastic Agents (See here)

Next Planned Actions

Future Plans/Actions

axw · 2023-11-23T02:46:34Z

Run tests in real k8s clusters and retrieve diagnostics from Agent trying to investigate memory consumption

Once we've resolved the issues (or earlier, if resolving them is not straightforward and we need to iterate): I think we should also figure out how to reliably reproduce the issues in an ephemeral cluster, ideally with some automation in place to create the cluster and whatever workload is necessary to trigger the issues (e.g. create a bunch of deployments/pods/whatever).

Then we can:

consider performing those tests regularly to ensure we don't regress
more rapidly iterate on improvements and bug fixes

gizas · 2023-11-23T09:47:57Z

Thanks @axw , I have updated a bit the section Next actions and added some previous ideas/issues that we can investigate here

lucabelluccini · 2024-01-05T14:44:49Z

As a short-term, can we somehow document the known issues / limitations we're facing until now?

dimm0 · 2024-06-06T21:46:22Z

Is there progress in the latest version or it's still destroying the k8s master? I've disabled elastic in our cluster a while ago, checking if there's any progress so far. I can't really tell if it should've improved if I upgrade.

cmacknz · 2024-06-07T15:25:56Z

We have tracked down the source of the high memory usage on k8s and are working to fix it. #4729 is the tracking issue.

dimm0 · 2024-06-07T18:57:46Z

And what about rate-limiting the k8s apiserver requests? Is any work going on that?

gizas · 2024-06-11T09:03:38Z

what about rate-limiting the k8s apiserver requests

Regarding rate limiting, the main issue is this which is not yet prioritised in the next iterations. But for sure it is in our backlog

Somehow related, we have already merged 3625, in order to minimise any possible effect of leader election api calls. Additionally since 8.14.0, we have done a major refactoring in 37243, which we proved that it will help the overall resource consumption

constanca-m · 2024-09-13T05:47:42Z

Test setup

I have run a script to evaluate the performance of our K8s integration. I evaluated all 8.x.0 versions between 8.5.0 and 8.15.0.

The test increases the number of pods in a one node cluster at this rhythm: 12, 61, 111, 161, 211, 311, 411, and 511.

I annotated the following results after 5min for each cycle:

Pods: number of pods in the cluster.
CPU: CPU usage of EA.
Memory: Memory usage of EA.
EA pod restarts: Restarts of EA so far.

Once the EA restarts, I stop registering the tests for the upcoming increase of pods, since the performance is no longer stable.

This is the script I am running for the tests.

setup_cluster () {
   kind delete cluster
   kind create cluster
   # This is so we can execute kubectl top
   kubectl apply -f https://raw.githubusercontent.com/pythianarora/total-practice/master/sample-kubernetes-code/metrics-server.yaml
}

test_n_pods () {
  # $1 - EA filename to used in kubectl apply
  # $2 - filename for the results
  # Prepare cluster with EA using kubernetes + system policy
  setup_cluster
  kubectl apply -f "$1"

  echo "| Pods | CPU | Memory | EA pod restarts |" > "$2"
  echo "|------|-----|--------|-----------------|" >> "$2"

  for replicas in 1 50 100 150 200 300 400 500 ;
    do
      kubectl delete -f nginx-pod.yaml
      sed -i -e "s/  replicas: .*/  replicas: $replicas/g" nginx-pod.yaml
      kubectl apply -f nginx-pod.yaml
      sleep 5m

      top=$(kubectl top pods -n kube-system | grep elastic*)
      pods=$(kubectl get pods --no-headers --all-namespaces | wc -l)
      line=$(kubectl get pods -o wide --all-namespaces | awk '$2 ~ /^elastic/')
      restarts=$(echo "$line" | awk '{print  $5}')

      print_results_to_file "$pods" "$top" "$restarts" "$2"
    done
}

print_results_to_file () {
    # Gets arguments:
    # $1 = number of pods
    # $2 = kubectl top result
    # $3 = number of EA restarts
    # $4 = results filename

    # Parse result of kubectl top (example 'elastic-agent-985zk                          16m          583Mi')
    cpu=$(echo "$2" | awk '{print  $2}')
    memory=$(echo "$2" | awk '{print  $3}')
    echo "| $1 | $cpu | $memory | $3 |" >> "$4"
}

# Test the performance by running test_n_pods. Change the arguments to your own.
test_n_pods <DEPLOYMENT EA FILE GOES HERE> <RESULTS FILENAME GOES HERE>

This is the NGINX pod deployment I use in the script.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 500
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
        - name: nginx
          image: nginx:1.14.2
          ports:
            - containerPort: 80