{WIP} To run some scale tests to understand the recent failures #16296

hakuna-matatah · 2024-01-29T20:07:50Z

Investigate AWS 5K test failures that started happening in last 10 days test-infra#31755

k8s-ci-robot · 2024-01-29T20:08:06Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign zetaab for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

hakuna-matatah · 2024-01-29T20:12:33Z

/test presubmit-kops-aws-small-scale-amazonvpc-using-cl2

hakuna-matatah · 2024-01-29T20:13:19Z

/test presubmit-kops-aws-small-scale-amazonvpc-using-cl2

testing on small scale to see if the prom stack comes up properly.

hakuna-matatah · 2024-01-29T23:37:38Z

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

hakuna-matatah · 2024-01-30T05:33:29Z

/retest

hakuna-matatah · 2024-01-30T16:22:24Z

/test presubmit-kops-aws-small-scale-amazonvpc-using-cl2

testing on small scale to see if the prom stack comes up properly.

It looks like prometheus is coming up fine on small scale tests

W0129 20:28:32.119620   20929 framework.go:291] Skipping empty manifest default/prometheus-serviceMonitorKubelet.yaml
I0129 20:28:32.155996   20929 framework.go:276] Applying templates for "master-ip/*.yaml"
W0129 20:28:32.174218   20929 warnings.go:70] unknown field "spec.selector.k8s-app"
W0129 20:28:32.182250   20929 util.go:72] error while calling prometheus api: the server is currently unable to handle the request (get services http:prometheus-k8s:9090), response: k8s�

�v1��Status�]
�
�������Failure�3no endpoints available for service "prometheus-k8s""�ServiceUnavailable0����"�
I0129 20:29:32.379189   20929 framework.go:276] Applying templates for "manifest/exec_deployment.yaml"
W0129 20:29:42.457062   20929 imagepreload.go:92] No images specified. Skipping image preloading
I0129 20:29:42.499512   20929 clusterloader.go:445] Test config successfully dumped to: /logs/artifacts/generatedConfig_load.yaml
I0129 20:29:42.499545   20929 clusterloader.go:238] --------------------------------------------------------------------------------
I0129 20:29:42.499551   20929 clusterloader.go:239] Running /home/prow/go/src/k8s.io/perf-tests/clusterloader2/testing/load/config.yaml
I0129 20:29:42.499556   20929 clusterloader.go:240] --------------------------------------------------------------------------------
I0129 20:29:42.649101   20929 framework.go:276] Applying templates for "manifests/dnsLookup/*yaml"
I0129 20:29:42.651600   20929 framework.go:276] Applying templates for "manifests/*.yaml"
I0129 20:31:10.253880   20929 shared_informer.go:240] Waiting for caches to sync for PodsIndexer
I0129 20:31:10.654823   20929 shared_informer.go:247] Caches are synced for PodsIndexer 
I0129 20:36:43.528751   20929 phase_latency.go:146] PodStartupLatency: perc50: 4s, perc90: 8s, perc99: 9s

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kops/16296/presubmit-kops-aws-small-scale-amazonvpc-using-cl2/1752062204464599040

It tells me that there might be some resource constraints on large scale like 5k which is causing the issues for Prom to not come up ? But we schedule Prom pod on controlplane-node which is 18x large instance, so it would be surprising if it is suffering from resource constraints unless it's not actually getting scheduled on control-plane node recently. Need to verify these.

hakuna-matatah · 2024-01-30T16:44:13Z

From the latest periodic run , it appears that prometheus pod is getting scheduled properly as expected on control-plane-node but it appears that Prometheus pod is panicking and exiting. See below error for reference:

Has expected tolerations to schedule on control-plane-node

    tolerations:
    - effect: NoSchedule
      key: monitoring
      operator: Exists
    - effect: NoSchedule
      key: node-role.kubernetes.io/master
      operator: Exists
    - effect: NoSchedule
      key: node-role.kubernetes.io/control-plane
      operator: Exists
    - effect: NoExecute
      key: node.kubernetes.io/not-ready
      operator: Exists
      tolerationSeconds: 300
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      operator: Exists
      tolerationSeconds: 300
      ```
      
      And it appears prom pod/container failed with panic with following reason:

  lastState:
    terminated:
      containerID: containerd://19350c59815eaf55b9b15d9eed62673dd82f559dc6f684c426cc3afff8c0d2be
      exitCode: 2
      finishedAt: "2024-01-30T06:55:11Z"
      message: "ar/run/secrets/kubernetes.io/serviceaccount/token: too many open
        files\"\nlevel=debug ts=2024-01-30T06:55:11.717Z caller=scrape.go:1127
        component=\"scrape manager\" scrape_pool=monitoring/coredns/0 target=http://10.2.124.225:9153/metrics
        msg=\"Scrape failed\" err=\"Get \\\"http://10.2.124.225:9153/metrics\\\":
        unable to read bearer token file /var/run/secrets/kubernetes.io/serviceaccount/token:
        open /var/run/secrets/kubernetes.io/serviceaccount/token: too many open
        files\"\npanic: next sequence file: open /prometheus/chunks_head: too
        many open files\n\ngoroutine 947 [running]:\ngithub.com/prometheus/prometheus/tsdb.(*memSeries).mmapCurrentHeadChunk(0xc000919180,
        0xc0001ea180)\n\t/app/tsdb/head.go:2105 +0x21a\ngithub.com/prometheus/prometheus/tsdb.(*memSeries).cutNewHeadChunk(0xc000919180,
        0x18d59267f19, 0xc0001ea180, 0x0)\n\t/app/tsdb/head.go:2076 +0x39\ngithub.com/prometheus/prometheus/tsdb.(*memSeries).append(0xc000919180,
        0x18d59267f19, 0x402a000000000000, 0x7cc5, 0xc0001ea180, 0x53d8)\n\t/app/tsdb/head.go:2232
        +0x38e\ngithub.com/prometheus/prometheus/tsdb.(*headAppender).Commit(0xc051f71300,
        0x0, 0x0)\n\t/app/tsdb/head.go:1295 +0x265\ngithub.com/prometheus/prometheus/tsdb.dbAppender.Commit(0x32af340,
        0xc051f71300, 0xc00029e0e0, 0x0, 0x0)\n\t/app/tsdb/db.go:794 +0x35\ngithub.com/prometheus/prometheus/storage.(*fanoutAppender).Commit(0xc06dbb9540,
        0xc1664533e63af65d, 0x98f89e9856)\n\t/app/storage/fanout.go:174 +0x49\ngithub.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport.func1(0xc01c3b9c18,
        0xc01c3b9c28, 0xc001fa0790)\n\t/app/scrape/scrape.go:1086 +0x49\ngithub.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport(0xc001fa0790,
        0x12a05f200, 0x12a05f200, 0xc1664532a63d3d1b, 0x97ce9aed2d, 0x4450000,
        0xc1664533e63af65d, 0x98f89e9856, 0x4450000, 0x0, ...)\n\t/app/scrape/scrape.go:1153
        +0xb45\ngithub.com/prometheus/prometheus/scrape.(*scrapeLoop).run(0xc001fa0790,
        0x12a05f200, 0x12a05f200, 0x0)\n\t/app/scrape/scrape.go:1039 +0x39e\ncreated
        by github.com/prometheus/prometheus/scrape.(*scrapePool).sync\n\t/app/scrape/scrape.go:510
        +0x9ce\n"
      reason: Error
      startedAt: "2024-01-30T06:44:14Z"
  name: prometheus
  ready: true
  restartCount: 2

      
      And Prometheus it's using is - 2.25 version, need to figure out what this prometheus error means.

hakuna-matatah · 2024-01-30T16:46:21Z

        files\"\npanic: next sequence file: open /prometheus/chunks_head: too
        many open files\n\ngoroutine 947

This looks suspicious, why this happens because of too many files open which makes it panic

hakman · 2024-01-30T18:00:05Z

May be related to #16300.

hakuna-matatah · 2024-01-30T18:43:00Z

May be related to #16300.

Looks like that @hakman given I have seen too many open files open as a cause for container restart and didn't come back healthy afterwards.
Given the fixes are merged in last hour w.r.t ulimits on kops, I will kick off the test and see to verify if it fixes the issue.

hakuna-matatah · 2024-01-30T18:45:27Z

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

hakuna-matatah · 2024-01-30T19:28:55Z

Looks like the above fix around ulimits reverted change helped resolving the issue we are seeing w.r.t scale test failures. I'm now able to see the Prom pod coming back up cleanly unlike failures we saw before ulimits fix

W0130 19:11:25.825912   21056 framework.go:291] Skipping empty manifest default/prometheus-podMonitorDNSPerf.yaml
W0130 19:11:25.826013   21056 framework.go:291] Skipping empty manifest default/prometheus-podMonitorNetPolicyClient.yaml
W0130 19:11:25.826150   21056 framework.go:291] Skipping empty manifest default/prometheus-podMonitorNodeLocalDNS.yaml
W0130 19:11:25.889030   21056 framework.go:291] Skipping empty manifest default/prometheus-serviceMonitorKubelet.yaml
I0130 19:11:25.929395   21056 framework.go:276] Applying templates for "master-ip/*.yaml"
W0130 19:11:25.960898   21056 warnings.go:70] unknown field "spec.selector.k8s-app"
W0130 19:11:25.978294   21056 util.go:72] error while calling prometheus api: the server is currently unable to handle the request (get services http:prometheus-k8s:9090), response: k8s�

�v1��Status�]
�
�������Failure�3no endpoints available for service "prometheus-k8s""�ServiceUnavailable0����"�
I0130 19:12:56.710499   21056 framework.go:276] Applying templates for "manifest/exec_deployment.yaml"
W0130 19:13:06.833085   21056 imagepreload.go:92] No images specified. Skipping image preloading
I0130 19:13:06.872931   21056 clusterloader.go:445] Test config successfully dumped to: /logs/artifacts/generatedConfig_load.yaml
I0130 19:13:06.872963   21056 clusterloader.go:238] --------------------------------------------------------------------------------
I0130 19:13:06.872969   21056 clusterloader.go:239] Running /home/prow/go/src/k8s.io/perf-tests/clusterloader2/testing/load/config.yaml
I0130 19:13:06.872974   21056 clusterloader.go:240] --------------------------------------------------------------------------------
I0130 19:13:09.281630   21056 framework.go:276] Applying templates for "manifests/*.yaml"
I0130 19:13:09.307557   21056 framework.go:276] Applying templates for "manifests/dnsLookup/*yaml"
I0130 19:22:28.314648   21056 shared_informer.go:240] Waiting for caches to sync for PodsIndexer
I0130 19:22:31.815435   21056 shared_informer.go:247] Caches are synced for PodsIndexer

hakuna-matatah · 2024-01-31T15:05:01Z

closing this as I have accomplished its purpose, summarized here - kubernetes/test-infra#31755 (comment) and periodics are now succeeding.

to run some scale tests

cfce497

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 29, 2024

k8s-ci-robot requested review from hakman and olemarkus January 29, 2024 20:08

hakuna-matatah mentioned this pull request Jan 30, 2024

Investigate AWS 5K test failures that started happening in last 10 days kubernetes/test-infra#31755

Closed

hakuna-matatah closed this Jan 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

{WIP} To run some scale tests to understand the recent failures #16296

{WIP} To run some scale tests to understand the recent failures #16296

hakuna-matatah commented Jan 29, 2024

k8s-ci-robot commented Jan 29, 2024

hakuna-matatah commented Jan 29, 2024

hakuna-matatah commented Jan 29, 2024

hakuna-matatah commented Jan 29, 2024

hakuna-matatah commented Jan 30, 2024

hakuna-matatah commented Jan 30, 2024

hakuna-matatah commented Jan 30, 2024 •

edited

Loading

hakuna-matatah commented Jan 30, 2024 •

edited

Loading

hakman commented Jan 30, 2024

hakuna-matatah commented Jan 30, 2024 •

edited

Loading

hakuna-matatah commented Jan 30, 2024

hakuna-matatah commented Jan 30, 2024 •

edited

Loading

hakuna-matatah commented Jan 31, 2024

{WIP} To run some scale tests to understand the recent failures #16296

{WIP} To run some scale tests to understand the recent failures #16296

Conversation

hakuna-matatah commented Jan 29, 2024

k8s-ci-robot commented Jan 29, 2024

hakuna-matatah commented Jan 29, 2024

hakuna-matatah commented Jan 29, 2024

hakuna-matatah commented Jan 29, 2024

hakuna-matatah commented Jan 30, 2024

hakuna-matatah commented Jan 30, 2024

hakuna-matatah commented Jan 30, 2024 • edited Loading

hakuna-matatah commented Jan 30, 2024 • edited Loading

hakman commented Jan 30, 2024

hakuna-matatah commented Jan 30, 2024 • edited Loading

hakuna-matatah commented Jan 30, 2024

hakuna-matatah commented Jan 30, 2024 • edited Loading

hakuna-matatah commented Jan 31, 2024

hakuna-matatah commented Jan 30, 2024 •

edited

Loading

hakuna-matatah commented Jan 30, 2024 •

edited

Loading

hakuna-matatah commented Jan 30, 2024 •

edited

Loading

hakuna-matatah commented Jan 30, 2024 •

edited

Loading