Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{WIP} To run some scale tests to understand the recent failures #16296

Closed
wants to merge 1 commit into from

Conversation

hakuna-matatah
Copy link
Contributor

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 29, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign zetaab for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hakuna-matatah
Copy link
Contributor Author

/test presubmit-kops-aws-small-scale-amazonvpc-using-cl2

@hakuna-matatah
Copy link
Contributor Author

/test presubmit-kops-aws-small-scale-amazonvpc-using-cl2

testing on small scale to see if the prom stack comes up properly.

@hakuna-matatah
Copy link
Contributor Author

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

@hakuna-matatah
Copy link
Contributor Author

/retest

@hakuna-matatah
Copy link
Contributor Author

/test presubmit-kops-aws-small-scale-amazonvpc-using-cl2

testing on small scale to see if the prom stack comes up properly.

It looks like prometheus is coming up fine on small scale tests

W0129 20:28:32.119620   20929 framework.go:291] Skipping empty manifest default/prometheus-serviceMonitorKubelet.yaml
I0129 20:28:32.155996   20929 framework.go:276] Applying templates for "master-ip/*.yaml"
W0129 20:28:32.174218   20929 warnings.go:70] unknown field "spec.selector.k8s-app"
W0129 20:28:32.182250   20929 util.go:72] error while calling prometheus api: the server is currently unable to handle the request (get services http:prometheus-k8s:9090), response: k8s�

�v1��Status�]
�
�������Failure�3no endpoints available for service "prometheus-k8s""�ServiceUnavailable0����"�
I0129 20:29:32.379189   20929 framework.go:276] Applying templates for "manifest/exec_deployment.yaml"
W0129 20:29:42.457062   20929 imagepreload.go:92] No images specified. Skipping image preloading
I0129 20:29:42.499512   20929 clusterloader.go:445] Test config successfully dumped to: /logs/artifacts/generatedConfig_load.yaml
I0129 20:29:42.499545   20929 clusterloader.go:238] --------------------------------------------------------------------------------
I0129 20:29:42.499551   20929 clusterloader.go:239] Running /home/prow/go/src/k8s.io/perf-tests/clusterloader2/testing/load/config.yaml
I0129 20:29:42.499556   20929 clusterloader.go:240] --------------------------------------------------------------------------------
I0129 20:29:42.649101   20929 framework.go:276] Applying templates for "manifests/dnsLookup/*yaml"
I0129 20:29:42.651600   20929 framework.go:276] Applying templates for "manifests/*.yaml"
I0129 20:31:10.253880   20929 shared_informer.go:240] Waiting for caches to sync for PodsIndexer
I0129 20:31:10.654823   20929 shared_informer.go:247] Caches are synced for PodsIndexer 
I0129 20:36:43.528751   20929 phase_latency.go:146] PodStartupLatency: perc50: 4s, perc90: 8s, perc99: 9s

It tells me that there might be some resource constraints on large scale like 5k which is causing the issues for Prom to not come up ? But we schedule Prom pod on controlplane-node which is 18x large instance, so it would be surprising if it is suffering from resource constraints unless it's not actually getting scheduled on control-plane node recently. Need to verify these.

@hakuna-matatah
Copy link
Contributor Author

hakuna-matatah commented Jan 30, 2024

From the latest periodic run , it appears that prometheus pod is getting scheduled properly as expected on control-plane-node but it appears that Prometheus pod is panicking and exiting. See below error for reference:

Has expected tolerations to schedule on control-plane-node

    tolerations:
    - effect: NoSchedule
      key: monitoring
      operator: Exists
    - effect: NoSchedule
      key: node-role.kubernetes.io/master
      operator: Exists
    - effect: NoSchedule
      key: node-role.kubernetes.io/control-plane
      operator: Exists
    - effect: NoExecute
      key: node.kubernetes.io/not-ready
      operator: Exists
      tolerationSeconds: 300
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      operator: Exists
      tolerationSeconds: 300
      ```
      
      And it appears prom pod/container failed with panic with following reason:

  lastState:
    terminated:
      containerID: containerd://19350c59815eaf55b9b15d9eed62673dd82f559dc6f684c426cc3afff8c0d2be
      exitCode: 2
      finishedAt: "2024-01-30T06:55:11Z"
      message: "ar/run/secrets/kubernetes.io/serviceaccount/token: too many open
        files\"\nlevel=debug ts=2024-01-30T06:55:11.717Z caller=scrape.go:1127
        component=\"scrape manager\" scrape_pool=monitoring/coredns/0 target=http://10.2.124.225:9153/metrics
        msg=\"Scrape failed\" err=\"Get \\\"http://10.2.124.225:9153/metrics\\\":
        unable to read bearer token file /var/run/secrets/kubernetes.io/serviceaccount/token:
        open /var/run/secrets/kubernetes.io/serviceaccount/token: too many open
        files\"\npanic: next sequence file: open /prometheus/chunks_head: too
        many open files\n\ngoroutine 947 [running]:\ngithub.com/prometheus/prometheus/tsdb.(*memSeries).mmapCurrentHeadChunk(0xc000919180,
        0xc0001ea180)\n\t/app/tsdb/head.go:2105 +0x21a\ngithub.com/prometheus/prometheus/tsdb.(*memSeries).cutNewHeadChunk(0xc000919180,
        0x18d59267f19, 0xc0001ea180, 0x0)\n\t/app/tsdb/head.go:2076 +0x39\ngithub.com/prometheus/prometheus/tsdb.(*memSeries).append(0xc000919180,
        0x18d59267f19, 0x402a000000000000, 0x7cc5, 0xc0001ea180, 0x53d8)\n\t/app/tsdb/head.go:2232
        +0x38e\ngithub.com/prometheus/prometheus/tsdb.(*headAppender).Commit(0xc051f71300,
        0x0, 0x0)\n\t/app/tsdb/head.go:1295 +0x265\ngithub.com/prometheus/prometheus/tsdb.dbAppender.Commit(0x32af340,
        0xc051f71300, 0xc00029e0e0, 0x0, 0x0)\n\t/app/tsdb/db.go:794 +0x35\ngithub.com/prometheus/prometheus/storage.(*fanoutAppender).Commit(0xc06dbb9540,
        0xc1664533e63af65d, 0x98f89e9856)\n\t/app/storage/fanout.go:174 +0x49\ngithub.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport.func1(0xc01c3b9c18,
        0xc01c3b9c28, 0xc001fa0790)\n\t/app/scrape/scrape.go:1086 +0x49\ngithub.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport(0xc001fa0790,
        0x12a05f200, 0x12a05f200, 0xc1664532a63d3d1b, 0x97ce9aed2d, 0x4450000,
        0xc1664533e63af65d, 0x98f89e9856, 0x4450000, 0x0, ...)\n\t/app/scrape/scrape.go:1153
        +0xb45\ngithub.com/prometheus/prometheus/scrape.(*scrapeLoop).run(0xc001fa0790,
        0x12a05f200, 0x12a05f200, 0x0)\n\t/app/scrape/scrape.go:1039 +0x39e\ncreated
        by github.com/prometheus/prometheus/scrape.(*scrapePool).sync\n\t/app/scrape/scrape.go:510
        +0x9ce\n"
      reason: Error
      startedAt: "2024-01-30T06:44:14Z"
  name: prometheus
  ready: true
  restartCount: 2
      
      And Prometheus it's using is - 2.25 version, need to figure out what this prometheus error means.

@hakuna-matatah
Copy link
Contributor Author

hakuna-matatah commented Jan 30, 2024

        files\"\npanic: next sequence file: open /prometheus/chunks_head: too
        many open files\n\ngoroutine 947

This looks suspicious, why this happens because of too many files open which makes it panic

@hakman
Copy link
Member

hakman commented Jan 30, 2024

May be related to #16300.

@hakuna-matatah
Copy link
Contributor Author

hakuna-matatah commented Jan 30, 2024

May be related to #16300.

Looks like that @hakman given I have seen too many open files open as a cause for container restart and didn't come back healthy afterwards.
Given the fixes are merged in last hour w.r.t ulimits on kops, I will kick off the test and see to verify if it fixes the issue.

@hakuna-matatah
Copy link
Contributor Author

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

@hakuna-matatah
Copy link
Contributor Author

hakuna-matatah commented Jan 30, 2024

Looks like the above fix around ulimits reverted change helped resolving the issue we are seeing w.r.t scale test failures. I'm now able to see the Prom pod coming back up cleanly unlike failures we saw before ulimits fix

W0130 19:11:25.825912   21056 framework.go:291] Skipping empty manifest default/prometheus-podMonitorDNSPerf.yaml
W0130 19:11:25.826013   21056 framework.go:291] Skipping empty manifest default/prometheus-podMonitorNetPolicyClient.yaml
W0130 19:11:25.826150   21056 framework.go:291] Skipping empty manifest default/prometheus-podMonitorNodeLocalDNS.yaml
W0130 19:11:25.889030   21056 framework.go:291] Skipping empty manifest default/prometheus-serviceMonitorKubelet.yaml
I0130 19:11:25.929395   21056 framework.go:276] Applying templates for "master-ip/*.yaml"
W0130 19:11:25.960898   21056 warnings.go:70] unknown field "spec.selector.k8s-app"
W0130 19:11:25.978294   21056 util.go:72] error while calling prometheus api: the server is currently unable to handle the request (get services http:prometheus-k8s:9090), response: k8s�

�v1��Status�]
�
�������Failure�3no endpoints available for service "prometheus-k8s""�ServiceUnavailable0����"�
I0130 19:12:56.710499   21056 framework.go:276] Applying templates for "manifest/exec_deployment.yaml"
W0130 19:13:06.833085   21056 imagepreload.go:92] No images specified. Skipping image preloading
I0130 19:13:06.872931   21056 clusterloader.go:445] Test config successfully dumped to: /logs/artifacts/generatedConfig_load.yaml
I0130 19:13:06.872963   21056 clusterloader.go:238] --------------------------------------------------------------------------------
I0130 19:13:06.872969   21056 clusterloader.go:239] Running /home/prow/go/src/k8s.io/perf-tests/clusterloader2/testing/load/config.yaml
I0130 19:13:06.872974   21056 clusterloader.go:240] --------------------------------------------------------------------------------
I0130 19:13:09.281630   21056 framework.go:276] Applying templates for "manifests/*.yaml"
I0130 19:13:09.307557   21056 framework.go:276] Applying templates for "manifests/dnsLookup/*yaml"
I0130 19:22:28.314648   21056 shared_informer.go:240] Waiting for caches to sync for PodsIndexer
I0130 19:22:31.815435   21056 shared_informer.go:247] Caches are synced for PodsIndexer 

@hakuna-matatah
Copy link
Contributor Author

closing this as I have accomplished its purpose, summarized here - kubernetes/test-infra#31755 (comment) and periodics are now succeeding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants