Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate AWS 5K test failures that started happening in last 10 days #31755

Closed
hakuna-matatah opened this issue Jan 29, 2024 · 4 comments
Closed
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one.

Comments

@hakuna-matatah
Copy link
Contributor

What happened:
Investigate AWS 5K test failures that started happening in last 10 days. Need to understand what dependency chain in upstream has broken the tests. I see that from last couple of failures prom stack is not coming up.

What you expected to happen:
Tests were succeeding continuosly prior to that and it started failing, so we need to make them succeed again.

How to reproduce it (as minimally and precisely as possible):
periodics that run here everyday are already reproducing it - https://testgrid.k8s.io/sig-scalability-aws#ec2-master-scale-performance

Please provide links to example occurrences, if any:
https://testgrid.k8s.io/sig-scalability-aws#ec2-master-scale-performance

Anything else we need to know?:

@hakuna-matatah hakuna-matatah added the kind/bug Categorizes issue or PR as related to a bug. label Jan 29, 2024
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jan 29, 2024
@k8s-ci-robot
Copy link
Contributor

There are no sig labels on this issue. Please add an appropriate label by using one of the following commands:

  • /sig <group-name>
  • /wg <group-name>
  • /committee <group-name>

Please see the group list for a listing of the SIGs, working groups, and committees available.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hakuna-matatah
Copy link
Contributor Author

hakuna-matatah commented Jan 30, 2024

Investigation about test failures for this issue was done here(so tests can be run to RCA the issue)

Summary:

Prometheus pods were restarting and failing to come up due to the error mentioned in this comment here

  files\"\npanic: next sequence file: open /prometheus/chunks_head: too
        many open files\n\ngoroutine 947

Prom container/pod failing because of above error and this error is caused due to this containerd change and KOPS change here PR.

As of this morning that change was reverted on KOPS here and that fixed the problem and I verified it through this one off test which includes the reverted commit.

I will close the issue once I see a successful run in our periodics tomorrow which should pull in these changes in next run.

@dims
Copy link
Member

dims commented Jan 30, 2024

Sweet! thanks @hakuna-matatah

@hakuna-matatah
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

3 participants