Investigate AWS 5K test failures that started happening in last 10 days #31755

hakuna-matatah · 2024-01-29T18:07:17Z

What happened:
Investigate AWS 5K test failures that started happening in last 10 days. Need to understand what dependency chain in upstream has broken the tests. I see that from last couple of failures prom stack is not coming up.

What you expected to happen:
Tests were succeeding continuosly prior to that and it started failing, so we need to make them succeed again.

How to reproduce it (as minimally and precisely as possible):
periodics that run here everyday are already reproducing it - https://testgrid.k8s.io/sig-scalability-aws#ec2-master-scale-performance

Please provide links to example occurrences, if any:
https://testgrid.k8s.io/sig-scalability-aws#ec2-master-scale-performance

Anything else we need to know?:

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2024-01-29T18:07:25Z

There are no sig labels on this issue. Please add an appropriate label by using one of the following commands:

/sig <group-name>
/wg <group-name>
/committee <group-name>

Please see the group list for a listing of the SIGs, working groups, and committees available.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

hakuna-matatah · 2024-01-30T22:08:47Z

Investigation about test failures for this issue was done here(so tests can be run to RCA the issue)

Summary:

Prometheus pods were restarting and failing to come up due to the error mentioned in this comment here

  files\"\npanic: next sequence file: open /prometheus/chunks_head: too
        many open files\n\ngoroutine 947

Prom container/pod failing because of above error and this error is caused due to this containerd change and KOPS change here PR.

As of this morning that change was reverted on KOPS here and that fixed the problem and I verified it through this one off test which includes the reverted commit.

I will close the issue once I see a successful run in our periodics tomorrow which should pull in these changes in next run.

dims · 2024-01-30T22:17:53Z

Sweet! thanks @hakuna-matatah

hakuna-matatah · 2024-01-31T15:03:30Z

closing the issue as expected Periodics started succeeding now -
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-kops-aws-scale-amazonvpc-using-cl2/1752577224642400256

https://testgrid.k8s.io/sig-scalability-aws#ec2-master-scale-performance

hakuna-matatah added the kind/bug Categorizes issue or PR as related to a bug. label Jan 29, 2024

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jan 29, 2024

This was referenced Jan 29, 2024

Setup periodic scalability CI tests on AWS #29139

Closed

{WIP} To run some scale tests to understand the recent failures kubernetes/kops#16296

Closed

hakuna-matatah closed this as completed Jan 31, 2024

hakuna-matatah mentioned this issue Feb 2, 2024

Adding test to release informing #31808

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate AWS 5K test failures that started happening in last 10 days #31755

Investigate AWS 5K test failures that started happening in last 10 days #31755

hakuna-matatah commented Jan 29, 2024

k8s-ci-robot commented Jan 29, 2024

hakuna-matatah commented Jan 30, 2024 •

edited

Loading

dims commented Jan 30, 2024

hakuna-matatah commented Jan 31, 2024

Investigate AWS 5K test failures that started happening in last 10 days #31755

Investigate AWS 5K test failures that started happening in last 10 days #31755

Comments

hakuna-matatah commented Jan 29, 2024

k8s-ci-robot commented Jan 29, 2024

hakuna-matatah commented Jan 30, 2024 • edited Loading

Summary:

dims commented Jan 30, 2024

hakuna-matatah commented Jan 31, 2024

hakuna-matatah commented Jan 30, 2024 •

edited

Loading