Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup periodic scalability CI tests on AWS #29139

Closed
4 tasks done
shyamjvs opened this issue Mar 24, 2023 · 18 comments
Closed
4 tasks done

Setup periodic scalability CI tests on AWS #29139

shyamjvs opened this issue Mar 24, 2023 · 18 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/testing Categorizes an issue or PR as relevant to SIG Testing.

Comments

@shyamjvs
Copy link
Member

shyamjvs commented Mar 24, 2023

Follows from discussion here - https://groups.google.com/g/kubernetes-sig-release/c/ShwzKuYoRAc/m/t6LvF7BQAgAJ

Let's use this issue to plan and track the tasks we need to get there. For the initial phase, we need to:

  • Find (or create) a CNCF-owned AWS account where we can run these jobs. From experience, doing this in a separate account from regular CI jobs is better w.r.t isolation, limits/throttling, etc
  • Get the account limits raised for various AWS resources/APIs we depend on (ec2, ENIs, EBS, IAM, etc)
  • Bring up a smaller-scale job first (say 100-node?) using the recommended way today for creating AWS test clusters (kops? we have other, possibly better, options too for scale-testing - CAPA, KIT)
  • Get the job to reliably pass. We can bump up the scale incrementally from there

cc @kubernetes/sig-scalability @kubernetes/sig-k8s-infra @kubernetes/sig-testing
cc @dims @wojtek-t @BenTheElder

@shyamjvs shyamjvs added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 24, 2023
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Mar 24, 2023
@shyamjvs
Copy link
Member Author

We need guidance/status quo for 1 and 3 from @kubernetes/sig-testing :)

@BenTheElder
Copy link
Member

Hope I can help more here in the future, bit distracted with the registry redirect.

Find (or create) a CNCF-owned AWS account where we can run these jobs. From experience, doing this in a separate account from regular CI jobs is better w.r.t isolation, limits/throttling, etc

We usually use a pool of sub-accounts that we rent from boskos as a way to ensure we can cleanup after CI runs. I think SIG K8s Infra can help with the accounts, and this pattern should be pretty well established for kOps and CAPA at least.

We'd still need K8s Infra to setup some new projects with higher quota in a distinct pool, but the project pool part should be ready to go, mainly just the quota part I think?

Bring up a smaller-scale job first (say 100-node?) using the recommended way today for creating AWS test clusters (kops? we have other, possibly better, options too for scale-testing - CAPA, KIT)

I would strongly suggest starting with kOps:

  • Integrates with our existing CI tooling including kubetest(2), boskos, etc.
  • Existing fairly extensive and well kept CI signal on AWS and GCE, including at Kubernetes HEAD
  • Members of the Kubernetes project can readily patch any issues
  • Prior art running in Kubernetes/Kubernetes PR blocking CI for many years which worked well (until AWS CNCF Account seems to have lost quota #10043 in 2019)

@BenTheElder
Copy link
Member

@ameukam
Copy link
Member

ameukam commented Mar 24, 2023

/sig k8s-infra
/sig testing
/sig scalability
/priority important-longterm

@k8s-ci-robot k8s-ci-robot added sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. sig/testing Categorizes an issue or PR as relevant to SIG Testing. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 24, 2023
@justinsb
Copy link
Member

justinsb commented Mar 24, 2023

I'd be happy to work with you @shyamjvs to get scale testing running with kOps + AWS; we have a bunch of tests that run already. It looks like the ones at k8s head broke literally today, so we'll have to dig into why (but that's probably a good reason why you don't want to start your own effort if your focus is on scalability not fixing random breakages!)

We can simply create a scenario that runs with 100 nodes, and see what breaks (e.g. we might have quota already). Which CNI/network configuration would we want to start with? And if you have any ideas on machine sizes, then we can plug those in to create a one-off scenario here. If we don't know, it's also fairly easy to iterate here.

Then absolutely we should make sure it runs against the new CNCF AWS account , make sure those have the higher limits etc.

(Good news, the breakage looks like a known regression at head that should be fixed by kubernetes/kubernetes@b83600d )

@shyamjvs
Copy link
Member Author

We'd still need K8s Infra to setup some new projects with higher quota in a distinct pool, but the project pool part should be ready to go, mainly just the quota part I think?

Thanks Ben. So to clarify - we need either a new set of account pool with higher quotas or need to increase the quotas for every single account in the existing pool before we can run these tests? Would least resistance path be to provision a single new account (w/ higher quotas) and dedicate that for the scale tests instead?

We can simply create a scenario that runs with 100 nodes, and see what breaks (e.g. we might have quota already).

Thanks Justin for offering to help! We're still figuring out a few details especially around setting custom flags to a bunch of components (typically needed for the scale tests) and control-plane setup (for e.g co-locating etcd with apiserver vs running them seperately - latter being the model EKS uses today fwiw). Will need some help figuring out what's possible with kops vs not. Is sig-testing call good place to discuss this?

@BenTheElder
Copy link
Member

BenTheElder commented Mar 27, 2023

Thanks Ben. So to clarify - we need either a new set of account pool with higher quotas or need to increase the quotas for every single account in the existing pool before we can run these tests? Would least resistance path be to provision a single new account (w/ higher quotas) and dedicate that for the scale tests instead?

In the past we've created a separate resource pool for highly-quota accounts on GCP and I think we should do that here.

This makes it easier to track scale testing related spend and we have needed fewer accounts for this purpose than general e2e testing.

Will need some help figuring out what's possible with kops vs not. Is sig-testing call good place to discuss this?

SIG testing could be a reasonable place for this, as passing config will also need flow through the CI tooling.

In the meantime re e.g. custom etcd options:
https://pkg.go.dev/k8s.io/kops/pkg/apis/kops#EtcdManagerSpec (Env)
and higher level:
https://kops.sigs.k8s.io/cluster_spec/#the-cluster-resource

@shyamjvs
Copy link
Member Author

shyamjvs commented Apr 3, 2023

We discussed the above in last sig-scale meeting (30th Mar) - meeting notes

Summary of next steps:

  • we decided to use separate boskos pool for scale jobs (1 or 2 should be enough for now) - @ameukam volunteered to help with this
  • after that, we need to get some limit increases for those accounts (I'll help get the list for this)
  • port the relevant configs defined here and here (from GCE scale job today) to the kops cluster
  • switch the test cluster to use aws vpc cni
  • switch from conformance to scalability e2e test suite
  • (other supporting work around log upload to s3, ssh to apiservers, perfdash etc we need to figure out later)

@shyamjvs
Copy link
Member Author

shyamjvs commented Apr 6, 2023

/assign @ameukam @justinsb @shyamjvs
(initial set of owners - I'll add more as the work evolves)

@dims
Copy link
Member

dims commented May 17, 2023

@shyamjvs how many dedicated accounts do you need?

(from slack thread https://kubernetes.slack.com/archives/CCK68P2Q2/p1684337286675859)

@shyamjvs
Copy link
Member Author

We just need 1 account to begin with for now.

ameukam added a commit to ameukam/k8s.io that referenced this issue May 17, 2023
@dims
Copy link
Member

dims commented May 22, 2023

We have 2 accounts from line item # 1 with limits updated ( k8s-infra-e2e-scale-boskos-001 and k8s-infra-e2e-scale-boskos-002 )

details: https://kubernetes.slack.com/archives/CCK68P2Q2/p1684593816977859

Some follow ups pending with @ashishranjan738

@dims
Copy link
Member

dims commented May 22, 2023

Looks like there's a Kops CI job definition here we may have started looking at for line item 3:
https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/kops/kops-periodics-scale.yaml#L6

@hakuna-matatah
Copy link
Contributor

Current status for Scalability Tests runs on AWS:
Able to successfully setup Periodics for 5k measuring podslos on daily basis, which can be monitored here https://testgrid.k8s.io/kops-misc#ec2-master-scale-performance

Next things:

  • Measure API Server SLOs at 5k Scale
  • Add aws scalability tests to Perf-dash

@shyamjvs
Copy link
Member Author

Looks like the 5k-node kops test has been failing for the past 10 days due to some issue during cluster creation - https://testgrid.k8s.io/kops-misc#ec2-master-scale-performance.

@hakuna-matatah or @mengqiy - could one of you open an issue against this repo for tracking the fixes?

@hakuna-matatah
Copy link
Contributor

Thanks @shyamjvs. We are aware of this and looking into it as discussed internally.
I have cut an issue for tracking purposes.

@dims
Copy link
Member

dims commented Feb 10, 2024

@k8s-ci-robot
Copy link
Contributor

@dims: Closing this issue.

In response to this:

/close

https://testgrid.k8s.io/amazon-ec2-release#ec2-master-scale-performance&width=20

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

No branches or pull requests

7 participants