Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate KernelCI infrastructure to new Azure subscription #179

Closed
6 tasks done
mgalka opened this issue Mar 7, 2023 · 7 comments
Closed
6 tasks done

Migrate KernelCI infrastructure to new Azure subscription #179

mgalka opened this issue Mar 7, 2023 · 7 comments
Assignees

Comments

@mgalka
Copy link

mgalka commented Mar 7, 2023

As the new Azure subscription is available, KernelCI infrastructure should be moved there. It's also a good moment to review the functionality of ansible playbooks and terraform configs.

Items to be migrated

VMs

Machines:

  • Staging (staging.kernelci.org)
  • Production (kernelci.org)
  • Jenkins nodes
    • staging: kernelci7, kernelci9, kernelci-jenkins-usa
    • production: kernelci1, kernelci6, kernelci-jenkins-usa
  • Monitoring
  • Sandbox (kernelci2)

It may be a good idea to start moving process from kernelci2 which is a sandbox machine with a relatively low impact on KernelCI stability and quality of service.
There are repositories with ansible playbooks created for backend, frontend and Jenkins nodes configuration.
During the update it may be worth checking if the playbooks are still functional.
It may be worth trying if it's possible to create Debian based VMs for Jenkins nodes and configure them using https://github.com/kernelci/builder-config2.

Fixing possible errors is not the aim of this task, so if a significant amount of time is needed to make ansible playbooks functional the VMs should be migrated recreating them from images.

Kubernetes build cluster

There are terraform configs created for deploying our Kubernetes clusters to Azure. It's a good idea to use them to test their functionality and recreate the cluster with new subscription. The configs may need slight adjustments for current configuration and storage space.

Firewall settings

Firewall settings need to be moved from the current Azure subscription to the new one.

Tasks

@gctucker
Copy link
Contributor

gctucker commented Apr 5, 2023

@VinceHillier @mgalka I haven't found an issue about migrating staging.kernelci.org to the new Azure subscription so I'm leaving a comment here specifically about storage:

I've reduced the number of builds for staging-* branches, here's some data about the impact it should have on disk usage:

# staging-mainline
1.6G	./staging-mainline-20230404.3
4.7G	./staging-mainline-20230326.0
5.1G	./staging-mainline-20230331.0
4.9G	./staging-mainline-20230327.4
# staging-stable
662M	./staging-stable-20230404.2
1.5G	./staging-stable-20230404.0
1.5G	./staging-stable-20230404.1
1.4G	./staging-stable-20230403.0
# staging-next
2.3G	./staging-next-20230405.0
4.2G	./staging-next-20230403.0
5.0G	./staging-next-20230404.0
5.0G	./staging-next-20230330.2

These jobs are run every day, as you can see from the directory names (some were run multiple times hence the dotted version numbers). Basically, they used to amount to 5 + 1.5 + 5 = ~12GB / day and now they're down to 1.6 + 0.7 + 2.3 = ~5GB / day. So for 9 days (current retention period) we had 108 GB, now we can easily extend that to 14 days with 70 GB. I think 2 weeks / 14 days is a good guideline to follow for all the data on staging.

Meanwhile, the plain mainline jobs which are run daily also got reduced to the same level as staging-mainline rather than the full-blown set run in production. So we have this shift:

12G	./v6.3-rc5
1.8G	./v6.3-rc5-22-g76f598ba7d8e
1.3G	./v6.3-rc5-5-g148341f0a2f5

Basically down from 12GB / day to ~2GB. Likewise, 9 days was 108 GB and now we can easily have 14 days at 28GB.

Then linux-next is built and tested once a week (weekend) with the same scale as in production, and that's by design to catch everything else that the more frequent but trimmed-down jobs don't catch. These take ~12GB, so 24GB for 14 days.

So for the kernel build artifacts and associated test logs, we used to have a weekly storage usage of ~180GB and now we're down to ~60GB. We can add some margin for additional builds run by hand, bisections and other bits. Bottom line:

With 14-day retention policy, that means we need about 200GB on the VM's disk for the kernel builds.

Next things to look into are rootfs images, MongoDB data, Docker, and what the new API & Pipeline storage requirements are going to look like (still pretty small right now).

@mgalka
Copy link
Author

mgalka commented Apr 5, 2023

@VinceHillier @mgalka I haven't found an issue about migrating staging.kernelci.org to the new Azure subscription so I'm leaving a comment here specifically about storage:

@gctucker Do you think there is any action about the storage apart from increasing space that needs to be taken during migration?

@gctucker
Copy link
Contributor

gctucker commented Apr 6, 2023

@VinceHillier @mgalka I haven't found an issue about migrating staging.kernelci.org to the new Azure subscription so I'm leaving a comment here specifically about storage:

@gctucker Do you think there is any action about the storage apart from increasing space that needs to be taken during migration?

I don't know, also the other storage parts haven't been quantified yet so 1TB might not be the optimal size.

I/O bandwidth and operations might also be something to benchmark as we could choose a different type of disk based on that, and there are a few caching options too.

@VinceHillier
Copy link
Contributor

I've attached the iowait graphs for staging to a google doc here:

https://docs.google.com/document/d/1VtpFJFwX74yKH5SDX_hcNWylty6sbOw8qib1iUZBkNA

Increasing the disks to 1TB brings our IOPS from 2300 to 5000. We can analyze the differences once the environment has been moved to the new subscription.

@nuclearcat
Copy link
Member

nuclearcat commented Apr 7, 2023

We need also to add step of updating settings and various repositories where is old hostnames/ips are listed, or old credentials
Like:

  • Jenkins runners configuration
  • Kubernetes configuration, cluster names, auth data (for azure only)
    What else i am missing?

@gctucker
Copy link
Contributor

gctucker commented Apr 7, 2023

We need also to add step of updating settings and various repositories where is old hostnames/ips are listed, or old credentials Like:

  • Jenkins runners configuration
  • Kubernetes configuration, cluster names, auth data (for azure only)
    What else i am missing?

Sounds like it would be worth a separate issue about updating configuration files, and a bullet point here with a link to it. Maybe we can also have an issue for migrating the Kubernetes clusters, and potentially a few more.

@VinceHillier
Copy link
Contributor

This has been completed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants