Skip to content

[AKS VMs Pool] Add VMs implementation to 1.30 #7960

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: cluster-autoscaler-release-1.30
Choose a base branch
from

Conversation

wenxuan0923
Copy link
Contributor

@wenxuan0923 wenxuan0923 commented Mar 21, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

Support VMs pool autoscaling in CAS 1.30 release

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

No (disabled by default)

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 21, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @wenxuan0923. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot requested a review from feiskyer March 21, 2025 17:41
@k8s-ci-robot k8s-ci-robot added area/provider/azure Issues or PRs related to azure provider size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Mar 21, 2025
@jackfrancis
Copy link
Contributor

@wenxuan0923 we don't want to commit the vendor/ directory, could you remove that from the changeset? thank you!

This reverts commit 222c8e8.
@wenxuan0923
Copy link
Contributor Author

wenxuan0923 commented Mar 21, 2025

@wenxuan0923 we don't want to commit the vendor/ directory, could you remove that from the changeset? thank you!

Yes, removed the vendor directory.

Emmm but now it failed the test with package not found...

image

@comtalyst
Copy link
Contributor

@jackfrancis this PR is to 1.30 specifically, which still have the vendor folder (right?)

@comtalyst
Copy link
Contributor

@wenxuan0923 just to confirm: once more testing is done, this will be cherry-picked to all other versions, right?

Copy link
Contributor

@comtalyst comtalyst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the direction of making VMs pool's implementation similar to VMSS, I don't think it is that necessary if the differences it makes is for the better. I think many parts of VMSS's implementation worth refactoring, and one way to do so is for VMs pool to set up a possible pattern for that.

Happy to discuss more, and thanks for driving all this 🙂


isVMPool, agentPoolName, sku := m.parseSKUAndVMsAgentpoolNameFromSpecName(s.Name)
if isVMPool {
return NewVMPool(s, m, agentPoolName, sku)
}

switch m.config.VMType {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To prevent confusion, I think there could be a comment explaining the relationship between the above and this switch statement. Likely highlighting the fact that we are supporting multiple kinds of pools in the same cluster for VMs pool case and onwards.

Comment on lines 126 to 128
SkuName: *vmss.Sku.Name,
Tags: vmss.Tags,
Location: *vmss.Location,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be a good chance to introduce the pattern of nil checks and return an error (next to the NodeTemplate) for this and other functions. Looking at current uses, we could see what fields are unacceptable to be missing. Thoughts?

SkuName string
InstanceOS string
Location string
Zones *[]string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only Zones is a pointer?

}
}

func buildNodeTemplateFromVMPool(vmsPool armcontainerservice.AgentPool, location string, skuName string) NodeTemplate {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there are "VMPool", "VMsPool", and "VMs" being referred to this concept. Trying to to make those consistent would be easier to understand/navigate. Thoughts?

template compute.VirtualMachineScaleSet, manager *AzureManager, enableDynamicInstanceList bool) (*apiv1.Node, error) {
// VMsNodeTemplate holds properties for node from VMPool
type VMsNodeTemplate struct {
Taints []string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In scale from zero, CAS with VMSS currently use inputLabels, which is the list expected labels of that pool, to predict what labels it will have through the template as we can see in this file. The assumption that each node will have the same set of labels is made.
Without that for VMs pool, how would it handle scale from zero? And I think that assumption is not usable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also, doesn't look like Taints is being used yet)

Comment on lines 55 to 56
lastSizeRefresh time.Time
sizeRefreshPeriod time.Duration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asides from consistency with VMSS, what do we think about the necessity of caching here in the first place?
There was an idea that it might have always been overkill, and not worth the complexity it adds. Maybe we could try not having it in VMs pool. Can discuss more. Thoughts?

func (agentPool *VMsPool) IncreaseSize(delta int) error {
// TODO(wenxuan): Implement this method
return cloudprovider.ErrNotImplemented
func (vmPool *VMPool) getAgentpoolFromAzure() (armcontainerservice.AgentPool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move this down to where getAgentPoolFromCache() is?

Comment on lines 153 to 154
func (vmPool *VMPool) getVMPoolSize() (int64, error) {
size, err := vmPool.getCurSize()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this could have been one method. Thoughts?

}

// buildRequestBodyForScaleUp builds the request body for scale up for self-hosted CAS
func (vmPool *VMPool) buildRequestBodyForScaleUp(versionedAP armcontainerservice.AgentPool, count int64) (armcontainerservice.AgentPool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks more like a utility function than a method IMO. Maybe pass vmPool.sku as a parameter as well?

// this is a placeholder for now, no real implementation
type VMsPool struct {
// VMPool represents a pool of standalone VMs with a single SKU.
// It is part of a mixed-SKU agent pool (agent pool with type `VirtualMachines`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just making sure I understand it correctly: a mixed SKU pool maps to multiple "VMPool"s here, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's correct.

@wenxuan0923
Copy link
Contributor Author

@wenxuan0923 just to confirm: once more testing is done, this will be cherry-picked to all other versions, right?

yes that's correct.

@rakechill
Copy link
Contributor

@jackfrancis this PR is to 1.30 specifically, which still have the vendor folder (right?)

yes, can see from my update skewer module PR: https://github.com/kubernetes/autoscaler/pull/7288/files

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: wenxuan0923
Once this PR has been reviewed and has the lgtm label, please assign towca for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants