Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CAPI is taking too long removing taint node.cluster.x-k8s.io/uninitialized:NoSchedule from nodes #9858

Closed
njuettner opened this issue Dec 12, 2023 · 44 comments · May be fixed by kubernetes-sigs/cluster-api-provider-azure#5292
Labels
area/bootstrap Issues or PRs related to bootstrap providers area/machinepool Issues or PRs related to machinepools kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.

Comments

@njuettner
Copy link
Member

njuettner commented Dec 12, 2023

What steps did you take and what happened?

From reading the docs bootstrap provider can optionally taint worker nodes at creation with node.cluster.x-k8s.io/uninitialized:NoSchedule. We noticed that CAPI is taking up to 5 minutes removing the taint from nodes.

Unfortunately this is marked as optional but there is no flag to omit it. From observing the code
we cannot see any place where this is possible.

We don't think it is intended, especially when managing MachinePools externally by cluster-autoscaler.

Additionally there are other provider specific controller like aws-cloud-controller-manager who sets a similar taint node.cloudprovider.kubernetes.io/uninitialized waiting for nodes becoming ready.

What did you expect to happen?

Ensure taint configuration in bootstrap provider is optional.

Cluster API version

1.4.7

Kubernetes version

No response

Anything else you would like to add?

No response

Label(s) to be applied

/kind bug
/area bootstrap

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. area/bootstrap Issues or PRs related to bootstrap providers needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 12, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@AverageMarcus
Copy link
Member

In addition to this taking time, it also creates a requirement for the MC to be able to connect to the WC when new nodes are added. This means if you're using externally managed cluster-autoscaler (as in, running cluster-autoscaler in the workload cluster) and the MC is unavailable for any reason (network issue, upgrading, maintenance, etc) new nodes added by cluster-autoscaler will not become ready until the MC can connect again. This breaks one of the main reasons for wanting to run cluster-autoscaler in the WC in the first place - not having a reliance on the MC for scaling when needed.

@chrischdi
Copy link
Member

Question: does it only take too long for MachinePool based Machines which got created due to getting externally scaled?

Some datapoints around the removal of the taint:

It gets done at the patchNode function:

hasTaintChanges := taints.RemoveNodeTaint(newNode, clusterv1.NodeUninitializedTaint)

Which gets called from reconcileNode

if err := r.patchNode(ctx, remoteClient, node, nodeLabels, nodeAnnotations); err != nil {

And before that it waits e.g. for having a ProviderID set at the Machine as well as at the node inside the workload cluster:

if machine.Spec.ProviderID == nil || *machine.Spec.ProviderID == "" {
log.Info("Waiting for infrastructure provider to report spec.providerID", machine.Spec.InfrastructureRef.Kind, klog.KRef(machine.Spec.InfrastructureRef.Namespace, machine.Spec.InfrastructureRef.Name))
conditions.MarkFalse(machine, clusterv1.MachineNodeHealthyCondition, clusterv1.WaitingForNodeRefReason, clusterv1.ConditionSeverityInfo, "")
return ctrl.Result{}, nil
}

node, err := r.getNode(ctx, remoteClient, *machine.Spec.ProviderID)
if err != nil {
if err == ErrNodeNotFound {
// While a NodeRef is set in the status, failing to get that node means the node is deleted.
// If Status.NodeRef is not set before, node still can be in the provisioning state.
if machine.Status.NodeRef != nil {
conditions.MarkFalse(machine, clusterv1.MachineNodeHealthyCondition, clusterv1.NodeNotFoundReason, clusterv1.ConditionSeverityError, "")
return ctrl.Result{}, errors.Wrapf(err, "no matching Node for Machine %q in namespace %q", machine.Name, machine.Namespace)
}
conditions.MarkFalse(machine, clusterv1.MachineNodeHealthyCondition, clusterv1.NodeProvisioningReason, clusterv1.ConditionSeverityWarning, "")
// No need to requeue here. Nodes emit an event that triggers reconciliation.
return ctrl.Result{}, nil
}
log.Error(err, "Failed to retrieve Node by ProviderID")
r.recorder.Event(machine, corev1.EventTypeWarning, "Failed to retrieve Node by ProviderID", err.Error())
return ctrl.Result{}, err
}

Would be good to first know the reason why it takes too long.

It could be that if it is scaled externally, that it takes too long for CAPI to reconcile and finish the node? I think we watch the node objects in a cluster 🤔

Do you have datapoints about if other work for the node finished? E.g. did a provider ID got set by the cloud provider running inside this cluster (as noted above)?

In addition to this taking time, it also creates a requirement for the MC to be able to connect to the WC when new nodes are added. This means if you're using externally managed cluster-autoscaler (as in, running cluster-autoscaler in the workload cluster) and the MC is unavailable for any reason (network issue, upgrading, maintenance, etc) new nodes added by cluster-autoscaler will not become ready until the MC can connect again. This breaks one of the main reasons for wanting to run cluster-autoscaler in the WC in the first place - not having a reliance on the MC for scaling when needed.

Regarding communication to the WC: there is a feature group around alternative communication patterns (don't know how active it is currently): https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/community/20230111-alternative-communication-patterns.md

@chrischdi
Copy link
Member

Also there is a feature group around for karpenter: https://hackmd.io/@elmiko/ryR2VXR0n

@fabriziopandini
Copy link
Member

/triage needs-information

@k8s-ci-robot k8s-ci-robot added the triage/needs-information Indicates an issue needs more information in order to work on it. label Jan 19, 2024
@fabriziopandini fabriziopandini removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Feb 6, 2024
@fabriziopandini
Copy link
Member

/close
due to lack of feedback

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini: Closing this issue.

In response to this:

/close
due to lack of feedback

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mweibel
Copy link
Contributor

mweibel commented Jun 26, 2024

I'm not sure if it's better to reopen this or create a new issue but we noticed similar issues on our CAPI (v1.7.3) and CAPZ (v1.5.2 with a few patches in open PRs) and cluster-autoscaler with clusterapi provider (v1.28.2 with a few patches in open PRs).

We're observing this using MachinePools.

Looking at metrics using the following prometheus query (replace scrape_interval with your scrape interval):

sum by (node) (
  sum_over_time(kube_node_spec_taint{key="node.cluster.x-k8s.io/uninitialized"}[$__range]) * scrape_interval
)

most have the taint for 30-60s. However, looking at the top50 by wrapping the query in topk(50, ...), we see some outliers - maximum in our case was 3090s and minimum of those top 50 was at 210s which is 3.5mins. This is IMO too long for "only" a label sync.

I've been investigating a bit the code and some logs as to why that might happen in some cases. I've not fully verified this yet, but I hope anyone could confirm/reject my suspicion :)

Therefore: a prerequisite for label sync are the providerIDs which get populated by reconcileInfrastructure.

Looking at the CAPZ code related to readyness:
The ready state is only set when the underlying VMSS is in state succeeded and the replicas match.

This means that when we're scaling up or down, the InfraMP is not marked ready:

status:
  # ....
  infrastructureMachineKind: AzureMachinePoolMachine
  provisioningState: Updating
  ready: false
  replicas: 0

and in this case the nodes stay with uninitialized until the InfraMP is ready again.

Now my questions are:

  1. is this analysis correct?
  2. Is CAPZ handling the ready state correctly?
  3. do we really need to have a InfraMP ready state for updating the providerIdList or could we do it anyway so the node taint gets removed also during ScaleUp or ScaleDown?

@k8s-ci-robot
Copy link
Contributor

@mweibel: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@killianmuldoon
Copy link
Contributor

@mweibel reopening

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 27, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If CAPI contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@chrischdi
Copy link
Member

While it sounds reasonable to check and see if this can be improved, have you also been able to find out why it took for that long for the underlying VMSS is to be in state succeeded and the replicas match?

@mweibel
Copy link
Contributor

mweibel commented Jun 27, 2024

If I understand the CAPZ code correctly, it only adds ready MachinePoolMachines to the count in AzureMachinePool.Status.Replicas. Given that code:

It depends on OS, and availabilty of the machines we need and on the amount of machines. We run batch workloads and sometimes scale up from close to zero (or zero) to 100 machines within a short amount of time.

If we, let's say, scale up from 0 to 10, and a machine takes 1-5 minutes to be ready (linux machines take <1min usually, while windows machines average on 5min):
If all 10 machines become ready in the same amount of time, it takes worst case 5mins for the AzureMachinePool to be marked as ready again. It'll be faster on a Linux MachinePool, and can take longer on Windows MachinePools.

The worst case above with the 3090s might have been some other issue - I've investigated a bit but will need to investigate some more to figure it out.

Does that additional context help to make the issue more clear?

@chrischdi
Copy link
Member

Makes sense, I think its still worth to find the root cause though.

Changing the above mentioned behaviour in CAPI would lead to a change of the contract documented here: https://main.cluster-api.sigs.k8s.io/developer/architecture/controllers/machine-pool

Especially this part:

The status object must have at least one field defined:

  • ready - a boolean field indicating if the infrastructure is ready to be used or not.

[0]

If we want to change how this works, we need to also take a look if the other providers supporting MachinePools are okay with it to not break things.

So getting to the questions:

  1. Is CAPZ handling the ready state correctly?

I think CAPZ is doing fine according to the contract.

Depending on in which state CAPZ's MachinePool is, CAPZ could consider to earlier set the ready boolean to true.
If I did see it right, CAPI still checks if all nodes are already there.

  1. do we really need to have a InfraMP ready state for updating the providerIdList or could we do it anyway so the node taint gets removed also during ScaleUp or ScaleDown?

I think this is defined in the contract so yes, CAPI should not continue unless the ready state is set to true, however, its the Infra Provider's decision when it signals to CAPI that it could continue.

Note: I'm not that deep into the machine pool controller and machine pools, so don't consider this as strong opinion. But changing the contract IMHO should be done very very carefully. Also see the machine pool proposal which has more detailed diagrams about it.

@sbueringer sbueringer added the area/machinepool Issues or PRs related to machinepools label Jul 2, 2024
@sbueringer
Copy link
Member

cc @mboersma

@sbueringer
Copy link
Member

In addition to this taking time, it also creates a requirement for the MC to be able to connect to the WC when new nodes are added. This means if you're using externally managed cluster-autoscaler (as in, running cluster-autoscaler in the workload cluster) and the MC is unavailable for any reason (network issue, upgrading, maintenance, etc) new nodes added by cluster-autoscaler will not become ready until the MC can connect again. This breaks one of the main reasons for wanting to run cluster-autoscaler in the WC in the first place - not having a reliance on the MC for scaling when needed.

In my opinion with cabpk this requirement exists independent of the taint. When a kubeadmconfig is created cabpk creates a bootstrap token secret within the wl cluster. It's unclear to me how this would work without access to the wl cluster.

@sbueringer
Copy link
Member

sbueringer commented Jul 3, 2024

@AverageMarcus can you please share how you were able to create new nodes without connectivity with cluster autoscaler and cabpk before we introduced the taint

I'm also curious to know how the cluster-autoscaler is able to scale up a MD in the mgmt cluster if the mgmt cluster is not available (or are we assuming the connection is only broken in the other direction in that scenario?)

@fabriziopandini fabriziopandini added priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. and removed triage/needs-information Indicates an issue needs more information in order to work on it. labels Jul 3, 2024
@sbueringer
Copy link
Member

sbueringer commented Jul 11, 2024

Shouldn't at some point the Nodes show up in the workload cluster and then that should trigger a reconcile of the Machine controller? If it doesn't, then that should be implement, otherwise it should be fixed if it's buggy

See:

func (r *Reconciler) watchClusterNodes(ctx context.Context, cluster *clusterv1.Cluster) error {

EDIT: Sorry I guess Christian was saying roughly the same already above

@paurosello
Copy link

So cluster-autoscaler triggers the scaling via the AWS autoscaling group and never touches CAPI objects. So you are not using cluster-autoscaler in --cloud-provider=clusterapi mode as described in the docs, correct?

Sidenote: In this case, the management cluster still needs connectivity to the workload cluster to create or refresh the bootstrap token used.

Regarding "taking too long to remove the taint":

  • When cluster-autoscaler does directly talk to AWS, CAPI maybe does not get an event which triggers a reconciliation.

    • Propably one of the reasons why cluster-autoscaler has a special mode for clusterapi.
  • Because of this it may take the controller to hit the configured sync period to get an event and reconcile the MachinePool to remove the taint from the node.

Yes, the bootstrap token would still be a blocker if the Management Cluster is down for more than 15 minutes, which I think is too short of a period but that would be for another issue :)

And yeah, I guess the issue is that reconciliation loop can take a bit of time.

Would it make sense to only have this taint on the instances created by Machine CRs but not on the instances created via MachinePools?

@sbueringer
Copy link
Member

sbueringer commented Jul 11, 2024

I would like to figure out first why the current implementation that should react immediately doesn't work (#9858 (comment))

There's also this dependency:

I'm also curious to know how the cluster-autoscaler is able to scale up a MD in the mgmt cluster if the mgmt cluster is not available (or are we assuming the connection is only broken in the other direction in that scenario?)
#9858 (comment)

@paurosello
Copy link

I would like to figure out first why the current implementation that should react immediately doesn't work (#9858 (comment))

There's also this dependency:

I'm also curious to know how the cluster-autoscaler is able to scale up a MD in the mgmt cluster if the mgmt cluster is not available (or are we assuming the connection is only broken in the other direction in that scenario?)
#9858 (comment)

The current situation is it works, but there is a very big delay and my main grip is I don't see why the taint is needed at all as the instances are fully operational.

Have no experience on MachineDeployments, we only use MachinePools

@mweibel
Copy link
Contributor

mweibel commented Jul 12, 2024

FYI I linked a CAPZ issue here for a related issue. I continue to investigate this.

@chrischdi FYI:

So cluster-autoscaler triggers the scaling via the AWS autoscaling group and never touches CAPI objects. So you are not using cluster-autoscaler in --cloud-provider=clusterapi mode as described in the docs, correct?

In our case we use the cluster-autoscaler in clusterapi provider mode.

@paurosello

The current situation is it works, but there is a very big delay and my main grip is I don't see why the taint is needed at all as the instances are fully operational.

I have the same feeling, but I also see why it needs to be there. We could probably avoid the taint if the MachinePool template doesn't specify labels - but I don't think this would be a good fix because it hides an underlying issue.

@paurosello
Copy link

FYI I linked a CAPZ issue here for a related issue. I continue to investigate this.

@chrischdi FYI:

So cluster-autoscaler triggers the scaling via the AWS autoscaling group and never touches CAPI objects. So you are not using cluster-autoscaler in --cloud-provider=clusterapi mode as described in the docs, correct?

In our case we use the cluster-autoscaler in clusterapi provider mode.

@paurosello

The current situation is it works, but there is a very big delay and my main grip is I don't see why the taint is needed at all as the instances are fully operational.

I have the same feeling, but I also see why it needs to be there. We could probably avoid the taint if the MachinePool template doesn't specify labels - but I don't think this would be a good fix because it hides an underlying issue.

Yeah, I can see why it's useful but wonder if we could either make it optional or decrease the reconciliation loop time to make it much faster to detect new nodes.

@sbueringer
Copy link
Member

sbueringer commented Jul 15, 2024

I would like to figure out first why the current implementation that should react immediately doesn't work (#9858 (comment))

Until then I'm -1 to fixes like disabling the feature or reducing the sync period. Just to be clear. With our current implementation we should immediately reconcile after a Node shows up in the workload cluster (this means within a few milliseconds the taint should be gone)

@paurosello
Copy link

paurosello commented Jul 15, 2024

I would like to figure out first why the current implementation that should react immediately doesn't work (#9858 (comment))

Until then I'm -1 to fixes like disabling the feature or reducing the sync period. Just to be clear. With our current implementation we should immediately reconcile after a Node shows up in the workload cluster (this means within a few milliseconds the taint should be gone)

How should it reconcile immediately when the nodes are created by cluster-autoscaler directly talking to the AWS API? Aren't we limited by the sync-period (10 minutes by default) when using cluster-autoscaler deployed in the workload cluster?

@chrischdi
Copy link
Member

I would like to figure out first why the current implementation that should react immediately doesn't work (#9858 (comment))

Until then I'm -1 to fixes like disabling the feature or reducing the sync period. Just to be clear. With our current implementation we should immediately reconcile after a Node shows up in the workload cluster (this means within a few milliseconds the taint should be gone)

How should it reconcile immediately when the nodes are created by cluster-autoscaler directly talking to the AWS API? Aren't we limited by the sync-period (10 minutes by default) when using cluster-autoscaler deployed in the workload cluster?

As stefan pointed out: the controller should watch for the nodes in the cluster and get an event when the kubelet created its node: #9858 (comment)

@paurosello
Copy link

paurosello commented Jul 15, 2024

I would like to figure out first why the current implementation that should react immediately doesn't work (#9858 (comment))

Until then I'm -1 to fixes like disabling the feature or reducing the sync period. Just to be clear. With our current implementation we should immediately reconcile after a Node shows up in the workload cluster (this means within a few milliseconds the taint should be gone)

How should it reconcile immediately when the nodes are created by cluster-autoscaler directly talking to the AWS API? Aren't we limited by the sync-period (10 minutes by default) when using cluster-autoscaler deployed in the workload cluster?

As stefan pointed out: the controller should watch for the nodes in the cluster and get an event when the kubelet created its node: #9858 (comment)

But that is only executed every "sync-period" when there are no changes to the CRs AFAIK

@sbueringer
Copy link
Member

sbueringer commented Jul 15, 2024

A Kubernetes watch does not depend on the sync period (you can try with kubectl get -w). The apiserver pushes an event ~ as soon as it occurs

sync period is just the interval in which the local list => broadcast is executed for all objects currently in the local store

@paurosello
Copy link

Will try to gather logs and events from controllers, but I think this watch is not really triggering the removal of the taint.

@sbueringer
Copy link
Member

sbueringer commented Jul 15, 2024

That is exactly what I'm trying to say :). For some reason this watch does not seem to work, and the proper fix is to figure out why

@paurosello
Copy link

Here we can see it happening:

last node was provisioned at Mon, 15 Jul 2024 12:58:57 +0200
image

The SuccessfulSetNodeRefs event was triggered at Mon, 15 Jul 2024 13:06:00 +0200
image

@sbueringer
Copy link
Member

sbueringer commented Jul 15, 2024

These events are not the same as the events pushed by the Kubernetes watch API.

@paurosello
Copy link

paurosello commented Jul 15, 2024

This is the registered event on the workload cluster

Name:             ip-10-0-135-179.eu-west-2.compute.internal.17e25dcbd9a81aa1
Namespace:        default
Labels:           <none>
Annotations:      <none>
API Version:      v1
Count:            1
Event Time:       <nil>
First Timestamp:  2024-07-15T10:58:57Z
Involved Object:
  API Version:   v1
  Kind:          Node
  Name:          ip-10-0-135-179.eu-west-2.compute.internal
  UID:           85c04d10-ca6c-40e9-bf6e-86cd1762c026
Kind:            Event
Last Timestamp:  2024-07-15T10:58:57Z
Message:         Node ip-10-0-135-179.eu-west-2.compute.internal event: Registered Node ip-10-0-135-179.eu-west-2.compute.internal in Controller
Metadata:
  Creation Timestamp:  2024-07-15T10:58:57Z
  Resource Version:    16429
  UID:                 e949ab3e-d9dc-47b5-ba48-a9699f6ca68f
Reason:                RegisteredNode
Reporting Component:   node-controller
Reporting Instance:    
Source:
  Component:  node-controller
Type:         Normal
Events:       <none>

@sbueringer
Copy link
Member

None of our controllers are watching on these event objects. They are using the watch API. Please see https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes

@fabriziopandini
Copy link
Member

+1 to investigate why existing watches on Node objects are not working as expected

@paurosello
Copy link

paurosello commented Jul 16, 2024

Have some additional information here:

New node created (ip-10-0-204-6.eu-west-2.compute.internal, aws:///eu-west-2c/i-088c9f73c6f2ae2dc) at Tue, 16 Jul 2024 12:51:29 +0200

We can see multiple reconciliation loops:

| 2024-07-16 12:52:47.642 | I0716 10:52:47.642992       1 machinepool_controller_noderef.go:72] "node_taint_issue MachinePool has the required number of NodeRefs" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="org-giantswarm/pau06-nodepool0" namespace="org-giantswarm" name="pau06-nodepool0" reconcileID="a57c52d5-d5b4-4955-84f6-d10a9012fc84" Cluster="org-giantswarm/pau06" NodeRefs=5 |  
| 2024-07-16 12:52:47.642 | I0716 10:52:47.642915       1 machinepool_controller_noderef.go:66] "node_taint_issue MachinePool Status" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="org-giantswarm/pau06-nodepool0" namespace="org-giantswarm" name="pau06-nodepool0" reconcileID="a57c52d5-d5b4-4955-84f6-d10a9012fc84" Cluster="org-giantswarm/pau06" NodeRefs={"nodeRefs":[{"kind":"Node","name":"ip-10-0-127-171.eu-west-2.compute.internal","uid":"85c3540b-8a63-4cbf-8cdd-75af17278d48","apiVersion":"v1"},{"kind":"Node","name":"ip-10-0-68-84.eu-west-2.compute.internal","uid":"403a48e8-1f12-4dfd-b843-73f9dc157fc6","apiVersion":"v1"},{"kind":"Node","name":"ip-10-0-239-183.eu-west-2.compute.internal","uid":"37c10c3b-ea47-4292-8ae3-5845272a0bc0","apiVersion":"v1"},{"kind":"Node","name":"ip-10-0-160-170.eu-west-2.compute.internal","uid":"6a6c4f88-d29a-4ca1-b8a5-1bc165e18841","apiVersion":"v1"},{"kind":"Node","name":"ip-10-0-144-226.eu-west-2.compute.internal","uid":"99ee0049-4d1a-40a1-bc97-68bfe3e2ebbd","apiVersion":"v1"}],"replicas":5,"readyReplicas":5,"availableReplicas":5,"phase":"Running","bootstrapReady":true,"infrastructureReady":true,"observedGeneration":18,"conditions":[{"type":"Ready","status":"True","lastTransitionTime":"2024-07-16T10:49:17Z"},{"type":"BootstrapReady","status":"True","lastTransitionTime":"2024-07-16T07:17:38Z"},{"type":"InfrastructureReady","status":"True","lastTransitionTime":"2024-07-16T07:17:41Z"},{"type":"ReplicasReady","status":"True","lastTransitionTime":"2024-07-16T10:49:17Z"}]} |  
| 2024-07-16 12:52:47.642 | I0716 10:52:47.642881       1 machinepool_controller_noderef.go:65] "node_taint_issue MachinePool Spec" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="org-giantswarm/pau06-nodepool0" namespace="org-giantswarm" name="pau06-nodepool0" reconcileID="a57c52d5-d5b4-4955-84f6-d10a9012fc84" Cluster="org-giantswarm/pau06" Spec={"clusterName":"pau06","replicas":5,"template":{"metadata":{},"spec":{"clusterName":"pau06","bootstrap":{"configRef":{"kind":"KubeadmConfig","namespace":"org-giantswarm","name":"pau06-nodepool0-0d449","apiVersion":"bootstrap.cluster.x-k8s.io/v1beta1"},"dataSecretName":"pau06-nodepool0-0d449"},"infrastructureRef":{"kind":"AWSMachinePool","namespace":"org-giantswarm","name":"pau06-nodepool0","apiVersion":"infrastructure.cluster.x-k8s.io/v1beta2"},"version":"v1.25.16"}},"minReadySeconds":0,"providerIDList":["aws:///eu-west-2a/i-04281ba43c2bdbd79","aws:///eu-west-2a/i-0712ed4ef7cafc9bb","aws:///eu-west-2c/i-091c973916182b381","aws:///eu-west-2b/i-091f25ac9f8876f4f","aws:///eu-west-2b/i-0988e25c080783176"]} |  
| 2024-07-16 12:52:47.642 | I0716 10:52:47.642848       1 machinepool_controller_noderef.go:51] "node_taint_issue Reconciling MachinePool NodeRefs" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="org-giantswarm/pau06-nodepool0" namespace="org-giantswarm" name="pau06-nodepool0" reconcileID="a57c52d5-d5b4-4955-84f6-d10a9012fc84" Cluster="org-giantswarm/pau06" cluster="pau06" machinepool="pau06-nodepool0"
| 2024-07-16 12:55:14.499 | I0716 10:55:14.499423       1 machinepool_controller_noderef.go:72] "node_taint_issue MachinePool has the required number of NodeRefs" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="org-giantswarm/pau06-nodepool0" namespace="org-giantswarm" name="pau06-nodepool0" reconcileID="d7a36e07-fa1b-444e-8196-f7bc6a37e36d" Cluster="org-giantswarm/pau06" NodeRefs=5 |  
| 2024-07-16 12:55:14.499 | I0716 10:55:14.499398       1 machinepool_controller_noderef.go:66] "node_taint_issue MachinePool Status" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="org-giantswarm/pau06-nodepool0" namespace="org-giantswarm" name="pau06-nodepool0" reconcileID="d7a36e07-fa1b-444e-8196-f7bc6a37e36d" Cluster="org-giantswarm/pau06" NodeRefs={"nodeRefs":[{"kind":"Node","name":"ip-10-0-127-171.eu-west-2.compute.internal","uid":"85c3540b-8a63-4cbf-8cdd-75af17278d48","apiVersion":"v1"},{"kind":"Node","name":"ip-10-0-68-84.eu-west-2.compute.internal","uid":"403a48e8-1f12-4dfd-b843-73f9dc157fc6","apiVersion":"v1"},{"kind":"Node","name":"ip-10-0-239-183.eu-west-2.compute.internal","uid":"37c10c3b-ea47-4292-8ae3-5845272a0bc0","apiVersion":"v1"},{"kind":"Node","name":"ip-10-0-160-170.eu-west-2.compute.internal","uid":"6a6c4f88-d29a-4ca1-b8a5-1bc165e18841","apiVersion":"v1"},{"kind":"Node","name":"ip-10-0-144-226.eu-west-2.compute.internal","uid":"99ee0049-4d1a-40a1-bc97-68bfe3e2ebbd","apiVersion":"v1"}],"replicas":5,"readyReplicas":5,"availableReplicas":5,"phase":"Running","bootstrapReady":true,"infrastructureReady":true,"observedGeneration":18,"conditions":[{"type":"Ready","status":"True","lastTransitionTime":"2024-07-16T10:49:17Z"},{"type":"BootstrapReady","status":"True","lastTransitionTime":"2024-07-16T07:17:38Z"},{"type":"InfrastructureReady","status":"True","lastTransitionTime":"2024-07-16T07:17:41Z"},{"type":"ReplicasReady","status":"True","lastTransitionTime":"2024-07-16T10:49:17Z"}]} |  
| 2024-07-16 12:55:14.499 | I0716 10:55:14.499364       1 machinepool_controller_noderef.go:65] "node_taint_issue MachinePool Spec" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="org-giantswarm/pau06-nodepool0" namespace="org-giantswarm" name="pau06-nodepool0" reconcileID="d7a36e07-fa1b-444e-8196-f7bc6a37e36d" Cluster="org-giantswarm/pau06" Spec={"clusterName":"pau06","replicas":5,"template":{"metadata":{},"spec":{"clusterName":"pau06","bootstrap":{"configRef":{"kind":"KubeadmConfig","namespace":"org-giantswarm","name":"pau06-nodepool0-0d449","apiVersion":"bootstrap.cluster.x-k8s.io/v1beta1"},"dataSecretName":"pau06-nodepool0-0d449"},"infrastructureRef":{"kind":"AWSMachinePool","namespace":"org-giantswarm","name":"pau06-nodepool0","apiVersion":"infrastructure.cluster.x-k8s.io/v1beta2"},"version":"v1.25.16"}},"minReadySeconds":0,"providerIDList":["aws:///eu-west-2a/i-04281ba43c2bdbd79","aws:///eu-west-2a/i-0712ed4ef7cafc9bb","aws:///eu-west-2c/i-091c973916182b381","aws:///eu-west-2b/i-091f25ac9f8876f4f","aws:///eu-west-2b/i-0988e25c080783176"]} |  
| 2024-07-16 12:55:14.499 | I0716 10:55:14.499328       1 machinepool_controller_noderef.go:51] "node_taint_issue Reconciling MachinePool NodeRefs" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="org-giantswarm/pau06-nodepool0" namespace="org-giantswarm" name="pau06-nodepool0" reconcileID="d7a36e07-fa1b-444e-8196-f7bc6a37e36d" Cluster="org-giantswarm/pau06" cluster="pau06" machinepool="pau06-nodepool0"

Finally at 12:56 is get detected:

| 2024-07-16 12:56:35.363 | I0716 10:56:35.363639       1 machinepool_controller_noderef.go:66] "node_taint_issue MachinePool Status" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="org-giantswarm/pau06-nodepool0" namespace="org-giantswarm" name="pau06-nodepool0" reconcileID="6d5c9c46-a764-4b53-8d9b-251b501cd453" Cluster="org-giantswarm/pau06" NodeRefs={"nodeRefs":[{"kind":"Node","name":"ip-10-0-127-171.eu-west-2.compute.internal","uid":"85c3540b-8a63-4cbf-8cdd-75af17278d48","apiVersion":"v1"},{"kind":"Node","name":"ip-10-0-68-84.eu-west-2.compute.internal","uid":"403a48e8-1f12-4dfd-b843-73f9dc157fc6","apiVersion":"v1"},{"kind":"Node","name":"ip-10-0-204-6.eu-west-2.compute.internal","uid":"4a96de7a-64bf-4511-a804-432cea488595","apiVersion":"v1"},{"kind":"Node","name":"ip-10-0-239-183.eu-west-2.compute.internal","uid":"37c10c3b-ea47-4292-8ae3-5845272a0bc0","apiVersion":"v1"},{"kind":"Node","name":"ip-10-0-160-170.eu-west-2.compute.internal","uid":"6a6c4f88-d29a-4ca1-b8a5-1bc165e18841","apiVersion":"v1"},{"kind":"Node","name":"ip-10-0-144-226.eu-west-2.compute.internal","uid":"99ee0049-4d1a-40a1-bc97-68bfe3e2ebbd","apiVersion":"v1"}],"replicas":6,"readyReplicas":6,"availableReplicas":6,"unavailableReplicas":-1,"phase":"Running","bootstrapReady":true,"infrastructureReady":true,"observedGeneration":19,"conditions":[{"type":"Ready","status":"False","severity":"Info","lastTransitionTime":"2024-07-16T10:56:35Z","reason":"WaitingForReplicasReady"},{"type":"BootstrapReady","status":"True","lastTransitionTime":"2024-07-16T07:17:38Z"},{"type":"InfrastructureReady","status":"True","lastTransitionTime":"2024-07-16T07:17:41Z"},{"type":"ReplicasReady","status":"False","severity":"Info","lastTransitionTime":"2024-07-16T10:56:35Z","reason":"WaitingForReplicasReady"}]} |  
| 2024-07-16 12:56:35.363 | I0716 10:56:35.363605       1 machinepool_controller_noderef.go:65] "node_taint_issue MachinePool Spec" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="org-giantswarm/pau06-nodepool0" namespace="org-giantswarm" name="pau06-nodepool0" reconcileID="6d5c9c46-a764-4b53-8d9b-251b501cd453" Cluster="org-giantswarm/pau06" Spec={"clusterName":"pau06","replicas":6,"template":{"metadata":{},"spec":{"clusterName":"pau06","bootstrap":{"configRef":{"kind":"KubeadmConfig","namespace":"org-giantswarm","name":"pau06-nodepool0-0d449","apiVersion":"bootstrap.cluster.x-k8s.io/v1beta1"},"dataSecretName":"pau06-nodepool0-0d449"},"infrastructureRef":{"kind":"AWSMachinePool","namespace":"org-giantswarm","name":"pau06-nodepool0","apiVersion":"infrastructure.cluster.x-k8s.io/v1beta2"},"version":"v1.25.16"}},"minReadySeconds":0,"providerIDList":["aws:///eu-west-2a/i-04281ba43c2bdbd79","aws:///eu-west-2a/i-0712ed4ef7cafc9bb","aws:///eu-west-2c/i-088c9f73c6f2ae2dc","aws:///eu-west-2c/i-091c973916182b381","aws:///eu-west-2b/i-091f25ac9f8876f4f","aws:///eu-west-2b/i-0988e25c080783176"]} | 
| 2024-07-16 12:56:35.363 | I0716 10:56:35.363570       1 machinepool_controller_noderef.go:51] "node_taint_issue Reconciling MachinePool NodeRefs" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="org-giantswarm/pau06-nodepool0" namespace="org-giantswarm" name="pau06-nodepool0" reconcileID="6d5c9c46-a764-4b53-8d9b-251b501cd453" Cluster="org-giantswarm/pau06" cluster="pau06" machinepool="pau06-nodepool0" |  
| 2024-07-16 12:56:35.282 | I0716 10:56:35.282875       1 machinepool_controller_noderef.go:241] "node_taint_issue Patching node to set annotations and drop taints" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="org-giantswarm/pau06-nodepool0" namespace="org-giantswarm" name="pau06-nodepool0" reconcileID="c91e110a-f8fd-45e3-bd2f-e3c123ba7cc3" Cluster="org-giantswarm/pau06" node name="ip-10-0-204-6.eu-west-2.compute.internal"

Based on what I can see, it seems like nodeRefs are not being updated soon enough and the code is following this path

conditions.MarkTrue(mp, expv1.ReplicasReadyCondition)
until something updates the NodeRefs.

I see NodeRefs are updated later in the code in

nodeRefsResult, err := r.getNodeReferences(ctx, clusterClient, mp.Spec.ProviderIDList, mp.Spec.MinReadySeconds)
based on ProviderIDList

ProviderIDList is updated in

func (r *MachinePoolReconciler) reconcileInfrastructure(ctx context.Context, cluster *clusterv1.Cluster, mp *expv1.MachinePool) (ctrl.Result, error) {
currently adding some logging here to figure out when this is called.

I also added some logging to nodeToMachinePool function that should be called from watchClusterNodes and it's never called.

@paurosello
Copy link

The current issue does not seem to be related to CAPI controller but the CAPA controller, this function is only ran every 10 minutes and is the one that update the ProviderIDList.

https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/c383c6ea5c7260b037c35df8244741606862b407/exp/controllers/awsmachinepool_controller.go#L205

@fabriziopandini
Copy link
Member

Great job in triaging this issue!
Based on the evidences above, is it ok to close this issue (you should probably open a new one in CAPA)?

@paurosello
Copy link

paurosello commented Jul 16, 2024 via email

@fabriziopandini
Copy link
Member

(I think it is a yes)
/close

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini: Closing this issue.

In response to this:

(I think it is a yes)
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@AndiDog
Copy link
Contributor

AndiDog commented Jul 17, 2024

This CAPA problem is basically the same issue: kubernetes-sigs/cluster-api-provider-aws#4618

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/bootstrap Issues or PRs related to bootstrap providers area/machinepool Issues or PRs related to machinepools kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants