Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

talos_cluster_health times out when adding new nodes #221

Open
rmvangun opened this issue Dec 19, 2024 · 4 comments
Open

talos_cluster_health times out when adding new nodes #221

rmvangun opened this issue Dec 19, 2024 · 4 comments

Comments

@rmvangun
Copy link

rmvangun commented Dec 19, 2024

When creating a basic Talos cluster with Terraform, I think there's a kind of logical problem that fails health checks in the plan phase when adding new nodes. The talos_cluster_health data source takes a list of all controlplane and workers. The first run, this is fine, the nodes are configured, and the health check passes. However, the second time, let's say you increase your worker node count. Now, the health check times out in the plan phase. The reason is that it now expects an additional worker node to be available. However, that node hasn't yet been given machine configuration in the apply step.

So, the health check runs during the plan phase, expects a new healthy worker node that has yet to be configured, and so fails the check.

Here's the cluster health config:

data "talos_cluster_health" "this" {
  depends_on = [talos_cluster_kubeconfig.this]

  client_configuration   = talos_machine_secrets.this.client_configuration
  control_plane_nodes    = module.controlplanes.*.node
  worker_nodes           = module.workers.*.node
  endpoints              = module.controlplanes.*.endpoint
}

Add a new node, and the plan times out with: data.talos_cluster_health.this: Still reading... [20s elapsed]

Not sure how to fix this, wish the health check just knew not to check on nodes that have yet to be configured in the planning phase.

@nebula-it
Copy link

Facing the exact same issue here. We need health check only on the first run to confirm cluster is up before deploying CNI. There should be skip_node_check that can be used so it doesnt worry anything about nodes, just makes sure cluster endpoint is healthy.

@nebula-it
Copy link

That does not work, it keeps trying to connect to the IP of new worker node (Since we are passing IPs of all workers nodes(calculated in a local var) the IP of new worker node is in there) and then times out

@ionfury
Copy link

ionfury commented Jan 20, 2025

I have the same issue. I was not able to find a workaround with the talos_health_check data source.

I reverted to using local-exec provisioners to check cluster health in a scaling-compatible manner link.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants