Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to progress past BGP peering step on Anthos 1.11.2 #84

Open
cmluciano opened this issue Jul 8, 2022 · 4 comments
Open

Unable to progress past BGP peering step on Anthos 1.11.2 #84

cmluciano opened this issue Jul 8, 2022 · 4 comments

Comments

@cmluciano
Copy link

I am trying to test out Anthos 1.11.2 so that I can leverage some newer features that take advantage of Equinix metal's SRIOV features in the baremetal hardware. My preference is to use centos_8 as the backend and I patched the script to get past some errors I was having (I can send the patch in a PR) but the issue appears to happen on the default ubuntu_20_04 release as well so it doesn't appear to be OS related.

terraform.tfvars

// this should be your personal token, not the project token
metal_auth_token = "sanitized"
metal_organization_id = "sanitized"
metal_project_id = "sanitized"
// don't create a new project, use an existing
metal_create_project = false
gcp_project_id = "sanitized"
cluster_name = "anthos-metal-1"
// 1.11.X is necessary to get the latest multi-nic pieces for sriov
anthos_version = "1.11.2"
// ideally we want rhel_7 here but saw a couple bugs for rhel
// operating_system = "rhel_8"
// operating_system = "centos_8"
operating_system = "ubuntu_20_04"
facility = "dc13"

I get up to the null_resource.kube_vip_install_first_cp step and it never completes. I've even let it run overnight and it never completes even after 15 hours.

null_resource.kube_vip_install_first_cp (remote-exec): /root/bootstrap/vip.yaml FOUND!
null_resource.kube_vip_install_first_cp (remote-exec): BGP peering initiated! Cluster should be completed in about 5 minutes.
null_resource.kube_vip_install_first_cp: Creation complete after 9m23s [id=7216402651719392522]
***
***
***
null_resource.deploy_anthos_cluster: Still creating... [15h22m26s elapsed]
null_resource.deploy_anthos_cluster: Still creating... [15h22m36s elapsed]
null_resource.deploy_anthos_cluster: Still creating... [15h22m46s elapsed]
null_resource.deploy_anthos_cluster: Still creating... [15h22m56s elapsed]
^CStopping operation...
Interrupt received.
Please wait for Terraform to exit or data loss may occur.
Gracefully shutting down...
╷
│ Error: execution halted
│ 
│ Error: remote-exec provisioner error
│ 
│   with null_resource.deploy_anthos_cluster,
│   on main.tf line 239, in resource "null_resource" "deploy_anthos_cluster":
│  239:   provisioner "remote-exec" {
│ 
│ error executing "/tmp/terraform_925104650.sh": wait: remote command exited without exit status or exit signal
╵

Since my values are fairly standard except for the new Anthos, I presume the issue is likely a change with the BGP peering that perhaps hasn't been accounted for.

@displague
Copy link
Member

displague commented Jul 11, 2022

I was able to confirm this problem with the latest main after ~2h of waiting on null_resource.deploy_anthos_cluster

@displague
Copy link
Member

displague commented Jul 11, 2022

The script was stuck looping on the following error, found in /root/baremetal/cluster_create.log, the log of /root/bootstrap/create_cluster.sh execution:

Waiting for cluster to become ready: Internal error occurred: failed calling webhook "vvmruntime.kb.io": failed to call webhook: Post "https://vmruntime-webhook-service.vm-system.svc:443/validate-vm-cluster-gke-io-v1-vmruntime?timeout=10s": dial tcp 172.31.79.252:443: connect: connection refused⠋

root@eqnx-metal-gke-g5klw-cp-01:~# kubectl  get ValidatingWebhookConfiguration -A
NAME                                                    WEBHOOKS   AGE
capi-validating-webhook-configuration                   5          91m
cert-manager-webhook                                    1          92m
clientconfig-admission-webhook                          1          92m
clusterdns-webhook                                      1          92m
net-attach-def-admission-controller-validating-config   1          92m
validating-webhook-configuration                        10         91m
validation-webhook.snapshot.storage.k8s.io              1          92m
vmruntime-validating-webhook-configuration              1          92m
root@eqnx-metal-gke-g5klw-cp-01:~# kubectl  get all  -n vm-system
NAME                                                READY   STATUS    RESTARTS   AGE
pod/vmruntime-controller-manager-67775946fb-f9rlt   0/2     Pending   0          93m

NAME                                TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
service/vmruntime-webhook-service   ClusterIP   172.31.79.252   <none>        443/TCP   93m

NAME                                           READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/vmruntime-controller-manager   0/1     1            0           93m

NAME                                                      DESIRED   CURRENT   READY   AGE
replicaset.apps/vmruntime-controller-manager-67775946fb   1         1         0       93m

The pod is failing to start with:

Warning FailedScheduling 3m27s (x90 over 93m) default-scheduler 0/1 nodes are available: 1 node(s) had taint {node.cloudprovider.kubernetes.io/uninitialized: true}, that the pod didn't tolerate.

@cmluciano
Copy link
Author

Thanks for taking a look @displague . Do you think it might be sufficient to just patch the vmruntime-controller-manager to all scheduling on nodes that have this taint or is it a requirement to be responding for some BGP stuff ?

@displague
Copy link
Member

@cmluciano Yes, I think patching should be sufficient to get things started. I'm not sure about the dependencies of vmruntime-controller-manager, but this sounds like a good thing to try first.

I wonder if the upstream Anthos project might consider adding a toleration for node.cloudprovider.kubernetes.io/uninitialized: true to vmruntime-controller-manager. Thoughts, @c0dyhi11?

In this case, the vmruntime-controller-manager is awaiting the CloudProvider taint to be cleared, which will not happen until after vmruntime-controller-manager succeeds. 🐔 🥚 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants