Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for testing aws-hosted-cp #280

Merged
merged 3 commits into from
Sep 17, 2024
Merged

Add support for testing aws-hosted-cp #280

merged 3 commits into from
Sep 17, 2024

Conversation

squizzi
Copy link

@squizzi squizzi commented Sep 5, 2024

Note: This PR was intentionally created on a repository branch instead of my fork so we can iterate on the e2e testing workflow and then change it to pull_request_target for merge.


  • Break KubeClient helpers into provider specific file
  • Finish aws-hosted-cp test and add comments through test to make it
    easier to understand.
  • Use GinkgoHelper across e2e tests, populate hosted vars from
    AWSCluster.
  • No longer rely on local registry for images in test/e2e.
  • Support OS for awscli install.
  • Prepend hostname to collected log artifacts.
  • Support no cleanup of provider specs, differentiate ci
    cluster names.
  • Add docs on running tests, do not wait for all providers
    if configured.
  • Reinstantiate resource validation map on each instance of
    validation.
  • Enable the external-gc feature via annotation, featureGate
    bool. (Closes: AWS provider can't delete VPC if dependencies are present #152)
  • Bump aws-*-cp templates to 0.1.3
  • Bump cluster-api-provider-aws template to 0.1.2
  • Improve test logging to log template name and validation
    phase.
  • Bump k0s version to v1.30.4+k0s.0, set CCM nodeSelector to
    null for aws-hosted-cp. (Closes: Nodes bootstrapped via aws-hosted-cp get stuck with node.cloudprovider.kubernetes.io/uninitialized taint #290)
  • Break cleanup into seperate job so that it is unaffected by
    concurrency group cancellations.

Closes: #212

@squizzi squizzi added the test e2e Runs the entire provider E2E test suite, controller E2E tests are always ran label Sep 5, 2024
@squizzi squizzi marked this pull request as draft September 5, 2024 23:14
@squizzi squizzi force-pushed the aws-hosted-tests branch 4 times, most recently from 35fca5e to 8e3465d Compare September 9, 2024 23:38
@squizzi squizzi marked this pull request as ready for review September 10, 2024 04:18
@squizzi

This comment was marked as outdated.

@squizzi
Copy link
Author

squizzi commented Sep 10, 2024

I was able to fix the nuke behavior so we reliably cleanup EIP's now which helps the failed to allocate EIP behavior but concurrent running tests will fail because of the allocation limit, I'll try to talk to whomever to get that increase.

I just need to figure out #290 and this should be done. It's definitely ready for review.

@squizzi squizzi force-pushed the aws-hosted-tests branch 3 times, most recently from 01732f0 to 7da3be0 Compare September 10, 2024 19:06
@squizzi
Copy link
Author

squizzi commented Sep 11, 2024

Currently failing in Deletion validation and reproducing the VPC dependency violation bug, I'll probably comment those tests out with a fixme because they'll never pass otherwise.

@slysunkin
Copy link
Contributor

Currently failing in Deletion validation and reproducing the VPC dependency violation bug, I'll probably comment those tests out with a fixme because they'll never pass otherwise.

We already have bug #152 for AWS VPC

@squizzi
Copy link
Author

squizzi commented Sep 11, 2024

We already have bug #152 for AWS VPC

Thanks for adding it, I knew of it but was on my phone and couldn't throw it on here when I typed that comment.

@squizzi
Copy link
Author

squizzi commented Sep 11, 2024

Heads up to reviewers: The tests are failing as they've revealed (or encountered existing) issues: #152 and #290. The testing code itself isn't problematic though.

@squizzi
Copy link
Author

squizzi commented Sep 12, 2024

I've removed the CCM tests for the hosted-cp temporarily so we can test cluster deletions via CI. I intend to bring them back before merge.

@squizzi squizzi force-pushed the aws-hosted-tests branch 2 times, most recently from 7c6d4d9 to 1292207 Compare September 12, 2024 23:39
@squizzi squizzi force-pushed the aws-hosted-tests branch 2 times, most recently from 6f69f86 to 37c26c5 Compare September 13, 2024 00:39
@squizzi
Copy link
Author

squizzi commented Sep 13, 2024

It appears cancelled tests will not run the cleanup resources job, which is concerning since we could get into a state where our test gets cancelled by a subsequent PR push and then the AWS resources are never cleaned. I'll need to look into the best way to handle this case since the if: always() declaration doesn't appear to work in this situation.

@squizzi
Copy link
Author

squizzi commented Sep 13, 2024

It appears cancelled tests will not run the cleanup resources job, which is concerning since we could get into a state where our test gets cancelled by a subsequent PR push and then the AWS resources are never cleaned. I'll need to look into the best way to handle this case since the if: always() declaration doesn't appear to work in this situation.

I think I have an idea of how to fix this, I can make the e2etest job concurrent and then have a cleanup job that runs as always(), we'll see if this works...

@DinaBelova DinaBelova linked an issue Sep 13, 2024 that may be closed by this pull request
@squizzi
Copy link
Author

squizzi commented Sep 14, 2024

I've squash and pushed all of the changes since no one has reviewed this yet. In summary the squashed commit resolves all of the outstanding issues and fixes the cancellation issue. We just need @slysunkin 's merge in #242 so I can rebase and that should fix the deletions that are stuck for both hosted and standalone templates ... hopefully

Looking forward to that green checkmark.

@squizzi
Copy link
Author

squizzi commented Sep 14, 2024

I put the other make steps for dev-aws-nuke in the PHONY nooooooo 🤦‍♂️

@squizzi
Copy link
Author

squizzi commented Sep 14, 2024

I pulled @slysunkin 's changes in and CI is running, hopefully things are good 🤞

@squizzi
Copy link
Author

squizzi commented Sep 14, 2024

Last test failed in secret generation, that's a new one.

E0914 20:07:27.884717       1 controller.go:329] "Reconciler error" err="failed to add &Node{ObjectMeta:{      0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},Spec:NodeSpec{PodCIDR:,DoNotUseExternalID:,ProviderID:,Unschedulable:false,Taints:[]Taint{},ConfigSource:nil,PodCIDRs:[],},Status:NodeStatus{Capacity:ResourceList{},Allocatable:ResourceList{},Phase:,Conditions:[]NodeCondition{},Addresses:[]NodeAddress{},DaemonEndpoints:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:0,},},NodeInfo:NodeSystemInfo{MachineID:,SystemUUID:,BootID:,KernelVersion:,OSImage:,ContainerRuntimeVersion:,KubeletVersion:,KubeProxyVersion:,OperatingSystem:,Architecture:,},Images:[]ContainerImage{},VolumesInUse:[],VolumesAttached:[]AttachedVolume{},Config:nil,},} watch on cluster hmc-system/ci-1726339710-e2e-test-hosted: failed to create cluster accessor: error fetching REST client config for remote cluster \"hmc-system/ci-1726339710-e2e-test-hosted\": failed to retrieve kubeconfig secret for Cluster hmc-system/ci-1726339710-e2e-test-hosted: Secret \"ci-1726339710-e2e-test-hosted-kubeconfig\" not found" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="hmc-system/ci-1726339710-e2e-test-hosted-md-6dx4q-cnzvf" namespace="hmc-system" name="ci-1726339710-e2e-test-hosted-md-6dx4q-cnzvf" reconcileID="e4dfb466-766a-4267-9c3d-ca3b30e0992d"

It's consistently failing here after a 2nd run, so it's not flake. :( I was able to get past the this phase of validation prior to the rebase so not sure what's changed but the control plane isn't readying now.

k0s etcd is stuck in Pending state with the following events:

Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  45m (x2 over 55m)  default-scheduler  running PreBind plugin "VolumeBinding": binding volumes: context deadline exceeded
  Warning  FailedScheduling  35m                default-scheduler  running PreBind plugin "VolumeBinding": binding volumes: context deadline exceeded
  Warning  FailedScheduling  15m (x2 over 25m)  default-scheduler  running PreBind plugin "VolumeBinding": binding volumes: context deadline exceeded
  Warning  FailedScheduling  4m56s              default-scheduler  running PreBind plugin "VolumeBinding": binding volumes: context deadline exceeded

The ebs-csi daemonset is up and running on the standalone cluster where this is deploying and the csi-driver validation phase completed successfully:

NAME           DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
ebs-csi-node   2         2         2       2            2           kubernetes.io/os=linux   133m

I'm seeing failed to provision volume with StorageClass "ebs-csi-default-sc": error generating accessibility requirements: no topology key found on CSINode squizzi-test-e2e-md-2cm7q-phl9q errors in events:

108s        Normal    ExternalProvisioning   persistentvolumeclaim/etcd-data-kmc-squizzi-test-e2e-hosted-etcd-0   Waiting for a volume to be created either by the external provisioner 'ebs.csi.aws.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
22s         Normal    Provisioning           persistentvolumeclaim/etcd-data-kmc-squizzi-test-e2e-hosted-etcd-0   External provisioner is provisioning volume for claim "hmc-system/etcd-data-kmc-squizzi-test-e2e-hosted-etcd-0"
22s         Warning   ProvisioningFailed     persistentvolumeclaim/etcd-data-kmc-squizzi-test-e2e-hosted-etcd-0   failed to provision volume with StorageClass "ebs-csi-default-sc": error generating accessibility requirements: no topology key found on CSINode squizzi-test-e2e-md-2cm7q-phl9q
22s         Normal    Provisioning           persistentvolumeclaim/etcd-data-kmc-squizzi-test-e2e-hosted-etcd-1   External provisioner is provisioning volume for claim "hmc-system/etcd-data-kmc-squizzi-test-e2e-hosted-etcd-1"
22s         Warning   ProvisioningFailed     persistentvolumeclaim/etcd-data-kmc-squizzi-test-e2e-hosted-etcd-1   failed to provision volume with StorageClass "ebs-csi-default-sc": error generating accessibility requirements: no topology key found on CSINode squizzi-test-e2e-md-2cm7q-phl9q
108s        Normal    ExternalProvisioning   persistentvolumeclaim/etcd-data-kmc-squizzi-test-e2e-hosted-etcd-1   Waiting for a volume to be created either by the external provisioner 'ebs.csi.aws.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
22s         Normal    Provisioning           persistentvolumeclaim/etcd-data-kmc-squizzi-test-e2e-hosted-etcd-2   External provisioner is provisioning volume for claim "hmc-system/etcd-data-kmc-squizzi-test-e2e-hosted-etcd-2"
22s         Warning   ProvisioningFailed     persistentvolumeclaim/etcd-data-kmc-squizzi-test-e2e-hosted-etcd-2   failed to provision volume with StorageClass "ebs-csi-default-sc": error generating accessibility requirements: no topology key found on CSINode squizzi-test-e2e-md-2cm7q-phl9q
108s        Normal    ExternalProvisioning   persistentvolumeclaim/etcd-data-kmc-squizzi-test-e2e-hosted-etcd-2   Waiting for a volume to be created either by the external provisioner 'ebs.csi.aws.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered

The ebs-csi-controller logs are quiet though:

KUBECONFIG=squizzi-test-e2e-kubeconfig k logs deployments/ebs-csi-controller -n kube-system
Found 2 pods, using pod/ebs-csi-controller-5c9db44f4f-h2877
Defaulted container "ebs-plugin" out of: ebs-plugin, csi-provisioner, csi-attacher, csi-resizer, liveness-probe
I0914 23:09:11.286385       1 main.go:157] "Initializing metadata"
I0914 23:09:14.794997       1 metadata.go:48] "Retrieved metadata from IMDS"
I0914 23:09:14.795715       1 driver.go:69] "Driver Information" Driver="ebs.csi.aws.com" Version="v1.33.0"

I'm going to try setting node.tolerateAllTaints on the csi-driver helm chart and see if it fixes it per: kubernetes-sigs/aws-ebs-csi-driver#848 (comment)

Per https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/charts/aws-ebs-csi-driver/values.yaml#L371C1-L372C1 node.tolerateAllTaints is the default, so I'm not sure this will change any behavior, not to mention it doesn't make much sense because the daemonset has scheduled and has a ready pod for ebs-csi-driver on each node.

@squizzi squizzi force-pushed the aws-hosted-tests branch 2 times, most recently from 3ccb94c to d590bfb Compare September 16, 2024 18:18
@squizzi squizzi added the github actions Pull requests that update GitHub Actions code label Sep 16, 2024
@squizzi squizzi added test e2e Runs the entire provider E2E test suite, controller E2E tests are always ran and removed test e2e Runs the entire provider E2E test suite, controller E2E tests are always ran labels Sep 16, 2024
@squizzi
Copy link
Author

squizzi commented Sep 16, 2024

Almost there, seeing a failure to delete standalone now with:

I0916 20:53:12.723064       1 cleanup.go:64] "deleting aws resources created by tenant cluster" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="hmc-system/ci-1726517368-e2e-test" namespace="hmc-system" name="ci-1726517368-e2e-test" reconcileID="22e48f35-85e1-4a5c-97d5-7b116922050d" cluster="ci-1726517368-e2e-test"
I0916 20:53:12.723081       1 cleanup.go:97] "get aws resources created by tenant cluster with resource group tagging API" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="hmc-system/ci-1726517368-e2e-test" namespace="hmc-system" name="ci-1726517368-e2e-test" reconcileID="22e48f35-85e1-4a5c-97d5-7b116922050d" cluster="ci-1726517368-e2e-test"
E0916 20:53:13.726802       1 controller.go:329] "Reconciler error" err=<
	error deleting network: failed to detach internet gateway "igw-0b4456b73e77a2f65": DependencyViolation: Network vpc-0829721ee4d5f7b15 has some mapped public address(es). Please unmap those public address(es) before detaching the gateway.

This is with the ExternalGC feature gate active.

The ELB that was created to validate CCM which is given a public IP is what's holding up deletion this time, we see the cleanup script delete it here:

Checking for ELB with 'ci-1726524390-e2e-test' tag
Checking ELB: ad05423d65bc04a81be6d6d4b9d4e327 for tag
Deleting ELB: ad05423d65bc04a81be6d6d4b9d4e327

@squizzi
Copy link
Author

squizzi commented Sep 17, 2024

I've commented out the deletion test for standalone for a couple of reasons:

  1. The IGW issue here: Add support for testing aws-hosted-cp #280 (comment) is intermittent, but even if we get past that we eventually run into issues reconciling the AWSMachine's due to Cluster deletion stuck due to AWSCluster finalizer issue kubernetes-sigs/cluster-api-provider-aws#5107 almost every single time.
  2. We're already testing deleting of an AWS template via our aws-hosted-cp test which covers HMC's involvement in the deletion process relatively well. We can work in a follow-up once 5107 above is resolved or perhaps work on a workaround in a follow-up.
  3. There's a bunch other work that would benefit from the changes in this PR and I've worked through a ton of issues to get to a pretty much green result, minus a deletion test with outstanding known bugs.

* Break KubeClient helpers into provider specific file.
* Try to simplify the validation process for lots of different providers
  with different requirements.
* Finish aws-hosted-cp test and add comments through test to make it
  easier to understand.
* Use GinkgoHelper across e2e tests, populate hosted vars from
  AWSCluster.
* No longer rely on local registry for images in test/e2e.
* Support OS for awscli install.
* Prepend hostname to collected log artifacts.
* Support no cleanup of provider specs, differentiate ci
  cluster names.
* Add docs on running tests, do not wait for all providers
  if configured.
* Reinstantiate resource validation map on each instance of
  validation.
* Enable the external-gc feature via annotation, featureGate
  bool. (Closes: #152)
* Bump aws-*-cp templates to 0.1.3
* Bump cluster-api-provider-aws template to 0.1.2
* Improve test logging to log template name and validation
  phase.
* Bump k0s version to v1.30.4+k0s.0, set CCM nodeSelector to
  null for aws-hosted-cp. (Closes: #290)
* Break cleanup into seperate job so that it is unaffected by
  concurrency group cancellations.
* Make dev-aws-nuke target less PHONY.
* Only build linux/amd64 arch since CI does not need arm.

Signed-off-by: Kyle Squizzato <[email protected]>
@squizzi squizzi merged commit 7f04100 into main Sep 17, 2024
1 of 2 checks passed
@squizzi squizzi deleted the aws-hosted-tests branch September 17, 2024 22:44
bnallapeta pushed a commit to bnallapeta/hmc that referenced this pull request Nov 15, 2024
Add support for testing `aws-hosted-cp`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
github actions Pull requests that update GitHub Actions code test e2e Runs the entire provider E2E test suite, controller E2E tests are always ran
Projects
Status: Done
3 participants