Add support for testing `aws-hosted-cp` #280

squizzi · 2024-09-05T23:02:59Z

Note: This PR was intentionally created on a repository branch instead of my fork so we can iterate on the e2e testing workflow and then change it to pull_request_target for merge.

Break KubeClient helpers into provider specific file
Finish aws-hosted-cp test and add comments through test to make it
easier to understand.
Use GinkgoHelper across e2e tests, populate hosted vars from
AWSCluster.
No longer rely on local registry for images in test/e2e.
Support OS for awscli install.
Prepend hostname to collected log artifacts.
Support no cleanup of provider specs, differentiate ci
cluster names.
Add docs on running tests, do not wait for all providers
if configured.
Reinstantiate resource validation map on each instance of
validation.
Enable the external-gc feature via annotation, featureGate
bool. (Closes: AWS provider can't delete VPC if dependencies are present #152)
Bump aws-*-cp templates to 0.1.3
Bump cluster-api-provider-aws template to 0.1.2
Improve test logging to log template name and validation
phase.
Bump k0s version to v1.30.4+k0s.0, set CCM nodeSelector to
null for aws-hosted-cp. (Closes: Nodes bootstrapped via aws-hosted-cp get stuck with node.cloudprovider.kubernetes.io/uninitialized taint #290)
Break cleanup into seperate job so that it is unaffected by
concurrency group cancellations.

Closes: #212

squizzi · 2024-09-10T17:53:16Z

I was able to fix the nuke behavior so we reliably cleanup EIP's now which helps the failed to allocate EIP behavior but concurrent running tests will fail because of the allocation limit, I'll try to talk to whomever to get that increase.

I just need to figure out #290 and this should be done. It's definitely ready for review.

squizzi · 2024-09-11T16:03:00Z

Currently failing in Deletion validation and reproducing the VPC dependency violation bug, I'll probably comment those tests out with a fixme because they'll never pass otherwise.

slysunkin · 2024-09-11T16:13:25Z

Currently failing in Deletion validation and reproducing the VPC dependency violation bug, I'll probably comment those tests out with a fixme because they'll never pass otherwise.

We already have bug #152 for AWS VPC

squizzi · 2024-09-11T17:16:06Z

We already have bug #152 for AWS VPC

Thanks for adding it, I knew of it but was on my phone and couldn't throw it on here when I typed that comment.

squizzi · 2024-09-11T23:24:53Z

Heads up to reviewers: The tests are failing as they've revealed (or encountered existing) issues: #152 and #290. The testing code itself isn't problematic though.

squizzi · 2024-09-12T20:18:34Z

I've removed the CCM tests for the hosted-cp temporarily so we can test cluster deletions via CI. I intend to bring them back before merge.

squizzi · 2024-09-13T18:07:11Z

It appears cancelled tests will not run the cleanup resources job, which is concerning since we could get into a state where our test gets cancelled by a subsequent PR push and then the AWS resources are never cleaned. I'll need to look into the best way to handle this case since the if: always() declaration doesn't appear to work in this situation.

squizzi · 2024-09-13T21:15:20Z

It appears cancelled tests will not run the cleanup resources job, which is concerning since we could get into a state where our test gets cancelled by a subsequent PR push and then the AWS resources are never cleaned. I'll need to look into the best way to handle this case since the if: always() declaration doesn't appear to work in this situation.

I think I have an idea of how to fix this, I can make the e2etest job concurrent and then have a cleanup job that runs as always(), we'll see if this works...

squizzi · 2024-09-14T02:04:37Z

I've squash and pushed all of the changes since no one has reviewed this yet. In summary the squashed commit resolves all of the outstanding issues and fixes the cancellation issue. We just need @slysunkin 's merge in #242 so I can rebase and that should fix the deletions that are stuck for both hosted and standalone templates ... hopefully

Looking forward to that green checkmark.

squizzi · 2024-09-14T04:30:47Z

I put the other make steps for dev-aws-nuke in the PHONY nooooooo 🤦‍♂️

squizzi · 2024-09-14T06:39:55Z

I pulled @slysunkin 's changes in and CI is running, hopefully things are good 🤞

squizzi · 2024-09-14T21:33:25Z

Last test failed in secret generation, that's a new one.

E0914 20:07:27.884717       1 controller.go:329] "Reconciler error" err="failed to add &Node{ObjectMeta:{      0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},Spec:NodeSpec{PodCIDR:,DoNotUseExternalID:,ProviderID:,Unschedulable:false,Taints:[]Taint{},ConfigSource:nil,PodCIDRs:[],},Status:NodeStatus{Capacity:ResourceList{},Allocatable:ResourceList{},Phase:,Conditions:[]NodeCondition{},Addresses:[]NodeAddress{},DaemonEndpoints:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:0,},},NodeInfo:NodeSystemInfo{MachineID:,SystemUUID:,BootID:,KernelVersion:,OSImage:,ContainerRuntimeVersion:,KubeletVersion:,KubeProxyVersion:,OperatingSystem:,Architecture:,},Images:[]ContainerImage{},VolumesInUse:[],VolumesAttached:[]AttachedVolume{},Config:nil,},} watch on cluster hmc-system/ci-1726339710-e2e-test-hosted: failed to create cluster accessor: error fetching REST client config for remote cluster \"hmc-system/ci-1726339710-e2e-test-hosted\": failed to retrieve kubeconfig secret for Cluster hmc-system/ci-1726339710-e2e-test-hosted: Secret \"ci-1726339710-e2e-test-hosted-kubeconfig\" not found" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="hmc-system/ci-1726339710-e2e-test-hosted-md-6dx4q-cnzvf" namespace="hmc-system" name="ci-1726339710-e2e-test-hosted-md-6dx4q-cnzvf" reconcileID="e4dfb466-766a-4267-9c3d-ca3b30e0992d"

It's consistently failing here after a 2nd run, so it's not flake. :( I was able to get past the this phase of validation prior to the rebase so not sure what's changed but the control plane isn't readying now.

k0s etcd is stuck in Pending state with the following events:

Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  45m (x2 over 55m)  default-scheduler  running PreBind plugin "VolumeBinding": binding volumes: context deadline exceeded
  Warning  FailedScheduling  35m                default-scheduler  running PreBind plugin "VolumeBinding": binding volumes: context deadline exceeded
  Warning  FailedScheduling  15m (x2 over 25m)  default-scheduler  running PreBind plugin "VolumeBinding": binding volumes: context deadline exceeded
  Warning  FailedScheduling  4m56s              default-scheduler  running PreBind plugin "VolumeBinding": binding volumes: context deadline exceeded

The ebs-csi daemonset is up and running on the standalone cluster where this is deploying and the csi-driver validation phase completed successfully:

NAME           DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
ebs-csi-node   2         2         2       2            2           kubernetes.io/os=linux   133m

I'm seeing failed to provision volume with StorageClass "ebs-csi-default-sc": error generating accessibility requirements: no topology key found on CSINode squizzi-test-e2e-md-2cm7q-phl9q errors in events:

108s        Normal    ExternalProvisioning   persistentvolumeclaim/etcd-data-kmc-squizzi-test-e2e-hosted-etcd-0   Waiting for a volume to be created either by the external provisioner 'ebs.csi.aws.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
22s         Normal    Provisioning           persistentvolumeclaim/etcd-data-kmc-squizzi-test-e2e-hosted-etcd-0   External provisioner is provisioning volume for claim "hmc-system/etcd-data-kmc-squizzi-test-e2e-hosted-etcd-0"
22s         Warning   ProvisioningFailed     persistentvolumeclaim/etcd-data-kmc-squizzi-test-e2e-hosted-etcd-0   failed to provision volume with StorageClass "ebs-csi-default-sc": error generating accessibility requirements: no topology key found on CSINode squizzi-test-e2e-md-2cm7q-phl9q
22s         Normal    Provisioning           persistentvolumeclaim/etcd-data-kmc-squizzi-test-e2e-hosted-etcd-1   External provisioner is provisioning volume for claim "hmc-system/etcd-data-kmc-squizzi-test-e2e-hosted-etcd-1"
22s         Warning   ProvisioningFailed     persistentvolumeclaim/etcd-data-kmc-squizzi-test-e2e-hosted-etcd-1   failed to provision volume with StorageClass "ebs-csi-default-sc": error generating accessibility requirements: no topology key found on CSINode squizzi-test-e2e-md-2cm7q-phl9q
108s        Normal    ExternalProvisioning   persistentvolumeclaim/etcd-data-kmc-squizzi-test-e2e-hosted-etcd-1   Waiting for a volume to be created either by the external provisioner 'ebs.csi.aws.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
22s         Normal    Provisioning           persistentvolumeclaim/etcd-data-kmc-squizzi-test-e2e-hosted-etcd-2   External provisioner is provisioning volume for claim "hmc-system/etcd-data-kmc-squizzi-test-e2e-hosted-etcd-2"
22s         Warning   ProvisioningFailed     persistentvolumeclaim/etcd-data-kmc-squizzi-test-e2e-hosted-etcd-2   failed to provision volume with StorageClass "ebs-csi-default-sc": error generating accessibility requirements: no topology key found on CSINode squizzi-test-e2e-md-2cm7q-phl9q
108s        Normal    ExternalProvisioning   persistentvolumeclaim/etcd-data-kmc-squizzi-test-e2e-hosted-etcd-2   Waiting for a volume to be created either by the external provisioner 'ebs.csi.aws.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered

The ebs-csi-controller logs are quiet though:

KUBECONFIG=squizzi-test-e2e-kubeconfig k logs deployments/ebs-csi-controller -n kube-system
Found 2 pods, using pod/ebs-csi-controller-5c9db44f4f-h2877
Defaulted container "ebs-plugin" out of: ebs-plugin, csi-provisioner, csi-attacher, csi-resizer, liveness-probe
I0914 23:09:11.286385       1 main.go:157] "Initializing metadata"
I0914 23:09:14.794997       1 metadata.go:48] "Retrieved metadata from IMDS"
I0914 23:09:14.795715       1 driver.go:69] "Driver Information" Driver="ebs.csi.aws.com" Version="v1.33.0"

I'm going to try setting node.tolerateAllTaints on the csi-driver helm chart and see if it fixes it per: kubernetes-sigs/aws-ebs-csi-driver#848 (comment)

Per https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/charts/aws-ebs-csi-driver/values.yaml#L371C1-L372C1 node.tolerateAllTaints is the default, so I'm not sure this will change any behavior, not to mention it doesn't make much sense because the daemonset has scheduled and has a ready pod for ebs-csi-driver on each node.

squizzi · 2024-09-16T21:14:52Z

Almost there, seeing a failure to delete standalone now with:

I0916 20:53:12.723064       1 cleanup.go:64] "deleting aws resources created by tenant cluster" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="hmc-system/ci-1726517368-e2e-test" namespace="hmc-system" name="ci-1726517368-e2e-test" reconcileID="22e48f35-85e1-4a5c-97d5-7b116922050d" cluster="ci-1726517368-e2e-test"
I0916 20:53:12.723081       1 cleanup.go:97] "get aws resources created by tenant cluster with resource group tagging API" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="hmc-system/ci-1726517368-e2e-test" namespace="hmc-system" name="ci-1726517368-e2e-test" reconcileID="22e48f35-85e1-4a5c-97d5-7b116922050d" cluster="ci-1726517368-e2e-test"
E0916 20:53:13.726802       1 controller.go:329] "Reconciler error" err=<
	error deleting network: failed to detach internet gateway "igw-0b4456b73e77a2f65": DependencyViolation: Network vpc-0829721ee4d5f7b15 has some mapped public address(es). Please unmap those public address(es) before detaching the gateway.

This is with the ExternalGC feature gate active.

The ELB that was created to validate CCM which is given a public IP is what's holding up deletion this time, we see the cleanup script delete it here:

Checking for ELB with 'ci-1726524390-e2e-test' tag
Checking ELB: ad05423d65bc04a81be6d6d4b9d4e327 for tag
Deleting ELB: ad05423d65bc04a81be6d6d4b9d4e327

squizzi · 2024-09-17T00:00:16Z

I've commented out the deletion test for standalone for a couple of reasons:

The IGW issue here: Add support for testing aws-hosted-cp #280 (comment) is intermittent, but even if we get past that we eventually run into issues reconciling the AWSMachine's due to Cluster deletion stuck due to AWSCluster finalizer issue kubernetes-sigs/cluster-api-provider-aws#5107 almost every single time.
We're already testing deleting of an AWS template via our aws-hosted-cp test which covers HMC's involvement in the deletion process relatively well. We can work in a follow-up once 5107 above is resolved or perhaps work on a workaround in a follow-up.
There's a bunch other work that would benefit from the changes in this PR and I've worked through a ton of issues to get to a pretty much green result, minus a deletion test with outstanding known bugs.

* Break KubeClient helpers into provider specific file. * Try to simplify the validation process for lots of different providers with different requirements. * Finish aws-hosted-cp test and add comments through test to make it easier to understand. * Use GinkgoHelper across e2e tests, populate hosted vars from AWSCluster. * No longer rely on local registry for images in test/e2e. * Support OS for awscli install. * Prepend hostname to collected log artifacts. * Support no cleanup of provider specs, differentiate ci cluster names. * Add docs on running tests, do not wait for all providers if configured. * Reinstantiate resource validation map on each instance of validation. * Enable the external-gc feature via annotation, featureGate bool. (Closes: #152) * Bump aws-*-cp templates to 0.1.3 * Bump cluster-api-provider-aws template to 0.1.2 * Improve test logging to log template name and validation phase. * Bump k0s version to v1.30.4+k0s.0, set CCM nodeSelector to null for aws-hosted-cp. (Closes: #290) * Break cleanup into seperate job so that it is unaffected by concurrency group cancellations. * Make dev-aws-nuke target less PHONY. * Only build linux/amd64 arch since CI does not need arm. Signed-off-by: Kyle Squizzato <[email protected]>

Signed-off-by: Kyle Squizzato <[email protected]>

Add support for testing `aws-hosted-cp`

squizzi added the test e2e Runs the entire provider E2E test suite, controller E2E tests are always ran label Sep 5, 2024

squizzi marked this pull request as draft September 5, 2024 23:14

squizzi force-pushed the aws-hosted-tests branch 4 times, most recently from 35fca5e to 8e3465d Compare September 9, 2024 23:38

squizzi marked this pull request as ready for review September 10, 2024 04:18

squizzi requested review from kylewuolle, Kshatrix, eromanova, a13x5, wahabmk and slysunkin September 10, 2024 04:18

This comment was marked as outdated.

Sign in to view

squizzi force-pushed the aws-hosted-tests branch from ce30914 to 5596db9 Compare September 10, 2024 16:44

squizzi force-pushed the aws-hosted-tests branch 3 times, most recently from 01732f0 to 7da3be0 Compare September 10, 2024 19:06

squizzi force-pushed the aws-hosted-tests branch from bedae9a to 2148a61 Compare September 11, 2024 18:28

This was referenced Sep 12, 2024

Nodes bootstrapped via aws-hosted-cp get stuck with node.cloudprovider.kubernetes.io/uninitialized taint #290

Closed

Implement Azure e2e template tests #300

Closed

squizzi force-pushed the aws-hosted-tests branch from 2148a61 to 3530a70 Compare September 12, 2024 19:11

squizzi force-pushed the aws-hosted-tests branch 2 times, most recently from 7c6d4d9 to 1292207 Compare September 12, 2024 23:39

squizzi force-pushed the aws-hosted-tests branch 2 times, most recently from 6f69f86 to 37c26c5 Compare September 13, 2024 00:39

DinaBelova linked an issue Sep 13, 2024 that may be closed by this pull request

AWS provider can't delete VPC if dependencies are present #152

Closed

squizzi force-pushed the aws-hosted-tests branch from aae6aa2 to 6a3d3db Compare September 14, 2024 01:39

squizzi force-pushed the aws-hosted-tests branch from 6a3d3db to ee368eb Compare September 14, 2024 06:39

squizzi force-pushed the aws-hosted-tests branch from ee368eb to 2da5382 Compare September 14, 2024 18:48

squizzi force-pushed the aws-hosted-tests branch 2 times, most recently from 3ccb94c to d590bfb Compare September 16, 2024 18:18

squizzi added the github actions Pull requests that update GitHub Actions code label Sep 16, 2024

squizzi force-pushed the aws-hosted-tests branch from ee759b8 to 6e10aa9 Compare September 16, 2024 20:05

squizzi added test e2e Runs the entire provider E2E test suite, controller E2E tests are always ran and removed test e2e Runs the entire provider E2E test suite, controller E2E tests are always ran labels Sep 16, 2024

squizzi force-pushed the aws-hosted-tests branch from 2568013 to 0ef71dc Compare September 16, 2024 23:56

squizzi force-pushed the aws-hosted-tests branch from 0ef71dc to 7ec7fca Compare September 17, 2024 00:52

squizzi mentioned this pull request Sep 17, 2024

Re-enable aws-standalone-cp template deletion test #339

Open

squizzi added 2 commits September 17, 2024 08:15

Make unzip quiet, add FIXME comment

f43d927

Signed-off-by: Kyle Squizzato <[email protected]>

Undo any single-quote changes across yaml

1c942ef

Signed-off-by: Kyle Squizzato <[email protected]>

Kshatrix approved these changes Sep 17, 2024

View reviewed changes

squizzi merged commit 7f04100 into main Sep 17, 2024
1 of 2 checks passed

squizzi deleted the aws-hosted-tests branch September 17, 2024 22:44

bnallapeta pushed a commit to bnallapeta/hmc that referenced this pull request Nov 15, 2024

Merge pull request K0rdent#280 from Mirantis/aws-hosted-tests

69fd11a

Add support for testing `aws-hosted-cp`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for testing `aws-hosted-cp` #280

Add support for testing `aws-hosted-cp` #280

squizzi commented Sep 5, 2024 •

edited

Loading

This comment was marked as outdated.

squizzi commented Sep 10, 2024 •

edited

Loading

squizzi commented Sep 11, 2024

slysunkin commented Sep 11, 2024

squizzi commented Sep 11, 2024

squizzi commented Sep 11, 2024 •

edited

Loading

squizzi commented Sep 12, 2024

squizzi commented Sep 13, 2024 •

edited

Loading

squizzi commented Sep 13, 2024

squizzi commented Sep 14, 2024 •

edited

Loading

squizzi commented Sep 14, 2024

squizzi commented Sep 14, 2024

squizzi commented Sep 14, 2024 •

edited

Loading

squizzi commented Sep 16, 2024 •

edited

Loading

squizzi commented Sep 17, 2024 •

edited

Loading

Add support for testing aws-hosted-cp #280

Add support for testing aws-hosted-cp #280

Conversation

squizzi commented Sep 5, 2024 • edited Loading

This comment was marked as outdated.

squizzi commented Sep 10, 2024 • edited Loading

squizzi commented Sep 11, 2024

slysunkin commented Sep 11, 2024

squizzi commented Sep 11, 2024

squizzi commented Sep 11, 2024 • edited Loading

squizzi commented Sep 12, 2024

squizzi commented Sep 13, 2024 • edited Loading

squizzi commented Sep 13, 2024

squizzi commented Sep 14, 2024 • edited Loading

squizzi commented Sep 14, 2024

squizzi commented Sep 14, 2024

squizzi commented Sep 14, 2024 • edited Loading

squizzi commented Sep 16, 2024 • edited Loading

squizzi commented Sep 17, 2024 • edited Loading

Add support for testing `aws-hosted-cp` #280

Add support for testing `aws-hosted-cp` #280

squizzi commented Sep 5, 2024 •

edited

Loading

squizzi commented Sep 10, 2024 •

edited

Loading

squizzi commented Sep 11, 2024 •

edited

Loading

squizzi commented Sep 13, 2024 •

edited

Loading

squizzi commented Sep 14, 2024 •

edited

Loading

squizzi commented Sep 14, 2024 •

edited

Loading

squizzi commented Sep 16, 2024 •

edited

Loading

squizzi commented Sep 17, 2024 •

edited

Loading