Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS provider can't delete VPC if dependencies are present #152

Closed
DinaBelova opened this issue Aug 6, 2024 · 7 comments · Fixed by #280
Closed

AWS provider can't delete VPC if dependencies are present #152

DinaBelova opened this issue Aug 6, 2024 · 7 comments · Fixed by #280
Assignees
Labels
bug Something isn't working

Comments

@DinaBelova
Copy link
Collaborator

CAPI AWS provider can’t delete VPC if any resource (like LB or security group) is present in there.

As a result cluster will stuck in the Deleting state indefinitely. In the logs there is no clue regarding which resource is preventing deletion, so it could be only done manually.

We should research if there is a possibility to recursively delete all resources in VPC using the provider or fix is required in AWS provider code base.

@DinaBelova DinaBelova added the bug Something isn't working label Aug 6, 2024
@slysunkin slysunkin self-assigned this Aug 27, 2024
@slysunkin
Copy link
Contributor

Also the cluster deletion process may fall into an infinite loop of other AWSCluster resource removal, i.e.:
ClusterInfrastructure - AWSCluster/aws-dev-slava  False  Warning   DeletingFailed  2m49s  2 of 8 completed                                                                                                         ├─ClusterSecurityGroupsReady                        False  Warning   DeletingFailed  23s    [failed to delete security group "sg-0e7265faa200dbf55" with name "aws-dev-slava-node": DependencyVi ...                 ├─InternetGatewayReady                              False  Warning   DeletingFailed  22s    failed to detach internet gateway "igw-07172cadc7a642df2": DependencyViolation: Network vpc-0d15c201 ...  
Currently it can be fixed only manually by detaching network interfaces, detaching gateways, etc. Our code doesn't manage all these underlaying AWS objects, so it has to be resolved in AWSProvider.

@squizzi
Copy link

squizzi commented Sep 11, 2024

This E2E test has reproduced the issue: https://github.com/Mirantis/hmc/actions/runs/10800786965/job/29959547787

Here's the log artifacts from the test: e2e-test.zip

In this case we create a LoadBalancer service to test the CCM which gets assigned an external IP which never gets cleaned up and leads to a DependencyViolation, we can see in the Clean up resources job that only the VPC exists and all other resources were cleaned properly.

The AWSProvider has a CCM resources garbage collector (added in: kubernetes-sigs/cluster-api-provider-aws#3632), my best guess is there's a potential issue with gc not working correctly in this instance. The LoadBalancer we create is appropriately tagged with the cluster owned label kubernetes.io/cluster/[CLUSTERNAME] (we actually use this label in the cleanup resources script we run following the tests to clean CCM resources) and in this instance the LB was cleaned before the cleanup script got to it, so something else is holding this preventing this VPC from deleting. Perhaps another resource affiliated with the LB (such as a security-group) isn't appropriately tagged, or the code that is GCing these resources has an issue. A potential place to start investigation for whomever picks this up.

@Kshatrix
Copy link
Collaborator

@squizzi have you enabled the ExternalResourceGC feature gate?

@squizzi
Copy link

squizzi commented Sep 12, 2024

@Kshatrix nope, didn't realize that was a feature gate. I'll try, perhaps we should be making that a default annotation in our templates if it works?

@squizzi
Copy link

squizzi commented Sep 12, 2024

@Kshatrix it made things better, but it didn't fix the behavior entirely. It handled deletion of igw resources better, so it's probably worth adding to our templates.

Unfortunately, the issue I opened my ticket that got merged into this one is still present:

E0912 22:23:51.098332       1 controller.go:329] "Reconciler error" err=<
	error deleting network: failed to delete vpc "vpc-055495ce4bb41cf51": DependencyViolation: The vpc 'vpc-055495ce4bb41cf51' has dependencies and cannot be deleted.
		status code: 400, request id: ed8c44c5-02f8-45eb-af42-af585b8909a7
 > controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="hmc-system/d4878555-e2e-test" namespace="hmc-system" name="d4878555-e2e-test" reconcileID="e81abc52-2bf3-4216-aaa9-9f3bfc796cfd"
I0912 22:23:51.098696       1 awscluster_controller.go:208] "Reconciling AWSCluster delete" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="hmc-system/d4878555-e2e-test" namespace="hmc-system" name="d4878555-e2e-test" reconcileID="db899951-8fe4-4141-a834-1904b1df77b1" cluster="hmc-system/d4878555-e2e-test"

The elb has been deleted, but the sg associated with the elb is holding the vpc hostage.

SecurityGroups:
- Description: Security group for Kubernetes ELB adda32bb30b0840a4bde709a34df54b4
    (default/loadbalancer-d4878555-e2e-test)
  GroupId: sg-0da6ab6dbbc580c74
  GroupName: k8s-elb-adda32bb30b0840a4bde709a34df54b4
  IpPermissions:
  - FromPort: 3
    IpProtocol: icmp
    IpRanges:
    - CidrIp: 0.0.0.0/0
    Ipv6Ranges: []
    PrefixListIds: []
    ToPort: 4
    UserIdGroupPairs: []
  - FromPort: 8765
    IpProtocol: tcp
    IpRanges:
    - CidrIp: 0.0.0.0/0
    Ipv6Ranges: []
    PrefixListIds: []
    ToPort: 8765
    UserIdGroupPairs: []
  IpPermissionsEgress:
  - IpProtocol: '-1'
    IpRanges:
    - CidrIp: 0.0.0.0/0
    Ipv6Ranges: []
    PrefixListIds: []
    UserIdGroupPairs: []
  OwnerId: '026090528175'
  Tags:
  - Key: kubernetes.io/cluster/d4878555-e2e-test <---
    Value: owned <---
  VpcId: vpc-055495ce4bb41cf51

Note the kubernetes.io/cluster/<cluster-name> tag exists on this resource. I'm going to open an upstream issue.

@squizzi
Copy link

squizzi commented Sep 12, 2024

Looking further, I don't even see external GC running, none of the logs that start the gc_service are showing up, and the GC service is supposed to get registered prior to DeleteNetwork which we see in the logs here:

	if r.ExternalResourceGC {
		gcSvc := gc.NewService(clusterScope, gc.WithGCStrategy(r.AlternativeGCStrategy))
		if gcErr := gcSvc.ReconcileDelete(ctx); gcErr != nil {
			allErrs = append(allErrs, fmt.Errorf("failed delete reconcile for gc service: %w", gcErr))
		}
	}

	if err := networkSvc.DeleteNetwork(); err != nil {
		allErrs = append(allErrs, errors.Wrap(err, "error deleting network"))
	}

Based on this, I guess it was just pure luck that the igw deletion was better.

I'm going to build a custom CAPA image with some debug logging added and see if I can get to the bottom of this, I guess I should own this bug too.

@squizzi squizzi self-assigned this Sep 12, 2024
squizzi added a commit that referenced this issue Sep 13, 2024
* Bump aws templates to 0.1.3
* Remove testing around CCM features from hosted template
  until fixed

Closes: #152

Signed-off-by: Kyle Squizzato <[email protected]>
@squizzi
Copy link

squizzi commented Sep 13, 2024

Alright, I figured out what was going on and fixed how we enable the feature gate in the provider. It appears to have fixed the issue 🎉, but when deleting even the standalone cluster we end up with orphaned AWSMachine objects (boooo) which I'm assuming will be resolved via #242.

I've added a commit to the testing PR which adds these changes so they can be tested when #242 merges. If all goes well over there we can most likely consider this resolved when that merges.

squizzi added a commit that referenced this issue Sep 13, 2024
* Bump aws-*-cp templates to 0.1.3
* Bump cluster-api-provider-aws template to 0.1.2
* Delete csi-driver, ccm validation tests from aws-hosted-cp test
  until #290 is resolved so that we don't get stuck there and can
  properly test deletion.

Closes: #152

Signed-off-by: Kyle Squizzato <[email protected]>
@DinaBelova DinaBelova linked a pull request Sep 13, 2024 that will close this issue
squizzi added a commit that referenced this issue Sep 14, 2024
* Break KubeClient helpers into provider specific file
* Finish aws-hosted-cp test and add comments through test to make it
  easier to understand.
* Use GinkgoHelper across e2e tests, populate hosted vars from
  AWSCluster.
* No longer rely on local registry for images in test/e2e.
* Support OS for awscli install.
* Prepend hostname to collected log artifacts.
* Support no cleanup of provider specs, differentiate ci
  cluster names.
* Add docs on running tests, do not wait for all providers
  if configured.
* Reinstantiate resource validation map on each instance of
  validation.
* Enable the external-gc feature via annotation, featureGate
  bool. (Closes: #152)
* Bump aws-*-cp templates to 0.1.3
* Bump cluster-api-provider-aws template to 0.1.2
* Improve test logging to log template name and validation
  phase.
* Bump k0s version to v1.30.4+k0s.0, set CCM nodeSelector to
  null for aws-hosted-cp. (Closes: #290)
* Break cleanup into seperate job so that it is unaffected by
  concurrency group cancellations.

Closes: #212

Signed-off-by: Kyle Squizzato <[email protected]>
squizzi added a commit that referenced this issue Sep 14, 2024
* Break KubeClient helpers into provider specific file
* Finish aws-hosted-cp test and add comments through test to make it
  easier to understand.
* Use GinkgoHelper across e2e tests, populate hosted vars from
  AWSCluster.
* No longer rely on local registry for images in test/e2e.
* Support OS for awscli install.
* Prepend hostname to collected log artifacts.
* Support no cleanup of provider specs, differentiate ci
  cluster names.
* Add docs on running tests, do not wait for all providers
  if configured.
* Reinstantiate resource validation map on each instance of
  validation.
* Enable the external-gc feature via annotation, featureGate
  bool. (Closes: #152)
* Bump aws-*-cp templates to 0.1.3
* Bump cluster-api-provider-aws template to 0.1.2
* Improve test logging to log template name and validation
  phase.
* Bump k0s version to v1.30.4+k0s.0, set CCM nodeSelector to
  null for aws-hosted-cp. (Closes: #290)
* Break cleanup into seperate job so that it is unaffected by
  concurrency group cancellations.

Closes: #212

Signed-off-by: Kyle Squizzato <[email protected]>
squizzi added a commit that referenced this issue Sep 14, 2024
* Break KubeClient helpers into provider specific file
* Finish aws-hosted-cp test and add comments through test to make it
  easier to understand.
* Use GinkgoHelper across e2e tests, populate hosted vars from
  AWSCluster.
* No longer rely on local registry for images in test/e2e.
* Support OS for awscli install.
* Prepend hostname to collected log artifacts.
* Support no cleanup of provider specs, differentiate ci
  cluster names.
* Add docs on running tests, do not wait for all providers
  if configured.
* Reinstantiate resource validation map on each instance of
  validation.
* Enable the external-gc feature via annotation, featureGate
  bool. (Closes: #152)
* Bump aws-*-cp templates to 0.1.3
* Bump cluster-api-provider-aws template to 0.1.2
* Improve test logging to log template name and validation
  phase.
* Bump k0s version to v1.30.4+k0s.0, set CCM nodeSelector to
  null for aws-hosted-cp. (Closes: #290)
* Break cleanup into seperate job so that it is unaffected by
  concurrency group cancellations.
* Make dev-aws-nuke target less PHONY.

Closes: #212

Signed-off-by: Kyle Squizzato <[email protected]>
squizzi added a commit that referenced this issue Sep 16, 2024
* Break KubeClient helpers into provider specific file.
* Try to simplify the validation process for lots of different providers
  with different requirements.
* Finish aws-hosted-cp test and add comments through test to make it
  easier to understand.
* Use GinkgoHelper across e2e tests, populate hosted vars from
  AWSCluster.
* No longer rely on local registry for images in test/e2e.
* Support OS for awscli install.
* Prepend hostname to collected log artifacts.
* Support no cleanup of provider specs, differentiate ci
  cluster names.
* Add docs on running tests, do not wait for all providers
  if configured.
* Reinstantiate resource validation map on each instance of
  validation.
* Enable the external-gc feature via annotation, featureGate
  bool. (Closes: #152)
* Bump aws-*-cp templates to 0.1.3
* Bump cluster-api-provider-aws template to 0.1.2
* Improve test logging to log template name and validation
  phase.
* Bump k0s version to v1.30.4+k0s.0, set CCM nodeSelector to
  null for aws-hosted-cp. (Closes: #290)
* Break cleanup into seperate job so that it is unaffected by
  concurrency group cancellations.
* Make dev-aws-nuke target less PHONY.

Closes: #212

Signed-off-by: Kyle Squizzato <[email protected]>
squizzi added a commit that referenced this issue Sep 16, 2024
* Break KubeClient helpers into provider specific file.
* Try to simplify the validation process for lots of different providers
  with different requirements.
* Finish aws-hosted-cp test and add comments through test to make it
  easier to understand.
* Use GinkgoHelper across e2e tests, populate hosted vars from
  AWSCluster.
* No longer rely on local registry for images in test/e2e.
* Support OS for awscli install.
* Prepend hostname to collected log artifacts.
* Support no cleanup of provider specs, differentiate ci
  cluster names.
* Add docs on running tests, do not wait for all providers
  if configured.
* Reinstantiate resource validation map on each instance of
  validation.
* Enable the external-gc feature via annotation, featureGate
  bool. (Closes: #152)
* Bump aws-*-cp templates to 0.1.3
* Bump cluster-api-provider-aws template to 0.1.2
* Improve test logging to log template name and validation
  phase.
* Bump k0s version to v1.30.4+k0s.0, set CCM nodeSelector to
  null for aws-hosted-cp. (Closes: #290)
* Break cleanup into seperate job so that it is unaffected by
  concurrency group cancellations.
* Make dev-aws-nuke target less PHONY.

Closes: #212

Signed-off-by: Kyle Squizzato <[email protected]>
squizzi added a commit that referenced this issue Sep 16, 2024
* Break KubeClient helpers into provider specific file.
* Try to simplify the validation process for lots of different providers
  with different requirements.
* Finish aws-hosted-cp test and add comments through test to make it
  easier to understand.
* Use GinkgoHelper across e2e tests, populate hosted vars from
  AWSCluster.
* No longer rely on local registry for images in test/e2e.
* Support OS for awscli install.
* Prepend hostname to collected log artifacts.
* Support no cleanup of provider specs, differentiate ci
  cluster names.
* Add docs on running tests, do not wait for all providers
  if configured.
* Reinstantiate resource validation map on each instance of
  validation.
* Enable the external-gc feature via annotation, featureGate
  bool. (Closes: #152)
* Bump aws-*-cp templates to 0.1.3
* Bump cluster-api-provider-aws template to 0.1.2
* Improve test logging to log template name and validation
  phase.
* Bump k0s version to v1.30.4+k0s.0, set CCM nodeSelector to
  null for aws-hosted-cp. (Closes: #290)
* Break cleanup into seperate job so that it is unaffected by
  concurrency group cancellations.
* Make dev-aws-nuke target less PHONY.
* Only build linux/amd64 arch since CI does not need arm.

Signed-off-by: Kyle Squizzato <[email protected]>
@github-project-automation github-project-automation bot moved this from In Progress to Done in Project 2A Sep 17, 2024
bnallapeta pushed a commit to bnallapeta/hmc that referenced this issue Nov 15, 2024
* Break KubeClient helpers into provider specific file.
* Try to simplify the validation process for lots of different providers
  with different requirements.
* Finish aws-hosted-cp test and add comments through test to make it
  easier to understand.
* Use GinkgoHelper across e2e tests, populate hosted vars from
  AWSCluster.
* No longer rely on local registry for images in test/e2e.
* Support OS for awscli install.
* Prepend hostname to collected log artifacts.
* Support no cleanup of provider specs, differentiate ci
  cluster names.
* Add docs on running tests, do not wait for all providers
  if configured.
* Reinstantiate resource validation map on each instance of
  validation.
* Enable the external-gc feature via annotation, featureGate
  bool. (Closes: K0rdent#152)
* Bump aws-*-cp templates to 0.1.3
* Bump cluster-api-provider-aws template to 0.1.2
* Improve test logging to log template name and validation
  phase.
* Bump k0s version to v1.30.4+k0s.0, set CCM nodeSelector to
  null for aws-hosted-cp. (Closes: K0rdent#290)
* Break cleanup into seperate job so that it is unaffected by
  concurrency group cancellations.
* Make dev-aws-nuke target less PHONY.
* Only build linux/amd64 arch since CI does not need arm.

Signed-off-by: Kyle Squizzato <[email protected]>
@alex-shl alex-shl added this to K0rdent Jan 3, 2025
@alex-shl alex-shl moved this to Done in K0rdent Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants