-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upon provider pod crash/restart, MR deletion leaves orphaned external resources on cloud provider #304
Comments
Hi @arfevrier,
Do you have an idea on why the provider could not find the external resource (the external resource in this context is the actual set of Google Cloud DNS records)? Looks like the managed resource ( One explanation is for what happened to But then the external resource must have been deleted by an external entity (than the provider) and what I understand this is not the case. Another more dangerous case is the provider loses the track of the external resource. The provider associates the MR with the external resource using the external-name annotation and a bunch of other piece of information which is configured using upjet's resource configuration framework. In this specific case for the Looking at the underlying Terraform provider's documentation, I see the following:
I don't have experience with GCP DNS recordsets but could this maybe related to our observations? At first sight, it does not seem related, at least for the |
Hi @ulucinar,
Successfully deleted managed resource: Here we are on the line reconciler.go#L998 of the reconcile loop. This mean the finalizer have been removed from the resource (on line 981). The resource has no more finalizer and K8s is going finally delete the resource. Because the resource have been modified, this will start a new reconcile loop. We see the new reconcile loop at 2023-10-11T15:51:44 Reconciling. Of course, the resource doesn't exist anymore (Cannot get managed resource) because the finalizer have been removed in previous reconcile loop. That’s why I think we need to understand what happen before this Successfully deleted managed. I've just seen that the benchmarksix10 and benchmarksix11 objects have no Terraform deletion log. Maybe it's linked to your comment: "Terraform will not actually remove NS records during destroy but will report that it did." BUT! We don't have the "Successfully requested deletion of external resource". Because we should enter the if condidition here reconciler.go#L942, but there is no "Cannot delete external resource" or "Successfully requested deletion of external resource" meaning the observation.ResourceExists is false. If we look at the Observe function external_nofork.go#L451 the only way to return ResourceExists=false without error is this condition: if meta.WasDeleted(mg) && n.opTracker.IsDeleted() {
return managed.ExternalObservation{
ResourceExists: false,
}, nil
} We can take a look also on the GCP Bucket issue. You can see on the log that the Bucket bucketbenchone1928 is deleted, but the audit log at the end show that my browser can still get access to the bucket. With this bucketbenchone1928 bucket we have multiple time the log "Successfully requested deletion of external resource" but it comes with Terraform TTL error:
|
Resource deletion and orphaned resources on cloud provider
What happened?
Hello, I created a benchmark to perform tests with thousands of GCP objects. During the MR deletion part of my benchmark, sometimes (less than 1% of managed resources) the MR is deleted from K8s cluster but not the external resource from GCP. The benchmark has been made on the GCP provider, but I think the root cause is in the upjet framework generator. GCP provider in version v0.37, upjet version v0.11.0-rc.0.
> Here a link to the my Run N°4: 4000 RecordSets: Create, shutdown then Delete.
On this run, I have created 4000 RecordSets MRs. Then, wait for the MR object to be in ready state. Scale K8s cluster to 0 worker node, then scale back to 1 worker node. After requesting deletion of all RecordSets some of then are been deleted from Kubernetes, but not from GCP console.
Top diagram: Number of API calls reveiced by GCP for the DNS API endpoints. Botton diagram: CPU usage of the worker node
The CPU consumption of the node is 60%, meaning 10vCPU usage for the provider. During my other benchmark the provider has never been able to consume more than 10vCPU. This means that it is problably already consuming the maximum amount of available CPU time.
For example the object recordset.dns.gcp.upbound.io/benchmarksix10 is missing the Terraform deletion log. This object is still present on GCP console. Object recordset.dns.gcp.upbound.io/benchmarksix11 is correctly deleted and not present anymore on GCP console.
Please find the complete log export (exported from GCP Logs Explorer):
For the object benchmarksix10 this two logs are missing:
> Here a link to the my Run N°6: 2000 Buckets. The run deploy 2000 Bucket at (1 rq/s), and then request delete buckets delete. Same behavior as the run n°4: Bucket deleted from K8s but not from GCP
For the object bucketbenchone1928, this log is present:
Expected behavior ?
When requesting deletion of an object in K8s/Crossplane, the provider should wait the deletion confirmation from GCP/Terraform. Here problably the terraform destroy logic have not finish correctly but the object was deleted anyway.
How can we reproduce it?
The project benchmark_gcp_18-10-2023 contains a description of how I build my Crossplane environment (installation, configuration and script)s:
The text was updated successfully, but these errors were encountered: