-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More reliable provisioning/deprovisioning of shared terraform provider #263
Comments
Hello @toastwaffle, The Could you please also detail how we can reproduce the issue? Any Crossplane composition, claim and XRD manifests or MR manifests we could use together with the experiment scenarios (e.g., how may claims/MRs to provision, which versions of the providers & Crossplane, the managed Kubernetes cluster version, etc.) would be very helpful. Thank you for looking into this. |
Hi @ulucinar! Thanks for the pointer to those metrics, they will definitely prove useful. Our XRD, composition, and an example instantiation can be found in this gist. We (perhaps foolishly) updated the composition at the same time as upgrading the provider; the changes were adding the management policies to both the Cluster and NodePool resources to disable late initialisation, and removing the The composition is using private nodes and a project-level VPC, which requires additional manual configuration and allocation of IP ranges. I suspect that's probably irrelevant to the issue at hand, so it might be easier to remove all of the networking configuration and rely on the defaults. At any one time, we probably have about 30 instances of this composition, but they change somewhat regularly - we're using Crossplane to provision development and demonstration instances of our stack. Versions:
|
@toastwaffle we are close to releasing a new architecture where we integrate directly with the Terraform providers rather than using the Terraform CLI as an intermediary. This new architecture will void this issue you are experiencing and as such I will close this issue for now. You can follow progress and updates on the new architecture in the #sig-upjet-provider-efficiency Slack channel. |
What problem are you facing?
When we migrated provider-gcp from v0.28 to v0.35, we saw the provider appear to get stuck when the shared terraform provider had reached its invocation limit but didn't seem to be getting replaced. Based on code exploration I suspect this was because there was a long-running synchronous terraform operation which held the
inUse
count above 0, preventing the shared provider from being recreated.How could Upjet help solve your problem?
One potential improvement would be to provision a new instance of the shared provider as soon as the invocation limit is reached, without waiting for the old instance to be destroyed. The question is whether it is even possible for multiple instances of the provider to live side-by-side (at a minimum I assume there would be challenges around port allocation)?
Another potential improvement would be to support adding a timeout to terraform operations, such that execution is aborted after a fixed period of time (meaning that eventually all invocations would terminate and release the shared instance)
The text was updated successfully, but these errors were encountered: