Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More reliable provisioning/deprovisioning of shared terraform provider #263

Closed
toastwaffle opened this issue Aug 22, 2023 · 3 comments
Closed
Labels
enhancement New feature or request needs:triage

Comments

@toastwaffle
Copy link
Contributor

What problem are you facing?

When we migrated provider-gcp from v0.28 to v0.35, we saw the provider appear to get stuck when the shared terraform provider had reached its invocation limit but didn't seem to be getting replaced. Based on code exploration I suspect this was because there was a long-running synchronous terraform operation which held the inUse count above 0, preventing the shared provider from being recreated.

How could Upjet help solve your problem?

One potential improvement would be to provision a new instance of the shared provider as soon as the invocation limit is reached, without waiting for the old instance to be destroyed. The question is whether it is even possible for multiple instances of the provider to live side-by-side (at a minimum I assume there would be challenges around port allocation)?

Another potential improvement would be to support adding a timeout to terraform operations, such that execution is aborted after a fixed period of time (meaning that eventually all invocations would terminate and release the shared instance)

@toastwaffle toastwaffle added enhancement New feature or request needs:triage labels Aug 22, 2023
@ulucinar
Copy link
Collaborator

ulucinar commented Aug 23, 2023

Hello @toastwaffle,
Thank you for opening this issue. Implementing a new scheduler which runs parallel Terraform providers at the expense of increased memory consumption is in our roadmap.

The upjet_terraform_cli_duration histogram metric, the upjet_terraform_active_cli_invocations & upjet_terraform_running_processes gauge metrics may prove to be useful while debugging the issue you described above.

Could you please also detail how we can reproduce the issue? Any Crossplane composition, claim and XRD manifests or MR manifests we could use together with the experiment scenarios (e.g., how may claims/MRs to provision, which versions of the providers & Crossplane, the managed Kubernetes cluster version, etc.) would be very helpful. Thank you for looking into this.

@toastwaffle
Copy link
Contributor Author

Hi @ulucinar!

Thanks for the pointer to those metrics, they will definitely prove useful.

Our XRD, composition, and an example instantiation can be found in this gist. We (perhaps foolishly) updated the composition at the same time as upgrading the provider; the changes were adding the management policies to both the Cluster and NodePool resources to disable late initialisation, and removing the networkConfig.enablePrivateNodes lines (which we were using to prevent reconciliation after creation to mitigate crossplane-contrib/provider-upjet-gcp#340).

The composition is using private nodes and a project-level VPC, which requires additional manual configuration and allocation of IP ranges. I suspect that's probably irrelevant to the issue at hand, so it might be easier to remove all of the networking configuration and rely on the defaults.

At any one time, we probably have about 30 instances of this composition, but they change somewhat regularly - we're using Crossplane to provision development and demonstration instances of our stack.

Versions:

  • The Kubernetes cluster is running 1.24.13-gke.2500
  • We're using Universal Crossplane version 1.12.2
  • We were using upbound/provider-gcp v0.28.0 before the upgrade and v0.35.0 after the upgrade. We've since rolled back to v0.28.0 out of an abundance of caution (an adventure in itself), but I'm hoping to roll forward again soon - if there's something specific you suggest we do during that process to collect useful information, let me know; otherwise I will keep an eye on those metrics

@jeanduplessis
Copy link
Collaborator

@toastwaffle we are close to releasing a new architecture where we integrate directly with the Terraform providers rather than using the Terraform CLI as an intermediary. This new architecture will void this issue you are experiencing and as such I will close this issue for now. You can follow progress and updates on the new architecture in the #sig-upjet-provider-efficiency Slack channel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request needs:triage
Projects
None yet
Development

No branches or pull requests

3 participants