-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Setup workload identity via terraform #1130
Conversation
This was caught by @sgibson91 in #1074, but then https://2i2c.freshdesk.com/a/tickets/107 happened today as well so I had to fix that. Automating this seemed the best way to go forward. |
Looking at the cloud buckets in https://console.cloud.google.com/storage/browser?forceOnBucketsSortingFiltering=false&project=two-eye-two-see&prefix=&forceOnObjectsSortingFiltering=false, none of them actually have any content in them! So nobody is actually using scratch buckes in the default pilot-hubs cluster. |
Thank you for your work following this up @yuvipanda! 🙌 |
1c58a86
to
2c5f8eb
Compare
The GKE config connector was helpful in letting us deploy Google Cloud Service Accounts with permissions for cloud storage directly just from helm. However, it has been difficult to debug, and in 2i2c-org#669 we decided to move away from it and towards creating these cloud resources via Terraform. This commit adds: - Terraform code that will create a Google Service Account, bind it to a given Kubernetes Service Account, for a list of hub namespaces passed in. This means that some hub initial deployments now *can not be done just with CD*, but need manual work with terraform. I think this would be any hub that wants to use requestor pays or scratch buckets. This would need to be documented. - Move meom-ige to use this new scheme. metadata concealment (https://cloud.google.com/kubernetes-engine/docs/how-to/protecting-cluster-metadata#concealment) which is what we were using earlier as alternative to config-connector + workload identity, is no longer supported by the terraform google provider. In b7b42ce, we changed the default from 'SECURE' to 'UNSPECIFIED', but it looks like 'UNSPECIFIED' really means 'use workload identity' haha. When 2i2c-org#1124 was deployed to meom-ige yesterday, it seems to have enabled workload identity, causing cloud access to stop working, leading to https://2i2c.freshdesk.com/a/tickets/107. Further investigation on what happened here is needed, but I've currently fixed it by just deploying this change for meom-ige. - All hubs are given access to all buckets we create. This is inadequete, and needs to be more fine grained. Ref 2i2c-org#669 Ref 2i2c-org#1046
Much cleaner than trying to just 'do it all' based on a list of hubs.
Service Accounts have name length limitations that are busted by us using the full hub name there (looking at you, catalyst-cooperative hub). Instead, we explicitly specify the hub namespace.
c03ab51
to
69ec676
Compare
- Write docs on how to setup workload-identity - Create features/ subsection of the docs - Remove older instructions
69ec676
to
d954e00
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really love the level of documentation in this PR and I'm very grateful for understanding what's happening here with little knowledge of the terraform syntax. 🌼 Thank you, Yuvi!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left a few comments/typo fixes, but this looks good to me! 🚀
@@ -1,3 +1,7 @@ | |||
userServiceAccount: | |||
enabled: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this default to true? I'm not sure if it's better to explicitly enable features or rely on default config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a comment about that - but the short version is, we always have a service account attached to pods, regardless of wether they have any permissions assigned or not. I think if we don't, the default
service account gets attached.
Note that we're *removing* the scratch bucket feature from these hubs - those buckets are not being used at all. Simplify!
Co-authored-by: Sarah Gibson <[email protected]>
…ple hub matrix jobs inputs
'Missing' means a prod hub job was generated but no support or staging
Its not something that is common in the codebase and we don't check them, so we shouldn't use them for now. This decision can be reverted later if required.
Co-authored-by: choldgraf <null> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
These are no longer being set by the automatic config-connector, so we need to set them. But I think pangeo-hubs wasn't using this anyway so maybe not a biggie.
pangeo-hubs I've preserved current access, as deploying this there would require #1153 to be done too. |
The GKE config connector was helpful in letting us deploy
Google Cloud Service Accounts with permissions for cloud storage
directly just from helm. However, it has been difficult to debug,
and in #669
we decided to move away from it and towards creating these
cloud resources via Terraform.
This commit adds:
bind it to a given Kubernetes Service Account, for a list of
hub namespaces passed in. This means that some hub initial deployments
now can not be done just with CD, but need manual work with
terraform. I think this would be any hub that wants to use
requestor pays or scratch buckets. This would need to be
documented.
(https://cloud.google.com/kubernetes-engine/docs/how-to/protecting-cluster-metadata#concealment)
which is what we were using earlier as alternative to config-connector
google provider. In b7b42ce,
we changed the default from 'SECURE' to 'UNSPECIFIED', but
it looks like 'UNSPECIFIED' really means 'use workload identity'
haha. When Switch to using pd-balanced for all user & dask nodes #1124 was
deployed to meom-ige yesterday, it seems to have enabled workload
identity, causing cloud access to stop working, leading to
https://2i2c.freshdesk.com/a/tickets/107. Further investigation on
what happened here is needed, but I've currently fixed it by
just deploying this change for meom-ige.
enable these on a per-hub basis. They are designed to be 'feature flags'
in terraform wherever possible. This still involves copying some terraform
outputs to helm values files unfortunately, as we use helm to create namespaces
(and terraform doesn't have a way to say 'create this namespace if required',
unlike eksctl)
hub features, with a corresponding entry in the topic page. Will slowly move
other documentation about features to this structure.
cloud-access.md
document to clarify it's about authfor 2i2c engineers, not about accessing cloud features from the hub
dask-staging, ohw and catalyst-cooperative. None of these actually have
used their bucket. requestor_pays is enabled for all of them.
TODO:
Perhaps we can move the Kubernetes secret creation away from terraform
and into helm to simplify this?
Ref #669
Ref #1046