Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow JupyterHub admins different cloud permissions than standard users #9

Open
1 of 4 tasks
yuvipanda opened this issue Apr 23, 2022 · 11 comments
Open
1 of 4 tasks

Comments

@yuvipanda
Copy link
Member

yuvipanda commented Apr 23, 2022

Context

@rabernat brought up the point that it's important for hubs to be able to create cloud buckets whenever they want, without entirely having to rely on 2i2c. This can be accomplished by giving hub admin accounts a different set of cloud credentials when they're logged in to the hub than regular users - that way, we can scope it to just the extra perms they want (probably full GCS / S3 access) without having to give them full ownership on the cloud project.

Proposal

We already provide cloud credentials via workload identity in GCP and IRSA on AWS. This is matching a kubernetes SA to a GCP / AWS SA. We can have a different kubernetes service account for admins and thus grant that different rights

  • Create a different KSA that is attached for hub admins
  • Write terraform config that optionally provisions an extra GCP Service Account for this specific account. This should be a superset of regular permissions
  • Optionally give extra rights to admins
  • Write documentation on how to create additional storage buckets

Updates and actions

No response

@rabernat
Copy link

How would the UI side of this work? Would they just run aws s3 commands from the terminal? I rely heavily on the aws / gcp console for this currently.

@rabernat
Copy link

Credentials for cloud storage use the cloud-provider IAM system. In my ideal world, credentials for these buckets would be automatically populate based on hub identity. However, since hub identity is different from cloud-provider identity, that's not trivial to do, and would require some kind of database matching hub users to projects to project storage buckets. The concept of "groups" in jupyterhub could be very helpful here. Developing a general solution to this problem as part of z2jh would have a huge impact.

@yuvipanda
Copy link
Member Author

There are two separate parts here:

  1. Different cloud credentials just for JupyterHub admins,
  2. Different cloud credentials per-group

(1) is easier to do than (2) now, since we already have code that has special overrides for hub admins (that's how we do the shared dir). I want to focus this issue on (1).

And yes, any AWS command / tool should 'just work' - aws on the terminal would work with all the permissions granted.

@scottyhq
Copy link

scottyhq commented Jun 7, 2022

Just wanted to chime in here to say this would be really useful! I can think of a couple cases that (might?) be relatively straightforward to implement before tackling group-based permissions.

  1. admin creates a bucket without a lifecycle policy that everyone automatically has read-only access to (similar to current ~/shared folder)

  2. admin modifies the base service account policy to add additional buckets everyone can access. for example, in AWS you have to explicitly list buckets that are in other accounts but "requestor pays". It seems many public datasets have the requestor pays configuration that would be nice to access in addition to the scratch bucket: https://registry.opendata.aws/usgs-landsat/

@rabernat
Copy link

As we begin the new semester, I am pinging this issue to remind the team that this is an extremely high-value feature that would really accelerate the use of data on our hubs.

@yuvipanda
Copy link
Member Author

@rabernat ok, so to be more specific, we want to allow admins to create buckets, right? And implement that in a way that generalizes?

@rabernat
Copy link

Correct. This will empower the hub communities to manage their own cloud storage, rather than relying on 2i2c admins. Using object storage (rather than NFS mount) is key for more cloud-native-style workflows.

@rabernat
Copy link

I'm checking in on this issue. We continue to have requests from M2LInES and LEAP users to have a non-scratch bucket in which to store their data and share it with the hub team (but not the public).

@yuvipanda
Copy link
Member Author

I've dealt with the specific issue here in 2i2c-org/infrastructure#1776 by making PERSISTENT_BUCKET a feature. That PR will enable that for LEAP and m2lines. How do we make sure that it doesn't baloon costs super high by users just unexpectedly leaving stuff there?

@jmunroe
Copy link

jmunroe commented Oct 14, 2022

I am caught between two ways of solving this issue of creating cloud storage by admins on hubs.

  1. It is "easy" to create buckets using the Google Console or the command like assuming you have the right permissions. We could set it up to with instructions to default to "requestor-pays" and the give guidance on how to set life cycle rules. It would solve the immediate problem of letting admins create what ever storage buckets they want. It think this would only be an option on a "dedicated" cluster where the community partner is paying (either directly or via 2i2c) the entire cloud costs. It would then be the community partner's responsibilty to manage the costs and life cycle rules associated with that cloud storage.

  2. But I think that is not the "right" way to set it up (the way I would expect 2i2c cloud enginneers to create and manage cloud storage on a hub). I assume would modify the the correct terraform configuration files so that we are practicing infrastructure-as-code and other devops goodness. I see this as being important especially in cases where we are asked to migrate a hub to another availability zone, decommision a hub, or facility right to replicate: if the entire infrastructure is not managed we run a risk of "forgetting" some resource at some future point down the road. Is there potential to automate this process using an UI so that hub admins could deploy cloud storage in a managed way?

Are cloud buckets something that needs to be created/destroyed frequently? What is the true "cost" to having 2i2c create this resource on behalf of users?

Waiting for "I.T." to deploy some resource like extra storage was frustrating when I knew it was "easy" to do if I just had admin access to my own infrastructure is definitely something I wanted as a research user. But thinking it from a sustainability side, I am more hesistant to bypass any recommend cloud engineering best practices.

To be clear, it may be that for M2LInES and LEAP we just create the hubs for them so they can proceed with their work. My comments here are about the more general of what 2i2c is providing in a "research hub" and how that should be represented on our product roadmap.

@yuvipanda
Copy link
Member Author

This is currently being done in 2i2c-org/infrastructure#3932 for AWS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Needs Shaping / Refinement
Development

No branches or pull requests

4 participants