Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Hub] - Jupyter Meets the Earth #433

Closed
4 tasks done
consideRatio opened this issue May 27, 2021 · 16 comments
Closed
4 tasks done

[Hub] - Jupyter Meets the Earth #433

consideRatio opened this issue May 27, 2021 · 16 comments

Comments

@consideRatio
Copy link
Contributor

consideRatio commented May 27, 2021

Background

A collaboration space created for Jupyter Meets the Earth.

Setup Information

  • Hub auth type: GitHub
  • Hub administrators: @consideRatio, @andersy005
  • Hub url: hub.jupytearth.org
  • Hub logo:
    image
  • Hub logo URL: https://pangeo-data.github.io/jupyter-earth/_static/jupyter-earth.png
  • Hub type: z2jh, dask-gateway, shared filesystem storage, shared object storage space
  • Hub cluster: External AWS account 286354552638 in us-west-2 region
  • Hub image: Want to use something based on pangeo-notebook for example, perhaps via a custom Dockerfile with a FROM statement referencing pangeo-notebook.

Important Information

  • Link to leads issue: Private discussion in zoom call May 26th between me, @andersy005 and @fperez
  • Hub config name: jmte
  • Community champion: @consideRatio / Erik Sundell
  • Hub start date: As soon as possible
  • Hub end date: No
  • Hub important dates:

Deploy To Do

  • Initial Hub deployment
  • Administrators able to log on
  • Community Champion satisfied with hub environment
  • Hub now in steady-state
@consideRatio consideRatio changed the title [Hub] - [Hub name] [Hub] - Jupyter Meets the Earth May 27, 2021
@yuvipanda
Copy link
Member

@consideRatio \o/. #391 (comment) is our draft docs for setting up a hub on AWS in this repo. I'd love for you try them out and we can see how it goes?

@consideRatio
Copy link
Contributor Author

consideRatio commented May 27, 2021

@yuvipanda thanks for that pointer, it is very relevant for me to have right now so I better align with various infrastructure choices. I figure its better for me to go for kops even though my experience is with eksctl at this point to not setup something different than other clusters.

At the same time, is the main reason for abandoning eks the inability to scale a managed node to zero? I ended up opting for non-managed node and everything has been fine doing that. Wait! This belongs in #431, I'll ask it there.

@yuvipanda
Copy link
Member

I opened #431 to discuss EKS vs kops - I think your experience there will be invaluble. 2i2c-org/farallon-image#28 has current information on the switch - it eventually came down to costs, but I've found EKS somewhat clunky to use.

@consideRatio
Copy link
Contributor Author

@yuvipanda I've now read through all documentation at:

Questions raised and answered

Reading through that with the perspective that I'll create the cloud infrastructure myself, I was uncertain what would be created by myself and what would be done by 2i2c's various scripts and such.

Will a NFS server be deployed within the k8s cluster?

  • No, it must be self hosted alongside or within the k8s cluster you deploy. I'll use AWS EFS.

Does the hubs in pilot-hubs assume they run in the 2i2c managed cloud projects?

  • No, I can provide a kubeconfig for access to any k8s cluster.

Does the hubs in pilot-hubs configure something besides the things within the k8s cluster?

  • No, but it is possible in GCP. There is automation to setup scratch buckets for in GCP projects by creating k8s resources of a GCP specific kind.

Does the hubs in pilot-hubs assume i setup some keychain or similar in some KMS service?

  • No, secrets encrypted in the repo/decrypted during deployment are encrypted/decrypted by Google KMS managed by 2i2c of which only 2i2c engineers have access (I assume).

Does the hub allow for managing custom images?

  • Yes, but only via the /services/configurator where the image can be set, but this means the image must first have been built and such manually by the user wanting to provide a custom image.

List of misc questions:

  • What is included in basehub?
    • JupyterHub
    • A k8s ServiceAccount for the users (user-sa)
    • GCP Cloud resources for scratch buckets
    • Docs k8s service and k8 deployment
    • NFS PVC to reference
    • NFS Share creator job
      • Seem to ensure a folder on the NFS server is created and chowned
  • What is included in daskhub?
    • Basehub
    • Dask-Gateway
  • What is centralized within each k8s cluster?
    • The support chart including: prometheus, grafana, cert-manager, ingress-nginx
  • What is centralized outside the various k8s clusters?
    • Google KMS for use by SOPS?
    • Auth0 and an OAuth2 application registered for the hub specifically?

How does the configurator influence the user image chosen, and how does the various settings override the helm charts passed configuration?

  • A custom Spawner class is defined augmenting KubeSpawner with the ConfiguratorSpawnerMixin
  • The ConfiguratorSpawnerMixin accesses the configurator service assumedly running on localhost and it responds with the state of the configuration - this state is then used to overwrites traitlets whenever the Spawner is about to start a user pod.
  • I believe the state of the configurator is lost on restart of the hub pod, because its StorageBackend writes to a file that won't be persisted in a hub container think.

@consideRatio
Copy link
Contributor Author

consideRatio commented May 28, 2021

For transparency and to help me think clearly, I'm making this write up thoughts regarding using 2i2c-org/pilot-hubs as a base of configuration for the JMTE deployment.

  • I think it will be more likely that open source contributions become fruitful as part of 2i2c-org than in a standalone repo.
  • I'd like to better learn about the 2i2c-org infrastructure
  • I'd like to deliver a deployed hub quickly
  • I'm worried about abandoning my experience with hubploy which is a functional standalone project in favor of the repo's deployer script that is new to me. It worried that the deployer script is locked in to 2i2c infrastructure by being part of this repo compared to how hubploy isn't. A contribution to hubploy would feel more valuable for the open source community than to the deployer script in 2i2c due to this.
  • I'm not confident about either choosing kops or not instead of EKS via eksctl to deploy the k8s cluster.
    • EKS costing ~50$ / month more is not an issue
    • Maximum of ~30 users per node is not an issue
    • I have no experience with kops, but I have experience with eksctl.
  • I'm not confident about what it would mean to use Auth0 instead of GitHub directly and is worried that coupling to Auth0 instead of GitHub directly can cause trouble if we would need to decouple from 2i2c.
  • I'm worried about the added complexity of the configurator
  • I'm positive about the shared / shared-readwrite folders
  • I'm worried about what happens if we want to use a custom Helm chart because we want to deploy some additional template because we want to make some very custom thing. Then we wouldn't change the basehub, but instead create another meta-chart.

@yuvipanda
Copy link
Member

This is all beautiful, thank you for writing this up, @consideRatio!

The configurator is not required - in fact in most cases I just set the image tag in the config right now - like in https://github.com/2i2c-org/pilot-hubs/blob/fef7da6a93284d006a8536b144f3fd0a0be5a936/config/hubs/carbonplan.cluster.yaml#L59. So you can basically ignore the configurator now.

I actually think it'll be great for you to use eksctl in this case. I think we should pick and choose which one we want to use based on the circumstances - I think the ideal outcome of #431 is to determine when to use kops vs EKS. I suspect we'll end up using both for a while.

hubploy is in an interesting space. I think the current setup of 1 directory per deployment hasn't scaled well in repos with a lot of deployments IME. Too much duplication. I also think that possibly introducing jsonnet will have a longer term reduction in complexity. This is a radical change to hubploy since the set structure is one of the core parts of it. Many parts of the current scripts are cannibalized from it (particularly around sops - but perhaps that should be its own small library). hubploy grew out of the deploy scripts I had for https://github.com/berkeley-dsep-infra/datahub, and perhaps something can grow out of what we have here? I personally don't plan on doing any more work on hubploy...

But, it's extremely important to have an off-ramp from this repo - that's an essential part of right-to-replicate. My earlier thinking was to make sure we have a way to just extract out a values.yaml + base-chart config from this repo so people can continue using the same deployment without any changes. But perhaps a better way is to use this opportunity to think of a way to decouple the deployment script from this repo?

auth0 is primarily used for automating the creation of credentials, you don't have to use it! However, currently we don't have a way to store secret values to be merged in in the repo (something that hubploy has), so we'll have to build that in.

Can you give me an example of a super-custom thing you might want to deploy? My intuition would be that anything useful for JMTE will also be useful for others, so we could just incorporate that into one of the charts we have. Alternatively we can create a new meta chart.

I hope this is all helpful. Everything is nascent and malleable in this repo - I look forward to your experience and contributions shaping how things happen in this repo :)

@consideRatio
Copy link
Contributor Author

@yuvipanda ❤️ thanks for your quick and thorough response!

An example of a super custom thing could be a conda-store server, but that is of course quite standalone and could run in a separate namespace and such. But, if we want to maintain that we would want to setup some automation in a separate repo, setup sops, setup a KMS location, etc also for that repo.

I'd like to get some sleep now, but I'd love to speak with you 1on1 and brainstorm a bit and then try go at full speed with practical steps towards having a functional hub for JMTE.

Would you have time to chat with me sometime during 14.00-18 in your timezone today? I assume your clock is now 06.40 btw and I'll sleep ~7 hours. I'll be available on slack at your convenience if you do, and you could if you want also schedule a time slot here.

@yuvipanda
Copy link
Member

An example of a super custom thing could be a conda-store server, but that is of course quite standalone and could run in a separate namespace and such. But, if we want to maintain that we would want to setup some automation in a separate repo, setup sops, setup a KMS location, etc also for that repo.

I think of conda-store as something that would indeed be broadly useful! It could live in basehub, in fact.

I'll try book a slot now - I see it's almost 4AM for you?!

@consideRatio
Copy link
Contributor Author

@yuvipanda thank you so much for your care and effort to help me learn a lot about the 2i2c setup!

Here are some notes i scribbled down while speaking with you for future reference:

  • eksctl folder created in 2i2c-org/pilot-hubs
    • eksctl config regarding k8s cluster setup
    • cloudformation config regarding s3 buckets etc
    • EFS: setup-efs.py script regarding security groups etc
    • Cloudformation stuff should become terraform stuff in the long run
  • Our image is defined in pangeo-data/jupyter-earth repo
    • The image could be updated by:
      • JupyterHub admin using the configurator UI
      • GitHub workflow to automatically create a PR if image was built and pushed
      • A manual PR to 2i2c-org/pilot-hubs
      • A workflow is dispatched in 2i2c-org/pilot-hubs following a image build/push in some remote repo
    • We start out having the JupyterHub admins use the configurator UI under https://hub.jupytearth.org/services/configurator
  • Z2JH native configuration options hub.extraFiles / singleuser.extraFiles are used to mount additional files

@consideRatio
Copy link
Contributor Author

consideRatio commented May 31, 2021

PRs representing current progress

Status summary

  • I have an issue getting proxy-public (or any k8s Service of type LoadBalancer) to provide a public IP, this is the error: AWS Load Balancer Controller v2.1.3 + EKSCTL (Multiple tagged SGs) eksctl-io/eksctl#3459

    Events:
      Type     Reason                  Age               From                Message
      ----     ------                  ----              ----                -------
      Normal   EnsuringLoadBalancer    3s (x2 over 10s)  service-controller  Ensuring load balancer
      Warning  SyncLoadBalancerFailed  3s (x2 over 8s)   service-controller  Error syncing load balancer: failed to ensure load balancer: Multiple tagged security groups found for instance i-0a24c650cd9fbfc68; ensure only the k8s security group is tagged; the tagged groups were sg-0731df37a5a8e6844(eksctl-jmte-cluster-ClusterSharedNodeSecurityGroup-2LRGJW3MQ4O8) sg-00efe23d69c9c0c4a(eksctl-jmte-nodegroup-core-a-SG-LGMTEQ7JLTNX)
    
  • I have an issue mounting the EFS server from the k8s pods
    I've created MountTargets and an AccessPoint for the FileSystem resource. Kubelet reports a timeout trying to mount a PVC, which in turn is reported to be bound to the a NFS specific PV.

    Events:
      Type     Reason       Age    From                                                   Message
      ----     ------       ----   ----                                                   -------
      Normal   Scheduled    2m36s  default-scheduler                                      Successfully assigned prod/nfs-test-fsmrn to ip-192-168-31-119.us-west-2.compute.internal
      Warning  FailedMount  33s    kubelet, ip-192-168-31-119.us-west-2.compute.internal  Unable to attach or mount volumes: unmounted volumes=[home-base], unattached volumes=[home-base default-token-7pm2j]: timed out waiting for the condition
      Warning  FailedMount  26s    kubelet, ip-192-168-31-119.us-west-2.compute.internal  MountVolume.SetUp failed for volume "home-base" : mount failed: exit status 32
      Mounting command: systemd-run
      Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/18402c62-bfe0-454c-a84c-439f5be4b319/volumes/kubernetes.io~nfs/home-base --scope -- mount -t nfs fs-01707b06.efs.us-west-2.amazonaws.com:/homes/ /var/lib/kubelet/pods/18402c62-bfe0-454c-a84c-439f5be4b319/volumes/kubernetes.io~nfs/home-base
      Output: Running scope as unit run-27728.scope.
      mount.nfs: Connection timed out
    
  • No s3 storage buckets fixed yet

  • No user env image build automation fixed yet, but a Dockerfile is defined for use.

@2i2c-org/tech-team anyone with a guess on what to do about the first to situations I'm in above? Note for the first issue, I've provided quite an exhaustive report in the linked issue.

@yuvipanda
Copy link
Member

Yay!

For EFS, you need one mount target per subnet your EKS cluster is in, added to all the security groups in that subnet. Access points aren't used yet.

@yuvipanda
Copy link
Member

What kinda functionality do you want for storage buckets? A PANGEO_SCRATCH and SCRATCH_BUCKET setup per hub?

@consideRatio
Copy link
Contributor Author

Yay!

For EFS, you need one mount target per subnet your EKS cluster is in, added to all the security groups in that subnet. Access points aren't used yet.

Noooooo! Yuck, ALL security groups? Not just one or few? It turns out you can max have five per mounttarget and i have more multiple worker nodegroups...

@consideRatio
Copy link
Contributor Author

A PANGEO_SCRATCH and SCRATCH_BUCKET setup per hub?

Is this documented somewhere what it means? It is not clearly defined in my mind yet even though i saw a reference about an environment variable in a startup script for pangeo-notebooks image.

@yuvipanda
Copy link
Member

Is this documented somewhere what it means?

pangeo-data/pangeo-cloud-federation#610 is the upstream discussion. With https://github.com/2i2c-org/pilot-hubs/blob/fef7da6a93284d006a8536b144f3fd0a0be5a936/hub-templates/basehub/values.yaml#L322 the customization in the image is not necessary.

@consideRatio
Copy link
Contributor Author

We have a JMTE deployment active and functional, I think this can be closed now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants