Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merging 5750/5803/5804/5805/5806/5807/5808 to prod #5810

Merged
merged 26 commits into from
Jun 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
1cf940c
WIP
shaneknapp May 21, 2024
ee39223
more WIP
shaneknapp May 22, 2024
e2a898d
more edits
shaneknapp May 28, 2024
995098c
moar edits
shaneknapp Jun 5, 2024
5d37160
more more edits
shaneknapp Jun 6, 2024
d365228
Enable enhanced privileges for Data 100 summer instructors
balajialg Jun 15, 2024
98f7b62
Merge pull request #5803 from balajialg/ds100_staff
balajialg Jun 15, 2024
6ca3799
removing errant space
shaneknapp Jun 15, 2024
8a0a523
Merge pull request #5804 from shaneknapp/fix-data100-config
shaneknapp Jun 15, 2024
020a485
quick readme update
shaneknapp Jun 15, 2024
a8af219
Merge pull request #5805 from shaneknapp/update-readme
shaneknapp Jun 15, 2024
fca19b9
added newline
shaneknapp Jun 15, 2024
ea449d7
Merge pull request #5806 from shaneknapp/update-readme
shaneknapp Jun 15, 2024
0e5f0ac
Bump jupyter-server-proxy for security update.
ryanlovett Jun 17, 2024
98508ac
Bump and disable paup.
ryanlovett Jun 17, 2024
7022c95
Temporarily unpin DataFrames version.
ryanlovett Jun 17, 2024
2401d1d
Scale down workshop hub RAM
balajialg Jun 17, 2024
fe723f4
Merge pull request #5808 from balajialg/workshop_ram
balajialg Jun 17, 2024
2f6c030
Merge pull request #5807 from ryanlovett/jsp-4.2.0
ryanlovett Jun 17, 2024
79bcd14
expanding upon DNS black magic *waves hands*
shaneknapp Jun 17, 2024
0558bb9
adding some reasoning behind doing this herculean task
shaneknapp Jun 17, 2024
a788ff3
more verbiage
shaneknapp Jun 17, 2024
7e47753
more end-of-switchover task verbiage
shaneknapp Jun 17, 2024
b82684e
update dns/ip stuff acccording to felders feedback
shaneknapp Jun 17, 2024
cec4201
Update clusterswitch.md
felder Jun 17, 2024
a297d4a
Merge pull request #5750 from shaneknapp/cluster-switch-details
shaneknapp Jun 17, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,11 @@
# Berkeley JupyterHubs

Contains a fully reproducible configuration for JupyterHub on datahub.berkeley.edu,
as well as its single user image.
as well as the single user images.

[UC Berkeley Datahub](https://cdss.berkeley.edu/data)

[UC Berkeley CDSS](https://cdss.berkeley.edu)

## Branches

Expand Down
2 changes: 1 addition & 1 deletion deployments/astro/image/environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ channels:

dependencies:
- python=3.11.*
- jupyter-server-proxy==4.1.2
- jupyter-server-proxy==4.2.0
# A linux desktop environment
- websockify

Expand Down
8 changes: 7 additions & 1 deletion deployments/biology/image/bio1b-packages.bash
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
# Install PAUP* for BIO 1B
# https://github.com/berkeley-dsep-infra/datahub/issues/1699
wget http://phylosolutions.com/paup-test/paup4a168_ubuntu64.gz -O ${CONDA_DIR}/bin/paup.gz

# This package was requested in 2020 for the instructor to try out.
# The 168 version doesn't exist so I've bumped it to 169, but also disabled
# it in case the package is no longer needed.
return

wget https://phylosolutions.com/paup-test/paup4a169_ubuntu64.gz -O ${CONDA_DIR}/bin/paup.gz
gunzip ${CONDA_DIR}/bin/paup.gz
chmod +x ${CONDA_DIR}/bin/paup
2 changes: 1 addition & 1 deletion deployments/biology/image/environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ dependencies:
- nb_conda_kernels=2.3.1

# proxy web applications
- jupyter-server-proxy==4.1.2
- jupyter-server-proxy==4.2.0
- jupyter-rsession-proxy==2.0.1

# Packages from bioconda for IB134L
Expand Down
2 changes: 1 addition & 1 deletion deployments/cee/image/environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ channels:
# Only libraries *not* available in PyPI should be here
dependencies:
- python=3.11.*
- jupyter-server-proxy==4.1.2
- jupyter-server-proxy==4.2.0
#adding math functionality
- matplotlib=3.7.*
- scipy=1.10.*
Expand Down
18 changes: 9 additions & 9 deletions deployments/data100/config/common.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -32,18 +32,18 @@ jupyterhub:
# this role will be assigned to...
groups:
- course::1524699::group::all-admins
# Data 100, Spring 2024, https://github.com/berkeley-dsep-infra/datahub/issues/5376
#course-staff-1531798:
#Data 100, Summer 2024, https://github.com/berkeley-dsep-infra/datahub/issues/5802
course-staff-1535115:
# description: Enable course staff to view and access servers.
# this role provides permissions to...
# scopes:
# - admin-ui
# - list:users!group=course::1531798
# - admin:servers!group=course::1531798
# - access:servers!group=course::1531798
scopes:
- admin-ui
- list:users!group=course::1535115
- admin:servers!group=course::1535115
- access:servers!group=course::1535115
# this role will be assigned to...
# groups:
# - course::1531798::group::Admins
groups:
- course::1535115::group::Admins
# Econ 148, Spring 2024, DH-225
#course-staff-1532866:
# description: Enable course staff to view and access servers.
Expand Down
2 changes: 1 addition & 1 deletion deployments/data101/image/environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ dependencies:
- jupyter-archive==3.4.0
- jupyter-book==0.15.1
- jupyter-resource-usage==1.0.0
- jupyter-server-proxy==4.1.2
- jupyter-server-proxy==4.2.0
- jupyter_bokeh
- jupyterlab==4.0.11
- jupyterlab-favorites==3.0.0
Expand Down
2 changes: 1 addition & 1 deletion deployments/datahub/images/default/environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ dependencies:

# data8; foundation
- datascience==0.17.6
- jupyter-server-proxy==4.1.2
- jupyter-server-proxy==4.2.0
- jupyter-rsession-proxy==2.2.0
- folium==0.12.1.post1

Expand Down
2 changes: 1 addition & 1 deletion deployments/dev/images/default/environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ dependencies:
# bug w/notebook and traitlets: https://github.com/jupyter/notebook/issues/7048
- traitlets=5.9.*

- jupyter-server-proxy==4.1.2
- jupyter-server-proxy==4.2.0
- jupyter-rsession-proxy==2.2.0

- syncthing==1.23.5
Expand Down
2 changes: 1 addition & 1 deletion deployments/eecs/image/environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ dependencies:
- python=3.11.*
- nbclassic==1.0.0

- jupyter-server-proxy==4.1.2
- jupyter-server-proxy==4.2.0
# Visual Studio Code!
- jupyter-vscode-proxy=0.1
- code-server=4.5.2
Expand Down
2 changes: 1 addition & 1 deletion deployments/ischool/image/environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ dependencies:
- jupyter-rsession-proxy==2.2.0
# https://github.com/berkeley-dsep-infra/datahub/issues/5251
- nodejs=16 # code-server requires node < 17
- jupyter-server-proxy==4.1.2
- jupyter-server-proxy==4.2.0
- jupyter-vscode-proxy==0.5
- code-server==4.10.1
# bug w/notebook and traitlets: https://github.com/jupyter/notebook/issues/7048
Expand Down
2 changes: 1 addition & 1 deletion deployments/julia/image/environment.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
dependencies:
- jupyter-server-proxy==4.1.2
- jupyter-server-proxy==4.2.0
- nodejs==20.8.1
- pip==22.3.1
- python==3.11.*
Expand Down
2 changes: 1 addition & 1 deletion deployments/julia/image/install-julia-packages.jl
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Pkg.add.([
Pkg.PackageSpec(;name="VegaLite", version="2.6.0"),
Pkg.PackageSpec(;name="CSVFiles", version="1.0.1"),
Pkg.PackageSpec(;name="Distributions", version="0.23.11"),
Pkg.PackageSpec(;name="DataFrames", version="0.21.8"),
Pkg.PackageSpec(;name="DataFrames"),
Pkg.PackageSpec(;name="Plots", version="1.24.3"),
Pkg.PackageSpec(;name="Images", version="0.24.1"),
Pkg.PackageSpec(;name="PyPlot", version="2.10.0"),
Expand Down
2 changes: 1 addition & 1 deletion deployments/publichealth/image/environment.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
dependencies:
- pip
- syncthing==1.18.6
- jupyter-server-proxy==4.1.2
- jupyter-server-proxy==4.2.0
- jupyter-rsession-proxy==2.2.0
- pip:
# bug w/notebook and traitlets: https://github.com/jupyter/notebook/issues/7048
Expand Down
2 changes: 1 addition & 1 deletion deployments/shiny/image/environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ dependencies:
- ipywidgets==8.1.2
- jupyter-archive==3.4.0
- jupyter-resource-usage==1.0.1
- jupyter-server-proxy==4.1.2
- jupyter-server-proxy==4.2.0
- jupyter-rsession-proxy==2.2.0
- jupyter-syncthing-proxy==1.0.3
- jupyterhub==4.1.5
Expand Down
2 changes: 1 addition & 1 deletion deployments/stat159/image/environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -137,7 +137,7 @@ dependencies:
- syncthing==1.23.0
- websockify==0.11.0

- jupyter-server-proxy==4.1.2
- jupyter-server-proxy==4.2.0
# VS Code support
- jupyter-vscode-proxy==0.2
- code-server==4.10.1
Expand Down
2 changes: 1 addition & 1 deletion deployments/stat20/image/environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ channels:

dependencies:
- syncthing==1.22.2
- jupyter-server-proxy==4.1.2
- jupyter-server-proxy==4.2.0
- jupyter-rsession-proxy==2.2.0
# bug w/notebook and traitlets: https://github.com/jupyter/notebook/issues/7048
- traitlets=5.9.*
Expand Down
4 changes: 2 additions & 2 deletions deployments/workshop/config/common.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -49,5 +49,5 @@ jupyterhub:
subPath: "{username}"
memory:
# As low a guarantee as possible
guarantee: 4G
limit: 4G
guarantee: 1G
limit: 1G
120 changes: 91 additions & 29 deletions docs/admins/howto/clusterswitch.md
Original file line number Diff line number Diff line change
@@ -1,50 +1,112 @@
# Switching over a hub to a new cluster

This document describes how to switch an existing hub to a new cluster. The example used here refers to the data8x hub.
This document describes how to switch an existing hub to a new cluster. The example used here refers to moving all UC Berkeley Datahubs.

## Make a new cluster
You might find it easier to switch to a new cluster if you're running a [very old k8s version](https://cloud.google.com/kubernetes-engine/docs/release-notes), or in lieu of performing a [cluster credential rotation](https://cloud.google.com/kubernetes-engine/docs/how-to/credential-rotation). Sometimes starting from scratch is easier than an iterative and potentially destructive series of operations.

## Create a new cluster
1. Create a new cluster using the specifications here:
https://docs.datahub.berkeley.edu/en/latest/topic/cluster-config.html
https://docs.datahub.berkeley.edu/en/latest/admins/cluster-config.html
2. Set up helm on the cluster according to the instructions here:
http://z2jh.jupyter.org/en/latest/setup-helm.html
- Make sure the version of helm you're working with matches the version CircleCI is using.
For example: https://github.com/berkeley-dsep-infra/datahub/blob/staging/.circleci/config.yml#L169
3. Re-create all existing node pools for hubs, support and prometheus deployments in the new cluster. If the old cluster is still up and running, you will probably run out of CPU quota, as the new node pools will immediately default to three nodes. Wait ~15m for the new pools to wind down to zero, and then continue.

## Setting the 'context' for kubectl and work on the new cluster.
1. Ensure you're logged in to GCP: `gcloud auth login`
2. Pull down the credentials from the new cluster: `gcloud container clusters get-credentials <CLUSTER_NAME> --region us-central1`
3. Switch the kubectl context to this cluster: `kubectl config use-context gke_ucb-datahub-2018_us-central1_<CLUSTER_NAME>`

## Recreate node pools
Re-create all existing node pools for hubs, support and prometheus deployments in the new cluster.

If the old cluster is still up and running, you will probably run out of CPU quota, as the new node pools will immediately default to three nodes. Wait ~15m for the new pools to wind down to zero, and then continue.

## Install and configure the certificate manager
Before you can deploy any of the hubs or support tooling, the certificate manager must be installed and
configured on the new cluster. Until this is done, `hubploy` and `helm` will fail with the following error:
`ensure CRDs are installed first`.

1. Create a new feature branch and update your helm dependencies: `helm dep up`
2. At this point, it's usually wise to upgrade `cert-manager` to the latest version found in the chart repo.
You can find this by running the following command:

cert-manager-version=$(helm show all -n cert-manager jetstack/cert-manager | grep ^appVersion | awk '{print $2}')

3. Then, you can install the latest version of `cert-manager`:

kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/${cert-manager-version}/cert-manager.yaml

4. Change the corresponding entry in `support/requirements.yaml` to `$cert-manager-version` and commit the changes (do not push).

## Create the node-placeholder k8s namespace
The [calendar autoscaler](https://docs.datahub.berkeley.edu/en/latest/admins/howto/calendar-scaler.html) requires the `node-placeholder` namespace. Run the following command to create it:

kubectl create namespace node-placeholder

## Create a new static IP and switch DNS to point our new deployment at it.
1. Create a new static IP in the [GCP console](https://console.cloud.google.com/networking/addresses/add?project=ucb-datahub-2018).
2. Open [infoblox](https://infoblox.net.berkeley.edu) and change the wildcard and empty entries for datahub.berkeley.edu to point to the IP from the previous step.
3. Update `support/values.yaml`, under `ingress-nginx` with the newly created IP from infoblox: `loadBalancerIP: xx.xx.xx.xx`.
4. Add and commit this change to your feature branch (still do not push).

You will re-deploy the support chart in the next step.

## Manually deploy the support and prometheus pools
First, update any node pools in the configs to point to the new cluster. Typically, this is just for the `ingress-nginx` controllers in `support/values.yaml`.

Now we will manually deploy the `support` helm chart:

sops -d support/secrets.yaml > /tmp/secrets.yaml
helm install -f support/values.yaml -f /tmp/secrets.yaml -n support support support/ --set installCRDs=true --debug --create-namespace

Before continuing, confirm via the GCP console that the IP that was defined in step 1 is now [bound to a forwarding rule](https://console.cloud.google.com/networking/addresses/list?project=ucb-datahub-2018). You can further confirm by listing the services in the [support chart](https://github.com/berkeley-dsep-infra/datahub/blob/staging/support/requirements.yaml) and making sure the ingress-controller is using the newly defined IP.

One special thing to note: our `prometheus` instance uses a persistent volume that contains historical monitoring data. This is specified in `support/values.yaml`, under the `prometheus:` block:

persistentVolume:
size: 1000Gi
storageClass: ssd
existingClaim: prometheus-data-2024-05-15

## Manually deploy a hub to staging
Finally, we can attempt to deploy a hub to the new cluster! Any hub will do, but we should start with a low-traffic hub (eg: https://dev.datahub.berkeley.edu).

First, check the hub's configs for any node pools that need updating. Typically, this is just the core pool.

Second, update `hubploy.yaml` for this hub and point it to the new cluster you've created.

After this is done, add the changes to your feature branch (but don't push). After that, deploy a hub manually:

hubploy deploy dev hub staging

When the deploy is done, visit that hub and confirm that things are working.

## Switch staging over to new cluster
1. Change the name of the cluster in hubploy.yaml to match the name you chose when creating your new cluster.
2. Make sure the staging IP is a 'static' IP - so we don't lose the IP. You can see the list of IPs used by the project by checking the google cloud console.
For example: https://console.cloud.google.com/networking/addresses/list?project=data8x-scratch
Make sure you are in the right project!
3. If the staging IP (which you can find in staging.yaml) is marked as 'ephemeral', mark it as 'static'
4. Make a PR that includes your hubploy.yaml change, but don't merge it just yet.
## Manually deploy remaining hubs to staging and prod
Now, update the remaining hubs' configs to point to the new node pools and `hubploy.yaml` to the cluster.

Now we will perform the IP switch over from the old cluster to the new cluster. There will be downtime during the switchover!
Then use `hubploy` to deploy them to staging as with the previous step. The easiest way to do this is to have a list of hubs in a text file, and iterate over it with a `for` loop:

The current easiest way to do this is:
1. Merge the PR.
2. Immediately delete the service 'proxy-public' in the appropriate staging namespace in the old cluster. Make sure you have the command ready for this so that you can execute reasonably quickly.
for x in $(cat hubs.txt); do hubploy deploy ${x} hub staging; done
for x in $(cat hubs.txt); do hubploy deploy ${x} hub prod; done

gcloud container clusters list
gcloud container clusters get-credentials ${OLDCLUSTER} --region=us-central1
kubectl --namespace=data8x-staging get svc
kubectl --namespace=data8x-staging delete svc proxy-public

As the PR deploys, staging on the new cluster should pick up the IP we released from the old cluster. This way we don't have to wait for DNS propagation time.
When done, add the modified configs to your feature branch (and again, don't push yet).

At this time you can switch to the new cluster and watch the pods come up.
## Update CircleCI
Once you've successfully deployed the clusters manually via `hubploy`, it's time to update CircleCI to point to the new cluster.

Once done, poke around and make sure the staging cluster works fine. Since data8x requires going through EdX in order to load a hub, testing can be tricky. If you're able, the easiest way is to edit an old course you have access to and point one the notebooks to the staging instance.
All you need to do is `grep` for the old cluster name in `.circleci/config.yaml` and change this to the name of the new cluster. There should just be four entries: two for the `gcloud get credentials <cluster-name>`, and two in comments. Make these changes and add them to your existing feature branch, but don't commit yet.

Assuming everything worked correctly, you can follow the above steps to switch production over.
## Create and merge your PR!
Now you can finally push your changes to github. Create a PR, merge to `staging` and immediately kill off the deploy jobs for `node-placeholder`, `support` and `deploy`.

## Get hub logs from old cluster
Prior to deleting the old cluster, fetch the usage logs.
Create another PR to merge to `prod` and that deploy should work just fine.

HUB=data8x
kubectl --namespace=${HUB}-prod exec -it $(kubectl --namespace=${HUB}-prod get pod -l component=hub -o name | sed 's_pod/__') -- grep -a 'seconds to ' jupyterhub.log > ${HUB}-usage.log
## Update log and billing sinks, BigQuery queries, etc.
I would recommend searching GCP console for all occurrences of the old cluster name, and fixing any bits that might be left over. This should only take a few minutes, but should definitely be done.

Currently these are being placed on google drive here:
https://drive.google.com/open?id=1bUIJYGdFZCgmFXkhkPzFalJ1v9T8v7__
FIN!

## Deleting the old cluster

Expand Down
Loading