Skip to content

Commit

Permalink
Merge branch 'dev'
Browse files Browse the repository at this point in the history
  • Loading branch information
emlundell committed Dec 12, 2024
2 parents 685df14 + f3c9f2f commit bb95da5
Show file tree
Hide file tree
Showing 15 changed files with 628 additions and 639 deletions.
15 changes: 15 additions & 0 deletions docs/dev-guides/about_opensciencelab.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# About OpenScienceLab

OpenScienceLab is about Open Science.

Brought to you by...

the Alaska Satellite Facility: making remote sensing accessible.

And...

the OpenScienceLab team.

And...

by developers like you. Thank you.
110 changes: 110 additions & 0 deletions docs/dev-guides/cluster/build_and_deploy_opensarlab_cluster.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# Build and Deploy OpenSARLab Cluster

1. Build the docker images first based off `opensarlab-container`.

1. Deploy the following in the same AWS account and region as the previous container images.

1. Create new GitHub repo

To organize repos, use the naming convention: `deployment-{location/owner}-{maturity?}-cluster`

1. Copy canonical `opensarlab-cluster` and commit.

Either copy/paste or use `git remote add github https://github.com/ASFOpenSARlab/opensarlab-cluster.git`

Make sure any hidden files (like .gitignore, .yamllint, etc.) are properly copied.

1. Within AWS add GitHub Connections. If done before, the app should show your GitHub app name.

https://docs.aws.amazon.com/dtconsole/latest/userguide/connections-create-github.html

Make sure you are in the right region of your AWS account.

Once Connections is setup, save the Connection arn for later.

1. Remember to add the current GitHub repo to the Connection app

GitHub > Settings > GitHub Apps > AWS Connector for GitHub > Repository Access

Add GitHub repo

1. Add a SSL certificate to AWS Certification Manager.

You will need the ARN of the certificate.

1. Update `opensciencelab.yaml` within the code. See explaination of the various parts [here](../opensciencelab_yaml.md).

1. Deploy the CloudFormation template found at `pipeline/cf-setup-pipeline.yaml`.

Use the following parameters:

| Parameter | Description |
|-----------|-------------|
| Stack name | The CloudFormation stack name. For readablity, append `-pipeline` to the end. |
| CodeStarConnectionArn | The ARN of the Connection made eariler. |
| CostTagKey | Useful if using billing allocation tags. |
| CostTagValue | USeful if using billing allocation tags. Note that many resources will have this in their name for uniqueness. It needs to be short in length. |
| GitHubBranchName | The branch name of the GitHub repo where the code resides. |
| GitHubFullRepo | The GitHub repo name. Needs to be in the format `{GitHub organization}/{GitHub repo}` from `https://github.com/OrgName/RepoName`. |
| | |

The pipeline will take a few seconds to form.

If the cloudformation stack fails to fully form it will need to be fully deleted and the template will need to be re-uploaded.

1. The pipeline will start to build automatically in CodePipeline.

A successful run will take about 12 minutes.

If it takes signitifcantly less time then the build might have failed even if CodePipeline says successful.

Sometimes the final buld stage will error with something like "build role not found". In this case, just retry the stage. There is sometimes a race condtion for AWS role creations.

During the course of the build, other CloudFormation stacks will be created. One of these is for the cluster. Within Outputs will be the Load Balancer url which can be used within external DNS.

1. Add the Portal SSO Token to Secrets Manager.

Update `sso-token/{region}-{cluster name}`.

1. Add deployment to Portal

Update `labs.{maturity}.yaml` and re-build Portal.

Within the Portal Access page, create lab sheet with the `lab_short_name` found in `opensciencelab.yaml`.

Within the Portal Access page, add usernames and profiles as needed.

1. Add CloudShell access

From the AWS console, start CloudShell (preferably in it's own browser tab)

CloudShell copy and paste are not shifted like a normal terminal. They are your normal keyboard operations.

If needed, update default editor:

- Append to ~/.bashrc the command `export EDITOR=vim`

Setup access to the K8s cluster

- From the AWS EKS page, get the cluster name for below.

- From the AWS IAM page, get the ARN of the role `{region namwe}-{cluster name}-user-full-access`

- On the CloudShell terminal, run `aws eks update-kubeconfig --name {EKS clsuter name} --role-arn {role ARN}`

- Run `kubectl get pods -A`. You should see any user and hub pods.

1. Bump the AutoScaling Groups

For reasons unknown, fresh brand-new ASGs need to be "primed" by setting the desired number to one. JupyterHub's autoscaler will scale the groups back down to zero if there is no use. This normally has to only be done once.

1. Start a JupyterLab server to make sure one works

1. Within CloudShell, check the PVC and PV of the user volume. Make sure the K8s annotation `pv.kubernetes.io/provisioned-by: ebs.csi.aws.com` is present.

If not, then the JupyterHub volume managment will fail and volumes will become orpaned upon lifecycle deletion.


## Destroy OpenSARLab Cluster

To take down, consult [destroy deployment docs](../destroy_deployment.md)
75 changes: 75 additions & 0 deletions docs/dev-guides/cluster/egress_config.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Egress Configuration

If enabled, the Istio service mesh can apply rules for rate limiting and domain blocking. To facilitate usability, a custom configuration is employed with custom rules. These rules for a particular configuration will only apply to the user or dask pod assigned to the corresponsing egress profile. The configurations need to be found in the {root}/egress_configs directory/useretc.

## Schema

In general, any parameter starting with `@` is global, `%` is sequential, and `+` is one-time.

Wildcards `*` not allowed.

Comment lines start with `#` and are ignored.

Other line entries:

| Parameter | Value Type | description |
| --- | --- | ----------- |
| `@profile` | str | Required. Egress profile name that will be assigned to the lab profile. There can only be one `@profile` per egress config file. Other `@profile` references will be ignored. Because the profile name is part of the naming structure of some k8s resources, it must be fqdn compatible. |
| `@rate` | int | Required. Rate limit (per 10 seconds) applied to the assigned pod. Value is the max value of requests per second. Any subsequent `@rate` is ignored. To turn off rate limit, set value to `None`.|
| `@list` | `white` or `black` | Required. Either the config is a whitelist or a blacklist. Any subsequent `@list` is ignored. |
| `@include` | str | Optional. Any named `.conf` file within a sibling `includes` folder will be copied/inserted at the point of the `@include`. Having `@rate`, `@include`, or `@profile` within the "included" configs will throw an error. Other rules for ordering still apply. |
| `%port` | int,int | Required. Port value for the host. Must have a value between 1 and 65535. Ports can be consolidated by comma seperation. Ports seperated by `=>` will be treated like a redirect (_this is currently not working. The ports will be treated as seperated by a comma_). |
|`%timeout` | str | Optional. Timeout for a valid timeout for any subsequent host. The vlaue must end in `s` for seconds, `m` for minutes, etc. |
|`+ip` | num | Optional. Any valid fqdn ip address.|
|`^`| str | Optional. Globally negate the hostname value. Useful for disabling included hosts. |
|||

Lines not prepended with `@`, `%`, `+`, `^`, or `#` will be treated as a hostname.

## Examples

**Blacklist with rate limiting**

``` conf
# Included blacklist
%timeout 10s
%port 80=>443
example.com
```

``` conf
# This conf is required!!
# This will be used by profiles that don't have any explicit whitelist and are not None
@profile default
@rate 30
@list black
@include blacklist
# Note that the explicit redirect is not working properly and should not be used
# Both port 80 and port 443 will be allowed, though
%port 80=>443
%timeout 1s
blackhole.webpagetest.org
```

**Whitelist with rate limiting**

```conf
@profile m6a-large-whitelist
@rate 30
@list white
@include asf
@include aws
@include earthdata
@include github
@include local
@include mappings
@include mintpy
@include others
@include packaging
@include ubuntu
```
97 changes: 97 additions & 0 deletions docs/dev-guides/cluster/opensciencelab_yaml.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# Contents of `opensciencelab.yaml`

Schema for the egress config can be found [../egress_config.md](here).

```yaml
---

parameters:
lab_short_name: The url-friendly short name of the lab deployment.
cost_tag_key: Name of the cost allocation tag.
cost_tag_value: Value of the cost allocation tag. Also used by cloudformation during setup for naming.
admin_user_name: Username of initial JupyterHub admin
certificate_arn: AWS arn of the SSL certificate held in Certificate Manager
container_namespace: A namespaced path within AWS ECR containing custom images
lab_domain: Domain of JupyterHub deployment. Use `load balancer` if not known.
portal_domain: Domain of the OSL Portal. Used to communicate with email services, etc.

# Volume and snapshot lifecycle managament
days_till_volume_deletion: The number of integer days after last server use when the user's volume is deleted. To never delete volume, use value 365000.
days_after_server_stop_till_warning_email: Comma seperated list of integer days after last server use when user gets warning email. Must have minimum one value. To never send emails, use value 365000
days_till_snapshot_deletion: The number of integer days after last server use when the user's snapshot is deleted. To never delete snapshot, use value 365000.
days_after_server_stop_till_deletion_email: Number of integer days after last server use when user gets email notifiying about permanent deletion of data. Must have minimum one value. To never send emails, use value 365000
utc_hour_of_day_snapshot_cron_runs : Integer hour (UTC) when the daily snapshot cron runs.
utc_hour_of_day_volume_cron_runs: Integer hour (UTC) when the daily snapshot cron runs.

# Versions of sofware installed
eks_version: '1.31' # https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html
kubectl_version: '1.31.0/2024-09-12' # https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html
aws_ebs_csi_driver_version: '2.36.0' # https://github.com/kubernetes-sigs/aws-ebs-csi-driver/releases
jupyterhub_helm_version: '3.3.7' # https://jupyterhub.github.io/helm-chart/
jupyterhub_hub_image_version: '4.1.5' # Match App Version of JupyterHub Helm
aws_k8s_cni_version: 'v1.18.5' # https://docs.aws.amazon.com/eks/latest/userguide/managing-vpc-cni.html
cluster_autoscaler_helm_version: '9.43.1' # https://github.com/kubernetes/autoscaler/releases > cluster-autoscaler-chart
istio_version: '1.23.2' # https://github.com/istio/istio/releases; set to None if disabling Istio
dask_helm_version: '2024.1.0' # https://helm.dask.org/ > dask-gateway-{version}; Set to None if disabling Dask

nodes:
- name: hub # Required
instance: The EC2 instance for the hub node. Type t3a.medium is preferred.
min_number: 1 # Required
max_number: 1 # Required
node_policy: hub # Required
is_hub: True # Required

- name: daskcontroller # Required
instance: t3a.medium, t3.medium
min_number: 1 # Required
max_number: 1 # Required
node_policy: dask_controller # Required
is_dask_controller: True # Required
is_spot: True

- name: Name of node type. Must be alphanumeric (no special characters, whitespace, etc.)
instance: The EC2 instance for the hub node. Fallback types seperated by commas. (m6a.xlarge, m5a.xlarge)
min_number: Minimum number of running node of this type in the cluster (0)
max_number: Maximum number of running node of this type in the cluster (25)
node_policy: Node permission policy (user)
root_volume_size: Size of the root volume of the EC2 (GiB) (Optional, range 1 - 16,384)
is_dask_worker: The EC2 is a dask worker (Optional, True).
is_spot: The EC2 is part of a spot fleet (Optional, True).

# Service accounts allow a built-in way to interact with AWS resources from within a server.
# However, the default AWS profile is overwritten and may have inintended consequences.
service_accounts:
- name: service_account_name
namespace: namespace of k8s resource (jupyter)
permissions:
- Effect: "Allow"
Action:
- "AWS Resource Action"
Resource: "AWS Resource ARN"

dask_profiles:
- name: Name of dask profile that the user can select (Example 1)
short_name: example_1
description: "Basic worker used by example notebook"
image_url: FQDN with docker tags (233535791844.dkr.ecr.us-west-2.amazonaws.com/smce-test-opensarlab/daskworker:180a826). If not public, the domain must be in the same AWS account as the cluster.
node_name: Node must be defined as a dask worker.
egress_profile: Name of the egress config to use. Do not include `.conf` suffix (Optional)

lab_profiles:
- name: Name of profile that users can select (SAR 1)
description: Description of profile
image_url: FQDN of JupyterLab single user image with docker tags ( 233535791844.dkr.ecr.us-west-2.amazonaws.com/smce-test-opensarlab/sar:ea3e147). If not public, the domain must be in the same AWS account as the cluster.
hook_script: Name of the script ran on user server startup (sar.sh) (optional)
memory_guarantee: RAM usage guaranteed per user (6G) (Optional. Defaults to 0% RAM.)
memory_limit: RAM usage guaranteed per user (16G) (Optional. Defaults to 100% RAM of server.)
cpu_guarantee: CPU usage guaranteed per user (15) (Optional. Defaults to 0% CPU. Memory limits are preferable.)
cpu_limit: CPU usage limit per user (30) (Optional. Defaults to 100% CPU of server. Memory limits are preferable.)
storage_capacity: Size of each user's home directory (500Gi). Cannot be reduced after allocation.
node_name: Node name as given in above section (sar1)
delete_user_volumes: If True, deletes user volumes upon server stopping (Optional. Defaults to False.)
desktop: If True, use Virtual Desktop by default (Optional. Defaults to False) The desktop enviromnent must be installed on image.
default: If True, the specific profile is selected by default (Optional. False if not explicity set.)
service_account: Name of previously defined service account to apply to profile (Optional)
egress_profile: Name of the egress config to use. Do not include `.conf` suffix (Optional)
```
59 changes: 59 additions & 0 deletions docs/dev-guides/container/build_and_deploy_opensarlab_image.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Build and Deploy OpenSARLab Image Container

## Setup Container Build in AWS

1. Create AWS account if needed

1. Gain GitHUb access if needed

1. Create new GitHub repo

To organize repos, use the naming convention: `deployment-{location/owner}-{maturity?}-container`

1. Copy canonical `opensarlab-container` and commit

Either copy/paste or use `git remote add github https://github.com/ASFOpenSARlab/opensarlab-container.git`

1. Within AWS add GitHub Connections. If done before, the app should show your GitHub app name.

https://docs.aws.amazon.com/dtconsole/latest/userguide/connections-create-github.html

Make sure you are in the right region of your AWS account.

Once Connections is setup, save the Connection arn for later.

1. Remember to add the current GitHub repo to the Connection app

GitHub > Settings > GitHub Apps > AWS Connector for GitHub > Repository Access

Add GitHub repo

1. Within AWS CloudFormation, upload the template file `cf-container.yaml` and build.

When prompted, use the Parameters:

| Parameter | Description |
|-----------|-------------|
| Stack name | The CloudFormation stack name. For readablity, append `-pipeline` to the end. |
| CodeStarConnectionArn | The ARN of the Connection made eariler. |
| ContainerNamespace | The ECR prefix acting as a namespace for the images. This will be needed for the cluster's `opensarlab.yaml`. |
| CostTagKey | Useful if using billing allocation tags. |
| CostTagValue | USeful if using billing allocation tags. Note that many resources will have this in their name for uniqueness. It needs to be short in length. |
| GitHubBranchName | The branch name of the GitHub repo where the code resides. |
| GitHubFullRepo | The GitHub repo name. Needs to be in the format `{GitHub organization}/{GitHub repo}` from `https://github.com/OrgName/RepoName`. |
| | |

The pipeline will take a few seconds to form.

If the cloudformation stack fails to fully form it will need to be fully deleted and the template will need to be re-uploaded.

1. The pipeline will start to build automatically in CodePipeline.

A successful run will take about 20 minutes.

If it takes signitifcantly less time then the build might have failed even if CodePipeline says successful.


## Destroy OpenSARLab Image Container

To take down, consult [destroy deployment docs](../destroy_deployment.md)
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
[Return to Developer Guide](../dev.md)
[Return to Developer Guide](../../dev.md)

# There are a few options for creating conda environments in OpenSARLab.
Each option come with benefits and drawbacks.
Expand Down
File renamed without changes.
Loading

0 comments on commit bb95da5

Please sign in to comment.