Skip to content

Commit

Permalink
Merge image rebuild
Browse files Browse the repository at this point in the history
  • Loading branch information
wtripp180901 committed Aug 16, 2023
2 parents 4c7f875 + 9cde995 commit d3daba4
Show file tree
Hide file tree
Showing 27 changed files with 292 additions and 106 deletions.
1 change: 0 additions & 1 deletion .github/workflows/publish-helm-chart.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,4 +24,3 @@ jobs:
token: ${{ secrets.GITHUB_TOKEN }}
version: ${{ steps.semver.outputs.version }}
app-version: ${{ steps.semver.outputs.short-sha }}

34 changes: 20 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
# Slurm Docker Cluster

This is a multi-container Slurm cluster using Kubernetes. The Helm chart
creates a named volume for persistent storage of MySQL data files as well as
an NFS volume for shared storage.
This is a multi-container Slurm cluster using Kubernetes. The Slurm cluster Helm chart creates a named volume for persistent storage of MySQL data files. By default, it also installs the
RookNFS Helm chart (also in this repo) to provide shared storage across the Slurm cluster nodes.

## Dependencies

Expand All @@ -27,47 +26,51 @@ The Helm chart will create the following named volumes:

* var_lib_mysql ( -> /var/lib/mysql )

A named ReadWriteMany (RWX) volume mounted to `/home` is also expected, this can be external or can be deployed using the scripts in the `/nfs` directory (See "Deploying the Cluster")
A named ReadWriteMany (RWX) volume mounted to `/home` is also expected, this can be external or can be deployed using the provided `rooknfs` chart directory (See "Deploying the Cluster").

## Configuring the Cluster

All config files in `slurm-cluster-chart/files` will be mounted into the container to configure their respective services on startup. Note that changes to these files will not all be propagated to existing deployments (see "Reconfiguring the Cluster").
Additional parameters can be found in the `values.yaml` file, which will be applied on a Helm chart deployment. Note that some of these values will also not propagate until the cluster is restarted (see "Reconfiguring the Cluster").
All config files in `slurm-cluster-chart/files` will be mounted into the container to configure their respective services on startup. Note that changes to these files will not all be propagated to existing deployments (see "Reconfiguring the Cluster"). Additional parameters can be found in the `values.yaml` file for the Helm chart. Note that some of these values will also not propagate until the cluster is restarted (see "Reconfiguring the Cluster").

## Deploying the Cluster

### Generating Cluster Secrets

On initial deployment ONLY, run
```console
./generate-secrets.sh
./generate-secrets.sh [<target-namespace>]
```
This generates a set of secrets. If these need to be regenerated, see "Reconfiguring the Cluster"
This generates a set of secrets in the target namespace to be used by the Slurm cluster. If these need to be regenerated, see "Reconfiguring the Cluster"

Be sure to take note of the Open Ondemand credentials, you will need them to access the cluster through a browser

### Connecting RWX Volume

A ReadWriteMany (RWX) volume is required, if a named volume exists, set `nfs.claimName` in the `values.yaml` file to its name. If not, manifests to deploy a Rook NFS volume are provided in the `/nfs` directory. You can deploy this by running
```console
./nfs/deploy-nfs.sh
```
and leaving `nfs.claimName` as the provided value.
A ReadWriteMany (RWX) volume is required for shared storage across cluster nodes. By default, the Rook NFS Helm chart is installed as a dependency of the Slurm cluster chart in order to provide a RWX capable Storage Class for the required shared volume. If the target Kubernetes cluster has an existing storage class which should be used instead, then `storageClass` in `values.yaml` should be set to the name of this existing class and the RookNFS dependency should be disabled by setting `rooknfs.enabled = false`. In either case, the storage capacity of the provisioned RWX volume can be configured by setting the value of `storage.capacity`.

See the separate RookNFS chart [values.yaml](./rooknfs/values.yaml) for further configuration options when using the RookNFS to provide the shared storage volume.

### Supplying Public Keys

To access the cluster via `ssh`, you will need to make your public keys available. All your public keys from localhost can be added by running

```console
./publish-keys.sh
./publish-keys.sh [<target-namespace>]
```
where `<target-namespace>` is the namespace in which the Slurm cluster chart will be deployed (i.e. using `helm install -n <target-namespace> ...`). This will create a Kubernetes Secret in the appropriate namespace for the Slurm cluster to use. Omitting the namespace arg will install the secrets in the default namespace.

### Deploying with Helm

After configuring `kubectl` with the appropriate `kubeconfig` file, deploy the cluster using the Helm chart:
```console
helm install <deployment-name> slurm-cluster-chart
```

NOTE: If using the RookNFS dependency, then the following must be run before installing the Slurm cluster chart
```console
helm dependency update slurm-cluster-chart
```

Subsequent releases can be deployed using:

```console
Expand Down Expand Up @@ -130,6 +133,7 @@ srun singularity exec docker://ghcr.io/stackhpc/mpitests-container:${MPI_CONTAIN
```

Note: The mpirun script assumes you are running as user 'rocky'. If you are running as root, you will need to include the --allow-run-as-root argument

## Reconfiguring the Cluster

### Changes to config files
Expand Down Expand Up @@ -173,3 +177,5 @@ and then restart the other dependent deployments to propagate changes:
```console
kubectl rollout restart deployment slurmd slurmctld login slurmdbd
```

# Known Issues
18 changes: 10 additions & 8 deletions image/docker-entrypoint.sh
Original file line number Diff line number Diff line change
Expand Up @@ -91,12 +91,6 @@ then
mkdir -p /home/rocky/.ssh
cp /tmp/authorized_keys /home/rocky/.ssh/authorized_keys

if [ -f /home/rocky/.ssh/id_rsa.pub ]; then
echo "ssh keys already found"
else
ssh-keygen -t rsa -f /home/rocky/.ssh/id_rsa -N ""
fi

echo "---> Setting permissions for user home directories"
pushd /home > /dev/null
for DIR in *
Expand All @@ -119,14 +113,22 @@ then
start_munge

echo "---> Setting up self ssh capabilities for OOD"

if [ -f /home/rocky/.ssh/id_rsa.pub ]; then
echo "ssh keys already found"
else
ssh-keygen -t rsa -f /home/rocky/.ssh/id_rsa -N ""
chown rocky:rocky /home/rocky/.ssh/id_rsa /home/rocky/.ssh/id_rsa.pub
fi

ssh-keyscan localhost > /etc/ssh/ssh_known_hosts
echo "" >> /home/rocky/.ssh/authorized_keys #Adding newline to avoid breaking authorized_keys file
cat /home/rocky/.ssh/id_rsa.pub >> /home/rocky/.ssh/authorized_keys

echo "---> Starting Apache Server"

mkdir --parents /etc/ood/config/apps/shell
env > /etc/ood/config/apps/shell/env
# mkdir --parents /etc/ood/config/apps/shell
# env > /etc/ood/config/apps/shell/env

/usr/libexec/httpd-ssl-gencerts
/opt/ood/ood-portal-generator/sbin/update_ood_portal
Expand Down
11 changes: 0 additions & 11 deletions nfs/deploy-nfs.sh

This file was deleted.

11 changes: 0 additions & 11 deletions nfs/pvc.yaml

This file was deleted.

16 changes: 0 additions & 16 deletions nfs/teardown-nfs.sh

This file was deleted.

9 changes: 7 additions & 2 deletions publish-keys.sh
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
kubectl create configmap authorized-keys-configmap \
NAMESPACE="$1"
if [[ -z $1 ]]; then
NAMESPACE=default
fi
echo Installing in namespace $NAMESPACE
kubectl -n $NAMESPACE create configmap authorized-keys-configmap \
"--from-literal=authorized_keys=$(cat ~/.ssh/*.pub)" --dry-run=client -o yaml | \
kubectl apply -f -
kubectl -n $NAMESPACE apply -f -
4 changes: 4 additions & 0 deletions rooknfs/Chart.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
apiVersion: v2
name: rooknfs
version: 0.0.1
description: A packaged installation of Rook NFS for Kubernetes.
3 changes: 3 additions & 0 deletions rooknfs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# RookNFS Helm Chart

See `values.yaml` for available config options.
File renamed without changes.
50 changes: 50 additions & 0 deletions rooknfs/templates/hooks/pre-delete.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# NOTE: The cleanup jobs defined here are required to ensure that things which
# Rook NFS is responsible for cleaning up are deleted before deleting the Rook
# pods which do the actual clean up of NFS resources. For example, the RWM PVC
# must be deleted before the Rook StorageClass and provisioner pod. However,
# the PVC cannot be deleted until the pods which are using it are deleted, so
# the various Slurm node pods must actually be the first resources deleted.
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: rook-nfs-cleanup
---
# TODO: Create a job-specific ClusterRole for the ServiceAccount
# instead of using the cluster-admin role here
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: rook-nfs-cleanup
subjects:
- kind: ServiceAccount
name: rook-nfs-cleanup
namespace: {{ .Release.Namespace }}
roleRef:
kind: ClusterRole
name: cluster-admin
---
apiVersion: batch/v1
kind: Job
metadata:
name: rook-nfs-pre-delete-cleanup
annotations:
"helm.sh/hook": pre-delete
"helm.sh/hook-delete-policy": hook-succeeded
"helm.sh/hook-weight": "10"
spec:
template:
metadata:
name: rook-nfs-pre-delete-cleanup
spec:
serviceAccountName: rook-nfs-cleanup
containers:
- name: tester
image: bitnami/kubectl
command:
- "bin/bash"
- "-c"
- |
kubectl delete -n {{ .Values.serverNamespace }} nfsservers {{ .Values.serverName }} --wait
restartPolicy: Never
---
18 changes: 11 additions & 7 deletions nfs/nfs.yaml → rooknfs/templates/nfs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,30 +3,34 @@
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nfs-default-claim
namespace: rook-nfs
name: {{ .Values.claimName }}
namespace: {{ .Values.serverNamespace }}
spec:
{{- if .Values.backingStorageClass }}
storageClassName: {{ .Values.backingStorageClass }}
{{- end }}
accessModes:
- ReadWriteMany
resources:
requests:
storage: 1Gi
storage: {{ .Values.storageCapacity }}
---
apiVersion: nfs.rook.io/v1alpha1
kind: NFSServer
metadata:
name: rook-nfs
namespace: rook-nfs
name: {{ .Values.serverName }}
namespace: {{ .Values.serverNamespace }}
spec:
replicas: 1
exports:
- name: share1
- name: {{ .Values.shareName }}
server:
accessMode: ReadWrite
squash: "none"
# A Persistent Volume Claim must be created before creating NFS CRD instance.
persistentVolumeClaim:
claimName: nfs-default-claim
claimName: {{ .Values.claimName }}
# A key/value list of annotations
annotations:
rook: nfs
---
10 changes: 6 additions & 4 deletions nfs/operator.yaml → rooknfs/templates/operator.yaml
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@
---
apiVersion: v1
kind: Namespace
metadata:
name: rook-nfs-system # namespace:operator
name: {{ .Values.systemNamespace }}
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: rook-nfs-operator
namespace: rook-nfs-system # namespace:operator
namespace: {{ .Values.systemNamespace }}
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
Expand All @@ -20,7 +21,7 @@ roleRef:
subjects:
- kind: ServiceAccount
name: rook-nfs-operator
namespace: rook-nfs-system # namespace:operator
namespace: {{ .Values.systemNamespace }}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
Expand Down Expand Up @@ -106,7 +107,7 @@ apiVersion: apps/v1
kind: Deployment
metadata:
name: rook-nfs-operator
namespace: rook-nfs-system # namespace:operator
namespace: {{ .Values.systemNamespace }}
labels:
app: rook-nfs-operator
spec:
Expand Down Expand Up @@ -134,3 +135,4 @@ spec:
valueFrom:
fieldRef:
fieldPath: metadata.namespace
---
8 changes: 4 additions & 4 deletions nfs/rbac.yaml → rooknfs/templates/rbac.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@
apiVersion: v1
kind: Namespace
metadata:
name: rook-nfs
name: {{ .Values.serverNamespace }}
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: rook-nfs-server
namespace: rook-nfs
namespace: {{ .Values.serverNamespace }}
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
Expand Down Expand Up @@ -51,9 +51,9 @@ metadata:
subjects:
- kind: ServiceAccount
name: rook-nfs-server
# replace with namespace where provisioner is deployed
namespace: rook-nfs
namespace: {{ .Values.serverNamespace }}
roleRef:
kind: ClusterRole
name: rook-nfs-provisioner-runner
apiGroup: rbac.authorization.k8s.io
---
10 changes: 6 additions & 4 deletions nfs/sc.yaml → rooknfs/templates/sc.yaml
Original file line number Diff line number Diff line change
@@ -1,13 +1,15 @@
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
labels:
app: rook-nfs
name: rook-nfs-share1
name: {{ .Values.storageClassName }}
parameters:
exportName: share1
nfsServerName: rook-nfs
nfsServerNamespace: rook-nfs
exportName: {{ .Values.shareName }}
nfsServerName: {{ .Values.serverName }}
nfsServerNamespace: {{ .Values.serverNamespace }}
provisioner: nfs.rook.io/rook-nfs-provisioner
reclaimPolicy: Delete
volumeBindingMode: Immediate
---
Loading

0 comments on commit d3daba4

Please sign in to comment.