From a781677ccecc2057f47941a601043d27c8545f0f Mon Sep 17 00:00:00 2001 From: Dean Roehrich Date: Thu, 1 Aug 2024 10:28:17 -0500 Subject: [PATCH] Relationship betw cray.nnf.node.drain taint and Storage resource. The Storage status will be "Drained". Document how to use the Storage's .spec.state to manually disable a node. Signed-off-by: Dean Roehrich --- docs/guides/index.md | 2 +- docs/guides/node-management/drain.md | 54 +++++++++++++++++++++++----- mkdocs.yml | 2 +- 3 files changed, 47 insertions(+), 11 deletions(-) diff --git a/docs/guides/index.md b/docs/guides/index.md index 96dd22d..768d483 100644 --- a/docs/guides/index.md +++ b/docs/guides/index.md @@ -24,5 +24,5 @@ ## Node Management -* [Draining A Node](node-management/drain.md) +* [Disable or Drain a Node](node-management/drain.md) * [Debugging NVMe Namespaces](node-management/nvme-namespaces.md) diff --git a/docs/guides/node-management/drain.md b/docs/guides/node-management/drain.md index 8f996fd..9ea3381 100644 --- a/docs/guides/node-management/drain.md +++ b/docs/guides/node-management/drain.md @@ -1,4 +1,40 @@ -# Draining A Node +# Disable Or Drain A Node + +## Disabling a node + +A Rabbit node can be manually disabled, indicating to the WLM that it should not schedule more jobs on the node. Jobs currently on the node will be allowed to complete at the discretion of the WLM. + +Disable a node by setting its Storage state to `Disabled`. + +```shell +kubectl patch storage $NODE --type=json -p '[{"op":"replace", "path":"/spec/state", "value": "Disabled"}]' +``` + +When the Storage is queried by the WLM, it will show the disabled status. + +```console +$ kubectl get storages +NAME STATE STATUS MODE AGE +kind-worker2 Enabled Ready Live 10m +kind-worker3 Disabled Disabled Live 10m +``` + +To re-enable a node, set its Storage state to `Enabled`. + +```shell +kubectl patch storage $NODE --type=json -p '[{"op":"replace", "path":"/spec/state", "value": "Enabled"}]' +``` + +The Storage state will show that it is enabled. + +```console +kubectl get storages +NAME STATE STATUS MODE AGE +kind-worker2 Enabled Ready Live 10m +kind-worker3 Enabled Ready Live 10m +``` + +## Draining a node The NNF software consists of a collection of DaemonSets and Deployments. The pods on the Rabbit nodes are usually from DaemonSets. Because of this, the `kubectl drain` @@ -9,7 +45,7 @@ Given the limitations of DaemonSets, the NNF software will be drained by using t as described in [Taints and Tolerations](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/). -## Drain NNF Pods From A Rabbit Node +### Drain NNF pods from a rabbit node Drain the NNF software from a node by applying the `cray.nnf.node.drain` taint. The CSI driver pods will remain on the node to satisfy any unmount requests from k8s @@ -19,16 +55,16 @@ as it cleans up the NNF pods. kubectl taint node $NODE cray.nnf.node.drain=true:NoSchedule cray.nnf.node.drain=true:NoExecute ``` -This will cause the node's `Storage` resource to be disabled: +This will cause the node's `Storage` resource to be drained: ```console $ kubectl get storages -NAME STATE STATUS MODE AGE -rabbit1 Enabled Disabled Live 3m18s -rabbit2 Enabled Ready Live 3m18s +NAME STATE STATUS MODE AGE +kind-worker2 Enabled Drained Live 5m44s +kind-worker3 Enabled Ready Live 5m45s ``` -The `Storage` resource will contain the following message indicating the reason it has been disabled: +The `Storage` resource will contain the following message indicating the reason it has been drained: ```console $ kubectl get storages rabbit1 -o json | jq -rM .status.message @@ -43,9 +79,9 @@ kubectl taint node $NODE cray.nnf.node.drain- The `Storage` resource will revert to a `Ready` status. -## The CSI Driver +### The CSI driver -While the CSI driver pods may be drained from a Rabbit node, it is advisable not to do so. +While the CSI driver pods may be drained from a Rabbit node, it is inadvisable to do so. **Warning** K8s relies on the CSI driver to unmount any filesystems that may have been mounted into a pod's namespace. If it is not present when k8s is attempting diff --git a/mkdocs.yml b/mkdocs.yml index 258fec7..6e0535c 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -20,7 +20,7 @@ nav: - 'User Containers': 'guides/user-containers/readme.md' - 'Lustre External MGT': 'guides/external-mgs/readme.md' - 'Global Lustre': 'guides/global-lustre/readme.md' - - 'Draining A Node': 'guides/node-management/drain.md' + - 'Disable or Drain a Node': 'guides/node-management/drain.md' - 'Debugging NVMe Namespaces': 'guides/node-management/nvme-namespaces.md' - 'Directive Breakdown': 'guides/directive-breakdown/readme.md' - 'RFCs':