From 539192030c229656f4671e257b908626cbcd37f7 Mon Sep 17 00:00:00 2001 From: Dean Roehrich Date: Wed, 10 Jul 2024 14:23:19 -0500 Subject: [PATCH 1/4] NVMe namespace management Describe how to view total NVMe space available, used, or leaked. Describe how to reclaim leaked NVMe space. Signed-off-by: Dean Roehrich --- docs/guides/index.md | 2 + .../guides/node-management/nvme-namespaces.md | 73 +++++++++++++++++++ 2 files changed, 75 insertions(+) create mode 100644 docs/guides/node-management/nvme-namespaces.md diff --git a/docs/guides/index.md b/docs/guides/index.md index e80105c..b074aa2 100644 --- a/docs/guides/index.md +++ b/docs/guides/index.md @@ -16,6 +16,7 @@ * [Lustre External MGT](external-mgs/readme.md) * [Global Lustre](global-lustre/readme.md) * [Directive Breakdown](directive-breakdown/readme.md) +* [NVMe Namespaces](nvme-namespaces/readme.md) ## NNF User Containers @@ -24,3 +25,4 @@ ## Node Management * [Draining A Node](node-management/drain.md) +* [Debugging NVMe Namespaces](node-management/nvme-namespaces.md) diff --git a/docs/guides/node-management/nvme-namespaces.md b/docs/guides/node-management/nvme-namespaces.md new file mode 100644 index 0000000..6f23b5f --- /dev/null +++ b/docs/guides/node-management/nvme-namespaces.md @@ -0,0 +1,73 @@ +# Debugging NVMe Namespaces + +## Total Space Available or Used + +Find the total space available, and the total space used, on a Rabbit node using the Redfish API. One way to access the API is to use the `nnf-node-manager` pod on that node. + +To view the space on node ee50, find its nnf-node-manager pod and then exec into it to query the Redfish API: + +```console +[richerso@ee1:~]$ kubectl get pods -A -o wide | grep ee50 | grep node-manager +nnf-system nnf-node-manager-jhglm 1/1 Running 0 61m 10.85.71.11 ee50 +``` + +Then query the Redfish API to view the `AllocatedBytes` and `GuaranteedBytes`: + +```console +[richerso@ee1:~]$ kubectl exec --stdin --tty -n nnf-system nnf-node-manager-jhglm -- curl -S localhost:50057/redfish/v1/StorageServices/NNF/CapacitySource | jq +{ + "@odata.id": "/redfish/v1/StorageServices/NNF/CapacitySource", + "@odata.type": "#CapacitySource.v1_0_0.CapacitySource", + "Id": "0", + "Name": "Capacity Source", + "ProvidedCapacity": { + "Data": { + "AllocatedBytes": 128849888, + "ConsumedBytes": 128849888, + "GuaranteedBytes": 307132496928, + "ProvisionedBytes": 307261342816 + }, + "Metadata": {}, + "Snapshot": {} + }, + "ProvidedClassOfService": {}, + "ProvidingDrives": {}, + "ProvidingPools": {}, + "ProvidingVolumes": {}, + "Actions": {}, + "ProvidingMemory": {}, + "ProvidingMemoryChunks": {} +} +``` + +## Total Orphaned or Leaked Space + +To determine the amount of orphaned space, look at the Rabbit node when there are no allocations on it. If there are no allocations then there should be no `NnfNodeBlockStorages` in the k8s namespace with the Rabbit's name: + +```console +[richerso@ee1:~]$ kubectl get nnfnodeblockstorage -n ee50 +No resources found in ee50 namespace. +``` + +To check that there are no orphaned namespaces, you can use the nvme command while logged into that Rabbit node: + +```console +[root@ee50:~]# nvme list +Node SN Model Namespace Usage Format FW Rev +--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- -------- +/dev/nvme0n1 S666NN0TB11877 SAMSUNG MZ1L21T9HCLS-00A07 1 8.57 GB / 1.92 TB 512 B + 0 B GDC7302Q +``` + +There should be no namespaces on the kioxia drives: + +```console +[root@ee50:~]# nvme list | grep -i kioxia +[root@ee50:~]# +``` + +If there are namespaces listed, and there weren't any `NnfNodeBlockStorages` on the node, then they need to be deleted through the Rabbit software. The `NnfNodeECData` resource is a persistent data store for the allocations that should exist on the Rabbit. By deleting it, and then deleting the nnf-node-manager pod, it causes nnf-node-manager to delete the orphaned namespaces. This can take a few minutes after you actually delete the pod: + +```console +kubectl delete nnfnodeecdata ec-data -n ee50 +kubectl delete pod -n nnf-system nnf-node-manager-jhglm +``` From 5ed7e33c913e2454b482d07a01b3f0c99b32a86e Mon Sep 17 00:00:00 2001 From: Dean Roehrich Date: Wed, 10 Jul 2024 14:26:12 -0500 Subject: [PATCH 2/4] remove breadcrumb Signed-off-by: Dean Roehrich --- docs/guides/index.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/guides/index.md b/docs/guides/index.md index b074aa2..5d0a83b 100644 --- a/docs/guides/index.md +++ b/docs/guides/index.md @@ -16,7 +16,6 @@ * [Lustre External MGT](external-mgs/readme.md) * [Global Lustre](global-lustre/readme.md) * [Directive Breakdown](directive-breakdown/readme.md) -* [NVMe Namespaces](nvme-namespaces/readme.md) ## NNF User Containers From b0757f21be9dd1ce1b03bd0c58dd7e63ba9729b1 Mon Sep 17 00:00:00 2001 From: Dean Roehrich Date: Wed, 10 Jul 2024 14:29:58 -0500 Subject: [PATCH 3/4] add top index Signed-off-by: Dean Roehrich --- mkdocs.yml | 1 + 1 file changed, 1 insertion(+) diff --git a/mkdocs.yml b/mkdocs.yml index 2b75c8f..208575a 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -20,6 +20,7 @@ nav: - 'Lustre External MGT': 'guides/external-mgs/readme.md' - 'Global Lustre': 'guides/global-lustre/readme.md' - 'Draining A Node': 'guides/node-management/drain.md' + - 'Debugging NVMe Namespaces': 'guides/node-management/nvme-namespaces.md' - 'Directive Breakdown': 'guides/directive-breakdown/readme.md' - 'RFCs': - rfcs/index.md From e81d93760ec2ec31f5b63ee995ffee49e30bb54d Mon Sep 17 00:00:00 2001 From: Dean Roehrich Date: Wed, 10 Jul 2024 15:06:03 -0500 Subject: [PATCH 4/4] Update docs/guides/node-management/nvme-namespaces.md Co-authored-by: Blake Devcich <89158881+bdevcich@users.noreply.github.com> --- docs/guides/node-management/nvme-namespaces.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/guides/node-management/nvme-namespaces.md b/docs/guides/node-management/nvme-namespaces.md index 6f23b5f..ed35274 100644 --- a/docs/guides/node-management/nvme-namespaces.md +++ b/docs/guides/node-management/nvme-namespaces.md @@ -4,7 +4,7 @@ Find the total space available, and the total space used, on a Rabbit node using the Redfish API. One way to access the API is to use the `nnf-node-manager` pod on that node. -To view the space on node ee50, find its nnf-node-manager pod and then exec into it to query the Redfish API: +To view the space on node ee50, find its `nnf-node-manager` pod and then exec into it to query the Redfish API: ```console [richerso@ee1:~]$ kubectl get pods -A -o wide | grep ee50 | grep node-manager