NearNodeFlash · roehrich-hpe · Jul 10, 2024 · Jul 10, 2024 · Jul 10, 2024 · Jul 10, 2024
diff --git a/docs/guides/index.md b/docs/guides/index.md
@@ -24,3 +24,4 @@
 ## Node Management
 
 * [Draining A Node](node-management/drain.md)
+* [Debugging NVMe Namespaces](node-management/nvme-namespaces.md)
diff --git a/docs/guides/node-management/nvme-namespaces.md b/docs/guides/node-management/nvme-namespaces.md
@@ -0,0 +1,73 @@
+# Debugging NVMe Namespaces
+
+## Total Space Available or Used
+
+Find the total space available, and the total space used, on a Rabbit node using the Redfish API. One way to access the API is to use the `nnf-node-manager` pod on that node.
+
+To view the space on node ee50, find its `nnf-node-manager` pod and then exec into it to query the Redfish API:
+
+```console
+[richerso@ee1:~]$ kubectl get pods -A -o wide | grep ee50 | grep node-manager
+nnf-system             nnf-node-manager-jhglm                               1/1     Running                     0                 61m     10.85.71.11       ee50   <none>           <none>
+```
+
+Then query the Redfish API to view the `AllocatedBytes` and `GuaranteedBytes`:
+
+```console
+[richerso@ee1:~]$ kubectl exec --stdin --tty -n nnf-system nnf-node-manager-jhglm -- curl -S localhost:50057/redfish/v1/StorageServices/NNF/CapacitySource | jq
+{
+  "@odata.id": "/redfish/v1/StorageServices/NNF/CapacitySource",
+  "@odata.type": "#CapacitySource.v1_0_0.CapacitySource",
+  "Id": "0",
+  "Name": "Capacity Source",
+  "ProvidedCapacity": {
+    "Data": {
+      "AllocatedBytes": 128849888,
+      "ConsumedBytes": 128849888,
+      "GuaranteedBytes": 307132496928,
+      "ProvisionedBytes": 307261342816
+    },
+    "Metadata": {},
+    "Snapshot": {}
+  },
+  "ProvidedClassOfService": {},
+  "ProvidingDrives": {},
+  "ProvidingPools": {},
+  "ProvidingVolumes": {},
+  "Actions": {},
+  "ProvidingMemory": {},
+  "ProvidingMemoryChunks": {}
+}
+```
+
+## Total Orphaned or Leaked Space
+
+To determine the amount of orphaned space, look at the Rabbit node when there are no allocations on it. If there are no allocations then there should be no `NnfNodeBlockStorages` in the k8s namespace with the Rabbit's name:
+
+```console
+[richerso@ee1:~]$ kubectl get nnfnodeblockstorage -n ee50
+No resources found in ee50 namespace.
+```
+
+To check that there are no orphaned namespaces, you can use the nvme command while logged into that Rabbit node:
+
+```console
+[root@ee50:~]# nvme list
+Node                  SN                   Model                                    Namespace Usage                      Format           FW Rev
+--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
+/dev/nvme0n1          S666NN0TB11877       SAMSUNG MZ1L21T9HCLS-00A07               1           8.57  GB /   1.92  TB    512   B +  0 B   GDC7302Q
+```
+
+There should be no namespaces on the kioxia drives:
+
+```console
+[root@ee50:~]# nvme list | grep -i kioxia
+[root@ee50:~]#
+```
+
+If there are namespaces listed, and there weren't any `NnfNodeBlockStorages` on the node, then they need to be deleted through the Rabbit software. The `NnfNodeECData` resource is a persistent data store for the allocations that should exist on the Rabbit. By deleting it, and then deleting the nnf-node-manager pod, it causes nnf-node-manager to delete the orphaned namespaces. This can take a few minutes after you actually delete the pod:
+
+```console
+kubectl delete nnfnodeecdata ec-data -n ee50
+kubectl delete pod -n nnf-system nnf-node-manager-jhglm
+```
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -20,6 +20,7 @@ nav:
       - 'Lustre External MGT': 'guides/external-mgs/readme.md'
       - 'Global Lustre': 'guides/global-lustre/readme.md'
       - 'Draining A Node': 'guides/node-management/drain.md'
+      - 'Debugging NVMe Namespaces': 'guides/node-management/nvme-namespaces.md'
       - 'Directive Breakdown': 'guides/directive-breakdown/readme.md'
   - 'RFCs':
       - rfcs/index.md
Original file line number	Diff line number	Diff line change
Expand Up		@@ -24,3 +24,4 @@
		## Node Management

		* [Draining A Node](node-management/drain.md)
		* [Debugging NVMe Namespaces](node-management/nvme-namespaces.md)