diff --git a/docs/guides/index.md b/docs/guides/index.md index e80105c..5d0a83b 100644 --- a/docs/guides/index.md +++ b/docs/guides/index.md @@ -24,3 +24,4 @@ ## Node Management * [Draining A Node](node-management/drain.md) +* [Debugging NVMe Namespaces](node-management/nvme-namespaces.md) diff --git a/docs/guides/node-management/nvme-namespaces.md b/docs/guides/node-management/nvme-namespaces.md new file mode 100644 index 0000000..ed35274 --- /dev/null +++ b/docs/guides/node-management/nvme-namespaces.md @@ -0,0 +1,73 @@ +# Debugging NVMe Namespaces + +## Total Space Available or Used + +Find the total space available, and the total space used, on a Rabbit node using the Redfish API. One way to access the API is to use the `nnf-node-manager` pod on that node. + +To view the space on node ee50, find its `nnf-node-manager` pod and then exec into it to query the Redfish API: + +```console +[richerso@ee1:~]$ kubectl get pods -A -o wide | grep ee50 | grep node-manager +nnf-system nnf-node-manager-jhglm 1/1 Running 0 61m 10.85.71.11 ee50 +``` + +Then query the Redfish API to view the `AllocatedBytes` and `GuaranteedBytes`: + +```console +[richerso@ee1:~]$ kubectl exec --stdin --tty -n nnf-system nnf-node-manager-jhglm -- curl -S localhost:50057/redfish/v1/StorageServices/NNF/CapacitySource | jq +{ + "@odata.id": "/redfish/v1/StorageServices/NNF/CapacitySource", + "@odata.type": "#CapacitySource.v1_0_0.CapacitySource", + "Id": "0", + "Name": "Capacity Source", + "ProvidedCapacity": { + "Data": { + "AllocatedBytes": 128849888, + "ConsumedBytes": 128849888, + "GuaranteedBytes": 307132496928, + "ProvisionedBytes": 307261342816 + }, + "Metadata": {}, + "Snapshot": {} + }, + "ProvidedClassOfService": {}, + "ProvidingDrives": {}, + "ProvidingPools": {}, + "ProvidingVolumes": {}, + "Actions": {}, + "ProvidingMemory": {}, + "ProvidingMemoryChunks": {} +} +``` + +## Total Orphaned or Leaked Space + +To determine the amount of orphaned space, look at the Rabbit node when there are no allocations on it. If there are no allocations then there should be no `NnfNodeBlockStorages` in the k8s namespace with the Rabbit's name: + +```console +[richerso@ee1:~]$ kubectl get nnfnodeblockstorage -n ee50 +No resources found in ee50 namespace. +``` + +To check that there are no orphaned namespaces, you can use the nvme command while logged into that Rabbit node: + +```console +[root@ee50:~]# nvme list +Node SN Model Namespace Usage Format FW Rev +--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- -------- +/dev/nvme0n1 S666NN0TB11877 SAMSUNG MZ1L21T9HCLS-00A07 1 8.57 GB / 1.92 TB 512 B + 0 B GDC7302Q +``` + +There should be no namespaces on the kioxia drives: + +```console +[root@ee50:~]# nvme list | grep -i kioxia +[root@ee50:~]# +``` + +If there are namespaces listed, and there weren't any `NnfNodeBlockStorages` on the node, then they need to be deleted through the Rabbit software. The `NnfNodeECData` resource is a persistent data store for the allocations that should exist on the Rabbit. By deleting it, and then deleting the nnf-node-manager pod, it causes nnf-node-manager to delete the orphaned namespaces. This can take a few minutes after you actually delete the pod: + +```console +kubectl delete nnfnodeecdata ec-data -n ee50 +kubectl delete pod -n nnf-system nnf-node-manager-jhglm +``` diff --git a/mkdocs.yml b/mkdocs.yml index 2b75c8f..208575a 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -20,6 +20,7 @@ nav: - 'Lustre External MGT': 'guides/external-mgs/readme.md' - 'Global Lustre': 'guides/global-lustre/readme.md' - 'Draining A Node': 'guides/node-management/drain.md' + - 'Debugging NVMe Namespaces': 'guides/node-management/nvme-namespaces.md' - 'Directive Breakdown': 'guides/directive-breakdown/readme.md' - 'RFCs': - rfcs/index.md