Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVMe namespace management #179

Merged
merged 4 commits into from
Jul 10, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/guides/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,3 +24,4 @@
## Node Management

* [Draining A Node](node-management/drain.md)
* [Debugging NVMe Namespaces](node-management/nvme-namespaces.md)
73 changes: 73 additions & 0 deletions docs/guides/node-management/nvme-namespaces.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# Debugging NVMe Namespaces

## Total Space Available or Used

Find the total space available, and the total space used, on a Rabbit node using the Redfish API. One way to access the API is to use the `nnf-node-manager` pod on that node.

To view the space on node ee50, find its `nnf-node-manager` pod and then exec into it to query the Redfish API:

```console
[richerso@ee1:~]$ kubectl get pods -A -o wide | grep ee50 | grep node-manager
nnf-system nnf-node-manager-jhglm 1/1 Running 0 61m 10.85.71.11 ee50 <none> <none>
```

Then query the Redfish API to view the `AllocatedBytes` and `GuaranteedBytes`:

```console
[richerso@ee1:~]$ kubectl exec --stdin --tty -n nnf-system nnf-node-manager-jhglm -- curl -S localhost:50057/redfish/v1/StorageServices/NNF/CapacitySource | jq
{
"@odata.id": "/redfish/v1/StorageServices/NNF/CapacitySource",
"@odata.type": "#CapacitySource.v1_0_0.CapacitySource",
"Id": "0",
"Name": "Capacity Source",
"ProvidedCapacity": {
"Data": {
"AllocatedBytes": 128849888,
"ConsumedBytes": 128849888,
"GuaranteedBytes": 307132496928,
"ProvisionedBytes": 307261342816
},
"Metadata": {},
"Snapshot": {}
},
"ProvidedClassOfService": {},
"ProvidingDrives": {},
"ProvidingPools": {},
"ProvidingVolumes": {},
"Actions": {},
"ProvidingMemory": {},
"ProvidingMemoryChunks": {}
}
```

## Total Orphaned or Leaked Space

To determine the amount of orphaned space, look at the Rabbit node when there are no allocations on it. If there are no allocations then there should be no `NnfNodeBlockStorages` in the k8s namespace with the Rabbit's name:

```console
[richerso@ee1:~]$ kubectl get nnfnodeblockstorage -n ee50
No resources found in ee50 namespace.
```

To check that there are no orphaned namespaces, you can use the nvme command while logged into that Rabbit node:

```console
[root@ee50:~]# nvme list
Node SN Model Namespace Usage Format FW Rev
--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 S666NN0TB11877 SAMSUNG MZ1L21T9HCLS-00A07 1 8.57 GB / 1.92 TB 512 B + 0 B GDC7302Q
```

There should be no namespaces on the kioxia drives:

```console
[root@ee50:~]# nvme list | grep -i kioxia
[root@ee50:~]#
```

If there are namespaces listed, and there weren't any `NnfNodeBlockStorages` on the node, then they need to be deleted through the Rabbit software. The `NnfNodeECData` resource is a persistent data store for the allocations that should exist on the Rabbit. By deleting it, and then deleting the nnf-node-manager pod, it causes nnf-node-manager to delete the orphaned namespaces. This can take a few minutes after you actually delete the pod:

```console
kubectl delete nnfnodeecdata ec-data -n ee50
kubectl delete pod -n nnf-system nnf-node-manager-jhglm
```
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ nav:
- 'Lustre External MGT': 'guides/external-mgs/readme.md'
- 'Global Lustre': 'guides/global-lustre/readme.md'
- 'Draining A Node': 'guides/node-management/drain.md'
- 'Debugging NVMe Namespaces': 'guides/node-management/nvme-namespaces.md'
- 'Directive Breakdown': 'guides/directive-breakdown/readme.md'
- 'RFCs':
- rfcs/index.md
Expand Down
Loading