You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Looking at the nnf-system_nnf-node-manager log for the time in question, we see the filesystem successfully created, but there is never an attempt to attach the namespaces to tioga39. Clientmount resource is created for tioga39, however. This would indicate that the nnfnodeblockstorage controller believed it had attached the namespaces to the compute node. There were no failures in the log to indicate that it had attempted to map things and failed.
Digging into the configuration a bit deeper, we see the the PCIe status of the link to tioga39 was actually offline when this workflow was stuck.
# /admin/scripts/nnf/switch.sh status
Execute switch status on /dev/switchtec0
DEVICE: /dev/switchtec0 PAX_ID: 1
Switch Connection Status
=========================== ======
Interswitch Link UP
Drive Slot 4 UP
Drive Slot 5 UP
Drive Slot 6 UP
Drive Slot 2 UP
Drive Slot 1 DOWN
Drive Slot 9 UP
Drive Slot 10 UP
Drive Slot 11 UP
Drive Slot 3 UP
Rabbit, x9000c?j7b0 UP
Compute 8, x9000c?s4b0n0 DOWN
Compute 9, x9000c?s4b1n0 DOWN
Compute 10, x9000c?s5b0n0 DOWN
Compute 11, x9000c?s5b1n0 DOWN
Compute 12, x9000c?s6b0n0 UP
Compute 13, x9000c?s6b1n0 DOWN <<<< tioga39
Compute 14, x9000c?s7b0n0 DOWN
Compute 15, x9000c?s7b1n0 DOWN
Execute switch status on /dev/switchtec1
DEVICE: /dev/switchtec1 PAX_ID: 0
Switch Connection Status
=========================== ======
Interswitch Link UP
Drive Slot 8 UP
Drive Slot 7 UP
Drive Slot 15 UP
Drive Slot 16 UP
Drive Slot 17 UP
Drive Slot 18 UP
Drive Slot 14 UP
Drive Slot 13 DOWN
Drive Slot 12 UP
Rabbit, x9000c?j7b0 UP
Compute 0, x9000c?s0b0n0 DOWN
Compute 1, x9000c?s0b1n0 UP
Compute 2, x9000c?s1b0n0 UP
Compute 3, x9000c?s1b1n0 UP
Compute 4, x9000c?s2b0n0 UP
Compute 5, x9000c?s2b1n0 UP
Compute 6, x9000c?s3b0n0 UP
Compute 7, x9000c?s3b1n0 UP
$ kubectl get storages.dataworkflowservices.github.io tioga102 -o yaml
apiVersion: dataworkflowservices.github.io/v1alpha2
kind: Storage
metadata:
creationTimestamp: "2024-03-12T23:56:54Z"
generation: 1
labels:
dataworkflowservices.github.io/storage: Rabbit
name: tioga102
namespace: default
resourceVersion: "63456188"
uid: 4552b44e-1e68-416a-9576-d8597d20f4d0
spec:
state: Enabled
status:
access:
computes:
- name: tioga26
status: Offline
- name: tioga27
status: Ready
- name: tioga28
status: Ready
- name: tioga29
status: Ready
- name: tioga30
status: Ready
- name: tioga31
status: Ready
- name: tioga32
status: Ready
- name: tioga33
status: Ready
- name: tioga34
status: Offline
- name: tioga35
status: Offline
- name: tioga36
status: Offline
- name: tioga37
status: Offline
- name: tioga38
status: Ready
- name: tioga39 <<<<<<<<<<<<<<< tioga39
status: Offline <<<<<<<<<<<<<<< Offline
- name: tioga40
status: Offline
- name: tioga41
status: Offline
protocol: PCIe
servers:
- name: tioga102
status: Ready
capacity: 2965239273881
We confirmed with lspci -PP | grep KIO that tioga39 had no PCI connections to the drives.
We decided to reboot tioga39 to see if the PCI connections could be restored. Sure enough, after tioga39 rebooted, the link status was restored. Good news!
However, the workflow still stayed in PreRun Ready==false. Looking at the storages.dataworkflowservices.github.io tioga102 resource, tioga39s status had not changed. We restarted nnf-node-manager POD on tioga102 which caused the storages.dataworkflowservices.github.io tioga102 resource to be updated. Once we did that, the workflow successfully completed the PreRun state and proceeded all the way through Teardown.
Issues:
Why didn't the nnf-node-manager log show something when it either attempted to attach the namespaces to tioga39 and failed OR it just skipped tioga39 because it was offline.
Why wasn't the storages.dataworkflowservices.github.io tioga102 resource updated when tioga39 rebooted and its PCIe link was restored.
Why did flux allow a job to be run using tioga39 when that compute resource was offline.
Should the workflow sit there waiting for a compute node, or fail if it is assigned a compute node that is offline at the time it attempts the PreRun stage?
The text was updated successfully, but these errors were encountered:
Scenario:
PreRun
state where it never comes ready.Rabbit: tioga102
Compute: tioga39
Looking at the nnf-system_nnf-node-manager log for the time in question, we see the filesystem successfully created, but there is never an attempt to attach the namespaces to tioga39. Clientmount resource is created for tioga39, however. This would indicate that the
nnfnodeblockstorage
controller believed it had attached the namespaces to the compute node. There were no failures in the log to indicate that it had attempted to map things and failed.Digging into the configuration a bit deeper, we see the the PCIe status of the link to tioga39 was actually offline when this workflow was stuck.
We confirmed with
lspci -PP | grep KIO
thattioga39
had no PCI connections to the drives.We decided to reboot
tioga39
to see if the PCI connections could be restored. Sure enough, aftertioga39
rebooted, the link status was restored. Good news!However, the workflow still stayed in
PreRun
Ready==false. Looking at thestorages.dataworkflowservices.github.io tioga102
resource,tioga39
s status had not changed. We restarted nnf-node-manager POD ontioga102
which caused thestorages.dataworkflowservices.github.io tioga102
resource to be updated. Once we did that, the workflow successfully completed thePreRun
state and proceeded all the way throughTeardown
.Issues:
tioga39
and failed OR it just skippedtioga39
because it wasoffline
.storages.dataworkflowservices.github.io tioga102
resource updated whentioga39
rebooted and its PCIe link was restored.tioga39
when that compute resource was offline.PreRun
stage?The text was updated successfully, but these errors were encountered: