Workflow stuck in `PreRun` #145

ajfloeder · 2024-03-19T20:17:58Z

Scenario:

create workflow in flux
flux progresses the workflow to the PreRun state where it never comes ready.

Rabbit: tioga102
Compute: tioga39

$ grep tioga39 /etc/coral2/xhost_mapping 
tioga39 x1000c1s6b1n0

Looking at the nnf-system_nnf-node-manager log for the time in question, we see the filesystem successfully created, but there is never an attempt to attach the namespaces to tioga39. Clientmount resource is created for tioga39, however. This would indicate that the nnfnodeblockstorage controller believed it had attached the namespaces to the compute node. There were no failures in the log to indicate that it had attempted to map things and failed.

Digging into the configuration a bit deeper, we see the the PCIe status of the link to tioga39 was actually offline when this workflow was stuck.

# /admin/scripts/nnf/switch.sh status
Execute switch status on /dev/switchtec0
DEVICE: /dev/switchtec0 PAX_ID: 1

Switch Connection        	Status
===========================	======
Interswitch Link           	UP
Drive Slot 4               	UP
Drive Slot 5               	UP
Drive Slot 6               	UP
Drive Slot 2               	UP
Drive Slot 1               	DOWN
Drive Slot 9               	UP
Drive Slot 10              	UP
Drive Slot 11              	UP
Drive Slot 3               	UP
Rabbit,       x9000c?j7b0	UP
Compute 8,    x9000c?s4b0n0	DOWN
Compute 9,    x9000c?s4b1n0	DOWN
Compute 10,   x9000c?s5b0n0	DOWN
Compute 11,   x9000c?s5b1n0	DOWN
Compute 12,   x9000c?s6b0n0	UP
Compute 13,   x9000c?s6b1n0	DOWN   <<<< tioga39
Compute 14,   x9000c?s7b0n0	DOWN
Compute 15,   x9000c?s7b1n0	DOWN

Execute switch status on /dev/switchtec1
DEVICE: /dev/switchtec1 PAX_ID: 0

Switch Connection        	Status
===========================	======
Interswitch Link           	UP
Drive Slot 8               	UP
Drive Slot 7               	UP
Drive Slot 15              	UP
Drive Slot 16              	UP
Drive Slot 17              	UP
Drive Slot 18              	UP
Drive Slot 14              	UP
Drive Slot 13              	DOWN
Drive Slot 12              	UP
Rabbit,       x9000c?j7b0	UP
Compute 0,    x9000c?s0b0n0	DOWN
Compute 1,    x9000c?s0b1n0	UP
Compute 2,    x9000c?s1b0n0	UP
Compute 3,    x9000c?s1b1n0	UP
Compute 4,    x9000c?s2b0n0	UP
Compute 5,    x9000c?s2b1n0	UP
Compute 6,    x9000c?s3b0n0	UP
Compute 7,    x9000c?s3b1n0	UP


$ kubectl get storages.dataworkflowservices.github.io tioga102 -o yaml
apiVersion: dataworkflowservices.github.io/v1alpha2
kind: Storage
metadata:
  creationTimestamp: "2024-03-12T23:56:54Z"
  generation: 1
  labels:
    dataworkflowservices.github.io/storage: Rabbit
  name: tioga102
  namespace: default
  resourceVersion: "63456188"
  uid: 4552b44e-1e68-416a-9576-d8597d20f4d0
spec:
  state: Enabled
status:
  access:
    computes:
    - name: tioga26
      status: Offline
    - name: tioga27
      status: Ready
    - name: tioga28
      status: Ready
    - name: tioga29
      status: Ready
    - name: tioga30
      status: Ready
    - name: tioga31
      status: Ready
    - name: tioga32
      status: Ready
    - name: tioga33
      status: Ready
    - name: tioga34
      status: Offline
    - name: tioga35
      status: Offline
    - name: tioga36
      status: Offline
    - name: tioga37
      status: Offline
    - name: tioga38
      status: Ready
    - name: tioga39     <<<<<<<<<<<<<<< tioga39
      status: Offline   <<<<<<<<<<<<<<< Offline
    - name: tioga40
      status: Offline
    - name: tioga41
      status: Offline
    protocol: PCIe
    servers:
    - name: tioga102
      status: Ready
  capacity: 2965239273881

We confirmed with lspci -PP | grep KIO that tioga39 had no PCI connections to the drives.

We decided to reboot tioga39 to see if the PCI connections could be restored. Sure enough, after tioga39 rebooted, the link status was restored. Good news!

However, the workflow still stayed in PreRun Ready==false. Looking at the storages.dataworkflowservices.github.io tioga102 resource, tioga39s status had not changed. We restarted nnf-node-manager POD on tioga102 which caused the storages.dataworkflowservices.github.io tioga102 resource to be updated. Once we did that, the workflow successfully completed the PreRun state and proceeded all the way through Teardown.

Issues:

Why didn't the nnf-node-manager log show something when it either attempted to attach the namespaces to tioga39 and failed OR it just skipped tioga39 because it was offline.
Why wasn't the storages.dataworkflowservices.github.io tioga102 resource updated when tioga39 rebooted and its PCIe link was restored.
Why did flux allow a job to be run using tioga39 when that compute resource was offline.
Should the workflow sit there waiting for a compute node, or fail if it is assigned a compute node that is offline at the time it attempts the PreRun stage?

The text was updated successfully, but these errors were encountered:

ajfloeder self-assigned this Mar 19, 2024

github-project-automation bot added this to Issues Dashboard Mar 19, 2024

github-project-automation bot moved this to 📋 Open in Issues Dashboard Mar 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflow stuck in `PreRun` #145

Workflow stuck in `PreRun` #145

ajfloeder commented Mar 19, 2024

Workflow stuck in PreRun #145

Workflow stuck in PreRun #145

Comments

ajfloeder commented Mar 19, 2024

Workflow stuck in `PreRun` #145

Workflow stuck in `PreRun` #145