Flux picks computes that are attached to missing rabbits #159

bdevcich · 2024-05-09T16:02:33Z

Attempting to run a workflow on el cap:

[devcich1@elcap1:system-test]$ N=1 Q=iotesting make sanity
bats -j 1 --filter-tags tag:sanity .
./system-test.bats
✗ XFS
tags: tag:sanity tag:simple tag:xfs
(in test file ./system-test.bats, line 49)
`#DW jobdw type=xfs name=xfs capacity=50GB" ' failed
16.026s: job.exception type=dws-setup severity=0 DWS workflow interactions failed: 'XXX'
16.079s: job.exception type=prolog severity=0 prolog killed by signal 15 (timeout or job canceled)

1 test, 1 failure
In this case, I'm using system test to create a simple xfs workflow, but this behavior is the same for any filesystem type. It's effectively running: flux run -l -N${N} --wait-event=clean -q iotesting --setattr=dw="#DW jobdw type=xfs name=xfs capacity=50GB"

The workflow is going from Proposal directly to Teardown. I do not believe the workflow itself throws an error when watching changes to the workflow with:

kubectl get workflows -w -A -oyaml | grep -i error
No errors appear in the output.

Tracing the the compute node to find its rabbit node and the rabbit node is not in the cluster:

[devcich1@elcap1:~]$ kubectl get node XXX
Error from server (NotFound): nodes "XXX" not found

[devcich1@elcap1:~]$ kubectl get nnfnodes -n elcapXXX
No resources found in elcapXXX namespace.

jameshcorbett · 2024-05-09T16:03:41Z

There's already a fix for this (I'm fairly sure) but the rest of the flux team doesn't want me to put it in place yet on elcap, because they're working on sorting out some other issues. One thing you can do to avoid it (inconvenient I know sorry) is to force flux to choose specific compute nodes with flux run --requires=hosts:elcap[12-15] or similar.

ajfloeder · 2024-09-19T16:28:28Z

@jameshcorbett Has this change been rolled out?

jameshcorbett · 2024-09-20T23:47:06Z

Not yet, fingers crossed in couple of weeks though. The issues the other developers were working on have been resolved, but basically the fix involves a tragically (or comically if you prefer) large file and associated memory usage that I have to reduce before we can put it in place.

jameshcorbett · 2024-09-21T04:41:25Z

A closely related Flux issue (although it's not obvious how it's related): flux-framework/flux-sched#1255

github-project-automation bot added this to Issues Dashboard May 9, 2024

github-project-automation bot moved this to 📋 Open in Issues Dashboard May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flux picks computes that are attached to missing rabbits #159

Flux picks computes that are attached to missing rabbits #159

bdevcich commented May 9, 2024

jameshcorbett commented May 9, 2024

ajfloeder commented Sep 19, 2024

jameshcorbett commented Sep 20, 2024

jameshcorbett commented Sep 21, 2024

Flux picks computes that are attached to missing rabbits #159

Flux picks computes that are attached to missing rabbits #159

Comments

bdevcich commented May 9, 2024

jameshcorbett commented May 9, 2024

ajfloeder commented Sep 19, 2024

jameshcorbett commented Sep 20, 2024

jameshcorbett commented Sep 21, 2024