Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flux picks computes that are attached to missing rabbits #159

Open
bdevcich opened this issue May 9, 2024 · 4 comments
Open

Flux picks computes that are attached to missing rabbits #159

bdevcich opened this issue May 9, 2024 · 4 comments

Comments

@bdevcich
Copy link
Contributor

bdevcich commented May 9, 2024

Attempting to run a workflow on el cap:

[devcich1@elcap1:system-test]$ N=1 Q=iotesting make sanity
bats -j 1 --filter-tags tag:sanity .
./system-test.bats
✗ XFS
tags: tag:sanity tag:simple tag:xfs
(in test file ./system-test.bats, line 49)
`#DW jobdw type=xfs name=xfs capacity=50GB" ' failed
16.026s: job.exception type=dws-setup severity=0 DWS workflow interactions failed: 'XXX'
16.079s: job.exception type=prolog severity=0 prolog killed by signal 15 (timeout or job canceled)

1 test, 1 failure
In this case, I'm using system test to create a simple xfs workflow, but this behavior is the same for any filesystem type. It's effectively running: flux run -l -N${N} --wait-event=clean -q iotesting --setattr=dw="#DW jobdw type=xfs name=xfs capacity=50GB"

The workflow is going from Proposal directly to Teardown. I do not believe the workflow itself throws an error when watching changes to the workflow with:

kubectl get workflows -w -A -oyaml | grep -i error
No errors appear in the output.

Tracing the the compute node to find its rabbit node and the rabbit node is not in the cluster:

[devcich1@elcap1:~]$ kubectl get node XXX
Error from server (NotFound): nodes "XXX" not found

[devcich1@elcap1:~]$ kubectl get nnfnodes -n elcapXXX
No resources found in elcapXXX namespace.

@jameshcorbett
Copy link
Collaborator

There's already a fix for this (I'm fairly sure) but the rest of the flux team doesn't want me to put it in place yet on elcap, because they're working on sorting out some other issues. One thing you can do to avoid it (inconvenient I know sorry) is to force flux to choose specific compute nodes with flux run --requires=hosts:elcap[12-15] or similar.

@ajfloeder
Copy link
Contributor

@jameshcorbett Has this change been rolled out?

@jameshcorbett
Copy link
Collaborator

Not yet, fingers crossed in couple of weeks though. The issues the other developers were working on have been resolved, but basically the fix involves a tragically (or comically if you prefer) large file and associated memory usage that I have to reduce before we can put it in place.

@jameshcorbett
Copy link
Collaborator

A closely related Flux issue (although it's not obvious how it's related): flux-framework/flux-sched#1255

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 📋 Open
Development

No branches or pull requests

3 participants