-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flux picks computes that are attached to missing rabbits #159
Comments
There's already a fix for this (I'm fairly sure) but the rest of the flux team doesn't want me to put it in place yet on elcap, because they're working on sorting out some other issues. One thing you can do to avoid it (inconvenient I know sorry) is to force flux to choose specific compute nodes with flux run --requires=hosts:elcap[12-15] or similar. |
@jameshcorbett Has this change been rolled out? |
Not yet, fingers crossed in couple of weeks though. The issues the other developers were working on have been resolved, but basically the fix involves a tragically (or comically if you prefer) large file and associated memory usage that I have to reduce before we can put it in place. |
A closely related Flux issue (although it's not obvious how it's related): flux-framework/flux-sched#1255 |
Attempting to run a workflow on el cap:
[devcich1@elcap1:system-test]$ N=1 Q=iotesting make sanity
bats -j 1 --filter-tags tag:sanity .
./system-test.bats
✗ XFS
tags: tag:sanity tag:simple tag:xfs
(in test file ./system-test.bats, line 49)
`#DW jobdw type=xfs name=xfs capacity=50GB" ' failed
16.026s: job.exception type=dws-setup severity=0 DWS workflow interactions failed: 'XXX'
16.079s: job.exception type=prolog severity=0 prolog killed by signal 15 (timeout or job canceled)
1 test, 1 failure
In this case, I'm using system test to create a simple xfs workflow, but this behavior is the same for any filesystem type. It's effectively running: flux run -l -N${N} --wait-event=clean -q iotesting --setattr=dw="#DW jobdw type=xfs name=xfs capacity=50GB"
The workflow is going from Proposal directly to Teardown. I do not believe the workflow itself throws an error when watching changes to the workflow with:
kubectl get workflows -w -A -oyaml | grep -i error
No errors appear in the output.
Tracing the the compute node to find its rabbit node and the rabbit node is not in the cluster:
[devcich1@elcap1:~]$ kubectl get node XXX
Error from server (NotFound): nodes "XXX" not found
[devcich1@elcap1:~]$ kubectl get nnfnodes -n elcapXXX
No resources found in elcapXXX namespace.
The text was updated successfully, but these errors were encountered: