Issue with determining number of valid nodes for num_tasks=0 #3216

lagerhardt · 2024-06-18T20:30:36Z

With 4.6.1, if you have a reservation and a test with num_tasks=0, the framework returns 0 node. I'm invoking the code with reframe -vvvvvvv -r -R -c checks/microbenchmarks/dgemm -J reservation=checkout -n dgemm_cpu -C nersc-config.py and here's what I see if I turn up the logging:

[F] Flexible node allocation requested
[CMD] 'scontrol -a show -o nodes'
[F] Total available nodes: 443
[CMD] 'scontrol -a show res checkout'
[CMD] 'scontrol -a show -o Nodes=login[01-07],nid[001000-001023,001033,001036-001037,001040-001041,001044-001045,001048-001049,001052-001053,001064-001065,001068-001069,001072-001073,001076-001077,001080-001081,001084-001085,001088-001089,001092-001093,200001-200257,200260-200261,200264-200265,200268-200269,200272-200273,200276-200277,200280-200281,200284-200285,200288-200289,200292-200293,200296-200297,200300-200301,200304-200305,200308-200309,200312-200313,200316-200317,200320-200321,200324-200325,200328-200329,200332-200333,200336-200337,200340-200341,200344-200345,200348-200349,200352-200353,200356-200357,200360-200361,200364-200365,200368-200369,200372-200373,200376-200377,200380-200381,200384-200385,200388-200389,200392-200393,200396-200397,200400-200401,200404-200405,200408-200409,200412-200413,200416-200417,200420-200421,200424-200425,200428-200429,200432-200433,200436-200437,200440-200441,200444-200445,200448-200449,200452-200453,200456-200457,200460-200461,200464-200465,200468-200469,200472-200473,200476-200477,200480-200481,200484-200485,200488-200489,200492-200493,200496-200497,200500-200501,200504-200505,200508-200509]'
[S] slurm: [F] Filtering nodes by reservation checkout: available nodes now: 0

There are available nodes in the reservation, though not all of them are available. Here's the list of states:

1 State=DOWN+DRAIN+MAINTENANCE+RESERVED+NOT_RESPONDING
2 State=DOWN+DRAIN+MAINTENANCE+RESERVED
5 State=DOWN+DRAIN+RESERVED+NOT_RESPONDING
1 State=DOWN+MAINTENANCE+RESERVED+NOT_RESPONDING
2 State=DOWN+MAINTENANCE+RESERVED+NOT_RESPONDING
1 State=DOWN+RESERVED+NOT_RESPONDING
1 State=DOWN+RESERVED
1 State=DOWN+RESERVED
7 State=IDLE+DRAIN+MAINTENANCE+RESERVED
2 State=IDLE+MAINTENANCE+RESERVED
45 State=IDLE+RESERVED
1 State=IDLE+RESERVED
370 State=IDLE+RESERVED
1 State=IDLE+RESERVED
1 State=IDLE
1 State=MIXED+RESERVED

I can only get a non-zero number if I add --flex-alloc-nodes=IDLE+RESERVED. I still get zero if I add --flex-alloc-nodes=IDLE. It was my understanding that asking for IDLE was supposed to match any of these states, but that doesn't seem to be the case. I suspect that the fact that it's doing an & between two node sets (at

reframe/reframe/core/schedulers/slurm.py

Line 345 in 392efbc

nodes &= self._get_reservation_nodes(reservation)

) might have something to do with it. From my logging it looks like the node set is empty before it queries the nodes in the reservation.

The text was updated successfully, but these errors were encountered:

vkarak · 2024-06-18T21:48:34Z

Hi @lagerhardt, this has changed in 4.6 (check the docs the --distribute option). To revert to the original behaviour you should use --flex-alloc-policy=idle*. The rationale behind this change is that there was no way previously to select nodes exclusively in a state and ReFrame could end up requesting nodes in IDLE+DRAIN states, which it could never get.

lagerhardt · 2024-06-19T04:47:22Z

Ah, thanks. I missed that update.

So it looks like I can roughly get close to the behavior I want with the --flex-alloc-nodes=idle+reserved flag, but I see the note that multiple tests have to be executed serially. This would be okay, except we have multiple separate tests for our GPU and CPU partitions and we don't need one to sit idle while the other is busy. In theory I could get around this by running two separate instances of reframe each targeting a single partition but that's adding a second layer of complexity. Before I start setting up to do that, I was wondering if you knew of a better way to run tests across all available nodes. If it's via this mechanism, then I was wondering if there were any plans to adjust this behavior in the near future?

vkarak · 2024-06-19T08:52:02Z

This note was valid even before. You can still run in parallel across reframe partitions. The reason behind this note is that the first test will consume all available nodes in the partition, so the next one will not find any idle nodes and will be skipped. But across partitions, this is not a problem, as ReFrame scopes the node request automatically.

We should update this note in the docs to make this clearer.

lagerhardt · 2024-06-19T13:53:54Z

The issue is I have multiple gpu and cpu full system tests I want to run. If I do that with the async execution option then the second test of whatever type fails because there’s no available nodes. If I run with the serial option it runs just a single test at a time so the CPU partition sits idle while the GPU one runs and vice versa. In an ideal case, I’d like to be able to submit all these tests (as well as a bunch of smaller tests) at once from a single instance of reframe but I’m having trouble thinking of how to do that

…

On Wed, Jun 19, 2024 at 1:52 AM Vasileios Karakasis < ***@***.***> wrote: This note was valid even before. You can still run in parallel across reframe partitions. The reason behind this note is that the first test will consume all available nodes in the partition, so the next one will not find any idle nodes and will be skipped. But across partitions, this is not a problem, as ReFrame scopes the node request automatically. We should update this note in the docs to make this clearer. — Reply to this email directly, view it on GitHub <#3216 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACCNHQTLY4BX3FIBDOHZRFDZIFBERAVCNFSM6AAAAABJQXYSDCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZYGEZDQOBTHE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

teojgo · 2024-06-19T14:30:33Z

One solution that might work, is to run with the async execution but limit the number of jobs per partition to 1. So each test will try to consume all the nodes for each partition but will not submit another job until the nodes are free again.

vkarak · 2024-06-24T11:45:49Z

@lagerhardt We have now added a new pseudo-state in the flexible allocation policy. You can run with --flex-alloc-policy=avail or --distribute=avail (depending on which option you use) and this will scale the tests to all the "available" nodes in each partition. Available is a node that is either in ALLOCATED, COMPLETING or IDLE states. This way you can still submit all of your tests with the async policy: those submitted later will just wait.

lagerhardt · 2024-06-24T21:53:14Z

That sounds great! Thank you!!! When would this be available?

vkarak · 2024-06-25T07:40:47Z

That sounds great! Thank you!!! When would this be available?

It is already from 4.6 :-)

lagerhardt · 2024-07-15T22:21:06Z

Sorry for the long silence, finally able to come back to this. I am still getting zero nodes even with --flex-alloc-nodes=avail gives me zero nodes for all jobs, even the first one (btw, I'm only seeing "--flex-alloc-nodes", not "--flex-alloc-policy", which is what I assume you meant). For a reservation with five nodes where only two of the nodes are up and of the proper type when I use the avail flag, it's telling me:

[2024-07-15T15:16:12] debug: dgemm_gpu /3d2d2c0e @muller:gpu_ss11+default: [CMD] 'scontrol -a show -o nodes'
[2024-07-15T15:16:12] debug: dgemm_gpu /3d2d2c0e @muller:gpu_ss11+default: [F] Total available nodes: 443
[2024-07-15T15:16:12] debug: dgemm_gpu /3d2d2c0e @muller:gpu_ss11+default: [F] Total available nodes after filter by state: 45
[2024-07-15T15:16:12] debug: dgemm_gpu /3d2d2c0e @muller:gpu_ss11+default: [CMD] 'scontrol -a show res checkout'
[2024-07-15T15:16:12] debug: dgemm_gpu /3d2d2c0e @muller:gpu_ss11+default: [CMD] 'scontrol -a show -o Nodes=nid[001004-001006,001023,001040-001041]'
[2024-07-15T15:16:12] debug: dgemm_gpu /3d2d2c0e @muller:gpu_ss11+default: [CMD] 'scontrol -a show -o partitions'
[2024-07-15T15:16:12] debug: dgemm_gpu /3d2d2c0e @muller:gpu_ss11+default: [F] Total available nodes after filternodes: 0
[2024-07-15T15:16:12] debug: dgemm_gpu /3d2d2c0e @muller:gpu_ss11+default: caught reframe.core.exceptions.JobError: [jobid=None] could not satisfy the minimum task requirement: required 4, found 0```

vkarak · 2024-07-16T10:03:55Z

Yes, I meant --flex-alloc-policy (I always mix up the name). What state are the reservation nodes in? If they are also in RESERVED state then that could explain it since avail is excluding nodes in this state. Maybe, if a reservation is also requested it would make sense for ReFrame to allow nodes in RESERVED state to be selected automatically.

lagerhardt · 2024-07-16T16:10:10Z

Yes, these are also reserved. We typically use a full system reservation after a maintenance to do checkout.

vkarak · 2024-07-17T06:50:43Z

I think we need to add better support for RESERVED nodes then. Currently you could run with --flex-alloc-policy=IDLE+RESERVED, but you would have to submit the tests serially as before.

dmargala · 2024-10-18T19:42:14Z

@vkarak , do you have any suggestions where a good place to add better support for this might be?

One way might be to extend the syntax of --flex-alloc-nodes to handle something like an "ignore these states", e.g. --flex-alloc-nodes=avail-reserved (maybe there's a better notation?). This seems flexible but I haven't really thought about it fits in more broadly.

For example, changes in schedulers.filter_nodes_by_state:

+
+    if '-' in state:
+        state, ignore_states = state.split('-')
+    else:
+        ignore_states = ''
     if state == 'avail':
-        nodelist = {n for n in nodelist if n.is_avail()}
+        nodelist = {n for n in nodelist if n.is_avail(ignore_states)}

and schedulers.slurm:

-    def in_statex(self, state):
-        return self._states == set(state.upper().split('+'))
+    def in_statex(self, state, ignore_states=''):
+        return self._states - set(ignore_states.upper().split('+')) == set(state.upper().split('+'))

-    def is_avail(self):
-        return any(self.in_statex(s)
+    def is_avail(self, ignore_states=''):
+        return any(self.in_statex(s, ignore_states=ignore_states)
                    for s in ('ALLOCATED', 'COMPLETING', 'IDLE'))

Alternatively, It might be reasonable to ignore the RESERVED state generally in is_avail(). Since in SlurmJobScheduler.filternodes at least, there's already a step to filter nodes based on the presence of a reservation option before the node state filter is applied. I'm not sure what else might be impacted by that though.

dmargala · 2024-10-18T23:38:44Z

Another issue with the existing is_avail implementation contributing to the broader issue is that it seems nodes can be both IDLE and COMPLETING at the same time. I see some nodes in an active reservation with this state:

State=IDLE+COMPLETING+RESERVED

vkarak · 2024-10-21T08:37:40Z

Another issue with the existing is_avail implementation contributing to the broader issue is that it seems nodes can be both IDLE and COMPLETING at the same time. I see some nodes in an active reservation with this state:
State=IDLE+COMPLETING+RESERVED

I think this is easily fixed if is_avail allows the node to be in any combination of the "avail" states, i.e., self._states <= {'ALLOCATED', 'COMPLETING', 'IDLE'}

For including the RESERVED nodes, I don't know yet what would be the best implementation. Ideally, we would want RESERVED to be automatically included in the "avail" states if --reservation is passed, but looking at the way the node filtering is currently implemented, it's not so straightforward without breaking the encapsulation.

vkarak · 2024-10-21T19:27:52Z

For including the RESERVED nodes, I don't know yet what would be the best implementation. Ideally, we would want RESERVED to be automatically included in the "avail" states if --reservation is passed, but looking at the way the node filtering is currently implemented, it's not so straightforward without breaking the encapsulation.

I have an idea for this. Since we do a scheduler-specific filtering anyway here, the is_avail() for Slurm could always include the RESERVED nodes and then filter them out in filternodes() if the --reservation option is not passed.

vkarak added help wanted prio: normal labels Jun 18, 2024

vkarak added this to ReFrame Backlog Jun 18, 2024

vkarak moved this to Todo in ReFrame Backlog Jun 18, 2024

dmargala mentioned this issue Oct 22, 2024

[bugfix] Allow RESERVED nodes for Slurm "avail" state #3290

Merged

vkarak assigned dmargala Oct 23, 2024

vkarak moved this from Todo to In Progress in ReFrame Backlog Oct 23, 2024

vkarak added bug and removed help wanted labels Oct 23, 2024

vkarak added this to the ReFrame 4.7 milestone Oct 23, 2024

vkarak linked a pull request Oct 24, 2024 that will close this issue

[bugfix] Allow RESERVED nodes for Slurm "avail" state #3290

Merged

vkarak moved this from In Progress to Merge To Develop in ReFrame Backlog Oct 24, 2024

vkarak closed this as completed Nov 8, 2024

github-project-automation bot moved this from Merge To Develop to Done in ReFrame Backlog Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with determining number of valid nodes for num_tasks=0 #3216

Issue with determining number of valid nodes for num_tasks=0 #3216

lagerhardt commented Jun 18, 2024

vkarak commented Jun 18, 2024

lagerhardt commented Jun 19, 2024

vkarak commented Jun 19, 2024

lagerhardt commented Jun 19, 2024 via email

teojgo commented Jun 19, 2024

vkarak commented Jun 24, 2024

lagerhardt commented Jun 24, 2024

vkarak commented Jun 25, 2024

lagerhardt commented Jul 15, 2024

vkarak commented Jul 16, 2024

lagerhardt commented Jul 16, 2024

vkarak commented Jul 17, 2024

dmargala commented Oct 18, 2024

dmargala commented Oct 18, 2024 •

edited

Loading

vkarak commented Oct 21, 2024

vkarak commented Oct 21, 2024

Issue with determining number of valid nodes for num_tasks=0 #3216

Issue with determining number of valid nodes for num_tasks=0 #3216

Comments

lagerhardt commented Jun 18, 2024

vkarak commented Jun 18, 2024

lagerhardt commented Jun 19, 2024

vkarak commented Jun 19, 2024

lagerhardt commented Jun 19, 2024 via email

teojgo commented Jun 19, 2024

vkarak commented Jun 24, 2024

lagerhardt commented Jun 24, 2024

vkarak commented Jun 25, 2024

lagerhardt commented Jul 15, 2024

vkarak commented Jul 16, 2024

lagerhardt commented Jul 16, 2024

vkarak commented Jul 17, 2024

dmargala commented Oct 18, 2024

dmargala commented Oct 18, 2024 • edited Loading

vkarak commented Oct 21, 2024

vkarak commented Oct 21, 2024

dmargala commented Oct 18, 2024 •

edited

Loading