Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with determining number of valid nodes for num_tasks=0 #3216

Closed
lagerhardt opened this issue Jun 18, 2024 · 16 comments · Fixed by #3290
Closed

Issue with determining number of valid nodes for num_tasks=0 #3216

lagerhardt opened this issue Jun 18, 2024 · 16 comments · Fixed by #3290
Assignees
Milestone

Comments

@lagerhardt
Copy link

With 4.6.1, if you have a reservation and a test with num_tasks=0, the framework returns 0 node. I'm invoking the code with reframe -vvvvvvv -r -R -c checks/microbenchmarks/dgemm -J reservation=checkout -n dgemm_cpu -C nersc-config.py and here's what I see if I turn up the logging:

[F] Flexible node allocation requested
[CMD] 'scontrol -a show -o nodes'
[F] Total available nodes: 443
[CMD] 'scontrol -a show res checkout'
[CMD] 'scontrol -a show -o Nodes=login[01-07],nid[001000-001023,001033,001036-001037,001040-001041,001044-001045,001048-001049,001052-001053,001064-001065,001068-001069,001072-001073,001076-001077,001080-001081,001084-001085,001088-001089,001092-001093,200001-200257,200260-200261,200264-200265,200268-200269,200272-200273,200276-200277,200280-200281,200284-200285,200288-200289,200292-200293,200296-200297,200300-200301,200304-200305,200308-200309,200312-200313,200316-200317,200320-200321,200324-200325,200328-200329,200332-200333,200336-200337,200340-200341,200344-200345,200348-200349,200352-200353,200356-200357,200360-200361,200364-200365,200368-200369,200372-200373,200376-200377,200380-200381,200384-200385,200388-200389,200392-200393,200396-200397,200400-200401,200404-200405,200408-200409,200412-200413,200416-200417,200420-200421,200424-200425,200428-200429,200432-200433,200436-200437,200440-200441,200444-200445,200448-200449,200452-200453,200456-200457,200460-200461,200464-200465,200468-200469,200472-200473,200476-200477,200480-200481,200484-200485,200488-200489,200492-200493,200496-200497,200500-200501,200504-200505,200508-200509]'
[S] slurm: [F] Filtering nodes by reservation checkout: available nodes now: 0

There are available nodes in the reservation, though not all of them are available. Here's the list of states:

1 State=DOWN+DRAIN+MAINTENANCE+RESERVED+NOT_RESPONDING
2 State=DOWN+DRAIN+MAINTENANCE+RESERVED
5 State=DOWN+DRAIN+RESERVED+NOT_RESPONDING
1 State=DOWN+MAINTENANCE+RESERVED+NOT_RESPONDING
2 State=DOWN+MAINTENANCE+RESERVED+NOT_RESPONDING
1 State=DOWN+RESERVED+NOT_RESPONDING
1 State=DOWN+RESERVED
1 State=DOWN+RESERVED
7 State=IDLE+DRAIN+MAINTENANCE+RESERVED
2 State=IDLE+MAINTENANCE+RESERVED
45 State=IDLE+RESERVED
1 State=IDLE+RESERVED
370 State=IDLE+RESERVED
1 State=IDLE+RESERVED
1 State=IDLE
1 State=MIXED+RESERVED

I can only get a non-zero number if I add --flex-alloc-nodes=IDLE+RESERVED. I still get zero if I add --flex-alloc-nodes=IDLE. It was my understanding that asking for IDLE was supposed to match any of these states, but that doesn't seem to be the case. I suspect that the fact that it's doing an & between two node sets (at

nodes &= self._get_reservation_nodes(reservation)
) might have something to do with it. From my logging it looks like the node set is empty before it queries the nodes in the reservation.

@vkarak
Copy link
Contributor

vkarak commented Jun 18, 2024

Hi @lagerhardt, this has changed in 4.6 (check the docs the --distribute option). To revert to the original behaviour you should use --flex-alloc-policy=idle*. The rationale behind this change is that there was no way previously to select nodes exclusively in a state and ReFrame could end up requesting nodes in IDLE+DRAIN states, which it could never get.

@lagerhardt
Copy link
Author

Ah, thanks. I missed that update.

So it looks like I can roughly get close to the behavior I want with the --flex-alloc-nodes=idle+reserved flag, but I see the note that multiple tests have to be executed serially. This would be okay, except we have multiple separate tests for our GPU and CPU partitions and we don't need one to sit idle while the other is busy. In theory I could get around this by running two separate instances of reframe each targeting a single partition but that's adding a second layer of complexity. Before I start setting up to do that, I was wondering if you knew of a better way to run tests across all available nodes. If it's via this mechanism, then I was wondering if there were any plans to adjust this behavior in the near future?

@vkarak
Copy link
Contributor

vkarak commented Jun 19, 2024

This note was valid even before. You can still run in parallel across reframe partitions. The reason behind this note is that the first test will consume all available nodes in the partition, so the next one will not find any idle nodes and will be skipped. But across partitions, this is not a problem, as ReFrame scopes the node request automatically.

We should update this note in the docs to make this clearer.

@lagerhardt
Copy link
Author

lagerhardt commented Jun 19, 2024 via email

@teojgo
Copy link
Contributor

teojgo commented Jun 19, 2024

One solution that might work, is to run with the async execution but limit the number of jobs per partition to 1. So each test will try to consume all the nodes for each partition but will not submit another job until the nodes are free again.

@vkarak
Copy link
Contributor

vkarak commented Jun 24, 2024

@lagerhardt We have now added a new pseudo-state in the flexible allocation policy. You can run with --flex-alloc-policy=avail or --distribute=avail (depending on which option you use) and this will scale the tests to all the "available" nodes in each partition. Available is a node that is either in ALLOCATED, COMPLETING or IDLE states. This way you can still submit all of your tests with the async policy: those submitted later will just wait.

@lagerhardt
Copy link
Author

That sounds great! Thank you!!! When would this be available?

@vkarak
Copy link
Contributor

vkarak commented Jun 25, 2024

That sounds great! Thank you!!! When would this be available?

It is already from 4.6 :-)

@lagerhardt
Copy link
Author

Sorry for the long silence, finally able to come back to this. I am still getting zero nodes even with --flex-alloc-nodes=avail gives me zero nodes for all jobs, even the first one (btw, I'm only seeing "--flex-alloc-nodes", not "--flex-alloc-policy", which is what I assume you meant). For a reservation with five nodes where only two of the nodes are up and of the proper type when I use the avail flag, it's telling me:

[2024-07-15T15:16:12] debug: dgemm_gpu /3d2d2c0e @muller:gpu_ss11+default: [CMD] 'scontrol -a show -o nodes'
[2024-07-15T15:16:12] debug: dgemm_gpu /3d2d2c0e @muller:gpu_ss11+default: [F] Total available nodes: 443
[2024-07-15T15:16:12] debug: dgemm_gpu /3d2d2c0e @muller:gpu_ss11+default: [F] Total available nodes after filter by state: 45
[2024-07-15T15:16:12] debug: dgemm_gpu /3d2d2c0e @muller:gpu_ss11+default: [CMD] 'scontrol -a show res checkout'
[2024-07-15T15:16:12] debug: dgemm_gpu /3d2d2c0e @muller:gpu_ss11+default: [CMD] 'scontrol -a show -o Nodes=nid[001004-001006,001023,001040-001041]'
[2024-07-15T15:16:12] debug: dgemm_gpu /3d2d2c0e @muller:gpu_ss11+default: [CMD] 'scontrol -a show -o partitions'
[2024-07-15T15:16:12] debug: dgemm_gpu /3d2d2c0e @muller:gpu_ss11+default: [F] Total available nodes after filternodes: 0
[2024-07-15T15:16:12] debug: dgemm_gpu /3d2d2c0e @muller:gpu_ss11+default: caught reframe.core.exceptions.JobError: [jobid=None] could not satisfy the minimum task requirement: required 4, found 0```

@vkarak
Copy link
Contributor

vkarak commented Jul 16, 2024

Yes, I meant --flex-alloc-policy (I always mix up the name). What state are the reservation nodes in? If they are also in RESERVED state then that could explain it since avail is excluding nodes in this state. Maybe, if a reservation is also requested it would make sense for ReFrame to allow nodes in RESERVED state to be selected automatically.

@lagerhardt
Copy link
Author

Yes, these are also reserved. We typically use a full system reservation after a maintenance to do checkout.

@vkarak
Copy link
Contributor

vkarak commented Jul 17, 2024

I think we need to add better support for RESERVED nodes then. Currently you could run with --flex-alloc-policy=IDLE+RESERVED, but you would have to submit the tests serially as before.

@dmargala
Copy link
Contributor

@vkarak , do you have any suggestions where a good place to add better support for this might be?

One way might be to extend the syntax of --flex-alloc-nodes to handle something like an "ignore these states", e.g. --flex-alloc-nodes=avail-reserved (maybe there's a better notation?). This seems flexible but I haven't really thought about it fits in more broadly.

For example, changes in schedulers.filter_nodes_by_state:

+
+    if '-' in state:
+        state, ignore_states = state.split('-')
+    else:
+        ignore_states = ''
     if state == 'avail':
-        nodelist = {n for n in nodelist if n.is_avail()}
+        nodelist = {n for n in nodelist if n.is_avail(ignore_states)}

and schedulers.slurm:

-    def in_statex(self, state):
-        return self._states == set(state.upper().split('+'))
+    def in_statex(self, state, ignore_states=''):
+        return self._states - set(ignore_states.upper().split('+')) == set(state.upper().split('+'))

-    def is_avail(self):
-        return any(self.in_statex(s)
+    def is_avail(self, ignore_states=''):
+        return any(self.in_statex(s, ignore_states=ignore_states)
                    for s in ('ALLOCATED', 'COMPLETING', 'IDLE'))

Alternatively, It might be reasonable to ignore the RESERVED state generally in is_avail(). Since in SlurmJobScheduler.filternodes at least, there's already a step to filter nodes based on the presence of a reservation option before the node state filter is applied. I'm not sure what else might be impacted by that though.

@dmargala
Copy link
Contributor

dmargala commented Oct 18, 2024

Another issue with the existing is_avail implementation contributing to the broader issue is that it seems nodes can be both IDLE and COMPLETING at the same time. I see some nodes in an active reservation with this state:

State=IDLE+COMPLETING+RESERVED

@vkarak
Copy link
Contributor

vkarak commented Oct 21, 2024

Another issue with the existing is_avail implementation contributing to the broader issue is that it seems nodes can be both IDLE and COMPLETING at the same time. I see some nodes in an active reservation with this state:

State=IDLE+COMPLETING+RESERVED

I think this is easily fixed if is_avail allows the node to be in any combination of the "avail" states, i.e., self._states <= {'ALLOCATED', 'COMPLETING', 'IDLE'}

For including the RESERVED nodes, I don't know yet what would be the best implementation. Ideally, we would want RESERVED to be automatically included in the "avail" states if --reservation is passed, but looking at the way the node filtering is currently implemented, it's not so straightforward without breaking the encapsulation.

@vkarak
Copy link
Contributor

vkarak commented Oct 21, 2024

For including the RESERVED nodes, I don't know yet what would be the best implementation. Ideally, we would want RESERVED to be automatically included in the "avail" states if --reservation is passed, but looking at the way the node filtering is currently implemented, it's not so straightforward without breaking the encapsulation.

I have an idea for this. Since we do a scheduler-specific filtering anyway here, the is_avail() for Slurm could always include the RESERVED nodes and then filter them out in filternodes() if the --reservation option is not passed.

@vkarak vkarak moved this from Todo to In Progress in ReFrame Backlog Oct 23, 2024
@vkarak vkarak added bug and removed help wanted labels Oct 23, 2024
@vkarak vkarak added this to the ReFrame 4.7 milestone Oct 23, 2024
@vkarak vkarak linked a pull request Oct 24, 2024 that will close this issue
@vkarak vkarak moved this from In Progress to Merge To Develop in ReFrame Backlog Oct 24, 2024
@vkarak vkarak closed this as completed Nov 8, 2024
@github-project-automation github-project-automation bot moved this from Merge To Develop to Done in ReFrame Backlog Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants