Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodelist #200

Closed
kylechard opened this issue May 5, 2022 · 19 comments
Closed

Nodelist #200

kylechard opened this issue May 5, 2022 · 19 comments
Assignees

Comments

@kylechard
Copy link
Contributor

Libensemble would like to have access to a nodelist on different systems. They also have some code for several schedulers that could be used as a basis for this.

@jameshcorbett
Copy link
Collaborator

See also this issue in the job-api-spec repo.

@jameshcorbett
Copy link
Collaborator

In my tests on LLNL:

  • Flux does not export any nodelist environment variables but there is a command you can run to get something like ruby[1-2]
  • Slurm gives you something like SLURM_JOB_NODELIST=ruby[1-2] which is fine
  • LSF gives a bunch of stuff in different formats, but none of them is perfect:
LLNL_COMPUTE_NODES=lassen[23,26]
LSB_HOSTS=lassen710 lassen26 lassen26 lassen26 lassen26 lassen26 lassen26 lassen26 lassen26 lassen26 lassen26 lassen26 lassen26 lassen26 lassen26 lassen26 lassen26 lassen26 lassen26 lassen26 lassen26 lassen26 lassen26 lassen26 lassen26 lassen26 lassen26 lassen26 lassen26 lassen26 lassen26 lassen26 lassen26 lassen26 lassen26 lassen26 lassen26 lassen26 lassen26 lassen26 lassen26 lassen23 lassen23 lassen23 lassen23 lassen23 lassen23 lassen23 lassen23 lassen23 lassen23 lassen23 lassen23 lassen23 lassen23 lassen23 lassen23 lassen23 lassen23 lassen23 lassen23 lassen23 lassen23 lassen23 lassen23 lassen23 lassen23 lassen23 lassen23 lassen23 lassen23 lassen23 lassen23 lassen23 lassen23 lassen23 lassen23 lassen23 lassen23 lassen23 lassen23
LCSCHEDCLUSTER=lassen
LSB_SUB_HOST=lassen709
LSB_MCPU_HOSTS=lassen710 1 lassen26 40 lassen23 40

LLNL_COMPUTE_NODES is the best but not portable.

@hategan
Copy link
Collaborator

hategan commented May 13, 2022

So I posted this on Slack, but I'm adding here for posterity. It's from https://slurm.schedmd.com/srun.html.

srun -l /bin/hostname | sort -n | awk '{print $2}' > $MACHINEFILE

It's a hack because the task of translating the Slurm hostfile into an mpirun hostfile is the task of a translator (i.e., parser + generator) not by running distributed job. But it's brilliant because the only LRM-specific thing is the string srun.

@hategan
Copy link
Collaborator

hategan commented May 13, 2022

Point being, does jsrun -n ALL_HOSTS /bin/hostname work?

@jameshcorbett
Copy link
Collaborator

It works if you add -c ALL_CPUS. Otherwise I'm guessing it packs all the resource sets onto as few of nodes as it can.

@andre-merzky
Copy link
Collaborator

srun -l /bin/hostname | sort -n | awk '{print $2}' > $MACHINEFILE

Note that this can consume a significant amount of time, whereas the interpretation of the system specific env variable and files is more complex and less portable, but also much faster (at scale).

We do have some python code which generates a nodelist for ccm, cobalt, lsf, pbspro, slurm and torque, so we could extract this from RP (or rewrite, it's fairly simple). The code lives here -- have a look at the individual _init_from_scratch() methods.

BTW: not all systems make it easy to obtain cores_per_node and gpus_per_node via runtime inspection - specifically the last may be a much harder problem to solve than the nodelist.

@hategan
Copy link
Collaborator

hategan commented May 13, 2022

It works if you add -c ALL_CPUS. Otherwise I'm guessing it packs all the resource sets onto as few of nodes as it can.

The guess is on whether it works or on why it doesn't? (i.e., I'm entirely clear on whether you tried without -c ALL_CPUS or not)

@hategan
Copy link
Collaborator

hategan commented May 13, 2022

Note that this can consume a significant amount of time, whereas the interpretation of the system specific env variable and files is more complex and less portable, but also much faster (at scale).

That's what worries me, too. But there's a question we probably should ask. Is this time on the same order of magnitude as starting the actual job or is there a significant penalty not due to the srun, but due to the output collection?

If the first, the maybe the penalty is acceptable given the portability.

In all cases, and I know changes to the spec should be rare, but maybe it may make sense to protect this operation with a flag?

@jameshcorbett
Copy link
Collaborator

It works if you add -c ALL_CPUS. Otherwise I'm guessing it packs all the resource sets onto as few of nodes as it can.

The guess is on whether it works or on why it doesn't? (i.e., I'm entirely clear on whether you tried without -c ALL_CPUS or not)

Yeah it didn't work without ALL_CPUS, it put both "resource sets" on the same node.

@jameshcorbett
Copy link
Collaborator

If the first, the maybe the penalty is acceptable given the portability.

I think the best approach would be to let different systems do different implementations. With LSF, maybe it makes sense to use jsrun, rather than reading the host file and trying to figure out which is the launch node, I'm not sure. With Slurm, unclear. Is parsing the hostlist format better than running srun over every node? The first is more complicated (string parsing) but faster. As far as I know they are equally portable. But with Flux you could have a pretty simple, clean python script that sends an RPC to Flux and then uses Flux's hostlist iterator to produce the list.

@andre-merzky
Copy link
Collaborator

andre-merzky commented May 13, 2022

I think the best approach would be to let different systems do different implementations

I agree with that.

@hategan
Copy link
Collaborator

hategan commented May 14, 2022

I think the best approach would be to let different systems do different implementations.

Yes. If an executor has reasonable access to the hostlist, it should use that.

We should also test and see what the delay of running hostname is compared to the time it takes to launch the user job so that we know whether it's worth investing into alternative methods for non-flux executors.

@andre-merzky
Copy link
Collaborator

We should also test and see what the delay of running hostname is compared to the time it takes to launch the user job so that we know whether it's worth investing into alternative methods for non-flux executors.

Yeah, measuring this is useful. The time will depend on the number of nodes though, so we should be careful to measure this at scale.

@jameshcorbett
Copy link
Collaborator

My preference would just be do the easy thing, whatever that is, and then update the implementation once someone starts complaining...

@hategan
Copy link
Collaborator

hategan commented May 15, 2022

My preference would just be do the easy thing, whatever that is, and then update the implementation once someone starts complaining...

A few days of work could save us a few hours of planning! :)

@mturilli
Copy link
Collaborator

Decide whether this should be part of 0.1.0. General 'feeling' is that this should not be but need general confirmation/agreement. As a side note, we should also decide on a release schedule after reaching 0.1.0.

@kylechard
Copy link
Contributor Author

Agreed, seems reasonable to push to 0.2.0 release.

@hategan
Copy link
Collaborator

hategan commented Jun 16, 2022

A Python tool for slurm hostlists: https://www.nsc.liu.se/~kent/python-hostlist/

This was referenced Sep 13, 2023
@hategan
Copy link
Collaborator

hategan commented May 13, 2024

Nodelists have been supported since 0.9.4

@hategan hategan closed this as completed May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants