Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed slurm launcher #214

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Fixed slurm launcher #214

wants to merge 3 commits into from

Conversation

dquigley533
Copy link

Greetings from Warwick.

This PR fixes the missing use of run3 arguments into compute.py for when --slurm is specified.

Without this fix every instance of CASTEP launched by run3 tries to use all the processors in the SLURM reservation, which with -np > 1 leads to massive contention, huge server load and failures when the .castep file doesn't appear after 30s. Bad times.

I've tested this on the TiO2 tutorial example on Sulis (sulis.ac.uk) with the following script, which should also work fine for larger batches of geomopts on greater numbers of nodes.

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=128
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=3850
#SBATCH --time=08:00:00
#SBATCH --account=su000-nottellingyou

module purge
module load foss/2020b
module load Python SciPy-bundle
module load numba

EXEC=/path/to/castep.mpi

# Use 16 cores per geomopt with 8 geomopts running concurrently
run3 --slurm -nc 16 -np 8  --executable $EXEC TiO2

Side note - the matador install script seems to load generic builds of NumPy and SciPy newer than the hardware-optimised versions loaded by the SciPy-bundle module on the cluster. I wouldn't recommend requiring that unless you specifically need newer functionality. Not a major issue here as I imagine there isn't much CPU intensive going on inside matador.

@ml-evs
Copy link
Owner

ml-evs commented Nov 2, 2021

Hi Dave, thanks for this. Completely understand the issue but I wonder if instead I just need to deprecate the --slurm flag and improve the documentation. Happy to talk about this offline if it is easier - if you are just wanting to encourage high-throughput on Sulis, there are probably more advanced packages that are actively developed, e.g. signac, aiida, parsl, fireworks etc.

Let me make some clarifying remarks.

This PR fixes the missing use of run3 arguments into compute.py for when --slurm is specified.

Without this fix every instance of CASTEP launched by run3 tries to use all the processors in the SLURM reservation, which with -np > 1 leads to massive contention, huge server load and failures when the .castep file doesn't appear after 30s. Bad times.

  1. I added the --slurm flag many years ago for a machine I was using that did not expose mpirun to the user, only srun. Within the context of run3, it simply replaces the mpirun call with srun and tries to use the appropriate flags for parallelism. No matter whether you use --slurm, run3 will still look for SLURM_<x> environment variables and use them to control things like walltime, array jobs etc. If you don't have a specific reason to use srun over mpirun, then I would suggest that I just remove the --slurm flag. If you do want to use srun, we can certainly look at using the flags you desire, but I think it would also be good for these to be user configurable.
  2. I never use -np > 1 myself. Although there is little Python overhead from doing the job parallelism, there is always going to be a bottleneck where you may only be waiting for one process to finish (leaving up to (np-1)/np of your allocation idle). run3 does not do any load-balancing in this sense. My typical suggestion for doing high-throughput jobs is to use slurm array jobs, see e.g. this run3 tutorial for BlueBear.

I've tested this on the TiO2 tutorial example on Sulis (sulis.ac.uk) with the following script, which should also work fine for larger batches of geomopts on greater numbers of nodes.

I would make sure to test one multi-node job too, we have had some trouble in the past depending on the type of parallel filesystem in use.

# Use 16 cores per geomopt with 8 geomopts running concurrently
run3 --slurm -nc 16 -np 8  --executable $EXEC TiO2

-nc is per run3 job, not per geomopt. You might be getting lucky that the env SLURM_NTASKS is taking preference over the command-line args. This is another confusion and reason I would like to deprecate -np...

(tell a lie, I must have changed this at some point the last 5 years... maybe the confusion is only with me!)

Side note - the matador install script seems to load generic builds of NumPy and SciPy newer than the hardware-optimised versions loaded by the SciPy-bundle module on the cluster. I wouldn't recommend requiring that unless you specifically need newer functionality. Not a major issue here as I imagine there isn't much CPU intensive going on inside matador.

matador just uses pip/setuptools for installation, so if you already have recent versions of numpy and scipy loaded when you install, it will not overwrite them. I don't think the versions I require are that recent (probably just scipy>1 and a numpy from the last few years). As you say, matador is doing no real computations itself in this mode, so I wouldn't worry about this.

@dquigley533
Copy link
Author

Hi Matthew,

Thanks for looking at the PR and I see where you're coming from. If you're used to working on clusters where job arrays are the norm and mpirun does everything you need then your plan to deprecate srun and -np would make sense.

I wouldn't recommend it though - I'm looking at this precisely because Sulis users are very much not in that boat and some of them are wanting to use run3 for their AIRSS workloads.

  1. I'd be wary of treating of srun and mpirun as interchangeable. They often don't have the same user-facing feature set or integrate with the resource allocation manager and MPI implementation in the same way. Sometimes mpirun is just a wrapper around srun that exposes only minimal functionality. It's a bit of a cluster-specific minefield as you've seen from your machine that didn't even expose mpirun (there were probably good reasons for that).

For example the --exclusive option to srun I suggested in the PR is crucial here otherwise job step creation fails for anything but the first geomopt. I wouldn't know how to get equivalent results with mpirun.

From what I could tell, with --slurm run3 replaced mpirun with srun and then didn't try any flags for parallelism (not even the same ones it used for mpirun). That's the issue I was running up against and suggesting a fix for. Regardless of what was specified for -np and -nc it just did "srun executable"

So my suggestion would be to support srun, passing through ncores and nnodes as you do for mpirun, but also allowing users to pass through additional run3 arguments to srun like --exclusive when needed. Similarly CASTEP benefits from --cpu-bind=rank on Sulis (and equivalent on ARCHER2). That's free performance we'd ideally not leave on the table. So something that allows

run3 --slurm -nc 16 -np 8 --with-srun-args="--exclusive --cpu-bind=rank" TiO2

would be perfect.

  1. Understood, although that is easy(?) to avoid by keeping the pipeline fed. Should also be tensioned against having jobs that request whole nodes being held up in the queue because your last few array job elements are using a fractions of a node or nodes (we have 128-core nodes). We are discouraging (in fact rejecting) large job arrays on Sulis for this and other SAFE-related reasons.

Can also confirm that running in the way I suggest across multiple nodes works as expected.

On the setuptools thing your requirements.txt is doing what it should I think, I was just surprised by the requirement for such recent versions of things. But then I only looked at two of the source files :-).

HTH

David

@ml-evs
Copy link
Owner

ml-evs commented Nov 3, 2021

I wouldn't recommend it though - I'm looking at this precisely because Sulis users are very much not in that boat and some of them are wanting to use run3 for their AIRSS workloads.

Noted, maybe @ajm143 should be looped in here as I'm not sure who will actually be supporting new development to matador (it probably will not be me). If you have anyone to work up the suggested flags below then feel free, otherwise I am happy to merge this PR as-is too. It will probably be a few weeks until I have time to do this properly (and test it) myself.

1. I'd be wary of treating of srun and mpirun as interchangeable. They often don't have the same user-facing feature set or integrate with the resource allocation manager and MPI implementation in the same way. Sometimes mpirun is just a wrapper around srun that exposes only minimal functionality. It's a bit of a cluster-specific minefield as you've seen from your machine that didn't even expose mpirun (there were probably good reasons for that).

It is precisely this machine-specific minefield that I am trying to avoid. I think the suggestion of something overridable --mpi-command and --with-mpi-args would be the best general approach. There is some mechanism already for using the values from -nc and -np within other cmd-line args, so e.g. --mpi-command srun --with-mpi-args -n $nc --exclusive --cpu-bind=rank could do the job.

For example the --exclusive option to srun I suggested in the PR is crucial here otherwise job step creation fails for anything but the first geomopt. I wouldn't know how to get equivalent results with mpirun.

From what I could tell, with --slurm run3 replaced mpirun with srun and then didn't try any flags for parallelism (not even the same ones it used for mpirun). That's the issue I was running up against and suggesting a fix for. Regardless of what was specified for -np and -nc it just did "srun executable"

On the machine in question, srun castep would sort out the parallelism for you based on the node, but this approach was never used with np > 1 and the code path should have been blocked off.

1. Understood, although that is easy(?) to avoid by keeping the pipeline fed. Should also be tensioned against having jobs that request whole nodes being held up in the queue because your last few array job elements are using a fractions of a node or nodes (we have 128-core nodes). We are discouraging (in fact rejecting) large job arrays on Sulis for this and other SAFE-related reasons.

Fair enough - I am reticent to claim that run3 will support this use case properly as I will not be personally testing any of it. The method for distributing jobs by checking whether other running calculations to disk is probably not up to the task of keeping a machine like this fed. Using np > 1 may alleviate some of these issues but the hours I can personally dedicate this are 0.

Can also confirm that running in the way I suggest across multiple nodes works as expected.

👍

On the setuptools thing your requirements.txt is doing what it should I think, I was just surprised by the requirement for such recent versions of things. But then I only looked at two of the source files :-).

👍

@dquigley533
Copy link
Author

Thanks Matthew,

Let me try the mechanism you suggest for doing this via --mpi-command srun and --with-mpi-args and I'll get back to you. If that works then I can see why --slurm might be superfluous with no need for the PR.

It was a screen-shared Teams call with @ajm143 during which we attempted to fumble through trying run3 on sulis which prompted me to look into this in the first place. He seems to think sorting this out would be a good use of my 50% sabbatical this year. Not sure I'm 100% on board with that but can put some time into it.

Another option is to just do one geomopt per run3 invocation and then invoke run3 many times per job script with GNU parallel. I can give that a whirl as well.

@dquigley533
Copy link
Author

Sorry I might have grabbed the wrong end of a stick there --mpi-command and --with-mpi-args aren't recognised arguments to run3. Was that a suggestion based on the existing code or a possibility for future implementation (in which case I'm happy to have a go)?

@ml-evs
Copy link
Owner

ml-evs commented Nov 3, 2021

Sorry I might have grabbed the wrong end of a stick there --mpi-command and --with-mpi-args aren't recognised arguments to run3. Was that a suggestion based on the existing code or a possibility for future implementation (in which case I'm happy to have a go)?

Yes, sorry, the suggestion was that they get added! I can do the code but won't be able to test - let me know if you want to have a go in the meantime (I will post here if I begin implementing it).

@dquigley533 dquigley533 requested a review from ml-evs as a code owner June 8, 2022 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants