-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed slurm launcher #214
base: master
Are you sure you want to change the base?
Fixed slurm launcher #214
Conversation
Hi Dave, thanks for this. Completely understand the issue but I wonder if instead I just need to deprecate the Let me make some clarifying remarks.
I would make sure to test one multi-node job too, we have had some trouble in the past depending on the type of parallel filesystem in use.
(tell a lie, I must have changed this at some point the last 5 years... maybe the confusion is only with me!)
matador just uses |
Hi Matthew, Thanks for looking at the PR and I see where you're coming from. If you're used to working on clusters where job arrays are the norm and mpirun does everything you need then your plan to deprecate srun and -np would make sense. I wouldn't recommend it though - I'm looking at this precisely because Sulis users are very much not in that boat and some of them are wanting to use run3 for their AIRSS workloads.
For example the From what I could tell, with So my suggestion would be to support srun, passing through ncores and nnodes as you do for mpirun, but also allowing users to pass through additional run3 arguments to srun like run3 --slurm -nc 16 -np 8 --with-srun-args="--exclusive --cpu-bind=rank" TiO2 would be perfect.
Can also confirm that running in the way I suggest across multiple nodes works as expected. On the setuptools thing your requirements.txt is doing what it should I think, I was just surprised by the requirement for such recent versions of things. But then I only looked at two of the source files :-). HTH David |
Noted, maybe @ajm143 should be looped in here as I'm not sure who will actually be supporting new development to matador (it probably will not be me). If you have anyone to work up the suggested flags below then feel free, otherwise I am happy to merge this PR as-is too. It will probably be a few weeks until I have time to do this properly (and test it) myself.
It is precisely this machine-specific minefield that I am trying to avoid. I think the suggestion of something overridable
On the machine in question,
Fair enough - I am reticent to claim that run3 will support this use case properly as I will not be personally testing any of it. The method for distributing jobs by checking whether other running calculations to disk is probably not up to the task of keeping a machine like this fed. Using
👍
👍 |
Thanks Matthew, Let me try the mechanism you suggest for doing this via It was a screen-shared Teams call with @ajm143 during which we attempted to fumble through trying run3 on sulis which prompted me to look into this in the first place. He seems to think sorting this out would be a good use of my 50% sabbatical this year. Not sure I'm 100% on board with that but can put some time into it. Another option is to just do one geomopt per run3 invocation and then invoke run3 many times per job script with GNU parallel. I can give that a whirl as well. |
Sorry I might have grabbed the wrong end of a stick there |
Yes, sorry, the suggestion was that they get added! I can do the code but won't be able to test - let me know if you want to have a go in the meantime (I will post here if I begin implementing it). |
Greetings from Warwick.
This PR fixes the missing use of run3 arguments into compute.py for when
--slurm
is specified.Without this fix every instance of CASTEP launched by run3 tries to use all the processors in the SLURM reservation, which with -np > 1 leads to massive contention, huge server load and failures when the .castep file doesn't appear after 30s. Bad times.
I've tested this on the TiO2 tutorial example on Sulis (sulis.ac.uk) with the following script, which should also work fine for larger batches of geomopts on greater numbers of nodes.
Side note - the matador install script seems to load generic builds of NumPy and SciPy newer than the hardware-optimised versions loaded by the SciPy-bundle module on the cluster. I wouldn't recommend requiring that unless you specifically need newer functionality. Not a major issue here as I imagine there isn't much CPU intensive going on inside matador.