Failure launching MPI application on Slurm machine (unable to allocate shared memory) #5208
Replies: 37 comments 11 replies
-
After doing some Googling around and seeing some reports of differing behaviors with versions of Intel MPI, I thought I should try a different version to see if it made any difference. And it did. In fact, Intel 2019 does work correctly! It allocates tasks correctly and runs fine. I wish I had thought to try this earlier. Switching back to 2022.3.0 makes it fail. Recompiling with each compiler before running it makes no difference. The only thing that affects the behavior is which version is loaded at run time.
|
Beta Was this translation helpful? Give feedback.
-
It turns out that |
Beta Was this translation helpful? Give feedback.
-
Thanks! Looks like I'm trying to remember how this shared memory stuff works! One thing to check might to run Apologies for not having any better ideas at the moment! |
Beta Was this translation helpful? Give feedback.
-
One interesting thing in the PMI trace: I would have expected intel mpi (as an mpich derivative) to fetch |
Beta Was this translation helpful? Give feedback.
-
@garlick - The limits look to be the same. Unfortunately, the
|
Beta Was this translation helpful? Give feedback.
-
Bummer! I suppose we could try running all the tasks on separate nodes which should have the same effect, e.g. Running all the tasks on one node might be interesting too. |
Beta Was this translation helpful? Give feedback.
-
Ok, this is weird.
|
Beta Was this translation helpful? Give feedback.
-
This is interesting... I used
|
Beta Was this translation helpful? Give feedback.
-
That probably makes sense - mpi ranks don't share nodes when -n is less than -N, so shared memory cannot be used between any of the ranks. I was a little surprised in your earlier result that -n2 -N1 worked since those two ranks would be on the same node. However, maybe the problem is in how the mapping is communicated and we have the wrong groups of nodes trying to use shared memroy to talk to each other? |
Beta Was this translation helpful? Give feedback.
-
And when I have 4 nodes in my total allocation from |
Beta Was this translation helpful? Give feedback.
-
Wait, -n4 -N2 was the original failing case. What's different here? |
Beta Was this translation helpful? Give feedback.
-
The difference is that I obtained a total allocation of 4 nodes instead of only 2 before I started flux. |
Beta Was this translation helpful? Give feedback.
-
Puzzling. If you run I mean, you ran flux with |
Beta Was this translation helpful? Give feedback.
-
Yes. They are all different.
|
Beta Was this translation helpful? Give feedback.
-
The pattern seems to be: For any N > 1, it will fail if n > total number of nodes in the allocation. For example, this is with 3 nodes in the allocation:
|
Beta Was this translation helpful? Give feedback.
-
I just went back to my Parsl program and tried to reproduce the successful behavior above, but got this when my MPI program tried to launch:
Looking at the launch script, it seems correct to me. I'm not sure yet what is going on there.
|
Beta Was this translation helpful? Give feedback.
-
Hmm, we've seen that in the past when flux tries to bootstrap an application that is linked against slurm's Maybe this could be worked around by setting LD_LIBRARY_PATH to point to flux's libpmi2.so? |
Beta Was this translation helpful? Give feedback.
-
The application run in my Parsl program (
No pmi library is linked at all. |
Beta Was this translation helpful? Give feedback.
-
Ah... hold on. I am also printing out the environment right before the call to run the executable, and it contains this:
The full (ugly) output of the grep (ignore the use of
|
Beta Was this translation helpful? Give feedback.
-
We have a flux shell plugin that rewrites I_MPI_PMI_LIBRARY to point to the flux |
Beta Was this translation helpful? Give feedback.
-
This is what my Parsl App looks like. Maybe the replacement happens during the call to the executable, and so even though it's in the environment here, it gets replaced at launch?
|
Beta Was this translation helpful? Give feedback.
-
So flux effectively runs that little scriptlet on each rank? If so then this is past where flux would be doing anything to the environment. I wonder if the Edit: I am not at all familiar with Parsl so that comment might be way off |
Beta Was this translation helpful? Give feedback.
-
That is my understanding but @jameshcorbett could probably confirm. The
|
Beta Was this translation helpful? Give feedback.
-
What about adding something like this after the module load?
Edit: it's unfortunate that the module assumes impi will only be used under slurm. Maybe long term the site could address that? |
Beta Was this translation helpful? Give feedback.
-
I understand why that's happening now. I'm not sure it will help us, but what's happening is that when you submit an app to the FluxExecutor, it grabs the environment from the process it was submitted from (e.g. a process running on a login node of the cluster, if that's where you invoked parsl from). So naturally there are no
Yeah, that's what's going on. |
Beta Was this translation helpful? Give feedback.
-
@garlick - Using At this point, the thing I'm confused about is the Parsl configuration and use of the |
Beta Was this translation helpful? Give feedback.
-
@jameshcorbett - We've made a ton of progress and I'm trying to apply what we learned with our experiments on the Flux-only side to my Parsl program with the Just to make sure we are all on the same page after playing around with lots of different settings, I'd like to spell out what I'm currently running:
|
Beta Was this translation helpful? Give feedback.
-
The behavior I'm seeing now is that the application runs to completion without any errors. However, the output file,
If I change the This makes me wonder whether using the |
Beta Was this translation helpful? Give feedback.
-
Ah, right... Of course. I have some very good news! If I use the
It works. And by that I mean the stdout file in If I try to use the Bottom line: I can get it all to work correctly using |
Beta Was this translation helpful? Give feedback.
-
One outstanding question I have is: Should the environment set up in the |
Beta Was this translation helpful? Give feedback.
-
A user on a Slurm scheduled machine can launch an mpi application with
srun
fine but can't with Flux.The versions in use are [email protected], [email protected] with Python 3.9.15.
After launching Flux with
srun --tasks-per-node=1 --pty -c40 flux start
:Beta Was this translation helpful? Give feedback.
All reactions