-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding new daemons issue in prte_odls_base_default_construct_child_list #2084
Comments
Had to read this a few times to fully grok what you are doing, but I'm not surprised it would fail. The "add-host" capability remains under development as it requires some significant re-design. The initial implementation ran into problems when folks started testing it. For example: suppose I have two applications (or just two procs within an app - doesn't matter) that do a spawn that includes an "add-host" request, and the spawn specifies that the caller wants to start N procs/node. Currently, PRRTE acts as a state machine where each step of the launch procedure is treated as an independent state. So we treat each "add-host" request independent from the mapping of each specified job - which means we might wind up adding the hosts from both of the spawn requests before we start to map the specified processes, leading to more procs than either request was expecting. We could try to fix this, for example, by treating a spawn request as an atomistic operation. However, that means that the number of procs the two requests might see depends upon which spawn request gets handled first. In other words, the number of nodes in the DVM for the first one we process will be the current nodes plus the new ones the request specified. However, remember that the second requestor doesn't know about those new nodes - it only "saw" the current ones, and therefore expects that its job will include the current nodes plus only the new ones that it specified. So we get a race condition that leads to unexpected results, which takes us to a different possible solution. Take a "snapshot" of the current nodes when a spawn request is made, and treat any "add-host" option as relative to that snapshot. This provides the desired isolation between competing spawns - at the price of some significant complexity. Unsure of the final resolution. There is at least one other person working on it, but I need to check on their progress. I thought they had something to propose, but haven't seen it yet. @HawkmoonEternal - can you provide any insight on your progress to add support for "add-host"? I seem to recall you had something working? If so, can we see a PR for it? Anyway, we can leave this open as it has an interesting reproducer. Just don't expect a solution anytime soon. BTW: this cmd line
works because you launched daemons on every node in the allocation, and all your spawn requests are doing is simply adding procs to the existing daemons (i.e., your "add-host" requests are being ignored because the host is already known). Your other cmd line actually is trying to expand the DVM, and that isn't working at this time. |
Background information
What version of the PMIx Reference RTE (PRRTE) are you using? (e.g., v2.0, v3.0, git master @ hash, etc.)
What version of PMIx are you using? (e.g., v4.2.0, git branch name and hash, etc.)
OMPI5 version tested with
Slurm version tested with
slurm 23.02.7
Please describe the system on which you are running
Details of the problem
This is an issue tangentially related to this MPI issue I opened a while back which was resolved, in which a change was made in PRRTE to allow child processes to outlive their parents. I have been experimenting with this new functionality, and I've discovered a potential PRRTE issue related to it. It appears when expanding the DVM using MPI_Comm_spawn(_multiple) calls, in my case inside a Slurm allocation but with
--prtemca ras ^slurm
and--prtemca plm ^slurm
set (I assume the important detail is that PRRTE has to add a daemon to do the spawn). I've created a reproducer which consistently crashes due to this problem, in which an MPI program grows, using spawn calls, from 1, to 2, to 4, and then to 8 processes on separate nodes. It takes a list of nodes with appropriate slots as input in argv. Without--prtemca ras ^slurm
and--prtemca plm ^slurm
set it works fine, but with it the crash occurs when attempting to grow to 8 nodes.Having investigated the issue, I've narrowed it down to this location:
prrte > src > mca > odls > base > odls_base_default_fns.c in function prte_odls_base_default_construct_child_list
Specifically, it is inside this loop:
while (PMIX_SUCCESS == rc) { (line 461)
For some reason, it will normally iterate once, unpacking data in state 16 or PRTE_JOB_STATE_REGISTERED, then find more data to unpack in state 35 or PRTE_JOB_STATE_NOTIFIED, and proceed with a second iteration. In this second iteration, the code tries to access jdata->map, which is null (
pmix_pointer_array_add(jdata->map->nodes, pptr->node); (line 518)
) and the PRRTE daemon crashes instantly.The issue can be fixed by adding a null check and discarding the jdata, e.g. by setting line 479 to
if (NULL != prte_get_job_data_object(jdata->nspace) || NULL == jdata->map) {
but I am not knowledgeable enough on PRRTE's code to know if this would cause other issue or if there's a more proper way to handle it and address the underlying cause.
Here's the code for the reproducer:
And here is the Slurm script I used to run it:
Just let me know if you need any more info!
The text was updated successfully, but these errors were encountered: