Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding new daemons issue in prte_odls_base_default_construct_child_list #2084

Open
Petter-Programs opened this issue Nov 26, 2024 · 1 comment

Comments

@Petter-Programs
Copy link

Petter-Programs commented Nov 26, 2024

Background information

What version of the PMIx Reference RTE (PRRTE) are you using? (e.g., v2.0, v3.0, git master @ hash, etc.)

commit 0bfee1d11c760f5e6f9ac2f20ef98c0adebce732 (HEAD -> master, origin/master, origin/HEAD)
Author: Ralph Castain <[email protected]>
Date:   Mon Nov 18 11:19:34 2024 -0700

What version of PMIx are you using? (e.g., v4.2.0, git branch name and hash, etc.)

commit d76deb390365e128779a27c25ebed26bbd9c7f5f (HEAD -> master, origin/master, origin/HEAD)
Author: Ralph Castain <[email protected]>
Date:   Sun Nov 17 14:25:49 2024 -0700

OMPI5 version tested with

commit 448c3ba2d1b8dced090e5aefb7ccb07588613bcd (HEAD -> main, origin/main, origin/HEAD)
Merge: 30b1fb00ae 071bba8ce6
Author: Edgar Gabriel <[email protected]>
Date:   Wed Oct 16 13:59:04 2024 -0500

Slurm version tested with

slurm 23.02.7

Please describe the system on which you are running

  • Operating system/version: Red Hat Enterprise Server; Linux 5.14.x (EL9-based distribution)
  • Computer hardware: Intel Sapphire Rapids 8480
  • Network type: NDR200

Details of the problem

This is an issue tangentially related to this MPI issue I opened a while back which was resolved, in which a change was made in PRRTE to allow child processes to outlive their parents. I have been experimenting with this new functionality, and I've discovered a potential PRRTE issue related to it. It appears when expanding the DVM using MPI_Comm_spawn(_multiple) calls, in my case inside a Slurm allocation but with --prtemca ras ^slurm and --prtemca plm ^slurm set (I assume the important detail is that PRRTE has to add a daemon to do the spawn). I've created a reproducer which consistently crashes due to this problem, in which an MPI program grows, using spawn calls, from 1, to 2, to 4, and then to 8 processes on separate nodes. It takes a list of nodes with appropriate slots as input in argv. Without --prtemca ras ^slurm and --prtemca plm ^slurm set it works fine, but with it the crash occurs when attempting to grow to 8 nodes.

Having investigated the issue, I've narrowed it down to this location:
prrte > src > mca > odls > base > odls_base_default_fns.c in function prte_odls_base_default_construct_child_list

Specifically, it is inside this loop:
while (PMIX_SUCCESS == rc) { (line 461)

For some reason, it will normally iterate once, unpacking data in state 16 or PRTE_JOB_STATE_REGISTERED, then find more data to unpack in state 35 or PRTE_JOB_STATE_NOTIFIED, and proceed with a second iteration. In this second iteration, the code tries to access jdata->map, which is null (pmix_pointer_array_add(jdata->map->nodes, pptr->node); (line 518)) and the PRRTE daemon crashes instantly.

The issue can be fixed by adding a null check and discarding the jdata, e.g. by setting line 479 to
if (NULL != prte_get_job_data_object(jdata->nspace) || NULL == jdata->map) {
but I am not knowledgeable enough on PRRTE's code to know if this would cause other issue or if there's a more proper way to handle it and address the underlying cause.

Here's the code for the reproducer:

#include <mpi.h>
#include <stdio.h>
#include <stdbool.h>
#include <unistd.h>
#include <string.h>
#include <stdlib.h>

#define ROOT 0
#define SLEEP_TIME 5
#define STRING_SIZE 256
#define MAX_ARGC 64

// MIN_WORLD_SIZE=1 and MAX_WORLD_SIZE=8 parameters reproduce the crash
#define MIN_WORLD_SIZE 1
#define MAX_WORLD_SIZE 8

// Start with MIN_WORLD_SIZE, then spawn double the processes at each step until reached MAX_WORLD_SIZE.

int main(int argc, char *argv[])
{
    MPI_Init(&argc, &argv);

    MPI_Comm parent_comm;

    MPI_Comm_get_parent(&parent_comm);

    bool i_am_parent = parent_comm == MPI_COMM_NULL;

    int my_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);

    int world_size;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    // Some checks to make sure the program is being run as expected

    if (i_am_parent && world_size != MIN_WORLD_SIZE)
    {
        if (my_rank == ROOT)
            printf("Please run with exactly %d process initially or set MIN_WORLD_SIZE differently.\n", MIN_WORLD_SIZE);

        MPI_Finalize();
        return EXIT_FAILURE;
    }

    if (argc < MAX_WORLD_SIZE + MAX_WORLD_SIZE / 2)
    {
        if (my_rank == ROOT)
            printf("Please provide at least %d arguments (or set MAX_WORLD_SIZE so that argc >= MAX_WORLD_SIZE + MAX_WORLD_SIZE/2).\n", MAX_WORLD_SIZE + MAX_WORLD_SIZE / 2);

        MPI_Finalize();
        return EXIT_FAILURE;
    }

    if (argc > MAX_ARGC)
    {
        if (my_rank == ROOT)
            printf("Provided %d arguments but MAX_ARGC set to %d. Please adjust it.", argc, MAX_ARGC);

        MPI_Finalize();
        return EXIT_FAILURE;
    }

    int nodename_length;
    char nodename[MPI_MAX_PROCESSOR_NAME];

    MPI_Get_processor_name(nodename, &nodename_length);

    char nodes_inuse[MAX_WORLD_SIZE][MPI_MAX_PROCESSOR_NAME];

    // Get a list of the nodes currently in use

    MPI_Allgather(nodename, MPI_MAX_PROCESSOR_NAME, MPI_CHAR, nodes_inuse, MPI_MAX_PROCESSOR_NAME, MPI_CHAR, MPI_COMM_WORLD);

    if (my_rank == ROOT)
    {
        printf("Nodes in use in group with %d processes: ", world_size);
        for (int i = 0; i < world_size; i++)
            printf("%s%s", (i == 0 ? "" : ","), nodes_inuse[i]);

        printf("\n");
    }

    // Give parents some time to terminate
    if (!i_am_parent)
    {
        // Disconnect the communicators so that the parents can exit
        MPI_Comm_free(&parent_comm);

        for (int i = 0; i < SLEEP_TIME; i++)
        {
            if (my_rank == ROOT)
                printf("Child ranks in world size %d sleeping for another %d seconds...\n", world_size, SLEEP_TIME - i);

            sleep(1);
        }
    }

    int next_world_size = world_size * 2;

    if (next_world_size <= MAX_WORLD_SIZE)
    {
        MPI_Comm spawned_comm;
        int error_codes[MAX_WORLD_SIZE];

        char *command_pointers[MAX_WORLD_SIZE]; // Fill with pointers to argv[0]
        int child_per_command[MAX_WORLD_SIZE];  // Fill with the integer 1 for each command
        MPI_Info command_infos[MAX_WORLD_SIZE]; // Fill with host target information

        char ***the_argvs = MPI_ARGVS_NULL;     // Provide pointers to current argv array for each spawned process
        char *destination_pointers[MAX_ARGC];   // Store pointers to argv entries with hosts to spawn into 

        if (my_rank == ROOT)
        {
            the_argvs = malloc(sizeof(char **) * next_world_size);

            for (int i = 0; i < next_world_size; i++)
            {
                // Pass on the argv array to each of the spawned processes
                the_argvs[i] = malloc(sizeof(char *) * argc);

                for (int j = 0; j < argc - 1; j++)
                {
                    the_argvs[i][j] = argv[j+1];
                }

                the_argvs[i][argc - 1] = NULL;
            }

            for (int i = 0; i < MAX_WORLD_SIZE; i++)
            {
                command_pointers[i] = argv[0];
                child_per_command[i] = 1;
                MPI_Info_create(&command_infos[i]);
            }

            // Store pointers to the argv entries containing nodes not in current MPI_COMM_WORLD

            int destinations_added = 0;

            for (int i = 1; i < argc; i++)
            {
                bool in_use = false;
                for (int j = 0; j < world_size; j++)
                {
                    // Compare to list of nodes in use in MPI_COMM_WORLD
                    if (strstr(argv[i], nodes_inuse[j]) != NULL)
                    {
                        in_use = true;
                        break;
                    }
                }

                // Add as possible destination if not in MPI_COMM_WORLD
                if (!in_use)
                {
                    destination_pointers[destinations_added] = argv[i];
                    destinations_added++;
                }
            }

            if (destinations_added < next_world_size)
            {
                printf("Did not find enough options in argv to spawn into. Aborting.\n");
                MPI_Abort(MPI_COMM_WORLD, EXIT_FAILURE);
                return EXIT_FAILURE;
            }

            for (int i = 0; i < next_world_size; i++)
            {
                // Assign hosts
                MPI_Info_set(command_infos[i], "PMIX_ADD_HOST", destination_pointers[i]);
                MPI_Info_set(command_infos[i], "host", destination_pointers[i]);
            }
        }

        MPI_Comm_spawn_multiple(next_world_size, command_pointers, the_argvs, child_per_command, command_infos, ROOT, MPI_COMM_WORLD, &spawned_comm, error_codes);
        MPI_Comm_free(&spawned_comm);

        // Just free up some resources
        if (my_rank == ROOT)
        {
            for (int i = 0; i < MAX_WORLD_SIZE; i++)
            {
                MPI_Info_free(&command_infos[i]);
            }

            if (argc > 1)
            {
                for (int i = 0; i < next_world_size; i++)
                {
                    free(the_argvs[i]);
                }
                free(the_argvs);
            }
        }
    }

    MPI_Finalize();

    printf("Past finalize from %s rank %d/%d on %s.\n", i_am_parent ? "parent" : "child", my_rank, world_size - 1, nodename);

    return EXIT_SUCCESS;
}

And here is the Slurm script I used to run it:

#!/bin/bash
#SBATCH --time=00:02:30
#SBATCH --exclusive

# Number of nodes
#SBATCH -N12

#SBATCH --tasks-per-node=1

NODELIST="$(scontrol show hostname $SLURM_JOB_NODELIST)"

NODELIST_WITH_COUNTS=""
for node in $NODELIST; do
    NODELIST_WITH_COUNTS+="${node}:${SLURM_NTASKS_PER_NODE} "
done

echo "This should work..."

cmd="prterun -np 1 ./minimal-issue-reproduce $NODELIST_WITH_COUNTS"

echo $cmd
$cmd

echo "======================================"

echo "This should crash..."

cmd="prterun --host $(hostname):$SLURM_NTASKS_PER_NODE --prtemca ras ^slurm --prtemca plm ^slurm -np 1 ./minimal-issue-reproduce $NODELIST_WITH_COUNTS"

echo $cmd
$cmd

Just let me know if you need any more info!

@rhc54
Copy link
Contributor

rhc54 commented Nov 26, 2024

Had to read this a few times to fully grok what you are doing, but I'm not surprised it would fail. The "add-host" capability remains under development as it requires some significant re-design. The initial implementation ran into problems when folks started testing it.

For example: suppose I have two applications (or just two procs within an app - doesn't matter) that do a spawn that includes an "add-host" request, and the spawn specifies that the caller wants to start N procs/node. Currently, PRRTE acts as a state machine where each step of the launch procedure is treated as an independent state.

So we treat each "add-host" request independent from the mapping of each specified job - which means we might wind up adding the hosts from both of the spawn requests before we start to map the specified processes, leading to more procs than either request was expecting.

We could try to fix this, for example, by treating a spawn request as an atomistic operation. However, that means that the number of procs the two requests might see depends upon which spawn request gets handled first. In other words, the number of nodes in the DVM for the first one we process will be the current nodes plus the new ones the request specified. However, remember that the second requestor doesn't know about those new nodes - it only "saw" the current ones, and therefore expects that its job will include the current nodes plus only the new ones that it specified.

So we get a race condition that leads to unexpected results, which takes us to a different possible solution. Take a "snapshot" of the current nodes when a spawn request is made, and treat any "add-host" option as relative to that snapshot. This provides the desired isolation between competing spawns - at the price of some significant complexity.

Unsure of the final resolution. There is at least one other person working on it, but I need to check on their progress. I thought they had something to propose, but haven't seen it yet.

@HawkmoonEternal - can you provide any insight on your progress to add support for "add-host"? I seem to recall you had something working? If so, can we see a PR for it?

Anyway, we can leave this open as it has an interesting reproducer. Just don't expect a solution anytime soon.

BTW: this cmd line

prterun -np 1 ./minimal-issue-reproduce $NODELIST_WITH_COUNTS

works because you launched daemons on every node in the allocation, and all your spawn requests are doing is simply adding procs to the existing daemons (i.e., your "add-host" requests are being ignored because the host is already known). Your other cmd line actually is trying to expand the DVM, and that isn't working at this time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants