-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LaunchMON hang #36
Comments
FWIW, below is a successful run of test.attach_1 (ignore those first 4 srun errors). Can you attach to the hung process and give a stack trace of where the test hangs? Any more output info from the test runs would be helpful too. bash-4.2$ ./test.attach_1 [LMON FE] Please check the correctness of the following resource handle [LMON FE] RM launcher's pid is 203029 [LMON FE] PASS: run through the end |
I believe my failure is prior to attachment to remote job. so I collected a backtrace from the coredump file instead: (gdb) bt LaunchMON[334492]: LaunchMON security error in handshake: COBO/PMGR Handshake Security Error. My uid = xxx. Server at xxx:20101 took my connection from yyy:52020, but failed with error: Bad credential provided: Rewound credential In an older run, I had enabled some tracing within the shell script called "fe_attach_smoketest". Several of the vars resolve to empty strings, which I speculate may be the result of build malfunctions. |
@jeffreybquinn: It seems LaunchMON front end fails to connect to the back end daemons. Could you configure with |
With the U.S. holidays now out of the way, there's lots to summarize here. With this change plus a few other changes from Ralph Castain's 12/12 email on the OpenMPI mailing list, we're able to build LaunchMON (gnu compilers, OpenMPI 1.10.7) and pass the LaunchMON smoke tests. :) During execution of STAT script tests, we encounter new issues: |
Issue 1) is a reattach issue. One should be able to attach, detach, and reattach a debugger and expect the orterun to handle this properly. You are using quite an old version of OpenMPI, so I don't know if there is much hope in getting that version to work. I haven't tested it recently, but I seem to recall that this should work in more recent versions of OpenMPI. |
@jeffreybquinn: for OpenMPI/orterun, LaunchMON uses MPIR_attach_fifo support (http://mpi-forum.org/docs/mpir-specification-10-11-2010.pdf). It sends a small message to the FIFO opened by orterun asking this launcher to launch tool daemons on the nodes where the MPI processes are running. I believe what @lee218llnl characterizes above is accurate. |
We have recently upgraded our internal development cluster (Xeon Skylake Gold 6140/38 with SLES12 SP2). In rebuilding the STAT debug tool and its dependencies such as LaunchMON, we've encountered hang failures for LaunchMON smoketests test.attach_1 & test.launch_1.
Versions in use:
LaunchMon 1.0.2
gcc 5.4.0
openmpi 1.10.7
slurm 16.05.10-2
(These are the versions specified by our BKC build recipe. Our plan is to stage updating to newest versions after the baseline has been re-established.)
For our debug, we were hoping to gain access to logs/traces of successful runs of these two smoke tests on a similar configuration. We believe a differential analysis of this sort can help point us toward the configuration and build settings we need to adjust. We are additionally collecting strace logs to narrow down the hang point, but having trouble interpreting due to lack of in-depth familiarity with test operation and library operation.
The text was updated successfully, but these errors were encountered: