Skip to content

Meeting 2025 03 14

Jeff Squyres edited this page Mar 18, 2025 · 10 revisions

Summary of discussion with George/Ralph/Tommy/Jeff

This prefix thing kinda sucks (https://github.com/openpmix/prrte/pull/2154); it's getting complicated and Jeff fears it will be difficult to maintain over time. Is there a long-term path to get us out of this business?

Idea:

  • This particular problem comes down to installdirs.

    • We have installdirs for those who relocate installations (e.g., NVIDIA's Open MPI packaging).
    • At run time, we need to find plugins and show_help files.
      • Sidenote: Do we need to find anything else?
    • Can we solve this? I.e., can we find what we need at runtime via some other mechanism?
    • [JS] Can we look at proc/self/maps to find the directory of a known library (e.g., libmpi.so)? Not sure how robust that is.
  • We still have problems of multiple levels in the stack (OMPI, PRTE, PMIX) re-using MCA things.

    • We've sorta solved that by replicating everything and using different prefixes in env variable names and the like.
    • But it still kinda sucks -- lots of corner cases come up. And code duplication.
    • We're not going to solve that problem today.

Let's look at the installdirs issue.

Short term / v5.0.x

  • Ralph's PR already merged to PRTE: we now have 4 prefixes (CLI params and env vars)

  • George will investigate: in OMPI's installdirs init:

    • If user sets env variable(s), use that(them)
    • If user didn't set env variable(s):
      • Make LD call to find filesystem path of library containing opal_init (or whatever symbol makes sense)

      • Take dirname of that

      • Compare to installdirs libdir

      • If it's the same -- ok, we're done

      • If it's not the same:

        • Look at old libdir: is it defined in terms of prefix? If so, see if a comparison the path we just found compared to the old libdir can distill a prefix from that.
        • E.g., if we found /bar/lib/libopal_pal.so, and original installdirs libdir (from configure) was {prefix}/lib, then the new value for the installdirs prefix can be /bar.
        • Otherwise, assume prefix is one dir up from that
        • Set installdirs prefix to that value
      • This is good enough for OMPI v5.0.x / NVIDIA

      • Make sure to document this process in the RST docs somewhere

Can we get this to work with a small-ish patch? Assume yes. George will prototype.

Longer term / main/v6.0.x

  • Include everything that George did for v5.0.x

    • Perhaps get fancier trying to distill prefix from libdir (TBD)
    • Document in the RST whatever fanciness we do
  • installdirs currently has a bunch of dirs that nothing in the C code uses

    • Let's remove all the dirs that we are not using -- only keep the ones that we actually need.
      • Perhaps we only need libdir and help files dir...? (TBD)
    • If we remove things, we need to update documentation to remove all corresponding env variables / MCA params.
  • After removing what dirs aren't necessary, we should stat() all dirs in installdirs and complain if something doesn't exist

  • Can we slurp the text help files into C code somehow?

    • This would be one less thing we have to find at run time
      • ...and potentially one more entry we can remove from installdirs
    • Maybe run some (python?) script during make that converts the text files into C code that is then compiled.
      • Random note: clang v16 doesn't like multi-line C strings. Will need to be a little clever about how to encode the strings.
    • Will also need to upate opal_show_help() to get text source from C variables instead of reading text files.
    • JS: C23 has #embed to include arbitrary files into the binary but that would require GCC15/Clang19
  • Open question: if the new prefix-setting mechanism works reliably, can we sunset the prefix-setting CLI/env var mechanisms?

  • Here's the dirs we need:

    • bindir (when launching on a remote node, especially via SSH, or launching into dissimilar environments such as containers)
    • libdir (when launching on a remote node, especially via SSH, or launching into dissimilar environments such as containers)
    • DSO dir (to find DSOs)
    • sysconf dir (to find config files)
    • text help file dir (to find the show_help text files)
Clone this wiki locally