Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checklist for "stable" landing point #2020

Open
8 of 15 tasks
rhc54 opened this issue Oct 1, 2024 · 6 comments
Open
8 of 15 tasks

Checklist for "stable" landing point #2020

rhc54 opened this issue Oct 1, 2024 · 6 comments

Comments

@rhc54
Copy link
Contributor

rhc54 commented Oct 1, 2024

With the project winding down, it is time to define a stable landing point where we can leave it for those wanting to use it. This means:

  • removing all stale code, particularly components that aren't actively used
  • collapsing frameworks into single code directories where multiple variations are not required (e.g., rtc)
  • reducing complexity wherever possible

We'll keep a checklist here as we work thru the process - will culminate in a new PRRTE v4 release series

Code pruning and correction

Enhancements

  • Add PRRTE-internal resiliency support - recover connections to grandparents when parent connection is lost, restore parent connection if/when parent returns, number collective messages to ensure replay when necessary

Scheduler integration

  • Resolve question of moving scheduler integration support into separate branch
  • Complete node extension support for adding nodes on-the-fly
  • Complete session directive support - e.g., session/job preemption
@naughtont3
Copy link
Contributor

naughtont3 commented Oct 4, 2024

* [ ]  Resolve "permanent" solution to the Slurm plm problem - use new launcher lib _if_ it becomes available, otherwise may need to remove envar support for the internal "srun" cmd line options

Quick follow-up after 3oct2024 teleconf, I was mistaken and the SLURM_VERSION is not exported as an envvar within the allocation. Appears you must go through one of the utilities (e.g., srun --version, scontrol show config | grep SLURM_VERSION).

shell: $ srun --version
slurm 24.05.2
shell: $ scontrol show config | grep SLURM_VERSION
SLURM_VERSION           = 24.05.2
shell: $ echo $SLURM_VERSION

shell: $

@rhc54
Copy link
Contributor Author

rhc54 commented Oct 4, 2024

If you just get an allocation (salloc and no srun) is there anything you can see that might give us a hint as to version, even if it doesn't give us a direct value?

@naughtont3
Copy link
Contributor

Unfortunately, i do not see anything that would give an indication (salloc and then env | grep SLURM).

@rhc54
Copy link
Contributor Author

rhc54 commented Oct 14, 2024

The "oob collapse" has been completed - see #2035

@edgargabriel
Copy link
Contributor

@rhc54 Is the fix of open-mpi/ompi#12682 already in as well?

@rhc54
Copy link
Contributor Author

rhc54 commented Oct 23, 2024

Yes - everything is caught up. I have one thing still in the queue, but it's being tracked over in the PMIx repo. Otherwise, everything still needing attention is listed above, and anything else is done. The fix you ask about is also in the release branch awaiting update over in OMPI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants