Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLURM observability cookbook #3

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

piotrrojek
Copy link

@piotrrojek piotrrojek commented Dec 19, 2024

Summary

This PR contains a cookbook for SLURM observability stack.


For new content

When contributing new content, read through our contribution guidelines, and mark the following action items as completed:

  • I have added a new entry in registry.yaml (and, optionally, in authors.yaml) so that my content renders on the cookbook website.
  • I have conducted a self-review of my content based on the contribution guidelines:
    • Relevance: This content is related to building with Crusoe Cloud and is useful to others.
    • Uniqueness: I have searched for related examples in the Crusoe Cookbook, and verified that my content offers new insights or unique information compared to existing documentation.
    • Spelling and Grammar: I have checked for spelling or grammatical mistakes.
    • Clarity: I have done a final read-through and verified that my submission is well-organized and easy to understand.
    • Correctness: The information I include is correct and all of my code executes successfully.
    • Completeness: I have explained everything fully, including all necessary references and citations.

We will rate each of these areas on a scale from 1 to 4, and will only accept contributions that score 3 or higher on all areas. Refer to our contribution guidelines for more details.

@piotrrojek piotrrojek marked this pull request as ready for review December 23, 2024 14:25

#### 2. *Create Users and Directories*

_On all nodes_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For commands that are required to be run on all compute nodes, can we recommend using a tool like pssh or clush ( https://clustershell.readthedocs.io/en/latest/tools/clush.html ) to make it easier to get started with large cluster management? Perhaps this warrants another mini section on setting up a hostfile for parallel VM management

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After internal discussion, we suggest to use Ansible for this. In fact, the SLURM observability role is submitted as a PR (I'm going to remove draft PR mode to ready to review today after final testing).
The reason we propose that is that Ansible is already used widely in Crusoe, it's also present in GPUd cookbook. In order not to overload the main SLURM repo we can extract it to be its own playbook too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants