-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add amd gpu autodetect=rsmi support #2057
base: 3.x
Are you sure you want to change the base?
Conversation
…smi can be used Signed-off-by: Antony Cleave <[email protected]>
Thanks for the pull request. Also thanks for providing a build with the feature enabled. That makes things easier. I will bring this to the technical steering committee to see what they think. An external repository providing a dependency is not the usual approach in OpenHPC, but from the description at AMD it seems to be all open source. We also include the Intel repository in our build system so that means we already do something like it. It is not clear to me how this exactly works. It seems to be there a new Slurm plugin called Without talking to the TSC I would say the plugin needs to be in a separate sub-package to not pull in dependencies for people who are not interested in this feature. As this probably also requires runtime packages installed we need the correct runtime dependencies expressed in the RPM and a way to easily enable the AMD repository. For the Intel repository we ship a DNF repository definition. This also needs to be added to the documentation as an optional step. If people enable this the recipes should automatically enable the repository and install the corresponding runtime dependencies. GitHub Actions also needs to deal correctly with this runtime dependency. We would also need some tests to be able to verify this change actually works. The minimal test would be that the AMD repository is correctly enabled and the plugin is installed. Actually testing something with AMD GPUs would be even better. Do you have a system which the OpenHPC project could access to run the corresponding tests? |
@@ -16,6 +16,7 @@ | |||
%global _with_slurmrestd 1 | |||
%global _with_multiple_slurmd 1 | |||
%global _with_freeipmi 1 | |||
%global _with_rsmi /opt/rocm/lib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a path? This is never used anywhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that was leftover from my very first try when I was building manually and I had difference versions of the rocm stack and I wanted to be sure it was using the right one. With a fresh rpm install the path is not required anymore and this can be 1
@@ -94,6 +94,7 @@ Patch0: slurm.conf.example.patch | |||
%bcond_with lua | |||
%bcond_with numa | |||
%bcond_with pmix | |||
%bcond_with rsmi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The repository you mentioned seems to be for RHEL. So this only needs to be enabled on RHEL builds for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are no arm packages either will this matter for arm64 builds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, that also needs to be excluded. Better make it x86_64 only.
@@ -452,7 +456,6 @@ module load hwloc | |||
%{?_with_nvml} \ | |||
--with-hwloc=%{OHPC_LIBS}/hwloc \ | |||
%{?_with_cflags} || { cat config.log && exit 1; } | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try to avoid unnecessary whitespace changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, my bad! I'll fix this this afternoon along with the unnecessary path in with rsmi
yes there is a new plugin and it ends up in
It has no hard dependencies on packages in the AMD repos on the ctld or on the computes this is just a build time dependency We have tested the current rpms on:
in case 2 everything works and the GPUs are detected in case 1 with the defaults (i.e. no autodetecte enabled in gres.conf) nothing changed at all I would imagine that on a system with AMD GPUs and no rocm-smi-lib installed it would still fail to detect any GPUs but thats unlikely to occur if you have installed the drivers and any of the rocm stack to use the GPUs for compute.
Yes, I need to officially check about access but as long as you give us some notice beforehand I doubt there will be any issues. We currently have MI250X availiable. |
just ripped out the rocm-smi-lib on an active compute node and restarted slurmd in the forground on a rocky linux 8.10 system where I have done the same
and putting it back on:
|
In today's TSC meeting everyone was in favour of this change. We will continue to work with you here to get this merged. With 3.2 released this week, we will target this change for the 3.3 release which might be in May 2025. |
thats great news! I hope to get time to finish cleaning this up next week |
A friendly reminder that this PR had no activity for 30 days. |
This allows Autodetect=rsmi to be used in the slurm GRES.conf
technically no modification to the Spec file is needed but this is just a reminder that the rocm-smi-lib rpm needs to be installed as a build requirement on the OBS nodes
The rpm is included the AMD rocm repos which can be found here:
https://repo.radeon.com/rocm/el9/6.2.2/main
I have published a release in here with the rpms attached to verify that it works.
https://github.com/antonycleave/openhpc-slurm-with-rocm/releases