Skip to content

AMD MI300A Unified Memory Support #145693

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
lancelotnd opened this issue Jan 26, 2025 · 3 comments
Open

AMD MI300A Unified Memory Support #145693

lancelotnd opened this issue Jan 26, 2025 · 3 comments
Labels
feature A request for a proper, new feature. module: rocm AMD GPU support for Pytorch topic: new features topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@lancelotnd
Copy link
Contributor

lancelotnd commented Jan 26, 2025

🚀 The feature, motivation and pitch

I am working on improving performance for LLM workloads on AMD Instinct™ MI300A Accelerator. This APU is equipped with a fully unified-memory architecture not taken advantage by PyTorch at this time. Because both the GPU and CPU share the same physical memory, memcpy ops become redundant and can limit the size of the model we can train. Adding support for unified memory on ROCm for that particular APU would allow for zero-copy operations.

The motivation is similar that of #140787, but for ROCm instead of MPS.

Given that this APU is targeted for the most demanding HPC ML workloads, there is a great interest in optimizing the performance of PyTorch for it. Notably, El Capitan, the top1 Supercomputer from top500 runs exclusively with AMD's MI300A.

Alternatives

No response

Additional context

To facilitate understanding I provide more details as to the kind of changes this involves.

To understand the differences in operations between non-unified and unified memory, let us consider a regular matrix multiplication of matrices $A$ and $B$ where the result is stored in matrix $C$.

In a non-unified setup with a discrete GPU (device):

  1. malloc matrices $A,B,C$ on the host each of size $n \times n$.
  2. ... values are written on matrices $A$,$B$
  3. CudaMalloc to allocate memory for matrices $A',B',C'$
  4. CudaMemcpy $A \rightarrow A'$, and $B \rightarrow B'$ (hostToDevice)
  5. Kernel launch (results are written in $C'$).
  6. CudaMemcpy $C' \rightarrow C$ (DeviceToHost) to get back the results

Whereas with unified-memory you would have:

  1. CudaMallocManaged matrices $A,B,C$ on the host each of size $n \times n$.
  2. ... values are written on matrices $A$,$B$
  3. Kernel launch (results are written in $C$).

On machines with discrete GPUs the concept of unified-memory is purely virtual and still results in memory movement by way of page faults and page migrations. This adds a lot of overhead.

On architectures where CPU and GPU share the same physical memory such as the MAC, AMD MI300A, any memcpy operation becomes pointless and wastes space and time.

The quickest and dirtiest hack for the support of unified memory on Pytorch is to replace all cudaMalloc by cudaMallocManaged and to get rid of memCopy operations. Like in this paper. This however is not ideal, nor portable.

Perhaps a better way to do it would be to toggle on or off the unified memory. Given that this is a relatively new architecture, more hardware is likely to come out with such configuration from different manufacturers and so it would be great to have a device-agnostic support of unified memory.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

@malfet malfet added the module: rocm AMD GPU support for Pytorch label Jan 27, 2025
@drisspg drisspg added feature A request for a proper, new feature. triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module topic: new features topic category labels Jan 28, 2025
@jayfurmanek
Copy link
Contributor

We did have an RFC for this very thing a while back. The idea is to provide zero-copy for tensors across CPU/GPU but the usage model in torch has implicit and explicit copies that have to be handled properly, which is a bit tricky. Unified addressing also doesn't prevent you from having to synchronize. So the question becomes can we safely make these changes and get a performance win with a framework that does a pretty good job avoiding copy latency in the first place.

@lancelotnd
Copy link
Contributor Author

Thank you very much for resurfacing this RFC. I really appreciate all the work that has been done toward the design of this implementation. I'll start off of that document to build a prototype and test it on MI300A nodes. And as tricky as it may be, I think it is still worth the effort now that unified memory is no longer just virtual.

@naromero77amd
Copy link
Collaborator

@lancelotnd @malfet Do we need to keep this issue open? Seems more like a dev discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A request for a proper, new feature. module: rocm AMD GPU support for Pytorch topic: new features topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
Status: No status
Development

No branches or pull requests

5 participants