AMD MI300A Unified Memory Support #145693

lancelotnd · 2025-01-26T02:26:35Z

🚀 The feature, motivation and pitch

I am working on improving performance for LLM workloads on AMD Instinct™ MI300A Accelerator. This APU is equipped with a fully unified-memory architecture not taken advantage by PyTorch at this time. Because both the GPU and CPU share the same physical memory, memcpy ops become redundant and can limit the size of the model we can train. Adding support for unified memory on ROCm for that particular APU would allow for zero-copy operations.

The motivation is similar that of #140787, but for ROCm instead of MPS.

Given that this APU is targeted for the most demanding HPC ML workloads, there is a great interest in optimizing the performance of PyTorch for it. Notably, El Capitan, the top1 Supercomputer from top500 runs exclusively with AMD's MI300A.

Alternatives

No response

Additional context

To facilitate understanding I provide more details as to the kind of changes this involves.

To understand the differences in operations between non-unified and unified memory, let us consider a regular matrix multiplication of matrices $A$ and $B$ where the result is stored in matrix $C$.

In a non-unified setup with a discrete GPU (device):

malloc matrices $A,B,C$ on the host each of size $n \times n$.
... values are written on matrices $A$,$B$
CudaMalloc to allocate memory for matrices $A',B',C'$
CudaMemcpy $A \rightarrow A'$, and $B \rightarrow B'$ (hostToDevice)
Kernel launch (results are written in $C'$).
CudaMemcpy $C' \rightarrow C$ (DeviceToHost) to get back the results

Whereas with unified-memory you would have:

CudaMallocManaged matrices $A,B,C$ on the host each of size $n \times n$.
... values are written on matrices $A$,$B$
Kernel launch (results are written in $C$).

On machines with discrete GPUs the concept of unified-memory is purely virtual and still results in memory movement by way of page faults and page migrations. This adds a lot of overhead.

On architectures where CPU and GPU share the same physical memory such as the MAC, AMD MI300A, any memcpy operation becomes pointless and wastes space and time.

The quickest and dirtiest hack for the support of unified memory on Pytorch is to replace all cudaMalloc by cudaMallocManaged and to get rid of memCopy operations. Like in this paper. This however is not ideal, nor portable.

Perhaps a better way to do it would be to toggle on or off the unified memory. Given that this is a relatively new architecture, more hardware is likely to come out with such configuration from different manufacturers and so it would be great to have a device-agnostic support of unified memory.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

The text was updated successfully, but these errors were encountered:

jayfurmanek · 2025-01-30T21:37:54Z

We did have an RFC for this very thing a while back. The idea is to provide zero-copy for tensors across CPU/GPU but the usage model in torch has implicit and explicit copies that have to be handled properly, which is a bit tricky. Unified addressing also doesn't prevent you from having to synchronize. So the question becomes can we safely make these changes and get a performance win with a framework that does a pretty good job avoiding copy latency in the first place.

lancelotnd · 2025-01-31T22:31:25Z

Thank you very much for resurfacing this RFC. I really appreciate all the work that has been done toward the design of this implementation. I'll start off of that document to build a prototype and test it on MI300A nodes. And as tricky as it may be, I think it is still worth the effort now that unified memory is no longer just virtual.

naromero77amd · 2025-03-20T15:52:52Z

@lancelotnd @malfet Do we need to keep this issue open? Seems more like a dev discussion.

malfet added the module: rocm AMD GPU support for Pytorch label Jan 27, 2025

github-project-automation bot added this to PyTorch on ROCm Jan 27, 2025

drisspg added feature A request for a proper, new feature. triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module topic: new features topic category labels Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AMD MI300A Unified Memory Support #145693

AMD MI300A Unified Memory Support #145693

lancelotnd commented Jan 26, 2025 •

edited by pytorch-bot bot

Loading

jayfurmanek commented Jan 30, 2025

lancelotnd commented Jan 31, 2025

naromero77amd commented Mar 20, 2025

AMD MI300A Unified Memory Support #145693

AMD MI300A Unified Memory Support #145693

Comments

lancelotnd commented Jan 26, 2025 • edited by pytorch-bot bot Loading

🚀 The feature, motivation and pitch

Alternatives

Additional context

jayfurmanek commented Jan 30, 2025

lancelotnd commented Jan 31, 2025

naromero77amd commented Mar 20, 2025

lancelotnd commented Jan 26, 2025 •

edited by pytorch-bot bot

Loading