AMD MI300A Unified Memory Support #145693
Labels
feature
A request for a proper, new feature.
module: rocm
AMD GPU support for Pytorch
topic: new features
topic category
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
🚀 The feature, motivation and pitch
I am working on improving performance for LLM workloads on AMD Instinct™ MI300A Accelerator. This APU is equipped with a fully unified-memory architecture not taken advantage by PyTorch at this time. Because both the GPU and CPU share the same physical memory, memcpy ops become redundant and can limit the size of the model we can train. Adding support for unified memory on ROCm for that particular APU would allow for zero-copy operations.
The motivation is similar that of #140787, but for ROCm instead of MPS.
Given that this APU is targeted for the most demanding HPC ML workloads, there is a great interest in optimizing the performance of PyTorch for it. Notably, El Capitan, the top1 Supercomputer from top500 runs exclusively with AMD's MI300A.
Alternatives
No response
Additional context
To facilitate understanding I provide more details as to the kind of changes this involves.
To understand the differences in operations between non-unified and unified memory, let us consider a regular matrix multiplication of matrices$A$ and $B$ where the result is stored in matrix $C$ .
In a non-unified setup with a discrete GPU (device):
malloc
matricesCudaMalloc
to allocate memory for matricesCudaMemcpy
hostToDevice
)CudaMemcpy
DeviceToHost
) to get back the resultsWhereas with unified-memory you would have:
CudaMallocManaged
matricesOn machines with discrete GPUs the concept of unified-memory is purely virtual and still results in memory movement by way of page faults and page migrations. This adds a lot of overhead.
On architectures where CPU and GPU share the same physical memory such as the MAC, AMD MI300A, any memcpy operation becomes pointless and wastes space and time.
The quickest and dirtiest hack for the support of unified memory on Pytorch is to replace all
cudaMalloc
bycudaMallocManaged
and to get rid ofmemCopy
operations. Like in this paper. This however is not ideal, nor portable.Perhaps a better way to do it would be to toggle on or off the unified memory. Given that this is a relatively new architecture, more hardware is likely to come out with such configuration from different manufacturers and so it would be great to have a device-agnostic support of unified memory.
cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd
The text was updated successfully, but these errors were encountered: