You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have tested running the BinaryBH example on Dawn which contains Intel Data Center GPU Max 1550s (codename: 'Ponte Vecchio', often abbreviated to 'PVC').
Resources
Useful information can be found in the following sources:
The PVCs on Dawn are comprised of 2 stacks (previously referred to as "tiles" which is still common amongst Intel documentation). Each stack is effectively a separate GPU except that the 2 stacks can share GPU memory and can communicate with each other with a relatively high bandwidth (16 Xe links at 26.5 GB/s in each direction - see here for further details).
To enable passing GPU buffers directly to MPI calls it is necessary to set the following Intel MPI environment variable:
export I_MPI_OFFLOAD=1
Furthermore, it is necessary to enable it at the AMReX level by setting the following parameter
amrex.use_gpu_aware_mpi = 1
either as an argument on the command line or in the parameter file.
GPU Pinning
Currently, SLURM interferes with the GPU pinning support provided in the Intel MPI Library as it sets the ZE_AFFINITY_MASK environment variable automatically. We will need to unset this:
unset ZE_AFFINITY_MASK
For now, we will restrict to a single node so we can change the MPI bootstrap server from slurm to fork (which will only work intra-node) by passing -bootstrap fork to mpiexec e.g.:
We can also set the ZE_FLAT_DEVICE_HIERARCHY environment variable to FLAT which will expose the stacks as separate devices to programs. With the current GPU drivers, the default is COMPOSITE but this should change to FLAT with later driver versions. In any case, we can set it explicitly:
export ZE_FLAT_DEVICE_HIERARCHY=FLAT
We can also set the interface for GPU topology recognition by Intel MPI as follows (this should be done automatically if I_MPI_OFFLOAD = 1):
export I_MPI_OFFLOAD_TOPOLIB=level_zero
GPU pinning should work automatically if the above advice has been followed but if not, can be explicitly enabled with
export I_MPI_OFFLOAD_PIN=1
One can get Intel MPI to print the pinning topology by either setting
export I_MPI_DEBUG=3
or
export I_MPI_OFFLOAD_PRINT_TOPOLOGY=1
The aim is to see that a single tile (or stack) is pinned to each rank e.g.
I have yet to explore how best to do GPU pinning in the case of multi-node jobs but it may require some kind of wrapper script.
Ahead-Of-Time compilation
By default, building with Intel GPU support via SYCL (USE_SYCL = TRUE) uses Just-in-Time (JIT) compilation where code for a specific device is not produced at build time but rather the SYCL code is compiled down to an intermediate representation (IR) and then is compiled on-the-fly at runtime when the specfic device is known. Whilst this saves the user from having to figure out how to target a specific device, this increases runtimes and makes performance comparison (particularly with other GPU backends) inaccurate. Since we know we will be targetting PVCs we can instead do Ahead-of-Time (AOT) compilation. This can be enabled by adding the following code to your Make.local-pre file under /path/to/amrex/Tools/GNUMake:
Note that since AOT compilation happens at link time, it can be quite slow hence the SYCL_PARALLEL_LINK_JOBS option which allows this device code compilation to occur in parallel. This should not interfere with make's parallel build jobs option (-j 24) because linking must occur after all other files are compiled. Since it is currently only possible to use the Dawn software stack on the PVC compute nodes, I have assumed you have requested at least 24 cores (corresponding to $1/4$ of the available cores on a node assuming you request at least 1 GPU)
Empirically, it is found that the BinaryBH example (and most likely the CCZ4 RHS) exhibits kernels with high register pressure and this can be observed at AOT device compilation time. With the current AMReX defaults, you will see lots of warnings about register spills like the following:
warning: kernel _ZTSZZN5amrex6launchILi256EZNS_12experimental6detail11ParallelForILi256ENS_8MultiFabEZN13BinaryBHLevel15specificEvalRHSERS4_S6_dEUliiiiE0_EENSt9enable_ifIXsr10IsFabArrayIT0_EE5valueEvE4typeERKS9_RKNS_9IntVectNDILi3EEEiSH_bRKT1_EUlRKN4sycl3_V17nd_itemILi1EEEE_EEviNS_11gpuStream_tESD_ENKUlRNSM_7handlerEE_clESU_EUlSO_E_ compiled SIMD32 allocated 128 regs and spilled around 64
Register spills severely hurt performance and should be avoided as far as possible. The size of the registers can be maximised by setting the following AMReX makefile options1 (also in the Make.local-pre file in the <any other SYCL options> block above).
SYCL_SUB_GROUP_SIZE = 16
SYCL_AOT_GRF_MODE = Large
Unfortunately, performance on >1 GPUs is severely degraded with the v2024.1 of the Intel oneAPI software stack. For now, I advise downgrading to the following modules:
Summary
I have tested running the BinaryBH example on Dawn which contains Intel Data Center GPU Max 1550s (codename: 'Ponte Vecchio', often abbreviated to 'PVC').
Resources
Useful information can be found in the following sources:
GPU Architecture
The PVCs on Dawn are comprised of 2 stacks (previously referred to as "tiles" which is still common amongst Intel documentation). Each stack is effectively a separate GPU except that the 2 stacks can share GPU memory and can communicate with each other with a relatively high bandwidth (16 Xe links at 26.5 GB/s in each direction - see here for further details).
Additional information can be found in the Xe Architecture section of the oneAPI GPU Optimization Guide.
At the time of writing, the PVCs on Dawn are using Level Zero driver version 1.3.26516 (as reported by
sycl-ls --verbose
)Empirically, the best performance is achieved when there is 1 MPI process (or rank) per stack (i.e. 2 MPI ranks in the case of 1 GPU).
Software Environment
All the tests below use the following modules:
GPU-aware MPI
To enable passing GPU buffers directly to MPI calls it is necessary to set the following Intel MPI environment variable:
export I_MPI_OFFLOAD=1
Furthermore, it is necessary to enable it at the AMReX level by setting the following parameter
either as an argument on the command line or in the parameter file.
GPU Pinning
Currently, SLURM interferes with the GPU pinning support provided in the Intel MPI Library as it sets the
ZE_AFFINITY_MASK
environment variable automatically. We will need to unset this:unset ZE_AFFINITY_MASK
For now, we will restrict to a single node so we can change the MPI bootstrap server from
slurm
tofork
(which will only work intra-node) by passing-bootstrap fork
tompiexec
e.g.:We can also set the
ZE_FLAT_DEVICE_HIERARCHY
environment variable toFLAT
which will expose the stacks as separate devices to programs. With the current GPU drivers, the default isCOMPOSITE
but this should change toFLAT
with later driver versions. In any case, we can set it explicitly:export ZE_FLAT_DEVICE_HIERARCHY=FLAT
We can also set the interface for GPU topology recognition by Intel MPI as follows (this should be done automatically if
I_MPI_OFFLOAD = 1
):export I_MPI_OFFLOAD_TOPOLIB=level_zero
GPU pinning should work automatically if the above advice has been followed but if not, can be explicitly enabled with
export I_MPI_OFFLOAD_PIN=1
One can get Intel MPI to print the pinning topology by either setting
export I_MPI_DEBUG=3
or
export I_MPI_OFFLOAD_PRINT_TOPOLOGY=1
The aim is to see that a single tile (or stack) is pinned to each rank e.g.
I have yet to explore how best to do GPU pinning in the case of multi-node jobs but it may require some kind of wrapper script.
Ahead-Of-Time compilation
By default, building with Intel GPU support via SYCL (
USE_SYCL = TRUE
) uses Just-in-Time (JIT) compilation where code for a specific device is not produced at build time but rather the SYCL code is compiled down to an intermediate representation (IR) and then is compiled on-the-fly at runtime when the specfic device is known. Whilst this saves the user from having to figure out how to target a specific device, this increases runtimes and makes performance comparison (particularly with other GPU backends) inaccurate. Since we know we will be targetting PVCs we can instead do Ahead-of-Time (AOT) compilation. This can be enabled by adding the following code to yourMake.local-pre
file under/path/to/amrex/Tools/GNUMake
:Note that since AOT compilation happens at link time, it can be quite slow hence the$1/4$ of the available cores on a node assuming you request at least 1 GPU)
SYCL_PARALLEL_LINK_JOBS
option which allows this device code compilation to occur in parallel. This should not interfere withmake
's parallel build jobs option (-j 24
) because linking must occur after all other files are compiled. Since it is currently only possible to use the Dawn software stack on the PVC compute nodes, I have assumed you have requested at least 24 cores (corresponding toRegister Pressure
There is a useful section Registers and Performance in the Intel oneAPI GPU Optimization Guide.
Empirically, it is found that the BinaryBH example (and most likely the CCZ4 RHS) exhibits kernels with high register pressure and this can be observed at AOT device compilation time. With the current AMReX defaults, you will see lots of warnings about register spills like the following:
Register spills severely hurt performance and should be avoided as far as possible. The size of the registers can be maximised by setting the following AMReX makefile options1 (also in the
Make.local-pre
file in the<any other SYCL options>
block above).Footnotes
https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2024-1/porting-registers.html#REGISTERS-ON-AN-INTEL-R-DATA-CENTER-GPU-MAX-1550 ↩
The text was updated successfully, but these errors were encountered: