Please do not hesitate to reach out to us on the Discourse forums (Runtimes - OpenMP) or join one of our :ref:`regular calls <calls>`. Some common questions are answered in the :ref:`faq`.
- Development updates on OpenMP (and OpenACC) in the LLVM Project, including Clang, optimization, and runtime work.
- Join OpenMP in LLVM Technical Call.
- Time: Weekly call on every Wednesday 7:00 AM Pacific time.
- Meeting minutes are here.
- Status tracking page.
- Development updates on OpenMP and OpenACC in the Flang Project.
- Join OpenMP in Flang Technical Call
- Time: Weekly call on every Thursdays 8:00 AM Pacific time.
- Meeting minutes are here.
- Status tracking page.
Note
The FAQ is a work in progress and most of the expected content is not yet available. While you can expect changes, we always welcome feedback and additions. Please post on the Discourse forums (Runtimes - OpenMP).
All patches go through the regular LLVM review process.
To build an effective OpenMP offload capable compiler, only one extra CMake
option, LLVM_ENABLE_RUNTIMES="openmp"
, is needed when building LLVM (Generic
information about building LLVM is available here.). Make sure all backends that
are targeted by OpenMP are enabled. That can be done by adjusting the CMake
option LLVM_TARGETS_TO_BUILD
. The corresponding targets for offloading to AMD
and Nvidia GPUs are "AMDGPU"
and "NVPTX"
, respectively. By default,
Clang will be built with all backends enabled. When building with
LLVM_ENABLE_RUNTIMES="openmp"
OpenMP should not be enabled in
LLVM_ENABLE_PROJECTS
because it is enabled by default.
For Nvidia offload, please see :ref:`build_nvidia_offload_capable_compiler`. For AMDGPU offload, please see :ref:`build_amdgpu_offload_capable_compiler`.
Note
The compiler that generates the offload code should be the same (version) as the compiler that builds the OpenMP device runtimes. The OpenMP host runtime can be built by a different compiler.
The Cuda SDK is required on the machine that will execute the openmp application.
If your build machine is not the target machine or automatic detection of the available GPUs failed, you should also set:
LIBOMPTARGET_DEVICE_ARCHITECTURES=sm_<xy>,...
where<xy>
is the numeric compute capability of your GPU. For instance, setLIBOMPTARGET_DEVICE_ARCHITECTURES=sm_70,sm_80
to target the Nvidia Volta and Ampere architectures.
A subset of the ROCm toolchain is required to build the LLVM toolchain and to execute the openmp application. Either install ROCm somewhere that cmake's find_package can locate it, or build the required subcomponents ROCt and ROCr from source.
The two components used are ROCT-Thunk-Interface, roct, and ROCR-Runtime, rocr. Roct is the userspace part of the linux driver. It calls into the driver which ships with the linux kernel. It is an implementation detail of Rocr from OpenMP's perspective. Rocr is an implementation of HSA.
SOURCE_DIR=same-as-llvm-source # e.g. the checkout of llvm-project, next to openmp
BUILD_DIR=somewhere
INSTALL_PREFIX=same-as-llvm-install
cd $SOURCE_DIR
git clone [email protected]:RadeonOpenCompute/ROCT-Thunk-Interface.git -b roc-4.2.x \
--single-branch
git clone [email protected]:RadeonOpenCompute/ROCR-Runtime.git -b rocm-4.2.x \
--single-branch
cd $BUILD_DIR && mkdir roct && cd roct
cmake $SOURCE_DIR/ROCT-Thunk-Interface/ -DCMAKE_INSTALL_PREFIX=$INSTALL_PREFIX \
-DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=OFF
make && make install
cd $BUILD_DIR && mkdir rocr && cd rocr
cmake $SOURCE_DIR/ROCR-Runtime/src -DIMAGE_SUPPORT=OFF \
-DCMAKE_INSTALL_PREFIX=$INSTALL_PREFIX -DCMAKE_BUILD_TYPE=Release \
-DBUILD_SHARED_LIBS=ON
make && make install
IMAGE_SUPPORT
requires building rocr with clang and is not used by openmp.
Provided cmake's find_package can find the ROCR-Runtime package, LLVM will
build a tool bin/amdgpu-arch
which will print a string like gfx906
when
run if it recognises a GPU on the local system. LLVM will also build a shared
library, libomptarget.rtl.amdgpu.so, which is linked against rocr.
With those libraries installed, then LLVM build and installed, try:
clang -O2 -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa example.c -o example && ./example
If your build machine is not the target machine or automatic detection of the available GPUs failed, you should also set:
LIBOMPTARGET_DEVICE_ARCHITECTURES=gfx<xyz>,...
where<xyz>
is the shader core instruction set architecture. For instance, setLIBOMPTARGET_DEVICE_ARCHITECTURES=gfx906,gfx90a
to target AMD GCN5 and CDNA2 devices.
LD_LIBRARY_PATH or rpath/runpath are required to find libomp.so and libomptarget.so
There is no libc. That is, malloc and printf do not exist. Libm is implemented in terms of the rocm device library, which will be searched for if linking with '-lm'.
Some versions of the driver for the radeon vii (gfx906) will error unless the environment variable 'export HSA_IGNORE_SRAMECC_MISREPORT=1' is set.
It is a recent addition to LLVM and the implementation differs from that which has been shipping in ROCm and AOMP for some time. Early adopters will encounter bugs.
The libraries used by an executable compiled for target offloading are:
libomp.so
(or similar), the host openmp runtimelibomptarget.so
, the target-agnostic target offloading openmp runtime- plugins loaded by libomptarget.so:
libomptarget.rtl.amdgpu.so
libomptarget.rtl.cuda.so
libomptarget.rtl.x86_64.so
libomptarget.rtl.ve.so
- and others
- dependencies of those plugins, e.g. cuda/rocr for nvptx/amdgpu
The compiled executable is dynamically linked against a host runtime, e.g.
libomp.so
, and against the target offloading runtime, libomptarget.so
. These
are found like any other dynamic library, by setting rpath or runpath on the
executable, by setting LD_LIBRARY_PATH
, or by adding them to the system search.
libomptarget.so
is only supported to work with the associated clang
compiler. On systems with globally installed libomptarget.so
this can be
problematic. For this reason it is recommended to use a Clang configuration
file to
automatically configure the environment. For example, store the following file
as openmp.cfg
next to your clang
executable.
# Library paths for OpenMP offloading.
-L '<CFGDIR>/../lib'
-Wl,-rpath='<CFGDIR>/../lib'
The plugins will try to find their dependencies in plugin-dependent fashion.
The cuda plugin is dynamically linked against libcuda if cmake found it at
compiler build time. Otherwise it will attempt to dlopen libcuda.so
. It does
not have rpath set.
The amdgpu plugin is linked against ROCr if cmake found it at compiler build
time. Otherwise it will attempt to dlopen libhsa-runtime64.so
. It has rpath
set to $ORIGIN
, so installing libhsa-runtime64.so
in the same directory is a
way to locate it without environment variables.
In addition to those, there is a compiler runtime library called deviceRTL.
This is compiled from mostly common code into an architecture specific
bitcode library, e.g. libomptarget-nvptx-sm_70.bc
.
Clang and the deviceRTL need to match closely as the interface between them changes frequently. Using both from the same monorepo checkout is strongly recommended.
Unlike the host side which lets environment variables select components, the
deviceRTL that is located in the clang lib directory is preferred. Only if
it is absent, the LIBRARY_PATH
environment variable is searched to find a
bitcode file with the right name. This can be overridden by passing a clang
flag, --libomptarget-nvptx-bc-path
or --libomptarget-amdgcn-bc-path
. That
can specify a directory or an exact bitcode file to use.
For now, the answer is most likely no. Please see :ref:`build_offload_capable_compiler`.
For now, the answer is most likely no. Please see :ref:`build_offload_capable_compiler`.
Yes, LLVM/Clang allows math functions and complex arithmetic inside of OpenMP target regions that are compiled for GPUs.
Clang provides a set of wrapper headers that are found first when math.h and complex.h, for C, cmath and complex, for C++, or similar headers are included by the application. These wrappers will eventually include the system version of the corresponding header file after setting up a target device specific environment. The fact that the system header is included is important because they differ based on the architecture and operating system and may contain preprocessor, variable, and function definitions that need to be available in the target region regardless of the targeted device architecture. However, various functions may require specialized device versions, e.g., sin, and others are only available on certain devices, e.g., __umul64hi. To provide "native" support for math and complex on the respective architecture, Clang will wrap the "native" math functions, e.g., as provided by the device vendor, in an OpenMP begin/end declare variant. These functions will then be picked up instead of the host versions while host only variables and function definitions are still available. Complex arithmetic and functions are support through a similar mechanism. It is worth noting that this support requires extensions to the OpenMP begin/end declare variant context selector that are exposed through LLVM/Clang to the user as well.
An experimental way to debug these errors is to use :ref:`remote process
offloading <remote_offloading_plugin>`.
By using libomptarget.rtl.rpc.so
and openmp-offloading-server
, it is
possible to explicitly perform memory transfers between processes on the host
CPU and run sanitizers while doing so in order to catch these errors.
Dynamically linked libraries can be only used if there is no device code split between the library and application. Anything declared on the device inside the shared library will not be visible to the application when it's linked.
Enabling the OpenMP runtime will perform a two-stage build for you. If your host compiler is different from your system-wide compiler, you may need to set the CMake variable GCC_INSTALL_PREFIX so clang will be able to find the correct GCC toolchain in the second stage of the build.
For example, if your system-wide GCC installation is too old to build LLVM and you would like to use a newer GCC, set the CMake variable GCC_INSTALL_PREFIX to inform clang of the GCC installation you would like to use in the second stage.
Currently, there is an experimental CMake find module for OpenMP target
offloading provided by LLVM. It will attempt to find OpenMP target offloading
support for your compiler. The flags necessary for OpenMP target offloading will
be loaded into the OpenMPTarget::OpenMPTarget_<device>
target or the
OpenMPTarget_<device>_FLAGS
variable if successful. Currently supported
devices are AMDGPU
and NVPTX
.
To use this module, simply add the path to CMake's current module path and call
find_package
. The module will be installed with your OpenMP installation by
default. Including OpenMP offloading support in an application should now only
require a few additions.
cmake_minimum_required(VERSION 3.20.0)
project(offloadTest VERSION 1.0 LANGUAGES CXX)
list(APPEND CMAKE_MODULE_PATH "${PATH_TO_OPENMP_INSTALL}/lib/cmake/openmp")
find_package(OpenMPTarget REQUIRED NVPTX)
add_executable(offload)
target_link_libraries(offload PRIVATE OpenMPTarget::OpenMPTarget_NVPTX)
target_sources(offload PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/src/Main.cpp)
Using this module requires at least CMake version 3.20.0. Supported languages are C and C++ with Fortran support planned in the future. Compiler support is best for Clang but this module should work for other compiler vendors such as IBM, GNU.
This is a warning that the Nvidia tools will sometimes emit if the offloading
region is too complex. Normally, the CUDA tools attempt to statically determine
how much stack memory each thread. This way when the kernel is launched each
thread will have as much memory as it needs. If the control flow of the kernel
is too complex, containing recursive calls or nested parallelism, this analysis
can fail. If this warning is triggered it means that the kernel may run out of
stack memory during execution and crash. The environment variable
LIBOMPTARGET_STACK_SIZE
can be used to increase the stack size if this
occurs.
Since LLVM version 15.0, OpenMP offloading supports offloading to multiple architectures at once. This allows for executables to be run on different targets, such as offloading to AMD and NVIDIA GPUs simultaneously, as well as multiple sub-architectures for the same target. Additionally, static libraries will only extract archive members if an architecture is used, allowing users to create generic libraries.
The architecture can either be specified manually using --offload-arch=
. If
--offload-arch=
is present no -fopenmp-targets=
flag is present then the
targets will be inferred from the architectures. Conversely, if
--fopenmp-targets=
is present with no --offload-arch
then the target
architecture will be set to a default value, usually the architecture supported
by the system LLVM was built on.
For example, an executable can be built that runs on AMDGPU and NVIDIA hardware given that the necessary build tools are installed for both.
clang example.c -fopenmp --offload-arch=gfx90a --offload-arch=sm_80
If just given the architectures we should be able to infer the triples, otherwise we can specify them manually.
clang example.c -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa,nvptx64-nvidia-cuda \
-Xopenmp-target=amdgcn-amd-amdhsa --offload-arch=gfx90a \
-Xopenmp-target=nvptx64-nvidia-cuda --offload-arch=sm_80
When linking against a static library that contains device code for multiple architectures, only the images used by the executable will be extracted.
clang example.c -fopenmp --offload-arch=gfx90a,gfx90a,sm_70,sm_80 -c
llvm-ar rcs libexample.a example.o
clang app.c -fopenmp --offload-arch=gfx90a -o app
The supported device images can be viewed using the --offloading
option with
llvm-objdump
.
clang example.c -fopenmp --offload-arch=gfx90a --offload-arch=sm_80 -o example
llvm-objdump --offloading example
a.out: file format elf64-x86-64
OFFLOADING IMAGE [0]:
kind elf
arch gfx90a
triple amdgcn-amd-amdhsa
producer openmp
OFFLOADING IMAGE [1]:
kind elf
arch sm_80
triple nvptx64-nvidia-cuda
producer openmp
OpenMP offloading files can currently be experimentally linked with CUDA and HIP files. This will allow OpenMP to call a CUDA device function or vice-versa. However, the global state will be distinct between the two images at runtime. This means any global variables will potentially have different values when queried from OpenMP or CUDA.
Linking CUDA and HIP currently requires enabling a different compilation mode
for CUDA / HIP with --offload-new-driver
and to link using
--offload-link
. Additionally, -fgpu-rdc
must be used to create a
linkable device image.
clang++ openmp.cpp -fopenmp --offload-arch=sm_80 -c
clang++ cuda.cu --offload-new-driver --offload-arch=sm_80 -fgpu-rdc -c
clang++ openmp.o cuda.o --offload-link -o app
No. libomptarget and plugins are now built as LLVM libraries starting from LLVM 15. Because LLVM libraries are not backward compatible, libomptarget and plugins are not as well. Given that fact, the interfaces between 1) the Clang compiler and libomptarget, 2) the Clang compiler and device runtime library, and 3) libomptarget and plugins are not guaranteed to be compatible with an earlier version. Users are responsible for ensuring compatibility when not using the Clang compiler and runtime libraries from the same build. Nevertheless, in order to better support third-party libraries and toolchains that depend on existing libomptarget entry points, contributors are discouraged from making modifications to them.
LLVM provides basic libc
functionality through the LLVM C Library. For
building instructions, refer to the associated LLVM libc documentation. Once built,
this provides a static library called libcgpu.a
. See the documentation for a
list of supported functions as well.
To utilize these functions, simply link this library as any other when building
with OpenMP.
clang++ openmp.cpp -fopenmp --offload-arch=gfx90a -lcgpu
For more information on how this is implemented in LLVM/OpenMP's offloading runtime, refer to the runtime documentation.
We recommend taking a look at the OpenMP :doc:`command line argument reference <CommandLineArgumentReference>` page.
When installing OpenMP and other LLVM components, the build time on multicore
systems can be significantly reduced with parallel build jobs. As suggested in
LLVM Techniques, Tips, and Best Practices, one could consider using ninja
as the
generator. This can be done with the CMake option cmake -G Ninja
. Afterward,
use ninja install
and specify the number of parallel jobs with -j
. The build
time can also be reduced by setting the build type to Release
with the
CMAKE_BUILD_TYPE
option. Recompilation can also be sped up by caching previous
compilations. Consider enabling Ccache
with
CMAKE_CXX_COMPILER_LAUNCHER=ccache
.
Feel free to post questions or browse old threads at LLVM Discourse.