Add support for work_group_memory extension #1984

sommerlukas · 2025-02-04T14:25:40Z

Extend kernel argument handling to add support for the work_group_memory extension, allowing users to dynamically allocate local memory for a kernel.

Have you provided a meaningful PR description?
Have you added a test, reproducer or referred to an issue with a reproducer?
Have you tested your changes locally for CPU and GPU devices?
Have you made sure that new changes do not introduce compiler warnings?
Have you checked performance impact of proposed changes?
Have you added documentation for your changes, if necessary?
Have you added your changes to the changelog?

sommerlukas · 2025-02-04T14:39:35Z

Regarding the stability of the extension: The extension was originally introduced June last year: intel/llvm#13725

Since then, one minor fix (intel/llvm#14937) and one minor API change to constructor not used in this implementation (intel/llvm#15933) were made.

The implementation was added in intel/llvm#15861 in October, so should be included in the 2025.1 oneAPI releases.

oleksandr-pavlyk

Thank you for your contribution @sommerlukas !

Please cite the extension document from intel/llvm or elsewhere.

It would be great to add a test exercising this kernel argument type.

Perhaps libsyclinterface/tests/test_sycl_queue_submit_local_accessor_arg could be used to add such a test.

I am happy to work with you to build the test and add it.

sommerlukas · 2025-02-04T14:46:23Z

Thanks for your feedback @oleksandr-pavlyk!

Please cite the extension document from intel/llvm or elsewhere.

Extension document is here. As I had discussed with @ndgrigorian offline last week, the implementation in DPCTL focuses on the variant with runtime-specified size:
work_group_memory(size_t num, handler& cgh); (2)

sommerlukas · 2025-02-04T14:48:31Z

It would be great to add a test exercising this kernel argument type.

Sure, I can look into adding a test. As we would need a kernel to execute to exercise this, what's the recommended way of writing that kernel?

oleksandr-pavlyk · 2025-02-04T14:58:11Z

The first step would be to use DPC++, or OpenCL compiler to create SPV for a kernel that consumes work_group_memory argument. Please check the comment containing a SYCL source code and how to compiler it in libsyclinterface/tests for local_accessor arguments.

Then one would need to use that SPV like it is used in the test for local_accessor argument.

sommerlukas · 2025-02-04T16:59:32Z

The first step would be to use DPC++, or OpenCL compiler to create SPV for a kernel that consumes work_group_memory argument. Please check the comment containing a SYCL source code and how to compiler it in libsyclinterface/tests for local_accessor arguments.

Then one would need to use that SPV like it is used in the test for local_accessor argument.

I've taken a look at libsyclinterface/tests/test_sycl_queue_submit_local_accessor_arg.cpp and noticed that this is testing through a C++ file, not Python.

The current implementation of this PR only exposes the work_group_memory in Python. Should I write a test using a compiled SYCL kernel as .spv in Python? Or do we need to extend the implementation into libsyclinterface?

ndgrigorian · 2025-02-05T02:16:54Z

Builds seem to be failing, likely because the SYCL_EXT_ONEAPI_WORK_GROUP_MEMORY isn't defined and the header file for the extension isn't included.

It would probably work best in a helper file—for an example, we use the experimental sycl_ext_oneapi_complex in our element-wise functions.
https://github.com/IntelPython/dpctl/blob/master/dpctl/tensor/libtensor/include/kernels/elementwise_functions/sycl_complex.hpp

We define the macro and use the header file to indirectly include sycl/ext/oneapi/experimental/sycl_complex.hpp.

sommerlukas · 2025-02-06T08:41:45Z

Builds seem to be failing, likely because the SYCL_EXT_ONEAPI_WORK_GROUP_MEMORY isn't defined and the header file for the extension isn't included.

It would probably work best in a helper file—for an example, we use the experimental sycl_ext_oneapi_complex in our element-wise functions. https://github.com/IntelPython/dpctl/blob/master/dpctl/tensor/libtensor/include/kernels/elementwise_functions/sycl_complex.hpp

We define the macro and use the header file to indirectly include sycl/ext/oneapi/experimental/sycl_complex.hpp.

The use in that header isn't very idiomatic. Feature test macros for SYCL extensions are not supposed to be set by the application. Instead, the implementation should set them to a value defined in the extension specification, as described in the SYCL specification.
Most extensions simply define the value 1 for supported.

In DPC++, this is implemented through the feature_test header: https://github.com/intel/llvm/blob/sycl/sycl/source/feature_test.hpp.in

SYCL_EXT_ONEAPI_WORK_GROUP_MEMORY is set to 1 here on SYCL tip, but in the 2025.0 release, the implementation of this extension wasn't present yet, so the macro won't be defined.

If CI wants to build with the 2025.0 release compiler until a newer release comes out, I need to conditionally enable the feature based on macro and throw an exception in case somebody tries to use it with DPC++ version 2025.0 or older.

ndgrigorian · 2025-02-06T18:45:53Z

The use in that header isn't very idiomatic. Feature test macros for SYCL extensions are not supposed to be set by the application. Instead, the implementation should set them to a value defined in the extension specification, as described in the SYCL specification. Most extensions simply define the value 1 for supported.

I see, good to know. I noticed that some of the experimental extensions don't need to be explicitly set, but SYCL_EXT_ONEAPI_COMPLEX still does. Thanks for clarifying.

Making it conditional based on compiler version would definitely be preferable.

oleksandr-pavlyk · 2025-02-09T19:48:47Z

This PR needs work, I suggest we take time to get it right, and it also brought to light that dpctl Python API does not expose LocalAccessor kernel argument.

I have created a branch in my fork that adds changes on top of Lukas's changes in this PR (sommerlukas#1).

It adds test_sycl_queue_work_group_memory_arg.cpp as well as two new SPV files that are generated using clang++ from SYCL nightly build dated 2025-02-07
It implements dpctl._sycl_queue.LocaAccessor type to allow specifying local accessor argument to the kernel.
It adds Python tests for submissions using work-group memory and local accessor.

What is still to do:

Do not introduce dpctl.experimental. Move definition of WorkGroupMemory type to _sycl_queue.pyx.
Support specifying type of array passed to WorkGroupMemory to avoid the need to recalculate number of elements into number of bytes.
Submission of these kernels in the test only works on Level-Zero devices, but fails on OpenCL devices (raises SyclKernelSubmissionError exception). The native ELF executes fine on all available to me devices ("opencl:cpu", "opencl:gpu", "level_zero:gpu", and "cuda:gpu").

dpctl code to build interoperability kernel bundle may need updating.

sommerlukas · 2025-02-10T12:36:11Z

This PR needs work, I suggest we take time to get it right, and it also brought to light that dpctl Python API does not expose LocalAccessor kernel argument.

I have created a branch in my fork that adds changes on top of Lukas's changes in this PR (sommerlukas#1).

Hi @oleksandr-pavlyk,

I'm also working on improving the implementation locally, I was planning to push as soon as I have tests ready.

What I've done so far:

Reworked the implementation of work_group_memory to also make it available in libsyclinterface.
Make it conditionally available depending on whether the DPC++ version supports the extension and adding an availability query.
I'm currently working on a test for submitting a kernel with work_group_memory for both the Python interface and via libsyclinterface.

Do not introduce dpctl.experimental. Move definition of WorkGroupMemory type to _sycl_queue.pyx.

The reason I introduced dcptl.experimental is that I have another PR in the pipeline that adds support for the raw_kernel_arg extension. From my offline discussion with @ndgrigorian, support for the kernel_compiler extension is also on the roadmap/horizon, so I think bundling these extensions in experimental makes sense.

Support specifying type of array passed to WorkGroupMemory to avoid the need to recalculate number of elements into number of bytes.

I had considered that, but there was no precedent in the Python interface, as support for local_accessor wasn't there.

If we add this variant, IMHO it should be in addition to the current version where the number of bytes is specified. Unless we build something complicated with Cython/ctype types, we would otherwise limit the possible types of array elements to a fixed list, whereas the DPC++ extension allows user-defined types.

oleksandr-pavlyk · 2025-02-10T13:04:39Z

@sommerlukas The PR sommerlukas#1 was added with my changes to your branch in your fork. If you do not intend to accept it because you have already worked something out locally, please let me know, and I would break it up into smaller changes and open follow up PRs.

sommerlukas · 2025-02-10T13:13:03Z

@sommerlukas The PR sommerlukas#1 was added with my changes to your branch in your fork. If you do not intend to accept it because you have already worked something out locally, please let me know, and I would break it up into smaller changes and open follow up PRs.

Thanks for bringing that PR, I think there's some overlap with what I've done locally.

Maybe it make sense to break out the changes for exposing local_accessor in the Python API and bring that as a separate PR to the main repository directly.

For the changes relating to work_group_memory, I'm currently looking how I can reuse some of your work with my local changes, e.g., the test you wrote for libsyclinterface.

sommerlukas · 2025-02-10T17:00:07Z

I've pushed my local changes now. The work group memory extension is now also added to libsyclinterface. It's conditionally enabled based on the SYCL extension's feature test macro and there's a function to query availability, in libsyclinterface as well as in the Python API.

I also added a test for the Python interface and a test for libsyclinterface, based on @oleksandr-pavlyk's test in sommerlukas#1.

ndgrigorian · 2025-02-10T17:30:33Z

dpctl code to build interoperability kernel bundle may need updating.
While on this subject, should an update of dpctl.program and dpctl.program.SyclProgram to kernel_bundle be considered as well?

This may make the mapping between the class and its SYCL counter-part less confusing, though there are downstream implications.

sommerlukas · 2025-02-10T21:37:03Z

dpctl code to build interoperability kernel bundle may need updating.
While on this subject, should an update of dpctl.program and dpctl.program.SyclProgram to kernel_bundle be considered as well?

This may make the mapping between the class and its SYCL counter-part less confusing, though there are downstream implications.

As I mentioned offline, there's also the possibility to replace the interop to create kernel_bundle from SPIR-V binary and OpenCL C source with the new SYCL kernel compiler extension. The extension is directly in SYCL, so there would be no need to reach into lower-level APIs and reduced maintenance effort, as the SYCL version would work across Level Zero and OpenCL.

The other benefit is that it integrates into kernel_bundle and that it is the same interface that is used for SYCL-RTC, i.e., creating a executable kernel_bundle from SYCL source code at runtime.

If there's downstream implications, maybe it makes sense to build a new kernel_bundle API in addition to the existing program and keep both for a transition period. The kernel compiler extension is still relatively fresh and only added in the latest release.

Extensions:

kernel_compiler: https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_kernel_compiler.asciidoc
SPIR-V input language: https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_kernel_compiler_spirv.asciidoc
OpenCL C input language: https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_kernel_compiler_opencl.asciidoc
SYCL input language (currently draft while implementation is finalized): [SYCL][Doc] Add spec for "sycl" to kernel compiler intel/llvm#11985

oleksandr-pavlyk · 2025-02-10T22:09:54Z

Submission of these kernels in the test only works on Level-Zero devices, but fails on OpenCL devices (raises SyclKernelSubmissionError exception). The native ELF executes fine on all available to me devices ("opencl:cpu", "opencl:gpu", "level_zero:gpu", and "cuda:gpu").

dpctl code to build interoperability kernel bundle may need updating.

My testing was done using DPC++ runtime from SYCL nightly, which may have some bugs. Using that runtime library, only "level_zero:gpu" device behaved as I expected. But testing my branch with DPC++ 2025.0.4 I did not experience any problems with opencl or cuda devices.

oleksandr-pavlyk · 2025-02-10T22:13:49Z

@sommerlukas While kernel_compiler extension is an exciting development, we must wait for it to land in the shipping oneAPI DPC++ compiler before adopting it in dpctl. When that happens, it might indeed be good to rewrite dpctl.program to avoid interoperability.

oleksandr-pavlyk · 2025-02-10T22:50:29Z

Apart from details of the implementation suggested here, what is the idea behind exposing work-group-memory kernel argument in dpctl? For this to matter, Python user must have a way to create kernels that take a work-group-memory argument(s).

I feel that creating dpctl.experimental is the wrong thing to do. This namespace reflects an implementation detail (that a feature, at present, is considered experimental in SYCL), does not reflect the functionality provided to the user of dpctl. Namespace dpctl.program.experimental is more appropriate for exposing WorkGroupMemory object in my opinion.

sommerlukas · 2025-02-11T14:18:09Z

@sommerlukas While kernel_compiler extension is an exciting development, we must wait for it to land in the shipping oneAPI DPC++ compiler before adopting it in dpctl. When that happens, it might indeed be good to rewrite dpctl.program to avoid interoperability.

Yeah, this was not meant for this PR, more medium-term.

Support for SPIR-V and OpenCL source languages should already be in the 2025.0 package released a few weeks ago, so 2025.0 should match DPCTL's existing support.

SYCL source language is going to be in one of the other 2025.X releases soon.

sommerlukas · 2025-02-11T14:26:05Z

Apart from details of the implementation suggested here, what is the idea behind exposing work-group-memory kernel argument in dpctl? For this to matter, Python user must have a way to create kernels that take a work-group-memory argument(s).

The work_group_memory just decays to a pointer to local memory, so it is possible to use it with an OpenCL kernel that takes a __local pointer as parameter.

E.g., users could compile a OpenCL kernel like this with DPCTL and pass a work_group_memory for the __local float* parameter as second argument to kernel launch:

__kernel void example_kernel(__global float *input, __local float *local_data) {
...
}

Our concrete use case is with SYCL kernels compiled via DPC++ that use work_group_memory as kernel argument.

sommerlukas · 2025-02-11T14:26:58Z

I feel that creating dpctl.experimental is the wrong thing to do. This namespace reflects an implementation detail (that a feature, at present, is considered experimental in SYCL), does not reflect the functionality provided to the user of dpctl. Namespace dpctl.program.experimental is more appropriate for exposing WorkGroupMemory object in my opinion.

I'm happy to move it to some other place if you and @ndgrigorian prefer. I'm also fine to not mark it experimental at all if that's preferred.

sommerlukas · 2025-02-11T14:39:10Z

@oleksandr-pavlyk @ndgrigorian Any hunch why the conda tests on CI fail? I can't reproduce the issue locally with my venv. Any recommendations on how to debug this?

oleksandr-pavlyk · 2025-02-11T14:44:38Z

Apart from details of the implementation suggested here, what is the idea behind exposing work-group-memory kernel argument in dpctl? For this to matter, Python user must have a way to create kernels that take a work-group-memory argument(s).

The work_group_memory just decays to a pointer to local memory, so it is possible to use it with an OpenCL kernel that takes a __local pointer as parameter.

E.g., users could compile a OpenCL kernel like this with DPCTL and pass a work_group_memory for the __local float* parameter as second argument to kernel launch:
__kernel void example_kernel(__global float *input, __local float *local_data) {

...

}

Thanks for explaining, @sommerlukas , this is useful. Please add a test for OpenCL device that uses local pointer in the kernel specified in OpenCl and that is invoked using work-group-memory.

Extend kernel argument handling to add support for the work_group_memory extension, allowing users to dynamically allocate local memory for a kernel. Signed-off-by: Lukas Sommer <[email protected]>

sommerlukas · 2025-02-11T15:42:45Z

Thanks for explaining, @sommerlukas , this is useful. Please add a test for OpenCL device that uses local pointer in the kernel specified in OpenCl and that is invoked using work-group-memory.

I've added a test using an OpenCL kernel with work_group_memory in the latest commit.

ndgrigorian · 2025-02-11T20:26:30Z

I'm happy to move it to some other place if you and @ndgrigorian prefer. I'm also fine to not mark it experimental at all if that's preferred.

I think I more or less agree with @oleksandr-pavlyk , putting it aside in a special namespace which marks it experimental doesn't do much for a Python-level user—it's an implementation detail and may or may not change in the compiler in the future. If there's a check for availability already present, we can instead document that the compiler version is relevant to whether the extension works.

ndgrigorian · 2025-02-11T20:28:06Z

@oleksandr-pavlyk @ndgrigorian Any hunch why the conda tests on CI fail? I can't reproduce the issue locally with my venv. Any recommendations on how to debug this?

I'll take a look at the CI too, but the best way I would imagine to debug would be to download the artifact and install it into a Conda environment, if you can't reproduce locally.

sommerlukas · 2025-02-12T14:27:38Z

I'm happy to move it to some other place if you and @ndgrigorian prefer. I'm also fine to not mark it experimental at all if that's preferred.

I think I more or less agree with @oleksandr-pavlyk , putting it aside in a special namespace which marks it experimental doesn't do much for a Python-level user—it's an implementation detail and may or may not change in the compiler in the future. If there's a check for availability already present, we can instead document that the compiler version is relevant to whether the extension works.

Ok, makes sense. I've removed the experimental name space in the most recent commit and moved the Python API for WorkGroupMemory to the queue, similar to #1991, as WorkGroupMemory would also be used with kernel launch in queues.

coveralls · 2025-02-13T06:06:16Z

coverage: 88.117%. first build
when pulling 67c915e on sommerlukas:work_group_memory
into a9bba0b on IntelPython:master.

sommerlukas · 2025-02-13T07:44:06Z

@oleksandr-pavlyk @ndgrigorian Any hunch why the conda tests on CI fail? I can't reproduce the issue locally with my venv. Any recommendations on how to debug this?

I'll take a look at the CI too, but the best way I would imagine to debug would be to download the artifact and install it into a Conda environment, if you can't reproduce locally.

Removing the experimental package and putting WorkGroupMemory in existing files seems to have resolved the problem.

Building and testing on Windows seems fine. For the failure on Linux, I'm not sure this is related to my changes.

The error message points to _diagnostics, which I did not modify in this PR:

$SRC_DIR/_skbuild/linux-x86_64-3.13/cmake-build/dpctl/_diagnostics.cxx:2232:80: error: unknown type name '__pyx_vectorcallfunc'; did you mean 'vectorcallfunc'?

@ndgrigorian @oleksandr-pavlyk Any idea what could be causing this?

diptorupd · 2025-02-13T14:13:32Z

@oleksandr-pavlyk @sommerlukas nice set of changes. numba-dpex also supports local accessors.

ndgrigorian · 2025-02-13T20:52:31Z

@ndgrigorian @oleksandr-pavlyk Any idea what could be causing this?

The failure looks like one I would see when building with 3.13t (free-threaded Python).

Indeed, looking at the log—3.13.1-h9a34b6e_5_cp313t is the installed version. I'll try to open a PR to avoid free-threaded Python for now, or to fix build failures with it. Cython supports it now, so I don't see why we couldn't allow it.

ndgrigorian · 2025-02-21T21:01:25Z

@sommerlukas
Sorry for the delay, I am going to take a look at this PR over this weekend/at start of next week and look into CI failures. Can you merge in the master branch or rebase onto master to see if picking up some of the CI fixes helps?

Also a quick glance at gh-1991 would be appreciated, if any of it is necessary for your work—this wasn't clear to me.

sommerlukas requested review from oleksandr-pavlyk and ndgrigorian as code owners February 4, 2025 14:25

oleksandr-pavlyk reviewed Feb 4, 2025

View reviewed changes

sommerlukas force-pushed the work_group_memory branch from 2356410 to ddd8349 Compare February 10, 2025 16:51

sommerlukas added 2 commits February 11, 2025 15:34

Add support for work_group_memory extension

ed4e9ca

Extend kernel argument handling to add support for the work_group_memory extension, allowing users to dynamically allocate local memory for a kernel. Signed-off-by: Lukas Sommer <[email protected]>

Add OpenCL-based test

04efec6

sommerlukas force-pushed the work_group_memory branch from b952f90 to 04efec6 Compare February 11, 2025 15:41

Do not place in experimental

67c915e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for work_group_memory extension #1984

Add support for work_group_memory extension #1984

sommerlukas commented Feb 4, 2025

sommerlukas commented Feb 4, 2025

oleksandr-pavlyk left a comment

sommerlukas commented Feb 4, 2025

sommerlukas commented Feb 4, 2025

oleksandr-pavlyk commented Feb 4, 2025

sommerlukas commented Feb 4, 2025

ndgrigorian commented Feb 5, 2025

sommerlukas commented Feb 6, 2025

ndgrigorian commented Feb 6, 2025

oleksandr-pavlyk commented Feb 9, 2025

sommerlukas commented Feb 10, 2025 •

edited

Loading

oleksandr-pavlyk commented Feb 10, 2025

sommerlukas commented Feb 10, 2025

sommerlukas commented Feb 10, 2025

ndgrigorian commented Feb 10, 2025

sommerlukas commented Feb 10, 2025

oleksandr-pavlyk commented Feb 10, 2025

oleksandr-pavlyk commented Feb 10, 2025

oleksandr-pavlyk commented Feb 10, 2025

sommerlukas commented Feb 11, 2025

sommerlukas commented Feb 11, 2025

sommerlukas commented Feb 11, 2025

sommerlukas commented Feb 11, 2025

oleksandr-pavlyk commented Feb 11, 2025

sommerlukas commented Feb 11, 2025

ndgrigorian commented Feb 11, 2025

ndgrigorian commented Feb 11, 2025

sommerlukas commented Feb 12, 2025

coveralls commented Feb 13, 2025

sommerlukas commented Feb 13, 2025 •

edited

Loading

diptorupd commented Feb 13, 2025

ndgrigorian commented Feb 13, 2025 •

edited

Loading

ndgrigorian commented Feb 21, 2025

Add support for work_group_memory extension #1984

Are you sure you want to change the base?

Add support for work_group_memory extension #1984

Conversation

sommerlukas commented Feb 4, 2025

sommerlukas commented Feb 4, 2025

oleksandr-pavlyk left a comment

Choose a reason for hiding this comment

sommerlukas commented Feb 4, 2025

sommerlukas commented Feb 4, 2025

oleksandr-pavlyk commented Feb 4, 2025

sommerlukas commented Feb 4, 2025

ndgrigorian commented Feb 5, 2025

sommerlukas commented Feb 6, 2025

ndgrigorian commented Feb 6, 2025

oleksandr-pavlyk commented Feb 9, 2025

sommerlukas commented Feb 10, 2025 • edited Loading

oleksandr-pavlyk commented Feb 10, 2025

sommerlukas commented Feb 10, 2025

sommerlukas commented Feb 10, 2025

ndgrigorian commented Feb 10, 2025

sommerlukas commented Feb 10, 2025

oleksandr-pavlyk commented Feb 10, 2025

oleksandr-pavlyk commented Feb 10, 2025

oleksandr-pavlyk commented Feb 10, 2025

sommerlukas commented Feb 11, 2025

sommerlukas commented Feb 11, 2025

sommerlukas commented Feb 11, 2025

sommerlukas commented Feb 11, 2025

oleksandr-pavlyk commented Feb 11, 2025

sommerlukas commented Feb 11, 2025

ndgrigorian commented Feb 11, 2025

ndgrigorian commented Feb 11, 2025

sommerlukas commented Feb 12, 2025

coveralls commented Feb 13, 2025

sommerlukas commented Feb 13, 2025 • edited Loading

diptorupd commented Feb 13, 2025

ndgrigorian commented Feb 13, 2025 • edited Loading

ndgrigorian commented Feb 21, 2025

sommerlukas commented Feb 10, 2025 •

edited

Loading

sommerlukas commented Feb 13, 2025 •

edited

Loading

ndgrigorian commented Feb 13, 2025 •

edited

Loading