Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for work_group_memory extension #1984

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

sommerlukas
Copy link
Contributor

Extend kernel argument handling to add support for the work_group_memory extension, allowing users to dynamically allocate local memory for a kernel.

  • Have you provided a meaningful PR description?
  • Have you added a test, reproducer or referred to an issue with a reproducer?
  • Have you tested your changes locally for CPU and GPU devices?
  • Have you made sure that new changes do not introduce compiler warnings?
  • Have you checked performance impact of proposed changes?
  • Have you added documentation for your changes, if necessary?
  • Have you added your changes to the changelog?

@sommerlukas
Copy link
Contributor Author

Regarding the stability of the extension: The extension was originally introduced June last year: intel/llvm#13725

Since then, one minor fix (intel/llvm#14937) and one minor API change to constructor not used in this implementation (intel/llvm#15933) were made.

The implementation was added in intel/llvm#15861 in October, so should be included in the 2025.1 oneAPI releases.

Copy link
Collaborator

@oleksandr-pavlyk oleksandr-pavlyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution @sommerlukas !

Please cite the extension document from intel/llvm or elsewhere.

It would be great to add a test exercising this kernel argument type.

Perhaps libsyclinterface/tests/test_sycl_queue_submit_local_accessor_arg could be used to add such a test.

I am happy to work with you to build the test and add it.

@sommerlukas
Copy link
Contributor Author

Thanks for your feedback @oleksandr-pavlyk!

Please cite the extension document from intel/llvm or elsewhere.

Extension document is here. As I had discussed with @ndgrigorian offline last week, the implementation in DPCTL focuses on the variant with runtime-specified size:
work_group_memory(size_t num, handler& cgh); (2)

@sommerlukas
Copy link
Contributor Author

It would be great to add a test exercising this kernel argument type.

Sure, I can look into adding a test. As we would need a kernel to execute to exercise this, what's the recommended way of writing that kernel?

@oleksandr-pavlyk
Copy link
Collaborator

The first step would be to use DPC++, or OpenCL compiler to create SPV for a kernel that consumes work_group_memory argument. Please check the comment containing a SYCL source code and how to compiler it in libsyclinterface/tests for local_accessor arguments.

Then one would need to use that SPV like it is used in the test for local_accessor argument.

@sommerlukas
Copy link
Contributor Author

The first step would be to use DPC++, or OpenCL compiler to create SPV for a kernel that consumes work_group_memory argument. Please check the comment containing a SYCL source code and how to compiler it in libsyclinterface/tests for local_accessor arguments.

Then one would need to use that SPV like it is used in the test for local_accessor argument.

I've taken a look at libsyclinterface/tests/test_sycl_queue_submit_local_accessor_arg.cpp and noticed that this is testing through a C++ file, not Python.

The current implementation of this PR only exposes the work_group_memory in Python. Should I write a test using a compiled SYCL kernel as .spv in Python? Or do we need to extend the implementation into libsyclinterface?

@ndgrigorian
Copy link
Collaborator

Builds seem to be failing, likely because the SYCL_EXT_ONEAPI_WORK_GROUP_MEMORY isn't defined and the header file for the extension isn't included.

It would probably work best in a helper file—for an example, we use the experimental sycl_ext_oneapi_complex in our element-wise functions.
https://github.com/IntelPython/dpctl/blob/master/dpctl/tensor/libtensor/include/kernels/elementwise_functions/sycl_complex.hpp

We define the macro and use the header file to indirectly include sycl/ext/oneapi/experimental/sycl_complex.hpp.

@sommerlukas
Copy link
Contributor Author

Builds seem to be failing, likely because the SYCL_EXT_ONEAPI_WORK_GROUP_MEMORY isn't defined and the header file for the extension isn't included.

It would probably work best in a helper file—for an example, we use the experimental sycl_ext_oneapi_complex in our element-wise functions. https://github.com/IntelPython/dpctl/blob/master/dpctl/tensor/libtensor/include/kernels/elementwise_functions/sycl_complex.hpp

We define the macro and use the header file to indirectly include sycl/ext/oneapi/experimental/sycl_complex.hpp.

The use in that header isn't very idiomatic. Feature test macros for SYCL extensions are not supposed to be set by the application. Instead, the implementation should set them to a value defined in the extension specification, as described in the SYCL specification.
Most extensions simply define the value 1 for supported.

In DPC++, this is implemented through the feature_test header: https://github.com/intel/llvm/blob/sycl/sycl/source/feature_test.hpp.in

SYCL_EXT_ONEAPI_WORK_GROUP_MEMORY is set to 1 here on SYCL tip, but in the 2025.0 release, the implementation of this extension wasn't present yet, so the macro won't be defined.

If CI wants to build with the 2025.0 release compiler until a newer release comes out, I need to conditionally enable the feature based on macro and throw an exception in case somebody tries to use it with DPC++ version 2025.0 or older.

@ndgrigorian
Copy link
Collaborator

The use in that header isn't very idiomatic. Feature test macros for SYCL extensions are not supposed to be set by the application. Instead, the implementation should set them to a value defined in the extension specification, as described in the SYCL specification. Most extensions simply define the value 1 for supported.

I see, good to know. I noticed that some of the experimental extensions don't need to be explicitly set, but SYCL_EXT_ONEAPI_COMPLEX still does. Thanks for clarifying.

Making it conditional based on compiler version would definitely be preferable.

@oleksandr-pavlyk
Copy link
Collaborator

This PR needs work, I suggest we take time to get it right, and it also brought to light that dpctl Python API does not expose LocalAccessor kernel argument.

I have created a branch in my fork that adds changes on top of Lukas's changes in this PR (sommerlukas#1).

  1. It adds test_sycl_queue_work_group_memory_arg.cpp as well as two new SPV files that are generated using clang++ from SYCL nightly build dated 2025-02-07
  2. It implements dpctl._sycl_queue.LocaAccessor type to allow specifying local accessor argument to the kernel.
  3. It adds Python tests for submissions using work-group memory and local accessor.

What is still to do:

  1. Do not introduce dpctl.experimental. Move definition of WorkGroupMemory type to _sycl_queue.pyx.
  2. Support specifying type of array passed to WorkGroupMemory to avoid the need to recalculate number of elements into number of bytes.
  3. Submission of these kernels in the test only works on Level-Zero devices, but fails on OpenCL devices (raises SyclKernelSubmissionError exception). The native ELF executes fine on all available to me devices ("opencl:cpu", "opencl:gpu", "level_zero:gpu", and "cuda:gpu").

dpctl code to build interoperability kernel bundle may need updating.

@sommerlukas
Copy link
Contributor Author

sommerlukas commented Feb 10, 2025

This PR needs work, I suggest we take time to get it right, and it also brought to light that dpctl Python API does not expose LocalAccessor kernel argument.

I have created a branch in my fork that adds changes on top of Lukas's changes in this PR (sommerlukas#1).

Hi @oleksandr-pavlyk,

I'm also working on improving the implementation locally, I was planning to push as soon as I have tests ready.

What I've done so far:

  1. Reworked the implementation of work_group_memory to also make it available in libsyclinterface.
  2. Make it conditionally available depending on whether the DPC++ version supports the extension and adding an availability query.
  3. I'm currently working on a test for submitting a kernel with work_group_memory for both the Python interface and via libsyclinterface.
  1. Do not introduce dpctl.experimental. Move definition of WorkGroupMemory type to _sycl_queue.pyx.

The reason I introduced dcptl.experimental is that I have another PR in the pipeline that adds support for the raw_kernel_arg extension. From my offline discussion with @ndgrigorian, support for the kernel_compiler extension is also on the roadmap/horizon, so I think bundling these extensions in experimental makes sense.

  1. Support specifying type of array passed to WorkGroupMemory to avoid the need to recalculate number of elements into number of bytes.

I had considered that, but there was no precedent in the Python interface, as support for local_accessor wasn't there.

If we add this variant, IMHO it should be in addition to the current version where the number of bytes is specified. Unless we build something complicated with Cython/ctype types, we would otherwise limit the possible types of array elements to a fixed list, whereas the DPC++ extension allows user-defined types.

@oleksandr-pavlyk
Copy link
Collaborator

@sommerlukas The PR sommerlukas#1 was added with my changes to your branch in your fork. If you do not intend to accept it because you have already worked something out locally, please let me know, and I would break it up into smaller changes and open follow up PRs.

@sommerlukas
Copy link
Contributor Author

@sommerlukas The PR sommerlukas#1 was added with my changes to your branch in your fork. If you do not intend to accept it because you have already worked something out locally, please let me know, and I would break it up into smaller changes and open follow up PRs.

Thanks for bringing that PR, I think there's some overlap with what I've done locally.

Maybe it make sense to break out the changes for exposing local_accessor in the Python API and bring that as a separate PR to the main repository directly.

For the changes relating to work_group_memory, I'm currently looking how I can reuse some of your work with my local changes, e.g., the test you wrote for libsyclinterface.

@sommerlukas
Copy link
Contributor Author

I've pushed my local changes now. The work group memory extension is now also added to libsyclinterface. It's conditionally enabled based on the SYCL extension's feature test macro and there's a function to query availability, in libsyclinterface as well as in the Python API.

I also added a test for the Python interface and a test for libsyclinterface, based on @oleksandr-pavlyk's test in sommerlukas#1.

@ndgrigorian
Copy link
Collaborator

dpctl code to build interoperability kernel bundle may need updating.
While on this subject, should an update of dpctl.program and dpctl.program.SyclProgram to kernel_bundle be considered as well?

This may make the mapping between the class and its SYCL counter-part less confusing, though there are downstream implications.

@sommerlukas
Copy link
Contributor Author

dpctl code to build interoperability kernel bundle may need updating.
While on this subject, should an update of dpctl.program and dpctl.program.SyclProgram to kernel_bundle be considered as well?

This may make the mapping between the class and its SYCL counter-part less confusing, though there are downstream implications.

As I mentioned offline, there's also the possibility to replace the interop to create kernel_bundle from SPIR-V binary and OpenCL C source with the new SYCL kernel compiler extension. The extension is directly in SYCL, so there would be no need to reach into lower-level APIs and reduced maintenance effort, as the SYCL version would work across Level Zero and OpenCL.

The other benefit is that it integrates into kernel_bundle and that it is the same interface that is used for SYCL-RTC, i.e., creating a executable kernel_bundle from SYCL source code at runtime.

If there's downstream implications, maybe it makes sense to build a new kernel_bundle API in addition to the existing program and keep both for a transition period. The kernel compiler extension is still relatively fresh and only added in the latest release.

Extensions:

@oleksandr-pavlyk
Copy link
Collaborator

  1. Submission of these kernels in the test only works on Level-Zero devices, but fails on OpenCL devices (raises SyclKernelSubmissionError exception). The native ELF executes fine on all available to me devices ("opencl:cpu", "opencl:gpu", "level_zero:gpu", and "cuda:gpu").

dpctl code to build interoperability kernel bundle may need updating.

My testing was done using DPC++ runtime from SYCL nightly, which may have some bugs. Using that runtime library, only "level_zero:gpu" device behaved as I expected. But testing my branch with DPC++ 2025.0.4 I did not experience any problems with opencl or cuda devices.

@oleksandr-pavlyk
Copy link
Collaborator

@sommerlukas While kernel_compiler extension is an exciting development, we must wait for it to land in the shipping oneAPI DPC++ compiler before adopting it in dpctl. When that happens, it might indeed be good to rewrite dpctl.program to avoid interoperability.

@oleksandr-pavlyk
Copy link
Collaborator

Apart from details of the implementation suggested here, what is the idea behind exposing work-group-memory kernel argument in dpctl? For this to matter, Python user must have a way to create kernels that take a work-group-memory argument(s).

I feel that creating dpctl.experimental is the wrong thing to do. This namespace reflects an implementation detail (that a feature, at present, is considered experimental in SYCL), does not reflect the functionality provided to the user of dpctl. Namespace dpctl.program.experimental is more appropriate for exposing WorkGroupMemory object in my opinion.

@sommerlukas
Copy link
Contributor Author

@sommerlukas While kernel_compiler extension is an exciting development, we must wait for it to land in the shipping oneAPI DPC++ compiler before adopting it in dpctl. When that happens, it might indeed be good to rewrite dpctl.program to avoid interoperability.

Yeah, this was not meant for this PR, more medium-term.

Support for SPIR-V and OpenCL source languages should already be in the 2025.0 package released a few weeks ago, so 2025.0 should match DPCTL's existing support.

SYCL source language is going to be in one of the other 2025.X releases soon.

@sommerlukas
Copy link
Contributor Author

Apart from details of the implementation suggested here, what is the idea behind exposing work-group-memory kernel argument in dpctl? For this to matter, Python user must have a way to create kernels that take a work-group-memory argument(s).

The work_group_memory just decays to a pointer to local memory, so it is possible to use it with an OpenCL kernel that takes a __local pointer as parameter.

E.g., users could compile a OpenCL kernel like this with DPCTL and pass a work_group_memory for the __local float* parameter as second argument to kernel launch:

__kernel void example_kernel(__global float *input, __local float *local_data) {
...
}

Our concrete use case is with SYCL kernels compiled via DPC++ that use work_group_memory as kernel argument.

@sommerlukas
Copy link
Contributor Author

I feel that creating dpctl.experimental is the wrong thing to do. This namespace reflects an implementation detail (that a feature, at present, is considered experimental in SYCL), does not reflect the functionality provided to the user of dpctl. Namespace dpctl.program.experimental is more appropriate for exposing WorkGroupMemory object in my opinion.

I'm happy to move it to some other place if you and @ndgrigorian prefer. I'm also fine to not mark it experimental at all if that's preferred.

@sommerlukas
Copy link
Contributor Author

@oleksandr-pavlyk @ndgrigorian Any hunch why the conda tests on CI fail? I can't reproduce the issue locally with my venv. Any recommendations on how to debug this?

@oleksandr-pavlyk
Copy link
Collaborator

Apart from details of the implementation suggested here, what is the idea behind exposing work-group-memory kernel argument in dpctl? For this to matter, Python user must have a way to create kernels that take a work-group-memory argument(s).

The work_group_memory just decays to a pointer to local memory, so it is possible to use it with an OpenCL kernel that takes a __local pointer as parameter.

E.g., users could compile a OpenCL kernel like this with DPCTL and pass a work_group_memory for the __local float* parameter as second argument to kernel launch:

__kernel void example_kernel(__global float *input, __local float *local_data) {

...

}

Thanks for explaining, @sommerlukas , this is useful. Please add a test for OpenCL device that uses local pointer in the kernel specified in OpenCl and that is invoked using work-group-memory.

Extend kernel argument handling to add support for the
work_group_memory extension, allowing users to dynamically allocate
local memory for a kernel.

Signed-off-by: Lukas Sommer <[email protected]>
@sommerlukas
Copy link
Contributor Author

Thanks for explaining, @sommerlukas , this is useful. Please add a test for OpenCL device that uses local pointer in the kernel specified in OpenCl and that is invoked using work-group-memory.

I've added a test using an OpenCL kernel with work_group_memory in the latest commit.

@ndgrigorian
Copy link
Collaborator

I'm happy to move it to some other place if you and @ndgrigorian prefer. I'm also fine to not mark it experimental at all if that's preferred.

I think I more or less agree with @oleksandr-pavlyk , putting it aside in a special namespace which marks it experimental doesn't do much for a Python-level user—it's an implementation detail and may or may not change in the compiler in the future. If there's a check for availability already present, we can instead document that the compiler version is relevant to whether the extension works.

@ndgrigorian
Copy link
Collaborator

@oleksandr-pavlyk @ndgrigorian Any hunch why the conda tests on CI fail? I can't reproduce the issue locally with my venv. Any recommendations on how to debug this?

I'll take a look at the CI too, but the best way I would imagine to debug would be to download the artifact and install it into a Conda environment, if you can't reproduce locally.

@sommerlukas
Copy link
Contributor Author

I'm happy to move it to some other place if you and @ndgrigorian prefer. I'm also fine to not mark it experimental at all if that's preferred.

I think I more or less agree with @oleksandr-pavlyk , putting it aside in a special namespace which marks it experimental doesn't do much for a Python-level user—it's an implementation detail and may or may not change in the compiler in the future. If there's a check for availability already present, we can instead document that the compiler version is relevant to whether the extension works.

Ok, makes sense. I've removed the experimental name space in the most recent commit and moved the Python API for WorkGroupMemory to the queue, similar to #1991, as WorkGroupMemory would also be used with kernel launch in queues.

@coveralls
Copy link
Collaborator

Coverage Status

coverage: 88.117%. first build
when pulling 67c915e on sommerlukas:work_group_memory
into a9bba0b on IntelPython:master.

@sommerlukas
Copy link
Contributor Author

sommerlukas commented Feb 13, 2025

@oleksandr-pavlyk @ndgrigorian Any hunch why the conda tests on CI fail? I can't reproduce the issue locally with my venv. Any recommendations on how to debug this?

I'll take a look at the CI too, but the best way I would imagine to debug would be to download the artifact and install it into a Conda environment, if you can't reproduce locally.

Removing the experimental package and putting WorkGroupMemory in existing files seems to have resolved the problem.

Building and testing on Windows seems fine. For the failure on Linux, I'm not sure this is related to my changes.

The error message points to _diagnostics, which I did not modify in this PR:

$SRC_DIR/_skbuild/linux-x86_64-3.13/cmake-build/dpctl/_diagnostics.cxx:2232:80: error: unknown type name '__pyx_vectorcallfunc'; did you mean 'vectorcallfunc'?

@ndgrigorian @oleksandr-pavlyk Any idea what could be causing this?

@diptorupd
Copy link
Contributor

@oleksandr-pavlyk @sommerlukas nice set of changes. numba-dpex also supports local accessors.

@ndgrigorian
Copy link
Collaborator

ndgrigorian commented Feb 13, 2025

@ndgrigorian @oleksandr-pavlyk Any idea what could be causing this?

The failure looks like one I would see when building with 3.13t (free-threaded Python).

Indeed, looking at the log—3.13.1-h9a34b6e_5_cp313t is the installed version. I'll try to open a PR to avoid free-threaded Python for now, or to fix build failures with it. Cython supports it now, so I don't see why we couldn't allow it.

@ndgrigorian
Copy link
Collaborator

@sommerlukas
Sorry for the delay, I am going to take a look at this PR over this weekend/at start of next week and look into CI failures. Can you merge in the master branch or rebase onto master to see if picking up some of the CI fixes helps?

Also a quick glance at gh-1991 would be appreciated, if any of it is necessary for your work—this wasn't clear to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants