-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Updated cl_khr_defined_builtin_kernels #1007
base: main
Are you sure you want to change the base?
Conversation
First WiP draft of a defined BiKs extension.
* Add API function for describing (and querying at the same time) a DBK. * Add API function for creating a program from described DBKs. * Add API function for creating a kernel handle for DBK. * Change DBK description. * Replaced old DBKs for simpler start for illustrating new features. * Use (yet specified) tensors. * Add constraints for usage. * Add sample code for queuing a DBK. * Device-side launching needs redesign. Added note for it.
Co-authored-by: Pekka Jääskeläinen <[email protected]>
code sample. * Fixes and tweaks in the DBK code sample. * Update date.
* Fix typo.
|
So while I am extremely supportive of the idea of OpenCL getting built-in kernels, I am unsure if this is the right approach. Instead, I feel that there are real advantages to handling this at the kernel language level. That is to say one simply writes a kernel corresponding to an S/DGEMM (say). This kernel can then be annotated (assuming the availability of a language extension) saying "try and match this against any built-ins and if so, feel free to substitute". If so, the runtime would be free to use its own implementation and disregard any user-provided work group sizes and the such like. This idea is not without precedent: Intel's C compiler has been applying this exact idea for the best part of a decade: write a triple-for-loop S/DGEMM and the compiler will recognise it and substitute all three loops for a call out to MKL. The advantages are numerous. Firstly, as a developer it makes my life easier as I am going to need a fallback kernel anyway given that it is unlikely that all vendors are going to ship with built-in kernels anytime soon. Thus I already have my own kernel to hand. Secondly, it is more expressive, with built-ins it is going to be hard to agree on a set of functionality everyone is happy with (for S/DGEMM you have the data layout, alpha and beta and if they are fixed or varying, leading dimensions, and all of the extra fun which comes with tensors and batching). Make your API too expansive and a vendor may be locked out just because they only support 95% of the functionality. But, make it too restrictive and you leave performance on the table. Kernels + pattern matching is a far more natural means of handling this. Additionally, if one is forward looking kernels enable easier bolt-ons. For example, say I want a S/DGEMM but then want to take the absolute value of each element in C. This is a trivial post-processing operation which ideally should be fused or merged into the end of the GEMM kernel. But it is difficult to express through an API, but trivial inside a kernel. The compiler can then either fuse it in (assuming it knows how, but it is in the best place for this being a compiler) or just crack it into two operations (S/DGEMM + second kernel to take the absolute values). Either way this makes fusing possible and is compatible with existing code (command buffers could potentially enable fusing but requires runtimes from science-fiction). |
Thanks for your useful input. What you describe makes a lot of sense and I agree on the software kernel writer's perspective here, but from the point of view of this DBK proposal, automated function pattern matching would be just another way of utilizing a DBK, not a completely contradicting alternative mechanism. An annotation mechanism somewhere along these lines is definitely something to add to the OpenCL C/SPIR-V side of the spec when we start on it. Unfortunately, automated primitive operation pattern matching of potentially complex non-DAG functions is known to be very tricky and rather limited in applicability while with this proposal we aim to cover a large range of useful accelerated functions from multiple application domains, not only GEMMS. Relying on automated matching would require writing a (potentially) complex pattern matcher for each DBK for them to be usable in the compiler if I understood your suggestion right. The matching task would be easier if the kernel could be annotated with the matching kernel directly by the programmer similarly as you describe, but not delegate the "DBK detection" to the compiler. That is, what you describe would be instead a software-fallback mechanism / assistance for automated DBK selection. In a comment in the original thread there's a question if we would like to make the DBKs callable from software kernels via a mechanism or another, which could help kernel fusion use cases among other things. I think that makes sense. For now the main means for automated kernel fusion we have written up for this proposal is via command buffers of which contents can be optimized as an entity, but it can be expanded and improved to cover other means. You consider automated task/compute graph node fusion is science fiction? I disagree in this aspect - I think a full kernel pattern matcher from primitive software operations is much more complex than merging successive nodes in a higher level task/computation graph (for example, via simple rule/pattern based matching). Moreover, "fuser middlewares" that assist in this task can be provided as reusable (open source) components implemented in frameworks like MLIR or TVM which the drivers can integrate if they want since they work on generic well-defined DAGs. So, overall, I think it's better to support both options; a nice way for software assist/fallback and whole task graph optimizations, which I think can be done within the specs we are arriving at. |
https://github.com/pocl/pocl/pull/1521/files a prototype implementation was merged in PoCL. |
tensors of shape `(1, a, b)`. | ||
|
||
Operations of the matrix muliplication are performed in the precision | ||
of the `elementof\(COUT)`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seem to be a couple of formatting errors regarding \(COUT)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out. It'll be fixed.
This draft is an updated version of #867.
Notable changes:
cl_tensor
data type for data arguments. The data type is drafted here.HTML is rendered here. (last update: 8/22/2024)
For previous discussion about the topic, follow this link.