-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add vector-vector and matrix-matrix Kronecker product #575
Conversation
I didn't add the |
Maybe this can also fix #558 ? |
Because with Julia all kernels execute asynchronously (from the CPU perspective) and in a stream-ordered fashion (on the GPU). Synchronization happens either explicitly by user calling So adding explicit synchronization is redundant and would only slow down GPU. |
But what if an external user does for example If it is asynchronous, c may be not updated and d would be uncorrected. Am I wrong? The user may explicit call |
GPU executes kernels in order of their submission. So GPU will first compute |
Ok very clear thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason you went with a different implementation from what we already have in CUDA.jl: https://github.com/JuliaGPU/CUDA.jl/blob/19a08efa06bcb0b5aa88b3a25bb0b336b6538a9a/lib/cublas/linalg.jl#L400-L484?
Here I'm indexing on C instead of A or B. This should be easier and faster, I did some benchmarks. The CUDA implementation has the stride stuff, which I never understood if I should use it also for KernelAbstractions. |
Didn't you implement that kernel in CUDA.jl? JuliaGPU/CUDA.jl#2177 |
Yes. I mean that this implementation should be faster and simpler. Assuming that we implement the same also for the CUDA case, I still don't understand if I need to use the stride stuff also for KernelAbstractions. From the examples I have seen so far, it seems that it is handled internally in KernelAbstractions. |
Can you be more exact than "stride stuff"? Are you talking about what looks like a grid-stride loop in your CUDA.jl implementation? That's an optimization that makes it possible to launch fewer blocks and still have the kernel behave as expected, dynamically scaling to whatever the problem size is. KernelAbstractions.jl currently does not support that. |
Ok, I was saying exactly that. So the current implementation here should be more efficient that the current one in CUDA.jl (except for the grid-stride loop). |
Sorry, I wanted to make sure we were on the same page. @vchuravy is aware of grid stride loops not being supported in KA.jl, so hopefully we can revisit this some time in the future. Thanks for the PR! |
The
kron
method between twoAbstractGPUVector
or twoAbstractGPUMatrix
was missing.This should fix this issue in CUDA.jl. It fixes also #558 .