Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vector-vector and matrix-matrix Kronecker product #575

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

albertomercurio
Copy link

@albertomercurio albertomercurio commented Dec 15, 2024

The kron method between two AbstractGPUVector or two AbstractGPUMatrix was missing.

This should fix this issue in CUDA.jl. It fixes also #558 .

@albertomercurio
Copy link
Author

I didn't add the synchronize function since I saw that any implementation doesn't have one. I'm also wondering why since the output might be incorrect until we wait the kernel.

@albertomercurio albertomercurio changed the title Add vector-vector Kronecker product Add vector-vector and matrix-matrix Kronecker product Dec 15, 2024
@ytdHuang
Copy link

Maybe this can also fix #558 ?

@pxl-th
Copy link
Member

pxl-th commented Dec 15, 2024

I didn't add the synchronize function since I saw that any implementation doesn't have one. I'm also wondering why since the output might be incorrect until we wait the kernel.

Because with Julia all kernels execute asynchronously (from the CPU perspective) and in a stream-ordered fashion (on the GPU).

Synchronization happens either explicitly by user calling CUDA.synchronize() or implicitly during GPU->CPU transfer Array(gpu_array).

So adding explicit synchronization is redundant and would only slow down GPU.

@albertomercurio
Copy link
Author

But what if an external user does for example c = kron(a, b); d = c * a?

If it is asynchronous, c may be not updated and d would be uncorrected. Am I wrong?

The user may explicit call synchronize, but the code is in principle type-agnostic, such that it could work on CPU or in the GPU as well.

@pxl-th
Copy link
Member

pxl-th commented Dec 15, 2024

But what if an external user does for example c = kron(a, b); d = c * a?

If it is asynchronous, c may be not updated and d would be uncorrected. Am I wrong?

The user may explicit call synchronize, but the code is in principle type-agnostic, such that it could work on CPU or in the GPU as well.

GPU executes kernels in order of their submission. So GPU will first compute kron and only when it is done c * a.
That is true when kernels are executed on the same stream (which is the default).

@albertomercurio
Copy link
Author

Ok very clear thanks

Copy link
Member

@maleadt maleadt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason you went with a different implementation from what we already have in CUDA.jl: https://github.com/JuliaGPU/CUDA.jl/blob/19a08efa06bcb0b5aa88b3a25bb0b336b6538a9a/lib/cublas/linalg.jl#L400-L484?

src/host/linalg.jl Outdated Show resolved Hide resolved
@albertomercurio
Copy link
Author

Here I'm indexing on C instead of A or B. This should be easier and faster, I did some benchmarks.

The CUDA implementation has the stride stuff, which I never understood if I should use it also for KernelAbstractions.

@maleadt
Copy link
Member

maleadt commented Dec 16, 2024

The CUDA implementation has the stride stuff, which I never understood if I should use it also for KernelAbstractions.

Didn't you implement that kernel in CUDA.jl? JuliaGPU/CUDA.jl#2177

@albertomercurio
Copy link
Author

Yes. I mean that this implementation should be faster and simpler.

Assuming that we implement the same also for the CUDA case, I still don't understand if I need to use the stride stuff also for KernelAbstractions. From the examples I have seen so far, it seems that it is handled internally in KernelAbstractions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scalar indexing when performing kron on two CuVectors
4 participants