-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Base.stack is underperforming. #2248
Comments
The edge case presented here of stacking 1000+ tiny arrays doesn't seem very realistic; that's not how you typically use a GPU. The performance is bad because of all the API calls required for this operation:
It could be specialized using a kernel that takes an array of all of the sources, in case anybody wants to tackle this. |
I came across this issue a while back and have been intending to write a proper kernel for stacking CuArrays. In the meantime though, you can achieve a solid speedup by using broadcasted assignment. function fast_stack(
x::CuVector{T}, M
) where {T}
out = CuArray{T}(undef, length(x), M)
return out[:, :] .= x
end
y = fast_stack(x_cu, M);
println(y == cu(stack(fill(collect(x_cu), M))));
@btime fast_stack($x_cu, $M); Output on my RTX 4090:
|
Describe the bug
Stacking arrays of CuArrays is slow.
To reproduce
The Minimal Working Example (MWE) for this bug:
As timing I am getting:
Manifest.toml
Version info
Details on Julia: 1.10
Details on CUDA:
The text was updated successfully, but these errors were encountered: