Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Base.stack is underperforming. #2248

Open
rcalxrc08 opened this issue Jan 21, 2024 · 2 comments
Open

Base.stack is underperforming. #2248

rcalxrc08 opened this issue Jan 21, 2024 · 2 comments
Labels
good first issue Good for newcomers performance How fast can we go?

Comments

@rcalxrc08
Copy link

Describe the bug
Stacking arrays of CuArrays is slow.

To reproduce

The Minimal Working Example (MWE) for this bug:

using BenchmarkTools, CUDA;
N=100;
M=1000;
x=randn(N);
x_cu=cu(x);
@btime stack(fill($x,M));
@btime stack(fill($x_cu,M));
@btime cu(stack(fill(collect($x_cu),M)));

As timing I am getting:

70.800 μs (3 allocations: 789.23 KiB)
15.774 ms (8 allocations: 8.19 KiB)
318.900 μs (12 allocations: 399.83 KiB)
Manifest.toml

CUDA v5.1.2

CUDA v5.1.2

Version info

Details on Julia: 1.10

Julia Version 1.10.0
Commit 3120989f39 (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
  Threads: 1 on 8 virtual cores

Details on CUDA:

CUDA runtime 12.3, artifact installation
CUDA driver 12.0
Unknown NVIDIA driver

CUDA libraries:
- CUBLAS: 12.3.4
- CURAND: 10.3.4
- CUFFT: 11.0.12
- CUSOLVER: 11.5.4
- CUSPARSE: 12.2.0
- CUPTI: 21.0.0
- NVML: missing

Julia packages:
- CUDA: 5.1.2
- CUDA_Driver_jll: 0.7.0+1
- CUDA_Runtime_jll: 0.10.1+0

Toolchain:
- Julia: 1.10.0
- LLVM: 15.0.7

1 device:
  0: NVIDIA GeForce MX150 (sm_61, 1.491 GiB / 2.000 GiB available)
@rcalxrc08 rcalxrc08 added the bug Something isn't working label Jan 21, 2024
@maleadt maleadt removed the bug Something isn't working label Jan 23, 2024
@maleadt
Copy link
Member

maleadt commented Jan 23, 2024

The edge case presented here of stacking 1000+ tiny arrays doesn't seem very realistic; that's not how you typically use a GPU. The performance is bad because of all the API calls required for this operation:

julia> CUDA.@profile stack(fill(x_cu,M))
Profiler ran for 4.53 ms, capturing 9006 events.

Host-side activity: calling CUDA APIs took 3.92 ms (86.52% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬─────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                    │
├──────────┼────────────┼───────┼──────────────────────────────────────┼─────────────────────────┤
│   76.70% │    3.47 ms │  1000 │   3.47 µs ± 0.53   (  2.62 ‥ 17.17)  │ cuMemcpyDtoDAsync       │
│    2.05% │   92.98 µs │  2000 │  46.49 ns ± 99.18  (   0.0 ‥ 476.84) │ cuDeviceGet             │
│    0.20% │    9.06 µs │     1 │                                      │ cuMemAllocFromPoolAsync │
└──────────┴────────────┴───────┴──────────────────────────────────────┴─────────────────────────┘

Device-side activity: GPU was busy for 946.04 µs (20.89% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                     │ Name                           │
├──────────┼────────────┼───────┼───────────────────────────────────────┼────────────────────────────────┤
│   20.89% │  946.04 µs │  1000 │ 946.04 ns ± 159.12 (715.26 ‥ 1192.09) │ [copy device to device memory] │
└──────────┴────────────┴───────┴───────────────────────────────────────┴────────────────────────────────┘

It could be specialized using a kernel that takes an array of all of the sources, in case anybody wants to tackle this.

@maleadt maleadt added good first issue Good for newcomers performance How fast can we go? labels Jan 23, 2024
@THargreaves
Copy link

I came across this issue a while back and have been intending to write a proper kernel for stacking CuArrays. In the meantime though, you can achieve a solid speedup by using broadcasted assignment.

function fast_stack(
    x::CuVector{T}, M
) where {T}
    out = CuArray{T}(undef, length(x), M)
    return out[:, :] .= x
end

y = fast_stack(x_cu, M);
println(y == cu(stack(fill(collect(x_cu), M))));

@btime fast_stack($x_cu, $M);

Output on my RTX 4090:

true
  4.770 μs (49 allocations: 1.00 KiB)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers performance How fast can we go?
Projects
None yet
Development

No branches or pull requests

3 participants