Base.stack is underperforming. #2248

rcalxrc08 · 2024-01-21T12:55:23Z

Describe the bug
Stacking arrays of CuArrays is slow.

To reproduce

The Minimal Working Example (MWE) for this bug:

using BenchmarkTools, CUDA;
N=100;
M=1000;
x=randn(N);
x_cu=cu(x);
@btime stack(fill($x,M));
@btime stack(fill($x_cu,M));
@btime cu(stack(fill(collect($x_cu),M)));

As timing I am getting:

70.800 μs (3 allocations: 789.23 KiB)
15.774 ms (8 allocations: 8.19 KiB)
318.900 μs (12 allocations: 399.83 KiB)

Manifest.toml

CUDA v5.1.2

CUDA v5.1.2

Version info

Details on Julia: 1.10

Julia Version 1.10.0
Commit 3120989f39 (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
  Threads: 1 on 8 virtual cores

Details on CUDA:

CUDA runtime 12.3, artifact installation
CUDA driver 12.0
Unknown NVIDIA driver

CUDA libraries:
- CUBLAS: 12.3.4
- CURAND: 10.3.4
- CUFFT: 11.0.12
- CUSOLVER: 11.5.4
- CUSPARSE: 12.2.0
- CUPTI: 21.0.0
- NVML: missing

Julia packages:
- CUDA: 5.1.2
- CUDA_Driver_jll: 0.7.0+1
- CUDA_Runtime_jll: 0.10.1+0

Toolchain:
- Julia: 1.10.0
- LLVM: 15.0.7

1 device:
  0: NVIDIA GeForce MX150 (sm_61, 1.491 GiB / 2.000 GiB available)

The text was updated successfully, but these errors were encountered:

maleadt · 2024-01-23T15:34:08Z

The edge case presented here of stacking 1000+ tiny arrays doesn't seem very realistic; that's not how you typically use a GPU. The performance is bad because of all the API calls required for this operation:

julia> CUDA.@profile stack(fill(x_cu,M))
Profiler ran for 4.53 ms, capturing 9006 events.

Host-side activity: calling CUDA APIs took 3.92 ms (86.52% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬─────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                    │
├──────────┼────────────┼───────┼──────────────────────────────────────┼─────────────────────────┤
│   76.70% │    3.47 ms │  1000 │   3.47 µs ± 0.53   (  2.62 ‥ 17.17)  │ cuMemcpyDtoDAsync       │
│    2.05% │   92.98 µs │  2000 │  46.49 ns ± 99.18  (   0.0 ‥ 476.84) │ cuDeviceGet             │
│    0.20% │    9.06 µs │     1 │                                      │ cuMemAllocFromPoolAsync │
└──────────┴────────────┴───────┴──────────────────────────────────────┴─────────────────────────┘

Device-side activity: GPU was busy for 946.04 µs (20.89% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                     │ Name                           │
├──────────┼────────────┼───────┼───────────────────────────────────────┼────────────────────────────────┤
│   20.89% │  946.04 µs │  1000 │ 946.04 ns ± 159.12 (715.26 ‥ 1192.09) │ [copy device to device memory] │
└──────────┴────────────┴───────┴───────────────────────────────────────┴────────────────────────────────┘

It could be specialized using a kernel that takes an array of all of the sources, in case anybody wants to tackle this.

THargreaves · 2024-12-17T15:10:23Z

I came across this issue a while back and have been intending to write a proper kernel for stacking CuArrays. In the meantime though, you can achieve a solid speedup by using broadcasted assignment.

function fast_stack(
    x::CuVector{T}, M
) where {T}
    out = CuArray{T}(undef, length(x), M)
    return out[:, :] .= x
end

y = fast_stack(x_cu, M);
println(y == cu(stack(fill(collect(x_cu), M))));

@btime fast_stack($x_cu, $M);

Output on my RTX 4090:

true
  4.770 μs (49 allocations: 1.00 KiB)

rcalxrc08 added the bug Something isn't working label Jan 21, 2024

rcalxrc08 mentioned this issue Jan 23, 2024

Base.stack is underperforming for SparseArrays JuliaSparse/SparseArrays.jl#498

Closed

maleadt removed the bug Something isn't working label Jan 23, 2024

maleadt added good first issue Good for newcomers performance How fast can we go? labels Jan 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Base.stack is underperforming. #2248

Base.stack is underperforming. #2248

rcalxrc08 commented Jan 21, 2024

maleadt commented Jan 23, 2024

THargreaves commented Dec 17, 2024

Base.stack is underperforming. #2248

Base.stack is underperforming. #2248

Comments

rcalxrc08 commented Jan 21, 2024

maleadt commented Jan 23, 2024

THargreaves commented Dec 17, 2024