Move strided batch pointer conversion to GPU #2601

THargreaves · 2024-12-20T11:07:01Z

Relates to #2592

Modifications and comments welcome.

codecov · 2024-12-20T13:30:09Z

Codecov Report

Attention: Patch coverage is 46.15385% with 7 lines in your changes missing coverage. Please review.

Project coverage is 73.50%. Comparing base (830c49b) to head (953c796).
Report is 3 commits behind head on master.

Files with missing lines	Patch %	Lines
lib/cublas/wrappers.jl	46.15%	7 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2601      +/-   ##
==========================================
- Coverage   73.51%   73.50%   -0.02%     
==========================================
  Files         157      157              
  Lines       15207    15218      +11     
==========================================
+ Hits        11180    11186       +6     
- Misses       4027     4032       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions

CUDA.jl Benchmarks

Benchmark suite	Current: `953c796`	Previous: `62564aa`	Ratio
`latency/precompile`	`45168542210` ns	`45352888677` ns	`1.00`
`latency/ttfp`	`6429231913.5` ns	`6365236644.5` ns	`1.01`
`latency/import`	`3045177395` ns	`3028318014.5` ns	`1.01`
`integration/volumerhs`	`9554999` ns	`9566694` ns	`1.00`
`integration/byval/slices=1`	`146470` ns	`146977` ns	`1.00`
`integration/byval/slices=3`	`425314` ns	`425622` ns	`1.00`
`integration/byval/reference`	`144902` ns	`144997` ns	`1.00`
`integration/byval/slices=2`	`286364` ns	`286278` ns	`1.00`
`integration/cudadevrt`	`103469` ns	`103635` ns	`1.00`
`kernel/indexing`	`14403` ns	`14502` ns	`0.99`
`kernel/indexing_checked`	`15257` ns	`15491` ns	`0.98`
`kernel/occupancy`	`693.4575163398692` ns	`685.9934210526316` ns	`1.01`
`kernel/launch`	`2086` ns	`2211.5555555555557` ns	`0.94`
`kernel/rand`	`16453` ns	`15495` ns	`1.06`
`array/reverse/1d`	`19368` ns	`19288.5` ns	`1.00`
`array/reverse/2d`	`23258` ns	`25407` ns	`0.92`
`array/reverse/1d_inplace`	`9968` ns	`11091` ns	`0.90`
`array/reverse/2d_inplace`	`11558` ns	`12732` ns	`0.91`
`array/copy`	`20492` ns	`20971` ns	`0.98`
`array/iteration/findall/int`	`156404` ns	`159562` ns	`0.98`
`array/iteration/findall/bool`	`137197` ns	`139396` ns	`0.98`
`array/iteration/findfirst/int`	`153397` ns	`153501` ns	`1.00`
`array/iteration/findfirst/bool`	`154184.5` ns	`154850` ns	`1.00`
`array/iteration/scalar`	`76406` ns	`75791` ns	`1.01`
`array/iteration/logical`	`209655.5` ns	`216533` ns	`0.97`
`array/iteration/findmin/1d`	`40407` ns	`41336` ns	`0.98`
`array/iteration/findmin/2d`	`93014` ns	`94735` ns	`0.98`
`array/reductions/reduce/1d`	`34611` ns	`42301` ns	`0.82`
`array/reductions/reduce/2d`	`46237.5` ns	`45473.5` ns	`1.02`
`array/reductions/mapreduce/1d`	`32218` ns	`40507` ns	`0.80`
`array/reductions/mapreduce/2d`	`43278` ns	`47968` ns	`0.90`
`array/broadcast`	`21168.5` ns	`21824` ns	`0.97`
`array/copyto!/gpu_to_gpu`	`13385` ns	`13701` ns	`0.98`
`array/copyto!/cpu_to_gpu`	`209733` ns	`213728` ns	`0.98`
`array/copyto!/gpu_to_cpu`	`245916` ns	`245756` ns	`1.00`
`array/accumulate/1d`	`108625` ns	`108365` ns	`1.00`
`array/accumulate/2d`	`79320.5` ns	`80195` ns	`0.99`
`array/construct`	`1204.95` ns	`1204.7` ns	`1.00`
`array/random/randn/Float32`	`42140` ns	`43838` ns	`0.96`
`array/random/randn!/Float32`	`26056` ns	`26350` ns	`0.99`
`array/random/rand!/Int64`	`27164` ns	`27179` ns	`1.00`
`array/random/rand!/Float32`	`8737.333333333334` ns	`8916.333333333334` ns	`0.98`
`array/random/rand/Int64`	`29715` ns	`30059` ns	`0.99`
`array/random/rand/Float32`	`12729` ns	`12950` ns	`0.98`
`array/permutedims/4d`	`66562` ns	`67259` ns	`0.99`
`array/permutedims/2d`	`55897` ns	`56558.5` ns	`0.99`
`array/permutedims/3d`	`58481` ns	`59379` ns	`0.98`
`array/sorting/1d`	`2931270.5` ns	`2932110` ns	`1.00`
`array/sorting/by`	`3497571` ns	`3498470` ns	`1.00`
`array/sorting/2d`	`1084182` ns	`1085408` ns	`1.00`
`cuda/synchronization/stream/auto`	`1029.7` ns	`1025.5` ns	`1.00`
`cuda/synchronization/stream/nonblocking`	`6521.6` ns	`6545.6` ns	`1.00`
`cuda/synchronization/stream/blocking`	`811.1630434782609` ns	`796.0645161290323` ns	`1.02`
`cuda/synchronization/context/auto`	`1177.7` ns	`1190.9` ns	`0.99`
`cuda/synchronization/context/nonblocking`	`6703.5` ns	`6739.6` ns	`0.99`
`cuda/synchronization/context/blocking`	`915.9777777777778` ns	`903.25` ns	`1.01`

This comment was automatically generated by workflow using github-action-benchmark.

maleadt

Why not pass the array in its entirety to the kernel, so you can keep the pointer(strided, (i-1)*stride + 1) intact (but now device-side) instead of having to roll your own address calculation? It doesn't matter much in practice, but I'm just trying to keep the code as generic as possible (in case we ever add strides to CuArray, as suggested in #1298).

THargreaves · 2024-12-20T20:04:57Z

Wouldn't that put us back at the same point as we were before using the very slow pointer repeatedly, just now on the GPU?

My thought process was to perform the pointer call only once since that's the expensive bit, and then leave the remaining simple logic for in the kernel.

What if we restricted this fast method to be just for dense arrays and then used a fallback based on pointer or Base._memory_offset for any other cases.

Am correct in thinking that even with the strided arrays, we'd need the more general case now in case we were performed the strided batched operation on a non-contiguous view?

maleadt · 2024-12-21T07:31:35Z

Wouldn't that put us back at the same point as we were before using the very slow pointer repeatedly, just now on the GPU?

No. You're dealing with different array types on the CPU and the GPU. On the CPU, pointer calls into an expensive operation performing lots of book keeping (which I linked to in the issue; configuring memory pools, verifying accessibility, etc). On the GPU, you're dealing with CuDeviceArrays for which pointer is a simple memory address calculation.

THargreaves · 2024-12-29T10:41:06Z

Ah, that makes sense. Thank you.

It looks like the pointer method for CuDeviceArray returns an LLVMPtr instead of a CuPtr though.

The implicit conversion during setindex leads to an IR error.

ERROR: InvalidIRError: compiling MethodInstance for create_ptrs_general_kernel!(::CuDeviceVector{CuPtr{Float32}, 1}, ::CuDeviceArray{Float32, 3, 1}, ::Int64) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to convert)
Stacktrace:
 [1] setindex!
   @ ~/.julia/packages/CUDA/2kjXI/src/device/array.jl:166

Source:

using CUDA
D = 4;
N = 10^6;
A = CUDA.rand(Float32, D, D, N);
Aptrs_gpu = CuArray{CuPtr{Float32}}(undef, N);
stride = D^2
function create_ptrs_general_kernel!(Aptrs, A, stride)
    index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = gridDim().x * blockDim().x
    for i in index:stride:length(Aptrs)
        Aptrs[i] = pointer(A, (i - 1) * stride + 1)
    end
    return nothing
end

@cuda threads = 256 blocks = ceil(Int, N / 256) create_ptrs_general_kernel!(
    Aptrs_gpu, A, stride
)

maleadt · 2025-01-06T08:08:54Z

It looks like the pointer method for CuDeviceArray returns an LLVMPtr instead of a CuPtr though.

Yes, that's expected. CuPtr only exists on the host, to differentiate from Ptr. On the device, we use plain pointers, albeit using the LLVMPtr type so that we can additionally track address space information.

It should be safe to reinterpret or Core.bitcast the pointer back to CuPtr.

THargreaves · 2025-01-06T10:31:11Z

Ah okay, gotcha. Thanks.

I've reached something that runs, has essentially the same runtime as the manual pointer offsets above, and outputs the correct pointers.

It feels a bit hacky though as I had to go through UInt as an intermediary step—the kernel wouldn't compile if I tried to reinterpret/bitcast LLVMPtr to Ptr/CuPtr directly. I also wasn't able to create a CuArray of LLVMPtrs and reinterpret after since that's not allocated inline.

Anyway, does this sort of approach look reasonable to you?

function create_ptrs_general_2_kernel!(Aptrs, A, arr_stride)
    index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = gridDim().x * blockDim().x
    for i in index:stride:length(Aptrs)
        Aptrs[i] = UInt(pointer(A, (i - 1) * arr_stride + 1))
    end
    return nothing
end

function create_ptrs_general_2(A)
    stride = size(A, 1) * size(A, 2)
    Aptrs = CuArray{CuPtr{Float32}}(undef, size(A, 3))
    @cuda threads = 256 blocks = ceil(Int, length(Aptrs) / 256) create_ptrs_general_kernel!(
        Aptrs, A, stride
    )
    return Aptrs
end

If so, I'll tidy it up and add it to the PR.

maleadt · 2025-01-06T11:39:05Z

the kernel wouldn't compile if I tried to reinterpret/bitcast LLVMPtr to Ptr/CuPtr directly

Should work for both Ptr and CuPtr:

julia> code_llvm(reinterpret, Tuple{Type{Ptr{Int}}, Core.LLVMPtr{Int,1}})
; Function Signature: reinterpret(Type{Ptr{Int64}}, Core.LLVMPtr{Int64, 1})
;  @ essentials.jl:728 within `reinterpret`
define i64 @julia_reinterpret_1538(ptr addrspace(1) %"x::LLVMPtr") #0 {
top:
;  @ essentials.jl:730 within `reinterpret`
  %bitcast_coercion = ptrtoint ptr addrspace(1) %"x::LLVMPtr" to i64
  ret i64 %bitcast_coercion
}

I also wasn't able to create a CuArray of LLVMPtrs and reinterpret after since that's not allocated inline.

LLVMPtr is allocated inline, you were probably omitting one of its typevars. That said, it's probably better to use an array of CuPtrs though, which you can pass directly to the subsequent API.

[ci skip]

[only downstream]

maleadt · 2025-01-07T20:38:37Z

Something went wrong with the rebase?

THargreaves · 2025-01-07T20:46:34Z

Indeed, I think I know how to resolve it and just doing that now. Apologies!

THargreaves · 2025-01-07T20:52:06Z

Yeah, I was making changes on my fork's master branch and I think that's why I messed things up.

Sorry for all the faff this PR has been and thank you for your guidance. I'm learning!

I've cherry-picked the changes into a new branch and created a new PR for that (#2608).

THargreaves added 2 commits December 20, 2024 10:59

Move strided batch pointer conversion to GPU

da0c6c7

Remove redundant/incorrect nblock check

953c796

github-actions bot reviewed Dec 20, 2024

View reviewed changes

maleadt reviewed Dec 20, 2024

View reviewed changes

maleadt and others added 5 commits January 7, 2025 20:35

Build containers with platform-specific package images.

319514a

[ci skip]

Enzyme: add make_zero of cuarrays (JuliaGPU#2600)

3ff1de0

Fix typo in handle cache lookup (JuliaGPU#2604)

bd6e4f9

Enzyme: mark device_sync as non-differentiable (JuliaGPU#2605)

d1faf25

[only downstream]

Generalise address logic

c68da23

THargreaves mentioned this pull request Jan 7, 2025

Move strided batch pointer conversion to GPU #2608

Merged

THargreaves closed this Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move strided batch pointer conversion to GPU #2601

Move strided batch pointer conversion to GPU #2601

THargreaves commented Dec 20, 2024

codecov bot commented Dec 20, 2024

github-actions bot left a comment

maleadt left a comment •

edited

Loading

THargreaves commented Dec 20, 2024

maleadt commented Dec 21, 2024

THargreaves commented Dec 29, 2024 •

edited

Loading

maleadt commented Jan 6, 2025

THargreaves commented Jan 6, 2025

maleadt commented Jan 6, 2025

maleadt commented Jan 7, 2025

THargreaves commented Jan 7, 2025

THargreaves commented Jan 7, 2025

Move strided batch pointer conversion to GPU #2601

Move strided batch pointer conversion to GPU #2601

Conversation

THargreaves commented Dec 20, 2024

codecov bot commented Dec 20, 2024

Codecov Report

github-actions bot left a comment

Choose a reason for hiding this comment

CUDA.jl Benchmarks

maleadt left a comment • edited Loading

Choose a reason for hiding this comment

THargreaves commented Dec 20, 2024

maleadt commented Dec 21, 2024

THargreaves commented Dec 29, 2024 • edited Loading

maleadt commented Jan 6, 2025

THargreaves commented Jan 6, 2025

maleadt commented Jan 6, 2025

maleadt commented Jan 7, 2025

THargreaves commented Jan 7, 2025

THargreaves commented Jan 7, 2025

maleadt left a comment •

edited

Loading

THargreaves commented Dec 29, 2024 •

edited

Loading