Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move strided batch pointer conversion to GPU #2601

Closed
wants to merge 7 commits into from

Conversation

THargreaves
Copy link
Contributor

Relates to #2592

Modifications and comments welcome.

Copy link

codecov bot commented Dec 20, 2024

Codecov Report

Attention: Patch coverage is 46.15385% with 7 lines in your changes missing coverage. Please review.

Project coverage is 73.50%. Comparing base (830c49b) to head (953c796).
Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
lib/cublas/wrappers.jl 46.15% 7 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2601      +/-   ##
==========================================
- Coverage   73.51%   73.50%   -0.02%     
==========================================
  Files         157      157              
  Lines       15207    15218      +11     
==========================================
+ Hits        11180    11186       +6     
- Misses       4027     4032       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Benchmark suite Current: 953c796 Previous: 62564aa Ratio
latency/precompile 45168542210 ns 45352888677 ns 1.00
latency/ttfp 6429231913.5 ns 6365236644.5 ns 1.01
latency/import 3045177395 ns 3028318014.5 ns 1.01
integration/volumerhs 9554999 ns 9566694 ns 1.00
integration/byval/slices=1 146470 ns 146977 ns 1.00
integration/byval/slices=3 425314 ns 425622 ns 1.00
integration/byval/reference 144902 ns 144997 ns 1.00
integration/byval/slices=2 286364 ns 286278 ns 1.00
integration/cudadevrt 103469 ns 103635 ns 1.00
kernel/indexing 14403 ns 14502 ns 0.99
kernel/indexing_checked 15257 ns 15491 ns 0.98
kernel/occupancy 693.4575163398692 ns 685.9934210526316 ns 1.01
kernel/launch 2086 ns 2211.5555555555557 ns 0.94
kernel/rand 16453 ns 15495 ns 1.06
array/reverse/1d 19368 ns 19288.5 ns 1.00
array/reverse/2d 23258 ns 25407 ns 0.92
array/reverse/1d_inplace 9968 ns 11091 ns 0.90
array/reverse/2d_inplace 11558 ns 12732 ns 0.91
array/copy 20492 ns 20971 ns 0.98
array/iteration/findall/int 156404 ns 159562 ns 0.98
array/iteration/findall/bool 137197 ns 139396 ns 0.98
array/iteration/findfirst/int 153397 ns 153501 ns 1.00
array/iteration/findfirst/bool 154184.5 ns 154850 ns 1.00
array/iteration/scalar 76406 ns 75791 ns 1.01
array/iteration/logical 209655.5 ns 216533 ns 0.97
array/iteration/findmin/1d 40407 ns 41336 ns 0.98
array/iteration/findmin/2d 93014 ns 94735 ns 0.98
array/reductions/reduce/1d 34611 ns 42301 ns 0.82
array/reductions/reduce/2d 46237.5 ns 45473.5 ns 1.02
array/reductions/mapreduce/1d 32218 ns 40507 ns 0.80
array/reductions/mapreduce/2d 43278 ns 47968 ns 0.90
array/broadcast 21168.5 ns 21824 ns 0.97
array/copyto!/gpu_to_gpu 13385 ns 13701 ns 0.98
array/copyto!/cpu_to_gpu 209733 ns 213728 ns 0.98
array/copyto!/gpu_to_cpu 245916 ns 245756 ns 1.00
array/accumulate/1d 108625 ns 108365 ns 1.00
array/accumulate/2d 79320.5 ns 80195 ns 0.99
array/construct 1204.95 ns 1204.7 ns 1.00
array/random/randn/Float32 42140 ns 43838 ns 0.96
array/random/randn!/Float32 26056 ns 26350 ns 0.99
array/random/rand!/Int64 27164 ns 27179 ns 1.00
array/random/rand!/Float32 8737.333333333334 ns 8916.333333333334 ns 0.98
array/random/rand/Int64 29715 ns 30059 ns 0.99
array/random/rand/Float32 12729 ns 12950 ns 0.98
array/permutedims/4d 66562 ns 67259 ns 0.99
array/permutedims/2d 55897 ns 56558.5 ns 0.99
array/permutedims/3d 58481 ns 59379 ns 0.98
array/sorting/1d 2931270.5 ns 2932110 ns 1.00
array/sorting/by 3497571 ns 3498470 ns 1.00
array/sorting/2d 1084182 ns 1085408 ns 1.00
cuda/synchronization/stream/auto 1029.7 ns 1025.5 ns 1.00
cuda/synchronization/stream/nonblocking 6521.6 ns 6545.6 ns 1.00
cuda/synchronization/stream/blocking 811.1630434782609 ns 796.0645161290323 ns 1.02
cuda/synchronization/context/auto 1177.7 ns 1190.9 ns 0.99
cuda/synchronization/context/nonblocking 6703.5 ns 6739.6 ns 0.99
cuda/synchronization/context/blocking 915.9777777777778 ns 903.25 ns 1.01

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Member

@maleadt maleadt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not pass the array in its entirety to the kernel, so you can keep the pointer(strided, (i-1)*stride + 1) intact (but now device-side) instead of having to roll your own address calculation? It doesn't matter much in practice, but I'm just trying to keep the code as generic as possible (in case we ever add strides to CuArray, as suggested in #1298).

@THargreaves
Copy link
Contributor Author

Wouldn't that put us back at the same point as we were before using the very slow pointer repeatedly, just now on the GPU?

My thought process was to perform the pointer call only once since that's the expensive bit, and then leave the remaining simple logic for in the kernel.

What if we restricted this fast method to be just for dense arrays and then used a fallback based on pointer or Base._memory_offset for any other cases.

Am correct in thinking that even with the strided arrays, we'd need the more general case now in case we were performed the strided batched operation on a non-contiguous view?

@maleadt
Copy link
Member

maleadt commented Dec 21, 2024

Wouldn't that put us back at the same point as we were before using the very slow pointer repeatedly, just now on the GPU?

No. You're dealing with different array types on the CPU and the GPU. On the CPU, pointer calls into an expensive operation performing lots of book keeping (which I linked to in the issue; configuring memory pools, verifying accessibility, etc). On the GPU, you're dealing with CuDeviceArrays for which pointer is a simple memory address calculation.

@THargreaves
Copy link
Contributor Author

THargreaves commented Dec 29, 2024

Ah, that makes sense. Thank you.

It looks like the pointer method for CuDeviceArray returns an LLVMPtr instead of a CuPtr though.

The implicit conversion during setindex leads to an IR error.

ERROR: InvalidIRError: compiling MethodInstance for create_ptrs_general_kernel!(::CuDeviceVector{CuPtr{Float32}, 1}, ::CuDeviceArray{Float32, 3, 1}, ::Int64) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to convert)
Stacktrace:
 [1] setindex!
   @ ~/.julia/packages/CUDA/2kjXI/src/device/array.jl:166

Source:

using CUDA
D = 4;
N = 10^6;
A = CUDA.rand(Float32, D, D, N);
Aptrs_gpu = CuArray{CuPtr{Float32}}(undef, N);
stride = D^2
function create_ptrs_general_kernel!(Aptrs, A, stride)
    index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = gridDim().x * blockDim().x
    for i in index:stride:length(Aptrs)
        Aptrs[i] = pointer(A, (i - 1) * stride + 1)
    end
    return nothing
end

@cuda threads = 256 blocks = ceil(Int, N / 256) create_ptrs_general_kernel!(
    Aptrs_gpu, A, stride
)

@maleadt
Copy link
Member

maleadt commented Jan 6, 2025

It looks like the pointer method for CuDeviceArray returns an LLVMPtr instead of a CuPtr though.

Yes, that's expected. CuPtr only exists on the host, to differentiate from Ptr. On the device, we use plain pointers, albeit using the LLVMPtr type so that we can additionally track address space information.

It should be safe to reinterpret or Core.bitcast the pointer back to CuPtr.

@THargreaves
Copy link
Contributor Author

Ah okay, gotcha. Thanks.

I've reached something that runs, has essentially the same runtime as the manual pointer offsets above, and outputs the correct pointers.

It feels a bit hacky though as I had to go through UInt as an intermediary step—the kernel wouldn't compile if I tried to reinterpret/bitcast LLVMPtr to Ptr/CuPtr directly. I also wasn't able to create a CuArray of LLVMPtrs and reinterpret after since that's not allocated inline.

Anyway, does this sort of approach look reasonable to you?

function create_ptrs_general_2_kernel!(Aptrs, A, arr_stride)
    index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = gridDim().x * blockDim().x
    for i in index:stride:length(Aptrs)
        Aptrs[i] = UInt(pointer(A, (i - 1) * arr_stride + 1))
    end
    return nothing
end

function create_ptrs_general_2(A)
    stride = size(A, 1) * size(A, 2)
    Aptrs = CuArray{CuPtr{Float32}}(undef, size(A, 3))
    @cuda threads = 256 blocks = ceil(Int, length(Aptrs) / 256) create_ptrs_general_kernel!(
        Aptrs, A, stride
    )
    return Aptrs
end

If so, I'll tidy it up and add it to the PR.

@maleadt
Copy link
Member

maleadt commented Jan 6, 2025

the kernel wouldn't compile if I tried to reinterpret/bitcast LLVMPtr to Ptr/CuPtr directly

Should work for both Ptr and CuPtr:

julia> code_llvm(reinterpret, Tuple{Type{Ptr{Int}}, Core.LLVMPtr{Int,1}})
; Function Signature: reinterpret(Type{Ptr{Int64}}, Core.LLVMPtr{Int64, 1})
;  @ essentials.jl:728 within `reinterpret`
define i64 @julia_reinterpret_1538(ptr addrspace(1) %"x::LLVMPtr") #0 {
top:
;  @ essentials.jl:730 within `reinterpret`
  %bitcast_coercion = ptrtoint ptr addrspace(1) %"x::LLVMPtr" to i64
  ret i64 %bitcast_coercion
}

I also wasn't able to create a CuArray of LLVMPtrs and reinterpret after since that's not allocated inline.

LLVMPtr is allocated inline, you were probably omitting one of its typevars. That said, it's probably better to use an array of CuPtrs though, which you can pass directly to the subsequent API.

@maleadt
Copy link
Member

maleadt commented Jan 7, 2025

Something went wrong with the rebase?

@THargreaves
Copy link
Contributor Author

Indeed, I think I know how to resolve it and just doing that now. Apologies!

@THargreaves
Copy link
Contributor Author

Yeah, I was making changes on my fork's master branch and I think that's why I messed things up.

Sorry for all the faff this PR has been and thank you for your guidance. I'm learning!

I've cherry-picked the changes into a new branch and created a new PR for that (#2608).

@THargreaves THargreaves closed this Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants