Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in broadcast: getting Array (Julia 1.10) instead of CuArray (Julia 1.9) #2276

Closed
drewrobson opened this issue Feb 28, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@drewrobson
Copy link

drewrobson commented Feb 28, 2024

Describe the bug

Certain broadcast expressions that previously executed on the GPU (on Julia 1.9.3) and returned a CuArray are instead triggering scalar indexing warnings (on Julia 1.10.1) and returning an Array.

To reproduce

The Minimal Working Example (MWE) for this bug:

using CUDA
d_test = CUDA.ones(5)
getindex.(Ref(d_test), keys(d_test))

Expected behavior

Based on previous Julia versions, the MWE should produce a CuVector{Float32}:

5-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 1.0
 1.0
 1.0
 1.0
 1.0

Version info

Details on Julia:

Julia Version 1.10.1
Commit 7790d6f064* (2024-02-13 20:41 UTC)
Build Info:

    Note: This is an unofficial build, please report bugs to the project
    responsible for this build and not to the Julia project unless you can
    reproduce the issue using official builds available at https://julialang.org/downloads

Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 24 × AMD Ryzen 9 3900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver2)
Threads: 1 default, 0 interactive, 1 GC (on 24 virtual cores)

Details on CUDA:

CUDA runtime 12.3, artifact installation
CUDA driver 12.2
NVIDIA driver 535.54.3

CUDA libraries: 
- CUBLAS: 12.3.4
- CURAND: 10.3.4
- CUFFT: 11.0.12
- CUSOLVER: 11.5.4
- CUSPARSE: 12.2.0
- CUPTI: 21.0.0
- NVML: 12.0.0+535.54.3

Julia packages: 
- CUDA: 5.3.0
- CUDA_Driver_jll: 0.7.0+1
- CUDA_Runtime_jll: 0.11.1+0

Toolchain:
- Julia: 1.10.1
- LLVM: 15.0.7

1 device:
  0: NVIDIA GeForce RTX 4090 (sm_89, 19.250 GiB / 23.988 GiB available)

Additional context

On Julia 1.9.3, Base.broadcasted(getindex, Ref(d_test), keys(d_test)) yields a

Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Nothing, typeof(getindex), Tuple{Base.RefValue{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, LinearIndices{1, Tuple{Base.OneTo{Int64}}}}}

On Julia 1.10.1, the same expression yields a

Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Nothing, typeof(getindex), Tuple{Base.RefValue{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, LinearIndices{1, Tuple{Base.OneTo{Int64}}}}}

This change in behavior broke some more complicated broadcast expressions (the MWE was reduced from one of these). For now, I am working around the issue by specifying a CuArray destination, like this:

d_result .= getindex.(Ref(d_test), keys(d_test))

(but that means figuring out the output type and dimensions first, which adds a step during development/prototyping)

Thanks!

@drewrobson drewrobson added the bug Something isn't working label Feb 28, 2024
@maleadt
Copy link
Member

maleadt commented Feb 28, 2024

This was an deliberate change, see JuliaGPU/GPUArrays.jl#510 for the rationale.
It's too bad this trips up in your code, as I had hoped to sneak this in without having to tag a breaking release...

@maleadt maleadt closed this as completed Feb 28, 2024
@drewrobson
Copy link
Author

Thanks very much, makes sense. I like the clarity of the capture approach - it's easier to see the arguments that actually participate in broadcasting in a nontrivial way.

I'm updating my code, but in many cases all the "GPU-residing" objects are now captures. The MWE is such a case: keys(d_test) is (Base.OneTo(5),) so the naive fix wouldn't work:

function test()
    d_test = CUDA.ones(5)
    broadcast(keys(d_test)) do idx
        d_test[idx]
    end
end

This leads to a question I've been wanting to ask anyways:

Certain lightweight objects like OneTo(1000000) seem equally happy broadcasting on the host or the GPU (which is I think why cu(OneTo(1000000)) doesn't "move" anything to the device). Is there a way to opt into GPU execution? For broadcast! we can write

d_result .= foo.(OneTo(1000000))

For broadcast, is there anything easier than manually constructing a Broadcasted{CuArrayStyle} object?

@maleadt
Copy link
Member

maleadt commented Mar 2, 2024

For broadcast, is there anything easier than manually constructing a Broadcasted{CuArrayStyle} object?

I don't know of anything like that, but I agree it would be useful to override the broadcaststyle in a more ergonomic way. Maybe something to open an issue about upstream?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants