Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getindex for CuArray to fix repeated indices error #1131

Closed
wants to merge 7 commits into from

Conversation

jgreener64
Copy link
Contributor

I spent quite a while working out where my gradients were wrong, until eventually tracking it down to the GPU repeated index issue that is mentioned in multiple places:

Based on a solution suggested at https://discourse.julialang.org/t/increment-elements-of-array-by-index/49694 I added a hack to get the correct gradients, which can be found in this PR. It uses scatter! from NNlib to accumulate values over multiple indices. It is considerably slower than the current version, which on the GPU gives silently incorrect and even stochastic gradients. However it is faster than moving the array off the GPU to do this step.

This isn't currently suitable for merging since it adds dependencies and isn't tested beyond the simple case. It might be useful as a starting point for a discussion on how to get this important issue fixed, though, and also might be useful as a hack to others.

src/lib/array.jl Outdated
Comment on lines 44 to 45
using CUDA
import NNlib
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these imports shouldn't be here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's what @jgreener64 was referring to with "This isn't currently suitable for merging since it adds dependencies..."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah sorry, missed that

src/lib/array.jl Outdated Show resolved Hide resolved
src/lib/array.jl Outdated

∇getindex(x::CuArray, inds::Tuple{AbstractArray{<:Integer}}) = dy -> begin
dx = _zero(x, eltype(dy))
NNlib.scatter!(+, dx, dy, inds[1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we use only inds[1] we should restict the input to ::AbstractArray{<:Integer} or even better AbstractVector{<:Integer}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tuple wrapper comes from the varargs in the adjoint, so unless that's changed too there's not much that can be done.

@CarloLucibello
Copy link
Member

I think we should be correct and avoid silent bugs before aiming at being fast. Maybe taking a fast path after a "has duplicates" check can help performance a bit?

@CarloLucibello
Copy link
Member

Since Zygote doesn't depend on CUDA and NNlib I'm not sure where we should put this. Here but inside a @require? In NNlib/NNlibCUDA?

@jgreener64
Copy link
Contributor Author

Maybe taking a fast path after a "has duplicates" check can help performance a bit?

Yes I would imagine so.

I'm not sure where we should put this

There is already a CUDA require block in lib/broadcast.jl though that still leaves the NNlib dependency. Another option is to rewrite a simple version of scatter here.

@ToucheSir
Copy link
Member

Is this something ReverseDiff could benefit from as well? If so, we could consider adding it as an rrule with RuleConfig{>:HasReverseMode} to NNlibCUDA.

@DhairyaLGandhi
Copy link
Member

DhairyaLGandhi commented Dec 9, 2021

I don't think this is something that can be solved well with an adjoint at this point. The forward pass value would be incorrect by this point already. Specifically, note JuliaGPU/CUDA.jl#89 which describes why this happens. Best fixed at the CUDA.jl level based on how it is implemented.

@jgreener64
Copy link
Contributor Author

The forward pass value would be incorrect by this point already.

Not in general, since assignment broadcasting is used in the backward pass whether or not it is used in the forward pass. For example on the current release:

using Zygote, CUDA

mul2(x) = 2x

function f(a)
    v = view(a, [1, 1, 2])
    sum(mul2.(v))
end

a = [1.0f0, 1.0f0, 1.0f0]
f(a) # Correct, 6.0f0
f(cu(a)) # Correct, 6.0f0

gradient(f, a) # Correct, (Float32[4.0, 2.0, 0.0],)
gradient(f, cu(a)) # Incorrect, (Float32[2.0, 2.0, 0.0],)

The gradient is correct with the change in this PR.

Best fixed at the CUDA.jl level based on how it is implemented.

The problem is that this seems difficult to fix at the CUDA level based on what @maleadt says at JuliaGPU/CUDA.jl#89.

@CarloLucibello
Copy link
Member

Can we have some benchmarks here?

@ToucheSir
Copy link
Member

So the main concern wrt. dependencies here is not CUDA (because there's already a Requires block for it in Zygote) but NNlib(CUDA). I think one way to make this more palatable would be to extract scatter!(+, dx, dy, inds) into a kernel and create a wrapper accumulation function that dispatches to this kernel when passed CuArrays. The default fallback could reuse the code in https://github.com/FluxML/Zygote.jl/blob/v0.6.33/src/lib/array.jl#L38-L39.

@jgreener64
Copy link
Contributor Author

I added the check for unique indices. Here are some benchmarks on a NVIDIA RTX A6000 with 1000 indices:

using Zygote, CUDA, BenchmarkTools, Random
using Zygote: ∇getindex

arr = cu(rand(Float32, 1000))
inds_norep = shuffle(1:1000)
inds_norep_cu = cu(inds_norep)
inds_rep = rand(1:1000, 1000)
inds_rep_cu = cu(inds_rep)
yb = cu(rand(Float32, 1000))

# On master
@btime $(∇getindex(arr, (inds_norep   ,)))($(yb)) # 11.474 μs (35 allocations: 2.28 KiB)
@btime $(∇getindex(arr, (inds_norep_cu,)))($(yb)) # 15.839 μs (31 allocations: 10.06 KiB)
@btime $(∇getindex(arr, (inds_rep     ,)))($(yb)) # 11.380 μs (35 allocations: 2.28 KiB)
@btime $(∇getindex(arr, (inds_rep_cu  ,)))($(yb)) # 16.069 μs (31 allocations: 10.06 KiB)

# With this change
@btime $(∇getindex(arr, (inds_norep   ,)))($(yb)) # 23.712 μs (53 allocations: 59.36 KiB)
@btime $(∇getindex(arr, (inds_norep_cu,)))($(yb)) # 35.649 μs (50 allocations: 67.16 KiB)
@btime $(∇getindex(arr, (inds_rep     ,)))($(yb)) # 3.626 ms (26013 allocations: 1.29 MiB)
@btime $(∇getindex(arr, (inds_rep_cu  ,)))($(yb)) # 3.663 ms (26014 allocations: 1.29 MiB)

So it's a 2x slowdown for the previously correct case of no repeated indices due to the uniqueness check, and a 300x slowdown for the previously incorrect case of repeated indices due to the use of NNlib.scatter!.

@CarloLucibello
Copy link
Member

What should we do here? I think we should take the performance hit for the correctness.

So the main concern wrt. dependencies here is not CUDA (because there's already a Requires block for it in Zygote) but NNlib(CUDA). I think one way to make this more palatable would be to extract scatter!(+, dx, dy, inds) into a kernel and create a wrapper accumulation function that dispatches to this kernel when passed CuArrays. The default fallback could reuse the code in https://github.com/FluxML/Zygote.jl/blob/v0.6.33/src/lib/array.jl#L38-L39.

Hi don't quite understand this suggestion

@ToucheSir
Copy link
Member

I meant that we could copy the scatter kernel definition from NNlibCUDA and use it as an internal function. Pretty messy, but a couple of duplicated methods isn't the end of the world.

@CarloLucibello
Copy link
Member

Yes, we could do that, not ideal but seems the best alternative

@CarloLucibello
Copy link
Member

@jgreener64 do you want to implement this plan?

@jgreener64
Copy link
Contributor Author

Looking at this again, another option when the indices are not unique is to do the broadcast on the CPU and move it to the GPU. I added that in the last commit. Compared to the above it benchmarks as:

@btime $(∇getindex(arr, (inds_norep   ,)))($(yb)) # 24.253 μs (57 allocations: 59.64 KiB)
@btime $(∇getindex(arr, (inds_norep_cu,)))($(yb)) # 37.904 μs (53 allocations: 67.39 KiB)
@btime $(∇getindex(arr, (inds_rep     ,)))($(yb)) # 12.634 μs (21 allocations: 19.92 KiB)
@btime $(∇getindex(arr, (inds_rep_cu  ,)))($(yb)) # 20.333 μs (22 allocations: 19.94 KiB)

i.e. it is fast, though I imagine it increases GPU memory requirements. It may be a good way to go here due to its simplicity.

@mcabbott
Copy link
Member

Do these times need a CUDA.@sync? Seems a little crazy that copying to an Array could be quicker, but if it really is then that's simple. And if so, it should probably skip the check? allunique is surprisingly slow, has to make a Dict I think:

julia> @btime allunique($(shuffle!(collect(1:1000))));
  min 11.958 μs, mean 20.876 μs (17 allocations, 49.14 KiB)

@jgreener64
Copy link
Contributor Author

I'm not fully sure but with this benchmark I think it is quicker to do on CPU and copy across. Skipping the uniqueness check and always doing this, which I just added to the branch, gives:

@btime $(∇getindex(arr, (inds_norep   ,)))($(yb)) # 11.843 μs (11 allocations: 16.30 KiB)
@btime $(∇getindex(arr, (inds_norep_cu,)))($(yb)) # 19.797 μs (12 allocations: 16.31 KiB)
@btime $(∇getindex(arr, (inds_rep     ,)))($(yb)) # 12.079 μs (11 allocations: 16.30 KiB)
@btime $(∇getindex(arr, (inds_rep_cu  ,)))($(yb)) # 19.589 μs (12 allocations: 16.31 KiB)

This is very similar to the results on master above, but I'm unsure of the wider impact of changing every indexing operation on a GPU array as it may be slower or memory-heavy in other contexts.

Are there downstream benchmark tests that can be run (e.g. for Flux) to assess this?

@ToucheSir
Copy link
Member

Could you package up some of the test cases you're using and put them in https://github.com/FluxML/Zygote.jl/blob/master/test/cuda.jl?

@mcabbott mcabbott added the CUDA All things GPU label Jul 4, 2022
@mcabbott
Copy link
Member

mcabbott commented Aug 3, 2022

JuliaDiff/ChainRules.jl#655 implements the same copy-to-CPU idea is 98e87c3. Comments (or benchmarking) would be welcome.

Adding test cases here would also be a great idea, as they will be run on a real GPU via buildkite.

@jgreener64
Copy link
Contributor Author

I added a simple test case that is fixed by this PR.

Yes it probably makes sense to deal with this on the ChainRules.jl side long term.

Is there a way to benchmark performance on various downstream tasks such as Flux? I'm a little worried about the consequences for GPU memory if a CPU-GPU conversion happens with every indexing step, though in my tests above it seemed okay. The approach of allowing integers/ranges in JuliaDiff/ChainRules.jl#655 without doing a uniqueness check seems sensible.

@ToucheSir ToucheSir mentioned this pull request Oct 6, 2023
1 task
@jgreener64
Copy link
Contributor Author

I think this was fixed in JuliaDiff/ChainRules.jl#655.

@jgreener64 jgreener64 closed this Oct 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CUDA All things GPU
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants