-
-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
getindex for CuArray to fix repeated indices error #1131
Conversation
src/lib/array.jl
Outdated
using CUDA | ||
import NNlib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these imports shouldn't be here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's what @jgreener64 was referring to with "This isn't currently suitable for merging since it adds dependencies..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah sorry, missed that
src/lib/array.jl
Outdated
|
||
∇getindex(x::CuArray, inds::Tuple{AbstractArray{<:Integer}}) = dy -> begin | ||
dx = _zero(x, eltype(dy)) | ||
NNlib.scatter!(+, dx, dy, inds[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we use only inds[1]
we should restict the input to ::AbstractArray{<:Integer}
or even better AbstractVector{<:Integer}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tuple wrapper comes from the varargs in the adjoint, so unless that's changed too there's not much that can be done.
I think we should be correct and avoid silent bugs before aiming at being fast. Maybe taking a fast path after a "has duplicates" check can help performance a bit? |
Since Zygote doesn't depend on CUDA and NNlib I'm not sure where we should put this. Here but inside a |
Yes I would imagine so.
There is already a CUDA require block in |
Is this something ReverseDiff could benefit from as well? If so, we could consider adding it as an |
I don't think this is something that can be solved well with an adjoint at this point. The forward pass value would be incorrect by this point already. Specifically, note JuliaGPU/CUDA.jl#89 which describes why this happens. Best fixed at the CUDA.jl level based on how it is implemented. |
Not in general, since assignment broadcasting is used in the backward pass whether or not it is used in the forward pass. For example on the current release: using Zygote, CUDA
mul2(x) = 2x
function f(a)
v = view(a, [1, 1, 2])
sum(mul2.(v))
end
a = [1.0f0, 1.0f0, 1.0f0]
f(a) # Correct, 6.0f0
f(cu(a)) # Correct, 6.0f0
gradient(f, a) # Correct, (Float32[4.0, 2.0, 0.0],)
gradient(f, cu(a)) # Incorrect, (Float32[2.0, 2.0, 0.0],) The gradient is correct with the change in this PR.
The problem is that this seems difficult to fix at the CUDA level based on what @maleadt says at JuliaGPU/CUDA.jl#89. |
Can we have some benchmarks here? |
So the main concern wrt. dependencies here is not CUDA (because there's already a |
I added the check for unique indices. Here are some benchmarks on a NVIDIA RTX A6000 with 1000 indices: using Zygote, CUDA, BenchmarkTools, Random
using Zygote: ∇getindex
arr = cu(rand(Float32, 1000))
inds_norep = shuffle(1:1000)
inds_norep_cu = cu(inds_norep)
inds_rep = rand(1:1000, 1000)
inds_rep_cu = cu(inds_rep)
yb = cu(rand(Float32, 1000))
# On master
@btime $(∇getindex(arr, (inds_norep ,)))($(yb)) # 11.474 μs (35 allocations: 2.28 KiB)
@btime $(∇getindex(arr, (inds_norep_cu,)))($(yb)) # 15.839 μs (31 allocations: 10.06 KiB)
@btime $(∇getindex(arr, (inds_rep ,)))($(yb)) # 11.380 μs (35 allocations: 2.28 KiB)
@btime $(∇getindex(arr, (inds_rep_cu ,)))($(yb)) # 16.069 μs (31 allocations: 10.06 KiB)
# With this change
@btime $(∇getindex(arr, (inds_norep ,)))($(yb)) # 23.712 μs (53 allocations: 59.36 KiB)
@btime $(∇getindex(arr, (inds_norep_cu,)))($(yb)) # 35.649 μs (50 allocations: 67.16 KiB)
@btime $(∇getindex(arr, (inds_rep ,)))($(yb)) # 3.626 ms (26013 allocations: 1.29 MiB)
@btime $(∇getindex(arr, (inds_rep_cu ,)))($(yb)) # 3.663 ms (26014 allocations: 1.29 MiB) So it's a 2x slowdown for the previously correct case of no repeated indices due to the uniqueness check, and a 300x slowdown for the previously incorrect case of repeated indices due to the use of |
What should we do here? I think we should take the performance hit for the correctness.
Hi don't quite understand this suggestion |
I meant that we could copy the scatter kernel definition from NNlibCUDA and use it as an internal function. Pretty messy, but a couple of duplicated methods isn't the end of the world. |
Yes, we could do that, not ideal but seems the best alternative |
@jgreener64 do you want to implement this plan? |
dd4d7d2
to
40517a6
Compare
Looking at this again, another option when the indices are not unique is to do the broadcast on the CPU and move it to the GPU. I added that in the last commit. Compared to the above it benchmarks as: @btime $(∇getindex(arr, (inds_norep ,)))($(yb)) # 24.253 μs (57 allocations: 59.64 KiB)
@btime $(∇getindex(arr, (inds_norep_cu,)))($(yb)) # 37.904 μs (53 allocations: 67.39 KiB)
@btime $(∇getindex(arr, (inds_rep ,)))($(yb)) # 12.634 μs (21 allocations: 19.92 KiB)
@btime $(∇getindex(arr, (inds_rep_cu ,)))($(yb)) # 20.333 μs (22 allocations: 19.94 KiB) i.e. it is fast, though I imagine it increases GPU memory requirements. It may be a good way to go here due to its simplicity. |
Do these times need a
|
I'm not fully sure but with this benchmark I think it is quicker to do on CPU and copy across. Skipping the uniqueness check and always doing this, which I just added to the branch, gives: @btime $(∇getindex(arr, (inds_norep ,)))($(yb)) # 11.843 μs (11 allocations: 16.30 KiB)
@btime $(∇getindex(arr, (inds_norep_cu,)))($(yb)) # 19.797 μs (12 allocations: 16.31 KiB)
@btime $(∇getindex(arr, (inds_rep ,)))($(yb)) # 12.079 μs (11 allocations: 16.30 KiB)
@btime $(∇getindex(arr, (inds_rep_cu ,)))($(yb)) # 19.589 μs (12 allocations: 16.31 KiB) This is very similar to the results on master above, but I'm unsure of the wider impact of changing every indexing operation on a GPU array as it may be slower or memory-heavy in other contexts. Are there downstream benchmark tests that can be run (e.g. for Flux) to assess this? |
Could you package up some of the test cases you're using and put them in https://github.com/FluxML/Zygote.jl/blob/master/test/cuda.jl? |
JuliaDiff/ChainRules.jl#655 implements the same copy-to-CPU idea is 98e87c3. Comments (or benchmarking) would be welcome. Adding test cases here would also be a great idea, as they will be run on a real GPU via buildkite. |
3b85689
to
8e06259
Compare
I added a simple test case that is fixed by this PR. Yes it probably makes sense to deal with this on the ChainRules.jl side long term. Is there a way to benchmark performance on various downstream tasks such as Flux? I'm a little worried about the consequences for GPU memory if a CPU-GPU conversion happens with every indexing step, though in my tests above it seemed okay. The approach of allowing integers/ranges in JuliaDiff/ChainRules.jl#655 without doing a uniqueness check seems sensible. |
I think this was fixed in JuliaDiff/ChainRules.jl#655. |
I spent quite a while working out where my gradients were wrong, until eventually tracking it down to the GPU repeated index issue that is mentioned in multiple places:
getindex()
with repeating indices on CuArrays #821Based on a solution suggested at https://discourse.julialang.org/t/increment-elements-of-array-by-index/49694 I added a hack to get the correct gradients, which can be found in this PR. It uses
scatter!
from NNlib to accumulate values over multiple indices. It is considerably slower than the current version, which on the GPU gives silently incorrect and even stochastic gradients. However it is faster than moving the array off the GPU to do this step.This isn't currently suitable for merging since it adds dependencies and isn't tested beyond the simple case. It might be useful as a starting point for a discussion on how to get this important issue fixed, though, and also might be useful as a hack to others.