-
-
Notifications
You must be signed in to change notification settings - Fork 612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RNN update to drop CUDNN, fix LSTM bug and output type stability #1367
Conversation
There's a non-deterministic failure that hits computations on the GPU when running a RNN. CPU and GPU models are identical through a couple of iterations, then crashes: using Revise
using Flux
using Statistics: mean
feat = 32
h_size = 64
seq_len = 10
batch_size = 32
rnn = Chain(RNN(feat, h_size),
Dense(h_size, 1, σ),
x -> reshape(x,:))
X = [rand(feat, batch_size) for i in 1:seq_len]
Y = rand(batch_size, seq_len) ./ 10
######################################
rnn_gpu = rnn |> gpu
X_gpu = gpu(X)
Y_gpu = gpu(Y)
######################################
θ = Flux.params(rnn)
θ_gpu = Flux.params(rnn_gpu)
function loss(x,y)
l = mean((Flux.stack(map(rnn, x),2) .- y) .^ 2f0)
Flux.reset!(rnn)
return l
end
function loss_gpu(x,y)
l = mean((Flux.stack(map(rnn_gpu, x),2) .- y) .^ 2f0)
Flux.reset!(rnn_gpu)
return l
end
opt = ADAM(1e-4)
opt_gpu = ADAM(1e-4)
for i in 1:50
println("iter: ", i)
Flux.train!(loss, θ, [(X,Y)], opt)
Flux.train!(loss_gpu, θ_gpu, [(X_gpu,Y_gpu)], opt_gpu)
println("loss_cpu: ", loss(X, Y))
println("loss_gpu: ", loss_gpu(X_gpu, Y_gpu))
end iter: 1
loss_cpu: 0.30154937090167455
loss_gpu: 0.3015062
iter: 2
loss_cpu: 0.291274794331852
loss_gpu: 0.29123157
iter: 3
loss_cpu: 0.2811146438490362
loss_gpu: 0.28107145
...
iter: 38
loss_cpu: 0.0699269985382991
loss_gpu: 0.06990228
iter: 39
loss_cpu: 0.0676047453112943
loss_gpu: 0.06758058
iter: 40
loss_cpu: 0.06539862587330023
loss_gpu: NaN
iter: 41
loss_cpu: 0.06330197006630321
loss_gpu: NaN
iter: 42
loss_cpu: 0.061308456094240696
loss_gpu: NaN
On this run, it failed at iteration 40, but another run it can be at iteration 28, etc. |
check that CUDNN drop solves for too many wrappers - FluxML#1259
Nope, no immediate thoughts. Besides, I haven't worked much on the CUDNN wrappers, most are a copy of what used to exist in Flux and was developed here. |
@maleadt Thanks for feedback. In the above example however, the CUDNN wrappers aren't in cause. The example is based on this current PR in which the CUDNN wrappers were dropped, so the issue would be CUDA related (I assume). |
Ah, I misunderstood. I don't see how the CUDA stack would be to blame, then. Since you're not using streams, kernels |
@maleadt Considering that the RNN layer running on GPU is just the result of The basic RNN cell simply applies matrix multiplication and addition, along with broadcasted activation function, which shouldn't be any different than the basic So if the issue explaining the that cpu case works but not the gpu is not on the CUDA/CuArray, I am a bit confused on where to look at. Any help be would be welcomed! @DhairyaLGandhi Has there been a Flux release whose RNN GPU support was validated (might help trace back if something broke along the way)? |
Is there no other (GPU-specific) functionality from Flux involved at all? Anyway, as I said its possible there's an issue with a specific kernel or operation in CUDA.jl, but that's impossible to tell from this code. Happy to give it a better look if you can give me some CuArray operation(s) that don't match up with their Array counterparts. |
@maleadt I created a gist with a reproducible example of the RNN failure on GPU. It builds from scratch the RNN structures. I couldn't isolate the problem down to a single array operation. As can be seen by executing the script, the predictions of the cpu and gpu models are exactly the same for a couple of iterations, until NaNs are returned on the GPU side (below shows the prints in the gist showing models' loss and some parameters), so it seems more tied to the iterative process. Let me know if there are some operations or tests that I could perform that may help the diagnosis.
|
Thanks, but this is still way to high level to debug from the GPU side of things. I started reducing further, and after a couple of hours (and getting dumbfounded by JuliaGPU/CUDA.jl#515) I'm at: using Flux
using CUDA
using Statistics: mean
using Random
mutable struct MyRecur{T}
cell::T
init
state
end
MyRecur(m, h = hidden(m)) = MyRecur(m, h, h)
function (m::MyRecur)(xs...)
m.state, y = m.cell(m.state, xs...)
return y
end
Flux.@functor MyRecur
Flux.trainable(a::MyRecur) = (a.cell,)
struct MyRNNCell{F,A,V}
σ::F
Wi::A
Wh::A
b::V
end
MyRNNCell(in::Integer, out::Integer, σ = tanh; init = Flux.glorot_uniform) =
MyRNNCell(σ, init(out, in), init(out, out), init(out))
function (m::MyRNNCell)(h, x)
h = m.σ.(m.Wi*x)
return h, h
end
hidden(m::MyRNNCell) = m.h
Flux.@functor MyRNNCell
MyRecur(m::MyRNNCell) = MyRecur(m, zeros(length(m.b)), zeros(length(m.b)))
MyRNN(a...; ka...) = MyRecur(MyRNNCell(a...; ka...))
function main(seed=0x6e73f7d40a1175b8)
feat = 2
h_size = 2
Random.seed!(seed)
rnn = Chain(MyRNN(feat, h_size))
X = [[0.9426079f0; 0.8691368f0]]
Y = [0.098483585f0]
rnn_gpu = rnn |> gpu
X_gpu = gpu(X)
Y_gpu = gpu(Y)
θ = Flux.params(rnn)
θ_gpu = Flux.params(rnn_gpu)
function loss(x,y)
l = mean((Flux.stack(map(rnn, x),2) .- y) .^ 2f0)
return l
end
function loss_gpu(x,y)
l = mean((Flux.stack(map(rnn_gpu, x),2) .- y) .^ 2f0)
return l
end
println("CPU")
train!(loss, θ, (X,Y))
println()
println("GPU")
train!(loss_gpu, θ_gpu, (X_gpu,Y_gpu))
println()
end
function train!(loss, ps, d)
gs = gradient(ps) do
loss(d...)
end
x = first(ps)
@show x gs[x]
end
isinteractive() || main()
So it looks like the gradients are wrong? I'll defer to somebody who knows about Zygote here. |
Thanks for taking a look at this, and sorry for not nailing down the example to a basic operator. |
@maleadt I think I finally put the finger on what was causing the problem... and I'm a little disappointed of myself! By replacing the It remains unclear to me whether it would be considered an anomaly that on some iterations such NaN be thrown, like if the 2f0 got sometimes rounded not exactly to 2, resulting is complex number errors when the basis is negative, while the cpu seems not subject to the same rounding issue (or not as sensitive). Anyhow, I think I now have all in place to complete this PR on RNNs, thanks again for you help. |
@CarloLucibello Anything left to do? |
just two questions:
|
|
This is for being able to use the same layertype (e.g. |
@DrChainsaw Thanks for the clarifications! From my understanding, the challenges relatives to the Zeros / no bias are more general than just the RNN. Therefore, I think it makes sense to consider this current RNN PR independent and bring the eventual adaptation from the no bias approach selected as needed. |
Ok, let's merge this then bors r+ |
Build succeeded: |
Cudnn is an industry standard for cuda usage and dropping it can come with hidden costs, so we should be extra careful before releasing. Further, is storing the state necessary? It doesn't seem like it should be |
In the RNN cells (RNN, GRU, LSTM), the I agree that CUDNN is a standard, however, as previously discussed, the CUDNN's RNN would be beneficial in a 3D data logic, which I think would effectively be much beneficial to add. In a single timestep logic as of now, performance test performed above actually showed slight improvement with this pure CUDA approach on fairly good sized data on LSTMs as well as consistent results between CPU and GPU. Given various issues raised around RNNs, aim here was to remove some of the frictions encountered (couple issues seems to be fixed from it). I agree though about importance to effectively not to brake something aside, so let me know if there are some tests you think would help in that assessment. |
I would definitely change |
actually, there is no need for that this since you just renamed |
|
I would opt for I'm not sure of the use case for |
for retro-compatibility, since RNNCell and others used to have an
I was thinking about retro-compatibility. Could you file a new PR with the renaming and the getproperty with some deprecation messages? |
RE cuDNN, JAX-based frameworks don't use the cuDNN RNN functionality at all. This is presumably because XLA doesn't support it and can optimize aggressively enough to be competitive, though. |
I'll do! |
1390: RNN deprecations and naming fixes r=CarloLucibello a=jeremiedb Following discussion in #1367 This PR brings the disambiguation between initial state parameters named `state0` in the rnn cells with the state of the rnn chain named `state` in the `Recur` struct. Add getproperty with deprecation messages to access the legacy `h` and `c` in the rnn cells as well as the `init` field in `Recu` (which now points to `recur.cell.state0`). Include both 1D and 2D input dimensions to the basic BPTT test. Co-authored-by: jeremie.db <[email protected]>
Revisiting this, perhaps we should also experiment with 3d data and look at testing. Currently we seen to have a lot of broken tests which was not the case until recently |
@DhairyaLGandhi Could you point to the actual broken tests? I'm only only aware of issues that were actually fixed. |
You might want to check the tests marked broken in the linked pr #1472 |
I'm a bit confused by the your comment regarding the broken tests since from what I see these were introduced as "broken" in July 2020, so prior to this refactor of RNN for Flux v0.11.3: As for the underlying cause for the broken tests, it seems to be a minor thing, I've just opened a PR where those test should now be passing: #1473 Regarding performance, as previously mentionned in the PR, I got a slight improvement vs prior CURNN dependency, which was observed while on CUDA.jl v1. However, there was a performance regression introduced in CUDA.jl v2 which seems to stem from a different GEMM behind called: JuliaGPU/CUDA.jl#538. It is from my understanding that this regression results in a less than 10% regression, which should in principle revert back to improved performance if fixed. I think it is very desirable to have a CURNN integration for 3D input. However for the single step and other custom cases, I'm don't see the benefits vs relying on the transparent integration of FLux with CUDA.jl which is where I think Flux shines. But this is arguably just a very subjective perception of mine :) |
Right I wasn't certain as to when the tests were marked so, and the point about the regression is well taken. |
1473: Fix RNN tests on GPU r=DhairyaLGandhi a=jeremiedb Fix for RNN on CUDA, as discussed in #1367 . Co-authored-by: jeremie.db <[email protected]>
PR related to #1114 #1360 #1365
Some experiment for RNN handling.
Hidden state of each cell structure was dropped as they weren't needed (AFAIK, only needed for size inference for CUDNN, but bias size could be used as a substitute to cells'
h
there as well).Looked to drop dependence on CUDNN entirely, so it's a pure Flux/CUDA.jl. File
src/cuda/curnnjl
no longer used. No modifications were made to the cell computations. Initial test seems to show decent performance, but yet to benchmark.Pending issue: despite having dropped completely the CUDNN dependency, there's still an instability issue that seems present when running on GPU. This is illustrated in the test at lines 1-50 of file
test\rnn-test-jdb.jl
. If that test runs on CPU, it goes well thorugh the 100 iterations. However, the same on GPU will thow NAs after couple dozens of iterations.My only hypothesis so far: when performing the iteration over the sequence through
m.(x)
ormap(rnn, x)
, is the order of the execution safe? Ie: is it possible that there isn't async()
on the CUDA side between those seq steps, which may mess up the state?PR Checklist
@dhairyagandhi96
(for API changes).