Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaNs in jacobian of a function that uses a StaticArray and norm #547

Open
ferrolho opened this issue Oct 2, 2021 · 11 comments · Fixed by JuliaArrays/StaticArrays.jl#975

Comments

@ferrolho
Copy link
Contributor

ferrolho commented Oct 2, 2021

Hi! I am having trouble differentiating a function with ForwardDiff.jl. I don't know why, but the resulting jacobian contains NaNs. Below is a MWE to kickstart the discussion.

Consider the following functions:

using BenchmarkTools, ForwardDiff, LinearAlgebra, StaticArrays

foo2(A) = SVector{4}(norm(r, 2) for r = eachrow(A))

function foo1!(out, x; μ = 0.8 / 2)
    λ = SVector{3}(@view x[1:3])
    K = SMatrix{3,3}(@view x[4:12])

    A_λ = @SMatrix [ 1   0  -1   0 ;
                     0   1   0  -1 ;
                     μ   μ   μ   μ ]

    out[1:4] = A_λ' * λ + foo2(A_λ' * K)

    out
end

Let's try out foo1!:

julia> x = rand(12);

julia> out = zeros(4);

julia> foo1!(out, x)
4-element Vector{Float64}:
 2.403435298832313
 2.0030472808350077
 0.7216049166673891
 0.6293600802591698

foo1! is type-stable and does not perform dynamic allocations:

julia> @btime $foo1!($out, $x)
  21.684 ns (0 allocations: 0 bytes)
4-element Vector{Float64}:
 2.403435298832313
 2.0030472808350077
 0.7216049166673891
 0.6293600802591698

We can use ForwardDiff.jl to compute the jacobian of foo1!:

julia> ForwardDiff.jacobian(foo1!, out, x)
4×12 Matrix{Float64}:
  1.0   0.0  0.565685  0.606882  0.0        0.343304     0.353444  0.491234  0.0        0.277884
  0.0   1.0  0.565685  0.0       0.744664   0.421245      0.22503   0.0       0.535939   0.303173
 -1.0   0.0  0.565685  0.722007  0.0       -0.408429     -0.362283  0.261825  0.0       -0.14811
  0.0  -1.0  0.565685  0.0       0.930926  -0.526611     -0.154069  0.0       0.243306  -0.137634

However, if the inputs are all zeros, the jacobian will contain NaNs:

julia> x = zeros(12);

julia> ForwardDiff.jacobian(foo1!, out, x)
4×12 Matrix{Float64}:
 NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
 NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
 NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
 NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN

But if we change the type of the matrix A_λ (in foo1!) from an SMatrix to a normal Matrix, the jacobian will be evaluated properly:

julia> function foo1!(out, x; μ = 0.8 / 2)
           λ = SVector{3}(@view x[1:3])
           K = SMatrix{3,3}(@view x[4:12])

           A_λ = [ 1   0  -1   0 ;
                   0   1   0  -1 ;
                   μ   μ   μ   μ ]

           out[1:4] = A_λ' * λ + foo2(A_λ' * K)

           out
       end
foo1! (generic function with 1 method)

julia> ForwardDiff.jacobian(foo1!, out, x)
4×12 Matrix{Float64}:
  1.0   0.0  0.565685  0.0  0.0  0.0  0.0  0.0  0.0   1.0   0.0  0.565685
  0.0   1.0  0.565685  0.0  0.0  0.0  0.0  0.0  0.0   0.0   1.0  0.565685
 -1.0   0.0  0.565685  0.0  0.0  0.0  0.0  0.0  0.0  -1.0   0.0  0.565685
  0.0  -1.0  0.565685  0.0  0.0  0.0  0.0  0.0  0.0   0.0  -1.0  0.565685

Moreover, I have also observed that the jacobian will not contain NaNs if we use the 1-norm or the Inf-norm, even if we keep A_λ as an SMatrix, i.e.,

foo2(A) = SVector{4}(norm(r, 1) for r = eachrow(A))

or

foo2(A) = SVector{4}(norm(r, Inf) for r = eachrow(A))

I am not sure if this is a bug or if I am doing something wrong... Can someone help me figure it out? Thank you in advance!

@ferrolho
Copy link
Contributor Author

ferrolho commented Oct 2, 2021

I have read the docs section Fixing NaN/Inf Issues under User Documentation > Advanced Usage Guide. Indeed, if I set the NANSAFE_MODE_ENABLED constant to true, the jacobian no longer contains NaNs. However, the evaluated jacobian is different compared to the version of foo1! which defines A_λ as a normal Matrix.

If A_λ is a Matrix{Float64}:

julia> ForwardDiff.jacobian(foo1!, out, x)
4×12 Matrix{Float64}:
  1.0   0.0  0.565685  0.0  0.0  0.0  0.0  0.0  0.0   1.0   0.0  0.565685
  0.0   1.0  0.565685  0.0  0.0  0.0  0.0  0.0  0.0   0.0   1.0  0.565685
 -1.0   0.0  0.565685  0.0  0.0  0.0  0.0  0.0  0.0  -1.0   0.0  0.565685
  0.0  -1.0  0.565685  0.0  0.0  0.0  0.0  0.0  0.0   0.0  -1.0  0.565685

If A_λ is a SMatrix{3, 4, Float64, 12}:

julia> ForwardDiff.jacobian(foo1!, out, x)
4×12 Matrix{Float64}:
  1.0   0.0  0.565685  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
  0.0   1.0  0.565685  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 -1.0   0.0  0.565685  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
  0.0  -1.0  0.565685  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0

I would appreciate some guidance on what I should do. I.e., should I define NANSAFE_MODE_ENABLED as true from now on (and take the ~5%-10% performance hit)? Or is there something else I can do to fix this issue, without having to toggle NANSAFE_MODE_ENABLED? Thank you!

@KristofferC
Copy link
Collaborator

Regardless of NaNs, it would be good to figure out why Matrix and SMatrix differ. I guess just peppering some print statements throughout the code should reveal where things diverge.

@ferrolho
Copy link
Contributor Author

ferrolho commented Oct 2, 2021

Hi, @KristofferC. Thank you for your reply. You mean in ForwardDiff.jl? Where should I start? 😅

@KristofferC
Copy link
Collaborator

KristofferC commented Oct 2, 2021

With StaticArrays 0.12.4 I get the same result between static matrix and matrix running the example in the first post. With StaticArrays 1.2.13 I get your result. So something happened there.

Bisecting points to JuliaArrays/StaticArrays.jl#908 as the offending PR.

@mcabbott
Copy link
Member

mcabbott commented Oct 7, 2021

A more minimal example is

julia> using StaticArrays, ForwardDiff

julia> ForwardDiff.gradient(norm, [0,0])
2-element Vector{Float64}:
 0.0
 1.0

julia> ForwardDiff.gradient(norm, SA[0,0])
2-element SVector{2, Float64} with indices SOneTo(2):
 NaN
 NaN

With a hand-written norm, we are evaluating at sqrt(0) where the gradient is infinite. The non-NaN answer is something like a choice of subgradient:

julia> ForwardDiff.gradient(x -> sqrt(sum(abs2, x)), [0,0])
2-element Vector{Float64}:
 NaN
 NaN

julia> ForwardDiff.gradient(x -> sqrt(sum(x .^ 2)), [0, eps()])
2-element Vector{Float64}:
 0.0
 1.0

julia> ForwardDiff.gradient(x -> sqrt(sum(x .^ 2)), [eps(), 0])
2-element Vector{Float64}:
 1.0
 0.0

Note also that because it's caused by an explicit branch (at least according to JuliaArrays/StaticArrays.jl#964, didn't check LinearAlgebra's source) it will be changed by #481, to never take the measure-zero branch, and hence always give NaNs.

Zygote's behaviour (using a rule for norm from ChainRules, here) does not pick one of these, BTW:

julia> Zygote.gradient(x -> sqrt(sum(abs2, x)), [0,0])
([NaN, NaN],)

julia> Zygote.gradient(norm, [0,0])
([0.0, 0.0],)

That could choose to pick a particular "sub-gradient". But I'm not sure that doing that with DiffRules would work, since the function takes a vector not a scalar.

@sethaxen
Copy link
Member

It seems this particular issue is due to the derivative for sqrt assuming the input is nonzero.

julia> ForwardDiff.gradient(x -> sqrt(sum(abs2, x)), [0,0])  # bad
2-element Vector{Float64}:
 NaN
 NaN

julia> ForwardDiff.gradient(x -> sum(abs2, x), [0,0]) # removing the sqrt, fine
2-element Vector{Int64}:
 0
 0

julia> sqrt(ForwardDiff.Dual(0.0, 0.0)) # Uh-oh! should be zero!
Dual{Nothing}(0.0,NaN)

julia> ForwardDiff.Dual(0.0, 0.0)^0.5 # yes, pow has the same issue
Dual{Nothing}(0.0,NaN)

I think the right fix here is to bypass DiffRules' rules for ^, sqrt, and cbrt to return 0 in these cases, since it can't take into account the value of the input derivative.

See also JuliaDiff/ChainRules.jl#576

@mcabbott
Copy link
Member

Oh right, well spotted. The dual part is indeed zero:

julia> sqrt_show(x) = sqrt(@show x);

julia> ForwardDiff.gradient(x -> sqrt_show(sum(abs2, x)), [0,0])
x = Dual{ForwardDiff.Tag{var"#5#6", Int64}}(0,0,0)
2-element Vector{Float64}:
 NaN
 NaN

Catching that and replacing it by zero can avoid the NaN, but seems slightly weird in that it doesn't pick a subgradient (if that's the term). While [0, eps()] gives you a vector of length 1, telling you that moving away from zero will change the function, this would imply that the function is flat there.

We can picture this (for a 2-vector) as being the tip of a cone, a rolled up piece of paper cone. The current behaviour is to say that the tip is a singularity, hence has no gradient. The sqrt proposal would smooth the tip off, so that microscopic angels can treat it as a dance floor. If you could always pick [0, eps()], then instead you are always informed that you can go downhill, and a random direction is provided, but the rate at which you will go downhill is correct.

Maybe that's not so strange then. It's a bit like derivative(abs, 0) == 0, rounding off the singularity. While [0, eps()] is like derivative(abs, 0) == 1. At present CR picks the former, ForwardDiff the latter.

@sethaxen
Copy link
Member

FWIW, finite differences finds that the gradient of norm at a zero vector is zero:

julia> using FiniteDifferences, LinearAlgebra

julia> FiniteDifferences.grad(central_fdm(5, 1), norm, [0.0,0.0])
([-3.8050255348364236e-17, -3.8050255348364236e-17],)

So I would be inclined to side with CR here.

But I think either way, the sqrt, cbrt, ^ issue is bad behavior that needs to be addressed.

@mcabbott
Copy link
Member

Finite differencing just averages the two sides of abs, right? If it were doing anything smarter I'd give it less weight...

@devmotion
Copy link
Member

As a side remark: As noted above, everything works if one uses the NaN-safe setting (https://juliadiff.org/ForwardDiff.jl/dev/user/advanced/#Fixing-NaN/Inf-Issues). I think one should benchmark if and to what extent performance is affected by making NaN-safe the default. I think it's weird to report incorrect results with the default setting.

@mcabbott
Copy link
Member

mcabbott commented Jan 19, 2022

Great. I see that the implementation is here:

if NANSAFE_MODE_ENABLED
@inline function Base.:*(partials::Partials, x::Real)
x = ifelse(!isfinite(x) && iszero(partials), one(x), x)
return Partials(scale_tuple(partials.values, x))
end
@inline function Base.:/(partials::Partials, x::Real)
x = ifelse(x == zero(x) && iszero(partials), one(x), x)
return Partials(div_tuple_by_scalar(partials.values, x))
end
@inline function _mul_partials(a::Partials{N}, b::Partials{N}, x_a, x_b) where N
x_a = ifelse(!isfinite(x_a) && iszero(a), one(x_a), x_a)
x_b = ifelse(!isfinite(x_b) && iszero(b), one(x_b), x_b)
return Partials(mul_tuples(a.values, b.values, x_a, x_b))
end
else
@inline function Base.:*(partials::Partials, x::Real)
return Partials(scale_tuple(partials.values, x))
end
@inline function Base.:/(partials::Partials, x::Real)
return Partials(div_tuple_by_scalar(partials.values, x))
end
@inline function _mul_partials(a::Partials{N}, b::Partials{N}, x_a, x_b) where N
return Partials(mul_tuples(a.values, b.values, x_a, x_b))
end
end

Is it obvious why this test !isfinite(x) && iszero(partials) not just iszero(partials)? And if both are needed, why && not &? Strange to put a branch inside an ifelse. Maybe the idea is that with 12 partials & one x, it's quicker to check one & branch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants