`norm` at zero #538

mcabbott · 2021-10-08T17:17:27Z

From JuliaDiff/ForwardDiff.jl#547, note that the rule for norm gives zero gradient at x=0. It might be preferable to pick something like a sub-gradient?

julia> using Zygote, ForwardDiff, LinearAlgebra

julia> for g in [Zygote.gradient, ForwardDiff.gradient]
       @show g
       for f in [norm, x -> sqrt(sum(abs2, x))]
         @show f
         @show g(f, [eps(),0])
         @show g(f, [0,eps()])
         @show g(f, [0,0])
       end
       end
g = Zygote.gradient
f = LinearAlgebra.norm
g(f, [eps(), 0]) = ([1.0, 0.0],)
g(f, [0, eps()]) = ([0.0, 1.0],)
g(f, [0, 0]) = ([0.0, 0.0],)   # rule from ChainRules
f = var"#17#18"()
g(f, [eps(), 0]) = ([1.0, 0.0],)
g(f, [0, eps()]) = ([0.0, 1.0],)
g(f, [0, 0]) = ([NaN, NaN],)   # with hand-written norm, 0/0
g = ForwardDiff.gradient
f = LinearAlgebra.norm
g(f, [eps(), 0]) = [1.0, 0.0]
g(f, [0, eps()]) = [0.0, 1.0]
g(f, [0, 0]) = [0.0, 1.0]      # this picks a sub-gradient?
f = var"#17#18"()
g(f, [eps(), 0]) = [1.0, 0.0]
g(f, [0, eps()]) = [0.0, 1.0]
g(f, [0, 0]) = [NaN, NaN]

The text was updated successfully, but these errors were encountered:

oxinabox · 2021-10-12T10:49:05Z

[0.0, 0.0] seems right to me; but maybe i am missing something important.
Breaking symmetry and choosing either [1.0, 0.0] or [0.0, 1.0] seems icky.
I guess we could do fill(inv(sqrt(length(x))), length(x)), though that also is a arbitrary choice of perturbing off on a "positive diagonal"

sethaxen · 2021-10-12T11:55:54Z

Seems to be norm would often be used in an optimization problem, where the optimum would be achieved when norm(...) == 0, so the [0,0] gradient makes sense to me. The only other way I can thinking of how one would get exactly a 0-norm is if one initialized points such that exactly a 0-norm was formed, which doesn't seem like our problem.

mcabbott · 2021-10-12T14:05:36Z

The concern would be if x==[0,0] wasn't the optimum, then you could get stuck there. And you needn't initialise there, you could for instance be adding some noise & restricting, like x_next = clamp.(x .+ randn.()./100, 0, 1).

Mathematically the answer will depend on what direction you approach this point from. Which could lead you to argue that no limit exists, and the right answer is then NaN. But for optimisation, probably it's better to pick one?

That said, this hasn't bitten me, but it came up in the linked ForwardDiff issue.

mcabbott transferred this issue from JuliaDiff/ChainRulesCore.jl Oct 8, 2021

mcabbott closed this as completed Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`norm` at zero #538

`norm` at zero #538

mcabbott commented Oct 8, 2021

oxinabox commented Oct 12, 2021

sethaxen commented Oct 12, 2021

mcabbott commented Oct 12, 2021

norm at zero #538

norm at zero #538

Comments

mcabbott commented Oct 8, 2021

oxinabox commented Oct 12, 2021

sethaxen commented Oct 12, 2021

mcabbott commented Oct 12, 2021

`norm` at zero #538

`norm` at zero #538