-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inequality, ε-balls, and accidental structural zeros #480
Comments
There's another example here: TuringLang/DistributionsAD.jl#23 (comment). Behaviour on tagged ForwardDiff: using ForwardDiff, LinearAlgebra
A = Matrix(I, 2,2)
ForwardDiff.gradient(x -> sum(x \ [1,3]), A) # [-1 0; 0 -3]
ForwardDiff.gradient(x -> sum(cholesky(x) \ [1,3]), A) # [-1 -4; 0 -3]
Aplus2 = [1 0; 0.001 1] # perturb A[2]
sum(Aplus2 \ [1,3]) - sum(A \ [1,3]) # -0.001
sum(cholesky(Aplus2) \ [1,3]) - sum(cholesky(A) \ [1,3]) # PosDefException: matrix is not Hermitian
Aplus23 = [1 0.001; 0.001 1] # perturb A[2] and A[3] together
sum(Aplus23 \ [1,3]) - sum(A \ [1,3]) # -0.004
sum(cholesky(Aplus23) \ [1,3]) - sum(cholesky(A) \ [1,3]) # -0.004 Perhaps one could claim the ForwardDiff.jacobian(x -> x \ [1,3], A)
# -1.0 -0.0 -0.0 -0.0
# -0.0 -0.0 -0.0 -3.0
ForwardDiff.jacobian(x -> cholesky(x) \ [1,3], A)
# -1.0 0.0 -3.0 0.0
# 0.0 0.0 -1.0 -3.0 With the proposed change to ForwardDiff.gradient(x -> sum(x \ [1,3]), A) # [-1 -3; -1 -3]
ForwardDiff.gradient(x -> sum(cholesky(x) \ [1,3]), A) # ERROR: PosDefException
ForwardDiff.jacobian(x -> x \ [1,3], A)
# -1.0 0.0 -3.0 0.0
# 0.0 -1.0 0.0 -3.0
Aplus2 = [1 0; 0.001 1]
Aplus2 \ [1, 3] - A \ [1, 3] # [0, -0.001]
Aplus3 = [1 0.001; 0 1]
Aplus3 \ [1, 3] - A \ [1, 3] # [-0.003, 0] The example just above that, in TuringLang/DistributionsAD.jl#23 (comment), also gives an error on this branch, also from |
I didn't fully think it through but wouldn't it break functions like |
Right now if I wanted correct derivatives of such functions up to a certain order I'd just put in the "measure zero branch" a Taylor approximation of the function. |
This julia> fpi(x) = x==0 ? 1.0 : sin(pi*x) / (pi*x);
julia> fpi(1e-40)
1.0
julia> ForwardDiff.derivative(fpi, 1e-40)
1.2089258196146292e24 This is a variant of #466, for which the solution is julia> cosc(1e-40)
-3.2898681336964526e-40 Even if you don't have an exact closed-form derivative, I think that you would usually want to replace Since the fallback is julia> ForwardDiff.Dual(1e-40,0) ≈ 0
false
julia> nextfloat(0.0) ≈ 0.0
false |
One more data point is that https://github.com/JuliaPhysics/Measurements.jl gets this right according to my argument above. While its error bars aren't exactly dual numbers, they are a related species. Here's a version of #536 in which julia> using Measurements
julia> λ = measurement(0, 0.1)
0.0 ± 0.1
julia> iszero(λ)
false
julia> A = measurement.([1 2; 3 4], 0.1);
julia> B = measurement.([5 6; 7 8], 0.1);
julia> A * (B * λ)
2×2 Matrix{Measurement{Float64}}:
0.0±1.9 0.0±2.2
0.0±4.3 0.0±5.0
julia> mul!(similar(A), A, B, λ, 0) # like issue 536
2×2 Matrix{Measurement{Float64}}:
0.0±1.9 0.0±2.2
0.0±4.3 0.0±5.0 |
And it's definitely a false one. It's an axiom for computer scientists, but not for numerical analysts 😅. A nice example is in the space of ODEs. Automatic differentiation is equivalent to solving the expanded ODE known as the forward sensitivity equations, essentially: u' = f(u,p,t)
d/dt (du/dp) = df/du du/dp + df/dp Straight automatic differentiation is equivalent to solving the expanded ODE with the adaptive error controls only applying to the first part of the equation https://github.com/SciML/DiffEqBase.jl/blob/v6.83.1/src/forwarddiff.jl#L31-L34 This means that solving with ForwardDiff gives different stepping behavior, but if you don't do that, then there will be cases where you have "infinite" derivative because of numerical instability in the derivative calculation even when the primal is stable. So definitely, this axiom does not hold for the realities of numerical computing. So back to the core of the thread, I definitely agree with you. In fact, DiffEq specializes its interpolation computation in order to work around this kind of issue. Normally it would just pull |
Perhaps it's worth noting that many other AD systems have the same problem. On the example above: julia> Zygote.gradient(prod1, [1,2,0,4,0,6])
([0.0, 0.0, 0.0, 0.0, 0.0, 0.0],)
julia> Zygote.gradient(prod2, [1,2,0,4,0,6]) # wrong
([0.0, 0.0, 2.0, 0.0, 0.0, 0.0],)
julia> Tracker.gradient(prod1, [1,2,0,4,0,6]) # (after removing :::Vector restriction)
([0.0, 0.0, 0.0, 0.0, 0.0, 0.0] (tracked),)
julia> Tracker.gradient(prod2, [1,2,0,4,0,6]) # wrong
([0.0, 0.0, 2.0, 0.0, 0.0, 0.0] (tracked),)
julia> dx = zeros(6); Enzyme.autodiff(prod1, Duplicated([1,2,0,4,0,6.], dx)); (dx,)
([0.0, 0.0, 0.0, 0.0, 0.0, 0.0],)
julia> dx = zeros(6); Enzyme.autodiff(prod2, Duplicated([1,2,0,4,0,6.], dx)); (dx,) # wrong
([0.0, 0.0, 2.0, 0.0, 0.0, 0.0],) I think Zygote cannot fix this, as it does not know which variables are active when it transforms code. Tracker could surely change |
I think @wsmoses's comment is quite interesting. I've never viewed it that way but I guess one could say that all these AD systems are correct as they correctly return the gradient of the function that is implemented/defined by |
It would the best if it was only surprising, but that behaviour for users is frustrating at least. From this discussion and discussion on Slack it turned out that this is known issue and property of many other AD backends, but for some reason it is not communicated well to end users who may rely on this in critical systems or extensive simulations. That is important, because it happens not only in toy examples, but in real code as well. A good example from our case is the This kind-of limitations should be communicated (e.g. in documentation) better to end users, who indeed have not thought about the implications of different implementations for AD. AD systems position themselves as fast and more accurate alternative to finite differences, but do not document clear (known!) pitfalls. That is bad and that is not specific to It's great that |
Consider this function:
Here
ForwardDiff
gets the wrong answer, according to your first calculus class: The derivative is defined by taking limits, evaluatingsq(x + ε)
for some smallε
, and these always see the continuumx^2
not the special point.One to think about this is to say that
x==1
really meansabs(x-1) < ζ
with some tinyζ
, which we keep finite until we are sure we aren't confused. The calculus class assumption is thatζ << ε
.The assumption of
ForwardDiff
is the opposite. ItsDual(x,1)
encodes a perturbationx + 1ε
withε
smaller than everything else around, and in particularε << ζ
. Or in other words,sq
is viewed as being piecewise continuous, with a small flat area of width2ζ
, which is still large enough for us to see that its slope is zero.Of course nobody really writes contrived examples like
sq
. But they do write things like this:This has almost the same problem as #197, where
det(A)
tests foristriu(A) || istril(A)
before calling a simpler branch. The fact thatf(x,y) == g(x,y)
wheny==0
does not imply thatdf/dy == dg/dy
. So it seems AD ought not to take that branch.In which case we want something like this:
This fixes the tests above, and (a slightly more careful version) fixes #197 and #407.
However, it means that
fun(Dual(x,1)).value
need not be equal tofun(x)
, on a discontinuous function. Althoughfun(Dual(x,0)).value
should still be equal,@assert zero(x) == 0
isn't broken, and there should be no problems where functions use things likezero(eltype(xs))
for type-stability.The idea that the forward evaluation is unchanged is often thought of as an axiom of AD, but for discontinuous functions, I think that's another way of saying
ε << ζ
. Which is a choice. And one that your calculus teacher would disapprove of. The point of evaluating a function with dual numbers is, presumably, to find derivatives, so finding them correctly ought to have a higher priority.There are other comparisons to think about, for example:
I'm not sure how often simulating
x==1
as insq2(x)
happens in the wild. Perhaps from some combination likef(x) = relu(x) + 0.1*relu(-x)
?But
clamp
ing parameters to some range is routine. Your calculus teacher would throw an error here, but that's probably not the most helpful response for the computer.Returning a nonzero derivative here is useful because, if this is some parameter being optimised, it means gradient descent won't get stuck against the wall, when the gradient is away from it. So you can argue that the ability to choose which sub-gradient
ForwardDiff
will use is a feature. The0.5
gradient allaFiniteDifferences
would also be fine for getting un-stuck, but it's very difficult to picture howForwardDiff
could evaluate both branches, and easy to picture doing so having awful side-effects.Here is one way to relate the present rule for
>(::Dual, ::Real)
and>=(::Dual, ::Real)
to the finite-everythingζ << ε
story. We can say that while theε
-ball overlaps with both sides, the vote from the longer side (longer by about2ζ
) always wins by a hair:Trying out the above
==(::Dual, ::Real)
rule, it looks like the tests of this package all pass, except for the ones explicitly testing such rules. It would be interesting to know if this breaks any other uses in the wild. It would also be interesting to think up other pathological examples, maybe I've missed something important.Also:
Another way to talk about this: The problem with
prod2
above, anddet
in support for the determinant function #197, is that they promote accidental zeros of the input to structural zeros. And AD then respects these, and gets the wrong answer. What looked like a simple optimisation when writing for real numbers, has been unintentionally promoted to a constraint on what derivatives are allowed. This is the reverse of the discussion about preserving structural zeros in things likeZygote.gradient(sum∘exp, Diagonal(ones(3)))[1]
.Some other packages get this right, such as TaylorSeries.jl, this C++ code, and this Ruby. Some get it wrong (according to me) like this Mathematica code, this paper with Matlab, and this blog post, although he changed his mind from right to wrong. More mathematical treatments seem to regard the tuple
(x,δx)
as inheriting==
from tuples, i.e. they get it right.Similar things were also discussed in Problem with
abs
#377, where the example is this:A rule was suggested there in which
x > y
behaves differently forx.value == y.value
, breaking such ties by comparingx.partials > y.partials
. In theclamp2
example, whether you get stuck against the wall presumably shouldn't depend on whether you minimiseloss(x)
or maximise-loss(x)
, so we probably don't want to comparex.partials .> 0
when onlyx
is a dual number. But the rule when bothx
andy
are dual might be worth some more thought.The text was updated successfully, but these errors were encountered: