Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add direct Enzyme support #476

Closed
wsmoses opened this issue Oct 8, 2024 · 40 comments
Closed

Add direct Enzyme support #476

wsmoses opened this issue Oct 8, 2024 · 40 comments

Comments

@wsmoses
Copy link

wsmoses commented Oct 8, 2024

AutoEnzyme should probably be specialized and not fall back to DI.

In addition to being slower in some cases, it's been shown to cause errors (even segfaults) when using AutoEnzyme in DI whereas using Enzyme directly ran successfully.

see jump-dev/JuMP.jl#3836 (comment)

@gdalle
Copy link
Collaborator

gdalle commented Oct 8, 2024

How about we try to fix bugs together in DI instead of always undermining it? We could start with this one you mention: EnzymeAD/Enzyme.jl#1942 (I managed to reproduce it with pure Enzyme, independently from DI).

I understand your concerns about Enzyme performance / correctness being misrepresented by DI. But whenever someone wants access to multiple AD backends, as is the case for most SciML libraries, DI is the natural way to go. It offers unified syntax, and it can do things that no backend on its own can do at the moment, like sparse autodiff1.
Besides, the whole idea is to avoid duplication of efforts, so that bindings only have to be written in one place.

If you show me a meaningful example where LinearSolve fails with DI + Enzyme (I have no doubt there are plenty), I'll be happy to try and fix it. I have already altered DI's design significantly to fit Enzyme's peculiarities (e.g. by introducing Constant arguments). I can alter it again, but it would be more pleasant to work together instead of against each other.

Footnotes

  1. I know about Spadina but it seems there is no Julia interface yet?

@wsmoses
Copy link
Author

wsmoses commented Oct 8, 2024

Oh for sure, and I'm not at all saying to remove DI.

It's a fantastic package that makes a lot of things easier, some of which you describe above!

But also at the end of the day we both want to make things easier for our users. From my point of view, we should help the ecosystem by writing extensions/examples for Enzyme that maximize performance and compatibility. I don't see how this is different from you opening up PR's/issues on various repos asking if DI can be helpful?

In some cases, like LinearSolve.jl, that's an extension to add EnzymeRules to simplify the code being differentiated (also learning to better performance).

In other cases, where a package dispatches to an autodiff backend, it makes sense to call Enzyme directly.

The fact that the other day we saw a segfault when using AutoEnzyme in DI for a small docs example where Enzyme directly worked is rather worrying to me and implies that presently the overhead of DI might be making problems for Enzyme users more generically.

As a result, packages have been usually adopting a dual approach, calling Enzyme.autodiff/related directly when given an AutoEnzyme or other object, and DI.gradient/related for other non-specialized ADTypes. This lets users get the best of both worlds, performance and compatibility when available, and general support for AD packages, as well as all the nice things like sparsity that DI provides.

I'm all up for fixing issues as they arise, but also in the case where we have a backend which is more performant and stable, we should use it and not force the burden of debugging our packages on users, when unnecessary.

Often times, someone will try something once, and if it fails or runs unnecessarily slow they'll just never use things again.

I get that your goal is to try to have as many packages as possible start using DI, but there's more than enough from for both DI and Enzyme extensions in the Julia autodiff ecosystem! :)

@wsmoses
Copy link
Author

wsmoses commented Oct 8, 2024

As for the issue you opened seemingly after my initial comment, the MWE implies that DI is introducing type instability through the splat operator. This may result in performance slowdowns as well as some code no longer being able to be differentiated (as is clearly the case there).

I'll quickly look into fixing it, but also historically we have been advising people to remove type instabilities (as well as unions), which here would require not using DI

@gdalle
Copy link
Collaborator

gdalle commented Oct 8, 2024

In other cases, where a package dispatches to an autodiff backend, it makes sense to call Enzyme directly.

My main message is that this doesn't always make sense. There is a convenience tradeoff between (1) using DI for everything and (2) using DI for everything except Enzyme + using Enzyme's native interface. And of course option (2) seems easy to wou because you know Enzyme inside and out, but that's not the case for most users, even power users from SciML.

The fact that the other day we saw a segfault when using AutoEnzyme in DI for a small docs example where Enzyme directly worked is rather worrying to me

This was a very specific case due to JuMP's unusual requirement that arguments should be splatted into the function f(x...). In most other optimization settings, the input x is a vector or array that doesn't need to be splatted into individual numbers before being passed to f. So I wouldn't really make such a big deal out of it, but we can add a warning to that JuMP page mentioning this caveat.

Still, whenever you say "DI is problematic because it can segfault on Enzyme", it also implies "Enzyme itself is problematic because it can segfault". Indeed, the pure Enzyme MWE for splatting shows that this type instability is not handled gracefully, and just crashes the console. Sure, Enzyme provides a multi-argument way around this bug (which is inaccessible through DI's single-argument interface), but it remains an Enzyme bug because it doesn't happen with other backends.

presently the overhead of DI might be making problems for Enzyme users more generically.

On the other hand, DI also allows users to experiment with Enzyme at virtually no cost. Until LinearSolve switched to DI, it had no way of using Enzyme for its internal Jacobians. Now it's as simple as a backend argument switch, because the necessary bindings are already in DI.

As a result, packages have been usually adopting a dual approach, calling Enzyme.autodiff/related directly when given an AutoEnzyme or other object, and DI.gradient/related for other non-specialized ADTypes.

But then what's the point of going through all the trouble of supporting Enzyme in DI, if you're just gonna go around telling people not to use it that way?
LinearSolve is a classic example of case where using Enzyme through DI should be straightforward: array in, array out, single active argument. Do you have benchmarks or errors to show that DI is insufficient here? Not in JuMP, in this very repo.

This lets users get the best of both worlds, performance and compatibility when available, and general support for AD packages, as well as all the nice things like sparsity that DI provides.

Fair enough, so how would you handle AutoSparse{AutoEnzyme} with Enzyme's native API? Because that's a big part of what LinearSolve needs, and DI gives you that for free.

I'm all up for fixing issues as they arise, but also in the case where we have a backend which is more performant and stable, we should use it and not force the burden of debugging our packages on users, when unnecessary.

Except that in doing so, we force another burden on the users: maintenance of tens of copies of essentially identical Enzyme extensions. My hope is that, if we keep putting our minds together, we could just write this extension once in DI and be good enough for most use cases.

@adrhill
Copy link

adrhill commented Oct 8, 2024

I think the last bit addressing the "maintenance of tens of copies of essentially identical Enzyme extensions" hits the nail on the head. If maintainers had unlimited time, adding individual package extensions would be great. The appeal of DI is the centralization of that maintenance burden into a single package. We keep up with any breaking changes in any backends for package developers, freeing up dev time.

Adding specialized package extensions on top of DI is probably a valuable approach for performance critical hot loops. Individual package developers will have to decide whether the gained performance-delta is worth the increased maintenance burden. Here, Avik and Chris will have to make that call.

At the end of the day, to advance the Julia AD ecosystem, that maintenance burden should be centralized as much as possible. DI made big steps toward Enzyme with the new argument activities.

@wsmoses
Copy link
Author

wsmoses commented Oct 8, 2024

In other cases, where a package dispatches to an autodiff backend, it makes sense to call Enzyme directly.

My main message is that this doesn't always make sense. There is a convenience tradeoff between (1) using DI for everything and (2) using DI for everything except Enzyme + using Enzyme's native interface. And of course option (2) seems easy to wou because you know Enzyme inside and out, but that's not the case for most users, even power users from SciML.

Oh for sure, but my argument here isn't that users ought use Enzyme directly (though that also may be wise/helpful), but libraries that already hardcode a bunch of autodiff support should. Such libraries already have a bunch of code for various AD tools, so there is already precedence for having such code around (and of course users all benefit without writing additional code).

I guess I bucket library devs and end users in two separate categories.

Still, whenever you say "DI is problematic because it can segfault on Enzyme", it also implies "Enzyme itself is problematic because it can segfault". Indeed, the pure Enzyme MWE for splatting shows that this type instability is not handled gracefully, and just crashes the console. Sure, Enzyme provides a multi-argument way around this bug (which is inaccessible through DI's single-argument interface), but it remains an Enzyme bug because it doesn't happen with other backends.

Yeah for sure, but also if the use of DI makes it more likely to hit Enzyme issues, its natural to just use Enzyme directly, no?

On the other hand, DI also allows users to experiment with Enzyme at virtually no cost. Until LinearSolve switched to DI, it had no way of using Enzyme for its internal Jacobians. Now it's as simple as a backend argument switch, because the necessary bindings are already in DI.

Oh definitely and that's one of the most significant advantages of DI (both swapping out backends, and also the sparse support). I'm only suggesting here that Nonlinearsolve.jl add some special cases when something is calling Dense Enzyme to improve performance/compatibility.

But then what's the point of going through all the trouble of supporting Enzyme in DI, if you're just gonna go around telling people not to use it that way? LinearSolve is a classic example of case where using Enzyme through DI should be straightforward: array in, array out, single active argument. Do you have benchmarks or errors to show that DI is insufficient here? Not in JuMP, in this very repo.

I think you're conflating LinearSolve and NonlinearSolve. Like I said above "In some cases, like LinearSolve.jl, that's an extension to add EnzymeRules to simplify the code being differentiated (also learning to better performance)", some repos need an enzyme extension for a custom rule (https://github.com/SciML/LinearSolve.jl/blob/main/ext/LinearSolveEnzymeExt.jl). This was needed to get some things differentiating at the time (that now might work without). My guess it's that it's like 2-5x (and possibly more with threading on) with the Enzyme extension?

Fair enough, so how would you handle AutoSparse{AutoEnzyme} with Enzyme's native API? Because that's a big part of what LinearSolve needs, and DI gives you that for free.

Yeah I wouldn't special case this, just dense. Sparse can always call DI (unless say a future sparse backend wants to specialize).

@gdalle
Copy link
Collaborator

gdalle commented Oct 8, 2024

I guess I bucket library devs and end users in two separate categories.

Yeah that might be the main difference between our mindsets, because I see library devs as users of AD systems ;) But I get where you come from.

Yeah for sure, but also if the use of DI makes it more likely to hit Enzyme issues, its natural to just use Enzyme directly, no?

All I'm saying is that this needs to be decided on a case by case basis and with examples, to justify the cost of additional implementations living side by side.

I think you're conflating LinearSolve and NonlinearSolve. Like I said above "In some cases, like LinearSolve.jl, that's an extension to add EnzymeRules to simplify the code being differentiated (also learning to better performance)", some repos need an enzyme extension for a custom rule

Sorry yes I meant NonlinearSolve in my remark. Of course extensions for rule systems are still warranted and essential, like those in LinearSolve. This discussion was solely focused on extensions for calling into AD, not for making stuff AD-compatible.

@wsmoses
Copy link
Author

wsmoses commented Oct 8, 2024

Yeah for sure, but also if the use of DI makes it more likely to hit Enzyme issues, its natural to just use Enzyme directly, no?

All I'm saying is that this needs to be decided on a case by case basis and with examples, to justify the cost of additional implementations living side by side.

Sure, but at this point having seen segfaults in the wild for some tiny example code (in addition to the performance issues discussed slack), at least my default is that there should be a separate Enzyme backend unless DI can be shown to be @code_typed equivalent to relevant Enzyme calls.

In other words, if the function DI passes to Enzyme autodiff utilities is equivalent to the original user function (and not with a closure, unwrapping, etc that could dramatically change how julia compiles the code and add type instabilities, drop alias info, etc) the default being not having an Enzyme extension is reasonable, if there is an indirection, it should default to using Enzyme directly (since that indirection, like above, can frequently be the source of Enzyme failing to differentiate a function).

The reason I'd push this is because Enzyme has a defined scope of Julia code that it works on. Adding indirection can cause Julia to compile code outside of that scope, causing crashes (like above), in addition to all the perf issues. I agree in the long term it would be nice for Enzyme to handle all code always, but otherwise that's equivalent to asking for significantly more feature dev for something which is already supported natively.

Our goal was explicitly to start with a small but well defined set of code (e.g. originally just code you could use in a @cuda kernel from GPUCompiler), and do exceptionally well on it. This lets us organically grow an ecosystem of users who can use Enzyme without being stuck trying to "boil the ocean" before anything done. We've been growing that scope with time, but again if something is likely to cause code to move outside of it, I'd recommend someone to use the version which works (and open an issue).

@gdalle
Copy link
Collaborator

gdalle commented Oct 8, 2024

my default is that there should be a separate Enzyme backend unless DI can be shown to be @code_typed-equivalent to relevant Enzyme calls.

Let me rephrase that the way I see it:

"Given infinite developer, money and time resources, provided everyone is perfectly at ease with autodiff in general and Enzyme in particular, and assuming that optimal performance in every single case is worth more than convenience, conciseness and maintainability, then there should be a separate Enzyme backend unless DI can be shown to be @code_typed-equivalent to relevant Enzyme calls."

But in the current situation of the Julia ecosystem, I think this is a completely unreasonable request, and it would be essentially equivalent to

const DifferentiationInterface = Enzyme

Most users and package devs don't even need a fraction of what you're demanding for DI + Enzyme to be useful.


I'm going to stop the conversation here because it makes me sad and I admire your work too much to keep fighting. In any case, it's up to package developers to decide what they want to do with the tools that we offer.

@wsmoses
Copy link
Author

wsmoses commented Oct 8, 2024

Most users and package devs don't even need a fraction of what you're demanding for DI + Enzyme to be useful.

But that's exactly my point!

I'm just trying to propose a threshold that would indicate where an extension would be useful or not, and catch the relevant usage and performance bugs.

To be clear, my biggest concern here is not performance -- but code which fails with DI but works with Enzyme directly (performance is good too but not crashing should be the first priority).

My argument is that if the extra indirection is added, it probably presents issues (and should be catchable). But most cases this isn't the case.

So let's see how it works as a rule of thumb on some samples (again not sure if its the best rule of thumb, but its what I could think of just now):

using ADTypes
import DifferentiationInterface as DI

# Note I modified Enzyme.autodiff with noinline so we could see if and how the call occured (without getting inlined into an llvmcall at which point the info is lost)
# Specifically
#  @noinline function autodiff(
#     rmode::ReverseMode{ReturnPrimal,RuntimeActivity,RABI,Holomorphic,ErrIfFuncWritten},
#     f::FA,
#     ::Type{A},
#     args::Vararg{Annotation,Nargs},
#
# and 
# 
# @noinline function autodiff(
#     ::ForwardMode{ReturnPrimal,RABI,ErrIfFuncWritten,RuntimeActivity},
#     f::FA,
#     ::Type{A},
#     args::Vararg{Annotation,Nargs},

using Enzyme: Enzyme

x = [2.0, 3.0]

a = [3.0, 4.0]

function mul(a, b)
    dot(a, b)
end

function grad(f::F, a, x) where F
    DI.gradient(f, AutoEnzyme(), x, DI.Constant(a))
end

@code_typed grad(mul, a, x)

julia> @code_typed grad(mul, a, x)
# 
# CodeInfo(
# 1 ── %1  = Base.arraysize(x, 1)::Int64
# │    %2  = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Vector{Float64}, svec(Any, Int64), 0, :(:ccall), Vector{Float64}, :(%1), :(%1)))::Vector{Float64}
# │    %3  = Base.arraysize(%2, 1)::Int64
# │    %4  = Base.slt_int(%3, 0)::Bool
# │    %5  = Core.ifelse(%4, 0, %3)::Int64
# │    %6  = Base.slt_int(%5, 1)::Bool
# └───       goto #3 if not %6
# 2 ──       goto #4
# 3 ──       goto #4
# 4 ┄─ %10 = φ (#2 => true, #3 => false)::Bool
# │    %11 = φ (#3 => 1)::Int64
# │    %12 = φ (#3 => 1)::Int64
# │    %13 = Base.not_int(%10)::Bool
# └───       goto #10 if not %13
# 5 ┄─ %15 = φ (#4 => %11, #9 => %23)::Int64
# │    %16 = φ (#4 => %12, #9 => %24)::Int64
# │          Base.arrayset(false, %2, 0.0, %15)::Vector{Float64}
# │    %18 = (%16 === %5)::Bool
# └───       goto #7 if not %18
# 6 ──       goto #8
# 7 ── %21 = Base.add_int(%16, 1)::Int64
# └───       goto #8
# 8 ┄─ %23 = φ (#7 => %21)::Int64
# │    %24 = φ (#7 => %21)::Int64
# │    %25 = φ (#6 => true, #7 => false)::Bool
# │    %26 = Base.not_int(%25)::Bool
# └───       goto #10 if not %26
# 9 ──       goto #5
# 10 ┄       goto #11
# 11 ─       goto #12
# 12 ─       goto #13
# 13 ─ %32 = %new(Duplicated{Vector{Float64}}, x, %2)::Duplicated{Vector{Float64}}
# │    %33 = %new(Const{Vector{Float64}}, a)::Const{Vector{Float64}}
# │          invoke Enzyme.autodiff($(QuoteNode(ReverseMode{false, false, FFIABI, false, true}()))::ReverseMode{false, false, FFIABI, false, true}, $(QuoteNode(Const{typeof(mul)}(mul)))::Const{typeof(mul)}, Active::Type{Active}, %32::Duplicated{Vector{Float64}}, %33::Const{Vector{Float64}})::Tuple{Tuple{Nothing, Nothing}}
# └───       goto #14
# 14 ─       return %2
# ) => Vector{Float64}

We can see that mul was directly send to an Enzyme call so this should be fine as a rule of thumb.

Let's look at the multi arg case earlier (which is fixed above by DI support), to confirm that its absence would trigger the rule of thumb.

function grad2(mul, a, x)
    DI.gradient(Base.Fix1(mul, a), AutoEnzyme(), x)
end

julia> @code_typed grad2(mul, a, x)
# CodeInfo(
# 1 ── %1  = %new(Base.Fix1{typeof(mul), Vector{Float64}}, mul, a)::Base.Fix1{typeof(mul), Vector{Float64}}
# │    %2  = Base.arraysize(x, 1)::Int64
# │    %3  = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Vector{Float64}, svec(Any, Int64), 0, :(:ccall), Vector{Float64}, :(%2), :(%2)))::Vector{Float64}
# │    %4  = Base.arraysize(%3, 1)::Int64
# │    %5  = Base.slt_int(%4, 0)::Bool
# │    %6  = Core.ifelse(%5, 0, %4)::Int64
# │    %7  = Base.slt_int(%6, 1)::Bool
# └───       goto #3 if not %7
# 2 ──       goto #4
# 3 ──       goto #4
# 4 ┄─ %11 = φ (#2 => true, #3 => false)::Bool
# │    %12 = φ (#3 => 1)::Int64
# │    %13 = φ (#3 => 1)::Int64
# │    %14 = Base.not_int(%11)::Bool
# └───       goto #10 if not %14
# 5 ┄─ %16 = φ (#4 => %12, #9 => %24)::Int64
# │    %17 = φ (#4 => %13, #9 => %25)::Int64
# │          Base.arrayset(false, %3, 0.0, %16)::Vector{Float64}
# │    %19 = (%17 === %6)::Bool
# └───       goto #7 if not %19
# 6 ──       goto #8
# 7 ── %22 = Base.add_int(%17, 1)::Int64
# └───       goto #8
# 8 ┄─ %24 = φ (#7 => %22)::Int64
# │    %25 = φ (#7 => %22)::Int64
# │    %26 = φ (#6 => true, #7 => false)::Bool
# │    %27 = Base.not_int(%26)::Bool
# └───       goto #10 if not %27
# 9 ──       goto #5
# 10 ┄       goto #11
# 11 ─       goto #12
# 12 ─       goto #13
# 13 ─ %33 = %new(Duplicated{Vector{Float64}}, x, %3)::Duplicated{Vector{Float64}}
# │    %34 = %new(Const{Base.Fix1{typeof(mul), Vector{Float64}}}, %1)::Const{Base.Fix1{typeof(mul), Vector{Float64}}}
# │          invoke Enzyme.autodiff($(QuoteNode(ReverseMode{false, false, FFIABI, false, true}()))::ReverseMode{false, false, FFIABI, false, true}, %34::Const{Base.Fix1{typeof(mul), Vector{Float64}}}, Active::Type{Active}, %33::Duplicated{Vector{Float64}})::Tuple{Tuple{Nothing}}
# └───       goto #14
# 14 ─       return %3
# ) => Vector{Float64}
# 

We can see that this doesn't directly forward mul into the call, instead passing in Base.Fix1{typeof(mul), Vector{Float64}}} and might cause problems under our rule of thumb. Thus, libraries that require Fix1/etc to use DI probably need an Enzyme Ext.

Let's try the hessian case causing issues

f(x::T...) where {T} = (1 - x[1])^2 + 100 * (x[2] - x[1]^2)^2
f_nosplat(x::AbstractVector) = (1 - x[1])^2 + 100 * (x[2] - x[1]^2)^2

function h1(x)
    backend = AutoEnzyme()
    DI.hessian(f_nosplat, backend, x)  # works
end

function h2(x)
    DI.hessian(splat(f), backend, x)  # segfaults
end

julia> @code_typed h1(x)
CodeInfo(
1%1 = invoke DifferentiationInterface._prepare_hessian_aux($(QuoteNode(Val{16}()))::Val{16}, f_nosplat::typeof(f_nosplat), $(QuoteNode(AutoEnzyme()))::AutoEnzyme{Nothing, Nothing}, x::Vector{Float64})::DifferentiationInterface.HVPGradientHessianPrep{16, NTuple{16, Vector{Float64}}, NTuple{16, Vector{Float64}}, DifferentiationInterface.ForwardOverReverseHVPPrep{DifferentiationInterface.var"#inner_gradient#46"{typeof(f_nosplat), AutoEnzyme{Nothing, Nothing}, DifferentiationInterface.Rewrap{0, Tuple{}}}, DifferentiationInterface.NoPushforwardPrep}, DifferentiationInterfaceEnzymeExt.EnzymeGradientPrep{Vector{Float64}}}
│   %2 = invoke DifferentiationInterface.hessian(f_nosplat::typeof(f_nosplat), %1::DifferentiationInterface.HVPGradientHessianPrep{16, NTuple{16, Vector{Float64}}, NTuple{16, Vector{Float64}}, DifferentiationInterface.ForwardOverReverseHVPPrep{DifferentiationInterface.var"#inner_gradient#46"{typeof(f_nosplat), AutoEnzyme{Nothing, Nothing}, DifferentiationInterface.Rewrap{0, Tuple{}}}, DifferentiationInterface.NoPushforwardPrep}, DifferentiationInterfaceEnzymeExt.EnzymeGradientPrep{Vector{Float64}}}, $(QuoteNode(AutoEnzyme()))::AutoEnzyme{Nothing, Nothing}, x::Vector{Float64})::Matrix{Float64}
└──      return %2
) => Matrix{Float64}

okay this isn't terribly helpful so let's recurse in a bit to the inner calls

function h1(f::F, x, dx) where F
    @inline DI.hvp(f, AutoEnzyme(), x, dx)
end

@code_typed h1(f_nosplat, x, (x,))

CodeInfo(
1 ── %1  = Base.getfield(dx, 1, true)::Vector{Float64}%2  = %new(Duplicated{Vector{Float64}}, x, %1)::Duplicated{Vector{Float64}}%3  = invoke Enzyme.autodiff($(QuoteNode(ForwardMode{false, FFIABI, true, false}()))::ForwardMode{false, FFIABI, true, false}, $(QuoteNode(Const{DifferentiationInterface.var"#inner_gradient#46"{typeof(f_nosplat), AutoEnzyme{Nothing, Nothing}, DifferentiationInterface.Rewrap{0, Tuple{}}}}(DifferentiationInterface.var"#inner_gradient#46"{typeof(f_nosplat), AutoEnzyme{Nothing, Nothing}, DifferentiationInterface.Rewrap{0, Tuple{}}}(f_nosplat, AutoEnzyme(), DifferentiationInterface.Rewrap{0, Tuple{}}(())))))::Const{DifferentiationInterface.var"#inner_gradient#46"{typeof(f_nosplat), AutoEnzyme{Nothing, Nothing}, DifferentiationInterface.Rewrap{0, Tuple{}}}}, Duplicated{Vector{Float64}}::Type{Duplicated{Vector{Float64}}}, %2::Duplicated{Vector{Float64}})::@NamedTuple{1::Vector{Float64}}
└───       goto #3
2 ──       nothing::Nothing
3 ┄─ %6  = Base.getfield(%3, 1)::Vector{Float64}
└───       goto #4
4 ──       goto #5
5 ──       goto #6
6 ──       goto #7
7 ── %11 = Core.tuple(%6)::Tuple{Vector{Float64}}
└───       goto #8
8 ──       goto #9
9 ──       goto #10
10return %11
) => Tuple{Vector{Float64}}

This indicates that the wrapper of DI.inner_gradient could be an issue, meriting further investigation.

It looks like this is a closure capturing the function argument which actually might provide problems for user code. Let's reimplement from DI: https://github.com/gdalle/DifferentiationInterface.jl/blob/efb5acf1f3df51cd5f85edc7fb693703a4cd5bf0/DifferentiationInterface/src/second_order/hvp.jl#L91

In particular, let's recurse a bit more to debug and try to fix.

# copying from DI
#     function inner_gradient(_x, unannotated_contexts...)
#        annotated_contexts = rewrap(unannotated_contexts...)
#        return gradient(f, nested(inner(backend)), _x, annotated_contexts...)
#     end


function h1(f::F, x, dx) where F
    prep = @inline DI.prepare_hvp(f, AutoEnzyme(), x, (x,))
    @inline DI.hvp(f, prep, AutoEnzyme(), x, dx)
end

@code_typed h1(f_nosplat, x, (x,))

# ...
# much more inlining and debugging later
# 

function my_inner_gradient(_x, f::F) where F
    return DI.gradient(f, AutoEnzyme(), _x)
end

function my_hvp(f::F, _x, dx) where F
    return (DI.pushforward(my_inner_gradient, DI.NoPushforwardPrep(), AutoEnzyme(), _x, dx, DI.Constant(f)),)
end

function my_h1(x, dx)
    my_hvp(f_nosplat, x, dx)
end
@code_typed my_h1(x, (x,))

# CodeInfo(
# 1 ─ %1  = Base.getfield(dx, 1, true)::Vector{Float64}
# │   %2  = %new(Duplicated{Vector{Float64}}, x, %1)::Duplicated{Vector{Float64}}
# │   %3  = invoke Enzyme.autodiff($(QuoteNode(ForwardMode{false, FFIABI, true, false}()))::ForwardMode{false, FFIABI, true, false}, $(QuoteNode(Const{typeof(my_inner_gradient)}(my_inner_gradient)))::Const{typeof(my_inner_gradient)}, Duplicated{Vector{Float64}}::Type{Duplicated{Vector{Float64}}}, %2::Duplicated{Vector{Float64}}, $(QuoteNode(Const{typeof(f_nosplat)}(f_nosplat)))::Const{typeof(f_nosplat)})::@NamedTuple{1::Vector{Float64}}
# └──       goto #3
# 2 ─       nothing::Nothing
# 3 ┄ %6  = Base.getfield(%3, 1)::Vector{Float64}
# └──       goto #4
# 4 ─       goto #5
# 5 ─       goto #6
# 6 ─       goto #7
# 7 ─ %11 = Core.tuple(%6)::Tuple{Vector{Float64}}
# └──       goto #8
# 8 ─ %13 = Core.tuple(%11)::Tuple{Tuple{Vector{Float64}}}
# └──       goto #9
# 9 ─       return %13
# ) => Tuple{Tuple{Vector{Float64}}}

Great, with our fix looks like we're directly passing in functions without creating closures/indirection! This is probably good to go! (opened an issue on DI to fix this JuliaDiff/DifferentiationInterface.jl#555

Let's check out my_h2

function my_h2(x, dx)
    my_hvp(splat(f), x, dx)
end
@code_typed my_h2(x, (x,))
# julia> @code_typed my_h2(x, (x,))
# CodeInfo(
# 1 ─ %1  = Base.getfield(dx, 1, true)::Vector{Float64}
# │   %2  = %new(Duplicated{Vector{Float64}}, x, %1)::Duplicated{Vector{Float64}}
# │   %3  = invoke Enzyme.autodiff($(QuoteNode(ForwardMode{false, FFIABI, true, false}()))::ForwardMode{false, FFIABI, true, false}, $(QuoteNode(Const{typeof(my_inner_gradient)}(my_inner_gradient)))::Const{typeof(my_inner_gradient)}, Duplicated{Vector{Float64}}::Type{Duplicated{Vector{Float64}}}, %2::Duplicated{Vector{Float64}}, $(QuoteNode(Const{Base.Splat{typeof(f)}}(splat(f))))::Const{Base.Splat{typeof(f)}})::@NamedTuple{1::Vector{Float64}}
# └──       goto #3
# 2 ─       nothing::Nothing
# 3 ┄ %6  = Base.getfield(%3, 1)::Vector{Float64}
# └──       goto #4
# 4 ─       goto #5
# 5 ─       goto #6
# 6 ─       goto #7
# 7 ─ %11 = Core.tuple(%6)::Tuple{Vector{Float64}}
# └──       goto #8
# 8 ─ %13 = Core.tuple(%11)::Tuple{Tuple{Vector{Float64}}}
# └──       goto #9
# 9 ─       return %13
# ) => Tuple{Tuple{Vector{Float64}}}
# 

Yeah so it looks like we're sending in a closure Const{Base.Splat{typeof(f)}}(splat(f))) to autodiff, so this could (and does) cause more problems

@gdalle
Copy link
Collaborator

gdalle commented Oct 8, 2024

Thank you for this analysis, it will be helpful to improve DI!

@wsmoses
Copy link
Author

wsmoses commented Oct 8, 2024

To be clear, this is the code_typed analysis I mentioned above as the back of envelope threshold for whether an Enzyme Ext makes sense in addition to DI.

Obviously there will be mitigating factors in either direction (code re-use, extreme performance/compat), but this is what I'm suggesting be done in a way that generalizes to the types of arguments/functions intended for use. If it passes the smoke test, an Enzyme Ext probably isn't necessary, if not it probably does.

However, as seen in the past few days, even seemingly simple use cases can fail this (including DI internally shown above not needing the closure). Thus my personal default is that it probably is necessary to have an extension until this analysis is performed (which would probably pass for most users which have simpler cases).

I am in no way trying to undermine DI. I am, however, trying to keep users from having broken (first priority) or slow (second priority) code with the default setup.

@gdalle
Copy link
Collaborator

gdalle commented Oct 8, 2024

To be clear, this is the code_typed analysis I mentioned above as the back of envelope threshold for whether an Enzyme Ext makes sense in addition to DI.

I had interpreted your requirement as "DI should generate the exact same @code_typed as Enzyme to the semicolon", which will obviously be impossible. But if it's only about passing the right function and argument annotations to the Enzyme call without a closure, that seems much more reasonable indeed.

I am in no way trying to undermine DI. I am, however, trying to keep users from having broken (first priority) or slow (second priority) code with the default setup.

Thank you for clarifying. I think in the end we are mostly in agreement, except on the default recommendation:

  • Mine is to start with DI, and if Enzyme problems are detected, add an Enzyme extension.
  • Yours is to start with an Enzyme extension, and if the comparison with DI is good enough, remove it afterwards.

Both are understandable from our respective standpoints, I think they will simply be appealing to different categories of devs and that's okay.

@gdalle
Copy link
Collaborator

gdalle commented Oct 8, 2024

And I really appreciate you taking the time to dive into DI internals and debug, the same way I try to do it for Enzyme. I do think we can end up with something pretty decent with some more elbow grease.

@oscardssmith
Copy link
Contributor

oscardssmith commented Oct 8, 2024

Thus my personal default is that it probably is necessary to have an extension until this analysis is performed

imo this is exactly backwards. we should never do extra work to implement a feature that adds code complexity and maintenance burden without direct evidence that it's needed.

Edit: and any PR made as a result of such analysis should have a @test_broken for the DI version so that we know when we should revert it.

@ChrisRackauckas
Copy link
Member

Any discussion that is too general is hard to be correct on. Context matters. NonlinearSolve uses Jacobians for the core Newton methods, and jvp/vjp operations within some line search algorithms. For this context, DI looks pretty good. In fact, that should be DI's bread and butter operation really, and it seems to do rather well already. If it's within 10% of optimal, then I don't see why we wouldn't drop the maintenance burden. We of course need to hold DI accountable for keeping up its performance and continually improving it, but in this context it looks reasonable.

That doesn't mean it's reasonable everywhere. Fully supporting Optimization.jl and SciMLSensitivity.jl will take a lot more work in DI, those libraries demand different things. But pointing out the hvp has overhead in DI for example has no bearing on this library since it's not used here. The moment DI is ready for a given library, we should use it, consolidate the code, and reduce the maintenance. It won't be ready for all of them at the same time.

@wsmoses
Copy link
Author

wsmoses commented Oct 9, 2024

I totally agree and context does matter and I opened this issue to suggest that it's worth doing the investigation here given a recent case that changed my impression of the relative stability.

The specific context which made me think it was worth revisiting places like this repo to investigate a potential extension was an issue on JuMP where using DI caused a segfault where using Enzyme directly ran fine. This was for the intro docs example so if DI causes a crash for simple functions it gives me a bit of pause that merits investigation (and the rule of thumb I mention above of "does it add wrappers" hopefully serves as a good way to judge if DI will introduce overhead which increases that likelihood).

using Enzyme, DifferentiationInterface

f(x::T...) where {T} = (1 - x[1])^2 + 100 * (x[2] - x[1]^2)^2

function ∇²f(x::Vararg{T,N}) where {T,N}
    H = Matrix{T}(undef, N, N)
    direction(i) = ntuple(j -> Enzyme.Active(T(i == j)), N)
    ∇f(x...) = Enzyme.autodiff(Enzyme.Reverse, f, Enzyme.Active, x...)[1]
    hess = Enzyme.autodiff(
        Enzyme.Forward,
        ∇f,
        Enzyme.BatchDuplicated.(Enzyme.Active.(x), ntuple(direction, N))...,
    )[1]
    for j in 1:N, i in 1:j
        H[j, i] = hess[j][i]
    end
    return H
end

x = [2.0, 3.0]

∇²f(x...) # returns hessian

backend = AutoEnzyme()
DifferentiationInterface.hessian(splat(f), backend, x)  # segfaults

While I personally think the additional risk of crash is more critical than a performance gap, it's easier to do the performance test so I took the example from the readme in this repo and the performance gap is ~10x (though @gdalle let me know if something should be changed for the setup). It's also probably worth doing the wrapper analysis as well to determine stability.

using DifferentiationInterface, ADTypes, Enzyme, BenchmarkTools, StaticArrays

f(u, p) = u .* u .- 2

u0 = @SVector[1.0, 1.0]

p = [1.0, 2.0]

prep = DifferentiationInterface.prepare_jacobian(f, AutoEnzyme(), u0, Constant(p))

@btime DifferentiationInterface.jacobian($f, $prep, AutoEnzyme(), $u0, Constant($p))
#   265.006 ns (4 allocations: 592 bytes)
# 2×2 Matrix{Float64}:
#  2.0  0.0
#  0.0  2.0

function idx(f::F, x, p, i) where F
	return f(x, p)[i]
end

function jac1(f::F, x, p) where F
	dx = Enzyme.autodiff(Reverse, idx, Const(f), Active(x), Const(p), Const(1))[1][2]
	dx2 = Enzyme.autodiff(Reverse, idx, Const(f), Active(x), Const(p), Const(2))[1][2]
	return hcat(dx, dx2)
end

@btime jac1($f, $u0, $p)
#  17.817 ns (0 allocations: 0 bytes)
# 2×2 SMatrix{2, 2, Float64, 4} with indices SOneTo(2)×SOneTo(2):
# 2.0  0.0
# 0.0  2.0

@gdalle
Copy link
Collaborator

gdalle commented Oct 9, 2024

First of all, thank you @wsmoses for focusing on a concrete benchmark instead of generalities. It helped me understand some things and even fix a few bugs. Therefore, the code below is run on the branch from JuliaDiff/DifferentiationInterface.jl#557 (I'll explain the contents of that PR in a minute).

I won't go back to the JuMP example because the main culprit here is the obligation to splat arguments into the function, even if there are thousands of them. DI's behavior here may or may not be improved by a solution to JuliaDiff/DifferentiationInterface.jl#555, we'll see. But let me discuss your second example in more detail.

Warning

I just remembered that AutoEnzyme() computes Jacobians in forward mode by default in DI, so take the comparison with grain of salt. I'll edit the message to add reverse mode.

DI's design principles

To understand DI's behavior on this example, it helps to list a few rules of thumb that I used when writing the package:

  1. The differentiated functions are costly
  2. The arrays involved can have high-dimension
  3. Allocations are expensive so we should work in-place whenever possible

Obviously none of that applies here, which explains a big part of the bad performance. Now onto the nitty gritty details.

Jacobians without cheating

Setup code
using ADTypes
using BenchmarkTools
import DifferentiationInterface as DI
using DifferentiationInterface: Constant
using Enzyme: Enzyme, Active, Const, Reverse
using StaticArrays
import ForwardDiff

f(u0, p) = u0 .* u0 .- 2
u0 = @SVector[1.0, 1.0]
p = [1.0, 2.0]

If you want people to compute Jacobians with Enzyme, the first thing they will try is Enzyme.jacobian. So I'll compare both native interfaces, which seems a bit fairer to me (let me know if I used Enzyme's wrong).

DI.jacobian with AutoEnzyme() (in forward mode by default) works as expected, even though it doesn't return the right matrix type (more on that below):

julia> DI.jacobian(f, AutoEnzyme(), u0, Constant(p))
2×2 Matrix{Float64}:
 2.0  0.0
 0.0  2.0

DI.jacobian with AutoEnzyme(; mode=Enzyme.Reverse) fails because of mutation:

julia> DI.jacobian(f, prep, AutoEnzyme(; mode=Enzyme.Reverse), u0, Constant(p))
ERROR: setindex!(::SVector{2, Float64}, value, ::Int) is not defined.
 Hint: Use `MArray` or `SizedArray` to create a mutable static array

Reverse Enzyme.jacobian fails because, unlike Enzyme.gradient, it doesn't accept multiple arguments:

julia> Enzyme.jacobian(Reverse, f, u0, Const(p))
ERROR: MethodError: no method matching jacobian(::EnzymeCore.ReverseMode{…}, ::typeof(f), ::SVector{…}, ::Const{…})

Closest candidates are:
  jacobian(::EnzymeCore.ReverseMode{ReturnPrimal, RuntimeActivity, RABI, Holomorphic, ErrIfFuncWritten}, ::F, ::X; n_outs, chunk) where {ReturnPrimal, F, X, RABI<:EnzymeCore.ABI, ErrIfFuncWritten, RuntimeActivity, Holomorphic}
   @ Enzyme ~/.julia/packages/Enzyme/Vjlrr/src/Enzyme.jl:2087
  jacobian(::EnzymeCore.ForwardMode, ::Any...; kwargs...)
   @ Enzyme ~/.julia/packages/Enzyme/Vjlrr/src/Enzyme.jl:2029

Reverse Enzyme.jacobian also fails without the additional p argument, again because of mutation:

julia> f_noparam(u0) = u0 .* u0 .- 2
f_noparam (generic function with 1 method)

julia> Enzyme.jacobian(Reverse, f_noparam, u0)
ERROR: setindex!(::SVector{2, Float64}, value, ::Int) is not defined.
 Hint: Use `MArray` or `SizedArray` to create a mutable static array
...

On the other hand, forward Enzyme.jacobian works perfectly without the additional p:

julia> Enzyme.jacobian(Forward, f_noparam, u0)[1]
2×2 SMatrix{2, 2, Float64, 4} with indices SOneTo(2)×SOneTo(2):
 2.0  0.0
 0.0  2.0

Jacobians with cheating

To squeeze maximum performance out of Enzyme, you didn't use Enzyme.jacobian (you couldn't have anyway, see above), you used Enzyme.autodiff on each individual function output. For a function with m outputs, that would multiply the complexity of autodiff by m, because you would compute the full vector every time only to grab a single element. But sure, let's do it that way.

partial_output(f::F, u0, p, i) where {F} = f(u0, p)[i]
partial_output_shuffled(u0, f::F, p, i) where {F} = f(u0, p)[i]

function enzyme_jac_cheating(f::F, u0, p) where {F}
    du01 = Enzyme.autodiff(
        Reverse, partial_output, Const(f), Active(u0), Const(p), Const(1)
    )[1][2]
    du02 = Enzyme.autodiff(
        Reverse, partial_output, Const(f), Active(u0), Const(p), Const(2)
    )[1][2]
    return hcat(du01, du02)
end

function di_jac_cheating(f::F, u0, p) where {F}
    du01 = DI.gradient(
        partial_output_shuffled, AutoEnzyme(), u0, Constant(f), Constant(p), Constant(1)
    )
    du02 = DI.gradient(
        partial_output_shuffled, AutoEnzyme(), u0, Constant(f), Constant(p), Constant(2)
    )
    return hcat(du01, du02)
end

What do we get in the benchmark?

julia> @btime enzyme_jac_cheating($f, $u0, $p)
  17.600 ns (0 allocations: 0 bytes)
2×2 SMatrix{2, 2, Float64, 4} with indices SOneTo(2)×SOneTo(2):
 2.0  0.0
 0.0  2.0

julia> @btime di_jac_cheating($f, $u0, $p)
  26.199 ns (2 allocations: 64 bytes)
2×2 SMatrix{2, 2, Float64, 4} with indices SOneTo(2)×SOneTo(2):
 2.0  0.0
 0.0  2.0

I guess that's not too bad, although I'll have to investigate where the allocations come from.

Why is DI slow?

On the branch from JuliaDiff/DifferentiationInterface.jl#557 (where I modified the batch size to adapt to the array length), we get the following measurements.
Keep in mind that AutoEnzyme() is forward mode by default for Jacobians in DI.

julia> prep = DI.prepare_jacobian(f, AutoEnzyme(), u0, Constant(p));

julia> @btime DI.jacobian($f, $prep, AutoEnzyme(), $u0, Constant($p))
  132.796 ns (3 allocations: 256 bytes)
2×2 Matrix{Float64}:
 2.0  0.0
 0.0  2.0

As you notice, the return type is a Matrix, which is due to stack and its variants returning non-static arrays. You pointed this out yourself in JuliaArrays/StaticArrays.jl#1272 but it hasn't been addressed. Thus, the bulk of the runtime is spent in allocations. If we work in-place, we recover most of the lost performance:

julia> J = @MMatrix(zeros(2, 2))
2×2 MMatrix{2, 2, Float64, 4} with indices SOneTo(2)×SOneTo(2):
 0.0  0.0
 0.0  0.0

julia> @btime DI.jacobian!($f, $J, $prep, AutoEnzyme(), $u0, Constant($p))
  34.976 ns (0 allocations: 0 bytes)
2×2 MMatrix{2, 2, Float64, 4} with indices SOneTo(2)×SOneTo(2):
 2.0  0.0
 0.0  2.0

In this case, ForwardDiff remains our safest bet by far, and it is also accessible from DI:

julia> @btime DI.jacobian($f, AutoForwardDiff(; chunksize=2), $u0, Constant($p))
  2.464 ns (0 allocations: 0 bytes)
2×2 SMatrix{2, 2, Float64, 4} with indices SOneTo(2)×SOneTo(2):
 2.0  0.0
 0.0  2.0

Why is DI wrong?

These experiments also made me realize some issues with how DI handles SArrays, specifically with Enzyme. I always assumed that the tangents inside Duplicated would be modified in-place, which is not the case for SArray. I fixed the gradient method in JuliaDiff/DifferentiationInterface.jl#557 but I opened JuliaDiff/DifferentiationInterface.jl#558 to keep track of the need for more tests.
I have also opened EnzymeAD/Enzyme.jl#1948 to discuss automatic choice of activity.

@wsmoses
Copy link
Author

wsmoses commented Oct 9, 2024

Jacobians without cheating

I don't think that calling autodiff directly is cheating... Enzyme.autodiff is the standard interface and we explicitly warn folks about the sugar functions being more limited (though I do agree we should still add this support here). I agree it's more natural for end users to try them first, but for a package extension trying to maximize performance and compatibility, using the core function is fine imo.

from https://enzyme.mit.edu/index.fcgi/julia/stable/#Gradient-Convenience-functions:

While the convenience functions discussed below use autodiff internally, they are generally more limited in their functionality. Beyond that, these convenience functions may also come with performance penalties; especially if one makes a closure of a multi-argument function instead of calling the appropriate multi-argument autodiff function directly.

In any case, I think your experiment supports my suggestion that there be an Enzyme Extension here. I don't mind if that extension is implemented with DI.gradient instead of Enzyme.autodiff as above (though the gap there is stil ~50% slowdown), but the point is that some special case handling here is required for the Enzyme ADType.

@gdalle
Copy link
Collaborator

gdalle commented Oct 9, 2024

some special case handling here is required for the Enzyme ADType.

But this special handling does not scale well at all if the function is costly or has high-dimensional outputs, can we at least agree on that? Computing the whole vector only to throw most of it away seems very strange to me. Unless Enzyme is able to somehow optimize that away?

function idx(f::F, x, p, i) where F
	return f(x, p)[i]
end

@wsmoses
Copy link
Author

wsmoses commented Oct 9, 2024

Often yes it can, but it depends on the program.

And sure, then in your special case handler if you're worried about it you can use the index'd algorithm for low dim and something else for high. I just wrote it that way because it was the easier setup, but you could totally set something up nicely with batch mode with a similar behavior.

@wsmoses
Copy link
Author

wsmoses commented Oct 9, 2024

Since I didn't realize DI does forward mode for jacobian's by default, you get something a lot more concise as follows. If you want to truly speed race, here's a couple of ways to tune that puts it on par with forwarddiff.

julia> @btime hcat(Enzyme.autodiff(Forward, $f, $(BatchDuplicated(u0, Enzyme.onehot(u0))), $(Const(p)))[1]...)
  3.740 ns (0 allocations: 0 bytes)
2×2 SMatrix{2, 2, Float64, 4} with indices SOneTo(2)×SOneTo(2):
 2.0  0.0
 0.0  2.0

julia> @btime hcat(Enzyme.autodiff(Enzyme.set_abi(Forward,InlineABI), $f, $(BatchDuplicated(u0, Enzyme.onehot(u0))), $(Const(p)))[1]...)
  2.340 ns (0 allocations: 0 bytes)
2×2 SMatrix{2, 2, Float64, 4} with indices SOneTo(2)×SOneTo(2):
 2.0  0.0
 0.0  2.0
 
 julia> @btime DifferentiationInterface.jacobian($f, AutoForwardDiff(; chunksize=2), $u0, Constant($p))
  2.460 ns (0 allocations: 0 bytes)
2×2 SMatrix{2, 2, Float64, 4} with indices SOneTo(2)×SOneTo(2):
 2.0  0.0
 0.0  2.0

But again per below (using latest DI release), I think it's clear that some special handling is required for AutoEnzyme here -- be it calling DI.gradient or Enzyme.autodiff

julia> @btime DifferentiationInterface.jacobian($f, AutoEnzyme(), $u0, Constant($p))
 449.050 ns (23 allocations: 1.69 KiB)
2×2 Matrix{Float64}:
2.0  0.0
0.0  2.0

julia> @btime DifferentiationInterface.jacobian($f, AutoEnzyme(; mode=Enzyme.Reverse), $u0, Constant($p))
ERROR: setindex!(::SVector{2, Float64}, value, ::Int) is not defined.
Hint: Use `MArray` or `SizedArray` to create a mutable static array
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:35
 [2] setindex!(a::SVector{2, Float64}, value::Float64, i::Int64)
   @ StaticArrays ~/.julia/packages/StaticArrays/MSJcA/src/indexing.jl:3
 [3] macro expansion
   @ ~/.julia/packages/StaticArrays/MSJcA/src/broadcast.jl:159 [inlined]
 [4] _broadcast!
   @ ~/.julia/packages/StaticArrays/MSJcA/src/broadcast.jl:143 [inlined]
 [5] _copyto!
   @ ~/.julia/packages/StaticArrays/MSJcA/src/broadcast.jl:70 [inlined]
 [6] copyto!
   @ ~/.julia/packages/StaticArrays/MSJcA/src/broadcast.jl:63 [inlined]
 [7] materialize!
   @ ./broadcast.jl:914 [inlined]
 [8] materialize!
   @ ./broadcast.jl:911 [inlined]
 [9] (::DifferentiationInterfaceEnzymeExt.var"#5#6")(d0::SVector{2, Float64}, d::SVector{2, Float64})
   @ DifferentiationInterfaceEnzymeExt ~/.julia/packages/DifferentiationInterface/v1iM4/ext/DifferentiationInterfaceEnzymeExt/reverse_onearg.jl:39
[10] foreach(::Function, ::@NamedTuple{}, ::NTuple{…})
   @ Base ./abstractarray.jl:3098
[11] batch_seeded_autodiff_thunk(::EnzymeCore.ReverseModeSplit{true, true, false, 0, true, FFIABI, false, true, false}, ::NTuple{16, SVector{…}}, ::Const{typeof(f)}, ::Type{BatchDuplicated}, ::BatchDuplicated{SVector{…}, 16}, ::Const{Vector{…}})
   @ DifferentiationInterfaceEnzymeExt ~/.julia/packages/DifferentiationInterface/v1iM4/ext/DifferentiationInterfaceEnzymeExt/reverse_onearg.jl:38
[12] value_and_pullback(f::typeof(f), ::DifferentiationInterface.NoPullbackPrep, backend::AutoEnzyme{ReverseMode{false, false, FFIABI, false, false}, Nothing}, x::SVector{2, Float64}, ty::NTuple{16, SVector{2, Float64}}, contexts::Constant{Vector{Float64}})
   @ DifferentiationInterfaceEnzymeExt ~/.julia/packages/DifferentiationInterface/v1iM4/ext/DifferentiationInterfaceEnzymeExt/reverse_onearg.jl:128
[13] pullback(f::typeof(f), prep::DifferentiationInterface.NoPullbackPrep, backend::AutoEnzyme{ReverseMode{false, false, FFIABI, false, false}, Nothing}, x::SVector{2, Float64}, ty::NTuple{16, SVector{2, Float64}}, contexts::Constant{Vector{Float64}})
   @ DifferentiationInterfaceEnzymeExt ~/.julia/packages/DifferentiationInterface/v1iM4/ext/DifferentiationInterfaceEnzymeExt/reverse_onearg.jl:142
[14] (::DifferentiationInterface.var"#42#43"{16, Tuple{typeof(f)}, AutoEnzyme{ReverseMode{}, Nothing}, SVector{2, Float64}, Tuple{Constant{}}, DifferentiationInterface.NoPullbackPrep, Int64, Vector{NTuple{}}})(a::Int64)
   @ DifferentiationInterface ~/.julia/packages/DifferentiationInterface/v1iM4/src/first_order/jacobian.jl:269
[15] iterate
   @ ./generator.jl:47 [inlined]
[16] _collect
   @ ./array.jl:854 [inlined]
[17] collect_similar
   @ ./array.jl:763 [inlined]
[18] map
   @ ./abstractarray.jl:3285 [inlined]
[19] _jacobian_aux(f_or_f!y::Tuple{…}, prep::DifferentiationInterface.PullbackJacobianPrep{…}, backend::AutoEnzyme{…}, x::SVector{…}, contexts::Constant{…})
   @ DifferentiationInterface ~/.julia/packages/DifferentiationInterface/v1iM4/src/first_order/jacobian.jl:268
[20] jacobian(f::typeof(f), prep::DifferentiationInterface.PullbackJacobianPrep{16, NTuple{…}, NTuple{…}, DifferentiationInterface.NoPullbackPrep}, backend::AutoEnzyme{ReverseMode{…}, Nothing}, x::SVector{2, Float64}, contexts::Constant{Vector{…}})
   @ DifferentiationInterface ~/.julia/packages/DifferentiationInterface/v1iM4/src/first_order/jacobian.jl:157
[21] jacobian(f::typeof(f), backend::AutoEnzyme{ReverseMode{false, false, FFIABI, false, false}, Nothing}, x::SVector{2, Float64}, contexts::Constant{Vector{Float64}})
   @ DifferentiationInterface ~/.julia/packages/DifferentiationInterface/v1iM4/src/fallbacks/no_prep.jl:49
[22] var"##core#1571"(f#1568::typeof(f), u0#1569::SVector{2, Float64}, p#1570::Vector{Float64})
   @ Main ~/.julia/packages/BenchmarkTools/QNsku/src/execution.jl:561
[23] var"##sample#1572"(::Tuple{typeof(f), SVector{2, Float64}, Vector{Float64}}, __params::BenchmarkTools.Parameters)
   @ Main ~/.julia/packages/BenchmarkTools/QNsku/src/execution.jl:570
[24] _lineartrial(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; maxevals::Int64, kwargs::@Kwargs{})
   @ BenchmarkTools ~/.julia/packages/BenchmarkTools/QNsku/src/execution.jl:187
[25] _lineartrial(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters)
   @ BenchmarkTools ~/.julia/packages/BenchmarkTools/QNsku/src/execution.jl:182
[26] #invokelatest#2
   @ ./essentials.jl:892 [inlined]
[27] invokelatest
   @ ./essentials.jl:889 [inlined]
[28] #lineartrial#46
   @ ~/.julia/packages/BenchmarkTools/QNsku/src/execution.jl:51 [inlined]
[29] lineartrial
   @ ~/.julia/packages/BenchmarkTools/QNsku/src/execution.jl:50 [inlined]
[30] tune!(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Float64, ndone::Float64, verbose::Bool, pad::String, kwargs::@Kwargs{})
   @ BenchmarkTools ~/.julia/packages/BenchmarkTools/QNsku/src/execution.jl:300
[31] tune!
   @ ~/.julia/packages/BenchmarkTools/QNsku/src/execution.jl:289 [inlined]
[32] tune!(b::BenchmarkTools.Benchmark)
   @ BenchmarkTools ~/.julia/packages/BenchmarkTools/QNsku/src/execution.jl:289
[33] top-level scope
   @ ~/.julia/packages/BenchmarkTools/QNsku/src/execution.jl:666
Some type information was truncated. Use `show(err)` to see complete types.

@gdalle
Copy link
Collaborator

gdalle commented Oct 9, 2024

I am 100% convinced that on this small instance, it is possible to speed up Enzyme significantly beyond what DI can do at the moment. And I also see that it comes at the cost of some careful special-casing, likely based on function runtime and dimensionality.
Hopefully such differences will to vanish as the functions get more complex or the inputs get larger, because we spend less time in AD overhead and more time actually computing stuff.

In any case, I think we should let @avik-pal make a decision, and whatever it is that decision can also be revisited in the future. Meanwhile, I'm gonna do my best to bring DI+Enzyme to a satisfactory level.

@wsmoses
Copy link
Author

wsmoses commented Oct 9, 2024

Meanwhile, I'm gonna do my best to bring DI+Enzyme to a satisfactory level.

For sure, and happy to help you get there :)

Generalizing slightly from this, my personal takeaway from this at the moment when folks ask if we should add an Enzyme extension is:

  • If a package just needs DI.gradient(Enzyme.Reverse) is probably close to optimal for replacement of Enzyme.autodiff extensions if function/argument doesn't add any indirection (like splatting, closure, etc). DI.gradient(Enzyme.Forward) might also be, but I haven't looked in depth yet so no strong opinions atm. That said the ~50% slowdown above is weird.... so it might be a case to case basis.
  • Users who need a jacobian may be best served calling Enzyme directly (the crash with reverse mode is bad, as is the slowdown for forward mode). Though it might be fine with just a regular dense Julia array input/output if no indirection (e.g. the splatting, closure), so it again depends on the context of the package.
  • Same jacobian advice applies hessian/hvp/etc

All of course subject to change as things progress moving forward, but that's presently what I'd recommend.

@ChrisRackauckas
Copy link
Member

Non-allocating fast handling of static arrays is of course pretty essential. Though that is something that DI should be able to specialize on, and if we need to use ForwardDiff for that case because that's the setup that currently is optmized I think that's probably fine.

Small systems like 10 or less with static arrays is a very common case of course so its performance matters a lot (there's lots of models which are repeated nonlinear solves of such systems, solving them a few hundred thousand times). Generally to achieve this we've had some specialization specifically for static arrays (for example SimpleNonlinearSolve.jl vs NonlienarSolve.jl) as it's just necessary, and so for this we can also just ensure that the static array optimized solvers are targeting whatever the best target is.

@gdalle
Copy link
Collaborator

gdalle commented Oct 9, 2024

As discussed with @ExpandingMan, specializing every DI backend on static arrays would double the number of extensions, so it's not a great idea for code maintenance. But realistically I guess we only need ForwardDiff, FiniteDiff and Enzyme?

@oscardssmith
Copy link
Contributor

yeah, you only really care about StaticArrays for forward mode.

@wsmoses
Copy link
Author

wsmoses commented Oct 9, 2024

I guess a question to me is whether that would need to be done for more array types or is it just StaticArrays?

For example using a view

                 Enzyme    DI
Forward Mode:    128ns     crash
Reverse Mode:    5.079 μs  27.761 μs
julia> u = ones(4); u0 = view(u, 2:3)
2-element view(::Vector{Float64}, 2:3) with eltype Float64:
 1.0
 1.0
 
julia> @btime hcat(Enzyme.autodiff(Enzyme.set_abi(Forward,InlineABI), $f, $(BatchDuplicated(u0, (view([0., 1., 0., 0.], 2:3), view([0., 0., 1.0, 0.], 2:3)))), $(Const(p)))[1]...)
  128.774 ns (4 allocations: 336 bytes)
2×2 Matrix{Float64}:
 2.0  0.0
 0.0  2.0

julia> function idx(f::F, x, p, i) where F
               return f(x, p)[i]
       end
idx (generic function with 1 method)

julia> function jac1(f::F, x, p) where F
               dx = Enzyme.gradient(Reverse, idx, Const(f), x, Const(p), Const(1))[2]
               dx2 = Enzyme.gradient(Reverse, idx, Const(f), x, Const(p), Const(2))[2]
               return hcat(dx, dx2)
       end
jac1 (generic function with 1 method)

julia> @btime jac1($f, $u0, $p)
  5.079 μs (59 allocations: 3.31 KiB)
2×2 Matrix{Float64}:
 2.0  0.0
 0.0  2.0

julia> @btime DifferentiationInterface.jacobian($f, AutoEnzyme(), $u0, Constant($p))
ERROR: MethodError: Cannot `convert` an object of type Vector{Float64} to an object of type SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}

Closest candidates are:
  convert(::Type{T}, ::T) where T
   @ Base Base.jl:84
  convert(::Type{T}, ::LinearAlgebra.Factorization) where T<:AbstractArray
   @ LinearAlgebra ~/git/Enzyme.jl/julia-1.10.5/share/julia/stdlib/v1.10/LinearAlgebra/src/factorization.jl:108
  convert(::Type{T}, ::LinearAlgebra.AbstractQ) where T<:AbstractArray
   @ LinearAlgebra ~/git/Enzyme.jl/julia-1.10.5/share/julia/stdlib/v1.10/LinearAlgebra/src/abstractq.jl:49
  ...

Stacktrace:
  [1] (::Base.Fix1{typeof(convert), Type{SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}}})(y::Vector{Float64})
    @ Base ./operators.jl:1118
  [2] map
    @ ./tuple.jl:294 [inlined]
  [3] pushforward(f::typeof(f), ::DifferentiationInterface.NoPushforwardPrep, backend::AutoEnzyme{Nothing, Nothing}, x::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, tx::NTuple{16, Vector{Float64}}, contexts::Constant{Vector{Float64}})
    @ DifferentiationInterfaceEnzymeExt ~/.julia/packages/DifferentiationInterface/v1iM4/ext/DifferentiationInterfaceEnzymeExt/forward_onearg.jl:73
  [4] (::DifferentiationInterface.var"#40#41"{16, Tuple{typeof(f)}, AutoEnzyme{Nothing, Nothing}, SubArray{Float64, 1, Vector{}, Tuple{}, true}, Tuple{Constant{}}, DifferentiationInterface.NoPushforwardPrep, Int64, Vector{NTuple{}}})(a::Int64)
    @ DifferentiationInterface ~/.julia/packages/DifferentiationInterface/v1iM4/src/first_order/jacobian.jl:236
  [5] iterate
    @ ./generator.jl:47 [inlined]
  [6] _collect
    @ ./array.jl:854 [inlined]
  [7] collect_similar
    @ ./array.jl:763 [inlined]
  [8] map
    @ ./abstractarray.jl:3285 [inlined]
  [9] _jacobian_aux(f_or_f!y::Tuple{…}, prep::DifferentiationInterface.PushforwardJacobianPrep{…}, backend::AutoEnzyme{…}, x::SubArray{…}, contexts::Constant{…})
    @ DifferentiationInterface ~/.julia/packages/DifferentiationInterface/v1iM4/src/first_order/jacobian.jl:235
 [10] jacobian(f::typeof(f), prep::DifferentiationInterface.PushforwardJacobianPrep{…}, backend::AutoEnzyme{…}, x::SubArray{…}, contexts::Constant{…})
    @ DifferentiationInterface ~/.julia/packages/DifferentiationInterface/v1iM4/src/first_order/jacobian.jl:157
 [11] jacobian(f::typeof(f), backend::AutoEnzyme{Nothing, Nothing}, x::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, contexts::Constant{Vector{Float64}})
    @ DifferentiationInterface ~/.julia/packages/DifferentiationInterface/v1iM4/src/fallbacks/no_prep.jl:49
 [12] var"##core#21078"(f#21075::typeof(f), u0#21076::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, p#21077::Vector{Float64})
    @ Main ~/.julia/packages/BenchmarkTools/QNsku/src/execution.jl:561
 [13] var"##sample#21079"(::Tuple{typeof(f), SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, Vector{Float64}}, __params::BenchmarkTools.Parameters)
    @ Main ~/.julia/packages/BenchmarkTools/QNsku/src/execution.jl:570
 [14] _lineartrial(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; maxevals::Int64, kwargs::@Kwargs{})
    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/QNsku/src/execution.jl:187
 [15] _lineartrial(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters)
    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/QNsku/src/execution.jl:182
 [16] #invokelatest#2
    @ ./essentials.jl:892 [inlined]
 [17] invokelatest
    @ ./essentials.jl:889 [inlined]
 [18] #lineartrial#46
    @ ~/.julia/packages/BenchmarkTools/QNsku/src/execution.jl:51 [inlined]
 [19] lineartrial
    @ ~/.julia/packages/BenchmarkTools/QNsku/src/execution.jl:50 [inlined]
 [20] tune!(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Float64, ndone::Float64, verbose::Bool, pad::String, kwargs::@Kwargs{})
    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/QNsku/src/execution.jl:300
 [21] tune!
    @ ~/.julia/packages/BenchmarkTools/QNsku/src/execution.jl:289 [inlined]
 [22] tune!(b::BenchmarkTools.Benchmark)
    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/QNsku/src/execution.jl:289
 [23] top-level scope
    @ ~/.julia/packages/BenchmarkTools/QNsku/src/execution.jl:666
Some type information was truncated. Use `show(err)` to see complete types.

julia> @btime DifferentiationInterface.jacobian($f, AutoEnzyme(; mode=Enzyme.Reverse), $u0, Constant($p))
  27.761 μs (431 allocations: 25.58 KiB)
2×2 Matrix{Float64}:
 2.0  0.0
 0.0  2.0

@gdalle
Copy link
Collaborator

gdalle commented Oct 9, 2024

This is due to a combination of two things:

  • the way I construct basis vectors from an array x DI, which involves calling zero(x)
  • the conversion I perform in the Enzyme extension to make sure that typeof(x) == typeof(dx) when I build Duplicated(x, dx)

I have opened an issue here: JuliaDiff/DifferentiationInterface.jl#559

I'm very happy we're finding bugs in DI, feel free to also open issues on the repo instead of just mentioning them in this (already very long) thread.

@wsmoses
Copy link
Author

wsmoses commented Oct 9, 2024

For a regular old plain array it also looks like DI adds a ~10x slowdown.

                 Enzyme      DI
Forward Mode:    107.521 ns  1.442 μs
Reverse Mode:    218.175 ns  2.169 μs
julia> u0 = ones(2);
 
julia> @btime hcat(Enzyme.autodiff(Enzyme.set_abi(Forward,InlineABI), $f, $(BatchDuplicated(u0, Enzyme.onehot(u0))), $(Const(p)))[1]...)
  107.521 ns (4 allocations: 336 bytes)
2×2 Matrix{Float64}:
 2.0  0.0
 0.0  2.0

julia> function idx(f::F, x, p, i) where F
               return f(x, p)[i]
       end
idx (generic function with 1 method)

julia> function jac1(f::F, x, p) where F
               dx = Enzyme.gradient(Reverse, idx, Const(f), x, Const(p), Const(1))[2]
               dx2 = Enzyme.gradient(Reverse, idx, Const(f), x, Const(p), Const(2))[2]
               return hcat(dx, dx2)
       end
jac1 (generic function with 1 method)

julia> @btime jac1($f, $u0, $p)
  218.175 ns (7 allocations: 576 bytes)
2×2 Matrix{Float64}:
 2.0  0.0
 0.0  2.0

julia> @btime DifferentiationInterface.jacobian($f, AutoEnzyme(), $u0, Constant($p))
  1.442 μs (45 allocations: 3.98 KiB)
2×2 Matrix{Float64}:
 2.0  0.0
 0.0  2.0

julia> @btime DifferentiationInterface.jacobian($f, AutoEnzyme(; mode=Enzyme.Reverse), $u0, Constant($p))
  2.169 μs (61 allocations: 5.23 KiB)
2×2 Matrix{Float64}:
 2.0  0.0
 0.0  2.0

@wsmoses
Copy link
Author

wsmoses commented Oct 9, 2024

For the regular array doing DI prep shaved off some time but it remains ~6x slower for some reason

julia> prep = DifferentiationInterface.prepare_jacobian(f, AutoEnzyme(), u0, Constant(p))
DifferentiationInterface.PushforwardJacobianPrep{16, NTuple{16, Vector{Float64}}, NTuple{16, Vector{Float64}}, DifferentiationInterface.NoPushforwardPrep}([([1.0, 0.0], [0.0, 1.0], [1.0, 0.0], [0.0, 1.0], [1.0, 0.0], [0.0, 1.0], [1.0, 0.0], [0.0, 1.0], [1.0, 0.0], [0.0, 1.0], [1.0, 0.0], [0.0, 1.0], [1.0, 0.0], [0.0, 1.0], [1.0, 0.0], [0.0, 1.0])], [([2.0, 0.0], [0.0, 2.0], [2.0, 0.0], [0.0, 2.0], [2.0, 0.0], [0.0, 2.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0])], DifferentiationInterface.NoPushforwardPrep(), 2)

julia> @btime DifferentiationInterface.jacobian($f, $prep, AutoEnzyme(), $u0, Constant($p))
  670.331 ns (21 allocations: 1.91 KiB)
2×2 Matrix{Float64}:
 2.0  0.0
 0.0  2.0

@gdalle
Copy link
Collaborator

gdalle commented Oct 9, 2024

For a regular old plain array it also looks like DI adds a ~10x slowdown.
For the regular array doing DI prep shaved off some time but it remains ~6x slower for some reason

Again, you're comparing apples to oranges. Your optimized Enzyme routine calls Enzyme.autodiff as many times as there are function outputs, wasting function evaluations to discard all coordinates but one. And then magically you find that DI.jacobian is slower. Of course it is. But as shown above, we could also hand-tune the DI code by calling DI.gradient two times, which would get us closer to pure Enzyme performance. Not equal, but closer. If you really want to play the speedracing game, that's how you should play it. Specialize every implementation, not just the one you like best.

Honestly though, I'm tired of this back and forth, which doesn't prove much except that things are faster when we hardcode them. We already knew that. The whole point of DI is to avoid hardcoding everything everywhere, but I can't seem to get that point across.
So if you really want to help, you're welcome to open issues on the DI repo with non-manufactured examples, comparing things which are comparable, and we can discuss the sources of slowdowns or errors (like the one with view, which is actually interesting). In the meantime, this is going nowhere, so I'm gonna stop adding fuel to the fire. Good night.

@ExpandingMan
Copy link

ExpandingMan commented Oct 9, 2024

I'd like to contribute an outside voice that may be helpful here.

The benefits of a generic AD interface for packages like NonlinearSolve.jl are obvious. There are multiple back-ends available at different levels of maturity and if a moment ever comes when there is one and only one obvious choice for autodiff in Julia, it is well in the future. There are also potential benefits (particularly for sake of testing) to being able to easily swap out finite differencing.

Of course developing a generic interface that comes with no performance penalties is very hard to do. DI achieves this very well in the vast majority of "simple" cases. There is likely a large class of cases which currently come with some significant performance penalty in DI which have relatively straightforward fixes and can and should be fixed. There is then a (probably much smaller) class of cases which are not reasonable to expect any generic interface to achieve without significant performance penalties. I suspect NonlinearSolve falls into the second class, but this is a decision that it will have to be made by its developers based on its specific needs and benchmarks.

Now, as for StaticArrays, while ultimately in most cases they work amazingly well, there is no denying that they are also a major source of headaches for package developers. Normally users of StaticArrays are looking for a whole bunch of extra idiosyncratic guarantees that it may or may not be reasonable for AD packages to make. Indeed, most StaticArrays users will consider any functions with heap allocations "broken", which is not necessarily compatible with the philosophy of an AD implementation or interface. I have come to the conclusion that it's not entirely reasonable to expect AD to work on them without significant penalties in the generic case, and they often require slightly more specialization, so my own approach has been StaticDiffInterface.jl. This may well be entirely temporary, as the most important cases (Enzyme and ForwardDiff) tend to work pretty well now, but at least at this moment I both very much want a generic interface and do not really trust StaticArrays to not get broken (either by AD back-ends or an interface that is not specifically for them). The fact that DI is extensible enough that it allows for this approach illustrates the usefulness of a generic interface.

Despite whatever difficulties they present, I don't think StaticArrays changes the overall picture of interface vs direct implementation in any significant way. In most cases (and I have tested quite a lot) you can get perfectly good performance through an interface. There will undoubtedly be cases where that fails. If those are important for NonlinearSolve, it should use a direct AD dependency.

Finally, I think showing benchmarks here is only useful if one can show the clear applicability to common use cases in NonlinearSolve. Again, nobody should doubt that there will be cases where DI is fine and cases where it costs a significant penalty. It's perfectly fine that both types of cases exist, it's merely a matter of deciding whether or not a particular package is truly stuck with edge cases that a generic interface cannot solve and how crucial it is for that package to be generic across AD back-ends.

@ChrisRackauckas
Copy link
Member

It is a bit odd though.

julia> prep = DifferentiationInterface.prepare_jacobian(f, AutoEnzyme(), u0, Constant(p))
DifferentiationInterface.PushforwardJacobianPrep{16, NTuple{16, Vector{Float64}}, NTuple{16, Vector{Float64}}, DifferentiationInterface.NoPushforwardPrep}([([1.0, 0.0], [0.0, 1.0], [1.0, 0.0], [0.0, 1.0], [1.0, 0.0], [0.0, 1.0], [1.0, 0.0], [0.0, 1.0], [1.0, 0.0], [0.0, 1.0], [1.0, 0.0], [0.0, 1.0], [1.0, 0.0], [0.0, 1.0], [1.0, 0.0], [0.0, 1.0])], [([2.0, 0.0], [0.0, 2.0], [2.0, 0.0], [0.0, 2.0], [2.0, 0.0], [0.0, 2.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0])], DifferentiationInterface.NoPushforwardPrep(), 2)

julia> @btime DifferentiationInterface.jacobian($f, $prep, AutoEnzyme(), $u0, Constant($p))
  670.331 ns (21 allocations: 1.91 KiB)
2×2 Matrix{Float64}:
 2.0  0.0
 0.0  2.0

That form should be lowering to exactly:

julia> @btime hcat(Enzyme.autodiff(Enzyme.set_abi(Forward,InlineABI), $f, $(BatchDuplicated(u0, Enzyme.onehot(u0))), $(Const(p)))[1]...)
  107.521 ns (4 allocations: 336 bytes)
2×2 Matrix{Float64}:
 2.0  0.0
 0.0  2.0

so I'm not sure I understand what would cause the difference. Can someone profile what the actual overhead is here? That must be able to be cut out.

Of course developing a generic interface that comes with no performance penalties is very hard to do. DI achieves this very well in the vast majority of "simple" cases. There is likely a large class of cases which currently come with some significant performance penalty in DI which have relatively straightforward fixes and can and should be fixed. There is then a (probably much smaller) class of cases which are not reasonable to expect any generic interface to achieve without significant performance penalties. I suspect NonlinearSolve falls into the second class, but this is a decision that it will have to be made by its developers based on its specific needs and benchmarks.

I would actually be surprised if NonlinearSolve is in the latter case. Optimization.jl, yes. SciMLSensitivity.jl, yes. NonlinearSolve.jl is just doing Jacobians and jvps of functions f(u,p) with respect to u. So basically just Jacobians w.r.t. f(u) with p a constant enclosed. That's all it is. That's the most basic case of differentiation you can ever have. If there is any case that can be handled well with a general AD interface, this would be the first one. No hard things like gradients, reverse mode, etc., just forward mode total derivatives.

Despite whatever difficulties they present, I don't think StaticArrays changes the overall picture of interface vs direct implementation in any significant way. In most cases (and I have tested quite a lot) you can get perfectly good performance through an interface. There will undoubtedly be cases where that fails. If those are important for NonlinearSolve, it should use a direct AD dependency.

Agreed. If we have to specialize on static arrays, so be it. That's straightforwad. I expect StaticArrays are weird because most things with arrays assume the array interface is satisfied, which assumes setindex! exists. StaticArrays aren't really arrays in that sense because they don't implement the defined interface, so they need some different paths. The hardest cases are the smallest cases, and so this is the most special of special cases. We can special case it here, I don't see a problem.

DI should handle the common array case well, and use ArrayInterface.jl to take the optimal path on all generic array types (i.e. paths for indexing vs non-indexable, etc.). That's what it should focus on getting for now. There's still a ton of generic code optimizing for it to do, and Array, so reducing its scope to just those cases and taking StaticArrays out of it is a good use of time.

@gdalle
Copy link
Collaborator

gdalle commented Oct 10, 2024

Can someone profile what the actual overhead is here? That must be able to be cut out.

I explained the overhead in great detail in this comment, section "Why is DI slow?". TLDR:

DI does not assume that the batch size B is sufficient to compute the whole Jacobian. Indeed, at least for Enzyme and ForwardDiff, batched mode becomes slower beyond B ~ 10. Therefore, like ForwardDiff, for a function f with dimensions N -> M, DI computes the out-of-place forward Jacobian as follows:

  1. Split the input into N/B batches of size B each
  2. Compute a batched pushforward on each batch
  3. Within each batch, stack the results of the pushforward into a block of size (M, B)
  4. Concatenate the blocks with reduce(hcat) to get a matrix of size (M, N)

You can see the whole code in this file:

https://github.com/gdalle/DifferentiationInterface.jl/blob/f6632258d24631f60c7aaeb081aea0cf5849b993/DifferentiationInterface/src/first_order/jacobian.jl#L222-L253

Your remark actually made me realize a suboptimal aspect, fixed in JuliaDiff/DifferentiationInterface.jl#561 (x2 speedup).

I would gladly call Enzyme.jacobian directly, but (1) as we have seen it doesn't support additional arguments (EnzymeAD/Enzyme.jl#1949) and (2) Billy keeps warning against it because it is slow anyway.

Agreed. If we have to specialize on static arrays, so be it. That's straightforward. [...] I don't see a problem. DI should handle the common array case well, and use ArrayInterface.jl to take the optimal path on all generic array types (i.e. paths for indexing vs non-indexable, etc.).

It may be straightforward, but it doesn't come for free. Anything I do in DI, I have to test properly, against as many backends as I can support. That's how I get 99% code coverage (by the way I just added code coverage to Enzyme in EnzymeAD/Enzyme.jl#1954).
The goal of this interface is to be both reliable and maintainable, so even though we all want every feature as fast as possible, at some point we need to be realistic.

The productive part of this thread was coming up with actual benchmarks where DI is slowed down. I have taken note of them, opened some issues to keep track. Barring a few exceptions (JuMP-mandated splatting), most of them are definitely fixable, because I designed the API to be flexible enough. It's just a matter of time and energy.

@ChrisRackauckas
Copy link
Member

Barring a few exceptions (JuMP-mandated splatting)

I would just keep JuMP entirely separate for this reason. Not supporting full Julia functions is just a non-starter and we shouldn't have things splat because JuMP needs it 😅. That's one of the reasons we had to start making Optimization.jl for the neural network support, splatting for a NN is just... yeah that's not happening.

Anything I do in DI, I have to test properly, against as many backends as I can support. That's how I get 99% code coverage (by the way I just added code coverage to Enzyme in EnzymeAD/Enzyme.jl#1954).
The goal of this interface is to be both reliable and maintainable, so even though we all want every feature as fast as possible, at some point we need to be realistic.

That stat is very misleading though. In a real sense it has a lot less coverage than something like SparseDiffTools because the coverage metric is only whether a line of code is hit, not for all of the different dispatches and types that can hit it. Testing the full gambit of FillArrays, MultiScaleArrays, ArrayPartition, CuSparseMatrixCSR, etc. is definitely not covered from what I can tell right now.

DI does not assume that the batch size B is sufficient to compute the whole Jacobian. Indeed, at least for Enzyme and ForwardDiff, batched mode becomes slower beyond B ~ 10.

ForwardDiff benchmarked it at 12 before. I'm skeptical it's 10 and not 8, if the SIMD is good it should max out when all of the operations fully SIMD right? Saving a few primals at that point usually isn't a big deal.

I would gladly call Enzyme.jacobian directly, but (1) as we have seen it doesn't support additional arguments (EnzymeAD/Enzyme.jl#1949) and (2) Billy keeps warning against it because it is slow anyway.

That's probably the solution here. NonlinearSolve literally just wants a Jacobian. If we need a faster Jacobian setup for Enzyme, we should let Enzyme.jl control its definition via Enzyme.jacobian! and DI/NonlinearSolve.jl should just use it. I'm sure Billy can make that fast and we'll gladly just use the fast version of that.

The upside is that I know Enzyme has coming special tools for sparsity handling, and so it would be great if Enzyme was handling that the way it wanted to and DI could just facilitate shuttling down to the Enzyme-specific calls to abstract away the interface.

In the end, someone has to make a Jacobian function if we want a Jacobian, so let's just make sure the Enzyme.jacobian is the good one.

@gdalle
Copy link
Collaborator

gdalle commented Oct 10, 2024

In a real sense it has a lot less coverage than something like SparseDiffTools

Let's not even discuss SparseDiffTools given the bad stuff we found in #473 and JuliaDiff/SparseDiffTools.jl#307. I've been working tirelessly so that we can retire it as soon as possible, and that day is coming.

because the coverage metric is only whether a line of code is hit, not for all of the different dispatches and types that can hit it.

That is entirely correct. I have tests for StaticArrays and JLArrays (GPU emulators) but I can't test every array type in the ecosystem while keeping the CI time reasonable (keep in mind everything is multiplied by 12 backends and 8 differentiation operators).
For some array types, it's gonna be the responsibility of downstream packages to test that everything works, and report bugs. Maybe we can discuss integration tests, but they would have to be on a separate compute budget than the free GitHub tier, and definitely not run at every PR.

ForwardDiff benchmarked it at 12 before. I'm skeptical it's 10 and not 8, if the SIMD is good it should max out when all of the operations fully SIMD right? Saving a few primals at that point usually isn't a big deal.

That's why I used the ~ sign (8, 10, 12, it's the same ballpark), and that's also why I want to upstream batch size choice to the AD backend. I use ForwardDiff.pickchunksize with ForwardDiff, and I have opened EnzymeAD/Enzyme.jl#1545 to add a similar function to Enzyme.

If we need a faster Jacobian setup for Enzyme, we should let Enzyme.jl control its definition via Enzyme.jacobian! and DI/NonlinearSolve.jl should just use it. I'm sure Billy can make that fast and we'll gladly just use the fast version of that.
In the end, someone has to make a Jacobian function if we want a Jacobian, so let's just make sure the Enzyme.jacobian is the good one.

If Billy codes a faster Jacobian, DI can also fall back on it, and everyone is happy. But I gather the Enzyme philosophy sofar has been "high-level functions are in general not optimally fast" (see this comment). This is the root of all our disagreements above, because low-level operators need to be hardcoded in each case for Enzyme to reach top speed.

The upside is that I know Enzyme has coming special tools for sparsity handling, and so it would be great if Enzyme was handling that the way it wanted to and DI could just facilitate shuttling down to the Enzyme-specific calls to abstract away the interface.

I'll be happy to add bindings to Spadina once it's accessible from Julia, documented and tested.

@gdalle
Copy link
Collaborator

gdalle commented Oct 10, 2024

I'm happy to report a x15 speedup in DI's handling of this benchmark case, and we get the right Jacobian type now.

DI v0.6.8 (before this discussion):

julia> prep = DI.prepare_jacobian(f, AutoEnzyme(), u0, Constant(p));

julia> @btime DI.jacobian(f, $prep, AutoEnzyme(), $u0, Constant($p))
  375.357 ns (4 allocations: 592 bytes)
2×2 Matrix{Float64}:
 2.0  0.0
 0.0  2.0

DI v0.6.10 (from the branch JuliaDiff/DifferentiationInterface.jl#565):

julia> prep = DI.prepare_jacobian(f, AutoEnzyme(), u0, Constant(p));

julia> @btime DI.jacobian(f, $prep, AutoEnzyme(), $u0, Constant($p))
  25.890 ns (1 allocation: 48 bytes)
2×2 SMatrix{2, 2, Float64, 4} with indices SOneTo(2)×SOneTo(2):
 2.0  0.0
 0.0  2.0

It's not fully allocation-free yet, but this shows that progress on such cases is not limited by the API, just by the current generic implementation.
Importantly, nothing is hardcoded, and the meaningful code changes I made to get there are probably 20 LOCs max:

  1. Custom stacking for StaticArrays JuliaDiff/DifferentiationInterface.jl#564
  2. mapreduce instead of map + reduce for Jacobians & Hessians JuliaDiff/DifferentiationInterface.jl#565

@wsmoses
Copy link
Author

wsmoses commented Oct 10, 2024

Awesome!

By chance, did you figure out the source of the other ~10x gap? Is it just that allocation, or is it something more?

Note that below I used DI#main and Enzyme#main. Per your request I added multi-arg support to ForwardMode functions. However, note that the semantics of Enzyme.jacobian are sufficiently different from DI.jacobian that offhand I don't think you can directly forward one to the other (in particular how const/duplicated/etc function's works differs from your present semantics, as well as the return type being multi dimensional).

julia> using Enzyme, BenchmarkTools, DifferentiationInterface, ADTypes, StaticArrays

julia> f(u, p) = u .* u .- 2
f (generic function with 1 method)

julia> u0 = @SVector[1.0, 1.0]
2-element SVector{2, Float64} with indices SOneTo(2):
 1.0
 1.0

julia> p = [1.0, 2.0]
2-element Vector{Float64}:
 1.0
 2.0

julia> prep = DifferentiationInterface.prepare_jacobian(f, AutoEnzyme(), u0, Constant(p))

DifferentiationInterface.PushforwardJacobianPrep{2, Tuple{SVector{2, Float64}, SVector{2, Float64}}, Tuple{MVector{2, Float64}, MVector{2, Float64}}, DifferentiationInterface.NoPushforwardPrep}(Tuple{SVector{2, Float64}, SVector{2, Float64}}[([1.0, 0.0], [0.0, 1.0])], Tuple{MVector{2, Float64}, MVector{2, Float64}}[([1.0e-323, 6.23553276621004e-310], [1.0e-323, 6.23553276621004e-310])], DifferentiationInterface.NoPushforwardPrep(), 2)

julia> @btime DifferentiationInterface.jacobian($f, $prep, AutoEnzyme(), $u0, Constant($p))
  19.158 ns (1 allocation: 48 bytes)
2×2 SMatrix{2, 2, Float64, 4} with indices SOneTo(2)×SOneTo(2):
 2.0  0.0
 0.0  2.0

julia> @btime hcat(Enzyme.autodiff(Enzyme.set_abi(Forward,InlineABI), $f, $(BatchDuplicated(u0, Enzyme.onehot(u0))), $(Const(p)))[1]...)
  2.330 ns (0 allocations: 0 bytes)
2×2 SMatrix{2, 2, Float64, 4} with indices SOneTo(2)×SOneTo(2):
 2.0  0.0
 0.0  2.0

julia> @btime Enzyme.jacobian(Enzyme.set_abi(Forward,InlineABI), $f, $u0, $(Const(p)))
  2.330 ns (0 allocations: 0 bytes)
([2.0 0.0; 0.0 2.0], nothing)

@gdalle again just to re-iterate the goal of DI is fantastic. The idea of having an interface that encapsulates all of the common use cases is a laudable and critical task. However, each end user may have additional information which lets them run code more efficiently directly and/or let more code be successfully differentiated. Adding that generally to DI would mean a separate function for each potential context e.g. (known scalar output jacobian, know array output jacobian, symmetric input jacobian, coupled input output, etc). Covering every case is clearly a non goal as it would be boiling the ocean.

The question is a trade off between what level of generality that gets you the most use cases bang for your buck. There will always be cases where the more information is worth the trade, and others where it's not.

This also will simultaneously depend on the state of the artifact's implementation, which will change over time and likely lead to more users being able to switch over to DI.

Your improvements here are fantastic, but still I think at the current state of the artifact and end application, it merits the ~10 lines of code needed to add special handling for Enzyme. Again, this could be calling DI.gradient in the special handling if you want, I don't care if the special handling is implemented with DI subfunctions instead of DI.jacobian. But it does look like it's merited -- and of course may chance in the future if these gaps disappear!

But that's no reason to block a PR you don't like, we can always remove it once it's no longer needed.


On a side note:

Frankly this entire conversation has been exhausting and makes me think twice about engaging with DI in the future.

Here's what seemed to happen in this thread from my view:

I offer the suggestion that it may be warranted to add a ~10 line special case handler for my package and you responded like it was a personal attack and claiming to undermine you, demanding a justification why Enzyme deserves an extension here.

I give a high level description for why/when it can be beneficial and you now demand "Do you have benchmarks or errors to show that DI is insufficient here? Not in JuMP, in this very repo."

I benchmark the function in the readme of this repo and you claim this is "cheating" and "hard-coded" by calling the actual autodiff API.

At this point I've shown this is sufficiently general that there's now a draft PR (#478), and a nicer Enzyme.jacobian function.

I don't know what more you'll want next, and I don't think it's productive to continue this thread, so I'm going to close the issue.

I'm glad that you've used these examples to improve DI, and I hope it continues to improve in the future.

Feel free to take the PR, or feel free to ignore it.

@gdalle
Copy link
Collaborator

gdalle commented Oct 14, 2024

By chance, did you figure out the source of the other ~10x gap? Is it just that allocation, or is it something more?

Yes, I eventually figured it out!

DI's default jacobian splits the input into batches and then applies a mapreduce(hcat) over them. But when a single batch is enough (and we can know that statically after preparation), we can remove the mapreduce and gain more speed.
In JuliaDiff/DifferentiationInterface.jl#575, I added the necessary special cases for single-batch jacobian and hessian. Essentially, I took your clever Enzyme implementation and added a special case to the DI one that looks about the same. I also drew inspiration from the distinction between ForwardDiff's "vector mode" and "chunk mode" operators.

On that PR branch, DI's generic implementation using Enzyme's pushforwards is within a factor 2 of native Enzyme (note that it doesn't use your new jacobian with constant arguments, it goes right down to the autodiff level):

julia> mode = Enzyme.set_abi(Enzyme.Forward, Enzyme.InlineABI)
EnzymeCore.ForwardMode{false, EnzymeCore.InlineABI, false, false}()

julia> backend = AutoEnzyme(; mode);

julia> prep = prepare_jacobian(f, backend, u0, Constant(p));

julia> @btime jacobian(f, $prep, $backend, $u0, Constant($p))
  3.815 ns (0 allocations: 0 bytes)
2×2 SMatrix{2, 2, Float64, 4} with indices SOneTo(2)×SOneTo(2):
 2.0  0.0
 0.0  2.0

julia> @btime hcat(Enzyme.autodiff($mode, $f, $(Enzyme.BatchDuplicated(u0, Enzyme.onehot(u0))), $(Enzyme.Const(p)))[1]...,)
  2.996 ns (0 allocations: 0 bytes)
2×2 SMatrix{2, 2, Float64, 4} with indices SOneTo(2)×SOneTo(2):
 2.0  0.0
 0.0  2.0

Frankly this entire conversation has been exhausting and makes me think twice about engaging with DI in the future.

As discussed privately on Slack, I'll strive to be more Enzyme-friendly in the future, and learn from your custom implementations rather than dismiss them. I can even help out with #478 if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants