-
-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add non-default broadcasting of ExtendedJumpArray's. #340
Conversation
The previous ExtendedJumpArray code contains code that implements part of the broadcastable interface, but is not actually used due to the entire interface not being included. This commit replaces that broadcast code, reusing much of the arg packing/repacking. Because of this, the current code uses the fallback indexing method when broadcasting, which requires branching inside the broadcast loop. This branch is likely well-branch-predicted, but the presence of the branch prevents kernel functions from being fused into efficent SIMD instructions. This defines a ExtendedJumpArrayStyle, a broadcast style that includes two sub-styles that represent the styles for broadcasting the `u` and `jump_u` entries. `copyto!` overloads are specified that do the efficent thing and call `copyto!`, using the broadcast computation tree, on `u` and `jump_u` separately. Binary style rules are defined such that broadcasts fall back to the old method if you do something like adding an ExtendedJumpArray to a normal vector. A Julia >=1.9 extension is added for FastBroadcast.jl, where the equivilant of the `copyto!` call, `fast_materialize!` is similarlly overloaded.
Pull Request Test Coverage Report for Build 6026823188
💛 - Coveralls |
This looks great! Just a few minor pieces mentioned above and I think this is good to go. Comments:
👍 that should be fine.
This is something that FastBroadcast.jl handles, so in practice we're fine. |
Would be interesting to redo the VariableRate benchmarks once this is merged and see the performance vs. Coevolve now. |
I added a commit to address those comments, thanks!
I think the problem could still exist, as FastBroadcast also calls out to
I expect that the broadcasting speedup is potentially less for short |
Yes, #321 looks like it is fixed now by the broadcasting changes. Tests added and pushed. It possibly won't be fixed pre Julia 1.9 though, because the FastBroadcast extension won't be loaded. |
We'll see when the 1.6 CI finishes, but I think that this is still fine on < Julia 1.9. It won't load the extension so the FastBroadcasting will be slower, but the broadcast rules should still load and correctly select the output type, fixing the problem in #321. |
Yes, it should just fallback to being slower, which is fine for earlier versions. We need to just move forward and get the new LTS 😅 |
@meson800 thanks for this! I suspect you've actually closed a bunch of other open |
Summary
The previous ExtendedJumpArray code contains code that implements part of the broadcastable interface, but is not actually used due to the entire interface not being included. This commit replaces that broadcast code, reusing much of the arg packing/repacking. Broadcasting operations that only rely on scalars and
ExtendedJumpArray
's will now use an efficient implementation that can emit SIMD instructions at the native and LLVM code levels. This closes #335. Benchmarking shows between a two to five fold increase in performance with this broadcasting method.The previous code uses the fallback indexing method when broadcasting, which requires branching inside the broadcast loop. This branch is likely well-branch-predicted, but the presence of the branch prevents kernel functions from being fused into efficient SIMD instructions.
Implementation
This PR defines a ExtendedJumpArrayStyle, a broadcast style that includes two sub-styles that represent the styles for broadcasting the
u
andjump_u
entries.copyto!
overloads are specified that do the efficient thing and callcopyto!
, using the broadcast computation tree, onu
andjump_u
separately. Binary style rules are defined such that broadcasts fall back to the old method if you do something like adding an ExtendedJumpArray to a normal vector.The arg tuple unpacking is mostly the same as the old partial implementation that was there; I just renamed the functions and unified them with the new broadcasting style.
A Julia >=1.9 extension is added for FastBroadcast.jl, where the equivalent of the
copyto!
call,fast_materialize!
is similarly overloaded. This only adds a weak dependency, so only users who also useFastBroadcast
(such asOrdinaryDiffEq
) will have these methods loaded; those that don't already have FastBroadcast in the environment will not pull it in as a dependency.Tests are added to check that the new broadcasting behavior is behaving as expected. In addition, a benchmark file is included that can be used to confirm that
ExtendedJumpArray
broadcasting is just as fast as broadcasting of normalVector{Float64}
's. This PR passes all other tests, including e.g. the ones that use auto differentiation; broadcasting works great even in this case :)Remaining problems/notes
Inhomogeneous
ExtendedJumpArray
's still act strangely under broadcasting, but no more strangely than they were already acting. In fact, some of the VariableRateJump tests rely on the behavior that anExtendedJumpArray
plus aVector
decay to become aVector
.The new broadcasting method means that if you define an output array that is inhomogeneous (e.g. one where the
A.u
is aVector{Float64}
butA.jump_u
is aVector{Int64}
), then broadcasted operations will successfully maintain the inhomogeneity. However, if you allow the broadcast machinery to create an output array for you (allocating), it will just allocate a homogenous ExtendedJumpArray with bothu
andjump_u
beingVector{Float64}
's. This is the same behavior as the fallback, so this should not break any existing downstream code.Finally, I just used an implementation for selecting an output
ExtendedJumpArray
while broadcasting based on the Julia interface documentation; this is a recursive function that is used to unpack the broadcasted tuple in a type-stable way, but there can be a few allocations that happen prior to the broadcast if someone broadcasts a very complicated expression that exceeds a recursion depth of 16. I was trying to replace it with a concise generated function that would always be type stable, but I think it is far less likely to hit this limit than in my other recent PRs. If there is an instance where this happens, I can fix this.