Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add non-default broadcasting of ExtendedJumpArray's. #340

Merged
merged 3 commits into from
Aug 30, 2023

Conversation

meson800
Copy link
Contributor

@meson800 meson800 commented Aug 30, 2023

Summary

The previous ExtendedJumpArray code contains code that implements part of the broadcastable interface, but is not actually used due to the entire interface not being included. This commit replaces that broadcast code, reusing much of the arg packing/repacking. Broadcasting operations that only rely on scalars and ExtendedJumpArray's will now use an efficient implementation that can emit SIMD instructions at the native and LLVM code levels. This closes #335. Benchmarking shows between a two to five fold increase in performance with this broadcasting method.

The previous code uses the fallback indexing method when broadcasting, which requires branching inside the broadcast loop. This branch is likely well-branch-predicted, but the presence of the branch prevents kernel functions from being fused into efficient SIMD instructions.

Implementation

This PR defines a ExtendedJumpArrayStyle, a broadcast style that includes two sub-styles that represent the styles for broadcasting the u and jump_u entries. copyto! overloads are specified that do the efficient thing and call copyto!, using the broadcast computation tree, on u and jump_u separately. Binary style rules are defined such that broadcasts fall back to the old method if you do something like adding an ExtendedJumpArray to a normal vector.

The arg tuple unpacking is mostly the same as the old partial implementation that was there; I just renamed the functions and unified them with the new broadcasting style.

A Julia >=1.9 extension is added for FastBroadcast.jl, where the equivalent of the copyto! call, fast_materialize! is similarly overloaded. This only adds a weak dependency, so only users who also use FastBroadcast (such as OrdinaryDiffEq) will have these methods loaded; those that don't already have FastBroadcast in the environment will not pull it in as a dependency.

Tests are added to check that the new broadcasting behavior is behaving as expected. In addition, a benchmark file is included that can be used to confirm that ExtendedJumpArray broadcasting is just as fast as broadcasting of normal Vector{Float64}'s. This PR passes all other tests, including e.g. the ones that use auto differentiation; broadcasting works great even in this case :)

Remaining problems/notes

Inhomogeneous ExtendedJumpArray's still act strangely under broadcasting, but no more strangely than they were already acting. In fact, some of the VariableRateJump tests rely on the behavior that an ExtendedJumpArray plus a Vector decay to become a Vector.

The new broadcasting method means that if you define an output array that is inhomogeneous (e.g. one where the A.u is a Vector{Float64} but A.jump_u is a Vector{Int64}), then broadcasted operations will successfully maintain the inhomogeneity. However, if you allow the broadcast machinery to create an output array for you (allocating), it will just allocate a homogenous ExtendedJumpArray with both u and jump_u being Vector{Float64}'s. This is the same behavior as the fallback, so this should not break any existing downstream code.

Finally, I just used an implementation for selecting an output ExtendedJumpArray while broadcasting based on the Julia interface documentation; this is a recursive function that is used to unpack the broadcasted tuple in a type-stable way, but there can be a few allocations that happen prior to the broadcast if someone broadcasts a very complicated expression that exceeds a recursion depth of 16. I was trying to replace it with a concise generated function that would always be type stable, but I think it is far less likely to hit this limit than in my other recent PRs. If there is an instance where this happens, I can fix this.

The previous ExtendedJumpArray code contains code that implements part
of the broadcastable interface, but is not actually used due to the
entire interface not being included. This commit replaces that broadcast
code, reusing much of the arg packing/repacking.

Because of this, the current code uses the fallback indexing method when
broadcasting, which requires branching inside the broadcast loop. This
branch is likely well-branch-predicted, but the presence of the branch
prevents kernel functions from being fused into efficent SIMD
instructions.

This defines a ExtendedJumpArrayStyle, a broadcast style that includes
two sub-styles that represent the styles for broadcasting the `u` and
`jump_u` entries. `copyto!` overloads are specified that do the efficent
thing and call `copyto!`, using the broadcast computation tree, on `u`
and `jump_u` separately. Binary style rules are defined such that
broadcasts fall back to the old method if you do something like adding
an ExtendedJumpArray to a normal vector.

A Julia >=1.9 extension is added for FastBroadcast.jl, where the
equivilant of the `copyto!` call, `fast_materialize!` is similarlly
overloaded.
@coveralls
Copy link

coveralls commented Aug 30, 2023

Pull Request Test Coverage Report for Build 6026823188

  • 37 of 42 (88.1%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+1.4%) to 89.861%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/extended_jump_array.jl 33 34 97.06%
ext/JumpProcessFastBroadcastExt.jl 4 8 50.0%
Totals Coverage Status
Change from base Build 5658974660: 1.4%
Covered Lines: 2269
Relevant Lines: 2525

💛 - Coveralls

Project.toml Outdated Show resolved Hide resolved
@ChrisRackauckas
Copy link
Member

This looks great! Just a few minor pieces mentioned above and I think this is good to go. Comments:

The new broadcasting method means that if you define an output array that is inhomogeneous (e.g. one where the A.u is a Vector{Float64} but A.jump_u is a Vector{Int64}), then broadcasted operations will successfully maintain the inhomogeneity. However, if you allow the broadcast machinery to create an output array for you (allocating), it will just allocate a homogenous ExtendedJumpArray with both u and jump_u being Vector{Float64}'s. This is the same behavior as the fallback, so this should not break any existing downstream code.

👍 that should be fine.

Finally, I just used an implementation for selecting an output ExtendedJumpArray while broadcasting based on the Julia interface documentation; this is a recursive function that is used to unpack the broadcasted tuple in a type-stable way, but there can be a few allocations that happen prior to the broadcast if someone broadcasts a very complicated expression that exceeds a recursion depth of 16. I was trying to replace it with a concise generated function that would always be type stable, but I think it is far less likely to hit this limit than in my other recent PRs. If there is an instance where this happens, I can fix this.

This is something that FastBroadcast.jl handles, so in practice we're fine.

@isaacsas
Copy link
Member

Would be interesting to redo the VariableRate benchmarks once this is merged and see the performance vs. Coevolve now.

@meson800
Copy link
Contributor Author

Just a few minor pieces mentioned above and I think this is good to go

I added a commit to address those comments, thanks!

This is something that FastBroadcast.jl handles, so in practice we're fine.

I think the problem could still exist, as FastBroadcast also calls out to similar. In the ExtendedJumpArrayStyle method of similar is where find_eja is used (in order to locate the ExtendedJumpArray in the arg list to determine output axis sizes); if the argument list is too deep then the compiler might give up type inference for the recursive call, leading to some inefficient boxing/unboxing.

Would be interesting to redo the VariableRate benchmarks once this is merged and see the performance vs. Coevolve now.

I expect that the broadcasting speedup is potentially less for short ExtendedJumpArray's because they benefit less from SIMD.

@isaacsas
Copy link
Member

@meson800 could you see if this issue is also closed now:

#321

If so can you add a related test using the simple example in that issue in your extended jump array tests file? (I only ask as it should be super easy to check.)

Thanks!

@meson800
Copy link
Contributor Author

meson800 commented Aug 30, 2023

Yes, #321 looks like it is fixed now by the broadcasting changes. Tests added and pushed. It possibly won't be fixed pre Julia 1.9 though, because the FastBroadcast extension won't be loaded.

@meson800
Copy link
Contributor Author

meson800 commented Aug 30, 2023

We'll see when the 1.6 CI finishes, but I think that this is still fine on < Julia 1.9. It won't load the extension so the FastBroadcasting will be slower, but the broadcast rules should still load and correctly select the output type, fixing the problem in #321.

@ChrisRackauckas
Copy link
Member

We'll see when the 1.6 CI finishes, but I think that this is still fine on < Julia 1.9. It won't load the extension so the FastBroadcasting will be slower, but the broadcast rules should still load and correctly select the output type, fixing the problem in #321.

Yes, it should just fallback to being slower, which is fine for earlier versions. We need to just move forward and get the new LTS 😅

@ChrisRackauckas ChrisRackauckas merged commit 9f89895 into SciML:master Aug 30, 2023
@isaacsas
Copy link
Member

@meson800 thanks for this! I suspect you've actually closed a bunch of other open ExtendedJumpArray-related issues with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Efficency of ExtendedJumpArray broadcasting in ode_interpolant
4 participants