Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broadcasting is much slower than necessary #400

Open
ranocha opened this issue Aug 18, 2024 · 3 comments
Open

Broadcasting is much slower than necessary #400

ranocha opened this issue Aug 18, 2024 · 3 comments
Labels

Comments

@ranocha
Copy link
Member

ranocha commented Aug 18, 2024

Describe the bug 🐞

The documentation of ArrayPartition states

Broadcasting like f.(A) is efficient.

However, I found that broadcasting appears to be much slower than it could be - an order of magnitude in the example below.

Expected behavior

Broadcasting is as fast as the hand-written version below.

Minimal Reproducible Example 👇

using BenchmarkTools: @benchmark
using FastBroadcast: @..
using MuladdMacro: @muladd
using RecursiveArrayTools: ArrayPartition, npartitions

function run_benchmarks(num_arrays, num_elements)
    u0 = ArrayPartition(ntuple(_ -> randn(num_elements), num_arrays)...)
    u1 = ArrayPartition(ntuple(_ -> randn(num_elements), num_arrays)...)
    u2 = ArrayPartition(ntuple(_ -> randn(num_elements), num_arrays)...)
    u3 = ArrayPartition(ntuple(_ -> randn(num_elements), num_arrays)...)
    u4 = ArrayPartition(ntuple(_ -> randn(num_elements), num_arrays)...)

    a1 = randn()
    a2 = randn()
    a3 = randn()

    @info "Base broadcast"
    display(@benchmark base_broadcast!($u0, $u1, $u2, $u3, $u4, $a1, $a2, $a3))
    sleep(1)
    @info "FastBroadcast.jl"
    display(@benchmark fast_broadcast!($u0, $u1, $u2, $u3, $u4, $a1, $a2, $a3))
    sleep(1)
    @info "Own base broadcast"
    display(@benchmark own_base_broadcast!($u0, $u1, $u2, $u3, $u4, $a1, $a2, $a3))
    sleep(1)
    @info "Own fast broadcast"
    display(@benchmark own_fast_broadcast!($u0, $u1, $u2, $u3, $u4, $a1, $a2, $a3))

    return nothing
end

@muladd @noinline function base_broadcast!(u0, u1, u2, u3, u4, a1, a2, a3)
    @. u0 = a1 * u1 + a2 * u2 + a3 * u3 + u4
    return nothing
end

@muladd @noinline function fast_broadcast!(u0, u1, u2, u3, u4, a1, a2, a3)
    @.. u0 = a1 * u1 + a2 * u2 + a3 * u3 + u4
    return nothing
end

@muladd @noinline function own_base_broadcast!(u0, u1, u2, u3, u4, a1, a2, a3)
    for i in npartitions(u0, u1, u2, u3, u4)
        @. u0.x[i] = a1 * u1.x[i] + a2 * u2.x[i] + a3 * u3.x[i] + u4.x[i]
    end
    return nothing
end

@muladd @noinline function own_fast_broadcast!(u0, u1, u2, u3, u4, a1, a2, a3)
    for i in npartitions(u0, u1, u2, u3, u4)
        @.. u0.x[i] = a1 * u1.x[i] + a2 * u2.x[i] + a3 * u3.x[i] + u4.x[i]
    end
    return nothing
end

Error & Stacktrace ⚠️

julia> run_benchmarks(3, 1000)
[ Info: Base broadcast
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min  max):  2.018 μs    4.394 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     2.093 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.145 μs ± 216.232 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆▅▁██▂                 ▁ ▁▂ ▂                               ▂
  ██████▄▁▁▁▁▁▁▁▁▃▄▁▁▁▆▆▇███████▇█▆▆▆▄▄▄▄▁▅▄▄▇▆▇▆▇▇▅▆▆▅▆▅▇▆█▇ █
  2.02 μs      Histogram: log(frequency) by time      3.16 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
[ Info: FastBroadcast.jl
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min  max):  2.023 μs    4.519 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     2.093 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.107 μs ± 108.084 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁▃▃  ▆█▇▅▁▂                                                 ▂
  ███▇▅██████▇▆▅▆▄▆▄▃▄▃▄▆▁▅▅▆▄▃▄▄▃▅▁▃▄▄▅▃▄▄▆▅▄▄▅▆▄▅▅▆▆▆▄▇▆▆▇▆ █
  2.02 μs      Histogram: log(frequency) by time      2.62 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
[ Info: Own base broadcast
BenchmarkTools.Trial: 10000 samples with 496 evaluations.
 Range (min  max):  221.101 ns  431.115 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     227.823 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   232.615 ns ±   8.997 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

          ▁█▄▄▂▂▃  ▁▁▃▁▁▁▂▁   ▁▁▁▁▁▁▁   ▁                       ▁
  ▆▃▂▅▃▂▃▄█████████████████████████████████████▇██▇▇▆▆▆▆▇▆▆▆▇▅▆ █
  221 ns        Histogram: log(frequency) by time        263 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
[ Info: Own fast broadcast
BenchmarkTools.Trial: 10000 samples with 511 evaluations.
 Range (min  max):  218.280 ns  419.112 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     224.967 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   229.016 ns ±   8.731 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂       █▇▄▅▃▃▂▂▂▂▄▂▂▃▂▁ ▂▁▁▁▂▂▁▁▁    ▁  ▁ ▁                  ▂
  █▁▃▄▁▃▁▄███████████████████████████▇███████████▇▇▇▇█▇▇▇▇▆▇▇▆▆ █
  218 ns        Histogram: log(frequency) by time        262 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Environment (please complete the following information):

  • Output of using Pkg; Pkg.status()
julia> using Pkg; Pkg.status()
Status `~/..../Project.toml`
  [6e4b80f9] BenchmarkTools v1.5.0
  [7034ab61] FastBroadcast v0.3.5
  [46d2c3a1] MuladdMacro v0.2.4
  [e4faabce] PProf v3.1.0
  [c46f51b8] ProfileView v1.8.0
  [731186ca] RecursiveArrayTools v3.27.0
  • Output of using Pkg; Pkg.status(; mode = PKGMODE_MANIFEST)
  [1520ce14] AbstractTrees v0.4.5
  [7d9f7c33] Accessors v0.1.37
  [79e6a3ab] Adapt v4.0.4
  [4fba245c] ArrayInterface v7.15.0
  [6e4b80f9] BenchmarkTools v1.5.0
  [d1d4a3ce] BitFlags v0.1.9
  [62783981] BitTwiddlingConvenienceFunctions v0.1.6
  [e1450e63] BufferedStreams v1.2.2
  [fa961155] CEnum v0.5.0
  [2a0fbf3d] CPUSummary v0.2.6
  [159f3aea] Cairo v1.1.0
  [fb6a15b2] CloseOpenIntervals v0.1.13
  [944b1d66] CodecZlib v0.7.6
  [3da002f7] ColorTypes v0.11.5
  [5ae59095] Colors v0.12.11
  [f70d9fcc] CommonWorldInvalidations v1.0.0
  [34da2185] Compat v4.16.0
  [a33af91c] CompositionsBase v0.1.2
  [187b0558] ConstructionBase v1.5.6
  [adafc99b] CpuId v0.3.1
  [9a962f9c] DataAPI v1.16.0
  [e2d170a0] DataValueInterfaces v1.0.0
  [ffbed154] DocStringExtensions v0.9.3
  [4e289a0a] EnumX v1.0.4
  [e2ba6199] ExprTools v0.1.10
  [7034ab61] FastBroadcast v0.3.5
  [5789e2e9] FileIO v1.16.3
  [53c48c17] FixedPointNumbers v0.8.5
  [08572546] FlameGraphs v1.0.0
  [46192b85] GPUArraysCore v0.1.6
  [a2bd30eb] Graphics v1.1.2
⌃ [9db2cae5] Gtk4 v0.6.1
  [8710efd8] GtkObservables v2.1.1
  [615f187c] IfElse v0.1.1
  [9b13fd28] IndirectArrays v1.0.0
  [8197267c] IntervalSets v0.7.10
  [3587e190] InverseFunctions v0.1.16
  [82899510] IteratorInterfaceExtensions v1.0.0
  [692b3bcd] JLLWrappers v1.5.0
  [682c06a0] JSON v0.21.4
  [10f19ff3] LayoutPointers v0.1.17
  [1d6d02ad] LeftChildRightSiblingTrees v0.2.0
  [1914dd2f] MacroTools v0.5.13
  [d125e4d3] ManualMemory v0.1.8
  [85b6ec6f] MethodAnalysis v0.4.13
  [46d2c3a1] MuladdMacro v0.2.4
  [77ba4419] NaNMath v1.0.2
  [510215fc] Observables v0.5.5
  [bac558e1] OrderedCollections v1.6.3
  [e4faabce] PProf v3.1.0
  [69de0a69] Parsers v2.8.1
  [f517fe37] Polyester v0.7.16
  [1d0040c9] PolyesterWeave v0.2.2
  [aea7be01] PrecompileTools v1.2.1
  [21216c6a] Preferences v1.4.3
  [c46f51b8] ProfileView v1.8.0
  [92933f4c] ProgressMeter v1.10.2
  [3349acd9] ProtoBuf v1.0.16
  [3cdcf5f2] RecipesBase v1.3.4
  [731186ca] RecursiveArrayTools v3.27.0
  [189a3867] Reexport v1.2.2
  [ae029012] Requires v1.3.0
  [d5f540fe] RoundingIntegers v1.1.0
  [7e49a35a] RuntimeGeneratedFunctions v0.5.13
  [94e857df] SIMDTypes v0.1.0
  [6c6a2e73] Scratch v1.2.1
  [aedffcd0] Static v1.1.1
  [0d7ed370] StaticArrayInterface v1.8.0
  [1e83bf80] StaticArraysCore v1.4.3
  [7792a7ef] StrideArraysCore v0.5.7
  [2efcf032] SymbolicIndexingInterface v0.3.28
  [3783bdb8] TableTraits v1.0.1
  [bd369af6] Tables v1.12.0
  [8290d209] ThreadingUtilities v0.5.2
  [3bb67fe8] TranscodingStreams v0.11.2
  [6e34b625] Bzip2_jll v1.0.8+1
  [83423d85] Cairo_jll v1.18.0+2
  [2702e6a9] EpollShim_jll v0.0.20230411+0
  [2e619515] Expat_jll v2.6.2+0
  [a3f928ae] Fontconfig_jll v2.13.96+0
  [d7e528f0] FreeType2_jll v2.13.2+0
  [559328eb] FriBidi_jll v1.0.14+0
  [6ebb71f1] GTK4_jll v4.12.5+0
  [78b55507] Gettext_jll v0.21.0+0
  [7746bdde] Glib_jll v2.80.2+0
  [75302f13] Graphene_jll v1.10.6+0
  [3b182d85] Graphite2_jll v1.3.14+0
  [3c863552] Graphviz_jll v2.50.0+1
  [2e76f6c2] HarfBuzz_jll v2.8.1+1
  [aacddb02] JpegTurbo_jll v3.0.3+0
⌅ [88015f11] LERC_jll v3.0.0+1
  [1d63c593] LLVMOpenMP_jll v17.0.6+0
  [dd4b983a] LZO_jll v2.10.2+0
  [42c93a91] Libepoxy_jll v1.5.10+0
⌅ [e9f186c6] Libffi_jll v3.2.2+1
  [d4300ac3] Libgcrypt_jll v1.8.11+0
  [7e76a0d4] Libglvnd_jll v1.6.0+0
  [7add5ba3] Libgpg_error_jll v1.49.0+0
  [94ce4f54] Libiconv_jll v1.17.0+0
  [4b2f31a3] Libmount_jll v2.40.1+0
  [925c91fb] Librsvg_jll v2.54.5+0
  [89763e89] Libtiff_jll v4.6.0+0
  [38a345b3] Libuuid_jll v2.40.1+0
  [36c8627f] Pango_jll v1.52.2+0
  [30392449] Pixman_jll v0.43.4+0
  [a2964d1f] Wayland_jll v1.21.0+1
  [2381bf8a] Wayland_protocols_jll v1.31.0+0
  [02c8fc9c] XML2_jll v2.13.1+0
  [aed1982a] XSLT_jll v1.1.41+0
  [ffd25f8a] XZ_jll v5.4.6+0
  [4f6342f7] Xorg_libX11_jll v1.8.6+0
  [0c0b7dd1] Xorg_libXau_jll v1.0.11+0
  [935fb764] Xorg_libXcursor_jll v1.2.0+4
  [0aeada51] Xorg_libXdamage_jll v1.1.5+4
  [a3789734] Xorg_libXdmcp_jll v1.1.4+0
  [1082639a] Xorg_libXext_jll v1.3.6+0
  [d091e8ba] Xorg_libXfixes_jll v5.0.3+4
  [a51aa0fd] Xorg_libXi_jll v1.7.10+4
  [d1454406] Xorg_libXinerama_jll v1.1.4+4
  [ec84b674] Xorg_libXrandr_jll v1.5.2+4
  [ea2f1a96] Xorg_libXrender_jll v0.9.11+0
  [14d82f49] Xorg_libpthread_stubs_jll v0.1.1+0
  [c7cfdc94] Xorg_libxcb_jll v1.17.0+0
  [cc61e674] Xorg_libxkbfile_jll v1.1.2+0
  [35661453] Xorg_xkbcomp_jll v1.4.6+0
  [33bec58e] Xorg_xkeyboard_config_jll v2.39.0+0
  [c5fb5394] Xorg_xtrans_jll v1.5.0+0
  [3161d3a3] Zstd_jll v1.5.6+0
  [b437f822] adwaita_icon_theme_jll v43.0.1+0
  [da03df04] gdk_pixbuf_jll v2.42.10+0
  [059c91fe] hicolor_icon_theme_jll v0.17.0+3
  [bf975903] iso_codes_jll v4.15.1+0
  [b53b4c65] libpng_jll v1.6.43+1
  [cf2c5f97] pprof_jll v1.0.1+0
  [d8fb68d0] xkbcommon_jll v1.4.1+1
  [0dad84c5] ArgTools v1.1.1
  [56f22d72] Artifacts
  [2a0f44e3] Base64
  [ade2ca70] Dates
  [8ba89e20] Distributed
  [f43a241f] Downloads v1.6.0
  [7b1f6079] FileWatching
  [b77e0a4c] InteractiveUtils
  [b27032c2] LibCURL v0.6.4
  [76f85450] LibGit2
  [8f399da3] Libdl
  [37e2e46d] LinearAlgebra
  [56ddb016] Logging
  [d6f4376e] Markdown
  [a63ad114] Mmap
  [ca575930] NetworkOptions v1.2.0
  [44cfe95a] Pkg v1.10.0
  [de0858da] Printf
  [9abbd945] Profile
  [3fa0cd96] REPL
  [9a3f8284] Random
  [ea8e919c] SHA v0.7.0
  [9e88b42a] Serialization
  [6462fe0b] Sockets
  [2f01184e] SparseArrays v1.10.0
  [10745b16] Statistics v1.10.0
  [fa267f1f] TOML v1.0.3
  [a4e569a6] Tar v1.10.0
  [8dfed614] Test
  [cf7118a7] UUIDs
  [4ec0a83e] Unicode
  [e66e0078] CompilerSupportLibraries_jll v1.1.1+0
  [deac9b47] LibCURL_jll v8.4.0+0
  [e37daf67] LibGit2_jll v1.6.4+0
  [29816b5a] LibSSH2_jll v1.11.0+1
  [c8ffd9c3] MbedTLS_jll v2.28.2+1
  [14a3606d] MozillaCACerts_jll v2023.1.10
  [4536629a] OpenBLAS_jll v0.3.23+4
  [05823500] OpenLibm_jll v0.8.1+2
  [efcefdf7] PCRE2_jll v10.42.0+1
  [bea87d4a] SuiteSparse_jll v7.2.1+1
  [83775a58] Zlib_jll v1.2.13+1
  [8e850b90] libblastrampoline_jll v5.8.0+1
  [8e850ede] nghttp2_jll v1.52.0+1
  [3f19e933] p7zip_jll v17.4.0+2
  • Output of versioninfo()
julia> versioninfo()
Julia Version 1.10.4
Commit 48d4fd48430 (2024-06-04 10:41 UTC)
Build Info:
  Official https://julialang.org/ release

Additional context

Using Cthulhu.jl, the basic difference in the underlying method

https://github.com/JuliaLang/julia/blob/306cee71560e24d26585fd1143a2aacac41b5508/base/broadcast.jl#L961-L977

seems to be the type of bc

  • the current version with RecursiveArrayTools.jl gives bc::Base.Broadcast.Broadcasted{Nothing, Nothing, typeof(*), Tuple{Float64, Vector{Float64}}}
  • the manual version gives Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, typeof(*), Tuple{Float64, Vector{Float64}}}

For reference, the basic setup of the current broadcasting infrastructure seems to be laid in #41

@ranocha ranocha added the bug label Aug 18, 2024
@ranocha
Copy link
Member Author

ranocha commented Aug 18, 2024

@vchuravy I know, it's pretty annoying to ask you something about some code you wrote six years ago 🙈

# drop axes because it is easier to recompute

Could you please explain this comment? As far as I understand right now, dropping the Axes type parameter (Nothing instead of Tuple{Base.OneTo{Int64}}) is the only difference between the two versions I can spot in the output of Cthulhu.jl in the underlying copyto!(dest::Vector{Float64}, ...) calls.

Current version from RecursiveArrayTools.jl

copyto!(dest::AbstractArray, bc::Base.Broadcast.Broadcasted{Nothing}) @ Base.Broadcast broadcast.jl:991
 991 function copyto!(dest::Vector{Float64}::AbstractArray, bc::Base.Broadcast.Broadcasted{Nothing, Nothing, typeof(*), Tuple{Float64, Vector{Float64}}}::Broadcasted{Nothing})::Vector{Float64}
 992     (axes(dest) == axes(bc))::Bool || throwdm(axes(dest), axes(bc))
 993     # Performance optimization: broadcast!(identity, dest, A) is equivalent to copyto!(dest, A) if indices match
 994     if bc::Base.Broadcast.Broadcasted{Nothing, Nothing, typeof(*), Tuple{Float64, Vector{Float64}}}.f::typeof(*) === identity && bc.args isa Tuple{AbstractArray} # only a single input argument to broadcast!
 995         A = bc.args[1]
 996         if axes(dest) == axes(A)
 997             return copyto!(dest, A)
 998         end
 999     end
1000     bc′ = preprocess(dest::Vector{Float64}, bc::Base.Broadcast.Broadcasted{Nothing, Nothing, typeof(*), Tuple{Float64, Vector{Float64}}})::Base.Broadcast.Broadcasted{Nothing, Nothing, typeof(*), Tuple{Float64, Base.Broadcast.Extruded{Vector{Float64}, Tuple{Bool}, Tuple{Int64}}}}
1001     # Performance may vary depending on whether `@inbounds` is placed outside the
1002     # for loop or not. (cf. https://github.com/JuliaLang/julia/issues/38086)
1003     @inbounds @simd for I in eachindex(bc′::Base.Broadcast.Broadcasted{Nothing, Nothing, typeof(*), Tuple{Float64, Base.Broadcast.Extruded{Vector{Float64}, Tuple{Bool}, Tuple{Int64}}}})::Base.OneTo{Int64}
1004         dest[I] = bc′::Base.Broadcast.Broadcasted{Nothing, Nothing, typeof(*), Tuple{Float64, Base.Broadcast.Extruded{Vector{Float64}, Tuple{Bool}, Tuple{Int64}}}}[I::Int64]::Float64
1005     end
1006     return dest::Vector{Float64}
1007 end

A profile shows that most time is spent in setindex!.

Manual version

copyto!(dest::AbstractArray, bc::Base.Broadcast.Broadcasted{Nothing}) @ Base.Broadcast broadcast.jl:991
 991 function copyto!(dest::Vector{Float64}::AbstractArray, bc::Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, typeof(*), Tuple{Float64, Vector{Float64}}}::Broadcasted{Nothing})::Vector{Float64}
 992     (axes(dest) == axes(bc))::Bool || throwdm(axes(dest), axes(bc))
 993     # Performance optimization: broadcast!(identity, dest, A) is equivalent to copyto!(dest, A) if indices match
 994     if bc::Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, typeof(*), Tuple{Float64, Vector{Float64}}}.f::typeof(*) === identity && bc.args isa Tuple{AbstractArray} # only a single input argument to broadcast!
 995         A = bc.args[1]
 996         if axes(dest) == axes(A)
 997             return copyto!(dest, A)
 998         end
 999     end
1000     bc′ = preprocess(dest::Vector{Float64}, bc::Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, typeof(*), Tuple{Float64, Vector{Float64}}})::Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, typeof(*), Tuple{Float64, Base.Broadcast.Extruded{Vector{Float64}, Tuple{Bool}, Tuple{Int64}}}}
1001     # Performance may vary depending on whether `@inbounds` is placed outside the
1002     # for loop or not. (cf. https://github.com/JuliaLang/julia/issues/38086)
1003     @inbounds @simd for I in eachindex(bc′::Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, typeof(*), Tuple{Float64, Base.Broadcast.Extruded{Vector{Float64}, Tuple{Bool}, Tuple{Int64}}}})::Base.OneTo{Int64}
1004         dest[I] = bc′::Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, typeof(*), Tuple{Float64, Base.Broadcast.Extruded{Vector{Float64}, Tuple{Bool}, Tuple{Int64}}}}[I::Int64]::Float64
1005     end
1006     return dest::Vector{Float64}
1007 end

A profile shows that most time is spent in getindex, i.e., calculating the broadcasted expression value.

@ranocha
Copy link
Member Author

ranocha commented Aug 18, 2024

Maybe this is the wrong path... With the diff

diff --git a/src/array_partition.jl b/src/array_partition.jl
index 7bd8bb5..52ba89a 100644
--- a/src/array_partition.jl
+++ b/src/array_partition.jl
@@ -357,7 +357,7 @@ end
 }
     N = npartitions(dest, bc)
     @inline function f(i)
-        copyto!(dest.x[i], unpack(bc, i))
+        copyto!(dest.x[i], unpack(bc, i, axes(dest.x[i])))
     end
     ntuple(f, Val(N))
     dest
@@ -383,20 +383,20 @@ _npartitions(args::Tuple{Any}) = npartitions(args[1])
 _npartitions(args::Tuple{}) = 0
 
 # drop axes because it is easier to recompute
-@inline function unpack(bc::Broadcast.Broadcasted{Style}, i) where {Style}
-    Broadcast.Broadcasted(bc.f, unpack_args(i, bc.args))
+@inline function unpack(bc::Broadcast.Broadcasted{Style}, i, axes = nothing) where {Style}
+    Broadcast.Broadcasted(bc.f, unpack_args(i, bc.args), axes)
 end
 @inline function unpack(bc::Broadcast.Broadcasted{ArrayPartitionStyle{Style}},
-        i) where {Style}
-    Broadcast.Broadcasted(bc.f, unpack_args(i, bc.args))
+        i, axes = nothing) where {Style}
+    Broadcast.Broadcasted(bc.f, unpack_args(i, bc.args), axes)
 end
 @inline function unpack(bc::Broadcast.Broadcasted{Style},
-        i) where {Style <: Broadcast.DefaultArrayStyle}
-    Broadcast.Broadcasted{Style}(bc.f, unpack_args(i, bc.args))
+        i, axes = nothing) where {Style <: Broadcast.DefaultArrayStyle}
+    Broadcast.Broadcasted{Style}(bc.f, unpack_args(i, bc.args), axes)
 end
 @inline function unpack(bc::Broadcast.Broadcasted{ArrayPartitionStyle{Style}},
-        i) where {Style <: Broadcast.DefaultArrayStyle}
-    Broadcast.Broadcasted{Style}(bc.f, unpack_args(i, bc.args))
+        i, axes = nothing) where {Style <: Broadcast.DefaultArrayStyle}
+    Broadcast.Broadcasted{Style}(bc.f, unpack_args(i, bc.args), axes)
 end
 unpack(x, ::Any) = x
 unpack(x::ArrayPartition, i) = x.x[i]

I get the same Cthulhu.jl output

copyto!(dest::AbstractArray, bc::Base.Broadcast.Broadcasted{Nothing}) @ Base.Broadcast broadcast.jl:991
 991 function copyto!(dest::Vector{Float64}::AbstractArray, bc::Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, typeof(*), Tuple{Float64, Vector{Float64}}}::Broadcasted{Nothing})::Vector{Float64}
 992     (axes(dest) == axes(bc))::Bool || throwdm(axes(dest), axes(bc))
 993     # Performance optimization: broadcast!(identity, dest, A) is equivalent to copyto!(dest, A) if indices match
 994     if bc::Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, typeof(*), Tuple{Float64, Vector{Float64}}}.f::typeof(*) === identity && bc.args isa Tuple{AbstractArray} # only a single input argument to broadcast!
 995         A = bc.args[1]
 996         if axes(dest) == axes(A)
 997             return copyto!(dest, A)
 998         end
 999     end
1000     bc′ = preprocess(dest::Vector{Float64}, bc::Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, typeof(*), Tuple{Float64, Vector{Float64}}})::Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, typeof(*), Tuple{Float64, Base.Broadcast.Extruded{Vector{Float64}, Tuple{Bool}, Tuple{Int64}}}}
1001     # Performance may vary depending on whether `@inbounds` is placed outside the
1002     # for loop or not. (cf. https://github.com/JuliaLang/julia/issues/38086)
1003     @inbounds @simd for I in eachindex(bc′::Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, typeof(*), Tuple{Float64, Base.Broadcast.Extruded{Vector{Float64}, Tuple{Bool}, Tuple{Int64}}}})::Base.OneTo{Int64}
1004         dest[I] = bc′::Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, typeof(*), Tuple{Float64, Base.Broadcast.Extruded{Vector{Float64}, Tuple{Bool}, Tuple{Int64}}}}[I::Int64]::Float64
1005     end
1006     return dest::Vector{Float64}
1007 end

as with the manual version...

@vchuravy
Copy link
Contributor

The code here probably comes from some old Distributed arrays code I was working on many years ago. Base implementation has slightly changed since there is now an output style. Would need to dig in a fair bit to understand this again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants