Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large regression with StaticArrays in v1.11 #1102

Closed
kaipartmann opened this issue Oct 22, 2024 · 9 comments
Closed

Large regression with StaticArrays in v1.11 #1102

kaipartmann opened this issue Oct 22, 2024 · 9 comments
Labels
arrays [a, r, r, a, y, s]

Comments

@kaipartmann
Copy link

As also described in JuliaArrays/StaticArrays.jl#1282, there is a large performance regression with v1.11 when using StaticArrays:

using LinearAlgebra, StaticArrays, BenchmarkTools

function mwe1(a, X, n)
    K = zeros(SMatrix{3,3})
    for i in 1:n
        k = a * i
        K += k * X * X'
    end
    return K
end

@btime mwe1(1e-5, SVector{3}(1.0, 1.0, 1.0), 1_000_000);
❯ julia +1.10.5 --project -t 6 mwe.jl 
  1.070 ms (0 allocations: 0 bytes)

❯ julia +1.11.1 --project -t 6 mwe.jl 
  129.890 ms (7000000 allocations: 289.92 MiB)

Interestingly, the problem can be solved by changing k * X * X' to k * (X * X'):

function mwe3(a, X, n)
    K = zeros(SMatrix{3,3})
    for i in 1:n
        k = a * i
        K += k * (X * X')
    end
    return K
end
@btime mwe3(1e-5, SVector{3}(1.0, 1.0, 1.0), 1_000_000);
❯ julia +1.10.5 --project -t 6 mwe.jl 
  805.458 μs (0 allocations: 0 bytes)

❯ julia +1.11.1 --project -t 6 mwe.jl
  708.208 μs (0 allocations: 0 bytes)
@ronisbr
Copy link
Member

ronisbr commented Oct 22, 2024

I can reproduce those results in macOS. I found some interesting scenarios based on the proposed MWE:

If I use this version, I see all the allocations:

function mwe2(a, X, n)
    local K
    for i in 1:n
        k = a * i
        K = k * X * X'
    end
    return K
end

julia> @btime mwe2(1e-5, SVector{3}(1.0, 1.0, 1.0), 1_000_000)
  163.540 ms (7000000 allocations: 289.92 MiB)
3×3 SMatrix{3, 3, Float64, 9} with indices SOneTo(3)×SOneTo(3):
 10.0  10.0  10.0
 10.0  10.0  10.0
 10.0  10.0  10.0

However, if I suppress the local variable k, everything works:

function mwe3(a, X, n)
    local K
    for i in 1:n
        K = a * i * X * X'
    end
    return K
end

julia> @btime mwe3(1e-5, SVector{3}(1.0, 1.0, 1.0), 1_000_000)
  1.916 ns (0 allocations: 0 bytes)
3×3 SMatrix{3, 3, Float64, 9} with indices SOneTo(3)×SOneTo(3):
 10.0  10.0  10.0
 10.0  10.0  10.0
 10.0  10.0  10.0

@gbaraldi
Copy link
Member

This is an inlining change
│ %31 = invoke LinearAlgebra.broadcast(LinearAlgebra.:*::typeof(*), %29::Float64, X::SVector{3, Float64}, %30::Vararg{Any})::SMatrix{3, 3, Float64, 9}
no longer gets inlined and we allocate because of it.
Changing the code to this

function mwe1(a, X, n)
           K = zeros(SMatrix{3,3})
           for i in 1:n
               k = a * i
               K1 = k * X
               K += @inline K1* X'
           end
           return K
       end

fixes it and its actually better

@ronisbr
Copy link
Member

ronisbr commented Oct 22, 2024

Hi @gbaraldi !

no longer gets inlined and we allocate because of it.

Will those allocations happen on every call to that function (scalar x vector product) ? Or does it depend on the scenario? If so, I think I might experience a huge amount of performance degradation in our simulations. Is it something that might be reverted or fixed in 1.11?

@nsajko nsajko added the arrays [a, r, r, a, y, s] label Oct 22, 2024
@nsajko
Copy link
Contributor

nsajko commented Oct 22, 2024

@gbaraldi what's the point with your example? As far as I see it's using different code paths so it seems irrelevant here?

To be specific, the MWE is using a three-argument * method, while your example only calls two-argument * methods.

@dkarrasch
Copy link
Member

I don't remember in which release cycle (could well be in v1.11), but we introduced three-arg * specializations, and that "generic" choice seems to be a bad one for static arrays, as changing the order of multiplication makes a gigantic difference (mwe1 vs mwe3 in the OP). I'm not sure if we want to change the order of multiplication generically, or whether this should be specified by the user.

@mcabbott
Copy link
Collaborator

mcabbott commented Oct 25, 2024

3-arg * was new in Julia 1.7. That did have a small negative effect on square SMatrices, a few ns, and at the time I made a PR to address this omitting the size-checks, JuliaArrays/StaticArrays.jl#919 (not merged).

Edit: sorry this is scalar-vector-adjointvec, which now calls broadcast(*, k, X, X'). Not sure whether there was any regression on 1.7 for SVectors. The earlier behaviour would have been K += (k * X) * X', which seems as fast as mwe3's K += k * (X * X') now.

I am a bit surprised that the version with a * i * X * X' is fast, as this does not call a 4-arg * method, but instead to (a * i) * X * X' and then the same 3-arg * via broadcast. But perhaps the inlining is different.

@dkarrasch
Copy link
Member

dkarrasch commented Oct 25, 2024

Time flies. 🤦

@nsajko: @gbaraldi's quote stems from an inspection of @code_typed mwe1(1e-5, SVector{3}(1.0, 1.0, 1.0), 1_000_000), and the corresponding broadcast call is absent on v1.10. Instead, one finds the inlined code for all the products and additions there. So, indeed, it seems that inlining thresholds have changed, and splitting the multiplications (as he suggested) improves performance beyond what one can achieve with setting brackets explicitly (EDIT: the @inline annotation doesn't even seem to make a difference for me).

@dkarrasch
Copy link
Member

Should we close this, or is there anything actionable here?

@KristofferC KristofferC transferred this issue from JuliaLang/julia Nov 26, 2024
@ViralBShah
Copy link
Member

Closing - but please reopen if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays [a, r, r, a, y, s]
Projects
None yet
Development

No branches or pull requests

7 participants