Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use VectorOfArray in wrap_array for DGMulti solvers #2150

Draft
wants to merge 46 commits into
base: main
Choose a base branch
from

Conversation

jlchan
Copy link
Contributor

@jlchan jlchan commented Nov 7, 2024

@huiyuxie FYI this PR goes towards addressing #1789.

Copy link
Contributor

github-actions bot commented Nov 7, 2024

Review checklist

This checklist is meant to assist creators of PRs (to let them know what reviewers will typically look for) and reviewers (to guide them in a structured review process). Items do not need to be checked explicitly for a PR to be eligible for merging.

Purpose and scope

  • The PR has a single goal that is clear from the PR title and/or description.
  • All code changes represent a single set of modifications that logically belong together.
  • No more than 500 lines of code are changed or there is no obvious way to split the PR into multiple PRs.

Code quality

  • The code can be understood easily.
  • Newly introduced names for variables etc. are self-descriptive and consistent with existing naming conventions.
  • There are no redundancies that can be removed by simple modularization/refactoring.
  • There are no leftover debug statements or commented code sections.
  • The code adheres to our conventions and style guide, and to the Julia guidelines.

Documentation

  • New functions and types are documented with a docstring or top-level comment.
  • Relevant publications are referenced in docstrings (see example for formatting).
  • Inline comments are used to document longer or unusual code sections.
  • Comments describe intent ("why?") and not just functionality ("what?").
  • If the PR introduces a significant change or new feature, it is documented in NEWS.md with its PR number.

Testing

  • The PR passes all tests.
  • New or modified lines of code are covered by tests.
  • New or modified tests run in less then 10 seconds.

Performance

  • There are no type instabilities or memory allocations in performance-critical parts.
  • If the PR intent is to improve performance, before/after time measurements are posted in the PR.

Verification

  • The correctness of the code was verified using appropriate tests.
  • If new equations/methods are added, a convergence test has been run and the results
    are posted in the PR.

Created with ❤️ by the Trixi.jl community.

Copy link

codecov bot commented Nov 7, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.44%. Comparing base (91eaaf6) to head (9b21e12).
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2150      +/-   ##
==========================================
- Coverage   96.36%   95.44%   -0.92%     
==========================================
  Files         480      480              
  Lines       38028    38025       -3     
==========================================
- Hits        36645    36292     -353     
- Misses       1383     1733     +350     
Flag Coverage Δ
unittests 95.44% <ø> (-0.92%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jlchan
Copy link
Contributor Author

jlchan commented Nov 7, 2024

@DanielDoehring if I remember correctly, you implemented @allocated tests, right?

I am running into some @allocated failures (for example https://github.com/trixi-framework/Trixi.jl/actions/runs/11732466598/job/32684751742?pr=2150)

elixir_euler_weakform.jl (SBP, EC): Test Failed at /home/runner/work/Trixi.jl/Trixi.jl/test/test_threaded.jl:388
  Expression: #= /home/runner/work/Trixi.jl/Trixi.jl/test/test_threaded.jl:388 =# @allocated(Trixi.rhs!(du_ode, u_ode, semi, t)) < 5000
   Evaluated: 118704 < 5000

However, at least locally, this appears to only happen the first time Trixi.rhs! is run. I tried running Trixi.rhs! to avoid these allocations in CI, but it doesn't seem to work.

Did you run into this issue before?

jlchan and others added 9 commits November 7, 2024 17:17
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Update test/test_dgmulti_2d.jl

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Apply suggestions from code review

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@DanielDoehring
Copy link
Contributor

However, at least locally, this appears to only happen the first time Trixi.rhs! is run. I tried running Trixi.rhs! to avoid these allocations in CI, but it doesn't seem to work.

Did you run into this issue before?

Hm, I have not encountered this. Can you maybe track down the allocations using the SummaryCallback ? I guess there must be some type instability. But I have no idea why the behaviour after executing rhs! once is different depending on the machine.

@jlchan
Copy link
Contributor Author

jlchan commented Nov 8, 2024

Hm, I have not encountered this. Can you maybe track down the allocations using the SummaryCallback ? I guess there must be some type instability. But I have no idea why the behaviour after executing rhs! once is different depending on the machine.

There doesn't seem to be a type instability. There's a weird GC call stack that @profview_allocs shows, I have no idea what's going on. I'll bring it up at the next Trixi meeting.

@DanielDoehring
Copy link
Contributor

There doesn't seem to be a type instability. There's a weird GC call stack that @profview_allocs shows, I have no idea what's going on. I'll bring it up at the next Trixi meeting.

Oh wow, that sounds strange. So the allocations are only present in the test "environment"?

@jlchan
Copy link
Contributor Author

jlchan commented Nov 10, 2024

There doesn't seem to be a type instability. There's a weird GC call stack that @profview_allocs shows, I have no idea what's going on. I'll bring it up at the next Trixi meeting.

Oh wow, that sounds strange. So the allocations are only present in the test "environment"?

Not exactly - on my machine, the allocations disappear after running rhs! once. In the test environment (or on CI, I can't tell) this isn't the case.

Here's the output from Profile.Allocs.@profile sample_rate=0.001 Trixi.rhs!(du, ode.u0, semi, 0.0) printed to a txt file: profile_allocs.txt.

Project.toml Show resolved Hide resolved
src/auxiliary/precompile.jl Outdated Show resolved Hide resolved
test/test_dgmulti_1d.jl Show resolved Hide resolved
@huiyuxie
Copy link
Member

https://github.com/JuliaLang/julia/blob/3318941e585db632423366b8b703ea55a6ba8421/base/timing.jl#L479-L489
If the allocation is less in the second call compared to the first during a single session, the GC might have been triggered between these two gc_bytes calls. But I also suspect it could be caused by type instability introduced by RecursiveArrayTools.jl. Why do you think it doesn’t seem to be a type instability issue?

@huiyuxie
Copy link
Member

Here's the output from Profile.Allocs.@Profile sample_rate=0.001 Trixi.rhs!(du, ode.u0, semi, 0.0) printed to a txt file: profile_allocs.txt.

The text file only shows half of the information on each line. Could you provide it in a different way?

@jlchan
Copy link
Contributor Author

jlchan commented Nov 11, 2024

Why do you think it doesn’t seem to be a type instability issue?

I checked @code_warntype on the first call; nothing is red and I don't see an Any instance anywhere. If you have any suggestions for where else to check, please let me know.

The text file only shows half of the information on each line. Could you provide it in a different way?

Thanks - how is this?
profile_allocs.txt

@huiyuxie
Copy link
Member

Same issue - how about telling me how to reproduce it? Use your latest commit and then run Profile.Allocs.@profile sample_rate=0.001 Trixi.rhs!(du, ode.u0, semi, 0.0)?

jlchan and others added 2 commits November 14, 2024 14:09
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
test/test_dgmulti_3d.jl Outdated Show resolved Hide resolved
test/test_dgmulti_3d.jl Outdated Show resolved Hide resolved
# @assert Base.precompile(Tuple{DiscreteCallback{typeof(Trixi.summary_callback),
# typeof(Trixi.summary_callback),
# typeof(Trixi.initialize_summary_callback),
# typeof(SciMLBase.FINALIZE_DEFAULT)}})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the CI fail if you uncomment these lines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried in fdf1ff5; lets see

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@huiyuxie
Copy link
Member

huiyuxie commented Nov 15, 2024

One thing that might be wired for me is that I tried the allocation profiling (use your MWE) for both main branch and this PR - but found that the memory allocation bytes are similar

Total bytes: 199336 - from main
Total bytes: 191504 - from this PR

Can you confirm this on your side @jlchan (no rush)? Or is there anything we're doing wrong...

@jlchan
Copy link
Contributor Author

jlchan commented Nov 15, 2024

One thing that might be wired for me is that I tried the allocation profiling (use your MWE) for both main branch and this PR - but found that the memory allocation bytes are similar

Total bytes: 199336 - from main Total bytes: 191504 - from this PR

Can you confirm this on your side @jlchan (no rush)? Or is there anything we're doing wrong...

Thanks for checking; I can confirm tomorrow.

If you run Trixi.rhs! a second time, do the total number of allocations decrease?

@jlchan
Copy link
Contributor Author

jlchan commented Nov 15, 2024

@huiyuxie I see something very weird. If I print out typeof(du) in this function, then for the failing elixir I see typeof(du) = Matrix{SVector{...}}.

However, for the passing elixir I see typeof(du) = StructArray{SVector{. However, u is the same type for both...

EDIT: aha, it is the time-stepper. If I use CarpenterKennedy with a fixed time-step size, it works. If I use SSPRK43 with adaptivity turned on, it fails. In general, turning adaptivity off fixes the failing CI.

@huiyuxie
Copy link
Member

huiyuxie commented Nov 15, 2024

But if you said that u is the same type for both, I highly suspect there is a type instability problem with this package...

@jlchan
Copy link
Contributor Author

jlchan commented Nov 15, 2024

But if you said that u is the same type for both, I highly suspect there is a type instability problem with this package...

No, I think it's a weird subtlety with how OrdinaryDiffEq.jl initializes the du container for specifically the SSPRK43 time-stepper. Here's a MWE:

using Trixi, OrdinaryDiffEq

dg = DGMulti(polydeg = 3, element_type = Line(),
             surface_integral = SurfaceIntegralWeakForm(FluxLaxFriedrichs()),
             volume_integral = VolumeIntegralWeakForm())

equations = CompressibleEulerEquations1D(1.4)
initial_condition = initial_condition_constant

mesh = DGMultiMesh(dg, (2, ), periodicity = true)
semi = SemidiscretizationHyperbolic(mesh, equations, initial_condition, dg)
ode = semidiscretize(semi, (0.0, .01))

integrator = init(ode, CarpenterKennedy2N54(williamson_condition = false), dt = 0.01) # works
integrator = init(ode, Tsit5()) # works
integrator = init(ode, RDPK3SpFSAL49()) # works
integrator = init(ode, SSPRK43(), dt = 0.01) # works

integrator = init(ode, SSPRK43()) # fails?!?

The error for the last line isn't very helpful, but if I run @show typeof(du) in Trixi.rhs!, I get typeof(du) = Matrix{SVector{3, Float64}}. It should be StructArray{SVector{3, Float64}, ...} instead, so I think OrdinaryDiffEq.jl is initializing the integrator.k field (that gets passed into Trixi.rhs! as du) incorrectly.

@ranocha - you know the SSPRK time-steppers in OrdinaryDiffEq.jl better than I do. Any idea what might be causing this?

@ranocha
Copy link
Member

ranocha commented Nov 20, 2024

Which versions of the packages are you using?

julia> using Pkg; Pkg.activate(temp = true); Pkg.add(["OrdinaryDiffEq", "StaticArrays", "StructArrays"])
[...]
⌃ [1dea7af3] + OrdinaryDiffEq v6.89.0
  [90137ffa] + StaticArrays v1.9.8
⌃ [09ab397b] + StructArrays v0.6.18
[...]

julia> using OrdinaryDiffEq, StructArrays, StaticArrays

julia> u = StructArray([SVector(1.0, 1.0), SVector(2.0, 2.0)])
2-element StructArray(::Vector{Float64}, ::Vector{Float64}) with eltype SVector{2, Float64}:
 [1.0, 1.0]
 [2.0, 2.0]

julia> ode = ODEProblem(u, (0.0, 1.0)) do du, u, p, t
           @show typeof(du)
           du .= u
           return nothing
       end

julia> solve(ode, Tsit5())
typeof(du) = StructVector{SVector{2, Float64}, Tuple{Vector{Float64}, Vector{Float64}}, Int64}
typeof(du) = StructVector{SVector{2, Float64}, Tuple{Vector{Float64}, Vector{Float64}}, Int64}
typeof(du) = StructVector{SVector{2, Float64}, Tuple{Vector{Float64}, Vector{Float64}}, Int64}
[...]

julia> solve(ode, SSPRK43())
typeof(du) = StructVector{SVector{2, Float64}, Tuple{Vector{Float64}, Vector{Float64}}, Int64}
typeof(du) = StructVector{SVector{2, Float64}, Tuple{Vector{Float64}, Vector{Float64}}, Int64}
typeof(du) = StructVector{SVector{2, Float64}, Tuple{Vector{Float64}, Vector{Float64}}, Int64}
[...]

I can't see the difference here that you describe 🤷

@jlchan
Copy link
Contributor Author

jlchan commented Nov 20, 2024

Which versions of the packages are you using?

julia> using Pkg; Pkg.activate(temp = true); Pkg.add(["OrdinaryDiffEq", "StaticArrays", "StructArrays"])
[...]
⌃ [1dea7af3] + OrdinaryDiffEq v6.89.0
  [90137ffa] + StaticArrays v1.9.8
⌃ [09ab397b] + StructArrays v0.6.18
[...]

julia> using OrdinaryDiffEq, StructArrays, StaticArrays

julia> u = StructArray([SVector(1.0, 1.0), SVector(2.0, 2.0)])
2-element StructArray(::Vector{Float64}, ::Vector{Float64}) with eltype SVector{2, Float64}:
 [1.0, 1.0]
 [2.0, 2.0]

julia> ode = ODEProblem(u, (0.0, 1.0)) do du, u, p, t
           @show typeof(du)
           du .= u
           return nothing
       end

julia> solve(ode, Tsit5())
typeof(du) = StructVector{SVector{2, Float64}, Tuple{Vector{Float64}, Vector{Float64}}, Int64}
typeof(du) = StructVector{SVector{2, Float64}, Tuple{Vector{Float64}, Vector{Float64}}, Int64}
typeof(du) = StructVector{SVector{2, Float64}, Tuple{Vector{Float64}, Vector{Float64}}, Int64}
[...]

julia> solve(ode, SSPRK43())
typeof(du) = StructVector{SVector{2, Float64}, Tuple{Vector{Float64}, Vector{Float64}}, Int64}
typeof(du) = StructVector{SVector{2, Float64}, Tuple{Vector{Float64}, Vector{Float64}}, Int64}
typeof(du) = StructVector{SVector{2, Float64}, Tuple{Vector{Float64}, Vector{Float64}}, Int64}
[...]

I can't see the difference here that you describe 🤷

Odd! I'll check tomorrow (Thursdays are my Trixi PR day).

@jlchan
Copy link
Contributor Author

jlchan commented Nov 20, 2024

@ranocha I misread your MWE; I couldn't reproduce this with a simplified MWE either. It only occurs in Trixi.jl.

I'll try to isolate it into a MWE tomorrow.

@@ -94,7 +94,7 @@ PrecompileTools = "1.1"
Preferences = "1.3"
Printf = "1"
RecipesBase = "1.1"
RecursiveArrayTools = "2.38.10"
RecursiveArrayTools = "3.27.1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are setting this compatibility version so high that SciMLBase will also adopt a higher version, forcing you to use the new version of the DiscreteCallback struct.

That's why even if you're using Julia 1.10 but it still failed here https://github.com/trixi-framework/Trixi.jl/actions/runs/11871869411/job/33084910314?pr=2150#step:7:24

Either change this to a lower version or adopt the new DiscreteCallback struct in precompile.

@JoshuaLampert
Copy link
Member

Is there any update on this? I also need Trixi.jl being compatible with newer versions of RecursiveArrayTools.jl and OrdinaryDiffEq.jl.

@jlchan
Copy link
Contributor Author

jlchan commented Dec 3, 2024

Is there any update on this? I also need Trixi.jl being compatible with newer versions of RecursiveArrayTools.jl and OrdinaryDiffEq.jl.

I've got a conference this week so I won't be able to return to this until next Monday. However, there aren't any major numerical issues to address IMO. The main failing tests are related to allocation (e.g., #2150 (comment)) and precompilation (#2150 (comment)). I believe @huiyuxie has identified some of these issues already.

If you have any time to look at those before I get back to it next week I'd welcome the help.

@huiyuxie
Copy link
Member

huiyuxie commented Dec 4, 2024

I will open a PR to address this but I'm not familiar with new DiscreteCallback struct I may need help from @ranocha

@ranocha
Copy link
Member

ranocha commented Dec 4, 2024

Please let me know if you have questions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants