Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proof of concept: TrixiMPIArray #1104

Draft
wants to merge 37 commits into
base: main
Choose a base branch
from
Draft

Proof of concept: TrixiMPIArray #1104

wants to merge 37 commits into from

Conversation

ranocha
Copy link
Member

@ranocha ranocha commented Mar 30, 2022

This is a rough draft of a possible MPI array type. A lot of TODO notes are left in the draft at the moment.

Partially implemented in a reduced version (only ode_norm and ode_unstable_check) in #1113. We will use this reduced version for now and see how it works in the wild.

TODO:

  • Local reductions (sum) - to docstring or test whether we could also just use local mapreduce and parallel ode_norm?
  • Check step rejections
  • Check some complex setups (MPI shock capturing does not use alpha smoothing! but everything else should work, incl. AMR)
  • Maybe performance of serial vs. one MPI rank (needs some hacks, mpi_parallel and mpi_isparallel)

Closes #329; closes #339

@codecov
Copy link

codecov bot commented Mar 30, 2022

Codecov Report

Merging #1104 (cdcf828) into main (1b604a6) will increase coverage by 0.00%.
The diff coverage is 98.81%.

@@           Coverage Diff           @@
##             main    #1104   +/-   ##
=======================================
  Coverage   96.75%   96.75%           
=======================================
  Files         303      305    +2     
  Lines       23876    23931   +55     
=======================================
+ Hits        23099    23153   +54     
- Misses        777      778    +1     
Flag Coverage Δ
unittests 96.75% <98.81%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/Trixi.jl 66.67% <ø> (ø)
src/callbacks_step/save_restart_dg.jl 89.36% <ø> (ø)
src/callbacks_step/save_solution_dg.jl 95.89% <ø> (ø)
src/auxiliary/mpi_arrays.jl 97.92% <97.92%> (ø)
src/callbacks_step/amr.jl 97.07% <100.00%> (ø)
src/callbacks_step/analysis_dg2d_parallel.jl 100.00% <100.00%> (ø)
src/callbacks_step/stepsize_dg2d.jl 100.00% <100.00%> (ø)
src/callbacks_step/stepsize_dg3d.jl 100.00% <100.00%> (ø)
src/callbacks_step/time_series_dg2d.jl 100.00% <100.00%> (ø)
src/meshes/meshes.jl 100.00% <100.00%> (ø)
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1b604a6...cdcf828. Read the comment docs.

Copy link
Member

@sloede sloede left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great news that you have started thinking about an MPI array implementation for Trixi! I looked through the code and left some comments where I thought it might be helpful. Looking forward to getting something like this to work with the adaptive time integration schemes 😎

src/Trixi.jl Show resolved Hide resolved
src/auxiliary/mpi_arrays.jl Outdated Show resolved Hide resolved
src/auxiliary/mpi_arrays.jl Outdated Show resolved Hide resolved
src/auxiliary/mpi_arrays.jl Outdated Show resolved Hide resolved
src/auxiliary/mpi_arrays.jl Outdated Show resolved Hide resolved
@ranocha
Copy link
Member Author

ranocha commented Apr 1, 2022

Some results from 987407e

julia --check-bounds=no --threads=2

julia> trixi_include("examples/tree_2d_dgsem/elixir_euler_ec.jl", tspan=(0.0, 10.0))

julia> sol = solve(ode, CarpenterKennedy2N54(williamson_condition=false), dt=1.0, save_everystep=false, callback=callbacks); summary_callback()
 ────────────────────────────────────────────────────────────────────────────────────
              Trixi.jl                      Time                    Allocations      
                                   ───────────────────────   ────────────────────────
         Tot / % measured:              2.73s /  90.4%           23.5MiB /  97.3%    

 Section                   ncalls     time    %tot     avg     alloc    %tot      avg
 ────────────────────────────────────────────────────────────────────────────────────
 rhs!                       4.24k    2.36s   95.9%   558μs   7.57MiB   33.1%  1.83KiB
   volume integral          4.24k    1.94s   78.6%   458μs   1.16MiB    5.1%     288B
   interface flux           4.24k    251ms   10.2%  59.2μs   1.62MiB    7.1%     400B
   prolong2interfaces       4.24k   58.2ms    2.4%  13.7μs   0.97MiB    4.2%     240B
   surface integral         4.24k   56.3ms    2.3%  13.3μs   1.23MiB    5.4%     304B
   reset ∂u/∂t              4.24k   28.3ms    1.1%  6.68μs     0.00B    0.0%    0.00B
   Jacobian                 4.24k   22.3ms    0.9%  5.27μs   1.10MiB    4.8%     272B
   ~rhs!~                   4.24k   8.06ms    0.3%  1.90μs   1.50MiB    6.5%     370B
   prolong2boundaries       4.24k    251μs    0.0%  59.2ns     0.00B    0.0%    0.00B
   prolong2mortars          4.24k    177μs    0.0%  41.7ns     0.00B    0.0%    0.00B
   mortar flux              4.24k    145μs    0.0%  34.3ns     0.00B    0.0%    0.00B
   source terms             4.24k   91.7μs    0.0%  21.6ns     0.00B    0.0%    0.00B
   boundary flux            4.24k   87.0μs    0.0%  20.5ns     0.00B    0.0%    0.00B
 calculate dt                 848   50.1ms    2.0%  59.0μs     0.00B    0.0%    0.00B
 analyze solution              10   30.6ms    1.2%  3.06ms    174KiB    0.7%  17.4KiB
 I/O                           11   20.9ms    0.8%  1.90ms   15.1MiB   66.1%  1.38MiB
   save solution               10   20.7ms    0.8%  2.07ms   15.1MiB   66.0%  1.51MiB
   get element variables       10   97.2μs    0.0%  9.72μs   20.6KiB    0.1%  2.06KiB
   ~I/O~                       11   26.0μs    0.0%  2.37μs   7.20KiB    0.0%     671B
   save mesh                   10    785ns    0.0%  78.5ns     0.00B    0.0%    0.00B
 ────────────────────────────────────────────────────────────────────────────────────

julia> sol = solve(ode, RDPK3SpFSAL35(), abstol=1.0e-4, reltol=1.0e-4, save_everystep=false, callback=callbacks); summary_callback()
 ────────────────────────────────────────────────────────────────────────────────────
              Trixi.jl                      Time                    Allocations      
                                   ───────────────────────   ────────────────────────
         Tot / % measured:              1.43s /  81.7%           15.5MiB /  86.5%    

 Section                   ncalls     time    %tot     avg     alloc    %tot      avg
 ────────────────────────────────────────────────────────────────────────────────────
 rhs!                       2.35k    1.14s   97.7%   487μs   4.20MiB   31.4%  1.83KiB
   volume integral          2.35k    924ms   79.0%   394μs    660KiB    4.8%     288B
   interface flux           2.35k    121ms   10.3%  51.3μs    917KiB    6.7%     400B
   prolong2interfaces       2.35k   32.3ms    2.8%  13.7μs    550KiB    4.0%     240B
   surface integral         2.35k   30.9ms    2.6%  13.1μs    697KiB    5.1%     304B
   reset ∂u/∂t              2.35k   17.4ms    1.5%  7.42μs     0.00B    0.0%    0.00B
   Jacobian                 2.35k   12.9ms    1.1%  5.51μs    624KiB    4.6%     272B
   ~rhs!~                   2.35k   4.41ms    0.4%  1.88μs    853KiB    6.2%     372B
   prolong2boundaries       2.35k    158μs    0.0%  67.3ns     0.00B    0.0%    0.00B
   prolong2mortars          2.35k    104μs    0.0%  44.2ns     0.00B    0.0%    0.00B
   mortar flux              2.35k   79.6μs    0.0%  33.9ns     0.00B    0.0%    0.00B
   source terms             2.35k   54.2μs    0.0%  23.1ns     0.00B    0.0%    0.00B
   boundary flux            2.35k   50.1μs    0.0%  21.3ns     0.00B    0.0%    0.00B
 analyze solution               6   18.3ms    1.6%  3.05ms    105KiB    0.8%  17.5KiB
 I/O                            7   9.09ms    0.8%  1.30ms   9.08MiB   67.8%  1.30MiB
   save solution                6   9.00ms    0.8%  1.50ms   9.06MiB   67.7%  1.51MiB
   get element variables        6   73.3μs    0.0%  12.2μs   12.4KiB    0.1%  2.06KiB
   ~I/O~                        7   16.2μs    0.0%  2.31μs   5.20KiB    0.0%     761B
   save mesh                    6    448ns    0.0%  74.7ns     0.00B    0.0%    0.00B
 ────────────────────────────────────────────────────────────────────────────────────

tmpi 2 julia --check-bounds=no --threads=1

julia> trixi_include("examples/tree_2d_dgsem/elixir_euler_ec.jl", tspan=(0.0, 10.0))

julia> sol = solve(ode, CarpenterKennedy2N54(williamson_condition=false), dt=1.0, save_everystep=false, callback=callbacks); summary_callback()
 ────────────────────────────────────────────────────────────────────────────────────
              Trixi.jl                      Time                    Allocations      
                                   ───────────────────────   ────────────────────────
         Tot / % measured:              2.72s /  95.4%           19.2MiB /  98.0%    

 Section                   ncalls     time    %tot     avg     alloc    %tot      avg
 ────────────────────────────────────────────────────────────────────────────────────
 rhs!                       4.24k    2.49s   95.9%   588μs   3.44MiB   18.3%     852B
   volume integral          4.24k    2.01s   77.4%   475μs     0.00B    0.0%    0.00B
   interface flux           4.24k    277ms   10.6%  65.3μs     0.00B    0.0%    0.00B
   surface integral         4.24k   55.9ms    2.1%  13.2μs     0.00B    0.0%    0.00B
   prolong2interfaces       4.24k   52.2ms    2.0%  12.3μs     0.00B    0.0%    0.00B
   reset ∂u/∂t              4.24k   23.0ms    0.9%  5.44μs     0.00B    0.0%    0.00B
   Jacobian                 4.24k   19.6ms    0.8%  4.64μs     0.00B    0.0%    0.00B
   MPI interface flux       4.24k   13.6ms    0.5%  3.22μs     0.00B    0.0%    0.00B
   ~rhs!~                   4.24k   11.8ms    0.5%  2.79μs   1.70MiB    9.0%     420B
   finish MPI receive       4.24k   11.4ms    0.4%  2.68μs    530KiB    2.8%     128B
   start MPI send           4.24k   9.67ms    0.4%  2.28μs    397KiB    2.1%    96.0B
   prolong2mpiinterfaces    4.24k   3.17ms    0.1%   749ns     0.00B    0.0%    0.00B
   finish MPI send          4.24k   1.03ms    0.0%   243ns    596KiB    3.1%     144B
   start MPI receive        4.24k    912μs    0.0%   215ns    265KiB    1.4%    64.0B
   prolong2mortars          4.24k    286μs    0.0%  67.5ns     0.00B    0.0%    0.00B
   prolong2boundaries       4.24k    256μs    0.0%  60.3ns     0.00B    0.0%    0.00B
   MPI mortar flux          4.24k    224μs    0.0%  52.8ns     0.00B    0.0%    0.00B
   prolong2mpimortars       4.24k    210μs    0.0%  49.6ns     0.00B    0.0%    0.00B
   mortar flux              4.24k    148μs    0.0%  35.0ns     0.00B    0.0%    0.00B
   boundary flux            4.24k   91.0μs    0.0%  21.5ns     0.00B    0.0%    0.00B
   source terms             4.24k   75.2μs    0.0%  17.8ns     0.00B    0.0%    0.00B
 calculate dt                 848   70.5ms    2.7%  83.2μs   79.5KiB    0.4%    96.0B
 analyze solution              10   22.1ms    0.9%  2.21ms   2.61MiB   13.9%   267KiB
 I/O                           11   14.6ms    0.6%  1.33ms   12.6MiB   67.4%  1.15MiB
   save solution               10   14.4ms    0.6%  1.44ms   12.6MiB   67.2%  1.26MiB
   get element variables       10    178μs    0.0%  17.8μs   23.0KiB    0.1%  2.30KiB
   ~I/O~                       11   21.5μs    0.0%  1.95μs   7.20KiB    0.0%     671B
   save mesh                   10    991ns    0.0%  99.1ns     0.00B    0.0%    0.00B
 ────────────────────────────────────────────────────────────────────────────────────

julia> sol = solve(ode, RDPK3SpFSAL35(), abstol=1.0e-4, reltol=1.0e-4, save_everystep=false, callback=callbacks); summary_callback()
 ────────────────────────────────────────────────────────────────────────────────────
              Trixi.jl                      Time                    Allocations      
                                   ───────────────────────   ────────────────────────
         Tot / % measured:              1.44s /  87.5%           12.3MiB /  90.0%    

 Section                   ncalls     time    %tot     avg     alloc    %tot      avg
 ────────────────────────────────────────────────────────────────────────────────────
 rhs!                       2.35k    1.23s   98.1%   525μs   1.91MiB   17.3%     855B
   volume integral          2.35k    978ms   77.7%   416μs     0.00B    0.0%    0.00B
   interface flux           2.35k    135ms   10.7%  57.4μs     0.00B    0.0%    0.00B
   surface integral         2.35k   31.0ms    2.5%  13.2μs     0.00B    0.0%    0.00B
   prolong2interfaces       2.35k   30.3ms    2.4%  12.9μs     0.00B    0.0%    0.00B
   reset ∂u/∂t              2.35k   12.6ms    1.0%  5.37μs     0.00B    0.0%    0.00B
   finish MPI receive       2.35k   11.5ms    0.9%  4.90μs    294KiB    2.6%     128B
   Jacobian                 2.35k   11.2ms    0.9%  4.77μs     0.00B    0.0%    0.00B
   MPI interface flux       2.35k   7.86ms    0.6%  3.35μs     0.00B    0.0%    0.00B
   ~rhs!~                   2.35k   7.16ms    0.6%  3.05μs    969KiB    8.6%     423B
   start MPI send           2.35k   5.48ms    0.4%  2.33μs    220KiB    1.9%    96.0B
   prolong2mpiinterfaces    2.35k   1.91ms    0.2%   813ns     0.00B    0.0%    0.00B
   finish MPI send          2.35k    712μs    0.1%   303ns    330KiB    2.9%     144B
   start MPI receive        2.35k    547μs    0.0%   233ns    147KiB    1.3%    64.0B
   prolong2mortars          2.35k    184μs    0.0%  78.6ns     0.00B    0.0%    0.00B
   prolong2mpimortars       2.35k    161μs    0.0%  68.7ns     0.00B    0.0%    0.00B
   prolong2boundaries       2.35k    154μs    0.0%  65.5ns     0.00B    0.0%    0.00B
   MPI mortar flux          2.35k    120μs    0.0%  51.3ns     0.00B    0.0%    0.00B
   mortar flux              2.35k    109μs    0.0%  46.4ns     0.00B    0.0%    0.00B
   source terms             2.35k   58.0μs    0.0%  24.7ns     0.00B    0.0%    0.00B
   boundary flux            2.35k   47.8μs    0.0%  20.4ns     0.00B    0.0%    0.00B
 analyze solution               6   13.3ms    1.1%  2.21ms   1.56MiB   14.1%   267KiB
 I/O                            7   10.8ms    0.9%  1.54ms   7.58MiB   68.6%  1.08MiB
   save solution                6   10.6ms    0.8%  1.76ms   7.57MiB   68.4%  1.26MiB
   get element variables        6    169μs    0.0%  28.1μs   13.8KiB    0.1%  2.30KiB
   ~I/O~                        7   12.8μs    0.0%  1.83μs   5.20KiB    0.0%     761B
   save mesh                    6    647ns    0.0%   108ns     0.00B    0.0%    0.00B
 ────────────────────────────────────────────────────────────────────────────────────

TL/DR: Looks reasonable

src/auxiliary/mpi_arrays.jl Outdated Show resolved Hide resolved
src/callbacks_step/stepsize_dg3d.jl Show resolved Hide resolved
src/solvers/dgsem_p4est/dg_2d_parallel.jl Outdated Show resolved Hide resolved
test/test_mpi_tree.jl Outdated Show resolved Hide resolved
src/auxiliary/mpi_arrays.jl Outdated Show resolved Hide resolved
src/auxiliary/mpi_arrays.jl Outdated Show resolved Hide resolved
src/auxiliary/mpi_arrays.jl Outdated Show resolved Hide resolved
# like regular `Array`s in most code, e.g., when looping over an array (which
# should use `eachindex`). At the same time, we want to be able to use adaptive
# time stepping using error estimates in OrdinaryDiffEq.jl. There, the default
# norm `ODE_DEFAULT_NORM` is the one described in the book of Hairer & Wanner,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thinking - can we avoid this local_length issue if we define a norm function that works in parallel? That might be an alternative to having to remember to use local_length.

A potential downside of local_length - that I just noticed - is that it allows users to create code that works in serial but may fail in spectacularly surprising ways if run in parallel. That is, if someone uses length where local_length is required, it works fine in serial but may cause weird issues in parallel (especially if running with --check-bounds=no).

Copy link
Member Author

@ranocha ranocha Apr 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's the issue of the minimally invasive approach using a global length. However, I would argue that users should better use eachindex in most cases, which is fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I agree - eachindex should be used where possible. It makes, however, for difficult-to-understand errors, and the "wrong" use of length might be hard to spot in reviews. I suggest to continue making it work, but then we should revisit this (or at least capture it in an issue).

Copy link
Member Author

@ranocha ranocha Apr 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative would be to write our own norm function and pass that as solve(ode, alg; kwargs..., internalnorm=our_new_norm_function). However, that requires yet another keyword argument we need to remember.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, especially since it would fail very late during the initialization (or even worse, just hang) if forgotten. Maybe we need our own trixi_solve that passes some default options to OrdinaryDiffEq.jl's solve?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either this or set up some trixi_default_kwargs()?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be better. We don't need to solve this right now, though, do we? Maybe we just copy the current discussion to an issue and deal with it later, once we have some more experience with the new type.

Copy link
Member Author

@ranocha ranocha Apr 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sounds good to me. I'll leave this thread open and we can continue the discussion later (#1108).

@ranocha
Copy link
Member Author

ranocha commented Apr 1, 2022

New results from Rocinante:

julia --project=. --check-bounds=no --threads=24

julia> using Trixi, OrdinaryDiffEq

julia> trixi_include("examples/tree_2d_dgsem/elixir_euler_ec.jl", tspan=(0.0, 10.0),
                     initial_refinement_level=6, save_solution=TrivialCallback())

julia> sol = solve(ode, CarpenterKennedy2N54(williamson_condition=false), dt=1.0, save_everystep=false, callback=callbacks); summary_callback()
 ─────────────────────────────────────────────────────────────────────────────────
            Trixi.jl                     Time                    Allocations      
                                ───────────────────────   ────────────────────────
        Tot / % measured:            8.31s /  44.9%           18.3MiB /  87.6%    

 Section                ncalls     time    %tot     avg     alloc    %tot      avg
 ─────────────────────────────────────────────────────────────────────────────────
 rhs!                    8.77k    3.00s   80.5%   343μs   15.7MiB   98.0%  1.83KiB
   volume integral       8.77k    1.59s   42.5%   181μs   2.41MiB   15.1%     288B
   reset ∂u/∂t           8.77k    883ms   23.6%   101μs     0.00B    0.0%    0.00B
   interface flux        8.77k    289ms    7.7%  33.0μs   3.35MiB   20.9%     400B
   prolong2interfaces    8.77k   92.6ms    2.5%  10.6μs   2.01MiB   12.6%     240B
   surface integral      8.77k   89.8ms    2.4%  10.2μs   2.54MiB   15.9%     304B
   ~rhs!~                8.77k   32.1ms    0.9%  3.66μs   3.09MiB   19.3%     369B
   Jacobian              8.77k   29.6ms    0.8%  3.38μs   2.28MiB   14.2%     272B
   prolong2mortars       8.77k    473μs    0.0%  54.0ns     0.00B    0.0%    0.00B
   prolong2boundaries    8.77k    469μs    0.0%  53.5ns     0.00B    0.0%    0.00B
   mortar flux           8.77k    291μs    0.0%  33.2ns     0.00B    0.0%    0.00B
   boundary flux         8.77k    207μs    0.0%  23.5ns     0.00B    0.0%    0.00B
   source terms          8.77k    205μs    0.0%  23.4ns     0.00B    0.0%    0.00B
 calculate dt            1.75k    554ms   14.8%   316μs     0.00B    0.0%    0.00B
 analyze solution           19    175ms    4.7%  9.22ms    328KiB    2.0%  17.3KiB
 ─────────────────────────────────────────────────────────────────────────────────

julia> sol = solve(ode, CarpenterKennedy2N54(williamson_condition=false, thread=OrdinaryDiffEq.True()), dt=1.0, save_everystep=false, callback=callbacks); summary_callback()
 ─────────────────────────────────────────────────────────────────────────────────
            Trixi.jl                     Time                    Allocations      
                                ───────────────────────   ────────────────────────
        Tot / % measured:            3.56s /  80.3%           19.6MiB /  81.5%    

 Section                ncalls     time    %tot     avg     alloc    %tot      avg
 ─────────────────────────────────────────────────────────────────────────────────
 rhs!                    8.77k    2.13s   74.5%   243μs   15.7MiB   98.0%  1.83KiB
   volume integral       8.77k    1.54s   53.8%   175μs   2.41MiB   15.1%     288B
   interface flux        8.77k    286ms   10.0%  32.6μs   3.35MiB   20.9%     400B
   prolong2interfaces    8.77k    120ms    4.2%  13.7μs   2.01MiB   12.6%     240B
   surface integral      8.77k   87.9ms    3.1%  10.0μs   2.54MiB   15.9%     304B
   reset ∂u/∂t           8.77k   33.9ms    1.2%  3.87μs     0.00B    0.0%    0.00B
   ~rhs!~                8.77k   31.2ms    1.1%  3.55μs   3.09MiB   19.3%     369B
   Jacobian              8.77k   30.8ms    1.1%  3.52μs   2.28MiB   14.2%     272B
   prolong2boundaries    8.77k    486μs    0.0%  55.4ns     0.00B    0.0%    0.00B
   prolong2mortars       8.77k    378μs    0.0%  43.1ns     0.00B    0.0%    0.00B
   mortar flux           8.77k    288μs    0.0%  32.8ns     0.00B    0.0%    0.00B
   boundary flux         8.77k    204μs    0.0%  23.2ns     0.00B    0.0%    0.00B
   source terms          8.77k    199μs    0.0%  22.7ns     0.00B    0.0%    0.00B
 calculate dt            1.75k    555ms   19.4%   316μs     0.00B    0.0%    0.00B
 analyze solution           19    172ms    6.0%  9.07ms    328KiB    2.0%  17.3KiB
 ─────────────────────────────────────────────────────────────────────────────────

julia> sol = solve(ode, RDPK3SpFSAL35(), abstol=1.0e-4, reltol=1.0e-4, save_everystep=false, callback=callbacks); summary_callback()
 ─────────────────────────────────────────────────────────────────────────────────
            Trixi.jl                     Time                    Allocations      
                                ───────────────────────   ────────────────────────
        Tot / % measured:            4.52s /  35.2%           16.6MiB /  51.0%    

 Section                ncalls     time    %tot     avg     alloc    %tot      avg
 ─────────────────────────────────────────────────────────────────────────────────
 rhs!                    4.64k    1.49s   93.7%   322μs   8.29MiB   97.8%  1.83KiB
   volume integral       4.64k    687ms   43.2%   148μs   1.27MiB   15.0%     288B
   reset ∂u/∂t           4.64k    474ms   29.8%   102μs     0.00B    0.0%    0.00B
   interface flux        4.64k    142ms    8.9%  30.7μs   1.77MiB   20.9%     400B
   ~rhs!~                4.64k   62.2ms    3.9%  13.4μs   1.64MiB   19.3%     370B
   prolong2interfaces    4.64k   54.7ms    3.4%  11.8μs   1.06MiB   12.5%     240B
   surface integral      4.64k   50.0ms    3.1%  10.8μs   1.34MiB   15.9%     304B
   Jacobian              4.64k   18.5ms    1.2%  4.00μs   1.20MiB   14.2%     272B
   prolong2mortars       4.64k    672μs    0.0%   145ns     0.00B    0.0%    0.00B
   prolong2boundaries    4.64k    520μs    0.0%   112ns     0.00B    0.0%    0.00B
   mortar flux           4.64k    345μs    0.0%  74.3ns     0.00B    0.0%    0.00B
   source terms          4.64k    127μs    0.0%  27.4ns     0.00B    0.0%    0.00B
   boundary flux         4.64k    108μs    0.0%  23.2ns     0.00B    0.0%    0.00B
 analyze solution           11    101ms    6.3%  9.19ms    189KiB    2.2%  17.2KiB
 ─────────────────────────────────────────────────────────────────────────────────

julia> sol = solve(ode, RDPK3SpFSAL35(thread=OrdinaryDiffEq.True()), abstol=1.0e-4, reltol=1.0e-4, save_everystep=false, callback=callbacks); summary_callback()
 ─────────────────────────────────────────────────────────────────────────────────
            Trixi.jl                     Time                    Allocations      
                                ───────────────────────   ────────────────────────
        Tot / % measured:            2.57s /  44.0%           17.8MiB /  47.7%    

 Section                ncalls     time    %tot     avg     alloc    %tot      avg
 ─────────────────────────────────────────────────────────────────────────────────
 rhs!                    4.64k    1.03s   91.2%   223μs   8.29MiB   97.8%  1.83KiB
   volume integral       4.64k    660ms   58.2%   142μs   1.27MiB   15.0%     288B
   interface flux        4.64k    142ms   12.5%  30.6μs   1.77MiB   20.9%     400B
   reset ∂u/∂t           4.64k   92.8ms    8.2%  20.0μs     0.00B    0.0%    0.00B
   prolong2interfaces    4.64k   62.0ms    5.5%  13.4μs   1.06MiB   12.5%     240B
   surface integral      4.64k   45.4ms    4.0%  9.79μs   1.34MiB   15.9%     304B
   ~rhs!~                4.64k   17.1ms    1.5%  3.69μs   1.64MiB   19.3%     370B
   Jacobian              4.64k   14.4ms    1.3%  3.11μs   1.20MiB   14.2%     272B
   prolong2boundaries    4.64k    238μs    0.0%  51.3ns     0.00B    0.0%    0.00B
   mortar flux           4.64k    189μs    0.0%  40.8ns     0.00B    0.0%    0.00B
   prolong2mortars       4.64k    183μs    0.0%  39.5ns     0.00B    0.0%    0.00B
   boundary flux         4.64k    108μs    0.0%  23.2ns     0.00B    0.0%    0.00B
   source terms          4.64k    105μs    0.0%  22.7ns     0.00B    0.0%    0.00B
 analyze solution           11   99.4ms    8.8%  9.04ms    190KiB    2.2%  17.2KiB
 ─────────────────────────────────────────────────────────────────────────────────

tmpi 2 julia --project=. --check-bounds=no --threads=12

julia> using Trixi, OrdinaryDiffEq

julia> trixi_include("examples/tree_2d_dgsem/elixir_euler_ec.jl", tspan=(0.0, 10.0),
                     initial_refinement_level=6, save_solution=TrivialCallback())

julia> sol = solve(ode, CarpenterKennedy2N54(williamson_condition=false), dt=1.0, save_everystep=false, callback=callbacks); summary_callback()
 ────────────────────────────────────────────────────────────────────────────────────
              Trixi.jl                      Time                    Allocations      
                                   ───────────────────────   ────────────────────────
         Tot / % measured:              5.61s /  58.3%           46.1MiB /  97.2%    

 Section                   ncalls     time    %tot     avg     alloc    %tot      avg
 ────────────────────────────────────────────────────────────────────────────────────
 rhs!                       8.77k    2.84s   86.7%   323μs   25.4MiB   56.7%  2.97KiB
   volume integral          8.77k    1.54s   47.2%   176μs   2.81MiB    6.3%     336B
   reset ∂u/∂t              8.77k    415ms   12.7%  47.4μs     0.00B    0.0%    0.00B
   interface flux           8.77k    282ms    8.6%  32.2μs   3.35MiB    7.5%     400B
   finish MPI receive       8.77k    194ms    5.9%  22.1μs   1.07MiB    2.4%     128B
   surface integral         8.77k   94.0ms    2.9%  10.7μs   2.54MiB    5.7%     304B
   start MPI send           8.77k   93.2ms    2.8%  10.6μs    822KiB    1.8%    96.0B
   prolong2interfaces       8.77k   85.0ms    2.6%  9.69μs   2.01MiB    4.5%     240B
   ~rhs!~                   8.77k   33.6ms    1.0%  3.83μs   3.49MiB    7.8%     418B
   MPI interface flux       8.77k   31.2ms    1.0%  3.55μs   3.35MiB    7.5%     400B
   Jacobian                 8.77k   29.2ms    0.9%  3.33μs   2.41MiB    5.4%     288B
   prolong2mpiinterfaces    8.77k   27.1ms    0.8%  3.09μs   1.87MiB    4.2%     224B
   finish MPI send          8.77k   2.14ms    0.1%   244ns   1.20MiB    2.7%     144B
   start MPI receive        8.77k   1.89ms    0.1%   216ns    548KiB    1.2%    64.0B
   prolong2boundaries       8.77k    547μs    0.0%  62.4ns     0.00B    0.0%    0.00B
   prolong2mpimortars       8.77k    401μs    0.0%  45.7ns     0.00B    0.0%    0.00B
   prolong2mortars          8.77k    388μs    0.0%  44.3ns     0.00B    0.0%    0.00B
   MPI mortar flux          8.77k    368μs    0.0%  42.0ns     0.00B    0.0%    0.00B
   mortar flux              8.77k    287μs    0.0%  32.7ns     0.00B    0.0%    0.00B
   source terms             8.77k    203μs    0.0%  23.2ns     0.00B    0.0%    0.00B
   boundary flux            8.77k    201μs    0.0%  22.9ns     0.00B    0.0%    0.00B
 calculate dt               1.75k    335ms   10.2%   191μs    165KiB    0.4%    96.0B
 analyze solution              19    101ms    3.1%  5.29ms   19.2MiB   42.9%  1.01MiB
 ────────────────────────────────────────────────────────────────────────────────────

julia> sol = solve(ode, CarpenterKennedy2N54(williamson_condition=false, thread=OrdinaryDiffEq.True()), dt=1.0, save_everystep=false, callback=callbacks); summary_callback()
 ────────────────────────────────────────────────────────────────────────────────────
              Trixi.jl                      Time                    Allocations      
                                   ───────────────────────   ────────────────────────
         Tot / % measured:              3.30s /  82.8%           48.5MiB /  92.5%    

 Section                   ncalls     time    %tot     avg     alloc    %tot      avg
 ────────────────────────────────────────────────────────────────────────────────────
 rhs!                       8.77k    2.33s   85.5%   266μs   25.4MiB   56.7%  2.97KiB
   volume integral          8.77k    1.53s   56.2%   175μs   2.81MiB    6.3%     336B
   interface flux           8.77k    297ms   10.9%  33.9μs   3.35MiB    7.5%     400B
   prolong2interfaces       8.77k    105ms    3.9%  12.0μs   2.01MiB    4.5%     240B
   finish MPI receive       8.77k   98.0ms    3.6%  11.2μs   1.07MiB    2.4%     128B
   surface integral         8.77k   86.5ms    3.2%  9.87μs   2.54MiB    5.7%     304B
   start MPI send           8.77k   62.5ms    2.3%  7.12μs    822KiB    1.8%    96.0B
   ~rhs!~                   8.77k   33.9ms    1.2%  3.86μs   3.49MiB    7.8%     418B
   MPI interface flux       8.77k   33.0ms    1.2%  3.77μs   3.35MiB    7.5%     400B
   Jacobian                 8.77k   28.2ms    1.0%  3.22μs   2.41MiB    5.4%     288B
   reset ∂u/∂t              8.77k   27.1ms    1.0%  3.09μs     0.00B    0.0%    0.00B
   prolong2mpiinterfaces    8.77k   20.6ms    0.8%  2.35μs   1.87MiB    4.2%     224B
   finish MPI send          8.77k   2.44ms    0.1%   279ns   1.20MiB    2.7%     144B
   start MPI receive        8.77k   1.81ms    0.1%   207ns    548KiB    1.2%    64.0B
   prolong2boundaries       8.77k    404μs    0.0%  46.0ns     0.00B    0.0%    0.00B
   prolong2mortars          8.77k    380μs    0.0%  43.3ns     0.00B    0.0%    0.00B
   MPI mortar flux          8.77k    341μs    0.0%  38.9ns     0.00B    0.0%    0.00B
   prolong2mpimortars       8.77k    341μs    0.0%  38.8ns     0.00B    0.0%    0.00B
   mortar flux              8.77k    250μs    0.0%  28.5ns     0.00B    0.0%    0.00B
   source terms             8.77k    203μs    0.0%  23.1ns     0.00B    0.0%    0.00B
   boundary flux            8.77k    201μs    0.0%  22.9ns     0.00B    0.0%    0.00B
 calculate dt               1.75k    295ms   10.8%   168μs    165KiB    0.4%    96.0B
 analyze solution              19    101ms    3.7%  5.32ms   19.2MiB   42.9%  1.01MiB
 ────────────────────────────────────────────────────────────────────────────────────

julia> sol = solve(ode, RDPK3SpFSAL35(), abstol=1.0e-4, reltol=1.0e-4, save_everystep=false, callback=callbacks); summary_callback()
 ────────────────────────────────────────────────────────────────────────────────────
              Trixi.jl                      Time                    Allocations      
                                   ───────────────────────   ────────────────────────
         Tot / % measured:              3.01s /  45.0%           29.0MiB /  84.7%    

 Section                   ncalls     time    %tot     avg     alloc    %tot      avg
 ────────────────────────────────────────────────────────────────────────────────────
 rhs!                       4.64k    1.30s   95.7%   280μs   13.5MiB   54.7%  2.97KiB
   volume integral          4.64k    677ms   50.0%   146μs   1.49MiB    6.0%     336B
   reset ∂u/∂t              4.64k    233ms   17.2%  50.3μs     0.00B    0.0%    0.00B
   interface flux           4.64k    137ms   10.1%  29.6μs   1.77MiB    7.2%     400B
   surface integral         4.64k   47.6ms    3.5%  10.3μs   1.34MiB    5.5%     304B
   finish MPI receive       4.64k   47.0ms    3.5%  10.1μs    580KiB    2.3%     128B
   prolong2interfaces       4.64k   45.1ms    3.3%  9.72μs   1.06MiB    4.3%     240B
   start MPI send           4.64k   44.5ms    3.3%  9.60μs    435KiB    1.7%    96.0B
   ~rhs!~                   4.64k   18.2ms    1.3%  3.92μs   1.85MiB    7.5%     419B
   MPI interface flux       4.64k   15.9ms    1.2%  3.43μs   1.77MiB    7.2%     400B
   Jacobian                 4.64k   15.1ms    1.1%  3.26μs   1.27MiB    5.2%     288B
   prolong2mpiinterfaces    4.64k   13.1ms    1.0%  2.82μs   0.99MiB    4.0%     224B
   start MPI receive        4.64k   1.09ms    0.1%   235ns    290KiB    1.2%    64.0B
   finish MPI send          4.64k    982μs    0.1%   212ns    652KiB    2.6%     144B
   prolong2boundaries       4.64k    284μs    0.0%  61.3ns     0.00B    0.0%    0.00B
   prolong2mpimortars       4.64k    238μs    0.0%  51.2ns     0.00B    0.0%    0.00B
   prolong2mortars          4.64k    217μs    0.0%  46.8ns     0.00B    0.0%    0.00B
   MPI mortar flux          4.64k    196μs    0.0%  42.4ns     0.00B    0.0%    0.00B
   mortar flux              4.64k    140μs    0.0%  30.1ns     0.00B    0.0%    0.00B
   boundary flux            4.64k    118μs    0.0%  25.4ns     0.00B    0.0%    0.00B
   source terms             4.64k    112μs    0.0%  24.0ns     0.00B    0.0%    0.00B
 analyze solution              11   57.7ms    4.3%  5.25ms   11.1MiB   45.3%  1.01MiB
 ────────────────────────────────────────────────────────────────────────────────────

julia> sol = solve(ode, RDPK3SpFSAL35(thread=OrdinaryDiffEq.True()), abstol=1.0e-4, reltol=1.0e-4, save_everystep=false, callback=callbacks); summary_callback()
 ────────────────────────────────────────────────────────────────────────────────────
              Trixi.jl                      Time                    Allocations      
                                   ───────────────────────   ────────────────────────
         Tot / % measured:              2.17s /  55.6%           31.0MiB /  79.2%    

 Section                   ncalls     time    %tot     avg     alloc    %tot      avg
 ────────────────────────────────────────────────────────────────────────────────────
 rhs!                       4.64k    1.11s   92.2%   240μs   13.5MiB   54.7%  2.97KiB
   volume integral          4.64k    662ms   54.9%   143μs   1.49MiB    6.0%     336B
   interface flux           4.64k    135ms   11.2%  29.1μs   1.77MiB    7.2%     400B
   finish MPI receive       4.64k   57.1ms    4.7%  12.3μs    580KiB    2.3%     128B
   reset ∂u/∂t              4.64k   56.8ms    4.7%  12.2μs     0.00B    0.0%    0.00B
   prolong2interfaces       4.64k   56.7ms    4.7%  12.2μs   1.06MiB    4.3%     240B
   surface integral         4.64k   48.3ms    4.0%  10.4μs   1.34MiB    5.5%     304B
   start MPI send           4.64k   32.3ms    2.7%  6.97μs    435KiB    1.7%    96.0B
   ~rhs!~                   4.64k   17.5ms    1.5%  3.78μs   1.85MiB    7.5%     419B
   MPI interface flux       4.64k   15.7ms    1.3%  3.38μs   1.77MiB    7.2%     400B
   Jacobian                 4.64k   15.2ms    1.3%  3.28μs   1.27MiB    5.2%     288B
   prolong2mpiinterfaces    4.64k   11.2ms    0.9%  2.42μs   0.99MiB    4.0%     224B
   finish MPI send          4.64k   1.35ms    0.1%   292ns    652KiB    2.6%     144B
   start MPI receive        4.64k    919μs    0.1%   198ns    290KiB    1.2%    64.0B
   prolong2boundaries       4.64k    226μs    0.0%  48.7ns     0.00B    0.0%    0.00B
   prolong2mpimortars       4.64k    215μs    0.0%  46.4ns     0.00B    0.0%    0.00B
   prolong2mortars          4.64k    203μs    0.0%  43.7ns     0.00B    0.0%    0.00B
   MPI mortar flux          4.64k    199μs    0.0%  42.9ns     0.00B    0.0%    0.00B
   mortar flux              4.64k    137μs    0.0%  29.6ns     0.00B    0.0%    0.00B
   source terms             4.64k    111μs    0.0%  24.0ns     0.00B    0.0%    0.00B
   boundary flux            4.64k    106μs    0.0%  22.8ns     0.00B    0.0%    0.00B
 analyze solution              11   93.9ms    7.8%  8.54ms   11.1MiB   45.3%  1.01MiB
 ────────────────────────────────────────────────────────────────────────────────────

Looks okay, doesn't it? In particular, there seems to be an effect of using multi-threading also for the RK solver.

@ranocha ranocha requested a review from sloede April 1, 2022 13:55
test/test_mpi_tree.jl Outdated Show resolved Hide resolved
Copy link
Member

@sloede sloede left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As said before, great work, and thanks for pushing this! I have left a few remarks and suggestions; please ping me if anything is unclear.

src/callbacks_step/analysis_dg2d_parallel.jl Outdated Show resolved Hide resolved
src/callbacks_step/analysis_dg2d_parallel.jl Outdated Show resolved Hide resolved
src/callbacks_step/analysis_dg2d_parallel.jl Outdated Show resolved Hide resolved
src/callbacks_step/stepsize_dg2d.jl Show resolved Hide resolved
test/test_mpi_tree.jl Outdated Show resolved Hide resolved
src/auxiliary/mpi_arrays.jl Outdated Show resolved Hide resolved
src/auxiliary/mpi_arrays.jl Outdated Show resolved Hide resolved
src/auxiliary/mpi_arrays.jl Show resolved Hide resolved
src/callbacks_step/amr_dg2d.jl Outdated Show resolved Hide resolved
src/auxiliary/mpi_arrays.jl Outdated Show resolved Hide resolved
@sloede
Copy link
Member

sloede commented Apr 1, 2022

Looks okay, doesn't it? In particular, there seems to be an effect of using multi-threading also for the RK solver.

Yes, it looks ok. Although it's not clear yet what the performance impact really is (hard to tell with such a small problem size) and whether it makes more sense to use more threads or more ranks. Then again, this is often hardware dependent...

@sloede
Copy link
Member

sloede commented Apr 3, 2022

Do you understand why the serial p4est runs fail? Why would the results change? Is it because we do not use raw PtrArrays anymore and thus OrdinaryDiffEq.jl does something different under the hood when computing the time step update?

No idea... It's elixir_advection_basic.jl, everything else passes 😕

Positive: Now everything that was "weirdly" broken passes. Negative: macOS tests are still hanging...

@ranocha
Copy link
Member Author

ranocha commented Apr 3, 2022

Yeah... but I can't really debug the macOS part (since I don't have a Mac)

@sloede
Copy link
Member

sloede commented Apr 3, 2022

Could you see which test is the issue? If yes, we can try disabling it to check whether it's a singleton issue or a general problem. Although we should try to find the root cause either way.

@ranocha
Copy link
Member Author

ranocha commented Apr 3, 2022

Could you see which test is the issue? If yes, we can try disabling it to check whether it's a singleton issue or a general problem. Although we should try to find the root cause either way.

Looks like it's examples/tree_2d_dgsem/elixir_euler_ec.jl with error-based step size control 😢

@sloede
Copy link
Member

sloede commented Apr 3, 2022

Could you see which test is the issue? If yes, we can try disabling it to check whether it's a singleton issue or a general problem. Although we should try to find the root cause either way.

Looks like it's examples/tree_2d_dgsem/elixir_euler_ec.jl with error-based step size control 😢

@andrewwinters5000 It would be great if you could try to reproduce this issue.

@ranocha
Copy link
Member Author

ranocha commented Apr 4, 2022

I got rid of the global length completely, since it leads to hard-to-find bugs. Let's see what happens now...

@ranocha ranocha requested a review from sloede April 4, 2022 12:05
@ranocha
Copy link
Member Author

ranocha commented Apr 4, 2022

MPI tests pass 🥳
@sloede Please have a look at the new stuff. Right now, our calling convention must be

sol = solve(ode, alg; kwargs..., internalnorm=ode_norm, unstable_check=ode_unstable_check)

We should probably make it easier to use all this but it seems to be working.

test/test_mpi_p4est_2d.jl Show resolved Hide resolved
src/auxiliary/mpi_arrays.jl Outdated Show resolved Hide resolved
src/auxiliary/mpi_arrays.jl Show resolved Hide resolved
src/auxiliary/mpi_arrays.jl Show resolved Hide resolved
src/auxiliary/mpi_arrays.jl Outdated Show resolved Hide resolved
src/auxiliary/mpi_arrays.jl Show resolved Hide resolved
ranocha added a commit that referenced this pull request Apr 5, 2022
src/auxiliary/mpi_arrays.jl Outdated Show resolved Hide resolved
src/auxiliary/mpi_arrays.jl Show resolved Hide resolved
src/auxiliary/mpi_arrays.jl Show resolved Hide resolved
Copy link
Member

@sloede sloede left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from my side - great work!

@ranocha ranocha changed the title WIP: TrixiMPIArray TrixiMPIArray Apr 5, 2022
@ranocha ranocha mentioned this pull request Apr 5, 2022
2 tasks
@ranocha ranocha changed the title TrixiMPIArray Proof of concept: TrixiMPIArray Apr 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parallelization Related to MPI, threading, tasks etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MPI: MPI array type for dispatch MPI: Disable unstable_check in OrdinaryDiffEq for MPI
4 participants