Proof of Concept: benchmark neighborhood search overhead #284

efaulhaber · 2023-11-09T22:14:30Z

I was wondering for quite a while how large the actual overhead of querying my grid neighborhood search is. The part of looping over a generator that collects particle lists of neighboring cells.

I profiled a simulation on one thread without bounds checking, and I found that

30% of the full simulation time is spent in the NHS in get(hashtable, ...), which is the get(::Dict, key, default) function in Julia Base. About half of this is in hash.
6% is in sqrt(distance2) in the particle-neighbor loop.
5% is in dot(pos_diff, pos_diff) above to compute the squared distance.

Finding that 30% of the total runtime is spent in get(hashtable, ...), I got really curious. So, I implemented a neighborhood search that wraps the existing grid neighborhood search, lets it do the update part, and then computes explicit lists of neighbors for each particle as a vector of vectors. Of course, allocating the full vector of vectors is terribly slow, but then for the interaction, the eachneighbor loop will just be a loop over a simple vector.

Assuming that looping over an explicit vector of neighbors is the absolute best that we can do, we should get an upper limit of how much we can optimize the neighborhood search. Here we go:

Benchmarks

The following is the basic rectangular_tank_2d.jl example (tspan = (0.0, 20.0)) on main with the GridNeighborhoodSearch on 24 threads. So it's where we currently stand. Considering that the particles basically don't move in this example, I disabled the NHS update.

 ──────────────────────────────────────────────────────────────────────────────────────────
            TrixiParticles.jl                     Time                    Allocations      
                                         ───────────────────────   ────────────────────────
            Tot / % measured:                 20.9s /  73.3%            157MiB /  98.7%    

 Section                         ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────────────────────
 kick!                            50.1k    14.8s   96.9%   296μs    146MiB   94.1%  2.98KiB
   system interaction             50.1k    8.46s   55.4%   169μs    113MiB   72.8%  2.30KiB
     fluid1-fluid1                50.1k    6.41s   42.0%   128μs   23.7MiB   15.3%     496B
     fluid1-boundary2             50.1k    1.80s   11.8%  36.0μs   21.4MiB   13.8%     448B
     ~system interaction~         50.1k    244ms    1.6%  4.88μs   67.6MiB   43.7%  1.38KiB
     boundary2-fluid1             50.1k   1.12ms    0.0%  22.4ns     0.00B    0.0%    0.00B
     boundary2-boundary2          50.1k   1.12ms    0.0%  22.3ns     0.00B    0.0%    0.00B
   update systems and nhs         50.1k    5.80s   37.9%   116μs   23.7MiB   15.3%     496B
     ~update systems and nhs~     50.1k    5.00s   32.7%   100μs      752B    0.0%    0.02B
     compute boundary pressure    50.1k    792ms    5.2%  15.8μs   23.7MiB   15.3%     496B
   gravity and damping            50.1k    293ms    1.9%  5.85μs   9.17MiB    5.9%     192B
   reset ∂v/∂t                    50.1k    200ms    1.3%  4.00μs     0.00B    0.0%    0.00B
   ~kick!~                        50.1k   52.1ms    0.3%  1.04μs   2.94KiB    0.0%    0.06B
 drift!                           50.1k    476ms    3.1%  9.51μs   9.17MiB    5.9%     192B
   reset ∂u/∂t                    50.1k    235ms    1.5%  4.70μs     0.00B    0.0%    0.00B
   velocity                       50.1k    216ms    1.4%  4.31μs   9.17MiB    5.9%     192B
   ~drift!~                       50.1k   24.9ms    0.2%   497ns   1.47KiB    0.0%    0.03B
 ──────────────────────────────────────────────────────────────────────────────────────────

This is the new explicit neighbor list NHS:

 ──────────────────────────────────────────────────────────────────────────────────────────
            TrixiParticles.jl                     Time                    Allocations      
                                         ───────────────────────   ────────────────────────
            Tot / % measured:                 17.7s /  70.0%            161MiB /  98.7%    

 Section                         ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────────────────────
 kick!                            50.1k    11.9s   96.1%   238μs    150MiB   94.2%  3.07KiB
   system interaction             50.1k    5.78s   46.6%   115μs    115MiB   72.2%  2.35KiB
     fluid1-fluid1                50.1k    4.67s   37.6%  93.2μs   26.0MiB   16.3%     544B
     fluid1-boundary2             50.1k    851ms    6.9%  17.0μs   21.4MiB   13.4%     448B
     ~system interaction~         50.1k    258ms    2.1%  5.16μs   67.6MiB   42.4%  1.38KiB
     boundary2-boundary2          50.1k   1.12ms    0.0%  22.3ns     0.00B    0.0%    0.00B
     boundary2-fluid1             50.1k   1.11ms    0.0%  22.2ns     0.00B    0.0%    0.00B
   update systems and nhs         50.1k    5.59s   45.1%   112μs   26.0MiB   16.3%     544B
     ~update systems and nhs~     50.1k    5.09s   41.1%   102μs      752B    0.0%    0.02B
     compute boundary pressure    50.1k    495ms    4.0%  9.89μs   26.0MiB   16.3%     544B
   gravity and damping            50.1k    286ms    2.3%  5.71μs   9.17MiB    5.8%     192B
   reset ∂v/∂t                    50.1k    215ms    1.7%  4.29μs     0.00B    0.0%    0.00B
   ~kick!~                        50.1k   48.1ms    0.4%   961ns   2.97KiB    0.0%    0.06B
 drift!                           50.1k    486ms    3.9%  9.71μs   9.17MiB    5.8%     192B
   reset ∂u/∂t                    50.1k    245ms    2.0%  4.90μs      288B    0.0%    0.01B
   velocity                       50.1k    217ms    1.7%  4.32μs   9.17MiB    5.8%     192B
   ~drift!~                       50.1k   24.1ms    0.2%   480ns   1.47KiB    0.0%    0.03B
 ──────────────────────────────────────────────────────────────────────────────────────────

To be more accurate, here is a benchmark of just the fluid-fluid interaction:

julia> @btime TrixiParticles.interact!($dv, $v, $u, $v, $u, $(nhs.grid_nhs), $fluid_system, $fluid_system);
  86.652 μs (1 allocation: 496 bytes)
  
julia> @btime TrixiParticles.interact!($dv, $v, $u, $v, $u, $nhs, $fluid_system, $fluid_system);
  57.951 μs (1 allocation: 544 bytes)

So the NHS query overhead is about 33% of our current runtime.

Here is the grid NHS on a single thread:

 ──────────────────────────────────────────────────────────────────────────────────────────
            TrixiParticles.jl                     Time                    Allocations      
                                         ───────────────────────   ────────────────────────
            Tot / % measured:                 16.9s /  96.7%           16.0MiB /  42.4%    

 Section                         ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────────────────────
 kick!                            5.02k    16.3s   99.9%  3.25ms   6.78MiB  100.0%  1.38KiB
   system interaction             5.02k    15.4s   94.4%  3.07ms   6.78MiB   99.9%  1.38KiB
     fluid1-fluid1                5.02k    13.2s   81.1%  2.64ms     0.00B    0.0%    0.00B
     fluid1-boundary2             5.02k    2.16s   13.2%   431μs     0.00B    0.0%    0.00B
     ~system interaction~         5.02k   14.9ms    0.1%  2.97μs   6.78MiB   99.9%  1.38KiB
     boundary2-boundary2          5.02k    111μs    0.0%  22.1ns     0.00B    0.0%    0.00B
     boundary2-fluid1             5.02k    109μs    0.0%  21.7ns     0.00B    0.0%    0.00B
   update systems and nhs         5.02k    870ms    5.3%   173μs      752B    0.0%    0.15B
     compute boundary pressure    5.02k    621ms    3.8%   124μs     0.00B    0.0%    0.00B
     ~update systems and nhs~     5.02k    249ms    1.5%  49.7μs      752B    0.0%    0.15B
   gravity and damping            5.02k   14.7ms    0.1%  2.93μs     0.00B    0.0%    0.00B
   reset ∂v/∂t                    5.02k   5.33ms    0.0%  1.06μs     0.00B    0.0%    0.00B
   ~kick!~                        5.02k   1.87ms    0.0%   372ns   2.94KiB    0.0%    0.60B
 drift!                           5.02k   16.1ms    0.1%  3.21μs   1.47KiB    0.0%    0.30B
   velocity                       5.02k   11.0ms    0.1%  2.19μs     0.00B    0.0%    0.00B
   reset ∂u/∂t                    5.02k   4.22ms    0.0%   840ns     0.00B    0.0%    0.00B
   ~drift!~                       5.02k    904μs    0.0%   180ns   1.47KiB    0.0%    0.30B
 ──────────────────────────────────────────────────────────────────────────────────────────

And the neighbor list NHS on a single thread:

──────────────────────────────────────────────────────────────────────────────────────────
            TrixiParticles.jl                     Time                    Allocations      
                                         ───────────────────────   ────────────────────────
            Tot / % measured:                 11.4s /  96.2%           7.46MiB /  90.9%    

 Section                         ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────────────────────
 kick!                            5.02k    10.9s   99.8%  2.18ms   6.78MiB  100.0%  1.38KiB
   system interaction             5.02k    10.5s   96.2%  2.10ms   6.78MiB   99.9%  1.38KiB
     fluid1-fluid1                5.02k    10.3s   94.3%  2.06ms     0.00B    0.0%    0.00B
     fluid1-boundary2             5.02k    190ms    1.7%  37.8μs     0.00B    0.0%    0.00B
     ~system interaction~         5.02k   15.9ms    0.1%  3.18μs   6.78MiB   99.9%  1.38KiB
     boundary2-boundary2          5.02k    114μs    0.0%  22.8ns     0.00B    0.0%    0.00B
     boundary2-fluid1             5.02k    109μs    0.0%  21.8ns     0.00B    0.0%    0.00B
   update systems and nhs         5.02k    378ms    3.5%  75.4μs      752B    0.0%    0.15B
     ~update systems and nhs~     5.02k    250ms    2.3%  49.9μs      752B    0.0%    0.15B
     compute boundary pressure    5.02k    128ms    1.2%  25.5μs     0.00B    0.0%    0.00B
   gravity and damping            5.02k   14.6ms    0.1%  2.90μs     0.00B    0.0%    0.00B
   reset ∂v/∂t                    5.02k   5.34ms    0.0%  1.06μs     0.00B    0.0%    0.00B
   ~kick!~                        5.02k   1.72ms    0.0%   344ns   2.94KiB    0.0%    0.60B
 drift!                           5.02k   16.6ms    0.2%  3.31μs   1.47KiB    0.0%    0.30B
   velocity                       5.02k   11.2ms    0.1%  2.24μs     0.00B    0.0%    0.00B
   reset ∂u/∂t                    5.02k   4.39ms    0.0%   876ns     0.00B    0.0%    0.00B
   ~drift!~                       5.02k    943μs    0.0%   188ns   1.47KiB    0.0%    0.30B
 ──────────────────────────────────────────────────────────────────────────────────────────

And the precise benchmark:

julia> @btime TrixiParticles.interact!($dv, $v, $u, $v, $u, $(nhs.grid_nhs), $fluid_system, $fluid_system);
  2.548 ms (0 allocations: 0 bytes)

julia> @btime TrixiParticles.interact!($dv, $v, $u, $v, $u, $nhs, $fluid_system, $fluid_system);
  1.943 ms (0 allocations: 0 bytes)

This is a 24% overhead.

I played around a little and only stored actual neighbors in the lists instead of "possible neighbors" (all particles in neighboring cells). This gets rid of the distance checking for particles that are outside the search radius and should probably also be considered for a "perfect" neighborhood search.

──────────────────────────────────────────────────────────────────────────────────────────
            TrixiParticles.jl                     Time                    Allocations      
                                         ───────────────────────   ────────────────────────
            Tot / % measured:                 8.76s /  95.1%           7.46MiB /  90.9%    

 Section                         ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────────────────────
 kick!                            5.02k    8.31s   99.8%  1.66ms   6.78MiB  100.0%  1.38KiB
   system interaction             5.02k    7.98s   95.8%  1.59ms   6.78MiB   99.9%  1.38KiB
     fluid1-fluid1                5.02k    7.82s   93.9%  1.56ms     0.00B    0.0%    0.00B
     fluid1-boundary2             5.02k    148ms    1.8%  29.4μs     0.00B    0.0%    0.00B
     ~system interaction~         5.02k   14.7ms    0.2%  2.92μs   6.78MiB   99.9%  1.38KiB
     boundary2-fluid1             5.02k    109μs    0.0%  21.6ns     0.00B    0.0%    0.00B
     boundary2-boundary2          5.02k    108μs    0.0%  21.6ns     0.00B    0.0%    0.00B
   update systems and nhs         5.02k    311ms    3.7%  62.0μs      752B    0.0%    0.15B
     ~update systems and nhs~     5.02k    231ms    2.8%  46.0μs      752B    0.0%    0.15B
     compute boundary pressure    5.02k   80.3ms    1.0%  16.0μs     0.00B    0.0%    0.00B
   gravity and damping            5.02k   15.1ms    0.2%  3.01μs     0.00B    0.0%    0.00B
   reset ∂v/∂t                    5.02k   5.19ms    0.1%  1.03μs     0.00B    0.0%    0.00B
   ~kick!~                        5.02k   1.74ms    0.0%   347ns   2.94KiB    0.0%    0.60B
 drift!                           5.02k   16.0ms    0.2%  3.19μs   1.47KiB    0.0%    0.30B
   velocity                       5.02k   10.9ms    0.1%  2.17μs     0.00B    0.0%    0.00B
   reset ∂u/∂t                    5.02k   4.19ms    0.1%   835ns     0.00B    0.0%    0.00B
   ~drift!~                       5.02k    932μs    0.0%   186ns   1.47KiB    0.0%    0.30B
 ──────────────────────────────────────────────────────────────────────────────────────────

julia> @btime TrixiParticles.interact!($dv, $v, $u, $v, $u, $nhs, $fluid_system, $fluid_system);
  1.519 ms (0 allocations: 0 bytes)

This would be a 40% overhead.

Conclusion

The NHS query could be more efficient. If we stick with grid-based NHS, the upper limit seems to be a 33% multithreaded or 24% single-threaded improvement of the fluid-fluid interaction.
It is, however, very difficult to improve the query performance without hurting the update performance, since faster querying will probably require a different data structure for the cell lists. Great care has to be taken regarding the parallelizability of the update step, or otherwise we will lose a lot of performance in the multithreaded update.

We might also be able to improve the existing NHS. About 15% of the total runtime is in hash, so maybe an optimized has function for NTuple{2, Int} could help the performance. Also, the remaining 15% in the hashtable querying might come from the fact that the cell lists are spread accross memory. A compact hashing approach where the cell lists are columns of a large matrix might improve cache hits.

Edit

Now I'm thinking, would it make sense to do another test where I use a contiuous vector for the neighbor lists instead of a vector of vectors?

efaulhaber · 2023-11-13T11:54:43Z

Update

I implemented a contiguous vector for the neighbor lists, and there was no noticeable difference in 2D.
I also repeated the same test in 3D and could find a small difference there, probably due to the larger problem size.

In 3D, there was a 25% overhead with the first version on one thread, which went up to 32% when using the contiuous vector:

julia> @btime TrixiParticles.interact!($dv, $v, $u, $v, $u, $(nhs.grid_nhs), $fluid_system, $fluid_system);
  26.861 ms (0 allocations: 0 bytes)

julia> @btime TrixiParticles.interact!($dv, $v, $u, $v, $u, $nhs, $fluid_system, $fluid_system);
  21.386 ms (0 allocations: 0 bytes)

Contiguous vector:

julia> @btime TrixiParticles.interact!($dv, $v, $u, $v, $u, $nhs, $fluid_system, $fluid_system);
  20.305 ms (0 allocations: 0 bytes)

efaulhaber added 3 commits November 9, 2023 22:32

[skip ci] Add NeighborListNeighborhoodSearch

e6aef57

Only consider actual neighbors

fe91de9

Use contiguous vector as neighbor list

5cb4b00

[skip ci] Reformat

83f1e19

svchb added the discussion label Dec 13, 2023

sloede closed this Mar 5, 2024

efaulhaber mentioned this pull request May 10, 2024

Implement neighborhood search based on static neighbor lists trixi-framework/PointNeighbors.jl#9

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proof of Concept: benchmark neighborhood search overhead #284

Proof of Concept: benchmark neighborhood search overhead #284

efaulhaber commented Nov 9, 2023 •

edited

Loading

efaulhaber commented Nov 13, 2023 •

edited

Loading

Proof of Concept: benchmark neighborhood search overhead #284

Proof of Concept: benchmark neighborhood search overhead #284

Conversation

efaulhaber commented Nov 9, 2023 • edited Loading

Benchmarks

Conclusion

Edit

efaulhaber commented Nov 13, 2023 • edited Loading

Update

efaulhaber commented Nov 9, 2023 •

edited

Loading

efaulhaber commented Nov 13, 2023 •

edited

Loading