Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine similar kernels using cooperative groups #97

Merged
merged 3 commits into from
Dec 21, 2024
Merged

Combine similar kernels using cooperative groups #97

merged 3 commits into from
Dec 21, 2024

Conversation

huiyuxie
Copy link
Member

@huiyuxie huiyuxie commented Dec 21, 2024

Several kernels are launched separately to achieve grid-wide synchronization. Cooperative groups can help combine these similar kernels into a single kernel launch to improve performance. Also, kernel size configurators for 1D, 2D, and 3D cooperative kernel launches are added.

Tasks:

  • cuda_prolong2mortars! for 3D
  • cuda_mortar_flux! with nonconservative_terms::False for 3D
  • cuda_mortar_flux! with nonconservative_terms::False for 3D

@huiyuxie huiyuxie added performance Improve performance benchmark Benchmark labels Dec 21, 2024
@huiyuxie
Copy link
Member Author

Here we use Euler mortar 3D as an example for benchmarking.

cuda_prolong2mortars! before kernel combination

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max):   67.100 μs …  48.753 ms  ┊ GC (min … max): 0.00% … 22.12%
Time  (median):     111.200 μs               ┊ GC (median):    0.00%
Time  (mean ± σ):   152.253 μs ± 647.446 μs  ┊ GC (mean ± σ):  1.41% ±  0.34%

    ▁▄▇█▅▁
 ▃▆███████▆▆▅▅▄▃▂▂▂▂▂▂▂▂▁▁▁▁▂▃▂▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
 67.1 μs          Histogram: frequency by time          453 μs <

Memory estimate: 15.75 KiB, allocs estimate: 217.

cuda_prolong2mortars! after kernel combination

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   76.600 μs …  1.312 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     112.900 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   122.359 μs ± 49.262 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▁█▁    ▁  ▂▁
  ▁███▆▇▇██████▇▅▄▃▃▃▃▂▃▃▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  76.6 μs         Histogram: frequency by time          300 μs <

 Memory estimate: 12.66 KiB, allocs estimate: 181.

@huiyuxie
Copy link
Member Author

Here we use Euler mortar 3D as an example for benchmarking.

cuda_mortar_flux! with nonconservative_terms::False before kernel combination

BenchmarkTools.Trial: 4405 samples with 1 evaluation.
 Range (min … max):  296.400 μs … 132.412 ms  ┊ GC (min … max): 0.00% … 17.08%
 Time  (median):     980.600 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.128 ms ±   4.445 ms  ┊ GC (mean ± σ):  0.46% ±  0.26%

   █▅                       ▃▄▃▂
  ▃██▆▇▆▄▄▃▁▂▁▁▁▁▁▁▁▁▁▁▂▂▃▄▅█████▆▅▅▄▃▂▂▂▂▂▁▁▁▁▁▁▂▂▂▁▂▁▂▂▁▁▁▁▁▁ ▂
  296 μs           Histogram: frequency by time         1.87 ms <

 Memory estimate: 20.28 KiB, allocs estimate: 265.

cuda_mortar_flux! with nonconservative_terms::False after kernel combination

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  124.800 μs …  55.918 ms  ┊ GC (min … max): 0.00% … 21.18%
 Time  (median):     231.250 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   389.808 μs ± 786.346 μs  ┊ GC (mean ± σ):  0.54% ±  0.28%

  █▃▆▂
  ████▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▃▄▄▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▁▁ ▂
  125 μs           Histogram: frequency by time          819 μs <

 Memory estimate: 16.00 KiB, allocs estimate: 218.

@huiyuxie
Copy link
Member Author

Here we use MHD mortar 3D as an example for benchmarking.

cuda_mortar_flux! with nonconservative_terms::False before kernel combination

BenchmarkTools.Trial: 3956 samples with 1 evaluation.
 Range (min … max):  746.100 μs …  33.295 ms  ┊ GC (min … max): 0.00% … 21.74%
 Time  (median):       1.374 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.259 ms ± 644.034 μs  ┊ GC (mean ± σ):  0.15% ±  0.35%

     █▄▁    ▁
  ▃▆████▆▆▇▇█▅▄▃▃▂▂▂▂▂▂▂▁▂▂▁▂▁▂▁▁▃▆▆▅▆▇▆▅▅▅▅▃▃▃▃▄▆▆▆▆▆▅▄▄▄▄▃▃▃▂ ▄
  746 μs           Histogram: frequency by time         1.92 ms <

 Memory estimate: 20.81 KiB, allocs estimate: 268.

cuda_mortar_flux! with nonconservative_terms::False after kernel combination

 BenchmarkTools.Trial: 5731 samples with 1 evaluation.
 Range (min … max):  403.400 μs …  37.955 ms  ┊ GC (min … max): 0.00% … 21.76%
 Time  (median):     892.700 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   864.990 μs ± 731.922 μs  ┊ GC (mean ± σ):  0.28% ±  0.37%

   █▃▂ ▁▄▅▂                  ▂        ▁
  ▇███▇█████▇▆▅▄▄▄▃▃▃▂▂▂▁▂▁▁▄█▇▇██▇▇▆██▇▄▄▃▂▃▂▃▄▅▅▅▇▆▇▆▆▇▇▆▅▄▃▂ ▄
  403 μs           Histogram: frequency by time         1.45 ms <

 Memory estimate: 16.34 KiB, allocs estimate: 221.

@huiyuxie huiyuxie closed this Dec 21, 2024
@huiyuxie huiyuxie reopened this Dec 21, 2024
@huiyuxie huiyuxie merged commit ccb6d67 into main Dec 21, 2024
7 checks passed
@huiyuxie huiyuxie deleted the optimize branch December 24, 2024 05:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmark Benchmark performance Improve performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant