Better support for unified and host memory #2138

maleadt · 2023-10-31T14:52:07Z

This PR includes a couple of unified memory improvements that are long overdue. With most vendors providing mature unified memory architectures, we really should be exploring defaulting to unified memory, as it greatly helps with two major issues that users run into: scalar indexing errors due to abstractarray fallbacks, and out-of-memory situations.

Scalar iteration

Building on JuliaGPU/GPUArrays.jl#499, CUDA.jl now implements efficient scalar indexing for unified arrays. The performance difference to old-style @allowscalar is huge:

julia> a = cu([1]; unified=true);

julia> a[1]
1

julia> @benchmark a[1]
BenchmarkTools.Trial: 10000 samples with 998 evaluations.
 Range (min … max):  13.145 ns … 29.890 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     13.497 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   13.521 ns ±  0.357 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▂            ▁             ▃▃  ▃  █▇  ▅▇▄ ▆▃ ▂▄         ▁  ▂
  ▇█▄▃█▇▁▁▃▁▃▁▁▃█▄▁▆▄▇▁▇█▁▃▇▆▁██▄▇██▁██▆████▄██▅███▁▁▁▁▃▁▁▁█▆ █
  13.1 ns      Histogram: log(frequency) by time      13.7 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

vs.

julia> b = cu([1]);

julia> b[1]
ERROR: Scalar indexing is disallowed.

julia> CUDA.@allowscalar b[1]
1

julia> @benchmark CUDA.@allowscalar b[1]
 BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  11.910 μs … 107.569 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     14.620 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   14.354 μs ±   2.073 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

          ▂▁                         ▁▃▆█▄▁
  ▂▂▂▂▃▃▄▆██▇▄▄▄▄▃▃▃▂▂▂▂▂▁▂▂▂▂▂▂▃▃▅▆▇██████▇▆▆▇▆▆▄▄▄▄▄▄▃▃▃▂▃▂▂ ▄
  11.9 μs         Histogram: frequency by time         16.2 μs <

 Memory estimate: 80 bytes, allocs estimate: 2.

The dirty bit is a bit naive though, as it breaks down when multiple tasks are accessing a single array concurrently. Maybe that's fine though, and users ought to perform manual synchronization when using multiple tasks or streams.

Default unified memory

I also added a preference to switch the default memory type to unified, and added a CI job to test whether the test suite supports that. We should consider if we actually want to switch, for one, because unified memory doesn't have a stream-ordered allocator, so it may make allocations and GC pauses worse. At the same time, it would allow the user to use much more memory, so maybe that helps with GC pressure...

TODO: consider adding a MemAdvice or Prefetch (see https://developer.nvidia.com/blog/maximizing-unified-memory-performance-cuda/)

cc @vchuravy

maleadt · 2023-10-31T21:27:13Z

Instead of tracking a single dirty bit per buffer, which can lead to wrong results (when using multiple tasks), I switched to keeping track of in-flight asynchronous buffers in the TLS. That halves the performance of scalar indexing, but at 23ns instead of 10us it's still a huge improvement. I think I'll start with this approach and optimize it along the way, instead of starting which a questionable but fast design.

codecov · 2023-10-31T23:10:05Z

Codecov Report

Attention: 30 lines in your changes are missing coverage. Please review.

Comparison is base (b917e1c) 72.21% compared to head (f377f81) 72.29%.
Report is 3 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2138      +/-   ##
==========================================
+ Coverage   72.21%   72.29%   +0.08%     
==========================================
  Files         159      159              
  Lines       14340    14444     +104     
==========================================
+ Hits        10356    10443      +87     
- Misses       3984     4001      +17

Files	Coverage Δ
src/pool.jl	`69.31% <100.00%> (+0.77%)`	⬆️
src/texture.jl	`88.88% <100.00%> (ø)`
src/utilities.jl	`84.21% <100.00%> (+1.19%)`	⬆️
lib/cusparse/array.jl	`67.73% <85.71%> (ø)`
src/compiler/execution.jl	`86.92% <92.85%> (+2.92%)`	⬆️
src/array.jl	`80.44% <76.66%> (-2.17%)`	⬇️

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

vchuravy · 2023-11-01T13:30:06Z

src/compiler/execution.jl

+  if is_unified(xs) && sizeof(xs) > 0 && !is_capturing()
+    buf = xs.data[]
+    subbuf = Mem.UnifiedBuffer(buf.ctx, pointer(xs), sizeof(xs))
+    Mem.prefetch(subbuf)


@pmccormick you were asking me about this yesterday.

vchuravy · 2023-11-01T13:33:23Z

src/array.jl

+  haskey(tls, :CUDA_ASYNC_BUFFERS) || return
+  async_buffers = tls[:CUDA_ASYNC_BUFFERS]::Vector{Mem.UnifiedBuffer}


I really need to finish TaskLocalValues.jl xD

maleadt added 11 commits October 31, 2023 15:47

Allow scalar indexing with unified arrays.

dd0a37f

Make the default memory type configurable.

331573d

Fix null ctor.

f3383df

Fix unsafe_wrap.

085fb0c

Fix some tests.

d5e6f1f

Fix another default constructor.

0a68110

Bugfix.

cce48cc

Test fixes.

579309c

Pointer conversion improvements.

a48715f

Report preferences in the versioninfo output.

9a498c3

Keep the default on device memory, but test unified memory support.

427eb3e

maleadt changed the title ~~Allow scalar indexing with unified arrays.~~ Unified memory improvements Oct 31, 2023

maleadt marked this pull request as ready for review October 31, 2023 19:45

maleadt added 2 commits October 31, 2023 21:42

Skip scalar indexing tests when using unified memory.

9ecbdfd

Store the dirty bit in task local storage, for correctness.

e945351

Prefetch unified memory before kernel launches.

4873f56

vchuravy reviewed Nov 1, 2023

View reviewed changes

maleadt added 3 commits November 1, 2023 15:17

Rework unsafe_wrap.

6f9c4ad

Better support for host memory.

fc63805

Run CI with host memory too.

5460bfb

maleadt changed the title ~~Unified memory improvements~~ Better support for unified and host memory Nov 1, 2023

maleadt added 2 commits November 1, 2023 16:58

Add bounds checking to optimized scalar indexing functions.

7bacf91

Try testing libraries with host memory.

f377f81

maleadt merged commit 76d6d06 into master Nov 1, 2023

maleadt deleted the tb/unified branch November 1, 2023 17:50

maleadt mentioned this pull request Mar 11, 2024

Use unified memory for scalar indexing of permutation matrices JuliaGPU/Metal.jl#313

Merged

maleadt mentioned this pull request Oct 2, 2024

Things to port from CUDA.jl JuliaGPU/Metal.jl#443

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better support for unified and host memory #2138

Better support for unified and host memory #2138

maleadt commented Oct 31, 2023 •

edited

Loading

maleadt commented Oct 31, 2023

codecov bot commented Oct 31, 2023 •

edited

Loading

vchuravy Nov 1, 2023

vchuravy Nov 1, 2023

		haskey(tls, :CUDA_ASYNC_BUFFERS) \|\| return
		async_buffers = tls[:CUDA_ASYNC_BUFFERS]::Vector{Mem.UnifiedBuffer}

Better support for unified and host memory #2138

Better support for unified and host memory #2138

Conversation

maleadt commented Oct 31, 2023 • edited Loading

Scalar iteration

Default unified memory

maleadt commented Oct 31, 2023

codecov bot commented Oct 31, 2023 • edited Loading

Codecov Report

vchuravy Nov 1, 2023

Choose a reason for hiding this comment

vchuravy Nov 1, 2023

Choose a reason for hiding this comment

maleadt commented Oct 31, 2023 •

edited

Loading

codecov bot commented Oct 31, 2023 •

edited

Loading