Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better support for unified and host memory #2138

Merged
merged 19 commits into from
Nov 1, 2023
Merged

Better support for unified and host memory #2138

merged 19 commits into from
Nov 1, 2023

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Oct 31, 2023

This PR includes a couple of unified memory improvements that are long overdue. With most vendors providing mature unified memory architectures, we really should be exploring defaulting to unified memory, as it greatly helps with two major issues that users run into: scalar indexing errors due to abstractarray fallbacks, and out-of-memory situations.

Scalar iteration

Building on JuliaGPU/GPUArrays.jl#499, CUDA.jl now implements efficient scalar indexing for unified arrays. The performance difference to old-style @allowscalar is huge:

julia> a = cu([1]; unified=true);

julia> a[1]
1

julia> @benchmark a[1]
BenchmarkTools.Trial: 10000 samples with 998 evaluations.
 Range (min  max):  13.145 ns  29.890 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     13.497 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   13.521 ns ±  0.357 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▂            ▁             ▃▃  ▃  █▇  ▅▇▄ ▆▃ ▂▄         ▁  ▂
  ▇█▄▃█▇▁▁▃▁▃▁▁▃█▄▁▆▄▇▁▇█▁▃▇▆▁██▄▇██▁██▆████▄██▅███▁▁▁▁▃▁▁▁█▆ █
  13.1 ns      Histogram: log(frequency) by time      13.7 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

vs.

julia> b = cu([1]);

julia> b[1]
ERROR: Scalar indexing is disallowed.

julia> CUDA.@allowscalar b[1]
1

julia> @benchmark CUDA.@allowscalar b[1]
 BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  11.910 μs  107.569 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     14.620 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   14.354 μs ±   2.073 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

          ▂▁                         ▁▃▆█▄▁
  ▂▂▂▂▃▃▄▆██▇▄▄▄▄▃▃▃▂▂▂▂▂▁▂▂▂▂▂▂▃▃▅▆▇██████▇▆▆▇▆▆▄▄▄▄▄▄▃▃▃▂▃▂▂ ▄
  11.9 μs         Histogram: frequency by time         16.2 μs <

 Memory estimate: 80 bytes, allocs estimate: 2.

The dirty bit is a bit naive though, as it breaks down when multiple tasks are accessing a single array concurrently. Maybe that's fine though, and users ought to perform manual synchronization when using multiple tasks or streams.

Default unified memory

I also added a preference to switch the default memory type to unified, and added a CI job to test whether the test suite supports that. We should consider if we actually want to switch, for one, because unified memory doesn't have a stream-ordered allocator, so it may make allocations and GC pauses worse. At the same time, it would allow the user to use much more memory, so maybe that helps with GC pressure...

TODO: consider adding a MemAdvice or Prefetch (see https://developer.nvidia.com/blog/maximizing-unified-memory-performance-cuda/)

cc @vchuravy

@maleadt maleadt changed the title Allow scalar indexing with unified arrays. Unified memory improvements Oct 31, 2023
@maleadt maleadt marked this pull request as ready for review October 31, 2023 19:45
@maleadt
Copy link
Member Author

maleadt commented Oct 31, 2023

Instead of tracking a single dirty bit per buffer, which can lead to wrong results (when using multiple tasks), I switched to keeping track of in-flight asynchronous buffers in the TLS. That halves the performance of scalar indexing, but at 23ns instead of 10us it's still a huge improvement. I think I'll start with this approach and optimize it along the way, instead of starting which a questionable but fast design.

Copy link

codecov bot commented Oct 31, 2023

Codecov Report

Attention: 30 lines in your changes are missing coverage. Please review.

Comparison is base (b917e1c) 72.21% compared to head (f377f81) 72.29%.
Report is 3 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2138      +/-   ##
==========================================
+ Coverage   72.21%   72.29%   +0.08%     
==========================================
  Files         159      159              
  Lines       14340    14444     +104     
==========================================
+ Hits        10356    10443      +87     
- Misses       3984     4001      +17     
Files Coverage Δ
src/pool.jl 69.31% <100.00%> (+0.77%) ⬆️
src/texture.jl 88.88% <100.00%> (ø)
src/utilities.jl 84.21% <100.00%> (+1.19%) ⬆️
lib/cusparse/array.jl 67.73% <85.71%> (ø)
src/compiler/execution.jl 86.92% <92.85%> (+2.92%) ⬆️
src/array.jl 80.44% <76.66%> (-2.17%) ⬇️

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

if is_unified(xs) && sizeof(xs) > 0 && !is_capturing()
buf = xs.data[]
subbuf = Mem.UnifiedBuffer(buf.ctx, pointer(xs), sizeof(xs))
Mem.prefetch(subbuf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmccormick you were asking me about this yesterday.

src/array.jl Outdated
Comment on lines 381 to 382
haskey(tls, :CUDA_ASYNC_BUFFERS) || return
async_buffers = tls[:CUDA_ASYNC_BUFFERS]::Vector{Mem.UnifiedBuffer}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really need to finish TaskLocalValues.jl xD

@maleadt maleadt changed the title Unified memory improvements Better support for unified and host memory Nov 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants