Add caching allocator interface #576

pxl-th · 2024-12-15T14:22:15Z

Since Julia's GC is not aware of GPU memory, in scenarios with lots of allocations we end up in either OOM situations or in excessively high memory usage.
Even though the program may require only fraction of it.

To help with GPU memory utilizaton in a program with repeating blocks of code, we can wrap those regions in a scope that will utilize caching allocator every time the program enters this scope during the execution.

For example, this is especially useful when training models, where you compute loss, gradients w.r.t. loss and perform in-place parameter update of the model.

cache = GPUArrays.AllocCache()
model = ...
for i in 1:1000
    GPUArrays.@enable cache begin
        loss, grads = ...
        update!(optimizer, model, grads)
    end
end

Caching is done on: (ArrayType, current device, eltype, dims[, buffer type]).

Example

In the following example we apply caching allocator at every iteration of the for-loop.
Every iteration requires 8 GiB of gpu memory, without caching allocator
GC wouldn't be able to free arrays in time resulting in higher memory usage.
With caching allocator, memory usage stays at exactly 8 GiB.

After the loop, we free all cached memory if there's any.
Alternatively, it will be freed automatically when cache is collected by GC.

cache = GPUArrays.AllocCache(CuArray)
n = 1024^3
CUDA.@sync for i in 1:1000
    GPUArrays.@enable cache begin
        sin.(CUDA.rand(Float32, n))
    end
end
GPUArrays.unsafe_free!(cache)

Performance impact

Executing GaussianSplatting.jl benchmark (1k training iterations) on RX 7900XTX:

	Without caching allocator	With caching allocator
GPU memory utilization
Time	`59.656476` seconds	`46.365646` seconds

TODO

maleadt

Could you add some high-level design description to the PR?

As I mentioned on Slack, CUDA already has a caching allocator, so I'm not sure if for those back-ends this shouldn't boil down to basically batch-calling unsafe_free! at the end of each iteration, instead of actively caching arrays. Would be good to compare performance, if possible.

src/host/allocations_cache.jl

pxl-th · 2024-12-16T13:38:59Z

Could you add some high-level design description to the PR?

As I mentioned on Slack, CUDA already has a caching allocator, so I'm not sure if for those back-ends this shouldn't boil down to basically batch-calling unsafe_free! at the end of each iteration, instead of actively caching arrays. Would be good to compare performance, if possible.

Yeah, I'm planning to add both detailed PR description and documentation.
And to allow batch-freeing instead of caching the arrays (which can be just an option in the caching allocator).

pxl-th · 2024-12-18T16:59:53Z

@maleadt, I've updated the PR.
Also, I've added tests but they are not enabled right now, because no backend currently has the implementation merged (including JLArrays, because tests use released version of it).
However, they pass locally on my machines.

Let me know what you think.

src/host/allocations_cache.jl

docs/src/interface.md

src/host/allocations_cache.jl

pxl-th · 2025-01-07T12:39:51Z

One difference I've found between Julia 1.10 and Julia 1.11:

Julia 1.10:

julia> GPUArrays.AllocCache.@enable CuArray :loop begin
           x1 = CuArray(rand(Float32, 1))
       end
1-element CuArray{Float32, 1, CUDA.DeviceMemory}:
 0.680597

julia> x1
ERROR: UndefVarError: `x1` not defined

Julia 1.11:

julia> GPUArrays.AllocCache.@enable CuArray :loop begin
           x1 = CuArray(rand(Float32, 1))
       end
1-element CuArray{Float32, 1, CUDA.DeviceMemory}:
 0.7703809

julia> x1
1-element CuArray{Float32, 1, CUDA.DeviceMemory}:
 0.7703809

Not sure where is it coming from.

github-actions

Runic suggested the following formatting changes.

docs/make.jl

lib/JLArrays/src/JLArrays.jl

src/host/allocations_cache.jl

Done.

lib/JLArrays/src/JLArrays.jl

src/host/allocations_cache.jl

maleadt · 2025-01-08T14:14:17Z

One difference I've found between Julia 1.10 and Julia 1.11

Hmm, that seems problematic. Macros should not introduce scope:

❯ jl +1.10
julia> @time begin
       x1 = []
       end
  0.000002 seconds (1 allocation: 48 bytes)
Any[]

julia> x1
Any[]

pxl-th · 2025-01-08T14:59:31Z

ScopedValues.jl on Julia 1.10 introduce a scope:

julia> using ScopedValues

julia> x = ScopedValue(1)
ScopedValue{Int64}(1)

julia> @with x => 2 begin
           x2 = x[]
           x3 = 1
       end
1

julia> x2
ERROR: UndefVarError: `x2` not defined

maleadt · 2025-01-08T18:53:29Z

Another fundamental question (sorry for stretching this out): Why do you even care about the array type in the @enable interface? Wouldn't it be better if the user didn't have to worry about this? The cache can be sharded internally so that back-ends can invalidate! only their portion (e.g. from within a retry_reclaim hook), but I don't see why it has to be provided.

Maybe the cache name should be optional as well. It could default to something derived from the current task's name, so that's it's really convenient to do:

AllocCache.@enable begin
  for i in epocs
    ...
  end
end
AllocCache.invalidate!()

Just spitballing here, you probably have a better view regarding it based on your experiments with it already.

Seeing the above written out, I wonder if a wholly different API wouldn't be much more idiomatic, reifing the now implicit stuff like the name of the cache:

cache = AllocCache()
cache() do
  for i in epocs
    ...
  end
end
empty!(cache)

A finalizer could then also empty the cache, avoiding the risk of leaking memory if you forget to empty! it.

github-actions

Runic suggested the following formatting changes.

src/host/alloc_cache.jl

github-actions

Runic suggested the following formatting changes.

src/host/alloc_cache.jl

github-actions

Runic suggested the following formatting changes.

src/host/alloc_cache.jl

pxl-th · 2025-01-08T21:53:31Z

Seeing the above written out, I wonder if a wholly different API wouldn't be much more idiomatic, reifing the now implicit stuff like the name of the cache:

@maleadt , I've updated the implementation based on this, see examples in PR description for TL;DR.
It will now also free all memory in the finalizer.

github-actions

Runic suggested the following formatting changes.

src/host/alloc_cache.jl

maleadt · 2025-01-09T08:05:46Z

Looks good! Thanks for keeping up with my review requests.

CI failures look related though.

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

github-actions

Runic suggested the following formatting changes.

lib/JLArrays/src/JLArrays.jl

github-actions

Runic suggested the following formatting changes.

lib/JLArrays/src/JLArrays.jl

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

maleadt · 2025-01-09T11:00:33Z

Pushed a simplification to the back-end interface to avoid having to reach into the scoped value. The only cost is having to allocate the key tuple unconditionally, but I think that should be fine.

pxl-th · 2025-01-09T11:02:33Z

Looks good!

maleadt · 2025-01-09T11:06:06Z

The show method is broken:

julia> cache = GPUArrays.AllocCache(CuArray)
GPUArrays.AllocCache{CuArray}(Error showing value of type GPUArrays.AllocCache{CuArray}:
ERROR: TypeError: in typeassert, expected Int64, got a value of type UInt64

Would probably be useful if it showed some stats.

src/host/alloc_cache.jl

pxl-th · 2025-01-09T11:26:13Z

The show method is broken:

julia> cache = GPUArrays.AllocCache(CuArray)
GPUArrays.AllocCache{CuArray}(Error showing value of type GPUArrays.AllocCache{CuArray}:
ERROR: TypeError: in typeassert, expected Int64, got a value of type UInt64

Would probably be useful if it showed some stats.

Fixed.

github-actions

Runic suggested the following formatting changes.

src/host/alloc_cache.jl

[ci skip]

maleadt

Alright, I think we arrived at something great. Thanks for the PR! Let's merge and tag.

maleadt · 2025-01-09T13:22:31Z

Just saw this:

      From worker 2:	error in running finalizer: ErrorException("Invalidating allocations cache that's currently in use. Invalidating inside `@cached` is not allowed.")
      From worker 2:	error at ./error.jl:35
      From worker 2:	macro expansion at /home/tim/Julia/pkg/GPUArrays/src/host/alloc_cache.jl:75 [inlined]
      From worker 2:	macro expansion at ./lock.jl:273 [inlined]
      From worker 2:	unsafe_free! at /home/tim/Julia/pkg/GPUArrays/src/host/alloc_cache.jl:73
      From worker 2:	unknown function (ip: 0x7385147b5df2)
      From worker 2:	run_finalizer at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/gc.c:299
      From worker 2:	jl_gc_run_finalizers_in_list at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/gc.c:391

@cached should probably root the cache until it completes?

pxl-th · 2025-01-09T13:38:21Z

Oh... right

This was referenced Dec 15, 2024

HandleCache fixes & GPUArrays caching allocator interface implementation JuliaGPU/AMDGPU.jl#710

Merged

Support GPUArrays allocations cache JuliaGPU/CUDA.jl#2593

Merged

maleadt reviewed Dec 16, 2024

View reviewed changes

src/host/allocations_cache.jl Outdated Show resolved Hide resolved

pxl-th marked this pull request as ready for review December 18, 2024 16:59

pxl-th requested a review from maleadt December 22, 2024 20:08

This was referenced Jan 1, 2025

cuda gpu memory usage increasing in time FluxML/Flux.jl#2523

Open

run GC every epoch CarloLucibello/Tsunami.jl#66

Open

maleadt reviewed Jan 6, 2025

View reviewed changes

pxl-th force-pushed the pxl-th/cache-alloc branch from 051cd6d to d6a74b0 Compare January 6, 2025 21:00

pxl-th self-assigned this Jan 6, 2025

pxl-th requested a review from maleadt January 6, 2025 21:15

maleadt force-pushed the pxl-th/cache-alloc branch from bc6dcd7 to ee377ea Compare January 8, 2025 13:54

github-actions bot previously requested changes Jan 8, 2025

View reviewed changes

maleadt reviewed Jan 8, 2025

View reviewed changes

lib/JLArrays/src/JLArrays.jl Outdated Show resolved Hide resolved

maleadt reviewed Jan 8, 2025

View reviewed changes

src/host/allocations_cache.jl Outdated Show resolved Hide resolved

maleadt mentioned this pull request Jan 8, 2025

Remove kron functionality JuliaGPU/CUDA.jl#2602

Open

github-actions bot reviewed Jan 8, 2025

View reviewed changes

src/host/alloc_cache.jl Outdated Show resolved Hide resolved

src/host/alloc_cache.jl Outdated Show resolved Hide resolved

src/host/alloc_cache.jl Outdated Show resolved Hide resolved

github-actions bot reviewed Jan 8, 2025

View reviewed changes

src/host/alloc_cache.jl Outdated Show resolved Hide resolved

src/host/alloc_cache.jl Outdated Show resolved Hide resolved

src/host/alloc_cache.jl Outdated Show resolved Hide resolved

github-actions bot reviewed Jan 8, 2025

View reviewed changes

src/host/alloc_cache.jl Outdated Show resolved Hide resolved

src/host/alloc_cache.jl Show resolved Hide resolved

github-actions bot reviewed Jan 8, 2025

View reviewed changes

src/host/alloc_cache.jl Outdated Show resolved Hide resolved

maleadt reviewed Jan 9, 2025

View reviewed changes

src/host/alloc_cache.jl Outdated Show resolved Hide resolved

pxl-th and others added 6 commits January 9, 2025 11:35

Update src/host/alloc_cache.jl

96af44c

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Update src/host/alloc_cache.jl

cf5fda2

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Update src/host/alloc_cache.jl

36ced83

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Update docs

c98bfa4

Rename enable to cached

63ffeae

Rename.

972b386

maleadt force-pushed the pxl-th/cache-alloc branch from 6c0962e to 09818a1 Compare January 9, 2025 10:47

github-actions bot reviewed Jan 9, 2025

View reviewed changes

lib/JLArrays/src/JLArrays.jl Outdated Show resolved Hide resolved

Simplify back-end interface.

9960b52

maleadt force-pushed the pxl-th/cache-alloc branch from 09818a1 to 9960b52 Compare January 9, 2025 10:52

github-actions bot reviewed Jan 9, 2025

View reviewed changes

lib/JLArrays/src/JLArrays.jl Outdated Show resolved Hide resolved

lib/JLArrays/src/JLArrays.jl Outdated Show resolved Hide resolved

Apply suggestions from code review

7e32124

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

maleadt reviewed Jan 9, 2025

View reviewed changes

src/host/alloc_cache.jl Outdated Show resolved Hide resolved

Strip down cache from array type

e579824

github-actions bot reviewed Jan 9, 2025

View reviewed changes

src/host/alloc_cache.jl Outdated Show resolved Hide resolved

pxl-th and others added 3 commits January 9, 2025 13:27

Add return stmt

cdc2543

Improve docs.

8734a35

Remove duplicate gitignore.

a83a527

[ci skip]

maleadt approved these changes Jan 9, 2025

View reviewed changes

maleadt merged commit c869cc3 into master Jan 9, 2025

maleadt deleted the pxl-th/cache-alloc branch January 9, 2025 11:56

maleadt referenced this pull request Jan 9, 2025

Bump version.

64482de

maleadt mentioned this pull request Jan 9, 2025

Alloc cache: preserve the cache across the cached macro. #582

Merged

pxl-th referenced this pull request in JuliaGPU/AMDGPU.jl Jan 9, 2025

Bump to 1.2.0

b6f3376

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add caching allocator interface #576

Add caching allocator interface #576

pxl-th commented Dec 15, 2024 •

edited

Loading

maleadt left a comment

pxl-th commented Dec 16, 2024 •

edited

Loading

pxl-th commented Dec 18, 2024 •

edited

Loading

pxl-th commented Jan 7, 2025

github-actions bot left a comment

maleadt commented Jan 8, 2025

pxl-th commented Jan 8, 2025

maleadt commented Jan 8, 2025

github-actions bot left a comment

github-actions bot left a comment

github-actions bot left a comment

pxl-th commented Jan 8, 2025

github-actions bot left a comment

maleadt commented Jan 9, 2025

github-actions bot left a comment

github-actions bot left a comment

maleadt commented Jan 9, 2025

pxl-th commented Jan 9, 2025

maleadt commented Jan 9, 2025

pxl-th commented Jan 9, 2025

github-actions bot left a comment

maleadt left a comment

maleadt commented Jan 9, 2025

pxl-th commented Jan 9, 2025

Add caching allocator interface #576

Add caching allocator interface #576

Conversation

pxl-th commented Dec 15, 2024 • edited Loading

Example

Performance impact

TODO

maleadt left a comment

Choose a reason for hiding this comment

pxl-th commented Dec 16, 2024 • edited Loading

pxl-th commented Dec 18, 2024 • edited Loading

pxl-th commented Jan 7, 2025

github-actions bot left a comment

Choose a reason for hiding this comment

maleadt commented Jan 8, 2025

pxl-th commented Jan 8, 2025

maleadt commented Jan 8, 2025

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

pxl-th commented Jan 8, 2025

github-actions bot left a comment

Choose a reason for hiding this comment

maleadt commented Jan 9, 2025

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

maleadt commented Jan 9, 2025

pxl-th commented Jan 9, 2025

maleadt commented Jan 9, 2025

pxl-th commented Jan 9, 2025

github-actions bot left a comment

Choose a reason for hiding this comment

maleadt left a comment

Choose a reason for hiding this comment

maleadt commented Jan 9, 2025

pxl-th commented Jan 9, 2025

pxl-th commented Dec 15, 2024 •

edited

Loading

pxl-th commented Dec 16, 2024 •

edited

Loading

pxl-th commented Dec 18, 2024 •

edited

Loading