-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add caching allocator interface #576
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some high-level design description to the PR?
As I mentioned on Slack, CUDA already has a caching allocator, so I'm not sure if for those back-ends this shouldn't boil down to basically batch-calling unsafe_free!
at the end of each iteration, instead of actively caching arrays. Would be good to compare performance, if possible.
Yeah, I'm planning to add both detailed PR description and documentation. |
@maleadt, I've updated the PR. Let me know what you think. |
051cd6d
to
d6a74b0
Compare
One difference I've found between Julia 1.10 and Julia 1.11:
julia> GPUArrays.AllocCache.@enable CuArray :loop begin
x1 = CuArray(rand(Float32, 1))
end
1-element CuArray{Float32, 1, CUDA.DeviceMemory}:
0.680597
julia> x1
ERROR: UndefVarError: `x1` not defined
julia> GPUArrays.AllocCache.@enable CuArray :loop begin
x1 = CuArray(rand(Float32, 1))
end
1-element CuArray{Float32, 1, CUDA.DeviceMemory}:
0.7703809
julia> x1
1-element CuArray{Float32, 1, CUDA.DeviceMemory}:
0.7703809 Not sure where is it coming from. |
bc6dcd7
to
ee377ea
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Runic suggested the following formatting changes.
Hmm, that seems problematic. Macros should not introduce scope: ❯ jl +1.10
julia> @time begin
x1 = []
end
0.000002 seconds (1 allocation: 48 bytes)
Any[]
julia> x1
Any[] |
julia> using ScopedValues
julia> x = ScopedValue(1)
ScopedValue{Int64}(1)
julia> @with x => 2 begin
x2 = x[]
x3 = 1
end
1
julia> x2
ERROR: UndefVarError: `x2` not defined |
Another fundamental question (sorry for stretching this out): Why do you even care about the array type in the Maybe the cache name should be optional as well. It could default to something derived from the current task's name, so that's it's really convenient to do: AllocCache.@enable begin
for i in epocs
...
end
end
AllocCache.invalidate!() Just spitballing here, you probably have a better view regarding it based on your experiments with it already. Seeing the above written out, I wonder if a wholly different API wouldn't be much more idiomatic, reifing the now implicit stuff like the name of the cache: cache = AllocCache()
cache() do
for i in epocs
...
end
end
empty!(cache) A finalizer could then also empty the cache, avoiding the risk of leaking memory if you forget to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Runic suggested the following formatting changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Runic suggested the following formatting changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Runic suggested the following formatting changes.
@maleadt , I've updated the implementation based on this, see examples in PR description for TL;DR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Runic suggested the following formatting changes.
Looks good! Thanks for keeping up with my review requests. CI failures look related though. |
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
6c0962e
to
09818a1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Runic suggested the following formatting changes.
09818a1
to
9960b52
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Runic suggested the following formatting changes.
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Pushed a simplification to the back-end interface to avoid having to reach into the scoped value. The only cost is having to allocate the key tuple unconditionally, but I think that should be fine. |
Looks good! |
The julia> cache = GPUArrays.AllocCache(CuArray)
GPUArrays.AllocCache{CuArray}(Error showing value of type GPUArrays.AllocCache{CuArray}:
ERROR: TypeError: in typeassert, expected Int64, got a value of type UInt64 Would probably be useful if it showed some stats. |
Fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Runic suggested the following formatting changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I think we arrived at something great. Thanks for the PR! Let's merge and tag.
Just saw this:
|
Oh... right |
Since Julia's GC is not aware of GPU memory, in scenarios with lots of allocations we end up in either OOM situations or in excessively high memory usage.
Even though the program may require only fraction of it.
To help with GPU memory utilizaton in a program with repeating blocks of code, we can wrap those regions in a scope that will utilize caching allocator every time the program enters this scope during the execution.
For example, this is especially useful when training models, where you compute loss, gradients w.r.t. loss and perform in-place parameter update of the model.
Caching is done on:
(ArrayType, current device, eltype, dims[, buffer type])
.Example
In the following example we apply caching allocator at every iteration of the for-loop.
Every iteration requires 8 GiB of gpu memory, without caching allocator
GC wouldn't be able to free arrays in time resulting in higher memory usage.
With caching allocator, memory usage stays at exactly 8 GiB.
After the loop, we free all cached memory if there's any.
Alternatively, it will be freed automatically when cache is collected by GC.
Performance impact
Executing GaussianSplatting.jl benchmark (1k training iterations) on RX 7900XTX:
59.656476
seconds46.365646
secondsTODO