You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a case where the inclusion of a try-catch in a for loop is causing CUDA OOM.
Rather nastily the try-catch was actually introduced by a @info "epoch $epoch_num" log, which wraps the evaluation of the log message in a try-catch because of the interpolation. The issue doesn't happen for @info "epoch"
This is with CUDA.jl dev-ed and logging added as suggested by @maleadt
julia>include("effnet_train.jl")
julia>@timetrain(limit=3, gpu_gc=false, gpu_stats=true)
[ Info: loading CIFAR-10 dataset
[ Info: loading EfficientNetv2 model
[ Info: starting training
Effective GPU memory usage:99.55% (11.688 GiB/11.741 GiB)
Memory pool usage:8.505 GiB (10.406 GiB reserved)
WARNING: CUDA memory allocation failed; attempting to free up memory...... waiting for pending frees (freed 0 bytes)
... waiting for all pending frees (freed 0 bytes)
... running a quick GC (freed 8.292 GiB)
Effective GPU memory usage:99.55% (11.688 GiB/11.741 GiB)
Memory pool usage:8.492 GiB (10.406 GiB reserved)
training epoch 1/454%|███▊ | ETA:0:00:09WARNING: CUDA memory allocation failed; attempting to free up memory...... waiting for pending frees (freed 0 bytes)
... waiting for all pending frees (freed 0 bytes)
... running a quick GC (freed 8.279 GiB)
Effective GPU memory usage:99.55% (11.688 GiB/11.741 GiB)
Memory pool usage:8.492 GiB (10.406 GiB reserved)
training epoch 1/456%|█████▊ | ETA:0:00:09WARNING: CUDA memory allocation failed; attempting to free up memory...... waiting for pending frees (freed 0 bytes)
... waiting for all pending frees (freed 0 bytes)
... running a quick GC (freed 8.279 GiB)
Effective GPU memory usage:99.54% (11.687 GiB/11.741 GiB)
Memory pool usage:8.492 GiB (10.406 GiB reserved)
training epoch 1/45100%|███████████████████████████████████████████████████████████████████████████████████████████████| Time:0:00:00
[ Info: epoch 1 complete. Testing...
WARNING: CUDA memory allocation failed; attempting to free up memory...... waiting for pending frees (freed 0 bytes)
... waiting for all pending frees (freed 0 bytes)
... running a quick GC (freed 9.974 GiB)
WARNING: CUDA memory allocation failed; attempting to free up memory...... waiting for pending frees (freed 0 bytes)
... waiting for all pending frees (freed 0 bytes)
... running a quick GC (freed 9.917 GiB)
WARNING: CUDA memory allocation failed; attempting to free up memory...... waiting for pending frees (freed 0 bytes)
WARNING: CUDA memory allocation failed; attempting to free up memory...... waiting for pending frees (freed 0 bytes)
... waiting for all pending frees (freed 0 bytes)
... running a quick GC (freed 9.818 GiB)
WARNING: CUDA memory allocation failed; attempting to free up memory...... waiting for pending frees (freed 0 bytes)
... waiting for all pending frees (freed 0 bytes)
... running a quick GC (freed 9.834 GiB)
[ Info: (train_loss =2.303f0, train_acc =0.1, test_loss =2.303f0, test_acc =0.099)
WARNING: CUDA memory allocation failed; attempting to free up memory...... waiting for pending frees (freed 0 bytes)
WARNING: CUDA memory allocation failed; attempting to free up memory...... waiting for pending frees (freed 0 bytes)
... waiting for all pending frees (freed 0 bytes)
... running a quick GC (freed 5.953 GiB)
Effective GPU memory usage:99.54% (11.687 GiB/11.741 GiB)
Memory pool usage:6.301 GiB (10.406 GiB reserved)
WARNING: CUDA memory allocation failed; attempting to free up memory...... waiting for pending frees (freed 0 bytes)
WARNING: CUDA memory allocation failed; attempting to free up memory...... waiting for pending frees (freed 0 bytes)
... waiting for all pending frees (freed 0 bytes)
... running a quick GC (freed 5.875 GiB)
WARNING: CUDA memory allocation failed; attempting to free up memory...... waiting for pending frees (freed 0 bytes)
WARNING: CUDA memory allocation failed; attempting to free up memory...... waiting for pending frees (freed 0 bytes)
... waiting for all pending frees (freed 0 bytes)
... running a quick GC (freed 3.002 GiB)
Effective GPU memory usage:99.54% (11.687 GiB/11.741 GiB)
Memory pool usage:5.703 GiB (10.406 GiB reserved)
training epoch 2/454%|███▊ | ETA:0:00:09WARNING: CUDA memory allocation failed; attempting to free up memory...... waiting for pending frees (freed 0 bytes)
... waiting for all pending frees (freed 0 bytes)
... running a quick GC (freed 5.240 GiB)
WARNING: CUDA memory allocation failed; attempting to free up memory...... waiting for pending frees (freed 0 bytes)
WARNING: CUDA memory allocation failed; attempting to free up memory...... waiting for pending frees (freed 0 bytes)
... waiting for all pending frees (freed 0 bytes)
... running a quick GC (freed 1.578 GiB)
WARNING: CUDA memory allocation failed; attempting to free up memory...... waiting for pending frees (freed 0 bytes)
... waiting for all pending frees (freed 0 bytes)
... running a quick GC (freed 167.542 MiB)
... running a full GC (freed 11.719 MiB)
... releasing reserved memory (freed 0 bytes)
┌ Error: Out of GPU memory trying to allocate 128.000 MiB
│ Effective GPU memory usage:99.54% (11.687 GiB/11.741 GiB)
│ Memory pool usage:4.157 GiB (10.406 GiB reserved)
└ @ Main ~/Documents/GitHub/EfficientNet-Training/effnet_train.jl:1043.465019 seconds (2.09 M allocations:1.532 GiB, 27.75% gc time)
The inner loop can be simplified to this reproducer
for epoch in1:epochs
for (x, y) in train_loader
x = x |> device; y = y |> device
gs, _ =gradient(model, x) do m, _x
logitcrossentropy(m(_x), y)
end# state, model = Optimisers.update(state, model, gs)
model = Flux.Functors.fmap(copy, model; exclude = Optimisers.maywrite)
state = Flux.Functors.fmap(copy, state; exclude = Optimisers.maywrite)
@show CUDA.MemoryInfo().pool_used_bytes
endtrytruecatchendend
Commenting out the try-catch (or the 2 loop @info logs in the original code), or will avoid the OOM entirely.
I have been trying to look for differences in GC collections with GC logging on. On the left here is with the try-catch, on the right is without.
The GC operations seem to happen in the same sequence.
You can see pool_used_bytes jumps up on the left, and doesn't on the right, but on the right there's no big GC collect, which I was expecting to see
maleadt
changed the title
CUDA OOM if loop contains a try-catch
Julia keeps allocations alive in presence of try/catch
Dec 12, 2023
On Slack, we determined that this is likely due to objects being kept alive in upsilon nodes due to the try/catch. Unclear why calling GC.gc helps for that, but this does look like a Julia bug...
Describe the bug
I have a case where the inclusion of a try-catch in a for loop is causing CUDA OOM.
Rather nastily the try-catch was actually introduced by a
@info "epoch $epoch_num"
log, which wraps the evaluation of the log message in a try-catch because of the interpolation. The issue doesn't happen for@info "epoch"
https://github.com/IanButterworth/julia/blob/61a36549f961eacd74d431f5e987f3e7f789c643/base/logging.jl#L344-L380
To reproduce
The Minimal Working Example (MWE) for this bug:
Instantiate https://github.com/IanButterworth/EfficientNet-Training/tree/main
This is with CUDA.jl dev-ed and logging added as suggested by @maleadt
The inner loop can be simplified to this reproducer
Commenting out the
try-catch
(or the 2 loop@info
logs in the original code), or will avoid the OOM entirely.Manifest.toml
Expected behavior
A clear and concise description of what you expected to happen.
No OOM, which is the case if you GC.gc() each loop (turn that on via
train(limit=3, gpu_gc=true, gpu_stats=true)
Version info
Details on Julia:
Details on CUDA:
Additional context
Add any other context about the problem here.
CUDA.jl debug prints added
The text was updated successfully, but these errors were encountered: