Unrelated try-catch causes CUDA arrays to not be freed #52533

IanButterworth · 2023-12-14T14:21:11Z

Originally posted here JuliaGPU/CUDA.jl#2197

Take a GPU training loop like this

for epoch in 1:epochs
    for (x, y) in train_loader
        x = x |> gpu; y = y |> gpu
        gs, _ = gradient(model, x) do m, _x
            logitcrossentropy(m(_x), y)
        end
        state, model = Optimisers.update(state, model, gs)
    end
    try
        true
    catch
    end
end

With the try-catch, the GPU runs out of memory very quickly. Without the try-catch no issue.

Approximately quoting @gbaraldi from slack:

try-catches introduce some phic nodes to store variables in case we error and still need their values.
(from the example above)

   store volatile {}* %value_phi61, {}** %phic, align 8
   store volatile {}* %value_phi62, {}** %phic1, align 16
   store volatile {}* %value_phi46, {}** %phic2, align 8
   store volatile {}* %value_phi47, {}** %phic3, align 16
   store volatile i64 %value_phi48, i64* %phic4, align 8
   store volatile i64 %value_phi49, i64* %phic5, align 8
   store volatile i8 0, i8* %phic6, align 1
   store volatile {}* null, {}** %phic7, align 8
   store volatile i8 0, i8* %phic8, align 1
   store volatile {}* %278, {}** %phic9, align 16
   store volatile {}* %267, {}** %phic10, align 8
   store volatile {}* inttoptr (i64 140366834286144 to {}*), {}** %phic11, align 16

I have the suspicion some of them are holding our CUDA arrays

This is especially nasty because the logging macros introduce try-catch blocks if they cannot be proven to not error. i.e. are more than simple strings.

So the @info log with interpolation here introduces the issue, while no interpolation like @info "completed epoch" doesn't.

for epoch in 1:epochs
    for (x, y) in train_loader
        x = x |> gpu; y = y |> gpu
        gs, _ = gradient(model, x) do m, _x
            logitcrossentropy(m(_x), y)
        end
        state, model = Optimisers.update(state, model, gs)
    end
    @info "completed epoch $epoch"
end

However @Keno's automatic try-catch elision on 1.11 might fix that?
Note that this is on 1.9.4. CUDA has issues on julia master so I haven't been able to test this yet.

The text was updated successfully, but these errors were encountered:

IanButterworth · 2024-10-30T14:19:54Z

This is such an unfriendly bug.
Could be behind things like this https://discourse.julialang.org/t/memory-usage-increasing-with-each-epoch/121798

IMO we need to get this fixed but I don't know how.

@gbaraldi you had some ideas.

I'm going to put this on the 1.12 milestone to raise visibility.

maleadt added GC Garbage collector gpu Affects running Julia on a GPU labels May 7, 2024

Dale-Black mentioned this issue May 7, 2024

Capture_stdout causes crash, GPU, Lux.jl fonsp/Pluto.jl#2913

Closed

maleadt mentioned this issue Jul 13, 2024

Explore early finalization JuliaGPU/CUDA.jl#2443

Open

IanButterworth added the bug Indicates an unexpected problem or unintended behavior label Oct 30, 2024

IanButterworth added this to the 1.12 milestone Oct 30, 2024

gbaraldi mentioned this issue Nov 5, 2024

Switching from setjmp/longjmp exceptions to another kind #56466

Open

IanButterworth mentioned this issue Jan 9, 2025

Add try/catch around handle_message() to catch errors during logging. #57004

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unrelated try-catch causes CUDA arrays to not be freed #52533

Unrelated try-catch causes CUDA arrays to not be freed #52533

IanButterworth commented Dec 14, 2023 •

edited

Loading

IanButterworth commented Oct 30, 2024

Unrelated try-catch causes CUDA arrays to not be freed #52533

Unrelated try-catch causes CUDA arrays to not be freed #52533

Comments

IanButterworth commented Dec 14, 2023 • edited Loading

IanButterworth commented Oct 30, 2024

IanButterworth commented Dec 14, 2023 •

edited

Loading