Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unrelated try-catch causes CUDA arrays to not be freed #52533

Open
IanButterworth opened this issue Dec 14, 2023 · 1 comment
Open

Unrelated try-catch causes CUDA arrays to not be freed #52533

IanButterworth opened this issue Dec 14, 2023 · 1 comment
Labels
bug Indicates an unexpected problem or unintended behavior GC Garbage collector gpu Affects running Julia on a GPU
Milestone

Comments

@IanButterworth
Copy link
Member

IanButterworth commented Dec 14, 2023

Originally posted here JuliaGPU/CUDA.jl#2197

Take a GPU training loop like this

for epoch in 1:epochs
    for (x, y) in train_loader
        x = x |> gpu; y = y |> gpu
        gs, _ = gradient(model, x) do m, _x
            logitcrossentropy(m(_x), y)
        end
        state, model = Optimisers.update(state, model, gs)
    end
    try
        true
    catch
    end
end

With the try-catch, the GPU runs out of memory very quickly. Without the try-catch no issue.

Approximately quoting @gbaraldi from slack:


try-catches introduce some phic nodes to store variables in case we error and still need their values.
(from the example above)

   store volatile {}* %value_phi61, {}** %phic, align 8
   store volatile {}* %value_phi62, {}** %phic1, align 16
   store volatile {}* %value_phi46, {}** %phic2, align 8
   store volatile {}* %value_phi47, {}** %phic3, align 16
   store volatile i64 %value_phi48, i64* %phic4, align 8
   store volatile i64 %value_phi49, i64* %phic5, align 8
   store volatile i8 0, i8* %phic6, align 1
   store volatile {}* null, {}** %phic7, align 8
   store volatile i8 0, i8* %phic8, align 1
   store volatile {}* %278, {}** %phic9, align 16
   store volatile {}* %267, {}** %phic10, align 8
   store volatile {}* inttoptr (i64 140366834286144 to {}*), {}** %phic11, align 16

I have the suspicion some of them are holding our CUDA arrays


This is especially nasty because the logging macros introduce try-catch blocks if they cannot be proven to not error. i.e. are more than simple strings.

So the @info log with interpolation here introduces the issue, while no interpolation like @info "completed epoch" doesn't.

for epoch in 1:epochs
    for (x, y) in train_loader
        x = x |> gpu; y = y |> gpu
        gs, _ = gradient(model, x) do m, _x
            logitcrossentropy(m(_x), y)
        end
        state, model = Optimisers.update(state, model, gs)
    end
    @info "completed epoch $epoch"
end

However @Keno's automatic try-catch elision on 1.11 might fix that?
Note that this is on 1.9.4. CUDA has issues on julia master so I haven't been able to test this yet.

@maleadt maleadt added GC Garbage collector gpu Affects running Julia on a GPU labels May 7, 2024
@IanButterworth IanButterworth added the bug Indicates an unexpected problem or unintended behavior label Oct 30, 2024
@IanButterworth
Copy link
Member Author

This is such an unfriendly bug.
Could be behind things like this https://discourse.julialang.org/t/memory-usage-increasing-with-each-epoch/121798

IMO we need to get this fixed but I don't know how.

@gbaraldi you had some ideas.

I'm going to put this on the 1.12 milestone to raise visibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior GC Garbage collector gpu Affects running Julia on a GPU
Projects
None yet
Development

No branches or pull requests

2 participants