You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
for epoch in 1:epochs
for (x, y) in train_loader
x = x |> gpu; y = y |> gpu
gs, _ = gradient(model, x) do m, _x
logitcrossentropy(m(_x), y)
end
state, model = Optimisers.update(state, model, gs)
end
try
true
catch
end
end
With the try-catch, the GPU runs out of memory very quickly. Without the try-catch no issue.
try-catches introduce some phic nodes to store variables in case we error and still need their values.
(from the example above)
store volatile {}* %value_phi61, {}** %phic, align 8
store volatile {}* %value_phi62, {}** %phic1, align 16
store volatile {}* %value_phi46, {}** %phic2, align 8
store volatile {}* %value_phi47, {}** %phic3, align 16
store volatile i64 %value_phi48, i64* %phic4, align 8
store volatile i64 %value_phi49, i64* %phic5, align 8
store volatile i8 0, i8* %phic6, align 1
store volatile {}* null, {}** %phic7, align 8
store volatile i8 0, i8* %phic8, align 1
store volatile {}* %278, {}** %phic9, align 16
store volatile {}* %267, {}** %phic10, align 8
store volatile {}* inttoptr (i64 140366834286144 to {}*), {}** %phic11, align 16
I have the suspicion some of them are holding our CUDA arrays
This is especially nasty because the logging macros introduce try-catch blocks if they cannot be proven to not error. i.e. are more than simple strings.
So the @info log with interpolation here introduces the issue, while no interpolation like @info "completed epoch" doesn't.
for epoch in 1:epochs
for (x, y) in train_loader
x = x |> gpu; y = y |> gpu
gs, _ = gradient(model, x) do m, _x
logitcrossentropy(m(_x), y)
end
state, model = Optimisers.update(state, model, gs)
end
@info "completed epoch $epoch"
end
However @Keno's automatic try-catch elision on 1.11 might fix that?
Note that this is on 1.9.4. CUDA has issues on julia master so I haven't been able to test this yet.
The text was updated successfully, but these errors were encountered:
Originally posted here JuliaGPU/CUDA.jl#2197
Take a GPU training loop like this
With the try-catch, the GPU runs out of memory very quickly. Without the try-catch no issue.
Approximately quoting @gbaraldi from slack:
try-catches introduce some phic nodes to store variables in case we error and still need their values.
(from the example above)
I have the suspicion some of them are holding our CUDA arrays
This is especially nasty because the logging macros introduce
try-catch
blocks if they cannot be proven to not error. i.e. are more than simple strings.So the
@info
log with interpolation here introduces the issue, while no interpolation like@info "completed epoch"
doesn't.However @Keno's automatic
try-catch
elision on 1.11 might fix that?Note that this is on 1.9.4. CUDA has issues on julia master so I haven't been able to test this yet.
The text was updated successfully, but these errors were encountered: