-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
device_reset can throw errors and not reset the device #17
Comments
There's a little explanation for the current behavior in the first paragraph of this section. To be a little more explicit, I settled on this design as a way to solve a tricky interaction between Julia's GC and CUDA: function foo()
p = CUDArt.malloc(Float32, 8) # let's say this points to 0x1234
# Do some stuff with p
end
foo()
device_reset()
# Let's say p is still hanging around on the julia side, even though the memory has been freed on the CUDA side
device()
x = CUDArt.malloc(Float32, 8) # this might also point to 0x1234---CUDA can reassign the same pointer across sessions
gc() # oops, x just got deleted because p got garbage-collected
fill!(x, 4) # segfault |
There's an explicit test for this behavior here: https://github.com/JuliaGPU/CUDArt.jl/blob/35e9e4922eba7ed4bd3154b520e6716f345195c1/test/gc.jl#L43-L62 |
Hmm, I wonder if there might be another way to solve the same problem: perhaps each "session" could start a unique version number (stored as some global), and Thoughts? An alternative is to figure out which calls in |
Hm, that's tricky issue, I hadn't thought of that. Given that you never know when a device might be reset (in another thread, even?), I think this means that finalizers can't be made to work correctly.
This sounds complicated, it's like trying to keep track of the device context object inside libcudart. How would this handle multiple threads doing this? Can you guarantee that it wouldn't go wrong? I would like to propose something else. Garbage collection is non-deterministic, so you can't really rely on finalizers running exactly at the end of a "compute region", which starts with initialization with cudaSetDevice and ends with cudaDeviceReset. But everything that is associated with the device context in between those two calls is only valid until the matching cudaDeviceReset, which can be called almost anywhere, even in another thread. My proposal:
where Otherwise the user can
I think it's kind of brittle. Also, I'm not very happy with a wrapper library that occasionally passes weird invalid inputs for no good reason. You would also have to reset |
It's an interesting idea, and extends a pattern that is obviously already used heavily within CUDArt for initialization of devices and modules. I worry a little bit about the performance implications, since julia's anonymous functions are not speedy. But it certainly would be a simple and robust approach from the standpoint of the package. (Less convenient for the user, of course.) Keep in mind that the current system runs the finalizers either upon In other words, I agree that context errors are nasty and need to be stomped out. But I'm a little reluctant to break the more "julian" behavior unless it's abundantly clear that the current system can't work. I'd be more convinced if, for example, I could replicate #18 and knew there wasn't an easy way to fix it. (Of course, other demonstrations of the problem would also be fine.) |
Okay, here is an example where CUDArt is not reset correctly. I think this can be addressed by ignoring all errors in the finalizers in device_reset, because cudaFree and other such functions can return error codes from previous asynchronous functions.
This will fail:
Outcome:
|
As it's implemented now, all wrappers of cuda functions check returns results for cudaSuccess and throw an error if the result is not cudaSuccess. This means that if for some reason, in device_reset: https://github.com/JuliaGPU/CUDArt.jl/blob/master/src/device.jl#L10 some function returns an error (e.g., a previous kernel launch by the user failed, or, as happens to me, irreproducibly (so not a bug I think) that one of cuda_ptrs is "an invalid device pointer", the clean up code throws an error and never executes cudaDeviceReset().
Also, I believe that cleaning up pointers before cudaDeviceReset() is unnecessary because that function's documentation says it releases all resources associated with the current device and process. So it isn't really necessary to free the pointers at all, they should be cleaned up by cudaDeviceReset.
This also means that in devices: https://github.com/JuliaGPU/CUDArt.jl/blob/master/src/device.jl#L59 the finally-clause can throw errors, and the first error interrupts the whole finally-clause, preventing the devices from being reset correctly.
The bug is that code in finally-clause should never throw errors in a way that prevents resources/devices from being cleaned up.
The text was updated successfully, but these errors were encountered: