Support for GPU checkpointing in nvproxy #11095

cweld510 · 2024-10-31T02:39:53Z

Description

We're interested in some form of GPU checkpointing - is this something that the gvisor team plans on supporting at any point?

Generally, existing GPU checkpointing implementations described in papers like Singularity or Cricket intercept CUDA calls via LD_PRELOAD. Prior to a checkpoint, they record stateful calls in a log, which is stored at checkpoint time along with the contents of GPU memory. At restore time, GPU memory is reloaded and the log is replayed. Both frameworks have to do some of virtualization of device pointers as well.

It seems (perhaps naively) that a similar scheme might be possible within nvproxy, which already intercepts calls to the GPU driver. In theory, nvproxy could record a subset of calls made to the GPU driver and replay them at checkpoint-restore time, virtualizing file descriptors and device pointers as needed; and separately, support copying contents of GPU memory off the device to a file and back.

This is clearly complex. I'm curious if you all believe it to be viable and plan on exploring the scheme described above, or a different one, at any point?

Is this feature related to a specific bug?

No response

Do you have a specific solution in mind?

No response

The text was updated successfully, but these errors were encountered:

EtiennePerot · 2024-10-31T05:15:20Z

Have you looked at #10478 (which I believe was filed by from one of your colleagues :))?
I believe cuda-checkpoint should work well within gVisor now that NVIDIA has fixed the issue described in that bug, and should allow GPU checkpointing to work in gVisor without the complexity of recording and replaying CUDA calls.

cweld510 · 2024-10-31T13:01:43Z

Interesting, I assumed that NVIDIA hadn't fixed the issue since NVIDIA/cuda-checkpoint#4 is still open, but honestly, I haven't tried running cuda-checkpoint again recently on pytorch within gvisor. I will do that.

ayushr2 · 2024-10-31T17:17:54Z

I would recommend trying the latest driver (R565 I believe).

cweld510 · 2024-11-06T18:47:38Z

Thanks! I'll reply back when I've had a chance to try the latest driver. Really appreciate the help on this.

cweld510 added the type: enhancement New feature or request label Oct 31, 2024

ayushr2 added the area: gpu Issue related to sandboxed GPU access label Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for GPU checkpointing in nvproxy #11095

Support for GPU checkpointing in nvproxy #11095

cweld510 commented Oct 31, 2024 •

edited

Loading

EtiennePerot commented Oct 31, 2024

cweld510 commented Oct 31, 2024

ayushr2 commented Oct 31, 2024

cweld510 commented Nov 6, 2024 •

edited

Loading

Support for GPU checkpointing in nvproxy #11095

Support for GPU checkpointing in nvproxy #11095

Comments

cweld510 commented Oct 31, 2024 • edited Loading

Description

Is this feature related to a specific bug?

Do you have a specific solution in mind?

EtiennePerot commented Oct 31, 2024

cweld510 commented Oct 31, 2024

ayushr2 commented Oct 31, 2024

cweld510 commented Nov 6, 2024 • edited Loading

cweld510 commented Oct 31, 2024 •

edited

Loading

cweld510 commented Nov 6, 2024 •

edited

Loading