Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Dump GPU/CPU memory footprint when GPU/CPU OOM happens #12173

Open
binmahone opened this issue Feb 19, 2025 · 0 comments
Open

[FEA] Dump GPU/CPU memory footprint when GPU/CPU OOM happens #12173

binmahone opened this issue Feb 19, 2025 · 0 comments
Assignees
Labels
? - Needs Triage Need team to review and classify feature request New feature or request

Comments

@binmahone
Copy link
Collaborator

binmahone commented Feb 19, 2025

Is your feature request related to a problem? Please describe.

Let's only talk about GPU memory, since CPU memory applies to same logic.

When GPU OOM happens it would be great if we can tell how much GPU memory has been used so far, and how much GPU memory each thread is acountable for (let's call it memory bookkeeping). The concept of "account for" can be simplied defined as "which thread allocated it", it would be good enough to help us figure out what's happening when the OOM raises, e.g. "is one thread alone exhausting all the GPU memory? " or "how many threads are heavy user of GPU memory?" In order to better understand what each thread was doing when the OOM raised, a callstack dump would also be helpful.

The timing for dump the memory bookkeeping is critical. It's straightforward to think about adding some callback in RapidsExecutorPlugin.onTaskFailed, just like

GpuCoreDumpHandler.waitForDump(timeoutSecs = 60)
. However when the catch block are being exeucted, it's already too late because the OOM state machine has already finished transition, so we can no longer capture the blocked threads' stacktrace as they have already waked up and resumed running. So it's best if we can dump the bookkeeping before the exception is thrown. But it's hard to tell whether a exception-to-throw will eventualy cause task failure.

With above said, I suggest to dump right before RmmRapidsRetryIterator.AutoCloseableAttemptSpliterator#split. If a thread starts to split it usually mean some severe memory contention has already token place, bookeeping dump at this moment is valuable to help understanding what's happening. But it's also worth mentioning that calling split does not necessarily leads to task failure, so we might end up dumping the bookkeep too many times (I think it's okay because after all we have a toggle to enable/disable the whole memory bookkeeping behaviors, and it's by default OFF). Only in the case of withRetryNoSplit, rapids plugin will throw a GpuSplitAndRetryOOM and inevitably fail the current task attempt

This work can be broke down into three subtasks:

  1. Add CPU memory footprint bookkeeeping in spark-rapids repo add memory bookkeeping for CPU Memory #12181
  2. Add GPU memory footprint bookkeeeping in CUDF repo BaseDeviceMemoryBuffer for GPU mem bookkeep rapidsai/cudf#18052
  3. invoke GPU memory bookkeepring from spark-rapids side
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify feature request New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant