[FEA] Dump GPU/CPU memory footprint when GPU/CPU OOM happens #12173

binmahone · 2025-02-19T07:50:58Z

Is your feature request related to a problem? Please describe.

Let's only talk about GPU memory, since CPU memory applies to same logic.

When GPU OOM happens it would be great if we can tell how much GPU memory has been used so far, and how much GPU memory each thread is acountable for (let's call it memory bookkeeping). The concept of "account for" can be simplied defined as "which thread allocated it", it would be good enough to help us figure out what's happening when the OOM raises, e.g. "is one thread alone exhausting all the GPU memory? " or "how many threads are heavy user of GPU memory?" In order to better understand what each thread was doing when the OOM raised, a callstack dump would also be helpful.

The timing for dump the memory bookkeeping is critical. It's straightforward to think about adding some callback in RapidsExecutorPlugin.onTaskFailed, just like

spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/Plugin.scala

Line 703 in 9122ab0

GpuCoreDumpHandler.waitForDump(timeoutSecs = 60)

. However when the catch block are being exeucted, it's already too late because the OOM state machine has already finished transition, so we can no longer capture the blocked threads' stacktrace as they have already waked up and resumed running. So it's best if we can dump the bookkeeping before the exception is thrown. But it's hard to tell whether a exception-to-throw will eventualy cause task failure.

With above said, I suggest to dump right before RmmRapidsRetryIterator.AutoCloseableAttemptSpliterator#split. If a thread starts to split it usually mean some severe memory contention has already token place, bookeeping dump at this moment is valuable to help understanding what's happening. But it's also worth mentioning that calling split does not necessarily leads to task failure, so we might end up dumping the bookkeep too many times (I think it's okay because after all we have a toggle to enable/disable the whole memory bookkeeping behaviors, and it's by default OFF). Only in the case of withRetryNoSplit, rapids plugin will throw a GpuSplitAndRetryOOM and inevitably fail the current task attempt

This work can be broke down into three subtasks:

Add CPU memory footprint bookkeeeping in spark-rapids repo add memory bookkeeping for CPU Memory #12181
Add GPU memory footprint bookkeeeping in CUDF repo BaseDeviceMemoryBuffer for GPU mem bookkeep rapidsai/cudf#18052
invoke GPU memory bookkeepring from spark-rapids side

The text was updated successfully, but these errors were encountered:

binmahone added ? - Needs Triage Need team to review and classify feature request New feature or request labels Feb 19, 2025

binmahone self-assigned this Feb 19, 2025

This was referenced Feb 19, 2025

[FEA] Limit Host Memory Usage #8874

Open

add memory bookkeeping for CPU Memory #12181

Open

BaseDeviceMemoryBuffer for GPU mem bookkeep rapidsai/cudf#18052

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Dump GPU/CPU memory footprint when GPU/CPU OOM happens #12173

[FEA] Dump GPU/CPU memory footprint when GPU/CPU OOM happens #12173

binmahone commented Feb 19, 2025 •

edited

Loading

[FEA] Dump GPU/CPU memory footprint when GPU/CPU OOM happens #12173

[FEA] Dump GPU/CPU memory footprint when GPU/CPU OOM happens #12173

Comments

binmahone commented Feb 19, 2025 • edited Loading

binmahone commented Feb 19, 2025 •

edited

Loading