You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Let's only talk about GPU memory, since CPU memory applies to same logic.
When GPU OOM happens it would be great if we can tell how much GPU memory has been used so far, and how much GPU memory each thread is acountable for (let's call it memory bookkeeping). The concept of "account for" can be simplied defined as "which thread allocated it", it would be good enough to help us figure out what's happening when the OOM raises, e.g. "is one thread alone exhausting all the GPU memory? " or "how many threads are heavy user of GPU memory?" In order to better understand what each thread was doing when the OOM raised, a callstack dump would also be helpful.
The timing for dump the memory bookkeeping is critical. It's straightforward to think about adding some callback in RapidsExecutorPlugin.onTaskFailed, just like
. However when the catch block are being exeucted, it's already too late because the OOM state machine has already finished transition, so we can no longer capture the blocked threads' stacktrace as they have already waked up and resumed running. So it's best if we can dump the bookkeeping before the exception is thrown. But it's hard to tell whether a exception-to-throw will eventualy cause task failure.
With above said, I suggest to dump right before RmmRapidsRetryIterator.AutoCloseableAttemptSpliterator#split. If a thread starts to split it usually mean some severe memory contention has already token place, bookeeping dump at this moment is valuable to help understanding what's happening. But it's also worth mentioning that calling split does not necessarily leads to task failure, so we might end up dumping the bookkeep too many times (I think it's okay because after all we have a toggle to enable/disable the whole memory bookkeeping behaviors, and it's by default OFF). Only in the case of withRetryNoSplit, rapids plugin will throw a GpuSplitAndRetryOOM and inevitably fail the current task attempt
Is your feature request related to a problem? Please describe.
Let's only talk about GPU memory, since CPU memory applies to same logic.
When GPU OOM happens it would be great if we can tell how much GPU memory has been used so far, and how much GPU memory each thread is acountable for (let's call it memory bookkeeping). The concept of "account for" can be simplied defined as "which thread allocated it", it would be good enough to help us figure out what's happening when the OOM raises, e.g. "is one thread alone exhausting all the GPU memory? " or "how many threads are heavy user of GPU memory?" In order to better understand what each thread was doing when the OOM raised, a callstack dump would also be helpful.
The timing for dump the memory bookkeeping is critical. It's straightforward to think about adding some callback in RapidsExecutorPlugin.onTaskFailed, just like
spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/Plugin.scala
Line 703 in 9122ab0
With above said, I suggest to dump right before
RmmRapidsRetryIterator.AutoCloseableAttemptSpliterator#split
. If a thread starts to split it usually mean some severe memory contention has already token place, bookeeping dump at this moment is valuable to help understanding what's happening. But it's also worth mentioning that callingsplit
does not necessarily leads to task failure, so we might end up dumping the bookkeep too many times (I think it's okay because after all we have a toggle to enable/disable the whole memory bookkeeping behaviors, and it's by default OFF). Only in the case of withRetryNoSplit, rapids plugin will throw a GpuSplitAndRetryOOM and inevitably fail the current task attemptThis work can be broke down into three subtasks:
The text was updated successfully, but these errors were encountered: