catch tensor.numel() == 0 in nan detector (pytorch#140741)

Context: we are trying to pass an empty tensor through the system now (sometimes;... its an edge case); and it seems to cause all_reduce to seg fault, which is unexpected to me Deep Shah and Pavan identified the issue, I'm just pushing for a fix :) Test Plan: idk what i'm doing here, someone help Reviewed By: shuqiangzhang Differential Revision: D65956095 Pull Request resolved: pytorch#140741 Approved by: https://github.com/shuqiangzhang
yuchengliu1 · Nov 15, 2024 · 8043e67 · 8043e67
1 parent 865a7c5
commit 8043e67
Showing 1 changed file with 4 additions and 0 deletions.
diff --git a/torch/csrc/distributed/c10d/NanCheck.cu b/torch/csrc/distributed/c10d/NanCheck.cu
@@ -233,6 +233,10 @@ void checkForNan(const at::Tensor& tensor, at::cuda::CUDAStream& stream) {
   const size_t numThreadsPerBlock =
       std::min<size_t>(maxNumThreadsPerBlock, tensor.numel());
 
+  if (!(numThreadsPerBlock > 0)) {
+    return;
+  }
+
   const size_t numBlocks = std::min<size_t>(
       maxNumBlocks,
       (tensor.numel() + numThreadsPerBlock - 1) / numThreadsPerBlock);