[GPU] [ROCM] Matmul-like op followed by pad produces `NAN` values #19703

zjgarvey · 2025-01-15T01:09:51Z

What happened?

Compiling and running a few face analysis ONNX models produces NAN outputs for gfx942.

I was able to generate a small linalg-level reproducer from the problematic dispatch in one such model. It seems like performing a particular matmul-like conv operation followed by a pad results in the NAN values.

Some other notes:

Removing the pad does not reproduce the issue.
Using the same operations with smaller sizes (e.g., changing sizes : 256->10, 512->20, 6->2) also does not reproduce the issue.
The issue does not exist on CPU
The issue is irrespective on inputs provided.

Steps to reproduce your issue

Small Reproducer

save the following IR to a file 'repro.mlir'

module {
    func.func @nan_generator(%arg0 : tensor<256x6x6xf32>, %arg1: tensor<512x256xf32>) -> tensor<512x8x8xf32> {
        %cst = arith.constant 0.0 : f32
        %2 = tensor.empty() : tensor<512x6x6xf32>
        %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<512x6x6xf32>) -> tensor<512x6x6xf32>
        %4 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d3, d1, d2)>, affine_map<(d0, d1, d2, d3) -> (d0, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel", "reduction"]} ins(%arg0, %arg1 : tensor<256x6x6xf32>, tensor<512x256xf32>) outs(%3 : tensor<512x6x6xf32>) {
        ^bb0(%in: f32, %in_0: f32, %out: f32):
          %5 = arith.mulf %in, %in_0 : f32
          %6 = arith.addf %out, %5 : f32
          linalg.yield %6 : f32
        } -> tensor<512x6x6xf32>
        %5 = tensor.pad %4 low[0,1,1] high[0,1,1] {
            ^bb0(%arg2: index, %arg3: index, %arg4: index):
                tensor.yield %cst : f32
        } : tensor<512x6x6xf32> to tensor<512x8x8xf32>
        return %5 : tensor<512x8x8xf32>
    }
}

compile for mi300

iree-compile --iree-hal-target-backends=rocm --iree-hip-target=gfx942 repro.mlir -o repro.vmfb

run on some splat inputs

iree-run-module --device=hip --module=repro.vmfb --input='256x6x6xf32=1.0' --input='512x256xf32=1.0'

The output to terminal should look like

EXEC @nan_generator
result[0]: hal.buffer_view
512x8x8xf32=[[0 0 0 0 0 0 0 0][0 -NAN -NAN -NAN -NAN -NAN -NAN 0][0 -NAN -NAN -NAN -NAN -NAN -NAN 0][0 -NAN -NAN -NAN -NAN -NAN -NAN 0][0 -NAN -NAN -NAN -NAN -NAN -NAN 0][0 -NAN -NAN -NAN -NAN -NAN -NAN 0][0 -NAN -NAN -NAN -NAN -NAN -NAN 0]...

Full Model Reproducer:

get the onnx model:

wget https://onnxstorage.blob.core.windows.net/onnxstorage/e2eshark/onnx/models/face_analysis_2d106det/model.onnx.zip
unzip model.onnx.zip

import to mlir

if your iree-compiler has onnx import support:

iree-import-onnx model.onnx -o repro.mlir

if you have python bindings enabled and iree's python packages are on your PYTHONPATH:

python -m iree.compiler.tools.import_onnx model.onnx -o repro.mlir

follow steps 2. and 3. from the previous section, with the single input shape 1x3x192x192xf32

What component(s) does this issue relate to?

No response

Version information

The issue is reproducible with pip installed packages:

iree-base-compiler 3.2.0rc20250114
iree-base-runtime  3.2.0rc20250114

rocminfo indicates I have "ROCk module version 6.8.5"

Additional context

The original op in the onnx model that generates the matmul-like generic op is:

%227 = torch.operator "onnx.Conv"(%226, %122) {torch.onnx.dilations = [1 : si64, 1 : si64], torch.onnx.group = 1 : si64, torch.onnx.kernel_shape = [1 : si64, 1 : si64], torch.onnx.pads = [0 : si64, 0 : si64, 0 : si64, 0 : si64], torch.onnx.strides = [1 : si64, 1 : si64]} : (!torch.vtensor<[1,256,6,6],f32>, !torch.vtensor<[512,256,1,1],f32>) -> !torch.vtensor<[1,512,6,6],f32>

The output seems to be stored to a larger tensor because it is followed eventually by another conv with padding of 1 on both high and low.

The last IR snippet I can somewhat read is the IR dump before LLVMGPUVectorLoweringPass:

func.func @nan_generator_dispatch_0_matmul_like_512x6x6x256_f32() {
  %c1 = arith.constant 1 : index
  %c1024 = arith.constant 1024 : index
  %c6 = arith.constant 6 : index
  %c0 = arith.constant 0 : index
  %cst = arith.constant 0.000000e+00 : f32
  %c256 = arith.constant 256 : index
  %c64 = arith.constant 64 : index
  %cst_0 = arith.constant dense<0.000000e+00> : vector<1x1x6xf32>
  %thread_id_y = gpu.thread_id  y
  %thread_id_x = gpu.thread_id  x
  %0 = arith.muli %thread_id_y, %c64 : index
  %1 = arith.addi %0, %thread_id_x : index
  %2 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(0) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : memref<256x6x6xf32, #gpu.address_space<global>>
  memref.assume_alignment %2, 64 : memref<256x6x6xf32, #gpu.address_space<global>>
  %3 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(1) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : memref<512x256xf32, #gpu.address_space<global>>
  memref.assume_alignment %3, 64 : memref<512x256xf32, #gpu.address_space<global>>
  %4 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(2) alignment(64) offset(%c0) flags(Indirect) : memref<512x8x8xf32, #gpu.address_space<global>>
  memref.assume_alignment %4, 64 : memref<512x8x8xf32, #gpu.address_space<global>>
  %workgroup_id_x = hal.interface.workgroup.id[0] : index
  gpu.barrier
  scf.for %arg0 = %1 to %c6 step %c1024 {
    %5 = scf.for %arg1 = %c0 to %c256 step %c64 iter_args(%arg2 = %cst_0) -> (vector<1x1x6xf32>) {
      %8 = vector.transfer_read %2[%arg1, %arg0, %c0], %cst {in_bounds = [true, true, true]} : memref<256x6x6xf32, #gpu.address_space<global>>, vector<64x1x6xf32>
      %9 = vector.transfer_read %3[%workgroup_id_x, %arg1], %cst {in_bounds = [true]} : memref<512x256xf32, #gpu.address_space<global>>, vector<64xf32>
      %10 = vector.transpose %8, [1, 0, 2] : vector<64x1x6xf32> to vector<1x64x6xf32>
      %11 = vector.extract %10[0] : vector<64x6xf32> from vector<1x64x6xf32>
      %12 = vector.extract %arg2[0, 0] : vector<6xf32> from vector<1x1x6xf32>
      %13 = vector.contract {indexing_maps = [affine_map<(d0, d1) -> (d1, d0)>, affine_map<(d0, d1) -> (d1)>, affine_map<(d0, d1) -> (d0)>], iterator_types = ["parallel", "reduction"], kind = #vector.kind<add>} %11, %9, %12 : vector<64x6xf32>, vector<64xf32> into vector<6xf32>
      %14 = vector.broadcast %13 : vector<6xf32> to vector<1x1x6xf32>
      scf.yield %14 : vector<1x1x6xf32>
    }
    %6 = arith.addi %arg0, %c1 : index
    %7 = vector.extract %5[0, 0] : vector<6xf32> from vector<1x1x6xf32>
    vector.transfer_write %7, %4[%workgroup_id_x, %6, %c1] {in_bounds = [true]} : vector<6xf32>, memref<512x8x8xf32, #gpu.address_space<global>>
  }
  gpu.barrier
  return
}

After this, it gets converted into around 130 lines of vector.load, affine.apply, then another 150 lines of vector.extract vector.splat vector.fma and I can't seem to glean anything useful from reading it.

Although my understanding of this level is pretty poor, the outer scf.for op has a somewhat suspicious step of 1024 and an end of 6 (although I don't really know how to parse the initial value %0, and also don't understand the syntax of this loop, since it doesn't have an scf.yield). If relevant, this loop gets generated from the pass GPUDistributeForallPass.

The text was updated successfully, but these errors were encountered:

zjgarvey added the bug 🐞 Something isn't working label Jan 15, 2025

zjgarvey added this to the Scalability on AMD GPU milestone Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU] [ROCM] Matmul-like op followed by pad produces `NAN` values #19703

[GPU] [ROCM] Matmul-like op followed by pad produces `NAN` values #19703

zjgarvey commented Jan 15, 2025

[GPU] [ROCM] Matmul-like op followed by pad produces NAN values #19703

[GPU] [ROCM] Matmul-like op followed by pad produces NAN values #19703

Comments

zjgarvey commented Jan 15, 2025

What happened?

Steps to reproduce your issue

Small Reproducer

Full Model Reproducer:

What component(s) does this issue relate to?

Version information

Additional context

[GPU] [ROCM] Matmul-like op followed by pad produces `NAN` values #19703

[GPU] [ROCM] Matmul-like op followed by pad produces `NAN` values #19703