Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU] [ROCM] Matmul-like op followed by pad produces NAN values #19703

Open
zjgarvey opened this issue Jan 15, 2025 · 0 comments
Open

[GPU] [ROCM] Matmul-like op followed by pad produces NAN values #19703

zjgarvey opened this issue Jan 15, 2025 · 0 comments
Labels
bug 🐞 Something isn't working

Comments

@zjgarvey
Copy link
Contributor

What happened?

Compiling and running a few face analysis ONNX models produces NAN outputs for gfx942.

I was able to generate a small linalg-level reproducer from the problematic dispatch in one such model. It seems like performing a particular matmul-like conv operation followed by a pad results in the NAN values.

Some other notes:

  • Removing the pad does not reproduce the issue.
  • Using the same operations with smaller sizes (e.g., changing sizes : 256->10, 512->20, 6->2) also does not reproduce the issue.
  • The issue does not exist on CPU
  • The issue is irrespective on inputs provided.

Steps to reproduce your issue

Small Reproducer

  1. save the following IR to a file 'repro.mlir'
module {
    func.func @nan_generator(%arg0 : tensor<256x6x6xf32>, %arg1: tensor<512x256xf32>) -> tensor<512x8x8xf32> {
        %cst = arith.constant 0.0 : f32
        %2 = tensor.empty() : tensor<512x6x6xf32>
        %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<512x6x6xf32>) -> tensor<512x6x6xf32>
        %4 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d3, d1, d2)>, affine_map<(d0, d1, d2, d3) -> (d0, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel", "reduction"]} ins(%arg0, %arg1 : tensor<256x6x6xf32>, tensor<512x256xf32>) outs(%3 : tensor<512x6x6xf32>) {
        ^bb0(%in: f32, %in_0: f32, %out: f32):
          %5 = arith.mulf %in, %in_0 : f32
          %6 = arith.addf %out, %5 : f32
          linalg.yield %6 : f32
        } -> tensor<512x6x6xf32>
        %5 = tensor.pad %4 low[0,1,1] high[0,1,1] {
            ^bb0(%arg2: index, %arg3: index, %arg4: index):
                tensor.yield %cst : f32
        } : tensor<512x6x6xf32> to tensor<512x8x8xf32>
        return %5 : tensor<512x8x8xf32>
    }
}
  1. compile for mi300
iree-compile --iree-hal-target-backends=rocm --iree-hip-target=gfx942 repro.mlir -o repro.vmfb
  1. run on some splat inputs
iree-run-module --device=hip --module=repro.vmfb --input='256x6x6xf32=1.0' --input='512x256xf32=1.0'

The output to terminal should look like

EXEC @nan_generator
result[0]: hal.buffer_view
512x8x8xf32=[[0 0 0 0 0 0 0 0][0 -NAN -NAN -NAN -NAN -NAN -NAN 0][0 -NAN -NAN -NAN -NAN -NAN -NAN 0][0 -NAN -NAN -NAN -NAN -NAN -NAN 0][0 -NAN -NAN -NAN -NAN -NAN -NAN 0][0 -NAN -NAN -NAN -NAN -NAN -NAN 0][0 -NAN -NAN -NAN -NAN -NAN -NAN 0]...

Full Model Reproducer:

  1. get the onnx model:
wget https://onnxstorage.blob.core.windows.net/onnxstorage/e2eshark/onnx/models/face_analysis_2d106det/model.onnx.zip
unzip model.onnx.zip
  1. import to mlir

if your iree-compiler has onnx import support:

iree-import-onnx model.onnx -o repro.mlir

if you have python bindings enabled and iree's python packages are on your PYTHONPATH:

python -m iree.compiler.tools.import_onnx model.onnx -o repro.mlir
  1. follow steps 2. and 3. from the previous section, with the single input shape 1x3x192x192xf32

What component(s) does this issue relate to?

No response

Version information

The issue is reproducible with pip installed packages:

iree-base-compiler 3.2.0rc20250114
iree-base-runtime  3.2.0rc20250114

rocminfo indicates I have "ROCk module version 6.8.5"

Additional context

The original op in the onnx model that generates the matmul-like generic op is:

%227 = torch.operator "onnx.Conv"(%226, %122) {torch.onnx.dilations = [1 : si64, 1 : si64], torch.onnx.group = 1 : si64, torch.onnx.kernel_shape = [1 : si64, 1 : si64], torch.onnx.pads = [0 : si64, 0 : si64, 0 : si64, 0 : si64], torch.onnx.strides = [1 : si64, 1 : si64]} : (!torch.vtensor<[1,256,6,6],f32>, !torch.vtensor<[512,256,1,1],f32>) -> !torch.vtensor<[1,512,6,6],f32> 

The output seems to be stored to a larger tensor because it is followed eventually by another conv with padding of 1 on both high and low.


The last IR snippet I can somewhat read is the IR dump before LLVMGPUVectorLoweringPass:

func.func @nan_generator_dispatch_0_matmul_like_512x6x6x256_f32() {
  %c1 = arith.constant 1 : index
  %c1024 = arith.constant 1024 : index
  %c6 = arith.constant 6 : index
  %c0 = arith.constant 0 : index
  %cst = arith.constant 0.000000e+00 : f32
  %c256 = arith.constant 256 : index
  %c64 = arith.constant 64 : index
  %cst_0 = arith.constant dense<0.000000e+00> : vector<1x1x6xf32>
  %thread_id_y = gpu.thread_id  y
  %thread_id_x = gpu.thread_id  x
  %0 = arith.muli %thread_id_y, %c64 : index
  %1 = arith.addi %0, %thread_id_x : index
  %2 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(0) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : memref<256x6x6xf32, #gpu.address_space<global>>
  memref.assume_alignment %2, 64 : memref<256x6x6xf32, #gpu.address_space<global>>
  %3 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(1) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : memref<512x256xf32, #gpu.address_space<global>>
  memref.assume_alignment %3, 64 : memref<512x256xf32, #gpu.address_space<global>>
  %4 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(2) alignment(64) offset(%c0) flags(Indirect) : memref<512x8x8xf32, #gpu.address_space<global>>
  memref.assume_alignment %4, 64 : memref<512x8x8xf32, #gpu.address_space<global>>
  %workgroup_id_x = hal.interface.workgroup.id[0] : index
  gpu.barrier
  scf.for %arg0 = %1 to %c6 step %c1024 {
    %5 = scf.for %arg1 = %c0 to %c256 step %c64 iter_args(%arg2 = %cst_0) -> (vector<1x1x6xf32>) {
      %8 = vector.transfer_read %2[%arg1, %arg0, %c0], %cst {in_bounds = [true, true, true]} : memref<256x6x6xf32, #gpu.address_space<global>>, vector<64x1x6xf32>
      %9 = vector.transfer_read %3[%workgroup_id_x, %arg1], %cst {in_bounds = [true]} : memref<512x256xf32, #gpu.address_space<global>>, vector<64xf32>
      %10 = vector.transpose %8, [1, 0, 2] : vector<64x1x6xf32> to vector<1x64x6xf32>
      %11 = vector.extract %10[0] : vector<64x6xf32> from vector<1x64x6xf32>
      %12 = vector.extract %arg2[0, 0] : vector<6xf32> from vector<1x1x6xf32>
      %13 = vector.contract {indexing_maps = [affine_map<(d0, d1) -> (d1, d0)>, affine_map<(d0, d1) -> (d1)>, affine_map<(d0, d1) -> (d0)>], iterator_types = ["parallel", "reduction"], kind = #vector.kind<add>} %11, %9, %12 : vector<64x6xf32>, vector<64xf32> into vector<6xf32>
      %14 = vector.broadcast %13 : vector<6xf32> to vector<1x1x6xf32>
      scf.yield %14 : vector<1x1x6xf32>
    }
    %6 = arith.addi %arg0, %c1 : index
    %7 = vector.extract %5[0, 0] : vector<6xf32> from vector<1x1x6xf32>
    vector.transfer_write %7, %4[%workgroup_id_x, %6, %c1] {in_bounds = [true]} : vector<6xf32>, memref<512x8x8xf32, #gpu.address_space<global>>
  }
  gpu.barrier
  return
}

After this, it gets converted into around 130 lines of vector.load, affine.apply, then another 150 lines of vector.extract vector.splat vector.fma and I can't seem to glean anything useful from reading it.

Although my understanding of this level is pretty poor, the outer scf.for op has a somewhat suspicious step of 1024 and an end of 6 (although I don't really know how to parse the initial value %0, and also don't understand the syntax of this loop, since it doesn't have an scf.yield). If relevant, this loop gets generated from the pass GPUDistributeForallPass.

@zjgarvey zjgarvey added the bug 🐞 Something isn't working label Jan 15, 2025
@zjgarvey zjgarvey added this to the Scalability on AMD GPU milestone Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant