[GPU] [ROCM] Matmul-like op followed by pad produces NAN
values
#19703
Labels
bug 🐞
Something isn't working
Milestone
NAN
values
#19703
What happened?
Compiling and running a few face analysis ONNX models produces NAN outputs for gfx942.
I was able to generate a small linalg-level reproducer from the problematic dispatch in one such model. It seems like performing a particular matmul-like conv operation followed by a pad results in the NAN values.
Some other notes:
256->10
,512->20
,6->2
) also does not reproduce the issue.Steps to reproduce your issue
Small Reproducer
The output to terminal should look like
Full Model Reproducer:
if your iree-compiler has onnx import support:
if you have python bindings enabled and iree's python packages are on your
PYTHONPATH
:1x3x192x192xf32
What component(s) does this issue relate to?
No response
Version information
The issue is reproducible with pip installed packages:
rocminfo
indicates I have "ROCk module version 6.8.5"Additional context
The original op in the onnx model that generates the matmul-like generic op is:
The output seems to be stored to a larger tensor because it is followed eventually by another conv with padding of 1 on both high and low.
The last IR snippet I can somewhat read is the IR dump before
LLVMGPUVectorLoweringPass
:After this, it gets converted into around 130 lines of
vector.load
,affine.apply
, then another 150 lines ofvector.extract
vector.splat
vector.fma
and I can't seem to glean anything useful from reading it.Although my understanding of this level is pretty poor, the outer
scf.for
op has a somewhat suspicious step of1024
and an end of6
(although I don't really know how to parse the initial value %0, and also don't understand the syntax of this loop, since it doesn't have anscf.yield
). If relevant, this loop gets generated from the passGPUDistributeForallPass
.The text was updated successfully, but these errors were encountered: