[QNN] MatMul Op Builder to Handle All Cases of ONNX's MatMul #22639

centwang · 2024-10-29T08:35:40Z

ONNX's MatMul is same as numpy.matmul, which supports input tensors with rank >= 1. But QNN's MatMul can only support input tensors with rank >= 2. This PR is to add MatMulOpBuilder for QNN EP to build QNN graph to support all possible cases of ONNX's MatMul, by adding Reshape nodes if necessary, e.g., if Reshape 1D input to 2D if exists, and Reshape output to expected shape at the end.

This PR also tries to use FullyConnected Op for MatMul if 2nd input is 2D initializer or 1D tensor because FullyConnected is faster than MatMul on QNN EP. If 2nd input is 2D tensor, we require it an initializer because FullyConnected requires 2nd input in [n, k] shape, we can transpose it when graph building if it's an initializer (we don't want to add extra Transpose node).

Use swin_base model as example, which contains several MatMul nodes with 2nd input is 2D initializer (not followed by Add), running on Gen3 mobile device, before the change, it takes 34.8876 ms, after this change, it's 27.0639 ms.

onnxruntime/test/providers/qnn/matmul_test.cpp

onnxruntime/core/providers/qnn/builder/opbuilder/matmul_op_builder.cc

onnxruntime/test/providers/qnn/matmul_test.cpp

skottmckay

ONNX's MatMul is same as numpy.matmul, which supports input tensors with rank >= 1. But QNN's MatMul can only support input tensors with rank >= 2. This PR is to add MatMulOpBuilder for QNN EP to build QNN graph to support all possible cases of ONNX's MatMul, by adding Reshape nodes if necessary, e.g., if Reshape 1D input to 2D if exists, and Reshape output to expected shape at the end. This PR also tries to use FullyConnected Op for MatMul if 2nd input is 2D initializer or 1D tensor because FullyConnected is faster than MatMul on QNN EP. If 2nd input is 2D tensor, we require it an initializer because FullyConnected requires 2nd input in [n, k] shape, we can transpose it when graph building if it's an initializer (we don't want to add extra Transpose node). Use swin_base model as example, which contains several MatMul nodes with 2nd input is 2D initializer (not followed by Add), running on Gen3 mobile device, before the change, it takes 34.8876 ms, after this change, it's 27.0639 ms.

…ft#22639) ONNX's MatMul is same as numpy.matmul, which supports input tensors with rank >= 1. But QNN's MatMul can only support input tensors with rank >= 2. This PR is to add MatMulOpBuilder for QNN EP to build QNN graph to support all possible cases of ONNX's MatMul, by adding Reshape nodes if necessary, e.g., if Reshape 1D input to 2D if exists, and Reshape output to expected shape at the end. This PR also tries to use FullyConnected Op for MatMul if 2nd input is 2D initializer or 1D tensor because FullyConnected is faster than MatMul on QNN EP. If 2nd input is 2D tensor, we require it an initializer because FullyConnected requires 2nd input in [n, k] shape, we can transpose it when graph building if it's an initializer (we don't want to add extra Transpose node). Use swin_base model as example, which contains several MatMul nodes with 2nd input is 2D initializer (not followed by Add), running on Gen3 mobile device, before the change, it takes 34.8876 ms, after this change, it's 27.0639 ms.

ONNX's MatMul is same as numpy.matmul, which supports input tensors with rank >= 1. But QNN's MatMul can only support input tensors with rank >= 2. This PR is to add MatMulOpBuilder for QNN EP to build QNN graph to support all possible cases of ONNX's MatMul, by adding Reshape nodes if necessary, e.g., if Reshape 1D input to 2D if exists, and Reshape output to expected shape at the end. This PR also tries to use FullyConnected Op for MatMul if 2nd input is 2D initializer or 1D tensor because FullyConnected is faster than MatMul on QNN EP. If 2nd input is 2D tensor, we require it an initializer because FullyConnected requires 2nd input in [n, k] shape, we can transpose it when graph building if it's an initializer (we don't want to add extra Transpose node). Use swin_base model as example, which contains several MatMul nodes with 2nd input is 2D initializer (not followed by Add), running on Gen3 mobile device, before the change, it takes 34.8876 ms, after this change, it's 27.0639 ms.

…inputs (#23419) ### Description - Fixes regression for MatMul with two quantized/dynamic uint16 inputs. We need to convert input[1] to uint8 to pass QNN validation. - Separates translation of `ONNX MatMul -> QNN MatMul` and `ONNX MatMul -> QNN FullyConnected` to separate functions to make the code more readable. ### Motivation and Context The following PR updated the handling of MatMul. The logic to handle MatMul with two non-const uint16 inputs was not ported from [simple_op_builder.cc](https://github.com/microsoft/onnxruntime/blob/c64fa18834f0651b7d62507a34d802874b099c29/onnxruntime/core/providers/qnn/builder/opbuilder/simple_op_builder.cc#L107) to the new [matmul_op_builder.cc](https://github.com/microsoft/onnxruntime/blob/c64fa18834f0651b7d62507a34d802874b099c29/onnxruntime/core/providers/qnn/builder/opbuilder/matmul_op_builder.cc#L57). #22639

centwang requested review from skottmckay, cloudhan, adrianlizarraga and jywu-msft October 29, 2024 08:35

cloudhan approved these changes Oct 29, 2024

View reviewed changes

adrianlizarraga requested a review from HectorSVC November 6, 2024 17:42

adrianlizarraga reviewed Nov 6, 2024

View reviewed changes

onnxruntime/test/providers/qnn/matmul_test.cpp Outdated Show resolved Hide resolved

adrianlizarraga reviewed Nov 6, 2024

View reviewed changes

onnxruntime/test/providers/qnn/matmul_test.cpp Outdated Show resolved Hide resolved

skottmckay reviewed Nov 7, 2024

View reviewed changes

centwang added 2 commits November 20, 2024 12:47

Add matmul op builder for qnn

0b013b4

matmul_op_builder to support all possible cases

172d6c5

centwang force-pushed the weicwang/matmul_op_builder branch from cae0690 to 172d6c5 Compare November 22, 2024 02:21

centwang changed the title ~~[QNN] Use FullyConnected for MatMul if 2nd Input is 2D Initializer~~ [QNN] MatMul Op Builder to Handle All Cases of ONNX's MatMul Nov 22, 2024

adjust ut

ccaeaeb

skottmckay reviewed Nov 27, 2024

View reviewed changes

centwang added 2 commits November 28, 2024 10:15

Merge branch 'main' into weicwang/matmul_op_builder

1ff7edb

resolve comments

bcbabbc

skottmckay approved these changes Dec 3, 2024

View reviewed changes

adrianlizarraga approved these changes Jan 8, 2025

View reviewed changes

centwang merged commit 34d70f5 into main Jan 8, 2025
95 checks passed

centwang deleted the weicwang/matmul_op_builder branch January 8, 2025 02:15

adrianlizarraga mentioned this pull request Jan 17, 2025

[QNN EP] Fix regression for MatMul with two quantized/dynamic uint16 inputs #23419

Merged

jywu-msft mentioned this pull request Feb 6, 2025

[Performance] Replace MatMul with FullyConnected #20524

Open

zhouwg mentioned this pull request Feb 16, 2025

offload mulmat to QNN backend kantv-ai/ggmlqnn-discuss#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QNN] MatMul Op Builder to Handle All Cases of ONNX's MatMul #22639

[QNN] MatMul Op Builder to Handle All Cases of ONNX's MatMul #22639

centwang commented Oct 29, 2024 •

edited

Loading

skottmckay left a comment

[QNN] MatMul Op Builder to Handle All Cases of ONNX's MatMul #22639

[QNN] MatMul Op Builder to Handle All Cases of ONNX's MatMul #22639

Conversation

centwang commented Oct 29, 2024 • edited Loading

skottmckay left a comment

Choose a reason for hiding this comment

centwang commented Oct 29, 2024 •

edited

Loading