Remaining Issues for MLP performance on par with libxsmm-dnn #783

rengolin · 2023-11-13T17:01:07Z

These are the known issues to reach libxsmm-dnn performance on "pre-packed layer" MLPs:

Beta=Zero (see Enable folding beta 0 in XSMM flags #777, Duplicate fill on contractions #784)
XSMM fusion (see Xsmm fusion of brgemm+add+relu #752)
Allocation on page boundary (2MB)?
Change loop order with flags?

In theory, if we get all of those in, it should reach parity. If more is discovered, please add to the list. Let's only close this issue when we reach parity on the base MLP benchmarks we have for pre-packed MLPs.

@chelini @alheinecke

Duplicate fill operations when the use is a contraction and we can fold the fill in the contraction later on in the pipeline using: `fold-xsmm-flags`. Duplication avoids introducing `memref.copies` by bufferization. Example, ```mlir %0 = tensor.empty() %1 = linalg.fill ins(...) outs(%0) // fill with zeros. %2 = linalg.matmul ins(...) outs(%1) %3 = linalg.matmul ins(...) outs(%1) ``` Without this PR it bufferizes as: ```mlir %0 = memref.alloc() %1 = memref.alloc() linalg.fill ins(...) outs(%0) // fill with zeros. memref.copy %0 into %1 linalg.matmul ins(...) outs(%0) linalg.matmul ins(...) outs(%1) ``` With this PR the IR looks like: ```mlir // no copies and fills folded as beta = 0. %0 = memref.alloc() %1 = memref.alloc() xsmm.matmul ins(...) outs(%0) // beta = 0 xsmm.matmul ins(...) outs(%1) // beta = 0 ``` The PR has minor performance impact, the only notable improvement is for `fp32_mha_tensorflow_seq_len_32`. The IR looks cleaner too with 1 less allocation and all the beta flags properly folded. `fp32_mha_tensorflow_seq_len_1024` does not improve because dimensionality allows fusion to distribute the fill, see: b1167fe. This PR is part of #783

rengolin · 2023-11-20T14:59:49Z

Beta=0 is done, benchmark IR is affected, but we got <1% performance change from that, probably within noise. We didn't expect a huge change, so not a big deal.

chelini mentioned this issue Nov 16, 2023

Duplicate fill on contractions #784

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remaining Issues for MLP performance on par with libxsmm-dnn #783

Remaining Issues for MLP performance on par with libxsmm-dnn #783

rengolin commented Nov 13, 2023 •

edited

Loading

rengolin commented Nov 20, 2023

Remaining Issues for MLP performance on par with libxsmm-dnn #783

Remaining Issues for MLP performance on par with libxsmm-dnn #783

Comments

rengolin commented Nov 13, 2023 • edited Loading

rengolin commented Nov 20, 2023

rengolin commented Nov 13, 2023 •

edited

Loading