You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In theory, if we get all of those in, it should reach parity. If more is discovered, please add to the list. Let's only close this issue when we reach parity on the base MLP benchmarks we have for pre-packed MLPs.
Duplicate fill operations when the use is a contraction and we can fold
the fill in the contraction later on in the pipeline using:
`fold-xsmm-flags`. Duplication avoids introducing `memref.copies` by
bufferization. Example,
```mlir
%0 = tensor.empty()
%1 = linalg.fill ins(...) outs(%0) // fill with zeros.
%2 = linalg.matmul ins(...) outs(%1)
%3 = linalg.matmul ins(...) outs(%1)
```
Without this PR it bufferizes as:
```mlir
%0 = memref.alloc()
%1 = memref.alloc()
linalg.fill ins(...) outs(%0) // fill with zeros.
memref.copy %0 into %1
linalg.matmul ins(...) outs(%0)
linalg.matmul ins(...) outs(%1)
```
With this PR the IR looks like:
```mlir
// no copies and fills folded as beta = 0.
%0 = memref.alloc()
%1 = memref.alloc()
xsmm.matmul ins(...) outs(%0) // beta = 0
xsmm.matmul ins(...) outs(%1) // beta = 0
```
The PR has minor performance impact, the only notable improvement is for
`fp32_mha_tensorflow_seq_len_32`. The IR looks cleaner too with 1 less
allocation and all the beta flags properly folded.
`fp32_mha_tensorflow_seq_len_1024` does not improve because
dimensionality allows fusion to distribute the fill, see:
b1167fe.
This PR is part of #783
Beta=0 is done, benchmark IR is affected, but we got <1% performance change from that, probably within noise. We didn't expect a huge change, so not a big deal.
These are the known issues to reach libxsmm-dnn performance on "pre-packed layer" MLPs:
In theory, if we get all of those in, it should reach parity. If more is discovered, please add to the list. Let's only close this issue when we reach parity on the base MLP benchmarks we have for pre-packed MLPs.
@chelini @alheinecke
The text was updated successfully, but these errors were encountered: