Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remaining Issues for MLP performance on par with libxsmm-dnn #783

Open
2 of 4 tasks
rengolin opened this issue Nov 13, 2023 · 1 comment
Open
2 of 4 tasks

Remaining Issues for MLP performance on par with libxsmm-dnn #783

rengolin opened this issue Nov 13, 2023 · 1 comment

Comments

@rengolin
Copy link
Contributor

rengolin commented Nov 13, 2023

These are the known issues to reach libxsmm-dnn performance on "pre-packed layer" MLPs:

In theory, if we get all of those in, it should reach parity. If more is discovered, please add to the list. Let's only close this issue when we reach parity on the base MLP benchmarks we have for pre-packed MLPs.

@chelini @alheinecke

chelini added a commit that referenced this issue Nov 20, 2023
Duplicate fill operations when the use is a contraction and we can fold
the fill in the contraction later on in the pipeline using:
`fold-xsmm-flags`. Duplication avoids introducing `memref.copies` by
bufferization. Example,

```mlir
%0 = tensor.empty()
%1 = linalg.fill ins(...) outs(%0) // fill with zeros.
%2 = linalg.matmul ins(...) outs(%1)
%3 = linalg.matmul ins(...) outs(%1)
```
Without this PR it bufferizes as:

```mlir
%0 = memref.alloc()
%1 = memref.alloc()
linalg.fill ins(...) outs(%0) // fill with zeros.
memref.copy %0 into %1
linalg.matmul ins(...) outs(%0)
linalg.matmul ins(...) outs(%1)
```

With this PR the IR looks like:

```mlir
// no copies and fills folded as beta = 0.
%0 = memref.alloc()
%1 = memref.alloc()
xsmm.matmul ins(...) outs(%0) // beta = 0
xsmm.matmul ins(...) outs(%1) // beta = 0
```

The PR has minor performance impact, the only notable improvement is for
`fp32_mha_tensorflow_seq_len_32`. The IR looks cleaner too with 1 less
allocation and all the beta flags properly folded.
`fp32_mha_tensorflow_seq_len_1024` does not improve because
dimensionality allows fusion to distribute the fill, see:
b1167fe.

This PR is part of #783
@rengolin
Copy link
Contributor Author

Beta=0 is done, benchmark IR is affected, but we got <1% performance change from that, probably within noise. We didn't expect a huge change, so not a big deal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant