Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate fill on contractions #784

Merged
merged 2 commits into from
Nov 20, 2023
Merged

Duplicate fill on contractions #784

merged 2 commits into from
Nov 20, 2023

Conversation

chelini
Copy link
Contributor

@chelini chelini commented Nov 16, 2023

Duplicate fill operations when the use is a contraction and we can fold the fill in the contraction later on in the pipeline using: fold-xsmm-flags. Duplication avoids introducing memref.copies by bufferization. Example,

%0 = tensor.empty()
%1 = linalg.fill ins(...) outs(%0) // fill with zeros.
%2 = linalg.matmul ins(...) outs(%1)
%3 = linalg.matmul ins(...) outs(%1)

Without this PR it bufferizes as:

%0 = memref.alloc()
%1 = memref.alloc()
linalg.fill ins(...) outs(%0) // fill with zeros.
memref.copy %0 into %1
linalg.matmul ins(...) outs(%0)
linalg.matmul ins(...) outs(%1)

With this PR the IR looks like:

// no copies and fills folded as beta = 0.
%0 = memref.alloc()
%1 = memref.alloc()
xsmm.matmul ins(...) outs(%0) // beta = 0
xsmm.matmul ins(...) outs(%1) // beta = 0

The PR has minor performance impact, the only notable improvement is for fp32_mha_tensorflow_seq_len_32. The IR looks cleaner too with 1 less allocation and all the beta flags properly folded. fp32_mha_tensorflow_seq_len_1024 does not improve because dimensionality allows fusion to distribute the fill, see: b1167fe.

This PR is part of #783

Copy link
Contributor

@rengolin rengolin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't remember what were the examples we discussed about external dependencies to following ops, but we need to make sure that the copy isn't being used by other non contraction/matmul operations, so probably need an artificial test to make sure we check for number of users of each following fill/contraction.

Duplicate fill operations when the use is a contraction and we
can fold the fill in the contraction later on in the pipeline using:
`fold-xsmm-flags`. The PR has minor performance impact, the only notable
improvement is for `fp32_mha_tensorflow_seq_len_32`. The IR looks
cleaner too with 1 less allocation and all the beta flags properly
folded. `fp32_mha_tensorflow_seq_len_1024` does not improve because
dimensionality allows fusion to run distribute the fill, see: libxsmm@b1167fe
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants