Duplicate fill on contractions #784

chelini · 2023-11-16T11:48:33Z

Duplicate fill operations when the use is a contraction and we can fold the fill in the contraction later on in the pipeline using: fold-xsmm-flags. Duplication avoids introducing memref.copies by bufferization. Example,

%0 = tensor.empty()
%1 = linalg.fill ins(...) outs(%0) // fill with zeros.
%2 = linalg.matmul ins(...) outs(%1)
%3 = linalg.matmul ins(...) outs(%1)

Without this PR it bufferizes as:

%0 = memref.alloc()
%1 = memref.alloc()
linalg.fill ins(...) outs(%0) // fill with zeros.
memref.copy %0 into %1
linalg.matmul ins(...) outs(%0)
linalg.matmul ins(...) outs(%1)

With this PR the IR looks like:

// no copies and fills folded as beta = 0.
%0 = memref.alloc()
%1 = memref.alloc()
xsmm.matmul ins(...) outs(%0) // beta = 0
xsmm.matmul ins(...) outs(%1) // beta = 0

The PR has minor performance impact, the only notable improvement is for fp32_mha_tensorflow_seq_len_32. The IR looks cleaner too with 1 less allocation and all the beta flags properly folded. fp32_mha_tensorflow_seq_len_1024 does not improve because dimensionality allows fusion to distribute the fill, see: b1167fe.

This PR is part of #783

rengolin

I can't remember what were the examples we discussed about external dependencies to following ops, but we need to make sure that the copy isn't being used by other non contraction/matmul operations, so probably need an artificial test to make sure we check for number of users of each following fill/contraction.

lib/TPP/Transforms/Bufferize.cpp

Duplicate fill operations when the use is a contraction and we can fold the fill in the contraction later on in the pipeline using: `fold-xsmm-flags`. The PR has minor performance impact, the only notable improvement is for `fp32_mha_tensorflow_seq_len_32`. The IR looks cleaner too with 1 less allocation and all the beta flags properly folded. `fp32_mha_tensorflow_seq_len_1024` does not improve because dimensionality allows fusion to run distribute the fill, see: libxsmm@b1167fe

test/Passes/pass-duplicate-fill.mlir

rengolin reviewed Nov 16, 2023

View reviewed changes

lib/TPP/Transforms/Bufferize.cpp Outdated Show resolved Hide resolved

lib/TPP/Transforms/Bufferize.cpp Outdated Show resolved Hide resolved

chelini force-pushed the wip-copy branch from 795dbb8 to 23ac1e8 Compare November 17, 2023 14:11

imp

50463da

chelini force-pushed the wip-copy branch from 23ac1e8 to 50463da Compare November 17, 2023 14:14

rengolin reviewed Nov 17, 2023

View reviewed changes

test/Passes/pass-duplicate-fill.mlir Show resolved Hide resolved

rengolin approved these changes Nov 17, 2023

View reviewed changes

test/Passes/pass-duplicate-fill.mlir Show resolved Hide resolved

This was referenced Nov 17, 2023

Remaining Issues for MLP performance on par with libxsmm-dnn #783

Open

Adding PyTorch-Dynamo models as benchmarks #786

Merged

chelini merged commit 48455d0 into libxsmm:main Nov 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate fill on contractions #784

Duplicate fill on contractions #784

chelini commented Nov 16, 2023 •

edited

Loading

rengolin left a comment

Duplicate fill on contractions #784

Duplicate fill on contractions #784

Conversation

chelini commented Nov 16, 2023 • edited Loading

rengolin left a comment

Choose a reason for hiding this comment

chelini commented Nov 16, 2023 •

edited

Loading