Function outlining with -O3 #929

newling · 2024-11-25T17:45:36Z

Bumping llvm-opt from O2 to O3 improves perf with function outlining so much, with O3 perf with function outlining is better than without! Hopefully CI doesn't say other workloads are going OOM.

Update: Summary of findings (WIP)

We saw a ~2x decline in performance (time to run matmul) in our 3 benchmark sizes ((M, N, K) in [(512, 512, 4096), (512, 4096, 512), (4096, 512, 512)]. This is illustrated in PR #919 where for each of the (M, N, K) dimensions, there is a run with and without function outlining. For example, for (4096, 512, 512) the time to run increases from 18 [ms] to 40 [ms] when outlining is enabled.

So we initially thought that the function call overhead of AIE is for some reason very large, but then Jorn pointed out that the ukernel approach uses function calls, and performance there is decent -- indeed with ukernels here the time is 11 [ms] (see #919). And there are presumably as many function calls with ukernels as function outlining, so it can't just be overhead of function calls which is causing the slow down.

As an experiment I tried increasing the optimization level from llvm-opt from -O2 to -O3. With outlining enabled this results in some functions calls being inlined (of course this isn't possible with ukernels).

To check this, I first count the number of function calls with grep -r "call void @generic_matmul_0_outlined" file.name.ll | wc -l

Before llvm-opt there are 32 calls (this is the input.ll). After llvm-opt with O2 there are 64 calls. With O3 there are 16 calls.

Doing the same for the matmul instruction, so counting llvm.aie2.bf.mac16.conf:

Before llvm-opt: 2 appearances. After O2: 33 appearances. After O3: 545 appearances. As a reference without outlining, there are 1024 appearances if there is no outlining, for both O2 and O3 (O2 and O3 produce exactly the same code without outlining).

The sizes of the elf files are different with O2 and O3 as well. The file core_1_2.elf is 9.8 KB with O2 and 13 KB with O3. As a reference without outlining enabled, at O2 and O3 the memory is 16 KB (for both).

Summary for (512, 4096, 512) matmul (see https://github.com/nod-ai/iree-amd-aie/actions/runs/12015758500/job/33494886925)

Current ToM:

vanilla_matmul_512_4096_512_bf16_f32.mlir
[IREE_AMDAIE] Kernel time: 18.8055 [ms]
[IREE_AMDAIE] Kernel time: 18.1393 [ms]

vanilla_matmul_512_4096_512_bf16_f32_ukernel.mlir
[IREE_AMDAIE] Kernel time: 13.0724 [ms]
[IREE_AMDAIE] Kernel time: 13.33 [ms]

This PR

vanilla_matmul_512_4096_512_bf16_f32_outlining.mlir
[IREE_AMDAIE] Kernel time: 16.5064 [ms]
[IREE_AMDAIE] Kernel time: 16.4465 [ms]
[IREE_AMDAIE] Kernel time: 16.0052 [ms]

vanilla_matmul_512_4096_512_bf16_f32.mlir
[IREE_AMDAIE] Kernel time: 18.7304 [ms]
[IREE_AMDAIE] Kernel time: 19.48 [ms]
[IREE_AMDAIE] Kernel time: 18.7937 [ms]

So O3 and O2 are identical if there is no function outlining, but O3 helps performance if there is outlining.

jtuyls · 2024-11-25T18:23:02Z

Bumping llvm-opt from O2 to O3 improves perf with function outlining so much, with O3 perf with function outlining is better than without! Hopefully CI doesn't say other workloads are going OOM.

Nice find! Do you have an idea why the difference is so big? Imo, we shouldn't (in theory) have to bump to O3 to get reasonable function outlining perf. Are we sure this doesn't inline again after all? A couple of thoughts on things that could be tried:

What's the elf size difference between O2 and O3
What about explicitly disabling auto-inlining? Does --auto_inline=0 work?

jtuyls · 2024-11-26T15:27:10Z

Before llvm-opt there are 32 calls (this is the input.ll). After llvm-opt with O2 there are 64 calls. With O3 there are 16 calls.

This result quite surprises me. Why would it double with O2 and half with O3?

newling · 2024-11-26T15:44:02Z

This result quite surprises me. Why would it double with O2 and half with O3?

Note this is just a simple count of the number of calls in the file.

What makes the number of calls increase? The number of calls increases if a loop with calls in it is unrolled.

What makes the number of calls decrease? The number of calls decreases if a call is inlined. It could also decrease if the opposite of unrolling happens, but that seems very unlikely at O3.

So I think what is happened (I can take another look to confirm) is that at O2, there is some unrolling, so the number of calls in the file increases from 32 to 64.

Then at O3, some of the 64 calls which appear in the O2'd version are inlined. Specifically, 3 quarters of them are, leaving just 16 calls.

) Motivation: we probably want -O3, but not we're completely committed to it yet (see #929) I've added a c++ unit test, I've also tested this from run.py with ``` aie_compilation_flags=["--iree-amd-aie-additional-peano-opt-flags=\"-O3\""], ``` and it works, but I don't want this e2e style test to CI for now. Also in PR: -- use anonymous namespace for functions to in .h file -- minor refactorings --------- Co-authored-by: Jorn Tuyls <[email protected]>

newling · 2024-12-05T22:46:18Z

Closing, please see #950

first commit

e438c0e

newling changed the title ~~first commit~~ Function outlining with -O3 Nov 25, 2024

newling mentioned this pull request Nov 26, 2024

[NFC] Ability to add flags to peano's opt without recompiling iree #938

Merged

newling closed this Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Function outlining with -O3 #929

Function outlining with -O3 #929

newling commented Nov 25, 2024 •

edited

Loading

jtuyls commented Nov 25, 2024

jtuyls commented Nov 26, 2024

newling commented Nov 26, 2024

newling commented Dec 5, 2024

Function outlining with -O3 #929

Function outlining with -O3 #929

Conversation

newling commented Nov 25, 2024 • edited Loading

Update: Summary of findings (WIP)

jtuyls commented Nov 25, 2024

jtuyls commented Nov 26, 2024

newling commented Nov 26, 2024

newling commented Dec 5, 2024

newling commented Nov 25, 2024 •

edited

Loading