Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Function outlining with -O3 #929

Closed
wants to merge 1 commit into from

Conversation

newling
Copy link
Contributor

@newling newling commented Nov 25, 2024

Bumping llvm-opt from O2 to O3 improves perf with function outlining so much, with O3 perf with function outlining is better than without! Hopefully CI doesn't say other workloads are going OOM.

Update: Summary of findings (WIP)

We saw a ~2x decline in performance (time to run matmul) in our 3 benchmark sizes ((M, N, K) in [(512, 512, 4096), (512, 4096, 512), (4096, 512, 512)]. This is illustrated in PR #919 where for each of the (M, N, K) dimensions, there is a run with and without function outlining. For example, for (4096, 512, 512) the time to run increases from 18 [ms] to 40 [ms] when outlining is enabled.

So we initially thought that the function call overhead of AIE is for some reason very large, but then Jorn pointed out that the ukernel approach uses function calls, and performance there is decent -- indeed with ukernels here the time is 11 [ms] (see #919). And there are presumably as many function calls with ukernels as function outlining, so it can't just be overhead of function calls which is causing the slow down.

As an experiment I tried increasing the optimization level from llvm-opt from -O2 to -O3. With outlining enabled this results in some functions calls being inlined (of course this isn't possible with ukernels).

To check this, I first count the number of function calls with grep -r "call void @generic_matmul_0_outlined" file.name.ll | wc -l

Before llvm-opt there are 32 calls (this is the input.ll). After llvm-opt with O2 there are 64 calls. With O3 there are 16 calls.

Doing the same for the matmul instruction, so counting llvm.aie2.bf.mac16.conf:

Before llvm-opt: 2 appearances. After O2: 33 appearances. After O3: 545 appearances. As a reference without outlining, there are 1024 appearances if there is no outlining, for both O2 and O3 (O2 and O3 produce exactly the same code without outlining).

The sizes of the elf files are different with O2 and O3 as well. The file core_1_2.elf is 9.8 KB with O2 and 13 KB with O3. As a reference without outlining enabled, at O2 and O3 the memory is 16 KB (for both).

Summary for (512, 4096, 512) matmul (see https://github.com/nod-ai/iree-amd-aie/actions/runs/12015758500/job/33494886925)

Current ToM:

vanilla_matmul_512_4096_512_bf16_f32.mlir
[IREE_AMDAIE] Kernel time: 18.8055 [ms]
[IREE_AMDAIE] Kernel time: 18.1393 [ms]

vanilla_matmul_512_4096_512_bf16_f32_ukernel.mlir
[IREE_AMDAIE] Kernel time: 13.0724 [ms]
[IREE_AMDAIE] Kernel time: 13.33 [ms]

This PR

vanilla_matmul_512_4096_512_bf16_f32_outlining.mlir
[IREE_AMDAIE] Kernel time: 16.5064 [ms]
[IREE_AMDAIE] Kernel time: 16.4465 [ms]
[IREE_AMDAIE] Kernel time: 16.0052 [ms]

vanilla_matmul_512_4096_512_bf16_f32.mlir
[IREE_AMDAIE] Kernel time: 18.7304 [ms]
[IREE_AMDAIE] Kernel time: 19.48 [ms]
[IREE_AMDAIE] Kernel time: 18.7937 [ms]

So O3 and O2 are identical if there is no function outlining, but O3 helps performance if there is outlining.

@newling newling changed the title first commit Function outlining with -O3 Nov 25, 2024
@jtuyls
Copy link
Collaborator

jtuyls commented Nov 25, 2024

Bumping llvm-opt from O2 to O3 improves perf with function outlining so much, with O3 perf with function outlining is better than without! Hopefully CI doesn't say other workloads are going OOM.

Nice find! Do you have an idea why the difference is so big? Imo, we shouldn't (in theory) have to bump to O3 to get reasonable function outlining perf. Are we sure this doesn't inline again after all? A couple of thoughts on things that could be tried:

  • What's the elf size difference between O2 and O3
  • What about explicitly disabling auto-inlining? Does --auto_inline=0 work?

@jtuyls
Copy link
Collaborator

jtuyls commented Nov 26, 2024

Before llvm-opt there are 32 calls (this is the input.ll). After llvm-opt with O2 there are 64 calls. With O3 there are 16 calls.

This result quite surprises me. Why would it double with O2 and half with O3?

@newling
Copy link
Contributor Author

newling commented Nov 26, 2024

This result quite surprises me. Why would it double with O2 and half with O3?

Note this is just a simple count of the number of calls in the file.

What makes the number of calls increase? The number of calls increases if a loop with calls in it is unrolled.

What makes the number of calls decrease? The number of calls decreases if a call is inlined. It could also decrease if the opposite of unrolling happens, but that seems very unlikely at O3.

So I think what is happened (I can take another look to confirm) is that at O2, there is some unrolling, so the number of calls in the file increases from 32 to 64.

Then at O3, some of the 64 calls which appear in the O2'd version are inlined. Specifically, 3 quarters of them are, leaving just 16 calls.

newling added a commit that referenced this pull request Dec 2, 2024
)

Motivation: we probably want -O3, but not we're completely committed to it yet (see
#929)

I've added a c++ unit test, I've also tested this from run.py with
```
 aie_compilation_flags=["--iree-amd-aie-additional-peano-opt-flags=\"-O3\""],
 ```
 
and it works, but I don't want this e2e style test to CI for now.  

Also in PR:

-- use anonymous namespace for functions to in .h file
-- minor refactorings

---------

Co-authored-by: Jorn Tuyls <[email protected]>
@newling
Copy link
Contributor Author

newling commented Dec 5, 2024

Closing, please see #950

@newling newling closed this Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants