Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Bumping llvm-opt from
O2
toO3
improves perf with function outlining so much, withO3
perf with function outlining is better than without! Hopefully CI doesn't say other workloads are going OOM.Update: Summary of findings (WIP)
We saw a ~2x decline in performance (time to run matmul) in our 3 benchmark sizes ((M, N, K) in [(512, 512, 4096), (512, 4096, 512), (4096, 512, 512)]. This is illustrated in PR #919 where for each of the (M, N, K) dimensions, there is a run with and without function outlining. For example, for (4096, 512, 512) the time to run increases from 18 [ms] to 40 [ms] when outlining is enabled.
So we initially thought that the function call overhead of AIE is for some reason very large, but then Jorn pointed out that the ukernel approach uses function calls, and performance there is decent -- indeed with ukernels here the time is 11 [ms] (see #919). And there are presumably as many function calls with ukernels as function outlining, so it can't just be overhead of function calls which is causing the slow down.
As an experiment I tried increasing the optimization level from llvm-opt from -O2 to -O3. With outlining enabled this results in some functions calls being inlined (of course this isn't possible with ukernels).
To check this, I first count the number of function calls with
grep -r "call void @generic_matmul_0_outlined" file.name.ll | wc -l
Before llvm-opt there are 32 calls (this is the input.ll). After llvm-opt with
O2
there are 64 calls. WithO3
there are16
calls.Doing the same for the matmul instruction, so counting
llvm.aie2.bf.mac16.conf
:Before llvm-opt: 2 appearances. After
O2
: 33 appearances. AfterO3
: 545 appearances. As a reference without outlining, there are 1024 appearances if there is no outlining, for both O2 and O3 (O2 and O3 produce exactly the same code without outlining).The sizes of the elf files are different with O2 and O3 as well. The file
core_1_2.elf
is 9.8 KB with O2 and 13 KB with O3. As a reference without outlining enabled, at O2 and O3 the memory is 16 KB (for both).Summary for (512, 4096, 512) matmul (see https://github.com/nod-ai/iree-amd-aie/actions/runs/12015758500/job/33494886925)
Current ToM:
This PR
So O3 and O2 are identical if there is no function outlining, but O3 helps performance if there is outlining.