ukernels are also a huge compilation-time win #16848

bjacob · 2024-03-20T14:35:16Z

bjacob
Mar 20, 2024
Collaborator

In the discussion of how to ensure that enabling ukernels by default doesn't regress anything, which led to the discussion of how to fall back to codegen when a ukernel would perform worse (#15784) I didn't realize but we had a blind spot on the compilation-times angle. This actually is one of the areas where ukernels help the most, and whichever solution we retain to solve #15784 shouldn't regress that.

Some numbers of my AMD 7950X3D show that when compiling a typical LLM with parameters that are not baked into the code (e.g. in a separate parameter file, or simulating that with elision), enabling ukernels tends to save 8 seconds of compilation time, consistently across two different model where this means a very different speedup ratio:

	Time to compile with ukernels=none (s)	Time to compile with ukernels=mmt4d (s)	Difference (s)
opt-1.3b with separate IRPA parameters	12.4	3.6	8.8
llama2_7b_int4 with elided parameters	20.5	13.4	7.1

benvanik · 2024-03-20T15:00:58Z

benvanik
Mar 20, 2024
Maintainer

Makes sense to me! ukernels are by nature doing what our codegen flow should be: deduplicating early and only expanding at the end. We want as few flow.executables as possible through the pipeline and then only where we know we want multiple variants fork out to them. The equivalence would be if while parsing the AST clang expanded anything memcpy-like to a loop over the static range being copied at every site vs what happens where memcpy is kept as a compiler intrinsic until very near codegen - there's a lot more IR to move around, a lot fewer opportunities to fold, etc. Today we generate 1000 flow.executables that may only represent 50 unique code paths and pay the cost of compiling them all through the stack. The thing that saves us early on is that we can parallelize (so wall time is reasonable the more cores you have to soak up the cpu time), but we also generate a lot more code for our currently serialized LLVM code generation step.

There were some experiments done to try to make our executables more generic (pull out some of the static shapes, etc) to help with this but at the time at least too much of the pipeline required static shapes to function. What I'd like to see in the future is an approach where we start to think about our pipeline in ukernels and then just have a decisions as to whether to source the ukernel contents from external files or internal codegen. This keeps our IR narrow and less sensitive to static shapes through the bulk of the pipeline and forks out at exactly the same point ukernels would. We can even do fun things like a pass that gathers all required generated ukernels from all executables, goes parallel to generate them all (possibly even to bitcode, but maybe not), and links them just like our hand-authored ukernels. Then we have nearly the same code path for both approaches which will ensure consistent compile performance and make mixing/matching require much less code/rarely travelled paths.

So maybe a challenge: take an external ukernel function declaration and write a pass that defines it with something we would codegen instead (linalg on buffers, whatever) - if we can get good runtime performance like that it opens a lot of doors. We can always then amortize that to not have 1000 identical ukernels generated by 1000 users, etc and it should end up in the wall time noise.

1 reply

jpienaar Mar 21, 2024
Maintainer

Probably some of this is the nature of the frontends used here: python that spits out functions like candy and inlines arbitrarily :-)

bjacob · 2024-03-20T15:14:05Z

bjacob
Mar 20, 2024
Collaborator Author

Ooh neat. So basically we just switch wholesale to ukernels with no "codegen fallback" as envisioned in #15784. But then we autogenerate any missing ukernel tile function by a MLIR code generator that emits a Linalg source for any mmt4d tile function, which we iree-compile. The nontrivial part is in terms of ABI, iree-compile currently generates dispatch functions, not the same ABI as ukernel tile functions, but that could be overcome.

2 replies

bjacob Mar 20, 2024
Collaborator Author

Wait, this doesn't have to be at the level of just the tile-function, it could equivalently be at the whole mmt4d level (including the outer loops of mmt4d). And then, this is virtually the same as keeping the approach in #15784 but saying we erase static-shape info (rewrite as dynamic) specifically the the outer dims of linalg.mmt4d ops and nothing else.

bjacob Mar 20, 2024
Collaborator Author

Ah no, it needs to deduplicate dispatch functions - it would have to be some kind of global optimization with a TypeConverter, not mmt4d-specific, early erasing all static sizes above a certain threshold. Filed #16850 .

mariecwhite · 2024-03-20T22:12:14Z

mariecwhite
Mar 20, 2024
Collaborator

We've also seen that DT+UK can cut the peak memory usage significantly compared to DT alone, depending on the tile size used. I've only run these experiments on internal models so can't share any numbers.

0 replies

dcaballe · 2024-03-21T16:19:51Z

dcaballe
Mar 21, 2024

Interesting findings! It would be great if we could understand the details. Do you know which passes are consuming most of the time? Currently DT and DT+UK paths are not too comparable. DT is not using i8mm and it's using more aggressive unrolling than DT+UK. We know that most of the compilation time goes to LLVM's ISel, which is very sensitive to unrolling. We should have i8mm support in DT hopefully by the end of this week and both paths should be more comparable, I guess.

We also keep the mm4d in its "compact" representation (e.g., linalg.mmt4d and vector.contract) most of the time and we lower it at the very end of the MLIR pipeline, before lowering to LLVM. LLVM inlining happens pretty early in the LLVM pipeline so the ukernel code should run through most of the LLVM passes. That means that the gap between DT + DT+UK should be relatively small if both are using the same tile sizes. I might be missing something but I can't think of any other reason that would lead to significant compile time differences.

Regarding memory usage, there shouldn't be any differences once the DT path supports i8mm. Currently, given the more aggressive unrolling, major memory differences may come from larger padding on the outer dimensions (M, N), given that the innermost dimensions that we have in some models are huge (K).

2 replies

mariecwhite Mar 21, 2024
Collaborator

We are using different tile sizes depending on whether ukernels are enabled. I believe there's a bug where codegen doesn't do well for K > 1.

dcaballe Mar 22, 2024

Yep, K > 1 introduces dependencies that lead to shuffles in the innermost loop but those won't happen once i8mm is enabled.

hanhanW · 2024-03-28T22:02:21Z

hanhanW
Mar 28, 2024
Collaborator

DT is not using i8mm and it's using more aggressive unrolling than DT+UK. We know that most of the compilation time goes to LLVM's ISel, which is very sensitive to unrolling. We should have i8mm support in DT hopefully by the end of this week and both paths should be more comparable, I guess.

@bjacob What is the target CPU in your reports? What Diego said is aligned with my understanding. We are using more aggressive unrolling in pure DT; we are not doing good in LLVM's ISel. The i8mm support should help the arm cases, but not x86s. The ISel issue is fixable on x86 side as well. We can introduce VNNI dialect and implement some lowering to the VNNI ops. The ISel issue is likely what's happening in llama2_7b_int4. I did not expect the compilation time issue in opt-1.3b since we are doing good for f32 types... Something might be off here.

2 replies

dcaballe Mar 29, 2024

Another potential problem: the existing "efficient" i4 support is very limited. I only added support for signed integers, signed extensions, truncations and sitofp. If your case uses unsigned i4 or any other type conversion operations you would end up with a tons of scalar instructions to convert each i4 element individually.

bjacob Mar 30, 2024
Collaborator Author

@bjacob What is the target CPU in your reports?

it's znver4.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ukernels are also a huge compilation-time win #16848

{{title}}

Replies: 5 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

ukernels are also a huge compilation-time win #16848

bjacob Mar 20, 2024 Collaborator

Replies: 5 comments · 7 replies

benvanik Mar 20, 2024 Maintainer

jpienaar Mar 21, 2024 Maintainer

bjacob Mar 20, 2024 Collaborator Author

bjacob Mar 20, 2024 Collaborator Author

bjacob Mar 20, 2024 Collaborator Author

mariecwhite Mar 20, 2024 Collaborator

dcaballe Mar 21, 2024

mariecwhite Mar 21, 2024 Collaborator

dcaballe Mar 22, 2024

hanhanW Mar 28, 2024 Collaborator

dcaballe Mar 29, 2024

bjacob Mar 30, 2024 Collaborator Author

bjacob
Mar 20, 2024
Collaborator

Replies: 5 comments 7 replies

benvanik
Mar 20, 2024
Maintainer

jpienaar Mar 21, 2024
Maintainer

bjacob
Mar 20, 2024
Collaborator Author

bjacob Mar 20, 2024
Collaborator Author

bjacob Mar 20, 2024
Collaborator Author

mariecwhite
Mar 20, 2024
Collaborator

dcaballe
Mar 21, 2024

mariecwhite Mar 21, 2024
Collaborator

hanhanW
Mar 28, 2024
Collaborator

bjacob Mar 30, 2024
Collaborator Author