Replies: 5 comments 7 replies
-
Makes sense to me! ukernels are by nature doing what our codegen flow should be: deduplicating early and only expanding at the end. We want as few flow.executables as possible through the pipeline and then only where we know we want multiple variants fork out to them. The equivalence would be if while parsing the AST clang expanded anything memcpy-like to a loop over the static range being copied at every site vs what happens where memcpy is kept as a compiler intrinsic until very near codegen - there's a lot more IR to move around, a lot fewer opportunities to fold, etc. Today we generate 1000 flow.executables that may only represent 50 unique code paths and pay the cost of compiling them all through the stack. The thing that saves us early on is that we can parallelize (so wall time is reasonable the more cores you have to soak up the cpu time), but we also generate a lot more code for our currently serialized LLVM code generation step. There were some experiments done to try to make our executables more generic (pull out some of the static shapes, etc) to help with this but at the time at least too much of the pipeline required static shapes to function. What I'd like to see in the future is an approach where we start to think about our pipeline in ukernels and then just have a decisions as to whether to source the ukernel contents from external files or internal codegen. This keeps our IR narrow and less sensitive to static shapes through the bulk of the pipeline and forks out at exactly the same point ukernels would. We can even do fun things like a pass that gathers all required generated ukernels from all executables, goes parallel to generate them all (possibly even to bitcode, but maybe not), and links them just like our hand-authored ukernels. Then we have nearly the same code path for both approaches which will ensure consistent compile performance and make mixing/matching require much less code/rarely travelled paths. So maybe a challenge: take an external ukernel function declaration and write a pass that defines it with something we would codegen instead (linalg on buffers, whatever) - if we can get good runtime performance like that it opens a lot of doors. We can always then amortize that to not have 1000 identical ukernels generated by 1000 users, etc and it should end up in the wall time noise. |
Beta Was this translation helpful? Give feedback.
-
Ooh neat. So basically we just switch wholesale to ukernels with no "codegen fallback" as envisioned in #15784. But then we autogenerate any missing ukernel tile function by a MLIR code generator that emits a Linalg source for any mmt4d tile function, which we iree-compile. The nontrivial part is in terms of ABI, iree-compile currently generates dispatch functions, not the same ABI as ukernel tile functions, but that could be overcome. |
Beta Was this translation helpful? Give feedback.
-
We've also seen that DT+UK can cut the peak memory usage significantly compared to DT alone, depending on the tile size used. I've only run these experiments on internal models so can't share any numbers. |
Beta Was this translation helpful? Give feedback.
-
Interesting findings! It would be great if we could understand the details. Do you know which passes are consuming most of the time? Currently DT and DT+UK paths are not too comparable. DT is not using i8mm and it's using more aggressive unrolling than DT+UK. We know that most of the compilation time goes to LLVM's ISel, which is very sensitive to unrolling. We should have i8mm support in DT hopefully by the end of this week and both paths should be more comparable, I guess. We also keep the mm4d in its "compact" representation (e.g., Regarding memory usage, there shouldn't be any differences once the DT path supports i8mm. Currently, given the more aggressive unrolling, major memory differences may come from larger padding on the outer dimensions (M, N), given that the innermost dimensions that we have in some models are huge (K). |
Beta Was this translation helpful? Give feedback.
-
@bjacob What is the target CPU in your reports? What Diego said is aligned with my understanding. We are using more aggressive unrolling in pure DT; we are not doing good in LLVM's ISel. The i8mm support should help the arm cases, but not x86s. The ISel issue is fixable on x86 side as well. We can introduce VNNI dialect and implement some lowering to the VNNI ops. The ISel issue is likely what's happening in |
Beta Was this translation helpful? Give feedback.
-
In the discussion of how to ensure that enabling ukernels by default doesn't regress anything, which led to the discussion of how to fall back to codegen when a ukernel would perform worse (#15784) I didn't realize but we had a blind spot on the compilation-times angle. This actually is one of the areas where ukernels help the most, and whichever solution we retain to solve #15784 shouldn't regress that.
Some numbers of my AMD 7950X3D show that when compiling a typical LLM with parameters that are not baked into the code (e.g. in a separate parameter file, or simulating that with elision), enabling ukernels tends to save 8 seconds of compilation time, consistently across two different model where this means a very different speedup ratio:
Beta Was this translation helpful? Give feedback.
All reactions