[AMD] release/3.2.x AMD perf cherry picks #5191

jataylo · 2024-11-19T13:55:02Z

Cherry pick list:

In the case of 16 bit floats operands for tt::AtomicRMWOp, construct only one LLVM::AtomicRMWOp but use vector of elements. Such approach allows to generate packed intrinsics and process 2 elements at once. Added a lit test for f16 vectorized case. (cherry picked from commit 78c8054)

(cherry picked from commit 86a2ac7)

…4935) This PR adds more restrictions about when should we apply the sched-load optimizations and un-revert triton-lang#4823. We will only apply the optimization when all of the following is satisfied: 1. pureMatmulProblem, i.e. 1 `tt.dot` in the main loop 2. two `tt.load`s in the main loop 3. 2nd `tt.load` is ahead of the `tt.dot` 4. 1st user of 2nd `tt.load` is after the `tt.dot` 5. tile size is large enough, i.e. nonKDim >= 128 and kDim >= 64 (cherry picked from commit 4f6f768)

…n-lang#4991) Specifically, it fixes problems when `srcLayout` and `dstLayout` have different number of registers but the same number of not free registers. We solved the problem by padding free registers to either `srcLayout` or `dstLayout`, but this can be improved by fixing the `invertAndCompose` function. (cherry picked from commit 15c5e55)

…triton-lang#4951) This PR removes the legacy `isMmaToDotShortcut` and its associated shortcut conversion. (cherry picked from commit 1d5fdfe)

This commit removes special cases for MFMA -> Dot Operand LDS shortcuts. Now it is supported by common linear layout infrastructure. No tests are added, mfma-shortcut.mlir already testing this. (cherry picked from commit 69f656c)

This commit adds support for warp-level reduction with DPP instructions, which can improve performance. See https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/ (cherry picked from commit 21119e3)

TritonAMDGPUTransforms now depends on it. (cherry picked from commit 0b443ce)

In the case of unpaired f16 elements utilize DPP instructions to accelerate atomics. Here is an algorithm of lowering `tt::atomicRmwOp(%ptr, %val, %mask)`: 0. Group thread by pairs. Master thread is (tid % 2 == 0); 1. All the threads send `%val` to `(tid - 1)` thread via `dppUpdateOp shl`, so all the masters recieve value from secondary threads; 2. Take into account parity in the `%mask` value, build CF structures according to it; 3. Generate `llvm::atomicRmwOp` in the threads enabled by `%mask` value; 4. All the threads send result of generated operation to `(tid + 1)` thread via `dppUpdateOp shl`, so all secondary thread also recieve their result. DPP approach has ~5% perf improvment so use this one in the case target arch supports DPP. Signed-off-by: Ilya Veselov <[email protected]> (cherry picked from commit bab3470)

joviliast and others added 9 commits November 18, 2024 16:56

[AMD] Restructure ReorderInstructions pass (triton-lang#4998)

bbd72b7

(cherry picked from commit 86a2ac7)

[BACKEND] Replace isMmaToDotShortcut with linear layout based logic (…

4499262

…triton-lang#4951) This PR removes the legacy `isMmaToDotShortcut` and its associated shortcut conversion. (cherry picked from commit 1d5fdfe)

[AMD] Support warp-level reduction with DPP (triton-lang#5019)

5014ca9

This commit adds support for warp-level reduction with DPP instructions, which can improve performance. See https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/ (cherry picked from commit 21119e3)

[AMD] Add missing dependency to TritonAMDGPUIR (triton-lang#5053)

2527a67

TritonAMDGPUTransforms now depends on it. (cherry picked from commit 0b443ce)

jataylo requested review from antiagainst, zhanglx13, Jokeren and ptillet as code owners November 19, 2024 13:55

jataylo marked this pull request as draft November 19, 2024 13:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] release/3.2.x AMD perf cherry picks #5191

[AMD] release/3.2.x AMD perf cherry picks #5191

jataylo commented Nov 19, 2024

[AMD] release/3.2.x AMD perf cherry picks #5191

Are you sure you want to change the base?

[AMD] release/3.2.x AMD perf cherry picks #5191

Conversation

jataylo commented Nov 19, 2024