add new branch heyi_cast_transpose and update cast_transpose optimizations #89

eliotwang · 2024-11-04T13:43:19Z

Description

Optimize the cast_transpose kernel by:

refactoring the kernel to support more flexible parameter tuning methods;
use assembly instructions to optimize the conversion for fp8 output types;
apply performance analysis on different input shapes and provide empirical parameter configuration methods, ensuring that the HIP kernel's performance exceeds Triton's performance in all tests;

Steps to run:

docker pull rocm/pytorch:latest

docker run -it --network=host -v /home/yigex/heyi:/workspace --device=/dev/kfd --device=/dev/dri --group-add video --security-opt seccomp=unconfined --ipc=host --name heyi_te_wx rocm/pytorch:latest

mkdir heyi && cd heyi

git clone -b heyi_cast_transpose --recursive https://github.com/eliotwang/TransformerEngine.git

cd TransformerEngine

export NVTE_FRAMEWORK=pytorch
export NVTE_ROCM_ARCH=gfx942

pip install .

mkdir tests/cpp/build && cd tests/cpp/build

cmake .. && make

./operator/test_operator

[ RUN ] OperatorTest/CTTestSuite.TestCastTranspose/bfloat16Xfloat8e5m2X2048X12288
GPU execution time: 54.7456 us
[ OK ] OperatorTest/CTTestSuite.TestCastTranspose/bfloat16Xfloat8e5m2X2048X12288 (2854 ms)
[ RUN ] OperatorTest/CTTestSuite.TestCastTranspose/bfloat16Xfloat8e5m2X256X65536
GPU execution time: 39.4748 us
[ OK ] OperatorTest/CTTestSuite.TestCastTranspose/bfloat16Xfloat8e5m2X256X65536 (1690 ms)

Performance:

wenchenvincent · 2024-11-04T22:01:26Z

tests/cpp/operator/test_cast_transpose.cu

-
-  nvte_cast_transpose(input.data(), output_c.data(), output_t.data(), 0);
+
+  int warm_iter = 3;


I suggest that we keep this test file as is. This file was just for unit testing. We can measure performance when running it with rocprof.

wenchenvincent · 2024-11-05T06:03:41Z

transformer_engine/common/transpose/cast_transpose.cu

-            const size_t num_blocks = kernel_config.num_blocks;
-
+            size_t load_size;
+            size_t store_size;


We would like to keep the original NVTE CUDA code while adding the new code for ROCm. You can use __HIP_PLATFORM_AMD__ to guard the code. Here is an example: https://github.com/ROCm/TransformerEngine/blob/dev/transformer_engine/common/fused_softmax/scaled_masked_softmax.cu#L72-L80

+1

Line 280 to 286 should also be guarded by #ifdef HIP_PLATFORM_AMD.

Basically, after removing the rocm/amd specific ifdefs, we would like to see our repo are exactly the same as the upstream NVTE

wenchenvincent · 2024-11-05T06:08:27Z

transformer_engine/common/transpose/rtc/cast_transpose.cu

 constexpr size_t block_size = __BLOCK_SIZE__;

 }  // namespace

+__device__ OType convert_from_fp32(float v) {


What's the key difference between this function and https://github.com/ROCm/TransformerEngine/blob/dev/transformer_engine/common/amd_detail/hip_float8.h?

This function appears to be more complete. When compiling with the gfx940 macro enabled, it results in an error related to function overloading

eliotwang · 2024-11-11T12:55:47Z

Revert the tests/cpp/operator/test_cast_transpose.cu to its original version;
Add HIP_PLATFORM_AMD to guard modified code in transformer_engine/common/transpose/cast_transpose.cu;
Modify hip_f8 funcs in hip_float8.h to ensure the code runs correctly on MI308X;
Adopt hip_f8 funcs in hip_float8.h to do precision conversion and remove custom conversion functions;

eliotwang · 2024-11-14T12:02:50Z

fix cast_transpose issue where the width and height of the tensor cannot be divided by the tile size. pass pytest including test_float8tensor.py，test_numerics.py，fused_attn/test_fused_attn.py and test_sanity.py

…tions

…version, add __HIP_PLATFORM_AMD__ to guard modified code in transformer_engine/common/transpose/cast_transpose.cu, use hip_float8 implementation and remove custom conversion functions

…not be divided by the tile size.

…dvances

wangye805

Sorry for the late review.

I was only able to review the cast_transpose part, not the cast_transpose_fusion part. I gave some comments on the coding style. But I'm still quite confused about the big picture as how wpt_size, iter_size works.

wangye805 · 2025-01-07T23:10:56Z

transformer_engine/common/transpose/cast_transpose.cu

-            const size_t num_blocks = kernel_config.num_blocks;
-
+            size_t load_size;
+            size_t store_size;


+1

Line 280 to 286 should also be guarded by #ifdef HIP_PLATFORM_AMD.

Basically, after removing the rocm/amd specific ifdefs, we would like to see our repo are exactly the same as the upstream NVTE

wangye805 · 2025-01-07T23:12:47Z

transformer_engine/common/transpose/cast_transpose.cu

+            bool do_general_config = true;
+
+#ifdef __HIP_PLATFORM_AMD__            
+            if((std::is_same<OutputType, fp8e5m2>::value) || (std::is_same<OutputType, fp8e4m3>::value)){


can use if constexpr since OutputType is determined at compile time

wangye805 · 2025-01-07T23:14:10Z

transformer_engine/common/transpose/cast_transpose.cu

+              auto get_n_tiles = [=] (size_t load_size, size_t store_size) -> int {
+                constexpr size_t threads_per_warp = static_cast<size_t>(THREADS_PER_WARP);
+                size_t nvec_in = load_size / sizeof(InputType);
+                size_t nvec_out = store_size / sizeof(OutputType);


Actually the sizeof(OutputType) is already fixed under this if condition

wangye805 · 2025-01-07T23:14:45Z

transformer_engine/common/transpose/cast_transpose.cu

+                size_t nvec_out = store_size / sizeof(OutputType);
+                size_t n_tiles = DIVUP(row_length, nvec_in * threads_per_warp) *
+                                DIVUP(num_rows, nvec_out * threads_per_warp);
+              return n_tiles;


Should be the same indent as line 301

wangye805 · 2025-01-07T23:15:14Z

transformer_engine/common/transpose/cast_transpose.cu

+              auto get_n_blocks = [=] (size_t n_tiles, size_t cast_transpose_num_threads, size_t wpt_size) -> int {
+                size_t n_warps_per_block = cast_transpose_num_threads / THREADS_PER_WARP;
+                size_t n_blocks = DIVUP(n_tiles * wpt_size, n_warps_per_block);
+              return n_blocks;


indent issue

wangye805 · 2025-01-07T23:24:12Z

transformer_engine/common/transpose/cast_transpose.cu

+              // Number of CUDA blocks
+              num_blocks = (row_length / row_tile_elements) * (num_rows / col_tile_elements);
+              rtc_block_size = THREADS_PER_WARP * wpt_size;  
+              do_general_config =!(row_length % row_tile_elements == 0 && num_rows % col_tile_elements == 0);


do_general_config is just not aligned()?

Under the AMD framework, if the output type is FP8 and the current tile_size meets certain conditions, the optimized configuration will be used; otherwise, it will fall under the do_general_config case.

wangye805 · 2025-01-07T23:37:47Z

transformer_engine/common/transpose/cast_transpose.cu

+                kernel_configs.emplace_back(row_length, num_rows, itype_size, otype_size, load_size,
+                                            store_size, sm_count);
+              };
+              add_config(8, 8);


Regarding the big picture, in my opinion, you are trying to introduce a wpt_size and a iter_size so that wpt_size * iter_size <= THREADS_PER_WARP, in contrast to warps_per_tile x num_interations = THREADS_PER_WARP. So that each thread can save some registers and overall shared mems. Is my understanding correct?

If so, I would expect add_config() will add wpt_size and iter_size into the cost model. But here we don't see additional parameters for add_config. Why?

Yes, when wpt_size * iter_size <= THREADS_PER_WARP, it can save some registers and overall shared memory, thus improving the occupancy of the cu. How can we reflect the impact of wpt_size and iter_size on performance improvement in the cost model?

Emm, we probably first need to understand how latency is affected by wpt_size*iter_size before changing the cost model.

By the way, from https://github.com/eliotwang/TransformerEngine/blob/b9359d65666bbcbd6734376e180c8fbe513c50ee/transformer_engine/common/transpose/cast_transpose.cu#L312, it seems that wpt_size*iter_size is still tied to THREADS_PER_WARP, which is 32?

wangye805 · 2025-01-07T23:39:37Z

transformer_engine/common/transpose/cast_transpose.cu

+              }; 
+
+              wpt_size = 8;
+              iter_size = THREADS_PER_WARP / wpt_size;


if iter_size x wpt_size = THREADS_PER_WARP here, why do we need to add ITER_SIZE into the subsequent RTC params?

This is an added parameter for optimization, aimed at testing the impact of different wpt_size and iter_size configurations on performance.

…ization implementation for cast_transpose and cast_transpose_fusion. Organize the newly added code within the management scope of __HIP_PLATFORM_AMD__.

wenchenvincent reviewed Nov 4, 2024

View reviewed changes

wenchenvincent reviewed Nov 5, 2024

View reviewed changes

BruceXcluding requested review from BruceXcluding and wangye805 and removed request for BruceXcluding November 12, 2024 04:25

eliotwang force-pushed the heyi_cast_transpose branch from bdd1bbe to 346ca67 Compare December 18, 2024 07:53

eliotwang added 5 commits January 7, 2025 21:34

add new branch heyi_cast_transpose and update cast_transpose optimiza…

512616d

…tions

Revert the tests/cpp/operator/test_cast_transpose.cu to its original …

4b0797f

…version, add __HIP_PLATFORM_AMD__ to guard modified code in transformer_engine/common/transpose/cast_transpose.cu, use hip_float8 implementation and remove custom conversion functions

fix cast_transpose issue where the width and height of the tensor can…

892ee95

…not be divided by the tile size.

merge pr#1083 and update cast transpose and its fusion optimization a…

3ba719c

…dvances

optimize reduce dbias kernel of cast transpose fusion kernel

54d606a

eliotwang force-pushed the heyi_cast_transpose branch from 0acaf23 to 54d606a Compare January 7, 2025 13:42

eliotwang added 2 commits January 7, 2025 21:44

optimize reduce dbias kernel of cast transpose fusion kernel

4d9f945

Fix incorrect commit content

b9359d6

wangye805 requested changes Jan 7, 2025

View reviewed changes

eliotwang added 2 commits March 7, 2025 16:10

Merge remote-tracking branch 'upstream/dev' into heyi_cast_transpose

3304196

Use environment variables to control the on - off switch of the optim…

96a44fb

…ization implementation for cast_transpose and cast_transpose_fusion. Organize the newly added code within the management scope of __HIP_PLATFORM_AMD__.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add new branch heyi_cast_transpose and update cast_transpose optimizations #89

add new branch heyi_cast_transpose and update cast_transpose optimizations #89

eliotwang commented Nov 4, 2024 •

edited

Loading

wenchenvincent Nov 4, 2024

wenchenvincent Nov 5, 2024

wangye805 Jan 7, 2025

wenchenvincent Nov 5, 2024

eliotwang Nov 7, 2024

eliotwang commented Nov 11, 2024

eliotwang commented Nov 14, 2024

wangye805 left a comment

wangye805 Jan 7, 2025

wangye805 Jan 7, 2025

wangye805 Jan 7, 2025

wangye805 Jan 7, 2025

wangye805 Jan 7, 2025

wangye805 Jan 7, 2025

zora-my Jan 8, 2025

wangye805 Jan 7, 2025

zora-my Jan 9, 2025

wangye805 Jan 9, 2025

wangye805 Jan 7, 2025

zora-my Jan 8, 2025


		nvte_cast_transpose(input.data(), output_c.data(), output_t.data(), 0);

		int warm_iter = 3;

add new branch heyi_cast_transpose and update cast_transpose optimizations #89

Are you sure you want to change the base?

add new branch heyi_cast_transpose and update cast_transpose optimizations #89

Conversation

eliotwang commented Nov 4, 2024 • edited Loading

Description

Steps to run:

Performance:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eliotwang commented Nov 11, 2024

eliotwang commented Nov 14, 2024

wangye805 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eliotwang commented Nov 4, 2024 •

edited

Loading