[Phantom] Pytorch topk perf (SWDEV-499808) #1848

apakbin · 2025-01-21T19:00:55Z

link to issue: https://ontrack-internal.amd.com/browse/SWDEV-499808

List of tasks kindly given by Jerry:

Default numbers with upstream
numbers with forcing sort based path
numbers with forcing native path
numbers with forcing native path with NV config
Raise a draft upstream PR for native path with NV config (Ping Pruthvi to enable CI test on this)
numbers with forcing sort based path with updated rocprim
Update hueristics so that we take the best path for different shapes
Update the draft PR with heuristics change

Benchmarks:

topk_repro_metamd.py (METAMD benchmark repro)
topk_repro.py (DENSE benchmark repro)

apakbin · 2025-01-21T21:28:09Z

machine: banff-cyxtera-s80-2
docker image: compute-artifactory.amd.com:5000/rocm-plus-docker/framework/compute-rocm-dkms-no-npi-hipclang:15472_ubuntu22.04_py3.10_pytorch_rocm6.4_internal_testing_d4d0b07
pytorch upstream: 35f5668

apakbin · 2025-01-21T23:06:49Z

upstream build failed:

FAILED: caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/bgemm_kernels/torch_hip_generated_bgemm_kernel_bf16bf16bf16_256_256x256x32_32x32_4x4_16x16x1_16x16x1_1x16x1x16_4_Intrawave_v4.hip.o /home/docker_shared_dir/topk_tune/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/bgemm_kernels/torch_hip_generated_bgemm_kernel_bf16bf16bf16_256_256x256x32_32x32_4x4_16x16x1_16x16x1_1x16x1x16_4_Intrawave_v4.hip.o cd /home/docker_shared_dir/topk_tune/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/bgemm_kernels && /opt/conda/envs/py_3.10/lib/python3.10/site-packages/cmake/data/bin/cmake -E make_directory /home/docker_shared_dir/topk_tune/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/bgemm_kernels/. && /opt/conda/envs/py_3.10/lib/python3.10/site-packages/cmake/data/bin/cmake -D verbose:BOOL=OFF -D build_configuration:STRING=RELEASE -D generated_file:STRING=/home/docker_shared_dir/topk_tune/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/bgemm_kernels/./torch_hip_generated_bgemm_kernel_bf16bf16bf16_256_256x256x32_32x32_4x4_16x16x1_16x16x1_1x16x1x16_4_Intrawave_v4.hip.o -P /home/docker_shared_dir/topk_tune/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/bgemm_kernels/torch_hip_generated_bgemm_kernel_bf16bf16bf16_256_256x256x32_32x32_4x4_16x16x1_16x16x1_1x16x1x16_4_Intrawave_v4.hip.o.cmake clang++: warning: argument unused during compilation: '--offload-compress' [-Wunused-command-line-argument] In file included from /home/docker_shared_dir/topk_tune/pytorch/aten/src/ATen/native/hip/bgemm_kernels/bgemm_kernel_bf16bf16bf16_256_256x256x32_32x32_4x4_16x16x1_16x16x1_1x16x1x16_4_Intrawave_v4.hip:3: In file included from /home/docker_shared_dir/topk_tune/pytorch/aten/src/ATen/native/hip/bgemm_kernels/bgemm_kernel_template.h:11: In file included from /home/docker_shared_dir/topk_tune/pytorch/aten/src/ATen/../../../third_party/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp:9: In file included from /home/docker_shared_dir/topk_tune/pytorch/aten/src/ATen/../../../third_party/composable_kernel/include/ck/utility/common_header.hpp:11: In file included from /home/docker_shared_dir/topk_tune/pytorch/aten/src/ATen/../../../third_party/composable_kernel/include/ck/utility/multi_index.hpp:11: In file included from /home/docker_shared_dir/topk_tune/pytorch/aten/src/ATen/../../../third_party/composable_kernel/include/ck/utility/statically_indexed_array_multi_index.hpp:8: /home/docker_shared_dir/topk_tune/pytorch/aten/src/ATen/../../../third_party/composable_kernel/include/ck/utility/math_v2.hpp:614:12: error: call to 'hneg' is ambiguous 614 | return hneg(x); | ^~ /opt/rocm-6.4.0/lib/llvm/bin/../../../include/hip/amd_detail/amd_hip_fp16.h:1688:20: note: candidate function 1688 | half hneg(__half x) | ^ /opt/rocm-6.4.0/lib/llvm/bin/../../../include/hip/amd_detail/amd_hip_bf16.h:839:44: note: candidate function 839 | __BF16_HOST_DEVICE_STATIC hip_bfloat16 hneg(const hip_bfloat16 a) { | ^ In file included from /home/docker_shared_dir/topk_tune/pytorch/aten/src/ATen/native/hip/bgemm_kernels/bgemm_kernel_bf16bf16bf16_256_256x256x32_32x32_4x4_16x16x1_16x16x1_1x16x1x16_4_Intrawave_v4.hip:3: In file included from /home/docker_shared_dir/topk_tune/pytorch/aten/src/ATen/native/hip/bgemm_kernels/bgemm_kernel_template.h:11: In file included from /home/docker_shared_dir/topk_tune/pytorch/aten/src/ATen/../../../third_party/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp:15: /home/docker_shared_dir/topk_tune/pytorch/aten/src/ATen/../../../third_party/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3_multi_d.hpp:742:35: warning: division by zero is undefined [-Wdivision-by-zero] 742 | ? KThreadRead / (kfold * K0PerThreadWrite / K0PerThreadRead) | ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /home/docker_shared_dir/topk_tune/pytorch/aten/src/ATen/../../../third_party/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3_multi_d.hpp:971:42: note: in instantiation of member function 'ck::GridwiseGemmMultiD_xdl_cshuffle_v3<ck::tensor_layout::gemm::ColumnMajor, ck::tensor_layout::gemm::ColumnMajor, ck::Tuple<>, ck::tensor_layout::gemm::RowMajor, unsigned short, unsigned short, float, unsigned short, ck::Tuple<>, unsigned short, ck::tensor_operation::element_wise::PassThrough, ck::tensor_operation::element_wise::PassThrough, ck::tensor_operation::element_wise::PassThrough, ck::tensor_operation::device::GemmSpecialization::Default, 256, 256, 256, 32, 2, 2, 32, 32, 4, 4, ck::Sequence<16, 16, 1>, ck::Sequence<1, 0, 2>, ck::Sequence<1, 0, 2>, 2, 2, 2, false, 0, ck::Sequence<16, 16, 1>, ck::Sequence<1, 0, 2>, ck::Sequence<1, 0, 2>, 2, 2, 2, false, 0, 1, 1, ck::Sequence<1, 16, 1, 16>, ck::Sequence<4>, ck::BlockGemmPipelineScheduler::Intrawave, ck::BlockGemmPipelineVersion::v4>::GetABlockDescriptor_AK0PerBlock_MPerBlock_AK1' requested here 971 | decltype(GetABlockDescriptor_AK0PerBlock_MPerBlock_AK1()), | ^

using the pytorch shipped with the docker image.

apakbin · 2025-01-22T21:18:40Z

results for topk included with default rocPRIM when making choices:

codepath $\in$ {default, sort, native}
config $\in$ {default, NV}

rslts.csv

apakbin · 2025-01-23T17:26:53Z

PR: pytorch#145416

apakbin · 2025-01-23T23:45:51Z

Verbatim from email thread, written by [email protected]:
"
It looks like the radixFindKthValues kernel uses __threadfence() in some locations, it seems like this generates an L1 and L2 cache flush[4], which is obviously gonna be a problem on MI300. I replaced this with asm volatile ("s_waitcnt vmcnt(0) lgkmcnt(0)" ::: "memory");, which improves performance by about 50% to 723 us. See the attached patch. There are certainly cleaner ways to do this, at least in HIP, I think that __builtin_amdgcn_fence(__ATOMIC_RELEASE, "agent"); along with __syncthreads() should suffice. See also this[5] pull request. Note that the current implementation has a bug that has been fixed recently, see [6]. This threadfence should be replaced by __builtin_amdgcn_fence(__ATOMIC_ACQUIRE, "agent");.
"

By replacing __threadfence() in line 337 of TensorTopK.hip I was getting "Memory access fault by GPU node-8 (Agent handle: 0xf914a50) on address 0x7f2a8cb5a000. Reason: Unknown." However, replacing it instead with "__builtin_amdgcn_fence(__ATOMIC_RELEASE, "agent")" worked without errors. The entries regarding this fix have "patch" values of "TensorTopK.hop:337:threadfence~>amdgcnfence".

Also regarding the bug fix, added "__builtin_amdgcn_fence(__ATOMIC_ACQUIRE, "agent");" to line 359 of TensorTopK.hip. Such entries of "patch"es of "TensorTopK.hop:337:threadfence~>amdgcnfence+TensorTopK.hop:359:amdgcnfence".

As expected, these patches help with the performance.

rslts.csv

apakbin · 2025-01-24T17:34:05Z

Focusing on (patch + bug fix) results only, we added more shapes to get a better understanding of how native and sort compare in different scenarios:
input_shapes = [
(512, 4, 64, 1024),
(512, 10, 1, 71664),
(512, 1, 1, 71664),
(1, 1, 71664),
(1, 1, 1024),
(2048, 71664),
(512, 71664),
(62, 71664),
(8, 71664),
(2, 71664),
(1, 71664),
(1, 32768),
(1, 1024),
(1, 512),
(1, 256),
(800000080,),
(80000008,),
(71664,),
(32768,),
(1024,),
(512,),
(256,)
]

The results are included, where rows for which "sort" is better are highlighted. We observe that for data which is 1 dimensional (considering (1, 256) and (1, 1, 256) as 1 dimensional as well) and rather long, sort is the better option. We will focus our tests specifically on the number of elements in 1 the dimensional case.

sort_vs_native_diff_shapes.xlsx

apakbin · 2025-01-27T10:44:46Z

Added the changes described in #1848 (comment) to the PR: pytorch#145416.

After talking to Pruthvi and Jerry on Jan 24, out focus will be on:

creating a comprehensive analysis of how each component in the following list changes the performance of topk:

latest rocprim optimizations versus the one shipped in the docker image mentioned in [Phantom] Pytorch topk perf (SWDEV-499808) #1848 (comment)
native implementation versus the sort based
the new heuristic we are working on to choose between native versus sort based

How we compare against H100 before the changes and after the changes
Adding sanity checks to the benchmarking, making sure output of TopK is correct

apakbin · 2025-01-31T16:48:22Z

results presented on Jan 30 with Phantom (adding both for the record, the second one has 'sort' as well and subsumes the first):

topk_results_jan30_2025.xlsx
topk_results_jan30_2025_2.xlsx

The Github repo performing the analysis: https://github.com/apakbin/topk_benchmark.git
commit: 56634a14c80f3cc5bfa2aa6e6702c9b8d3d0078c

apakbin · 2025-02-04T16:58:25Z

questions asked on Jan30, answered feb3:

reg=100 without NV config
multiblock tuning without NV config
H100 with no fence, how does it compare with baseline H100?

apakbin self-assigned this Jan 21, 2025

jerrymannil self-assigned this Feb 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Phantom] Pytorch topk perf (SWDEV-499808) #1848

[Phantom] Pytorch topk perf (SWDEV-499808) #1848

apakbin commented Jan 21, 2025 •

edited

Loading

apakbin commented Jan 21, 2025

apakbin commented Jan 21, 2025 •

edited

Loading

apakbin commented Jan 22, 2025

apakbin commented Jan 23, 2025

apakbin commented Jan 23, 2025

apakbin commented Jan 24, 2025

apakbin commented Jan 27, 2025

apakbin commented Jan 31, 2025 •

edited

Loading

apakbin commented Feb 4, 2025

[Phantom] Pytorch topk perf (SWDEV-499808) #1848

[Phantom] Pytorch topk perf (SWDEV-499808) #1848

Comments

apakbin commented Jan 21, 2025 • edited Loading

apakbin commented Jan 21, 2025

apakbin commented Jan 21, 2025 • edited Loading

apakbin commented Jan 22, 2025

apakbin commented Jan 23, 2025

apakbin commented Jan 23, 2025

apakbin commented Jan 24, 2025

apakbin commented Jan 27, 2025

apakbin commented Jan 31, 2025 • edited Loading

apakbin commented Feb 4, 2025

apakbin commented Jan 21, 2025 •

edited

Loading

apakbin commented Jan 21, 2025 •

edited

Loading

apakbin commented Jan 31, 2025 •

edited

Loading