Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Phantom] Pytorch topk perf (SWDEV-499808) #1848

Open
apakbin opened this issue Jan 21, 2025 · 9 comments
Open

[Phantom] Pytorch topk perf (SWDEV-499808) #1848

apakbin opened this issue Jan 21, 2025 · 9 comments
Assignees

Comments

@apakbin
Copy link

apakbin commented Jan 21, 2025

link to issue: https://ontrack-internal.amd.com/browse/SWDEV-499808

List of tasks kindly given by Jerry:

  1. Default numbers with upstream
  2. numbers with forcing sort based path
  3. numbers with forcing native path
  4. numbers with forcing native path with NV config
  5. Raise a draft upstream PR for native path with NV config (Ping Pruthvi to enable CI test on this)
  6. numbers with forcing sort based path with updated rocprim
  7. Update hueristics so that we take the best path for different shapes
  8. Update the draft PR with heuristics change

Benchmarks:

  1. topk_repro_metamd.py (METAMD benchmark repro)
  2. topk_repro.py (DENSE benchmark repro)
@apakbin apakbin self-assigned this Jan 21, 2025
@apakbin
Copy link
Author

apakbin commented Jan 21, 2025

machine: banff-cyxtera-s80-2
docker image: compute-artifactory.amd.com:5000/rocm-plus-docker/framework/compute-rocm-dkms-no-npi-hipclang:15472_ubuntu22.04_py3.10_pytorch_rocm6.4_internal_testing_d4d0b07
pytorch upstream: 35f5668

@apakbin
Copy link
Author

apakbin commented Jan 21, 2025

upstream build failed:

FAILED: caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/bgemm_kernels/torch_hip_generated_bgemm_kernel_bf16bf16bf16_256_256x256x32_32x32_4x4_16x16x1_16x16x1_1x16x1x16_4_Intrawave_v4.hip.o /home/docker_shared_dir/topk_tune/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/bgemm_kernels/torch_hip_generated_bgemm_kernel_bf16bf16bf16_256_256x256x32_32x32_4x4_16x16x1_16x16x1_1x16x1x16_4_Intrawave_v4.hip.o cd /home/docker_shared_dir/topk_tune/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/bgemm_kernels && /opt/conda/envs/py_3.10/lib/python3.10/site-packages/cmake/data/bin/cmake -E make_directory /home/docker_shared_dir/topk_tune/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/bgemm_kernels/. && /opt/conda/envs/py_3.10/lib/python3.10/site-packages/cmake/data/bin/cmake -D verbose:BOOL=OFF -D build_configuration:STRING=RELEASE -D generated_file:STRING=/home/docker_shared_dir/topk_tune/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/bgemm_kernels/./torch_hip_generated_bgemm_kernel_bf16bf16bf16_256_256x256x32_32x32_4x4_16x16x1_16x16x1_1x16x1x16_4_Intrawave_v4.hip.o -P /home/docker_shared_dir/topk_tune/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/bgemm_kernels/torch_hip_generated_bgemm_kernel_bf16bf16bf16_256_256x256x32_32x32_4x4_16x16x1_16x16x1_1x16x1x16_4_Intrawave_v4.hip.o.cmake clang++: warning: argument unused during compilation: '--offload-compress' [-Wunused-command-line-argument] In file included from /home/docker_shared_dir/topk_tune/pytorch/aten/src/ATen/native/hip/bgemm_kernels/bgemm_kernel_bf16bf16bf16_256_256x256x32_32x32_4x4_16x16x1_16x16x1_1x16x1x16_4_Intrawave_v4.hip:3: In file included from /home/docker_shared_dir/topk_tune/pytorch/aten/src/ATen/native/hip/bgemm_kernels/bgemm_kernel_template.h:11: In file included from /home/docker_shared_dir/topk_tune/pytorch/aten/src/ATen/../../../third_party/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp:9: In file included from /home/docker_shared_dir/topk_tune/pytorch/aten/src/ATen/../../../third_party/composable_kernel/include/ck/utility/common_header.hpp:11: In file included from /home/docker_shared_dir/topk_tune/pytorch/aten/src/ATen/../../../third_party/composable_kernel/include/ck/utility/multi_index.hpp:11: In file included from /home/docker_shared_dir/topk_tune/pytorch/aten/src/ATen/../../../third_party/composable_kernel/include/ck/utility/statically_indexed_array_multi_index.hpp:8: /home/docker_shared_dir/topk_tune/pytorch/aten/src/ATen/../../../third_party/composable_kernel/include/ck/utility/math_v2.hpp:614:12: error: call to '__hneg' is ambiguous 614 | return __hneg(x); | ^~~~~~ /opt/rocm-6.4.0/lib/llvm/bin/../../../include/hip/amd_detail/amd_hip_fp16.h:1688:20: note: candidate function 1688 | __half __hneg(__half x) | ^ /opt/rocm-6.4.0/lib/llvm/bin/../../../include/hip/amd_detail/amd_hip_bf16.h:839:44: note: candidate function 839 | __BF16_HOST_DEVICE_STATIC__ __hip_bfloat16 __hneg(const __hip_bfloat16 a) { | ^ In file included from /home/docker_shared_dir/topk_tune/pytorch/aten/src/ATen/native/hip/bgemm_kernels/bgemm_kernel_bf16bf16bf16_256_256x256x32_32x32_4x4_16x16x1_16x16x1_1x16x1x16_4_Intrawave_v4.hip:3: In file included from /home/docker_shared_dir/topk_tune/pytorch/aten/src/ATen/native/hip/bgemm_kernels/bgemm_kernel_template.h:11: In file included from /home/docker_shared_dir/topk_tune/pytorch/aten/src/ATen/../../../third_party/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp:15: /home/docker_shared_dir/topk_tune/pytorch/aten/src/ATen/../../../third_party/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3_multi_d.hpp:742:35: warning: division by zero is undefined [-Wdivision-by-zero] 742 | ? KThreadRead / (kfold * K0PerThreadWrite / K0PerThreadRead) | ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /home/docker_shared_dir/topk_tune/pytorch/aten/src/ATen/../../../third_party/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3_multi_d.hpp:971:42: note: in instantiation of member function 'ck::GridwiseGemmMultiD_xdl_cshuffle_v3<ck::tensor_layout::gemm::ColumnMajor, ck::tensor_layout::gemm::ColumnMajor, ck::Tuple<>, ck::tensor_layout::gemm::RowMajor, unsigned short, unsigned short, float, unsigned short, ck::Tuple<>, unsigned short, ck::tensor_operation::element_wise::PassThrough, ck::tensor_operation::element_wise::PassThrough, ck::tensor_operation::element_wise::PassThrough, ck::tensor_operation::device::GemmSpecialization::Default, 256, 256, 256, 32, 2, 2, 32, 32, 4, 4, ck::Sequence<16, 16, 1>, ck::Sequence<1, 0, 2>, ck::Sequence<1, 0, 2>, 2, 2, 2, false, 0, ck::Sequence<16, 16, 1>, ck::Sequence<1, 0, 2>, ck::Sequence<1, 0, 2>, 2, 2, 2, false, 0, 1, 1, ck::Sequence<1, 16, 1, 16>, ck::Sequence<4>, ck::BlockGemmPipelineScheduler::Intrawave, ck::BlockGemmPipelineVersion::v4>::GetABlockDescriptor_AK0PerBlock_MPerBlock_AK1' requested here 971 | decltype(GetABlockDescriptor_AK0PerBlock_MPerBlock_AK1()), | ^

using the pytorch shipped with the docker image.

@apakbin
Copy link
Author

apakbin commented Jan 22, 2025

results for topk included with default rocPRIM when making choices:

  1. codepath $\in$ {default, sort, native}
  2. config $\in$ {default, NV}

rslts.csv

@apakbin
Copy link
Author

apakbin commented Jan 23, 2025

PR: pytorch#145416

@apakbin
Copy link
Author

apakbin commented Jan 23, 2025

Verbatim from email thread, written by [email protected]:
"
It looks like the radixFindKthValues kernel uses __threadfence() in some locations, it seems like this generates an L1 and L2 cache flush[4], which is obviously gonna be a problem on MI300. I replaced this with asm volatile ("s_waitcnt vmcnt(0) lgkmcnt(0)" ::: "memory");, which improves performance by about 50% to 723 us. See the attached patch. There are certainly cleaner ways to do this, at least in HIP, I think that __builtin_amdgcn_fence(__ATOMIC_RELEASE, "agent"); along with __syncthreads() should suffice. See also this[5] pull request. Note that the current implementation has a bug that has been fixed recently, see [6]. This threadfence should be replaced by __builtin_amdgcn_fence(__ATOMIC_ACQUIRE, "agent");.
"

By replacing __threadfence() in line 337 of TensorTopK.hip I was getting "Memory access fault by GPU node-8 (Agent handle: 0xf914a50) on address 0x7f2a8cb5a000. Reason: Unknown." However, replacing it instead with "__builtin_amdgcn_fence(__ATOMIC_RELEASE, "agent")" worked without errors. The entries regarding this fix have "patch" values of "TensorTopK.hop:337:threadfence~>amdgcnfence".

Also regarding the bug fix, added "__builtin_amdgcn_fence(__ATOMIC_ACQUIRE, "agent");" to line 359 of TensorTopK.hip. Such entries of "patch"es of "TensorTopK.hop:337:threadfence~>amdgcnfence+TensorTopK.hop:359:amdgcnfence".

As expected, these patches help with the performance.

rslts.csv

@apakbin
Copy link
Author

apakbin commented Jan 24, 2025

Focusing on (patch + bug fix) results only, we added more shapes to get a better understanding of how native and sort compare in different scenarios:
input_shapes = [
(512, 4, 64, 1024),
(512, 10, 1, 71664),
(512, 1, 1, 71664),
(1, 1, 71664),
(1, 1, 1024),
(2048, 71664),
(512, 71664),
(62, 71664),
(8, 71664),
(2, 71664),
(1, 71664),
(1, 32768),
(1, 1024),
(1, 512),
(1, 256),
(800000080,),
(80000008,),
(71664,),
(32768,),
(1024,),
(512,),
(256,)
]

The results are included, where rows for which "sort" is better are highlighted. We observe that for data which is 1 dimensional (considering (1, 256) and (1, 1, 256) as 1 dimensional as well) and rather long, sort is the better option. We will focus our tests specifically on the number of elements in 1 the dimensional case.

sort_vs_native_diff_shapes.xlsx

@apakbin
Copy link
Author

apakbin commented Jan 27, 2025

Added the changes described in #1848 (comment) to the PR: pytorch#145416.

After talking to Pruthvi and Jerry on Jan 24, out focus will be on:

  1. creating a comprehensive analysis of how each component in the following list changes the performance of topk:
  1. How we compare against H100 before the changes and after the changes
  2. Adding sanity checks to the benchmarking, making sure output of TopK is correct

@apakbin
Copy link
Author

apakbin commented Jan 31, 2025

results presented on Jan 30 with Phantom (adding both for the record, the second one has 'sort' as well and subsumes the first):

topk_results_jan30_2025.xlsx
topk_results_jan30_2025_2.xlsx

The Github repo performing the analysis: https://github.com/apakbin/topk_benchmark.git
commit: 56634a14c80f3cc5bfa2aa6e6702c9b8d3d0078c

@apakbin
Copy link
Author

apakbin commented Feb 4, 2025

questions asked on Jan30, answered feb3:

  • reg=100 without NV config
  • multiblock tuning without NV config
  • H100 with no fence, how does it compare with baseline H100?

Image

Image

@jerrymannil jerrymannil self-assigned this Feb 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants