Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU] Improve NCHW Convolution performance #19660

Open
3 tasks
qedawkins opened this issue Jan 10, 2025 · 0 comments
Open
3 tasks

[GPU] Improve NCHW Convolution performance #19660

qedawkins opened this issue Jan 10, 2025 · 0 comments
Labels
codegen/rocm ROCm code generation compiler backend (HIP/HSA) performance ⚡ Performance/optimization related work across the compiler and runtime

Comments

@qedawkins
Copy link
Contributor

Performance of convolutions with NCHW input layouts has not been given much attention. This issue is intended to list the known deficiencies with codegen for this layout (specifically implicit GEMM) and give a list of steps to take.

1. Vectorization

Currently vectorization of im2col for NCHW inputs never occurs. For example, this conv

linalg.conv_2d_nchw_fchw {dilations = dense<1> : tensor<2xi64>, strides = dense<1> : tensor<2xi64>} ins(%arg0, %arg1 : tensor<2x256x130x130xf16>, tensor<32x256x3x3xf16>) outs(%1 : tensor<2x32x128x128xf32>) -> tensor<2x32x128x128xf32>

Will generate an im2col op like this:

iree_linalg_ext.im2col {lowering_config = #iree_gpu.derived_thread_config} strides = [1, 1] dilations = [1, 1] kernel_size = [3, 3] m_offset = [%37, %34#1] * [128, 1] k_offset = [%36] * [1] batch_pos = [0] m_pos = [2, 3] k_pos = [1] ins(%extracted_slice_9 : tensor<1x256x130x130xf16>) outs(%extracted_slice_10 : tensor<1x1x1x8xf16>) -> tensor<1x1x1x8xf16>

This im2col is not vectorizable because it is loading 8 elements along the combined k dimension, which are not contiguous in the input. If we change the tiling configuration to load 8 elements along the image width dimension:

iree_linalg_ext.im2col {lowering_config = #iree_gpu.derived_thread_config} strides = [1, 1] dilations = [1, 1] kernel_size = [3, 3] m_offset = [%37, %35] * [128, 1] k_offset = [%36] * [1] batch_pos = [0] m_pos = [2, 3] k_pos = [1] ins(%extracted_slice_9 : tensor<1x256x130x130xf16>) outs(%extracted_slice_10 : tensor<1x1x8x1xf16>) -> tensor<1x1x8x1xf16>

This currently won't vectorize either because DecomposeIm2colPass only vectorizes in cases where k is the inner most dim.

  • Improve tile size selection for im2col in NCHW layouts here
  • Add support for vectorizing non-inner most dim im2col ops

2. Filter Layouts

The choice of filter layout determines the preferred iteration order for the K dimension post im2col. The default (and currently only supported variant of NCHW convs) layout for the filter is FCHW (Output channels, Input channels, Filter Height Filter Width). This means that the fastest varying part of the K dim is the kernel window (typically 3x3) which is reflected in the load order from the input image matrix.

@qedawkins qedawkins added codegen/rocm ROCm code generation compiler backend (HIP/HSA) performance ⚡ Performance/optimization related work across the compiler and runtime labels Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
codegen/rocm ROCm code generation compiler backend (HIP/HSA) performance ⚡ Performance/optimization related work across the compiler and runtime
Projects
None yet
Development

No branches or pull requests

1 participant