[GPU] Improve NCHW Convolution performance #19660

qedawkins · 2025-01-10T02:07:02Z

Performance of convolutions with NCHW input layouts has not been given much attention. This issue is intended to list the known deficiencies with codegen for this layout (specifically implicit GEMM) and give a list of steps to take.

1. Vectorization

Currently vectorization of im2col for NCHW inputs never occurs. For example, this conv

linalg.conv_2d_nchw_fchw {dilations = dense<1> : tensor<2xi64>, strides = dense<1> : tensor<2xi64>} ins(%arg0, %arg1 : tensor<2x256x130x130xf16>, tensor<32x256x3x3xf16>) outs(%1 : tensor<2x32x128x128xf32>) -> tensor<2x32x128x128xf32>

Will generate an im2col op like this:

iree_linalg_ext.im2col {lowering_config = #iree_gpu.derived_thread_config} strides = [1, 1] dilations = [1, 1] kernel_size = [3, 3] m_offset = [%37, %34#1] * [128, 1] k_offset = [%36] * [1] batch_pos = [0] m_pos = [2, 3] k_pos = [1] ins(%extracted_slice_9 : tensor<1x256x130x130xf16>) outs(%extracted_slice_10 : tensor<1x1x1x8xf16>) -> tensor<1x1x1x8xf16>

This im2col is not vectorizable because it is loading 8 elements along the combined k dimension, which are not contiguous in the input. If we change the tiling configuration to load 8 elements along the image width dimension:

iree_linalg_ext.im2col {lowering_config = #iree_gpu.derived_thread_config} strides = [1, 1] dilations = [1, 1] kernel_size = [3, 3] m_offset = [%37, %35] * [128, 1] k_offset = [%36] * [1] batch_pos = [0] m_pos = [2, 3] k_pos = [1] ins(%extracted_slice_9 : tensor<1x256x130x130xf16>) outs(%extracted_slice_10 : tensor<1x1x8x1xf16>) -> tensor<1x1x8x1xf16>

This currently won't vectorize either because DecomposeIm2colPass only vectorizes in cases where k is the inner most dim.

Improve tile size selection for im2col in NCHW layouts here
Add support for vectorizing non-inner most dim im2col ops

2. Filter Layouts

The choice of filter layout determines the preferred iteration order for the K dimension post im2col. The default (and currently only supported variant of NCHW convs) layout for the filter is FCHW (Output channels, Input channels, Filter Height Filter Width). This means that the fastest varying part of the K dim is the kernel window (typically 3x3) which is reflected in the load order from the input image matrix.

Add im2col support for other conv variants (or ideally isConvolutionOpInterface like generics): https://mlir.llvm.org/docs/Dialects/Linalg/

qedawkins added codegen/rocm ROCm code generation compiler backend (HIP/HSA) performance ⚡ Performance/optimization related work across the compiler and runtime labels Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU] Improve NCHW Convolution performance #19660

[GPU] Improve NCHW Convolution performance #19660

qedawkins commented Jan 10, 2025

[GPU] Improve NCHW Convolution performance #19660

[GPU] Improve NCHW Convolution performance #19660

Comments

qedawkins commented Jan 10, 2025

1. Vectorization

2. Filter Layouts