[GPU] Improve NCHW Convolution performance #19660
Labels
codegen/rocm
ROCm code generation compiler backend (HIP/HSA)
performance ⚡
Performance/optimization related work across the compiler and runtime
Performance of convolutions with NCHW input layouts has not been given much attention. This issue is intended to list the known deficiencies with codegen for this layout (specifically implicit GEMM) and give a list of steps to take.
1. Vectorization
Currently vectorization of im2col for NCHW inputs never occurs. For example, this conv
Will generate an im2col op like this:
This im2col is not vectorizable because it is loading 8 elements along the combined k dimension, which are not contiguous in the input. If we change the tiling configuration to load 8 elements along the image width dimension:
This currently won't vectorize either because DecomposeIm2colPass only vectorizes in cases where k is the inner most dim.
2. Filter Layouts
The choice of filter layout determines the preferred iteration order for the K dimension post im2col. The default (and currently only supported variant of NCHW convs) layout for the filter is FCHW (Output channels, Input channels, Filter Height Filter Width). This means that the fastest varying part of the K dim is the kernel window (typically 3x3) which is reflected in the load order from the input image matrix.
The text was updated successfully, but these errors were encountered: