Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[pytorch][PR] Optimize qavg_pool3d_nhwc (pytorch#35740)
Summary: Pull Request resolved: pytorch#35740 For one of the quantized CV model, the avg_pool3d operation is more than 6x slower than C2 implementation. The reason behind this comes from the following aspects: - function access inside the loop (such as ```q_scale()``` and ```q_zero_point()```) - additional data copy in ```Vec256::store``` and ```at::quantize_vec``` This diff resolves the above issue with the following measures: - lift function access outside the loops - add an 8-lane path in ```QuantizeAvx2``` to replace ```at::quantize_vec``` - in addition, interchanges c-loop to the innermost for better memory locality. Test Plan: buck test //caffe2/test:quantized Performance Before (n x h x w x c = 4 x 56 x 56 x ch): ``` type c=2 c=4 c=15 c=24 c=48 c=128 c=256 torch.qint8 903.08 us 1373.39 us 2297.97 us 636.72 us 864.98 us 1618.72 us 2908.47 us torch.quint8 911.93 us 1429.39 us 2315.59 us 623.08 us 844.17 us 1522.28 us 2711.08 us torch.qint32 897.77 us 1346.97 us 3846.41 us 6211.92 us 11977.74 us 34348.23 us 62927.48 us ``` Performance After: ``` type c=2 c=4 c=15 c=24 c=48 c=128 c=256 torch.qint8 123.29 us 176.00 us 348.90 us 99.02 us 132.73 us 267.17 us 513.43 us torch.quint8 123.76 us 171.90 us 338.17 us 97.92 us 131.06 us 260.09 us 521.16 us torch.qint32 102.97 us 172.57 us 559.31 us 814.03 us 1606.11 us 4164.89 us 10041.52 us ``` Reviewed By: lly-zero-one Differential Revision: D20711888 fbshipit-source-id: a71dd55639500f4a036eee96c357737cff9d33db
- Loading branch information