Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aarch64 and X64 have different behaviour in u8s8 reorder for convolution #2412

Open
robert-hardwick opened this issue Jan 15, 2025 · 5 comments
Labels
platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 sighting Suspicious library behavior. Should be promoted to a bug when confirmed

Comments

@robert-hardwick
Copy link

robert-hardwick commented Jan 15, 2025

Summary

This issue has been identified from these pytorch unit test failures pytorch/pytorch#144770 and it appears to me that the issue is a result of the weight prepacking such that the weights are not in the expected memory locations.

https://github.com/pytorch/pytorch/blob/dc8692b0eb093d5af150ae0f3a29a0957c3e4c0d/aten/src/ATen/native/quantized/cpu/qconv_prepack.cpp#L437-L458

The code relating to the test failure is above, and I can confirm that by commenting out this line, the tests pass.

https://github.com/pytorch/pytorch/blob/dc8692b0eb093d5af150ae0f3a29a0957c3e4c0d/aten/src/ATen/native/quantized/cpu/qconv_prepack.cpp#L415
op_attr.set_zero_points_mask(DNNL_ARG_SRC, /* zero_points_mask= */0);


Reproducer is in comments below.

But it seems that doing a reorder with reverse memory format ab->ba of a 135x2 int8 matrix seems to have zeros in the resulting memory on Aarch64, but not on x64.

memory::dims dims = {135, 2};

dnnl::memory::desc src_desc(dims, dnnl::memory::data_type::s8, dnnl::memory::format_tag::ab);
dnnl::memory src_tensor(src_desc, engine, ( void * ) input_data.data());

dnnl::memory::desc weights_desc(dims, dnnl::memory::data_type::s8, dnnl::memory::format_tag::ba);
weights_desc.get()->extra.flags |= dnnl::impl::memory_extra_flags::compensation_conv_asymmetric_src;
weights_desc.get()->extra.asymm_compensation_mask = (1 << 0);
dnnl::memory dst_tensor(weights_desc, engine);

followed by

dnnl::primitive_attr op_attr = dnnl::primitive_attr();
op_attr.set_scales_mask(DNNL_ARG_DST, 0);
op_attr.set_scratchpad_mode(dnnl::scratchpad_mode::user);
auto pd = dnnl::reorder::primitive_desc(src_tensor, dst_tensor, op_attr);

dnnl::memory scratchpad(pd.scratchpad_desc(), engine);
std::unordered_map<int, memory> args;
args.insert({DNNL_ARG_FROM, src_tensor});
args.insert({DNNL_ARG_TO, dst_tensor});
args.insert({DNNL_ARG_SCRATCHPAD, scratchpad});

// add scales argument
float scale = 1.f;
dnnl::memory scale_tensor(dnnl::memory::desc({1}, dnnl::memory::data_type::f32, dnnl::memory::format_tag::a), engine, ( void * ) &scale );
args.insert({DNNL_ARG_ATTR_SCALES | DNNL_ARG_DST, scale_tensor});

dnnl::reorder(pd).execute(
    engine_stream,
    args);

See Observed behaviour below

Note - This seems to follow the reference primitives of reorder ( jit_unit_reorder) and convolution ( gemm_x8s8s32x_convolution_fwd_t ).

Version

Ideep = 77f18a4a44492e79b8c738e4fcd2698699b1acba
oneDNN = 66f0cb9 ( v3.5.3 )

Environment

$ gcc --version
gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
$ lscpu
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Vendor ID: ARM
Model: 1
Thread(s) per core: 1
Core(s) per socket: 48
Socket(s): 1
Stepping: r1p1
BogoMIPS: 2100.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs pa
ca pacg dcpodp svei8mm svebf16 i8mm bf16 dgh rng

Steps to reproduce

See comment below with reproducer attached.

Observed behavior

Tensor dimensions are : 135 ( 27 groups x 5 input channels ) , 2, 1, 1

src tensor @ 0 = -2
src tensor @ 1 = 3
src tensor @ 2 = -4
src tensor @ 3 = 1
dst_tensor @ 0 = -2
dst_tensor @ 1 = -4
dst_tensor @ 2 = -4
dst_tensor @ 3 = -1
dst_tensor @ 135 = 0  <---- THIS IS UNEXPECTED, should be the same as src tensor @ 1
dst_tensor @ 136 = 0      <---- THIS IS UNEXPECTED, should be the same as src tensor @ 2

Expected behavior

Document behavior you expect.

dst_tensor @ 0 = -2
dst_tensor @ 1 = -4
dst_tensor @ 2 = -4
dst_tensor @ 3 = -1
dst_tensor @ 135 = 3  <---- THIS IS THE EXPECTED OUTPUT
dst_tensor @ 136 = 1  

If you comment out these lines https://github.com/oneapi-src/oneDNN/blob/381134e4fdbdcb1099fc584291ee2805fafdb5f0/src/cpu/cpu_convolution_list.cpp#L592C1-L612C82 to force x86 to use gemm_x8s8s32x_convolution_fwd_t then we get the correct output on x86.

@robert-hardwick robert-hardwick added the sighting Suspicious library behavior. Should be promoted to a bug when confirmed label Jan 15, 2025
@vpirogov vpirogov added the platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 label Jan 15, 2025
@vpirogov
Copy link
Member

+@oneapi-src/onednn-cpu-aarch64, @milpuz01, @Ryo-not-rio, @fadara01, @annop-w

@robert-hardwick
Copy link
Author

robert-hardwick commented Jan 16, 2025

EDIT

I've made a simpler reproducer without any ideep dependencies.

reorder_reproducer.cpp.zip

The problem seems to occur for very specific tensor shapes, in this case ( 135 x 2 ) , and it only occurs when the compensation flags are set, and a scale is set on the dst.

weights_desc.get()->extra.flags |= dnnl::impl::memory_extra_flags::compensation_conv_asymmetric_src;
weights_desc.get()->extra.asymm_compensation_mask = (1 << 0);

and

op_attr.set_scales_mask(DNNL_ARG_DST, 0);

float scale = 1.0f;
dnnl::memory scale_tensor(dnnl::memory::desc({1}, dnnl::memory::data_type::f32, dnnl::memory::format_tag::a), engine, ( void * ) &scale );
args.insert({DNNL_ARG_ATTR_SCALES | DNNL_ARG_DST, scale_tensor});

@robert-hardwick
Copy link
Author

Have tested on oneDNN main and the problem persists

@robert-hardwick
Copy link
Author

Trying to reproduce this issue in benchdnn, does anybody know if we can set asymm_compensation_mask in benchdnn for reorder?

int asymm_compensation_mask;

@robert-hardwick
Copy link
Author

Found the answer,

if (i_oflag.first & FLAG_ZP_COMP) {
extra.flags |= dnnl::impl::memory_extra_flags::
compensation_conv_asymmetric_src;
extra.asymm_compensation_mask = i_oflag.second;

@robert-hardwick robert-hardwick changed the title Aarch64 and X86 have different behaviour in u8s8 reorder for convolution Aarch64 and X64 have different behaviour in u8s8 reorder for convolution Jan 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 sighting Suspicious library behavior. Should be promoted to a bug when confirmed
Projects
None yet
Development

No branches or pull requests

2 participants