You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On Blackhole, watcher reports ncrisc_noc_nonposted_writes_flushed assert for some matmuls and convolution ops, even though they pass normally when ran without the watcher.
I discovered this while developing SD 1.4 on Blackhole. Adding noc_async_write_barrier() at the end of the reader kernel resolves the watcher assert in both cases I encountered. However, when I tried to add the barrier at the end of all matmul and conv kernels that do some writes and miss this barrier, I encountered a hang in one convolution.
Since I am not very knowledgable about this problem - should this be debugged globally, or should each op owner debug separately?
Adding the barrier to ttnn/cpp/ttnn/operations/matmul/device/kernels/dataflow/reader_bmm_tile_layout_in0_sender_receiver_padding_block_sharded.cpp resolves watcher assert.
Conv that hangs when the barrier is added to the reader kernel
The convolution that hangs with the barrier is already in main, and can be invoked:
If we add the barrier to the end of the reader (ttnn/cpp/ttnn/operations/conv/conv2d/device/kernels/activation_reader_width_sharded.cpp), the test hangs, otherwise it passes.
The text was updated successfully, but these errors were encountered:
Description
On Blackhole, watcher reports
ncrisc_noc_nonposted_writes_flushed
assert for some matmuls and convolution ops, even though they pass normally when ran without the watcher.I discovered this while developing SD 1.4 on Blackhole. Adding
noc_async_write_barrier()
at the end of the reader kernel resolves the watcher assert in both cases I encountered. However, when I tried to add the barrier at the end of all matmul and conv kernels that do some writes and miss this barrier, I encountered a hang in one convolution.Since I am not very knowledgable about this problem - should this be debugged globally, or should each op owner debug separately?
Matmul that triggers the assert
$ TT_METAL_WATCHER=1 pytest <name-of-the-file>.py
-> triggers assertAdding the barrier to
ttnn/cpp/ttnn/operations/matmul/device/kernels/dataflow/reader_bmm_tile_layout_in0_sender_receiver_padding_block_sharded.cpp
resolves watcher assert.Conv that hangs when the barrier is added to the reader kernel
The convolution that hangs with the barrier is already in main, and can be invoked:
If we add the barrier to the end of the reader (
ttnn/cpp/ttnn/operations/conv/conv2d/device/kernels/activation_reader_width_sharded.cpp
), the test hangs, otherwise it passes.The text was updated successfully, but these errors were encountered: