Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BH] Watcher asserts ncrisc_noc_nonposted_writes_flushed for matmul and conv ops #18341

Open
s-jovic opened this issue Feb 26, 2025 · 1 comment

Comments

@s-jovic
Copy link
Contributor

s-jovic commented Feb 26, 2025

Description

On Blackhole, watcher reports ncrisc_noc_nonposted_writes_flushed assert for some matmuls and convolution ops, even though they pass normally when ran without the watcher.

I discovered this while developing SD 1.4 on Blackhole. Adding noc_async_write_barrier() at the end of the reader kernel resolves the watcher assert in both cases I encountered. However, when I tried to add the barrier at the end of all matmul and conv kernels that do some writes and miss this barrier, I encountered a hang in one convolution.

Since I am not very knowledgable about this problem - should this be debugged globally, or should each op owner debug separately?

Matmul that triggers the assert

# SPDX-FileCopyrightText: © 2025 Tenstorrent Inc.
# SPDX-License-Identifier: Apache-2.0
import torch
import ttnn

def test_matmul_with_watcher_assert(
    device,
):
    grid_size = (5, 8)
    input_shape = [1, 1, 8192, 320]
    weights_shape = [1, 1, 320, 1280]
    bias_shape = [1, 1, 1, 1280]

    block_sharded_mem_config = ttnn.MemoryConfig(
        memory_layout=ttnn.TensorMemoryLayout.BLOCK_SHARDED,
        buffer_type=ttnn.BufferType.L1,
    )

    dram_mem_config = ttnn.MemoryConfig(
        memory_layout=ttnn.TensorMemoryLayout.INTERLEAVED,
        buffer_type=ttnn.BufferType.DRAM,
    )

    input = torch.randn(input_shape).bfloat16().float()
    weights = torch.randn(weights_shape).bfloat16().float()
    bias = torch.randn(bias_shape).bfloat16().float()

    input_t = ttnn.Tensor(input, ttnn.bfloat16).to(ttnn.TILE_LAYOUT).to(
        device, ttnn.MemoryConfig(
            ttnn.TensorMemoryLayout.BLOCK_SHARDED,
            ttnn.BufferType.L1,
            ttnn.ShardSpec(
                ttnn.CoreRangeSet(
                    {
                        ttnn.CoreRange(
                            ttnn.CoreCoord(0, 0),
                            ttnn.CoreCoord(4, 7)
                        ),
                    }
                ),
                (1024, 64),
                ttnn.ShardOrientation.ROW_MAJOR,
            )
        ))
    weights_t = ttnn.Tensor(weights, ttnn.bfloat8_b).to(ttnn.TILE_LAYOUT).to(device, dram_mem_config)
    bias_t = ttnn.Tensor(bias, ttnn.bfloat8_b).to(ttnn.TILE_LAYOUT).to(device, dram_mem_config)

    program_config = ttnn.MatmulMultiCoreReuseMultiCastProgramConfig(
        compute_with_storage_grid_size=grid_size,
        in0_block_w=2,
        out_subblock_h=1,
        out_subblock_w=8,
        per_core_M=32,
        per_core_N=8,
        transpose_mcast=False,
        fused_activation=None,
    )
    output_t = ttnn.linear(
        input_t,
        weights_t,
        bias=bias_t,
        program_config=program_config,
        memory_config=block_sharded_mem_config,
        dtype=ttnn.bfloat8_b,
         compute_kernel_config= ttnn.WormholeComputeKernelConfig(
            math_fidelity=ttnn.MathFidelity.LoFi,
            math_approx_mode=False,
            fp32_dest_acc_en=False,
            packer_l1_acc=False,
        )
    )

    tt_out = output_t.cpu().to_torch()

$ TT_METAL_WATCHER=1 pytest <name-of-the-file>.py -> triggers assert

Adding the barrier to ttnn/cpp/ttnn/operations/matmul/device/kernels/dataflow/reader_bmm_tile_layout_in0_sender_receiver_padding_block_sharded.cpp resolves watcher assert.

Conv that hangs when the barrier is added to the reader kernel

The convolution that hangs with the barrier is already in main, and can be invoked:

$ pytest "tests/ttnn/unit_tests/operations/test_new_conv2d.py::test_conv_ws[tilized-auto_shard-activations_dtype=DataType.BFLOAT16-weights_dtype=DataType.BFLOAT16-has_bias=True-batch_size=2-output_channels=576-input_channels=576-input_height=9-input_width=9-filter_height=3-filter_width=3-pad_h=0-pad_w=0-act_block_w_div=1-stride=1-device_params={'l1_small_size': 16384}]"

If we add the barrier to the end of the reader (ttnn/cpp/ttnn/operations/conv/conv2d/device/kernels/activation_reader_width_sharded.cpp), the test hangs, otherwise it passes.

@s-jovic
Copy link
Contributor Author

s-jovic commented Feb 26, 2025

@pavlejosipovic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants