Skip to content

Latest commit

 

History

History
 
 

Matrix Multiplication - Single Core Design

In this design, a single AI Engine compute core performs a matrix-matrix-multiplication. By default, the matrices are int16 data type for the input and int32 data type for the output, and the dimensions are set (by default) to M×K×N = 256×256×256. The kernel operates on chunks of 64×32×64 (m×k×n), so it is invoked multiple times to complete the full result.

This design is a simplification of the whole-array design. Instead of utilizing all available AI Engine compute cores in parallel, this design performs all computation on a single core. To understand this design better, please refer to the discussion of the whole-array design and the differences outlined below.

Differences from the Whole-Array Design

  • This design supports tracing; See below.
  • Only a single core performs computations. As such, we only need a single ObjectFIFO for each of the transfers between the levels (shim → memory, memory → compute, and back). These ObjectFIFOs are named inA, inB, outC and memA, memB and memC, respectively.

Notes on the single_core_alt.py Implementation

As in the whole-array design, the single_core.py file describes the data movement of the design. This single core example also comes with an alternative implementation, which can be found in single_core_alt.py. If you specify use_alt=1 as an environment variable at compile time, this alternative implementation will be used in place of single_core.py.

Functionally, single_core.py and single_core_alt.py are intended to be identical. However, single_core_alt.py is implemented using a new syntax for runtime buffer descriptor configuration on the shim. Specifically, single_core_alt.py uses the aiex.dma_configure_task_for, aiex.dma_start_task and aiex.dma_await_task operations instead of aiex.dma_memcpy_nd.

Notes on the single_core_iron.py Implementation

There is an implementation of this design found in single_core_iron.py using a higher-level version of IRON. If you specify use_iron=1 as an environment variable at compile time, this alternative implementation will be used in place of single_core.py.

Functionally, this design is intended to be identical to the other two. However, single_core_iron.py currently does not support tracing.

Building and Running the Design

You need C++23 for bfloat16_t support. It can be found in g++-13: https://lindevs.com/install-g-on-ubuntu

To compile and run design:

make
make single_core.exe
make run

To compile and run the alternative design:

env use_alt=1 make
env use_alt=1 make single_core.exe
env use_alt=1 make run

To compile and run the higher-level IRON design:

env use_iron=1 make
env use_iron=1 make single_core.exe
env use_iron=1 make run

Tracing

To get tracing output, set enable_tracing=True in single_core.py and ENABLE_TRACING=true in test.cpp. Tracing is also supported in single_core_alt.py.

By default, traces will be written out to trace.txt; another output file can be specified using the --trace (or -t) flag to the host code.