In this design, a single AI Engine compute core performs a matrix-matrix-multiplication. By default, the matrices are int16
data type for the input and int32
data type for the output, and the dimensions are set (by default) to M
×K
×N
= 256
×256
×256
. The kernel operates on chunks of 64
×32
×64
(m
×k
×n
), so it is invoked multiple times to complete the full result.
This design is a simplification of the whole-array design. Instead of utilizing all available AI Engine compute cores in parallel, this design performs all computation on a single core. To understand this design better, please refer to the discussion of the whole-array design and the differences outlined below.
Differences from the Whole-Array Design
- This design supports tracing; See below.
- Only a single core performs computations. As such, we only need a single ObjectFIFO for each of the transfers between the levels (shim → memory, memory → compute, and back). These ObjectFIFOs are named
inA
,inB
,outC
andmemA
,memB
andmemC
, respectively.
As in the whole-array design, the single_core.py
file describes the data movement of the design. This single core example also comes with an alternative implementation, which can be found in single_core_alt.py
. If you specify use_alt=1
as an environment variable at compile time, this alternative implementation will be used in place of single_core.py
.
Functionally, single_core.py
and single_core_alt.py
are intended to be identical. However, single_core_alt.py
is implemented using a new syntax for runtime buffer descriptor configuration on the shim. Specifically, single_core_alt.py
uses the aiex.dma_configure_task_for
, aiex.dma_start_task
and aiex.dma_await_task
operations instead of aiex.dma_memcpy_nd
.
There is an implementation of this design found in single_core_iron.py
using a higher-level version of IRON. If you specify use_iron=1
as an environment variable at compile time, this alternative implementation will be used in place of single_core.py
.
Functionally, this design is intended to be identical to the other two. However, single_core_iron.py
currently does not support tracing.
You need C++23 for bfloat16_t support. It can be found in g++-13: https://lindevs.com/install-g-on-ubuntu
To compile and run design:
make
make single_core.exe
make run
To compile and run the alternative design:
env use_alt=1 make
env use_alt=1 make single_core.exe
env use_alt=1 make run
To compile and run the higher-level IRON design:
env use_iron=1 make
env use_iron=1 make single_core.exe
env use_iron=1 make run
To get tracing output, set enable_tracing=True
in single_core.py
and ENABLE_TRACING=true
in test.cpp
. Tracing is also supported in single_core_alt.py
.
By default, traces will be written out to trace.txt
; another output file can be specified using the --trace
(or -t
) flag to the host code.