Skip to content

AIE platform architecture

Maksim Levental edited this page Jun 12, 2024 · 2 revisions

The AIE platform is like a 12 layer cake where the lower layers aren't cake at all but actually onions. The following attempts to describe AIE "from farm to table": starting from high-level code and ending with registers and assembly.

Disclaimer 1: this doc only covers AIE2.

Disclaimer 2: this is a "brain dump" and is thus is only as authoritative as the brain that dumped it.

Disclaimer 3: this is a WIP that I'll be updating continually.

The layers

  • High-level code
    • ML frameworks like ONNX/torch-mlir that somehow lower/translate to LLVM IR;
      • Many of these go through IREE but not all;
    • C/C++ with aie_api;
  • Intermediate Representation
    • MLIR (tensor/loop level transformations);
    • LLVM IR;
  • Target code gen
    • Clang/LLVM based compilers like Peano and Chess.

  • Device configuration
    • MLIR-AIE;
    • AIE-RT;
  • Launching/loading
    • IREE;
    • XRT;
    • Drivers;
    • Firmware.

I've drawn a horizontal line to indicate what I think is a rough separation of concerns: above the line are program/code concerns and below the line are device and runtime concerns. Naturally, the line is "fuzzy" (markdown doesn't support fuzzy lines...). Also, for the moment, we'll skip everything above the line. Note, IREE stands for Intermediate Representation Execution Environment where the bolding emphasises that IREE primarily adds runtime infrastructure (most of the codegen components are actually wholly inherited/reused from upstream MLIR).

Device overview

Screenshot from 2024-06-12 12-22-08

AIE devices/platforms are composed of "tiles" arranged in grids and connected by AXI-S. There are three kinds/types of tiles with differing resources:

  • Compute/core tiles
    • Vector VLIW processor
    • Some memory
    • 2-channel DMA engine
    • Some locks (actually semaphores)
  • Mem tiles
    • More memory
    • No compute/vector/ALU/nothing
    • 6-channel DMA engine
    • More locks
  • Shim tiles
    • No memory, no compute
    • Can access host DDR
    • 2-channel DMA engine
    • Some locks

In order to get a working/functioning/running design one needs to configure the pertinent (i.e., associated with the tiles will comprise the design) DMAs, locks, and stream switches (and of course the cores with runnable code). Note: you don't have to program all the core/mem tiles all the time (different designs use differing quantities) but you must always configure at least some of the shim tiles in order to send/receive data from any of the core/mem tiles.

Device configuration

Here is your moment of zen: all device configuration (today) is via direct register writes to configuration registers on the device itself. Don't get distracted by the questions of when/how/where those register writes are actually performed (it's in the firmware) because the point is that a configuration object/serialization/thing is just a sequence of such writes. For example, here is a sequence of writes to configure the locks and DMA on shim_tile(0, 0) to read a single 4 byte value and store it somewhere in host DDR:

0x06000100
0x00000000
0x00000001
0x00000000
0x00000000
0x00000000
0x80000000
0x00000000
0x00000000
0x02000000
0x02000000
0x0001D204
0x80000000
0x03000000
0x00010100

Disassembling these (i.e., explaining) would be pointless (you don't write these by hand), the point is just to drive home the claim that configuration is just register writes.

Today, all such configuration register sequences are built ultimately using AIE-RT. Yes, even MLIR-AIE hands off to AIE-RT as its last step. AIE-RT is a utility that collects up all the relative register offsets (very big file...) for each tile and provides an "expression like" interface for computing absolute register addresses. A minimal-ish use of AIE-RT APIs appears in E2E-Linux-Example.

Building the right sequence of register writes is only half the challenge - in order to cross the remainder of the rubicon you must become intimately familiar with the onions in the bottom layers of the stack (the runtime components).

Runtime

TBD

Clone this wiki locally