Systolic arrays are one of the most popular compute substrates within DL accelerators today, as they provide extremely high efficiency for running dense matrix multiplications by re-using operands through local data shifts. One such effort by RISE lab at IIT Madras is ShaktiMAAN, an open-source DNN accelerator for inference on edge devices based on systolic arrays.
The complexity of this accelerator poses a variety of challenges in:
- Hardware verification
- Bottleneck analysis using performance modelling
- Design space trade-offs
- Efficient mapping strategy
- Compiler optimizations
To tackle these challenges, I built Synapse (SYstolic CNN Accelerator’s MaPper-Simulator Environment): a versatile python based mapper-simulator environment. This work, done under the guidance of Prof. Pratyush Kumar, was submitted as my Bachelor's thesis at IIT Madras.
- Mapper that generates instruction trace given any workload, knob values for a targeted architecture.
- Functional simulator cost model for ShaktiMAAN.
- An Reinforcement Learning agent that interacts with the mapper-simulator environment to search through the design space to find optimal hardware (array, buffer size), software (network folds, loop order) co-design knobs.
- Installing DRAMSim2
- Installing SWIG
- OpenAI Gym 0.7+
- PyTorch 1.11+
- Instructions for SHAKTIMAAN and simulator can be generated using
systolic/mapper.py
, which takes care of all dependency resolutions between different instructions. - As done in the cost-model file
model.py
, instantiateNetworkMapping
object and pass DL network (topologies/1.csv
), systolic array (configs/1.cfg
) configurations to generate instructions.
- An event-driven, analytical, data-flow accurate
systolic/simulator.py
tries to model SHAKTIMAAN. It uses the instructions generated by the mapper and runs on an event-driven fashion usingtimestamps
. - It also calculates instruction-wise and global utilization efficiency.
- It finally verifies if the output it generates matches with the actual expected
MatMul
output. - As done in the cost-model file
model.py
, instantiateSimulator
object and pass systolic array, instructions generated (from mapper or otherwise with proper ISA) to simulate. It outputs layer-wise, instruction-wise, global statistics inoutputs/
directory.
A single Jupyter notebook ppo_main.py.ipynb
, defines the RL agent, update rule, learning algorithm (PPO) and trains, evaluates the model to find optimal knobs like buffer-size
, loop-order
, etc. It can be easily ported to run on public-cloud like GCP, AWS, etc. or google-colab.