diff --git a/hw_acceleration.qmd b/hw_acceleration.qmd index 6ad3f7f0..bc8740eb 100644 --- a/hw_acceleration.qmd +++ b/hw_acceleration.qmd @@ -93,13 +93,9 @@ Accelerator design involves squeezing maximim performance within area constraint The target workload dictates optimal accelerator architectures. Some of the key considerations include: - **Memory vs Compute boundedness:** Memory-bound workloads require more memory bandwidth, while compute-bound apps need arithmetic throughput. - - **Data locality:** Data movement should be minimized for efficiency. Near-compute memory helps. - - **Bit-level operations:** Low precision datatypes like INT8/INT4 optimize compute density. - - **Data parallelism:** Multiple replicated compute units allow parallel execution. - - **Pipelining:** Overlapped execution of operations increases throughput. Understanding workload characteristics enables customized acceleration. For example, convolutional neural networks use sliding window operations that are optimally mapped to spatial arrays of processing elements. @@ -391,13 +387,9 @@ The term CPUs has a long history that dates back to 1955 [@weik_survey_1955] whi An overview of significant developments in CPUs: * **Single-core Era (1950s- 2000):** This era is known for seeing aggressive microarchitectural improvements. Techniques like speculative execution (executing an instruction before the previous one was done), out-of-order execution (re-ordering instructions to be more effective), and wider issue widths (executing multiple instructions at once) were implemented to increase instruction throughput. The term “System on Chip” also originated in this era as different analog components (components designed with transistors) and digital components (components designed with hardware description languages that are mapped to transistors) were put on the same platform to achieve some task. - * **Multi-core Era (2000s):** Driven by the decrease of Moore’s Law, this era is marked by scaling the number of cores within a CPU. Now tasks can be split across many different cores each with its own datapath and control unit. Many of the issues arising in this era pertained to how to share certain resources, which resources to share, and how to maintain coherency and consistency across all the cores. - * **Sea of accelerators (2010s):** Again, driven by the decrease of Moore’s law, this era is marked by offloading more complicated tasks to accelerators (widgets) attached the the main datapath in CPUs. It’s common to see accelerators dedicated to various AI workloads, as well as image/digital processing, and cryptography. In these designs, CPUs are often described more as arbiters, deciding which tasks should be processed rather than doing the processing itself. Any task could still be run on the CPU rather than the accelerators, but the CPU would generally be slower. However, the cost of designing and especially programming the accelerator became be a non-trivial hurdle that led to a spike of interest in design-specific libraries (DSLs). - * **Presence in data centers:** Although we often hear that GPUs dominate the data center marker, CPUs are still well suited for tasks that don’t inherently possess a large amount of parallelism. CPUs often handle serial and small tasks and coordinate the data center as a whole. - * **On the edge:** Given the tighter resource constraints on the edge, edge CPUs often only implement a subset of the techniques developed in the sing-core era because these optimizations tend to be heavy on power and area consumption. Edge CPUs still maintain a relatively simple datapath with limited memory capacities. Traditionally, CPUs have been synonymous with general-purpose computing–a term that has also changed as the “average” workload a consumer would run changes over time. For example, floating point components were once considered reserved for “scientific computing” so it was usually implemented as a co-processor (a modular component that worked in tandem with the datapath) and seldom deployed to average consumers. Compare this attitude to today, where FPUs are built into every datapath. @@ -512,24 +504,17 @@ The key goal is tailoring the hardware capabilities to match the algorithms and The software stack can be optimized to better leverage the underlying hardware capabilities: - **Model Parallelism:** Parallelize matrix computations like convolution or attention layers to maximize throughput on vector engines. - - **Memory Optimization:** Tune data layouts to improve cache locality based on hardware profiling. This maximizes reuse and minimizes expensive DRAM access. - - **Custom Operations:** Incorporate specialized ops like low precision INT4 or bfloat16 into models to capitalize on dedicated hardware support. - - **Dataflow Mapping:** Explicitly map model stages to computational units to optimize data movement on hardware. #### Algorithm-Driven Hardware Specialization -Hardware can be tailored to better suit the characteristics of ML -algorithms: +Hardware can be tailored to better suit the characteristics of ML algorithms: - **Custom Datatypes:** Support low precision INT8/4 or bfloat16 in hardware for higher arithmetic density. - - **On-Chip Memory:** Increase SRAM bandwidth and lower access latency to match model memory access patterns. - - **Domain-Specific Ops:** Add hardware units for key ML functions like FFTs or matrix multiplication to reduce latency and energy. - - **Model Profiling:** Use model simulation and profiling to identify computational hotspots and guide hardware optimization. The key is collaborative feedback - insights from hardware profiling @@ -582,13 +567,9 @@ Engineers comfortable with established discrete hardware or software design prac At this time it should be obvious that specialized hardware accelerators like GPUs, TPUs, and FPGAs are essential to delivering high-performance artificial intelligence applications. But to leverage these hardware platforms effectively, an extensive software stack is required spanning the entire development and deployment lifecycle. Frameworks and libraries form the backbone of AI hardware, offering sets of robust, pre-built code, algorithms, and functions specifically optimized to perform a wide array of AI tasks on the different hardware. They are designed to simplify the complexities involved in utilizing the hardware from scratch, which can be time-consuming and prone to error. Software plays an important role in the following: - Providing programming abstractions and models like CUDA and OpenCL to map computations onto accelerators. - - Integrating accelerators into popular deep learning frameworks like TensorFlow and PyTorch. - - Compilers and tools to optimize across the hardware-software stack. - - Simulation platforms to model hardware and software together. - - Infrastructure to manage deployment on accelerators. ```{mermaid} @@ -608,13 +589,9 @@ This expansive software ecosystem is as important as the hardware itself in deli Programming models provide abstractions to map computations and data onto heterogeneous hardware accelerators: - **[CUDA](https://developer.nvidia.com/cuda-toolkit):** Nvidia's parallel programming model to leverage GPUs using extensions to languages like C/C++. Allows launching kernels across GPU cores [@luebke2008cuda]. - - **[OpenCL](https://www.khronos.org/opencl/):** Open standard for writing programs spanning CPUs, GPUs, FPGAs and other accelerators. Specifies a heterogeneous computing framework [@munshi2009opencl]. - - **[OpenGL/WebGL](https://www.opengl.org):** 3D graphics programming interfaces that can map general-purpose code to GPU cores [@segal1999opengl]. - - **[Verilog](https://www.verilog.com)/VHDL**: Hardware description languages (HDLs) used to configure FPGAs as AI accelerators by specifying digital circuits [@gannot1994verilog]. - - **[TVM](https://tvm.apache.org):** Compiler framework providing Python frontend to optimize and map deep learning models onto diverse hardware back-ends [@chen2018tvm]. Key challenges include expressing parallelism, managing memory across devices, and matching algorithms to hardware capabilities. Abstractions must balance portability with allowing hardware customization. Programming models enable developers to harness accelerators without hardware expertise. More of these details are discussed in the [AI frameworks](frameworks.qmd) section. @@ -624,11 +601,8 @@ Key challenges include expressing parallelism, managing memory across devices, a Specialized libraries and runtimes provide software abstractions to access and maximize utilization of AI accelerators: - **Math Libraries:** Highly optimized implementations of linear algebra primitives like GEMM, FFTs, convolutions etc. tailored to target hardware. [Nvidia cuBLAS](https://developer.nvidia.com/cublas), [Intel MKL](https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html), and [Arm compute libraries](https://www.arm.com/technologies/compute-library) are examples. - - **Framework Integrations:** Libraries to accelerate deep learning frameworks like TensorFlow, PyTorch, and MXNet on supported hardware. For example, [cuDNN](https://developer.nvidia.com/cudnn) for accelerating CNNs on Nvidia GPUs. - - **Runtimes:** Software to handle execution on accelerators, including scheduling, synchronization, memory management and other tasks. [Nvidia TensorRT](https://developer.nvidia.com/tensorrt) is an inference optimizer and runtime. - - **Drivers and Firmware:** Low-level software to interface with hardware, initialize devices, and handle execution. Vendors like Xilinx provide drivers for their accelerator boards. For instance, PyTorch integrators use cuDNN and cuBLAS libraries to accelerate training on Nvidia GPUs. The TensorFlow XLA runtime optimizes and compiles models for accelerators like TPUs. Drivers initialize devices and offload operations. @@ -642,9 +616,7 @@ Libraries, runtimes and drivers provide optimized building blocks that deep lear Optimizing compilers play a key role in extracting maximum performance and efficiency from hardware accelerators for AI workloads. They apply optimizations spanning algorithmic changes, graph-level transformations, and low-level code generation. - **Algorithm Optimization:** Techniques like quantization, pruning, and neural architecture search to enhance model efficiency and match hardware capabilities. - - **Graph Optimizations:** Graph-level optimizations like operator fusion, rewriting, and layout transformations to optimize performance on target hardware. - - **Code Generation:** Generating optimized low-level code for accelerators from high-level models and frameworks. For example, the TVM open compiler stack applies quantization for a BERT model targeting Arm GPUs. It fuses pointwise convolution operations and transforms weight layout to optimize memory access. Finally it emits optimized OpenGL code to run the workload on the GPU. @@ -658,9 +630,7 @@ However, efficiently mapping complex models introduces challenges like efficient Simulation software is important in hardware-software co-design. It enables joint modeling of proposed hardware architectures and software stacks: - **Hardware Simulation:** Platforms like [Gem5](https://www.gem5.org) allow detailed simulation of hardware components like pipelines, caches, interconnects, and memory hierarchies. Engineers can model hardware changes without physical prototyping [@binkert2011gem5]. - - **Software Simulation:** Compiler stacks like [TVM](https://tvm.apache.org) support simulation of machine learning workloads to estimate performance on target hardware architectures. This assists with software optimizations. - - **Co-simulation:** Unified platforms like the SCALE-Sim [@samajdar2018scale] integrate hardware and software simulation into a single tool. This enables what-if analysis to quantify the system-level impacts of cross-layer optimizations early in the design cycle. For example, an FPGA-based AI accelerator design could be simulated using Verilog hardware description language and synthesized into a Gem5 model. Verilog is well-suited for describing the digital logic and interconnects that make up the accelerator architecture. Using Verilog allows the designer to specify the datapaths, control logic, on-chip memories, and other components that will be implemented in the FPGA fabric. Once the Verilog design is complete, it can be synthesized into a model that simulates the behavior of the hardware, such as using the Gem5 simulator. Gem5 is useful for this task because it allows modeling of full systems including processors, caches, buses, and custom accelerators. Gem5 supports interfacing Verilog models of hardware to the simulation, enabling unified system modeling. @@ -799,7 +769,6 @@ Chiplets are interconnected using advanced packaging techniques like high-densit ![Figure 1: Chiplet partitioning concept. Figure taken from [@Vivet2021] [Cerebras](https://www.cerebras.net/product-chip/) ](images/hw_acceleration/aimage2.png) - Some key advantages of using chiplets for AI include: - **Flexibility:** Flexibility: Chiplets allow combining different chip types, process nodes, and memories tailored for each function. This is more modular versus a fixed wafer-scale design. @@ -821,8 +790,6 @@ Neuromorphic computing is an emerging field aiming to emulate the efficiency and ![Comparison of the von Neumann architecture with the neuromorphic architecture. These two architectures have some fundamental differences when it comes to operation, organization, programming, communication, and timing. Figure taken from [@schuman2022]](images/hw_acceleration/aimage3.png) - - Intel and IBM are leading commercial efforts in neuromorphic hardware. Intel's Loihi and Loihi 2 chips [@davies2018loihi; @davies2021advancing] offer programmable neuromorphic cores with on-chip learning. IBM's Northpole [@modha2023neural] device comprises more than 100 million magnetic tunnel junction synapses and 68 billion transistors. These specialized chips deliver benefits like low power consumption for edge inference. Spiking neural networks (SNNs) [@maass1997networks] are computational models suited for neuromorphic hardware. Unlike deep neural networks that communicate via continuous values, SNNs use discrete spikes more akin to biological neurons. This allows efficient event-based computation rather than constant processing. Additionally, SNNs take into account temporal characteristics of input data in addition to spatial characteristics. This better mimics biological neural networks, where timing of neuronal spikes plays an important role. However, training SNNs remains challenging due to the added temporal complexity. See following figure and video for reference. @@ -837,7 +804,6 @@ Recently, the integration of photonics with neuromorphic computing [@shastri2021 Neuromorphic computing offers promising capabilities for efficient edge inference but still faces obstacles around training algorithms, nanodevice integration, and system design. Ongoing multidisciplinary research across computer science, engineering, materials science, and physics will be key to unlocking the full potential of this technology for AI use cases. - ### Analog Computing Analog computing is an emerging approach that uses analog signals and components like capacitors, inductors, and amplifiers rather than digital logic for computing. It represents information as continuous electrical signals instead of discrete 0s and 1s. This allows the computation to directly reflect the analog nature of real-world data, avoiding digitization errors and overhead. @@ -896,7 +862,6 @@ These technologies find applications in AI workloads and high-performance comput While in-memory computing technologies like ReRAM and PIM offer exciting prospects for efficiency and performance, they come with their own set of challenges such as issues with data uniformity and scalability in ReRAM [@imani2016resistive]. Nonetheless, the field is ripe for innovation, and addressing these limitations can potentially open new frontiers in both AI and high-performance computing. - ### Optical Computing In AI acceleration, a burgeoning area of interest lies in novel technologies that deviate from traditional paradigms. Some emerging technologies mentioned above such as flexible electronics, in-memory computing or even neuromorphics computing are close to becoming a reality, given their ground-breaking innovations and applications. One of the promising and leading the next-gen frontiers are optical computing technologies [@miller2000optical,@zhou2022photonic ]. Companies like [[LightMatter]](https://lightmatter.co/) are pioneering the use of light photonics for calculations, thereby utilizing photons instead of electrons for data transmission and computation.