diff --git a/hw_acceleration.qmd b/hw_acceleration.qmd
index 214f0d9a..bc8740eb 100644
--- a/hw_acceleration.qmd
+++ b/hw_acceleration.qmd
@@ -2,95 +2,1004 @@
![_DALL·E 3 Prompt: Create an intricate and colorful representation of a System on Chip (SoC) design in a rectangular format. Showcase a variety of specialized machine learning accelerators and chiplets, all integrated into the processor. Provide a detailed view inside the chip, highlighting the rapid movement of electrons. Each accelerator and chiplet should be designed to interact with neural network neurons, layers, and activations, emphasizing their processing speed. Depict the neural networks as a network of interconnected nodes, with vibrant data streams flowing between the accelerator pieces, showcasing the enhanced computation speed._](./images/cover_ai_hardware.png)
+Machine learning has emerged as a transformative technology across many industries. However, deploying ML capabilities in real-world edge devices faces challenges due to limited computing resources. Specialized hardware acceleration has become essential to enable high-performance machine learning under these constraints. Hardware accelerators optimize compute-intensive operations like inference using custom silicon optimized for matrix multiplications. This provides dramatic speedups over general-purpose CPUs, unlocking real-time execution of advanced models on size, weight and power-constrained devices.
+
+This chapter provides essential background on hardware acceleration techniques for embedded machine learning and their tradeoffs. The goal is to equip readers to make informed hardware selections and software optimizations to develop performant on-device ML capabilities.
+
::: {.callout-tip}
## Learning Objectives
-* coming soon.
+* Understand why hardware acceleration is needed for AI workloads
+
+* Survey key accelerator options like GPUs, TPUs, FPGAs, and ASICs and their tradeoffs
+
+* Learn about programming models, frameworks, compilers for AI accelerators
+
+* Appreciate the importance of benchmarking and metrics for hardware evaluation
+
+* Recognize the role of hardware-software co-design in building efficient systems
+
+* Gain exposure to cutting-edge research directions like neuromorphics and quantum computing
+
+* Understand how ML is beginning to augment and enhance hardware design
:::
## Introduction
+Machine learning has emerged as a transformative technology across many industries, enabling systems to learn and improve from data. To deploy machine learning capabilities in real-world environments, there is a growing demand for embedded ML solutions - where models are built into edge devices like smartphones, home appliances and autonomous vehicles. However, these edge devices have limited computing resources compared to data center servers.
+
+To enable high-performance machine learning on resource-constrained edge devices, specialized hardware acceleration has become essential. Hardware acceleration refers to using custom silicon chips and architectures to offload compute-intensive ML operations from the main processor. In neural networks, the most intensive computations are the matrix multiplications during inference. Hardware accelerators can optimize these matrix operations, providing 10-100x speedups over general-purpose CPUs. This acceleration unlocks the ability to run advanced neural network models in real-time on devices with size, weight and power constraints.
-Explanation: This section lays the groundwork for the chapter, introducing readers to the fundamental concepts of hardware acceleration and its role in enhancing the performance of AI systems, particularly embedded AI. This context is essential because hardware acceleration is a pivotal topic in the domain of embedded AI.
+This chapter overviews hardware acceleration techniques for embedded machine learning and their design tradeoffs. The goal of this chapter is to equip readers with essential background on embedded ML acceleration. This will enable informed hardware selection and software optimization to develop high-performance machine learning capabilities on edge devices.
## Background and Basics
-Explanation: Here, readers are provided with a foundational understanding of the historical and theoretical aspects of hardware acceleration technologies. This section is essential to give readers a historical perspective and a base to aid them in understanding the current state of hardware acceleration technologies.
+### Historical Background
+
+The origins of hardware acceleration date back to the 1960s, with the advent of floating point math co-processors to offload calculations from the main CPU. One early example was the [Intel 8087](https://en.wikipedia.org/wiki/Intel_8087) chip released in 1980 to accelerate floating point operations for the 8086 processor. This established the practice of using specialized processors to handle math-intensive workloads efficiently.
+
+In the 1990s, the first [graphics processing units (GPUs)](https://en.wikipedia.org/wiki/History_of_the_graphics_processor) emerged to process graphics pipelines for rendering and gaming rapidly. Nvidia's [GeForce 256](https://en.wikipedia.org/wiki/GeForce_256) in 1999 was one of the earliest programmable GPUs capable of running custom software algorithms. GPUs exemplify domain-specific fixed-function accelerators as well as evolving into parallel programmable accelerators.
+
+In the 2000s, GPUs were applied to general-purpose computing under [GPGPU](https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units). Their high memory bandwidth and computational throughput made them well-suited for math-intensive workloads. This included breakthroughs in using GPUs to accelerate training of deep learning models such as [AlexNet](https://papers.nips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html) in 2012.
+
+In recent years, Google's [Tensor Processing Units (TPUs)](https://en.wikipedia.org/wiki/Tensor_processing_unit) represent customized ASICs specifically architected for matrix multiplication in deep learning. Their optimized tensor cores achieve higher TeraOPS/watt than CPUs or GPUs during inference. Ongoing innovation includes model compression techniques like [pruning](https://arxiv.org/abs/1506.02626) and [quantization](https://arxiv.org/abs/1609.07061) to fit larger neural networks on edge devices.
+
+This evolution demonstrates how hardware acceleration has focused on solving compute-intensive bottlenecks, from floating point math to graphics to matrix multiplication for ML. Understanding this history provides a crucial context for specialized AI accelerators today.
+
+### The Need for Acceleration
+
+The evolution of hardware acceleration is closely tied to the broader history of computing. In the early decades, chip design was governed by Moore's Law and Dennard Scaling, which predicted the exponential growth in transistor density and proportional improvements in power and performance. This was held through the single-core era.
+
+However, as @patterson2016computer describe, technological constraints eventually forced a transition to the multicore era, with chips containing multiple processing cores to deliver gains in performance. As power limitations prevented further scaling, this led to "dark silicon" ([Dark Silicon](https://en.wikipedia.org/wiki/Dark_silicon)) where not all chip areas could be simultaneously active [@xiu2019time].
+
+The concept of dark silicon emerged as a consequence of these constraints. "Dark silicon" refers to portions of the chip that cannot be powered on at the same time due to thermal and power limitations. Essentially, as the density of transistors increased, the proportion of the chip that could be actively used without overheating or exceeding power budgets shrank.
+
+This phenomenon meant that while chips had more transistors, not all could be operational simultaneously, limiting potential performance gains. This power crisis necessitated a shift to the accelerator era, with specialized hardware units tailored for specific tasks to maximize efficiency. The explosion in AI workloads further drove demand for customized accelerators. Enabling factors included new programming languages, software tools, and manufacturing advances.
+
+![Figure 1: Time Moore: Exploiting Moore's Law From The Perspective of Time. The Dennard scaling failed around the middle of the 2000s. Figure taken from reference (@xiu2019time)](images/hw_acceleration/hwai_40yearsmicrotrenddata.png)
+
+Fundamentally, hardware accelerators are evaluated on performance, power, and silicon area (PPA). The nature of the target application - whether memory-bound or compute-bound - heavily influences the design. For example, memory-bound workloads demand high bandwidth and low latency access, while compute-bound applications require maximal computational throughput.
+
+### General Principles
+
+The design of specialized hardware accelerators involves navigating complex trade-offs between performance, power efficiency, silicon area, and workload-specific optimizations. This section outlines core considerations and methodologies for achieving an optimal balance based on application requirements and hardware constraints.
+
+#### Performance Within Power Budgets
+
+Performance refers to the throughput of computational work per unit time, commonly measured in floating point operations per second (FLOPS) or frames per second (FPS). Higher performance enables completing more work, but power consumption rises with activity.
+
+[Show Image - Optimization curve showing tradeoffs between performance and power. Maximum performance requires exceeding power limits.]
+
+Hardware accelerators aim to maximize performance within set power budgets. This requires careful balancing of parallelism, clock frequency of the chip, operating voltage of the chip, workload optimization and other techniques to maximize operations per watt.
+
+- **Performance** = Throughput * Efficiency
+- **Throughput** ~= Parallelism * Clock Frequency
+- **Efficiency** = Operations / Watt
+
+For example, GPUs achieve high throughput via massively parallel architectures. However, their efficiency is lower than customized ASICs like Google's TPU that optimize for a specific workload.
+
+#### Managing Silicon Area and Costs
+
+Chip area directly impacts manufacturing cost. Larger die sizes require more materials, lower yields, and higher defect rates. Mulit-die packages help scale designs but add packaging complexity. Silicon area depends on:
+
+* **Computational resources** - e.g. number of cores, memory, caches
+* **Manufacturing process node** - smaller transistors enable higher density
+* **Programming model** - programmed accelerators require more flexibility
+
+Accelerator design involves squeezing maximim performance within area constraints. Techniques like pruning and compression help fit larger models on chip.
+
+#### Workload-Specific Optimizations
+
+The target workload dictates optimal accelerator architectures. Some of the key considerations include:
+
+- **Memory vs Compute boundedness:** Memory-bound workloads require more memory bandwidth, while compute-bound apps need arithmetic throughput.
+- **Data locality:** Data movement should be minimized for efficiency. Near-compute memory helps.
+- **Bit-level operations:** Low precision datatypes like INT8/INT4 optimize compute density.
+- **Data parallelism:** Multiple replicated compute units allow parallel execution.
+- **Pipelining:** Overlapped execution of operations increases throughput.
+
+Understanding workload characteristics enables customized acceleration. For example, convolutional neural networks use sliding window operations that are optimally mapped to spatial arrays of processing elements.
+
+By navigating these architectural tradeoffs, hardware accelerators can deliver massive performance gains and enable emerging applications in AI, graphics, scientific computing and other domains.
+
+#### Sustainable Hardware Design
+
+In recent years, AI sustainability has become a pressing concern driven by two key factors - the exploding scale of AI workloads and their associated energy consumption.
+
+First, the size of AI models and datasets has rapidly grown. For example, the amount of compute used to train state-of-the-art models doubles every 3.5 months based on OpenAI's AI compute trends. This exponential growth requires massive computational resources in data centers.
+
+Second, the energy usage of AI training and inference presents sustainability challenges. Data centers running AI applications now consume substantial amounts of energy, contributing to high carbon emissions. It's estimated that training a large AI model can have a carbon footprint of 626,000 pounds of CO2 equivalent, almost 5 times the lifetime emissions of an average car.
+
+As a result, AI research and practice must prioritize energy efficiency and carbon impact alongside accuracy. There is increasing focus on model efficiency, data center design, hardware optimization and other solutions to improve sustainability. Striking a balance between AI progress and environmental responsibility has emerged as a key consideration and an area of active research across the field.
+
+The scale of AI systems is expected to keep growing. Developing sustainable AI is crucial for managing the environmental footprint and enabling widespread beneficial deployment of this transformative technology.
+
+We will learn about [Sustainable AI](./sustainable_ai.qmd) in a later chapter where we will go into more detail about it.
+
+## Accelerator Types {#sec-aihw}
+
+Hardware accelerators can take on many forms. They can exist as a widget (like the [Neural Engine in the Apple M1 chip](https://www.apple.com/newsroom/2020/11/apple-unleashes-m1/)) or as entire chips specially designed to perform certain tasks very well. In this section, we will examine processors for machine learning workloads along the spectrum from highly specialized ASICs to more general-purpose CPUs. We first focus on custom hardware purpose-built for AI to understand the most extreme optimizations possible when design constraints are removed. This establishes a ceiling for performance and efficiency.
+
+We then progressively consider more programmable and adaptable architectures with discussions of GPUs and FPGAs. These make tradeoffs in customization to maintain flexibility. Finally, we cover general-purpose CPUs which sacrifice optimizations for a particular workload in exchange for versatile programmability across applications.
+
+By structuring the analysis along this spectrum, we aim to illustrate the fundamental tradeoffs in accelerator design between utilization, efficiency, programmability, and flexibility. The optimal balance point depends on the constraints and requirements of the target application. This spectrum perspective provides a framework for reasoning about hardware choices for machine learning and the capabilities required at each level of specialization.
+
+The progression begins with the most specialized option, ASICs purpose-built for AI, to ground our understanding in the maximum possible optimizations before expanding to more generalizable architectures. This structured approach aims to elucidate the accelerator design space.
+
+![](images/hw_acceleration/tradeoffs.png)
+*This graph illustrates the complex interplay between flexibility, performance, functional diversity, and area for various types of hardware processors. As the area of a processor increases, so does its potential for functional diversity and flexibility. However, a design could dedicate that additional area to target application specific tasks. Performance will always depend on how and how effectively that area is utilized. Optimal design always requires balancing these factors according to a hierarchy of application requirements.*
+
+### Application-Specific Integrated Circuits (ASICs)
+
+An Application-Specific Integrated Circuit (ASIC) is a type of [integrated circuit](https://en.wikipedia.org/wiki/Integrated_circuit) (IC) that is custom-designed for a specific application or workload, rather than for general-purpose use. Unlike CPUs and GPUs, ASICs do not support multiple applications or workloads. Rather, they are optimized to perform a single task extremely efficiently. Apple’s M1/2/3, AMD’s Neoverse, Intel’s i5/7/9, Google’s TPUs, and NVIDIA’s GPUs are all examples of ASICs.
+
+ASICs achieve this efficiency by tailoring every aspect of the chip design - the underlying logic gates, electronic components, architecture, memory, I/O, and manufacturing process - specifically for the target application. This level of customization allows removing any unnecessary logic or functionality required for general computation. The result is an IC that maximizes performance and power efficiency on the desired workload. The efficiency gains from application-specific hardware are so substantial that these software-centric firms are dedicating enormous engineering resources to designing customized ASICs.
+
+The rise of more complex machine learning algorithms has made the performance advantages enabled by tailored hardware acceleration a key competitive differentiator, even for companies traditionally concentrated on software engineering. ASICs have become a high-priority investment for major cloud providers aiming to offer faster AI computation.
+
+#### Advantages
+
+ASICs provide significant benefits over general purpose processors like CPUs and GPUs due to their customized nature. The key advantages include the following.
+
+##### Maximized Performance and Efficiency
+
+The most fundamental advantage of ASICs is the ability to maximize performance and power efficiency by customizing the hardware architecture specifically for the target application. Every transistor and design aspect is optimized for the desired workload - no unnecessary logic or overhead is needed to support generic computation.
+
+For example, [Google's Tensor Processing Units (TPUs)](https://cloud.google.com/tpu/docs/intro-to-tpu) contain architectures tailored exactly for the matrix multiplication operations used in neural networks. To design the TPU ASICs, Google's engineering teams need to clearly define the chip specifications, write the architecture description using Hardware Description Languages like [Verilog](https://www.verilog.com/), synthesize the design to map it to hardware components, and carefully place-and-route transistors and wires based on the fabrication process design rules. This complex design process, known as very-large-scale integration (VLSI), allows them to build an IC optimized just for machine learning workloads.
+
+As a result, TPU ASICs achieve over an order of magnitude higher efficiency in operations per watt than general purpose GPUs on ML workloads by maximizing performance and minimizing power consumption through a full-stack custom hardware design.
+
+##### Specialized On-Chip Memory
+
+ASICs incorporate on-chip SRAM and caches specifically optimized to feed data to the computational units. For example, Apple's M1 system-on-a-chip contains special low-latency SRAM to accelerate the performance of its Neural Engine machine learning hardware. Large local memory with high bandwidth enables keeping data as close as possible to the processing elements. This provides tremendous speed advantages compared to off-chip DRAM access, which is up to 100x slower.
+
+Data locality and optimizing memory hierarchy is crucial for both high throughput and low power.Below is a table "Numbers Everyone Should Know" from [Jeff Dean](https://research.google/people/jeff/).
+
+| Operation | Latency | Notes |
+|-|-|-|
+| L1 cache reference | 0.5 ns | |
+| Branch mispredict | 5 ns | |
+| L2 cache reference | 7 ns | |
+| Mutex lock/unlock | 25 ns | |
+| Main memory reference | 100 ns | |
+| Compress 1K bytes with Zippy | 3,000 ns | 3 μs |
+| Send 1 KB bytes over 1 Gbps network | 10,000 ns | 10 μs |
+| Read 4 KB randomly from SSD | 150,000 ns | 150 μs |
+| Read 1 MB sequentially from memory | 250,000 ns | 250 μs |
+| Round trip within same datacenter | 500,000 ns | 0.5 ms |
+| Read 1 MB sequentially from SSD | 1,000,000 ns | 1 ms |
+| Disk seek | 10,000,000 ns | 10 ms |
+| Read 1 MB sequentially from disk | 20,000,000 ns | 20 ms |
+| Send packet CA->Netherlands->CA | 150,000,000 ns | 150 ms |
+
+##### Custom Datatypes and Operations
+
+Unlike general purpose processors, ASICs can be designed to natively support custom datatypes like INT4 or bfloat16 that are widely used in ML models. For instance, Nvidia's Ampere GPU architecture has dedicated bfloat16 Tensor Cores to accelerate AI workloads. Low precision datatypes enable higher arithmetic density and performance. ASICs can also directly incorporate non-standard operations common in ML algorithms as primitive operations - for example, natively supporting activation functions like ReLU makes execution more efficient. We encourage you to refer to the Efficient Numeric Representations chapter for additional details.
+
+##### High Parallelism
+
+ASIC architectures can leverage much higher parallelism tuned for the target workload versus general purpose CPUs or GPUs. More computational units tailored for the application means more operations execute simultaneously. Highly parallel ASICs achieve tremendous throughput for data parallel workloads like neural network inference.
+
+##### Advanced Process Nodes
+Cutting edge manufacturing processes allow packing more transistors into smaller die areas, increasing density. ASICs designed specifically for high volume applications can better amortize the costs of bleeding edge process nodes.
+
+#### Disadvatages
+
+##### Long Design Timelines
+
+The engineering process of designing and validating an ASIC can take 2-3 years. Synthesizing the architecture using hardware description languages, taping out the chip layout, and fabricating the silicon on advanced process nodes involves long development cycles. For example, to tape out a 7nm chip, teams need to carefully define specifications, write the architecture in HDL, synthesize the logic gates, place components, route all interconnections, and finalize the layout to send for fabrication. This very large scale integration (VLSI) flow means ASIC design and manufacturing can traditionally take 2-5 years.
+
+There are a few key reasons why the long design timelines of ASICs, often 2-3 years, can be challenging for machine learning workloads:
+
+- **ML algorithms evolve rapidly:** New model architectures, training techniques, and network optimizations are constantly emerging. For example, Transformers became hugely popular in NLP in just the last few years. By the time an ASIC finishes tapeout, the optimal architecture for a workload may have changed.
+- **Datasets grow quickly:** ASICs designed for certain model sizes or datatypes can become undersized relative to demand. For instance, natural language models are scaling exponentially with more data and parameters. A chip designed for BERT might not accommodate GPT-3.
+- **ML applications change frequently:** The industry focus shifts between computer vision, speech, NLP, recommender systems etc. An ASIC optimized for image classification may have less relevance in a few years.
+- **Faster design cycles with GPUs/FPGAs:** Programmable accelerators like GPUs can adapt much quicker by upgrading software libraries and frameworks. New algorithms can be deployed without hardware changes.
+- **Time-to-market needs:** Getting a competitive edge in ML requires rapidly experimenting with new ideas and deploying them. Waiting several years for an ASIC is not aligned with fast iteration.
+
+The pace of innovation in ML is not well matched to the multi-year timescale for ASIC development. Significant engineering efforts are required to extend ASIC lifespan through modular architectures, process scaling, model compression, and other techniques. But the rapid evolution of ML makes fixed function hardware challenging.
+
+##### High Non-Recurring Engineering Costs
+The fixed costs of taking an ASIC from design to high volume manufacturing can be very capital intensive, often tens of millions of dollars. Photomask fabrication for taping out chips in advanced process nodes, packaging, and one-time engineering efforts are expensive. For instance, a 7nm chip tapeout alone could cost tens of millions of dollars. The high non-recurring engineering (NRE) investment narrows ASIC viability to high-volume production use cases where the upfront cost can be amortized.
+
+![](images/hw_acceleration/nre.png)
+*Table from [Enabling Cheaper Design](https://semiengineering.com/enabling-cheaper-design/)*
+
+##### Complex Integration and Programming
+ASICs require extensive software integration work including drivers, compilers, OS support, and debugging tools. They also need expertise in electrical and thermal packaging. Additionally, programming ASIC architectures efficiently can involve challenges like workload partitioning and scheduling across many parallel units. The customized nature necessitates significant integration efforts to turn raw hardware into fully operational accelerators.
+
+While ASICs provide massive efficiency gains on target applications by tailoring every aspect of the hardware design to one specific task, their fixed nature results in tradeoffs in flexibility and development costs compared to programmable accelerators, which must be weighed based on the application.
+
+### Field-Programmable Gate Arrays (FPGAs)
+
+FPGAs are programmable integrated circuits that can be reconfigured for different applications. Their customizable nature provides advantages for accelerating AI algorithms compared to fixed ASICs or inflexible GPUs. While Google, Meta, and NVIDIA which are looking at putting ASICs in data centers, Microsoft deployed FPGAs in their data centers [@putnam_reconfigurable_2014] in 2011 to efficiently serve diverse data center workloads.
+
+#### Advantages
+
+FPGAs provide several benefits over GPUs and ASICs for accelerating machine learning workloads.
+
+##### Flexibility Through Reconfigurable Fabric
+
+The key advantage of FPGAs is the ability to reconfigure the underlying fabric to implement custom architectures optimized for different models, unlike fixed-function ASICs. For example, quant trading firms use FPGAs to accelerate their algorithms because they change frequently, and the low NRE cost of FPGAs is more viable than taping out new ASICs.
+
+![](images/hw_acceleration/fpga.png)
+*Comparison of FPGAs on the market [@gwennap_certus-nx_nodate]*
+
+FPGAs are composed of basic building blocks - configurable logic blocks, RAM blocks, and interconnects. Vendors provide a base amount of these resources, and engineers program the chips by compiling HDL code into bitstreams that rearrange the fabric into different configurations. This makes FPGAs adaptable as algorithms evolve.
+
+While FPGAs may not achieve the utmost performance and efficiency of workload-specific ASICs, their programmability provides more flexibility as algorithms change. This adaptability makes FPGAs a compelling choice for accelerating evolving machine learning applications. For machine learning workloads, Microsoft has deployed FPGAs in its Azure data centers to serve diverse applications, instead of using ASICs. The programmability enables optimization across changing ML models.
+
+##### Customized Parallelism and Pipelining
+
+FPGA architectures can leverage spatial parallelism and pipelining by tailoring the hardware design to mirror the parallelism in ML models. For example, Intel's HARPv2 FPGA platform splits the layers of an MNIST convolutional network across separate processing elements to maximize throughput. Unique parallel patterns like tree ensemble evaluations are also possible on FPGAs. Deep pipelines with optimized buffering and dataflow can be customized to each model's structure and datatypes. This level of tailored parallelism and pipelining is not feasible on GPUs.
+
+##### Low Latency On-Chip Memory
+
+Large amounts of high bandwidth on-chip memory enables localized storage for weights and activations. For instance, Xilinx Versal FPGAs contain 32MB of low latency RAM blocks along with dual-channel DDR4 interfaces for external memory. Bringing memory physically closer to the compute units reduces access latency. This provides significant speed advantages over GPUs that must traverse PCIe or other system buses to reach off-chip GDDR6 memory.
+
+##### Native Support for Low Precision
+
+A key advantage of FPGAs is the ability to natively implement any bit width for arithmetic units, such as INT4 or bfloat16 used in quantized ML models. For example, Intel's Stratix 10 NX FPGAs have dedicated INT8 cores that can achieve up to 143 INT8 TOPS at ~1 TOPS/W [Intel® Stratix® 10 NX FPGA
+](https://www.intel.com/content/www/us/en/products/details/fpga/stratix/10/nx.html). Lower bit widths increase arithmetic density and performance. FPGAs can even support mixed precision or dynamic precision tuning at runtime.
+
+#### Disadvatages
+
+##### Lower Peak Throughput than ASICs
+
+FPGAs cannot match the raw throughput numbers of ASICs customized for a specific model and precision. The overheads of the reconfigurable fabric compared to fixed function hardware result in lower peak performance. For example, the TPU v5e pods allow up to 256 chips to be connected with more than 100 petaOps of INT8 performance while FPGAs can offer up to 143 INT8 TOPS or 286 INT4 TOPS [Intel® Stratix® 10 NX FPGA
+](https://www.intel.com/content/www/us/en/products/details/fpga/stratix/10/nx.html).
+
+This is because FPGAs are composed of basic building blocks - configurable logic blocks, RAM blocks, and interconnects. Vendors provide a set amount of these resources. To program FPGAs, engineers write HDL code and compile into bitstreams that rearrange the fabric, which has inherent overheads versus an ASIC purpose-built for one computation.
+
+##### Programming Complexity
+
+To optimize FPGA performance, engineers must program the architectures in low-level hardware description languages like Verilog or VHDL. This requires hardware design expertise and longer development cycles versus higher level software frameworks like TensorFlow. Maximizing utilization can be challenging despite advances in high-level synthesis from C/C++.
-- Historical Background
-- The Need for Hardware Acceleration
-- General Principles of Hardware Acceleration
+##### Reconfiguration Overheads
-## Types of Hardware Accelerators {#sec-aihw}
+To change FPGA configurations requires reloading a new bitstream, which has considerable latency and storage size costs. For example, partial reconfiguration on Xilinx FPGAs can take 100s of milliseconds. This makes dynamically swapping architectures in real-time infeasible. The bitstream storage also consumes on-chip memory.
-Explanation: This section offers an overview of the hardware options available for accelerating AI tasks, discussing each type in detail, and comparing their advantages and disadvantages. It is key for readers to comprehend the various hardware solutions available for specific AI tasks, and to make informed decisions when selecting hardware solutions.
+##### Diminishing Gains on Advanced Nodes
-- Central Processing Units (CPUs) with AI Capabilities
-- Graphics Processing Units (GPUs)
-- Digital Signal Processors (DSPs)
-- Field-Programmable Gate Arrays (FPGAs)
-- Application-Specific Integrated Circuits (ASICs)
-- Tensor Processing Units (TPUs)
-- Comparative Analysis of Different Hardware Accelerators
+While smaller process nodes benefit ASICs greatly, they provide less advantages for FPGAs. At 7nm and below, effects like process variation, thermal constraints, and aging disproportionately impact FPGA performance. The overheads of configurable fabric also diminish gains vs fixed function ASICs.
+
+##### Case Study
+
+FPGAs have found widespread application in various fields, including medical imaging, robotics, and finance, where they excel in handling computationally intensive machine learning tasks. In the context of medical imaging, an illustrative example is the application of FPGAs for brain tumor segmentation, a traditionally time-consuming and error-prone process. For instance, Xiong et al. developed a quantized segmentation accelerator, which they retrained using the BraTS19 and BraTS20 datasets. Their work yielded remarkable results, achieving over 5x and 44x performance improvements, as well as 11x and 82x energy efficiency gains compared to GPU and CPU implementations, respectively [@xiong_mri-based_2021].
+
+### Digital Signal Processors (DSPs)
+
+The first digital signal processor core was built in 1948 by Texas Instruments ([“The Evolution of Audio DSPs
+"](https://audioxpress.com/article/the-evolution-of-audio-dsps)). Traditionally, DSPs would have logic to allow them to directly access digital/audio data in memory, perform an arithmetic operation (multiply-add-accumulate–MAC–was one of the most common operations) and then write the result back to memory. The DSP would also include specialized analog components to retrieve said digital/audio data.
+
+Once we entered the smartphone era, DSPs started encompassing more sophisticated tasks. They required Bluetooth, Wi-Fi, and cellular connectivity. Media also became much more complex. Today, it’s not common to have entire chips dedicated to just DSP, but a System on Chip would include DSPs in addition to general-purpose CPUs. For example, Qualcomm’s [Hexagon Digital Signal Processor](https://developer.qualcomm.com/software/hexagon-dsp-sdk/dsp-processor) claims to be a “world-class processor with both CPU and DSP functionality to support deeply embedded processing needs of the mobile platform for both multimedia and modem functions.” [Google Tensors](https://blog.google/products/pixel/google-tensor-g3-pixel-8/), the chip in the Google Pixel phones, also includes both CPUs and specialized DSP engines.
+
+#### Advatages
+
+DSPs architecturally provide advantages in vector math throughput, low latency memory access, power efficiency, and support for diverse datatypes - making them well-suited for embedded ML acceleration.
+
+##### Optimized Architecture for Vector Math
+
+DSPs contain specialized data paths, register files, and instructions optimized specifically for vector math operations commonly used in machine learning models. This includes dot product engines, MAC units, and SIMD capabilities tailored for vector/matrix calculations. For example, the CEVA-XM6 DSP (["Ceva SensPro Fuses AI and Vector DSP"](https://www.ceva-dsp.com/wp-content/uploads/2020/04/Ceva-SensPro-Fuses-AI-and-Vector-DSP.pdf)) has 512-bit vector units to accelerate convolutions. This efficiency on vector math workloads is far beyond general CPUs.
+
+##### Low Latency On-Chip Memory
+
+DSPs integrate large amounts of fast on-chip SRAM memory to hold data locally for processing. Bringing memory physically closer to the computation units reduces access latency. For example, Analog's SHARC+ DSP contains 10MB of on-chip SRAM. This high-bandwidth local memory provides speed advantages for real-time applications.
+
+##### Power Efficiency
+
+DSPs are engineered to provide high performance per watt on digital signal workloads. Efficient data paths, parallelism, and memory architectures enable trillions of math operations per second within tight mobile power budgets. For example, [Qualcomm's Hexagon DSP](https://developer.qualcomm.com/software/hexagon-dsp-sdk/dsp-processor) can deliver 4 trillion operations per second (TOPS) while consuming minimal watts.
+
+##### Support for Integer and Floating Point Math
+
+Unlike GPUs which excel at single or half precision, DSPs can natively support both 8/16-bit integer and 32-bit floating point datatypes used across ML models. Some DSPs even support dot product acceleration at INT8 precision for quantized neural networks.
+
+#### Disadvatages
+
+DSPs make architectural tradeoffs that limit peak throughput, precision, and model capacity compared to other AI accelerators. But their advantages in power efficiency and integer math make them a strong edge compute option. So while DSPs provide some benefits over CPUs, they also come with limitations for machine learning workloads:
+
+##### Lower Peak Throughput than ASICs/GPUs
+
+DSPs cannot match the raw computational throughput of GPUs or customized ASICs designed specifically for machine learning. For example, Qualcomm's Cloud AI 100 ASIC delivers 480 TOPS on INT8, while their Hexagon DSP provides 10 TOPS. DSPs lack the massive parallelism of GPU SM units.
+
+##### Slower Double Precision Performance
+
+Most DSPs are not optimized for higher precision floating point needed in some ML models. Their dot product engines focus on INT8/16 and FP32 which provides better power efficiency. But 64-bit floating point throughput is much lower. This can limit usage in models requiring high precision.
+
+##### Constrained Model Capacity
+
+The limited on-chip memory of DSPs constrains the model sizes that can be run. Large deep learning models with hundreds of megabytes of parameters would exceed on-chip SRAM capacity. DSPs are best suited for small to mid-sized models targeted for edge devices.
+
+##### Programming Complexity
+
+Efficiently programming DSP architectures requires expertise in parallel programming and optimizing data access patterns. Their specialized microarchitectures have more learning curve than high-level software frameworks. This makes development more complex.
+
+### Graphics Processing Units (GPUs)
+
+The term graphics processing unit existed since at least the 1980s. There had always been a demand for graphics hardware in both video game consoles (high demand, needed to be relatively lower cost) and scientific simulations (lower demand, but needed higher resolution, could be at a high price point).
+
+The term was popularized, however, in 1999 when NVIDIA launched the GeForce 256 mainly targeting the PC games market sector [@lindholm_nvidia_2008]. As PC games became more sophisticated, NVIDIA GPUs became more programmable over time as well. Soon, users realized they could take advantage of this programmability and run a variety of non-graphics related workloads on GPUs and benefit from the underlying architecture. And so, starting in the late 2000s, GPUs became general-purpose graphics processing units or GP-GPUs.
+
+[Intel Arc Graphics](https://www.intel.com/content/www/us/en/products/details/fpga/stratix/10/nx.html) and [AMD Radeon RX](https://www.amd.com/en/graphics/radeon-rx-graphics) have also developed their GPUs over time.
+
+#### Advatages
+
+##### High Computational Throughput
+
+The key advantage of GPUs is their ability to perform massively parallel floating point calculations optimized for computer graphics and linear algebra [@raina_large-scale_2009]. Modern GPUs like Nvidia's A100 offers up to 19.5 teraflops of FP32 performance with 6912 CUDA cores and 40GB of graphics memory that is tightly coupled with 1.6TB/s of graphics memory bandwidth.
+
+This raw throughput stems from the highly parallel streaming multiprocessor (SM) architecture tailored for data-parallel workloads [@jia2019beyond]. Each SM contains hundreds of scalar cores optimized for float32/64 math. With thousands of SMs on chip, GPUs are purpose-built for matrix multiplication and vector operations used throughout neural networks.
+
+For example, Nvidia's latest [H100](https://www.nvidia.com/en-us/data-center/h100/) GPU provides 4000 TFLOPs of FP8, 2000 TFLOPs of FP16, 1000 TFLOPs of TF32, 67 TFLOPs of FP32 and 34 TFLOPs of FP64 Compute performance, which can dramatically accelerate large batch training on models like BERT, GPT-3, and other transformer architectures. The scalable parallelism of GPUs is key to speeding up computationally intensive deep learning.
+
+##### Mature Software Ecosystem
+
+Nvidia provides extensive runtime libraries like [cuDNN](https://developer.nvidia.com/cudnn) and [cuBLAS](https://developer.nvidia.com/cublas) that are highly optimized for deep learning primitives. Frameworks like TensorFlow and PyTorch integrate with these libraries to enable GPU acceleration with no direct programming. CUDA provides lower-level control for custom computations.
+
+This ecosystem enables quickly leveraging GPUs via high-level Python without GPU programming expertise. Known workflows and abstractions provide a convenient on-ramp for scaling up deep learning experiments. The software maturity supplements the throughput advantages.
+
+##### Broad Availability
+
+The economies of scale of graphics processing make GPUs broadly accessible in data centers, cloud platforms like AWS and GCP, and desktop workstations. Their availability in research environments has provided a convenient platform for ML experimentation and innovation. For example, nearly every state-of-the-art deep learning result has involved GPU acceleration because of this ubiquity. The broad access supplements the software maturity to make GPUs the standard ML accelerator.
+
+##### Programmable Architecture
+
+While not fully flexible as FPGAs, GPUs do provide programmability via CUDA and shader languages to customize computations. Developers can optimize data access patterns, create new ops, and tune precisions for evolving models and algorithms.
+
+#### Disadvatages
+
+While GPUs have become the standard accelerator for deep learning, their architecture also comes with some key downsides.
+
+##### Less Efficient than Custom ASICs
+
+The statement "GPUs are less efficient than ASICs" could spark intense debate within the ML/AI field and cause this book to explode 🤯.
+
+Typically, GPUs are perceived as less efficient than ASICs because the latter are custom-built for specific tasks and thus can operate more efficiently by design. GPUs, with their general-purpose architecture, are inherently more versatile and programmable, catering to a broad spectrum of computational tasks beyond ML/AI.
+
+However, modern GPUs, however, have evolved to include specialized hardware support for essential AI operations, such as generalized matrix multiplication (GEMM) and other matrix operations, which are critical for running ML models effectively. These enhancements have significantly improved the efficiency of GPUs for AI tasks, to the point where they can rival the performance of ASICs for certain applications.
+
+Consequently, some might argue that contemporary GPUs represent a convergence of sorts, incorporating specialized, ASIC-like capabilities within a flexible, general-purpose processing framework. This adaptability has blurred the lines between the two types of hardware, with GPUs offering a strong balance of specialization and programmability that is well-suited to the dynamic needs of ML/AI research and development.
+
+##### High Memory Bandwidth Needs
+
+The massively parallel architecture requires tremendous memory bandwidth to supply thousands of cores as shown in Figure 1. For example, the Nvidia A100 GPU requires 1.6TB/sec to fully saturate its compute. GPUs rely on wide 384-bit memory buses to high bandwidth GDDR6 RAM, but even the fastest GDDR6 tops out around 1 TB/sec. This dependence on external DRAM incurs latency and power overheads.
+
+##### Programming Complexity
+
+While tools like CUDA help, optimally mapping and partitioning ML workloads across the massively parallel GPU architecture remains challenging. Achieving both high utilization and memory locality requires low-level tuning [@jia_dissecting_2018]. Abstractions like TensorFlow can leave performance on the table.
+
+##### Limited On-Chip Memory
+
+GPUs have relatively small on-chip memory caches compared to the large working set requirements of ML models during training. They are reliant on high bandwidth access to external DRAM, which ASICs minimize with large on-chip SRAM.
+
+##### Fixed Architecture
+
+Unlike FPGAs, the fundamental GPU architecture cannot be altered post-manufacture. This constraint limits adapting to novel ML workloads or layers. The CPU-GPU boundary also creates data movement overheads.
+
+#### Case Study
+
+The recent groundbreaking research conducted by OpenAI [@brown2020language] with their GPT-3 model. GPT-3, a language model consisting of 175 billion parameters, demonstrated unprecedented language understanding and generation capabilities. Its training, which would have taken months on conventional CPUs, was accomplished in a matter of days using powerful GPUs, thus pushing the boundaries of natural language processing (NLP) capabilities.
+
+### Central Processing Units (CPUs)
+
+The term CPUs has a long history that dates back to 1955 [@weik_survey_1955] while the first microprocessor CPU–the Intel 4004–was invented in 1971([Who Invented the Microprocessor?](https://computerhistory.org/blog/who-invented-the-microprocessor/)). Compilers compile high-level programming languages like Python, Java, or C to assembly instructions (x86, ARM, RISC-V, etc.) for CPUs to process. The set of instructions a CPU understands is called the “instruction set” and must be agreed upon by both the hardware and software running atop it (See section 5 for a more in-depth description on instruction set architectures–ISAs).
+
+An overview of significant developments in CPUs:
+
+* **Single-core Era (1950s- 2000):** This era is known for seeing aggressive microarchitectural improvements. Techniques like speculative execution (executing an instruction before the previous one was done), out-of-order execution (re-ordering instructions to be more effective), and wider issue widths (executing multiple instructions at once) were implemented to increase instruction throughput. The term “System on Chip” also originated in this era as different analog components (components designed with transistors) and digital components (components designed with hardware description languages that are mapped to transistors) were put on the same platform to achieve some task.
+* **Multi-core Era (2000s):** Driven by the decrease of Moore’s Law, this era is marked by scaling the number of cores within a CPU. Now tasks can be split across many different cores each with its own datapath and control unit. Many of the issues arising in this era pertained to how to share certain resources, which resources to share, and how to maintain coherency and consistency across all the cores.
+* **Sea of accelerators (2010s):** Again, driven by the decrease of Moore’s law, this era is marked by offloading more complicated tasks to accelerators (widgets) attached the the main datapath in CPUs. It’s common to see accelerators dedicated to various AI workloads, as well as image/digital processing, and cryptography. In these designs, CPUs are often described more as arbiters, deciding which tasks should be processed rather than doing the processing itself. Any task could still be run on the CPU rather than the accelerators, but the CPU would generally be slower. However, the cost of designing and especially programming the accelerator became be a non-trivial hurdle that led to a spike of interest in design-specific libraries (DSLs).
+* **Presence in data centers:** Although we often hear that GPUs dominate the data center marker, CPUs are still well suited for tasks that don’t inherently possess a large amount of parallelism. CPUs often handle serial and small tasks and coordinate the data center as a whole.
+* **On the edge:** Given the tighter resource constraints on the edge, edge CPUs often only implement a subset of the techniques developed in the sing-core era because these optimizations tend to be heavy on power and area consumption. Edge CPUs still maintain a relatively simple datapath with limited memory capacities.
+
+Traditionally, CPUs have been synonymous with general-purpose computing–a term that has also changed as the “average” workload a consumer would run changes over time. For example, floating point components were once considered reserved for “scientific computing” so it was usually implemented as a co-processor (a modular component that worked in tandem with the datapath) and seldom deployed to average consumers. Compare this attitude to today, where FPUs are built into every datapath.
+
+#### Advatages
+
+While limited in raw throughput, general-purpose CPUs do provide some practical benefits for AI acceleration.
+
+##### General Programmability
+
+CPUs support diverse workloads beyond ML, providing flexible general-purpose programmability. This versatility comes from their standardized instruction sets and mature compiler ecosystems that allow running any application from databases and web servers to analytics pipelines [@Hennessy2019-je].
+
+This avoids the need for dedicated ML accelerators and enables leveraging existing CPU-based infrastructure for basic ML deployment. For example, X86 servers from vendors like Intel and AMD can run common ML frameworks using Python and TensorFlow packages alongside other enterprise workloads.
+
+##### Mature Software Ecosystem
+
+For decades, highly optimized math libraries like [BLAS](https://www.netlib.org/blas/), [LAPACK](https://hpc.llnl.gov/software/mathematical-software/lapack#:~:text=The%20Linear%20Algebra%20PACKage%20(LAPACK,problems%2C%20and%20singular%20value%20decomposition.)), and [FFTW](https://www.fftw.org/) have leveraged vectorized instructions and multithreading on CPUs [@Dongarra2009-na]. Major ML frameworks like PyTorch, TensorFlow, and SciKit-Learn are designed to integrate seamlessly with these CPU math kernels.
+
+Hardware vendors like Intel and AMD also provide low-level libraries to fully optimize performance for deep learning primitives ([AI Inference Acceleration on CPUs](https://www.intel.com/content/www/us/en/developer/articles/technical/ai-inference-acceleration-on-intel-cpus.html#gs.0w9qn2)). This robust, mature software ecosystem allows quickly deploying ML on existing CPU infrastructure.
+
+##### Wide Availability
+
+The economies of scale of CPU manufacturing, driven by demand across many markets like PCs, servers, and mobile, make them ubiquitously available. Intel CPUs, for example, have powered most servers for decades [@Ranganathan2011-dc]. This wide availability in data centers reduces hardware costs for basic ML deployment.
+
+Even small embedded devices typically integrate some CPU, enabling edge inference. The ubiquity reduces need for purchasing specialized ML accelerators in many situations.
+
+##### Low Power for Inference
+
+Optimizations like vector extensions in ARM Neon and Intel AVX provide power efficient integer and floating point throughput optimized for "bursty" workloads like inference [@Ignatov2018-kh]. While slower than GPUs, CPU inference can be deployed in power-constrained environments. For example, ARM's Cortex-M CPUs now deliver over 1 TOPS of INT8 performance under 1W, enabling keyword spotting and vision applications on edge devices ([ARM](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/armv8_2d00_m-based-processor-software-development-hints-and-tips)).
+
+#### Disadvatages
+
+While providing some advantages, general-purpose CPUs also come with limitations for AI workloads.
+
+##### Lower Throughput than Accelerators
+
+CPUs lack the specialized architectures for massively parallel processing that GPUs and other accelerators provide. Their general-purpose design results in lower computational throughput for the highly parallelizable math operations common in ML models [@jouppi2017datacenter].
+
+##### Not Optimized for Data Parallelism
+
+The architectures of CPUs are not specifically optimized for data parallel workloads inherent to AI [@Sze2017-ak]. They allocate substantial silicon area to instruction decoding, speculative execution, caching, and flow control that provide little benefit for the array operations used in neural networks ([AI Inference Acceleration on CPUs](https://www.intel.com/content/www/us/en/developer/articles/technical/ai-inference-acceleration-on-intel-cpus.html#gs.0w9qn2)).
+
+GPU streaming multiprocessors, for example, devote most transistors to floating point units instead of complex branch prediction logic. This specialization allows much higher utilization for ML math.
+
+##### Higher Memory Latency
+
+CPUs suffer from higher latency accessing main memory relative to GPUs and other accelerators ([DDR](https://www.integralmemory.com/articles/the-evolution-of-ddr-sdram/)). Techniques like tiling and caching can help, but the physical separation from off-chip RAM bottlenecks data-intensive ML workloads. This emphasizes the need for specialized memory architectures in ML hardware.
+
+##### Power Inefficiency Under Heavy Workloads
+
+While suitable for intermittent inference, sustaining near-peak throughput for training results in inefficient power consumption on CPUs, especially mobile CPUs [@ignatov2018ai]. Accelerators explicitly optimize the dataflow, memory, and computation for sustained ML workloads. For training large models, CPUs are energy-inefficient.
+
+### Comparison
+
+| Accelerator | Description | Key Advantages | Key Disadvantages |
+|-|-|-|-|
+| ASICs | Custom ICs designed for target workload like AI inference | - Maximizes perf/watt
- Optimized for tensor ops
- Low latency on-chip memory | - Fixed architecture lacks flexibility
- High NRE cost
- Long design cycles |
+| FPGAs | Reconfigurable fabric with programmable logic and routing | - Flexible architecture
- Low latency memory access | - Lower perf/watt than ASICs
- Complex programming |
+| GPUs | Originally for graphics, now used for neural network acceleration | - High throughput
- Parallel scalability
- Software ecosystem with CUDA | - Not as power efficient as ASICs
- Require high memory bandwidth |
+| CPUs | General purpose processors | - Programmability
- Ubiquitous availability | - Lower performance for AI workloads |
+
+In general, CPUs provide a readily available baseline, GPUs deliver broadly accessible acceleration, FPGAs offer programmability, and ASICs maximize efficiency for fixed functions. The optimal choice depends on the scale, cost, flexibility and other requirements of the target application.
+
+Although first developed for data center deployment, where [cite some benefit that google cites], Google has also put considerable effort into developing Edge TPUs. These Edge TPUs maintain the inspiration from systolic arrays but are tailored to the limited resources accessible at the edge.
## Hardware-Software Co-Design
-Explanation: Focusing on the synergies between hardware and software components, this section discusses the principles and techniques of hardware-software co-design to achieve optimized performance in AI systems. This information is crucial to understanding how to design powerful and efficient AI systems that leverage both hardware and software components effectively.
+Hardware-software co-design is based on the principle that AI systems achieve optimal performance and efficiency when the hardware and software components are designed in tight integration. This involves an iterative, collaborative design cycle where the hardware architecture and software algorithms are concurrently developed and refined with continuous feedback between teams.
+
+For example, a new neural network model may be prototyped on an FPGA-based accelerator platform to obtain real performance data early in the design process. These results provide feedback to both the hardware designers on potential optimizations as well as the software developers on refinements to the model or framework to better leverage the hardware capabilities. This level of synergy is difficult to achieve with the common practice of software being developed independently to deploy on fixed commodity hardware.
+
+Co-design is particularly critical for embedded AI systems which face significant resource constraints like low power budgets, limited memory and compute capacity, and real-time latency requirements. Tight integration between algorithm developers and hardware architects helps unlock optimizations across the stack to meet these restrictions. Enabling techniques include algorithmic improvements like neural architecture search and pruning along with hardware advances like specialized dataflows and memory hierarchies.
+
+By bringing hardware and software design together, rather than developing them separately, holistic optimizations can be made that maximize performance and efficiency. The next sections provide more details on specific co-design approaches.
+
+### The Need for Co-Design
+
+There are several key factors that make a collaborative hardware-software co-design approach essential for building efficient AI systems.
+
+#### Increasing Model Size and Complexity
+
+State-of-the-art AI models have been rapidly growing in size, enabled by advances in neural architecture design and availability of large datasets. For example, the GPT-3 language model contains 175 billion parameters [@brown2020language], requiring huge computational resources for training. This explosion in model complexity necessitates co-design to develop efficient hardware and algorithms in tandem. Techniques like model compression [@cheng2017survey] and quantization must be co-optimized with the hardware architecture.
+
+#### Constraints of Embedded Deployment
+
+Deploying AI applications on edge devices like mobile phones or smart home appliances introduces significant constraints on resources such as energy, memory, and silicon area [@sze2017efficient]. To enable real-time inference under these restrictions requires co-exploring hardware optimizations like specialized dataflows and compression with efficient neural network design and pruning techniques. Co-design maximizes performance within the tight deployment constraints.
+
+#### Rapid Evolution of AI Algorithms
-- Principles of Hardware-Software Co-Design
-- Optimization Techniques
-- Integration with Embedded Systems
+The field of AI is evolving extremely rapidly, with new model architectures, training methodologies, and software frameworks constantly emerging. For example, Transformers have become hugely popular for NLP just in the last few years [@young2018recent]. Keeping pace with these algorithmic innovations requires hardware-software co-design to quickly adapt platforms and avoid accrued technical debt.
-## Acceleration Techniques
+#### Complex Hardware-Software Interactions
-Explanation: In this section, various techniques to enhance computational efficiency and reduce latency through hardware acceleration are discussed. This information is fundamental for readers to understand how to maximize the benefits of hardware acceleration in AI systems, focusing on achieving superior computational performance.
+There are many subtle interactions and tradeoffs between hardware architectural choices and software optimizations that have significant impacts on overall efficiency. For instance, techniques like tensor partitioning and batching affect parallelism. Data access patterns impact memory utilization. Co-design provides a cross-layer perspective to unravel these dependencies.
-- Parallel Computing
-- Pipeline Computing
-- Memory Hierarchy Optimization
-- Instruction Set Optimization
+#### Need for Specialization
-## Tools and Frameworks
+AI workloads benefit from specialized operations like low precision math and customized memory hierarchies. This motivates incorporating custom hardware tailored to neural network algorithms rather than relying solely on flexible software running on generic hardware [@sze2017efficient]. But to realize the benefits, the software stack must explicitly target the custom hardware operations.
-Explanation: This section introduces readers to an array of tools and frameworks available for facilitating work with hardware accelerators. It is essential for practical applications to help readers understand the resources they have at their disposal for implementing and optimizing hardware-accelerated AI systems.
+#### Demand for Higher Efficiency
-- Software Tools for Hardware Acceleration
-- Development Environments
-- Libraries and APIs
+With growing model complexity, there are diminishing returns and overhead from optimizing only the hardware or software in isolation [@putnam_reconfigurable_2014]. Inevitable tradeoffs arise that require a global optimization across layers. Jointly co-designing hardware and software provides large compound efficiency gains.
-## Case Studies
+### Principles of Hardware-Software Co-Design
-Explanation: Providing real-world case studies offers practical insights and lessons from actual hardware-accelerated AI implementations. This section helps readers bridge theory with practice by demonstrating potential benefits and challenges in real-world scenarios, and offers a practical perspective on the topics discussed.
+To build high-performance and efficient AI systems, there must be tight integration and co-optimization between the underlying hardware architecture and software stack. Neither can be designed in isolation - maximizing their synergies requires a holistic approach known as hardware-software co-design.
-- Real-world Applications
-- Case Study 1: Implementing Neural Networks on FPGAs
-- Case Study 2: Optimizing Performance with GPUs
-- Lessons Learned from Case Studies
+The key goal is tailoring the hardware capabilities to match the algorithms and workloads run by the software. This requires a feedback loop between hardware architects and software developers to converge on optimized solutions. Several techniques enable effective co-design:
+
+#### Hardware-Aware Software Optimization
+
+The software stack can be optimized to better leverage the underlying hardware capabilities:
+
+- **Model Parallelism:** Parallelize matrix computations like convolution or attention layers to maximize throughput on vector engines.
+- **Memory Optimization:** Tune data layouts to improve cache locality based on hardware profiling. This maximizes reuse and minimizes expensive DRAM access.
+- **Custom Operations:** Incorporate specialized ops like low precision INT4 or bfloat16 into models to capitalize on dedicated hardware support.
+- **Dataflow Mapping:** Explicitly map model stages to computational units to optimize data movement on hardware.
+
+#### Algorithm-Driven Hardware Specialization
+
+Hardware can be tailored to better suit the characteristics of ML algorithms:
+
+- **Custom Datatypes:** Support low precision INT8/4 or bfloat16 in hardware for higher arithmetic density.
+- **On-Chip Memory:** Increase SRAM bandwidth and lower access latency to match model memory access patterns.
+- **Domain-Specific Ops:** Add hardware units for key ML functions like FFTs or matrix multiplication to reduce latency and energy.
+- **Model Profiling:** Use model simulation and profiling to identify computational hotspots and guide hardware optimization.
+
+The key is collaborative feedback - insights from hardware profiling
+guide software optimizations, while algorithmic advances inform hardware
+specialization. This mutual enhancement provides multiplicative
+efficiency gains compared to isolated efforts.
+
+#### Algorithm-Hardare Co-exploration
+
+Jointly exploring innovations in neural network architectures along with custom hardware design is a powerful co-design technique. This allows finding ideal pairings tailored to each other's strengths [@sze2017efficient].
+
+For instance, the shift to mobile architectures like MobileNets [@howard_mobilenets_2017] was guided by edge device constraints like model size and latency. The quantization [@jacob2018quantization] and pruning techniques [@gale2019state] that unlocked these efficient models became possible thanks to hardware accelerators with native low-precision integer support.
+
+Attention-based models have thrived on massively parallel GPUs and ASICs where their computation maps well spatially, as opposed to RNN architectures reliant on sequential processing. Co-evolution of algorithms and hardware unlocked new capabilities.
+
+Effective co-exploration requires close collaboration between algorithm researchers and hardware architects. Rapid prototyping on FPGAs [@zhang2015fpga] or specialized AI simulators allows quickly evaluating different pairings of model architectures and hardware designs pre-silicon.
+
+For example, Google's TPU architecture evolved in conjunction with optimizations to TensorFlow models to maximize performance on image classification. This tight feedback loop yielded models tailored for the TPU that would have been unlikely in isolation.
+
+Studies have shown 2-5x higher performance and efficiency gains with algorithm-hardware co-exploration compared to isolated algorithm or hardware optimization efforts [@suda2016throughput]. Parallelizing the joint development also reduces time-to-deployment.
+
+Overall, exploring the tight interdependencies between model innovation and hardware advances unlocks opportunities not visible when tackled sequentially. This synergistic co-design yields solutions greater than the sum of their parts.
+
+### Challenges
+
+While collaborative co-design can improve efficiency, adaptability, and time-to-market, it also comes with engineering and organizational challenges.
+
+#### Increased Prototyping Costs
+
+More extensive prototyping is required to evaluate different hardware-software pairings. The need for rapid, iterative prototypes on FPGAs or emulators increases validation overhead. For example, Microsoft found more prototypes needed for co-design of an AI accelerator versus sequential design [@fowers2018configurable].
+
+#### Team and Organizational Hurdles
+
+Co-design requires close coordination between traditionally disconnected hardware and software groups. This could introduce communication issues or misaligned priorities and schedules. Navigating different engineering workflows is also challenging. Some organizational inertia to adopting integrated practices may exist.
+
+#### Simulation and Modeling Complexity
+
+Capturing subtle interactions between hardware and software layers for joint simulation and modeling adds significant complexity. Full cross-layer abstractions are difficult to construct quantitatively pre-implementation. This makes holistic optimizations harder to quantify ahead of time.
+
+#### Over-Specialization Risks
+
+Tight co-design bears the risk of overfitting optimizations to current algorithms, sacrificing generality. For example, hardware tuned exclusively for Transformer models could underperform on future techniques. Maintaining flexibility requires foresight.
+
+#### Adoption Challenges
+
+Engineers comfortable with established discrete hardware or software design practices may resist adopting unfamiliar collaborative workflows. Projects could face friction in transitioning to co-design, despite long-term benefits.
+
+## Software for AI Hardware
+
+At this time it should be obvious that specialized hardware accelerators like GPUs, TPUs, and FPGAs are essential to delivering high-performance artificial intelligence applications. But to leverage these hardware platforms effectively, an extensive software stack is required spanning the entire development and deployment lifecycle. Frameworks and libraries form the backbone of AI hardware, offering sets of robust, pre-built code, algorithms, and functions specifically optimized to perform a wide array of AI tasks on the different hardware. They are designed to simplify the complexities involved in utilizing the hardware from scratch, which can be time-consuming and prone to error. Software plays an important role in the following:
+
+- Providing programming abstractions and models like CUDA and OpenCL to map computations onto accelerators.
+- Integrating accelerators into popular deep learning frameworks like TensorFlow and PyTorch.
+- Compilers and tools to optimize across the hardware-software stack.
+- Simulation platforms to model hardware and software together.
+- Infrastructure to manage deployment on accelerators.
+
+```{mermaid}
+%%| label: fig-ai-stack
+%%| fig-cap: AI software stack spanning development, optimization, simulation, and deployment
+%%| fig-align: center
+%%| fig-alt: AI software stack spanning development, optimization, simulation, and deployment
+flowchart TD
+ A(Model Development) --> B(Model Optimization)
+ B --> C(Deployment onto Simulator) & D(Deployment onto Hardware)
+```
+
+This expansive software ecosystem is as important as the hardware itself in delivering performant and efficient AI applications. This section provides an overview of the tools available at each layer of the stack shown in @fig-ai-stack to enable developers to build and run AI systems powered by hardware acceleration.
+
+### Programming Models {#sec-programming-models}
+
+Programming models provide abstractions to map computations and data onto heterogeneous hardware accelerators:
+
+- **[CUDA](https://developer.nvidia.com/cuda-toolkit):** Nvidia's parallel programming model to leverage GPUs using extensions to languages like C/C++. Allows launching kernels across GPU cores [@luebke2008cuda].
+- **[OpenCL](https://www.khronos.org/opencl/):** Open standard for writing programs spanning CPUs, GPUs, FPGAs and other accelerators. Specifies a heterogeneous computing framework [@munshi2009opencl].
+- **[OpenGL/WebGL](https://www.opengl.org):** 3D graphics programming interfaces that can map general-purpose code to GPU cores [@segal1999opengl].
+- **[Verilog](https://www.verilog.com)/VHDL**: Hardware description languages (HDLs) used to configure FPGAs as AI accelerators by specifying digital circuits [@gannot1994verilog].
+- **[TVM](https://tvm.apache.org):** Compiler framework providing Python frontend to optimize and map deep learning models onto diverse hardware back-ends [@chen2018tvm].
+
+Key challenges include expressing parallelism, managing memory across devices, and matching algorithms to hardware capabilities. Abstractions must balance portability with allowing hardware customization. Programming models enable developers to harness accelerators without hardware expertise. More of these details are discussed in the [AI frameworks](frameworks.qmd) section.
+
+### Libraries and Runtimes
+
+Specialized libraries and runtimes provide software abstractions to access and maximize utilization of AI accelerators:
+
+- **Math Libraries:** Highly optimized implementations of linear algebra primitives like GEMM, FFTs, convolutions etc. tailored to target hardware. [Nvidia cuBLAS](https://developer.nvidia.com/cublas), [Intel MKL](https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html), and [Arm compute libraries](https://www.arm.com/technologies/compute-library) are examples.
+- **Framework Integrations:** Libraries to accelerate deep learning frameworks like TensorFlow, PyTorch, and MXNet on supported hardware. For example, [cuDNN](https://developer.nvidia.com/cudnn) for accelerating CNNs on Nvidia GPUs.
+- **Runtimes:** Software to handle execution on accelerators, including scheduling, synchronization, memory management and other tasks. [Nvidia TensorRT](https://developer.nvidia.com/tensorrt) is an inference optimizer and runtime.
+- **Drivers and Firmware:** Low-level software to interface with hardware, initialize devices, and handle execution. Vendors like Xilinx provide drivers for their accelerator boards.
+
+For instance, PyTorch integrators use cuDNN and cuBLAS libraries to accelerate training on Nvidia GPUs. The TensorFlow XLA runtime optimizes and compiles models for accelerators like TPUs. Drivers initialize devices and offload operations.
+
+The challenges include efficiently partitioning and scheduling workloads across heterogeneous devices like multi-GPU nodes. Runtimes must also minimize overhead of data transfers and synchronization.
+
+Libraries, runtimes and drivers provide optimized building blocks that deep learning developers can leverage to tap into accelerator performance without hardware programming expertise. Their optimization is essential for production deployments.
+
+### Optimizing Compilers
+
+Optimizing compilers play a key role in extracting maximum performance and efficiency from hardware accelerators for AI workloads. They apply optimizations spanning algorithmic changes, graph-level transformations, and low-level code generation.
+
+- **Algorithm Optimization:** Techniques like quantization, pruning, and neural architecture search to enhance model efficiency and match hardware capabilities.
+- **Graph Optimizations:** Graph-level optimizations like operator fusion, rewriting, and layout transformations to optimize performance on target hardware.
+- **Code Generation:** Generating optimized low-level code for accelerators from high-level models and frameworks.
+
+For example, the TVM open compiler stack applies quantization for a BERT model targeting Arm GPUs. It fuses pointwise convolution operations and transforms weight layout to optimize memory access. Finally it emits optimized OpenGL code to run the workload on the GPU.
+
+Key compiler optimizations include maximizing parallelism, improving data locality and reuse, minimizing memory footprint, and exploiting custom hardware operations. Compilers build and optimize machine learning workloads holistically across hardware components like CPUs, GPUs, and other accelerators.
+
+However, efficiently mapping complex models introduces challenges like efficiently partitioning workloads across heterogeneous devices. Production-level compilers also require extensive time tuning on representative workloads. Still, optimizing compilers are indispensable in unlocking the full capabilities of AI accelerators.
+
+### Simulation and Modeling
+
+Simulation software is important in hardware-software co-design. It enables joint modeling of proposed hardware architectures and software stacks:
+
+- **Hardware Simulation:** Platforms like [Gem5](https://www.gem5.org) allow detailed simulation of hardware components like pipelines, caches, interconnects, and memory hierarchies. Engineers can model hardware changes without physical prototyping [@binkert2011gem5].
+- **Software Simulation:** Compiler stacks like [TVM](https://tvm.apache.org) support simulation of machine learning workloads to estimate performance on target hardware architectures. This assists with software optimizations.
+- **Co-simulation:** Unified platforms like the SCALE-Sim [@samajdar2018scale] integrate hardware and software simulation into a single tool. This enables what-if analysis to quantify the system-level impacts of cross-layer optimizations early in the design cycle.
+
+For example, an FPGA-based AI accelerator design could be simulated using Verilog hardware description language and synthesized into a Gem5 model. Verilog is well-suited for describing the digital logic and interconnects that make up the accelerator architecture. Using Verilog allows the designer to specify the datapaths, control logic, on-chip memories, and other components that will be implemented in the FPGA fabric. Once the Verilog design is complete, it can be synthesized into a model that simulates the behavior of the hardware, such as using the Gem5 simulator. Gem5 is useful for this task because it allows modeling of full systems including processors, caches, buses, and custom accelerators. Gem5 supports interfacing Verilog models of hardware to the simulation, enabling unified system modeling.
+
+The synthesized FPGA accelerator model could then have ML workloads simulated using TVM compiled onto it within the Gem5 environment for unified modeling. TVM allows optimized compilation of ML models onto heterogeneous hardware like FPGAs. Running TVM-compiled workloads on the accelerator within the Gem5 simulation provides an integrated way to validate and refine the hardware design, software stack, and system integration before ever needing to physically realize the accelerator on a real FPGA.
+
+This type of co-simulation provides estimations of overall metrics like throughput, latency, and power to guide co-design before expensive physical prototyping. They also assist with partitioning optimizations between hardware and software to guide design tradeoffs.
+
+However, limitations exist in accurately modeling subtle low-level interactions between components. Quantified simulations are an estimate but cannot wholly replace physical prototypes and testing. Still, unified simulation and modeling provides invaluable early insights into system-level optimization opportunities during the co-deign process.
+
+## Benchmarking AI Hardware
+
+Benchmarking is a critical process that quantifies and compares the performance of various hardware platforms designed to speed up artificial intelligence applications. It guides purchasing decisions, development focus, and performance optimization efforts for both hardware manufacturers and software developers.
+
+The [benchmarking chapter](benchmarking.qmd) explores this topic in great detail and why it has become an indispensable part of the AI hardware development cycle and how it impacts the broader technology landscape. Here, we will briefly review the main concepts but refer you to the chapter for more details.
+
+Benchmarking suites such as MLPerf, Fathom, and AI Benchmark offer a set of standardized tests that can be used across different hardware platforms. These suites measure AI accelerator performance across various neural networks and machine learning tasks, from basic image classification to complex language processing. By providing a common ground for comparison, they help ensure that performance claims are consistent and verifiable. These "tools" are applied not only to guide the development of hardware but also to ensure that the software stack leverages the full potential of the underlying architecture.
+
+- **MLPerf**: Includes a broad set of benchmarks covering both training [@mattson2020mlperf] and inference [@reddi2020mlperf] for a range of machine learning tasks.
+- **Fathom**: Focuses on core operations found in deep learning models, emphasizing their execution on different architectures [@adolf2016fathom].
+- **AI Benchmark**: Targets mobile and consumer devices, assessing AI performance in end-user applications [@ignatov2018ai].
+
+Benchmarks also have performance metrics that are the quantifiable measures used to evaluate the effectiveness of AI accelerators. These metrics provide a comprehensive view of an accelerator's capabilities and are used to guide the design and selection process for AI systems. Common metrics include:
+
+- **Throughput**: Usually measured in operations per second, this metric indicates the volume of computations an accelerator can handle.
+- **Latency**: The time delay from input to output in a system, vital for real-time processing tasks.
+- **Energy Efficiency**: Calculated as computations per watt, representing the trade-off between performance and power consumption.
+- **Cost Efficiency**: This evaluates the cost of operation relative to performance, an essential metric for budget-conscious deployments.
+- **Accuracy**: Particularly in inference tasks, the precision of computations is critical and sometimes balanced against speed.
+- **Scalability**: The ability of the system to maintain performance gains as the computational load scales up.
+
+Benchmark results give insights beyond just numbers - they can reveal bottlenecks in the software and hardware stack. For example, benchmarks may show how increased batch size improves GPU utilization by providing more parallelism. Or how compiler optimizations boost TPU performance. These learnings enable continuous optimization [@jia2019beyond].
+
+Standardized benchmarking provides quantified, comparable evaluation of AI accelerators to inform design, purchasing, and optimization. But real-world performance validation remains essential as well [@zhu2018benchmarking].
## Challenges and Solutions
-Explanation: This segment discusses the prevalent challenges encountered in implementing hardware acceleration in AI systems and proposes potential solutions. It equips readers with a realistic view of the complexities involved and guides them in overcoming common hurdles.
+### Portability/Compatibility Issues
+
+AI accelerators offer impressive performance improvements, but their integration into the broader AI landscape is often hindered by significant portability and compatibility challenges. The crux of the issue lies in the diversity of the AI ecosystem — a vast array of machine learning frameworks and programming languages exists, each with its unique features and requirements.
+
+Developers frequently encounter difficulties when attempting to transfer their AI models from one hardware environment to another. For example, a machine learning model developed for a desktop environment in Python using the PyTorch framework, optimized for an Nvidia GPU, may not easily transition to a more constrained device such as the Arduino Nano 33 BLE. This complexity stems from stark differences in programming requirements — Python and PyTorch on the desktop versus a C++ environment on an Arduino, not to mention the shift from x86 architecture to ARM ISA.
+
+These divergences highlight the intricacy of portability within AI systems. Moreover, the rapid advancement in AI algorithms and models means that hardware accelerators must continually adapt, creating a moving target for compatibility. The absence of universal standards and interfaces compounds the issue, making it challenging to deploy AI solutions consistently across various devices and platforms.
+
+#### Solutions and Strategies
+
+To address these hurdles, the AI industry is moving towards several solutions:
+
+##### Standardization Initiatives
+
+The Open Neural Network Exchange (ONNX) is at the forefront of this pursuit, proposing an open and shared ecosystem that promotes model interchangeability. ONNX facilitates the use of AI models across various frameworks, allowing for models trained in one environment to be efficiently deployed in another, which significantly reduces the need for time-consuming rewrites or adjustments.
+
+##### Cross-Platform Frameworks
+
+Complementing the standardization efforts, cross-platform frameworks such as TensorFlow Lite and PyTorch Mobile have been developed specifically to create cohesion between diverse computational environments ranging from desktops to mobile and embedded devices. These frameworks offer streamlined, lightweight versions of their parent frameworks, ensuring compatibility and functional integrity across different hardware types without sacrificing performance. This ensures that developers can create applications with the confidence that they will work on a multitude of devices, bridging a gap that has traditionally posed a considerable challenge in AI development.
+
+##### Hardware-agnostic Platforms
+
+The rise of hardware-agnostic platforms has also played a pivotal role in democratizing the use of AI. By creating environments where AI applications can be executed on various accelerators, these platforms remove the burden of hardware-specific coding from developers. This abstraction not only simplifies the development process but also opens up new possibilities for innovation and application deployment, free from the constraints of hardware specifications.
+
+##### Advanced Compilation Tools
+
+In addition, the advent of advanced compilation tools like TVM—an end-to-end tensor compiler—offers an optimized path through the jungle of diverse hardware architectures. TVM equips developers with the means to fine-tune machine learning models for a broad spectrum of computational substrates, ensuring optimal performance and avoiding the need for manual model adjustment each time there is a shift in the underlying hardware.
+
+##### Community and Industry Collaboration
+
+The collaboration between open-source communities and industry consortia cannot be understated. These collective bodies are instrumental in forming shared standards and best practices that all developers and manufacturers can adhere to. Such collaboration fosters a more unified and synergistic AI ecosystem, significantly diminishing the prevalence of portability issues and smoothing the path toward global AI integration and advancement. Through these combined efforts, the field of AI is steadily moving toward a future where seamless model deployment across various platforms becomes a standard, rather than an exception.
+
+Solving the portability challenges is crucial for the AI field to realize the full potential of hardware accelerators in a dynamic and diverse technological landscape. It requires a concerted effort from hardware manufacturers, software developers, and standard bodies to create a more interoperable and flexible environment. With continued innovation and collaboration, the AI community can pave the way for seamless integration and deployment of AI models across a multitude of platforms.
+
+### Power Consumption Concerns
+
+Power consumption is a crucial issue in the development and operation of data center AI accelerators, like Graphics Processing Units (GPUs) and Tensor Processing Units (TPUs) [@Norman2017TPUv1] [@Norrie2021TPUv2_3] [@Jouppi2023TPUv4]. These powerful components are the backbone of contemporary AI infrastructure, but their high energy demands contribute to the environmental impact of technology and drive up operational costs significantly. As data processing needs become more complex, with the popularity of AI and deep learning increasing, there's a pressing demand for GPUs and TPUs that can deliver the necessary computational power more efficiently. The impact of such advancements is two-fold: they can lower the environmental footprint of these technologies and also reduce the cost of running AI applications.
+
+Emerging hardware technologies are at the cusp of revolutionizing power efficiency in this sector. Photonic computing, for instance, uses light rather than electricity to carry information, offering a promise of high-speed processing with a fraction of the power usage. We delve deeper into this and other innovative technologies in the "Emerging Hardware Technologies" section, exploring their potential to address current power consumption challenges.
+
+At the edge of the network, AI accelerators are engineered to process data on devices like smartphones, IoT sensors, and smart wearables. These devices often work under severe power limitations, necessitating a careful balancing act between performance and power usage. A high-performance AI model may provide quick results but at the cost of depleting battery life swiftly and increasing thermal output, which may affect the device's functionality and durability. The stakes are higher for devices deployed in remote or hard-to-reach areas, where consistent power supply cannot be guaranteed, underscoring the need for low-power consuming solutions.
+
+The challenge of power efficiency at the edge is further compounded by latency issues. Edge AI applications in fields such as autonomous driving and healthcare monitoring require not just speed but also precision and reliability, as delays in processing can lead to serious safety risks. For these applications, developers are compelled to optimize both the AI algorithms and the hardware design to strike an optimal balance between power consumption and latency.
+
+This optimization effort is not just about making incremental improvements to existing technologies; it’s about rethinking how and where we process AI tasks. By designing AI accelerators that are both power-efficient and capable of quick processing, we can ensure these devices serve their intended purposes without unnecessary energy use or compromised performance. Such developments could propel the widespread adoption of AI across various sectors, enabling smarter, safer, and more sustainable use of technology.
+
+### Overcoming Resource Constraints
+
+Resource constraints also pose a significant challenge for Edge AI accelerators, as these specialized hardware and software solutions must deliver robust performance within the limitations of edge devices. Due to power and size limitations, edge AI accelerators often have restricted computation, memory, and storage capacity [@lin2022ondevice]. This scarcity of resources necessitates a careful allocation of processing capabilities to execute machine learning models efficiently.
+
+Moreover, managing constrained resources demands innovative approaches, including model quantization [@lin2023awq] [@Li2020Additive], pruning [@wang2020apq], and optimizing inference pipelines. Edge AI accelerators must strike a delicate balance between providing meaningful AI functionality and not exhausting the available resources, all while maintaining low power consumption. Overcoming these resource constraints is crucial to ensure the successful deployment of AI at the edge, where many applications, from IoT to mobile devices, rely on the efficient use of limited hardware resources to deliver real-time and intelligent decision-making.
+
+## Emerging Technologies
+
+Thus far we have discussed AI hardware technology in the context of conventional von Neumann architecture design and CMOS-based implementation. These specialized AI chips offer benefits like higher throughput and power efficiency but rely on traditional computing principles. The relentless growth in demand for AI compute power is driving innovations in integration methods for AI hardware.
+
+Two leading approaches have emerged for maximizing compute density - wafer-scale integration and chiplet-based architectures, which we will discuss in this section. Looking much further ahead, we will look into emerging technologies that diverge from conventional architectures and adopt fundamentally different approaches for AI-specialized computing.
+
+Some of these unconventional paradigms include neuromorphic computing which mimics biological neural networks, quantum computing that leverages quantum mechanical effects, and optical computing utilizing photons instead of electrons. Beyond novel computing substrates, new device technologies are enabling additional gains through better memory and interconnect.
+
+Examples include memristors for in-memory computing and nanophotonics for integrated photonic communication. Together, these technologies offer the potential for orders of magnitude improvements in speed, efficiency, and scalability compared to current AI hardware. We will examine these in this section.
+
+### Integration Methods
+
+Integration methods refer to the approaches used to combine and interconnect the various computational and memory components in an AI chip or system. The goal of integration is to maximize performance, power efficiency, and density by closely linking the key processing elements.
+
+In the past, AI compute was primarily performed on CPUs and GPUs built using conventional integration methods. These discrete components were manufactured separately then connected together on a board. However, this loose integration creates bottlenecks like data transfer overheads.
+
+As AI workloads have grown, there is increasing demand for tighter integration between compute, memory, and communication elements. Some key drivers of integration include:
+
+- **Minimizing data movement:** Tight integration reduces latency and power for moving data between components. This improves efficiency.
+- **Customization:** Tailoring all components of a system to AI workloads allows optimizations throughout the hardware stack.
+- **Parallelism:** Integrating a large number of processing elements enables massively parallel computation.
+- **Density:** Tighter integration allows packing more transistors and memory into a given area.
+- **Cost:** Economies of scale from large integrated systems can reduce costs.
+
+In response, new manufacturing techniques like wafer-scale fabrication and advanced packaging now allow much higher levels of integration. The goal is to create unified, specialized AI compute complexes tailored for deep learning and other AI algorithms. Tighter integration is key to delivering the performance and efficiency needed for the next generation of AI.
+
+#### Wafer-scale AI
-- Portability/Compatibility Issues
-- Power Consumption Concerns
-- Latency Reduction
-- Overcoming Resource Constraints
+Wafer-scale AI takes an extremely integrated approach, manufacturing an entire silicon wafer as one gigantic chip. This differs drastically from conventional CPUs and GPUs which cut each wafer into many smaller individual chips. While some GPUs may contain billions of transistors, they still pale in comparison to the scale of a wafer-size chip with over a trillion transistors.
-## Emerging Hardware Technologies and Future Trends
+The wafer-scale approach also diverges from more modular system-on-chip designs that still have discrete components communicating by bus. Instead, wafer-scale AI enables full customization and tight integration of computation, memory, and interconnects across the entire die.
-Explanation: Discussing emerging technologies and trends, this section offers readers a glimpse into the future developments in the field of embedded hardware. This is vital to help readers stay abreast of the evolving landscape and potentially guide research and development efforts in the sector.
+![Comparing a wafer-scale AI chip with a conventional GPU die](images/hw_acceleration/aimage1.png)
-- Optimization Techniques for New Hardware
-- Flexible Electronics
-- Neuromorphic Computing
-- In-Memory Computing
-- ...
-- Challenges with Scalability and Hardware-Software Integration
-- Next-Generation Hardware Trends and Innovations
+By designing the wafer as one integrated logic unit, data transfer between elements is minimized. This provides lower latency and power consumption compared to discrete system-on-chip or chiplet designs. While chiplets can offer flexibility by mixing and matching components, communication between chiplets is a challenge. The monolithic nature of wafer-scale integration eliminates these inter-chip communication bottlenecks.
+
+However, the ultra-large scale also poses difficulties for manufacturability and yield with wafer-scale designs. Defects in any region of the wafer can make (certian parts of) the chip unusable. And specialized lithography techniques are required to produce such large dies. So wafer-scale integration pursues the maximum performance gains from integration but requires overcoming substantial fabrication challenges. The following video will provide additional context.
+
+{{< video https://www.youtube.com/watch?v=Fcob512SJz0 >}}
+
+
+#### Chiplets for AI
+
+[Chiplets](https://en.wikipedia.org/wiki/Chiplet) are smaller independent dies that are manufactured separately then interconnected on a substrate to create a larger system. For AI hardware, chiplets enable mixing different types of chips optimized for tasks like matrix multiplication, data movement, analog I/O, and specialized memories. This heterogeneous integration differs greatly from wafer-scale integration where all logic is manufactured as one monolithic chip. Companies like Intel and AMD have adopted chiplet design for their CPUs.
+
+Chiplets are interconnected using advanced packaging techniques like high-density substrate interposers, 2.5D/3D stacking, and wafer-level packaging. This allows combining chiplets fabricated with different process nodes, specialized memories, and various optimized AI engines.
+
+![Figure 1: Chiplet partitioning concept. Figure taken from [@Vivet2021] [Cerebras](https://www.cerebras.net/product-chip/) ](images/hw_acceleration/aimage2.png)
+
+Some key advantages of using chiplets for AI include:
+
+- **Flexibility:** Flexibility: Chiplets allow combining different chip types, process nodes, and memories tailored for each function. This is more modular versus a fixed wafer-scale design.
+- **Yield:** Smaller chiplets have higher yield than a gigantic wafer-scale chip. Defects are contained to individual chiplets.
+- **Cost:** Leverages existing manufacturing capabilities versus requiring specialized new processes. Reduces costs by reusing mature fabrication.
+- **Compatibility:** Can integrate with more conventional system architectures like PCIe and standard DDR memory interfaces.
+
+However, chiplets also face integration and performance challenges:
+
+- Lower density compared to wafer-scale, as chiplets are limited in size.
+- Added latency when communicating between chiplets versus monolithic integration. Requires optimization for low-latency interconnect.
+- Advanced packaging adds complexity versus wafer-scale integration, though this is arguable.
+
+The key objective of chiplets is finding the right balance between modular flexibility and integration density for optimal AI performance. Chiplets aim for efficient AI acceleration while working within the constraints of conventional manufacturing techniques. Overall, chiplets take a middle path between the extremes of wafer-scale integration and fully discrete components. This provides practical benefits but may sacrifice some computational density and efficiency versus a theoretical wafer-size system.
+
+### Neuromorphic Computing
+
+Neuromorphic computing is an emerging field aiming to emulate the efficiency and robustness of biological neural systems for machine learning applications. A key difference from classical Von Neumann architectures is the merging of memory and processing in the same circuit [@schuman2022; @markovic2020; @furber2016large], as illustrated in Figure below. This integrated approach is inspired by the structure of the brain. A key advantage is the potential for orders of magnitude improvement in energy efficient computation compared to conventional AI hardware. For example, some estimates project 100x-1000x gains in energy efficiency versus current GPU-based systems for equivalent workloads.
+
+![Comparison of the von Neumann architecture with the neuromorphic architecture. These two architectures have some fundamental differences when it comes to operation, organization, programming, communication, and timing. Figure taken from [@schuman2022]](images/hw_acceleration/aimage3.png)
+
+Intel and IBM are leading commercial efforts in neuromorphic hardware. Intel's Loihi and Loihi 2 chips [@davies2018loihi; @davies2021advancing] offer programmable neuromorphic cores with on-chip learning. IBM's Northpole [@modha2023neural] device comprises more than 100 million magnetic tunnel junction synapses and 68 billion transistors. These specialized chips deliver benefits like low power consumption for edge inference.
+
+Spiking neural networks (SNNs) [@maass1997networks] are computational models suited for neuromorphic hardware. Unlike deep neural networks that communicate via continuous values, SNNs use discrete spikes more akin to biological neurons. This allows efficient event-based computation rather than constant processing. Additionally, SNNs take into account temporal characteristics of input data in addition to spatial characteristics. This better mimics biological neural networks, where timing of neuronal spikes plays an important role. However, training SNNs remains challenging due to the added temporal complexity. See following figure and video for reference.
+
+![Neurons communicate via spikes. (a) Diagram of a neuron. (b) Measuring an action potential propagated along the axon of a neuron. Only the action potential is detectable along the axon. (c) The neuron's spike is approximated with a binary representation. (d) Event-Driven Processing (e) Active Pixel Sensor and Dynamic Vision Sensor. Figure taken from [@10242251]](images/hw_acceleration/aimage4.png)
+
+{{< video https://www.youtube.com/watch?v=yihk_8XnCzg >}}
+
+Specialized nanoelectronic devices called memristors [@chua1971memristor] serve as the synaptic components in neuromorphic systems. Memristors act as non-volatile memory with adjustable conductance, emulating the plasticity of real synapses. By combining memory and processing functions, memristors enable in-situ learning without separate data transfers. However, memristor technology has not yet reached maturity and scalability for commercial hardware.
+
+Recently, the integration of photonics with neuromorphic computing [@shastri2021photonics] has emerged as an active research area. Using light for computation and communication allows high speeds and reduced energy consumption. However, fully realizing photonic neuromorphic systems requires overcoming design and integration challenges.
+
+Neuromorphic computing offers promising capabilities for efficient edge inference but still faces obstacles around training algorithms, nanodevice integration, and system design. Ongoing multidisciplinary research across computer science, engineering, materials science, and physics will be key to unlocking the full potential of this technology for AI use cases.
+
+### Analog Computing
+
+Analog computing is an emerging approach that uses analog signals and components like capacitors, inductors, and amplifiers rather than digital logic for computing. It represents information as continuous electrical signals instead of discrete 0s and 1s. This allows the computation to directly reflect the analog nature of real-world data, avoiding digitization errors and overhead.
+
+Analog computing has generated renewed interest for efficient AI hardware, particularly for inference directly on low-power edge devices. Operations like multiplication and summation at the core of neural networks can be performed with very low energy consumption using analog circuits. This makes analog well-suited for deploying ML models on energy-constrained end nodes. Startups like Mythic are developing analog AI accelerators.
+
+While analog computing was popular in early computers, the boom of digital logic led to its decline. However, analog is compelling for niche applications requiring extreme efficiency [@haensch2018next]. It contrasts with digital neuromorphic approaches that still use digital spikes for computation. Analog may allow lower precision computation but requires expertise in analog circuit design. Tradeoffs around precision, programming complexity, and fabrication costs remain active areas of research.
+
+Neuromorphic computing, which aims to emulate biological neural systems for efficient ML inference, can for instance use analog circuits to implement the key components and behaviors of brains. For example, researchers have designed analog circuits to model neurons and synapses using capacitors, transistors, and operational amplifiers [@hazan2021neuromorphic]. The capacitors can exhibit the spiking dynamics of biological neurons, while the amplifiers and transistors provide weighted summation of inputs to mimic dendrites. Variable resistor technologies like memristors can realize analog synapses with spike-timing dependent plasticity - the ability to strengthen or weaken connections based on spiking activity.
+
+![Neuromorphic circuits. Credit: @hazan2021neuromorphic](images/hw_acceleration/aimage5.png)
+
+Startups like SynSense have developed analog neuromorphic chips containing these biomimetic components [@bains2020business]. This analog approach results in very low power consumption and high scalability for edge devices versus complex digital SNN implementations.
+
+However, training analog SNNs on chip remains an open challenge. Overall, analog realization is a promising technique for delivering the efficiency, scalability, and biological plausibility envisioned with neuromorphic computing. The physics of analog components combined with neural architecture design could enable large improvements in inference efficiency over conventional digital neural networks.
+
+### Flexible Electronics
+
+While much of the new hardware technology in the ML workspace has been focused on optimizing and making systems more efficient, there's a parallel trajectory aiming to adapt hardware for specific applications [@gates2009flexible; @musk2019integrated; @tang2023flexible; @tang2022soft; @kwon2022flexible]. One such avenue is the development of flexible electronics for AI use cases.
+
+Flexible electronics refer to electronic circuits and devices fabricated on flexible plastic or polymer substrates rather than rigid silicon. This allows the electronics to bend, twist, and conform to irregular shapes, unlike conventional rigid boards and chips. Early examples of flexible electronics include rollable OLED displays, flexible printed circuits, and skin-like patches. The flexibility and bendability of emerging electronic materials allows them to be integrated into thin, lightweight form factors well-suited for embedded AI and tinyML applications.
+
+Flexible AI hardware can conform to curvy surfaces and operate efficiently with microwatt power budgets. Flexibility also enables rollable or foldable form factors to minimize device footprint and weight, which is ideal for small, portable smart devices and wearables incorporating tinyML. Another key advantage of flexible electronics compared to conventional technologies is lower manufacturing costs and simpler fabrication processes, which could democratize access to these technologies. While silicon masks and fabrication costs typically cost millions of dollars, flexible hardware typically costs only tens of cents to manufacture [@huang2010pseudo; @biggs2021natively]. The potential to fabricate flexible electronics directly onto plastic films using high-throughput printing and coating processes can reduce costs and improve manufacturability at scale versus rigid AI chips [@musk2019integrated].
+
+The characteristics like low-power operation, compactness, lightweight, and potential low cost stemming from flexibility make flexible electronics a promising technology vector for further enhancing embedded and tinyML applications.
+
+![Flexible electronics and some of their applications in daily life. Figure taken from @farah2005neuroethics. ](images/hw_acceleration/fimage1.png)
+
+The field is enabled by advances in organic semiconductors and nanomaterials that can be deposited on thin, flexible films. However, fabrication remains challenging compared to mature silicon processes. Flexible circuits typically exhibit lower performance than rigid equivalents right now. Still, they promise to transform electronics into lightweight, bendable materials.
+
+Flexible electronics use cases are well-suited for intimate integration with the human body. Potential medical AI applications include biointegrated sensors, soft assistive robots, and implants to monitor or stimulate the nervous system intelligently. Specifically, flexible electrode arrays could enable higher density, less invasive neural interfaces compared to rigid equivalents.
+
+Therefore, flexible electronics are ushering in a new era of wearables and body sensors, largely due to innovations in organic transistors. These components allow for more lightweight and bendable electronics, which are ideal for wearables, electronic skin, and body-conforming medical devices.
+
+In terms of biocompatibility, they are well-suited for bioelectronic devices, opening avenues for applications in both brain and cardiac interfaces. For example, research in flexible brain--computer interfaces and soft bioelectronics for cardiac applications demonstrates the potential for wide-ranging medical applications.
+
+Companies and research institutions are not only developing and investing great amounts of resources in flexible electrodes, as showcased in Neuralink's work [@musk2019integrated], but are also pushing the boundaries to integrate machine learning models within the systems [@kwon2022flexible]. These smart sensors aim for a seamless, long-lasting symbiosis with the human body.
+
+Ethically, the incorporation of smart, machine-learning-driven sensors within the body raises important questions . Issues surrounding data privacy, informed consent, and the long-term societal implications of such technologies are the focus of ongoing work in neuroethics and bioethics [@segura2018ethical; @goodyear2017social; @farah2005neuroethics; @roskies2002neuroethics]. The field is progressing at a pace that necessitates parallel advancements in ethical frameworks to guide the responsible development and deployment of these technologies. Overall, while there are limitations and ethical hurdles to overcome, the prospects for flexible electronics are expansive and hold immense promise for future research and applications.
+
+### Memory Technologies
+
+Memory technologies are critical to AI hardware, but conventional DDR DRAM and SRAM create bottlenecks. AI workloads require high bandwidth (>1 TB/s) and extreme scientific applications of AI require extremely low latency (<50 ns) to feed data to compute units [@duarte2022fastml], high density (>128Gb) to store large model parameters and data sets, and excellent energy efficiency (<100 fJ/b) for embedded use [@verma2019memory]. New memories are needed to meet these demands. Emerging options include several new technologies:
+
+- Resistive RAM (ReRAM) can improve density with simple, passive arrays. However, challenges around variability remain [@chi2016prime].
+- Phase change memory (PCM) exploits the unique properties of chalcogenide glass. Crystalline and amorphous phases have different resistances. Intel's Optane DCPMM provides fast (100ns), high endurance PCM. But challenges include limited write cycles and high reset current [@burr2016recent].
+- 3D stacking can also boost memory density and bandwidth by vertically integrating memory layers with TSV interconnects [@loh20083d]. For example, HBM provides 1024-bit wide interfaces.
+
+New memory technologies are critical to unlock the next level of AI hardware performance and efficiency through their innovative cell architectures and materials. Realizing their benefits in commercial systems remains an ongoing challenge.
+
+In-Memory Computing is gaining traction as a promising avenue for optimizing machine learning and high-performance computing workloads. At its core, the technology co-locates data storage and computation to improve energy efficiency and reduce latency [@verma2019memory; @mittal2021survey,@wong2012metal]. Two key technologies under this umbrella are Resistive RAM (ReRAM) and Processing-In-Memory (PIM).
+
+ReRAM [@wong2012metal] and PIM [@chi2016prime] serve as the backbone for in-memory computing by storing and computing data in the same location. ReRAM focuses on issues of uniformity, endurance, retention, multibit operation, and scalability. On the other hand, PIM involves CPU units integrated directly into memory arrays, specialized for tasks like matrix multiplication which are central in AI computations.
+
+These technologies find applications in AI workloads and high-performance computing, where the synergy of storage and computation can lead to significant performance gains. The architecture is particularly useful for compute-intensive tasks common in machine learning models.
+
+While in-memory computing technologies like ReRAM and PIM offer exciting prospects for efficiency and performance, they come with their own set of challenges such as issues with data uniformity and scalability in ReRAM [@imani2016resistive]. Nonetheless, the field is ripe for innovation, and addressing these limitations can potentially open new frontiers in both AI and high-performance computing.
+
+### Optical Computing
+
+In AI acceleration, a burgeoning area of interest lies in novel technologies that deviate from traditional paradigms. Some emerging technologies mentioned above such as flexible electronics, in-memory computing or even neuromorphics computing are close to becoming a reality, given their ground-breaking innovations and applications. One of the promising and leading the next-gen frontiers are optical computing technologies [@miller2000optical,@zhou2022photonic ]. Companies like [[LightMatter]](https://lightmatter.co/) are pioneering the use of light photonics for calculations, thereby utilizing photons instead of electrons for data transmission and computation.
+
+![Photonic AI computing platform enables neural networks in the world while reducing environmental impact. ([https://lightmatter.co/](https://lightmatter.co/))](images/hw_acceleration/aimage7.png)
+
+Optical computing utilizes photons and photonic devices rather than traditional electronic circuits for computing and data processing. It takes inspiration from fiber optic communication links that already rely on light for fast, efficient data transfer [@shastri2021photonics]. Light can propagate with much less loss compared to electrons in semiconductors, enabling inherent speed and efficiency benefits.
+
+Some specific advantages of optical computing include:
+
+- **High throughput:** Photons can transmit with bandwidths >100 Tb/s using wavelength division multiplexing.
+- **Low latency:** Photons interact on femtosecond timescales, millions of times faster than silicon transistors.
+- **Parallelism:** Multiple data signals can propagate through the same optical medium simultaneously.
+- **Low power:** Photonic circuits utilizing waveguides and resonators can achieve complex logic and memory with only microwatts of power.
+
+However, optical computing currently faces significant challenges:
+
+- Lack of optical memory equivalent to electronic RAM
+- Requires conversion between optical and electrical domains.
+- Limited set of available optical components compared to rich electronics ecosystem.
+- Immature integration methods to combine photonics with traditional CMOS chips.
+- Complex programming models required to handle parallelism.
+
+As a result, optical computing is still in the very early research stage despite its promising potential. But technical breakthroughs could enable it to complement electronics and unlock performance gains for AI workloads. Companies like Lightmatter are pioneering early optical AI accelerators. Long term, it could represent a revolutionary computing substrate if key challenges are overcome.
+
+### Quantum Computing
+
+Quantum computers leverage unique phenomena of quantum physics like superposition and entanglement to represent and process information in ways not possible classically. Instead of binary bits, the fundamental unit is the quantum bit or qubit. Unlike classical bits limited to 0 or 1, qubits can exist in a superposition of both states simultaneously due to quantum effects.
+
+Multiple qubits can also be entangled, leading to exponential information density but introducing probabilistic results. Superposition enables parallel computation on all possible states, while entanglement allows nonlocal correlations between qubits.
+
+Quantum algorithms carefully manipulate these inherently quantum mechanical effects to solve problems like optimization or search more efficiently than their classical counterparts in theory.
+
+- Faster training of deep neural networks by exploiting quantum parallelism for linear algebra operations.
+- Efficient quantum ML algorithms making use of the unique capabilities of qubits.
+- Quantum neural networks with inherent quantum effects baked into the model architecture.
+- Quantum optimizers leveraging quantum annealing or adiabatic algorithms for combinatorial optimization problems.
+
+However, quantum states are fragile and prone to errors that require error-correcting protocols. The non-intuitive nature of quantum programming also introduces challenges not present in classical computing.
+
+- Noisy and fragile quantum bits difficult to scale up. The largest quantum computer today has less than 100 qubits.
+- Restricted set of available quantum gates and circuits relative to classical programming.
+- Lack of datasets and benchmarks to evaluate quantum ML in practical domains.
+
+While meaningful quantum advantage for ML remains far off, active research at companies like [D-Wave](https://www.dwavesys.com/company/about-d-wave/), [Rigetti](https://www.rigetti.com/), and [IonQ](https://ionq.com/) is advancing quantum computer engineering and quantum algorithms. Major technology companies like Google, [IBM](https://www.ibm.com/quantum?utm_content=SRCWW&p1=Search&p4C700050385964705&p5=e&gclid=Cj0KCQjw-pyqBhDmARIsAKd9XIPD9U1Sjez_S0z5jeDDE4nRyd6X_gtVDUKJ-HIolx2vOc599KgW8gAaAv8gEALw_wcB&gclsrc=aw.ds), and Microsoft are actively exploring quantum computing. Google recently announced a 72-qubit quantum processor called [Bristlecone](https://blog.research.google/2018/03/a-preview-of-bristlecone-googles-new.html) and plans to build a 49-qubit commercial quantum system. Microsoft also has an active research program in topological quantum computing and collaborates with quantum startup [IonQ](https://ionq.com/)
+
+Quantum techniques may first make inroads for optimization before more generalized ML adoption. Realizing the full potential of quantum ML awaits major milestones in quantum hardware development and ecosystem maturity.
+
+## Future Trends
+
+Thus far in this chapter, we have primarily explored how to design specialized hardware that is optimized for machine learning workloads and algorithms. For example, we discussed how GPUs and TPUs have architectures tailored for neural network training and inference. However, we have not yet discussed an emerging and exciting area - using machine learning to aid in the hardware design process itself.
+
+The hardware design process involves many complex stages, including specification, high-level modeling, simulation, synthesis, verification, prototyping, and fabrication. Traditionally, much of this process requires extensive human expertise, effort, and time. However, recent advances in machine learning are enabling parts of the hardware design workflow to be automated and enhanced using ML techniques.
+
+Some examples of how ML is transforming hardware design include:
+
+* **Automated circuit synthesis using reinforcement learning:** Rather than hand-crafting transistor-level designs, ML agents can learn to connect logic gates and generate circuit layouts automatically. This can accelerate the time-consuming syntheses process.
+* **ML-based hardware simulation and emulation:** Deep neural network models can be trained to predict how a hardware design will perform under different conditions. This allows fast and accurate simulation compared to traditional RTL simulations.
+* **Automated chip floorplanning using ML algorithms:** Chip floorplanning, which involves optimally placing different components on a die, can leverage genetic algorithms and ML to explore floorplan options. This can lead to performance improvements.
+* **ML-driven architecture optimization:** Novel hardware architectures, like those for efficient ML accelerators, can be automatically generated and optimized using neural architecture search techniques. This expands the architectural design space.
+
+Applying ML to hardware design automation holds enormous promise to make the process faster, cheaper, and more efficient. It opens up design possibilities that would be extremely difficult through manual design. The use of ML in hardware design is an area of active research and early deployment, and we will study the techniques involved and their transformative potential.
+
+### ML for Hardware Design Automation
+
+A major opportunity for machine learning in hardware design is automating parts of the complex and tedious design workflow. Hardware design automation (HDA) broadly refers to using ML techniques like reinforcement learning, genetic algorithms, and neural networks to automate tasks like synthesis, verification, floorplanning, and more. A few examples of where ML for HDA shows real promise:
+
+- **Automated circuit synthesis:** Circuit synthesis involves converting a high-level description of desired logic into an optimized gate-level netlist implementation. This complex process has many design considerations and tradeoffs. ML agents can be trained through reinforcement learning to explore the design space and output optimized syntheses automatically. Startups like [Symbiotic EDA](https://www.symbioticeda.com/) are bringing this technology to market.
+- **Automated chip floorplanning:** Floorplanning refers to strategically placing different components on a chip die area. ML techniques like genetic algorithms can be used to automate floorplan optimization to minimize wire length, power consumption, and other objectives. This is extremely valuable as chip complexity increases.
+- **ML hardware simulators:** Training deep neural network models to predict how hardware designs will perform as simulators can accelerate the simulation process by over 100x compared to traditional RTL simulations.
+- **Automated code translation:** Converting hardware description languages like Verilog to optimized RTL implementations is critical but time-consuming. ML models can be trained to act as translator agents and automate parts of this process.
+
+The benefits of HDA using ML are reduced design time, superior optimizations, and exploration of design spaces too complex for manual approaches. This can accelerate hardware development and lead to better designs.
+
+Challenges include limits of ML generalization, the black-box nature of some techniques, and accuracy tradeoffs. But research is rapidly advancing to address these issues and make HDA ML solutions robust and reliable for production use. HDA provides a major avenue for ML to transform hardware design.
+
+### ML-Based Hardware Simulation and Verification
+
+Simulating and verifying hardware designs is critical before manufacturing to ensure the design behaves as intended. Traditional approaches like register-transfer level (RTL) simulation are complex and time-consuming. ML introduces new opportunities to enhance hardware simulation and verification. Some examples include:
+
+- **Surrogate modeling for simulation:** Highly accurate surrogate models of a design can be built using neural networks. These models predict outputs from inputs much faster than RTL simulation, enabling fast design space exploration. Companies like Ansys use this technique.
+- **ML simulators:** Large neural network models can be trained on RTL simulations to learn to mimic the functionality of a hardware design. Once trained, the NN model can act as a highly efficient simulator to use for regression testing and other tasks. Graphcore has demonstrated over 100x speedup with this approach.
+- **Formal verification using ML:** Formal verification mathematically proves properties about a design. ML techniques can help generate verification properties and can learn to solve the complex formal proofs needed. This automates parts of this challenging process. Startups like Cortical.io are bringing ML formal verification solutions to market.
+- **Bug detection:** ML models can be trained to process hardware designs and identify potential issues. This assists human designers in inspecting complex designs and finding bugs. Facebook has shown bug detection models for their server hardware.
+
+The key benefits of applying ML to simulation and verification are faster design validation turnaround times, more rigorous testing, and reduced human effort. Challenges include verifying ML model correctness and handling corner cases. ML promises to significantly accelerate testing workflows.
+
+### ML for Efficient Hardware Architectures
+
+Designing hardware architectures optimized for performance, power, and efficiency is a key goal. ML introduces new techniques to automate and enhance architecture design space exploration for both general-purpose and specialized hardware like ML accelerators. Some promising examples include:
+
+- **Neural architecture search for hardware:** Search techniques like evolutionary algorithms can automatically generate novel hardware architectures by mutating and mixing design attributes like cache size, number of parallel units, memory bandwidth, and so on. This expands the design space beyond human limitations.
+- **ML-based architecture optimizers:** ML agents can be trained with reinforcement learning to tweak architectures to optimize for desired objectives like throughput or power. The agent explores the space of possible configurations to find high-performing, efficient designs.
+- **Predictive modeling for optimization:** - ML models can be trained to predict hardware performance, power, and efficiency metrics for a given architecture. These become "surrogate models" for fast optimization and space exploration by substituting lengthy simulations.
+- **Specialized accelerator optimization:** - For specialized chips like tensor processing units for AI, automated architecture search techniques based on ML/evolutionary algorithms show promise for finding fast, efficient designs.
+
+The benefits of using ML include superior design space exploration, automated optimization, and reduced manual effort. Challenges include long training times for some techniques and local optima limitations. But ML for hardware architecture holds great potential for unlocking performance and efficiency gains.
+
+### ML to Optimize Manufacturing and Reduce Defects
+
+Once a hardware design is complete, it moves to manufacturing. But variability and defects during manufacturing can impact yields and quality. ML techniques are now being applied to improve fabrication processes and reduce defects. Some examples include:
+
+- **Predictive maintenance:** ML models can analyze equipment sensor data over time and identify signals that predict maintenance needs before failure. This enables proactive upkeep that can come in very handy in the costly fabrication process.
+- **Process optimization:** Supervised learning models can be trained on process data to identify factors that lead to low yields. The models can then optimize parameters to improve yields, throughput, or consistency.
+- **Yield prediction:** By analyzing test data from fabricated designs using techniques like regression trees, ML models can predict yields early in production. This allows process adjustments.
+- **Defect detection:** Computer vision ML techniques can be applied to images of designs to identify defects invisible to the human eye. This enables precision quality control and root cause analysis.
+- **Proactive failure analysis:** - By analyzing structured and unstructured process data, ML models can help predict, diagnose, and prevent issues that lead to downstream defects and failures.
+
+Applying ML to manufacturing enables process optimization, real-time quality control, predictive maintenance, and ultimately higher yields. Challenges include managing complex manufacturing data and variations. But ML is poised to transform semiconductor manufacturing.
+
+### Toward Foundation Models for Hardware Design
+As we have seen, machine learning is opening up new possibilities across the hardware design workflow, from specification to manufacturing. However, current ML techniques are still narrow in scope and require extensive domain-specific engineering. The long-term vision is the development of general artificial intelligence systems that can be applied with versatility across hardware design tasks.
+
+To fully realize this vision, investment and research are needed to develop foundation models for hardware design. These are unified, general-purpose ML models and architectures that can learn complex hardware design skills with the right training data and objectives.
+
+Realizing foundation models for end-to-end hardware design will require:
+
+- Accumulation of large, high-quality, labeled datasets across hardware design stages to train foundation models.
+- Advances in multi-modal, multi-task ML techniques to handle the diversity of hardware design data and tasks.
+- Interfaces and abstraction layers to connect foundation models to existing design flows and tools.
+- Development of simulation environments and benchmarks to train and test foundation models on hardware design capabilities.
+- Methods to explain and interpret the design decisions and optimizations made by ML models for trust and verification.
+- Compilation techniques to optimize foundation models for efficient deployment across hardware platforms.
+
+While significant research remains, foundation models represent the most transformative long-term goal for imbuing AI into the hardware design process. Democratizing hardware design via versatile, automated ML systems promises to unlock a new era of optimized, efficient, and innovative chip design. The journey ahead is filled with open challenges and opportunities.
+
+We encourage you to read [Architecture 2.0](https://www.sigarch.org/architecture-2-0-why-computer-architects-need-a-data-centric-ai-gymnasium/) if ML-aided computer architecture design interests you. Alternatively, you can watch the below video.
+
+{{< video https://www.youtube.com/watch?v=F5Eieaz7u1I&ab_channel=OpenComputeProject >}}
## Conclusion
-Explanation: This section consolidates the key learnings from the chapter, providing a summary and a future outlook on hardware acceleration in embedded AI systems. This offers insight into where the field might be headed, helping to inspire future projects or study.
+Specialized hardware acceleration has become indispensable for enabling performant and efficient artificial intelligence applications as models and datasets explode in complexity. In this chapter, we examined the limitations of general-purpose processors like CPUs for AI workloads. Their lack of parallelism and computational throughput cannot train or run state-of-the-art deep neural networks quickly. These motivations have driven innovations in customized accelerators.
+
+We surveyed GPUs, TPUs, FPGAs and ASICs specifically designed for the math-intensive operations inherent to neural networks. By covering this spectrum of options, we aimed to provide a framework for reasoning through accelerator selection based on constraints around flexibility, performance, power, cost, and other factors.
+
+We also explored the role of software in actively enabling and optimizing AI acceleration. This spans programming abstractions, frameworks, compilers and simulators. We discussed hardware-software co-design as a proactive methodology for building more holistic AI systems by closely integrating algorithm innovation and hardware advances.
+
+But there is so much more to come! Exciting frontiers like analog computing, optical neural networks, and quantum machine learning represent active research directions that could unlock orders of magnitude improvements in efficiency, speed, and scale compared to present paradigms.
-- Summary of Key Points
-- The Future Outlook for Hardware Acceleration in Embedded AI Systems
\ No newline at end of file
+In the end, specialized hardware acceleration remains indispensable for unlocking the performance and efficiency necessary to fulfill the promise of artificial intelligence from cloud to edge. We hope this chapter actively provided useful background and insights into the rapid innovation occurring in this domain.
\ No newline at end of file
diff --git a/images/hw_acceleration/aimage1.png b/images/hw_acceleration/aimage1.png
new file mode 100644
index 00000000..a3359e67
Binary files /dev/null and b/images/hw_acceleration/aimage1.png differ
diff --git a/images/hw_acceleration/aimage2.png b/images/hw_acceleration/aimage2.png
new file mode 100644
index 00000000..0a4c91dc
Binary files /dev/null and b/images/hw_acceleration/aimage2.png differ
diff --git a/images/hw_acceleration/aimage3.png b/images/hw_acceleration/aimage3.png
new file mode 100644
index 00000000..e8ffff73
Binary files /dev/null and b/images/hw_acceleration/aimage3.png differ
diff --git a/images/hw_acceleration/aimage4.png b/images/hw_acceleration/aimage4.png
new file mode 100644
index 00000000..9071ac0e
Binary files /dev/null and b/images/hw_acceleration/aimage4.png differ
diff --git a/images/hw_acceleration/aimage5.png b/images/hw_acceleration/aimage5.png
new file mode 100644
index 00000000..009eba1e
Binary files /dev/null and b/images/hw_acceleration/aimage5.png differ
diff --git a/images/hw_acceleration/aimage6.png b/images/hw_acceleration/aimage6.png
new file mode 100644
index 00000000..115c39af
Binary files /dev/null and b/images/hw_acceleration/aimage6.png differ
diff --git a/images/hw_acceleration/aimage7.png b/images/hw_acceleration/aimage7.png
new file mode 100644
index 00000000..13626692
Binary files /dev/null and b/images/hw_acceleration/aimage7.png differ
diff --git a/images/hw_acceleration/fimage1.png b/images/hw_acceleration/fimage1.png
new file mode 100644
index 00000000..115c39af
Binary files /dev/null and b/images/hw_acceleration/fimage1.png differ
diff --git a/images/hw_acceleration/fpga.png b/images/hw_acceleration/fpga.png
new file mode 100644
index 00000000..a960175b
Binary files /dev/null and b/images/hw_acceleration/fpga.png differ
diff --git a/images/hw_acceleration/hwai_40yearsmicrotrenddata.png b/images/hw_acceleration/hwai_40yearsmicrotrenddata.png
new file mode 100644
index 00000000..23bf6afe
Binary files /dev/null and b/images/hw_acceleration/hwai_40yearsmicrotrenddata.png differ
diff --git a/images/hw_acceleration/nre.png b/images/hw_acceleration/nre.png
new file mode 100644
index 00000000..0874161e
Binary files /dev/null and b/images/hw_acceleration/nre.png differ
diff --git a/images/hw_acceleration/tradeoffs.png b/images/hw_acceleration/tradeoffs.png
new file mode 100644
index 00000000..8000aa9a
Binary files /dev/null and b/images/hw_acceleration/tradeoffs.png differ
diff --git a/references.bib b/references.bib
index cc07274f..ee76ca3e 100644
--- a/references.bib
+++ b/references.bib
@@ -1967,6 +1967,1044 @@ @misc{zhou_deep_2023
file = {Zhou et al. - 2023 - Deep Class-Incremental Learning A Survey.pdf:/Users/alex/Zotero/storage/859VZG7W/Zhou et al. - 2023 - Deep Class-Incremental Learning A Survey.pdf:application/pdf}
}
+@misc{kuzmin2022fp8,
+ title={FP8 Quantization: The Power of the Exponent},
+ author={Andrey Kuzmin and Mart Van Baalen and Yuwei Ren and Markus Nagel and Jorn Peters and Tijmen Blankevoort},
+ year={2022},
+ eprint={2208.09225},
+ archivePrefix={arXiv},
+ primaryClass={cs.LG}
+}
+
+
+/* Types of AI Accelerators, up until CPU Advantages */
+@misc{noauthor_who_nodate,
+ title = {Who {Invented} the {Microprocessor}? - {CHM}},
+ url = {https://computerhistory.org/blog/who-invented-the-microprocessor/},
+ urldate = {2023-11-07},
+}
+
+@book{weik_survey_1955,
+ title = {A {Survey} of {Domestic} {Electronic} {Digital} {Computing} {Systems}},
+ language = {en},
+ publisher = {Ballistic Research Laboratories},
+ author = {Weik, Martin H.},
+ year = {1955},
+}
+
+@inproceedings{brown_language_2020,
+ title = {Language {Models} are {Few}-{Shot} {Learners}},
+ volume = {33},
+ url = {https://proceedings.neurips.cc/paper_files/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html},
+ abstract = {We demonstrate that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even becoming competitive with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks. We also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora.},
+ urldate = {2023-11-07},
+ booktitle = {Advances in {Neural} {Information} {Processing} {Systems}},
+ publisher = {Curran Associates, Inc.},
+ author = {Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winter, Clemens and Hesse, Chris and Chen, Mark and Sigler, Eric and Litwin, Mateusz and Gray, Scott and Chess, Benjamin and Clark, Jack and Berner, Christopher and McCandlish, Sam and Radford, Alec and Sutskever, Ilya and Amodei, Dario},
+ year = {2020},
+ pages = {1877--1901},
+}
+
+@misc{jia_dissecting_2018,
+ title = {Dissecting the {NVIDIA} {Volta} {GPU} {Architecture} via {Microbenchmarking}},
+ url = {http://arxiv.org/abs/1804.06826},
+ abstract = {Every year, novel NVIDIA GPU designs are introduced. This rapid architectural and technological progression, coupled with a reluctance by manufacturers to disclose low-level details, makes it difficult for even the most proficient GPU software designers to remain up-to-date with the technological advances at a microarchitectural level. To address this dearth of public, microarchitectural-level information on the novel NVIDIA GPUs, independent researchers have resorted to microbenchmarks-based dissection and discovery. This has led to a prolific line of publications that shed light on instruction encoding, and memory hierarchy's geometry and features at each level. Namely, research that describes the performance and behavior of the Kepler, Maxwell and Pascal architectures. In this technical report, we continue this line of research by presenting the microarchitectural details of the NVIDIA Volta architecture, discovered through microbenchmarks and instruction set disassembly. Additionally, we compare quantitatively our Volta findings against its predecessors, Kepler, Maxwell and Pascal.},
+ urldate = {2023-11-07},
+ publisher = {arXiv},
+ author = {Jia, Zhe and Maggioni, Marco and Staiger, Benjamin and Scarpazza, Daniele P.},
+ month = apr,
+ year = {2018},
+ note = {arXiv:1804.06826 [cs]},
+ keywords = {Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance},
+}
+
+@article{jia2019beyond,
+ title={Beyond Data and Model Parallelism for Deep Neural Networks.},
+ author={Jia, Zhihao and Zaharia, Matei and Aiken, Alex},
+ journal={Proceedings of Machine Learning and Systems},
+ volume={1},
+ pages={1--13},
+ year={2019}
+}
+
+@inproceedings{raina_large-scale_2009,
+ address = {Montreal Quebec Canada},
+ title = {Large-scale deep unsupervised learning using graphics processors},
+ isbn = {978-1-60558-516-1},
+ url = {https://dl.acm.org/doi/10.1145/1553374.1553486},
+ doi = {10.1145/1553374.1553486},
+ language = {en},
+ urldate = {2023-11-07},
+ booktitle = {Proceedings of the 26th {Annual} {International} {Conference} on {Machine} {Learning}},
+ publisher = {ACM},
+ author = {Raina, Rajat and Madhavan, Anand and Ng, Andrew Y.},
+ month = jun,
+ year = {2009},
+ pages = {873--880},
+}
+
+@misc{noauthor_amd_nodate,
+ title = {{AMD} {Radeon} {RX} 7000 {Series} {Desktop} {Graphics} {Cards}},
+ url = {https://www.amd.com/en/graphics/radeon-rx-graphics},
+ urldate = {2023-11-07},
+}
+
+@misc{noauthor_intel_nodate,
+ title = {Intel® {Arc}™ {Graphics} {Overview}},
+ url = {https://www.intel.com/content/www/us/en/products/details/discrete-gpus/arc.html},
+ abstract = {Find out how Intel® Arc Graphics unlock lifelike gaming and seamless content creation.},
+ language = {en},
+ urldate = {2023-11-07},
+ journal = {Intel},
+}
+
+@article{lindholm_nvidia_2008,
+ title = {{NVIDIA} {Tesla}: {A} {Unified} {Graphics} and {Computing} {Architecture}},
+ volume = {28},
+ issn = {1937-4143},
+ shorttitle = {{NVIDIA} {Tesla}},
+ url = {https://ieeexplore.ieee.org/document/4523358},
+ doi = {10.1109/MM.2008.31},
+ abstract = {To enable flexible, programmable graphics and high-performance computing, NVIDIA has developed the Tesla scalable unified graphics and parallel computing architecture. Its scalable parallel array of processors is massively multithreaded and programmable in C or via graphics APIs.},
+ number = {2},
+ urldate = {2023-11-07},
+ journal = {IEEE Micro},
+ author = {Lindholm, Erik and Nickolls, John and Oberman, Stuart and Montrym, John},
+ month = mar,
+ year = {2008},
+ note = {Conference Name: IEEE Micro},
+ pages = {39--55},
+}
+
+@article{dally_evolution_2021,
+ title = {Evolution of the {Graphics} {Processing} {Unit} ({GPU})},
+ volume = {41},
+ issn = {1937-4143},
+ url = {https://ieeexplore.ieee.org/document/9623445},
+ doi = {10.1109/MM.2021.3113475},
+ abstract = {Graphics processing units (GPUs) power today’s fastest supercomputers, are the dominant platform for deep learning, and provide the intelligence for devices ranging from self-driving cars to robots and smart cameras. They also generate compelling photorealistic images at real-time frame rates. GPUs have evolved by adding features to support new use cases. NVIDIA’s GeForce 256, the first GPU, was a dedicated processor for real-time graphics, an application that demands large amounts of floating-point arithmetic for vertex and fragment shading computations and high memory bandwidth. As real-time graphics advanced, GPUs became programmable. The combination of programmability and floating-point performance made GPUs attractive for running scientific applications. Scientists found ways to use early programmable GPUs by casting their calculations as vertex and fragment shaders. GPUs evolved to meet the needs of scientific users by adding hardware for simpler programming, double-precision floating-point arithmetic, and resilience.},
+ number = {6},
+ urldate = {2023-11-07},
+ journal = {IEEE Micro},
+ author = {Dally, William J. and Keckler, Stephen W. and Kirk, David B.},
+ month = nov,
+ year = {2021},
+ note = {Conference Name: IEEE Micro},
+ pages = {42--51},
+}
+
+@article{demler_ceva_2020,
+ title = {{CEVA} {SENSPRO} {FUSES} {AI} {AND} {VECTOR} {DSP}},
+ language = {en},
+ author = {Demler, Mike},
+ year = {2020},
+}
+
+@misc{noauthor_google_2023,
+ title = {Google {Tensor} {G3}: {The} new chip that gives your {Pixel} an {AI} upgrade},
+ shorttitle = {Google {Tensor} {G3}},
+ url = {https://blog.google/products/pixel/google-tensor-g3-pixel-8/},
+ abstract = {Tensor G3 on Pixel 8 and Pixel 8 Pro is more helpful, more efficient and more powerful.},
+ language = {en-us},
+ urldate = {2023-11-07},
+ journal = {Google},
+ month = oct,
+ year = {2023},
+}
+
+@misc{noauthor_hexagon_nodate,
+ title = {Hexagon {DSP} {SDK} {Processor}},
+ url = {https://developer.qualcomm.com/software/hexagon-dsp-sdk/dsp-processor},
+ abstract = {The Hexagon DSP processor has both CPU and DSP functionality to support deeply embedded processing needs of the mobile platform for both multimedia and modem functions.},
+ language = {en},
+ urldate = {2023-11-07},
+ journal = {Qualcomm Developer Network},
+}
+
+@misc{noauthor_evolution_2023,
+ title = {The {Evolution} of {Audio} {DSPs}},
+ url = {https://audioxpress.com/article/the-evolution-of-audio-dsps},
+ abstract = {To complement the extensive perspective of another Market Update feature article on DSP Products and Applications, published in the November 2020 edition, audioXpress was honored to have the valuable contribution from one of the main suppliers in the field. In this article, Youval Nachum, CEVA’s Senior Product Marketing Manager, writes about \"The Evolution of Audio DSPs,\" discussing how DSP technology has evolved, its impact on the user experience, and what the future of DSP has in store for us.},
+ language = {en},
+ urldate = {2023-11-07},
+ journal = {audioXpress},
+ month = oct,
+ year = {2023},
+}
+
+@article{xiong_mri-based_2021,
+ title = {{MRI}-based brain tumor segmentation using {FPGA}-accelerated neural network},
+ volume = {22},
+ issn = {1471-2105},
+ url = {https://doi.org/10.1186/s12859-021-04347-6},
+ doi = {10.1186/s12859-021-04347-6},
+ abstract = {Brain tumor segmentation is a challenging problem in medical image processing and analysis. It is a very time-consuming and error-prone task. In order to reduce the burden on physicians and improve the segmentation accuracy, the computer-aided detection (CAD) systems need to be developed. Due to the powerful feature learning ability of the deep learning technology, many deep learning-based methods have been applied to the brain tumor segmentation CAD systems and achieved satisfactory accuracy. However, deep learning neural networks have high computational complexity, and the brain tumor segmentation process consumes significant time. Therefore, in order to achieve the high segmentation accuracy of brain tumors and obtain the segmentation results efficiently, it is very demanding to speed up the segmentation process of brain tumors.},
+ number = {1},
+ urldate = {2023-11-07},
+ journal = {BMC Bioinformatics},
+ author = {Xiong, Siyu and Wu, Guoqing and Fan, Xitian and Feng, Xuan and Huang, Zhongcheng and Cao, Wei and Zhou, Xuegong and Ding, Shijin and Yu, Jinhua and Wang, Lingli and Shi, Zhifeng},
+ month = sep,
+ year = {2021},
+ keywords = {Brain tumor segmatation, FPGA acceleration, Neural network},
+ pages = {421},
+}
+
+@article{gwennap_certus-nx_nodate,
+ title = {Certus-{NX} {Innovates} {General}-{Purpose} {FPGAs}},
+ language = {en},
+ author = {Gwennap, Linley},
+}
+
+@misc{noauthor_fpga_nodate,
+ title = {{FPGA} {Architecture} {Overview}},
+ url = {https://www.intel.com/content/www/us/en/docs/oneapi-fpga-add-on/optimization-guide/2023-1/fpga-architecture-overview.html},
+ urldate = {2023-11-07},
+}
+
+@misc{noauthor_what_nodate,
+ title = {What is an {FPGA}? {Field} {Programmable} {Gate} {Array}},
+ shorttitle = {What is an {FPGA}?},
+ url = {https://www.xilinx.com/products/silicon-devices/fpga/what-is-an-fpga.html},
+ abstract = {What is an FPGA - Field Programmable Gate Arrays are semiconductor devices that are based around a matrix of configurable logic blocks (CLBs) connected via programmable interconnects. FPGAs can be reprogrammed to desired application or functionality requirements after manufacturing.},
+ language = {en},
+ urldate = {2023-11-07},
+ journal = {AMD},
+}
+
+@article{putnam_reconfigurable_2014,
+ title = {A reconfigurable fabric for accelerating large-scale datacenter services},
+ volume = {42},
+ issn = {0163-5964},
+ url = {https://dl.acm.org/doi/10.1145/2678373.2665678},
+ doi = {10.1145/2678373.2665678},
+ abstract = {Datacenter workloads demand high computational capabilities, flexibility, power efficiency, and low cost. It is challenging to improve all of these factors simultaneously. To advance datacenter capabilities beyond what commodity server designs can provide, we have designed and built a composable, reconfigurablefabric to accelerate portions of large-scale software services. Each instantiation of the fabric consists of a 6x8 2-D torus of high-end Stratix V FPGAs embedded into a half-rack of 48 machines. One FPGA is placed into each server, accessible through PCIe, and wired directly to other FPGAs with pairs of 10 Gb SAS cables
+ In this paper, we describe a medium-scale deployment of this fabric on a bed of 1,632 servers, and measure its efficacy in accelerating the Bing web search engine. We describe the requirements and architecture of the system, detail the critical engineering challenges and solutions needed to make the system robust in the presence of failures, and measure the performance, power, and resilience of the system when ranking candidate documents. Under high load, the largescale reconfigurable fabric improves the ranking throughput of each server by a factor of 95\% for a fixed latency distribution--- or, while maintaining equivalent throughput, reduces the tail latency by 29\%},
+ language = {en},
+ number = {3},
+ urldate = {2023-11-07},
+ journal = {ACM SIGARCH Computer Architecture News},
+ author = {Putnam, Andrew and Caulfield, Adrian M. and Chung, Eric S. and Chiou, Derek and Constantinides, Kypros and Demme, John and Esmaeilzadeh, Hadi and Fowers, Jeremy and Gopal, Gopi Prashanth and Gray, Jan and Haselman, Michael and Hauck, Scott and Heil, Stephen and Hormati, Amir and Kim, Joo-Young and Lanka, Sitaram and Larus, James and Peterson, Eric and Pope, Simon and Smith, Aaron and Thong, Jason and Xiao, Phillip Yi and Burger, Doug},
+ month = oct,
+ year = {2014},
+ pages = {13--24},
+}
+
+@misc{noauthor_project_nodate,
+ title = {Project {Catapult} - {Microsoft} {Research}},
+ url = {https://www.microsoft.com/en-us/research/project/project-catapult/},
+ urldate = {2023-11-07},
+}
+
+@misc{dean_jeff_numbers_nodate,
+ title = {Numbers {Everyone} {Should} {Know}},
+ url = {https://brenocon.com/dean_perf.html},
+ urldate = {2023-11-07},
+ author = {Dean. Jeff},
+}
+
+@misc{bailey_enabling_2018,
+ title = {Enabling {Cheaper} {Design}},
+ url = {https://semiengineering.com/enabling-cheaper-design/},
+ abstract = {Enabling Cheaper Design, At what point does cheaper design enable a significant growth in custom semiconductor content? Not everyone is onboard with the idea.},
+ language = {en-US},
+ urldate = {2023-11-07},
+ journal = {Semiconductor Engineering},
+ author = {Bailey, Brian},
+ month = sep,
+ year = {2018},
+}
+
+@misc{noauthor_integrated_2023,
+ title = {Integrated circuit},
+ copyright = {Creative Commons Attribution-ShareAlike License},
+ url = {https://en.wikipedia.org/w/index.php?title=Integrated_circuit&oldid=1183537457},
+ abstract = {An integrated circuit (also known as an IC, a chip, or a microchip) is a set of electronic circuits on one small flat piece of semiconductor material, usually silicon. Large numbers of miniaturized transistors and other electronic components are integrated together on the chip. This results in circuits that are orders of magnitude smaller, faster, and less expensive than those constructed of discrete components, allowing a large transistor count.
+The IC's mass production capability, reliability, and building-block approach to integrated circuit design have ensured the rapid adoption of standardized ICs in place of designs using discrete transistors. ICs are now used in virtually all electronic equipment and have revolutionized the world of electronics. Computers, mobile phones and other home appliances are now essential parts of the structure of modern societies, made possible by the small size and low cost of ICs such as modern computer processors and microcontrollers.
+Very-large-scale integration was made practical by technological advancements in semiconductor device fabrication. Since their origins in the 1960s, the size, speed, and capacity of chips have progressed enormously, driven by technical advances that fit more and more transistors on chips of the same size – a modern chip may have many billions of transistors in an area the size of a human fingernail. These advances, roughly following Moore's law, make the computer chips of today possess millions of times the capacity and thousands of times the speed of the computer chips of the early 1970s.
+ICs have three main advantages over discrete circuits: size, cost and performance. The size and cost is low because the chips, with all their components, are printed as a unit by photolithography rather than being constructed one transistor at a time. Furthermore, packaged ICs use much less material than discrete circuits. Performance is high because the IC's components switch quickly and consume comparatively little power because of their small size and proximity. The main disadvantage of ICs is the high initial cost of designing them and the enormous capital cost of factory construction. This high initial cost means ICs are only commercially viable when high production volumes are anticipated.},
+ language = {en},
+ urldate = {2023-11-07},
+ journal = {Wikipedia},
+ month = nov,
+ year = {2023},
+ note = {Page Version ID: 1183537457},
+}
+
+@article{el-rayis_reconfigurable_nodate,
+ title = {Reconfigurable {Architectures} for the {Next} {Generation} of {Mobile} {Device} {Telecommunications} {Systems}},
+ language = {en},
+ author = {El-Rayis, Ahmed Osman},
+}
+
+
+@misc{noauthor_intel_nodate,
+ title = {Intel® {Stratix}® 10 {NX} {FPGA} {Overview} - {High} {Performance} {Stratix}® {FPGA}},
+ url = {https://www.intel.com/content/www/us/en/products/details/fpga/stratix/10/nx.html},
+ abstract = {View Intel® Stratix® 10 NX FPGAs and find product specifications, features, applications and more.},
+ language = {en},
+ urldate = {2023-11-07},
+ journal = {Intel},
+}
+
+
+@book{patterson2016computer,
+ title={Computer organization and design ARM edition: the hardware software interface},
+ author={Patterson, David A and Hennessy, John L},
+ year={2016},
+ publisher={Morgan kaufmann}
+}
+
+@article{xiu2019time,
+ title={Time Moore: Exploiting Moore's Law from the perspective of time},
+ author={Xiu, Liming}, journal={IEEE Solid-State Circuits Magazine},
+ volume={11}, number={1}, pages={39--55}, year={2019}, publisher={IEEE}
+
+}
+
+@article{brown2020language,
+ title={Language models are few-shot learners},
+ author={Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and others},
+ journal={Advances in neural information processing systems},
+ volume={33},
+ pages={1877--1901},
+ year={2020}
+}
+
+@article{cheng2017survey,
+ title={A survey of model compression and acceleration for deep neural networks},
+ author={Cheng, Yu and Wang, Duo and Zhou, Pan and Zhang, Tao},
+ journal={arXiv preprint arXiv:1710.09282},
+ year={2017}
+}
+
+@article{sze2017efficient,
+ title={Efficient processing of deep neural networks: A tutorial and survey},
+ author={Sze, Vivienne and Chen, Yu-Hsin and Yang, Tien-Ju and Emer, Joel S},
+ journal={Proceedings of the IEEE},
+ volume={105},
+ number={12},
+ pages={2295--2329},
+ year={2017},
+ publisher={Ieee}
+}
+
+@article{young2018recent,
+ title={Recent trends in deep learning based natural language processing},
+ author={Young, Tom and Hazarika, Devamanyu and Poria, Soujanya and Cambria, Erik},
+ journal={ieee Computational intelligenCe magazine},
+ volume={13},
+ number={3},
+ pages={55--75},
+ year={2018},
+ publisher={IEEE}
+}
+
+@inproceedings{jacob2018quantization,
+ title={Quantization and training of neural networks for efficient integer-arithmetic-only inference},
+ author={Jacob, Benoit and Kligys, Skirmantas and Chen, Bo and Zhu, Menglong and Tang, Matthew and Howard, Andrew and Adam, Hartwig and Kalenichenko, Dmitry},
+ booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
+ pages={2704--2713},
+ year={2018}
+}
+
+@article{gale2019state,
+ title={The state of sparsity in deep neural networks},
+ author={Gale, Trevor and Elsen, Erich and Hooker, Sara},
+ journal={arXiv preprint arXiv:1902.09574},
+ year={2019}
+}
+
+@inproceedings{zhang2015fpga,
+ title={FPGA-based Accelerator Design for Deep Convolutional Neural Networks Proceedings of the 2015 ACM},
+ author={Zhang, Chen and Li, Peng and Sun, Guangyu and Guan, Yijin and Xiao, Bingjun and Cong, Jason Optimizing},
+ booktitle={SIGDA International Symposium on Field-Programmable Gate Arrays-FPGA},
+ volume={15},
+ pages={161--170},
+ year={2015}
+}
+
+@inproceedings{suda2016throughput,
+ title={Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks},
+ author={Suda, Naveen and Chandra, Vikas and Dasika, Ganesh and Mohanty, Abinash and Ma, Yufei and Vrudhula, Sarma and Seo, Jae-sun and Cao, Yu},
+ booktitle={Proceedings of the 2016 ACM/SIGDA international symposium on field-programmable gate arrays},
+ pages={16--25},
+ year={2016}
+}
+
+@inproceedings{fowers2018configurable,
+ title={A configurable cloud-scale DNN processor for real-time AI},
+ author={Fowers, Jeremy and Ovtcharov, Kalin and Papamichael, Michael and Massengill, Todd and Liu, Ming and Lo, Daniel and Alkalay, Shlomi and Haselman, Michael and Adams, Logan and Ghandi, Mahdi and others},
+ booktitle={2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA)},
+ pages={1--14},
+ year={2018},
+ organization={IEEE}
+}
+
+@article{jia2019beyond,
+ title={Beyond Data and Model Parallelism for Deep Neural Networks.},
+ author={Jia, Zhihao and Zaharia, Matei and Aiken, Alex},
+ journal={Proceedings of Machine Learning and Systems},
+ volume={1},
+ pages={1--13},
+ year={2019}
+}
+
+@inproceedings{zhu2018benchmarking,
+ title={Benchmarking and analyzing deep neural network training},
+ author={Zhu, Hongyu and Akrout, Mohamed and Zheng, Bojian and Pelegris, Andrew and Jayarajan, Anand and Phanishayee, Amar and Schroeder, Bianca and Pekhimenko, Gennady},
+ booktitle={2018 IEEE International Symposium on Workload Characterization (IISWC)},
+ pages={88--100},
+ year={2018},
+ organization={IEEE}
+}
+
+@article{samajdar2018scale,
+ title={Scale-sim: Systolic cnn accelerator simulator},
+ author={Samajdar, Ananda and Zhu, Yuhao and Whatmough, Paul and Mattina, Matthew and Krishna, Tushar},
+ journal={arXiv preprint arXiv:1811.02883},
+ year={2018}
+}
+
+@INPROCEEDINGS{munshi2009opencl,
+ author={Munshi, Aaftab},
+ booktitle={2009 IEEE Hot Chips 21 Symposium (HCS)},
+ title={The OpenCL specification},
+ year={2009},
+ volume={},
+ number={},
+ pages={1-314},
+ doi={10.1109/HOTCHIPS.2009.7478342}
+}
+
+@INPROCEEDINGS{luebke2008cuda,
+ author={Luebke, David},
+ booktitle={2008 5th IEEE International Symposium on Biomedical Imaging: From Nano to Macro},
+ title={CUDA: Scalable parallel programming for high-performance scientific computing},
+ year={2008},
+ volume={},
+ number={},
+ pages={836-838},
+ doi={10.1109/ISBI.2008.4541126}
+}
+
+@misc{segal1999opengl,
+ title={The OpenGL graphics system: A specification (version 1.1)},
+ author={Segal, Mark and Akeley, Kurt},
+ year={1999}
+}
+
+@INPROCEEDINGS{gannot1994verilog,
+ author={Gannot, G. and Ligthart, M.},
+ booktitle={International Verilog HDL Conference},
+ title={Verilog HDL based FPGA design},
+ year={1994},
+ volume={},
+ number={},
+ pages={86-92},
+ doi={10.1109/IVC.1994.323743}
+}
+
+@article{binkert2011gem5,
+ title={The gem5 simulator},
+ author={Binkert, Nathan and Beckmann, Bradford and Black, Gabriel and Reinhardt, Steven K and Saidi, Ali and Basu, Arkaprava and Hestness, Joel and Hower, Derek R and Krishna, Tushar and Sardashti, Somayeh and others},
+ journal={ACM SIGARCH computer architecture news},
+ volume={39},
+ number={2},
+ pages={1--7},
+ year={2011},
+ publisher={ACM New York, NY, USA}
+}
+
+
+
+
+
+#### ai-acc
+@ARTICLE{Vivet2021, author={Vivet, Pascal and Guthmuller, Eric and Thonnart, Yvain and Pillonnet, Gael and Fuguet, César and Miro-Panades, Ivan and Moritz, Guillaume and Durupt, Jean and Bernard, Christian and Varreau, Didier and Pontes, Julian and Thuries, Sébastien and Coriat, David and Harrand, Michel and Dutoit, Denis and Lattard, Didier and Arnaud, Lucile and Charbonnier, Jean and Coudrain, Perceval and Garnier, Arnaud and Berger, Frédéric and Gueugnot, Alain and Greiner, Alain and Meunier, Quentin L. and Farcy, Alexis and Arriordaz, Alexandre and Chéramy, Séverine and Clermidy, Fabien},
+journal={IEEE Journal of Solid-State Circuits},
+title={IntAct: A 96-Core Processor With Six Chiplets 3D-Stacked on an Active Interposer With Distributed Interconnects and Integrated Power Management},
+year={2021},
+volume={56},
+number={1},
+pages={79-97},
+doi={10.1109/JSSC.2020.3036341}}
+
+
+@article{schuman2022,
+ title={Opportunities for neuromorphic computing algorithms and applications},
+ author={Schuman, Catherine D and Kulkarni, Shruti R and Parsa, Maryam and Mitchell, J Parker and Date, Prasanna and Kay, Bill},
+ journal={Nature Computational Science},
+ volume={2},
+ number={1},
+ pages={10--19},
+ year={2022},
+ publisher={Nature Publishing Group US New York}
+}
+
+
+@article{markovic2020,
+ title={Physics for neuromorphic computing},
+ author={Markovi{\'c}, Danijela and Mizrahi, Alice and Querlioz, Damien and Grollier, Julie},
+ journal={Nature Reviews Physics},
+ volume={2},
+ number={9},
+ pages={499--510},
+ year={2020},
+ publisher={Nature Publishing Group UK London}
+}
+
+@article{furber2016large,
+ title={Large-scale neuromorphic computing systems},
+ author={Furber, Steve},
+ journal={Journal of neural engineering},
+ volume={13},
+ number={5},
+ pages={051001},
+ year={2016},
+ publisher={IOP Publishing}
+}
+
+@article{davies2018loihi,
+ title={Loihi: A neuromorphic manycore processor with on-chip learning},
+ author={Davies, Mike and Srinivasa, Narayan and Lin, Tsung-Han and Chinya, Gautham and Cao, Yongqiang and Choday, Sri Harsha and Dimou, Georgios and Joshi, Prasad and Imam, Nabil and Jain, Shweta and others},
+ journal={Ieee Micro},
+ volume={38},
+ number={1},
+ pages={82--99},
+ year={2018},
+ publisher={IEEE}
+}
+
+@article{davies2021advancing,
+ title={Advancing neuromorphic computing with loihi: A survey of results and outlook},
+ author={Davies, Mike and Wild, Andreas and Orchard, Garrick and Sandamirskaya, Yulia and Guerra, Gabriel A Fonseca and Joshi, Prasad and Plank, Philipp and Risbud, Sumedh R},
+ journal={Proceedings of the IEEE},
+ volume={109},
+ number={5},
+ pages={911--934},
+ year={2021},
+ publisher={IEEE}
+}
+
+@article{modha2023neural,
+ title={Neural inference at the frontier of energy, space, and time},
+ author={Modha, Dharmendra S and Akopyan, Filipp and Andreopoulos, Alexander and Appuswamy, Rathinakumar and Arthur, John V and Cassidy, Andrew S and Datta, Pallab and DeBole, Michael V and Esser, Steven K and Otero, Carlos Ortega and others},
+ journal={Science},
+ volume={382},
+ number={6668},
+ pages={329--335},
+ year={2023},
+ publisher={American Association for the Advancement of Science}
+}
+
+@article{maass1997networks,
+ title={Networks of spiking neurons: the third generation of neural network models},
+ author={Maass, Wolfgang},
+ journal={Neural networks},
+ volume={10},
+ number={9},
+ pages={1659--1671},
+ year={1997},
+ publisher={Elsevier}
+}
+
+@ARTICLE{10242251,
+author={Eshraghian, Jason K. and Ward, Max and Neftci, Emre O. and Wang, Xinxin and Lenz, Gregor and Dwivedi, Girish and Bennamoun, Mohammed and Jeong, Doo Seok and Lu, Wei D.},
+journal={Proceedings of the IEEE},
+title={Training Spiking Neural Networks Using Lessons From Deep Learning},
+year={2023},
+volume={111},
+number={9},
+pages={1016-1054},
+doi={10.1109/JPROC.2023.3308088}}
+
+@article{chua1971memristor,
+ title={Memristor-the missing circuit element},
+ author={Chua, Leon},
+ journal={IEEE Transactions on circuit theory},
+ volume={18},
+ number={5},
+ pages={507--519},
+ year={1971},
+ publisher={IEEE}
+}
+
+@article{shastri2021photonics,
+ title={Photonics for artificial intelligence and neuromorphic computing},
+ author={Shastri, Bhavin J and Tait, Alexander N and Ferreira de Lima, Thomas and Pernice, Wolfram HP and Bhaskaran, Harish and Wright, C David and Prucnal, Paul R},
+ journal={Nature Photonics},
+ volume={15},
+ number={2},
+ pages={102--114},
+ year={2021},
+ publisher={Nature Publishing Group UK London}
+}
+
+@article{haensch2018next,
+title={The next generation of deep learning hardware: Analog computing},
+author={Haensch, Wilfried and Gokmen, Tayfun and Puri, Ruchir},
+journal={Proceedings of the IEEE},
+volume={107},
+number={1},
+pages={108--122},
+year={2018},
+publisher={IEEE}
+}
+
+@article{hazan2021neuromorphic,
+ title={Neuromorphic analog implementation of neural engineering framework-inspired spiking neuron for high-dimensional representation},
+ author={Hazan, Avi and Ezra Tsur, Elishai},
+ journal={Frontiers in Neuroscience},
+ volume={15},
+ pages={627221},
+ year={2021},
+ publisher={Frontiers Media SA}
+}
+
+@article{gates2009flexible,
+ title={Flexible electronics},
+ author={Gates, Byron D},
+ journal={Science},
+ volume={323},
+ number={5921},
+ pages={1566--1567},
+ year={2009},
+ publisher={American Association for the Advancement of Science}
+}
+
+@article{musk2019integrated,
+ title={An integrated brain-machine interface platform with thousands of channels},
+ author={Musk, Elon and others},
+ journal={Journal of medical Internet research},
+ volume={21},
+ number={10},
+ pages={e16194},
+ year={2019},
+ publisher={JMIR Publications Inc., Toronto, Canada}
+}
+
+@article{tang2023flexible,
+ title={Flexible brain--computer interfaces},
+ author={Tang, Xin and Shen, Hao and Zhao, Siyuan and Li, Na and Liu, Jia},
+ journal={Nature Electronics},
+ volume={6},
+ number={2},
+ pages={109--118},
+ year={2023},
+ publisher={Nature Publishing Group UK London}
+}
+
+@article{tang2022soft,
+ title={Soft bioelectronics for cardiac interfaces},
+ author={Tang, Xin and He, Yichun and Liu, Jia},
+ journal={Biophysics Reviews},
+ volume={3},
+ number={1},
+ year={2022},
+ publisher={AIP Publishing}
+}
+@article{kwon2022flexible,
+ title={Flexible sensors and machine learning for heart monitoring},
+ author={Kwon, Sun Hwa and Dong, Lin},
+ journal={Nano Energy},
+ pages={107632},
+ year={2022},
+ publisher={Elsevier}
+}
+
+@article{huang2010pseudo,
+ title={Pseudo-CMOS: A design style for low-cost and robust flexible electronics},
+ author={Huang, Tsung-Ching and Fukuda, Kenjiro and Lo, Chun-Ming and Yeh, Yung-Hui and Sekitani, Tsuyoshi and Someya, Takao and Cheng, Kwang-Ting},
+ journal={IEEE Transactions on Electron Devices},
+ volume={58},
+ number={1},
+ pages={141--150},
+ year={2010},
+ publisher={IEEE}
+}
+@article{biggs2021natively,
+ title={A natively flexible 32-bit Arm microprocessor},
+ author={Biggs, John and Myers, James and Kufel, Jedrzej and Ozer, Emre and Craske, Simon and Sou, Antony and Ramsdale, Catherine and Williamson, Ken and Price, Richard and White, Scott},
+ journal={Nature},
+ volume={595},
+ number={7868},
+ pages={532--536},
+ year={2021},
+ publisher={Nature Publishing Group UK London}
+}
+
+@article{farah2005neuroethics,
+ title={Neuroethics: the practical and the philosophical},
+ author={Farah, Martha J},
+ journal={Trends in cognitive sciences},
+ volume={9},
+ number={1},
+ pages={34--40},
+ year={2005},
+ publisher={Elsevier}
+}
+
+@article{segura2018ethical,
+ title={Ethical implications of user perceptions of wearable devices},
+ author={Segura Anaya, LH and Alsadoon, Abeer and Costadopoulos, Nectar and Prasad, PWC},
+ journal={Science and engineering ethics},
+ volume={24},
+ pages={1--28},
+ year={2018},
+ publisher={Springer}
+}
+
+@article{goodyear2017social,
+ title={Social media, apps and wearable technologies: navigating ethical dilemmas and procedures},
+ author={Goodyear, Victoria A},
+ journal={Qualitative research in sport, exercise and health},
+ volume={9},
+ number={3},
+ pages={285--302},
+ year={2017},
+ publisher={Taylor \& Francis}
+}
+
+@article{roskies2002neuroethics,
+ title={Neuroethics for the new millenium},
+ author={Roskies, Adina},
+ journal={Neuron},
+ volume={35},
+ number={1},
+ pages={21--23},
+ year={2002},
+ publisher={Elsevier}
+}
+
+@article{duarte2022fastml,
+ title={FastML Science Benchmarks: Accelerating Real-Time Scientific Edge Machine Learning},
+ author={Duarte, Javier and Tran, Nhan and Hawks, Ben and Herwig, Christian and Muhizi, Jules and Prakash, Shvetank and Reddi, Vijay Janapa},
+ journal={arXiv preprint arXiv:2207.07958},
+ year={2022}
+}
+
+@article{verma2019memory,
+ title={In-memory computing: Advances and prospects},
+ author={Verma, Naveen and Jia, Hongyang and Valavi, Hossein and Tang, Yinqi and Ozatay, Murat and Chen, Lung-Yen and Zhang, Bonan and Deaville, Peter},
+ journal={IEEE Solid-State Circuits Magazine},
+ volume={11},
+ number={3},
+ pages={43--55},
+ year={2019},
+ publisher={IEEE}
+}
+
+@article{chi2016prime,
+ title={Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory},
+ author={Chi, Ping and Li, Shuangchen and Xu, Cong and Zhang, Tao and Zhao, Jishen and Liu, Yongpan and Wang, Yu and Xie, Yuan},
+ journal={ACM SIGARCH Computer Architecture News},
+ volume={44},
+ number={3},
+ pages={27--39},
+ year={2016},
+ publisher={ACM New York, NY, USA}
+}
+
+
+@article{burr2016recent,
+ title={Recent progress in phase-change memory technology},
+ author={Burr, Geoffrey W and Brightsky, Matthew J and Sebastian, Abu and Cheng, Huai-Yu and Wu, Jau-Yi and Kim, Sangbum and Sosa, Norma E and Papandreou, Nikolaos and Lung, Hsiang-Lan and Pozidis, Haralampos and others},
+ journal={IEEE Journal on Emerging and Selected Topics in Circuits and Systems},
+ volume={6},
+ number={2},
+ pages={146--162},
+ year={2016},
+ publisher={IEEE}
+}
+
+@article{loh20083d,
+ title={3D-stacked memory architectures for multi-core processors},
+ author={Loh, Gabriel H},
+ journal={ACM SIGARCH computer architecture news},
+ volume={36},
+ number={3},
+ pages={453--464},
+ year={2008},
+ publisher={ACM New York, NY, USA}
+}
+
+@article{mittal2021survey,
+ title={A survey of SRAM-based in-memory computing techniques and applications},
+ author={Mittal, Sparsh and Verma, Gaurav and Kaushik, Brajesh and Khanday, Farooq A},
+ journal={Journal of Systems Architecture},
+ volume={119},
+ pages={102276},
+ year={2021},
+ publisher={Elsevier}
+}
+
+@article{wong2012metal,
+ title={Metal--oxide RRAM},
+ author={Wong, H-S Philip and Lee, Heng-Yuan and Yu, Shimeng and Chen, Yu-Sheng and Wu, Yi and Chen, Pang-Shiu and Lee, Byoungil and Chen, Frederick T and Tsai, Ming-Jinn},
+ journal={Proceedings of the IEEE},
+ volume={100},
+ number={6},
+ pages={1951--1970},
+ year={2012},
+ publisher={IEEE}
+}
+
+@inproceedings{imani2016resistive,
+ title={Resistive configurable associative memory for approximate computing},
+ author={Imani, Mohsen and Rahimi, Abbas and Rosing, Tajana S},
+ booktitle={2016 Design, Automation \& Test in Europe Conference \& Exhibition (DATE)},
+ pages={1327--1332},
+ year={2016},
+ organization={IEEE}
+}
+
+@article{miller2000optical,
+ title={Optical interconnects to silicon},
+ author={Miller, David AB},
+ journal={IEEE Journal of Selected Topics in Quantum Electronics},
+ volume={6},
+ number={6},
+ pages={1312--1317},
+ year={2000},
+ publisher={IEEE}
+}
+
+@article{zhou2022photonic,
+title={Photonic matrix multiplication lights up photonic accelerator and beyond},
+author={Zhou, Hailong and Dong, Jianji and Cheng, Junwei and Dong, Wenchan and Huang, Chaoran and Shen, Yichen and Zhang, Qiming and Gu, Min and Qian, Chao and Chen, Hongsheng and others},
+journal={Light: Science \& Applications},
+volume={11},
+number={1},
+pages={30},
+year={2022},
+publisher={Nature Publishing Group UK London}
+}
+
+@article{shastri2021photonics,
+ title={Photonics for artificial intelligence and neuromorphic computing},
+ author={Shastri, Bhavin J and Tait, Alexander N and Ferreira de Lima, Thomas and Pernice, Wolfram HP and Bhaskaran, Harish and Wright, C David and Prucnal, Paul R},
+ journal={Nature Photonics},
+ volume={15},
+ number={2},
+ pages={102--114},
+ year={2021},
+ publisher={Nature Publishing Group UK London}
+}
+
+@article{bains2020business,
+ title={The business of building brains},
+ author={Bains, Sunny},
+ journal={Nat. Electron},
+ volume={3},
+ number={7},
+ pages={348--351},
+ year={2020}
+}
+
+@ARTICLE{Hennessy2019-je,
+ title = "A new golden age for computer architecture",
+ author = "Hennessy, John L and Patterson, David A",
+ abstract = "Innovations like domain-specific hardware, enhanced security,
+ open instruction sets, and agile chip development will lead the
+ way.",
+ journal = "Commun. ACM",
+ publisher = "Association for Computing Machinery (ACM)",
+ volume = 62,
+ number = 2,
+ pages = "48--60",
+ month = jan,
+ year = 2019,
+ copyright = "http://www.acm.org/publications/policies/copyright\_policy\#Background",
+ language = "en"
+}
+
+@ARTICLE{Dongarra2009-na,
+ title = "The evolution of high performance computing on system z",
+ author = "Dongarra, Jack J",
+ journal = "IBM Journal of Research and Development",
+ volume = 53,
+ pages = "3--4",
+ year = 2009
+}
+
+@ARTICLE{Ranganathan2011-dc,
+ title = "From microprocessors to nanostores: Rethinking data-centric
+ systems",
+ author = "Ranganathan, Parthasarathy",
+ journal = "Computer (Long Beach Calif.)",
+ publisher = "Institute of Electrical and Electronics Engineers (IEEE)",
+ volume = 44,
+ number = 1,
+ pages = "39--48",
+ month = jan,
+ year = 2011
+}
+
+@ARTICLE{Ignatov2018-kh,
+ title = "{AI} Benchmark: Running deep neural networks on Android
+ smartphones",
+ author = "Ignatov, Andrey and Timofte, Radu and Chou, William and Wang, Ke
+ and Wu, Max and Hartley, Tim and Van Gool, Luc",
+ abstract = "Over the last years, the computational power of mobile devices
+ such as smartphones and tablets has grown dramatically, reaching
+ the level of desktop computers available not long ago. While
+ standard smartphone apps are no longer a problem for them, there
+ is still a group of tasks that can easily challenge even
+ high-end devices, namely running artificial intelligence
+ algorithms. In this paper, we present a study of the current
+ state of deep learning in the Android ecosystem and describe
+ available frameworks, programming models and the limitations of
+ running AI on smartphones. We give an overview of the hardware
+ acceleration resources available on four main mobile chipset
+ platforms: Qualcomm, HiSilicon, MediaTek and Samsung.
+ Additionally, we present the real-world performance results of
+ different mobile SoCs collected with AI Benchmark that are
+ covering all main existing hardware configurations.",
+ publisher = "arXiv",
+ year = 2018
+}
+
+@inproceedings{jouppi2017datacenter,
+ title={In-datacenter performance analysis of a tensor processing unit},
+ author={Jouppi, Norman P and Young, Cliff and Patil, Nishant and Patterson, David and Agrawal, Gaurav and Bajwa, Raminder and Bates, Sarah and Bhatia, Suresh and Boden, Nan and Borchers, Al and others},
+ booktitle={Proceedings of the 44th annual international symposium on computer architecture},
+ pages={1--12},
+ year={2017}
+}
+
+@ARTICLE{Sze2017-ak,
+ title = "Efficient processing of deep neural networks: A tutorial and
+ survey",
+ author = "Sze, Vivienne and Chen, Yu-Hsin and Yang, Tien-Ju and Emer,
+ Joel",
+ abstract = "Deep neural networks (DNNs) are currently widely used for
+ many artificial intelligence (AI) applications including
+ computer vision, speech recognition, and robotics. While
+ DNNs deliver state-of-the-art accuracy on many AI tasks, it
+ comes at the cost of high computational complexity.
+ Accordingly, techniques that enable efficient processing of
+ DNNs to improve energy efficiency and throughput without
+ sacrificing application accuracy or increasing hardware cost
+ are critical to the wide deployment of DNNs in AI systems.
+ This article aims to provide a comprehensive tutorial and
+ survey about the recent advances towards the goal of
+ enabling efficient processing of DNNs. Specifically, it will
+ provide an overview of DNNs, discuss various hardware
+ platforms and architectures that support DNNs, and highlight
+ key trends in reducing the computation cost of DNNs either
+ solely via hardware design changes or via joint hardware
+ design and DNN algorithm changes. It will also summarize
+ various development resources that enable researchers and
+ practitioners to quickly get started in this field, and
+ highlight important benchmarking metrics and design
+ considerations that should be used for evaluating the
+ rapidly growing number of DNN hardware designs, optionally
+ including algorithmic co-designs, being proposed in academia
+ and industry. The reader will take away the following
+ concepts from this article: understand the key design
+ considerations for DNNs; be able to evaluate different DNN
+ hardware implementations with benchmarks and comparison
+ metrics; understand the trade-offs between various hardware
+ architectures and platforms; be able to evaluate the utility
+ of various DNN design techniques for efficient processing;
+ and understand recent implementation trends and
+ opportunities.",
+ month = mar,
+ year = 2017,
+ copyright = "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
+ archivePrefix = "arXiv",
+ primaryClass = "cs.CV",
+ eprint = "1703.09039"
+}
+
+@inproceedings{ignatov2018ai,
+title={Ai benchmark: Running deep neural networks on android smartphones},
+ author={Ignatov, Andrey and Timofte, Radu and Chou, William and Wang, Ke and Wu, Max and Hartley, Tim and Van Gool, Luc},
+ booktitle={Proceedings of the European Conference on Computer Vision (ECCV) Workshops},
+ pages={0--0},
+ year={2018}
+}
+
+@inproceedings{lin2022ondevice,
+ title = {On-Device Training Under 256KB Memory},
+ author = {Lin, Ji and Zhu, Ligeng and Chen, Wei-Ming and Wang, Wei-Chen and Gan, Chuang and Han, Song},
+ booktitle = {ArXiv},
+ year = {2022}
+}
+
+@article{lin2023awq,
+ title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
+ author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
+ journal={arXiv},
+ year={2023}
+}
+
+@inproceedings{wang2020apq,
+ author={Wang, Tianzhe and Wang, Kuan and Cai, Han and Lin, Ji and Liu, Zhijian and Wang, Hanrui and Lin, Yujun and Han, Song},
+ booktitle={2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
+ title={APQ: Joint Search for Network Architecture, Pruning and Quantization Policy},
+ year={2020},
+ volume={},
+ number={},
+ pages={2075-2084},
+ doi={10.1109/CVPR42600.2020.00215}
+}
+
+@inproceedings{Li2020Additive,
+title={Additive Powers-of-Two Quantization: An Efficient Non-uniform Discretization for Neural Networks},
+author={Yuhang Li and Xin Dong and Wei Wang},
+booktitle={International Conference on Learning Representations},
+year={2020},
+url={https://openreview.net/forum?id=BkgXT24tDS}
+}
+
+@inproceedings{Norman2017TPUv1,
+author = {Jouppi, Norman P. and Young, Cliff and Patil, Nishant and Patterson, David and Agrawal, Gaurav and Bajwa, Raminder and Bates, Sarah and Bhatia, Suresh and Boden, Nan and Borchers, Al and Boyle, Rick and Cantin, Pierre-luc and Chao, Clifford and Clark, Chris and Coriell, Jeremy and Daley, Mike and Dau, Matt and Dean, Jeffrey and Gelb, Ben and Ghaemmaghami, Tara Vazir and Gottipati, Rajendra and Gulland, William and Hagmann, Robert and Ho, C. Richard and Hogberg, Doug and Hu, John and Hundt, Robert and Hurt, Dan and Ibarz, Julian and Jaffey, Aaron and Jaworski, Alek and Kaplan, Alexander and Khaitan, Harshit and Killebrew, Daniel and Koch, Andy and Kumar, Naveen and Lacy, Steve and Laudon, James and Law, James and Le, Diemthu and Leary, Chris and Liu, Zhuyuan and Lucke, Kyle and Lundin, Alan and MacKean, Gordon and Maggiore, Adriana and Mahony, Maire and Miller, Kieran and Nagarajan, Rahul and Narayanaswami, Ravi and Ni, Ray and Nix, Kathy and Norrie, Thomas and Omernick, Mark and Penukonda, Narayana and Phelps, Andy and Ross, Jonathan and Ross, Matt and Salek, Amir and Samadiani, Emad and Severn, Chris and Sizikov, Gregory and Snelham, Matthew and Souter, Jed and Steinberg, Dan and Swing, Andy and Tan, Mercedes and Thorson, Gregory and Tian, Bo and Toma, Horia and Tuttle, Erick and Vasudevan, Vijay and Walter, Richard and Wang, Walter and Wilcox, Eric and Yoon, Doe Hyun},
+title = {In-Datacenter Performance Analysis of a Tensor Processing Unit},
+year = {2017},
+isbn = {9781450348928},
+publisher = {Association for Computing Machinery},
+address = {New York, NY, USA},
+url = {https://doi.org/10.1145/3079856.3080246},
+doi = {10.1145/3079856.3080246},
+abstract = {Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95\% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X -- 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X -- 80X higher. Moreover, using the CPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.},
+booktitle = {Proceedings of the 44th Annual International Symposium on Computer Architecture},
+pages = {1–12},
+numpages = {12},
+keywords = {accelerator, neural network, MLP, TPU, CNN, deep learning, domain-specific architecture, GPU, TensorFlow, DNN, RNN, LSTM},
+location = {Toronto, ON, Canada},
+series = {ISCA '17}
+}
+
+@ARTICLE{Norrie2021TPUv2_3,
+ author={Norrie, Thomas and Patil, Nishant and Yoon, Doe Hyun and Kurian, George and Li, Sheng and Laudon, James and Young, Cliff and Jouppi, Norman and Patterson, David},
+ journal={IEEE Micro},
+ title={The Design Process for Google's Training Chips: TPUv2 and TPUv3},
+ year={2021},
+ volume={41},
+ number={2},
+ pages={56-63},
+ doi={10.1109/MM.2021.3058217}
+}
+
+@inproceedings{Jouppi2023TPUv4,
+author = {Jouppi, Norm and Kurian, George and Li, Sheng and Ma, Peter and Nagarajan, Rahul and Nai, Lifeng and Patil, Nishant and Subramanian, Suvinay and Swing, Andy and Towles, Brian and Young, Clifford and Zhou, Xiang and Zhou, Zongwei and Patterson, David A},
+title = {TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings},
+year = {2023},
+isbn = {9798400700958},
+publisher = {Association for Computing Machinery},
+address = {New York, NY, USA},
+url = {https://doi.org/10.1145/3579371.3589350},
+doi = {10.1145/3579371.3589350},
+abstract = {In response to innovations in machine learning (ML) models, production workloads changed radically and rapidly. TPU v4 is the fifth Google domain specific architecture (DSA) and its third supercomputer for such ML models. Optical circuit switches (OCSes) dynamically reconfigure its interconnect topology to improve scale, availability, utilization, modularity, deployment, security, power, and performance; users can pick a twisted 3D torus topology if desired. Much cheaper, lower power, and faster than Infiniband, OCSes and underlying optical components are <5\% of system cost and <3\% of system power. Each TPU v4 includes SparseCores, dataflow processors that accelerate models that rely on embeddings by 5x--7x yet use only 5\% of die area and power. Deployed since 2020, TPU v4 outperforms TPU v3 by 2.1x and improves performance/Watt by 2.7x. The TPU v4 supercomputer is 4x larger at 4096 chips and thus nearly 10x faster overall, which along with OCS flexibility and availability allows a large language model to train at an average of ~60\% of peak FLOPS/second. For similar sized systems, it is ~4.3x--4.5x faster than the Graphcore IPU Bow and is 1.2x--1.7x faster and uses 1.3x--1.9x less power than the Nvidia A100. TPU v4s inside the energy-optimized warehouse scale computers of Google Cloud use ~2--6x less energy and produce ~20x less CO2e than contemporary DSAs in typical on-premise data centers.},
+booktitle = {Proceedings of the 50th Annual International Symposium on Computer Architecture},
+articleno = {82},
+numpages = {14},
+keywords = {warehouse scale computer, embeddings, supercomputer, domain specific architecture, reconfigurable, TPU, large language model, power usage effectiveness, CO2 equivalent emissions, energy, optical interconnect, IPU, machine learning, GPU, carbon emissions},
+location = {Orlando, FL, USA},
+series = {ISCA '23}
+}
+
@misc{zhou2021analognets,
title = {AnalogNets: ML-HW Co-Design of Noise-robust TinyML Models and Always-On Analog Compute-in-Memory Accelerator},
author = {Chuteng Zhou and Fernando Garcia Redondo and Julian Büchel and Irem Boybat and Xavier Timoneda Comas and S. R. Nandakumar and Shidhartha Das and Abu Sebastian and Manuel Le Gallo and Paul N. Whatmough},