diff --git a/contents/core/benchmarking/benchmarking.bib b/contents/core/benchmarking/benchmarking.bib index 678f09b23..863e8d386 100644 --- a/contents/core/benchmarking/benchmarking.bib +++ b/contents/core/benchmarking/benchmarking.bib @@ -86,7 +86,7 @@ @article{10.1145/3467017 abstract = {After decades of incentivizing the isolation of hardware, software, and algorithm development, the catalysts for closer collaboration are changing the paradigm.}, journal = {Commun. ACM}, month = nov, -pages = {58–65}, +pages = {58-65}, numpages = {8} } @@ -397,3 +397,10 @@ @misc{yik2023neurobench title = {{NeuroBench:} {Advancing} Neuromorphic Computing through Collaborative, Fair and Representative Benchmarking}, year = {2023}, } + +@article{tschand2024mlperf, + title={MLPerf Power: Benchmarking the Energy Efficiency of Machine Learning Systems from $\{$$\backslash$mu$\}$ Watts to MWatts for Sustainable AI}, + author={Tschand, Arya and Rajan, Arun Tejusve Raghunath and Idgunji, Sachin and Ghosh, Anirban and Holleman, Jeremy and Kiraly, Csaba and Ambalkar, Pawan and Borkar, Ritika and Chukka, Ramesh and Cockrell, Trevor and others}, + journal={arXiv preprint arXiv:2410.12032}, + year={2024} +} \ No newline at end of file diff --git a/contents/core/benchmarking/benchmarking.qmd b/contents/core/benchmarking/benchmarking.qmd index 345c56203..00ee74ed3 100644 --- a/contents/core/benchmarking/benchmarking.qmd +++ b/contents/core/benchmarking/benchmarking.qmd @@ -20,7 +20,7 @@ This chapter will provide an overview of popular ML benchmarks, best practices f * Understand the purpose and goals of benchmarking AI systems, including performance assessment, resource evaluation, validation, and more. -* Learn about key model benchmarks, metrics, and trends, including accuracy, fairness, complexity, and efficiency. +* Learn about key model benchmarks, metrics, and trends, including accuracy, fairness, complexity, perforamnce, and energy efficiency. * Become familiar with the key components of an AI benchmark, including datasets, tasks, metrics, baselines, reproducibility rules, and more. @@ -36,7 +36,7 @@ This chapter will provide an overview of popular ML benchmarks, best practices f ::: -## Introduction {#sec-benchmarking-ai} +## Introduction Benchmarking provides the essential measurements needed to drive machine learning progress and truly understand system performance. As the physicist Lord Kelvin famously said, "To measure is to know." Benchmarks allow us to quantitatively know the capabilities of different models, software, and hardware. They allow ML developers to measure the inference time, memory usage, power consumption, and other metrics that characterize a system. Moreover, benchmarks create standardized processes for measurement, enabling fair comparisons across different solutions. @@ -46,6 +46,8 @@ Benchmarking has several important goals and objectives that guide its implement * **Performance assessment.** This involves evaluating key metrics like a given model's speed, accuracy, and efficiency. For instance, in a TinyML context, it is crucial to benchmark how quickly a voice assistant can recognize commands, as this evaluates real-time performance. +* **Power assessment.** Evaluating the power drawn by a workload along with its performance equates to its energy efficiency. As the environmental impact of ML computing continues to grow, benchmarking energy can enable us to better optimize our systems for sustainability. + * **Resource evaluation.** This means assessing the model's impact on critical system resources, including battery life, memory usage, and computational overhead. A relevant example is comparing the battery drain of two different image recognition algorithms running on a wearable device. * **Validation and verification.** Benchmarking helps ensure the system functions correctly and meets specified requirements. One way is by checking the accuracy of an algorithm, like a heart rate monitor on a smartwatch, against readings from medical-grade equipment as a form of clinical validation. @@ -60,7 +62,7 @@ This chapter will cover the 3 types of AI benchmarks, the standard metrics, tool ## Historical Context -### Standard Benchmarks +### Performance Benchmarks The evolution of benchmarks in computing vividly illustrates the industry's relentless pursuit of excellence and innovation. In the early days of computing during the 1960s and 1970s, benchmarks were rudimentary and designed for mainframe computers. For example, the [Whetstone benchmark](https://en.wikipedia.org/wiki/Whetstone_(benchmark)), named after the Whetstone ALGOL compiler, was one of the first standardized tests to measure the floating-point arithmetic performance of a CPU. These pioneering benchmarks prompted manufacturers to refine their architectures and algorithms to achieve better benchmark scores. @@ -70,7 +72,25 @@ The 1990s brought the era of graphics-intensive applications and video games. Th The 2000s saw a surge in mobile phones and portable devices like tablets. With portability came the challenge of balancing performance and power consumption. Benchmarks like [MobileMark](https://bapco.com/products/mobilemark-2014/) by BAPCo evaluated speed and battery life. This drove companies to develop more energy-efficient System-on-Chips (SOCs), leading to the emergence of architectures like ARM that prioritized power efficiency. -The focus of the recent decade has shifted towards cloud computing, big data, and artificial intelligence. Cloud service providers like Amazon Web Services and Google Cloud compete on performance, scalability, and cost-effectiveness. Tailored cloud benchmarks like [CloudSuite](http://cloudsuite.ch/) have become essential, driving providers to optimize their infrastructure for better services. +The focus of the recent decade has shifted towards cloud computing, big data, and artificial intelligence. Cloud service providers like Amazon Web Services and Google Cloud compete on performance, scalability, and cost-effectiveness. Tailored cloud benchmarks like [CloudSuite](http://cloudsuite.ch/) have become essential, driving providers to optimize their infrastructure for better services. + +### Energy Benchmarks + +Energy consumption and environmental concerns have gained prominence in recent years, making power (more precisely, energy) benchmarking increasingly important in the industry. This shift began in the mid-2000s when processors and systems started hitting cooling limits, and scaling became a crucial aspect of building large-scale systems due to internet advancements. Since then, energy considerations have expanded to encompass all areas of computing, from personal devices to large-scale data centers. + +Power benchmarking aims to measure the energy efficiency of computing systems, evaluating performance in relation to power consumption. This is crucial for several reasons: + +* **Environmental impact:** With the growing carbon footprint of the tech industry, there's a pressing need to reduce energy consumption. +* **Operational costs:** Energy expenses constitute a significant portion of data center operating costs. +* **Device longevity:** For mobile devices, power efficiency directly impacts battery life and user experience. + +Several key benchmarks have emerged in this space: + +* **SPEC Power:** Introduced in 2007, [SPEC Power](https://www.spec.org/power/) was one of the first industry-standard benchmarks for evaluating the power and performance characteristics of computer servers. +* **Green500:** The [Green500](https://top500.org/lists/green500/) list ranks supercomputers by energy efficiency, complementing the performance-focused TOP500 list. +* **Energy Star:** While not a benchmark per se, [ENERGY STAR for Computers](https://www.energystar.gov/products/computers) certification program has driven manufacturers to improve the energy efficiency of consumer electronics. + +Power benchmarking faces unique challenges, such as accounting for different workloads and system configurations, and measuring power consumption accurately across a range of hardware that scales from microWatts to megaWatts in power consumption. As AI and edge computing continue to grow, power benchmarking is likely to become even more critical, driving the development of specialized energy-efficient AI hardware and software optimizations. ### Custom Benchmarks @@ -197,7 +217,7 @@ Example: Tasks for natural language processing benchmarks might include sentimen #### Evaluation Metrics -Once a task is defined, benchmarks require metrics to quantify performance. These metrics offer objective measures to compare different models or systems. In classification tasks, metrics like accuracy, precision, recall, and [F1 score](https://en.wikipedia.org/wiki/F-score) are commonly used. Mean squared or absolute errors might be employed for regression tasks. +Once a task is defined, benchmarks require metrics to quantify performance. These metrics offer objective measures to compare different models or systems. In classification tasks, metrics like accuracy, precision, recall, and [F1 score](https://en.wikipedia.org/wiki/F-score) are commonly used. Mean squared or absolute errors might be employed for regression tasks. We can also measure the power consumed by the benchmark execution to calculate energy efficiency. #### Baselines and Baseline Models @@ -303,9 +323,11 @@ It is important to carefully consider these factors when designing benchmarks to Here are some original works that laid the fundamental groundwork for developing systematic benchmarks for training machine learning systems. -*[MLPerf Training Benchmark](https://github.com/mlcommons/training)* +* [MLPerf Training Benchmark](https://github.com/mlcommons/training)* -MLPerf is a suite of benchmarks designed to measure the performance of machine learning hardware, software, and services. The MLPerf Training benchmark [@mattson2020mlperf] focuses on the time it takes to train models to a target quality metric. It includes diverse workloads, such as image classification, object detection, translation, and reinforcement learning. +MLPerf is a suite of benchmarks designed to measure the performance of machine learning hardware, software, and services. The MLPerf Training benchmark [@mattson2020mlperf] focuses on the time it takes to train models to a target quality metric. It includes diverse workloads, such as image classification, object detection, translation, and reinforcement learning. @fig-perf-trend highlights the performance improvements in progressive versions of MLPerf Training benchmarks, which have all outpaced Moore's Law. Using standardized benchmarking trends enables us to rigorously showcase the rapid evolution of ML computing. + +![MLPerf Training performance trends. Source: @mattson2020mlperf.](images/png/mlperf_perf_trend.png){#fig-perf-trend} Metrics: @@ -313,7 +335,7 @@ Metrics: * Throughput (examples per second) * Resource utilization (CPU, GPU, memory, disk I/O) -*[DAWNBench](https://dawn.cs.stanford.edu/benchmark/)* +* [DAWNBench](https://dawn.cs.stanford.edu/benchmark/)* DAWNBench [@coleman2017dawnbench] is a benchmark suite focusing on end-to-end deep learning training time and inference performance. It includes common tasks such as image classification and question answering. @@ -323,7 +345,7 @@ Metrics: * Inference latency * Cost (in terms of cloud computing and storage resources) -*[Fathom](https://github.com/rdadolf/fathom)* +* [Fathom](https://github.com/rdadolf/fathom)* Fathom [@adolf2016fathom] is a benchmark from Harvard University that evaluates the performance of deep learning models using a diverse set of workloads. These include common tasks such as image classification, speech recognition, and language modeling. @@ -459,6 +481,18 @@ Get ready to put your AI models to the ultimate test! MLPerf is like the Olympic ::: +### Measuring Energy Efficiency + +As machine learning capabilities expand, both in training and inference, concerns about increased power consumption and its ecological footprint have intensified. Addressing the sustainability of ML systems, a topic explored in more depth in the [Sustainable AI](../sustainable_ai/sustainable_ai.qmd) chapter, has thus become a key priority. This focus on sustainability has led to the development of standardized benchmarks designed to accurately measure energy efficiency. However, standardizing these methodologies poses challenges due to the need to accommodate vastly different scales—from the microwatt consumption of TinyML devices to the megawatt demands of data center training systems. Moreover, ensuring that benchmarking is fair and reproducible requires accommodating the diverse range of hardware configurations and architectures in use today. + +One example is the MLPerf Power benchmarking methodology [@tschand2024mlperf], which tackles these challenges by tailoring the methodologies for datacenter, edge inference, and tiny inference systems while measuring power consumption as comprehensively as possible for each scale. This methodology adapts to a variety of hardware, from general-purpose CPUs to specialized AI accelerators, while maintaining uniform measurement principles to ensure that comparisons are both fair and accurate across different platforms. + +@fig-power-diagram illustrates the power measurement boundaries for different system scales, from TinyML devices to inference nodes and training racks. Each example highlights the components within the measurement boundary and those outside it. This setup allows for accurate reflection of the true energy costs associated with running ML workloads across various real-world scenarios, and ensures that the benchmark captures the full spectrum of energy consumption. + +![MLPerf Power system measurement diagram. Source: @tschand2024mlperf.](images/png/power_component_diagram.png){#fig-power-diagram} + +It is important to note that optimizing a system for performance may not lead to the most energy efficient execution. Oftentimes, sacrificing a small amount of performance or accuracy can lead to significant gains in energy efficiency, highlighting the importance of accurately benchmarking power metrics. Future insights from energy efficiency and sustainability benchmarking will enable us to optimize for more sustainable ML systems. + ### Benchmark Example To properly illustrate the components of a systems benchmark, we can look at the keyword spotting benchmark in MLPerf Tiny and explain the motivation behind each decision. @@ -505,14 +539,12 @@ But of all these, the most important challenge is benchmark engineering. #### Hardware Lottery -The hardware lottery, first described by @10.1145/3467017, in benchmarking machine learning systems refers to the situation where the success or efficiency of a machine learning model is significantly influenced by the compatibility of the model with the underlying hardware [@chu2021discovering]. In other words, some models perform exceptionally well because they are a good fit for the particular characteristics or capabilities of the hardware they are run on rather than because they are intrinsically superior models. +The hardware lottery, first described by @10.1145/3467017, refers to the situation where a machine learning model's success or efficiency is significantly influenced by its compatibility with the underlying hardware [@chu2021discovering]. Some models perform exceptionally well not because they are intrinsically superior, but because they are optimized for specific hardware characteristics, such as the parallel processing capabilities of Graphics Processing Units (GPUs) or Tensor Processing Units (TPUs). For instance, @fig-hardware-lottery compares the performance of models across different hardware platforms. The multi-hardware models show comparable results to "MobileNetV3 Large min" on both the CPU uint8 and GPU configurations. However, these multi-hardware models demonstrate significant performance improvements over the MobileNetV3 Large baseline when run on the EdgeTPU and DSP hardware. This emphasizes the variable efficiency of multi-hardware models in specialized computing environments. ![Accuracy-latency trade-offs of multiple ML models and how they perform on various hardware. Source: @chu2021discovering](images/png/hardware_lottery.png){#fig-hardware-lottery} -For instance, certain machine learning models may be designed and optimized to take advantage of the parallel processing capabilities of specific hardware accelerators, such as Graphics Processing Units (GPUs) or Tensor Processing Units (TPUs). As a result, these models might show superior performance when benchmarked on such hardware compared to other models that are not optimized for the hardware. - Hardware lottery can introduce challenges and biases in benchmarking machine learning systems, as the model's performance is not solely dependent on the model's architecture or algorithm but also on the compatibility and synergies with the underlying hardware. This can make it difficult to compare different models fairly and to identify the best model based on its intrinsic merits. It can also lead to a situation where the community converges on models that are a good fit for the popular hardware of the day, potentially overlooking other models that might be superior but incompatible with the current hardware trends. #### Benchmark Engineering @@ -567,9 +599,9 @@ Machine learning datasets have a rich history and have evolved significantly ove #### MNIST (1998) -The [MNIST dataset](https://www.tensorflow.org/datasets/catalog/mnist), created by Yann LeCun, Corinna Cortes, and Christopher J.C. Burges in 1998, can be considered a cornerstone in the history of machine learning datasets. It comprises 70,000 labeled 28x28 pixel grayscale images of handwritten digits (0-9). MNIST has been widely used for benchmarking algorithms in image processing and machine learning as a starting point for many researchers and practitioners. +The [MNIST dataset](https://www.tensorflow.org/datasets/catalog/mnist), created by Yann LeCun, Corinna Cortes, and Christopher J.C. Burges in 1998, can be considered a cornerstone in the history of machine learning datasets. It comprises 70,000 labeled 28x28 pixel grayscale images of handwritten digits (0-9). MNIST has been widely used for benchmarking algorithms in image processing and machine learning as a starting point for many researchers and practitioners. @fig-mnist shows some examples of handwritten digits. -![MNIST handwritten digits. Source: [Suvanjanprasai.](https://en.wikipedia.org/wiki/File:MnistExamplesModified.png)](images/png/mnist.png){#fig-mnist} +![MNIST handwritten digits. Source: [Suvanjanprasai](https://en.wikipedia.org/wiki/File:MnistExamplesModified.png)](images/png/mnist.png){#fig-mnist} #### ImageNet (2009) @@ -579,7 +611,7 @@ Fast forward to 2009, and we see the introduction of the [ImageNet dataset](http The [Common Objects in Context (COCO) dataset](https://cocodataset.org/) [@lin2014microsoft], released in 2014, further expanded the landscape of machine learning datasets by introducing a richer set of annotations. COCO consists of images containing complex scenes with multiple objects, and each image is annotated with object bounding boxes, segmentation masks, and captions, as shown in @fig-coco. This dataset has been instrumental in advancing research in object detection, segmentation, and image captioning. -![Example images from the COCO dataset. Source: [Coco](https://cocodataset.org/).](images/png/coco.png){#fig-coco} +![Example images from the COCO dataset. Source: [Coco](https://cocodataset.org/)](images/png/coco.png){#fig-coco} #### GPT-3 (2020) @@ -774,7 +806,7 @@ Several approaches can be taken to improve data quality. These methods include a * **Feature Engineering:** Transforming or creating new features can significantly improve model performance by providing more relevant information for learning. * **Data Augmentation:** Augmenting data by creating new samples through various transformations can help improve model robustness and generalization. * **Active Learning:** This is a semi-supervised learning approach where the model actively queries a human oracle to label the most informative samples [@coleman2022similarity]. This ensures that the model is trained on the most relevant data. -* Dimensionality Reduction: Techniques like PCA can reduce the number of features in a dataset, thereby reducing complexity and training time. +* **Dimensionality Reduction:** Techniques like PCA can reduce the number of features in a dataset, thereby reducing complexity and training time. There are many other methods in the wild. But the goal is the same. Refining the dataset and ensuring it is of the highest quality can reduce the training time required for models to converge. However, achieving this requires developing and implementing sophisticated methods, algorithms, and techniques that can clean, preprocess, and augment data while retaining the most informative samples. This is an ongoing challenge that will require continued research and innovation in the field of machine learning. diff --git a/contents/core/benchmarking/images/png/mlperf_perf_trend.png b/contents/core/benchmarking/images/png/mlperf_perf_trend.png new file mode 100644 index 000000000..711915a06 Binary files /dev/null and b/contents/core/benchmarking/images/png/mlperf_perf_trend.png differ diff --git a/contents/core/benchmarking/images/png/power_component_diagram.png b/contents/core/benchmarking/images/png/power_component_diagram.png new file mode 100644 index 000000000..2bab340c2 Binary files /dev/null and b/contents/core/benchmarking/images/png/power_component_diagram.png differ