diff --git a/benchmarking.qmd b/benchmarking.qmd index cbd998fb..c60ca9d1 100644 --- a/benchmarking.qmd +++ b/benchmarking.qmd @@ -99,25 +99,25 @@ Micro-benchmarks in AI are specialized, focusing on the evaluation of distinct c These types of microbenchmarks include that zoom into very specific operations or components of the AI pipeline, such as the following: -- Tensor Operations: Libraries like cuDNN (by NVIDIA) often have benchmarks to measure the performance of individual tensor operations, such as convolutions or matrix multiplications, which are foundational to deep learning computations. +- Tensor Operations: Libraries like [cuDNN](https://developer.nvidia.com/cudnn) (by NVIDIA) often have benchmarks to measure the performance of individual tensor operations, such as convolutions or matrix multiplications, which are foundational to deep learning computations. - Activation Functions: Benchmarks that measure the speed and efficiency of various activation functions like ReLU, Sigmoid, or Tanh in isolation. - Layer Benchmarks: Evaluations of the computational efficiency of distinct neural network layers, such as a LSTM layer or a Transformer block, when operating on standardized input sizes. -Example: Introduced by Baidu, is a good example of something that asseses the above. DeepBench assesses the performance of basic operations in deep learning models, providing insights into how different hardware platforms handle neural network training and inference. +Example: [DeepBench](https://github.com/baidu-research/DeepBench), introduced by Baidu, is a good example of something that asseses the above. DeepBench assesses the performance of basic operations in deep learning models, providing insights into how different hardware platforms handle neural network training and inference. **Macro Benchmarks** -Macro-benchmarks provide a holistic view, assessing the end-to-end performance of entire machine learning models or comprehensive AI systems. Rather than focusing on individual operations, macro-benchmarks evaluate the collective efficacy of models under real-world scenarios or tasks. For example, a macro-benchmark might assess the complete performance of a deep learning model undertaking image classification on a dataset like ImageNet. This includes gauging accuracy, computational speed, and resource consumption. Similarly, one might measure the cumulative time and resources needed to train a natural language processing model on extensive text corpora or evaluate the performance of an entire recommendation system, from data ingestion to final user-specific outputs. +Macro-benchmarks provide a holistic view, assessing the end-to-end performance of entire machine learning models or comprehensive AI systems. Rather than focusing on individual operations, macro-benchmarks evaluate the collective efficacy of models under real-world scenarios or tasks. For example, a macro-benchmark might assess the complete performance of a deep learning model undertaking image classification on a dataset like [ImageNet](https://www.image-net.org/). This includes gauging accuracy, computational speed, and resource consumption. Similarly, one might measure the cumulative time and resources needed to train a natural language processing model on extensive text corpora or evaluate the performance of an entire recommendation system, from data ingestion to final user-specific outputs. Examples: These benchmarks evaluate the AI model: -- MLPerf Inference: An industry-standard set of benchmarks for measuring the performance of machine learning software and hardware. MLPerf has a dedicated benchmark for TinyML, called MLPerf Tiny, which focuses on microcontrollers and other resource-constrained devices. +- [MLPerf Inference](https://github.com/mlcommons/inference): An industry-standard set of benchmarks for measuring the performance of machine learning software and hardware. MLPerf has a dedicated benchmark for TinyML, called MLPerf Tiny, which focuses on microcontrollers and other resource-constrained devices. -- EEMBC's MLMark: A benchmarking suite for evaluating the performance and power efficiency of embedded devices running machine learning workloads. This benchmark provides insights into how different hardware platforms handle tasks like image recognition or audio processing. +- [EEMBC's MLMark](https://github.com/eembc/mlmark): A benchmarking suite for evaluating the performance and power efficiency of embedded devices running machine learning workloads. This benchmark provides insights into how different hardware platforms handle tasks like image recognition or audio processing. -- AI-Benchmark: A benchmarking tool designed for Android devices, it valuates the performance of AI tasks on mobile devices, encompassing various real-world scenarios like image recognition, face parsing, and optical character recognition. +- [AI-Benchmark](https://ai-benchmark.com/): A benchmarking tool designed for Android devices, it valuates the performance of AI tasks on mobile devices, encompassing various real-world scenarios like image recognition, face parsing, and optical character recognition. **End-to-end Benchmarks** @@ -161,7 +161,7 @@ Example: For natural language processing benchmarks, tasks might include sentime Once a task is defined, benchmarks require metrics to quantify performance. These metrics offer objective measures to compare different models or systems. -In classification tasks, metrics like accuracy, precision, recall, and F1 score are commonly used. For regression tasks, mean squared error or mean absolute error might be employed. +In classification tasks, metrics like accuracy, precision, recall, and [F1 score](https://en.wikipedia.org/wiki/F-score) are commonly used. For regression tasks, mean squared error or mean absolute error might be employed. **Baseline Models** @@ -199,7 +199,7 @@ The development life cycle of a machine learning model involves two critical pha Benchmarking the training phase provides insights into how different model architectures, hyperparameter values, and optimization algorithms impact the time and resources needed to train the model. For instance, benchmarking shows how neural network depth affects training time on a given dataset. Benchmarking also reveals how hardware accelerators like GPUs and TPUs can speed up training. -On the other hand, benchmarking inference evaluates model performance in real-world conditions after deployment. Key metrics include latency, throughput, memory footprint, and power consumption. Inference benchmarking determines if an model meets the requirements of its target application regarding response time and device constraints, which is typically the focus of tinyML but we wil discsuss these broadly to make sure we have a general understanding. +On the other hand, benchmarking inference evaluates model performance in real-world conditions after deployment. Key metrics include latency, throughput, memory footprint, and power consumption. Inference benchmarking determines if an model meets the requirements of its target application regarding response time and device constraints, which is typically the focus of tinyML but we will discsuss these broadly to make sure we have a general understanding. ### Training Benchmarks @@ -221,9 +221,9 @@ Training metrics, when viewed from a systems perspective, offer insights that tr The following metrics are often considered important: -1. Training Time: The time taken to train a model from scratch until it reaches a satisfactory performance level. It is a direct measure of the computational resources required to train a model. For example, Google's BERT model is a state-of-the-art natural language processing model that requires several days to train on a massive corpus of text data using multiple GPUs. The long training time is a significant challenge in terms of resource consumption and cost. +1. Training Time: The time taken to train a model from scratch until it reaches a satisfactory performance level. It is a direct measure of the computational resources required to train a model. For example, [Google's BERT](https://arxiv.org/abs/1810.04805) model is a natural language processing model that requires several days to train on a massive corpus of text data using multiple GPUs. The long training time is a significant challenge in terms of resource consumption and cost. -2. Scalability: How well the training process can handle increases in data size or model complexity. Scalability can be assessed by measuring training time, memory usage, and other resource consumption as data size or model complexity increases. OpenAI's GPT-3 model has 175 billion parameters, making it one of the largest language models in existence. Training GPT-3 required extensive engineering efforts to scale up the training process to handle the massive model size. This involved the use of specialized hardware, distributed training, and other techniques to ensure that the model could be trained efficiently. +2. Scalability: How well the training process can handle increases in data size or model complexity. Scalability can be assessed by measuring training time, memory usage, and other resource consumption as data size or model complexity increases. [OpenAI's GPT-3](https://arxiv.org/abs/2005.14165) model has 175 billion parameters, making it one of the largest language models in existence. Training GPT-3 required extensive engineering efforts to scale up the training process to handle the massive model size. This involved the use of specialized hardware, distributed training, and other techniques to ensure that the model could be trained efficiently. 3. Resource Utilization: The extent to which the training process utilizes available computational resources such as CPU, GPU, memory, and disk I/O. High resource utilization can indicate an efficient training process, while low utilization can suggest bottlenecks or inefficiencies. For instance, training a convolutional neural network (CNN) for image classification requires significant GPU resources. Utilizing multi-GPU setups and optimizing the training code for GPU acceleration can greatly improve resource utilization and training efficiency. @@ -237,9 +237,9 @@ The following metrics are often considered important: 8. Fault Tolerance and Robustness: The ability of the training process to handle failures or errors without crashing or producing incorrect results. This is important for ensuring the reliability of the training process. In a real-world scenario, where a machine learning model is being trained on a distributed system, network failures or hardware malfunctions can occur. In recent years, for instance, it has become abundantly clear that faults that arise from silent data corruption have emerged as a major issue. A fault-tolerant and robust training process can recover from such failures without compromising the integrity of the model. -9. Ease of Use and Flexibility: The ease with which the training process can be set up and used, as well as its flexibility in handling different types of data and models. In companies like Google, efficiency can sometimes be measured in terms of the GCUs saved or number of SWE years saved, since that translates directly to impact. Ease of use and flexibility can reduce the time and effort required to train a model. TensorFlow and PyTorch are popular machine learning frameworks that provide user-friendly interfaces and flexible APIs for building and training machine learning models. These frameworks support a wide range of model architectures and are equipped with tools that simplify the training process. +9. Ease of Use and Flexibility: The ease with which the training process can be set up and used, as well as its flexibility in handling different types of data and models. In companies like Google, efficiency can sometimes be measured in terms of the number of Software Engineer (SWE) years saved since that translates directly to impact. Ease of use and flexibility can reduce the time and effort required to train a model. TensorFlow and PyTorch are popular machine learning frameworks that provide user-friendly interfaces and flexible APIs for building and training machine learning models. These frameworks support a wide range of model architectures and are equipped with tools that simplify the training process. -10. Reproducibility: The ability to reproduce the results of the training process. Reproducibility is important for verifying the correctness and validity of a model. However, There are often variations due to stochastic network characteristics and this makes it hard to reproduce the precise behavior of applications being trained, and this can present a challenge for benchmarking. +10. Reproducibility: The ability to reproduce the results of the training process. Reproducibility is important for verifying the correctness and validity of a model. However, there are often variations due to stochastic network characteristics and this makes it hard to reproduce the precise behavior of applications being trained, and this can present a challenge for benchmarking. By benchmarking for these types of metrics, we can obtain a comprehensive view of the performance and efficiency of the training process from a systems perspective, which can help identify areas for improvement and ensure that resources are used effectively. @@ -267,7 +267,7 @@ It is important to carefully consider these factors when designing benchmarks to Here are some original works that laid the fundamental groundwork for developing systematic benchmarks for training machine learning systems. -*MLPerf Training Benchmark* +*[MLPerf Training Benchmark](https://github.com/mlcommons/training)* MLPerf is a suite of benchmarks designed to measure the performance of machine learning hardware, software, and services. The MLPerf Training benchmark focuses on the time it takes to train models to a target quality metric. It includes a diverse set of workloads, such as image classification, object detection, translation, and reinforcement learning. @@ -279,7 +279,7 @@ Metrics: - Resource utilization (CPU, GPU, memory, disk I/O) -*DAWNBench* +*[DAWNBench](https://dawn.cs.stanford.edu/benchmark/)* DAWNBench is a benchmark suite that focuses on end-to-end deep learning training time and inference performance. It includes common tasks such as image classification and question answering. @@ -291,7 +291,7 @@ Metrics: - Cost (in terms of cloud compute and storage resources) -*Fathom* +*[Fathom](https://github.com/rdadolf/fathom)* Fathom is a benchmark from Harvard University that includes a diverse set of workloads to evaluate the performance of deep learning models. It includes common tasks such as image classification, speech recognition, and language modeling. @@ -353,9 +353,9 @@ Finally, ensuring that the model's predictions are not only accurate but also co **Tasks** -By and large, the challenges in picking representative tasks for benchmarking inference machine learning systems are somewhat of the same taxonomy as what we have provided for training (Section XXX). Nevertheless, to be pedantic, let's discuss those in the context of inference machine learning systems. +By and large, the challenges in picking representative tasks for benchmarking inference machine learning systems are somewhat of the same taxonomy as what we have provided for training. Nevertheless, to be pedantic, let's discuss those in the context of inference machine learning systems. -1. Diversity of Applications: Inference machine learning is employed across numerous domains such as healthcare, finance, entertainment, security, and more. Each domain has its unique tasks, and what's representative in one domain might not be in another. For example, an inference task for predicting stock prices in the financial domain might not be representative for image recognition tasks in the medical domain. +1. Diversity of Applications: Inference machine learning is employed across numerous domains such as healthcare, finance, entertainment, security, and more. Each domain has its unique tasks, and what's representative in one domain might not be in another. For example, an inference task for predicting stock prices in the financial domain might not be representative of image recognition tasks in the medical domain. 2. Variability in Data Types: Different inference tasks require different types of data -- text, images, videos, numerical data, etc. Ensuring that benchmarks address the wide variety of data types used in real-world applications is challenging. For example, voice recognition systems process audio data, which is vastly different from the visual data processed by facial recognition systems. @@ -375,7 +375,7 @@ By and large, the challenges in picking representative tasks for benchmarking in Here are some original works that laid the fundamental groundwork for developing systematic benchmarks for inference machine learning systems. -*MLPerf Inference Benchmark* +*[MLPerf Inference Benchmark](https://github.com/mlcommons/inference)* MLPerf Inference is a comprehensive suite of benchmarks that assess the performance of machine learning models during the inference phase. It encompasses a variety of workloads including image classification, object detection, and natural language processing, aiming to provide standardized and insightful metrics for evaluating different inference systems. @@ -391,7 +391,7 @@ Metrics: - Energy consumption -*AI Benchmark* +*[AI Benchmark](https://ai-benchmark.com/)* AI Benchmark is a benchmarking tool that evaluates the performance of AI and machine learning models on mobile devices and edge computing platforms. It includes tests for image classification, object detection, and natural language processing tasks, providing a detailed analysis of the inference performance on different hardware platforms. @@ -407,7 +407,7 @@ Metrics: - Throughput -*OpenVINO™ toolkit* +*[OpenVINO™ toolkit](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html)* OpenVINO™ toolkit provides a benchmark tool to measure the performance of deep learning models for a variety of tasks such as image classification, object detection, and facial recognition on Intel hardware. It offers detailed insights into the inference performance of the models on different hardware configurations. @@ -451,7 +451,7 @@ Keyword spotting was selected as a task because it is a common usecase in TinyML **Dataset** -Google Speech Commands was selected as the best dataset to represent the task. The dataset is well established in the research community and has permissive licensing which allows it to be easily used in a benchmark. +[Google Speech Commands](https://www.tensorflow.org/datasets/catalog/speech_commands) was selected as the best dataset to represent the task. The dataset is well established in the research community and has permissive licensing which allows it to be easily used in a benchmark. **Model** @@ -463,17 +463,17 @@ Latency was selected as the primary metric for the benchmark, as keyword spottin **Benchmark Harness** -MLPerf Tiny uses EEMBCs EnergyRunner™ benchmark harness to load the inputs to the model and to isolate and measure the energy consumption of the device. When measuring energy consumption it's critical to select a harness that is accurate at the expected power levels of the devices under test, and simple enough to not become a burden for participants of the benchmark. +MLPerf Tiny uses [EEMBCs EnergyRunner™ benchmark harness](https://github.com/eembc/energyrunner) to load the inputs to the model and to isolate and measure the energy consumption of the device. When measuring energy consumption it's critical to select a harness that is accurate at the expected power levels of the devices under test, and simple enough to not become a burden for participants of the benchmark. **Baseline Submission** -Baseline submissions are critical for contextualizing results and acting as a reference point to help participants get started. The baseline submission should prioritise simplicity and readability over state of the art performance. The keyword spotting baseline uses a standard STM microcontroller as it's hardware and TensorFlow Lite for Microcontrollers as it's inference framework. +Baseline submissions are critical for contextualizing results and acting as a reference point to help participants get started. The baseline submission should prioritise simplicity and readability over state of the art performance. The keyword spotting baseline uses a standard [STM microcontroller](https://www.st.com/en/microcontrollers-microprocessors.html) as it's hardware and [TensorFlow Lite for Microcontrollers](https://www.tensorflow.org/lite/microcontrollers) as it's inference framework. ### Challenges and Limitations While benchmarking provides a structured methodology for performance evaluation in complex domains like artificial intelligence and computing, the process also poses several challenges. If not properly addressed, these challenges can undermine the credibility and accuracy of benchmarking results. Some of the predominant difficulties faced in benchmarking include the following: -- Incomplete problem coverage - Benchmark tasks may not fully represent the problem space. For instance, common image classification datasets like CIFAR-10 have limited diversity in image types. Algorithms tuned for such benchmarks may fail to generalize well to real-world datasets. +- Incomplete problem coverage - Benchmark tasks may not fully represent the problem space. For instance, common image classification datasets like [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) have limited diversity in image types. Algorithms tuned for such benchmarks may fail to generalize well to real-world datasets. - Statistical insignificance - Benchmarks must have enough trials and data samples to produce statistically significant results. For example, benchmarking an OCR model on only a few text scans may not adequately capture its true error rates. @@ -487,7 +487,7 @@ But of all these, perhaps the most important challenge is dealing with benchmark **Hardware Lottery** -The "hardware lottery" in benchmarking machine learning systems refers to the situation where the success or efficiency of a machine learning model is significantly influenced by the compatibility of the model with the underlying hardware. In other words, some models perform exceptionally well because they are a good fit for the particular characteristics or capabilities of the hardware on which they are run, rather than because they are intrinsically superior models. Unfortunately, the hardware used is often omitted from papers or given only brief mentions, making reproducing results difficult if not impossible. +The ["hardware lottery"](https://arxiv.org/abs/2009.06489) in benchmarking machine learning systems refers to the situation where the success or efficiency of a machine learning model is significantly influenced by the compatibility of the model with the underlying hardware. In other words, some models perform exceptionally well because they are a good fit for the particular characteristics or capabilities of the hardware on which they are run, rather than because they are intrinsically superior models. Unfortunately, the hardware used is often omitted from papers or given only brief mentions, making reproducing results difficult if not impossible. For instance, certain machine learning models may be designed and optimized to take advantage of parallel processing capabilities of specific hardware accelerators, such as Graphics Processing Units (GPUs) or Tensor Processing Units (TPUs). As a result, these models might show superior performance when benchmarked on such hardware, compared to other models that are not optimized for the hardware. @@ -553,7 +553,7 @@ Machine learning datasets have a rich history and have evolved significantly ove **MNIST (1998)** -The MNIST dataset, created by Yann LeCun, Corinna Cortes, and Christopher J.C. Burges in 1998, can be considered a cornerstone in the history of machine learning datasets. It consists of 70,000 labeled 28x28 pixel grayscale images of handwritten digits (0-9). MNIST has been widely used for benchmarking algorithms in image processing and machine learning, serving as a starting point for many researchers and practitioners in the field. +The [MNIST dataset](https://www.tensorflow.org/datasets/catalog/mnist), created by Yann LeCun, Corinna Cortes, and Christopher J.C. Burges in 1998, can be considered a cornerstone in the history of machine learning datasets. It consists of 70,000 labeled 28x28 pixel grayscale images of handwritten digits (0-9). MNIST has been widely used for benchmarking algorithms in image processing and machine learning, serving as a starting point for many researchers and practitioners in the field. ![](vertopal_ee1dac92e9a84172a09aeb26aee803be/media/image2.png){width="5.802083333333333in" height="3.40625in"} @@ -561,7 +561,7 @@ Source: https://en.wikipedia.org/wiki/File:MnistExamplesModified.png **ImageNet (2009)** -Fast forward to 2009, and we see the introduction of the ImageNet dataset, which marked a significant leap in the scale and complexity of datasets. ImageNet consists of over 14 million labeled images spanning more than 20,000 categories. It was developed by Fei-Fei Li and her team with the goal of advancing research in object recognition and computer vision. The dataset became synonymous with the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), an annual competition that played a crucial role in the development of deep learning models, including the famous AlexNet in 2012. +Fast forward to 2009, and we see the introduction of the [ImageNet dataset](https://www.tensorflow.org/datasets/catalog/imagenet2012), which marked a significant leap in the scale and complexity of datasets. ImageNet consists of over 14 million labeled images spanning more than 20,000 categories. It was developed by Fei-Fei Li and her team with the goal of advancing research in object recognition and computer vision. The dataset became synonymous with the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), an annual competition that played a crucial role in the development of deep learning models, including the famous AlexNet in 2012. ![](vertopal_ee1dac92e9a84172a09aeb26aee803be/media/image7.png){width="6.5in" height="2.5972222222222223in"} @@ -569,7 +569,7 @@ Source: https://cv.gluon.ai/_images/imagenet_banner.jpeg **COCO (2014)** -The Common Objects in Context (COCO) dataset, released in 2014, further expanded the landscape of machine learning datasets by introducing a richer set of annotations. COCO consists of images containing complex scenes with multiple objects, and each image is annotated with object bounding boxes, segmentation masks, and captions. This dataset has been instrumental in advancing research in object detection, segmentation, and image captioning. +The [Common Objects in Context (COCO) dataset](https://cocodataset.org/), released in 2014, further expanded the landscape of machine learning datasets by introducing a richer set of annotations. COCO consists of images containing complex scenes with multiple objects, and each image is annotated with object bounding boxes, segmentation masks, and captions. This dataset has been instrumental in advancing research in object detection, segmentation, and image captioning. ![](vertopal_ee1dac92e9a84172a09aeb26aee803be/media/image1.png){width="6.5in" height="1.9722222222222223in"}