harvard-edge · elizakimball · Nov 8, 2024 · Nov 12, 2024 · Nov 13, 2024 · Nov 14, 2024
diff --git a/Machine-Learning-Systems 2.tex b/Machine-Learning-Systems 2.tex
diff --git a/Machine-Learning-Systems 3.tex b/Machine-Learning-Systems 3.tex
diff --git a/contents/core/data_engineering/data_engineering.qmd b/contents/core/data_engineering/data_engineering.qmd
diff --git a/contents/core/efficient_ai/efficient_ai.qmd b/contents/core/efficient_ai/efficient_ai.qmd
@@ -58,17 +58,23 @@ The spectrum from Cloud to TinyML represents a shift from vast, centralized comp
 
 Selecting an optimal model architecture is as crucial as optimizing it. In recent years, researchers have made significant strides in exploring innovative architectures that can inherently have fewer parameters while maintaining strong performance.
 
-**MobileNets:** MobileNets are efficient mobile and embedded vision application models [@howard2017mobilenets]. The key idea that led to their success is depth-wise separable convolutions, significantly reducing the number of parameters and computations in the network. MobileNetV2 and V3 further enhance this design by introducing inverted residuals and linear bottlenecks.
+**MobileNets:** MobileNets are efficient mobile and embedded vision application models [@howard2017mobilenets]. The key idea that led to their success is depth-wise separable convolutions, significantly reducing the number of parameters and computations in the network. MobileNetV2[^mn-v2] and V3 further enhance this design by introducing inverted residuals and linear bottlenecks.
 
-**SqueezeNet:** SqueezeNet is a class of ML models known for its smaller size without sacrificing accuracy. It achieves this by using a "fire module" that reduces the number of input channels to 3x3 filters, thus reducing the parameters [@iandola2016squeezenet]. Moreover, it employs delayed downsampling to increase the accuracy by maintaining a larger feature map.
+[^mn-v2]: MobileNets are useful for real-time application on embedded systems such as facial recognition or augmented reality.
+
+**SqueezeNet:** SqueezeNet is a class of ML models known for its smaller size without sacrificing accuracy. It achieves this by using a "fire module" that reduces the number of input channels to 3x3 filters, thus reducing the parameters [@iandola2016squeezenet]. Moreover, it employs delayed downsampling to increase the accuracy by maintaining a larger feature map.[^squeeze-net]
+
+[^squeeze-net]: SqueezeNet achieves similar accuracy to AlexNet while being 50 times smaller.
 
 **ResNet variants:** The Residual Network (ResNet) architecture allows for the introduction of skip connections or shortcuts [@he2016deep]. Some variants of ResNet are designed to be more efficient. For instance, ResNet-SE incorporates the "squeeze and excitation" mechanism to recalibrate feature maps [@hu2018squeeze], while ResNeXt offers grouped convolutions for efficiency [@xie2017aggregated].
 
 ## Efficient Model Compression {#sec-efficient-model-compression}
 
 Model compression methods are essential for bringing deep learning models to devices with limited resources. These techniques reduce models' size, energy consumption, and computational demands without significantly losing accuracy. At a high level, the methods can be categorized into the following fundamental methods:
 
-**Pruning:** We've mentioned pruning a few times in previous chapters but have not yet formally introduced it. Pruning is similar to trimming the branches of a tree. This was first thought of in the [Optimal Brain Damage](https://proceedings.neurips.cc/paper/1989/file/6c9882bbac1c7093bd25041881277658-Paper.pdf) paper [@lecun1989optimal] and was later popularized in the context of deep learning by @han2016deep. Certain weights or entire neurons are removed from the network in pruning based on specific criteria. This can significantly reduce the model size. We will explore two of the main pruning strategies, structured and unstructured pruning, in @sec-pruning. @fig-pruning is an example of neural network pruning, where removing some of the nodes in the inner layers (based on specific criteria) reduces the number of edges between the nodes and, in turn, the model's size.
+**Pruning:** We've mentioned pruning a few times in previous chapters but have not yet formally introduced it. Pruning is similar to trimming the branches of a tree. This was first thought of in the [Optimal Brain Damage](https://proceedings.neurips.cc/paper/1989/file/6c9882bbac1c7093bd25041881277658-Paper.pdf) paper [@lecun1989optimal] and was later popularized in the context of deep learning by @han2016deep.[^han] Certain weights or entire neurons are removed from the network in pruning based on specific criteria. This can significantly reduce the model size. We will explore two of the main pruning strategies, structured and unstructured pruning, in @sec-pruning. @fig-pruning is an example of neural network pruning, where removing some of the nodes in the inner layers (based on specific criteria) reduces the number of edges between the nodes and, in turn, the model's size.
+
+[^han]: Pruning was inspired by biological development where unused connections between neurons are eliminated during brain development.
 
 ![Neural Network Pruning.](images/jpg/pruning.jpeg){#fig-pruning}
 
@@ -85,7 +91,9 @@ Model compression methods are essential for bringing deep learning models to dev
 
 In the [Training](../training/training.qmd) chapter, we discussed the process of training AI models. Now, from an efficiency standpoint, it's important to note that training is a resource and time-intensive task, often requiring powerful hardware and taking anywhere from hours to weeks to complete. Inference, on the other hand, needs to be as fast as possible, especially in real-time applications. This is where efficient inference hardware comes into play. By optimizing the hardware specifically for inference tasks, we can achieve rapid response times and power-efficient operation, which is especially crucial for edge devices and embedded systems.
 
-**TPUs (Tensor Processing Units):** [TPUs](https://cloud.google.com/tpu) are custom-built ASICs (Application-Specific Integrated Circuits) by Google to accelerate machine learning workloads [@jouppi2017datacenter]. They are optimized for tensor operations, offering high throughput for low-precision arithmetic, and are designed specifically for neural network machine learning. TPUs significantly accelerate model training and inference compared to general-purpose GPU/CPUs. This boost means faster model training and real-time or near-real-time inference capabilities, crucial for applications like voice search and augmented reality.
+**TPUs (Tensor Processing Units):** [TPUs](https://cloud.google.com/tpu) are custom-built ASICs (Application-Specific Integrated Circuits) by Google to accelerate machine learning workloads [@jouppi2017datacenter].[^work-load] They are optimized for tensor operations, offering high throughput for low-precision arithmetic, and are designed specifically for neural network machine learning. TPUs significantly accelerate model training and inference compared to general-purpose GPU/CPUs. This boost means faster model training and real-time or near-real-time inference capabilities, crucial for applications like voice search and augmented reality.
+
+[^work-load]: TPUs can process AI workloads up to 30 times faster than standard GPUs and 80 times faster than CPUs.
 
 [Edge TPUs](https://cloud.google.com/edge-tpu) are a smaller, power-efficient version of Google's TPUs tailored for edge devices. They provide fast on-device ML inferencing for TensorFlow Lite models. Edge TPUs allow for low-latency, high-efficiency inference on edge devices like smartphones, IoT devices, and embedded systems. AI capabilities can be deployed in real-time applications without communicating with a central server, thus saving bandwidth and reducing latency. Consider the table in @fig-edge-tpu-perf. It shows the performance differences between running different models on CPUs versus a Coral USB accelerator. The Coral USB accelerator is an accessory by Google's Coral AI platform that lets developers connect Edge TPUs to Linux computers. Running inference on the Edge TPUs was 70 to 100 times faster than on CPUs.
 
@@ -99,7 +107,9 @@ Efficient hardware for inference speeds up the process, saves energy, extends ba
 
 ## Efficient Numerics {#sec-efficient-numerics}
 
-Machine learning, and especially deep learning, involves enormous amounts of computation. Models can have millions to billions of parameters, often trained on vast datasets. Every operation, every multiplication or addition, demands computational resources. Therefore, the precision of the numbers used in these operations can significantly impact the computational speed, energy consumption, and memory requirements. This is where the concept of efficient numerics comes into play.
+Machine learning, and especially deep learning, involves enormous amounts of computation. Models can have millions to billions of parameters, often trained on vast datasets. Every operation, every multiplication or addition, demands computational resources. Therefore, the precision of the numbers used in these operations can significantly impact the computational speed, energy consumption, and memory requirements. This is where the concept of efficient numerics comes into play.[^into-play]
+
+[^into-play]: Even though the human brain relies on imprecise neural signals, it is still able to perform extraordinarily well. This efficiency inspired reduced precision in AI models.
 
 ### Numerical Formats {#sec-numerical-formats}
 
@@ -188,7 +198,9 @@ A deep understanding of model evaluation methods is important to guide this proc
 
 **FLOPs (Floating Point Operations)**, as introduced in [Training](../training/training.html), gauge a model's computational demands. For instance, a modern neural network like BERT has billions of FLOPs, which might be manageable on a powerful cloud server but would be taxing on a smartphone. Higher FLOPs can lead to more prolonged inference times and significant power drain, especially on devices without specialized hardware accelerators. Hence, for real-time applications such as video streaming or gaming, models with lower FLOPs might be more desirable.
 
-**Memory Usage** pertains to how much storage the model requires, affecting both the deploying device's storage and RAM. Consider deploying a model onto a smartphone: a model that occupies several gigabytes of space not only consumes precious storage but might also be slower due to the need to load large weights into memory. This becomes especially crucial for edge devices like security cameras or drones, where minimal memory footprints are vital for storage and rapid data processing.
+**Memory Usage** pertains to how much storage the model requires, affecting both the deploying device's storage and **RAM**[^ram]. Consider deploying a model onto a smartphone: a model that occupies several gigabytes of space not only consumes precious storage but might also be slower due to the need to load large weights into memory. This becomes especially crucial for edge devices like security cameras or drones, where minimal memory footprints are vital for storage and rapid data processing.
+
+[^ram]: Random Access Memory (RAM) acts as a computer's working memory by storing data temporarily while the device is in use. RAM is much faster than storage devices like hard drives, enabling computers to run smoothly and efficiently. Since RAM's contents are erased when the device is powered off, it is ideal for temporary tasks such as running applications or processing data.
 
 **Power Consumption** becomes especially crucial for devices that rely on batteries. For instance, a wearable health monitor using a power-hungry model could drain its battery in hours, rendering it impractical for continuous health monitoring. Optimizing models for low power consumption becomes essential as we move toward an era dominated by IoT devices, where many devices operate on battery power.
 
@@ -205,8 +217,6 @@ Moreover, the optimal model choice is not always universal but often depends on
 
 Another important consideration is the relationship between model complexity and its practical benefits. Take voice-activated assistants, such as "Alexa" or "OK Google." While a complex model might demonstrate a marginally superior understanding of user speech if it's slower to respond than a simpler counterpart, the user experience could be compromised. Thus, adding layers or parameters only sometimes equates to better real-world outcomes.
 
-Another important consideration is the relationship between model complexity and its practical benefits. Take voice-activated assistants like "Alexa" or "OK Google." While a complex model might demonstrate a marginally superior understanding of user speech if it's slower to respond than a simpler counterpart, the user experience could be compromised. Thus, adding layers or parameters only sometimes equates to better real-world outcomes.
-
 Furthermore, while benchmark datasets, such as ImageNet [@russakovsky2015imagenet], COCO [@lin2014microsoft], Visual Wake Words [@chowdhery2019visual], Google Speech Commands [@warden2018speech], etc. provide a standardized performance metric, they might not capture the diversity and unpredictability of real-world data. Two facial recognition models with similar benchmark scores might exhibit varied competencies when faced with diverse ethnic backgrounds or challenging lighting conditions. Such disparities underscore the importance of robustness and consistency across varied data. For example, @fig-stoves from the Dollar Street dataset shows stove images across extreme monthly incomes. Stoves have different shapes and technological levels across different regions and income levels. A model that is not trained on diverse datasets might perform well on a benchmark but fail in real-world applications. So, if a model was trained on pictures of stoves found in wealthy countries only, it would fail to recognize stoves from poorer regions.
 
 ![Different types of stoves. Source: Dollar Street stove images.](images/jpg/ds_stoves.jpg){#fig-stoves}