diff --git a/_quarto.yml b/_quarto.yml
index 3ca168cc..62e3be4b 100644
--- a/_quarto.yml
+++ b/_quarto.yml
@@ -10,9 +10,9 @@ website:
     icon: star-half
     dismissable: true
     content: |
-      ⭐ [Oct 18] <b>We Hit 1,000 GitHub Stars</b> 🎉 Thanks to you, Arduino and SEEED donated AI hardware kits for <a href="https://tinyml.seas.harvard.edu/4D/pastEvents">TinyML workshops</a> in developing nations! </br>
-      🎓 [Nov 15] The [EDGE AI Foundation](https://www.edgeaifoundation.org/) is **matching academic scholarship funds** for every new GitHub ⭐ (up to 10,000 stars). <a href="https://github.com/harvard-edge/cs249r_book">Click here to show support!</a> 🙏 </br> 
-      🚀 <b>Our mission. 1 ⭐ = 1 👩‍🎓 Learner</b>. Every star tells a story: learners gaining knowledge and supporters driving the mission. Together, we're making a difference.
+      🚀 <b>Our mission. 1 ⭐ = 1 👩‍🎓 Learner</b>. Every star tells a story: learners gaining knowledge and supporters fueling our mission. Together, we're making a difference. Thank you for your support and happy holidays!  
+      🎓 [Nov 15] The <a href="https://www.edgeaifoundation.org/">EDGE AI Foundation</a> is <b>matching academic scholarship funds</b> for every new GitHub ⭐ (up to 10,000 stars). <a href="https://github.com/harvard-edge/cs249r_book">Click here to show support!</a> 🙏  
+      📘 [Dec 22] <b>Chapter 3 updated!</b> New revisions include expanded content and improved explanations. Check it out <a href="https://mlsysbook.ai/contents/core/dl_primer/dl_primer.html">here</a>. 🌟
 
     position: below-navbar
 
@@ -117,6 +117,7 @@ book:
     - contents/core/introduction/introduction.qmd
     - contents/core/ml_systems/ml_systems.qmd
     - contents/core/dl_primer/dl_primer.qmd
+#    - contents/core/dl_architectures/dl_architectures.qmd
     - contents/core/workflow/workflow.qmd
     - contents/core/data_engineering/data_engineering.qmd
     - contents/core/frameworks/frameworks.qmd
@@ -242,7 +243,7 @@ format:
         - style-dark.scss
 
     code-block-bg: true
-    code-block-border-left: "#A51C30"
+    #code-block-border-left: "#A51C30"
 
     table:
       classes: [table-striped, table-hover]
diff --git a/contents/core/dl_architectures/dl_architectures.bib b/contents/core/dl_architectures/dl_architectures.bib
new file mode 100644
index 00000000..e69de29b
diff --git a/contents/core/dl_architectures/dl_architectures.qmd b/contents/core/dl_architectures/dl_architectures.qmd
new file mode 100644
index 00000000..8a8a2676
--- /dev/null
+++ b/contents/core/dl_architectures/dl_architectures.qmd
@@ -0,0 +1,748 @@
+---
+bibliography: dl_architectures.bib
+---
+
+# DL Architectures {#sec-dl_arch}
+
+::: {.content-visible when-format="html"}
+Resources: [Slides](#sec-deep-learning-primer-resource), [Videos](#sec-deep-learning-primer-resource), [Exercises](#sec-deep-learning-primer-resource)
+:::
+
+![_DALL·E 3 Prompt: A visually striking rectangular image illustrating the interplay between deep learning algorithms like CNNs, RNNs, and Attention Networks, interconnected with machine learning systems. The composition features neural network diagrams blending seamlessly with representations of computational systems such as processors, graphs, and data streams. Bright neon tones contrast against a dark futuristic background, symbolizing cutting-edge technology and intricate system complexity._](images/png/cover_dl_arch.png)
+
+::: {.callout-tip}
+
+## Learning Objectives
+
+* Bridge theoretical deep learning concepts with their practical system implementations.
+
+* Understand how different deep learning architectures impact system requirements.
+
+* Compare traditional ML and deep learning approaches from a systems perspective.
+
+* Identify core computational patterns in deep learning and their deployment implications.
+
+* Apply systems thinking to deep learning model development and deployment.
+  
+:::
+
+## Overview
+
+coming soon
+
+## Modern Neural Network Architectures
+
+The basic principles of neural networks have evolved into more sophisticated architectures that power today's most advanced AI systems. While the fundamental concepts of neurons, activations, and learning remain the same, modern architectures introduce specialized structures and organizational patterns that dramatically enhance the network's ability to process specific types of data and solve increasingly complex problems.
+
+These architectural innovations reflect both theoretical insights about how networks learn and practical lessons from applying neural networks in the real world. Each new architecture represents a careful balance between computational efficiency, learning capacity, and the specific demands of different problem domains. By examining these architectures, we will see how the field has moved from general-purpose networks to highly specialized designs optimized for particular tasks, while also understanding the implicatiuons of these changes on the unlderying system that enables their computation.
+
+### Multi-Layer Perceptrons: Dense Pattern Processing
+
+Multi-Layer Perceptrons (MLPs) represent the most direct extension of neural networks into deep architectures. Unlike more specialized networks, MLPs process each input element with equal importance, making them versatile but computationally intensive. Their architecture, while simple, establishes fundamental computational patterns that appear throughout deep learning systems.
+
+When applied to the MNIST handwritten digit recognition challenge, an MLP reveals its computational power by transforming a complex 28×28 pixel image into a precise digit classification. By treating each of the 784 pixels as an equally weighted input, the network learns to decompose visual information through a systematic progression of layers, converting raw pixel intensities into increasingly abstract representations that capture the essential characteristics of handwritten digits.
+
+#### Algorithmic Structure
+
+The core structure of an MLP is a series of fully-connected layers, where each neuron connects to every neuron in adjacent layers. As shown in @fig-mlp, each layer transforms its input through matrix multiplication followed by element-wise activation:
+
+![MLP layers and its associated matrix representation. Source: @reagen2017deep](images/png/mlp_mm.png)
+
+$$
+\mathbf{h}^{(l)} = f(\mathbf{W}^{(l)}\mathbf{h}^{(l-1)} + \mathbf{b}^{(l)})
+$$
+
+The dimensions at each layer illustrate the computation scale:
+* Input vector: $\mathbf{h}^{(0)} \in \mathbb{R}^{d_{in}}$
+* Weight matrices: $\mathbf{W}^{(l)} \in \mathbb{R}^{d_{out} \times d_{in}}$
+* Output vector: $\mathbf{h}^{(l)} \in \mathbb{R}^{d_{out}}$
+
+#### Computational Mapping
+
+The fundamental computation in MLPs is dense matrix multiplication, which creates specific computational patterns:
+
+1. **Full Connectivity**: Each output depends on every input element
+   * Complete rows and columns must be accessed
+   * All-to-all communication pattern
+   * High memory bandwidth requirement
+
+2. **Batch Processing**: Multiple inputs processed simultaneously via matrix-matrix multiply:
+   $$
+   \mathbf{H}^{(l)} = f(\mathbf{H}^{(l-1)}\mathbf{W}^{(l)} + \mathbf{b}^{(l)})
+   $$
+   where $\mathbf{H}^{(l)} \in \mathbb{R}^{B \times d_{out}}$ for batch size B
+
+While this mathematical view is elegant, its actual implementation reveals a more detailed computational reality:
+
+```python
+# Mathematical abstraction
+def mlp_layer_matrix(X, W, b):
+    H = activation(matmul(X, W) + b)    # One clean line of math
+    return H
+
+# System reality
+def mlp_layer_compute(X, W, b):
+    for batch in range(batch_size):
+        for out in range(num_outputs):
+            Z[batch,out] = b[out]
+            for in_ in range(num_inputs):
+                Z[batch,out] += X[batch,in_] * W[in_,out]
+    
+    H = activation(Z)
+    return H
+```
+
+This translation from mathematical abstraction to concrete computation exposes how dense matrix multiplication decomposes into nested loops of simpler operations. The clean mathematical notation of $\mathbf{W}\mathbf{x}$ becomes hundreds of individual multiply-accumulate operations, each requiring multiple memory accesses. These patterns fundamentally influence system design, creating both challenges in implementation and opportunities for optimization.
+
+#### System Implications
+
+The implementation of MLPs presents several key challenges and opportunities that shape system design.
+
+##### System Challenges
+
+1. **Memory Bandwidth Pressure**
+   * Each output neuron must read all input values and weights. For MNIST, this means 784 memory accesses for inputs and another 784 for weights per output.
+   * Weights must be accessed repeatedly. In a batch of 32 images, each weight is read 32 times.
+   * The system must constantly write intermediate results to memory. For a layer with 128 outputs, this means 128 writes per image.
+   These intensive memory operations create a fundamental bottleneck. For just one fully-connected layer processing a batch of MNIST images, the system performs over 50,000 memory operations.
+
+2. **Computation Volume**
+   * Each output requires hundreds of multiply-accumulate (MAC) operations. For MNIST, each output performs 784 multiply-accumulates.
+   * These operations repeat for every output neuron in each layer. A layer with 128 outputs performs over 100,000 multiply-accumulates per image.
+   * A typical network processes millions of operations per input. Even a modest three-layer network can require over a million operations per image.
+   The massive number of operations means even small inefficiencies can significantly impact performance.
+
+##### System Opportunities
+
+1. Regular Computation Structure
+   * The core operation is a simple multiply-accumulate. This enables specialized hardware units optimized for this specific operation.
+   * Memory access patterns are predictable. The system knows it needs all 784 inputs and weights in a fixed order.
+   * The same operations repeat throughout the network. A single optimization in the multiply-accumulate unit gets reused millions of times.
+   * These patterns remain consistent across different network sizes. Whether processing 28×28 MNIST images or 224×224 ImageNet images, the basic computational pattern stays the same.
+
+2. Parallelism Potential
+   * Input samples can process independently. A batch of 32 MNIST images can process on 32 separate units.
+   * Output neurons have no dependencies. All 128 outputs in a layer can compute in parallel.
+   * Individual multiply-accumulates can execute together. A vector unit could process 8 or 16 multiplications at once.
+   * Layers operate independently. While one layer processes batch 1, another layer can start on batch 0.
+
+These challenges and opportunities drive the development of specialized neural processing engines fo machine learning systems. While memory bandwidth limitations push designs toward sophisticated memory hierarchies (needing to handle >50,000 memory operations efficiently), the regular patterns and parallel opportunities enable efficient implementations through specialized processing units. The patterns established by MLPs form a baseline that more specialized neural architectures must consider in their implementations.
+
+:::{.callout-caution #exr-mlp collapse="false"}
+
+##### Multilayer Perceptrons (MLPs)
+
+We've just scratched the surface of neural networks. Now, you'll get to try and apply these concepts in practical examples. In the provided Colab notebooks, you'll explore:
+
+**Predicting house prices:** Learn how neural networks can analyze housing data to estimate property values.
+[![](https://colab.research.google.com/assets/colab-badge.png)](https://colab.research.google.com/github/Mjrovai/UNIFEI-IESTI01-TinyML-2022.1/blob/main/00_Curse_Folder/1_Fundamentals/Class_07/TF_Boston_Housing_Regression.ipynb)
+
+**Image Classification:** Discover how to build a network to understand the famous MNIST handwritten digit dataset.
+[![](https://colab.research.google.com/assets/colab-badge.png)](https://colab.research.google.com/github/Mjrovai/UNIFEI-IESTI01-TinyML-2022.1/blob/main/00_Curse_Folder/1_Fundamentals/Class_09/TF_MNIST_Classification_v2.ipynb)
+
+**Real-world medical diagnosis:** Use deep learning to tackle the important task of breast cancer classification.
+[![](https://colab.research.google.com/assets/colab-badge.png)](https://colab.research.google.com/github/Mjrovai/UNIFEI-IESTI01-TinyML-2022.1/blob/main/00_Curse_Folder/1_Fundamentals/Class_13/docs/WDBC_Project/Breast_Cancer_Classification.ipynb)
+
+:::
+
+### Convolutional Neural Networks: Spatial Pattern Processing
+
+Convolutional Neural Networks (CNNs) represent a specialized neural architecture designed to efficiently process data with spatial relationships, such as images. While MLPs treat each input independently, CNNs exploit local patterns and spatial hierarchies, establishing computational patterns that have revolutionized computer vision and spatial data processing.
+
+In the MNIST digit recognition task, CNNs demonstrate their unique approach by recognizing that visual information contains inherent spatial dependencies. Using sliding kernels that move across the image, these networks detect local features like edges, curves, and intersections specific to handwritten digits. This spatially-aware method allows the network to learn and distinguish digit characteristics by focusing on their most distinctive local patterns, fundamentally different from the uniform processing of MLPs.
+
+#### Algorithmic Structure
+
+The core innovation of CNNs is the convolutional layer, where each neuron processes only a local region of the input. As shown in @fig-cnn, a kernel (or filter) slides across the input data, performing the same transformation at each position:
+
+::: {.content-visible when-format="html"}
+![Convolution operation, image data (blue) and 3x3 kernel (green). Source: V. Dumoulin, F. Visin, MIT](images/gif/cnn.gif){#fig-cnn}
+:::
+
+::: {.content-visible when-format="pdf"}
+![Convolution operation, image data (blue) and 3x3 kernel (green). Source: V. Dumoulin, F. Visin, MIT](images/png/cnn.png){#fig-cnn}
+:::
+
+For a convolutional layer with kernel $\mathbf{K}$, the computation at each spatial position $(i,j)$ is:
+
+$$
+\mathbf{h}^{(l)}_{i,j} = f(\sum_{p,q} \mathbf{K}_{p,q} \cdot \mathbf{h}^{(l-1)}_{i+p,j+q} + b)
+$$
+
+where the dimensions depend on the layer configuration:
+
+* Input: $\mathbf{h}^{(l-1)} \in \mathbb{R}^{H \times W \times C_{in}}$
+* Kernel: $\mathbf{K} \in \mathbb{R}^{k \times k \times C_{in} \times C_{out}}$
+* Output: $\mathbf{h}^{(l)} \in \mathbb{R}^{H' \times W' \times C_{out}}$
+
+#### Computational Mapping
+
+The fundamental computation in CNNs is the convolution operation, which exhibits distinct patterns:
+
+1. **Sliding Window**: The kernel moves systematically across spatial dimensions, creating regular access patterns.
+2. **Weight Sharing**: The same kernel weights are reused at each position, reducing parameter count.
+3. **Local Connectivity**: Each output depends only on a small neighborhood of input values.
+
+For a batch of B inputs, the convolution operation processes multiple images simultaneously:
+
+$$
+\mathbf{H}^{(l)}_{b,i,j} = f(\sum_{p,q} \mathbf{K}_{p,q} \cdot \mathbf{H}^{(l-1)}_{b,i+p,j+q} + b)
+$$
+
+where $b$ indexes the batch dimension.
+
+The convolution operation involves systematically applying a kernel across an input. While mathematically elegant, its implementation reveals the complex patterns of data movement and computation:
+
+```python
+# Mathematical abstraction - simple and clean
+def conv_layer_math(input, kernel, bias):
+    output = convolution(input, kernel) + bias
+    return activation(output)
+
+# System reality - nested loops of computation
+def conv_layer_compute(input, kernel, bias):
+    # Loop 1: Process each image in batch
+    for image in range(batch_size):
+        
+        # Loop 2&3: Move across image spatially
+        for y in range(height):
+            for x in range(width):
+                
+                # Loop 4: Compute each output feature
+                for out_channel in range(num_output_channels):
+                    result = bias[out_channel]
+                    
+                    # Loop 5&6: Move across kernel window
+                    for ky in range(kernel_height):
+                        for kx in range(kernel_width):
+                            
+                            # Loop 7: Process each input feature
+                            for in_channel in range(num_input_channels):
+                                # Get input value from correct window position
+                                in_y = y + ky  
+                                in_x = x + kx
+                                # Perform multiply-accumulate operation
+                                result += input[image, in_y, in_x, in_channel] * \
+                                         kernel[ky, kx, in_channel, out_channel]
+                    
+                    # Store result for this output position
+                    output[image, y, x, out_channel] = result
+```                    
+
+This implementation, while simplified, exposes the core patterns in convolution. The seven nested loops reveal different aspects of the computation:
+
+* Outer loops (1-3) manage position: which image and where in the image
+* Middle loop (4) handles output features: computing different learned patterns
+* Inner loops (5-7) perform the actual convolution: sliding the kernel window
+
+Each level creates different system implications. The outer loops show where we can parallelize across images or spatial positions. The inner loops reveal opportunities for data reuse, as adjacent positions share input values. While real implementations use sophisticated optimizations, these basic patterns drive the key design decisions in CNN execution.
+
+#### System Implications
+
+The implementation of CNNs reveals a distinct set of challenges and opportunities that fundamentally shape system design.
+
+**Data Access Complexity:** The sliding window nature of convolution creates complex data access patterns. Consider a basic 3×3 convolution kernel. Each output computation requires reading 9 input values per channel. For a simple MNIST image with one channel, this means 9 values per output pixel. The complexity grows substantially with modern CNNs. A 3×3 kernel operating on a 224×224 RGB image with 64 output channels must access 1,728 values (9 × 3 × 64) for each output position. Moreover, these windows overlap, meaning adjacent outputs need much of the same input data.
+
+**Computation Organization:** The computational workload in CNNs follows the sliding window pattern. Each position requires a complete set of kernel computations. Even for a modest 3×3 kernel on MNIST, this means 9 multiply-accumulate operations per output. The computational demands scale dramatically with channels. In a modern network, a 3×3 kernel with 64 input and output channels requires 36,864 multiply-accumulate operations (3 × 3 × 64 × 64) per window position.
+
+**Spatial Locality:** CNNs exhibit strong spatial data reuse patterns. In a 3×3 convolution, each input value contributes to 9 different output computations. This natural overlap creates opportunities for efficient caching strategies. A well-designed memory hierarchy can reduce memory bandwidth by a factor of 9 for a 3×3 kernel. The potential for reuse grows with kernel size - a 5×5 kernel allows each input value to be reused up to 25 times.
+
+**Computation Structure:** The computational pattern of CNNs offers multiple paths to parallelism. The same kernel weights apply across the entire image, simplifying weight management. Within each window, the 9 multiplications of a 3×3 kernel can execute in parallel. The computation can also parallelize across output channels and spatial positions, as these calculations are independent.
+
+###### System Design Impact
+
+These patterns create fundamentally different system demands than MLPs. While MLPs struggle with global data access patterns, CNNs must efficiently manage sliding windows of computation. The opportunity for data reuse is significantly higher in CNNs, but capturing this reuse requires more sophisticated memory systems. Many CNN accelerators specifically target these patterns through specialized memory hierarchies and computational arrays that align with the natural structure of convolution operations.
+
+### Recurrent Neural Networks: Sequential Pattern Processing
+
+Recurrent Neural Networks (RNNs) introduce a fundamental shift in deep learning architecture by maintaining state across time steps. Unlike MLPs and CNNs that process fixed-size inputs, RNNs handle variable-length sequences by reusing weights across time steps while maintaining an internal memory state. This architecture enables sequential data processing but introduces new computational patterns and dependencies.
+
+RNNs excel in processing sequential data that changes over time. In tasks like predicting stock prices or analyzing sentiment in customer reviews, RNNs demonstrate their unique ability to process data as a temporal sequence. By maintaining an internal hidden state that captures information from previous time steps, these networks can recognize patterns that unfold over time. For instance, in sentiment analysis, an RNN can progressively understand the emotional trajectory of a sentence by processing each word in context, allowing it to capture nuanced changes in tone and meaning that traditional fixed-input models might miss.
+
+#### Algorithmic Structure
+
+The core innovation of RNNs is the recurrent connection, where each step's computation depends on both the current input and the previous hidden state. As shown in @fig-rnn, the same weights are applied repeatedly across the sequence:
+
+![RNN unrolled.](images/jpg/rnn_unrolled.jpg){#fig-rnn}
+
+For a sequence of T time steps, each step t computes:
+
+$$
+\mathbf{h}_t = f(\mathbf{W}_h\mathbf{h}_{t-1} + \mathbf{W}_x\mathbf{x}_t + \mathbf{b})
+$$
+
+The dimensions reveal the computation structure:
+
+* Input vector: $\mathbf{x}_t \in \mathbb{R}^{d_{in}}$
+* Hidden state: $\mathbf{h}_t \in \mathbb{R}^{d_h}$
+* Weight matrices: $\mathbf{W}_h \in \mathbb{R}^{d_h \times d_h}$, $\mathbf{W}_x \in \mathbb{R}^{d_h \times d_{in}}$
+
+#### Computational Mapping
+
+RNN computation exhibits distinct patterns due to its sequential nature:
+
+1. **State Update**: Each time step combines current input with previous state
+   * Two matrix multiplications per step
+   * State must be maintained across sequence
+   * Sequential dependency limits parallelization
+
+2. **Batch Processing**: Sequences processed in parallel with shape transformations:
+   $$
+   \mathbf{H}_t = f(\mathbf{H}_{t-1}\mathbf{W}_h + \mathbf{X}_t\mathbf{W}_x + \mathbf{b})
+   $$
+   where $\mathbf{H}_t \in \mathbb{R}^{B \times d_h}$ for batch size B
+
+The translation from mathematical abstraction to implementation reveals how these patterns manifest in practice:
+
+```python
+# Mathematical abstraction - clean and stateless
+def rnn_step_math(x_t, h_prev, W_h, W_x, b):
+    h_t = activation(matmul(h_prev, W_h) + matmul(x_t, W_x) + b)
+    return h_t
+
+# System reality - managing state and sequence
+def rnn_sequence_compute(X, W_h, W_x, b):
+    # Process each sequence in batch
+    for batch in range(batch_size):
+        # Hidden state must be maintained
+        h = zeros(hidden_size)
+        
+        # Steps must process sequentially
+        for t in range(sequence_length):
+            # State update computation
+            h_prev = h
+            h = activation(matmul(h_prev, W_h) + 
+                         matmul(X[batch,t], W_x) + b)
+            outputs[batch,t] = h
+```
+
+While real implementations use sophisticated optimizations, this simplified code reveals how RNNs must carefully manage state and sequential dependencies.
+
+#### System Implications
+
+The implementation of RNNs reveals unique challenges and opportunities that shape system design. Consider processing a batch of text sequences, each 100 words long, with a hidden state size of 256:
+
+**Sequential Data Dependencies:** The fundamental challenge in RNNs stems from their sequential nature. Each time step must wait for the completion of the previous step's computation, creating a chain of dependencies throughout the sequence. For a 100-word input, this means 100 strictly sequential operations, each producing a 256-dimensional hidden state that must be preserved and passed forward. This creates a computational pipeline that cannot be parallelized across time steps, fundamentally limiting system throughput.
+
+**Variable Computation and Storage:** Unlike MLPs or CNNs where computation is fixed, RNNs must handle varying sequence lengths. A batch might contain sequences ranging from a few words to hundreds of words, each requiring a different amount of computation. This variability creates complex resource management challenges. A system processing sequences of length 100 might suddenly need to accommodate sequences of length 500, requiring five times more computation and memory. Load balancing becomes particularly challenging when different sequences in a batch have widely varying lengths.
+
+**Weight Reuse Structure:** RNNs offer significant opportunities for weight reuse. The same weight matrices (W_h and W_x) are applied at every time step in the sequence. For a 100-word sequence, each weight is reused 100 times. With a hidden state size of 256, the W_h matrix has only 65,536 parameters (256 × 256) but participates in tens of thousands of computations. This intensive reuse makes weight-stationary architectures particularly effective for RNN computation.
+
+**Regular Computation Pattern:** Despite their sequential nature, RNNs exhibit highly regular computation patterns. Each time step performs identical operations: two matrix multiplications with fixed dimensions followed by an activation function. This regularity means that once a system is optimized for a single step's computation, that optimization applies throughout the sequence. The matrix sizes remain constant (256 × 256 for W_h, 256 × d_in for W_x), enabling efficient hardware utilization despite the sequential constraints.
+
+These patterns create fundamentally different demands than MLPs or CNNs. While MLPs focus on parallel matrix multiplication and CNNs on spatial data reuse, RNNs must carefully manage sequential state while maximizing batch-level parallelism. This has led to specialized architectures that pipeline sequential operations while exploiting weight reuse across time steps.
+
+Modern systems address these challenges through deep pipelines that overlap sequential computations, sophisticated batch scheduling for varying sequence lengths, and hierarchical memory systems optimized for weight reuse. The patterns established by RNNs highlight how sequential dependencies can fundamentally reshape system architecture decisions, even while maintaining the core mathematical operations common to deep learning.
+
+### Attention Mechanisms: Dynamic Information Flow
+
+Attention mechanisms revolutionized deep learning by enabling dynamic, content-based information processing. Unlike the fixed patterns of MLPs, CNNs, or RNNs, attention allows each element in a sequence to selectively focus on relevant parts of the input. This dynamic routing of information creates new computational patterns that scale differently from traditional architectures.
+
+In language translation tasks, attention mechanisms reveal their power by allowing the network to dynamically align and focus on relevant words across different languages. Instead of processing a sentence sequentially with fixed weights, an attention-based model can simultaneously consider multiple parts of the input, creating contextually rich representations that capture subtle linguistic nuances. This approach enables more accurate and contextually aware translations by letting the model dynamically determine which parts of the input are most relevant for generating each output word.
+
+#### Mathematical Structure
+
+The core operation in attention computes relevance scores between pairs of elements in sequences. Each output is a weighted combination of all inputs, where weights are learned from content:
+
+$$
+\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}})\mathbf{V}
+$$
+
+The dimensions highlight the scale of computation:
+* Queries: $\mathbf{Q} \in \mathbb{R}^{N \times d_k}$
+* Keys: $\mathbf{K} \in \mathbb{R}^{N \times d_k}$
+* Values: $\mathbf{V} \in \mathbb{R}^{N \times d_v}$
+* Output: $\mathbf{O} \in \mathbb{R}^{N \times d_v}$
+
+#### Computational Mapping
+
+Attention computation exhibits distinct patterns:
+
+1. **All-to-All Interaction**: Every query computes relevance with every key
+   * Quadratic scaling with sequence length
+   * Large intermediate matrices
+   * Global information access
+
+2. **Batch Processing**: Multiple sequences processed simultaneously:
+   $$
+   \mathbf{A} = \text{softmax}(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}}) \in \mathbb{R}^{B \times N \times N}
+   $$
+   where B is batch size and A is the attention matrix
+
+The translation from mathematical abstraction to implementation reveals the computational complexity:
+
+```{python}
+# Mathematical abstraction - clean matrix operations
+def attention_math(Q, K, V):
+    scores = matmul(Q, K.transpose()) / sqrt(d_k)
+    weights = softmax(scores)
+    output = matmul(weights, V)
+    return output
+
+# System reality - explicit computation
+def attention_compute(Q, K, V):
+    # Process each sequence in batch
+    for b in range(batch_size):
+        # Every query must interact with every key
+        for i in range(seq_length):
+            for j in range(seq_length):
+                # Compute attention score
+                score = 0
+                for d in range(d_k):
+                    score += Q[b,i,d] * K[b,j,d]
+                scores[b,i,j] = score / sqrt(d_k)
+        
+        # Softmax normalization
+        weights = softmax(scores[b])
+        
+        # Weighted combination of values
+        for i in range(seq_length):
+            for j in range(seq_length):
+                for d in range(d_v):
+                    output[b,i,d] += weights[i,j] * V[b,j,d]
+```
+
+This simplified code view exposes the quadratic nature of attention computation.
+
+#### System Implications
+
+The implementation of attention mechanisms reveals unique challenges and opportunities that shape system design. Consider processing a batch of sentences, each containing 512 tokens, with an embedding dimension of 64:
+
+**Memory Intensity:** The defining characteristic of attention is its quadratic memory scaling. For a sequence of 512 tokens, the attention matrix alone requires 512 × 512 = 262,144 elements per head. With 8 attention heads, this grows to over 2 million elements just for intermediate storage. This quadratic growth means doubling sequence length quadruples memory requirements, creating a fundamental scaling challenge that often limits batch sizes in practice.
+
+**Computation Volume:** Attention mechanisms create an intense computational workload. Computing attention scores requires 512 × 512 × 64 multiply-accumulate operations just for the Q-K interaction in a single head. This amounts to over 16 million operations per layer. Unlike CNNs where computation is local, or RNNs where it's sequential, attention requires global interaction between all elements, making the computational pattern both dense and extensive.
+
+**Parallelization Opportunity:** Despite its computational intensity, attention offers extensive parallelism opportunities. Unlike RNNs, attention has no sequential dependencies. All attention scores can compute simultaneously. Moreover, different attention heads can process independently, and operations across the batch dimension are entirely separate. This creates multiple levels of parallelism that modern hardware can exploit.
+
+**Dynamic Data Access:** The content-dependent nature of attention creates unique data access patterns. Unlike CNNs with fixed sliding windows or MLPs with regular matrix multiplication, attention's data access depends on the computed relevance scores. This dynamic pattern means the system cannot predict which values will be most important, requiring efficient access to the entire sequence at each step.
+
+These characteristics drive the development of specialized attention accelerators. While CNNs focus on reusing spatial data and RNNs on managing sequential state, attention accelerators must handle large, dense matrix operations while managing significant memory requirements. 
+
+Modern systems address these challenges through:
+
+* Sophisticated memory hierarchies to handle the quadratic scaling
+* Highly parallel matrix multiplication units
+* Custom data layouts optimized for attention patterns
+* Memory management strategies to handle variable sequence lengths
+
+The patterns established by attention mechanisms highlight how dynamic, content-dependent computation can create fundamentally different system requirements, even while building on the basic operations of deep learning.
+
+### Transformers: Parallel Sequence Processing
+
+Transformers represent a culmination of deep learning architecture development, combining attention mechanisms with positional encoding and parallel processing. Unlike RNNs that process sequences step by step, Transformers handle entire sequences simultaneously through self-attention layers. This architecture has enabled breakthrough performance in language processing and increasingly in other domains, while introducing distinct computational patterns that influence system design.
+
+#### Mathematical Structure
+
+The Transformer architecture processes sequences through alternating self-attention and feedforward layers. The core computation in each attention head is:
+
+$$
+\text{head}_i = \text{Attention}(\mathbf{X}\mathbf{W}^Q_i, \mathbf{X}\mathbf{W}^K_i, \mathbf{X}\mathbf{W}^V_i)
+$$
+
+Multiple heads are concatenated and projected:
+$$
+\text{MultiHead}(\mathbf{X}) = \text{Concat}(\text{head}_1,...,\text{head}_h)\mathbf{W}^O
+$$
+
+Dimensions scale with model size:
+* Input sequence: $\mathbf{X} \in \mathbb{R}^{N \times d_{model}}$
+* Per-head projections: $\mathbf{W}^Q_i, \mathbf{W}^K_i, \mathbf{W}^V_i \in \mathbb{R}^{d_{model} \times d_k}$
+* Output projection: $\mathbf{W}^O \in \mathbb{R}^{hd_v \times d_{model}}$
+
+#### Computational Mapping
+
+Transformer computation exhibits several key patterns:
+
+1. **Multi-Head Processing**: Attention computed in parallel heads
+   * Multiple attention patterns learned simultaneously
+   * Head outputs concatenated and projected
+   * Parameter count scales with head count
+
+2. **Layer Operations**: Each block performs sequence transformation:
+   $$
+   \mathbf{X}' = \text{FFN}(\text{LayerNorm}(\mathbf{X} + \text{MultiHead}(\mathbf{X})))
+   $$
+   where FFN is a position-wise feedforward network
+
+The translation from mathematical abstraction to implementation reveals the computational complexity:
+
+```python
+# Mathematical abstraction - clean composition
+def transformer_layer(X):
+    # Multi-head attention
+    attention_out = multi_head_attention(X)
+    residual1 = layer_norm(X + attention_out)
+    
+    # Position-wise feedforward
+    ff_out = feedforward(residual1)
+    output = layer_norm(residual1 + ff_out)
+    return output
+
+# System reality - explicit computation
+def transformer_compute(X):
+    # Process each sequence in batch
+    for b in range(batch_size):
+        # Compute attention in each head
+        for h in range(num_heads):
+            # Project inputs to Q,K,V
+            Q = matmul(X[b], W_q[h])
+            K = matmul(X[b], W_k[h])
+            V = matmul(X[b], W_v[h])
+            
+            # Compute attention scores
+            for i in range(seq_length):
+                for j in range(seq_length):
+                    scores[h,i,j] = dot(Q[i], K[j]) / sqrt(d_k)
+            
+            # Apply attention and transform
+            weights = softmax(scores[h])
+            head_out[h] = matmul(weights, V)
+        
+        # Combine heads and apply FFN
+        combined = concat(head_out) @ W_o
+        residual1 = layer_norm(X[b] + combined)
+        
+        # Position-wise feedforward
+        for pos in range(seq_length):
+            ff_out[pos] = feedforward(residual1[pos])
+        output[b] = layer_norm(residual1 + ff_out)
+```
+
+#### System Implications
+
+The implementation of Transformers reveals unique challenges and opportunities that shape system design. Consider a typical model with 12 layers, 12 attention heads, and processing sequences of 512 tokens with model dimension 768:
+
+**Memory Hierarchy Complexity:** Transformers create intricate memory demands across multiple scales. Each attention head in each layer requires its own attention matrix (512 × 512 elements), leading to 12 × 12 × 512 × 512 = 37.7M elements just for attention scores. Additionally, residual connections mean each layer's activations (512 × 768 = 393K elements) must be preserved until the corresponding addition operation. Layer normalization requires computing and storing statistics across the full sequence length. For a batch size of 32, the total active memory footprint can exceed 1GB, creating complex memory management challenges.
+
+**Computational Pattern Diversity:** Unlike MLPs or CNNs with uniform computation patterns, Transformers alternate between two fundamentally different computations. The self-attention mechanism requires the quadratic-scaling attention computation we saw earlier (262K elements per head), while the position-wise feedforward network applies identical dense layers (768 × 3072 for the first FFN layer) at each position independently. This creates a unique rhythm of global interaction followed by local computation that systems must manage efficiently.
+
+**Parallelization Structure:** Transformers offer multiple levels of parallel computation. Each attention head can process independently, the feedforward networks can run separately for each position, and different layers can potentially pipeline their operations. However, this parallelism comes with data dependencies: layer normalization needs global statistics, attention requires the full key-value space, and residual connections need preserved activations. For our example model, each layer offers 12-way parallelism across heads and 512-way parallelism across positions, but must synchronize between major operations.
+
+**Scaling Implications:** The most distinctive aspect of Transformer computation is how it scales with different dimensions. Doubling sequence length quadruples memory requirements for attention (from 262K to 1M elements per head). Doubling model dimension increases both attention projection sizes and feedforward layer sizes. Adding heads or layers creates more independent computation streams but requires proportionally more parameter storage. These scaling properties create fundamental tensions in system design between memory capacity, computational throughput, and parallelization strategy.
+
+These characteristics drive the development of specialized Transformer accelerators. Systems must balance:
+
+* Complex memory hierarchies to handle multiple scales of data reuse
+* Mixed computation units for attention and feedforward operations
+* Sophisticated scheduling to exploit available parallelism
+* Memory management strategies for growing model sizes
+
+The patterns established by Transformers show how combining multiple computational motifs (attention and feedforward networks) creates new system-level challenges that go beyond simply implementing each component separately. Modern hardware must carefully co-design their memory systems, computational units, and control logic to handle these diverse demands efficiently.
+
+## Computational Patterns and System Impact
+
+Having examined major deep learning architectures, we can now compare and contrast their computational patterns and system implications collectively. This synthesis reveals common themes and distinct challenges that drive system design decisions in deep learning.
+
+### Memory Access Patterns
+
+Deep learning architectures exhibit various memory access patterns that significantly impact system performance. MLPs demonstrate the simplest pattern with dense, regular access to weight matrices. Each layer requires loading its entire weight matrix, creating substantial memory bandwidth demands that grow with layer width. CNNs, in contrast, exploit spatial locality through weight sharing. The same kernel parameters are reused across spatial positions, reducing memory requirements but necessitating efficient caching mechanisms for kernel weights.
+
+RNNs introduce temporal dependencies in memory access, requiring state maintenance across sequence steps. While they reuse weights efficiently across time steps, the sequential nature of computation limits opportunities for parallel processing. Attention mechanisms and Transformers create perhaps the most demanding memory patterns, with quadratic scaling in sequence length due to their all-to-all interaction matrices.
+
+Consider the memory scaling for processing a sequence of length N:
+
+* MLP: Linear scaling with layer width
+* CNN: Constant kernel size, independent of input
+* RNN: Linear scaling with hidden state size
+* Attention: Quadratic scaling with sequence length
+* Transformer: Quadratic in sequence length, linear in model width
+
+### Computation Characteristics
+
+The computational structure of these architectures reveals a spectrum of parallelization opportunities and constraints. MLPs and CNNs offer straightforward parallelization—each output element can be computed independently. CNNs add the complexity of sliding window operations but maintain regular computation patterns amenable to hardware acceleration.
+
+RNNs represent a sharp departure, with inherent sequential dependencies limiting parallelization across time steps. While batch processing enables some parallelism, the fundamental sequential nature remains a bottleneck. Attention mechanisms swing to the opposite extreme, enabling full parallelization but at the cost of quadratic computation growth.
+
+Key computation patterns across architectures:
+
+1. **Matrix Operations**
+
+* MLPs: Dense matrix multiplication
+* CNNs: Convolution as structured sparse multiplication
+* RNNs: Sequential matrix operations with state updates
+* Attention/Transformers: Batched matrix multiplication with dynamic weights
+
+2. **Data Dependencies**
+
+* MLPs: Layer-wise dependencies only
+* CNNs: Local spatial dependencies
+* RNNs: Sequential temporal dependencies
+* Attention: Global but parallel dependencies
+
+### System Design Implications
+
+These patterns drive specific system design decisions across the deep learning stack:
+
+#### Memory Hierarchy
+
+Modern deep learning systems must balance multiple memory access patterns:
+
+* Fast access to frequently reused weights (CNN kernels)
+* Efficient handling of large, dense matrices (MLP layers)
+* Support for dynamic, content-dependent access (Attention)
+* Management of growing intermediate states (Transformers)
+
+#### Computation Units
+
+Different architectures benefit from specialized compute capabilities:
+
+* Matrix multiplication units for MLPs and Transformers
+* Convolution accelerators for CNNs
+* Sequential processing optimization for RNNs
+* High-bandwidth units for attention computation
+
+#### Data Movement
+
+The cost of data movement often dominates energy and performance:
+
+* Weight streaming in MLPs
+* Feature map movement in CNNs
+* State propagation in RNNs
+* Attention score distribution in Transformers
+
+Understanding these patterns helps system designers to make informed decisions about hardware architecture, memory hierarchy, and optimization strategies. The diversity of patterns across architectures also explains the trend toward heterogeneous computing systems that can efficiently handle multiple types of deep learning workloads.
+
+The computational characteristics of neural network architectures reveal fundamental trade-offs in deep learning system design. @tbl-arch-comparison synthesizes the key computational patterns across major neural network architectures, highlighting how different design choices impact performance, memory usage, and computational efficiency.
+
+The diversity of architectural approaches reflects the complex challenges of processing different types of data. Each architecture emerges as a specialized solution to specific computational demands, balancing factors like parallelization, memory efficiency, and the ability to capture complex patterns in data.
+
++--------------+------------------------------------+--------------------------------+-----------------------+----------------------------------------------+---------------------------------------------------------+
+| Architecture |         Primary Computation        |         Memory Scaling         |    Parallelization    |          Key Computational Patterns          |                     Typical Use Case                    |
++:=============+:===================================+:===============================+:======================+:=============================================+:========================================================+
+| MLP          | Dense Matrix Multiplication        | Linear with layer width        | High                  | Full connectivity, uniform weight access     | General-purpose classification, regression              |
++--------------+------------------------------------+--------------------------------+-----------------------+----------------------------------------------+---------------------------------------------------------+
+| CNN          | Convolution with Sliding Kernel    | Constant with kernel size      | High                  | Spatial local computation, weight sharing    | Image processing, spatial data                          |
++--------------+------------------------------------+--------------------------------+-----------------------+----------------------------------------------+---------------------------------------------------------+
+| RNN          | Sequential Matrix Operations       | Linear with hidden state size  | Low                   | Temporal dependencies, state maintenance     | Sequential data, time series                            |
++--------------+------------------------------------+--------------------------------+-----------------------+----------------------------------------------+---------------------------------------------------------+
+| Attention    | Batched Matrix Multiplication      | Quadratic with sequence length | Very High             | Global, content-dependent interactions       | Natural language processing, sequence-to-sequence tasks |
++--------------+------------------------------------+--------------------------------+-----------------------+----------------------------------------------+---------------------------------------------------------+
+| Transformer  | Multi-head Attention + Feedforward | Quadratic in sequence length   | Highly Parallelizable | Parallel sequence processing, global context | Large language models, complex sequence tasks           |
++--------------+------------------------------------+--------------------------------+-----------------------+----------------------------------------------+---------------------------------------------------------+
+
+: Comparative computational characteristics of neural network architectures. {#tbl-arch-comparison .striped .hover}
+
+This comparative view shows the fundamental design principles driving neural network architectures, demonstrating how computational requirements shape the evolution of deep learning systems.
+
+### Architectural Complexity Analysis
+
+#### Computational Complexity
+
+For an input of size N, different architectures exhibit distinct computational complexities:
+
+* **MLP (Multi-Layer Perceptron)**: 
+  * Time Complexity: O(N * W)
+  * Where W is the width of the layers
+  * Example: For a 784-100-10 MNIST network, complexity scales linearly with layer sizes
+
+* **CNN (Convolutional Neural Network)**:
+  * Time Complexity: O(N * K * C)
+  * Where K is kernel size, C is number of channels
+  * Significantly more efficient for spatial data due to local connectivity
+
+* **RNN (Recurrent Neural Network)**:
+  * Time Complexity: O(N * T * H)
+  * Where T is sequence length, H is hidden state size
+  * Quadratic complexity with sequence length creates scaling challenges
+
+* **Transformer**:
+  * Time Complexity: O(N² * d)
+  * Where N is sequence length, d is model dimensionality
+  * Quadratic scaling makes long sequence processing expensive
+
+#### Memory Complexity Comparison
+
++---------------+-------------------+-------------------+---------------------+-------------------+
+| Architecture  | Input Dependency  | Parameter Storage | Activation Storage  | Scaling Behavior  |
++:==============+:==================+:==================+:====================+:==================+
+| MLP           | Linear            | O(N * W)          | O(B * W)            | Predictable       |
++---------------+-------------------+-------------------+---------------------+-------------------+
+| CNN           | Constant          | O(K * C)          | O(B * H * W)        | Efficient         |
++---------------+-------------------+-------------------+---------------------+-------------------+
+| RNN           | Linear            | O(H²)             | O(B * T * H)        | Challenging       |
++---------------+-------------------+-------------------+---------------------+-------------------+
+| Transformer   | Quadratic         | O(N * d)          | O(B * N²)           | Problematic       |
++---------------+-------------------+-------------------+---------------------+-------------------+
+
+Where:
+* N: Input/sequence size
+* W: Layer width
+* B: Batch size
+* K: Kernel size
+* C: Channels
+* H: Hidden state size
+* T: Sequence length
+* d: Model dimensionality
+
+## System Demands
+
+Deep learning systems must manage significant computational and memory resources during both training and inference. While the previous section examined computational patterns of different architectures, here we focus on the practical system requirements for running these models. Understanding these fundamental demands is crucial for later chapters where we explore optimization and deployment strategies in detail.
+
+### Training Demands
+
+#### Batch Processing
+
+Training neural networks requires processing multiple examples simultaneously through batch processing. This approach not only improves statistical learning by computing gradients across multiple examples but also enables better hardware utilization. The batch size choice directly impacts memory usage since activations must be stored for each example in the batch, creating a fundamental trade-off between training efficiency and memory constraints.
+
+From a systems perspective, batch processing creates opportunities for optimization through parallel computation and memory access patterns. Modern deep learning systems employ techniques like gradient accumulation to work around memory limitations, and dynamic batch sizing to adapt to available resources. These considerations become particularly important in distributed training settings, which we will explore in detail in subsequent chapters.
+
+#### Gradient Computation
+
+The backward pass in neural network training requires maintaining activation values from the forward pass to compute gradients. This effectively doubles the memory requirements compared to inference, as each layer's outputs must be preserved until they're needed for gradient calculation. The computational graph grows with model depth, making gradient computation increasingly demanding for deeper networks.
+
+System design must carefully manage this memory-computation trade-off. Techniques exist to reduce memory pressure by recomputing activations instead of storing them, or by using checkpointing to save memory at the cost of additional computation. While we will explore these optimization strategies thoroughly in Chapter 8, the fundamental challenge of balancing gradient computation requirements with system resources shapes basic design decisions.
+
+#### Parameter Updates and State Management
+
+Training requires maintaining and updating millions to billions of parameters across iterations. Each training step not only computes gradients but must safely update these parameters, often while maintaining optimizer states like momentum values. This creates substantial memory requirements beyond the basic model parameters.
+
+Systems must efficiently handle these frequent, large-scale parameter updates. Memory hierarchies need to be designed for both fast access during forward/backward passes and efficient updates during the optimization step. While various optimization techniques exist to reduce this overhead, which we will explore in Chapter 8, the basic challenge of parameter management influences fundamental system design choices.
+
+### Inference Requirements
+
+#### Latency Management
+
+Inference workloads often demand real-time or near-real-time responses, particularly in applications like autonomous systems, recommendation engines, or interactive services. Unlike training, where throughput is the primary concern, inference must often meet strict latency requirements. These timing constraints fundamentally shape how the system processes inputs and manages resources.
+
+From a systems perspective, meeting latency requirements involves carefully orchestrating computation and memory access. While techniques like batching can improve throughput, they must be balanced against latency constraints. The system must maintain consistent response times while efficiently utilizing available compute resources, a challenge we will examine more deeply in Chapter 9.
+
+#### Resource Utilization
+
+Inference deployments must operate within specific resource budgets, whether running on edge devices with limited power and memory or in cloud environments with cost constraints. The model's computational and memory demands must fit within these boundaries while maintaining acceptable performance. This creates a fundamental tension between model capability and resource efficiency.
+
+System designs address this challenge through various approaches to resource management. Memory usage can be optimized through quantization and compression, while computation can be streamlined through operation fusion and hardware acceleration. While we will explore these optimization techniques in detail in later chapters, understanding the basic resource constraints is essential for system design.
+
+### Memory Management
+
+#### Working Memory
+
+Deep learning systems require substantial working memory to hold intermediate activations during computation. For training, these memory demands are particularly high as activations must be preserved for gradient computation. Even during inference, the system needs sufficient memory to handle intermediate results, especially for large models or batch processing.
+
+Memory management systems must efficiently handle these dynamic memory requirements. Techniques like gradient checkpointing during training or activation recomputation during inference can trade computation for memory savings. The fundamental challenge lies in balancing memory usage against computational overhead, a trade-off that influences system architecture.
+
+#### Storage Requirements
+
+Beyond working memory, deep learning systems need efficient storage and access patterns for model parameters. Modern architectures can require anywhere from megabytes to hundreds of gigabytes for weight storage. This creates challenges not just in terms of capacity but also in terms of efficient parameter access during computation.
+
+Systems must carefully manage this parameter storage across the memory hierarchy. Fast access to frequently used parameters must be balanced against total storage capacity. While advanced techniques like parameter sharding and caching exist, which we will discuss in a future chapter, the basic requirements for parameter storage and access shape fundamental system design decisions.
+
+## Case Studies
+
+### Scaling Architectures: Transformers and Large Language Models
+
+* Evolution from Attention Mechanisms to GPT and BERT  
+* System challenges in scaling, memory, and deployment  
+
+### Efficiency and Specialization: EfficientNet and Mobile AI
+
+* Optimizing for constrained environments  
+* Trade-offs between accuracy and efficiency  
+
+### Distributed Systems: Training at Scale
+
+* Challenges in multi-GPU communication and resource scaling  
+* Tools and frameworks like Horovod and DeepSpeed  
+
+## Conclusion
+
+coming soon.
\ No newline at end of file
diff --git a/contents/core/dl_architectures/images/png/cover_dl_arch.png b/contents/core/dl_architectures/images/png/cover_dl_arch.png
new file mode 100644
index 00000000..88f0f4f4
Binary files /dev/null and b/contents/core/dl_architectures/images/png/cover_dl_arch.png differ
diff --git a/contents/core/dl_primer/dl_primer.bib b/contents/core/dl_primer/dl_primer.bib
index 8daf6b12..60e5286f 100644
--- a/contents/core/dl_primer/dl_primer.bib
+++ b/contents/core/dl_primer/dl_primer.bib
@@ -26,6 +26,27 @@ @article{goodfellow2020generative
  month = oct,
 }
 
+@article{vaswani2017attention,
+  title={Attention is all you need},
+  author={Vaswani, A},
+  journal={Advances in Neural Information Processing Systems},
+  year={2017}
+}
+
+@book{reagen2017deep,
+  title={Deep learning for computer architects},
+  author={Reagen, Brandon and Adolf, Robert and Whatmough, Paul and Wei, Gu-Yeon and Brooks, David and Martonosi, Margaret},
+  year={2017},
+  publisher={Springer}
+}
+
+@article{bahdanau2014neural,
+  title={Neural machine translation by jointly learning to align and translate},
+  author={Bahdanau, Dzmitry},
+  journal={arXiv preprint arXiv:1409.0473},
+  year={2014}
+}
+
 @inproceedings{jouppi2017datacenter,
  author = {Jouppi, Norman P. and Young, Cliff and Patil, Nishant and Patterson, David and Agrawal, Gaurav and Bajwa, Raminder and Bates, Sarah and Bhatia, Suresh and Boden, Nan and Borchers, Al and Boyle, Rick and Cantin, Pierre-luc and Chao, Clifford and Clark, Chris and Coriell, Jeremy and Daley, Mike and Dau, Matt and Dean, Jeffrey and Gelb, Ben and Ghaemmaghami, Tara Vazir and Gottipati, Rajendra and Gulland, William and Hagmann, Robert and Ho, C. Richard and Hogberg, Doug and Hu, John and Hundt, Robert and Hurt, Dan and Ibarz, Julian and Jaffey, Aaron and Jaworski, Alek and Kaplan, Alexander and Khaitan, Harshit and Killebrew, Daniel and Koch, Andy and Kumar, Naveen and Lacy, Steve and Laudon, James and Law, James and Le, Diemthu and Leary, Chris and Liu, Zhuyuan and Lucke, Kyle and Lundin, Alan and MacKean, Gordon and Maggiore, Adriana and Mahony, Maire and Miller, Kieran and Nagarajan, Rahul and Narayanaswami, Ravi and Ni, Ray and Nix, Kathy and Norrie, Thomas and Omernick, Mark and Penukonda, Narayana and Phelps, Andy and Ross, Jonathan and Ross, Matt and Salek, Amir and Samadiani, Emad and Severn, Chris and Sizikov, Gregory and Snelham, Matthew and Souter, Jed and Steinberg, Dan and Swing, Andy and Tan, Mercedes and Thorson, Gregory and Tian, Bo and Toma, Horia and Tuttle, Erick and Vasudevan, Vijay and Walter, Richard and Wang, Walter and Wilcox, Eric and Yoon, Doe Hyun},
  abstract = {Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC{\textemdash}called a Tensor Processing Unit (TPU) {\textemdash} deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95\% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X {\textendash} 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X {\textendash} 80X higher. Moreover, using the CPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.},
diff --git a/contents/core/dl_primer/dl_primer.qmd b/contents/core/dl_primer/dl_primer.qmd
index 95710bbf..d5d3139b 100644
--- a/contents/core/dl_primer/dl_primer.qmd
+++ b/contents/core/dl_primer/dl_primer.qmd
@@ -2,91 +2,325 @@
 bibliography: dl_primer.bib
 ---
 
-# DL Primer {#sec-dl_primer}
+# DNN Primer {#sec-dl_primer}
 
 ::: {.content-visible when-format="html"}
 Resources: [Slides](#sec-deep-learning-primer-resource), [Videos](#sec-deep-learning-primer-resource), [Exercises](#sec-deep-learning-primer-resource)
 :::
 
-![_DALL·E 3 Prompt: Photo of a classic classroom with a large blackboard dominating one wall. Chalk drawings showcase a detailed deep neural network with several hidden layers, and each node and connection is precisely labeled with white chalk. The rustic wooden floor and brick walls provide a contrast to the modern concepts. Surrounding the room, posters mounted on frames emphasize deep learning themes: convolutional networks, transformers, neurons, activation functions, and more._](images/png/cover_dl_primer.png)
+![_DALL·E 3 Prompt: A rectangular illustration divided into two halves on a clean white background. The left side features a detailed and colorful depiction of a biological neural network, showing interconnected neurons with glowing synapses and dendrites. The right side displays a sleek and modern artificial neural network, represented by a grid of interconnected nodes and edges resembling a digital circuit. The transition between the two sides is distinct but harmonious, with each half clearly illustrating its respective theme: biological on the left and artificial on the right._](images/png/cover_nn_primer.png)
 
-This section serves as a primer for deep learning, providing systems practitioners with essential context and foundational knowledge needed to implement deep learning solutions effectively. Rather than delving into theoretical depths, we focus on key concepts, architectures, and practical considerations relevant to systems implementation. We begin with an overview of deep learning's evolution and its particular significance in embedded AI systems. Core concepts like neural networks are introduced with an emphasis on implementation considerations rather than mathematical foundations. 
-
-The primer explores major deep learning architectures from a systems perspective, examining their practical implications and resource requirements. We also compare deep learning to traditional machine learning approaches, helping readers make informed architectural choices based on real-world system constraints. This high-level overview sets the context for the more detailed systems-focused techniques and optimizations covered in subsequent chapters.
+This chapter bridges fundamental neural network concepts with real-world system implementations by exploring how different architectural patterns process information and influence system design. Instead of concentrating on algorithms or model accuracy—typical topics in deep learning algorithms courses or books—this chapter focuses on how architectural choices shape distinct computational patterns that drive system-level decisions, such as memory hierarchy, processing units, and hardware acceleration. By understanding these relationships, readers will gain the insight needed to make informed decisions about model selection, system optimization, and hardware/software co-design in the chapters that follow.
 
 ::: {.callout-tip}
 
 ## Learning Objectives
 
-* Understand the basic concepts and definitions of deep neural networks.
+* Understand the biological inspiration for artificial neural networks and how this foundation informs their design and function.
+
+* Explore the fundamental structure of neural networks, including neurons, layers, and connections.
+
+* Examine the processes of forward propagation, backward propagation, and optimization as the core mechanisms of learning.
+
+* Understand the complete machine learning pipeline, from pre-processing through neural computation to post-processing.
 
-* Recognize there are different deep learning model architectures.
+* Compare and contrast training and inference phases, understanding their distinct computational requirements and optimizations.
 
-* Comparison between deep learning and traditional machine learning approaches across various dimensions.
+* Learn how neural networks process data to extract patterns and make predictions, bridging theoretical concepts with computational implementations.
 
-* Acquire the basic conceptual building blocks to dive deeper into advanced deep-learning techniques and applications.
-  
 :::
 
 ## Overview
 
-### Definition and Importance
+Neural networks, a foundational concept within machine learning and artificial intelligence, are computational models inspired by the structure and function of biological neural systems. These networks represent a critical intersection of algorithms, mathematical frameworks, and computing infrastructure, making them integral to solving complex problems in AI.
 
-Deep learning, a specialized area within machine learning and artificial intelligence (AI), utilizes algorithms modeled after the structure and function of the human brain, known as artificial neural networks. This field is a foundational element in AI, driving progress in diverse sectors such as computer vision, natural language processing, and self-driving vehicles. Its significance in embedded AI systems is highlighted by its capability to handle intricate calculations and predictions, optimizing the limited resources in embedded settings.
-
-@fig-ai-ml-dl provides a visual representation of how deep learning fits within the broader context of AI and machine learning. The diagram illustrates the chronological development and relative segmentation of these three interconnected fields, showcasing deep learning as a specialized subset of machine learning, which in turn is a subset of AI. 
+When studying neural networks, it is helpful to place them within the broader hierarchy of AI and machine learning. @fig-ai-ml-dl provides a visual representation of this context. AI, as the overarching field, encompasses all computational methods that aim to mimic human cognitive functions. Within AI, machine learning includes techniques that enable systems to learn patterns from data. Neural networks, a key subset of ML, form the backbone of more advanced learning systems, including deep learning, by modeling complex relationships in data through interconnected computational units.
 
 ![The diagram illustrates artificial intelligence as the overarching field encompassing all computational methods that mimic human cognitive functions. Machine learning is a subset of AI that includes algorithms capable of learning from data. Deep learning, a further subset of ML, specifically involves neural networks that are able to learn more complex patterns in large volumes of data. Source: NVIDIA.](images/png/ai_dl_progress_nvidia.png){#fig-ai-ml-dl}
 
-As shown in the figure, AI represents the overarching field, encompassing all computational methods that mimic human cognitive functions. Machine learning, shown as a subset of AI, includes algorithms capable of learning from data. Deep learning, the smallest subset in the diagram, specifically involves neural networks that are able to learn more complex patterns from large volumes of data.
+The emergence of neural networks reflects key shifts in how AI systems process information across three fundamental dimensions:
+
+* **Data:** From manually structured and rule-based datasets to raw, high-dimensional data. Neural networks are particularly adept at learning from complex and unstructured data, making them essential for tasks involving images, speech, and text.
+
+* **Algorithms:** From explicitly programmed rules to adaptive systems capable of learning patterns directly from data. Neural networks eliminate the need for manual feature engineering by discovering representations automatically through layers of interconnected units.
+
+* **Computation:** From simple, sequential operations to massively parallel computations. The scalability of neural networks has driven demand for advanced hardware, such as GPUs, that can efficiently process large models and datasets.
+
+These shifts underscore the importance of understanding neural networks, not only as mathematical constructs but also as practical components of real-world AI systems. The development and deployment of neural networks require careful consideration of computational efficiency, data processing workflows, and hardware optimization.
+
+To build a strong foundation, this chapter focuses on the core principles of neural networks, exploring their structure, functionality, and learning mechanisms. By understanding these basics, readers will be well-prepared to delve into more advanced architectures and their systems-level implications in later chapters.
+
+## What Makes Deep Learning Different
+
+Deep learning represents a fundamental shift in how we approach problem solving with computers. To understand this shift, let's consider the classic example of computer vision---specifically, the task of identifying objects in images.
+
+### Traditional Programming: The Era of Explicit Rules
+
+Traditional programming requires developers to explicitly define rules that tell computers how to process inputs and produce outputs. Consider a simple game like Breakout, shown in @fig-breakout. The program needs explicit rules for every interaction: when the ball hits a brick, the code must specify that the brick should be removed and the ball's direction should be reversed. While this approach works well for games with clear physics and limited states, it demonstrates an inherent limitation of rule-based systems.
+
+![Rule-based programming.](images/png/breakout.png){#fig-breakout}
+
+This rules-based paradigm extends to all traditional programming, as illustrated in @fig-traditional. The program takes both rules for processing and input data to produce outputs. Early artificial intelligence research explored whether this approach could scale to solve complex problems by encoding sufficient rules to capture intelligent behavior.
+
+![Traditional programming.](images/png/traditional.png){#fig-traditional}
+
+However, the limitations of rule-based approaches become evident when addressing complex real-world tasks. Consider the problem of recognizing human activities, shown in @fig-activity-rules. Initial rules might appear straightforward: classify movement below 4 mph as walking and faster movement as running. Yet real-world complexity quickly emerges. The classification must account for variations in speed, transitions between activities, and numerous edge cases. Each new consideration requires additional rules, leading to increasingly complex decision trees.
+
+![Activity rules.](images/png/activities.png){#fig-activity-rules}
+
+This challenge extends to computer vision tasks. Detecting objects like cats in images would require rules about System Implications: pointed ears, whiskers, typical body shapes. Such rules would need to account for variations in viewing angle, lighting conditions, partial occlusions, and natural variations among instances. Early computer vision systems attempted this approach through geometric rules but achieved success only in controlled environments with well-defined objects.
+
+This knowledge engineering approach characterized artificial intelligence research in the 1970s and 1980s. Expert systems encoded domain knowledge as explicit rules, showing promise in specific domains with well-defined parameters but struggling with tasks humans perform naturally---such as object recognition, speech understanding, or natural language interpretation. These limitations highlighted a fundamental challenge: many aspects of intelligent behavior rely on implicit knowledge that resists explicit rule-based representation.
+
+### Machine Learning: Learning from Engineered Patterns
+
+The limitations of pure rule-based systems led researchers to explore approaches that could learn from data. Machine learning offered a promising direction: instead of writing rules for every situation, we could write programs that found patterns in examples. However, the success of these methods still depended heavily on human insight to define what patterns might be important---a process known as feature engineering.
+
+Feature engineering involves transforming raw data into representations that make patterns more apparent to learning algorithms. In computer vision, researchers developed sophisticated methods to extract meaningful patterns from images. The Histogram of Oriented Gradients (HOG) method, shown in @fig-hog, exemplifies this approach. HOG works by first identifying edges in an image---places where brightness changes sharply, often indicating object boundaries. It then divides the image into small cells and measures how edges are oriented within each cell, summarizing these orientations in a histogram. This transformation converts raw pixel values into a representation that captures important shape information while being robust to variations in lighting and small changes in position.
+
+![Histogram of Oriented Gradients (HOG) requires explicit feature engineering.](images/png/hog.png){#fig-hog}
+
+Other feature extraction methods like SIFT (Scale-Invariant Feature Transform) and Gabor filters provided different ways to capture patterns in images. SIFT found distinctive points that could be recognized even when an object's size or orientation changed. Gabor filters helped identify textures and repeated patterns. Each method encoded different types of human insight about what makes visual patterns recognizable.
+
+These engineered features enabled significant advances in computer vision during the 2000s. Systems could now recognize objects with some robustness to real-world variations, leading to applications in face detection, pedestrian detection, and object recognition. However, the approach had fundamental limitations. Experts needed to carefully design feature extractors for each new problem, and the resulting features might miss important patterns that weren't anticipated in their design.
+
+### Deep Learning Paradigm
+
+Deep learning fundamentally differs by learning directly from raw data. Traditional programming, as we saw earlier in @fig-traditional, required both rules and data as inputs to produce answers. Machine learning inverts this relationship, as shown in @fig-deeplearning. Instead of writing rules, we provide examples (data) and their correct answers to discover the underlying rules automatically. This shift eliminates the need for humans to specify what patterns are important.
+
+![Deep learning.](images/png/ml_rules.png){#fig-deeplearning}
+
+The system discovers these patterns automatically from examples. When shown millions of images of cats, the system learns to identify increasingly complex visual patterns---from simple edges to more sophisticated combinations that make up cat-like features. This mirrors how our own visual system works, building up understanding from basic visual elements to complex objects.
+
+Unlike traditional approaches where performance often plateaus with more data and computation, deep learning systems continue to improve as we provide more resources. More training examples help the system recognize more variations and nuances. More computational power enables the system to discover more subtle patterns. This scalability has led to dramatic improvements in performance---for example, the accuracy of image recognition systems has improved from 74% in 2012 to over 95% today.
+
+This different approach has profound implications for how we build AI systems. Deep learning's ability to learn directly from raw data eliminates the need for manual feature engineering, but it comes with new demands. We need sophisticated infrastructure to handle massive datasets, powerful computers to process this data, and specialized hardware to perform the complex mathematical calculations efficiently. The computational requirements of deep learning have even driven the development of new types of computer chips optimized for these calculations.
+
+The success of deep learning in computer vision exemplifies how this approach, when given sufficient data and computation, can surpass traditional methods. This pattern has repeated across many domains, from speech recognition to game playing, establishing deep learning as a transformative approach to artificial intelligence.
+
+### Systems Implications of Each Approach
+
+The progression from traditional programming to deep learning represents not just a shift in how we solve problems, but a fundamental transformation in computing system requirements. This transformation becomes particularly critical when we consider the full spectrum of ML systems---from massive cloud deployments to resource-constrained tiny ML devices.
+
+Traditional programs follow predictable patterns. They execute sequential instructions, access memory in regular patterns, and utilize computing resources in well-understood ways. A typical rule-based image processing system might scan through pixels methodically, applying fixed operations with modest and predictable computational and memory requirements. These characteristics made traditional programs relatively straightforward to deploy across different computing platforms.
+
+Machine learning with engineered features introduced new complexities. Feature extraction algorithms required more intensive computation and structured data movement. The HOG feature extractor discussed earlier, for instance, requires multiple passes over image data, computing gradients and constructing histograms. While this increased both computational demands and memory complexity, the resource requirements remained relatively predictable and scalable across platforms.
+
+Deep learning, however, fundamentally reshapes system requirements across multiple dimensions. @tbl-evolution shows the evolution of system requirements across programming paradigms:
+
++------------------+-------------------------+---------------------------+--------------------------------+
+| System Aspect    | Traditional Programming | ML with Features          | Deep Learning                  |
++:=================+:========================+:==========================+:===============================+
+| Computation      | Sequential,             | Structured parallel       | Massive matrix                 |
+|                  | predictable paths       | operations                | parallelism                    |
++------------------+-------------------------+---------------------------+--------------------------------+
+| Memory Access    | Small, predictable      | Medium,                   | Large, complex                 |
+|                  | patterns                | batch-oriented            | hierarchical patterns          |
++------------------+-------------------------+---------------------------+--------------------------------+
+| Data Movement    | Simple input/output     | Structured batch          | Intensive cross-system         |
+|                  | flows                   | processing                | movement                       |
++------------------+-------------------------+---------------------------+--------------------------------+
+| Hardware Needs   | CPU-centric             | CPU with vector           | Specialized                    |
+|                  |                         | units                     | accelerators                   |
++------------------+-------------------------+---------------------------+--------------------------------+
+| Resource Scaling | Fixed requirements      | Linear with data          | Exponential with               |
+|                  |                         | size                      | complexity                     |
++------------------+-------------------------+---------------------------+--------------------------------+
+
+: Evolution of system requirements across programming paradigms. {#tbl-evolution}
+
+These differences manifest in several critical ways, with implications across the entire ML systems spectrum.
+
+#### Computation Patterns
+
+While traditional programs follow sequential logic flows, deep learning requires massive parallel operations on matrices. This shift explains why conventional CPUs, designed for sequential processing, prove inefficient for neural network computations. The need for parallel processing has driven the adoption of specialized hardware architectures---from powerful cloud GPUs to specialized mobile processors to tiny ML accelerators.
+
+#### Memory Systems
+
+Traditional programs typically maintain small, fixed memory footprints. Deep learning models, however, must manage parameters across complex memory hierarchies. Memory bandwidth often becomes the primary performance bottleneck, creating particular challenges for resource-constrained systems. This drives different optimization strategies across the ML systems spectrum---from memory-rich cloud deployments to heavily optimized tiny ML implementations.
+
+#### System Scale
+
+Perhaps most importantly, deep learning fundamentally changes how systems scale and the critical importance of efficiency. Traditional programs have relatively fixed resource requirements with predictable performance characteristics. Deep learning systems, however, can consume exponentially more resources as models grow in complexity. This relationship between model capability and resource consumption makes system efficiency a central concern.
+
+The need to bridge algorithmic concepts with hardware realities becomes crucial. While traditional programs map relatively straightforwardly to standard computer architectures, deep learning requires us to think carefully about:
+
+* How to efficiently map matrix operations to physical hardware
+* Ways to minimize data movement across memory hierarchies
+* Methods to balance computational capability with resource constraints
+* Techniques to optimize both algorithm and system-level efficiency
+
+These fundamental shifts explain why deep learning has spurred innovations across the entire computing stack. From specialized hardware accelerators to new memory architectures to sophisticated software frameworks, the demands of deep learning continue to reshape computer system design. Interestingly, many of these challenges---efficiency, scaling, and adaptability---are ones that biological systems have already solved. This brings us to a critical question: what can we learn from nature's own information processing system and strive to mimic them as artificially intelligent systems.
+
+## From Brain to Artificial Neurons
+
+The quest to create artificial intelligence has been profoundly influenced by our understanding of biological intelligence, particularly the human brain. This isn't surprising---the brain represents the most sophisticated information processing system we know of, capable of learning, adapting, and solving complex problems while maintaining remarkable energy efficiency. The way our brains function has provided fundamental insights that continue to shape how we approach artificial intelligence.
+
+### Biological Intelligence
+
+When we observe biological intelligence, several key principles emerge. The brain demonstrates an extraordinary ability to learn from experience, constantly modifying its neural connections based on new information and interactions with the environment. This adaptability is fundamental---every experience potentially alters the brain's structure, refining its responses for future situations. This biological capability directly inspired one of the core principles of machine learning: the ability to learn and improve from data rather than following fixed, pre-programmed rules.
 
-### Brief History of Deep Learning
+Another striking feature of biological intelligence is its parallel processing capability. The brain processes vast amounts of information simultaneously, with different regions specializing in specific functions while working in concert. This distributed, parallel architecture stands in stark contrast to traditional sequential computing and has significantly influenced modern AI system design. The brain's ability to efficiently coordinate these parallel processes while maintaining coherent function represents a level of sophistication we're still working to fully understand and replicate.
 
-The idea of deep learning has origins in early artificial neural networks. It has experienced several cycles of interest, starting with the introduction of the Perceptron in the 1950s [@rosenblatt1957perceptron], followed by the invention of backpropagation algorithms in the 1980s [@rumelhart1986learning].
+The brain's pattern recognition capabilities are particularly noteworthy. Biological systems excel at identifying patterns in complex, noisy data---whether recognizing faces in a crowd, understanding speech in a noisy environment, or identifying objects from partial information. This remarkable ability has inspired numerous AI applications, particularly in computer vision and speech recognition systems. The brain accomplishes these tasks with an efficiency that artificial systems are still striving to match.
 
-The term "deep learning" became prominent in the 2000s, characterized by advances in computational power and data accessibility. Important milestones include the successful training of deep networks like AlexNet [@krizhevsky2012imagenet] by [Geoffrey Hinton](https://amturing.acm.org/award_winners/hinton_4791679.cfm), a leading figure in AI, and the renewed focus on neural networks as effective tools for data analysis and modeling.
+Perhaps most remarkably, biological systems achieve all this with incredible energy efficiency. The human brain operates on approximately 20 watts of power---about the same as a low-power light bulb---while performing complex cognitive tasks that would require orders of magnitude more power in current artificial systems. This efficiency hasn't just impressed researchers; it has become a crucial goal in the development of AI hardware and algorithms.
 
-Deep learning has recently seen exponential growth, transforming various industries. @fig-trends illustrates this remarkable progression, highlighting two key trends in the field. First, the graph shows that computational growth followed an 18-month doubling pattern from 1952 to 2010. This trend then dramatically accelerated to a 6-month doubling cycle from 2010 to 2022, indicating a significant leap in computational capabilities. 
+These biological principles have led to two distinct but complementary approaches in artificial intelligence. The first attempts to directly mimic neural structure and function, leading to artificial neural networks and deep learning architectures that structurally resemble biological neural networks. The second takes a more abstract approach, adapting biological principles to work efficiently within the constraints of computer hardware without necessarily copying biological structures exactly. In the following sections, we will explore how these approaches manifest in practice, beginning with the fundamental building block of neural networks: the neuron itself.
 
-Second, the figure depicts the emergence of large-scale models between 2015 and 2022. These models appeared 2 to 3 orders of magnitude faster than the general trend, following an even more aggressive 10-month doubling cycle. This rapid scaling of model sizes represents a paradigm shift in deep learning capabilities.
+### Biological to Artificial Neurons
 
-![Growth of deep learning models.](https://epochai.org/assets/images/posts/2022/compute-trends.png){#fig-trends}
+To understand how biological principles translate into artificial systems, we must first examine the basic unit of biological information processing: the neuron. This cellular building block provides the blueprint for its artificial counterpart and helps us understand how complex neural networks emerge from simple components working in concert.
 
-Multiple factors have contributed to this surge, including advancements in computational power, the abundance of big data, and improvements in algorithmic designs. First, the growth of computational capabilities, especially the arrival of Graphics Processing Units (GPUs) and Tensor Processing Units (TPUs) [@jouppi2017datacenter], has significantly sped up the training and inference times of deep learning models. These hardware improvements have enabled the construction and training of more complex, deeper networks than what was possible in earlier years.
+In biological systems, the neuron (or cell) is the basic functional unit of the nervous system. Understanding its structure is crucial before we draw parallels to artificial systems. @fig-bio_nn2ai_nn illustrates the structure of a biological neuron.
 
-Second, the digital revolution has yielded a wealth of big data, offering rich material for deep learning models to learn from and excel in tasks such as image and speech recognition, language translation, and game playing. Large, labeled datasets have been key in refining and successfully deploying deep learning applications in real-world settings.
+![Bilogical structure of a neuron and its mapping to an artificial neuron. Source: Geeksforgeeks](images/png/bio_nn2ai_nn.png){#fig-bio_nn2ai_nn}
 
-Additionally, collaborations and open-source efforts have nurtured a dynamic community of researchers and practitioners, accelerating advancements in deep learning techniques. Innovations like deep reinforcement learning, transfer learning, and generative artificial intelligence have broadened the scope of what is achievable with deep learning, opening new possibilities in various sectors, including healthcare, finance, transportation, and entertainment.
+A biological neuron consists of several key components. The central part is the cell body, or soma, which contains the nucleus and performs the cell's basic life processes. Extending from the soma are branch-like structures called dendrites, which receive signals from other neurons. At the junctions where signals are passed between neurons are synapses. Finally, a long, slender projection called the axon conducts electrical impulses away from the cell body to other neurons.
 
-Organizations worldwide recognize deep learning's transformative potential and invest heavily in research and development to leverage its capabilities in providing innovative solutions, optimizing operations, and creating new business opportunities. As deep learning continues its upward trajectory, it is set to redefine how we interact with technology, enhancing convenience, safety, and connectivity in our lives.
+The neuron functions as follows: Dendrites receive inputs from other neurons, with synapses determining the strength of the connections. The soma integrates these signals and decides whether to trigger an output signal. If triggered, the axon transmits this signal to other neurons.
 
-### Applications of Deep Learning
+Each element of a biological neuron has a computational analog in artificial systems, reflecting the principles of learning, adaptability, and efficiency found in nature. To better understand how biological intelligence informs artificial systems, @tbl-bio_nn2ai_nn captures the mapping between the components of biological and artificial neurons. This should be viewed alongside @fig-bio_nn2ai_nn for a complete picture. Together, they paint a picture of the biological-to-artificial neuron mapping.
 
-Deep learning is extensively used across numerous industries today, with its transformative impact evident in various sectors, as illustrated in @fig-deeplearning. In finance, it powers stock market prediction, risk assessment, and fraud detection, guiding investment strategies and improving financial decisions. Marketing leverages deep learning for customer segmentation and personalization, enabling highly targeted advertising and content optimization based on consumer behavior analysis. In manufacturing, it streamlines production processes and enhances quality control, allowing companies to boost productivity and minimize waste. Healthcare benefits from deep learning in diagnosis, treatment planning, and patient monitoring, potentially saving lives through improved medical predictions.
++-----------------------+-----------------------+
+| Biological Neuron     | Artificial Neuron     |
++:======================+:======================+
+| Cell                  | Neuron / Node         |
++-----------------------+-----------------------+
+| Dendrites / Synapse   | Weights               |
++-----------------------+-----------------------+
+| Soma                  | Net Input             |
++-----------------------+-----------------------+
+| Axon                  | Output                |
++-----------------------+-----------------------+
 
-![Deep learning applications, benefits, and implementations across various industries including finance, marketing, manufacturing, and healthcare. Source: [Leeway Hertz](https://www.leewayhertz.com/what-is-deep-learning/)](images/png/deeplearning.png){#fig-deeplearning}
+: Mapping the biological neuron structure to an artificial neuron. {#tbl-bio_nn2ai_nn .striped .hover}
 
-Beyond these core industries, deep learning enhances everyday products and services. Netflix uses it to strengthen its recommender systems, providing users with more [personalized recommendations](https://dl.acm.org/doi/abs/10.1145/3543873.3587675). Google has significantly improved its Translate service, now handling over [100 languages](https://cloud.google.com/translate/docs/languages) with increased accuracy, as highlighted in their [recent advances](https://research.google/blog/recent-advances-in-google-translate/). Autonomous vehicles from companies like Waymo, Cruise, and Motional have become a reality through deep learning in their [perception system](https://motional.com/news/technically-speaking-improving-av-perception-through-transformative-machine-learning). Additionally, Amazon employs deep learning at the edge in Alexa devices for tasks such as [keyword spotting](https://towardsdatascience.com/how-amazon-alexa-works-your-guide-to-natural-language-processing-ai-7506004709d3). These applications demonstrate how machine learning often predicts and processes information with greater accuracy and speed than humans, revolutionizing various aspects of our daily lives.
+Each component serves a similar function, albeit through vastly different mechanisms. Here, we explain these mappings and their implications for artificial neural networks.
 
-### Relevance to Embedded AI
+1. **Cell ↔ Neuron/Node:** The artificial neuron or node serves as the fundamental computational unit, mirroring the cell's role in biological systems.
 
-Embedded AI, the integration of AI algorithms directly into hardware devices, naturally gains from deep learning capabilities. Combining deep learning algorithms and embedded systems has laid the groundwork for intelligent, autonomous devices capable of advanced on-device data processing and analysis. Deep learning aids in extracting complex patterns and information from input data, which is essential in developing smart embedded systems, from household appliances to industrial machinery. This collaboration ushers in a new era of intelligent, interconnected devices that can learn and adapt to user behavior and environmental conditions, optimizing performance and offering unprecedented convenience and efficiency.
+2. **Dendrites/Synapse ↔ Weights:**  Weights in artificial neurons represent connection strengths, analogous to synapses in biological neurons. These weights are adjustable, enabling learning and optimization over time.
 
-## Neural Networks
+3. **Soma ↔ Net Input:**  The net input in artificial neurons sums weighted inputs to determine activation, similar to how the soma integrates signals in biological neurons.
 
-Deep learning draws inspiration from the human brain's neural networks to create decision-making patterns. This section digs into the foundational concepts of deep learning, providing insights into the more complex topics discussed later in this primer.
+4. **Axon ↔ Output:** The output of an artificial neuron passes processed information to subsequent network layers, much like an axon transmits signals to other neurons.
 
-Neural networks serve as the foundation of deep learning, inspired by the biological neural networks in the human brain to process and analyze data hierarchically. Neural networks are composed of basic units called perceptrons, which are typically organized into layers. Each layer consists of several perceptrons, and multiple layers are stacked to form the entire network. The connections between these layers are defined by sets of weights or parameters that determine how data is processed as it flows from the input to the output of the network. 
+This mapping illustrates how artificial neural networks simplify and abstract biological processes while preserving their essential computational principles. However, understanding individual neurons is just the beginning—the true power of neural networks emerges from how these basic units work together in larger systems.
 
-Below, we examine the primary components and structures in neural networks.
+### Artificial Intelligence
 
-### Perceptrons
+The translation from biological principles to artificial computation requires a deep appreciation of what makes biological neural networks so effective at both the cellular and network levels. The brain processes information through distributed computation across billions of neurons, each operating relatively slowly compared to silicon transistors. A biological neuron fires at approximately 200Hz, while modern processors operate at gigahertz frequencies. Despite this speed limitation, the brain's parallel architecture enables sophisticated real-time processing of complex sensory input, decision making, and control of behavior.
+
+This computational efficiency emerges from the brain's basic organizational principles. Each neuron acts as a simple processing unit, integrating inputs from thousands of other neurons and producing a binary output signal based on whether this integrated input exceeds a threshold. The connection strengths between neurons, mediated by synapses, are continuously modified through experience. This synaptic plasticity forms the basis for learning and adaptation in biological neural networks. These biological principles suggest key computational elements needed in artificial neural systems:
+
+* Simple processing units that integrate multiple inputs
+* Adjustable connection strengths between units
+* Nonlinear activation based on input thresholds
+* Parallel processing architecture
+* Learning through modification of connection strengths
+
+### Computational Translation
+
+We face the challenge of capturing the essence of neural computation within the rigid framework of digital systems. The implementation of biological principles in artificial neural systems represents a nuanced balance between biological fidelity and computational efficiency. At its core, an artificial neuron captures the essential computational properties of its biological counterpart through mathematical operations that can be efficiently executed on digital hardware.
+
+@tbl-bio2comp provides a systematic view of how key biological features map to their computational counterparts. Each biological feature has an analog in computational systems, revealing both the possibilities and limitations of digital neural implementation, which we will learn more about later.
+
++---------------------+---------------------------+
+| Biological Feature  | Computational Translation |
++:====================+:==========================+
+| Neuron firing       | Activation function       |
++---------------------+---------------------------+
+| Synaptic strength   | Weighted connections      |
++---------------------+---------------------------+
+| Signal integration  | Summation operation       |
++---------------------+---------------------------+
+| Distributed memory  | Weight matrices           |
++---------------------+---------------------------+
+| Parallel processing | Concurrent computation    |
++---------------------+---------------------------+
+
+: Translating biological features to the computing domain. {#tbl-bio2comp .striped .hover}
+
+The basic computational unit in artificial neural networks, the artificial neuron, simplifies the complex electrochemical processes of biological neurons into three fundamental operations. First, input signals are weighted, mimicking how biological synapses modulate incoming signals with different strengths. Second, these weighted inputs are summed together, analogous to how a biological neuron integrates incoming signals in its cell body. Finally, the summed input passes through an activation function that determines the neuron's output, similar to how a biological neuron fires based on whether its membrane potential exceeds a threshold.
+
+This mathematical abstraction preserves key computational principles while enabling efficient digital implementation. The weighting of inputs allows the network to learn which connections are important, just as biological neural networks strengthen or weaken synaptic connections through experience. The summation operation captures how biological neurons integrate multiple inputs into a single decision. The activation function introduces nonlinearity essential for learning complex patterns, much like the threshold-based firing of biological neurons.
+
+Memory in artificial neural networks takes a markedly different form from biological systems. While biological memories are distributed across synaptic connections and neural patterns, artificial networks store information in discrete weights and parameters. This architectural difference reflects the constraints of current computing hardware, where memory and processing are physically separated rather than integrated as in biological systems. Despite these implementation differences, artificial neural networks achieve similar functional capabilities in pattern recognition and learning.
+
+The brain's massive parallelism represents a fundamental challenge in artificial implementation. While biological neural networks process information through billions of neurons operating simultaneously, artificial systems approximate this parallelism through specialized hardware like GPUs and tensor processing units. These devices efficiently compute the matrix operations that form the mathematical foundation of artificial neural networks, achieving parallel processing at a different scale and granularity than biological systems.
+
+### System Requirements
+
+The computational translation of neural principles creates specific demands on the underlying computing infrastructure. These requirements emerge from the fundamental differences between biological and artificial implementations of neural processing, shaping how we design and build systems capable of supporting artificial neural networks.
+
+@tbl-comp2sys shows how each computational element drives particular system requirements. From this mapping, we can see how the choices made in computational translation directly influence the hardware and system architecture needed for implementation.
+
++-----------------------+---------------------------------+
+| Computational Element | System Requirements             |
++:======================+:================================+
+| Activation functions  | Fast nonlinear operation units  |
++-----------------------+---------------------------------+
+| Weight operations     | High-bandwidth memory access    |
++-----------------------+---------------------------------+
+| Parallel computation  | Specialized parallel processors |
++-----------------------+---------------------------------+
+| Weight storage        | Large-scale memory systems      |
++-----------------------+---------------------------------+
+| Learning algorithms   | Gradient computation hardware   |
++-----------------------+---------------------------------+
+
+: From computation to system requirements. {#tbl-comp2sys .striped .hover}
+
+Storage architecture represents a critical requirement, driven by the fundamental difference in how biological and artificial systems handle memory. In biological systems, memory and processing are intrinsically integrated—synapses both store connection strengths and process signals. Artificial systems, however, must maintain a clear separation between processing units and memory. This creates a need for both high-capacity storage to hold millions or billions of connection weights and high-bandwidth pathways to move this data quickly between storage and processing units. The efficiency of this data movement often becomes a critical bottleneck that biological systems do not face.
+
+The learning process itself imposes distinct requirements on artificial systems. While biological networks modify synaptic strengths through local chemical processes, artificial networks must coordinate weight updates across the entire network. This creates substantial computational and memory demands during training—systems must not only store current weights but also maintain space for gradients and intermediate calculations. The requirement to backpropagate error signals, with no real biological analog, further complicates the system architecture.
+
+Energy efficiency emerges as a final critical requirement, highlighting perhaps the starkest contrast between biological and artificial implementations. The human brain's remarkable energy efficiency—operating on roughly 20 watts—stands in sharp contrast to the substantial power demands of artificial neural networks. Current systems often require orders of magnitude more energy to implement similar capabilities. This gap drives ongoing research in more efficient hardware architectures and has profound implications for the practical deployment of neural networks, particularly in resource-constrained environments like mobile devices or edge computing systems.
+
+### Evolution and Impact
+
+We can now better appreciate how the field of deep learning evolved to meet these challenges through advances in hardware and algorithms. This journey began with early artificial neural networks in the 1950s, marked by the introduction of the Perceptron. While groundbreaking in concept, these early systems were severely limited by the computational capabilities of their era—primarily mainframe computers that lacked both the processing power and memory capacity needed for complex networks.
+
+The development of backpropagation algorithms in the 1980s [@rumelhart1986learning], which we will learn about later, represented a  theoretical breakthrough and povided a systematic way to train multi-layer networks. However, the computational demands of this algorithm far exceeded available hardware capabilities. Training even modest networks could take weeks, making experimentation and practical applications challenging. This mismatch between algorithmic requirements and hardware capabilities contributed to a period of reduced interest in neural networks.
+
+The term "deep learning" gained prominence in the 2000s, coinciding with significant advances in computational power and data accessibility. The field has since experienced exponential growth, as illustrated in @fig-trends. The graph reveals two remarkable trends: computational capabilities measured in the number of Floating Point Operations per Second (FLOPS) initially followed a 1.4x improvement pattern from 1952 to 2010, then accelerated to a 3.4-month doubling cycle from 2012 to 2022. Perhaps more striking is the emergence of large-scale models between 2015 and 2022 (not explicitly shown or easily seen in the figure), which scaled 2 to 3 orders of magnitude faster than the general trend, following an aggressive 10-month doubling cycle.
+
+![Growth of deep learning models. Source: EOOCHS AI](https://epochai.org/assets/images/posts/2022/compute-trends.png){#fig-trends}
+
+The evolutionary trends were driven by parallel advances across three fundamental dimensions: data availability, algorithmic innovations, and computing infrastructure. These three factors—data, algorithms, and infrastructure—reinforced each other in a virtuous cycle that continues to drive progress in the field today.
+
+These three factors—data, algorithms, and infrastructure—reinforced each other. As @fig-virtuous-cycle shows, more powerful computing infrastructure enabled processing larger datasets. Larger datasets drove algorithmic innovations. Better algorithms demanded more sophisticated computing systems. This virtuous cycle continues to drive progress in the field today.
+
+![The virtuous cycle enabled by key breakthroughs in each layer.](images/png/virtuous-cycle.png){#fig-virtuous-cycle width=40%}
+
+The data revolution transformed what was possible with neural networks. The rise of the internet and digital devices created unprecedented access to training data. Image sharing platforms provided millions of labeled images. Digital text collections enabled language processing at scale. Sensor networks and IoT devices generated continuous streams of real-world data. This abundance of data provided the raw material needed for neural networks to learn complex patterns effectively.
+
+Algorithmic innovations made it possible to harness this data effectively. New methods for initializing networks and controlling learning rates made training more stable. Techniques for preventing overfitting allowed models to generalize better to new data. Most importantly, researchers discovered that neural network performance scaled predictably with model size, computation, and data quantity, leading to increasingly ambitious architectures.
+
+Computing infrastructure evolved to meet these growing demands. On the hardware side, graphics processing units (GPUs) provided the parallel processing capabilities needed for efficient neural network computation. Specialized AI accelerators like TPUs [@jouppi2017datacenter] pushed performance further. High-bandwidth memory systems and fast interconnects addressed data movement challenges. Equally important were software advances—frameworks and libraries that made it easier to build and train networks, distributed computing systems that enabled training at scale, and tools for optimizing model deployment.
+
+## Neural Network Foundations
+
+We can now examine the fundamental building blocks that make machine learning systems work. While the field has grown tremendously in sophistication, all modern neural networks—from simple classifiers to large language models—share a common architectural foundation built upon basic computational units and principles.
+
+This foundation begins with understanding how individual artificial neurons process information, how they are organized into layers, and how these layers are connected to form complete networks. By starting with these fundamental concepts, we can progressively build up to understanding more complex architectures and their applications.
+
+Neural networks have come a long way since their inception in the 1950s, when the perceptron was first introduced. After a period of decline in popularity due to computational and theoretical limitations, the field saw a resurgence in the 2000s, driven by advancements in hardware (e.g., GPUs) and innovations like deep learning. These breakthroughs have made it possible to train networks with millions of parameters, enabling applications once considered impossible.
+
+### Basic Architecture
+
+The architecture of a neural network determines how information flows through the system, from input to output. While modern networks can be tremendously complex, they all build upon a few key organizational principles that we will explore in the following sections. Understanding these principles is essential for both implementing neural networks and appreciating how they achieve their remarkable capabilities.
+
+#### Neurons and Activations
 
 The Perceptron is the basic unit or node that forms the foundation for more complex structures. It functions by taking multiple inputs, each representing a feature of the object under analysis, such as the characteristics of a home for predicting its price or the attributes of a song to forecast its popularity in music streaming services. These inputs are denoted as $x_1, x_2, ..., x_n$. A perceptron can be configured to perform either regression or classification tasks. For regression, the actual numerical output $\hat{y}$ is used. For classification, the output depends on whether $\hat{y}$ crosses a certain threshold. If $\hat{y}$ exceeds this threshold, the perceptron might output one class (e.g., 'yes'), and if it does not, another class (e.g., 'no').
 
-@fig-perceptron illustrates the fundamental building blocks of a perceptron, which serves as the foundation for more complex neural networks. A perceptron can be thought of as a miniature decision-maker, utilizing its weights, bias, and activation function to process inputs and generate outputs based on learned parameters. This concept forms the basis for understanding more intricate neural network architectures, such as multilayer perceptrons. In these advanced structures, layers of perceptrons work in concert, with each layer's output serving as the input for the subsequent layer. This hierarchical arrangement creates a deep learning model capable of comprehending and modeling complex, abstract patterns within data. By stacking these simple units, neural networks gain the ability to tackle increasingly sophisticated tasks, from image recognition to natural language processing.
+@fig-perceptron illustrates the fundamental building blocks of a perceptron, which serves as the foundation for more complex neural networks. A perceptron can be thought of as a miniature decision-maker, utilizing its weights, bias, and activation function to process inputs and generate outputs based on learned parameters. This concept forms the basis for understanding more intricate neural network architectures, such as multilayer perceptrons. 
+
+In these advanced structures, layers of perceptrons work in concert, with each layer's output serving as the input for the subsequent layer. This hierarchical arrangement creates a deep learning model capable of comprehending and modeling complex, abstract patterns within data. By stacking these simple units, neural networks gain the ability to tackle increasingly sophisticated tasks, from image recognition to natural language processing.
 
-![Perceptron. Conceived in the 1950s, perceptrons paved the way for developing more intricate neural networks and have been a fundamental building block in deep learning. Source: Wikimedia - Chrislb.](images/png/Rosenblattperceptron.png){#fig-perceptron}
+![Perceptron. Conceived in the 1950s, perceptrons paved the way for developing more intricate neural networks and have been a fundamental building block in deep learning. Source: Wikimedia---Chrislb.](images/png/Rosenblattperceptron.png){#fig-perceptron}
 
 Each input $x_i$ has a corresponding weight $w_{ij}$, and the perceptron simply multiplies each input by its matching weight. This operation is similar to linear regression, where the intermediate output, $z$, is computed as the sum of the products of inputs and their weights:
 
@@ -100,238 +334,949 @@ $$
 z = \sum (x_i \cdot w_{ij}) + b
 $$
 
-This basic form of a perceptron can only model linear relationships between the input and output. Patterns found in nature are often complex and extend beyond linear relationships. To enable the perceptron to handle non-linear relationships, an activation function is applied to the linear output $z$. 
+Common activation functions include:
+
+* **ReLU (Rectified Linear Unit)**: Defined as $f(x) = \max(0,x)$, it introduces sparsity and accelerates convergence in deep networks. Its simplicity and effectiveness have made it the default choice in many modern architectures.
+
+* **Sigmoid**: Historically popular, the sigmoid function maps inputs to a range between 0 and 1 but is prone to vanishing gradients in deeper architectures. It's particularly useful in binary classification problems where probabilities are needed.
+
+* **Tanh**: Similar to sigmoid but maps inputs to a range of -1 to 1, centering the data. This centered output often leads to faster convergence in practice compared to sigmoid.
+
+![Activation functions enable the modeling of complex non-linear relationships. Source: Medium---Sachin Kaushik.](images/png/nonlinear_patterns.png){#fig-nonlinear}
+
+These activation functions transform the linear input sum into a non-linear output:
 
 $$
 \hat{y} = \sigma(z)
 $$
 
-@fig-nonlinear illustrates an example where data exhibit a nonlinear pattern that could not be adequately modeled with a linear approach. The activation function, such as sigmoid, tanh, or ReLU, transforms the linear input sum into a non-linear output. The primary objective of this function is to introduce non-linearity into the model, enabling it to learn and perform more sophisticated tasks. Thus, the final output of the perceptron, including the activation function, can be expressed as:
+@fig-nonlinear shows an example where data exhibit a nonlinear pattern that could not be adequately modeled with a linear approach. The activation function enables the network to learn and represent complex relationships in the data, making it possible to solve sophisticated tasks like image recognition or speech processing.
+
+#### Layers and Connections
+
+While a single perceptron can model simple decisions, the power of neural networks comes from combining multiple neurons into layers. A layer is a collection of neurons that process information in parallel. Each neuron in a layer operates independently on the same input but with its own set of weights and bias, allowing the layer to learn different features or patterns from the same input data.
+
+In a typical neural network, we organize these layers hierarchically:
+
+1. **Input Layer**: Receives the raw data features
+2. **Hidden Layers**: Process and transform the data through multiple stages
+3. **Output Layer**: Produces the final prediction or decision
+
+@fig-layers illustrates this layered architecture. When data flows through these layers, each successive layer transforms the representation of the data, gradually building more complex and abstract features. This hierarchical processing is what gives deep neural networks their remarkable ability to learn complex patterns.
+
+![Neural network layers. Source: BrunelloN](images/png/nnlayers.png){#fig-layers}
+
+#### Data Flow and Layer Transformations
+
+As data flows through the network, it is transformed at each layer to extract meaningful patterns. Each layer combines the input data using learned weights and biases, then applies an activation function to introduce non-linearity. This process can be written mathematically as:
+
+$$
+\mathbf{z}^{(l)} = \mathbf{W}^{(l)}\mathbf{x}^{(l-1)} + \mathbf{b}^{(l)}
+$$
+
+Where:
+
+* $\mathbf{x}^{(l-1)}$ is the input vector from the previous layer
+
+* $\mathbf{W}^{(l)}$ is the weight matrix for the current layer
+
+* $\mathbf{b}^{(l)}$ is the bias vector
+
+* $\mathbf{z}^{(l)}$ is the pre-activation output
+
+Now that we have covered the basics, @vid-nn provides a great overview of how neural networks work using handwritten digit recognition. It introduces some new concepts that we will explore in more depth soon, but it serves as an excellent introduction.
+
+:::{#vid-nn .callout-important}
+
+# Neural Network
+
+{{< video https://youtu.be/aircAruvnKk?si=P7aT71L_uGT4xUz6 >}}
+
+:::
+
+### Weights and Biases
+
+#### Weight Matrices
 
-![Activation functions enable the modeling of complex non-linear relationships. Source: Medium - Sachin Kaushik.](images/png/nonlinear_patterns.png){#fig-nonlinear}
+Weights in neural networks determine how strongly inputs influence the output of a neuron. While we first discussed weights for a single perceptron, in larger networks, weights are organized into matrices for efficient computation across entire layers. For example, in a layer with $n$ input features and $m$ neurons, the weights form a matrix $\mathbf{W} \in \mathbb{R}^{n \times m}$. Each column in this matrix represents the weights for a single neuron in the layer. This organization allows the network to process multiple inputs simultaneously, an essential feature for handling real-world data efficiently.
 
-### Multilayer Perceptrons
+Let's consider how this extends our previous perceptron equations to handle multiple neurons simultaneously. For a layer of $m$ neurons, instead of computing each neuron's output separately:
 
-Multilayer perceptrons (MLPs) are an evolution of the single-layer perceptron model, featuring multiple layers of nodes connected in a feedforward manner. @fig-mlp provides a visual representation of this structure. As illustrated in the figure, information in a feedforward network moves in only one direction - from the input layer on the left, through the hidden layers in the middle, to the output layer on the right, without any cycles or loops.
+$$
+z_j = \sum_{i=1}^n (x_i \cdot w_{ij}) + b_j
+$$
+
+We can compute all outputs at once using matrix multiplication:
+
+$$
+\mathbf{z} = \mathbf{x}^T\mathbf{W} + \mathbf{b}
+$$
+
+This matrix organization is more than just mathematical convenience---it reflects how modern neural networks are implemented for efficiency. Each weight $w_{ij}$ represents the strength of the connection between input feature $i$ and neuron $j$ in the layer.
 
-![Multilayer Perceptron. Source: Wikimedia - Charlie.](https://www.nomidl.com/wp-content/uploads/2022/04/image-7.png){width=70% #fig-mlp}
+#### Connection Patterns
 
-While a single perceptron is limited in its capacity to model complex patterns, the real strength of neural networks emerges from the assembly of multiple layers. Each layer consists of numerous perceptrons working together, allowing the network to capture intricate and non-linear relationships within the data. With sufficient depth and breadth, these networks can approximate virtually any function, no matter how complex.
+In the simplest and most common case, each neuron in a layer is connected to every neuron in the previous layer, forming what we call a "dense" or "fully-connected" layer. This pattern means that each neuron has the opportunity to learn from all available features from the previous layer.
 
-### Training Process
+@fig-connections illustrates these dense connections between layers. For a network with layers of sizes $(n_1, n_2, n_3)$, the weight matrices would have dimensions:
 
-A neural network receives an input, performs a calculation, and produces a prediction. The prediction is determined by the calculations performed within the sets of perceptrons found between the input and output layers. These calculations depend primarily on the input and the weights. Since you do not have control over the input, the objective during training is to adjust the weights in such a way that the output of the network provides the most accurate prediction.
+![Dense connections between layers in a MLP. Source: J. McCaffrey](images/png/mlp_connection_weights.png){#fig-connections}
 
-The training process involves several key steps, beginning with the forward pass, where the existing weights of the network are used to calculate the output for a given input. This output is then compared to the true target values to calculate an error, which measures how well the network's prediction matches the expected outcome. Following this, a backward pass is performed. This involves using the error to make adjustments to the weights of the network through a process called backpropagation. This adjustment reduces the error in subsequent predictions. The cycle of forward pass, error calculation, and backward pass is repeated iteratively. This process continues until the network's predictions are sufficiently accurate or a predefined number of iterations is reached, effectively minimizing the loss function used to measure the error.
+* Between first and second layer: $\mathbf{W}^{(1)} \in \mathbb{R}^{n_1 \times n_2}$
+* Between second and third layer: $\mathbf{W}^{(2)} \in \mathbb{R}^{n_2 \times n_3}$
 
-#### Forward Pass
+#### Bias Terms
 
-The forward pass is the initial phase where data moves through the network from the input to the output layer, as illustrated in @fig-forward-propagation. At the start of training, the network's weights are randomly initialized, setting the initial conditions for learning. During the forward pass, each layer performs specific computations on the input data using these weights and biases, and the results are then passed to the subsequent layer. The final output of this phase is the network's prediction. This prediction is compared to the actual target values present in the dataset to calculate the loss, which can be thought of as the difference between the predicted outputs and the target values. The loss quantifies the network's performance at this stage, providing a crucial metric for the subsequent adjustment of weights during the backward pass.
+Each neuron in a layer also has an associated bias term. While weights determine the relative importance of inputs, biases allow neurons to shift their activation functions. This shifting is crucial for learning, as it gives the network flexibility to fit more complex patterns.
 
-![Neural networks - forward and backward propagation. Source: [Linkedin](https://www.linkedin.com/pulse/lecture2-unveiling-theoretical-foundations-ai-machine-underdown-phd-oqsuc/)](images/png/forwardpropagation.png){#fig-forward-propagation}
+For a layer with $m$ neurons, the bias terms form a vector $\mathbf{b} \in \mathbb{R}^m$. When we compute the layer's output, this bias vector is added to the weighted sum of inputs:
 
-#### Backward Pass (Backpropagation) {#sec-backward_pass}
+$$
+\mathbf{z} = \mathbf{x}^T\mathbf{W} + \mathbf{b}
+$$
+
+The bias terms effectively allow each neuron to have a different "threshold" for activation, making the network more expressive.
+
+#### Parameter Organization
 
-After completing the forward pass and computing the loss, which measures how far the model's predictions deviate from the actual target values, the next step is to improve the model's performance by adjusting the network's weights. Since we cannot control the inputs to the model, adjusting the weights becomes our primary method for refining the model.
+The organization of weights and biases across a neural network follows a systematic pattern. For a network with $L$ layers, we maintain:
 
-We determine how to adjust the weights of our model through a key algorithm called backpropagation. Backpropagation uses the calculated loss to determine the gradient of each weight. These gradients describe the direction and magnitude in which the weights should be adjusted. By tuning the weights based on these gradients, the model is better positioned to make predictions that are closer to the actual target values in the next forward pass.
+* A weight matrix $\mathbf{W}^{(l)}$ for each layer $l$
 
-Grasping these foundational concepts paves the way to understanding more intricate deep learning architectures and techniques, fostering the development of more sophisticated and productive applications, especially within embedded AI systems.
+* A bias vector $\mathbf{b}^{(l)}$ for each layer $l$f
 
-@vid-gd and @vid-bp build upon @vid-nn. They cover gradient descent and backpropagation in neural networks.
+* Activation functions $f^{(l)}$ for each layer $l$
 
-:::{#vid-gd .callout-important}
+This gives us the complete layer computation:
+
+$$
+\mathbf{h}^{(l)} = f^{(l)}(\mathbf{z}^{(l)}) = f^{(l)}(\mathbf{h}^{(l-1)T}\mathbf{W}^{(l)} + \mathbf{b}^{(l)})
+$$
 
-# Gradient descent
+Where $\mathbf{h}^{(l)}$ represents the layer's output after applying the activation function.
 
-{{< video https://www.youtube.com/watch?v=IHZwWFHWa-w&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&index=2 >}}
+### Network Topology
+
+Network topology describes how the basic building blocks we've discussed—neurons, layers, and connections—come together to form a complete neural network. We can best understand network topology through a concrete example. Consider the task of recognizing handwritten digits, a classic problem in deep learning using the MNIST[^defn-MNIST] dataset.
+
+[^defn-MNIST]: MNIST (Modified National Institute of Standards and Technology) is a large database of handwritten digits that has been widely used to train and test machine learning systems since its creation in 1998. The dataset consists of 60,000 training images and 10,000 testing images, each being a 28×28 pixel grayscale image of a single handwritten digit from 0 to 9.
+
+#### Basic Structure
+
+The fundamental structure of a neural network consists of three main components: input layer, hidden layers, and output layer. As shown in @fig-mnist-topology-1, a 28×28 pixel grayscale image of a handwritten digit must be processed through these layers to produce a classification output.
+
+::: {layout-nrow=1}
+
+![A neural network topology for classifying MNIST digits, showing how a 28x28 pixel image is processed. The image on the left shows the original digit, with dimensions labeled. The network on the right shows how each pixel connects to the hidden layers, ultimately producing 10 outputs for digit classification.](images/png/topology_28x28.png){#fig-mnist-topology-1}
+
+![Alternative visualization of the MNIST network topology, showing how the 2D image is flattened into a 784-dimensional vector before being processed by the network. This representation emphasizes how spatial data is transformed into a format suitable for neural network processing.](images/png/topology_flatten.png){#fig-mnist-topology-2}
 
 :::
 
+The input layer's width is directly determined by our data format. As shown in @fig-mnist-topology-2, for a 28×28 pixel image, each pixel becomes an input feature, requiring 784 input neurons (28 × 28 = 784). We can think of this either as a 2D grid of pixels or as a flattened vector of 784 values, where each value represents the intensity of one pixel.
+
+The output layer's structure is determined by our task requirements. For digit classification, we use 10 output neurons, one for each possible digit (0-9). When presented with an image, the network produces a value for each output neuron, where higher values indicate greater confidence that the image represents that particular digit.
+
+Between these fixed input and output layers, we have flexibility in designing the hidden layer topology. The choice of hidden layer structure—how many layers to use and how wide to make them—represents one of the fundamental design decisions in neural networks. Additional layers increase the network's depth, allowing it to learn more abstract features through successive transformations. The width of each layer provides capacity for learning different features at each level of abstraction.
+
+These basic topological choices have significant implications for both the network's capabilities and its computational requirements. Each additional layer or neuron increases the number of parameters that must be stored and computed during both training and inference. However, without sufficient depth or width, the network may lack the capacity to learn complex patterns in the data.
+
+#### Design Trade-offs
+
+The design of neural network topology centers on three fundamental decisions: the number of layers (depth), the size of each layer (width), and how these layers connect. Each choice affects both the network's learning capability and its computational requirements.
+
+Network depth determines the level of abstraction the network can achieve. Each layer transforms its input into a new representation, and stacking multiple layers allows the network to build increasingly complex features. In our MNIST example, a deeper network might first learn to detect edges, then combine these edges into strokes, and finally assemble strokes into complete digit patterns. However, adding layers isn't always beneficial—deeper networks increase computational cost substantially, can be harder to train due to vanishing gradients, and may require more sophisticated training techniques.
+
+The width of each layer—the number of neurons it contains—controls how much information the network can process in parallel at each stage. Wider layers can learn more features simultaneously but require proportionally more parameters and computation. For instance, if a hidden layer is processing edge features in our digit recognition task, its width determines how many different edge patterns it can detect simultaneously.
+
+A very important consideration in topology design is the total parameter count. For a network with layers of size $(n_1, n_2, ..., n_L)$, each pair of adjacent layers $l$ and $l+1$ requires $n_l \times n_{l+1}$ weight parameters, plus $n_{l+1}$ bias parameters. These parameters must be stored in memory and updated during training, making the parameter count a key constraint in practical applications.
+
+When designing networks, we need to balance learning capacity, computational efficiency, and ease of training. While the basic approach connects every neuron to every neuron in the next layer (fully connected), this isn't always the most effective strategy. Sometimes, using fewer but more strategic connections—like in convolutional networks—can achieve better results with less computation. Consider our MNIST example—when humans recognize digits, we don't analyze every pixel independently but look for meaningful patterns like lines and curves. Similarly, we can design our network to focus on local patterns in the image rather than treating each pixel as completely independent.
+
+Another important consideration is how information flows through the network. While the basic flow is from input to output, some network designs include additional paths for information to flow, such as skip connections or residual connections. These alternative paths can make the network easier to train and more effective at learning complex patterns. Think of these as shortcuts that help information flow more directly when needed, similar to how our brain can combine both detailed and general impressions when recognizing objects.
+
+These design decisions have significant practical implications for memory usage for storing network parameters, computational costs during both training and inference, training behavior and convergence, and the network's ability to generalize to new examples. The optimal balance of these trade-offs depends heavily on your specific problem, available computational resources, and dataset characteristics. Successful network design requires carefully weighing these factors against practical constraints.
+
+#### Connection Patterns
+
+Neural networks can be structured with different connection patterns between layers, each offering distinct advantages for learning and computation. Understanding these fundamental patterns provides insight into how networks process information and learn representations from data.
+
+Dense connectivity represents the standard pattern where each neuron connects to every neuron in the subsequent layer. In our MNIST example, connecting our 784-dimensional input layer to a hidden layer of 100 neurons requires 78,400 weight parameters. This full connectivity enables the network to learn arbitrary relationships between inputs and outputs, but the number of parameters scales quadratically with layer width.
+
+Sparse connectivity patterns introduce purposeful restrictions in how neurons connect between layers. Rather than maintaining all possible connections, neurons connect to only a subset of neurons in the adjacent layer. This approach draws inspiration from biological neural systems, where neurons typically form connections with a limited number of other neurons. In visual processing tasks like our MNIST example, neurons might connect only to inputs representing nearby pixels, reflecting the local nature of visual features.
+
+As networks grow deeper, the path from input to output becomes longer, potentially complicating the learning process. Skip connections address this by adding direct paths between non-adjacent layers. These connections provide alternative routes for information flow, supplementing the standard layer-by-layer progression. In our digit recognition example, skip connections might allow later layers to reference both high-level patterns and the original pixel values directly.
+
+These connection patterns have significant implications for both the theoretical capabilities and practical implementation of neural networks. Dense connections maximize learning flexibility at the cost of computational efficiency. Sparse connections can reduce computational requirements while potentially improving the network's ability to learn structured patterns. Skip connections help maintain effective information flow in deeper networks.
+
+#### Weight Considerations
+
+The arrangement of weights in a neural network fundamentally determines both its learning capacity and computational requirements. While topology defines the network's structure, the initialization and organization of weights within this structure plays a crucial role in how the network learns and performs.
+
+The number of weights in a network grows with both width and depth. For our MNIST example, consider a network with a 784-dimensional input layer, two hidden layers of 100 neurons each, and a 10-neuron output layer. The first layer requires 78,400 weights (784 × 100), the second layer 10,000 weights (100 × 100), and the output layer 1,000 weights (100 × 10), totaling 89,400 weights. Each of these weights must be stored in memory and updated during learning.
+
+Weight initialization plays a fundamental role in network behavior. When we create a new neural network, these weights must be set to initial values that enable effective learning. Setting all weights to zero would cause all neurons in a layer to behave identically, preventing the network from learning diverse features. Instead, weights are typically initialized randomly, but the scale of these random values matters significantly. Too large or too small initial weights can lead to poor learning dynamics.
+
+The distribution of weights across the network affects how information flows through layers. Consider our digit recognition task: if weights in early layers are too small, important details from the input image might not be preserved for later layers to process. Conversely, if weights are too large, the network might amplify noise in the input, making it harder to identify relevant patterns.
+
+Different network architectures may impose specific constraints on how weights are organized. Some architectures share weights across different parts of the network to encode specific properties, such as the ability to recognize patterns regardless of their position in an image. Other architectures might restrict certain weights to be zero, effectively implementing the sparse connectivity patterns discussed earlier.
+
+## Learning Process
+
+Neural networks learn to perform tasks through a process of training on examples. This process transforms the network from its initial state, where its weights are randomly initialized, to a trained state where the weights encode meaningful patterns from the training data. Understanding this process is fundamental to both the theoretical foundations and practical implementations of deep learning systems.
+
+### Training Overview
+
+The core principle of neural network training is supervised learning from labeled examples. Consider our MNIST digit recognition task: we have a dataset of 60,000 training images, each a 28×28 pixel grayscale image paired with its correct digit label. The network must learn the relationship between these images and their corresponding digits through an iterative process of prediction and weight adjustment.
+
+Training operates as a loop, where each iteration involves processing a subset of training examples called a batch. For each batch, the network performs several key operations:
+
+* Forward computation through the network layers to generate predictions
+* Evaluation of prediction accuracy using a loss function
+* Computation of weight adjustments based on prediction errors
+* Update of network weights to improve future predictions
+
+This process can be expressed mathematically. Given an input image $x$ and its true label $y$, the network computes its prediction:
+
+$$
+\hat{y} = f(x; \theta)
+$$
+
+where $f$ represents the neural network function and $\theta$ represents all trainable parameters (weights and biases, which we discussed earlier). The network's error is measured by a loss function $L$:
+
+$$
+\text{loss} = L(\hat{y}, y)
+$$
+
+This error measurement drives the adjustment of network parameters through a process called "backpropagation," which we will examine in detail later.
+
+In practice, training operates on batches of examples rather than individual inputs. For the MNIST dataset, each training iteration might process 32, 64, or 128 images simultaneously. This batch processing serves two purposes: it enables efficient use of modern computing hardware through parallel processing, and it provides more stable parameter updates by averaging errors across multiple examples.
+
+The training cycle continues until the network achieves sufficient accuracy or reaches a predetermined number of iterations. Throughout this process, the loss function serves as a guide, with its minimization indicating improved network performance.
+
+### Forward Propagation
+
+Forward propagation, as illustrated in @fig-forward-propagation, is the core computational process in a neural network, where input data flows through the network's layers to generate predictions. Understanding this process is essential as it forms the foundation for both network inference and training. Let's examine how forward propagation works using our MNIST digit recognition example.
+
+![Neural networks---forward and backward propagation. Source: [Linkedin](https://www.linkedin.com/pulse/lecture2-unveiling-theoretical-foundations-ai-machine-underdown-phd-oqsuc/)](images/png/forwardpropagation.png){#fig-forward-propagation}
+
+When an image of a handwritten digit enters our network, it undergoes a series of transformations through the layers. Each transformation combines the weighted inputs with learned patterns to progressively extract relevant features. In our MNIST example, a 28×28 pixel image is processed through multiple layers to ultimately produce probabilities for each possible digit (0-9).
+
+The process begins with the input layer, where each pixel's grayscale value becomes an input feature. For MNIST, this means 784 input values (28 × 28 = 784), each normalized between 0 and 1. These values then propagate forward through the hidden layers, where each neuron combines its inputs according to its learned weights and applies a nonlinear activation function.
+
+#### Layer-by-Layer Computation
+
+The forward computation through a neural network proceeds systematically, with each layer transforming its inputs into increasingly abstract representations. In our MNIST network, this transformation process occurs in distinct stages.
+
+At each layer, the computation involves two key steps: a linear transformation of inputs followed by a nonlinear activation. The linear transformation combines all inputs to a neuron using learned weights and a bias term. For a single neuron receiving inputs from the previous layer, this computation takes the form:
+
+$$
+z = \sum_{i=1}^n w_ix_i + b
+$$
+
+where $w_i$ represents the weights, $x_i$ the inputs, and $b$ the bias term. For an entire layer of neurons, we can express this more efficiently using matrix operations:
+
+$$
+\mathbf{Z}^{(l)} = \mathbf{W}^{(l)}\mathbf{A}^{(l-1)} + \mathbf{b}^{(l)}
+$$
+
+Here, $\mathbf{W}^{(l)}$ represents the weight matrix for layer $l$, $\mathbf{A}^{(l-1)}$ contains the activations from the previous layer, and $\mathbf{b}^{(l)}$ is the bias vector.
+
+Following this linear transformation, each layer applies a nonlinear activation function $f$:
+
+$$
+\mathbf{A}^{(l)} = f(\mathbf{Z}^{(l)})
+$$
+
+This process repeats at each layer, creating a chain of transformations:
+
+Input → Linear Transform → Activation → Linear Transform → Activation → ... → Output
+
+In our MNIST example, the pixel values first undergo a transformation by the first hidden layer's weights, converting the 784-dimensional input into an intermediate representation. Each subsequent layer further transforms this representation, ultimately producing a 10-dimensional output vector representing the network's confidence in each possible digit.
+
+#### Mathematical Representation
+
+The complete forward propagation process can be expressed as a composition of functions, each representing a layer's transformation. Let us formalize this mathematically, building on our MNIST example.
+
+For a network with L layers, we can express the full forward computation as:
+
+$$
+\mathbf{A}^{(L)} = f^{(L)}(\mathbf{W}^{(L)}f^{(L-1)}(\mathbf{W}^{(L-1)}...(f^{(1)}(\mathbf{W}^{(1)}\mathbf{X} + \mathbf{b}^{(1)}))... + \mathbf{b}^{(L-1)}) + \mathbf{b}^{(L)})
+$$
+
+While this nested expression captures the complete process, we typically compute it step by step:
+
+1. First layer:
+
+$$
+\mathbf{Z}^{(1)} = \mathbf{W}^{(1)}\mathbf{X} + \mathbf{b}^{(1)}
+$$
+$$
+\mathbf{A}^{(1)} = f^{(1)}(\mathbf{Z}^{(1)})
+$$
+
+2. Hidden layers (l = 2, ..., L-1):
+
+$$
+\mathbf{Z}^{(l)} = \mathbf{W}^{(l)}\mathbf{A}^{(l-1)} + \mathbf{b}^{(l)}
+$$
+$$
+\mathbf{A}^{(l)} = f^{(l)}(\mathbf{Z}^{(l)})
+$$
+
+3. Output layer:
+
+$$
+\mathbf{Z}^{(L)} = \mathbf{W}^{(L)}\mathbf{A}^{(L-1)} + \mathbf{b}^{(L)}
+$$
+$$
+\mathbf{A}^{(L)} = f^{(L)}(\mathbf{Z}^{(L)})
+$$
+
+In our MNIST example, if we have a batch of B images, the dimensions of these operations are:
+
+* Input $\mathbf{X}$: B × 784
+* First layer weights $\mathbf{W}^{(1)}$: n₁ × 784
+* Hidden layer weights $\mathbf{W}^{(l)}$: nₗ × n_{l-1}
+* Output layer weights $\mathbf{W}^{(L)}$: 10 × n_{L-1}
+
+#### Computational Process
+
+To understand how these mathematical operations translate into actual computation, let's walk through the forward propagation process for a batch of MNIST images. This process illustrates how data is transformed from raw pixel values to digit predictions.
+
+Consider a batch of 32 images entering our network. Each image starts as a 28×28 grid of pixel values, which we flatten into a 784-dimensional vector. For the entire batch, this gives us an input matrix $\mathbf{X}$ of size 32×784, where each row represents one image. The values are typically normalized to lie between 0 and 1.
+
+The transformation at each layer proceeds as follows:
+
+1. Input Layer Processing
+   The network takes our input matrix $\mathbf{X}$ (32×784) and transforms it using the first layer's weights. If our first hidden layer has 128 neurons, $\mathbf{W}^{(1)}$ is a 784×128 matrix. The resulting computation $\mathbf{X}\mathbf{W}^{(1)}$ produces a 32×128 matrix.
+
+2. Hidden Layer Transformations
+   Each element in this matrix then has its corresponding bias added and passes through an activation function. For example, with a ReLU activation, any negative values become zero while positive values remain unchanged. This nonlinear transformation enables the network to learn complex patterns in the data.
+
+3. Output Generation
+   The final layer transforms its inputs into a 32×10 matrix, where each row contains 10 values corresponding to the network's confidence scores for each possible digit. Often, these scores are converted to probabilities using a softmax function:
+
+$$
+P(\text{digit } j) = \frac{e^{z_j}}{\sum_{k=1}^{10} e^{z_k}}
+$$
+
+For each image in our batch, this gives us a probability distribution over the possible digits. The digit with the highest probability becomes the network's prediction.
+
+#### Practical Considerations
+
+The implementation of forward propagation requires careful attention to several practical aspects that affect both computational efficiency and memory usage. These considerations become particularly important when processing large batches of data or working with deep networks.
+
+Memory management plays an important role during forward propagation. Each layer's activations must be stored for potential use in the backward pass during training. For our MNIST example with a batch size of 32, if we have three hidden layers of sizes 128, 256, and 128, the activation storage requirements are:
+
+* First hidden layer: 32 × 128 = 4,096 values
+* Second hidden layer: 32 × 256 = 8,192 values
+* Third hidden layer: 32 × 128 = 4,096 values
+* Output layer: 32 × 10 = 320 values
+
+This gives us a total of 16,704 values that must be maintained in memory for each batch during training. The memory requirements scale linearly with batch size and can become substantial for larger networks.
+
+Batch processing introduces important trade-offs. Larger batches enable more efficient matrix operations and better hardware utilization but require more memory. For example, doubling the batch size to 64 would double our memory requirements for activations. This relationship between batch size, memory usage, and computational efficiency often guides the choice of batch size in practice.
+
+The organization of computations also affects performance. Matrix operations can be optimized through careful memory layout and the use of specialized libraries. The choice of activation functions impacts not only the network's learning capabilities but also its computational efficiency, as some functions (like ReLU) are less expensive to compute than others (like tanh or sigmoid).
+
+These considerations form the foundation for understanding the system requirements of neural networks, which we will explore in more detail in later chapters.
+
+### Loss Functions
+
+Neural networks learn by measuring and minimizing their prediction errors. Loss functions provide the Algorithmic Structure for quantifying these errors, serving as the essential feedback mechanism that guides the learning process. Through loss functions, we can convert the abstract goal of "making good predictions" into a concrete optimization problem.
+
+To understand the role of loss functions, let's continue with our MNIST digit recognition example. When the network processes a handwritten digit image, it outputs ten numbers representing its confidence in each possible digit (0-9). The loss function measures how far these predictions deviate from the true answer. For instance, if an image shows a "7", we want high confidence for digit "7" and low confidence for all other digits. The loss function penalizes the network when its prediction differs from this ideal.
+
+Consider a concrete example: if the network sees an image of "7" and outputs confidences:
+
+```
+[0.1, 0.1, 0.1, 0.0, 0.0, 0.0, 0.2, 0.3, 0.1, 0.1]
+```
+
+The highest confidence (0.3) is assigned to digit "7", but this confidence is quite low, indicating uncertainty in the prediction. A good loss function would produce a high loss value here, signaling that the network needs significant improvement. Conversely, if the network outputs:
+
+```
+[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.9, 0.0, 0.1]
+```
+
+The loss function should produce a lower value, as this prediction is much closer to ideal.
+
+#### Basic Concepts
+
+A loss function measures how far the network's predictions are from the correct answers. This difference is expressed as a single number: a lower loss means the predictions are more accurate, while a higher loss indicates the network needs improvement. During training, the loss function guides the network by helping it adjust its weights to make better predictions. For example, in recognizing handwritten digits, the loss will penalize predictions that assign low confidence to the correct digit.
+
+Mathematically, a loss function $L$ takes two inputs: the network's predictions $\hat{y}$ and the true values $y$. For a single training example in our MNIST task:
+
+$$
+L(\hat{y}, y) = \text{measure of discrepancy between prediction and truth}
+$$
+
+When training with batches of data, we typically compute the average loss across all examples in the batch:
+
+$$
+L_{\text{batch}} = \frac{1}{B}\sum_{i=1}^B L(\hat{y}_i, y_i)
+$$
+
+where $B$ is the batch size and $(\hat{y}_i, y_i)$ represents the prediction and truth for the $i$-th example.
+
+The choice of loss function depends on the type of task. For our MNIST classification problem, we need a loss function that can:
+
+1. Handle probability distributions over multiple classes
+2. Provide meaningful gradients for learning
+3. Penalize wrong predictions effectively
+4. Scale well with batch processing
+
+#### Common Classification Losses
+
+For classification tasks like MNIST digit recognition, "cross-entropy" loss has emerged as the standard choice. This loss function is particularly well-suited for comparing predicted probability distributions with true class labels.
+
+For a single digit image, our network outputs a probability distribution over the ten possible digits. We represent the true label as a one-hot vector where all entries are 0 except for a 1 at the correct digit's position. For instance, if the true digit is "7", the label would be:
+
+$$
+y = [0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
+$$
+
+The cross-entropy loss for this example is:
+
+$$
+L(\hat{y}, y) = -\sum_{j=1}^{10} y_j \log(\hat{y}_j)
+$$
+
+where $\hat{y}_j$ represents the network's predicted probability for digit j. Given our one-hot encoding, this simplifies to:
+
+$$
+L(\hat{y}, y) = -\log(\hat{y}_c)
+$$
+
+where c is the index of the correct class. This means the loss depends only on the predicted probability for the correct digit—the network is penalized based on how confident it is in the right answer.
+
+For example, if our network predicts the following probabilities for an image of "7":
+
+```
+Predicted: [0.1, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.8, 0.0, 0.1]
+True: [0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
+```
+
+The loss would be $-\log(0.8)$, which is approximately 0.223. If the network were more confident and predicted 0.9 for the correct digit, the loss would decrease to approximately 0.105.
+
+#### Loss Computation
+
+The practical computation of loss involves considerations for both numerical stability and batch processing. When working with batches of data, we compute the average loss across all examples in the batch.
+
+For a batch of B examples, the cross-entropy loss becomes:
+
+$$
+L_{\text{batch}} = -\frac{1}{B}\sum_{i=1}^B \sum_{j=1}^{10} y_{ij} \log(\hat{y}_{ij})
+$$
+
+Computing this loss efficiently requires careful consideration of numerical precision. Taking the logarithm of very small probabilities can lead to numerical instability. Consider a case where our network predicts a probability of 0.0001 for the correct class. Computing $\log(0.0001)$ directly might cause underflow or result in imprecise values.
+
+To address this, we typically implement the loss computation with two key modifications:
+
+1. Add a small epsilon to prevent taking log of zero:
+
+$$
+L = -\log(\hat{y} + \epsilon)
+$$
+
+2. Apply the log-sum-exp trick for numerical stability:
+
+$$
+\text{softmax}(z_i) = \frac{\exp(z_i - \max(z))}{\sum_j \exp(z_j - \max(z))}
+$$
+
+For our MNIST example with a batch size of 32, this means:
+
+* Processing 32 sets of 10 probabilities
+* Computing 32 individual loss values
+* Averaging these values to produce the final batch loss
+
+#### Training Implications
+
+Understanding how loss functions influence training helps explain key implementation decisions in deep learning systems.
+
+During each training iteration, the loss value serves multiple purposes:
+
+1. Performance Metric: It quantifies current network accuracy
+2. Optimization Target: Its gradients guide weight updates
+3. Convergence Signal: Its trend indicates training progress
+
+For our MNIST classifier, monitoring the loss during training reveals the network's learning trajectory. A typical pattern might show:
+
+* Initial high loss (~2.3, equivalent to random guessing among 10 classes)
+* Rapid decrease in early training iterations
+* Gradual improvement as the network fine-tunes its predictions
+* Eventually stabilizing at a lower loss (~0.1, indicating confident correct predictions)
+
+The loss function's gradients with respect to the network's outputs provide the initial error signal that drives backpropagation. For cross-entropy loss, these gradients have a particularly simple form: the difference between predicted and true probabilities. This mathematical property makes cross-entropy loss especially suitable for classification tasks, as it provides strong gradients even when predictions are very wrong.
+
+The choice of loss function also influences other training decisions:
+
+* Learning rate selection (larger loss gradients might require smaller learning rates)
+* Batch size (loss averaging across batches affects gradient stability)
+* Optimization algorithm behavior
+* Convergence criteria
+
+### Backward Propagation {#sec-backward_pass}
+
+Backward propagation, often called backpropagation, is the algorithmic cornerstone of neural network training. While forward propagation computes predictions, backward propagation determines how to adjust the network's weights to improve these predictions. This process enables neural networks to learn from their mistakes.
+
+To understand backward propagation, let's continue with our MNIST example. When the network predicts a "3" for an image of "7", we need a systematic way to adjust weights throughout the network to make this mistake less likely in the future. Backward propagation provides this by calculating how each weight contributed to the error.
+
+The process begins at the network's output, where we compare the predicted digit probabilities with the true label. This error then flows backward through the network, with each layer's weights receiving an update signal based on their contribution to the final prediction. The computation follows the chain rule of calculus, breaking down the complex relationship between weights and final error into manageable steps.
+
+@vid-gd1 and @vid-gd2 give a good high level overview of cost functions help neural networks learn
+
+::: {layout-nrow=1}
+
+:::{#vid-gd1 .callout-important}
+
+#### Gradient descent - Part 1
+
+{{< video https://youtu.be/IHZwWFHWa-w?si=_MpUFVskdVHYztkz >}}
+
+:::
+
+:::{#vid-gd2 .callout-important}
+
+#### Gradient descent - Part 2
+
+{{< video https://youtu.be/Ilg3gGewQ5U?si=YXVP3tm_ZBY9R-Hg >}}
+
+:::
+
+:::
+
+#### Gradient Flow
+
+The flow of gradients through a neural network follows a path opposite to the forward propagation. Starting from the loss at the output layer, gradients propagate backwards, computing how each layer, and ultimately each weight, influenced the final prediction error.
+
+In our MNIST example, consider what happens when the network misclassifies a "7" as a "3". The loss function generates an initial error signal at the output layer---essentially indicating that the probability for "7" should increase while the probability for "3" should decrease. This error signal then propagates backward through the network layers.
+
+For a network with L layers, the gradient flow can be expressed mathematically. At each layer l, we compute how the layer's output affected the final loss:
+
+$$
+\frac{\partial L}{\partial \mathbf{A}^{(l)}} = \frac{\partial L}{\partial \mathbf{A}^{(l+1)}} \frac{\partial \mathbf{A}^{(l+1)}}{\partial \mathbf{A}^{(l)}}
+$$
+
+This computation cascades backward through the network, with each layer's gradients depending on the gradients computed in the layer above it. The process reveals how each layer's transformation contributed to the final prediction error. For instance, if certain weights in an early layer strongly influenced a misclassification, they will receive larger gradient values, indicating a need for more substantial adjustment.
+
+However, this process faces important challenges in deep networks. As gradients flow backward through many layers, they can either vanish or explode. When gradients are repeatedly multiplied through many layers, they can become exponentially small, particularly with sigmoid or tanh activation functions. This causes early layers to learn very slowly or not at all, as they receive negligible (vanishing) updates. Conversely, if gradient values are consistently greater than 1, they can grow exponentially, leading to unstable training and destructive weight updates.
+
+#### Computing Gradients
+
+The actual computation of gradients involves calculating several partial derivatives at each layer. For each layer, we need to determine how changes in weights, biases, and activations affect the final loss. These computations follow directly from the chain rule of calculus but must be implemented efficiently for practical neural network training.
+
+At each layer l, we compute three main gradient components:
+
+1. Weight Gradients:
+
+$$
+\frac{\partial L}{\partial \mathbf{W}^{(l)}} = \frac{\partial L}{\partial \mathbf{Z}^{(l)}} {\mathbf{A}^{(l-1)}}^T
+$$
+
+2. Bias Gradients:
+
+$$
+\frac{\partial L}{\partial \mathbf{b}^{(l)}} = \frac{\partial L}{\partial \mathbf{Z}^{(l)}}
+$$
+
+3. Input Gradients (for propagating to previous layer):
+
+$$
+\frac{\partial L}{\partial \mathbf{A}^{(l-1)}} = {\mathbf{W}^{(l)}}^T \frac{\partial L}{\partial \mathbf{Z}^{(l)}}
+$$
+
+In our MNIST example, consider the final layer where the network outputs digit probabilities. If the network predicted [0.1, 0.2, 0.5, ..., 0.05] for an image of "7", the gradient computation would:
+
+1. Start with the error in these probabilities
+2. Compute how weight adjustments would affect this error
+3. Propagate these gradients backward to help adjust earlier layer weights
+
+#### Implementation Aspects
+
+The practical implementation of backward propagation requires careful consideration of computational resources and memory management. These implementation details significantly impact training efficiency and scalability.
+
+Memory requirements during backward propagation stem from two main sources. First, we need to store the intermediate activations from the forward pass, as these are required for computing gradients. For our MNIST network with a batch size of 32, each layer's activations must be maintained:
+
+* Input layer: 32 × 784 values
+* Hidden layers: 32 × h values (where h is the layer width)
+* Output layer: 32 × 10 values
+
+Second, we need storage for the gradients themselves. For each layer, we must maintain gradients of similar dimensions to the weights and biases. Taking our previous example of a network with hidden layers of size 128, 256, and 128, this means storing:
+
+* First layer gradients: 784 × 128 values
+* Second layer gradients: 128 × 256 values
+* Third layer gradients: 256 × 128 values
+* Output layer gradients: 128 × 10 values
+
+The computational pattern of backward propagation follows a specific sequence:
+
+1. Compute gradients at current layer
+2. Update stored gradients
+3. Propagate error signal to previous layer
+4. Repeat until input layer is reached
+
+For batch processing, these computations are performed simultaneously across all examples in the batch, enabling efficient use of matrix operations and parallel processing capabilities.
+
+### Optimization Process
+
+#### Gradient Descent Basics
+
+The optimization process adjusts the network's weights to improve its predictions. Using a method called gradient descent, the network calculates how much each weight contributes to the error and updates it to reduce the loss. This process is repeated over many iterations, gradually refining the network's ability to make accurate predictions.
+
+The fundamental update rule for gradient descent is:
+
+$$
+\theta_{\text{new}} = \theta_{\text{old}} - \alpha \nabla_{\theta}L
+$$
+
+where $\theta$ represents any network parameter (weights or biases), $\alpha$ is the learning rate, and $\nabla_{\theta}L$ is the gradient of the loss with respect to that parameter.
+
+For our MNIST example, this means adjusting weights to improve digit classification accuracy. If the network frequently confuses "7"s with "1"s, gradient descent will modify the weights to better distinguish between these digits. The learning rate $\alpha$ controls how large these adjustments are—too large, and the network might overshoot optimal values; too small, and training will progress very slowly.
+
+@vid-bp demonstrates how the backpropagation math works in neural networks for those inclined towards a more theoretical foundation.
+
 :::{#vid-bp .callout-important}
 
 # Backpropagation
 
-{{< video https://www.youtube.com/watch?v=Ilg3gGewQ5U&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&index=3 >}}
+{{< video https://youtu.be/tIeHLnjs5U8?si=Uckr8YPwwAZ_UI6t >}}
 
 :::
 
-### Model Architectures
+#### Batch Processing
 
-Deep learning architectures refer to the various structured approaches that dictate how neurons and layers are organized and interact in neural networks. These architectures have evolved to tackle different problems and data types effectively. This section overviews some well-known deep learning architectures and their characteristics.
+Neural networks typically process multiple examples simultaneously during training, an approach known as mini-batch gradient descent. Rather than updating weights after each individual image, we compute the average gradient over a batch of examples before performing the update.
 
-#### Multilayer Perceptrons (MLPs)
+For a batch of size B, the loss gradient becomes:
 
-MLPs are basic deep learning architectures comprising three layers: an input layer, one or more hidden layers, and an output layer. These layers are fully connected, meaning each neuron in a layer is linked to every neuron in the preceding and following layers. MLPs can model intricate functions and are used in various tasks, such as regression, classification, and pattern recognition. Their capacity to learn non-linear relationships through backpropagation makes them a versatile instrument in the deep learning toolkit.
+$$
+\nabla_{\theta}L_{\text{batch}} = \frac{1}{B}\sum_{i=1}^B \nabla_{\theta}L_i
+$$
 
-In embedded AI systems, MLPs can function as compact models for simpler tasks like sensor data analysis or basic pattern recognition, where computational resources are limited. Their ability to learn non-linear relationships with relatively less complexity makes them a suitable choice for embedded systems.
+In our MNIST training, with a typical batch size of 32, this means:
 
-:::{.callout-caution #exr-mlp collapse="false"}
+1. Process 32 images through forward propagation
+2. Compute loss for all 32 predictions
+3. Average the gradients across all 32 examples
+4. Update weights using this averaged gradient
 
-##### Multilayer Perceptrons (MLPs)
+#### Training Loop
 
-We've just scratched the surface of neural networks. Now, you'll get to try and apply these concepts in practical examples. In the provided Colab notebooks, you'll explore:
+The complete training process combines forward propagation, backward propagation, and weight updates into a systematic training loop. This loop repeats until the network achieves satisfactory performance or reaches a predetermined number of iterations.
 
-**Predicting house prices:** Learn how neural networks can analyze housing data to estimate property values.  
-[![](https://colab.research.google.com/assets/colab-badge.png)](https://colab.research.google.com/github/Mjrovai/UNIFEI-IESTI01-TinyML-2022.1/blob/main/00_Curse_Folder/1_Fundamentals/Class_07/TF_Boston_Housing_Regression.ipynb)
+A single pass through the entire training dataset is called an epoch. For MNIST, with 60,000 training images and a batch size of 32, each epoch consists of approximately 1,875 batch iterations. The training loop structure is:
 
-**Image Classification:** Discover how to build a network to understand the famous MNIST handwritten digit dataset.  
-[![](https://colab.research.google.com/assets/colab-badge.png)](https://colab.research.google.com/github/Mjrovai/UNIFEI-IESTI01-TinyML-2022.1/blob/main/00_Curse_Folder/1_Fundamentals/Class_09/TF_MNIST_Classification_v2.ipynb)
+1. For each epoch:
+   * Shuffle training data to prevent learning order-dependent patterns
+   * For each batch:
+     * Perform forward propagation
+     * Compute loss
+     * Execute backward propagation
+     * Update weights using gradient descent
+   * Evaluate network performance
 
-**Real-world medical diagnosis:** Use deep learning to tackle the important task of breast cancer classification.  
-[![](https://colab.research.google.com/assets/colab-badge.png)](https://colab.research.google.com/github/Mjrovai/UNIFEI-IESTI01-TinyML-2022.1/blob/main/00_Curse_Folder/1_Fundamentals/Class_13/docs/WDBC_Project/Breast_Cancer_Classification.ipynb)
+During training, we monitor several key metrics:
 
-:::
+* Training loss: average loss over recent batches
+* Validation accuracy: performance on held-out test data
+* Learning progress: how quickly the network improves
 
-#### Convolutional Neural Networks (CNNs)
+For our digit recognition task, we might observe the network's accuracy improve from 10% (random guessing) to over 95% through multiple epochs of training.
 
-CNNs are mainly used in image and video recognition tasks. This architecture consists of two main parts: the convolutional base and the fully connected layers. In the convolutional base, convolutional layers filter input data to identify features like edges, corners, and textures. Following each convolutional layer, a pooling layer can be applied to reduce the spatial dimensions of the data, thereby decreasing computational load and concentrating the extracted features. Unlike MLPs, which treat input features as flat, independent entities, CNNs maintain the spatial relationships between pixels, making them particularly effective for image and video data. The extracted features from the convolutional base are then passed into the fully connected layers, similar to those used in MLPs, which perform classification based on the features extracted by the convolution layers. CNNs have proven highly effective in image recognition, object detection, and other computer vision applications.
+#### Practical Considerations
 
-@vid-nn explains how neural networks work using handwritten digit recognition as an example application. It also touches on the math underlying neural nets.
+The successful implementation of neural network training requires attention to several key practical aspects that significantly impact learning effectiveness. These considerations bridge the gap between theoretical understanding and practical implementation.
 
-:::{#vid-nn .callout-important}
+Learning rate selection is perhaps the most critical parameter affecting training. For our MNIST network, the choice of learning rate dramatically influences the training dynamics. A large learning rate of 0.1 might cause unstable training where the loss oscillates or explodes as weight updates overshoot optimal values. Conversely, a very small learning rate of 0.0001 might result in extremely slow convergence, requiring many more epochs to achieve good performance. A moderate learning rate of 0.01 often provides a good balance between training speed and stability, allowing the network to make steady progress while maintaining stable learning.
 
-# MLP & CNN Networks
+Convergence monitoring provides crucial feedback during the training process. As training progresses, we typically observe the loss value stabilizing around a particular value, indicating the network is approaching a local optimum. The validation accuracy often plateaus as well, suggesting the network has extracted most of the learnable patterns from the data. The gap between training and validation performance offers insights into whether the network is overfitting or generalizing well to new examples.
 
-{{< video https://www.youtube.com/embed/aircAruvnKk?si=ZRj8jf4yx7ZMe8EK >}}
+Resource requirements become increasingly important as we scale neural network training. The memory footprint must accommodate both model parameters and the intermediate computations needed for backpropagation. Computation scales linearly with batch size, affecting training speed and hardware utilization. Modern training often leverages GPU acceleration, making efficient use of parallel computing capabilities crucial for practical implementation.
 
-:::
+Training neural networks also presents several fundamental challenges. Overfitting occurs when the network becomes too specialized to the training data, performing well on seen examples but poorly on new ones. Gradient instability can manifest as either vanishing or exploding gradients, making learning difficult. The interplay between batch size, available memory, and computational resources often requires careful balancing to achieve efficient training while working within hardware constraints.
 
-CNNs are crucial for image and video recognition tasks, where real-time processing is often needed. They can be optimized for embedded systems using techniques like quantization and pruning to minimize memory usage and computational demands, enabling efficient object detection and facial recognition functionalities in devices with limited computational resources.
+## Prediction Phase
 
-:::{.callout-caution #exr-cnn collapse="false"}
+Neural networks serve two distinct purposes: learning from data during training and making predictions during inference. While we've explored how networks learn through forward propagation, backward propagation, and weight updates, the prediction phase operates differently. During inference, networks use their learned parameters to transform inputs into outputs without the need for learning mechanisms. This simpler computational process still requires careful consideration of how data flows through the network and how system resources are utilized. Understanding the prediction phase is crucial as it represents how neural networks are actually deployed to solve real-world problems, from classifying images to generating text predictions.
 
-### Convolutional Neural Networks (CNNs)
+### Inference Fundamentals
 
-We discussed that CNNs excel at identifying image features, making them ideal for tasks like object classification. Now, you'll get to put this knowledge into action! This Colab notebook focuses on building a CNN to classify images from the CIFAR-10 dataset, which includes objects like airplanes, cars, and animals. You'll learn about the key differences between CIFAR-10 and the MNIST dataset we explored earlier and how these differences influence model choice. By the end of this notebook, you'll have a grasp of CNNs for image recognition.
+#### Training vs Inference
 
-[![](https://colab.research.google.com/assets/colab-badge.png)](https://colab.research.google.com/github/Mjrovai/UNIFEI-IESTI01-TinyML-2022.1/blob/main/00_Curse_Folder/1_Fundamentals/Class_11/CNN_Cifar_10.ipynb)
+The computation flow fundamentally changes when moving from training to inference. While training requires both forward and backward passes through the network to compute gradients and update weights, inference involves only the forward pass computation. This simpler flow means that each layer needs to perform only one set of operations---transforming inputs to outputs using the learned weights---rather than tracking intermediate values for gradient computation.
 
-:::
+Parameter freezing is another another major distinction between training and inference phases. During training, weights and biases continuously update to minimize the loss function. In inference, these parameters remain fixed, acting as static transformations learned from the training data. This freezing of parameters not only simplifies computation but also enables optimizations impossible during training, such as weight quantization or pruning.
 
-#### Recurrent Neural Networks (RNNs)
+The structural difference between training loops and inference passes significantly impacts system design. Training operates in an iterative loop, processing multiple batches of data repeatedly across many epochs to refine the network's parameters. Inference, in contrast, typically processes each input just once, generating predictions in a single forward pass. This fundamental shift from iterative refinement to single-pass prediction influences how we architect systems for deployment.
 
-RNNs are suitable for sequential data analysis, like time series forecasting and natural language processing. In this architecture, connections between nodes form a directed graph along a temporal sequence, allowing information to be carried across sequences through hidden state vectors. Variants of RNNs include Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU), designed to capture longer dependencies in sequence data.
+Memory and computation requirements differ substantially between training and inference. Training demands considerable memory to store intermediate activations for backpropagation, gradients for weight updates, and optimization states. Inference eliminates these memory-intensive requirements, needing only enough memory to store the model parameters and compute a single forward pass. This reduction in memory footprint, coupled with simpler computation patterns, enables inference to run efficiently on a broader range of devices, from powerful servers to resource-constrained edge devices.
 
-These networks can be used in voice recognition systems, predictive maintenance, or IoT devices where sequential data patterns are common. Optimizations specific to embedded platforms can assist in managing their typically high computational and memory requirements.
+In general, the training phase requires more computational resources and memory for learning, while inference is streamlined for efficient prediction. @tbl-train-vs-inference summarizes the key differences between training and inference. 
 
-#### Generative Adversarial Networks (GANs)
++-------------------------+------------------------------+--------------------------------+
+| Aspect                  | Training                     | Inference                      |
++:========================+:=============================+:===============================+
+| Computation Flow        | Forward and backward passes, | Forward pass only,             |
+|                         | gradient computation         | direct input to output         |
++-------------------------+------------------------------+--------------------------------+
+| Parameters              | Continuously updated         | Fixed/frozen weights           |
+|                         | weights and biases           | and biases                     |
++-------------------------+------------------------------+--------------------------------+
+| Processing Pattern      | Iterative loops over         | Single pass through            |
+|                         | multiple epochs              | the network                    |
++-------------------------+------------------------------+--------------------------------+
+| Memory Requirements     | High - stores activations,   | Lower - stores only model      |
+|                         | gradients, optimizer state   | parameters and current input   |
++-------------------------+------------------------------+--------------------------------+
+| Computational Needs     | Heavy - gradient updates,    | Lighter - matrix               |
+|                         | backpropagation              | multiplication only            |
++-------------------------+------------------------------+--------------------------------+
+| Hardware Requirements   | GPUs/specialized hardware    | Can run on simpler devices,    |
+|                         | for efficient training       | including mobile/edge          |
++-------------------------+------------------------------+--------------------------------+
 
-GANs consist of two networks, a generator and a discriminator, trained simultaneously through adversarial training [@goodfellow2020generative]. The generator produces data that tries to mimic the real data distribution, while the discriminator distinguishes between real and generated data. GANs are widely used in image generation, style transfer, and data augmentation.
+: Key differences between training and inference phases in neural networks. {#tbl-train-vs-inference}
 
-In embedded settings, GANs could be used for on-device data augmentation to improve the training of models directly on the embedded device, enabling continual learning and adaptation to new data without the need for cloud computing resources.
+This stark contrast between training and inference phases highlights why system architectures often differ significantly between development and deployment environments. While training requires substantial computational resources and specialized hardware, inference can be optimized for efficiency and deployed across a broader range of devices.
 
-#### Autoencoders
+#### Basic Pipeline
 
-Autoencoders are neural networks for data compression and noise reduction [@bank2023autoencoders]. They are structured to encode input data into a lower-dimensional representation and then decode it back to its original form. Variants like Variational Autoencoders (VAEs) introduce probabilistic layers that allow for generative properties, finding applications in image generation and anomaly detection.
+The implementation of neural networks in practical applications requires a complete processing pipeline that extends beyond the network itself. This pipeline, which is illustrated in @fig-inference-pipeline. transforms raw inputs into meaningful outputs through a series of distinct stages, each essential for the system's operation. Understanding this complete pipeline provides critical insights into the design and deployment of machine learning systems.
 
-Using autoencoders can help in efficient data transmission and storage, improving the overall performance of embedded systems with limited computational and memory resources.
+![End-to-end workflow for the inference prediction phase.](images/png/inference_pipeline.png){#fig-inference-pipeline}
 
-#### Transformer Networks
+The key thing to notice from the figure is that machine learning systems operate as hybrid architectures that combine conventional computing operations with neural network computations. The neural network component, focused on learned transformations through matrix operations, represents just one element within a broader computational framework. This framework encompasses both the preparation of input data and the interpretation of network outputs, processes that rely primarily on traditional computing methods.
 
-Transformer networks have emerged as a powerful architecture, especially in natural language processing [@vaswani2017attention]. These networks use self-attention mechanisms to weigh the influence of different input words on each output word, enabling parallel computation and capturing intricate patterns in data. Transformer networks have led to state-of-the-art results in tasks like language translation, summarization, and text generation.
+Consider how data flows through the pipeline in @fig-inference-pipeline:
 
-These networks can be optimized to perform language-related tasks directly on the device. For example, transformers can be used in embedded systems for real-time translation services or voice-assisted interfaces, where latency and computational efficiency are crucial. Techniques such as model distillation can be employed to deploy these networks on embedded devices with limited resources.
+1. Raw inputs arrive in their original form, which might be images, text, sensor readings, or other data types
+2. Pre-processing transforms these inputs into a format suitable for neural network consumption
+3. The neural network performs its learned transformations
+4. Raw outputs emerge from the network, often in numerical form
+5. Post-processing converts these outputs into meaningful, actionable results
 
-These architectures serve specific purposes and excel in different domains, offering a rich toolkit for addressing diverse problems in embedded AI systems. Understanding the nuances of these architectures is crucial in designing effective and efficient deep learning models for various applications.
+This pipeline structure reveals several fundamental characteristics of machine learning systems. The neural network, despite its computational sophistication, functions as a component within a larger system. Performance bottlenecks may arise at any stage of the pipeline, not exclusively within the neural network computation. System optimization must therefore consider the entire pipeline rather than focusing solely on the neural network's operation.
 
-### Traditional ML vs Deep Learning
+The hybrid nature of this architecture has significant implications for system implementation. While neural network computations may benefit from specialized hardware accelerators, pre- and post-processing operations typically execute on conventional processors. This distribution of computation across heterogeneous hardware resources represents a fundamental consideration in system design.
 
-Deep learning extends traditional machine learning by utilizing neural networks to discern patterns in data. In contrast, traditional machine learning relies on a set of established algorithms such as decision trees, k-nearest neighbors, and support vector machines, but does not involve neural networks. @fig-ml-dl provides a visual comparison of Machine Learning and Deep Learning, highlighting their key differences in approach and capabilities.
+### Pre-processing
 
-![Comparing Machine Learning and Deep Learning. Source: [Medium](https://aoyilmaz.medium.com/understanding-the-differences-between-deep-learning-and-machine-learning-eb41d64f1732)](images/png/mlvsdl.png){#fig-ml-dl}
+The pre-processing stage transforms raw inputs into a format suitable for neural network computation. While often overlooked in theoretical discussions, this stage forms a critical bridge between real-world data and neural network operations. Consider our MNIST digit recognition example: before a handwritten digit image can be processed by the neural network we designed earlier, it must undergo several transformations. Raw images of handwritten digits arrive in various formats, sizes, and pixel value ranges. For instance, in @fig-handwritten, we see that the digits are all of different sizes, and even the number 6 is written differently by the same person. 
 
-As shown in the figure, deep learning models can process raw data directly and automatically extract relevant features, while traditional machine learning often requires manual feature engineering. The figure also illustrates how deep learning models can handle more complex tasks and larger datasets compared to traditional machine learning approaches.
+![Images of handwritten digits. Source: O. Augereau](images/png/handwritten_digits.png){#fig-handwritten}
 
-To further highlight the differences, @tbl-mlvsdl provides a more detailed comparison of the contrasting characteristics between traditional ML and deep learning. This table complements the visual representation in @fig-ml-dl by offering specific points of comparison across various aspects of these two approaches.
+The pre-processing stage standardizes these inputs through conventional computing operations:
 
-+-------------------------------+-----------------------------------------------------------+--------------------------------------------------------------+
-| Aspect                        | Traditional ML                                            | Deep Learning                                                |
-+:==============================+:==========================================================+:=============================================================+
-| Data Requirements             | Low to Moderate (efficient with smaller datasets)         | High (requires large datasets for nuanced learning)          |
-+-------------------------------+-----------------------------------------------------------+--------------------------------------------------------------+
-| Model Complexity              | Moderate (suitable for well-defined problems)             | High (detects intricate patterns, suited for complex tasks)  |
-+-------------------------------+-----------------------------------------------------------+--------------------------------------------------------------+
-| Computational Resources       | Low to Moderate (cost-effective, less resource-intensive) | High (demands substantial computational power and resources) |
-+-------------------------------+-----------------------------------------------------------+--------------------------------------------------------------+
-| Deployment Speed              | Fast (quicker training and deployment cycles)             | Slow (prolonged training times, esp. with larger datasets)   |
-+-------------------------------+-----------------------------------------------------------+--------------------------------------------------------------+
-| Interpretability              | High (clear insights into decision pathways)              | Low (complex layered structures, "black box" nature)         |
-+-------------------------------+-----------------------------------------------------------+--------------------------------------------------------------+
-| Maintenance                   | Easier (simple to update and maintain)                    | Complex (requires more efforts in maintenance and updates)   |
-+-------------------------------+-----------------------------------------------------------+--------------------------------------------------------------+ 
+* Image scaling to the required 28×28 pixel dimensions, camera images are usually large(r).
+* Pixel value normalization from [0,255] to [0,1], most cameras generate colored images.
+* Flattening the 2D image array into a 784-dimensional vector, preparing it for the neural network.
+* Basic validation to ensure data integrity, making sure the network predicted correctly.
 
-: Comparison of traditional machine learning and deep learning. {#tbl-mlvsdl .striped .hover}
+What distinguishes pre-processing from neural network computation is its reliance on traditional computing operations rather than learned transformations. While the neural network learns to recognize digits through training, pre-processing operations remain fixed, deterministic transformations. This distinction has important system implications: pre-processing operates on conventional CPUs rather than specialized neural network hardware, and its performance characteristics follow traditional computing patterns.
 
-### Choosing Traditional ML vs. DL
+The effectiveness of pre-processing directly impacts system performance. Poor normalization can lead to reduced accuracy, inconsistent scaling can introduce artifacts, and inefficient implementation can create bottlenecks. Understanding these implications helps in designing robust machine learning systems that perform well in real-world conditions.
 
-#### Data Availability and Volume
+### Inference
 
-**Amount of Data:** Traditional machine learning algorithms, such as decision trees or Naive Bayes, are often more suitable when data availability is limited. They offer robust predictions even with smaller datasets. This is particularly true in medical diagnostics for disease prediction and customer segmentation in marketing.
+The inference phase represents the operational state of a neural network, where learned parameters are used to transform inputs into predictions. Unlike the training phase we discussed earlier, inference focuses solely on forward computation with fixed parameters.
 
-**Data Diversity and Quality:** Traditional machine learning algorithms often work well with structured data (the input to the model is a set of features, ideally independent of each other) but may require significant preprocessing effort (i.e., feature engineering). On the other hand, deep learning takes the approach of automatically performing feature engineering as part of the model architecture. This approach enables the construction of end-to-end models capable of directly mapping from unstructured input data (such as text, audio, and images) to the desired output without relying on simplistic heuristics that have limited effectiveness. However, this results in larger models demanding more data and computational resources. In noisy data, the necessity for larger datasets is further emphasized when utilizing Deep Learning.
+#### Network Initialization
 
-#### Complexity of the Problem
+Before processing any inputs, the neural network must be properly initialized for inference. This initialization phase involves loading the model parameters learned during training into memory. For our MNIST digit recognition network, this means loading specific weight matrices and bias vectors for each layer. Let's examine the exact memory requirements for our architecture:
 
-**Problem Granularity:** Problems that are simple to moderately complex, which may involve linear or polynomial relationships between variables, often find a better fit with traditional machine learning methods.
+* Input to first hidden layer:
+  * Weight matrix: 784 × 100 = 78,400 parameters
+  * Bias vector: 100 parameters
 
-**Hierarchical Feature Representation:** Deep learning models are excellent in tasks that require hierarchical feature representation, such as image and speech recognition. However, not all problems require this complexity, and traditional machine learning algorithms may sometimes offer simpler and equally effective solutions.
+* First to second hidden layer:
+  * Weight matrix: 100 × 100 = 10,000 parameters
+  * Bias vector: 100 parameters
 
-#### Hardware and Computational Resources
+* Second hidden layer to output:
+  * Weight matrix: 100 × 10 = 1,000 parameters
+  * Bias vector: 10 parameters
 
-**Resource Constraints:** The availability of computational resources often influences the choice between traditional ML and deep learning. The former is generally less resource-intensive and thus preferable in environments with hardware limitations or budget constraints.
+In total, the network requires storage for 89,610 learned parameters (89,400 weights plus 210 biases). Beyond these fixed parameters, memory must also be allocated for intermediate activations during forward computation. For processing a single image, this means allocating space for:
 
-**Scalability and Speed:** Traditional machine learning algorithms, like support vector machines (SVM), often allow for faster training times and easier scalability, which is particularly beneficial in projects with tight timelines and growing data volumes.
+* First hidden layer activations: 100 values
+* Second hidden layer activations: 100 values
+* Output layer activations: 10 values
 
-#### Regulatory Compliance
+This memory allocation pattern differs significantly from training, where additional memory was needed for gradients, optimizer states, and backpropagation computations.
 
-Regulatory compliance is crucial in various industries, requiring adherence to guidelines and best practices such as the General Data Protection Regulation (GDPR) in the EU. Traditional ML models, due to their inherent interpretability, often align better with these regulations, especially in sectors like finance and healthcare.
+#### Forward Pass Computation
 
-#### Interpretability
+During inference, data propagates through the network's layers using the initialized parameters. This forward propagation process, while similar in structure to its training counterpart, operates with different computational constraints and optimizations. The computation follows a deterministic path from input to output, transforming the data at each layer using learned parameters.
 
-Understanding the decision-making process is easier with traditional machine learning techniques than deep learning models, which function as "black boxes," making it challenging to trace decision pathways.
+For our MNIST digit recognition network, consider the precise computations at each layer. The network processes a pre-processed image represented as a 784-dimensional vector through successive transformations:
 
-### Making an Informed Choice
+1. First Hidden Layer Computation:
+   * Input transformation: 784 inputs combine with 78,400 weights through matrix multiplication
+   * Linear computation: $\mathbf{z}^{(1)} = \mathbf{x}\mathbf{W}^{(1)} + \mathbf{b}^{(1)}$
+   * Activation: $\mathbf{a}^{(1)} = \text{ReLU}(\mathbf{z}^{(1)})$
+   * Output: 100-dimensional activation vector
 
-Given the constraints of embedded AI systems, understanding the differences between traditional ML techniques and deep learning becomes essential. Both avenues offer unique advantages, and their distinct characteristics often dictate the choice of one over the other in different scenarios.
+2. Second Hidden Layer Computation:
+   * Input transformation: 100 values combine with 10,000 weights
+   * Linear computation: $\mathbf{z}^{(2)} = \mathbf{a}^{(1)}\mathbf{W}^{(2)} + \mathbf{b}^{(2)}$
+   * Activation: $\mathbf{a}^{(2)} = \text{ReLU}(\mathbf{z}^{(2)})$
+   * Output: 100-dimensional activation vector
 
-Despite this, deep learning has steadily outperformed traditional machine learning methods in several key areas due to abundant data, computational advancements, and proven effectiveness in complex tasks. Here are some specific reasons why we focus on deep learning:
+3. Output Layer Computation:
+   * Final transformation: 100 values combine with 1,000 weights
+   * Linear computation: $\mathbf{z}^{(3)} = \mathbf{a}^{(2)}\mathbf{W}^{(3)} + \mathbf{b}^{(3)}$
+   * Activation: $\mathbf{a}^{(3)} = \text{softmax}(\mathbf{z}^{(3)})$
+   * Output: 10 probability values
 
-1. **Superior Performance in Complex Tasks:** Deep learning models, particularly deep neural networks, excel in tasks where the relationships between data points are incredibly intricate. Tasks like image and speech recognition, language translation, and playing complex games like Go and Chess have seen significant advancements primarily through deep learning algorithms.
+@tbl-forward-pass shows how these computations, while mathematically identical to training-time forward propagation, show important operational differences:
 
-2. **Efficient Handling of Unstructured Data:** Unlike traditional machine learning methods, deep learning can more effectively process unstructured data. This is crucial in today's data landscape, where the vast majority of data, such as text, images, and videos, is unstructured.
++----------------------+--------------------------------+--------------------------------+
+| Characteristic       | Training Forward Pass          | Inference Forward Pass         |
++:=====================+:===============================+:===============================+
+| Activation Storage   | Maintains complete activation  | Retains only current layer     |
+|                      | history for backpropagation    | activations                    |
++----------------------+--------------------------------+--------------------------------+
+| Memory Pattern       | Preserves intermediate states  | Releases memory after layer    |
+|                      | throughout forward pass        | computation completes          |
++----------------------+--------------------------------+--------------------------------+
+| Computational Flow   | Structured for gradient        | Optimized for direct output    |
+|                      | computation preparation        | generation                     |
++----------------------+--------------------------------+--------------------------------+
+| Resource Profile     | Higher memory requirements     | Minimized memory footprint     |
+|                      | for training operations        | for efficient execution        |
++----------------------+--------------------------------+--------------------------------+
 
-3. **Leveraging Big Data:** With the availability of big data, deep learning models can learn and improve continually. These models excel at utilizing large datasets to improve their predictive accuracy, a limitation in traditional machine-learning approaches.
+: Operational characteristics of forward pass computation during training versus inference {#tbl-forward-pass}
 
-4. **Hardware Advancements and Parallel Computing:** The advent of powerful GPUs and the availability of cloud computing platforms have enabled the rapid training of deep learning models. These advancements have addressed one of deep learning's significant challenges: the need for substantial computational resources.
+This streamlined computation pattern enables efficient inference while maintaining the network's learned capabilities. The reduction in memory requirements and simplified computational flow make inference particularly suitable for deployment in resource-constrained environments, such as Mobile ML and Tiny ML.
 
-5. **Dynamic Adaptability and Continuous Learning:** Deep learning models can dynamically adapt to new information or data. They can be trained to generalize their learning to new, unseen data, crucial in rapidly evolving fields like autonomous driving or real-time language translation.
+#### Resource Requirements
 
-While deep learning has gained significant traction, it's essential to understand that traditional machine learning is still relevant. As we dive deeper into the intricacies of deep learning, we will also highlight situations where traditional machine learning methods may be more appropriate due to their simplicity, efficiency, and interpretability. By focusing on deep learning in this text, we aim to equip readers with the knowledge and tools to tackle modern, complex problems across various domains while also providing insights into the comparative advantages and appropriate application scenarios for deep learning and traditional machine learning techniques.
+Neural networks consume computational resources differently during inference compared to training. During inference, resource utilization focuses primarily on efficient forward pass computation and minimal memory overhead. Let's examine the specific requirements for our MNIST digit recognition network.
 
-## Conclusion
+Memory requirements during inference can be precisely quantified:
+
+1. Static Memory (Model Parameters):
+   * Layer 1: 78,400 weights + 100 biases
+   * Layer 2: 10,000 weights + 100 biases
+   * Layer 3: 1,000 weights + 10 biases
+   * Total: 89,610 parameters (≈ 358.44 KB at 32-bit floating point precision)
+
+2. Dynamic Memory (Activations):
+   * Layer 1 output: 100 values
+   * Layer 2 output: 100 values
+   * Layer 3 output: 10 values
+   * Total: 210 values (≈ 0.84 KB at 32-bit floating point precision)
+
+Computational requirements follow a fixed pattern for each input:
+
+* First layer: 78,400 multiply-adds
+* Second layer: 10,000 multiply-adds
+* Output layer: 1,000 multiply-adds
+* Total: 89,400 multiply-add operations per inference
+
+This resource profile stands in stark contrast to training requirements, where additional memory for gradients and computational overhead for backpropagation significantly increase resource demands. The predictable, streamlined nature of inference computations enables various optimization opportunities and efficient hardware utilization.
+
+#### Optimization Opportunities
+
+The fixed nature of inference computation presents several opportunities for optimization that are not available during training. Once a neural network's parameters are frozen, the predictable pattern of computation allows for systematic improvements in both memory usage and computational efficiency.
+
+Batch size selection represents a fundamental trade-off in inference optimization. During training, large batches were necessary for stable gradient computation, but inference offers more flexibility. Processing single inputs minimizes latency, making it ideal for real-time applications where immediate responses are crucial. However, batch processing can significantly improve throughput by better utilizing parallel computing capabilities, particularly on GPUs. For our MNIST network, consider the memory implications: processing a single image requires storing 210 activation values, while a batch of 32 images requires 6,720 activation values but can process images up to 32 times faster on parallel hardware.
+
+Memory management during inference can be significantly more efficient than during training. Since intermediate values are only needed for forward computation, memory buffers can be carefully managed and reused. The activation values from each layer need only exist until the next layer's computation is complete. This enables in-place operations where possible, reducing the total memory footprint. Furthermore, the fixed nature of inference allows for precise memory alignment and access patterns optimized for the underlying hardware architecture.
+
+Hardware-specific optimizations become particularly important during inference. On CPUs, computations can be organized to maximize cache utilization and take advantage of SIMD (Single Instruction, Multiple Data) capabilities. GPU deployments benefit from optimized matrix multiplication routines and efficient memory transfer patterns. These optimizations extend beyond pure computational efficiency---they can significantly impact power consumption and hardware utilization, critical factors in real-world deployments.
+
+The predictable nature of inference also enables more aggressive optimizations like reduced numerical precision. While training typically requires 32-bit floating-point precision to maintain stable gradient computation, inference can often operate with 16-bit or even 8-bit precision while maintaining acceptable accuracy. For our MNIST network, this could reduce the memory footprint from 358.44 KB to 179.22 KB or even 89.61 KB, with corresponding improvements in computational efficiency.
+
+These optimization principles, while illustrated through our simple MNIST feedforward network, represent only the foundation of neural network optimization. More sophisticated architectures introduce additional considerations and opportunities. Convolutional Neural Networks (CNNs), for instance, present unique optimization opportunities in handling spatial data and filter operations. Recurrent Neural Networks (RNNs) require careful consideration of sequential computation and state management. Transformer architectures introduce distinct patterns of attention computation and memory access. These architectural variations and their optimizations will be explored in detail in subsequent chapters, particularly when we discuss deep learning architectures, model optimizations, and efficient AI implementations.
+
+### Post-processing
+
+The transformation of neural network outputs into actionable predictions requires a return to traditional computing paradigms. Just as pre-processing bridges real-world data to neural computation, post-processing bridges neural outputs back to conventional computing systems. This completes the hybrid computing pipeline we examined earlier, where neural and traditional computing operations work in concert to solve real-world problems.
+
+The complexity of post-processing extends beyond simple mathematical transformations. Real-world systems must handle uncertainty, validate outputs, and integrate with larger computing systems. In our MNIST example, a digit recognition system might require not just the most likely digit, but also confidence measures to determine when human intervention is needed. This introduces additional computational steps: confidence thresholds, secondary prediction checks, and error handling logic---all of which are implemented in traditional computing frameworks.
+
+The computational requirements of post-processing differ significantly from neural network inference. While inference benefits from parallel processing and specialized hardware, post-processing typically runs on conventional CPUs and follows sequential logic patterns. This return to traditional computing brings both advantages and constraints. Operations are more flexible and easier to modify than neural computations, but they may become bottlenecks if not carefully implemented. For instance, computing softmax probabilities for a batch of predictions requires different optimization strategies than the matrix multiplications of neural network layers.
+
+System integration considerations often dominate post-processing design. Output formats must match downstream system requirements, error handling must align with broader system protocols, and performance must meet system-level constraints. In a complete mail sorting system, the post-processing stage must not only identify digits but also format these predictions for the sorting machinery, handle uncertainty cases appropriately, and maintain processing speeds that match physical mail flow rates.
+
+This return to traditional computing paradigms completes the hybrid nature of machine learning systems. Just as pre-processing prepared real-world data for neural computation, post-processing adapts neural outputs for real-world use. Understanding this hybrid nature---the interplay between neural and traditional computing--- is essential for designing and implementing effective machine learning systems. 
 
-Deep learning has become a potent set of techniques for addressing intricate pattern recognition and prediction challenges. Starting with an overview, we outlined the fundamental concepts and principles governing deep learning, laying the groundwork for more advanced studies.
+## Case study: USPS Postal Service
 
-Central to deep learning, we explored the basic ideas of neural networks, powerful computational models inspired by the human brain's interconnected neuron structure. This exploration allowed us to appreciate neural networks' capabilities and potential in creating sophisticated algorithms capable of learning and adapting from data.
+### Real-world Problem
 
-Understanding the role of libraries and frameworks was a key part of our discussion. We offered insights into the tools that can facilitate developing and deploying deep learning models. These resources ease the implementation of neural networks and open avenues for innovation and optimization.
+The United States Postal Service (USPS) processes over 100 million pieces of mail daily, each requiring accurate routing based on handwritten ZIP codes. In the early 1990s, this task was primarily performed by human operators, making it one of the largest manual data entry operations in the world. The automation of this process through neural networks represents one of the earliest and most successful large-scale deployments of artificial intelligence, embodying many of the principles we've explored in this chapter.
 
-Next, we tackled the challenges one might face when embedding deep learning algorithms within embedded systems, providing a critical perspective on the complexities and considerations of bringing AI to edge devices.
+Consider the complexity of this task: a ZIP code recognition system must process images of handwritten digits captured under varying conditions---different writing styles, pen types, paper colors, and environmental factors. It must make accurate predictions within milliseconds to maintain mail processing speeds. Furthermore, errors in recognition can lead to significant delays and costs from misrouted mail. This real-world constraint meant the system needed not just high accuracy, but also reliable measures of prediction confidence to identify when human intervention was necessary.
 
-Furthermore, we examined deep learning's limitations. Through discussions, we unraveled the challenges faced in deep learning applications and outlined scenarios where traditional machine learning might outperform deep learning. These sections are crucial for fostering a balanced view of deep learning's capabilities and limitations.
+This challenging environment presented requirements spanning every aspect of neural network implementation we've discussed---from biological inspiration to practical deployment considerations. The success or failure of the system would depend not just on the neural network's accuracy, but on the entire pipeline from image capture through to final sorting decisions.
 
-In this primer, we have equipped you with the knowledge to make informed choices between deploying traditional machine learning or deep learning techniques, depending on the unique demands and constraints of a specific problem.
+### System Development
 
-As we conclude this chapter, we hope you are now well-equipped with the basic "language" of deep learning and prepared to go deeper into the subsequent chapters with a solid understanding and critical perspective. The journey ahead is filled with exciting opportunities and challenges in embedding AI within systems.
+The development of the USPS digit recognition system required careful consideration at every stage, from data collection to deployment. This process illustrates how theoretical principles of neural networks translate into practical engineering decisions.
 
-## Resources {#sec-deep-learning-primer-resource}
+Data collection presented the first major challenge. Unlike controlled laboratory environments, postal facilities needed to process mail pieces with tremendous variety. The training dataset had to capture this diversity. Digits written by people of different ages, educational backgrounds, and writing styles formed just part of the challenge. Envelopes came in varying colors and textures, and images were captured under different lighting conditions and orientations. This extensive data collection effort later contributed to the creation of the MNIST database we've used in our examples.
 
-Here is a curated list of resources to support students and instructors in their learning and teaching journeys. We are continuously working on expanding this collection and will be adding new exercises soon.
+The network architecture design required balancing multiple constraints. While deeper networks might achieve higher accuracy, they would also increase processing time and computational requirements. Processing 28×28 pixel images of individual digits needed to complete within strict time constraints while running reliably on available hardware. The network had to maintain consistent accuracy across varying conditions, from well-written digits to hurried scrawls.
+
+Training the network introduced additional complexity. The system needed to achieve high accuracy not just on a test dataset, but on the endless variety of real-world handwriting styles. Careful preprocessing normalized input images to account for variations in size and orientation. Data augmentation techniques increased the variety of training samples. The team validated performance across different demographic groups and tested under actual operating conditions to ensure robust performance.
+
+The engineering team faced a critical decision regarding confidence thresholds. Setting these thresholds too high would route too many pieces to human operators, defeating the purpose of automation. Setting them too low would risk delivery errors. The solution emerged from analyzing the confidence distributions of correct versus incorrect predictions. This analysis established thresholds that optimized the tradeoff between automation rate and error rate, ensuring efficient operation while maintaining acceptable accuracy.
+
+### Complete Pipeline
+
+Following a single piece of mail through the USPS recognition system illustrates how the concepts we've discussed integrate into a complete solution. The journey from physical mail piece to sorted letter demonstrates the interplay between traditional computing, neural network inference, and physical machinery.
+
+The process begins when an envelope reaches the imaging station. High-speed cameras capture the ZIP code region at rates exceeding several pieces of mail (e.g. 10) pieces per second. This image acquisition process must adapt to varying envelope colors, handwriting styles, and environmental conditions. The system must maintain consistent image quality despite the speed of operation---motion blur and proper illumination present significant engineering challenges.
+
+Pre-processing transforms these raw camera images into a format suitable for neural network analysis. The system must locate the ZIP code region, segment individual digits, and normalize each digit image. This stage employs traditional computer vision techniques: image thresholding adapts to envelope background color, connected component analysis identifies individual digits, and size normalization produces standard 28×28 pixel images. Speed remains critical---these operations must complete within milliseconds to maintain throughput.
+
+The neural network then processes each normalized digit image. The trained network, with its 89,610 parameters (as we detailed earlier), performs forward propagation to generate predictions. Each digit passes through two hidden layers of 100 neurons each, ultimately producing ten output values representing digit probabilities. This inference process, while computationally intensive, benefits from the optimizations we discussed in the previous section.
+
+Post-processing converts these neural network outputs into sorting decisions. The system applies confidence thresholds to each digit prediction. A complete ZIP code requires high confidence in all five digits---a single uncertain digit flags the entire piece for human review. When confidence meets thresholds, the system transmits sorting instructions to mechanical systems that physically direct the mail piece to its appropriate bin.
+
+The entire pipeline operates under strict timing constraints. From image capture to sorting decision, processing must complete before the mail piece reaches its sorting point. The system maintains multiple pieces in various pipeline stages simultaneously, requiring careful synchronization between computing and mechanical systems. This real-time operation illustrates why the optimizations we discussed in inference and post-processing become crucial in practical applications.
+
+### Results and Impact
+
+The implementation of neural network-based ZIP code recognition transformed USPS mail processing operations. By 2000, several facilities across the country utilized this technology, processing millions of mail pieces daily. This real-world deployment demonstrated both the potential and limitations of neural network systems in mission-critical applications.
+
+Performance metrics revealed interesting patterns that validate many of the principles discussed earlier in this chapter. The system achieved its highest accuracy on clearly written digits, similar to those in the training data. However, performance varied significantly with real-world factors. Lighting conditions affected pre-processing effectiveness. Unusual writing styles occasionally confused the neural network. Environmental vibrations could also impact image quality. These challenges led to continuous refinements in both the physical system and the neural network pipeline.
+
+The economic impact proved substantial. Prior to automation, manual sorting required operators to read and key in ZIP codes at an average rate of one piece per second. The neural network system processed pieces at ten times this rate while reducing labor costs and error rates. However, the system didn't eliminate human operators entirely---their role shifted to handling uncertain cases and maintaining system performance. This hybrid approach, combining artificial and human intelligence, became a model for other automation projects.
+
+The system also revealed important lessons about deploying neural networks in production environments. Training data quality proved crucial---the network performed best on digit styles well-represented in its training set. Regular retraining helped adapt to evolving handwriting styles. Maintenance required both hardware specialists and machine learning experts, introducing new operational considerations. These insights influenced subsequent deployments of neural networks in other industrial applications.
+
+Perhaps most importantly, this implementation demonstrated how theoretical principles translate into practical constraints. The biological inspiration of neural networks provided the foundation for digit recognition, but successful deployment required careful consideration of system-level factors: processing speed, error handling, maintenance requirements, and integration with existing infrastructure. These lessons continue to inform modern machine learning deployments, where similar challenges of scale, reliability, and integration persist.
+
+### Takeaway
+
+The USPS ZIP code recognition system is an excellent example of the journey from biological inspiration to practical neural network deployment that we've explored throughout this chapter. It demonstrates how the basic principles of neural computation---from pre-processing through inference to post-processing---come together in solving real-world problems.
+
+The system's development shows why understanding both the theoretical foundations and practical considerations is crucial. While the biological visual system processes handwritten digits effortlessly, translating this capability into an artificial system required careful consideration of network architecture, training procedures, and system integration.
+
+The success of this early large-scale neural network deployment helped establish many practices we now consider standard: the importance of comprehensive training data, the need for confidence metrics, the role of pre- and post-processing, and the critical nature of system-level optimization.
+
+As we move forward to explore more complex architectures and applications in subsequent chapters, this case study reminds us that successful deployment requires mastery of both fundamental principles and practical engineering considerations.
+
+## Conclusion
+
+In this chapter, we explored the foundational concepts of neural networks, bridging the gap between biological inspiration and artificial implementation. We began by examining the remarkable efficiency and adaptability of the human brain, uncovering how its principles influence the design of artificial neurons. From there, we delved into the behavior of a single artificial neuron, breaking down its components and operations. This understanding laid the groundwork for constructing neural networks, where layers of interconnected neurons collaborate to tackle increasingly complex tasks.
+
+The progression from single neurons to network-wide behavior underscored the power of hierarchical learning, where each layer extracts and transforms patterns from raw data into meaningful abstractions. We examined both the learning process and the prediction phase, showing how neural networks first refine their performance through training and then deploy that knowledge through inference. The distinction between these phases revealed important system-level considerations for practical implementations.
+
+Our exploration of the complete processing pipeline---from pre-processing through inference to post-processing---highlighted the hybrid nature of machine learning systems, where traditional computing and neural computation work together. The USPS case study demonstrated how these theoretical principles translate into practical applications, revealing both the power and complexity of deployed neural networks. These real-world considerations, from data collection to system integration, form an essential part of understanding machine learning systems.
+
+In the next chapter, we will expand on these ideas, exploring sophisticated deep learning architectures such as convolutional and recurrent neural networks. These architectures are tailored to process diverse data types, from images and text to time series, enabling breakthroughs across a wide range of applications. By building on the concepts introduced here, we will gain a deeper appreciation for the design, capabilities, and versatility of modern deep learning systems.
 
 :::{.callout-note collapse="false"}
 
-#### Slides
+### Slides
 
 These slides are a valuable tool for instructors to deliver lectures and for students to review the material at their own pace. We encourage students and instructors to leverage these slides to improve their understanding and facilitate effective knowledge transfer.
 
@@ -355,11 +1300,13 @@ These slides are a valuable tool for instructors to deliver lectures and for stu
 
 :::{.callout-important collapse="false"}
 
-#### Videos
+### Videos
 
 * @vid-nn
 
-* @vid-gd
+* @vid-gd1
+
+* @vid-gd2
 
 * @vid-bp
 
@@ -367,11 +1314,9 @@ These slides are a valuable tool for instructors to deliver lectures and for stu
 
 :::{.callout-caution collapse="false"}
 
-#### Exercises
+### Exercises
 
 To reinforce the concepts covered in this chapter, we have curated a set of exercises that challenge students to apply their knowledge and deepen their understanding.
 
-* @exr-mlp
-  
-* @exr-cnn
+Coming soon.
 :::
diff --git a/contents/core/dl_primer/images/gif/cnn.gif b/contents/core/dl_primer/images/gif/cnn.gif
new file mode 100644
index 00000000..a319772b
Binary files /dev/null and b/contents/core/dl_primer/images/gif/cnn.gif differ
diff --git a/contents/core/dl_primer/images/jpg/rnn_unrolled.jpg b/contents/core/dl_primer/images/jpg/rnn_unrolled.jpg
new file mode 100644
index 00000000..f565b8a1
Binary files /dev/null and b/contents/core/dl_primer/images/jpg/rnn_unrolled.jpg differ
diff --git a/contents/core/dl_primer/images/png/activities.png b/contents/core/dl_primer/images/png/activities.png
new file mode 100644
index 00000000..f2008129
Binary files /dev/null and b/contents/core/dl_primer/images/png/activities.png differ
diff --git a/contents/core/dl_primer/images/png/attention.png b/contents/core/dl_primer/images/png/attention.png
new file mode 100644
index 00000000..bda606c5
Binary files /dev/null and b/contents/core/dl_primer/images/png/attention.png differ
diff --git a/contents/core/dl_primer/images/png/bio_nn2ai_nn.png b/contents/core/dl_primer/images/png/bio_nn2ai_nn.png
new file mode 100644
index 00000000..bcb990f5
Binary files /dev/null and b/contents/core/dl_primer/images/png/bio_nn2ai_nn.png differ
diff --git a/contents/core/dl_primer/images/png/breakout.png b/contents/core/dl_primer/images/png/breakout.png
new file mode 100644
index 00000000..7968b74f
Binary files /dev/null and b/contents/core/dl_primer/images/png/breakout.png differ
diff --git a/contents/core/dl_primer/images/png/cnn.png b/contents/core/dl_primer/images/png/cnn.png
new file mode 100644
index 00000000..15bf4511
Binary files /dev/null and b/contents/core/dl_primer/images/png/cnn.png differ
diff --git a/contents/core/dl_primer/images/png/cover_nn_primer.png b/contents/core/dl_primer/images/png/cover_nn_primer.png
new file mode 100644
index 00000000..258eabd9
Binary files /dev/null and b/contents/core/dl_primer/images/png/cover_nn_primer.png differ
diff --git a/contents/core/dl_primer/images/png/encoder_decoder.png b/contents/core/dl_primer/images/png/encoder_decoder.png
new file mode 100644
index 00000000..c63b6bd4
Binary files /dev/null and b/contents/core/dl_primer/images/png/encoder_decoder.png differ
diff --git a/contents/core/dl_primer/images/png/handwritten_digits.png b/contents/core/dl_primer/images/png/handwritten_digits.png
new file mode 100644
index 00000000..9770cf50
Binary files /dev/null and b/contents/core/dl_primer/images/png/handwritten_digits.png differ
diff --git a/contents/core/dl_primer/images/png/hog.png b/contents/core/dl_primer/images/png/hog.png
new file mode 100644
index 00000000..eaa724e2
Binary files /dev/null and b/contents/core/dl_primer/images/png/hog.png differ
diff --git a/contents/core/dl_primer/images/png/inference_pipeline.png b/contents/core/dl_primer/images/png/inference_pipeline.png
new file mode 100644
index 00000000..96aa27a0
Binary files /dev/null and b/contents/core/dl_primer/images/png/inference_pipeline.png differ
diff --git a/contents/core/dl_primer/images/png/ml_rules.png b/contents/core/dl_primer/images/png/ml_rules.png
new file mode 100644
index 00000000..1d912de3
Binary files /dev/null and b/contents/core/dl_primer/images/png/ml_rules.png differ
diff --git a/contents/core/dl_primer/images/png/mlp_connection_weights.png b/contents/core/dl_primer/images/png/mlp_connection_weights.png
new file mode 100644
index 00000000..53c66150
Binary files /dev/null and b/contents/core/dl_primer/images/png/mlp_connection_weights.png differ
diff --git a/contents/core/dl_primer/images/png/mlp_mm.png b/contents/core/dl_primer/images/png/mlp_mm.png
new file mode 100644
index 00000000..884b16d2
Binary files /dev/null and b/contents/core/dl_primer/images/png/mlp_mm.png differ
diff --git a/contents/core/dl_primer/images/png/nnlayers.png b/contents/core/dl_primer/images/png/nnlayers.png
new file mode 100644
index 00000000..2fd69ae2
Binary files /dev/null and b/contents/core/dl_primer/images/png/nnlayers.png differ
diff --git a/contents/core/dl_primer/images/png/rnn_unrolled.png b/contents/core/dl_primer/images/png/rnn_unrolled.png
new file mode 100644
index 00000000..f03c307b
Binary files /dev/null and b/contents/core/dl_primer/images/png/rnn_unrolled.png differ
diff --git a/contents/core/dl_primer/images/png/topology_28x28.png b/contents/core/dl_primer/images/png/topology_28x28.png
new file mode 100644
index 00000000..2c878777
Binary files /dev/null and b/contents/core/dl_primer/images/png/topology_28x28.png differ
diff --git a/contents/core/dl_primer/images/png/topology_flatten.png b/contents/core/dl_primer/images/png/topology_flatten.png
new file mode 100644
index 00000000..71efe6a1
Binary files /dev/null and b/contents/core/dl_primer/images/png/topology_flatten.png differ
diff --git a/contents/core/dl_primer/images/png/traditional.png b/contents/core/dl_primer/images/png/traditional.png
new file mode 100644
index 00000000..39a3d73e
Binary files /dev/null and b/contents/core/dl_primer/images/png/traditional.png differ
diff --git a/contents/core/dl_primer/images/png/virtuous-cycle.png b/contents/core/dl_primer/images/png/virtuous-cycle.png
new file mode 100644
index 00000000..18092313
Binary files /dev/null and b/contents/core/dl_primer/images/png/virtuous-cycle.png differ