harvard-edge · profvjreddi · Dec 24, 2024 · Dec 23, 2024 · Dec 23, 2024
diff --git a/contents/core/dl_primer/dl_primer.qmd b/contents/core/dl_primer/dl_primer.qmd
@@ -284,7 +284,7 @@ We can now better appreciate how the field of deep learning evolved to meet thes
 
 The development of backpropagation algorithms in the 1980s [@rumelhart1986learning], which we will learn about later, represented a  theoretical breakthrough and povided a systematic way to train multi-layer networks. However, the computational demands of this algorithm far exceeded available hardware capabilities. Training even modest networks could take weeks, making experimentation and practical applications challenging. This mismatch between algorithmic requirements and hardware capabilities contributed to a period of reduced interest in neural networks.
 
-The term "deep learning" gained prominence in the 2000s, coinciding with significant advances in computational power and data accessibility. The field has since experienced exponential growth, as illustrated in @fig-trends. The graph reveals two remarkable trends: computational capabilities measured in the number of Floating Point Operations per Second (FLOPS) initially followed a 1.4x improvement pattern from 1952 to 2010, then accelerated to a 3.4-month doubling cycle from 2012 to 2022. Perhaps more striking is the emergence of large-scale models between 2015 and 2022 (not explicitly shown or easily seen in the figure), which scaled 2 to 3 orders of magnitude faster than the general trend, following an aggressive 10-month doubling cycle.
+The term "deep learning" gained prominence in the 2010s, coinciding with significant advances in computational power and data accessibility. The field has since experienced exponential growth, as illustrated in @fig-trends. The graph reveals two remarkable trends: computational capabilities measured in the number of Floating Point Operations per Second (FLOPS) initially followed a 1.4x improvement pattern from 1952 to 2010, then accelerated to a 3.4-month doubling cycle from 2012 to 2022. Perhaps more striking is the emergence of large-scale models between 2015 and 2022 (not explicitly shown or easily seen in the figure), which scaled 2 to 3 orders of magnitude faster than the general trend, following an aggressive 10-month doubling cycle.
 
 ![Growth of deep learning models. Source: EOOCHS AI](https://epochai.org/assets/images/posts/2022/compute-trends.png){#fig-trends}
 
@@ -320,7 +320,7 @@ The Perceptron is the basic unit or node that forms the foundation for more comp
 
 In these advanced structures, layers of perceptrons work in concert, with each layer's output serving as the input for the subsequent layer. This hierarchical arrangement creates a deep learning model capable of comprehending and modeling complex, abstract patterns within data. By stacking these simple units, neural networks gain the ability to tackle increasingly sophisticated tasks, from image recognition to natural language processing.
 
-![Perceptron. Conceived in the 1950s, perceptrons paved the way for developing more intricate neural networks and have been a fundamental building block in deep learning. Source: Wikimedia---Chrislb.](images/png/Rosenblattperceptron.png){#fig-perceptron}
+![Perceptron. Conceived in the 1950s, perceptrons paved the way for developing more intricate neural networks and have been a fundamental building block in deep learning.](images/png/basic_perceptron.png){#fig-perceptron}
 
 Each input $x_i$ has a corresponding weight $w_{ij}$, and the perceptron simply multiplies each input by its matching weight. This operation is similar to linear regression, where the intermediate output, $z$, is computed as the sum of the products of inputs and their weights:
 
@@ -350,8 +350,21 @@ $$
 \hat{y} = \sigma(z)
 $$
 
+Thus, the final output of the perceptron, including the activation function, can be expressed as:
+
+
+
 @fig-nonlinear shows an example where data exhibit a nonlinear pattern that could not be adequately modeled with a linear approach. The activation function enables the network to learn and represent complex relationships in the data, making it possible to solve sophisticated tasks like image recognition or speech processing.
 
+Thus, the final output of the perceptron, including the activation function, can be expressed as:
+
+
+
+$$
+z = \sigma(\sum (x_i \cdot w_{ij}) + b)
+$$
+
+
 #### Layers and Connections
 
 While a single perceptron can model simple decisions, the power of neural networks comes from combining multiple neurons into layers. A layer is a collection of neurons that process information in parallel. Each neuron in a layer operates independently on the same input but with its own set of weights and bias, allowing the layer to learn different features or patterns from the same input data.
@@ -368,7 +381,7 @@ In a typical neural network, we organize these layers hierarchically:
 
 #### Data Flow and Layer Transformations
 
-As data flows through the network, it is transformed at each layer to extract meaningful patterns. Each layer combines the input data using learned weights and biases, then applies an activation function to introduce non-linearity. This process can be written mathematically as:
+As data flows through the network, it is transformed at each layer (l) to extract meaningful patterns. Each layer combines the input data using learned weights and biases, then applies an activation function to introduce non-linearity. This process can be written mathematically as:
 
 $$
 \mathbf{z}^{(l)} = \mathbf{W}^{(l)}\mathbf{x}^{(l-1)} + \mathbf{b}^{(l)}
@@ -509,17 +522,17 @@ As networks grow deeper, the path from input to output becomes longer, potential
 
 These connection patterns have significant implications for both the theoretical capabilities and practical implementation of neural networks. Dense connections maximize learning flexibility at the cost of computational efficiency. Sparse connections can reduce computational requirements while potentially improving the network's ability to learn structured patterns. Skip connections help maintain effective information flow in deeper networks.
 
-#### Weight Considerations
+#### Parameters Considerations
 
-The arrangement of weights in a neural network fundamentally determines both its learning capacity and computational requirements. While topology defines the network's structure, the initialization and organization of weights within this structure plays a crucial role in how the network learns and performs.
+The arrangement of parameters (weights and biases) in a neural network determines both its learning capacity and computational requirements. While topology defines the network's structure, the initialization and organization of parameters plays a crucial role in learning and performance.
 
-The number of weights in a network grows with both width and depth. For our MNIST example, consider a network with a 784-dimensional input layer, two hidden layers of 100 neurons each, and a 10-neuron output layer. The first layer requires 78,400 weights (784 × 100), the second layer 10,000 weights (100 × 100), and the output layer 1,000 weights (100 × 10), totaling 89,400 weights. Each of these weights must be stored in memory and updated during learning.
+Parameter count grows with network width and depth. For our MNIST example, consider a network with a 784-dimensional input layer, two hidden layers of 100 neurons each, and a 10-neuron output layer. The first layer requires 78,400 weights and 100 biases, the second layer 10,000 weights and 100 biases, and the output layer 1,000 weights and 10 biases, totaling 89,610 parameters. Each must be stored in memory and updated during learning.
 
-Weight initialization plays a fundamental role in network behavior. When we create a new neural network, these weights must be set to initial values that enable effective learning. Setting all weights to zero would cause all neurons in a layer to behave identically, preventing the network from learning diverse features. Instead, weights are typically initialized randomly, but the scale of these random values matters significantly. Too large or too small initial weights can lead to poor learning dynamics.
+Parameter initialization is fundamental to network behavior. Setting all parameters to zero would cause neurons in a layer to behave identically, preventing diverse feature learning. Instead, weights are typically initialized randomly, while biases often start at small constant values or even zeros. The scale of these initial values matters significantly - too large or too small can lead to poor learning dynamics.
 
-The distribution of weights across the network affects how information flows through layers. Consider our digit recognition task: if weights in early layers are too small, important details from the input image might not be preserved for later layers to process. Conversely, if weights are too large, the network might amplify noise in the input, making it harder to identify relevant patterns.
+The distribution of parameters affects information flow through layers. In digit recognition, if weights are too small, important input details might not propagate to later layers. If too large, the network might amplify noise. Biases help adjust the activation threshold of each neuron, enabling the network to learn optimal decision boundaries.
 
-Different network architectures may impose specific constraints on how weights are organized. Some architectures share weights across different parts of the network to encode specific properties, such as the ability to recognize patterns regardless of their position in an image. Other architectures might restrict certain weights to be zero, effectively implementing the sparse connectivity patterns discussed earlier.
+Different architectures may impose specific constraints on parameter organization. Some share weights across network regions to encode position-invariant pattern recognition. Others might restrict certain weights to zero, implementing sparse connectivity patterns.
 
 ## Learning Process
 
@@ -550,15 +563,15 @@ $$
 
 This error measurement drives the adjustment of network parameters through a process called "backpropagation," which we will examine in detail later.
 
-In practice, training operates on batches of examples rather than individual inputs. For the MNIST dataset, each training iteration might process 32, 64, or 128 images simultaneously. This batch processing serves two purposes: it enables efficient use of modern computing hardware through parallel processing, and it provides more stable parameter updates by averaging errors across multiple examples.
+In practice, training operates on batches of examples rather than individual inputs. For the MNIST dataset, each training iteration might process, for example, 32, 64, or 128 images simultaneously. This batch processing serves two purposes: it enables efficient use of modern computing hardware through parallel processing, and it provides more stable parameter updates by averaging errors across multiple examples.
 
 The training cycle continues until the network achieves sufficient accuracy or reaches a predetermined number of iterations. Throughout this process, the loss function serves as a guide, with its minimization indicating improved network performance.
 
 ### Forward Propagation
 
 Forward propagation, as illustrated in @fig-forward-propagation, is the core computational process in a neural network, where input data flows through the network's layers to generate predictions. Understanding this process is essential as it forms the foundation for both network inference and training. Let's examine how forward propagation works using our MNIST digit recognition example.
 
-![Neural networks---forward and backward propagation. Source: [Linkedin](https://www.linkedin.com/pulse/lecture2-unveiling-theoretical-foundations-ai-machine-underdown-phd-oqsuc/)](images/png/forwardpropagation.png){#fig-forward-propagation}
+![Neural networks---forward and backward propagation.](images/png/forward_backward_propagation.png){#fig-forward-propagation}
 
 When an image of a handwritten digit enters our network, it undergoes a series of transformations through the layers. Each transformation combines the weighted inputs with learned patterns to progressively extract relevant features. In our MNIST example, a 28×28 pixel image is processed through multiple layers to ultimately produce probabilities for each possible digit (0-9).
 

diff --git a/contents/core/dl_primer/images/png/basic_perceptron.png b/contents/core/dl_primer/images/png/basic_perceptron.png
diff --git a/contents/core/dl_primer/images/png/forward_backward_propagation.png b/contents/core/dl_primer/images/png/forward_backward_propagation.png