Merge pull request #588 from harvard-edge/586-feedback-on-chapter-4-d…

…nn-architectures Improvement on Ch4 (DNN Architectures): add visualization figures and tool links
harvard-edge · Jan 5, 2025 · 4075b1c · 4075b1c
2 parents 8f29053 + 4d890ef
commit 4075b1c
Show file tree

Hide file tree

Showing 8 changed files with 26 additions and 7 deletions.
diff --git a/contents/core/dnn_architectures/dnn_architectures.qmd b/contents/core/dnn_architectures/dnn_architectures.qmd
@@ -193,7 +193,7 @@ Here, $(i,j)$ represents spatial positions, $k$ indexes output channels, $c$ ind
 
 For a concrete example, consider our MNIST digit classification task with 28×28 grayscale images. Each convolutional layer applies a set of filters (say 3×3) that slide across the image, computing local weighted sums. If we use 32 filters, the layer produces a 28×28×32 output, where each spatial position contains 32 different feature measurements of its local neighborhood. This is in stark contrast to our MLP approach where we flattened the entire image into a 784-dimensional vector.
 
-This algorithmic structure directly implements the requirements we identified for spatial pattern processing, creating distinct computational patterns that influence system design.
+This algorithmic structure directly implements the requirements we identified for spatial pattern processing, creating distinct computational patterns that influence system design. For a detailed visual exploration of these network structures, the [CNN Explainer](https://poloclub.github.io/cnn-explainer/) project provides an interactive visualization that illuminates how different convolutional networks are constructed.
 
 ::: {.content-visible when-format="html"}
 ![Convolution operation, image data (blue) and 3x3 filter (green). Source: V. Dumoulin, F. Visin, MIT](images/gif/cnn.gif){#fig-cnn}
@@ -307,11 +307,13 @@ $$
 \mathbf{h}_t = f(\mathbf{W}_{hh}\mathbf{h}_{t-1} + \mathbf{W}_{xh}\mathbf{x}_t + \mathbf{b}_h)
 $$
 
-where $\mathbf{h}_t$ represents the hidden state at time $t$, $\mathbf{x}_t$ is the input at time $t$, $\mathbf{W}_{hh}$ contains the recurrent weights, and $\mathbf{W}_{xh}$ contains the input weights.
+where $\mathbf{h}_t$ represents the hidden state at time $t$, $\mathbf{x}_t$ is the input at time $t$, $\mathbf{W}_{hh}$ contains the recurrent weights, and $\mathbf{W}_{xh}$ contains the input weights, as shown in the unfolded network structure in @fig-rnn.
 
 For example, in processing a sequence of words, each word might be represented as a 100-dimensional vector ($\mathbf{x}_t$), and we might maintain a hidden state of 128 dimensions ($\mathbf{h}_t$). At each time step, the network combines the current input with its previous state to update its understanding of the sequence. This creates a form of memory that can capture patterns across time steps.
 
-This recurrent structure directly implements our requirements for sequential processing through the introduction of recurrent connections, which maintain internal state and allow the network to carry information forward in time. Instead of processing all inputs independently, RNNs process sequences of data by iteratively updating a hidden state based on the current input and the previous hidden state. This makes RNNs well-suited for tasks such as language modeling, speech recognition, and time-series forecasting.
+This recurrent structure directly implements our requirements for sequential processing through the introduction of recurrent connections, which maintain internal state and allow the network to carry information forward in time. Instead of processing all inputs independently, RNNs process sequences of data by iteratively updating a hidden state based on the current input and the previous hidden state, as depicted in @fig-rnn. This makes RNNs well-suited for tasks such as language modeling, speech recognition, and time-series forecasting.
+
+![RNN architecture. Source: A. Amidi, S. Amidi, Stanford](images/png/rnn_unrolled.png){#fig-rnn}
 
 ### Computational Mapping
 
@@ -403,15 +405,30 @@ These scenarios demand specific capabilities from our processing architecture. T
 
 #### Algorithmic Structure
 
-Attention mechanisms form the foundation of dynamic pattern processing by computing weighted connections between elements based on their content [@bahdanau2014neural]. This approach allows for the processing of relationships that aren't fixed by architecture but instead emerge from the data itself. At the core of an attention mechanism lies a fundamental operation that can be expressed mathematically as:
+Attention mechanisms form the foundation of dynamic pattern processing by computing weighted connections between elements based on their content [@bahdanau2014neural]. This approach allows for the processing of relationships that aren't fixed by architecture but instead emerge from the data itself. At the core of an attention mechanism is a fundamental operation that can be expressed mathematically as:
 
 $$
 \text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}})\mathbf{V}
 $$
 
 In this equation, Q (queries), K (keys), and V (values) represent learned projections of the input. For a sequence of length N with dimension d, this operation creates an N×N attention matrix, determining how each position should attend to all others.
 
-The attention operation involves several key steps. First, it computes query, key, and value projections for each position in the sequence. Next, it generates an N×N attention matrix through query-key interactions. Finally, it uses these attention weights to combine value vectors, producing the output. Unlike the fixed weight matrices found in previous architectures, these attention weights are computed dynamically for each input, allowing the model to adapt its processing based on the content at hand.
+The attention operation involves several key steps. First, it computes query, key, and value projections for each position in the sequence. Next, it generates an N×N attention matrix through query-key interactions. These steps are illustrated in @fig-attention. Finally, it uses these attention weights to combine value vectors, producing the output. 
+
+![The interaction between Query, Key, and Value components. Source: [Transformer Explainer](https://poloclub.github.io/transformer-explainer/).](images/png/attention_viz_mm.png){#fig-attention width=70%}
+
+The key is that, unlike the fixed weight matrices found in previous architectures, as shown in @fig-attention-weightcalc, these attention weights are computed dynamically for each input. This allows the model to adapt its processing based on the dynamic content at hand.
+
+::: {.content-visible when-format="html"}
+![Dynamic weight calculation. Source: [Transformer Explainer](https://poloclub.github.io/transformer-explainer/).](images/gif/attention_kqv_calc.gif){#fig-attention-weightcalc}
+:::
+
+::: {.content-visible when-format="pdf"}
+![Dynamic weight calculation. Source: [Transformer Explainer](https://poloclub.github.io/transformer-explainer/).](images/png/attention_kqv_calc.png){#fig-attention-weightcalc}
+:::
+
+
+<!-- ![(left) Scaled dot-product attention. (right) Multi-head attention consists of several attention layers running in parallel. Source: [Attention Is All You Need](https://arxiv.org/pdf/1706.03762)](images/png/attention.png){#fig-attention} -->
 
 #### Computational Mapping
 
@@ -494,7 +511,9 @@ $$
 
 Here, X is the input sequence, and $W_Q$, $W_K$, and $W_V$ are learned weight matrices for queries, keys, and values respectively. This formulation highlights how self-attention derives all its components from the same input, creating a dynamic, content-dependent processing pattern.
 
-The Transformer architecture leverages this self-attention mechanism within a broader structure that typically includes feed-forward layers, layer normalization, and residual connections. This combination allows Transformers to process input sequences in parallel, capturing complex dependencies without the need for sequential computation. As a result, Transformers have demonstrated remarkable effectiveness across a wide range of tasks, from natural language processing to computer vision, revolutionizing the landscape of deep learning architectures.
+The Transformer architecture leverages this self-attention mechanism within a broader structure that typically includes feed-forward layers, layer normalization, and residual connections (see @fig-transformer). This combination allows Transformers to process input sequences in parallel, capturing complex dependencies without the need for sequential computation. As a result, Transformers have demonstrated remarkable effectiveness across a wide range of tasks, from natural language processing to computer vision, revolutionizing the landscape of deep learning architectures.
+
+![The Transformer model architecture. Source: [Attention Is All You Need](https://arxiv.org/pdf/1706.03762)](images/png/transformer.png){#fig-transformer}
 
 #### Computational Mapping
 
@@ -732,7 +751,7 @@ The data movement primitives have particularly influenced the design of intercon
 +-----------------------+---------------------------+--------------------------+----------------------------+
 | Dynamic Computation   | Flexible routing          | Dynamic graph execution  | Load balancing             |
 +-----------------------+---------------------------+--------------------------+----------------------------+
-| Sequential Access     | Burst mode DRAM           | Contiguous allocation    |                            |
+| Sequential Access     | Burst mode DRAM           | Contiguous allocation    | Access latency             |
 +-----------------------+---------------------------+--------------------------+----------------------------+
 | Random Access         | Large caches              | Memory-aware scheduling  | Cache misses               |
 +-----------------------+---------------------------+--------------------------+----------------------------+

diff --git a/contents/core/dnn_architectures/images/gif/attention_kqv_calc.gif b/contents/core/dnn_architectures/images/gif/attention_kqv_calc.gif
diff --git a/contents/core/dnn_architectures/images/gif/attention_kqv_calc.png b/contents/core/dnn_architectures/images/gif/attention_kqv_calc.png
diff --git a/contents/core/dnn_architectures/images/png/attention.png b/contents/core/dnn_architectures/images/png/attention.png
diff --git a/contents/core/dnn_architectures/images/png/attention_kqv_calc.png b/contents/core/dnn_architectures/images/png/attention_kqv_calc.png
diff --git a/contents/core/dnn_architectures/images/png/attention_viz_mm.png b/contents/core/dnn_architectures/images/png/attention_viz_mm.png
diff --git a/contents/core/dnn_architectures/images/png/rnn.png b/contents/core/dnn_architectures/images/png/rnn.png
diff --git a/contents/core/dnn_architectures/images/png/transformer.png b/contents/core/dnn_architectures/images/png/transformer.png