karpathy · sdenel · Jul 2, 2016 · Jul 2, 2016
diff --git a/_posts/2015-05-21-rnn-effectiveness.markdown b/_posts/2015-05-21-rnn-effectiveness.markdown
@@ -19,7 +19,7 @@ By the way, together with this post I am also releasing [code on Github](https:/
 
 <div class="imgcap">
 <img src="/assets/rnn/diags.jpeg">
-<div class="thecap" style="text-align:justify">Each rectangle is a vector and arrows represent functions (e.g. matrix multiply). Input vectors are in red, output vectors are in blue and green vectors hold the RNN's state (more on this soon). From left to right: <b>(1)</b> Vanilla mode of processing without RNN, from fixed-sized input to fixed-sized output (e.g. image classification). <b>(2)</b> Sequence output (e.g. image captioning takes an image and outputs a sentence of words). <b>(3)</b> Sequence input (e.g. sentiment analysis where a given sentence is classified as expressing positive or negative sentiment). <b>(4)</b> Sequence input and sequence output (e.g. Machine Translation: an RNN reads a sentence in English and then outputs a sentence in French). <b>(5)</b> Synced sequence input and output (e.g. video classification where we wish to label each frame of the video). Notice that in every case are no pre-specified constraints on the lengths sequences because the recurrent transformation (green) is fixed and can be applied as many times as we like.</div>
+<div class="thecap" style="text-align:justify">Each rectangle is a vector and arrows represent functions (e.g. matrix multiply). Input vectors are in red, output vectors are in blue and green vectors hold the RNN's state (more on this soon). From left to right: <b>(1)</b> Vanilla mode of processing without RNN, from fixed-sized input to fixed-sized output (e.g. image classification). <b>(2)</b> Sequence output (e.g. image captioning takes an image and outputs a sentence of words). <b>(3)</b> Sequence input (e.g. sentiment analysis where a given sentence is classified as expressing positive or negative sentiment). <b>(4)</b> Sequence input and sequence output (e.g. Machine Translation: an RNN reads a sentence in English and then outputs a sentence in French). <b>(5)</b> Synced sequence input and output (e.g. video classification where we wish to label each frame of the video). Notice that in every case there are no pre-specified constraints on the lengths sequences because the recurrent transformation (green) is fixed and can be applied as many times as we like.</div>
 </div>
 
 As you might expect, the sequence regime of operation is much more powerful compared to fixed networks that are doomed from the get-go by a fixed number of computational steps, and hence also much more appealing for those of us who aspire to build more intelligent systems. Moreover, as we'll see in a bit, RNNs combine the input vector with their state vector with a fixed (but learned) function to produce a new state vector. This can in programming terms be interpreted as running a fixed program with certain inputs and some internal variables. Viewed this way, RNNs essentially describe programs. In fact, it is known that [RNNs are Turing-Complete](http://binds.cs.umass.edu/papers/1995_Siegelmann_Science.pdf) in the sense that they can to simulate arbitrary programs (with proper weights). But similar to universal approximation theorems for neural nets you shouldn't read too much into this. In fact, forget I said anything.
@@ -58,7 +58,7 @@ class RNN:
     return y
 ```
 
-The above specifies the forward pass of a vanilla RNN. This RNN's parameters are the three matrices `W_hh, W_xh, W_hy`. The hidden state `self.h` is initialized with the zero vector. The `np.tanh` function implements a non-linearity that squashes the activations to the range `[-1, 1]`. Notice briefly how this works: There are two terms inside of the tanh: one is based on the previous hidden state and one is based on the current input. In numpy `np.dot` is matrix multiplication. The two intermediates interact with addition, and then get squashed by the tanh into the new state vector. If you're more comfortable with math notation, we can also write the hidden state update as \\( h\_t = \tanh ( W\_{hh} h\_{t-1} + W\_{xh} x\_t ) \\), where tanh is applied elementwise.
+The above specifies the forward pass of a vanilla RNN. This RNN's parameters are the three matrices `W_hh, W_xh, W_hy`. The hidden state `self.h` is initialized with the zero vector. The `np.tanh` function implements a non-linearity that squashes the activations to the range `[-1, 1]`. Notice briefly how this works: There are two terms inside of the tanh: one is based on the previous hidden state and one is based on the current input. In numpy `np.dot` is matrix multiplication. The two intermediates interact with addition, and then get squashed by the tanh into the new state vector. If you're more comfortable with math notation, we can also write the hidden state update as \\( h\_t = \tanh ( W\_{hh} h\_{t-1} + W\_{xh} x\_t ) \\), where tanh is applied element-wise.
 
 
 We initialize the matrices of the RNN with random numbers and the bulk of work during training goes into finding the matrices that give rise to desirable behavior, as measured with some loss function that expresses your preference to what kinds of outputs `y` you'd like to see in response to your input sequences `x`.