Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GRU cells #20

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open

GRU cells #20

wants to merge 11 commits into from

Conversation

guillitte
Copy link

I added the possibility to use GRU cells.

@jcjohnson
Copy link
Owner

Wow, this looks amazing - thanks a bunch! There's even a unit test! I want to look through it in a bit more detail before merging, and I probably won't have time to do so today.

@guillitte
Copy link
Author

Thanks. It could certainly be further optimized, but, at least, it seems to work fine.

@JoostvDoorn
Copy link

JoostvDoorn commented May 4, 2016

Any update on this?

@guillitte
Copy link
Author

For those interested, I also added a gridgru adapted from http://arxiv.org/abs/1507.01526 in the Dev branch

@guillitte
Copy link
Author

Running a small benchmark using 1000 iterations on tiny Shakespeare (Epoch 3.8), I got the following results :

LSTM :

{"i":1000,"val_loss_history":[1.6292053406889],"val_loss_history_it":[1000],"forward_backward_times":{},"opt":{"max_epochs":50,"checkpoint_every":1000,"batch_size":50,"memory_benchmark":0,"init_from":"","grad_clip":5,"model_type":"lstm","lr_decay_every":5,"print_every":1,"wordvec_size":64,"seq_length":50,"input_json":"data/tiny-shakespeare.json","num_layers":3,"input_h5":"data/tiny-shakespeare.h5","reset_iterations":1,"rnn_size":800,"dropout":0,"checkpoint_name":"cv/lstm","batchnorm":0,"learning_rate":0.0005,"speed_benchmark":0,"gpu_backend":"cuda","lr_decay_factor":0.5,"gpu":0}

GRU :

{"i":1000,"val_loss_history":[1.4681989658963],"val_loss_history_it":[1000],"forward_backward_times":{},"opt":{"max_epochs":50,"checkpoint_every":1000,"batch_size":50,"memory_benchmark":0,"init_from":"","grad_clip":5,"model_type":"gru","lr_decay_every":5,"print_every":1,"wordvec_size":64,"seq_length":50,"input_json":"data/tiny-shakespeare.json","num_layers":3,"input_h5":"data/tiny-shakespeare.h5","reset_iterations":1,"rnn_size":800,"dropout":0,"checkpoint_name":"cv/gru","batchnorm":0,"learning_rate":0.0005,"speed_benchmark":0,"gpu_backend":"cuda","lr_decay_factor":0.5,"gpu":0}

GRIDGRU :

{"i":1000,"val_loss_history":[1.4313773946329],"val_loss_history_it":[1000],"forward_backward_times":{},"opt":{"max_epochs":50,"checkpoint_every":1000,"batch_size":50,"memory_benchmark":0,"init_from":"","grad_clip":5,"model_type":"gridgru","lr_decay_every":5,"print_every":1,"wordvec_size":800,"seq_length":50,"input_json":"data/tiny-shakespeare.json","num_layers":3,"input_h5":"data/tiny-shakespeare.h5","reset_iterations":1,"rnn_size":800,"dropout":0,"checkpoint_name":"cv/gridgru","batchnorm":0,"learning_rate":0.0005,"speed_benchmark":0,"gpu_backend":"cuda","lr_decay_factor":0.5,"gpu":0}

NB : for GRIDGRU, wordvec_size is the size of the network along depth, so it should be about the same as rnn_size

cur_gates[{{}, {2 * H + 1, 3 * H}}]:addmm(next_h, Wh[{{}, {2 * H + 1, 3 * H}}]) -- hc += Wh * r . prev_h
local hc = cur_gates[{{}, {2 * H + 1, 3 * H}}]:tanh() --hidden candidate : hc = tanh(Wx * x + Wh * r . prev_h + b)
next_h:addcmul(prev_h,-1, u, prev_h)
next_h:addcmul(u,hc) --next_h = (1-u) . prev_h + u . hc

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small note: the original paper http://arxiv.org/pdf/1406.1078v3.pdf has it the other way around, see Equation 7.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is true.
As always, there are many small variations for the same algorithm.
For the definition of GRU, I used the code in Karpathy's char-rnn and I didn't chek the original article.

@AlekzNet
Copy link

AlekzNet commented Nov 6, 2016

@guillitte I wonder how fair this comparison is. GRIDGRU has as twice as more parameters than LSTM, and 2.5 times more parameters, than GRU. 3x800 GRIDGRU has roughly the same amount of parameters as, say, 3x1070 LSTM or 3x1250 GRU. So, in this comparison, GRU wins hands down.

@binary-person
Copy link

This has been open for a while, mind if one of the contributors merge this?

@JoostvDoorn
Copy link

@scheng123 An equivalent implementation is also merged into https://github.com/torch/rnn/ with the name SeqGRU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants