torch-rnn provides high-performance, reusable RNN and LSTM modules for torch7, and uses these modules for character-level and word-level language modeling similar to char-rnn.
You can find documentation for the RNN and LSTM modules here; they have no dependencies other than torch
and nn
, so they should be easy to integrate into existing projects.
Compared to char-rnn, torch-rnn is up to 1.9x faster and uses up to 7x less memory. For more details see the Benchmark section below.
Cristian Baldi has prepared Docker images for both CPU-only mode and GPU mode; you can find them here.
You'll need to install the header files for Python 2.7 and the HDF5 library. On Ubuntu you should be able to install like this:
sudo apt-get -y install python2.7-dev
sudo apt-get install libhdf5-dev
The preprocessing script is written in Python 2.7; its dependencies are in the file requirements.txt
.
You can install these dependencies in a virtual environment like this:
virtualenv .env # Create the virtual environment
source .env/bin/activate # Activate the virtual environment
pip install -r requirements.txt # Install Python dependencies
# Work for a while ...
deactivate # Exit the virtual environment
The main modeling code is written in Lua using torch; you can find installation instructions here. You'll need the following Lua packages:
After installing torch, you can install / update these packages by running the following:
# Install most things using luarocks
luarocks install torch
luarocks install nn
luarocks install optim
luarocks install lua-cjson
# We need to install torch-hdf5 from GitHub
git clone https://github.com/deepmind/torch-hdf5
cd torch-hdf5
luarocks make hdf5-0-0.rockspec
To enable GPU acceleration with CUDA, you'll need to install CUDA 6.5 or higher and the following Lua packages:
You can install / update them by running:
luarocks install cutorch
luarocks install cunn
use -cudnn 1 command line option to enable use of cudnn LSTM implementation. You will need to install cudnn library and cudnn torch bindings
luarocks install cudnn
Thanks to Jeremy Appleyard for implementing this.
To enable GPU acceleration with OpenCL, you'll need to install the following Lua packages:
You can install / update them by running:
luarocks install cltorch
luarocks install clnn
Jeff Thompson has written a very detailed installation guide for OSX that you can find here.
To train a model and use it to generate new text, you'll need to follow three simple steps:
You can use any text file or folder of .txt files for training models. Before training, you'll need to preprocess the data using the script
scripts/preprocess.py
; this will generate an HDF5 file and JSON file containing a preprocessed version of the data.
If you have training data stored in my_data.txt
, you can run the script like this:
python scripts/preprocess.py \
--input_txt my_data.txt \
--output_h5 my_data.h5 \
--output_json my_data.json
If you instead have multiple .txt files in the folder my_data
, you instead run the script like this
python scripts/preprocess.py \
--input_folder my_data
This will produce files my_data.h5
and my_data.json
that will be passed to the training script.
There are a few more flags you can use to configure preprocessing; read about them here
To preprocess the input data with words as tokens, add the flag --use_words
.
A large text corpus will contain many rare words, usually typos or unusual names. Adding a token for each of these is not practical and can result in a very large token space. Using the options --min_occurrences
or --min_documents
allow specifying how many times or in how many documents a word must occur before being added as a token. Words that fail to meet these criteria are replaced by wildcards, which are randomly distributed to avoid overtraining.
More information on additional flags is available here
If you have an existing token schema (.json file generated by preprocess.py), you can use the script scripts/tokenize.py
to tokenize a file based on that schema. It accepts input as a text file or folder of text files (similar to the preprocessing script), as well as an argument --input_json
which specifies the input token schema file. This is useful for transfer learning onto a new dataset.
To learn more about the tokenizing script see here.
After preprocessing the data, you'll need to train the model using the train.lua
script. This will be the slowest step.
You can run the training script like this:
th train.lua -input_h5 my_data.h5 -input_json my_data.json
This will read the data stored in my_data.h5
and my_data.json
, run for a while, and save checkpoints to files with
names like cv/checkpoint_1000.t7
.
Checkpoints also allow you to resume training from the last one saved if training is interrupted. Use the flag -resume_from
(e.g. -resume_from cv/checkpoint_1000
) to resume from any checkpoint preserving the model and training state.
You can change the RNN model type, hidden state size, and number of RNN layers like this:
th train.lua -input_h5 my_data.h5 -input_json my_data.json -model_type rnn -num_layers 3 -rnn_size 256
By default this will run in GPU mode using CUDA; to run in CPU-only mode, add the flag -gpu -1
.
To run with OpenCL, add the flag -gpu_backend opencl
.
There are many more flags you can use to configure training; read about them here.
After training a model, you can generate new text by sampling from it using the script sample.lua
. Run it like this:
th sample.lua -checkpoint cv/checkpoint_10000.t7 -length 2000
This will load the trained checkpoint cv/checkpoint_10000.t7
from the previous step, sample 2000 characters from it,
and print the results to the console.
By default the sampling script will run in GPU mode using CUDA; to run in CPU-only mode add the flag -gpu -1
and
to run in OpenCL mode add the flag -gpu_backend opencl
.
To pre-seed the model with text, there are 2 options. If you used character-based preprocessing, use flag -start_text
and include a quoted string.
If you used word-based preprocessing, use the Python script scripts\tokenize.py
to generate a JSON file of tokens and provide it using the flag -start_tokens
. Since Python was used to parse the input data into tokens, it is best to use Python to so it for seed text as well, as Lua does not have full regex support, hence the extra step.
To learn more about the tokenizing script see here.
There are more flags you can use to configure sampling; read about them here.
#Resuming Every checkpoint automatically saves a resume point. All parameters and the current learning rate are saved within. To pick up from that checkpoint, run the train script like this:
th train.lua -resume_from cv/checkpoint_10000_resume.json
To benchmark torch-rnn
against char-rnn
, we use each to train LSTM language models for the tiny-shakespeare dataset
with 1, 2 or 3 layers and with an RNN size of 64, 128, 256, or 512. For each we use a minibatch size of 50, a sequence
length of 50, and no dropout. For each model size and for both implementations, we record the forward/backward times and
GPU memory usage over the first 100 training iterations, and use these measurements to compute the mean time and memory
usage.
All benchmarks were run on a machine with an Intel i7-4790k CPU, 32 GB main memory, and a Titan X GPU.
Below we show the forward/backward times for both implementations, as well as the mean speedup of torch-rnn
over
char-rnn
. We see that torch-rnn
is faster than char-rnn
at all model sizes, with smaller models giving a larger
speedup; for a single-layer LSTM with 128 hidden units, we achieve a 1.9x speedup; for larger models we achieve about
a 1.4x speedup.
Below we show the GPU memory usage for both implementations, as well as the mean memory saving of torch-rnn
over
char-rnn
. Again torch-rnn
outperforms char-rnn
at all model sizes, but here the savings become more significant for
larger models: for models with 512 hidden units, we use 7x less memory than char-rnn
.
- Get rid of Python / JSON / HDF5 dependencies?