This repository contains the official TensorFlow implementation of the following paper:
April 10, 2020 - Update A simple and lightweight variant of the Shuffle-Exchange Network has been released. The new version not only scales to longer sequences but also converges faster and provides better accuracy. It is available on arXiv and GitHub.
Neural Shuffle-Exchange Networks − Sequence Processing in O(n log n) Time
by Kārlis Freivalds, Emīls Ozoliņš, Agris Šostaks
Abstract: A key requirement in sequence to sequence processing is the modeling of long range dependencies. To this end, a vast majority of the state-of-the-art models use attention mechanism which is of O(n²) complexity that leads to slow execution for long sequences.
We introduce a new Shuffle-Exchange neural network model for sequence to sequence tasks which have O(log n) depth and O(n log n) total complexity. We show that this model is powerful enough to infer efficient algorithms for common algorithmic benchmarks including sorting, addition and multiplication. We evaluate our architecture on the challenging LAMBADA question answering dataset and compare it with the state-of-the-art models which use attention. Our model achieves competitive accuracy and scales to sequences with more than a hundred thousand of elements.
We are confident that the proposed model has the potential for building more efficient architectures for processing large interrelated data in language modeling, music generation and other application domains.
Shuffle-Exchange neural network is a new differentiable architecture for sequence processing tasks that has O(log n) depth and O(n log n) total complexity and allows modeling any dependencies in the sequence. It can successfully learn nontrivial O(n log n) time algorithms with good generalization. The Shuffle-Exchange model can serve as an alternative to the attention mechanism with better scaling to long sequences.
Our paper describes Shuffle-Exchange neural networks in detail and provides full results on sequence duplication, reversal, long binary addition, long binary multiplication, sorting tasks and the LAMBADA question answering task.
Here are the accuracy results on the LAMBADA question answering task of predicting a target word in its broader context (on average 4.6 sentences picked from novels):
Model | Test accuracy (%) |
---|---|
Random word from passage | 1.6 |
Gated-Attention Reader | 49.0 |
Shuffle-Exchange neural network | 52.28 |
Universal Transformer | 56.0 |
Human performance | 86.0 |
Note: Our used model is 7 times smaller and can process 32 times longer sequences using the same amount of GPU memory compared to the Universal Transformer model.
- Our model with depth of 2 log n can learn sorting and long binary addition algorithms that generalize to longer inputs. Here are the accuracy results on the chosen sequential algorithmic tasks – generalization to sequences of length 512 where the model is trained on length 64:
- The algorithms that are learned by our model can process 128 times longer sequences and faster than the DNGPU model (optimized Neural GPU with diagonal gates).
- On long binary multiplication task, our model achieves 98% test accuracy on sequences of length 128. Training multiplication on longer sequences is possible but requires increasing the width or depth of the model.
- Note: The fastest practical FFT-based algorithms are O(n log n loglog n) therefore it is expected that the solution does not generalize to inputs longer than the model is trained on.
Shuffle-Exchange neural networks are continuous, differentiable neural networks with a regular-layered structure consisting of alternating Switch and Shuffle layers.
The Switch Layer divides the input into adjacent pairs of values and applies the Switch Unit, a learnable 2-to-2 function, to each pair of inputs producing two outputs.
Here is an illustration of a Switch Unit, it is similar to the Gated Recurrent Unit (GRU):
The Shuffle Layer follows where inputs are permuted according to a perfect-shuffle permutation (i.e., how a deck of cards is shuffled by splitting it into halves and then interleaving them) – a cyclic bit shift rotating left in the first part of the network and (inversely) rotating right in the second part.
The Shuffle-Exchange neural network is organized in blocks by alternating these two kinds of layers in the pattern of the Beneš network. Such a network can represent a wide class of functions including any permutation of the input values.
Here is an illustration of a whole Shuffle-Exchange neural network model consisting of two blocks on input of length 16:
- Python 3.5 or higher.
- TensorFlow 1.13.0.
To select the sequence processing task for which to train the Shuffle-Exchange neural network edit the config.py
file that contains various hyperparameter and other suggested setting options. For example, the LAMBADA question answering task under “Task configuration”:
...
"""
Task configuration.
"""
...
# suggested settings for LAMBADA question answering
task = "lambada"
bins = [128]
batch_size = 64
n_input = lambada_vocab_size
n_output = 3
n_hidden = 48 * 8
input_word_dropout_keep_prob = 0.95
use_front_padding = True
use_pre_trained_embedding = False # set this flag to True and provide an embedding for the best results
disperse_padding = True #we trained with this setting but are not sure if it helps
label_smoothing = 0.1
min_learning_rate = initial_learning_rate = 2e-4
...
To download the LAMBADA data set see the original publication by Paperno et al.
To download the pre-trained fastText 1M English word embedding see the downloads section of the FastText library website and extract to directory listed in the config.py
file variable base_folder
under “Embedding configuration”:
...
"""
Embedding configuration
"""
use_pre_trained_embedding = False
base_folder = "/host-dir/embeddings/"
embedding_file = base_folder + "fast_word_embedding.vec"
emb_vector_file = base_folder + "emb_vectors.bin"
emb_word_dictionary = base_folder + "word_dict.bin"
...
To enable the pre-trained embedding change the config.py
file variable use_pre_trained_embedding
to True
:
...
use_pre_trained_embedding = True # set this flag to True and provide an embedding for the best results
...
To start training the Shuffle-Exchange neural network use the terminal command:
run python3 shuffle_exchange_trainer.py
If you're running Windows, before starting training the Shuffle-Exchange neural network edit the config.py
file to change the directory-related variables to Windows file path format:
...
"""
Local storage (checkpoints, etc).
"""
...
out_dir = ".\host-dir\gpu" + gpu_instance
model_file = out_dir + "\\varWeights.ckpt"
image_path = out_dir + "\\images"
...
"""
Lambada configuration
"""
lambada_data_dir = ".\host-dir\lambada-dataset"
...
"""
Embedding configuration
"""
...
base_folder = ".\host-dir\embeddings"
embedding_file = base_folder + "fast_word_embedding.vec"
emb_vector_file = base_folder + "emb_vectors.bin"
emb_word_dictionary = base_folder + "word_dict.bin"
...
If you use Shuffle-Exchange neural networks, please use the following BibTeX entry:
@incollection{NIPS2019_8889,
title = {Neural Shuffle-Exchange Networks -- Sequence Processing in {O}(n log n) Time},
author = {Freivalds, Karlis and Ozoli\c{n}\v{s}, Em\={\i}ls and \v{S}ostaks, Agris},
booktitle = {Advances in Neural Information Processing Systems 32},
pages = {6626--6637},
year = {2019},
publisher = {Curran Associates, Inc.},
url = {http://papers.nips.cc/paper/8889-neural-shuffle-exchange-networks-sequence-processing-in-on-log-n-time.pdf}
}
For help or issues using Shuffle-Exchange neural networks, please submit a GitHub issue.
For personal communication related to Shuffle-Exchange neural networks, please contact Kārlis Freivalds ([email protected]).