FBTT-Embedding library provides functionality to compress sparse embedding tables commonly usedin machine learning models such as recommendation and natural language processing. The library can be used as a direct replacement to PyTorch’s EmbeddingBag functionality. It provides the forward and backward propagation functionality same as PyTorch’s EmbeddingBag with only difference of compression. In addition, our implementation includes a software cache to store a portion of the entries in the embedding tables (or “bag”s) in decompressed format for faster lookup and process removing the need for decompressing and compressing the entries every-time it is accessed during the program execution of training or inference.
Read more at "TT-Rec: Tensor Train Compression for Deep Learning Recommendation Models", accepted and to appear in the Proceedings of Conference on Machine Learning and Systems, MLSys 2021.
- The implementation was compiled and tested with PyTorch 1.6 and above, with CUDA 10.1.
- To install the library, run setup.py install In order to get the best performance, specify -gencode for nvcc in the setup.py (line 24). The corresponding compiler settings for each architecture can be found in CUDA Toolkit Documentation, for example Volta https://docs.nvidia.com/cuda/volta-compatibility-guide/index.html Pascal https://docs.nvidia.com/cuda/pascal-compatibility-guide/index.html
- Dependence cub https://nvlabs.github.io/cub/
- A sample run of the code, run tt_embeddings_benchmark.py. A sample output is shown below
INFO:root:sparse: True, optimizer: sgd
INFO:root:p_shapes: [200, 220, 250], q_shapes: [4, 4, 4], ranks: [32, 32]
INFO:root:B: 512, E: 11000000, D: 64, nnz: 10240
INFO:root:TTEmbeddingBag FWD-BWD time/nnz: 0.416 usecs, GFLOPS: 2657.631, BW: 18.456
num_embeddings(int)
— size of the dictionary of embeddingsembedding_dim(int)
— the length of each embedding vectortt_ranks(List[int])
— the ranks of TT corestt_p_shapes(Optional[List[int]])
— The factorization ofnum_embeddings
, where the product of all elements is not smaller thannum_embeddings
.tt_q_shapes(Optional[List[int]])
— The factorization ofembedding_dim
, where the product of all elements is equal toembedding_dim
.sparse (bool)
— if True, the weight update is fused with gradient calculation, and gradients are not returned by backward propagation. Otherwise the gradient w.r.t. TT cores or cache would be returned to external optimizer.optimizer (OptimType)
— The type of optimizer when using fused kernel.learning_rate (float)
— Learning rate of the optimizereps (float)
— term added to the denominator to improve numerical stability, for Adagrad only.use_cache (bool)
— if True, a software cache will be used to store the most-frequently-accessed embedding vectors.cache_size (int)
— The maximum number of embedding vectors to be stored in the cache.hashtbl_size (int
) — The maximum number of entries in the hash table for frequency count.weight_dist (str)
— “uniform”, “naive-uniform”, “approx-uniform”, "normal", “approx-normal”. When using “uniform” or “normal”, the weights of TT cores will be i.i.d from the specified distribution. When using “approx-uniform” or “approx-normal”, the TT cores are initialized in a way that the entries of full embedding table follow the specified distribution.
The initialization of TT-Emb is similar to Pytorch EmbeddingBag
tt_emb = TTEmbeddingBag(
num_embeddings=1000000,
embedding_dim=64,
tt_p_shapes=[120, 90, 110],
tt_q_shapes=[4, 4, 4],
tt_ranks=[12, 14],
sparse=False,
use_cache=False,
weight_dist="uniform"
)
This method will generate TT cores representing an embedding table of size 1000000 by 64 (num_embeddings x embedding_dim
), where each TT core is of size ranks[i] x tt_p_shapes[i] x tt_q_shapes[i] x ranks[i+1] and ranks = [1]+tt_rank+[1]
. In this case, the shape of the 3 TT-cores are 1 x 120 x 4 x 12, 12 x 90 x 4 x 16
, and 14 x 110 x 4 x 1
.
When tt_p_shapes
and tt_q_shapes are
specified, the product of tt_p_shapes[]
needs be no smaller than num_embeddings
; the product of tt_q_shapes
must be equal to embedding_dim
. When passing these 2 parameters as None, TTEmbeddingBag will factor num_embeddings
and embedding_dim
automatically.
>>> # an Embedding module containing 10 tensors of size 3
>>> embedding_sum *=* TTEmbeddingBag(10, 3, None, None, tt_ranks=[2,2], sparse=False, use_cache=False)
>>> # a batch of 2 samples of 4 indices each
>>> input *=* torch*.*LongTensor([1,2,4,5,4,3,2,9])
>>> offsets *=* torch*.*LongTensor([0,4])
>>> embedding_sum(input, offsets)
tensor([[-0.8861, -5.4350, -0.0523],
[ 1.1306, -2.5798, -1.0044]])
TT-Emb supports fused gradient computation and weight updates for better efficiency, where the weights of embedding tables are updated along with backward propagation. If the network is trained with an external optimizer, the gradients will no longer be returned to the optimizer. To enable the fused kernel, specify sparse=True, and pass the corresponding optimizer type and parameters to TTEmbeddingBag. For example,
tt_emb = TTEmbeddingBag(
num_embeddings=1000000,
embedding_dim=64,
tt_p_shapes=[120, 90, 110],
tt_q_shapes=[4, 4, 4],
tt_ranks=[12, 14],
*sparse**=True,
optimizer=OptimType.SGD,
learning_rate=0.05,*
eps=1.0e-10 #for ADAGRAD only
use_cache=False,
)
Embedding lookup in TT-Rec requires explicitly computing the embedding vectors from TT cores via two consecutive matrix multiplications (GEMM). Similarly in backward propagation, the gradient of each tensor core is calculated through a chain of matrix multiplications in reversed order.
To reduce the computation during training, TT-Emb implements a software cache to store an uncompressed copy of the most-frequently queried embedding vectors. When such vectors are queried, the vectors can be loaded directly from cache without computation. The size of cache can be determined according to each hardware platform and dataset, so that cached embedding rows capture as many embedding lookups as possible while minimizing the memory requirement during training.
We implemented a 32-way set-associative Least-Frequently-Used (LFU) cache using open addressing hash table for frequency count. To enable cache for TT-Emb, specify use_cache=True
and set the appropriate cache size as the maximum number of embedding vectors to store in the cache, and max hash table size.
tt_emb = TTEmbeddingBag(
num_embeddings=1000000,
embedding_dim=64,
tt_p_shapes=[120, 90, 110],
tt_q_shapes=[4, 4, 4],
tt_ranks=[12, 14],
sparse=True,
optimizer=OptimType.SGD,
learning_rate=0.05,
eps=1.0e-10 #for ADAGRAD only
*use_cache**=True,
cache_size=1000,
hashtbl_size=1000*
)
During forward propagation, the access frequency of embedding vectors will be updated. However, the cache will only be updated when tt_emb.cache_populate()
is called. The cached rows are determined by the access frequency of the embedding vectors, and the value of each embedding vector would be initialized from TT cores.
Apart from TTEmbeddingBag demonstrated above, there's also a TableBatchedTTEmbeddingBag which groups the lookup operation together for multiple TT embedding tables. It can be more efficient than creating multiple instances of TTEmbeddingBags when each of the TTEmbeddingBags has little computation involved.
To use that, you just need to initialize it as below
tt_emb = TableBatchedTTEmbeddingBag(
num_tables=100,
num_embeddings=1000000,
embedding_dim=64,
tt_p_shapes=[120, 90, 110],
tt_q_shapes=[4, 4, 4],
tt_ranks=[12, 14],
sparse=False,
use_cache=False,
weight_dist="uniform"
)
The above creates 100 TT embedding tables with the same dimensions underlying. The only additional argument that needs to be passed is num_tables
Currently there are some limitations to TableBatchedTTEmbeddingBag.
- The mutiple tables in TableBatchedTTEmbeddingBag must share the same dimensions.
- No support for software cache yet.
- No support for "approx-uniform" weight init yet.
- No support for expanding to full weight yet.
FBTT-Embedding is MIT licensed, as found in the LICENSE file.