This repository contains the code to reproduce major results from the paper:
Neareast Neighbor (NN) Search is widely applied in retrieval tasks. However, a phenomenon called hubness [1] often degrades NN's performance. Hubness appears as some data points, called hubs, being suspiciously close to many others. It is often encountered in high dimensional spaces.
This work is interested in reducing hubness during retrieval. The proposed is a new method, which we call Hubless Nearest Neighbor (HNN) Search. Theoretically, HNN provides a unified view towards NN and Inverted SoFtmax (ISF [2]), a recently proposed retrieval method that mitigates hubness. Empirically, HNN demonstrates superior accuracy in a typical retrieval task, Bilingual Lexicon Induction (BLI [3]).
The following two sections give a step-by-step tutorial for repducing the major results in the paper. If you have any question, please post it on github or email [email protected]
Note: Other versions of python and packages below are not tested. But we believe they should work as long as python3+ is used.
Environment
- python (3.6.6)
Mandatory Packages
-
numpy (1.15.4)
-
scipy (1.1.0)
-
matplotlib (3.0.2)
-
gensim (3.6.0)
All of the above can be installed via pip
, e.g.,
pip install 'numpy==1.15.4'
Optional Packages (if use GPU)
- cupy
Assume available cuda version is 9.0. Install it by
pip install cupy-cuda90
More details can be found at Chainer. Also, do not forget to append the CUDA paths in bash environment. The following is a working example:
CUDA_PATH=/path/to/cuda-9.0.176
CUDNN_PATH=/path/to/cudnn-9.0-linux-x64-v7.0-rc
export LD_LIBRARY_PATH=$CUDA_PATH/lib64:$CUDA_PATH/extras/CUPTI/lib64:$CUDNN_PATH/lib64:$LD_LIBRARY_PATH
export PATH=$CUDA_PATH/bin:$PATH
Feel free to test other cuda versions.
Run
python GMM_retrieval.py
to reproduce Table 1 of the paper. The task is to retrieve the same class from a Gaussian mixture. Details can be found in section 5.1 of the paper. The purpose of this experiment is to understand the connection between HNN and other related methods. An example output is already given in ./exp/GMM_retrieval.example.log
.
The log includes the accuracies of different retrieval methods, as well as evidence that primal and dual solvers of HNN are equivalent. Results in Table 1 can be easily found in the log.
As we discussed in the paper, dual solver (algorithm 2) of HNN minimizes a convex objective. Therefore, if the learning rate in algorithm 2 is properly chosen, the loss and gradient norm should be monotonicly decreasing. This fact can be easily checked by parsing the log file using the following bash command (assuming gnuplot
installed)
grep "grad norm" exp/GMM_retrieval.example.log | cut -d, -f2 | cut -d: -f2 | gnuplot -p -e "set xlabel 'iteration'; set ylabel 'gradient norm'; plot '-' with line notitle;"
It will produce a plot of the gradient norm over the iterations, as below.
The following will reproduce Table 3 of the paper.
(1) Download the fasttext embeddings and dictionaries.
./get_data.sh
A ./data directory will be created. Under that are embeddings for 6 European languages, de (German), en (English), es (Spanish), fr (French), it (Italian) and pt (Portuguese), as well as dictionaries for all the pairs.
(2) Get translation accuracy for a src
-tgt
pair, using a specified retrieval method
(one of {nn, isf, csls, hnn}). Run
./bli_exp.sh $src $tgt $method
The experiment follows the "supervised" setup at MUSE, but differs in that we use larger test dictionaries (data/src
-tgt
.txt). The output is a log, /exp/BLI/src-tgt.method.log
.
By default, we use 500K as the vocabulary size for both source and target languages. In supplementary material, We have also reported results of using a vocabulary of 200K. To reproduce that, simply set V=200000
in bli_exp.sh
.
To see all translated words, set save_translation=1
in bli_exp.sh
. The translated words will be stored in ./exp/BLI/src-tgt.method.trans
. Each row of the file is a source word, followed by top-10 candidates of translation, from the "best" to the "worst".
(3) Check how hubness is reduced (Figure 4 and Table 2). For example, to check the hubness for Portuguese-to-English task, simply run
python hubness_in_translations.py pt en -k 5 -N 200
It will produce k-occurrence
(k=5
in this case) histograms, as measures of hubness, for the different methods. In particular, long tail of the histogram indicates strong hubness, which should be reduced. The Portuguese to English example will have the following histograms, where HNN has the shortest tail, i.e., weakest hubness.
"hubby" words | NN | ISF | CSLS | HNN |
---|---|---|---|---|
conspersus | 1,776 | 0 | 374 | 0 |
s+bd | 912 | 7 | 278 | 16 |
were | 798 | 99 | 262 | 24 |
you | 474 | 12 | 57 | 20 |
The numerics are the number of times these words being retrieved. A big value indicates that the word is a hub. Note how the values are reduced by HNN.
[1] Milos Radovanovic, Alexandros Nanopoulos, and Mirjana Ivanovic. 2010. Hubs in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research.
[2] Samuel L. Smith, David H. P. Turban, Steven Hamblin, and Nils Y. Hammerla. 2017. Offline bilingual word vectors, orthogonal transformations and the inverted softmax. In International Conference on Learning Representations.
[3] Alexis Conneau, Guillaume Lample, Marc' Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2018. Word translation without parallel data. In International Conference on Learning Representations.