This is a pyTorch implementation of two models described in Deep Semantic Text Hashing with Weak Supervision.
Python 3.6 and PyTorch 0.4.
We use 4 datasets in this paper: 20Newsgroups, DBPedia, YahooAnswers, and AG's news. You can download the original datasets from the link provided in the paper. For your convenience, the preprocessed dataset can be downloaded from here. These datasets are bag-of-words using BM25 weighting. The k-nearest neighbors for each document in both train and test collections are also provided.
It is important to create two data folders to train the models. The first one is "data" directory where it stores all bag-of-words datasets. The second folder is "bm25" where we use to save all the k-nearest neighbors data.
We provided 3 models in this repo: VDSH[1], NbrReg, and NbrReg+Doc. To train the model, use the following commands:
To train NbrReg model:
python train_NbrReg.py -g 0 -b 32 -d ng20 --epoch 30 --batch_size 100
To train NbrReg+Doc model:
python train_NbrRegDoc.py -g 0 -b 32 -d ng20 --epoch 30 --batch_size 100
To train VDSH model:
python train_VDSH.py -g 0 -b 32 -d ng20 --epoch 30 --batch_size 100
If you are interested in training our models on your custom datasets, you need to ensure that the dataset is in a bag-of-words format. You also need to generate a k-nearest neighbors file by running:
To create kNN for a train set:
python topK.py -d your_custom_dataset -g 0 --use_train
To create kNN for a test set:
python topK.py -d your_custom_dataset -g 0
@inproceedings{Chaidaroon:2018:DST:3209978.3210090,
author = {Chaidaroon, Suthee and Ebesu, Travis and Fang, Yi},
title = {Deep Semantic Text Hashing with Weak Supervision},
booktitle = {The 41st International ACM SIGIR Conference on Research \&\#38; Development in Information Retrieval},
series = {SIGIR '18},
year = {2018},
isbn = {978-1-4503-5657-2},
location = {Ann Arbor, MI, USA},
pages = {1109--1112},
numpages = {4},
url = {http://doi.acm.org/10.1145/3209978.3210090},
doi = {10.1145/3209978.3210090},
acmid = {3210090},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {semantic hashing, variational autoencoder, weak supervision},
}
[1] Chaidaroon, Suthee, and Yi Fang. "Variational deep semantic hashing for text documents." Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2017.