Skip to content

The pyTorch implementation of two models described in Deep Semantic Text Hashing with Weak Supervision (SIGIR'18)

License

Notifications You must be signed in to change notification settings

unsuthee/SemanticHashingWeakSupervision

Repository files navigation

Deep Semantic Text Hashing with Weak Supervision (SIGIR'18)

Author: Suthee Chaidaroon

This is a pyTorch implementation of two models described in Deep Semantic Text Hashing with Weak Supervision.

Requirements

Python 3.6 and PyTorch 0.4.

Datasets

We use 4 datasets in this paper: 20Newsgroups, DBPedia, YahooAnswers, and AG's news. You can download the original datasets from the link provided in the paper. For your convenience, the preprocessed dataset can be downloaded from here. These datasets are bag-of-words using BM25 weighting. The k-nearest neighbors for each document in both train and test collections are also provided.

It is important to create two data folders to train the models. The first one is "data" directory where it stores all bag-of-words datasets. The second folder is "bm25" where we use to save all the k-nearest neighbors data.

Run the program

We provided 3 models in this repo: VDSH[1], NbrReg, and NbrReg+Doc. To train the model, use the following commands:

To train NbrReg model:

python train_NbrReg.py -g 0 -b 32 -d ng20 --epoch 30 --batch_size 100

To train NbrReg+Doc model:

python train_NbrRegDoc.py -g 0 -b 32 -d ng20 --epoch 30 --batch_size 100

To train VDSH model:

python train_VDSH.py -g 0 -b 32 -d ng20 --epoch 30 --batch_size 100

Custom datasets

If you are interested in training our models on your custom datasets, you need to ensure that the dataset is in a bag-of-words format. You also need to generate a k-nearest neighbors file by running:

To create kNN for a train set:

python topK.py -d your_custom_dataset -g 0 --use_train

To create kNN for a test set:

python topK.py -d your_custom_dataset -g 0

Bibtex

@inproceedings{Chaidaroon:2018:DST:3209978.3210090,
 author = {Chaidaroon, Suthee and Ebesu, Travis and Fang, Yi},
 title = {Deep Semantic Text Hashing with Weak Supervision},
 booktitle = {The 41st International ACM SIGIR Conference on Research \&\#38; Development in Information Retrieval},
 series = {SIGIR '18},
 year = {2018},
 isbn = {978-1-4503-5657-2},
 location = {Ann Arbor, MI, USA},
 pages = {1109--1112},
 numpages = {4},
 url = {http://doi.acm.org/10.1145/3209978.3210090},
 doi = {10.1145/3209978.3210090},
 acmid = {3210090},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {semantic hashing, variational autoencoder, weak supervision},
} 

References

[1] Chaidaroon, Suthee, and Yi Fang. "Variational deep semantic hashing for text documents." Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2017.

About

The pyTorch implementation of two models described in Deep Semantic Text Hashing with Weak Supervision (SIGIR'18)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages