The weighted MinHash algorithms based on Consistent Weighted Sampling
The algorithms convert each weighted set into the hashcode for similarity-based data mining and machine learning tasks, e.g., classification, retreival, etc., by pairwise Hamming similarity calculation between the hashcodes.
Here, we develop three algorithms
- CCWS. Wei Wu, Bin Li, Ling Chen, Chengqi Zhang. (2016). Canonical Consistent Weighted Sampling for Real-Value Min-Hash. Proceedings of the 16th International Conference on Data Mining. 1287-1292.
- PCWS. Wei Wu, Bin Li, Ling Chen, Chengqi Zhang. (2017). Consistent Weighted Sampling Made More Practical. Proceedings of the 26th International World Wide Web Conference. 1035-1043.
- I2CWS. Wei Wu, Bin Li, Ling Chen, Chengqi Zhang, Philip S. Yu. (2019). Improved Consistent Weighted Sampling Revisited. IEEE Transactions on Knowledge and Data Engineering. 31(12):2332-2345.
If you use our algorithms in your research, please cite the following papers as reference in your publicaions:
@inproceedings{wu2016canonical,
title={{C}anonical {C}onsistent {W}eighted {S}ampling for {R}eal-{V}alue {W}eighted {M}in-{H}ash},
author={Wu, Wei and Li, Bin and Chen, Ling and Zhang, Chengqi},
booktitle={ICDM},
pages={1287--1292},
year={2016}
}
@inproceedings{wu2017consistent,
title={{C}onsistent {W}eighted {S}ampling {M}ade {M}ore {P}ractical},
author={Wu, Wei and Li, Bin and Chen, Ling and Zhang, Chengqi},
booktitle={WWW},
pages={1035--1043},
year={2017}
}
@article{wu2017improved,
title={{I}mproved {C}onsistent {W}eighted {S}ampling {R}evisited},
author={Wu, Wei and Li, Bin and Chen, Ling and Zhang, Chengqi and Yu, Philip S},
journal={IEEE Transactions on Knowledge and Data Engineering},
pages={2332--2345},
year={2019}
}
@article{wu2020review,
title={{A} {R}eview for {W}eighted {M}in{H}ash {A}lgorithms},
author={Wu, Wei and Li, Bin and Chen, Ling and Gao, Junbin and Zhang, Chengqi},
journal={IEEE Transactions on Knowledge and Data Engineering},
year={2022},
pages={2553--2573},
volume={34},
number={6}
}