Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization Ideas For MinHash #17

Open
keithing opened this issue Oct 8, 2022 · 2 comments
Open

Optimization Ideas For MinHash #17

keithing opened this issue Oct 8, 2022 · 2 comments

Comments

@keithing
Copy link

keithing commented Oct 8, 2022

Noticed your comment on Hacker News about this repo. I worked on something similar at a previous job (so I don't have the code the share), but looked pretty deeply into optimizing minhash. Not sure if it's helpful or if you're already aware of these tricks, but I found them to significantly speed up my minhash algorithm (and they are fun to code :P).

  1. One Permutation Hashing. Basically use a single hash function in clever way instead of the typical 128. Made a pretty big different in speed during my testing.
  2. Densified One Permutation Hashing. A small tweak for handling the "empty bin" problem of one permutation hashing.
  3. Reservoir Sampling. A trick for using less memory and avoiding dynamic allocations. This last paper also has a summary of the first two.

Caveat: Seems like you're working on Super Min Hash, which I was never able to fully code myself. I don't know how it compares to the one-hash approach.

@serega
Copy link
Owner

serega commented Oct 10, 2022

Thanks for your suggestions. I may have encountered some of these ideas in papers, but never took a deep look. Currently, I use MinHash for relatively short text documents (usually 50 - 200 characters), and generating min-hash signatures takes a small fraction of the whole pipeline. For bigger documents hashing can become a bottleneck.
Memory usage is definitely an issue for me, which I mitigate by using smaller hashes (thanks to Rust generics this is easy). I should definitely look at Reservoir Sampling and One Permutation Hashing, (especially if they are fun to code :) )

@jianshu93
Copy link

Hello All,

Just a followup, the most recently densification is called optimal densification: two ways to perform the densification are optimal (RMSE is the same with the original, much slower, many hash function implementation): http://proceedings.mlr.press/v70/shrivastava17a.html or http://proceedings.mlr.press/v115/mai20a.html. Implementation here: https://github.com/jean-pierreBoth/probminhash/blob/master/src/densminhash.rs

@serega I have some interesting idea to combine minhash with graphs to do very large scale search. Let me know if you are interested.

Jianshu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants