You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Noticed your comment on Hacker News about this repo. I worked on something similar at a previous job (so I don't have the code the share), but looked pretty deeply into optimizing minhash. Not sure if it's helpful or if you're already aware of these tricks, but I found them to significantly speed up my minhash algorithm (and they are fun to code :P).
One Permutation Hashing. Basically use a single hash function in clever way instead of the typical 128. Made a pretty big different in speed during my testing.
Reservoir Sampling. A trick for using less memory and avoiding dynamic allocations. This last paper also has a summary of the first two.
Caveat: Seems like you're working on Super Min Hash, which I was never able to fully code myself. I don't know how it compares to the one-hash approach.
The text was updated successfully, but these errors were encountered:
Thanks for your suggestions. I may have encountered some of these ideas in papers, but never took a deep look. Currently, I use MinHash for relatively short text documents (usually 50 - 200 characters), and generating min-hash signatures takes a small fraction of the whole pipeline. For bigger documents hashing can become a bottleneck.
Memory usage is definitely an issue for me, which I mitigate by using smaller hashes (thanks to Rust generics this is easy). I should definitely look at Reservoir Sampling and One Permutation Hashing, (especially if they are fun to code :) )
Noticed your comment on Hacker News about this repo. I worked on something similar at a previous job (so I don't have the code the share), but looked pretty deeply into optimizing minhash. Not sure if it's helpful or if you're already aware of these tricks, but I found them to significantly speed up my minhash algorithm (and they are fun to code :P).
Caveat: Seems like you're working on Super Min Hash, which I was never able to fully code myself. I don't know how it compares to the one-hash approach.
The text was updated successfully, but these errors were encountered: