-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
self implemented distance metric #200
Comments
You can provide a custom metric if you wish as long as it can be jit compiled by numba. Just specify the function as the metric instead of a metric name as a string. So for example you might have @numba.njit()
def custom_distance_function(a, b):
...
index = NNDescent(metric=custom_distance_function) The biggest hurdle here is that numba supports a (relatively large) subset of python, but can't see inside other functions, so you will need your own implementation of the distance metric, you can't just call another library (unless they have jit compiled code, or it is a C library and you use CFFI in numba). Note that a custom metric may not be as fast as many of the pre-provided metrics since the latter have had a lot of optimization work go into them. Lastly, if you are interested in using minhash, perhaps you could consider, instead, using jaccard on a sparse matrix representation of your data -- since minhash is really just an approximation of jaccard anyway, why not use the real thing? As long as you convert the data to |
hello Leland, Thanks for the suggestion on MinHash, the problem for me is, the matrix representation of my data (dimension) is huge, 10^6 or above (and there are 10^6 of such data vector ), so to have fast approximation of Jaccard index, there are no choice for me right now but MinHash approximation. I understand the jit complied part and the MinHash software I am relying on is actually a C++ library. So it may be difficult to incorporate it or supported by jit since I am not family with JIT at all. I am thinking kgraph C++ library here(https://github.com/aaalgo/kgraph) since Jaccard is a metric. Thanks for your help. Jianshu |
The ability to compute jaccard using sparse structure is dependent on the number of non-zeros per row, not on the actual dimension. So 10^6 dimensions might be fine if most of the data are zeros. This is very often the case for data with that many dimensions. If you still need to use minhash then as long as you can import the function via ctypes then you should be able to use it within a numba jitted function -- you'll just have to be careful with what you pass it. |
Dear Pynndescent team,
I am wondering whether there could be an option for this library to provide an API to allow external distance metric so that some unique distance metric can be used., e.g., for biological applications, distance computation was actually based on MinHash of genomic profiles, none of the provided metric is fast in this case.
Thanks,
Jianshu
The text was updated successfully, but these errors were encountered: