-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I don't understand what is returned #73
Comments
PolyFuzz extracts the most similar examples from the |
Can we take a step back? How does polyfuzz knows what are the "most similar examples from the It must calculate each pairwise distance, so m*n distances, where m=len(from_list) and n=len(to_list), and take the most similar examples, right? Because the docs say that So I don't understand with |
I didn't say that we are not calculating pairwise distances. What is happening here is that after calculating the distances, extracting the best or top n matching values can computationally be expensive. For that reason, there are several options implemented that allow you to select either the best or the top n matches. You can find more about that in the source code here: https://github.com/MaartenGr/PolyFuzz/blob/master/polyfuzz/models/_utils.py |
I continue to do NOT understand.
Sorting a list of floating point numbers and returning the biggest N ones is less computationally expensive than returning all of them?
Dec 6, 2023 7:31:27 PM Maarten Grootendorst ***@***.***>:
…
I didn't say that we are not calculating pairwise distances. What is happening here is that after calculating the distances, extracting the best or top n matching values can computationally be expensive. For that reason, there are several options implemented that allow you to select either the best or the top n matches. You can find more about that in the source code here: https://github.com/MaartenGr/PolyFuzz/blob/master/polyfuzz/models/_utils.py
—
Reply to this email directly, view it on GitHub[#73 (comment)], or unsubscribe[https://github.com/notifications/unsubscribe-auth/ANBZZ5S5SBNZ4SLM3FMDSMLYIC2XPAVCNFSM6AAAAABAJVAZSKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBTGQ2TKNBVHA].
You are receiving this because you authored the thread.[Tracking image][https://github.com/notifications/beacon/ANBZZ5RC36VM35S5UC5XU2TYIC2XPA5CNFSM6AAAAABAJVAZSKWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTTN4DS6E.gif]
|
I think the source code above would be a good start. Notice that there are several ways to calculate pairwise distances in that method. Doing so for a m x n matrix can be computationally difficult. Instead, getting only the best can be more feasible depending on the method you use. The link above might help you understand. |
The docs provide the following example:
which returns
I don't understand what is returned. If polyfuzz is doing pairwise comparion from
from
list toto
list, it should return len(from)*len(to) rows. So 18 results in this case.In the example, apple->apples, apple->mouse, apples->apple, apples->mouse and so on are missing.
The text was updated successfully, but these errors were encountered: