I don't understand what is returned #73

raffaem · 2023-12-06T16:29:58Z

from polyfuzz import PolyFuzz

from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
to_list = ["apple", "apples", "mouse"]

model = PolyFuzz("TF-IDF").match(from_list, to_list)

which returns

>>> model.get_matches()
         From      To    Similarity
0       apple   apple    1.000000
1      apples  apples    1.000000
2        appl   apple    0.783751
3       recal    None    0.000000
4       house   mouse    0.587927
5  similarity    None    0.000000

I don't understand what is returned. If polyfuzz is doing pairwise comparion from from list to to list, it should return len(from)*len(to) rows. So 18 results in this case.

In the example, apple->apples, apple->mouse, apples->apple, apples->mouse and so on are missing.

The text was updated successfully, but these errors were encountered:

MaartenGr · 2023-12-06T18:03:07Z

PolyFuzz extracts the most similar examples from the from_list and attempts to find the best matches. If you want to extract more matches, use the top_n parameter in the .match function.

raffaem · 2023-12-06T18:13:59Z

Can we take a step back?

How does polyfuzz knows what are the "most similar examples from the from_list"?

It must calculate each pairwise distance, so m*n distances, where m=len(from_list) and n=len(to_list), and take the most similar examples, right?

Because the docs say that top_n is "only implemented for polyfuzz.models.TFIDF and polyfuzz.models.Embeddings".

So I don't understand with EditDistance how does it know what are the most similar examples without calculating each pairwise distance.

MaartenGr · 2023-12-06T18:31:09Z

I didn't say that we are not calculating pairwise distances. What is happening here is that after calculating the distances, extracting the best or top n matching values can computationally be expensive. For that reason, there are several options implemented that allow you to select either the best or the top n matches. You can find more about that in the source code here: https://github.com/MaartenGr/PolyFuzz/blob/master/polyfuzz/models/_utils.py

raffaem · 2023-12-06T19:47:03Z

I continue to do NOT understand. Sorting a list of floating point numbers and returning the biggest N ones is less computationally expensive than returning all of them? Dec 6, 2023 7:31:27 PM Maarten Grootendorst ***@***.***>:

…

I didn't say that we are not calculating pairwise distances. What is happening here is that after calculating the distances, extracting the best or top n matching values can computationally be expensive. For that reason, there are several options implemented that allow you to select either the best or the top n matches. You can find more about that in the source code here: https://github.com/MaartenGr/PolyFuzz/blob/master/polyfuzz/models/_utils.py — Reply to this email directly, view it on GitHub[#73 (comment)], or unsubscribe[https://github.com/notifications/unsubscribe-auth/ANBZZ5S5SBNZ4SLM3FMDSMLYIC2XPAVCNFSM6AAAAABAJVAZSKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBTGQ2TKNBVHA]. You are receiving this because you authored the thread.[Tracking image][https://github.com/notifications/beacon/ANBZZ5RC36VM35S5UC5XU2TYIC2XPA5CNFSM6AAAAABAJVAZSKWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTTN4DS6E.gif]

MaartenGr · 2023-12-06T19:59:20Z

I think the source code above would be a good start. Notice that there are several ways to calculate pairwise distances in that method. Doing so for a m x n matrix can be computationally difficult. Instead, getting only the best can be more feasible depending on the method you use. The link above might help you understand.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I don't understand what is returned #73

I don't understand what is returned #73

raffaem commented Dec 6, 2023

MaartenGr commented Dec 6, 2023

raffaem commented Dec 6, 2023

MaartenGr commented Dec 6, 2023

raffaem commented Dec 6, 2023 via email

MaartenGr commented Dec 6, 2023

I don't understand what is returned #73

I don't understand what is returned #73

Comments

raffaem commented Dec 6, 2023

MaartenGr commented Dec 6, 2023

raffaem commented Dec 6, 2023

MaartenGr commented Dec 6, 2023

raffaem commented Dec 6, 2023 via email

MaartenGr commented Dec 6, 2023