Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I don't understand what is returned #73

Open
raffaem opened this issue Dec 6, 2023 · 5 comments
Open

I don't understand what is returned #73

raffaem opened this issue Dec 6, 2023 · 5 comments

Comments

@raffaem
Copy link
Contributor

raffaem commented Dec 6, 2023

The docs provide the following example:

from polyfuzz import PolyFuzz

from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
to_list = ["apple", "apples", "mouse"]

model = PolyFuzz("TF-IDF").match(from_list, to_list)

which returns

>>> model.get_matches()
         From      To    Similarity
0       apple   apple    1.000000
1      apples  apples    1.000000
2        appl   apple    0.783751
3       recal    None    0.000000
4       house   mouse    0.587927
5  similarity    None    0.000000

I don't understand what is returned. If polyfuzz is doing pairwise comparion from from list to to list, it should return len(from)*len(to) rows. So 18 results in this case.

In the example, apple->apples, apple->mouse, apples->apple, apples->mouse and so on are missing.

@MaartenGr
Copy link
Owner

PolyFuzz extracts the most similar examples from the from_list and attempts to find the best matches. If you want to extract more matches, use the top_n parameter in the .match function.

@raffaem
Copy link
Contributor Author

raffaem commented Dec 6, 2023

Can we take a step back?

How does polyfuzz knows what are the "most similar examples from the from_list"?

It must calculate each pairwise distance, so m*n distances, where m=len(from_list) and n=len(to_list), and take the most similar examples, right?

Because the docs say that top_n is "only implemented for polyfuzz.models.TFIDF and polyfuzz.models.Embeddings".

So I don't understand with EditDistance how does it know what are the most similar examples without calculating each pairwise distance.

@MaartenGr
Copy link
Owner

I didn't say that we are not calculating pairwise distances. What is happening here is that after calculating the distances, extracting the best or top n matching values can computationally be expensive. For that reason, there are several options implemented that allow you to select either the best or the top n matches. You can find more about that in the source code here: https://github.com/MaartenGr/PolyFuzz/blob/master/polyfuzz/models/_utils.py

@raffaem
Copy link
Contributor Author

raffaem commented Dec 6, 2023 via email

@MaartenGr
Copy link
Owner

I think the source code above would be a good start. Notice that there are several ways to calculate pairwise distances in that method. Doing so for a m x n matrix can be computationally difficult. Instead, getting only the best can be more feasible depending on the method you use. The link above might help you understand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants