v0.3
You can now specify the top_n
matches for each string. This option allows you to get a selection of matches that best suit the input. It is implemented in polyfuzz.models.TFIDF
and polyfuzz.models.Embeddings
since this is computationally quite heavy and these models are best suited for making those calculations.
Usage:
from polyfuzz import PolyFuzz
from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
to_list = ["apple", "apples", "mouse"]
model = PolyFuzz("TF-IDF")
model.match(from_list, to_list, top_n=3)
Or usage in custom models:
from polyfuzz.models import TFIDF, Embeddings
from flair.embeddings import TransformerWordEmbeddings
embeddings = TransformerWordEmbeddings('bert-base-multilingual-cased')
bert = Embeddings(embeddings, min_similarity=0, model_id="BERT", top_n=3)
tfidf = TFIDF(min_similarity=0, top_n=3)
string_models = [bert, tfidf]
model = PolyFuzz(string_models)
model.match(from_list, to_list)