v0.4.0 (#35)

* Add spacy, use, sbert, gensim * Add fit, transform, and fit_transform * Add options to save and load model
MaartenGr · May 7, 2022 · 9d9754d · 9d9754d
1 parent 241d7d3
commit 9d9754d
Show file tree

Hide file tree

Showing 27 changed files with 1,232 additions and 79 deletions.
diff --git a/.github/workflows/testing.yml b/.github/workflows/testing.yml
@@ -25,6 +25,6 @@ jobs:
     - name: Install dependencies
       run: |
         python -m pip install --upgrade pip
-        pip install -e ".[dev]"
+        pip install -e ".[dev, sbert]"
     - name: Run Checking Mechanisms
       run: make check
diff --git a/.gitignore b/.gitignore
@@ -75,3 +75,6 @@ venv.bak/
 
 .idea
 .idea/
+
+# For quick testing
+/Untitled.ipynb
diff --git a/README.md b/README.md
@@ -22,24 +22,21 @@ You can install **`PolyFuzz`** via pip:
 pip install polyfuzz
 ```
 
-This will install the base dependencies. If you want to speed 
-up the cosine similarity comparison and decrease memory usage, 
-you can use `sparse_dot_topn` which is installed via:
+You may want to install more depending on the transformers and language backends that you will be using. The possible installations are:
 
-```bash
-pip install polyfuzz[fast]
-```
-
-If you want to be making use of 🤗 Transformers, install the additional additional `Flair` dependency:
-
-```bash
-pip install polyfuzz[flair]
+```python
+pip install bertopic[sbert]
+pip install bertopic[flair]
+pip install bertopic[gensim]
+pip install bertopic[spacy]
+pip install bertopic[use]
 ```
 
-To install all the additional dependencies:
+If you want to speed up the cosine similarity comparison and decrease memory usage when using embedding models, 
+you can use `sparse_dot_topn` which is installed via:
 
 ```bash
-pip install polyfuzz[all]
+pip install polyfuzz[fast]
 ```
 
 <details>
@@ -103,6 +100,42 @@ The resulting matches can be accessed through `model.get_matches()`:
 **NOTE 2**: When instantiating `PolyFuzz` we also could have used "EditDistance" or "Embeddings" to quickly 
 access Levenshtein and FastText (English) respectively. 
 
+### Production
+The `.match` function allows you to quickly extract similar strings. However, after selecting the right models to be used, you may want to use PolyFuzz 
+in production to match incoming strings. To do so, we can make use of the familiar `fit`, `transform`, and `fit_transform` functions. 
+
+Let's say that we have a list of words that we know to be correct called `train_words`. We want to any incoming word to mapped to one of the words in `train_words`. 
+In other words, we `fit` on `train_words` and we use `transform` on any incoming words:
+
+```python
+from sklearn.datasets import fetch_20newsgroups
+from sklearn.feature_extraction.text import CountVectorizer
+from polyfuzz import PolyFuzz
+
+train_words = ["apple", "apples", "appl", "recal", "house", "similarity"]
+unseen_words = ["apple", "apples", "mouse"]
+
+# Fit
+model = PolyFuzz("TF-IDF")
+model.fit(train_words)
+
+# Transform
+results = model.transform(unseen_words)
+```
+
+In the above example, we are using `fit` on `train_words` to calculate the TF-IDF representation of those words which are saved to be used again in `transform`. 
+This speeds up `transform` quite a bit since all TF-IDF representations are stored when applying `fit`. 
+
+Then, we apply save and load the model as follows to be used in production:
+
+```python
+# Save the model
+model.save("my_model")
+
+# Load the model
+loaded_model = PolyFuzz.load("my_model")
+```
+
 ### Group Matches
 We can group the matches `To` as there might be significant overlap in strings in our to_list. 
 To do this, we calculate the similarity within strings in to_list and use `single linkage` to then 
@@ -214,7 +247,7 @@ from polyfuzz.models import BaseMatcher
 
 
 class MyModel(BaseMatcher):
-    def match(self, from_list, to_list):
+    def match(self, from_list, to_list, **kwargs):
         # Calculate distances
         matches = [[fuzz.ratio(from_string, to_string) / 100 for to_string in to_list] 
                     for from_string in from_list]

diff --git a/docs/api/models/gensim.md b/docs/api/models/gensim.md
@@ -0,0 +1,3 @@
+# `polyfuzz.models.GensimEmbeddings`
+
+::: polyfuzz.models.GensimEmbeddings
diff --git a/docs/api/models/sbert.md b/docs/api/models/sbert.md
@@ -0,0 +1,3 @@
+# `polyfuzz.models.SentenceEmbeddings`
+
+::: polyfuzz.models.SentenceEmbeddings
diff --git a/docs/api/models/spacy.md b/docs/api/models/spacy.md
@@ -0,0 +1,3 @@
+# `polyfuzz.models.SpacyEmbeddings`
+
+::: polyfuzz.models.SpacyEmbeddings
diff --git a/docs/api/models/use.md b/docs/api/models/use.md
@@ -0,0 +1,3 @@
+# `polyfuzz.models.USEEmbeddings`
+
+::: polyfuzz.models.USEEmbeddings
diff --git a/docs/index.md b/docs/index.md
@@ -8,20 +8,4 @@ Currently, methods include Levenshtein distance with RapidFuzz, a character-base
 techniques such as FastText and GloVe, and 🤗 transformers embeddings. 
 
 The philosophy of PolyFuzz is: `Easy to use yet highly customizable`. It is a string matcher tool that requires only 
-a few lines of code but that allows you customize and create your own models. 
-
-
-## Installation
-You can install **`PolyFuzz`** via pip:
-
-```
-pip install polyfuzz
-```
-
-This will install the base dependencies and excludes any deep learning/embedding models. 
-
-If you want to be making use of 🤗 Transformers, install the additional additional `Flair` dependency:
-
-```
-pip install polyfuzz[flair]
-```
+a few lines of code but that allows you customize and create your own models. 
diff --git a/docs/releases.md b/docs/releases.md
@@ -1,8 +1,83 @@
-v0.3.4    
+## **v0.4.0**
+
+
+* Added new models (SentenceTransformers, Gensim, USE, Spacy)
+* Added `.fit`, `.transform`, and `.fit_transform` methods
+* Added `.save` and `PolyFuzz.load()`
+
+
+**SentenceTransformers**  
+```python
+from polyfuzz.models import SentenceEmbeddings
+distance_model = SentenceEmbeddings("all-MiniLM-L6-v2")
+model = PolyFuzz(distance_model)
+```
+
+**Gensim**  
+```python
+from polyfuzz.models import GensimEmbeddings
+distance_model = GensimEmbeddings("glove-twitter-25")
+model = PolyFuzz(distance_model)
+```
+
+**USE**  
+```python
+from polyfuzz.models import USEEmbeddings
+distance_model = USEEmbeddings("https://tfhub.dev/google/universal-sentence-encoder/4")
+model = PolyFuzz(distance_model)
+```
+
+**Spacy**  
+```python
+from polyfuzz.models import SpacyEmbeddings
+distance_model = SpacyEmbeddings("en_core_web_md")
+model = PolyFuzz(distance_model)
+```
+
+
+**fit, transform, fit_transform**  
+Add `fit`, `transform`, and `fit_transform` in order to use PolyFuzz in production [#34](https://github.com/MaartenGr/PolyFuzz/issues/34)
+
+```python
+from sklearn.datasets import fetch_20newsgroups
+from sklearn.feature_extraction.text import CountVectorizer
+from polyfuzz import PolyFuzz
+
+train_words = ["apple", "apples", "appl", "recal", "house", "similarity"]
+unseen_words = ["apple", "apples", "mouse"]
+
+# Fit
+model = PolyFuzz("TF-IDF")
+model.fit(train_words)
+
+# Transform
+results = model.transform(unseen_words)
+```
+
+In the code above, we fit our TF-IDF model on `train_words` and use `.transform()` to match the words in `unseen_words` to the words that we trained on in `train_words`. 
+
+After fitting our model, we can save it as follows:
+
+```python
+model.save("my_model")
+```
+
+Then, we can load our model to be used elsewhere:
+
+```python
+from polyfuzz import PolyFuzz
+
+model = PolyFuzz.load("my_model")
+```
+
+
+## **v0.3.4**
+
 - Make sure that when you use two lists that are exactly the same, it will return 1 for identical terms:
 
 ```python
 from polyfuzz import PolyFuzz
+
 from_list = ["apple", "house"]
 model = PolyFuzz("TF-IDF")
 model.match(from_list, from_list)
@@ -14,6 +89,7 @@ mapping to itself:
 
 ```python
 from polyfuzz import PolyFuzz
+
 from_list = ["apple", "apples"]
 model = PolyFuzz("TF-IDF")
 model.match(from_list)
@@ -22,32 +98,32 @@ model.match(from_list)
 In the example above, `apple` will be mapped to `apples` and not to `apple`. Here, we assume that the user wants to 
 find the most similar words within a list without mapping to itself. 
 
-v0.3.3  
+## **v0.3.3**  
 - Update numpy to "numpy>=1.20.0" to prevent [this](https://github.com/MaartenGr/PolyFuzz/issues/23) and this [issue](https://github.com/MaartenGr/PolyFuzz/issues/21)
 - Update pytorch to "torch>=1.4.0,<1.7.1" to prevent save_state_warning error   
 
-v0.3.2  
+## **v0.3.2**  
 - Fix exploding memory usage when using `top_n`   
 
-v0.3.0  
+## **v0.3.0**  
 - Use `top_n` in `polyfuzz.models.TFIDF` and `polyfuzz.models.Embeddings`   
 
-v0.2.2  
+## **v0.2.2**  
 - Update grouping to include all strings only if identical lists of strings are compared  
 
-v0.2.0  
+## **v0.2.0**  
 - Update naming convention matcher --> model  
 - Update documentation  
 - Add basic models to grouper  
 - Fix issues with vector order in cosine similarity  
 - Update naming of cosine similarity function  
 
-v0.1.0  
+## **v0.1.0**
 - Additional tests  
 - More thorough documentation  
 - Prepare for public release  
 
-v0.0.1  
+## **v0.0.1**
 - First release of `PolyFuzz`
 - Matching through:
     - Edit Distance

diff --git a/docs/tutorial/basematcher/basematcher.md b/docs/tutorial/basematcher/basematcher.md
@@ -9,6 +9,7 @@ You simply create a class using `BaseMatcher`, make sure it has a function `matc
 two lists and outputs a pandas dataframe. That's it! 
 
 We start by creating our own model that implements the ratio similarity measure from RapidFuzz:
+
 ```python
 import numpy as np
 import pandas as pd
@@ -19,7 +20,7 @@ from polyfuzz.models import BaseMatcher
 
 
 class MyModel(BaseMatcher):
-    def match(self, from_list, to_list):
+    def match(self, from_list, to_list, **kwargs):
         # Calculate distances
         matches = [[fuzz.ratio(from_string, to_string) / 100 
                    for to_string in to_list] for from_string in from_list]
@@ -53,3 +54,67 @@ model.visualize_precision_recall(kde=True)
 ``` 
 
 ![](custom_model.png)
+
+
+## fit, transform, fit_transform
+
+Although the above model can be used in production using `fit`, it does not track its state between `fit` and `transform`. 
+This is not necessary here, since edit distances should be recalculated but if you have embeddings that you do not 
+want to re-calculate, then it is helpful to track the states between `fit` and `transform` so that embeddings do not need 
+to be re-calculated. To do so, we can use the `re_train` parameter to define what happens if we re-train a model (for example when using `fit`) 
+and what happens when we do not re-train a model (for example when using `transform`). 
+
+In the example below, when we set `re_train=True` we calculate the embeddings from both the `from_list` and `to_list` if they are defined 
+and save the embeddings to the `self.embeddings_to` variable. Then, when we set `re_train=True`, we can prevent redoing the `fit` by leveraging 
+the pre-calculated `self.embeddings_to` variable. 
+
+```python
+import numpy as np
+from sentence_transformers import SentenceTransformer
+
+from ._utils import cosine_similarity
+from ._base import BaseMatcher
+
+
+class SentenceEmbeddings(BaseMatcher):
+    def __init__(self, model_id):
+        super().__init__(model_id)
+        self.type = "Embeddings"
+
+        self.embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
+        self.embeddings_to = None
+
+    def match(self, from_list, to_list, re_train=True) -> pd.DataFrame:
+        # Extract embeddings from the `from_list`
+        embeddings_from = self.embedding_model.encode(from_list, show_progress_bar=False)
+
+        # Extract embeddings from the `to_list` if it exists
+        if not isinstance(embeddings_to, np.ndarray):
+            if not re_train:
+                embeddings_to = self.embeddings_to
+            elif to_list is None:
+                embeddings_to = self.embedding_model.encode(from_list, show_progress_bar=False)
+            else:
+                embeddings_to = self.embedding_model.encode(to_list, show_progress_bar=False)
+
+        # Extract matches
+        matches = cosine_similarity(embeddings_from, embeddings_to, from_list, to_list)
+
+        self.embeddings_to = embeddings_to
+
+        return matches
+```
+
+Then, we can use it as follows:
+
+```python
+from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
+to_list = ["apple", "apples", "mouse"]
+
+custom_matcher = MyModel()
+
+model = PolyFuzz(custom_matcher).fit(from_list)
+```
+
+By using the `.fit` function, embeddings are created from the `from_list` variable and saved. Then, when we 
+run `model.transform(to_list)`, the embeddings created from the `from_list` variable do not need to be recalculated.
-Original file line number
+Diff line change
@@ Expand Up / @@ -75,3 +75,6 @@ venv.bak/ @@
     .idea
     .idea/
+    # For quick testing
+    /Untitled.ipynb
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		# `polyfuzz.models.GensimEmbeddings`

		::: polyfuzz.models.GensimEmbeddings
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		# `polyfuzz.models.SentenceEmbeddings`

		::: polyfuzz.models.SentenceEmbeddings
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		# `polyfuzz.models.SpacyEmbeddings`

		::: polyfuzz.models.SpacyEmbeddings
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		# `polyfuzz.models.USEEmbeddings`

		::: polyfuzz.models.USEEmbeddings