Skip to content

Commit

Permalink
v0.4.0 (#35)
Browse files Browse the repository at this point in the history
* Add spacy, use, sbert, gensim
* Add fit, transform, and fit_transform
* Add options to save and load model
  • Loading branch information
MaartenGr authored May 7, 2022
1 parent 241d7d3 commit 9d9754d
Show file tree
Hide file tree
Showing 27 changed files with 1,232 additions and 79 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/testing.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,6 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -e ".[dev]"
pip install -e ".[dev, sbert]"
- name: Run Checking Mechanisms
run: make check
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -75,3 +75,6 @@ venv.bak/

.idea
.idea/

# For quick testing
/Untitled.ipynb
61 changes: 47 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,24 +22,21 @@ You can install **`PolyFuzz`** via pip:
pip install polyfuzz
```

This will install the base dependencies. If you want to speed
up the cosine similarity comparison and decrease memory usage,
you can use `sparse_dot_topn` which is installed via:
You may want to install more depending on the transformers and language backends that you will be using. The possible installations are:

```bash
pip install polyfuzz[fast]
```

If you want to be making use of 🤗 Transformers, install the additional additional `Flair` dependency:

```bash
pip install polyfuzz[flair]
```python
pip install bertopic[sbert]
pip install bertopic[flair]
pip install bertopic[gensim]
pip install bertopic[spacy]
pip install bertopic[use]
```

To install all the additional dependencies:
If you want to speed up the cosine similarity comparison and decrease memory usage when using embedding models,
you can use `sparse_dot_topn` which is installed via:

```bash
pip install polyfuzz[all]
pip install polyfuzz[fast]
```

<details>
Expand Down Expand Up @@ -103,6 +100,42 @@ The resulting matches can be accessed through `model.get_matches()`:
**NOTE 2**: When instantiating `PolyFuzz` we also could have used "EditDistance" or "Embeddings" to quickly
access Levenshtein and FastText (English) respectively.

### Production
The `.match` function allows you to quickly extract similar strings. However, after selecting the right models to be used, you may want to use PolyFuzz
in production to match incoming strings. To do so, we can make use of the familiar `fit`, `transform`, and `fit_transform` functions.

Let's say that we have a list of words that we know to be correct called `train_words`. We want to any incoming word to mapped to one of the words in `train_words`.
In other words, we `fit` on `train_words` and we use `transform` on any incoming words:

```python
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from polyfuzz import PolyFuzz

train_words = ["apple", "apples", "appl", "recal", "house", "similarity"]
unseen_words = ["apple", "apples", "mouse"]

# Fit
model = PolyFuzz("TF-IDF")
model.fit(train_words)

# Transform
results = model.transform(unseen_words)
```

In the above example, we are using `fit` on `train_words` to calculate the TF-IDF representation of those words which are saved to be used again in `transform`.
This speeds up `transform` quite a bit since all TF-IDF representations are stored when applying `fit`.

Then, we apply save and load the model as follows to be used in production:

```python
# Save the model
model.save("my_model")

# Load the model
loaded_model = PolyFuzz.load("my_model")
```

### Group Matches
We can group the matches `To` as there might be significant overlap in strings in our to_list.
To do this, we calculate the similarity within strings in to_list and use `single linkage` to then
Expand Down Expand Up @@ -214,7 +247,7 @@ from polyfuzz.models import BaseMatcher


class MyModel(BaseMatcher):
def match(self, from_list, to_list):
def match(self, from_list, to_list, **kwargs):
# Calculate distances
matches = [[fuzz.ratio(from_string, to_string) / 100 for to_string in to_list]
for from_string in from_list]
Expand Down
3 changes: 3 additions & 0 deletions docs/api/models/gensim.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# `polyfuzz.models.GensimEmbeddings`

::: polyfuzz.models.GensimEmbeddings
3 changes: 3 additions & 0 deletions docs/api/models/sbert.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# `polyfuzz.models.SentenceEmbeddings`

::: polyfuzz.models.SentenceEmbeddings
3 changes: 3 additions & 0 deletions docs/api/models/spacy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# `polyfuzz.models.SpacyEmbeddings`

::: polyfuzz.models.SpacyEmbeddings
3 changes: 3 additions & 0 deletions docs/api/models/use.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# `polyfuzz.models.USEEmbeddings`

::: polyfuzz.models.USEEmbeddings
18 changes: 1 addition & 17 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,20 +8,4 @@ Currently, methods include Levenshtein distance with RapidFuzz, a character-base
techniques such as FastText and GloVe, and 🤗 transformers embeddings.

The philosophy of PolyFuzz is: `Easy to use yet highly customizable`. It is a string matcher tool that requires only
a few lines of code but that allows you customize and create your own models.


## Installation
You can install **`PolyFuzz`** via pip:

```
pip install polyfuzz
```

This will install the base dependencies and excludes any deep learning/embedding models.

If you want to be making use of 🤗 Transformers, install the additional additional `Flair` dependency:

```
pip install polyfuzz[flair]
```
a few lines of code but that allows you customize and create your own models.
92 changes: 84 additions & 8 deletions docs/releases.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,83 @@
v0.3.4
## **v0.4.0**


* Added new models (SentenceTransformers, Gensim, USE, Spacy)
* Added `.fit`, `.transform`, and `.fit_transform` methods
* Added `.save` and `PolyFuzz.load()`


**SentenceTransformers**
```python
from polyfuzz.models import SentenceEmbeddings
distance_model = SentenceEmbeddings("all-MiniLM-L6-v2")
model = PolyFuzz(distance_model)
```

**Gensim**
```python
from polyfuzz.models import GensimEmbeddings
distance_model = GensimEmbeddings("glove-twitter-25")
model = PolyFuzz(distance_model)
```

**USE**
```python
from polyfuzz.models import USEEmbeddings
distance_model = USEEmbeddings("https://tfhub.dev/google/universal-sentence-encoder/4")
model = PolyFuzz(distance_model)
```

**Spacy**
```python
from polyfuzz.models import SpacyEmbeddings
distance_model = SpacyEmbeddings("en_core_web_md")
model = PolyFuzz(distance_model)
```


**fit, transform, fit_transform**
Add `fit`, `transform`, and `fit_transform` in order to use PolyFuzz in production [#34](https://github.com/MaartenGr/PolyFuzz/issues/34)

```python
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from polyfuzz import PolyFuzz

train_words = ["apple", "apples", "appl", "recal", "house", "similarity"]
unseen_words = ["apple", "apples", "mouse"]

# Fit
model = PolyFuzz("TF-IDF")
model.fit(train_words)

# Transform
results = model.transform(unseen_words)
```

In the code above, we fit our TF-IDF model on `train_words` and use `.transform()` to match the words in `unseen_words` to the words that we trained on in `train_words`.

After fitting our model, we can save it as follows:

```python
model.save("my_model")
```

Then, we can load our model to be used elsewhere:

```python
from polyfuzz import PolyFuzz

model = PolyFuzz.load("my_model")
```


## **v0.3.4**

- Make sure that when you use two lists that are exactly the same, it will return 1 for identical terms:

```python
from polyfuzz import PolyFuzz

from_list = ["apple", "house"]
model = PolyFuzz("TF-IDF")
model.match(from_list, from_list)
Expand All @@ -14,6 +89,7 @@ mapping to itself:

```python
from polyfuzz import PolyFuzz

from_list = ["apple", "apples"]
model = PolyFuzz("TF-IDF")
model.match(from_list)
Expand All @@ -22,32 +98,32 @@ model.match(from_list)
In the example above, `apple` will be mapped to `apples` and not to `apple`. Here, we assume that the user wants to
find the most similar words within a list without mapping to itself.

v0.3.3
## **v0.3.3**
- Update numpy to "numpy>=1.20.0" to prevent [this](https://github.com/MaartenGr/PolyFuzz/issues/23) and this [issue](https://github.com/MaartenGr/PolyFuzz/issues/21)
- Update pytorch to "torch>=1.4.0,<1.7.1" to prevent save_state_warning error

v0.3.2
## **v0.3.2**
- Fix exploding memory usage when using `top_n`

v0.3.0
## **v0.3.0**
- Use `top_n` in `polyfuzz.models.TFIDF` and `polyfuzz.models.Embeddings`

v0.2.2
## **v0.2.2**
- Update grouping to include all strings only if identical lists of strings are compared

v0.2.0
## **v0.2.0**
- Update naming convention matcher --> model
- Update documentation
- Add basic models to grouper
- Fix issues with vector order in cosine similarity
- Update naming of cosine similarity function

v0.1.0
## **v0.1.0**
- Additional tests
- More thorough documentation
- Prepare for public release

v0.0.1
## **v0.0.1**
- First release of `PolyFuzz`
- Matching through:
- Edit Distance
Expand Down
67 changes: 66 additions & 1 deletion docs/tutorial/basematcher/basematcher.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ You simply create a class using `BaseMatcher`, make sure it has a function `matc
two lists and outputs a pandas dataframe. That's it!

We start by creating our own model that implements the ratio similarity measure from RapidFuzz:

```python
import numpy as np
import pandas as pd
Expand All @@ -19,7 +20,7 @@ from polyfuzz.models import BaseMatcher


class MyModel(BaseMatcher):
def match(self, from_list, to_list):
def match(self, from_list, to_list, **kwargs):
# Calculate distances
matches = [[fuzz.ratio(from_string, to_string) / 100
for to_string in to_list] for from_string in from_list]
Expand Down Expand Up @@ -53,3 +54,67 @@ model.visualize_precision_recall(kde=True)
```

![](custom_model.png)


## fit, transform, fit_transform

Although the above model can be used in production using `fit`, it does not track its state between `fit` and `transform`.
This is not necessary here, since edit distances should be recalculated but if you have embeddings that you do not
want to re-calculate, then it is helpful to track the states between `fit` and `transform` so that embeddings do not need
to be re-calculated. To do so, we can use the `re_train` parameter to define what happens if we re-train a model (for example when using `fit`)
and what happens when we do not re-train a model (for example when using `transform`).

In the example below, when we set `re_train=True` we calculate the embeddings from both the `from_list` and `to_list` if they are defined
and save the embeddings to the `self.embeddings_to` variable. Then, when we set `re_train=True`, we can prevent redoing the `fit` by leveraging
the pre-calculated `self.embeddings_to` variable.

```python
import numpy as np
from sentence_transformers import SentenceTransformer

from ._utils import cosine_similarity
from ._base import BaseMatcher


class SentenceEmbeddings(BaseMatcher):
def __init__(self, model_id):
super().__init__(model_id)
self.type = "Embeddings"

self.embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
self.embeddings_to = None

def match(self, from_list, to_list, re_train=True) -> pd.DataFrame:
# Extract embeddings from the `from_list`
embeddings_from = self.embedding_model.encode(from_list, show_progress_bar=False)

# Extract embeddings from the `to_list` if it exists
if not isinstance(embeddings_to, np.ndarray):
if not re_train:
embeddings_to = self.embeddings_to
elif to_list is None:
embeddings_to = self.embedding_model.encode(from_list, show_progress_bar=False)
else:
embeddings_to = self.embedding_model.encode(to_list, show_progress_bar=False)

# Extract matches
matches = cosine_similarity(embeddings_from, embeddings_to, from_list, to_list)

self.embeddings_to = embeddings_to

return matches
```

Then, we can use it as follows:

```python
from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
to_list = ["apple", "apples", "mouse"]

custom_matcher = MyModel()

model = PolyFuzz(custom_matcher).fit(from_list)
```

By using the `.fit` function, embeddings are created from the `from_list` variable and saved. Then, when we
run `model.transform(to_list)`, the embeddings created from the `from_list` variable do not need to be recalculated.
Loading

0 comments on commit 9d9754d

Please sign in to comment.