-
Notifications
You must be signed in to change notification settings - Fork 68
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
6 changed files
with
92 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
# Datasets | ||
There are two datasets prepared for you to play around with: | ||
* Company Names | ||
* Movie Titles | ||
|
||
## Movie Titles | ||
This data is retrieved from: | ||
* https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset | ||
* https://www.kaggle.com/shivamb/netflix-shows | ||
|
||
It contains Netflix and IMDB movie titles that can be matched against each other. | ||
Where IMDB has 80852 movie titles and Netflix has 6172 movie titles. | ||
|
||
You can use them as follows: | ||
|
||
```python | ||
from polyfuzz import PolyFuzz | ||
from polyfuzz.datasets import load_movie_titles | ||
|
||
data = load_movie_titles() | ||
model = PolyFuzz("TF-IDF").match(data["Netflix"], data["IMDB"]) | ||
``` | ||
|
||
## Company Names | ||
This data is retrieved from https://www.kaggle.com/dattapiy/sec-edgar-companies-list?select=sec__edgar_company_info.csv | ||
and contains 100_000 company names to be matched against each other. | ||
|
||
This is a different use case than what you have typically seen so far. We often see two different lists compared | ||
with each other. Here, you can use this dataset to compare the company names with themselves in order to clean | ||
them up. | ||
|
||
You can use them as follows: | ||
|
||
```python | ||
from polyfuzz import PolyFuzz | ||
from polyfuzz.datasets import load_company_names | ||
|
||
data = load_company_names() | ||
model = PolyFuzz("TF-IDF").match(data, data) | ||
``` | ||
|
||
PolyFuzz will recognize that the lists are similar and that you are looking to match the titles with themselves. | ||
It will ignore any comparison a string has with itself, otherwise everything will get mapped to itself. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,2 @@ | ||
from .polyfuzz import PolyFuzz | ||
__version__ = "0.2.0" | ||
__version__ = "0.2.1" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
from ._load_data import load_movie_titles, load_company_names | ||
|
||
__all__ = [ | ||
"load_movie_titles", | ||
"load_company_names" | ||
] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
import json | ||
import requests | ||
from typing import List, Mapping | ||
|
||
|
||
def load_movie_titles() -> Mapping[str, List[str]]: | ||
""" Load Netflix and IMDB movie titles to be matched against each other | ||
Retrieved from: | ||
https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset | ||
https://www.kaggle.com/shivamb/netflix-shows | ||
Preprocessed such that it only contains the title names where | ||
IMDB has 80852 titles and Netflix has 6172 | ||
Returns: | ||
data: a dictionary with two keys: "Netflix" and "IMDB" where | ||
each value contains a list of movie titles | ||
""" | ||
url = 'https://github.com/MaartenGr/PolyFuzz/raw/master/data/movie_titles.json' | ||
resp = requests.get(url) | ||
data = json.loads(resp.text) | ||
return data | ||
|
||
|
||
def load_company_names() -> List[str]: | ||
""" Load company names to be matched against each other. | ||
Retrieved from: | ||
https://www.kaggle.com/dattapiy/sec-edgar-companies-list?select=sec__edgar_company_info.csv | ||
Preprocessed such that it only contains 100_000 company names. | ||
Returns: | ||
data: a list of company names | ||
""" | ||
url = 'https://github.com/MaartenGr/PolyFuzz/raw/master/data/company_names.json' | ||
resp = requests.get(url) | ||
data = json.loads(resp.text) | ||
return data |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -37,7 +37,7 @@ | |
setup( | ||
name="polyfuzz", | ||
packages=find_packages(exclude=["notebooks", "docs"]), | ||
version="0.2.0", | ||
version="0.2.1", | ||
author="Maarten Grootendorst", | ||
author_email="[email protected]", | ||
description="PolyFuzz performs fuzzy string matching, grouping, and evaluation.", | ||
|