Click to see image
The image displayed above is a visualization of the graph-structure of one of the groups of strings found by string_grouper
. Each circle (node) represents a string, and each connecting arc (edge) represents a match between a pair of strings with a similarity score above a given threshold score (here 0.8
).
The centroid of the group, as determined by string_grouper
(see tutorials/group_representatives.md for an explanation), is the largest node, also with the most edges originating from it. A thick line in the image denotes a strong similarity between the nodes at its ends, while a faint thin line denotes weak similarity.
The power of string_grouper
is discernible from this image: in large datasets, string_grouper
is often able to resolve indirect associations between strings even when, say, due to memory-resource-limitations, direct matches between those strings cannot be computed using conventional methods with a lower threshold similarity score.
This image was designed using the graph-visualization software Gephi 0.9.2 with data generated by string_grouper
operating on the sec__edgar_company_info.csv sample data file.
string_grouper
is a library that makes finding groups of similar strings within a single, or multiple, lists of
strings easy — and fast. string_grouper
uses tf-idf to calculate cosine similarities
within a single list or between two lists of strings. The full process is described in the blog Super Fast String Matching in Python.
pip install string-grouper
string_grouper
leverages the blazingly fast sparse_dot_topn libary
to calculate cosine similarities.
s = datetime.datetime.now()
matches = match_strings(names['Company Name'], number_of_processes = 4)
e = datetime.datetime.now()
diff = (e - s)
str(diff)
Results in:
00:05:34.65
On an Intel i7-6500U CPU @ 2.50GHz, where len(names)
= 663 000
in other words, the library is able to perform fuzzy matching of 663 000 names in five and a half minutes on a 2015 consumer CPU using 4 cores.
import pandas as pd
from string_grouper import match_strings
company_names = 'sec__edgar_company_info.csv'
companies = pd.read_csv(company_names)
# Create all matches:
matches = match_strings(companies['Company Name'])
# Look at only the non-exact matches:
matches[matches['left_Company Name'] != matches['right_Company Name']].head()
left_index | left_Company Name | similarity | right_Company Name | right_index | |
---|---|---|---|---|---|
15 | 14 | 0210, LLC | 0.870291 | 90210 LLC | 4211 |
167 | 165 | 1 800 MUTUALS ADVISOR SERIES | 0.931615 | 1 800 MUTUALS ADVISORS SERIES | 166 |
168 | 166 | 1 800 MUTUALS ADVISORS SERIES | 0.931615 | 1 800 MUTUALS ADVISOR SERIES | 165 |
172 | 168 | 1 800 RADIATOR FRANCHISE INC | 1 | 1-800-RADIATOR FRANCHISE INC. | 201 |
178 | 173 | 1 FINANCIAL MARKETPLACE SECURITIES LLC /BD | 0.949364 | 1 FINANCIAL MARKETPLACE SECURITIES, LLC | 174 |
companies[["group-id", "name_deduped"]] = group_similar_strings(companies['Company Name'])
companies.groupby('name_deduped')['Line Number'].count().sort_values(ascending=False).head(10)
name_deduped | Line Number |
---|---|
ADVISORS DISCIPLINED TRUST | 1747 |
NUVEEN TAX EXEMPT UNIT TRUST SERIES 1 | 916 |
GUGGENHEIM DEFINED PORTFOLIOS, SERIES 1200 | 652 |
U S TECHNOLOGIES INC | 632 |
CAPITAL MANAGEMENT LLC | 628 |
CLAYMORE SECURITIES DEFINED PORTFOLIOS, SERIES 200 | 611 |
E ACQUISITION CORP | 561 |
CAPITAL PARTNERS LP | 561 |
FIRST TRUST COMBINED SERIES 1 | 560 |
PRINCIPAL LIFE INCOME FUNDINGS TRUST 20 | 544 |
The documentation can be found here