Skip to content

Bergvca/string_grouper

Repository files navigation

String Grouper

pypi license lastcommit codecov PyPI Downloads

Click to see image

The image displayed above is a visualization of the graph-structure of one of the groups of strings found by string_grouper. Each circle (node) represents a string, and each connecting arc (edge) represents a match between a pair of strings with a similarity score above a given threshold score (here 0.8).

The centroid of the group, as determined by string_grouper (see tutorials/group_representatives.md for an explanation), is the largest node, also with the most edges originating from it. A thick line in the image denotes a strong similarity between the nodes at its ends, while a faint thin line denotes weak similarity.

The power of string_grouper is discernible from this image: in large datasets, string_grouper is often able to resolve indirect associations between strings even when, say, due to memory-resource-limitations, direct matches between those strings cannot be computed using conventional methods with a lower threshold similarity score.

———

This image was designed using the graph-visualization software Gephi 0.9.2 with data generated by string_grouper operating on the sec__edgar_company_info.csv sample data file.


string_grouper is a library that makes finding groups of similar strings within a single, or multiple, lists of strings easy — and fast. string_grouper uses tf-idf to calculate cosine similarities within a single list or between two lists of strings. The full process is described in the blog Super Fast String Matching in Python.

Installing

pip install string-grouper

Speed

string_grouper leverages the blazingly fast sparse_dot_topn libary to calculate cosine similarities.

s = datetime.datetime.now()
matches = match_strings(names['Company Name'], number_of_processes = 4)

e = datetime.datetime.now()
diff = (e - s)
str(diff)

Results in:

00:05:34.65 On an Intel i7-6500U CPU @ 2.50GHz, where len(names) = 663 000

in other words, the library is able to perform fuzzy matching of 663 000 names in five and a half minutes on a 2015 consumer CPU using 4 cores.

Simple Match

import pandas as pd
from string_grouper import match_strings

company_names = 'sec__edgar_company_info.csv'
companies = pd.read_csv(company_names)
# Create all matches:
matches = match_strings(companies['Company Name'])
# Look at only the non-exact matches:
matches[matches['left_Company Name'] != matches['right_Company Name']].head()
left_index left_Company Name similarity right_Company Name right_index
15 14 0210, LLC 0.870291 90210 LLC 4211
167 165 1 800 MUTUALS ADVISOR SERIES 0.931615 1 800 MUTUALS ADVISORS SERIES 166
168 166 1 800 MUTUALS ADVISORS SERIES 0.931615 1 800 MUTUALS ADVISOR SERIES 165
172 168 1 800 RADIATOR FRANCHISE INC 1 1-800-RADIATOR FRANCHISE INC. 201
178 173 1 FINANCIAL MARKETPLACE SECURITIES LLC /BD 0.949364 1 FINANCIAL MARKETPLACE SECURITIES, LLC 174

Group Similar Strings and Find most Common

companies[["group-id", "name_deduped"]] = group_similar_strings(companies['Company Name'])
companies.groupby('name_deduped')['Line Number'].count().sort_values(ascending=False).head(10)
name_deduped Line Number
ADVISORS DISCIPLINED TRUST 1747
NUVEEN TAX EXEMPT UNIT TRUST SERIES 1 916
GUGGENHEIM DEFINED PORTFOLIOS, SERIES 1200 652
U S TECHNOLOGIES INC 632
CAPITAL MANAGEMENT LLC 628
CLAYMORE SECURITIES DEFINED PORTFOLIOS, SERIES 200 611
E ACQUISITION CORP 561
CAPITAL PARTNERS LP 561
FIRST TRUST COMBINED SERIES 1 560
PRINCIPAL LIFE INCOME FUNDINGS TRUST 20 544

Documentation

The documentation can be found here