Words clustering using synonyms
This class is able to create clusters by using the definition of synonyms inside NLTK. Let's see an example.
import pandas as pd
import WordsClusterBySynonyms as wcbs
In this case we decided to use a list of italian verbs.
verbs = [
'cogliere', 'intagliare', 'ragguagliare', 'dilazionare', 'tuffare',
'dissipare', 'indisporre', 'complottare', 'contraddire', 'sconoscere',
'sgocciolare', 'ridimensionare', 'ammansire', 'stuzzicare', 'rintuzzare',
...
'autenticare', 'programmare', 'assassinare', 'immalinconire', 'esalare',
'istigare', 'abiurare', 'curare', 'tranciare', 'tracciare', 'vagolare',
'raddolcire', 'sfinire', 'confrontare', 'indispettire','fare','avere','vivere'
]
verbs = pd.DataFrame(verbs)
verbs.columns = ['verbs']
WordsClusterBySynonyms requires a dataframe in which you have to specify the name of the target column and the language.
The first function inside WordClusterBySynonyms is get_synonyms_pandas. It applies on the dataframe the generation of synonyms by creating a new columns.
wc = wcbs.WordsClusterBySynonyms(verbs, 'verbs', lang='ita')
df = wc.get_synonyms_pandas()
wc.plot_hist(df)
Using set_threshold you can repeat get_synonyms_pandas with a threshold on the number of synonyms for each word.
df = wc.set_threshold(20, df)
Using plot_hist you can check if in your list of words there are words with associate a huge number of synonyms. These words are a problem, because they tend to create few huge clusters with our definition of distance.
wc.plot_hist(df)
Given two different words (A and B) with associated two lists of synonyms ( and ) A is equal to B if is equal to . A is totally different from B if there is an empty intersection between and .
The formula we used is:
You can choose between min or max, or if you would like to use your definition of distance:
def mydistance_name():
...
return ...
wc.create_distance_matrix(mydistance= mydistance_name, criteria=None, verbose=True)
matrix = wc.create_distance_matrix(criteria=min, verbose=True)
wc.plot_eps_ncluster(matrix, ntot=10, min_samples=6)
The function run_cluster uses the DBSCAN implemented in sklearn. You can find the documentation here: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
result = wc.run_cluster(0.3,6, matrix)
Below a plot to show the cluster using a wordcloud-like format, where for a smaller size correnspond a lower distance.
wc.plot_cluster_k(matrix, 'contraddire')
This class seems to work better for verbs and adjectives, but in general the goodness of this method is crucial correlated to the "goodness" of synonyms' structure.
I've done this class together with https://github.com/aborgher