WordsClusterBySynonyms

Words clustering using synonyms

This class is able to create clusters by using the definition of synonyms inside NLTK. Let's see an example.

import pandas as pd
import WordsClusterBySynonyms as wcbs

In this case we decided to use a list of italian verbs.

verbs = [
    'cogliere', 'intagliare', 'ragguagliare', 'dilazionare', 'tuffare',
    'dissipare', 'indisporre', 'complottare', 'contraddire', 'sconoscere',
    'sgocciolare', 'ridimensionare', 'ammansire', 'stuzzicare', 'rintuzzare',
    ...
    'autenticare', 'programmare', 'assassinare', 'immalinconire', 'esalare',
    'istigare', 'abiurare', 'curare', 'tranciare', 'tracciare', 'vagolare',
    'raddolcire', 'sfinire', 'confrontare', 'indispettire','fare','avere','vivere'
]

verbs = pd.DataFrame(verbs)
verbs.columns = ['verbs']

WordsClusterBySynonyms requires a dataframe in which you have to specify the name of the target column and the language.

The first function inside WordClusterBySynonyms is get_synonyms_pandas. It applies on the dataframe the generation of synonyms by creating a new columns.

wc = wcbs.WordsClusterBySynonyms(verbs, 'verbs', lang='ita')
df = wc.get_synonyms_pandas()

wc.plot_hist(df)

Using set_threshold you can repeat get_synonyms_pandas with a threshold on the number of synonyms for each word.

df = wc.set_threshold(20, df)

Using plot_hist you can check if in your list of words there are words with associate a huge number of synonyms. These words are a problem, because they tend to create few huge clusters with our definition of distance.

wc.plot_hist(df)

DISTANCE

Given two different words (A and B) with associated two lists of synonyms ( and ) A is equal to B if is equal to . A is totally different from B if there is an empty intersection between and .

The formula we used is:

You can choose between min or max, or if you would like to use your definition of distance:

    def mydistance_name():
        ...
        return ...

    wc.create_distance_matrix(mydistance= mydistance_name, criteria=None, verbose=True)

matrix = wc.create_distance_matrix(criteria=min, verbose=True)
wc.plot_eps_ncluster(matrix, ntot=10, min_samples=6)

The function run_cluster uses the DBSCAN implemented in sklearn. You can find the documentation here: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

result = wc.run_cluster(0.3,6, matrix)

Below a plot to show the cluster using a wordcloud-like format, where for a smaller size correnspond a lower distance.

wc.plot_cluster_k(matrix, 'contraddire')

This class seems to work better for verbs and adjectives, but in general the goodness of this method is crucial correlated to the "goodness" of synonyms' structure.

I've done this class together with https://github.com/aborgher

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
README.md		README.md
WordsClusterBySynonyms.py		WordsClusterBySynonyms.py
example.ipynb		example.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WordsClusterBySynonyms

DISTANCE

About

Releases

Packages

Languages

frucci/WordsClusterBySynonyms

Folders and files

Latest commit

History

Repository files navigation

WordsClusterBySynonyms

DISTANCE

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages