nlpClusterAnalysis

NLP Clustering with HDBSCAN

This program processes a CSV file containing names, generates embeddings for those names using a pre-trained model(sentence-transformers/all-mpnet-base-v2), and clusters them using the HDBSCAN algorithm.

The results, including cluster ids and potential noise points, are saved to CSV files.

Overview

This project focuses on clustering similar names based on semantic embeddings generated by the sentence-transformers/all-mpnet-base-v2 model, creating a 768-dimensional representation for each name.

Using HDBSCAN, a density-based clustering algorithm, it groups names with similar meanings or structures while labeling outliers as noise.

The process begins by loading the dataset from a CSV file, preprocessing the data to handle missing values, and performing basic exploratory data analysis (EDA) to understand its structure. Embeddings are then generated, clustered, and saved into two output CSV files—one including noise labels and another without.

Evaluation Metric

The silhouette score is a metric used to evaluate the quality of clusters by measuring how well-separated and well-defined they are. It ranges from -1 to 1, where a higher score indicates better-defined clusters. A score close to 1 means that data points are well-matched within their cluster and distinct from neighboring clusters. A score near 0 suggests overlapping clusters, while negative scores indicate that points may be in the wrong clusters.

Disclaimer

This project aimed to balance several clustering objectives, including maximizing the silhouette score, minimizing the number of clusters, and ensuring high-quality groupings. While efforts were made to achieve a well-defined clustering of names, compromises were necessary to optimize across these goals. In some cases, clusters may have been merged or outliers labeled as noise to maintain a manageable number of clusters with clear boundaries. As such, the clustering results represent an optimized trade-off and may not perfectly capture every subtle variation in the data.

Also, results are highly dependent on the quality of the embeddings, other factors such as: tuning the parameters of the HDBSCAN algorithm may help achieve a better clustering.

Users are encouraged to adjust parameters based on their specific use cases for optimal results.

Requirements

Python 3.10 and above
pandas
numpy==1.26.4
scikit-learn
hdbscan
langchain
langchain_huggingface
sentence-transformers

Usage

Note : Example free & open-source huggingFace embeddings models you can use:

all-mpnet-base-v2: Requires approximately 438.7 MB [used in this project]
all-MiniLM-L6-v2: Requires approximately 91.6 MB

Clone the repository:

git clone https://github.com/felixLandlord/nlpClusterAnalysis.git
cd nlpClusterAnalysis

Create a virtual environment:

python -m venv .venv

Activate the virtual environment:

MacOS:

source .venv/bin/activate

Windows:

.venv\Scripts\activate

Linux:

source .venv/bin/activate

Install the required dependencies:

pip install -r requirements.txt

Running the clustering program:

This script will start processing the input data to generate the output CSV file with the cluster ids.

python cluster.py

Enter the path to the input CSV file: /path/to/your/input.csv

Enter the directory to save the output files: /path/to/output/directory

Colab Notebook

You can also run the clustering program in a Google Colab notebook. Here is an example of how to do it:

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
input_data		input_data
output_data		output_data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cluster.ipynb		cluster.ipynb
cluster.py		cluster.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nlpClusterAnalysis

NLP Clustering with HDBSCAN

Overview

Evaluation Metric

Disclaimer

Requirements

Usage

Colab Notebook

References

About

Releases

Packages

Languages

License

felixLandlord/nlpClusterAnalysis

Folders and files

Latest commit

History

Repository files navigation

nlpClusterAnalysis

NLP Clustering with HDBSCAN

Overview

Evaluation Metric

Disclaimer

Requirements

Usage

Colab Notebook

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages