Skip to content

The repository provides a pipeline for preprocessing text data, extracting features, and applying clustering algorithms like K-means, DBSCAN, or hierarchical clustering.

Notifications You must be signed in to change notification settings

michellemashutian/clusteringText

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ClusteringText

A repository for exploring clustering techniques in natural language processing (NLP), with a focus on analyzing textual datasets. This project demonstrates the implementation of unsupervised learning methods to group similar text documents effectively.

Features

  • Preprocessing pipeline for text datasets (can preprocess data in Chinese)

  • Multiple clustering algorithms (K-Means, MinibatchKmeans, Birch, AffinityPropagation, AgglomerativeClustering, DBSCAN)

  • Support for various vectorization methods (VSM, LSI, LDA)

  • Easy integration with custom datasets (given text and keywords)

Installation

  • Clone the repository:
git clone https://github.com/michellemashutian/clusteringText.git
cd clusteringText
  • Install required dependencies:
pip install -r requirements.txt

Usage

Prepare your dataset in txt format with columns containing text data.

Run the main script:

python main.py

Dependencies

gensim==4.3.3
jieba==0.42.1
numpy==2.2.1
scikit_learn==1.6.0

Contact

For any questions or feedback, feel free to reach out via issues or email me at [email protected]

Happy Clustering!

About

The repository provides a pipeline for preprocessing text data, extracting features, and applying clustering algorithms like K-means, DBSCAN, or hierarchical clustering.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages