cluster-links

A Web-Link Clustering Pipeline

This project is a robust, configurable pipeline for extracting, cleaning, embedding, and clustering web links based on their topical content. It aims to help you automatically organize large collections of URLs (such as bookmarks or research links) into semantically meaningful groups.

flowchart TD
    A[Input List of Weblinks] --> B[Extract Textual Semantic Representation]
    B --> C[Compute Embedding per Weblink]
    C --> D[Cluster Weblinks]

Key Features

Configurable Input:
Reads URLs from a specified file (via config.json) and supports domain-based rate limiting and ignore lists (e.g., skip YouTube or problematic PDFs).
Robust Content Extraction:
Uses trafilatura to extract the main content from HTML pages (with fallback to BeautifulSoup) and pdfminer.six to extract text from PDFs (limited to the first 10 pages).
Advanced cleaning routines remove common boilerplate and extraneous text.
Parallel Processing & Logging:
Extracts text from URLs in parallel using Python’s ThreadPoolExecutor with progress indication. Detailed logs are maintained in separate log files for general extraction, failures, and summary statistics.
Embedding & Clustering:
Computes text embeddings with SentenceTransformers and clusters the links using HDBSCAN and hierarchical clustering methods.
Semantic Keyword Extraction:
Generates cluster reports using KeyBERT to extract meaningful keywords that summarize each cluster’s content.

How It Works

Read and Clean URLs:
URLs are read from a links file (specified in config.json), cleaned, and filtered using get_links.py.
Parallel Extraction:
The pipeline fetches each URL, extracts the main content using advanced extraction (with trafilatura and cleaning functions), and logs successes and failures.
Compute Embeddings:
Extracted texts are converted into semantic embeddings using SentenceTransformers.
Clustering & Reporting:
Embeddings are clustered using HDBSCAN (and optionally hierarchical clustering), and a detailed cluster report (with semantic keywords) is generated.

Getting Started

Prerequisites

Make sure you have Python 3.8+ installed. Then install dependencies:

pip install -r requirements.txt

Running the Pipeline

Configure the Project

Edit config.json to set your links file (a text file listing weblinks), rate limiting domains, and ignore domains.

Run the Project

python main.py

View Outputs

Cluster Report: See cluster_report.txt for grouped links and extracted keywords.

Logs: Review extraction.log, extraction_failures.log, and extraction_stats.log for detailed processing information.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitignore		.gitignore
README.md		README.md
clustering.py		clustering.py
config.py		config.py
example.config.json		example.config.json
extraction.py		extraction.py
get_links.py		get_links.py
main.py		main.py
report.py		report.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cluster-links

A Web-Link Clustering Pipeline

Key Features

How It Works

Getting Started

Prerequisites

Running the Pipeline

Configure the Project

Run the Project

View Outputs

About

Languages

arj1211/cluster-links

Folders and files

Latest commit

History

Repository files navigation

cluster-links

A Web-Link Clustering Pipeline

Key Features

How It Works

Getting Started

Prerequisites

Running the Pipeline

Configure the Project

Run the Project

View Outputs

About

Topics

Resources

Stars

Watchers

Forks

Languages