This project is a robust, configurable pipeline for extracting, cleaning, embedding, and clustering web links based on their topical content. It aims to help you automatically organize large collections of URLs (such as bookmarks or research links) into semantically meaningful groups.
flowchart TD
A[Input List of Weblinks] --> B[Extract Textual Semantic Representation]
B --> C[Compute Embedding per Weblink]
C --> D[Cluster Weblinks]
-
Configurable Input:
Reads URLs from a specified file (viaconfig.json
) and supports domain-based rate limiting and ignore lists (e.g., skip YouTube or problematic PDFs). -
Robust Content Extraction:
Uses trafilatura to extract the main content from HTML pages (with fallback to BeautifulSoup) and pdfminer.six to extract text from PDFs (limited to the first 10 pages).
Advanced cleaning routines remove common boilerplate and extraneous text. -
Parallel Processing & Logging:
Extracts text from URLs in parallel using Python’s ThreadPoolExecutor with progress indication. Detailed logs are maintained in separate log files for general extraction, failures, and summary statistics. -
Embedding & Clustering:
Computes text embeddings with SentenceTransformers and clusters the links using HDBSCAN and hierarchical clustering methods. -
Semantic Keyword Extraction:
Generates cluster reports using KeyBERT to extract meaningful keywords that summarize each cluster’s content.
-
Read and Clean URLs:
URLs are read from a links file (specified inconfig.json
), cleaned, and filtered usingget_links.py
. -
Parallel Extraction:
The pipeline fetches each URL, extracts the main content using advanced extraction (with trafilatura and cleaning functions), and logs successes and failures. -
Compute Embeddings:
Extracted texts are converted into semantic embeddings using SentenceTransformers. -
Clustering & Reporting:
Embeddings are clustered using HDBSCAN (and optionally hierarchical clustering), and a detailed cluster report (with semantic keywords) is generated.
Make sure you have Python 3.8+ installed. Then install dependencies:
pip install -r requirements.txt
Edit config.json to set your links file (a text file listing weblinks), rate limiting domains, and ignore domains.
python main.py
Cluster Report: See cluster_report.txt
for grouped links and extracted keywords.
Logs: Review extraction.log
, extraction_failures.log
, and extraction_stats.log
for detailed processing information.