NOTE: This is a work in progress and not all functionality is available and/or stable yet.
OpenMatch is an enterprise-grade Python library for comprehensive Master Data Management (MDM) solutions. It provides a complete suite of tools for entity resolution, data governance, and master data lifecycle management using cutting-edge AI and scalable architecture.
- β‘ Advanced Matching Engine
- Configurable blocking strategies for performance optimization
- Multi-attribute fuzzy matching with weighted scoring
- Incremental matching support
- Match result persistence and metadata tracking
- Comprehensive match statistics and performance metrics
- Caching support for improved performance
- π Intelligent Merge Processing
- Flexible merge strategy framework
- Golden record generation and management
- Cross-reference (xref) tracking
- Source record lineage
- Merge operation rollback support
- Detailed merge audit trails
- ποΈ Robust Model Framework
- Entity and field configuration
- Physical model generation
- Schema validation
- Source system integration
- Field mapping and transformation
- Custom validation rules
- π Comprehensive Lineage
- Cross-reference management
- Change history tracking
- Source system mapping
- Temporal data support
- Full audit capabilities
- π Rich Connector Framework
- AWS integration
- Azure support
- Databricks connectivity
- JDBC/ODBC support
- REST API integration
- Snowflake native support
- Flat file processing
- π οΈ Administrative Capabilities
- Command-line interface
- Configuration management
- Deployment utilities
- Monitoring tools
- β
Data Quality Management
- Configurable trust rules
- Scoring framework
- Quality metrics
- Trust-based survivorship
- Framework configuration
pip install openmatch
OpenMatch is a powerful Master Data Management (MDM) system that uses advanced vector similarity search and machine learning for accurate record matching and deduplication.
- Vector-based similarity search using pgvector
- Configurable matching rules and thresholds
- Automatic schema and vector index management
- Batch processing with optimized performance
- Support for multiple source systems
- Real-time and batch matching capabilities
- Materialized views for performance optimization
- Comprehensive logging and monitoring
# Clone the repository
git clone https://github.com/yourusername/openmatch.git
cd openmatch
# Create and activate virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install requirements
pip install -r requirements.txt
- Install PostgreSQL 14+ and the pgvector extension
# On Ubuntu/Debian
sudo apt-get install postgresql-14 postgresql-14-pgvector
# On macOS with Homebrew
brew install postgresql@14
brew install pgvector
- Create the MDM database:
CREATE DATABASE mdm;
- Configure your database settings in
openmatch/match/local_settings.py
:
# local_settings.py
MDM_DB = {
'ENGINE': 'postgresql',
'HOST': 'localhost',
'PORT': 5432,
'NAME': 'mdm',
'USER': 'your_user',
'PASSWORD': 'your_password',
'SCHEMA': 'mdm',
}
# Configure your source systems
SOURCE_SYSTEMS = {
'source1': {
'ENGINE': 'postgresql',
'HOST': 'localhost',
'PORT': 5432,
'NAME': 'source1',
'USER': 'source1_user',
'PASSWORD': 'source1_password',
'ENTITY_TYPE': 'person',
'QUERY': 'SELECT * FROM persons WHERE updated_at > :last_sync',
'FIELD_MAPPINGS': {
'first_name': 'given_name',
'last_name': 'family_name',
}
}
}
- Adjust matching settings if needed:
MATCH_SETTINGS = {
'BLOCKING_KEYS': ['first_name', 'last_name', 'birth_date'],
'MATCH_THRESHOLD': 0.8,
'MERGE_THRESHOLD': 0.9,
}
# Initialize database schema and extensions
python manage.py init_db
# Process matches with default batch size
python manage.py process_matches
# Process matches with custom batch size
python manage.py process_matches --batch_size 5000
# Refresh materialized views
python manage.py refresh_views
openmatch/
βββ match/
β βββ __init__.py
β βββ settings.py # Main settings file
β βββ local_settings.py # Local overrides (create this)
β βββ engine.py # Core matching engine
β βββ db_ops.py # Database operations
β βββ rules.py # Matching rules
β βββ matchers.py # Matching algorithms
βββ manage.py # Management script
βββ requirements.txt
βββ README.md
VECTOR_SETTINGS = {
'BACKEND': VectorBackend.PGVECTOR,
'DIMENSION': 768,
'INDEX_TYPE': 'ivfflat', # or 'hnsw'
'IVF_LISTS': 100,
'PROBES': 10,
'SIMILARITY_THRESHOLD': 0.8,
}
MODEL_SETTINGS = {
'EMBEDDING_MODEL': 'sentence-transformers/all-MiniLM-L6-v2',
'USE_GPU': False,
'BATCH_SIZE': 128,
}
PROCESSING = {
'BATCH_SIZE': 10000,
'MAX_WORKERS': None, # None = 2 * CPU cores
'USE_PROCESSES': False,
}
from openmatch.match.engine import MatchEngine
from openmatch.match.db_ops import DatabaseOptimizer, BatchProcessor
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
# Create database session
engine = create_engine("postgresql://user:password@localhost/mdm")
Session = sessionmaker(bind=engine)
session = Session()
# Initialize match engine
match_engine = MatchEngine(session)
# Process matches
processor = BatchProcessor(session, match_engine)
processor.process_matches()
# Find matches for a specific record
matches = match_engine.find_candidates({
'first_name': 'John',
'last_name': 'Doe',
'birth_date': '1990-01-01'
})
# Get match details
for match_id, similarity in matches:
print(f"Match ID: {match_id}, Similarity: {similarity}")
- Environment Variables: Store sensitive information in environment variables:
export MDM_DB_PASSWORD=your_secure_password
export SOURCE1_DB_PASSWORD=another_secure_password
-
Batch Size: Adjust batch sizes based on your system's memory:
- For systems with 8GB RAM: 5,000-10,000 records
- For systems with 16GB RAM: 10,000-20,000 records
- For systems with 32GB+ RAM: 20,000-50,000 records
-
Vector Index: Choose the appropriate vector index:
ivfflat
: Better for frequent updates, good balance of speed/accuracyhnsw
: Better for read-heavy workloads, highest accuracy
-
Monitoring: Monitor the system using materialized views:
SELECT * FROM mdm.match_statistics;
SELECT * FROM mdm.blocking_statistics;
- Vector Extension Issues:
-- Check if pgvector is installed
SELECT * FROM pg_extension WHERE extname = 'vector';
-- Manually install if needed
CREATE EXTENSION vector;
- Performance Issues:
-- Check index usage
SELECT * FROM pg_stat_user_indexes
WHERE schemaname = 'mdm'
AND indexrelname LIKE '%vector%';
-- Analyze tables
ANALYZE mdm.record_embeddings;
- Memory Issues:
- Reduce batch size in PROCESSING settings
- Increase PostgreSQL work_mem for vector operations
- Monitor system memory usage during processing
- Fork the repository
- Create your feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
π Ready to master your data? Get started with OpenMatch today!