This document provides technical details about the search functionality and vector embedding system implemented in this blog.
The blog features a multi-modal search system that allows users to find content through three different search types:
- Keyword Search: Traditional text-based search using PostgreSQL's full-text search capabilities
- Semantic Search: Vector-based search using embeddings to find conceptually related content
- Hybrid Search: A weighted combination of both keyword and semantic search
The search system consists of these main components:
- React component with search input and type toggling
- Maintains search state and parameters in URL and session storage
- Displays results with formatting based on content type
- Shows match percentage based on the relevant score (hybrid_score, keyword_score, or vector_similarity)
- Handles search requests with different search types
- Generates embeddings for semantic/hybrid search using VoyageAI
- Executes PostgreSQL queries with vector operations
- Implements fallback searches when no results are found
- Returns normalized results with appropriate scoring
CREATE TABLE content_chunks (
id UUID PRIMARY KEY,
post_slug TEXT,
post_title TEXT,
content TEXT,
chunk_type TEXT,
metadata JSONB,
sequence INTEGER,
embedding vector(512),
overlaps_with UUID[],
overlap_score FLOAT[],
created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);
Uses PostgreSQL's to_tsvector
and plainto_tsquery
with rank functions to find and score exact text matches:
SELECT
content, post_slug, post_title, chunk_type, metadata,
ts_rank(
to_tsvector('english', content || ' ' || post_title),
plainto_tsquery('english', ${query})
) as text_rank
FROM content_chunks
WHERE to_tsvector('english', content || ' ' || post_title) @@ plainto_tsquery('english', ${query})
Uses vector similarity with the cosine distance operator <=>
to find conceptually similar content:
SELECT
content, post_slug, post_title, chunk_type, metadata,
1 - (embedding <=> ${formattedEmbedding}::vector) as vector_similarity
FROM content_chunks
WHERE 1 - (embedding <=> ${formattedEmbedding}::vector) > 0.4
Combines both approaches with weighted scoring (70% vector, 30% text):
(vector_similarity * 0.7 + COALESCE(text_rank, 0) * 0.3) as hybrid_score
The blog content is processed through a sophisticated chunking system:
- Sliding Window: Chunks overlap to maintain context between segments
- Content-Aware: Different chunk types (paragraphs, code, lists, quotes)
- Context Preservation: Metadata tracks section information and relationships
MIN_CHUNK_LENGTH
: Minimum size (150 chars) for a standalone chunkMAX_CHUNK_LENGTH
: Target size (600 chars) for chunksOVERLAP_PERCENTAGE
: 30% overlap between sliding windowsOVERLAP_MIN_CHARS
: Minimum overlap of 100 charactersOVERLAP_MAX_CHARS
: Maximum overlap of 200 characters
- Paragraphs (main content)
- Code blocks (with language metadata)
- Bullet lists (with count metadata)
- Blockquotes
- Mixed content (small elements combined)
- Content is processed into semantic chunks
- Each chunk is embedded using VoyageAI's
voyage-3-lite
model - Embeddings are stored with metadata in PostgreSQL with pgvector
- Batching: Processes chunks in batches of 20
- Rate Limiting: Includes delays between batches (200ms) and files (1000ms)
- Retry Logic: Exponential backoff for API rate limits
- Transaction Safety: Only updates embeddings for modified files
- Tracks relationships between sequential chunks
- Records overlap scores for improved context retrieval
- Enables "chunking with memory" for more coherent search results
A GitHub Actions workflow automatically updates embeddings when content changes:
name: Generate Blog Post Embeddings
on:
push:
paths:
- 'posts/*.md'
branches:
- main
jobs:
generate-embeddings:
runs-on: ubuntu-latest
env:
DATABASE_URL: ${{ secrets.DATABASE_URL }}
VOYAGE_AI_API_KEY: ${{ secrets.VOYAGE_AI_API_KEY }}
steps:
# Setup and identify changed files
- name: Generate embeddings for changed files
run: |
for file in ${{ steps.changed-files.outputs.all_changed_files }}; do
filename=$(basename "$file")
slug="${filename%.md}"
npx tsx scripts/generateEmbeddings.ts "$slug"
done
This system:
- Triggers on content changes
- Identifies modified files
- Regenerates embeddings only for changed content
- Preserves existing embeddings for unchanged content
- Vector operations can be computationally expensive; indexes optimize query performance
- Embeddings are generated asynchronously, not during page load
- The search UI is client-side with loading states to handle API response time
- Session storage caches recent search results to reduce redundant API calls