improve migration implementation (#1452)

* improve migration implementation * refine migrations to include kg * add alembic cli * extend documentation * extend docs and all that
SciPhi-AI · Oct 23, 2024 · f14d3c2 · f14d3c2
1 parent a478550
commit f14d3c2
Show file tree

Hide file tree

Showing 35 changed files with 1,815 additions and 251 deletions.
diff --git a/docs/api-reference/openapi.json b/docs/api-reference/openapi.json
diff --git a/docs/cookbooks/maintenance.mdx b/docs/cookbooks/maintenance.mdx
@@ -0,0 +1,185 @@
+---
+title: 'Maintenance & Scaling'
+description: 'Learn how to maintain and scale your R2R system'
+icon: 'paint-roller'
+---
+
+## Introduction
+
+This guide covers essential maintenance tasks for R2R deployments, with a focus on vector index management and system updates. Understanding when and how to build vector indices, as well as keeping your R2R installation current, is crucial for maintaining optimal performance at scale.
+
+## Vector Indices
+
+### Why Vector Indices Matter
+
+Vector indices are essential for efficient similarity search across documents. Without an index, every search would require comparing against every vector in your database - a process that becomes increasingly expensive as your dataset grows.
+
+Based on benchmarks from similar systems (pgvector), vector indices can provide significant performance improvements:
+- Queries can be 10-100x faster with proper indexing
+- High-dimensional vectors (1536d) benefit even more from indexing than lower-dimensional ones
+- Index performance becomes critical at scale (>100K documents)
+
+### When to Build Vector Indices
+
+Consider building vector indices when:
+- Your document collection exceeds 100K documents
+- Query latency exceeds acceptable thresholds
+- You're using high-dimensional vectors (e.g., 1536d from large language models)
+- You need to support concurrent queries
+
+### Vector Index Creation
+
+R2R supports multiple indexing methods, with HNSW (Hierarchical Navigable Small World) being recommended for most use cases:
+
+```python
+create_response = client.create_vector_index(
+    table_name="vectors",
+    index_method="hnsw",
+    index_measure="cosine_distance",
+    index_arguments={
+        "m": 16,              # Number of connections per element
+        "ef_construction": 64 # Size of dynamic candidate list
+    },
+    concurrently=True
+)
+```
+
+#### Important Considerations
+
+1. **Resource Usage**
+   - Index creation is CPU and memory intensive
+   - Memory usage scales with both dataset size and `m` parameter
+   - Consider creating indices during off-peak hours
+
+2. **Performance Tuning**
+   - HNSW Parameters:
+     - `m`: 16-64 (higher = better quality, more memory)
+     - `ef_construction`: 64-100 (higher = better quality, longer build time)
+   - Distance Measures:
+     - `cosine_distance`: Best for normalized vectors (most common)
+     - `l2_distance`: Better for absolute distances
+     - `max_inner_product`: Optimized for dot product similarity
+
+3. **Index Warming**
+   - New indices require warming for optimal performance
+   - Initial queries may be slower until index is loaded into memory
+   - Consider implementing explicit pre-warming in production
+
+### Managing Vector Indices
+
+List existing indices:
+```bash
+r2r list-vector-indices
+```
+
+Delete an index:
+```bash
+r2r delete-vector-index <index-name>
+```
+
+For detailed information about vector index management, see the [Ingestion documentation](/documentation/cli/ingestion).
+
+## System Updates and Maintenance
+
+### Version Management
+
+Check your current R2R version:
+```bash
+r2r version
+```
+
+### Update Process
+
+1. **Prepare for Update**
+   ```bash
+   # Check current versions
+   r2r version
+   r2r db current
+
+   # Generate system report (optional)
+   r2r generate-report
+   ```
+
+2. **Stop Running Services**
+   ```bash
+   r2r docker-down
+   ```
+
+3. **Update R2R**
+   ```bash
+   r2r update
+   ```
+
+4. **Update Database**
+   ```bash
+   r2r db upgrade
+   ```
+
+5. **Restart Services**
+   ```bash
+   r2r serve --docker [additional options]
+   ```
+
+### Database Migration Management
+
+R2R uses database migrations to manage schema changes. Always check and update your database schema after updates:
+
+Check current migration:
+```bash
+r2r db current
+```
+
+Apply migrations:
+```bash
+r2r db upgrade
+```
+
+### Managing Multiple Environments
+
+Use different project names and schemas for different environments:
+
+```bash
+# Development
+export R2R_PROJECT_NAME=r2r_dev
+r2r serve --docker --project-name r2r-dev
+
+# Staging
+export R2R_PROJECT_NAME=r2r_staging
+r2r serve --docker --project-name r2r-staging
+
+# Production
+export R2R_PROJECT_NAME=r2r_prod
+r2r serve --docker --project-name r2r-prod
+```
+
+## Troubleshooting
+
+If issues occur:
+
+1. Generate a system report:
+   ```bash
+   r2r generate-report
+   ```
+
+2. Check container health:
+   ```bash
+   r2r docker-down
+   r2r serve --docker
+   ```
+
+3. Review database state:
+   ```bash
+   r2r db current
+   r2r db history
+   ```
+
+4. Roll back if needed:
+   ```bash
+   r2r db downgrade --revision <previous-working-version>
+   ```
+
+## Additional Resources
+
+- [Python SDK Ingestion Documentation](/documentation/python-sdk/ingestion)
+- [CLI Maintenance Documentation](/documentation/cli/maintenance)
+- [Ingestion Configuration Documentation](/documentation/configuration/ingestion/overview)
diff --git a/docs/documentation/cli/ingestion.mdx b/docs/documentation/cli/ingestion.mdx
@@ -1,6 +1,6 @@
 ---
 title: 'Ingestion'
-description: 'Ingesting files with the R2R CLI.'
+description: 'Ingesting files and managing vector indices with the R2R CLI.'
 ---
 
 ## Document Ingestion and Management
@@ -73,9 +73,138 @@ r2r update-files path/to/file1_v2.txt \
   </Accordion>
 </AccordionGroup>
 
+## Vector Index Management
+## Vector Index Management
+
+### Create Vector Index
+
+Create a new vector index for similarity search using the `create-vector-index` command:
+
+```bash
+r2r create-vector-index \
+  --table-name vectors \
+  --index-method hnsw \
+  --index-measure cosine_distance \
+  --index-arguments '{"m": 16, "ef_construction": 64}'
+```
+
+<AccordionGroup>
+  <Accordion title="Arguments">
+    <ParamField path="--table-name" type="str">
+      Table to create index on. Options: vectors, entities_document, entities_collection, communities. Default: vectors
+    </ParamField>
+
+    <ParamField path="--index-method" type="str">
+      Indexing method to use. Options: hnsw, ivfflat, auto. Default: hnsw
+    </ParamField>
+
+    <ParamField path="--index-measure" type="str">
+      Distance measure for vector comparisons. Options: cosine_distance, l2_distance, max_inner_product. Default: cosine_distance
+    </ParamField>
+
+    <ParamField path="--index-arguments" type="str">
+      Configuration parameters as JSON string. For HNSW: `{"m": int, "ef_construction": int}`. For IVFFlat: `{"n_lists": int}`
+    </ParamField>
+
+    <ParamField path="--index-name" type="str">
+      Optional custom name for the index. If not provided, one will be auto-generated
+    </ParamField>
+
+    <ParamField path="--no-concurrent" type="flag">
+      Disable concurrent index creation. Default: False
+    </ParamField>
+  </Accordion>
+</AccordionGroup>
+
+#### Important Considerations
+
+Vector index creation requires careful planning and consideration of your data and performance requirements. Keep in mind:
+
+**Resource Intensive Process**
+- Index creation can be CPU and memory intensive, especially for large datasets
+- For HNSW indexes, memory usage scales with both dataset size and `m` parameter
+- Consider creating indexes during off-peak hours for production systems
+
+**Performance Tuning**
+1. **HNSW Parameters:**
+   - `m`: Higher values (16-64) improve search quality but increase memory usage and build time
+   - `ef_construction`: Higher values increase build time and quality but have diminishing returns past 100
+   - Recommended starting point: `m=16`, `ef_construction=64`
+
+```bash
+# Example balanced configuration
+r2r create-vector-index \
+  --table-name vectors \
+  --index-method hnsw \
+  --index-measure cosine_distance \
+  --index-arguments '{"m": 16, "ef_construction": 64}'
+```
+
+**Pre-warming Required**
+- **Important:** Newly created indexes require pre-warming to achieve optimal performance
+- Initial queries may be slower until the index is loaded into memory
+- The first several queries will automatically warm the index
+- For production systems, consider implementing explicit pre-warming by running representative queries after index creation
+- Without pre-warming, you may not see the expected performance improvements
+
+**Best Practices**
+1. Always use concurrent index creation (avoid `--no-concurrent`) in production to prevent blocking other operations
+2. Monitor system resources during index creation
+3. Test index performance with representative queries before deploying
+4. Consider creating indexes on smaller test datasets first to validate parameters
+5. Implement index pre-warming strategy before handling production traffic
+
+**Distance Measures**
+Choose the appropriate measure based on your use case:
+- `cosine_distance`: Best for normalized vectors (most common)
+- `l2_distance`: Better for absolute distances
+- `max_inner_product`: Optimized for dot product similarity
+
+### List Vector Indices
+
+List existing vector indices using the `list-vector-indices` command:
+
+```bash
+r2r list-vector-indices --table-name vectors
+```
+
+<AccordionGroup>
+  <Accordion title="Arguments">
+    <ParamField path="--table-name" type="str">
+      Table to list indices from. Options: vectors, entities_document, entities_collection, communities. Default: vectors
+    </ParamField>
+  </Accordion>
+</AccordionGroup>
+
+### Delete Vector Index
+
+Delete a vector index using the `delete-vector-index` command:
+
+```bash
+r2r delete-vector-index my-index-name --table-name vectors
+```
+
+<AccordionGroup>
+  <Accordion title="Arguments">
+    <ParamField path="index-name" type="str" required>
+      Name of the index to delete
+    </ParamField>
+
+    <ParamField path="--table-name" type="str">
+      Table containing the index. Options: vectors, entities_document, entities_collection, communities. Default: vectors
+    </ParamField>
+
+    <ParamField path="--no-concurrent" type="flag">
+      Disable concurrent index deletion. Default: False
+    </ParamField>
+  </Accordion>
+</AccordionGroup>
+
+## Sample File Management
+
 ### Ingest Sample Files
 
-Ingest one or more sample files from the R2R GitHub repository using the `ingest-sample-file` or `ingest-sample-files` commands:
+Ingest one or more sample files from the R2R GitHub repository:
 
 ```bash
 # Ingest a single sample file
@@ -92,7 +221,7 @@ These commands have no additional arguments. The `--v2` flag for `ingest-sample-
 
 ### Ingest Local Sample Files
 
-Ingest the local sample files in the `core/examples/data_unstructured` directory using the `ingest-sample-files-from-unstructured` command:
+Ingest the local sample files in the `core/examples/data_unstructured` directory:
 
 ```bash
 r2r ingest-sample-files-from-unstructured

diff --git a/docs/documentation/cli/introduction.mdx b/docs/documentation/cli/introduction.mdx
@@ -58,5 +58,5 @@ For more detailed information on specific functionalities of the R2R CLI, please
 
 - [Document Ingestion](/documentation/cli/ingestion): Learn how to add, retrieve, and manage documents using the CLI.
 - [Search & RAG](/documentation/cli/retrieval): Explore various querying techniques and Retrieval-Augmented Generation capabilities.
-- [Knowledge Graphs](/documentation/cli/graphrag): Learn how to create and enrich knowledge graphs, and perform GraphRAG.
+- [Knowledge Graphs](/documentation/cli/graph): Learn how to create and enrich knowledge graphs, and perform GraphRAG.
 - [Server Management](/documentation/cli/server): Manage your R2R server, including health checks, logs, and updates.