Skip to content

Commit

Permalink
improve migration implementation (#1452)
Browse files Browse the repository at this point in the history
* improve migration implementation

* refine migrations to include kg

* add alembic cli

* extend documentation

* extend docs and all that
  • Loading branch information
emrgnt-cmplxty authored Oct 23, 2024
1 parent a478550 commit f14d3c2
Show file tree
Hide file tree
Showing 35 changed files with 1,815 additions and 251 deletions.
2 changes: 1 addition & 1 deletion docs/api-reference/openapi.json

Large diffs are not rendered by default.

185 changes: 185 additions & 0 deletions docs/cookbooks/maintenance.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
---
title: 'Maintenance & Scaling'
description: 'Learn how to maintain and scale your R2R system'
icon: 'paint-roller'
---

## Introduction

This guide covers essential maintenance tasks for R2R deployments, with a focus on vector index management and system updates. Understanding when and how to build vector indices, as well as keeping your R2R installation current, is crucial for maintaining optimal performance at scale.

## Vector Indices

### Why Vector Indices Matter

Vector indices are essential for efficient similarity search across documents. Without an index, every search would require comparing against every vector in your database - a process that becomes increasingly expensive as your dataset grows.

Based on benchmarks from similar systems (pgvector), vector indices can provide significant performance improvements:
- Queries can be 10-100x faster with proper indexing
- High-dimensional vectors (1536d) benefit even more from indexing than lower-dimensional ones
- Index performance becomes critical at scale (>100K documents)

### When to Build Vector Indices

Consider building vector indices when:
- Your document collection exceeds 100K documents
- Query latency exceeds acceptable thresholds
- You're using high-dimensional vectors (e.g., 1536d from large language models)
- You need to support concurrent queries

### Vector Index Creation

R2R supports multiple indexing methods, with HNSW (Hierarchical Navigable Small World) being recommended for most use cases:

```python
create_response = client.create_vector_index(
table_name="vectors",
index_method="hnsw",
index_measure="cosine_distance",
index_arguments={
"m": 16, # Number of connections per element
"ef_construction": 64 # Size of dynamic candidate list
},
concurrently=True
)
```

#### Important Considerations

1. **Resource Usage**
- Index creation is CPU and memory intensive
- Memory usage scales with both dataset size and `m` parameter
- Consider creating indices during off-peak hours

2. **Performance Tuning**
- HNSW Parameters:
- `m`: 16-64 (higher = better quality, more memory)
- `ef_construction`: 64-100 (higher = better quality, longer build time)
- Distance Measures:
- `cosine_distance`: Best for normalized vectors (most common)
- `l2_distance`: Better for absolute distances
- `max_inner_product`: Optimized for dot product similarity

3. **Index Warming**
- New indices require warming for optimal performance
- Initial queries may be slower until index is loaded into memory
- Consider implementing explicit pre-warming in production

### Managing Vector Indices

List existing indices:
```bash
r2r list-vector-indices
```

Delete an index:
```bash
r2r delete-vector-index <index-name>
```

For detailed information about vector index management, see the [Ingestion documentation](/documentation/cli/ingestion).

## System Updates and Maintenance

### Version Management

Check your current R2R version:
```bash
r2r version
```

### Update Process

1. **Prepare for Update**
```bash
# Check current versions
r2r version
r2r db current

# Generate system report (optional)
r2r generate-report
```

2. **Stop Running Services**
```bash
r2r docker-down
```

3. **Update R2R**
```bash
r2r update
```

4. **Update Database**
```bash
r2r db upgrade
```

5. **Restart Services**
```bash
r2r serve --docker [additional options]
```

### Database Migration Management

R2R uses database migrations to manage schema changes. Always check and update your database schema after updates:

Check current migration:
```bash
r2r db current
```

Apply migrations:
```bash
r2r db upgrade
```

### Managing Multiple Environments

Use different project names and schemas for different environments:

```bash
# Development
export R2R_PROJECT_NAME=r2r_dev
r2r serve --docker --project-name r2r-dev

# Staging
export R2R_PROJECT_NAME=r2r_staging
r2r serve --docker --project-name r2r-staging

# Production
export R2R_PROJECT_NAME=r2r_prod
r2r serve --docker --project-name r2r-prod
```

## Troubleshooting

If issues occur:

1. Generate a system report:
```bash
r2r generate-report
```

2. Check container health:
```bash
r2r docker-down
r2r serve --docker
```

3. Review database state:
```bash
r2r db current
r2r db history
```

4. Roll back if needed:
```bash
r2r db downgrade --revision <previous-working-version>
```

## Additional Resources

- [Python SDK Ingestion Documentation](/documentation/python-sdk/ingestion)
- [CLI Maintenance Documentation](/documentation/cli/maintenance)
- [Ingestion Configuration Documentation](/documentation/configuration/ingestion/overview)
135 changes: 132 additions & 3 deletions docs/documentation/cli/ingestion.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: 'Ingestion'
description: 'Ingesting files with the R2R CLI.'
description: 'Ingesting files and managing vector indices with the R2R CLI.'
---

## Document Ingestion and Management
Expand Down Expand Up @@ -73,9 +73,138 @@ r2r update-files path/to/file1_v2.txt \
</Accordion>
</AccordionGroup>

## Vector Index Management
## Vector Index Management

### Create Vector Index

Create a new vector index for similarity search using the `create-vector-index` command:

```bash
r2r create-vector-index \
--table-name vectors \
--index-method hnsw \
--index-measure cosine_distance \
--index-arguments '{"m": 16, "ef_construction": 64}'
```

<AccordionGroup>
<Accordion title="Arguments">
<ParamField path="--table-name" type="str">
Table to create index on. Options: vectors, entities_document, entities_collection, communities. Default: vectors
</ParamField>

<ParamField path="--index-method" type="str">
Indexing method to use. Options: hnsw, ivfflat, auto. Default: hnsw
</ParamField>

<ParamField path="--index-measure" type="str">
Distance measure for vector comparisons. Options: cosine_distance, l2_distance, max_inner_product. Default: cosine_distance
</ParamField>

<ParamField path="--index-arguments" type="str">
Configuration parameters as JSON string. For HNSW: `{"m": int, "ef_construction": int}`. For IVFFlat: `{"n_lists": int}`
</ParamField>

<ParamField path="--index-name" type="str">
Optional custom name for the index. If not provided, one will be auto-generated
</ParamField>

<ParamField path="--no-concurrent" type="flag">
Disable concurrent index creation. Default: False
</ParamField>
</Accordion>
</AccordionGroup>

#### Important Considerations

Vector index creation requires careful planning and consideration of your data and performance requirements. Keep in mind:

**Resource Intensive Process**
- Index creation can be CPU and memory intensive, especially for large datasets
- For HNSW indexes, memory usage scales with both dataset size and `m` parameter
- Consider creating indexes during off-peak hours for production systems

**Performance Tuning**
1. **HNSW Parameters:**
- `m`: Higher values (16-64) improve search quality but increase memory usage and build time
- `ef_construction`: Higher values increase build time and quality but have diminishing returns past 100
- Recommended starting point: `m=16`, `ef_construction=64`

```bash
# Example balanced configuration
r2r create-vector-index \
--table-name vectors \
--index-method hnsw \
--index-measure cosine_distance \
--index-arguments '{"m": 16, "ef_construction": 64}'
```

**Pre-warming Required**
- **Important:** Newly created indexes require pre-warming to achieve optimal performance
- Initial queries may be slower until the index is loaded into memory
- The first several queries will automatically warm the index
- For production systems, consider implementing explicit pre-warming by running representative queries after index creation
- Without pre-warming, you may not see the expected performance improvements

**Best Practices**
1. Always use concurrent index creation (avoid `--no-concurrent`) in production to prevent blocking other operations
2. Monitor system resources during index creation
3. Test index performance with representative queries before deploying
4. Consider creating indexes on smaller test datasets first to validate parameters
5. Implement index pre-warming strategy before handling production traffic

**Distance Measures**
Choose the appropriate measure based on your use case:
- `cosine_distance`: Best for normalized vectors (most common)
- `l2_distance`: Better for absolute distances
- `max_inner_product`: Optimized for dot product similarity

### List Vector Indices

List existing vector indices using the `list-vector-indices` command:

```bash
r2r list-vector-indices --table-name vectors
```

<AccordionGroup>
<Accordion title="Arguments">
<ParamField path="--table-name" type="str">
Table to list indices from. Options: vectors, entities_document, entities_collection, communities. Default: vectors
</ParamField>
</Accordion>
</AccordionGroup>

### Delete Vector Index

Delete a vector index using the `delete-vector-index` command:

```bash
r2r delete-vector-index my-index-name --table-name vectors
```

<AccordionGroup>
<Accordion title="Arguments">
<ParamField path="index-name" type="str" required>
Name of the index to delete
</ParamField>

<ParamField path="--table-name" type="str">
Table containing the index. Options: vectors, entities_document, entities_collection, communities. Default: vectors
</ParamField>

<ParamField path="--no-concurrent" type="flag">
Disable concurrent index deletion. Default: False
</ParamField>
</Accordion>
</AccordionGroup>

## Sample File Management

### Ingest Sample Files

Ingest one or more sample files from the R2R GitHub repository using the `ingest-sample-file` or `ingest-sample-files` commands:
Ingest one or more sample files from the R2R GitHub repository:

```bash
# Ingest a single sample file
Expand All @@ -92,7 +221,7 @@ These commands have no additional arguments. The `--v2` flag for `ingest-sample-

### Ingest Local Sample Files

Ingest the local sample files in the `core/examples/data_unstructured` directory using the `ingest-sample-files-from-unstructured` command:
Ingest the local sample files in the `core/examples/data_unstructured` directory:

```bash
r2r ingest-sample-files-from-unstructured
Expand Down
2 changes: 1 addition & 1 deletion docs/documentation/cli/introduction.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -58,5 +58,5 @@ For more detailed information on specific functionalities of the R2R CLI, please

- [Document Ingestion](/documentation/cli/ingestion): Learn how to add, retrieve, and manage documents using the CLI.
- [Search & RAG](/documentation/cli/retrieval): Explore various querying techniques and Retrieval-Augmented Generation capabilities.
- [Knowledge Graphs](/documentation/cli/graphrag): Learn how to create and enrich knowledge graphs, and perform GraphRAG.
- [Knowledge Graphs](/documentation/cli/graph): Learn how to create and enrich knowledge graphs, and perform GraphRAG.
- [Server Management](/documentation/cli/server): Manage your R2R server, including health checks, logs, and updates.
Loading

0 comments on commit f14d3c2

Please sign in to comment.