Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/include vectors option document chunks #1419

Merged
2 changes: 1 addition & 1 deletion docs/api-reference/openapi.json

Large diffs are not rendered by default.

5 changes: 5 additions & 0 deletions docs/documentation/cli/management.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -119,5 +119,10 @@ r2r document-chunks --document-id doc1 --offset 0 --limit 10
<ParamField path="--limit" type="int">
The maximum number of nodes to return. Defaults to 100.
</ParamField>

<ParamField path="include_vectors" type="Optional[bool]">
An optional value to return the vectors associated with each chunk, defaults to `False`.
</ParamField>

</Accordion>
</AccordionGroup>
20 changes: 17 additions & 3 deletions docs/documentation/js-sdk/ingestion.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ const ingestResponse = await client.ingestFiles(files, {
</ParamField>

<ParamField path="ingestion_config" type="Optional[Union[dict, ChunkingConfig]]">
The ingestion config override parameter enables developers to customize their R2R chunking strategy at runtime.
The ingestion config override parameter enables developers to customize their R2R chunking strategy at runtime. Learn more about [configuration here](/documentation/configuration/ingestion/parsing_and_chunking).
<Expandable title="properties">
<ParamField path="provider" type="str" default="unstructured_local">
Which chunking provider to use. Options are "r2r", "unstructured_local", or "unstructured_api".
Expand Down Expand Up @@ -273,7 +273,7 @@ const updateResponse = await client.updateFiles(files, {
</ParamField>

<ParamField path="ingestion_config" type="Record<string, any>">
The ingestion config override parameter enables developers to customize their R2R chunking strategy at runtime.
The ingestion config override parameter enables developers to customize their R2R chunking strategy at runtime. Learn more about [configuration here](/documentation/configuration/ingestion/parsing_and_chunking).
<Expandable title="properties">
<ParamField path="provider" type="str" default="r2r">
Which chunking provider to use, `r2r` or `unstructured`. Selecting `unstructured` is generally recommended when parsing with `unstructured` or `unstructured_api`.
Expand Down Expand Up @@ -335,6 +335,12 @@ const documentsOverview = await client.documentsOverview();
<ParamField path="document_ids" type="Array<string>">
An optional array of document IDs to filter the overview.
</ParamField>
<ParamField path="offset" type="Optional[int]">
An optional value to offset the starting point of fetched results, defaults to `0`.
</ParamField>
<ParamField path="limit" type="Optional[int]">
An optional value to limit the fetched results, defaults to `100`.
</ParamField>


### Document Chunks
Expand Down Expand Up @@ -368,7 +374,15 @@ const chunks = await client.documentChunks(documentId);
<ParamField path="document_id" type="string" required>
The ID of the document to retrieve chunks for.
</ParamField>

<ParamField path="offset" type="Optional[int]">
An optional value to offset the starting point of fetched results, defaults to `0`.
</ParamField>
<ParamField path="limit" type="Optional[int]">
An optional value to limit the fetched results, defaults to `100`.
</ParamField>
<ParamField path="include_vectors" type="Optional[bool]">
An optional value to return the vectors associated with each chunk, defaults to `False`.
</ParamField>

### Delete Documents

Expand Down
21 changes: 19 additions & 2 deletions docs/documentation/python-sdk/ingestion.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ Refer to the [ingestion configuration](/documentation/configuration/ingestion/pa


<ParamField path="ingestion_config" type="Optional[Union[dict, IngestionConfig]]">
The ingestion config override parameter enables developers to customize their R2R chunking strategy at runtime.
The ingestion config override parameter enables developers to customize their R2R chunking strategy at runtime. Learn more about [configuration here](/documentation/configuration/ingestion/parsing_and_chunking).
<Expandable title="Other Provider Options">
<ParamField path="provider" type="str" default="r2r">
Which R2R ingestion provider to use. Options are "r2r".
Expand Down Expand Up @@ -287,7 +287,7 @@ The ingestion configuration can be customized analogously to the ingest files en


<ParamField path="ingestion_config" type="Optional[Union[dict, IngestionConfig]]">
The ingestion config override parameter enables developers to customize their R2R chunking strategy at runtime.
The ingestion config override parameter enables developers to customize their R2R chunking strategy at runtime. Learn more about [configuration here](/documentation/configuration/ingestion/parsing_and_chunking).
<Expandable title="Other Provider Options">
<ParamField path="provider" type="str" default="r2r">
Which R2R ingestion provider to use. Options are "r2r".
Expand Down Expand Up @@ -458,6 +458,13 @@ documents_overview = client.documents_overview()
<ParamField path="document_ids" type="Optional[list[Union[UUID, str]]]">
An optional list of document IDs to filter the overview.
</ParamField>
<ParamField path="offset" type="Optional[int]">
An optional value to offset the starting point of fetched results, defaults to `0`.
</ParamField>
<ParamField path="limit" type="Optional[int]">
An optional value to limit the fetched results, defaults to `100`.
</ParamField>


### Document Chunks

Expand Down Expand Up @@ -493,6 +500,16 @@ chunks = client.document_chunks(document_id)
<ParamField path="document_id" type="str" required>
The ID of the document to retrieve chunks for.
</ParamField>
<ParamField path="offset" type="Optional[int]">
An optional value to offset the starting point of fetched results, defaults to `0`.
</ParamField>
<ParamField path="limit" type="Optional[int]">
An optional value to limit the fetched results, defaults to `100`.
</ParamField>
<ParamField path="include_vectors" type="Optional[bool]">
An optional value to return the vectors associated with each chunk, defaults to `False`.
</ParamField>


### Delete Documents

Expand Down
14 changes: 12 additions & 2 deletions py/cli/commands/management.py
Original file line number Diff line number Diff line change
Expand Up @@ -126,16 +126,24 @@ def documents_overview(ctx, document_ids, offset, limit):
default=None,
help="The maximum number of nodes to return. Defaults to 100.",
)
@click.option(
"--include-vectors",
is_flag=True,
default=False,
help="Should the vector be included in the response chunks",
)
@pass_context
def document_chunks(ctx, document_id, offset, limit):
def document_chunks(ctx, document_id, offset, limit, include_vectors):
"""Get chunks of a specific document."""
client = ctx.obj
if not document_id:
click.echo("Error: Document ID is required.")
return

with timer():
chunks_data = client.document_chunks(document_id, offset, limit)
chunks_data = client.document_chunks(
document_id, offset, limit, include_vectors
)

chunks = chunks_data["results"]
if not chunks:
Expand All @@ -150,5 +158,7 @@ def document_chunks(ctx, document_id, offset, limit):
click.echo(f"Extraction ID: {chunk.get('id', 'N/A')}")
click.echo(f"Text: {chunk.get('text', '')[:100]}...")
click.echo(f"Metadata: {chunk.get('metadata', {})}")
if include_vectors:
click.echo(f"Vector: {chunk.get('vector', 'N/A')}")
else:
click.echo(f"Unexpected chunk format: {chunk}")
3 changes: 2 additions & 1 deletion py/core/main/api/management_router.py
Original file line number Diff line number Diff line change
Expand Up @@ -367,12 +367,13 @@ async def document_chunks_app(
document_id: str = Path(...),
offset: Optional[int] = Query(0, ge=0),
limit: Optional[int] = Query(100, ge=0),
include_vectors: Optional[bool] = Query(False),
auth_user=Depends(self.service.providers.auth.auth_wrapper),
) -> WrappedDocumentChunkResponse:
document_uuid = UUID(document_id)

document_chunks = await self.service.document_chunks(
document_uuid, offset, limit
document_uuid, offset, limit, include_vectors
)

document_chunks_result = document_chunks["results"]
Expand Down
6 changes: 5 additions & 1 deletion py/core/main/services/management_service.py
Original file line number Diff line number Diff line change
Expand Up @@ -365,11 +365,15 @@ async def document_chunks(
document_id: UUID,
offset: int = 0,
limit: int = 100,
include_vectors: bool = False,
*args,
**kwargs,
):
return self.providers.database.vector.get_document_chunks(
document_id, offset=offset, limit=limit
document_id,
offset=offset,
limit=limit,
include_vectors=include_vectors,
)

@telemetry_event("AssignDocumentToCollection")
Expand Down
19 changes: 16 additions & 3 deletions py/core/providers/database/vector.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import concurrent.futures
import copy
import json
import logging
import time
from concurrent.futures import ThreadPoolExecutor
Expand Down Expand Up @@ -490,16 +491,25 @@ def delete_collection(self, collection_id: str) -> None:
raise

def get_document_chunks(
self, document_id: str, offset: int = 0, limit: int = -1
self,
document_id: str,
offset: int = 0,
limit: int = -1,
include_vectors: bool = False,
) -> dict[str, Any]:
if not self.collection:
raise ValueError("Collection is not initialized.")

limit_clause = f"LIMIT {limit}" if limit != -1 else ""
table_name = self.collection.table.name

select_clause = "SELECT extraction_id, document_id, user_id, collection_ids, text, metadata"
if include_vectors:
select_clause += ", vec"

query = text(
f"""
SELECT extraction_id, document_id, user_id, collection_ids, text, metadata, COUNT(*) OVER() AS total
{select_clause}, COUNT(*) OVER() AS total
FROM {self.project_name}."{table_name}"
WHERE document_id = :document_id
ORDER BY CAST(metadata->>'chunk_order' AS INTEGER)
Expand All @@ -518,7 +528,7 @@ def get_document_chunks(
total = 0

if results:
total = results[0][6]
total = results[0][-1] # Get the total count from the last column
chunks = [
{
"extraction_id": result[0],
Expand All @@ -527,6 +537,9 @@ def get_document_chunks(
"collection_ids": result[3],
"text": result[4],
"metadata": result[5],
"vector": (
json.loads(result[6]) if include_vectors else None
),
}
for result in results
]
Expand Down
3 changes: 3 additions & 0 deletions py/sdk/management.py
Original file line number Diff line number Diff line change
Expand Up @@ -248,6 +248,7 @@ async def document_chunks(
document_id: str,
offset: Optional[int] = None,
limit: Optional[int] = None,
include_vectors: Optional[bool] = False,
) -> dict:
"""
Get the chunks for a document.
Expand All @@ -263,6 +264,8 @@ async def document_chunks(
params["offset"] = offset
if limit is not None:
params["limit"] = limit
if include_vectors:
params["include_vectors"] = include_vectors
if not params:
return await client._make_request(
"GET", f"document_chunks/{document_id}"
Expand Down
1 change: 1 addition & 0 deletions py/shared/api/models/management/responses.py
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,7 @@ class DocumentChunkResponse(BaseModel):
collection_ids: list[UUID]
text: str
metadata: dict[str, Any]
vector: Optional[list[float]] = None


KnowledgeGraphResponse = str
Expand Down
Loading