Fix bug in `index` api #10614

richarddwang · 2023-09-15T02:43:33Z

Description: a fix for index.
Issue: Not applicable.
Dependencies: None
Tag maintainer:
Twitter handle: richarddwang

Problem

Replication code

from pprint import pprint
from langchain.embeddings import OpenAIEmbeddings
from langchain.indexes import SQLRecordManager, index
from langchain.schema import Document
from langchain.vectorstores import Qdrant
from langchain_setup.qdrant import pprint_qdrant_documents, create_inmemory_empty_qdrant

# Documents
metadata1 = {"source": "fullhell.alchemist"}
doc1_1 = Document(page_content="1-1 I have a dog~", metadata=metadata1)
doc1_2 = Document(page_content="1-2 I have a daugter~", metadata=metadata1)
doc1_3 = Document(page_content="1-3 Ahh! O..Oniichan", metadata=metadata1)
doc2 = Document(page_content="2 Lancer died again.", metadata={"source": "fate.docx"})

# Create empty vectorstore
collection_name = "secret_of_D_disk"
vectorstore: Qdrant = create_inmemory_empty_qdrant()

# Create record Manager
import tempfile
from pathlib import Path

record_manager = SQLRecordManager(
    namespace="qdrant/{collection_name}",
    db_url=f"sqlite:///{Path(tempfile.gettempdir())/collection_name}.sql",
)
record_manager.create_schema()  # 必須

sync_result = index(
    [doc1_1, doc1_2, doc1_2, doc2],
    record_manager,
    vectorstore,
    cleanup="full",
    source_id_key="source",
)
print(sync_result, end="\n\n")
pprint_qdrant_documents(vectorstore)

Code of helper functions `pprint_qdrant_documents` and `create_inmemory_empty_qdrant`

def create_inmemory_empty_qdrant(**from_texts_kwargs):
    # Qdrant requires vector size, which can be only know after applying embedder
    vectorstore = Qdrant.from_texts(["dummy"], location=":memory:", embedding=OpenAIEmbeddings(), **from_texts_kwargs)
    dummy_document_id = vectorstore.client.scroll(vectorstore.collection_name)[0][0].id
    vectorstore.delete([dummy_document_id])
    return vectorstore

def pprint_qdrant_documents(vectorstore, limit: int = 100, **scroll_kwargs):
    document_ids, documents = [], []
    for record in vectorstore.client.scroll(
        vectorstore.collection_name, limit=100, **scroll_kwargs
    )[0]:
        document_ids.append(record.id)
        documents.append(
            Document(
                page_content=record.payload["page_content"],
                metadata=record.payload["metadata"] or {},
            )
        )
    pprint_documents(documents, document_ids=document_ids)

def pprint_document(document: Document = None, document_id=None, return_string=False):
    displayed_text = ""
    if document_id:
        displayed_text += f"Document {document_id}:\n\n"
    displayed_text += f"{document.page_content}\n\n"
    metadata_text = pformat(document.metadata, indent=1)
    if "\n" in metadata_text:
        displayed_text += f"Metadata:\n{metadata_text}"
    else:
        displayed_text += f"Metadata:{metadata_text}"

    if return_string:
        return displayed_text
    else:
        print(displayed_text)


def pprint_documents(documents, document_ids=None):
    if not document_ids:
        document_ids = [i + 1 for i in range(len(documents))]

    displayed_texts = []
    for document_id, document in zip(document_ids, documents):
        displayed_text = pprint_document(
            document_id=document_id, document=document, return_string=True
        )
        displayed_texts.append(displayed_text)
    print(f"\n{'-' * 100}\n".join(displayed_texts))

You will get

{'num_added': 3, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

Document 1b19816e-b802-53c0-ad60-5ff9d9b9b911:

1-2 I have a daugter~

Metadata:{'source': 'fullhell.alchemist'}
----------------------------------------------------------------------------------------------------
Document 3362f9bc-991a-5dd5-b465-c564786ce19c:

1-1 I have a dog~

Metadata:{'source': 'fullhell.alchemist'}
----------------------------------------------------------------------------------------------------
Document a4d50169-2fda-5339-a196-249b5f54a0de:

1-2 I have a daugter~

Metadata:{'source': 'fullhell.alchemist'}

This is not correct. We should be able to expect that the vectorsotre now includes doc1_1, doc1_2, and doc2, but not doc1_1, doc1_2, and doc1_2.

Reason

In index, the original code is

uids = []
docs_to_index = []
for doc, hashed_doc, doc_exists in zip(doc_batch, hashed_docs, exists_batch):
    if doc_exists:
        # Must be updated to refresh timestamp.
        record_manager.update([hashed_doc.uid], time_at_least=index_start_dt)
        num_skipped += 1
        continue
    uids.append(hashed_doc.uid)
    docs_to_index.append(doc)

In the aforementioned example, len(doc_batch) == 4, but len(hashed_docs) == len(exists_batch) == 3. This is because the deduplication of input documents [doc1_1, doc1_2, doc1_2, doc2] is [doc1_1, doc1_2, doc2]. So index insert doc1_1, doc1_2, doc1_2 with the uid of doc1_1, doc1_2, doc2.

vercel · 2023-09-15T02:43:36Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored (Inspect)	Visit Preview		Sep 23, 2023 2:22am

eyurtsev

Thank you!

eyurtsev · 2023-09-23T02:19:10Z

Linting and merging

Fix the bug in index

8f364a5

dosubot bot added Ɑ: vector store Related to vector store module 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Sep 15, 2023

eyurtsev self-assigned this Sep 23, 2023

eyurtsev approved these changes Sep 23, 2023

View reviewed changes

eyurtsev changed the title ~~Fix the bug in index~~ Fix bug in index api Sep 23, 2023

x

90c1b6c

eyurtsev merged commit b809c24 into langchain-ai:master Sep 23, 2023
31 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug in `index` api #10614

Fix bug in `index` api #10614

richarddwang commented Sep 15, 2023

vercel bot commented Sep 15, 2023 •

edited

Loading

eyurtsev left a comment

eyurtsev commented Sep 23, 2023

Fix bug in index api #10614

Fix bug in index api #10614

Conversation

richarddwang commented Sep 15, 2023

Problem

Reason

vercel bot commented Sep 15, 2023 • edited Loading

eyurtsev left a comment

Choose a reason for hiding this comment

eyurtsev commented Sep 23, 2023

Fix bug in `index` api #10614

Fix bug in `index` api #10614

vercel bot commented Sep 15, 2023 •

edited

Loading