Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix QdrantClient Document import issue and improve text processing fo… #311

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Tim-nocode
Copy link

…r STORM

Summary:

This commit updates the STORM repository to work with the latest versions of qdrant_client by:

Replacing the deprecated Document import from qdrant_client with PointStruct. Ensuring compatibility with RecursiveCharacterTextSplitter from LangChain by converting PointStruct into LangChain Document. Fixing potential issues with CSV parsing and content chunking before vectorization.

Key Changes:

  1. Fixed incompatibility with newer qdrant_client versions

Removed:
from qdrant_client import Document
Reason: In newer versions of qdrant_client, Document was removed and is no longer available.

Added instead:
from qdrant_client.models import PointStruct

Why? PointStruct is the correct way to structure documents before inserting them into Qdrant.

  1. Updated document processing to avoid conflicts with LangChain

Old version:
documents = [
Document(
page_content=row[content_column],
metadata={
"title": row.get(title_column, ""),
"url": row[url_column],
"description": row.get(desc_column, ""),
},
)
for row in df.to_dict(orient="records")
]

New version:
documents = [
PointStruct(
id=index, # Unique identifier
vector=[], # Empty vector (will be generated later)
payload={
"content": row[content_column],
"title": row.get(title_column, ""),
"url": row[url_column],
"description": row.get(desc_column, ""),
},
)
for index, row in enumerate(df.to_dict(orient="records"))
]

Why? This ensures compatibility with qdrant_client and allows storing metadata separately.

  1. Fixed compatibility with LangChain's RecursiveCharacterTextSplitter

Old version:
split_documents = text_splitter.split_documents(documents) Issue: PointStruct does not have a page_content attribute, which text_splitter requires. Fixed version:
from langchain.schema import Document as LangchainDocument documents_langchain = [
LangchainDocument(
page_content=doc.payload["content"],
metadata=doc.payload
)
for doc in documents
]

split_documents = text_splitter.split_documents(documents_langchain)

Why? RecursiveCharacterTextSplitter requires page_content, which PointStruct does not have. Converting PointStruct to LangChain Document resolves this issue.

  1. Ensured correct CSV parsing and encoding

Added sep="|" and encoding="utf-8" in pd.read_csv():

df = pd.read_csv(file_path, sep="|", encoding="utf-8")

Why?

Prevents issues where pandas treats the entire header row as a single column.

Ensures compatibility with datasets that use | as a separator.

  1. Batch processing optimization

Ensured that data is properly batched before sending to Qdrant:

num_batches = (len(split_documents) + batch_size - 1) // batch_size for i in tqdm(range(num_batches)):
start_idx = i * batch_size
end_idx = min((i + 1) * batch_size, len(split_documents))
qdrant.add_documents(
documents=split_documents[start_idx:end_idx],
batch_size=batch_size,
)

Why? Prevents timeout errors when handling large documents.

Ensures efficient memory usage and better API performance.

Impact & Benefits:

✅ Fixes compatibility issues with the latest qdrant_client versions.
✅ Ensures correct document chunking for LangChain's text splitter.
✅ Prevents "Content column not found" errors in CSV parsing. ✅ Improves stability when inserting large documents into Qdrant.

This commit ensures that STORM continues to work seamlessly with Qdrant and LangChain while providing better document processing support.

Next Steps:
Review and test with additional datasets.
Consider additional optimizations for embedding model selection.

…r STORM

Summary:

This commit updates the STORM repository to work with the latest versions of qdrant_client by:

Replacing the deprecated Document import from qdrant_client with PointStruct.
Ensuring compatibility with RecursiveCharacterTextSplitter from LangChain by converting PointStruct into LangChain Document.
Fixing potential issues with CSV parsing and content chunking before vectorization.

Key Changes:
1. Fixed incompatibility with newer qdrant_client versions

Removed:
from qdrant_client import Document
Reason: In newer versions of qdrant_client, Document was removed and is no longer available.

Added instead:
from qdrant_client.models import PointStruct

Why? PointStruct is the correct way to structure documents before inserting them into Qdrant.

2. Updated document processing to avoid conflicts with LangChain

Old version:
documents = [
    Document(
        page_content=row[content_column],
        metadata={
            "title": row.get(title_column, ""),
            "url": row[url_column],
            "description": row.get(desc_column, ""),
        },
    )
    for row in df.to_dict(orient="records")
]

New version:
documents = [
    PointStruct(
        id=index,  # Unique identifier
        vector=[],  # Empty vector (will be generated later)
        payload={
            "content": row[content_column],
            "title": row.get(title_column, ""),
            "url": row[url_column],
            "description": row.get(desc_column, ""),
        },
    )
    for index, row in enumerate(df.to_dict(orient="records"))
]

Why? This ensures compatibility with qdrant_client and allows storing metadata separately.

3. Fixed compatibility with LangChain's RecursiveCharacterTextSplitter

Old version:
split_documents = text_splitter.split_documents(documents)
Issue: PointStruct does not have a page_content attribute, which text_splitter requires.
Fixed version:
from langchain.schema import Document as LangchainDocument
documents_langchain = [
    LangchainDocument(
        page_content=doc.payload["content"],
        metadata=doc.payload
    )
    for doc in documents
]

split_documents = text_splitter.split_documents(documents_langchain)

Why? RecursiveCharacterTextSplitter requires page_content, which PointStruct does not have. Converting PointStruct to LangChain Document resolves this issue.

4. Ensured correct CSV parsing and encoding

Added sep="|" and encoding="utf-8" in pd.read_csv():

df = pd.read_csv(file_path, sep="|", encoding="utf-8")

Why?

Prevents issues where pandas treats the entire header row as a single column.

Ensures compatibility with datasets that use | as a separator.

5. Batch processing optimization

Ensured that data is properly batched before sending to Qdrant:

num_batches = (len(split_documents) + batch_size - 1) // batch_size
for i in tqdm(range(num_batches)):
    start_idx = i * batch_size
    end_idx = min((i + 1) * batch_size, len(split_documents))
    qdrant.add_documents(
        documents=split_documents[start_idx:end_idx],
        batch_size=batch_size,
    )

Why? Prevents timeout errors when handling large documents.

Ensures efficient memory usage and better API performance.

Impact & Benefits:

✅ Fixes compatibility issues with the latest qdrant_client versions.
✅ Ensures correct document chunking for LangChain's text splitter.
✅ Prevents "Content column not found" errors in CSV parsing.
✅ Improves stability when inserting large documents into Qdrant.

This commit ensures that STORM continues to work seamlessly with Qdrant and LangChain while providing better document processing support.

Next Steps:
Review and test with additional datasets.
Consider additional optimizations for embedding model selection.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant