Fix QdrantClient Document import issue and improve text processing fo… #311
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
…r STORM
Summary:
This commit updates the STORM repository to work with the latest versions of qdrant_client by:
Replacing the deprecated Document import from qdrant_client with PointStruct. Ensuring compatibility with RecursiveCharacterTextSplitter from LangChain by converting PointStruct into LangChain Document. Fixing potential issues with CSV parsing and content chunking before vectorization.
Key Changes:
Removed:
from qdrant_client import Document
Reason: In newer versions of qdrant_client, Document was removed and is no longer available.
Added instead:
from qdrant_client.models import PointStruct
Why? PointStruct is the correct way to structure documents before inserting them into Qdrant.
Old version:
documents = [
Document(
page_content=row[content_column],
metadata={
"title": row.get(title_column, ""),
"url": row[url_column],
"description": row.get(desc_column, ""),
},
)
for row in df.to_dict(orient="records")
]
New version:
documents = [
PointStruct(
id=index, # Unique identifier
vector=[], # Empty vector (will be generated later)
payload={
"content": row[content_column],
"title": row.get(title_column, ""),
"url": row[url_column],
"description": row.get(desc_column, ""),
},
)
for index, row in enumerate(df.to_dict(orient="records"))
]
Why? This ensures compatibility with qdrant_client and allows storing metadata separately.
Old version:
split_documents = text_splitter.split_documents(documents) Issue: PointStruct does not have a page_content attribute, which text_splitter requires. Fixed version:
from langchain.schema import Document as LangchainDocument documents_langchain = [
LangchainDocument(
page_content=doc.payload["content"],
metadata=doc.payload
)
for doc in documents
]
split_documents = text_splitter.split_documents(documents_langchain)
Why? RecursiveCharacterTextSplitter requires page_content, which PointStruct does not have. Converting PointStruct to LangChain Document resolves this issue.
Added sep="|" and encoding="utf-8" in pd.read_csv():
df = pd.read_csv(file_path, sep="|", encoding="utf-8")
Why?
Prevents issues where pandas treats the entire header row as a single column.
Ensures compatibility with datasets that use | as a separator.
Ensured that data is properly batched before sending to Qdrant:
num_batches = (len(split_documents) + batch_size - 1) // batch_size for i in tqdm(range(num_batches)):
start_idx = i * batch_size
end_idx = min((i + 1) * batch_size, len(split_documents))
qdrant.add_documents(
documents=split_documents[start_idx:end_idx],
batch_size=batch_size,
)
Why? Prevents timeout errors when handling large documents.
Ensures efficient memory usage and better API performance.
Impact & Benefits:
✅ Fixes compatibility issues with the latest qdrant_client versions.
✅ Ensures correct document chunking for LangChain's text splitter.
✅ Prevents "Content column not found" errors in CSV parsing. ✅ Improves stability when inserting large documents into Qdrant.
This commit ensures that STORM continues to work seamlessly with Qdrant and LangChain while providing better document processing support.
Next Steps:
Review and test with additional datasets.
Consider additional optimizations for embedding model selection.