Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

langchain: fix EmbeddingsFilter compress_documents return type to Sequence[Document] instead of Sequence[_DocumentWithState] #17946

Conversation

maximeperrindev
Copy link
Contributor

@maximeperrindev maximeperrindev commented Feb 22, 2024

  • Description: This PR solves a typing problem encountered using EmbeddingsFilter.compress_documents method with langchain. The returned Sequence of _DocumentWithState was not meeting the typing expectation of Sequence[Document]. This could cause a problem of json parsing because of embeddings field in _DocumentWithState.
  • Issue: TypeError: Type is not JSON serializable: numpy.float64 #17875
  • Twitter handle: @maximeperrin_

Copy link

vercel bot commented Feb 22, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Feb 22, 2024 0:27am

@dosubot dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. Ɑ: embeddings Related to text embedding models module 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Feb 22, 2024
@maximeperrindev
Copy link
Contributor Author

@eyurtsev

@baskaryan
Copy link
Collaborator

this is actually intentional, it's needed that this returns a stateful document so that embeddings aren't recomputed multiple times in a compression pipeline (where multiple compressors use embeddings). the correct solution would be to convert any stateful docs to non-stateful docs outside of this class.

closing but let me know if i'm missing something

@baskaryan baskaryan closed this Mar 29, 2024
@hpx502766238
Copy link

this is actually intentional, it's needed that this returns a stateful document so that embeddings aren't recomputed multiple times in a compression pipeline (where multiple compressors use embeddings). the correct solution would be to convert any stateful docs to non-stateful docs outside of this class.

closing but let me know if i'm missing something

Excuse me.Can you tell me how to convert any stateful docs to non-stateful docs?

@hpx502766238
Copy link

hpx502766238 commented Aug 15, 2024

I have found the solution.
We can convert (_DocumentWithState) to (Document) before an ContextualCompressionRetriever return.
in langchain/retrievers/contextual_compression.py, method _get_relevant_documents and async def _aget_relevant_documents,I changed the code like following:
docs = await self.base_retriever.ainvoke(
query, config={"callbacks": run_manager.get_child()}, **kwargs
)
if docs:
compressed_docs = await self.base_compressor.acompress_documents(
docs, query, callbacks=run_manager.get_child()
)
#convert _DocumentWithState
compressed_docs_converted = [
doc.to_document() if isinstance(doc, _DocumentWithState) else doc
for doc in compressed_docs
]
return compressed_docs_converted
else:
return []

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: embeddings Related to text embedding models module size:XS This PR changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants