-
Notifications
You must be signed in to change notification settings - Fork 15.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Choma similarity_search_with_score (and similar methods) don't populate the document.id property of the returned documents. #26860
Comments
Until there is a fix I am patching my code with this helper method that has my fix in it: ####################################################################################################
# This is PATCHED code from the original langchain_chroma module
####################################################################################################
# See bug: https://github.com/langchain-ai/langchain/issues/26860
def similarity_search_with_score_with_id(
vector_db: Chroma,
query: str,
k: int = DEFAULT_K,
filter: Optional[dict[str, str]] = None,
where_document: Optional[dict[str, str]] = None,
**kwargs,
) -> list[tuple[Document, float]]:
if vector_db._embedding_function is None:
results = vector_db._collection.query(
query_texts=[query],
n_results=k,
where=filter, # type: ignore
where_document=where_document, # type: ignore
**kwargs,
)
else:
query_embedding = vector_db._embedding_function.embed_query(query)
results = vector_db._collection.query(
query_embeddings=[query_embedding], # type: ignore
n_results=k,
where=filter, # type: ignore
where_document=where_document, # type: ignore
**kwargs,
)
return _results_to_docs_and_scores(results)
def _results_to_docs_and_scores(results) -> list[tuple[Document, float]]:
return [
# TODO: Chroma can do batch querying,
# we shouldn't hard code to the 1st result
(Document(id =result[0], page_content=result[1], metadata=result[2] or {}), result[3])
for result in zip(
results["ids"][0],
results["documents"][0],
results["metadatas"][0],
results["distances"][0],
)
]
####################################################################################################
# End of PATCHED code
#################################################################################################### Incase anyone finds it useful |
Not all vectorstores currently support this. Someone can make a PR to add support for returning the ID as part of the document. The ID in documents was added only a few months ago, and it still considered optional |
Hi @eyurtsev, |
@KrisTC ha sorry i didn't mean to imply that. This was for anyone in the community who wants to tackle this. I should read more carefully what I write :) |
@eyurtsev We are looking to contribute to this issue. Since @KrisTC already provided a fix for Chroma, we can potentially write test cases for it and also investigate the remaining vectorstores and add support for the ones that are missing this functionality. I believe we are targeting only vectorstores with similarity search, would that be correct? If you have more detailed information on which vectorstores need an update, that would be greatly appreciated. |
There's a standard test suite in the standard tests package that covers a lot of the relevant edge cases. You can git grep through the codebase to find example usage git grep "standard_test" |
Some update, any suggestions are appreciated:
|
Hello, @eyurtsev We have the fix and test cases for ChromaDB. Should we create a pull request for Chroma DB first and work on the other vector stores later on other pull request only or should we create a pull request after fixing other vector stores altogether |
Individual PRs would be great |
Just created a PR #27366 |
**Description:** Returns the document id along with the Vector Search results **Issue:** Fixes #26860 for CouchbaseVectorStore - [x] **Add tests and docs**: If you're adding a new integration, please include 1. a test for the integration, preferably unit tests that do not rely on network access, 2. an example notebook showing its use. It lives in `docs/docs/integrations` directory. - [x] **Lint and test**: Run `make format`, `make lint` and `make test` from the root of the package(s) you've modified. Co-authored-by: Erick Friis <[email protected]>
Checked other resources
Example Code
I want to get the document id from the documents I find with
similarity_search_with_score
but the id's aren't being set to the documents when they are created.Here is my test code:
Error Message and Stack Trace (if applicable)
Test output:
I would expect the result:
Description
I am using:
I am calling
vector_db.similarity_search_with_score(...)
orvector_db.similarity_search(...)
and the documents returned never have an id.I can't see a way of getting the documents give me an id. I though about adding a kwargs include ids. But the documentation for the underlying query says it always returns ids. I think the issue is in your method:
from:
langchain/libs/partners/chroma/langchain_chroma/vectorstores.py
Lines 43 to 53 in 51c4393
It doesn't use the id.
This fix should be very easy:
System Info
System Information
Package Information
Optional packages not installed
Other Dependencies
The text was updated successfully, but these errors were encountered: