community: added the possibility to return document ids as part of the results of similarity search in vectorstore chroma #17938

majdabd · 2024-02-22T11:28:22Z

Hello, I modified the _results_to_docs_and_scores method in chroma.py to allow the results of the similarity search to include the IDs of documents or document chunks. I'm new to Chroma and LangChain, but I couldn't find a way to retrieve the IDs of documents that best match the query.

The new _results_to_docs_and_scores thus becomes:

def _results_to_docs_and_scores(results: Any) -> List[Tuple[Document, float]]:
    return [
        # TODO: Chroma can do batch querying,
        # we shouldn't hard code to the 1st result
        (Document(page_content=result[0], metadata=(result[1] | {'id': result[3]}) or {}), result[2])
        for result in zip(
            results["documents"][0],
            results["metadatas"][0],
            results["distances"][0],
            results["ids"][0]
        )
    ]

Note: I found this solution in this closed issue #11592, but as far as I know it has not been proposed in a pull request.

vercel · 2024-02-22T11:28:27Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored (Inspect)	Visit Preview		Mar 29, 2024 0:14am

baskaryan · 2024-03-29T00:14:56Z

cc @jeffchuber

jeffchuber · 2024-04-01T05:06:52Z

libs/community/langchain_community/vectorstores/chroma.py

-        (Document(page_content=result[0], metadata=result[1] or {}), result[2])
+        (
+            Document(
+                page_content=result[0], metadata=(result[1] | {"id": result[3]}) or {}


this looks like if you have metadata it will override an id prop... can we think of a different way to handle this? i think it may really confuse some users, albeit an edge case

TheDeafOne · 2024-08-27T14:15:54Z

@hwchase17 why was this closed? Was there a solution merged? Having this functionality would be super useful.

LDelPinoNT · 2024-08-28T08:55:51Z

From the messages I understand there is a variable called id, but it can be useful to have the document id, maybe as "doc_id" (so there is no name overlapping).

An example of use case for the "doc_id" is recover it for training and evaluating RAG with InformationRetrievalEvaluator of the package SentenceEmbeddings (https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#informationretrievalevaluator) This evaluator needs pairs of documents and documents ids as "corpus" parameter which can be retrieved directly from VectorStores.

Added the possibility to return document ids within similary search

dd5b188

dosubot bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Feb 22, 2024

dosubot bot added Ɑ: vector store Related to vector store module 🔌: chroma Primarily related to ChromaDB integrations 🤖:improvement Medium size change to existing code to handle new use-cases labels Feb 22, 2024

majdabd and others added 3 commits February 23, 2024 16:08

Merge branch 'master' into similarity_search_with_id

7a0da5d

Merge branch 'master' into majdabd/similarity_search_with_id

e1c5b17

fmt

f553868

jeffchuber reviewed Apr 1, 2024

View reviewed changes

ccurme added the community Related to langchain-community label Jun 18, 2024

hwchase17 closed this Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

community: added the possibility to return document ids as part of the results of similarity search in vectorstore chroma #17938

community: added the possibility to return document ids as part of the results of similarity search in vectorstore chroma #17938

majdabd commented Feb 22, 2024

vercel bot commented Feb 22, 2024 •

edited

Loading

baskaryan commented Mar 29, 2024

jeffchuber Apr 1, 2024

TheDeafOne commented Aug 27, 2024

LDelPinoNT commented Aug 28, 2024

community: added the possibility to return document ids as part of the results of similarity search in vectorstore chroma #17938

community: added the possibility to return document ids as part of the results of similarity search in vectorstore chroma #17938

Conversation

majdabd commented Feb 22, 2024

vercel bot commented Feb 22, 2024 • edited Loading

baskaryan commented Mar 29, 2024

jeffchuber Apr 1, 2024

Choose a reason for hiding this comment

TheDeafOne commented Aug 27, 2024

LDelPinoNT commented Aug 28, 2024

vercel bot commented Feb 22, 2024 •

edited

Loading