Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community: added the possibility to return document ids as part of the results of similarity search in vectorstore chroma #17938

Closed

Conversation

majdabd
Copy link

@majdabd majdabd commented Feb 22, 2024

Hello, I modified the _results_to_docs_and_scores method in chroma.py to allow the results of the similarity search to include the IDs of documents or document chunks. I'm new to Chroma and LangChain, but I couldn't find a way to retrieve the IDs of documents that best match the query.

The new _results_to_docs_and_scores thus becomes:

def _results_to_docs_and_scores(results: Any) -> List[Tuple[Document, float]]:
    return [
        # TODO: Chroma can do batch querying,
        # we shouldn't hard code to the 1st result
        (Document(page_content=result[0], metadata=(result[1] | {'id': result[3]}) or {}), result[2])
        for result in zip(
            results["documents"][0],
            results["metadatas"][0],
            results["distances"][0],
            results["ids"][0]
        )
    ]

Note: I found this solution in this closed issue #11592, but as far as I know it has not been proposed in a pull request.

@dosubot dosubot bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Feb 22, 2024
Copy link

vercel bot commented Feb 22, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Mar 29, 2024 0:14am

@dosubot dosubot bot added Ɑ: vector store Related to vector store module 🔌: chroma Primarily related to ChromaDB integrations 🤖:improvement Medium size change to existing code to handle new use-cases labels Feb 22, 2024
@baskaryan
Copy link
Collaborator

cc @jeffchuber

(Document(page_content=result[0], metadata=result[1] or {}), result[2])
(
Document(
page_content=result[0], metadata=(result[1] | {"id": result[3]}) or {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like if you have metadata it will override an id prop... can we think of a different way to handle this? i think it may really confuse some users, albeit an edge case

@ccurme ccurme added the community Related to langchain-community label Jun 18, 2024
@hwchase17 hwchase17 closed this Aug 22, 2024
@TheDeafOne
Copy link

@hwchase17 why was this closed? Was there a solution merged? Having this functionality would be super useful.

@LDelPinoNT
Copy link

From the messages I understand there is a variable called id, but it can be useful to have the document id, maybe as "doc_id" (so there is no name overlapping).

An example of use case for the "doc_id" is recover it for training and evaluating RAG with InformationRetrievalEvaluator of the package SentenceEmbeddings (https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#informationretrievalevaluator) This evaluator needs pairs of documents and documents ids as "corpus" parameter which can be retrieved directly from VectorStores.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🔌: chroma Primarily related to ChromaDB integrations community Related to langchain-community 🤖:improvement Medium size change to existing code to handle new use-cases size:XS This PR changes 0-9 lines, ignoring generated files. Ɑ: vector store Related to vector store module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants