-
Notifications
You must be signed in to change notification settings - Fork 16.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
community: added the possibility to return document ids as part of the results of similarity search in vectorstore chroma #17938
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Ignored Deployment
|
cc @jeffchuber |
(Document(page_content=result[0], metadata=result[1] or {}), result[2]) | ||
( | ||
Document( | ||
page_content=result[0], metadata=(result[1] | {"id": result[3]}) or {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks like if you have metadata it will override an id
prop... can we think of a different way to handle this? i think it may really confuse some users, albeit an edge case
@hwchase17 why was this closed? Was there a solution merged? Having this functionality would be super useful. |
From the messages I understand there is a variable called id, but it can be useful to have the document id, maybe as "doc_id" (so there is no name overlapping). An example of use case for the "doc_id" is recover it for training and evaluating RAG with InformationRetrievalEvaluator of the package SentenceEmbeddings (https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#informationretrievalevaluator) This evaluator needs pairs of documents and documents ids as "corpus" parameter which can be retrieved directly from VectorStores. |
Hello, I modified the
_results_to_docs_and_scores
method inchroma.py
to allow the results of the similarity search to include the IDs of documents or document chunks. I'm new to Chroma and LangChain, but I couldn't find a way to retrieve the IDs of documents that best match the query.The new
_results_to_docs_and_scores
thus becomes:Note: I found this solution in this closed issue #11592, but as far as I know it has not been proposed in a pull request.