Choma similarity_search_with_score (and similar methods) don't populate the document.id property of the returned documents. #26860

KrisTC · 2024-09-25T16:03:13Z

Checked other resources

I added a very descriptive title to this issue.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain rather than my code.
The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

I want to get the document id from the documents I find with similarity_search_with_score but the id's aren't being set to the documents when they are created.

Here is my test code:

def test_basic_operations_with_langchain():
    token = os.environ.get('CHROMA_TOKEN')
    client = chromadb.HttpClient(
        settings=Settings(chroma_client_auth_provider="chromadb.auth.token_authn.TokenAuthClientProvider",
                          chroma_client_auth_credentials=token))
    test_collection_name = "test_collection"

    try:
        client.delete_collection(test_collection_name)
    except:  # noqa: E722
        pass

    collection = client.get_or_create_collection(test_collection_name)
    assert collection.name == test_collection_name
    assert collection.count() == 0

    EMBEDDING_MODEL_NAME = "sentence-transformers/all-mpnet-base-v2"
    EMBEDDING_MODEL_CHUNK_SIZE = 384

    embedding_model = HuggingFaceEmbeddings(
        model_name=EMBEDDING_MODEL_NAME,
        # multi_process=True,   # I found this causes crashes and slowness
        model_kwargs={"device": "cuda" if torch.cuda.is_available() else "cpu"},
        encode_kwargs={
            "normalize_embeddings": True
        },  # set True for cosine similarity
    )
    vector_db = Chroma(
        embedding_function=embedding_model,
        client=client,
        collection_name=test_collection_name,
        collection_metadata={"hnsw:space": "cosine"},
    )
    # Make new uid
    id = str(uuid.uuid4())
    test_doc = Document("hello world")
    test_doc.id = id
    test_doc.metadata["test"] = "test"
    vector_db.add_documents([test_doc])
    assert collection.count() == 1

    docs = vector_db.similarity_search_with_score("hello world", 1)
    assert len(docs) == 1
    assert docs[0][0].page_content == "hello world"
    assert docs[0][0].id == id

    client.delete_collection(test_collection_name)

Error Message and Stack Trace (if applicable)

Test output:

=================================== FAILURES ===================================
_____________________ test_basic_operations_with_langchain _____________________

    def test_basic_operations_with_langchain():
        token = os.environ.get('CHROMA_TOKEN')
        client = chromadb.HttpClient(
            settings=Settings(chroma_client_auth_provider="chromadb.auth.token_authn.TokenAuthClientProvider",
                              chroma_client_auth_credentials=token))
        test_collection_name = "test_collection"
    
        try:
            client.delete_collection(test_collection_name)
        except:  # noqa: E722
            pass
    
        collection = client.get_or_create_collection(test_collection_name)
        assert collection.name == test_collection_name
        assert collection.count() == 0
    
        EMBEDDING_MODEL_NAME = "sentence-transformers/all-mpnet-base-v2"
        EMBEDDING_MODEL_CHUNK_SIZE = 384
    
        embedding_model = HuggingFaceEmbeddings(
            model_name=EMBEDDING_MODEL_NAME,
            # multi_process=True,   # I found this causes crashes and slowness
            model_kwargs={"device": "cuda" if torch.cuda.is_available() else "cpu"},
            encode_kwargs={
                "normalize_embeddings": True
            },  # set True for cosine similarity
        )
        vector_db = Chroma(
            embedding_function=embedding_model,
            client=client,
            collection_name=test_collection_name,
            collection_metadata={"hnsw:space": "cosine"},
        )
        # Make new uid
        id = str(uuid.uuid4())
        test_doc = Document("hello world")
        test_doc.id = id
        test_doc.metadata["test"] = "test"
        vector_db.add_documents([test_doc])
        assert collection.count() == 1
    
        docs = vector_db.similarity_search_with_score("hello world", 1)
        assert len(docs) == 1
        assert docs[0][0].page_content == "hello world"
>       assert docs[0][0].id == id
E       AssertionError: assert None == '808971a7-f042-4306-864b-97af29c4837d'
E        +  where None = Document(metadata={'test': 'test'}, page_content='hello world').id

code/tests/test_basic_chroma_setup.py:99: AssertionError
=============================== warnings summary ===============================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: Type google._upb._message.MessageMapContainer uses PyType_Spec with a metaclass that has custom tp_new. This is deprecated and will no longer be allowed in Python 3.14.

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: Type google._upb._message.ScalarMapContainer uses PyType_Spec with a metaclass that has custom tp_new. This is deprecated and will no longer be allowed in Python 3.14.

code/tests/test_basic_chroma_setup.py::test_basic_operations_with_langchain
  /projects/myproj/.venv/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED code/tests/test_basic_chroma_setup.py::test_basic_operations_with_langchain
======================== 1 failed, 3 warnings in 6.21s =========================
Finished running tests!

I would expect the result:

=============================== warnings summary ===============================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: Type google._upb._message.MessageMapContainer uses PyType_Spec with a metaclass that has custom tp_new. This is deprecated and will no longer be allowed in Python 3.14.

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: Type google._upb._message.ScalarMapContainer uses PyType_Spec with a metaclass that has custom tp_new. This is deprecated and will no longer be allowed in Python 3.14.

code/tests/test_basic_chroma_setup.py::test_basic_operations_with_langchain
  /projects/myproj/.venv/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================== 1 passed, 3 warnings in 7.26s =========================
Finished running tests!

Description

I am using:

langchain==0.2.15
langchain-chroma==0.1.3
langchain-community==0.2.15
langchain-core==0.2.37
langchain-huggingface==0.0.3
langchain-text-splitters==0.2.2
chromadb==0.5.3

I am calling vector_db.similarity_search_with_score(...) or vector_db.similarity_search(...) and the documents returned never have an id.

I can't see a way of getting the documents give me an id. I though about adding a kwargs include ids. But the documentation for the underlying query says it always returns ids. I think the issue is in your method:

def _results_to_docs_and_scores(results: Any) -> List[Tuple[Document, float]]:
    return [
        # TODO: Chroma can do batch querying,
        # we shouldn't hard code to the 1st result
        (Document(page_content=result[0], metadata=result[1] or {}), result[2])
        for result in zip(
            results["documents"][0],
            results["metadatas"][0],
            results["distances"][0],
        )
    ]

from:

langchain/libs/partners/chroma/langchain_chroma/vectorstores.py

Lines 43 to 53 in 51c4393

    
           def _results_to_docs_and_scores(results: Any) -> List[Tuple[Document, float]]: 
        
               return [ 
        
                   # TODO: Chroma can do batch querying, 
        
                   # we shouldn't hard code to the 1st result 
        
                   (Document(page_content=result[0], metadata=result[1] or {}), result[2]) 
        
                   for result in zip( 
        
                       results["documents"][0], 
        
                       results["metadatas"][0], 
        
                       results["distances"][0], 
        
                   ) 
        
               ]

It doesn't use the id.

This fix should be very easy:

def _results_to_docs_and_scores(results: Any) -> List[Tuple[Document, float]]:
    return [
        # TODO: Chroma can do batch querying,
        # we shouldn't hard code to the 1st result
        (Document(id =result[0], page_content=result[1], metadata=result[2] or {}), result[3])
        for result in zip(
            results["ids"][0],
            results["documents"][0],
            results["metadatas"][0],
            results["distances"][0],
        )
    ]

System Info

System Information

OS: Linux
OS Version: #1 SMP Fri Mar 29 23:14:13 UTC 2024
Python Version: 3.12.4 (main, Jul 19 2024, 17:20:16) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]

Package Information

langchain_core: 0.2.37
langchain: 0.2.15
langchain_community: 0.2.15
langsmith: 0.1.108
langchain_chroma: 0.1.3
langchain_huggingface: 0.0.3
langchain_text_splitters: 0.2.2

Optional packages not installed

langgraph
langserve

Other Dependencies

aiohttp: 3.10.5
async-timeout: Installed. No version info available.
chromadb: 0.5.3
dataclasses-json: 0.6.7
fastapi: 0.112.2
httpx: 0.27.2
huggingface-hub: 0.24.6
jsonpatch: 1.33
numpy: 1.26.4
orjson: 3.10.7
packaging: 24.1
pydantic: 2.8.2
PyYAML: 6.0.2
requests: 2.32.3
sentence-transformers: 3.0.1
SQLAlchemy: 2.0.32
tenacity: 8.5.0
tokenizers: 0.19.1
transformers: 4.44.2
typing-extensions: 4.12.2

The text was updated successfully, but these errors were encountered:

KrisTC · 2024-09-25T16:14:31Z

Until there is a fix I am patching my code with this helper method that has my fix in it:

####################################################################################################
# This is PATCHED code from the original langchain_chroma module
####################################################################################################
# See bug: https://github.com/langchain-ai/langchain/issues/26860

def similarity_search_with_score_with_id(
    vector_db: Chroma,
    query: str,
    k: int = DEFAULT_K,
    filter: Optional[dict[str, str]] = None,
    where_document: Optional[dict[str, str]] = None,
    **kwargs,
) -> list[tuple[Document, float]]:
    if vector_db._embedding_function is None:
        results = vector_db._collection.query(
            query_texts=[query],
            n_results=k,
            where=filter,                   # type: ignore
            where_document=where_document,  # type: ignore
            **kwargs,
        )
    else:
        query_embedding = vector_db._embedding_function.embed_query(query)
        results = vector_db._collection.query(
            query_embeddings=[query_embedding], # type: ignore
            n_results=k,
            where=filter,                       # type: ignore
            where_document=where_document,      # type: ignore
            **kwargs,
        )

    return _results_to_docs_and_scores(results)


def _results_to_docs_and_scores(results) -> list[tuple[Document, float]]:
    return [
        # TODO: Chroma can do batch querying,
        # we shouldn't hard code to the 1st result
        (Document(id =result[0], page_content=result[1], metadata=result[2] or {}), result[3])
        for result in zip(
            results["ids"][0],
            results["documents"][0],
            results["metadatas"][0],
            results["distances"][0],
        )
    ]
####################################################################################################
# End of PATCHED code
####################################################################################################

Incase anyone finds it useful

eyurtsev · 2024-09-25T16:26:12Z

Not all vectorstores currently support this. Someone can make a PR to add support for returning the ID as part of the document.

The ID in documents was added only a few months ago, and it still considered optional

KrisTC · 2024-09-25T16:33:55Z

Hi @eyurtsev,
Thanks for your reply. Are you suggesting "someone" is me? I am happy to do it. But I don't want to break things, the code I highlighted is specifically in chroma, so I am guessing my suggested change would be ok. But I would need more info to do a better / more complete job.
Thanks
Kris

eyurtsev · 2024-09-25T18:22:43Z

@KrisTC ha sorry i didn't mean to imply that. This was for anyone in the community who wants to tackle this. I should read more carefully what I write :)

chaunguyenm · 2024-09-29T14:49:37Z

@eyurtsev We are looking to contribute to this issue. Since @KrisTC already provided a fix for Chroma, we can potentially write test cases for it and also investigate the remaining vectorstores and add support for the ones that are missing this functionality. I believe we are targeting only vectorstores with similarity search, would that be correct? If you have more detailed information on which vectorstores need an update, that would be greatly appreciated.

eyurtsev · 2024-09-29T15:03:52Z

There's a standard test suite in the standard tests package that covers a lot of the relevant edge cases. You can git grep through the codebase to find example usage

git grep "standard_test"

chaunguyenm · 2024-10-04T14:27:08Z

Some update, any suggestions are appreciated:

We looked through langchain/libs/partners and found these vectorstores also not do populate Document's id: q_drant, couchbase, mongodb. We haven't fully checked langchain/libs/community. We will focus on partners first, if there's no preference.
Applying the above fix by OP for chroma does not fail any test. We are planning to add some unit test cases to ensure Document's id is populated and returned correctly.

kwei-zhang · 2024-10-13T16:25:47Z

Hello, @eyurtsev We have the fix and test cases for ChromaDB. Should we create a pull request for Chroma DB first and work on the other vector stores later on other pull request only or should we create a pull request after fixing other vector stores altogether

eyurtsev · 2024-10-13T16:36:08Z

Individual PRs would be great

kwei-zhang · 2024-10-15T18:18:34Z

Just created a PR #27366

**Description:** Returns the document id along with the Vector Search results **Issue:** Fixes #26860 for CouchbaseVectorStore - [x] **Add tests and docs**: If you're adding a new integration, please include 1. a test for the integration, preferably unit tests that do not rely on network access, 2. an example notebook showing its use. It lives in `docs/docs/integrations` directory. - [x] **Lint and test**: Run `make format`, `make lint` and `make test` from the root of the package(s) you've modified. Co-authored-by: Erick Friis <[email protected]>

langcarl bot added the investigate Flagged for investigation. label Sep 25, 2024

dosubot bot added the Ɑ: vector store Related to vector store module label Sep 25, 2024

eyurtsev added the 03 enhancement Enhancement of existing functionality label Sep 25, 2024

kwei-zhang mentioned this issue Oct 17, 2024

Missing ID in VectorStores langchain-ai/langchainjs#7013

Open

5 tasks

nithishr mentioned this issue Oct 24, 2024

couchbase: Add document id to vector search results #27622

Merged

2 tasks

efriis closed this as completed in #27622 Oct 24, 2024

kwei-zhang mentioned this issue Oct 24, 2024

chroma[minor]: Add id attribute to Document returned by search results #27366

Merged

kwei-zhang mentioned this issue Nov 4, 2024

chroma[minor]: Add id attribute to Document returned by search results #27893

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choma similarity_search_with_score (and similar methods) don't populate the document.id property of the returned documents. #26860

Choma similarity_search_with_score (and similar methods) don't populate the document.id property of the returned documents. #26860

KrisTC commented Sep 25, 2024

KrisTC commented Sep 25, 2024

eyurtsev commented Sep 25, 2024

KrisTC commented Sep 25, 2024

eyurtsev commented Sep 25, 2024

chaunguyenm commented Sep 29, 2024

eyurtsev commented Sep 29, 2024

chaunguyenm commented Oct 4, 2024

kwei-zhang commented Oct 13, 2024

eyurtsev commented Oct 13, 2024

kwei-zhang commented Oct 15, 2024

Choma similarity_search_with_score (and similar methods) don't populate the document.id property of the returned documents. #26860

Choma similarity_search_with_score (and similar methods) don't populate the document.id property of the returned documents. #26860

Comments

KrisTC commented Sep 25, 2024

Checked other resources

Example Code

Error Message and Stack Trace (if applicable)

Description

System Info

System Information

Package Information

Optional packages not installed

Other Dependencies

KrisTC commented Sep 25, 2024

eyurtsev commented Sep 25, 2024

KrisTC commented Sep 25, 2024

eyurtsev commented Sep 25, 2024

chaunguyenm commented Sep 29, 2024

eyurtsev commented Sep 29, 2024

chaunguyenm commented Oct 4, 2024

kwei-zhang commented Oct 13, 2024

eyurtsev commented Oct 13, 2024

kwei-zhang commented Oct 15, 2024