Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Choma similarity_search_with_score (and similar methods) don't populate the document.id property of the returned documents. #26860

Closed
5 tasks done
KrisTC opened this issue Sep 25, 2024 · 10 comments · Fixed by #27622
Labels
03 enhancement Enhancement of existing functionality investigate Flagged for investigation. Ɑ: vector store Related to vector store module

Comments

@KrisTC
Copy link

KrisTC commented Sep 25, 2024

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

I want to get the document id from the documents I find with similarity_search_with_score but the id's aren't being set to the documents when they are created.

Here is my test code:

def test_basic_operations_with_langchain():
    token = os.environ.get('CHROMA_TOKEN')
    client = chromadb.HttpClient(
        settings=Settings(chroma_client_auth_provider="chromadb.auth.token_authn.TokenAuthClientProvider",
                          chroma_client_auth_credentials=token))
    test_collection_name = "test_collection"

    try:
        client.delete_collection(test_collection_name)
    except:  # noqa: E722
        pass

    collection = client.get_or_create_collection(test_collection_name)
    assert collection.name == test_collection_name
    assert collection.count() == 0

    EMBEDDING_MODEL_NAME = "sentence-transformers/all-mpnet-base-v2"
    EMBEDDING_MODEL_CHUNK_SIZE = 384

    embedding_model = HuggingFaceEmbeddings(
        model_name=EMBEDDING_MODEL_NAME,
        # multi_process=True,   # I found this causes crashes and slowness
        model_kwargs={"device": "cuda" if torch.cuda.is_available() else "cpu"},
        encode_kwargs={
            "normalize_embeddings": True
        },  # set True for cosine similarity
    )
    vector_db = Chroma(
        embedding_function=embedding_model,
        client=client,
        collection_name=test_collection_name,
        collection_metadata={"hnsw:space": "cosine"},
    )
    # Make new uid
    id = str(uuid.uuid4())
    test_doc = Document("hello world")
    test_doc.id = id
    test_doc.metadata["test"] = "test"
    vector_db.add_documents([test_doc])
    assert collection.count() == 1

    docs = vector_db.similarity_search_with_score("hello world", 1)
    assert len(docs) == 1
    assert docs[0][0].page_content == "hello world"
    assert docs[0][0].id == id

    client.delete_collection(test_collection_name)

Error Message and Stack Trace (if applicable)

Test output:

=================================== FAILURES ===================================
_____________________ test_basic_operations_with_langchain _____________________

    def test_basic_operations_with_langchain():
        token = os.environ.get('CHROMA_TOKEN')
        client = chromadb.HttpClient(
            settings=Settings(chroma_client_auth_provider="chromadb.auth.token_authn.TokenAuthClientProvider",
                              chroma_client_auth_credentials=token))
        test_collection_name = "test_collection"
    
        try:
            client.delete_collection(test_collection_name)
        except:  # noqa: E722
            pass
    
        collection = client.get_or_create_collection(test_collection_name)
        assert collection.name == test_collection_name
        assert collection.count() == 0
    
        EMBEDDING_MODEL_NAME = "sentence-transformers/all-mpnet-base-v2"
        EMBEDDING_MODEL_CHUNK_SIZE = 384
    
        embedding_model = HuggingFaceEmbeddings(
            model_name=EMBEDDING_MODEL_NAME,
            # multi_process=True,   # I found this causes crashes and slowness
            model_kwargs={"device": "cuda" if torch.cuda.is_available() else "cpu"},
            encode_kwargs={
                "normalize_embeddings": True
            },  # set True for cosine similarity
        )
        vector_db = Chroma(
            embedding_function=embedding_model,
            client=client,
            collection_name=test_collection_name,
            collection_metadata={"hnsw:space": "cosine"},
        )
        # Make new uid
        id = str(uuid.uuid4())
        test_doc = Document("hello world")
        test_doc.id = id
        test_doc.metadata["test"] = "test"
        vector_db.add_documents([test_doc])
        assert collection.count() == 1
    
        docs = vector_db.similarity_search_with_score("hello world", 1)
        assert len(docs) == 1
        assert docs[0][0].page_content == "hello world"
>       assert docs[0][0].id == id
E       AssertionError: assert None == '808971a7-f042-4306-864b-97af29c4837d'
E        +  where None = Document(metadata={'test': 'test'}, page_content='hello world').id

code/tests/test_basic_chroma_setup.py:99: AssertionError
=============================== warnings summary ===============================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: Type google._upb._message.MessageMapContainer uses PyType_Spec with a metaclass that has custom tp_new. This is deprecated and will no longer be allowed in Python 3.14.

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: Type google._upb._message.ScalarMapContainer uses PyType_Spec with a metaclass that has custom tp_new. This is deprecated and will no longer be allowed in Python 3.14.

code/tests/test_basic_chroma_setup.py::test_basic_operations_with_langchain
  /projects/myproj/.venv/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED code/tests/test_basic_chroma_setup.py::test_basic_operations_with_langchain
======================== 1 failed, 3 warnings in 6.21s =========================
Finished running tests!

I would expect the result:

=============================== warnings summary ===============================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: Type google._upb._message.MessageMapContainer uses PyType_Spec with a metaclass that has custom tp_new. This is deprecated and will no longer be allowed in Python 3.14.

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: Type google._upb._message.ScalarMapContainer uses PyType_Spec with a metaclass that has custom tp_new. This is deprecated and will no longer be allowed in Python 3.14.

code/tests/test_basic_chroma_setup.py::test_basic_operations_with_langchain
  /projects/myproj/.venv/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================== 1 passed, 3 warnings in 7.26s =========================
Finished running tests!

Description

I am using:

langchain==0.2.15
langchain-chroma==0.1.3
langchain-community==0.2.15
langchain-core==0.2.37
langchain-huggingface==0.0.3
langchain-text-splitters==0.2.2
chromadb==0.5.3

I am calling vector_db.similarity_search_with_score(...) or vector_db.similarity_search(...) and the documents returned never have an id.

I can't see a way of getting the documents give me an id. I though about adding a kwargs include ids. But the documentation for the underlying query says it always returns ids. I think the issue is in your method:

def _results_to_docs_and_scores(results: Any) -> List[Tuple[Document, float]]:
    return [
        # TODO: Chroma can do batch querying,
        # we shouldn't hard code to the 1st result
        (Document(page_content=result[0], metadata=result[1] or {}), result[2])
        for result in zip(
            results["documents"][0],
            results["metadatas"][0],
            results["distances"][0],
        )
    ]

from:

def _results_to_docs_and_scores(results: Any) -> List[Tuple[Document, float]]:
return [
# TODO: Chroma can do batch querying,
# we shouldn't hard code to the 1st result
(Document(page_content=result[0], metadata=result[1] or {}), result[2])
for result in zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0],
)
]

It doesn't use the id.

This fix should be very easy:

def _results_to_docs_and_scores(results: Any) -> List[Tuple[Document, float]]:
    return [
        # TODO: Chroma can do batch querying,
        # we shouldn't hard code to the 1st result
        (Document(id =result[0], page_content=result[1], metadata=result[2] or {}), result[3])
        for result in zip(
            results["ids"][0],
            results["documents"][0],
            results["metadatas"][0],
            results["distances"][0],
        )
    ]

System Info

System Information

OS: Linux
OS Version: #1 SMP Fri Mar 29 23:14:13 UTC 2024
Python Version: 3.12.4 (main, Jul 19 2024, 17:20:16) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]

Package Information

langchain_core: 0.2.37
langchain: 0.2.15
langchain_community: 0.2.15
langsmith: 0.1.108
langchain_chroma: 0.1.3
langchain_huggingface: 0.0.3
langchain_text_splitters: 0.2.2

Optional packages not installed

langgraph
langserve

Other Dependencies

aiohttp: 3.10.5
async-timeout: Installed. No version info available.
chromadb: 0.5.3
dataclasses-json: 0.6.7
fastapi: 0.112.2
httpx: 0.27.2
huggingface-hub: 0.24.6
jsonpatch: 1.33
numpy: 1.26.4
orjson: 3.10.7
packaging: 24.1
pydantic: 2.8.2
PyYAML: 6.0.2
requests: 2.32.3
sentence-transformers: 3.0.1
SQLAlchemy: 2.0.32
tenacity: 8.5.0
tokenizers: 0.19.1
transformers: 4.44.2
typing-extensions: 4.12.2

@langcarl langcarl bot added the investigate Flagged for investigation. label Sep 25, 2024
@dosubot dosubot bot added the Ɑ: vector store Related to vector store module label Sep 25, 2024
@KrisTC
Copy link
Author

KrisTC commented Sep 25, 2024

Until there is a fix I am patching my code with this helper method that has my fix in it:

####################################################################################################
# This is PATCHED code from the original langchain_chroma module
####################################################################################################
# See bug: https://github.com/langchain-ai/langchain/issues/26860

def similarity_search_with_score_with_id(
    vector_db: Chroma,
    query: str,
    k: int = DEFAULT_K,
    filter: Optional[dict[str, str]] = None,
    where_document: Optional[dict[str, str]] = None,
    **kwargs,
) -> list[tuple[Document, float]]:
    if vector_db._embedding_function is None:
        results = vector_db._collection.query(
            query_texts=[query],
            n_results=k,
            where=filter,                   # type: ignore
            where_document=where_document,  # type: ignore
            **kwargs,
        )
    else:
        query_embedding = vector_db._embedding_function.embed_query(query)
        results = vector_db._collection.query(
            query_embeddings=[query_embedding], # type: ignore
            n_results=k,
            where=filter,                       # type: ignore
            where_document=where_document,      # type: ignore
            **kwargs,
        )

    return _results_to_docs_and_scores(results)


def _results_to_docs_and_scores(results) -> list[tuple[Document, float]]:
    return [
        # TODO: Chroma can do batch querying,
        # we shouldn't hard code to the 1st result
        (Document(id =result[0], page_content=result[1], metadata=result[2] or {}), result[3])
        for result in zip(
            results["ids"][0],
            results["documents"][0],
            results["metadatas"][0],
            results["distances"][0],
        )
    ]
####################################################################################################
# End of PATCHED code
####################################################################################################

Incase anyone finds it useful

@eyurtsev eyurtsev added the 03 enhancement Enhancement of existing functionality label Sep 25, 2024
@eyurtsev
Copy link
Collaborator

Not all vectorstores currently support this. Someone can make a PR to add support for returning the ID as part of the document.

The ID in documents was added only a few months ago, and it still considered optional

@KrisTC
Copy link
Author

KrisTC commented Sep 25, 2024

Hi @eyurtsev,
Thanks for your reply. Are you suggesting "someone" is me? I am happy to do it. But I don't want to break things, the code I highlighted is specifically in chroma, so I am guessing my suggested change would be ok. But I would need more info to do a better / more complete job.
Thanks
Kris

@eyurtsev
Copy link
Collaborator

@KrisTC ha sorry i didn't mean to imply that. This was for anyone in the community who wants to tackle this. I should read more carefully what I write :)

@chaunguyenm
Copy link

@eyurtsev We are looking to contribute to this issue. Since @KrisTC already provided a fix for Chroma, we can potentially write test cases for it and also investigate the remaining vectorstores and add support for the ones that are missing this functionality. I believe we are targeting only vectorstores with similarity search, would that be correct? If you have more detailed information on which vectorstores need an update, that would be greatly appreciated.

@eyurtsev
Copy link
Collaborator

There's a standard test suite in the standard tests package that covers a lot of the relevant edge cases. You can git grep through the codebase to find example usage

git grep "standard_test"

@chaunguyenm
Copy link

Some update, any suggestions are appreciated:

  1. We looked through langchain/libs/partners and found these vectorstores also not do populate Document's id: q_drant, couchbase, mongodb. We haven't fully checked langchain/libs/community. We will focus on partners first, if there's no preference.

  2. Applying the above fix by OP for chroma does not fail any test. We are planning to add some unit test cases to ensure Document's id is populated and returned correctly.

@kwei-zhang
Copy link
Contributor

Hello, @eyurtsev We have the fix and test cases for ChromaDB. Should we create a pull request for Chroma DB first and work on the other vector stores later on other pull request only or should we create a pull request after fixing other vector stores altogether

@eyurtsev
Copy link
Collaborator

Individual PRs would be great

@kwei-zhang
Copy link
Contributor

Just created a PR #27366

efriis added a commit that referenced this issue Oct 24, 2024
**Description:** Returns the document id along with the Vector Search
results

**Issue:** Fixes #26860
for CouchbaseVectorStore


- [x] **Add tests and docs**: If you're adding a new integration, please
include
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. It lives in
`docs/docs/integrations` directory.


- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified.

Co-authored-by: Erick Friis <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
03 enhancement Enhancement of existing functionality investigate Flagged for investigation. Ɑ: vector store Related to vector store module
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants