Skip to content

Commit

Permalink
Add ChromaDB Document Store (#47)
Browse files Browse the repository at this point in the history
* setup test workflow

* add chroma

* fix package versioning

* fix

* try

* fix tests

* try

* fix linter

* revert unneeded test override
  • Loading branch information
masci authored Nov 9, 2023
1 parent e1f99ca commit 57b6b9c
Show file tree
Hide file tree
Showing 9 changed files with 58 additions and 79 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/components_instructor_embedders.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: Test / instructor-embedders
name: Test / Components / instructor-embedders

on:
push:
Expand Down Expand Up @@ -40,4 +40,4 @@ jobs:
- name: Run integration tests
run: |
pytest -v -m integration
pytest -v -m integration
Original file line number Diff line number Diff line change
@@ -1,12 +1,19 @@
# This workflow comes from https://github.com/ofek/hatch-mypyc
# https://github.com/ofek/hatch-mypyc/blob/5a198c0ba8660494d02716cfc9d79ce4adfb1442/.github/workflows/test.yml
name: test
name: Test / Document Stores / chroma

on:
push:
branches:
- main
pull_request:
paths:
- 'document_stores/chroma/**'
- '.github/workflows/document_stores_chroma.yml'

defaults:
run:
working-directory: document_stores/chroma

concurrency:
group: test-${{ github.head_ref }}
Expand All @@ -29,6 +36,7 @@ jobs:
steps:
- name: Support longpaths
if: matrix.os == 'windows-latest'
working-directory: .
run: git config --system core.longpaths true

- uses: actions/checkout@v3
Expand Down
18 changes: 5 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,17 @@
# Haystack 2.x additional resources

This repository contains integrations to extend the capabilities of [Haystack][haystack-repo] version 2.0 and
This repository contains integrations to extend the capabilities of [Haystack](https://github.com/deepset-ai/haystack) version 2.0 and
onwards. The code in this repo is maintained by [deepset](https://www.deepset.ai), some of it on a best-effort
basis: see each folder's `README` file for details around installation, usage and support.

This is the list of packages currently hosted in this repo.

| Package | Type | PyPi Package | Status |
| -------------------------------------------------------- | -------- | -------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [instructor-embedders](components/instructor-embedders/) | Embedder | [![PyPI - Version](https://img.shields.io/pypi/v/instructor-embedders-haystack.svg)](https://pypi.org/project/instructor-embedders-haystack) | [![Test / instructor-embedders](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/components_instructor_embedders.yml/badge.svg)](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/components_instructor_embedders.yml) |
| Package | Type | PyPi Package | Status |
| ----------------------------------------------------------------- | -------------- | -------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [chroma-haystack](document_stores/chroma/) | Document Store | [![PyPI - Version](https://img.shields.io/pypi/v/chroma-haystack.svg)](https://pypi.org/project/chroma-haystack) | [![Test / Document Stores / chroma](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/document_stores_chroma.yml/badge.svg)](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/document_stores_chroma.yml) |
| [instructor-embedders-haystack](components/instructor-embedders/) | Embedder | [![PyPI - Version](https://img.shields.io/pypi/v/instructor-embedders-haystack.svg)](https://pypi.org/project/instructor-embedders-haystack) | [![Test / instructor-embedders](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/components_instructor_embedders.yml/badge.svg)](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/components_instructor_embedders.yml) |


## Contributing

You will need `hatch` to create new projects in this folder. Run `pip install -r requirements.txt` to install it.



[haystack-repo]: https://github.com/deepset-ai/haystack
[text2speechbadge]: https://github.com/deepset-ai/haystack-extras/actions/workflows/nodes_text2speech.yml/badge.svg
[text2speech]: https://github.com/deepset-ai/haystack-extras/actions/workflows/nodes_text2speech.yml
[text2speechPypi]: https://pypi.org/project/farm-haystack-text2speech
[milvus_badge]: https://github.com/deepset-ai/haystack-extras/actions/workflows/stores_milvus_document_store.yml/badge.svg
[milvus]: https://github.com/deepset-ai/haystack-extras/actions/workflows/stores_milvus_document_store.yml
26 changes: 0 additions & 26 deletions document_stores/chroma/.github/workflows/release.yml

This file was deleted.

2 changes: 1 addition & 1 deletion document_stores/chroma/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ Issues = "https://github.com/masci/chroma-haystack/issues"
Source = "https://github.com/masci/chroma-haystack"

[tool.hatch.version]
source="vcs"
path = "src/chroma_haystack/__about__.py"

[tool.hatch.envs.default]
dependencies = [
Expand Down
4 changes: 4 additions & 0 deletions document_stores/chroma/src/chroma_haystack/__about__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# SPDX-FileCopyrightText: 2023-present deepset GmbH <[email protected]>
#
# SPDX-License-Identifier: Apache-2.0
__version__ = "0.7.0"
55 changes: 23 additions & 32 deletions document_stores/chroma/src/chroma_haystack/document_store.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ def __init__(
self._chroma_client = chromadb.Client()
self._collection = self._chroma_client.create_collection(
name=collection_name,
embedding_function=get_embedding_function(embedding_function)(**embedding_function_params),
embedding_function=get_embedding_function(embedding_function, **embedding_function_params),
)

def count_documents(self) -> int:
Expand Down Expand Up @@ -149,22 +149,23 @@ def write_documents(self, documents: List[Document], policy: DuplicatePolicy = D
:raises DuplicateDocumentError: Exception trigger on duplicate document if `policy=DuplicatePolicy.FAIL`
:return: None
"""
for d in documents:
if not isinstance(d, Document):
for doc in documents:
if not isinstance(doc, Document):
msg = "param 'documents' must contain a list of objects of type Document"
raise ValueError(msg)

doc = self._prepare(d)

if doc.text is None:
if doc.content is None:
logger.warn(
"ChromaDocumentStore can only store the text field of Documents: "
"'array', 'dataframe' and 'blob' will be dropped."
)
data = {"ids": [doc.id], "documents": [doc.text], "metadatas": [doc.metadata]}
data = {"ids": [doc.id], "documents": [doc.content]}

if doc.meta:
data["metadatas"] = [doc.meta]

if doc.embedding is not None:
data["embeddings"] = [doc.embedding.tolist()]
data["embeddings"] = [doc.embedding]

self._collection.add(**data)

Expand Down Expand Up @@ -224,7 +225,7 @@ def _normalize_filters(self, filters: Dict[str, Any]) -> Tuple[List[str], Dict[s
keys_to_remove = []

for field, value in filters.items():
if field == "text":
if field == "content":
# Schedule for removal the original key, we're going to change it
keys_to_remove.append(field)
where_document["$contains"] = value
Expand Down Expand Up @@ -267,35 +268,25 @@ def _normalize_filters(self, filters: Dict[str, Any]) -> Tuple[List[str], Dict[s

return ids, final_where, where_document

def _prepare(self, d: Document) -> Document:
"""
Change the document in a way we can better store it into Chroma.
Fore example, we store as metadata additional fields Chroma doesn't manage
"""
new_meta = {"_mime_type": d.mime_type} | d.metadata
orig = d.to_dict()
orig["metadata"] = new_meta
# return a copy
return Document.from_dict(orig)

def _get_result_to_documents(self, result: GetResult) -> List[Document]:
"""
Helper function to convert Chroma results into Haystack Documents
"""
retval = []
for i in range(len(result["documents"])):
# prepare metadata
metadata = result["metadatas"][i]
mime_type = metadata.pop("_mime_type")
document_dict = {
"id": result["ids"][i],
"text": result["documents"][i],
"metadata": metadata,
"mime_type": mime_type,
}
for i in range(len(result.get("documents", []))):
document_dict: Dict[str, Any] = {"id": result["ids"][i]}

result_documents = result.get("documents")
if result_documents:
document_dict["content"] = result_documents[i]

result_metadata = result.get("metadatas")
if result_metadata:
document_dict["meta"] = result_metadata[i]

if result["embeddings"]:
document_dict["embedding"] = np.ndarray(result["embeddings"][i])
result_embeddings = result.get("embeddings")
if result_embeddings:
document_dict["embedding"] = list(result_embeddings[i])

retval.append(Document.from_dict(document_dict))

Expand Down
4 changes: 2 additions & 2 deletions document_stores/chroma/src/chroma_haystack/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,9 @@
}


def get_embedding_function(function_name: str) -> EmbeddingFunction:
def get_embedding_function(function_name: str, **kwargs) -> EmbeddingFunction:
try:
return FUNCTION_REGISTRY[function_name]
return FUNCTION_REGISTRY[function_name](**kwargs)
except KeyError:
msg = f"Invalid function name: {function_name}"
raise ChromaDocumentStoreConfigError(msg) from KeyError
14 changes: 12 additions & 2 deletions document_stores/chroma/tests/test_document_store.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ class TestEmbeddingFunction(EmbeddingFunction):
vectors in unit tests.
"""

def __call__(self, _: Documents) -> Embeddings:
def __call__(self, input: Documents) -> Embeddings: # noqa - chroma will inspect the signature, it must match
# embed the documents somehow
return [np.random.default_rng().uniform(-1, 1, 768).tolist()]

Expand All @@ -39,9 +39,19 @@ def docstore(self) -> ChromaDocumentStore:
an instance of this document store so the base class can use it.
"""
with mock.patch("chroma_haystack.document_store.get_embedding_function") as get_func:
get_func.return_value = TestEmbeddingFunction
get_func.return_value = TestEmbeddingFunction()
return ChromaDocumentStore(embedding_function="test_function", collection_name=str(uuid.uuid1()))

@pytest.mark.unit
def test_ne_filter(self, docstore: ChromaDocumentStore, filterable_docs: List[Document]):
"""
We customize this test because Chroma consider "not equal" true when
a field is missing
"""
docstore.write_documents(filterable_docs)
result = docstore.filter_documents(filters={"page": {"$ne": "100"}})
assert self.contains_same_docs(result, [doc for doc in filterable_docs if doc.meta.get("page", "100") != "100"])

@pytest.mark.unit
def test_delete_empty(self, docstore: ChromaDocumentStore):
"""
Expand Down

0 comments on commit 57b6b9c

Please sign in to comment.