Skip to content

Commit

Permalink
feat!: Update embedders to use new Secret API (#29)
Browse files Browse the repository at this point in the history
  • Loading branch information
awinml authored Mar 17, 2024
1 parent de353da commit 8428df6
Show file tree
Hide file tree
Showing 12 changed files with 331 additions and 150 deletions.
50 changes: 27 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,37 +1,41 @@
[![PyPI](https://img.shields.io/pypi/v/voyage-embedders-haystack)](https://pypi.org/project/voyage-embedders-haystack/)
![PyPI - Downloads](https://img.shields.io/pypi/dm/voyage-embedders-haystack?color=blue&logo=pypi&logoColor=gold)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/voyage-embedders-haystack?logo=python&logoColor=gold)
[![GitHub](https://img.shields.io/github/license/awinml/voyage-embedders-haystack?color=green)](LICENSE)
[![PyPI](https://img.shields.io/pypi/v/voyage-embedders-haystack)](https://pypi.org/project/voyage-embedders-haystack/)
![PyPI - Downloads](https://img.shields.io/pypi/dm/voyage-embedders-haystack?color=blue&logo=pypi&logoColor=gold)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/voyage-embedders-haystack?logo=python&logoColor=gold)
[![GitHub](https://img.shields.io/github/license/awinml/voyage-embedders-haystack?color=green)](LICENSE)
[![Actions status](https://github.com/awinml/voyage-embedders-haystack/workflows/Test/badge.svg)](https://github.com/awinml/voyage-embedders-haystack/actions)
[![Coverage Status](https://coveralls.io/repos/github/awinml/voyage-embedders-haystack/badge.svg?branch=main)](https://coveralls.io/github/awinml/voyage-embedders-haystack?branch=main)

[![Types - Mypy](https://img.shields.io/badge/types-Mypy-blue.svg)](https://github.com/python/mypy)
[![Types - Mypy](https://img.shields.io/badge/types-Mypy-blue.svg)](https://github.com/python/mypy)
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
[![Code Style - Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)


[![Code Style - Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

<h1 align="center"> <a href="https://github.com/awinml/voyage-embedders-haystack"> Voyage Embedders - Haystack </a> </h1>

Custom component for [Haystack](https://github.com/deepset-ai/haystack) (2.x) for creating embeddings using the [VoyageAI Embedding Models](https://voyageai.com/).

Voyage’s embedding models, `voyage-2` and `voyage-2-code`, are state-of-the-art in retrieval accuracy. These models outperform top performing embedding models like `intfloat/e5-mistral-7b-instruct` and `OpenAI/text-embedding-3-large` on the [MTEB Benchmark](https://github.com/embeddings-benchmark/mteb). `voyage-2` is current ranked second on the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard).


#### What's New

- **[v1.3.0 - 18/03/24]:**

- **Breaking Change:** The import path for the embedders has been changed to `haystack_integrations.components.embedders.voyage_embedders`.
Please replace all instances of `from voyage_embedders.voyage_document_embedder import VoyageDocumentEmbedder` and `from voyage_embedders.voyage_text_embedder import VoyageTextEmbedder` with
`from haystack_integrations.components.embedders.voyage_embedders import VoyageDocumentEmbedder, VoyageTextEmbedder`.
- The embedders now use the Haystack `Secret` API for authentication. For more information please see the [Secret Management Documentation](https://docs.haystack.deepset.ai/docs/secret-management).

- **[v1.2.0 - 02/02/24]:**
- **Breaking Change:** `VoyageDocumentEmbedder` and `VoyageTextEmbedder` now accept the `model` parameter instead of `model_name`.
- The embedders have been use the new `voyageai.Client.embed()` method instead of the deprecated `get_embedding` and `get_embeddings` methods of the global namespace.
- Support for the new `truncate` parameter has been added.
- Default embedding model has been changed to "voyage-2" from the deprecated "voyage-01".
- The embedders now return the total number of tokens used as part of the `"total_tokens"` in the metadata.

- **Breaking Change:** `VoyageDocumentEmbedder` and `VoyageTextEmbedder` now accept the `model` parameter instead of `model_name`.
- The embedders have been use the new `voyageai.Client.embed()` method instead of the deprecated `get_embedding` and `get_embeddings` methods of the global namespace.
- Support for the new `truncate` parameter has been added.
- Default embedding model has been changed to "voyage-2" from the deprecated "voyage-01".
- The embedders now return the total number of tokens used as part of the `"total_tokens"` in the metadata.

- **[v1.1.0 - 13/12/23]:** Added support for `input_type` parameter in `VoyageTextEmbedder` and `VoyageDocument Embedder`.

- **[v1.0.0 - 21/11/23]:** Added `VoyageTextEmbedder` and `VoyageDocument Embedder` to embed strings and documents.


## Installation

```bash
Expand All @@ -42,17 +46,18 @@ pip install voyage-embedders-haystack

You can use Voyage Embedding models with two components: [VoyageTextEmbedder](https://github.com/awinml/voyage-embedders-haystack/blob/main/src/voyage_embedders/voyage_text_embedder.py) and [VoyageDocumentEmbedder](https://github.com/awinml/voyage-embedders-haystack/blob/main/src/voyage_embedders/voyage_document_embedder.py).

To create semantic embeddings for documents, use `VoyageDocumentEmbedder` in your indexing pipeline. For generating embeddings for queries, use `VoyageTextEmbedder`. Once you've selected the suitable component for your specific use case, initialize the component with the model name and VoyageAI API key. You can also
set the environment variable "VOYAGE_API_KEY" instead of passing the api key as an argument.
To create semantic embeddings for documents, use `VoyageDocumentEmbedder` in your indexing pipeline. For generating embeddings for queries, use `VoyageTextEmbedder`.

Once you've selected the suitable component for your specific use case, initialize the component with the model name and VoyageAI API key. You can also
set the environment variable `VOYAGE_API_KEY` instead of passing the API key as an argument.

Information about the supported models, can be found on the [Embeddings Documentation.](https://docs.voyageai.com/embeddings/)

To get an API key, please see the [Voyage AI website.](https://www.voyageai.com/)


## Example

Below is the example Semantic Search pipeline that uses the [Simple Wikipedia](https://huggingface.co/datasets/pszemraj/simple_wikipedia) Dataset from HuggingFace. You can find more examples in the [`examples`](https://github.com/awinml/voyage-embedders-haystack/tree/main/examples) folder.
Below is the example Semantic Search pipeline that uses the [Simple Wikipedia](https://huggingface.co/datasets/pszemraj/simple_wikipedia) Dataset from HuggingFace. You can find more examples in the [`examples`](https://github.com/awinml/voyage-embedders-haystack/tree/main/examples) folder.

Load the dataset:

Expand All @@ -66,8 +71,7 @@ from haystack.dataclasses import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore

# Import Voyage Embedders
from voyage_embedders.voyage_document_embedder import VoyageDocumentEmbedder
from voyage_embedders.voyage_text_embedder import VoyageTextEmbedder
from haystack_integrations.components.embedders.voyage_embedders import VoyageDocumentEmbedder, VoyageTextEmbedder

# Load first 100 rows of the Simple Wikipedia Dataset from HuggingFace
dataset = load_dataset("pszemraj/simple_wikipedia", split="validation[:100]")
Expand Down Expand Up @@ -101,7 +105,7 @@ text_embedder = VoyageTextEmbedder(model="voyage-2", input_type="query")
indexing_pipeline = Pipeline()
indexing_pipeline.add_component(instance=doc_embedder, name="DocEmbedder")
indexing_pipeline.add_component(instance=doc_writer, name="DocWriter")
indexing_pipeline.connect(sender="DocEmbedder", receiver="DocWriter")
indexing_pipeline.connect("DocEmbedder", "DocWriter")

indexing_pipeline.run({"DocEmbedder": {"documents": docs}})

Expand All @@ -111,6 +115,7 @@ print(f"Embedding of first Document: {doc_store.filter_documents()[0].embedding}
```

Query the Semantic Search Pipeline using the `InMemoryEmbeddingRetriever` and `VoyageTextEmbedder`:

```python
text_embedder = VoyageTextEmbedder(model="voyage-2", input_type="query")

Expand All @@ -120,7 +125,6 @@ query_pipeline.add_component(instance=text_embedder, name="TextEmbedder")
query_pipeline.add_component(instance=retriever, name="Retriever")
query_pipeline.connect("TextEmbedder.embedding", "Retriever.query_embedding")


# Search
results = query_pipeline.run({"TextEmbedder": {"text": "Which year did the Joker movie release?"}})

Expand Down
3 changes: 1 addition & 2 deletions examples/document_embedder_example.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
from haystack.dataclasses import Document

from voyage_embedders.voyage_document_embedder import VoyageDocumentEmbedder
from haystack_integrations.components.embedders.voyage_embedders import VoyageDocumentEmbedder

# Text taken from PubMed QA Dataset (https://huggingface.co/datasets/pubmed_qa)
document_list = [
Expand Down
7 changes: 3 additions & 4 deletions examples/semantic_search_pipeline_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,7 @@
from haystack.document_stores.in_memory import InMemoryDocumentStore

# Import Voyage Embedders
from voyage_embedders.voyage_document_embedder import VoyageDocumentEmbedder
from voyage_embedders.voyage_text_embedder import VoyageTextEmbedder
from haystack_integrations.components.embedders.voyage_embedders import VoyageDocumentEmbedder, VoyageTextEmbedder

# Load first 100 rows of the Simple Wikipedia Dataset from HuggingFace
dataset = load_dataset("pszemraj/simple_wikipedia", split="validation[:100]")
Expand All @@ -32,20 +31,20 @@
model="voyage-2",
input_type="document",
)
text_embedder = VoyageTextEmbedder(model="voyage-2", input_type="query")

# Indexing Pipeline
indexing_pipeline = Pipeline()
indexing_pipeline.add_component(instance=doc_embedder, name="DocEmbedder")
indexing_pipeline.add_component(instance=doc_writer, name="DocWriter")
indexing_pipeline.connect(sender="DocEmbedder", receiver="DocWriter")
indexing_pipeline.connect("DocEmbedder", "DocWriter")

indexing_pipeline.run({"DocEmbedder": {"documents": docs}})

print(f"Number of documents in Document Store: {len(doc_store.filter_documents())}")
print(f"First Document: {doc_store.filter_documents()[0]}")
print(f"Embedding of first Document: {doc_store.filter_documents()[0].embedding}")

text_embedder = VoyageTextEmbedder(model="voyage-2", input_type="query")

# Query Pipeline
query_pipeline = Pipeline()
Expand Down
2 changes: 1 addition & 1 deletion examples/text_embedder_example.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from voyage_embedders.voyage_text_embedder import VoyageTextEmbedder
from haystack_integrations.components.embedders.voyage_embedders import VoyageTextEmbedder

# Example text from the Amazon Reviews Polarity Dataset (https://huggingface.co/datasets/amazon_polarity)
text = (
Expand Down
44 changes: 26 additions & 18 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -36,10 +36,10 @@ Issues = "https://github.com/awinml/voyage-embedders-haystack/issues"
Source = "https://github.com/awinml/voyage-embedders-haystack"

[tool.hatch.build.targets.wheel]
packages = ["src/voyage_embedders"]
packages = ["src/haystack_integrations"]

[tool.hatch.version]
path = "src/voyage_embedders/__about__.py"
path = "src/haystack_integrations/components/embedders/voyage_embedders/__about__.py"

[tool.hatch.envs.default]
dependencies = ["coverage[toml]>=6.5", "coveralls", "pytest", "datasets"]
Expand All @@ -52,7 +52,6 @@ cov = ["test-cov", "cov-report"]
example-text-embedder = "python examples/text_embedder_example.py"
example-doc-embedder = "python examples/document_embedder_example.py"
example-semantic-search = "python examples/semantic_search_pipeline_example.py"

test-examples = [
"example-text-embedder",
"example-doc-embedder",
Expand All @@ -65,10 +64,11 @@ python = ["3.8", "3.9", "3.10", "3.11", "3.12"]
[tool.hatch.envs.lint]
detached = true
dependencies = ["black>=23.1.0", "mypy>=1.0.0", "ruff>=0.0.243"]

[tool.hatch.envs.lint.scripts]
typing = "mypy --install-types --non-interactive {args:src/voyage_embedders tests}"
style = ["ruff {args:.}", "black --check --diff {args:.}"]
fmt = ["black {args:.}", "ruff --fix {args:.}", "style"]
typing = "mypy --install-types --non-interactive --explicit-package-bases {args:src/ tests}"
style = ["ruff check {args:.}", "black --check --diff {args:.}"]
fmt = ["black {args:.}", "ruff check --fix {args:.}", "style"]
all = ["fmt", "typing"]

[tool.hatch.metadata]
Expand All @@ -82,7 +82,7 @@ skip-string-normalization = true
[tool.ruff]
target-version = "py37"
line-length = 120
select = [
lint.select = [
"A",
"ARG",
"B",
Expand All @@ -109,7 +109,7 @@ select = [
"W",
"YTT",
]
ignore = [
lint.ignore = [
# Allow non-abstract empty methods in abstract base classes
"B027",
# Allow boolean positional values in function calls, like `dict.get(... True)`
Expand All @@ -126,46 +126,54 @@ ignore = [
"PLR0915",
# Ignore print statements
"T201",
# Ignore function call in argument default - for secrets
"B008",
]
unfixable = [
lint.unfixable = [
# Don't touch unused imports
"F401",
]

[tool.ruff.isort]
[tool.ruff.lint.isort]
known-first-party = ["voyage_embedders"]

[tool.ruff.flake8-tidy-imports]
ban-relative-imports = "all"
[tool.ruff.lint.flake8-tidy-imports]
ban-relative-imports = "parents"

[tool.ruff.per-file-ignores]
[tool.ruff.lint.per-file-ignores]
# Tests can use magic values, assertions, and relative imports
"tests/**/*" = ["PLR2004", "S101", "TID252"]

[tool.coverage.run]
source_pkgs = ["voyage_embedders", "tests"]
source_pkgs = ["haystack_integrations", "tests"]
branch = true
parallel = true
omit = ["src/voyage_embedders/__about__.py", "example"]
omit = [
"src/haystack_integrations/components/embedders/voyage_embedders/__about__.py",
"example",
]

[tool.coverage.paths]
voyage_embedders = [
"src/voyage_embedders",
"*/voyage_embedders/src/voyage_embedders",
"src/haystack_integrations/components/embedders/voyage_embedders",
"*/voyage_embedders/src/haystack_integrations/components/embedders/voyage_embedders",
]
tests = ["tests", "*voyage_embedders/tests"]

[tool.coverage.report]
omit = ["*/__init__.py"]
show_missing = true
exclude_lines = ["no cov", "if __name__ == .__main__.:", "if TYPE_CHECKING:"]

[tool.pytest.ini_options]
minversion = "6.0"
addopts = "-vv"
markers = ["unit: unit tests", "integration: integration tests"]
log_cli = true

[tool.mypy]
ignore_missing_imports = true

[[tool.mypy.overrides]]
module = ["haystack.*", "pytest.*"]
module = ["haystack.*", "haystack_integrations.*", "pytest.*"]
ignore_missing_imports = true
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# SPDX-FileCopyrightText: 2023-present Ashwin Mathur <>
#
# SPDX-License-Identifier: Apache-2.0
__version__ = "1.2.0"
__version__ = "1.3.0"
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# SPDX-FileCopyrightText: 2023-present Ashwin Mathur <>
#
# SPDX-License-Identifier: Apache-2.0

from haystack_integrations.components.embedders.voyage_embedders.voyage_document_embedder import VoyageDocumentEmbedder
from haystack_integrations.components.embedders.voyage_embedders.voyage_text_embedder import VoyageTextEmbedder

__all__ = ["VoyageDocumentEmbedder", "VoyageTextEmbedder"]
Loading

0 comments on commit 8428df6

Please sign in to comment.