Ollama document embedder #400

jmdevita · 2024-02-13T03:50:05Z

Added Ollama Document Embedder and correlated pytests. Referenced existing Ollama Text Embedder and pre-existing Document Embedders to maintain parity.

Came from this issue

…st, test was incorrect

vblagoje · 2024-02-13T09:48:13Z

Seems like there are some linting issues. You can fix these via hatch lint with --fix parameter. See readme for more details. LMK if you need some help and thanks for this contribution @jmdevita

CLAassistant · 2024-02-13T17:43:01Z

All committers have signed the CLA.

jmdevita · 2024-02-13T18:06:24Z

@vblagoje Just pushed a recent version where everything seems to be working now. For some reason one of my commits wasn't configured to my account, so it's labeled with the CLA unsigned (even though it's still me). Even with an amendment to that branch's commit didn't change anything, so let me know if there's something else I need to do.

Thanks!

vblagoje · 2024-02-14T09:16:51Z

@jmdevita It's unfortunate that this commit from one of your other accounts got in somehow but you can edit that easily. In your local git repo do an interactive git rebase and edit the commit with something like git commit --amend --author="Your Name <[email protected]>", inspect the the git history and force push back on the same jmdevita:ollama-document-embedder branch.

jmdevita · 2024-02-14T18:05:12Z

@vblagoje thanks for your help there. Everything should be good now

vblagoje · 2024-02-15T09:21:32Z

@jmdevita seems ok to me, have you played with this document embedder? Does it work ok in your particular use case?

jmdevita · 2024-02-15T15:21:02Z

@vblagoje Yup, I use it for my pipeline that runs daily.

I use the TikaDocumentConverter that processes hundreds of files, then I use the DocumentCleaner & DocumentSplitter in the docs and use the OllamaDocumentEmbedder and writer to put into my Qdrant Vector DB.

vblagoje

Let's 🚢 thanks for your contribution @jmdevita Keep them coming 👍

vblagoje · 2024-02-15T16:16:03Z

@dfokina can we please add a note in docs about adding Ollama Document Embedder, then we can tag and release a new ollama package

vblagoje · 2024-02-15T16:17:06Z

If you have time @jmdevita please help out with some notes so @dfokina can update ollama integration page

jmdevita · 2024-02-15T16:23:07Z

@vblagoje @dfokina Happy to help out with the docs/notes- I can write up a guide later today and send it over

jmdevita · 2024-02-16T04:06:39Z

@dfokina Not sure where to send this, but attached below is a write up:

OllamaDocumentEmbedder

OllamaDocumentEmbedder computes the embeddings of a list of Documents and stores the obtained vectors in the embedding field of each Document. It uses embedding models compatible with the Ollama Library. Although it should be noted that most of the pre-built models are not great for producing embeddings.

The vectors computed by this component are necessary to perform embedding retrieval on a collection of Documents. At retrieval time, the vector that represents the query is compared with those of the Documents to find the most similar or relevant Documents.

Overview

OllamaDocumentEmbedder should be used to embed a lit of Documents, for embedding a string only, you should use the OllamaTextEmbedder. The component does uses http://localhost:11434/api/embeddings as the default URL as most available setups (Mac/linux/docker) default to the port 11434.

Compatible Models

Unless specified otherwise while initializing this component, the default embedding model is "orca-mini". Any other models can be viewed by viewing the other pre-built models. To load your own custom model, follow these instructions from Ollama.

Instructions

To start using this integration with Haystack, install the package with:
pip install Ollama-haystack
Make sure that you have a running Ollama model (either through a docker container, or locally hosted). No other configuration is necessary as Ollama has the embedding API built in.

Embedding Metadata

Most embedded metadata contains information about the model name and type. Optional arguments to pass to the Ollama generation endpoint, such as temperature, top_p, etc.

The model used will automatically be appended as part of the document metadata. An example payload using the orca-mini model will look like:
{'meta': {'model': 'orca-mini'}}

Usage

On its own:

from haystack import Document
from haystack_integrations.components.embedders.ollama import OllamaDocumentEmbedder

doc = Document(content="What do llamas say once you have thanked them? No probllama!")
document_embedder = OllamaDocumentEmbedder()

result = document_embedder.run([doc])
print(result['documents'][0].embedding)

#Calculating embeddings: 100%|██████████| 1/1 [00:02<00:00, 2.82s/it]

#[-0.16412407159805298, -3.8359334468841553, ... ]

In a Pipeline

from haystack import Pipeline

from haystack_integrations.components.embedders.ollama import OllamaDocumentEmbedder

from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter

from haystack.components.converters import PyPDFToDocument
from haystack.components.writers import DocumentWriter
from haystack.document_stores.types import DuplicatePolicy
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore(embedding_similarity_function="cosine")

embedder = OllamaDocumentEmbedder(model="orca-mini", url="http://localhost:11434/api/embeddings") # This is the defaulted model and URL

cleaner = DocumentCleaner()
splitter = DocumentSplitter()
file_converter = PyPDFToDocument()
writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE)

indexing_pipeline = Pipeline()

# Add components to pipeline
indexing_pipeline.add_component("embedder", embedder)
indexing_pipeline.add_component("converter", file_converter)
indexing_pipeline.add_component("cleaner", cleaner)
indexing_pipeline.add_component("splitter", splitter)
indexing_pipeline.add_component("writer", writer)

# Connect components in pipeline
indexing_pipeline.connect("converter", "cleaner")
indexing_pipeline.connect("cleaner", "splitter")
indexing_pipeline.connect("splitter", "embedder")
indexing_pipeline.connect("embedder", "writer")

# Run Pipeline
indexing_pipeline.run({"converter": {"sources": ["files/test_pdf_data.pdf"]}})

# Calculating embeddings: 100%|██████████| 115/115
# {'embedder': {'meta': {'model': 'orca-mini'}},  'writer': {'documents_written': 115}}

vblagoje · 2024-02-16T09:08:09Z

Wow, thank you @jmdevita we'll take it from here. Much much appreciated and looking forward to your next contribution. Keep them coming!

jmdevita added 2 commits February 12, 2024 22:09

Added ollama document embedder and tests

d210ef6

Cleaning of non-used variables and batch restrictions

b0d35b5

jmdevita requested a review from a team as a code owner February 13, 2024 03:50

jmdevita requested review from vblagoje and removed request for a team February 13, 2024 03:50

github-actions bot added type:documentation Improvements or additions to documentation integration:ollama labels Feb 13, 2024

Fixed issue with test_document_embedder.py import_text_in_embedder te…

27c0a38

…st, test was incorrect

jmdevita mentioned this pull request Feb 13, 2024

Ollama: support Embedders #190

Closed

jmdevita and others added 5 commits February 14, 2024 12:53

Fixed lint issues and tests

de0038a

chore: Exculde evaluator private classes in API docs (deepset-ai#392)

eecaa10

rename astraretriever (deepset-ai#399)

5f88391

rename retriever (deepset-ai#407)

e1df049

test patch- documents embedding wasn't working as expected

912fca5

jmdevita force-pushed the ollama-document-embedder branch from 157af9d to 912fca5 Compare February 14, 2024 17:54

github-actions bot added integration:chroma integration:astra integration:uptrain integration:deepeval labels Feb 14, 2024

Merge branch 'main' into ollama-document-embedder

8033d37

vblagoje approved these changes Feb 15, 2024

View reviewed changes

vblagoje merged commit e486ff0 into deepset-ai:main Feb 15, 2024
7 checks passed

jmdevita deleted the ollama-document-embedder branch February 16, 2024 03:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ollama document embedder #400

Ollama document embedder #400

jmdevita commented Feb 13, 2024

vblagoje commented Feb 13, 2024 •

edited

Loading

CLAassistant commented Feb 13, 2024 •

edited

Loading

jmdevita commented Feb 13, 2024 •

edited

Loading

vblagoje commented Feb 14, 2024

jmdevita commented Feb 14, 2024

vblagoje commented Feb 15, 2024

jmdevita commented Feb 15, 2024

vblagoje left a comment

vblagoje commented Feb 15, 2024

vblagoje commented Feb 15, 2024

jmdevita commented Feb 15, 2024

jmdevita commented Feb 16, 2024

vblagoje commented Feb 16, 2024

Ollama document embedder #400

Ollama document embedder #400

Conversation

jmdevita commented Feb 13, 2024

vblagoje commented Feb 13, 2024 • edited Loading

CLAassistant commented Feb 13, 2024 • edited Loading

jmdevita commented Feb 13, 2024 • edited Loading

vblagoje commented Feb 14, 2024

jmdevita commented Feb 14, 2024

vblagoje commented Feb 15, 2024

jmdevita commented Feb 15, 2024

vblagoje left a comment

Choose a reason for hiding this comment

vblagoje commented Feb 15, 2024

vblagoje commented Feb 15, 2024

jmdevita commented Feb 15, 2024

jmdevita commented Feb 16, 2024

OllamaDocumentEmbedder

Overview

Compatible Models

Instructions

Embedding Metadata

Usage

In a Pipeline

vblagoje commented Feb 16, 2024

vblagoje commented Feb 13, 2024 •

edited

Loading

CLAassistant commented Feb 13, 2024 •

edited

Loading

jmdevita commented Feb 13, 2024 •

edited

Loading