Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ollama document embedder #400

Merged
merged 9 commits into from
Feb 15, 2024
Merged

Conversation

jmdevita
Copy link
Contributor

Added Ollama Document Embedder and correlated pytests. Referenced existing Ollama Text Embedder and pre-existing Document Embedders to maintain parity.

Came from this issue

@jmdevita jmdevita requested a review from a team as a code owner February 13, 2024 03:50
@jmdevita jmdevita requested review from vblagoje and removed request for a team February 13, 2024 03:50
@github-actions github-actions bot added type:documentation Improvements or additions to documentation integration:ollama labels Feb 13, 2024
@vblagoje
Copy link
Member

vblagoje commented Feb 13, 2024

Seems like there are some linting issues. You can fix these via hatch lint with --fix parameter. See readme for more details. LMK if you need some help and thanks for this contribution @jmdevita

@CLAassistant
Copy link

CLAassistant commented Feb 13, 2024

CLA assistant check
All committers have signed the CLA.

@jmdevita
Copy link
Contributor Author

jmdevita commented Feb 13, 2024

@vblagoje Just pushed a recent version where everything seems to be working now. For some reason one of my commits wasn't configured to my account, so it's labeled with the CLA unsigned (even though it's still me). Even with an amendment to that branch's commit didn't change anything, so let me know if there's something else I need to do.

Thanks!

@vblagoje
Copy link
Member

@jmdevita It's unfortunate that this commit from one of your other accounts got in somehow but you can edit that easily. In your local git repo do an interactive git rebase and edit the commit with something like git commit --amend --author="Your Name <[email protected]>", inspect the the git history and force push back on the same jmdevita:ollama-document-embedder branch.

@jmdevita
Copy link
Contributor Author

@vblagoje thanks for your help there. Everything should be good now

@vblagoje
Copy link
Member

@jmdevita seems ok to me, have you played with this document embedder? Does it work ok in your particular use case?

@jmdevita
Copy link
Contributor Author

@vblagoje Yup, I use it for my pipeline that runs daily.

I use the TikaDocumentConverter that processes hundreds of files, then I use the DocumentCleaner & DocumentSplitter in the docs and use the OllamaDocumentEmbedder and writer to put into my Qdrant Vector DB.

Copy link
Member

@vblagoje vblagoje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's 🚢 thanks for your contribution @jmdevita Keep them coming 👍

@vblagoje vblagoje merged commit e486ff0 into deepset-ai:main Feb 15, 2024
7 checks passed
@vblagoje
Copy link
Member

@dfokina can we please add a note in docs about adding Ollama Document Embedder, then we can tag and release a new ollama package

@vblagoje
Copy link
Member

If you have time @jmdevita please help out with some notes so @dfokina can update ollama integration page

@jmdevita
Copy link
Contributor Author

@vblagoje @dfokina Happy to help out with the docs/notes- I can write up a guide later today and send it over

@jmdevita jmdevita deleted the ollama-document-embedder branch February 16, 2024 03:34
@jmdevita
Copy link
Contributor Author

@dfokina Not sure where to send this, but attached below is a write up:


OllamaDocumentEmbedder

OllamaDocumentEmbedder computes the embeddings of a list of Documents and stores the obtained vectors in the embedding field of each Document. It uses embedding models compatible with the Ollama Library. Although it should be noted that most of the pre-built models are not great for producing embeddings.

The vectors computed by this component are necessary to perform embedding retrieval on a collection of Documents. At retrieval time, the vector that represents the query is compared with those of the Documents to find the most similar or relevant Documents.

Overview

OllamaDocumentEmbedder should be used to embed a lit of Documents, for embedding a string only, you should use the OllamaTextEmbedder. The component does uses http://localhost:11434/api/embeddings as the default URL as most available setups (Mac/linux/docker) default to the port 11434.

Compatible Models

Unless specified otherwise while initializing this component, the default embedding model is "orca-mini". Any other models can be viewed by viewing the other pre-built models. To load your own custom model, follow these instructions from Ollama.

Instructions

To start using this integration with Haystack, install the package with:
pip install Ollama-haystack
Make sure that you have a running Ollama model (either through a docker container, or locally hosted). No other configuration is necessary as Ollama has the embedding API built in.

Embedding Metadata

Most embedded metadata contains information about the model name and type. Optional arguments to pass to the Ollama generation endpoint, such as temperature, top_p, etc.

The model used will automatically be appended as part of the document metadata. An example payload using the orca-mini model will look like:
{'meta': {'model': 'orca-mini'}}

Usage

On its own:

from haystack import Document
from haystack_integrations.components.embedders.ollama import OllamaDocumentEmbedder

doc = Document(content="What do llamas say once you have thanked them? No probllama!")
document_embedder = OllamaDocumentEmbedder()

result = document_embedder.run([doc])
print(result['documents'][0].embedding)

#Calculating embeddings: 100%|██████████| 1/1 [00:02<00:00, 2.82s/it]

#[-0.16412407159805298, -3.8359334468841553, ... ]

In a Pipeline

from haystack import Pipeline

from haystack_integrations.components.embedders.ollama import OllamaDocumentEmbedder

from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter

from haystack.components.converters import PyPDFToDocument
from haystack.components.writers import DocumentWriter
from haystack.document_stores.types import DuplicatePolicy
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore(embedding_similarity_function="cosine")

embedder = OllamaDocumentEmbedder(model="orca-mini", url="http://localhost:11434/api/embeddings") # This is the defaulted model and URL

cleaner = DocumentCleaner()
splitter = DocumentSplitter()
file_converter = PyPDFToDocument()
writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE)

indexing_pipeline = Pipeline()

# Add components to pipeline
indexing_pipeline.add_component("embedder", embedder)
indexing_pipeline.add_component("converter", file_converter)
indexing_pipeline.add_component("cleaner", cleaner)
indexing_pipeline.add_component("splitter", splitter)
indexing_pipeline.add_component("writer", writer)

# Connect components in pipeline
indexing_pipeline.connect("converter", "cleaner")
indexing_pipeline.connect("cleaner", "splitter")
indexing_pipeline.connect("splitter", "embedder")
indexing_pipeline.connect("embedder", "writer")

# Run Pipeline
indexing_pipeline.run({"converter": {"sources": ["files/test_pdf_data.pdf"]}})

# Calculating embeddings: 100%|██████████| 115/115
# {'embedder': {'meta': {'model': 'orca-mini'}},  'writer': {'documents_written': 115}}

@vblagoje
Copy link
Member

Wow, thank you @jmdevita we'll take it from here. Much much appreciated and looking forward to your next contribution. Keep them coming!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants