Weaviate error when updating embeddings #3390

danielbichuetti · 2022-10-14T13:30:09Z

Describe the bug
After initializing Weaviate with dummy embeddings (just meta), and trying to update embeddings, it throws an error because of pagination limits.

Error message

---------------------------------------------------------------------------
WeaviateDocumentStoreError                Traceback (most recent call last)
Cell In [8], line 8
      5 wv_docstore = WeaviateDocumentStore(host='http://weaviate.weaviate.svc.cluster.local',port=80,index='news',embedding_dim=768)
      6 wv_retriever = EmbeddingRetriever(document_store=wv_docstore,embedding_model='sentence-transformers/paraphrase-multilingual-mpnet-base-v2',model_format='sentence_transformers',use_gpu=True)
----> 8 wv_docstore.update_embeddings(retriever=wv_retriever,batch_size=1000)

File /opt/conda/lib/python3.10/site-packages/haystack/document_stores/weaviate.py:1230, in WeaviateDocumentStore.update_embeddings(self, retriever, index, filters, update_existing_embeddings, batch_size)
   1224     raise RuntimeError(
   1225         "All the documents in Weaviate store have an embedding by default. Only update is allowed!"
   1226     )
   1228 result = self._get_all_documents_in_index(index=index, filters=filters, batch_size=batch_size)
-> 1230 for result_batch in get_batches_from_generator(result, batch_size):
   1231     document_batch = [
   1232         self._convert_weaviate_result_to_document(hit, return_embedding=False) for hit in result_batch
   1233     ]
   1234     embeddings = retriever.embed_documents(document_batch)

File /opt/conda/lib/python3.10/site-packages/haystack/document_stores/base.py:890, in get_batches_from_generator(iterable, n)
    886 """
    887 Batch elements of an iterable into fixed-length chunks or blocks.
    888 """
    889 it = iter(iterable)
--> 890 x = tuple(islice(it, n))
    891 while x:
    892     yield x

File /opt/conda/lib/python3.10/site-packages/haystack/document_stores/weaviate.py:752, in WeaviateDocumentStore._get_all_documents_in_index(self, index, filters, batch_size, only_documents_without_embedding)
    749     raise WeaviateDocumentStoreError(f"Weaviate raised an exception: {e}")
    751 if "errors" in result:
--> 752     raise WeaviateDocumentStoreError(f"Query results contain errors: {result['errors']}")
    754 # If `query.do` didn't raise and `result` doesn't contain errors,
    755 # we are good accessing data
    756 docs = result.get("data").get("Get").get(index)

WeaviateDocumentStoreError: Query results contain errors: [{'locations': [{'column': 6, 'line': 1}], 'message': 'explorer: list class: search: invalid pagination params: query maximum results exceeded', 'path': ['Get', 'News']}]

Expected behavior
It's expected that embeddings get updated correctly.

Additional context
This error happens because of the pagination limit (10000) that was introduced on release 1.8.0.

Probably should be implemented some way to update embeddings.

UPDATE: The best solution currently appears to be this one:

You need to add an ordered arbitrary field… … integer going from 0..N… and than not use pagination but filter over this…
Yeah it is annoying… 😕 but no other way for now

The easy solution would be increasing the maximum query results, but this has dramatic impact in Weaviate memory consumption. Another quote:

if it works for you than OK, but be aware that there is a 10k limit to this approach (you can change it but it will consume a lot of memory) and also it’ll get slower the bigger i you will have… this might be you only way to do it, if your data are already saved and need to retrieve them. However i wouldn’t recommend it as a general solution. If you have more than 10k records, then you have problem… you can either change the default conf. variable (and make sure that you have enough memory) or split the dataset by some other field that you already know all the known values for.

To Reproduce
First, load more then 10000 documents to Document Store

from haystack.document_stores import WeaviateDocumentStore
from haystack.nodes import EmbeddingRetriever
from haystack.schema import Document

wv_docstore = WeaviateDocumentStore(host='http://weaviate.weaviate.svc.cluster.local',port=80,index='news',embedding_dim=768)
wv_retriever = EmbeddingRetriever(document_store=wv_docstore,embedding_model='sentence-transformers/paraphrase-multilingual-mpnet-base-v2',model_format='sentence_transformers',use_gpu=True)

wv_docstore.update_embeddings(retriever=wv_retriever,batch_size=1000)

FAQ Check

Have you had a look at our new FAQ page?

System:

OS: Ubuntu
GPU/CPU: T4
Haystack version (commit or version number): 1.9.1
DocumentStore: WeaviateDocumentStore
Retriever: EmbeddingRetriever

The text was updated successfully, but these errors were encountered:

anakin87 · 2022-10-14T13:44:02Z

Similar to #2898

danielbichuetti · 2022-10-14T14:26:13Z

@anakin87 The error message is similar, but unfortunately the error is because of the used logic and the pagination limits in Weaviate.

bobvanluijt · 2022-10-15T14:41:31Z

@byronvoorbach / @dirkkul - can we help with this? Seems like a legacy issue which can be resolved now.

etiennedi · 2022-10-15T15:21:28Z

Hi all. From what I understand, a Cursor API in Weaviate would be helpful here? Feel free to upvote the linked issue to indicate demand for it.

danielbichuetti · 2022-10-15T15:35:17Z

Thank you, @bobvanluijt and @etiennedi for the support! 😃

This would allow using Weaviate into Haystack production scenarios without complex solutions.

bobvanluijt · 2022-10-18T20:17:46Z

You're very welcome @danielbichuetti 👍

fvanlitsenburg · 2023-01-18T09:08:50Z

Just ran into this myself. Has anyone found a viable solution for this?

hsm207 · 2023-01-19T07:28:42Z

@fvanlitsenburg the long term solution is for weaviate to implement weaviate/weaviate#2302 which is already planned in 1.18. Then, we can update haystack integration so that it retrieves all objects using the cursor api instead of using offsets.

As a workaround until weaviate 1.18 is released, you could consider setting up your weaviate instance to use the text2vec-transformers module. That way, vectorization is done as a document is uploaded to weaviate so you do not need to call wv_docstore.update_embeddings()

asanoop24 · 2023-01-20T20:06:06Z

@hsm207 Even when using text2vec-transformers module in Weaviate instance, Haystack would add dummy embeddings while writing the documents to Weaviate to avoid sending empty vectors. In that case, Weaviate doesn't update the embeddings using its own text2vec-transformers model. Is there way to disable adding these dummy embeddings while writing documents?

fvanlitsenburg · 2023-01-21T13:34:54Z

@asanoop24 what worked for me was to embed the documents before writing to Weaviate:

def pre_embedder(docs):
    print('Running the pre-embedding')
    retriever = EmbeddingRetriever(
    document_store=split_document_store,
    embedding_model=embedding_model,
    model_format=model_format
    )
    embeds = retriever.embed_documents(docs)
    for doc, emb in zip(docs,embeds):
        doc.embedding = emb
    return docs

asanoop24 · 2023-01-21T21:46:05Z

@fvanlitsenburg Works. Thats a good workaround. Thanks!

Heucles · 2023-07-19T16:47:41Z

@asanoop24 what worked for me was to embed the documents before writing to Weaviate:

def pre_embedder(docs):
    print('Running the pre-embedding')
    retriever = EmbeddingRetriever(
    document_store=split_document_store,
    embedding_model=embedding_model,
    model_format=model_format
    )
    embeds = retriever.embed_documents(docs)
    for doc, emb in zip(docs,embeds):
        doc.embedding = emb
    return docs

@fvanlitsenburg can you elaborate a little bit more on your workaround? In your scenario you haven't even written the documents, so you are not updating them, you actually are embedding before the first write, is that correct? And also can you provide me with an example of the parameter for split_document_store?

fvanlitsenburg · 2023-07-19T16:57:23Z

Hi @Heucles - not sure how relevant this still is for version 1.18 (perhaps this bug has been fixed), but:

Yes, I am embedding them before writing them
The split_document_store is just a standard Haystack document store, e.g.:

from haystack.document_stores import WeaviateDocumentStore
split_document_store = WeaviateDocumentStore(host='http://weaviate.weaviate.svc.cluster.local',port=80,index='news',embedding_dim=768)

Heucles · 2023-07-19T17:44:14Z

Hey @fvanlitsenburg thank you for the quick reply! I am still facing this issue when trying to update the documents as it is proposed in this tutorial: Weaviate + Haystack presented by Laura Ham (Harry Potter example!), the issue happens when I try to update the embeddings.

And I am currently on 1.19.6

The issue was also reported here:
weaviate/weaviate-examples#56

and here: https://weaviate.slack.com/archives/C02RRQP23K3/p1689784391036609

Tks again and if you have any insight that you can provide me, it will be much appreciated.

fvanlitsenburg · 2023-07-20T08:52:47Z

Hey Heucles, Not sure how much help I can be here. Have you tried the method we suggest here in the Github issue, i.e. add the embeddings before writing? The error message in the issue you linked would suggest there's a different issue here, potentially. It's quite hard for me to give a super helpful answer here, because you're working with different versions of Haystack (1.19) and possibly Weaviate...

…

On Wed, Jul 19, 2023 at 7:44 PM Heucles ***@***.***> wrote: Hey @fvanlitsenburg <https://github.com/fvanlitsenburg> thank you for the quick reply! I am still facing this issue when trying to update the documents as it is proposed in this tutorial: Weaviate + Haystack presented by Laura Ham (Harry Potter example!) <https://www.youtube.com/watch?v=BkozaOnZpJI>, the issue happens when I try to update the embeddings. And I am currently on 1.19.6 The issue was also reported here: weaviate/weaviate-examples#56 <weaviate/weaviate-examples#56> N — Reply to this email directly, view it on GitHub <#3390 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQOM7YQW3GMQCVM7LY5VCK3XRAMHVANCNFSM6AAAAAARFHZJLQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Heucles · 2023-07-25T00:22:30Z

Hey @fvanlitsenburg I have an update, I was able to get it to work increasing the size of the environment var QUERY_MAXIMUM_RESULTS to 20k since my dataset has approx. 13k, but that is not a efficient solution.

I've also found this threads discussing the issue:
#2517
weaviate/weaviate#1947

And there is the same suggestion as you made of pre-calculating the embeddings before uploading them...

I am currently trying to do that, but if you can provide me with your complete code to do so, because what I am not sure how to give a cursor functionality to the dataset, comming directly from a csv file.

Can you help me out?

fvanlitsenburg · 2023-07-25T07:47:43Z

Have you tried something like this? Essentially, using the Haystack DocumentStore to write to Weaviate, rather than Weaviate's client itself.


from haystack.document_stores import WeaviateDocumentStore
import pandas as pd
from haystack.utils import clean_wiki_text

from tqdm import tqdm
import time
import requests

client = weaviate.Client(
    url = "http://localhost:8080/", 
    additional_headers={
        'X-HuggingFace-Api-Key': 'xxxxxxx'
    }
)

harry = pd.read_csv("documents/harry_potter_wiki.csv")

dicts = []

for ix, row in harry.iterrows():
    dic = {
        'content': clean_wiki_text(row.text),
        'name': row['name'],
        'url': row.url
    }
    dicts.append(dic)

document_store = WeaviateDocumentStore(host='http://weaviate.weaviate.svc.cluster.local',port=80,index='harry_potter',embedding_dim=768) # you may want to make sure this, in particular the host, aligns with your set-up

def pre_embedder(docs):
    print('Running the pre-embedding')
    retriever = EmbeddingRetriever(
    document_store=split_document_store,
    embedding_model=embedding_model,
    model_format=model_format
    )
    embeds = retriever.embed_documents(docs)
    for doc, emb in zip(docs,embeds):
        doc.embedding = emb
    return docs

document_store.write_documents(pre_embedder(dicts), index='harry_potter',duplicate_documents='overwrite')

Heucles · 2023-07-25T15:19:40Z

Hey @fvanlitsenburg thank you very much for the explanation and attention, I wal able to get it to work combining your suggestions and @zoltan-fedor, I thank you both very much! Here is the final code:

import weaviate
import pandas as pd
import os

from haystack.utils import clean_wiki_text
from haystack import Document
import pandas as pd

def pre_embedder(docs, haystack_document_store, embedding_model, model_format):
    print('Running the pre-embedding')
    retriever = EmbeddingRetriever(
    document_store=haystack_document_store,
    embedding_model=embedding_model,
    model_format=model_format
    )
    embeds = retriever.embed_documents(docs)
    for doc, emb in zip(docs,embeds):
        doc.embedding = emb
    return docs



harry = pd.read_csv("documents/harry_potter_wiki.csv")

dicts = []

for ix, row in harry.iterrows():
    dic = Document(
        content=clean_wiki_text(row.text),
          meta={'name': row['name'], 
                'url': row.url})
    
    dicts.append(dic)

print('Running the pre-embedding and writing the documents so there is no need to update')
from haystack.nodes import EmbeddingRetriever
from haystack.document_stores import InMemoryDocumentStore
from haystack.document_stores import WeaviateDocumentStore
weaviate_document_store = WeaviateDocumentStore() # assumes Weaviate is running on http://localhost:8080
BATCH_SIZE = 500
dict_embeddings = []
in_memory_document_store = InMemoryDocumentStore()
for i, dic in enumerate(dicts):
    dict_embeddings.append(dic)
    if i>0 and i % BATCH_SIZE == 0:
        print ('----------------------------------------------------------------------------')
        print(f'Batching and inserting from: {i - BATCH_SIZE} till {i}')
        print ('----------------------------------------------------------------------------')
        dict_embeddings = pre_embedder(docs=dict_embeddings, haystack_document_store=in_memory_document_store, model_format="sentence_transformers", embedding_model="sentence-transformers/multi-qa-mpnet-base-dot-v1",)
        weaviate_document_store.write_documents(dict_embeddings, index='potterDocuments')
        in_memory_document_store = InMemoryDocumentStore()
        dict_embeddings = []

One thing if might add, is that I needed to cast the objects into a Haystack Document, but other than that it was fairly easy! Thank you very much again!

ZanSara added Contributions wanted! Looking for external contributions journey:advanced topic:dependencies topic:document_store topic:weaviate type:bug Something isn't working labels Oct 19, 2022

hsm207 mentioned this issue Feb 1, 2023

refactor: remove dummy vectors in weaviate #4009

Closed

6 tasks

Heucles mentioned this issue Jul 25, 2023

error occurs at update_embeddings weaviate/weaviate-examples#56

Open

masci removed the journey:advanced label Dec 6, 2023

masci closed this as completed Dec 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weaviate error when updating embeddings #3390

Weaviate error when updating embeddings #3390

danielbichuetti commented Oct 14, 2022 •

edited

Loading

anakin87 commented Oct 14, 2022

danielbichuetti commented Oct 14, 2022

bobvanluijt commented Oct 15, 2022

etiennedi commented Oct 15, 2022

danielbichuetti commented Oct 15, 2022

bobvanluijt commented Oct 18, 2022

fvanlitsenburg commented Jan 18, 2023

hsm207 commented Jan 19, 2023

asanoop24 commented Jan 20, 2023

fvanlitsenburg commented Jan 21, 2023 •

edited

Loading

asanoop24 commented Jan 21, 2023

Heucles commented Jul 19, 2023

fvanlitsenburg commented Jul 19, 2023 •

edited

Loading

Heucles commented Jul 19, 2023 •

edited

Loading

fvanlitsenburg commented Jul 20, 2023 via email

Heucles commented Jul 25, 2023

fvanlitsenburg commented Jul 25, 2023

Heucles commented Jul 25, 2023 •

edited

Loading

Weaviate error when updating embeddings #3390

Weaviate error when updating embeddings #3390

Comments

danielbichuetti commented Oct 14, 2022 • edited Loading

anakin87 commented Oct 14, 2022

danielbichuetti commented Oct 14, 2022

bobvanluijt commented Oct 15, 2022

etiennedi commented Oct 15, 2022

danielbichuetti commented Oct 15, 2022

bobvanluijt commented Oct 18, 2022

fvanlitsenburg commented Jan 18, 2023

hsm207 commented Jan 19, 2023

asanoop24 commented Jan 20, 2023

fvanlitsenburg commented Jan 21, 2023 • edited Loading

asanoop24 commented Jan 21, 2023

Heucles commented Jul 19, 2023

fvanlitsenburg commented Jul 19, 2023 • edited Loading

Heucles commented Jul 19, 2023 • edited Loading

fvanlitsenburg commented Jul 20, 2023 via email

Heucles commented Jul 25, 2023

fvanlitsenburg commented Jul 25, 2023

Heucles commented Jul 25, 2023 • edited Loading

danielbichuetti commented Oct 14, 2022 •

edited

Loading

fvanlitsenburg commented Jan 21, 2023 •

edited

Loading

fvanlitsenburg commented Jul 19, 2023 •

edited

Loading

Heucles commented Jul 19, 2023 •

edited

Loading

Heucles commented Jul 25, 2023 •

edited

Loading