Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weaviate error when updating embeddings #3390

Closed
1 task done
danielbichuetti opened this issue Oct 14, 2022 · 18 comments
Closed
1 task done

Weaviate error when updating embeddings #3390

danielbichuetti opened this issue Oct 14, 2022 · 18 comments
Labels

Comments

@danielbichuetti
Copy link
Contributor

danielbichuetti commented Oct 14, 2022

Describe the bug
After initializing Weaviate with dummy embeddings (just meta), and trying to update embeddings, it throws an error because of pagination limits.

Error message

---------------------------------------------------------------------------
WeaviateDocumentStoreError                Traceback (most recent call last)
Cell In [8], line 8
      5 wv_docstore = WeaviateDocumentStore(host='http://weaviate.weaviate.svc.cluster.local',port=80,index='news',embedding_dim=768)
      6 wv_retriever = EmbeddingRetriever(document_store=wv_docstore,embedding_model='sentence-transformers/paraphrase-multilingual-mpnet-base-v2',model_format='sentence_transformers',use_gpu=True)
----> 8 wv_docstore.update_embeddings(retriever=wv_retriever,batch_size=1000)

File /opt/conda/lib/python3.10/site-packages/haystack/document_stores/weaviate.py:1230, in WeaviateDocumentStore.update_embeddings(self, retriever, index, filters, update_existing_embeddings, batch_size)
   1224     raise RuntimeError(
   1225         "All the documents in Weaviate store have an embedding by default. Only update is allowed!"
   1226     )
   1228 result = self._get_all_documents_in_index(index=index, filters=filters, batch_size=batch_size)
-> 1230 for result_batch in get_batches_from_generator(result, batch_size):
   1231     document_batch = [
   1232         self._convert_weaviate_result_to_document(hit, return_embedding=False) for hit in result_batch
   1233     ]
   1234     embeddings = retriever.embed_documents(document_batch)

File /opt/conda/lib/python3.10/site-packages/haystack/document_stores/base.py:890, in get_batches_from_generator(iterable, n)
    886 """
    887 Batch elements of an iterable into fixed-length chunks or blocks.
    888 """
    889 it = iter(iterable)
--> 890 x = tuple(islice(it, n))
    891 while x:
    892     yield x

File /opt/conda/lib/python3.10/site-packages/haystack/document_stores/weaviate.py:752, in WeaviateDocumentStore._get_all_documents_in_index(self, index, filters, batch_size, only_documents_without_embedding)
    749     raise WeaviateDocumentStoreError(f"Weaviate raised an exception: {e}")
    751 if "errors" in result:
--> 752     raise WeaviateDocumentStoreError(f"Query results contain errors: {result['errors']}")
    754 # If `query.do` didn't raise and `result` doesn't contain errors,
    755 # we are good accessing data
    756 docs = result.get("data").get("Get").get(index)

WeaviateDocumentStoreError: Query results contain errors: [{'locations': [{'column': 6, 'line': 1}], 'message': 'explorer: list class: search: invalid pagination params: query maximum results exceeded', 'path': ['Get', 'News']}]

Expected behavior
It's expected that embeddings get updated correctly.

Additional context
This error happens because of the pagination limit (10000) that was introduced on release 1.8.0.

Probably should be implemented some way to update embeddings.

UPDATE: The best solution currently appears to be this one:

You need to add an ordered arbitrary field… … integer going from 0..N… and than not use pagination but filter over this…
Yeah it is annoying… 😕 but no other way for now

The easy solution would be increasing the maximum query results, but this has dramatic impact in Weaviate memory consumption. Another quote:

if it works for you than OK, but be aware that there is a 10k limit to this approach (you can change it but it will consume a lot of memory) and also it’ll get slower the bigger i you will have… this might be you only way to do it, if your data are already saved and need to retrieve them. However i wouldn’t recommend it as a general solution. If you have more than 10k records, then you have problem… you can either change the default conf. variable (and make sure that you have enough memory) or split the dataset by some other field that you already know all the known values for.

To Reproduce
First, load more then 10000 documents to Document Store

from haystack.document_stores import WeaviateDocumentStore
from haystack.nodes import EmbeddingRetriever
from haystack.schema import Document

wv_docstore = WeaviateDocumentStore(host='http://weaviate.weaviate.svc.cluster.local',port=80,index='news',embedding_dim=768)
wv_retriever = EmbeddingRetriever(document_store=wv_docstore,embedding_model='sentence-transformers/paraphrase-multilingual-mpnet-base-v2',model_format='sentence_transformers',use_gpu=True)

wv_docstore.update_embeddings(retriever=wv_retriever,batch_size=1000)

FAQ Check

System:

  • OS: Ubuntu
  • GPU/CPU: T4
  • Haystack version (commit or version number): 1.9.1
  • DocumentStore: WeaviateDocumentStore
  • Retriever: EmbeddingRetriever
@anakin87
Copy link
Member

Similar to #2898

@danielbichuetti
Copy link
Contributor Author

@anakin87 The error message is similar, but unfortunately the error is because of the used logic and the pagination limits in Weaviate.

@bobvanluijt
Copy link
Contributor

@byronvoorbach / @dirkkul - can we help with this? Seems like a legacy issue which can be resolved now.

@etiennedi
Copy link

Hi all. From what I understand, a Cursor API in Weaviate would be helpful here? Feel free to upvote the linked issue to indicate demand for it.

@danielbichuetti
Copy link
Contributor Author

Thank you, @bobvanluijt and @etiennedi for the support! 😃

This would allow using Weaviate into Haystack production scenarios without complex solutions.

@bobvanluijt
Copy link
Contributor

You're very welcome @danielbichuetti 👍

@fvanlitsenburg
Copy link

Just ran into this myself. Has anyone found a viable solution for this?

@hsm207
Copy link
Contributor

hsm207 commented Jan 19, 2023

@fvanlitsenburg the long term solution is for weaviate to implement weaviate/weaviate#2302 which is already planned in 1.18. Then, we can update haystack integration so that it retrieves all objects using the cursor api instead of using offsets.

As a workaround until weaviate 1.18 is released, you could consider setting up your weaviate instance to use the text2vec-transformers module. That way, vectorization is done as a document is uploaded to weaviate so you do not need to call wv_docstore.update_embeddings()

@asanoop24
Copy link

@hsm207 Even when using text2vec-transformers module in Weaviate instance, Haystack would add dummy embeddings while writing the documents to Weaviate to avoid sending empty vectors. In that case, Weaviate doesn't update the embeddings using its own text2vec-transformers model. Is there way to disable adding these dummy embeddings while writing documents?

@fvanlitsenburg
Copy link

fvanlitsenburg commented Jan 21, 2023

@asanoop24 what worked for me was to embed the documents before writing to Weaviate:

def pre_embedder(docs):
    print('Running the pre-embedding')
    retriever = EmbeddingRetriever(
    document_store=split_document_store,
    embedding_model=embedding_model,
    model_format=model_format
    )
    embeds = retriever.embed_documents(docs)
    for doc, emb in zip(docs,embeds):
        doc.embedding = emb
    return docs

@asanoop24
Copy link

@fvanlitsenburg Works. Thats a good workaround. Thanks!

@Heucles
Copy link

Heucles commented Jul 19, 2023

@asanoop24 what worked for me was to embed the documents before writing to Weaviate:

def pre_embedder(docs):
    print('Running the pre-embedding')
    retriever = EmbeddingRetriever(
    document_store=split_document_store,
    embedding_model=embedding_model,
    model_format=model_format
    )
    embeds = retriever.embed_documents(docs)
    for doc, emb in zip(docs,embeds):
        doc.embedding = emb
    return docs

@fvanlitsenburg can you elaborate a little bit more on your workaround? In your scenario you haven't even written the documents, so you are not updating them, you actually are embedding before the first write, is that correct? And also can you provide me with an example of the parameter for split_document_store?

@fvanlitsenburg
Copy link

fvanlitsenburg commented Jul 19, 2023

Hi @Heucles - not sure how relevant this still is for version 1.18 (perhaps this bug has been fixed), but:

  • Yes, I am embedding them before writing them
  • The split_document_store is just a standard Haystack document store, e.g.:
from haystack.document_stores import WeaviateDocumentStore
split_document_store = WeaviateDocumentStore(host='http://weaviate.weaviate.svc.cluster.local',port=80,index='news',embedding_dim=768)

@Heucles
Copy link

Heucles commented Jul 19, 2023

Hey @fvanlitsenburg thank you for the quick reply! I am still facing this issue when trying to update the documents as it is proposed in this tutorial: Weaviate + Haystack presented by Laura Ham (Harry Potter example!), the issue happens when I try to update the embeddings.

And I am currently on 1.19.6

The issue was also reported here:
weaviate/weaviate-examples#56

and here: https://weaviate.slack.com/archives/C02RRQP23K3/p1689784391036609

Tks again and if you have any insight that you can provide me, it will be much appreciated.

@fvanlitsenburg
Copy link

fvanlitsenburg commented Jul 20, 2023 via email

@Heucles
Copy link

Heucles commented Jul 25, 2023

Hey @fvanlitsenburg I have an update, I was able to get it to work increasing the size of the environment var QUERY_MAXIMUM_RESULTS to 20k since my dataset has approx. 13k, but that is not a efficient solution.

I've also found this threads discussing the issue:
#2517
weaviate/weaviate#1947

And there is the same suggestion as you made of pre-calculating the embeddings before uploading them...

I am currently trying to do that, but if you can provide me with your complete code to do so, because what I am not sure how to give a cursor functionality to the dataset, comming directly from a csv file.

Can you help me out?

@fvanlitsenburg
Copy link

Have you tried something like this? Essentially, using the Haystack DocumentStore to write to Weaviate, rather than Weaviate's client itself.


from haystack.document_stores import WeaviateDocumentStore
import pandas as pd
from haystack.utils import clean_wiki_text

from tqdm import tqdm
import time
import requests

client = weaviate.Client(
    url = "http://localhost:8080/", 
    additional_headers={
        'X-HuggingFace-Api-Key': 'xxxxxxx'
    }
)

harry = pd.read_csv("documents/harry_potter_wiki.csv")

dicts = []

for ix, row in harry.iterrows():
    dic = {
        'content': clean_wiki_text(row.text),
        'name': row['name'],
        'url': row.url
    }
    dicts.append(dic)

document_store = WeaviateDocumentStore(host='http://weaviate.weaviate.svc.cluster.local',port=80,index='harry_potter',embedding_dim=768) # you may want to make sure this, in particular the host, aligns with your set-up

def pre_embedder(docs):
    print('Running the pre-embedding')
    retriever = EmbeddingRetriever(
    document_store=split_document_store,
    embedding_model=embedding_model,
    model_format=model_format
    )
    embeds = retriever.embed_documents(docs)
    for doc, emb in zip(docs,embeds):
        doc.embedding = emb
    return docs

document_store.write_documents(pre_embedder(dicts), index='harry_potter',duplicate_documents='overwrite')

@Heucles
Copy link

Heucles commented Jul 25, 2023

Hey @fvanlitsenburg thank you very much for the explanation and attention, I wal able to get it to work combining your suggestions and @zoltan-fedor, I thank you both very much! Here is the final code:

import weaviate
import pandas as pd
import os

from haystack.utils import clean_wiki_text
from haystack import Document
import pandas as pd

def pre_embedder(docs, haystack_document_store, embedding_model, model_format):
    print('Running the pre-embedding')
    retriever = EmbeddingRetriever(
    document_store=haystack_document_store,
    embedding_model=embedding_model,
    model_format=model_format
    )
    embeds = retriever.embed_documents(docs)
    for doc, emb in zip(docs,embeds):
        doc.embedding = emb
    return docs



harry = pd.read_csv("documents/harry_potter_wiki.csv")

dicts = []

for ix, row in harry.iterrows():
    dic = Document(
        content=clean_wiki_text(row.text),
          meta={'name': row['name'], 
                'url': row.url})
    
    dicts.append(dic)

print('Running the pre-embedding and writing the documents so there is no need to update')
from haystack.nodes import EmbeddingRetriever
from haystack.document_stores import InMemoryDocumentStore
from haystack.document_stores import WeaviateDocumentStore
weaviate_document_store = WeaviateDocumentStore() # assumes Weaviate is running on http://localhost:8080
BATCH_SIZE = 500
dict_embeddings = []
in_memory_document_store = InMemoryDocumentStore()
for i, dic in enumerate(dicts):
    dict_embeddings.append(dic)
    if i>0 and i % BATCH_SIZE == 0:
        print ('----------------------------------------------------------------------------')
        print(f'Batching and inserting from: {i - BATCH_SIZE} till {i}')
        print ('----------------------------------------------------------------------------')
        dict_embeddings = pre_embedder(docs=dict_embeddings, haystack_document_store=in_memory_document_store, model_format="sentence_transformers", embedding_model="sentence-transformers/multi-qa-mpnet-base-dot-v1",)
        weaviate_document_store.write_documents(dict_embeddings, index='potterDocuments')
        in_memory_document_store = InMemoryDocumentStore()
        dict_embeddings = []
        

One thing if might add, is that I needed to cast the objects into a Haystack Document, but other than that it was fairly easy! Thank you very much again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants