-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Weaviate error when updating embeddings #3390
Comments
Similar to #2898 |
@anakin87 The error message is similar, but unfortunately the error is because of the used logic and the pagination limits in Weaviate. |
@byronvoorbach / @dirkkul - can we help with this? Seems like a legacy issue which can be resolved now. |
Hi all. From what I understand, a Cursor API in Weaviate would be helpful here? Feel free to upvote the linked issue to indicate demand for it. |
Thank you, @bobvanluijt and @etiennedi for the support! 😃 This would allow using Weaviate into Haystack production scenarios without complex solutions. |
You're very welcome @danielbichuetti 👍 |
Just ran into this myself. Has anyone found a viable solution for this? |
@fvanlitsenburg the long term solution is for weaviate to implement weaviate/weaviate#2302 which is already planned in 1.18. Then, we can update haystack integration so that it retrieves all objects using the cursor api instead of using offsets. As a workaround until weaviate 1.18 is released, you could consider setting up your weaviate instance to use the text2vec-transformers module. That way, vectorization is done as a document is uploaded to weaviate so you do not need to call |
@hsm207 Even when using text2vec-transformers module in Weaviate instance, Haystack would add dummy embeddings while writing the documents to Weaviate to avoid sending empty vectors. In that case, Weaviate doesn't update the embeddings using its own text2vec-transformers model. Is there way to disable adding these dummy embeddings while writing documents? |
@asanoop24 what worked for me was to embed the documents before writing to Weaviate:
|
@fvanlitsenburg Works. Thats a good workaround. Thanks! |
@fvanlitsenburg can you elaborate a little bit more on your workaround? In your scenario you haven't even written the documents, so you are not updating them, you actually are embedding before the first write, is that correct? And also can you provide me with an example of the parameter for split_document_store? |
Hi @Heucles - not sure how relevant this still is for version 1.18 (perhaps this bug has been fixed), but:
|
Hey @fvanlitsenburg thank you for the quick reply! I am still facing this issue when trying to update the documents as it is proposed in this tutorial: Weaviate + Haystack presented by Laura Ham (Harry Potter example!), the issue happens when I try to update the embeddings. And I am currently on 1.19.6 The issue was also reported here: and here: https://weaviate.slack.com/archives/C02RRQP23K3/p1689784391036609 Tks again and if you have any insight that you can provide me, it will be much appreciated. |
Hey Heucles,
Not sure how much help I can be here. Have you tried the method we suggest
here in the Github issue, i.e. add the embeddings before writing? The error
message in the issue you linked would suggest there's a different issue
here, potentially.
It's quite hard for me to give a super helpful answer here, because you're
working with different versions of Haystack (1.19) and possibly Weaviate...
…On Wed, Jul 19, 2023 at 7:44 PM Heucles ***@***.***> wrote:
Hey @fvanlitsenburg <https://github.com/fvanlitsenburg> thank you for the
quick reply! I am still facing this issue when trying to update the
documents as it is proposed in this tutorial: Weaviate + Haystack
presented by Laura Ham (Harry Potter example!)
<https://www.youtube.com/watch?v=BkozaOnZpJI>, the issue happens when I
try to update the embeddings.
And I am currently on 1.19.6
The issue was also reported here:
weaviate/weaviate-examples#56
<weaviate/weaviate-examples#56>
N
—
Reply to this email directly, view it on GitHub
<#3390 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AQOM7YQW3GMQCVM7LY5VCK3XRAMHVANCNFSM6AAAAAARFHZJLQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hey @fvanlitsenburg I have an update, I was able to get it to work increasing the size of the environment var QUERY_MAXIMUM_RESULTS to 20k since my dataset has approx. 13k, but that is not a efficient solution. I've also found this threads discussing the issue: And there is the same suggestion as you made of pre-calculating the embeddings before uploading them... I am currently trying to do that, but if you can provide me with your complete code to do so, because what I am not sure how to give a cursor functionality to the dataset, comming directly from a csv file. Can you help me out? |
Have you tried something like this? Essentially, using the Haystack DocumentStore to write to Weaviate, rather than Weaviate's client itself.
|
Hey @fvanlitsenburg thank you very much for the explanation and attention, I wal able to get it to work combining your suggestions and @zoltan-fedor, I thank you both very much! Here is the final code: import weaviate
import pandas as pd
import os
from haystack.utils import clean_wiki_text
from haystack import Document
import pandas as pd
def pre_embedder(docs, haystack_document_store, embedding_model, model_format):
print('Running the pre-embedding')
retriever = EmbeddingRetriever(
document_store=haystack_document_store,
embedding_model=embedding_model,
model_format=model_format
)
embeds = retriever.embed_documents(docs)
for doc, emb in zip(docs,embeds):
doc.embedding = emb
return docs
harry = pd.read_csv("documents/harry_potter_wiki.csv")
dicts = []
for ix, row in harry.iterrows():
dic = Document(
content=clean_wiki_text(row.text),
meta={'name': row['name'],
'url': row.url})
dicts.append(dic)
print('Running the pre-embedding and writing the documents so there is no need to update')
from haystack.nodes import EmbeddingRetriever
from haystack.document_stores import InMemoryDocumentStore
from haystack.document_stores import WeaviateDocumentStore
weaviate_document_store = WeaviateDocumentStore() # assumes Weaviate is running on http://localhost:8080
BATCH_SIZE = 500
dict_embeddings = []
in_memory_document_store = InMemoryDocumentStore()
for i, dic in enumerate(dicts):
dict_embeddings.append(dic)
if i>0 and i % BATCH_SIZE == 0:
print ('----------------------------------------------------------------------------')
print(f'Batching and inserting from: {i - BATCH_SIZE} till {i}')
print ('----------------------------------------------------------------------------')
dict_embeddings = pre_embedder(docs=dict_embeddings, haystack_document_store=in_memory_document_store, model_format="sentence_transformers", embedding_model="sentence-transformers/multi-qa-mpnet-base-dot-v1",)
weaviate_document_store.write_documents(dict_embeddings, index='potterDocuments')
in_memory_document_store = InMemoryDocumentStore()
dict_embeddings = []
One thing if might add, is that I needed to cast the objects into a Haystack Document, but other than that it was fairly easy! Thank you very much again! |
Describe the bug
After initializing Weaviate with dummy embeddings (just meta), and trying to update embeddings, it throws an error because of pagination limits.
Error message
Expected behavior
It's expected that embeddings get updated correctly.
Additional context
This error happens because of the pagination limit (10000) that was introduced on release 1.8.0.
Probably should be implemented some way to update embeddings.
UPDATE: The best solution currently appears to be this one:
The easy solution would be increasing the maximum query results, but this has dramatic impact in Weaviate memory consumption. Another quote:
To Reproduce
First, load more then 10000 documents to Document Store
FAQ Check
System:
The text was updated successfully, but these errors were encountered: