Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save and load local QdrantDocumentstore #279

Closed
NILICK opened this issue Jan 27, 2024 · 6 comments
Closed

Save and load local QdrantDocumentstore #279

NILICK opened this issue Jan 27, 2024 · 6 comments
Labels
feature request Ideas to improve an integration

Comments

@NILICK
Copy link

NILICK commented Jan 27, 2024

I created a QdrantDocumentStore using below code and save database in local directory.

from haystack.document_stores.types import DuplicatePolicy
from haystack import Document
from haystack import Pipeline
from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder

from haystack_integrations.components.retrievers.qdrant import QdrantEmbeddingRetriever
from haystack_integrations.document_stores.qdrant import QdrantDocumentStore

# Save DocumnetStore in Local
document_store = QdrantDocumentStore(
    path="~/mydatabase/Haystack/db",
    index="Document",
    embedding_dim=768,
    recreate_index=True,
    hnsw_config={"m": 16, "ef_construct": 64}  # Optional
)


documents = [Document(content="There are over 7,000 languages spoken around the world today."),
						Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."),
						Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.")]

document_embedder = SentenceTransformersDocumentEmbedder()  
document_embedder.warm_up()
documents_with_embeddings = document_embedder.run(documents)

document_store.write_documents(documents_with_embeddings.get("documents"), policy=DuplicatePolicy.OVERWRITE)

query_pipeline = Pipeline()
query_pipeline.add_component("text_embedder", SentenceTransformersTextEmbedder())
query_pipeline.add_component("retriever", QdrantEmbeddingRetriever(document_store=document_store))
query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")

query = "How many languages are there?"

result = query_pipeline.run({"text_embedder": {"text": query}})

print(result['retriever']['documents'][0])

But when I want to use created local database in other project it return ValueError: document_store must be an instance of QdrantDocumentStore.
I load saved QdrantDocumentStore using:

from qdrant_client import QdrantClient
# load DocumnetStore from Local
path="~/mydatabase/Haystack/db"
document_store = QdrantClient(path=path) 

How can I load it correctly?

@NILICK NILICK added the feature request Ideas to improve an integration label Jan 27, 2024
@masci masci changed the title Sava and load local QdrantDocumentstore Save and load local QdrantDocumentstore Feb 5, 2024
@anakin87
Copy link
Member

anakin87 commented Feb 6, 2024

Hey!

I tried to replicate your example and was able to successfully reload the QdrantDocumentStore using

document_store = QdrantDocumentStore(
    path="~/mydatabase/Haystack/db",
    index="Document",
    embedding_dim=768,
)

In your case, I would not directly use qdrant_client, which is abstracted by the Document Store.

I'm closing this issue. Feel free to reopen it if you still encounter problems.

@anakin87 anakin87 closed this as completed Feb 6, 2024
@NILICK
Copy link
Author

NILICK commented Feb 6, 2024

Hey! Thanks for your reply. I tried your suggestion but it return below error:

TypeError                                 Traceback (most recent call last)
File <timed exec>:13

File ~/micromamba/envs/hstack/lib/python3.10/site-packages/haystack/document_stores/base.py:185, in BaseDocumentStore.__next__(self)
    183     raise StopIteration
    184 curr_id = self.ids_iterator[0]
--> 185 ret = self.get_document_by_id(curr_id)
    186 self.ids_iterator = self.ids_iterator[1:]
    187 return ret

File ~/micromamba/envs/hstack/lib/python3.10/site-packages/qdrant_haystack/document_stores/qdrant.py:179, in QdrantDocumentStore.get_document_by_id(self, id, index, headers)
    173 def get_document_by_id(
    174     self,
    175     id: str,
    176     index: Optional[str] = None,
    177     headers: Optional[Dict[str, str]] = None,
    178 ) -> Optional[Document]:
--> 179     documents = self.get_documents_by_id([id], index, headers)
    180     if 0 == len(documents):
    181         return None

File ~/micromamba/envs/hstack/lib/python3.10/site-packages/qdrant_haystack/document_stores/qdrant.py:199, in QdrantDocumentStore.get_documents_by_id(self, ids, index, batch_size, headers)
    197 scroll_filter = self.qdrant_filter_converter.convert(None, ids)
    198 while not stop_scrolling:
--> 199     records, next_offset = self.client.scroll(
    200         collection_name=index,
    201         scroll_filter=scroll_filter,
    202         limit=batch_size,
    203         offset=next_offset,
    204         with_payload=True,
    205         with_vectors=True,
    206     )
    207     stop_scrolling = next_offset is None or (
    208         isinstance(next_offset, grpc.PointId)
    209         and next_offset.num == 0
    210         and next_offset.uuid == ""
    211     )
    213     for record in records:

File ~/micromamba/envs/hstack/lib/python3.10/site-packages/qdrant_client/qdrant_client.py:905, in QdrantClient.scroll(self, collection_name, scroll_filter, limit, offset, with_payload, with_vectors, consistency, shard_key_selector, **kwargs)
    867 """Scroll over all (matching) points in the collection.
    868 
    869 This method provides a way to iterate over all stored points with some optional filtering condition.
   (...)
    901     If next page offset is `None` - there is no more points in the collection to scroll.
    902 """
    903 assert len(kwargs) == 0, f"Unknown arguments: {list(kwargs.keys())}"
--> 905 return self._client.scroll(
    906     collection_name=collection_name,
    907     scroll_filter=scroll_filter,
    908     limit=limit,
    909     offset=offset,
    910     with_payload=with_payload,
    911     with_vectors=with_vectors,
    912     consistency=consistency,
    913     shard_key_selector=shard_key_selector,
    914     **kwargs,
    915 )

File ~/micromamba/envs/hstack/lib/python3.10/site-packages/qdrant_client/local/qdrant_local.py:421, in QdrantLocal.scroll(self, collection_name, scroll_filter, limit, offset, with_payload, with_vectors, **kwargs)
    410 def scroll(
    411     self,
    412     collection_name: str,
   (...)
    418     **kwargs: Any,
    419 ) -> Tuple[List[types.Record], Optional[types.PointId]]:
    420     collection = self._get_collection(collection_name)
--> 421     return collection.scroll(
    422         scroll_filter=scroll_filter,
    423         limit=limit,
    424         offset=offset,
    425         with_payload=with_payload,
    426         with_vectors=with_vectors,
    427     )

File ~/micromamba/envs/hstack/lib/python3.10/site-packages/qdrant_client/local/local_collection.py:930, in LocalCollection.scroll(self, scroll_filter, limit, offset, with_payload, with_vectors)
    927 if offset is not None and self._universal_id(point_id) < self._universal_id(offset):
    928     continue
--> 930 if len(result) >= limit + 1:
    931     break
    933 if not mask[idx]:

TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'

@anakin87
Copy link
Member

anakin87 commented Feb 6, 2024

Can you report all the code you used?

@anakin87 anakin87 reopened this Feb 6, 2024
@NILICK
Copy link
Author

NILICK commented Feb 6, 2024

My complete code is:


from pprint import pprint
from tqdm.auto import tqdm
from haystack.document_stores import ElasticsearchDocumentStore
from haystack.nodes import QuestionGenerator, BM25Retriever, FARMReader
#from haystack.document_stores import ElasticsearchDocumentStore
from qdrant_haystack import QdrantDocumentStore
from haystack.pipelines import (
    QuestionGenerationPipeline,
    RetrieverQuestionGenerationPipeline,
    QuestionAnswerGenerationPipeline,
)
from haystack.utils import launch_es, print_questions
from haystack.nodes import MarkdownConverter
from haystack.nodes import PreProcessor

converter = MarkdownConverter(remove_numeric_tables=True, valid_languages=["en"])
docs = converter.convert(file_path=("./MD_files/Single_MD/Characterizing.md"), meta=None)[0]

preprocessor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=False,
    split_by="word",
    split_length=100,
    split_respect_sentence_boundary=True,
)

docs_default = preprocessor.process([docs])

# Save DocumnetStore in Local
document_store = QdrantDocumentStore(
    path="~/Haystack/Qdrant_DB",
    index="Document",
    embedding_dim=768,
    recreate_index=True,
    hnsw_config={"m": 16, "ef_construct": 64}  # Optional
)

document_store.write_documents(docs_default, duplicate_documents='skip')

# Question Generation Pipeline with Reload QdrantDocumentStore database
document_store_reloaded = QdrantDocumentStore(
    path="~/Haystack/Qdrant_DB",
    index="Document",
    embedding_dim=768,
)

# Initialize Question Generator
question_generator = QuestionGenerator()

question_generation_pipeline = QuestionGenerationPipeline(question_generator)
for idx, document in enumerate(document_store_reloaded):

    print(f"\n * Generating questions for document {idx}: {document.content[:100]}...\n")
    result = question_generation_pipeline.run(documents=[document])
    print_questions(result)

@anakin87
Copy link
Member

anakin87 commented Feb 7, 2024

In this new example, you are using Qdrant with Haystack 1.x.

Please try changing for idx, document in enumerate(document_store_reloaded):
with
for idx, document in enumerate(document_store_reloaded.get_all_documents()):
and let me know...

To understand why, you can look at the 1.x Document Store API Reference.

@NILICK
Copy link
Author

NILICK commented Feb 8, 2024

Hi! Thank you. It is the correct answer.

@NILICK NILICK closed this as completed Feb 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Ideas to improve an integration
Projects
None yet
Development

No branches or pull requests

2 participants