Save and load local QdrantDocumentstore #279

NILICK · 2024-01-27T06:30:13Z

I created a QdrantDocumentStore using below code and save database in local directory.

from haystack.document_stores.types import DuplicatePolicy
from haystack import Document
from haystack import Pipeline
from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder

from haystack_integrations.components.retrievers.qdrant import QdrantEmbeddingRetriever
from haystack_integrations.document_stores.qdrant import QdrantDocumentStore

# Save DocumnetStore in Local
document_store = QdrantDocumentStore(
    path="~/mydatabase/Haystack/db",
    index="Document",
    embedding_dim=768,
    recreate_index=True,
    hnsw_config={"m": 16, "ef_construct": 64}  # Optional
)


documents = [Document(content="There are over 7,000 languages spoken around the world today."),
						Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."),
						Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.")]

document_embedder = SentenceTransformersDocumentEmbedder()  
document_embedder.warm_up()
documents_with_embeddings = document_embedder.run(documents)

document_store.write_documents(documents_with_embeddings.get("documents"), policy=DuplicatePolicy.OVERWRITE)

query_pipeline = Pipeline()
query_pipeline.add_component("text_embedder", SentenceTransformersTextEmbedder())
query_pipeline.add_component("retriever", QdrantEmbeddingRetriever(document_store=document_store))
query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")

query = "How many languages are there?"

result = query_pipeline.run({"text_embedder": {"text": query}})

print(result['retriever']['documents'][0])

But when I want to use created local database in other project it return ValueError: document_store must be an instance of QdrantDocumentStore.
I load saved QdrantDocumentStore using:

from qdrant_client import QdrantClient
# load DocumnetStore from Local
path="~/mydatabase/Haystack/db"
document_store = QdrantClient(path=path)

How can I load it correctly?

The text was updated successfully, but these errors were encountered:

anakin87 · 2024-02-06T08:50:09Z

Hey!

I tried to replicate your example and was able to successfully reload the QdrantDocumentStore using

document_store = QdrantDocumentStore(
    path="~/mydatabase/Haystack/db",
    index="Document",
    embedding_dim=768,
)

In your case, I would not directly use qdrant_client, which is abstracted by the Document Store.

I'm closing this issue. Feel free to reopen it if you still encounter problems.

NILICK · 2024-02-06T18:50:20Z

Hey! Thanks for your reply. I tried your suggestion but it return below error:

TypeError                                 Traceback (most recent call last)
File <timed exec>:13

File ~/micromamba/envs/hstack/lib/python3.10/site-packages/haystack/document_stores/base.py:185, in BaseDocumentStore.__next__(self)
    183     raise StopIteration
    184 curr_id = self.ids_iterator[0]
--> 185 ret = self.get_document_by_id(curr_id)
    186 self.ids_iterator = self.ids_iterator[1:]
    187 return ret

File ~/micromamba/envs/hstack/lib/python3.10/site-packages/qdrant_haystack/document_stores/qdrant.py:179, in QdrantDocumentStore.get_document_by_id(self, id, index, headers)
    173 def get_document_by_id(
    174     self,
    175     id: str,
    176     index: Optional[str] = None,
    177     headers: Optional[Dict[str, str]] = None,
    178 ) -> Optional[Document]:
--> 179     documents = self.get_documents_by_id([id], index, headers)
    180     if 0 == len(documents):
    181         return None

File ~/micromamba/envs/hstack/lib/python3.10/site-packages/qdrant_haystack/document_stores/qdrant.py:199, in QdrantDocumentStore.get_documents_by_id(self, ids, index, batch_size, headers)
    197 scroll_filter = self.qdrant_filter_converter.convert(None, ids)
    198 while not stop_scrolling:
--> 199     records, next_offset = self.client.scroll(
    200         collection_name=index,
    201         scroll_filter=scroll_filter,
    202         limit=batch_size,
    203         offset=next_offset,
    204         with_payload=True,
    205         with_vectors=True,
    206     )
    207     stop_scrolling = next_offset is None or (
    208         isinstance(next_offset, grpc.PointId)
    209         and next_offset.num == 0
    210         and next_offset.uuid == ""
    211     )
    213     for record in records:

File ~/micromamba/envs/hstack/lib/python3.10/site-packages/qdrant_client/qdrant_client.py:905, in QdrantClient.scroll(self, collection_name, scroll_filter, limit, offset, with_payload, with_vectors, consistency, shard_key_selector, **kwargs)
    867 """Scroll over all (matching) points in the collection.
    868 
    869 This method provides a way to iterate over all stored points with some optional filtering condition.
   (...)
    901     If next page offset is `None` - there is no more points in the collection to scroll.
    902 """
    903 assert len(kwargs) == 0, f"Unknown arguments: {list(kwargs.keys())}"
--> 905 return self._client.scroll(
    906     collection_name=collection_name,
    907     scroll_filter=scroll_filter,
    908     limit=limit,
    909     offset=offset,
    910     with_payload=with_payload,
    911     with_vectors=with_vectors,
    912     consistency=consistency,
    913     shard_key_selector=shard_key_selector,
    914     **kwargs,
    915 )

File ~/micromamba/envs/hstack/lib/python3.10/site-packages/qdrant_client/local/qdrant_local.py:421, in QdrantLocal.scroll(self, collection_name, scroll_filter, limit, offset, with_payload, with_vectors, **kwargs)
    410 def scroll(
    411     self,
    412     collection_name: str,
   (...)
    418     **kwargs: Any,
    419 ) -> Tuple[List[types.Record], Optional[types.PointId]]:
    420     collection = self._get_collection(collection_name)
--> 421     return collection.scroll(
    422         scroll_filter=scroll_filter,
    423         limit=limit,
    424         offset=offset,
    425         with_payload=with_payload,
    426         with_vectors=with_vectors,
    427     )

File ~/micromamba/envs/hstack/lib/python3.10/site-packages/qdrant_client/local/local_collection.py:930, in LocalCollection.scroll(self, scroll_filter, limit, offset, with_payload, with_vectors)
    927 if offset is not None and self._universal_id(point_id) < self._universal_id(offset):
    928     continue
--> 930 if len(result) >= limit + 1:
    931     break
    933 if not mask[idx]:

TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'

anakin87 · 2024-02-06T19:21:07Z

Can you report all the code you used?

NILICK · 2024-02-06T19:42:31Z

My complete code is:


from pprint import pprint
from tqdm.auto import tqdm
from haystack.document_stores import ElasticsearchDocumentStore
from haystack.nodes import QuestionGenerator, BM25Retriever, FARMReader
#from haystack.document_stores import ElasticsearchDocumentStore
from qdrant_haystack import QdrantDocumentStore
from haystack.pipelines import (
    QuestionGenerationPipeline,
    RetrieverQuestionGenerationPipeline,
    QuestionAnswerGenerationPipeline,
)
from haystack.utils import launch_es, print_questions
from haystack.nodes import MarkdownConverter
from haystack.nodes import PreProcessor

converter = MarkdownConverter(remove_numeric_tables=True, valid_languages=["en"])
docs = converter.convert(file_path=("./MD_files/Single_MD/Characterizing.md"), meta=None)[0]

preprocessor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=False,
    split_by="word",
    split_length=100,
    split_respect_sentence_boundary=True,
)

docs_default = preprocessor.process([docs])

# Save DocumnetStore in Local
document_store = QdrantDocumentStore(
    path="~/Haystack/Qdrant_DB",
    index="Document",
    embedding_dim=768,
    recreate_index=True,
    hnsw_config={"m": 16, "ef_construct": 64}  # Optional
)

document_store.write_documents(docs_default, duplicate_documents='skip')

# Question Generation Pipeline with Reload QdrantDocumentStore database
document_store_reloaded = QdrantDocumentStore(
    path="~/Haystack/Qdrant_DB",
    index="Document",
    embedding_dim=768,
)

# Initialize Question Generator
question_generator = QuestionGenerator()

question_generation_pipeline = QuestionGenerationPipeline(question_generator)
for idx, document in enumerate(document_store_reloaded):

    print(f"\n * Generating questions for document {idx}: {document.content[:100]}...\n")
    result = question_generation_pipeline.run(documents=[document])
    print_questions(result)

anakin87 · 2024-02-07T15:18:57Z

In this new example, you are using Qdrant with Haystack 1.x.

Please try changing for idx, document in enumerate(document_store_reloaded):
with
for idx, document in enumerate(document_store_reloaded.get_all_documents()):
and let me know...

To understand why, you can look at the 1.x Document Store API Reference.

NILICK · 2024-02-08T07:34:44Z

Hi! Thank you. It is the correct answer.

NILICK added the feature request Ideas to improve an integration label Jan 27, 2024

masci changed the title ~~Sava and load local QdrantDocumentstore~~ Save and load local QdrantDocumentstore Feb 5, 2024

anakin87 closed this as completed Feb 6, 2024

anakin87 reopened this Feb 6, 2024

NILICK closed this as completed Feb 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save and load local QdrantDocumentstore #279

Save and load local QdrantDocumentstore #279

NILICK commented Jan 27, 2024 •

edited

Loading

anakin87 commented Feb 6, 2024

NILICK commented Feb 6, 2024

anakin87 commented Feb 6, 2024

NILICK commented Feb 6, 2024 •

edited

Loading

anakin87 commented Feb 7, 2024

NILICK commented Feb 8, 2024

Save and load local QdrantDocumentstore #279

Save and load local QdrantDocumentstore #279

Comments

NILICK commented Jan 27, 2024 • edited Loading

anakin87 commented Feb 6, 2024

NILICK commented Feb 6, 2024

anakin87 commented Feb 6, 2024

NILICK commented Feb 6, 2024 • edited Loading

anakin87 commented Feb 7, 2024

NILICK commented Feb 8, 2024

NILICK commented Jan 27, 2024 •

edited

Loading

NILICK commented Feb 6, 2024 •

edited

Loading