-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(Qdrant): start to work on sparse vector integration #578
Conversation
I want to ask @anakin87, if we'll have Haystack integrations to generate sparse vectors? Since @lambda-science's work relies on this. A clarification would be helpful. |
integrations/qdrant/src/haystack_integrations/document_stores/qdrant/document_store.py
Outdated
Show resolved
Hide resolved
@Anush008 I was planning to create custom component or public Haystack component contribution to create Sparse embedding after doing this integration in Qdrant, specifically since FastEmbed will release their embedder soon ! (I know it's a Qdrant maintained lib :p ) qdrant/fastembed#144 Qdrant integration could support Sparse embedding before we can generate them in Haystack, it's not an issue. We would just not be able to use the retriever for now But maybe this integration could requiere to modify the Document Dataclass to not only have the EDIT: Related to sparse embedding in Haystack, we could modify soon the FastEmbed integration to generate them ! https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/fastembed |
Wouldn't it be better to have a sparse embedding component first? Will help in validating the new Qdrant sparse retriever. Also, Just in. |
I can look into modifying the FastEmbed integration in Haystack to create a SparseEmbedding component then with this freshly relased version of FastEmbed 😁 |
Great. Please let me know if you need a hand with the code. |
@lambda-science I did some research on sparse embeddings over the past few weeks. What I think it should be necessary:
Since this requires several changes, I would like to discuss internally if this is the best course of action and should be prioritized. |
Previous existing test are fixed with the addition of the new sparse vector feature and hybrid search. EDIT: test added for sparse retriver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be a breaking change PR because I switched from unnamed dense vector to named dense vector "text-dense"
integrations/qdrant/src/haystack_integrations/document_stores/qdrant/converters.py
Outdated
Show resolved
Hide resolved
integrations/qdrant/src/haystack_integrations/components/retrievers/qdrant/retriever.py
Outdated
Show resolved
Hide resolved
integrations/qdrant/src/haystack_integrations/document_stores/qdrant/document_store.py
Outdated
Show resolved
Hide resolved
integrations/qdrant/src/haystack_integrations/document_stores/qdrant/document_store.py
Outdated
Show resolved
Hide resolved
integrations/qdrant/src/haystack_integrations/document_stores/qdrant/document_store.py
Outdated
Show resolved
Hide resolved
integrations/qdrant/src/haystack_integrations/document_stores/qdrant/document_store.py
Outdated
Show resolved
Hide resolved
…qdrant/converters.py Co-authored-by: Anush <[email protected]>
…qdrant/converters.py Co-authored-by: Anush <[email protected]>
…qdrant/document_store.py Co-authored-by: Anush <[email protected]>
…qdrant/document_store.py Co-authored-by: Anush <[email protected]>
…qdrant/document_store.py Co-authored-by: Anush <[email protected]>
…_stores/qdrant/document_store.py" This reverts commit f7cf65e.
…qdrant/converters.py Co-authored-by: Anush <[email protected]>
Thanks @Anush008 @anakin87 I think I've done enough I will stop my part of contribution here because it's has been kind of exhausting to fix the test and test everything. Specifically because end-to-end testing is messy (importing an old collection to see if nothing is breaking).
The only thing to do is that maybe if it's an old collection, it should set use_sparse to false automatically instead of throwing the error asking the user to do it. But please, please, please, test deeply your suggestion before doing so (with old collections for example), because then I spend a lot of time fixing stuff |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution @lambda-science.
.github/workflows/qdrant.yml
Outdated
@@ -58,7 +58,9 @@ jobs: | |||
|
|||
- name: Run tests | |||
id: tests | |||
run: hatch run cov | |||
run: | | |||
hatch run pip install git+https://github.com/deepset-ai/haystack.git #TODO: rm before merging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A reminder.
@lambda-science thanks for all your work! I've pushed some little simplifications, but looks good. The only thing that still leaves me doubtful is having At this point let's wait for 2.0.1! |
One solution would be to have |
Next week we will release 2.0.1! We discussed with @masci about this new feature. We would like to introduce it gradually:
@lambda-science do you want to make this change or would you prefer me to make it? |
Hey I let you commit the modification to put Happy to hear haystack 2.0.1 is coming soon. However is still think people will be quite unhappy when they will discover that the default behavior of the integration is not compatible with all features (from haystack or qdrant) and requires manual scripting for migration. |
I am working on these changes. @lambda-science if you have it at hand, can you share the migration script you used? |
In my case I used something like this, with Something like this (a bit modified compared to the real one I use for simplicity): """
Script to help upgrade a Qdrant collection that was created <1.7.0
(before sparse vector support), to add sparse vectors in a new collection.
The script simply scrolls through all points in the old collection, creates
a sparse vector for those points, and upserts them into a new collection.
"""
from qdrant_client import QdrantClient
from qdrant_client.http import models
from pinecone_text.sparse import BM25Encoder
import time
bm25 = BM25Encoder.default()
client = QdrantClient(url="http://localhost:6333")
# In my case I create the new_collection with proper configuration (sparse) by booting my Haystack instance
# before running this script.
old_collection = "pythia"
new_collection = "pythia-bm25"
total_points = client.count(
collection_name=old_collection,
exact=True,
).count
next_page_offset = "first"
points_transmitted = 0
start = time.time()
# Disable indexing while adding points so it's faster
client.update_collection(
collection_name=new_collection,
optimizer_config=models.OptimizersConfigDiff(indexing_threshold=0),
)
while next_page_offset:
if next_page_offset == "first":
offset = None
else:
offset = next_page_offset
# get the records
records = client.scroll(
collection_name=old_collection,
limit=100,
with_payload=True,
with_vectors=True,
offset=offset,
)
next_page_offset = records[1]
current_records = records[0]
points = []
# Calculate Sparse Embeddings for each document and set the vector dict
for record in current_records:
vector = {}
content_without_newlines = record.payload["content"].replace("\n", " ").replace(
"\r", " "
)
bm25_encoded = bm25.encode_documents(content_without_newlines)
# Set Vectors
vector["text-sparse"] = models.SparseVector(**bm25_encoded)
vector["text-dense"] = record.vector
point = {"id": record.id, "payload": record.payload, "vector": vector}
embedding_point = models.PointStruct(**point)
points.append(embedding_point)
client.upsert(collection_name=new_collection, points=points)
points_transmitted += len(points)
points_remaining = total_points - points_transmitted
# Print progress
message = f"Points transmitted: {points_transmitted}/{total_points}\nPercent done {points_transmitted/total_points*100:.2f}%\nTime elapsed: {time.time() - start:.2f} seconds\nTime remaining: {(((time.time() - start) / points_transmitted) * points_remaining) / 60:.2f} minutes\nCurrent offset: {next_page_offset}\n"
print(message)
# Re-enable indexing once all points are added with my own indexing threshold
client.update_collection(
collection_name=new_collection,
optimizer_config=models.OptimizersConfigDiff(indexing_threshold=20000),
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks also for your script! In a future PR, I will add a similar one to this project.
Related to issue #549
CC @Anush008 if you want to start having an eye on this and feedback / guidelines 👀
This is a first pull request towards integration of Sparse vector capabilities from Qdrant v1.7/v1.8 (https://qdrant.tech/articles/qdrant-1.7.x/).
I'm working on this following this article: https://qdrant.tech/articles/sparse-vectors/
Main changes / work (70%) has to be done in
document_store.py
, 25% in the Haystack-Qdrant converter and 5% just to add the Retriever that simply do adocument_store.py
call. Also fixing test will take time.CHANGES:
query_by_sparse()
indocument_store.py
to be able to query with a sparse embedding input._recreate_collection()
indocument_store.py
to add Sparse Vector invectors_config
QdrantSparseRetriever
inretriever.py
that calls theself._document_store.query_by_sparse()
TODO:
Main Concerns / Questions:
_set_up_collection
logic for warnings or safeguards ?