Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(Qdrant): start to work on sparse vector integration #578

Merged
merged 45 commits into from
Apr 12, 2024

Conversation

lambda-science
Copy link
Contributor

@lambda-science lambda-science commented Mar 13, 2024

Related to issue #549
CC @Anush008 if you want to start having an eye on this and feedback / guidelines 👀

This is a first pull request towards integration of Sparse vector capabilities from Qdrant v1.7/v1.8 (https://qdrant.tech/articles/qdrant-1.7.x/).
I'm working on this following this article: https://qdrant.tech/articles/sparse-vectors/

Main changes / work (70%) has to be done in document_store.py, 25% in the Haystack-Qdrant converter and 5% just to add the Retriever that simply do a document_store.py call. Also fixing test will take time.

CHANGES:

  • NEW query_by_sparse() in document_store.py to be able to query with a sparse embedding input.
  • MODIFIED _recreate_collection() in document_store.py to add Sparse Vector in vectors_config
  • NEW QdrantSparseRetriever in retriever.py that calls the self._document_store.query_by_sparse()

TODO:

  • How we handle previous collections without this feature ?

Main Concerns / Questions:

  • Afraid that this create some breaking changes with old collections as our vectors are now a dict with both unnamed dense vector and named sparse vectors.
  • What will happens for already existing collection withtout sparse ? Recreated/Errors ? Should we maybe make it optional ? Modify _set_up_collection logic for warnings or safeguards ?

@lambda-science lambda-science requested a review from a team as a code owner March 13, 2024 12:36
@lambda-science lambda-science requested review from masci and removed request for a team March 13, 2024 12:36
@lambda-science lambda-science marked this pull request as draft March 13, 2024 12:36
@github-actions github-actions bot added integration:qdrant type:documentation Improvements or additions to documentation labels Mar 13, 2024
@Anush008
Copy link
Contributor

I want to ask @anakin87, if we'll have Haystack integrations to generate sparse vectors? Since @lambda-science's work relies on this. A clarification would be helpful.

@lambda-science
Copy link
Contributor Author

lambda-science commented Mar 13, 2024

I want to ask @anakin87, if we'll have Haystack integrations to generate sparse vectors? Since @lambda-science's work relies on this. A clarification would be helpful.

@Anush008 I was planning to create custom component or public Haystack component contribution to create Sparse embedding after doing this integration in Qdrant, specifically since FastEmbed will release their embedder soon ! (I know it's a Qdrant maintained lib :p ) qdrant/fastembed#144

Qdrant integration could support Sparse embedding before we can generate them in Haystack, it's not an issue. We would just not be able to use the retriever for now

But maybe this integration could requiere to modify the Document Dataclass to not only have the embedding field for dense vectors, but also have an optional sparse_embedding for sparse :) Here, just a supplementary field: https://github.com/deepset-ai/haystack/blob/main/haystack/dataclasses/document.py#L69

EDIT: Related to sparse embedding in Haystack, we could modify soon the FastEmbed integration to generate them ! https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/fastembed

@Anush008
Copy link
Contributor

Wouldn't it be better to have a sparse embedding component first? Will help in validating the new Qdrant sparse retriever.

Also, Just in.
FastEmbed will now have a sparse embedding model.

qdrant/fastembed#146

@lambda-science
Copy link
Contributor Author

Wouldn't it be better to have a sparse embedding component first? Will help in validating the new Qdrant sparse retriever.

Also, Just in. FastEmbed will now have a sparse embedding model.

qdrant/fastembed#146

I can look into modifying the FastEmbed integration in Haystack to create a SparseEmbedding component then with this freshly relased version of FastEmbed 😁

@Anush008
Copy link
Contributor

Great. Please let me know if you need a hand with the code.

@anakin87
Copy link
Member

@lambda-science I did some research on sparse embeddings over the past few weeks.

What I think it should be necessary:

  • see if we should change the Document dataclass to store sparse embedding
  • create SparseEmbedder components (based on FastEmbed, for example)
  • refactor write_documents methods of Document Stores with this capability (Qdrant and Pinecone, I think)
  • create SparseEmbeddingRetrievers specific to these Document Stores

Since this requires several changes, I would like to discuss internally if this is the best course of action and should be prioritized.
I will let you know!

@lambda-science lambda-science changed the title feat(Qdrant): start to working on sparse vector integration feat(Qdrant): start to work on sparse vector integration Mar 13, 2024
@lambda-science
Copy link
Contributor Author

lambda-science commented Mar 14, 2024

Previous existing test are fixed with the addition of the new sparse vector feature and hybrid search.
Now gotta had test for sparse vectors and retrievers :)

EDIT: test added for sparse retriver

@lambda-science
Copy link
Contributor Author

I think I finished my PR for a Working Sparse Retriever in Qdrant based on my previous PR with Sparse Embedder from FastEmbed @Anush008 and @anakin87 I guess both PR are compatible to try to work with and allow do a complete round trip of testing (without Haytack document class modification).

@lambda-science lambda-science marked this pull request as ready for review March 14, 2024 10:15
Copy link
Contributor Author

@lambda-science lambda-science left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a breaking change PR because I switched from unnamed dense vector to named dense vector "text-dense"

@lambda-science
Copy link
Contributor Author

lambda-science commented Mar 25, 2024

Thanks @Anush008 @anakin87 I think I've done enough I will stop my part of contribution here because it's has been kind of exhausting to fix the test and test everything. Specifically because end-to-end testing is messy (importing an old collection to see if nothing is breaking).
I reverted part of your suggestion Anush because it would be still breaking change / add a supplementary case to support.
Now the PR is compatible with two things:

  • Old collection before this PR that used dense unnamed vectors.
  • New collection after this PR that use both dense and sparse named vectors by default

The only thing to do is that maybe if it's an old collection, it should set use_sparse to false automatically instead of throwing the error asking the user to do it.

But please, please, please, test deeply your suggestion before doing so (with old collections for example), because then I spend a lot of time fixing stuff

Copy link
Contributor

@Anush008 Anush008 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution @lambda-science.

@@ -58,7 +58,9 @@ jobs:

- name: Run tests
id: tests
run: hatch run cov
run: |
hatch run pip install git+https://github.com/deepset-ai/haystack.git #TODO: rm before merging
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A reminder.

@anakin87
Copy link
Member

@lambda-science thanks for all your work!

I've pushed some little simplifications, but looks good.

The only thing that still leaves me doubtful is having use_sparse_embeddings=True by default, but you provided good reasons. I will think about it...

At this point let's wait for 2.0.1!

@lambda-science
Copy link
Contributor Author

@lambda-science thanks for all your work!

I've pushed some little simplifications, but looks good.

The only thing that still leaves me doubtful is having use_sparse_embeddings=True by default, but you provided good reasons. I will think about it...

At this point let's wait for 2.0.1!

One solution would be to have use_sparse_embeddings=True by default, and if we detect an old collection (vector are not dict with sparse), then we can set self.use_sparse_embeddings=False and log a warning. So that no breaking/blocking/modification needed for old collection with dense only vector. And all new collections use dense+sparse named vectors by default.
I tried to do it but it was mess, but I guess it can be easy to implement, maybe you could do that @anakin87 ? (After 2.0.1 maybe ahah)

@anakin87
Copy link
Member

anakin87 commented Apr 5, 2024

Next week we will release 2.0.1!

We discussed with @masci about this new feature.

We would like to introduce it gradually:

  • set use_sparse_embeddings=False by default in qdrant-haystack 3.3.0
  • (see how much the new feature is used/what the reception is)
  • later (in qdrant-haystack 4.0.0?), switch to use_sparse_embeddings=True

@lambda-science do you want to make this change or would you prefer me to make it?

@lambda-science
Copy link
Contributor Author

Next week we will release 2.0.1!

We discussed with @masci about this new feature.

We would like to introduce it gradually:

  • set use_sparse_embeddings=False by default in qdrant-haystack 3.3.0
  • (see how much the new feature is used/what the reception is)
  • later (in qdrant-haystack 4.0.0?), switch to use_sparse_embeddings=True

@lambda-science do you want to make this change or would you prefer me to make it?

Hey

I let you commit the modification to put use_sparse_embeddings=False by default.

Happy to hear haystack 2.0.1 is coming soon.

However is still think people will be quite unhappy when they will discover that the default behavior of the integration is not compatible with all features (from haystack or qdrant) and requires manual scripting for migration.

@anakin87
Copy link
Member

I am working on these changes.

@lambda-science if you have it at hand, can you share the migration script you used?

@lambda-science
Copy link
Contributor Author

lambda-science commented Apr 11, 2024

I am working on these changes.

@lambda-science if you have it at hand, can you share the migration script you used?

In my case I used something like this, with pinecone-text lib for a BM25 encoder (instead of FastEmbed because ressources consumptions). Also note that old_collection refers to collection created before this PR and new_collection refers to a collection created after this PR with use_sparse_embeddings=True

Something like this (a bit modified compared to the real one I use for simplicity):

"""
Script to help upgrade a Qdrant collection that was created <1.7.0
(before sparse vector support), to add sparse vectors in a new collection.

The script simply scrolls through all points in the old collection, creates
a sparse vector for those points, and upserts them into a new collection.
"""

from qdrant_client import QdrantClient
from qdrant_client.http import models
from pinecone_text.sparse import BM25Encoder

import time

bm25 = BM25Encoder.default()
client = QdrantClient(url="http://localhost:6333")

# In my case I create the new_collection with proper configuration (sparse) by booting my Haystack instance
# before running this script. 
old_collection = "pythia"
new_collection = "pythia-bm25"

total_points = client.count(
    collection_name=old_collection,
    exact=True,
).count

next_page_offset = "first"
points_transmitted = 0
start = time.time()

# Disable indexing while adding points so it's faster
client.update_collection(
    collection_name=new_collection,
    optimizer_config=models.OptimizersConfigDiff(indexing_threshold=0),
)

while next_page_offset:
    if next_page_offset == "first":
        offset = None
    else:
        offset = next_page_offset

    # get the records
    records = client.scroll(
        collection_name=old_collection,
        limit=100,
        with_payload=True,
        with_vectors=True,
        offset=offset,
    )

    next_page_offset = records[1]
    current_records = records[0]

    points = []

    # Calculate Sparse Embeddings for each document and set the vector dict
    for record in current_records:
        vector = {}

        content_without_newlines = record.payload["content"].replace("\n", " ").replace(
            "\r", " "
        )
        bm25_encoded = bm25.encode_documents(content_without_newlines)

        # Set Vectors
        vector["text-sparse"] = models.SparseVector(**bm25_encoded)
        vector["text-dense"] = record.vector

        point = {"id": record.id, "payload": record.payload, "vector": vector}

        embedding_point = models.PointStruct(**point)
        points.append(embedding_point)

    client.upsert(collection_name=new_collection, points=points)

    points_transmitted += len(points)
    points_remaining = total_points - points_transmitted

    # Print progress
    message = f"Points transmitted: {points_transmitted}/{total_points}\nPercent done {points_transmitted/total_points*100:.2f}%\nTime elapsed: {time.time() - start:.2f} seconds\nTime remaining: {(((time.time() - start) / points_transmitted) * points_remaining) / 60:.2f} minutes\nCurrent offset: {next_page_offset}\n"
    print(message)

# Re-enable indexing once all points are added with my own indexing threshold
client.update_collection(
    collection_name=new_collection,
    optimizer_config=models.OptimizersConfigDiff(indexing_threshold=20000),
)

@anakin87 anakin87 self-requested a review April 12, 2024 10:59
Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks also for your script! In a future PR, I will add a similar one to this project.

@anakin87 anakin87 merged commit 12cdc11 into deepset-ai:main Apr 12, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integration:qdrant topic:CI type:documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants