Elasticsearch: support dense, sparse, hybrid with inference in Elasticsearch #699

maxjakob · 2024-04-29T14:01:48Z

Summary and motivation

Elasticsearch offers multiple retrieval features including

approximate dense vector retrieval with embedding inference in Python or in Elasticsearch
exact dense vector retrieval with embedding inference in Python
sparse vector retrieval with embedding inference in Elasticsearch
hybrid retrieval (dense+BM25) with embedding inference in Elasticsearch

Other libraries such as LangChain already have all these options integrated. It would be great to also have them available in Haystack. Elastic is currently working on a Python package that will make the integration of these features easier. Here we want to discuss how to best make them available.

Questions

Does Haystack want to enable inference in Elasticsearch? The current design assumes that mapping from input string to embedding vector is done in Python before calling a retriever. With inference in Elasticsearch, this would change. For example, users could configure a dense vector model in Elasticsearch and then use input strings in Haystack.
The options mentioned above require different ways of indexing the data. How to best incorporate this requirement? The current document store abstraction kind of assumes that there is only one way of indexing.

Detailed design

Concrete proposal:

ElasticsearchDocumentStore takes an argument retrieval_strategy similarly to how it is down in LangChain. Calls to write_documents make use of the retrieval strategy to know how to index the data.
We add a number of different retrievers (ElasticsearchDenseVectorRetriever, ElasticsearchSparseVectorRetriever, ElasticsearchHybridRetriever, ...) that get initialized with an ElasticsearchDocumentStore. The retrieval strategy has to match the expectation of the individual retrievers. We check that the expectation is met upon initialization. For retrieving documents, the retrievers call a search method on the document store as this is the established pattern.

Checklist

If the request is accepted, ensure the following checklist is complete before closing this issue.

The text was updated successfully, but these errors were encountered:

maxjakob · 2024-04-29T14:04:45Z

@anakin87 @silvanocerza Would be great to get your input here.

silvanocerza · 2024-04-30T10:07:06Z

I don't see why not to be fair, I'm not against this at all.
Everything you wrote makes totally sense in my opinion.

silvanocerza · 2024-04-30T10:08:22Z

Are you going to handle the implementation of this? 👀

anakin87 · 2024-04-30T12:51:42Z

thank you for your interest!

I would like to provide users with new options (such as inference in Elasticsearch) without significantly breaking existing ones.
This is the current naming convention for Retrievers. We should discuss together what would be the best names for the new retrievers.

maxjakob · 2024-04-30T16:20:20Z

I agree that breaking changes should be avoided. We can attempt to integrate this into the existing document store. If it proves too hard without breakage we can add a new class (and deprecate the old one). What do you think?

Regarding naming, here are some proposals (I'm completely open to other names):

ElasticsearchBM25Retriever
ElasticsearchDenseEmbeddingRetriever
- This would have a hybrid option. Alternatively we can add a ElasticsearchHybridRetriever.
ElasticsearchDenseExactEmbeddingRetriever (not convinced we need it but it is more efficient for <10k documents)
ElasticsearchSparseEmbeddingRetriever

maxjakob · 2024-04-30T16:22:19Z

I'm going to work on the LangChain integration. It will become the reference implementation for this kind of integration with the package mentioned above.
It would be fantastic if somebody from the community wants to give it a shot and integrate this into Haystack. That somebody would be invited to write a blog post for Elastic Search Labs to get some exposure for them and their Haystack use case in order to make a bit of a marketing noise, if they want to do this kind of thing.

maxjakob · 2024-05-24T09:34:17Z

The mentioned LangChain reference implementation can be found here:
https://github.com/langchain-ai/langchain-elastic/blob/66cf6f110dbfb2a89a1f92fbaa6488022275e17d/libs/elasticsearch/langchain_elasticsearch/vectorstores.py#L553

maxjakob added the new integration Discuss the creation of a new integration in Core label Apr 29, 2024

davidsbatista added the integration:elasticsearch label May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elasticsearch: support dense, sparse, hybrid with inference in Elasticsearch #699

Elasticsearch: support dense, sparse, hybrid with inference in Elasticsearch #699

maxjakob commented Apr 29, 2024

Tasks

maxjakob commented Apr 29, 2024

silvanocerza commented Apr 30, 2024

silvanocerza commented Apr 30, 2024

anakin87 commented Apr 30, 2024

maxjakob commented Apr 30, 2024

maxjakob commented Apr 30, 2024

maxjakob commented May 24, 2024

Elasticsearch: support dense, sparse, hybrid with inference in Elasticsearch #699

Elasticsearch: support dense, sparse, hybrid with inference in Elasticsearch #699

Comments

maxjakob commented Apr 29, 2024

Summary and motivation

Questions

Detailed design

Checklist

Tasks

maxjakob commented Apr 29, 2024

silvanocerza commented Apr 30, 2024

silvanocerza commented Apr 30, 2024

anakin87 commented Apr 30, 2024

maxjakob commented Apr 30, 2024

maxjakob commented Apr 30, 2024

maxjakob commented May 24, 2024