Maybe don't bother with vector DB integrations #44

nuric · 2024-07-19T15:20:36Z

nuric
Jul 19, 2024

Thank you for this great project! I really like how easy it is to get the embeddings 🚀

I saw in your road map you mention about integrating with vector databases. I think this repo would be better off not having such integrations because it bloats the codebase so much. There are many versions of vector databases and you might end up wrapping around wrappers for client libraries. It's not too difficult to just take the embeddings and call a function to put them in a vector DB.

I'm the author of SemaDB so I thought I give a perspective from a vector database.

akshayballal95 · 2024-07-22T14:51:01Z

akshayballal95
Jul 22, 2024
Maintainer

Hey @nuric, I am glad you are finding the project useful.

You are correct that integrating vector databases could bloat the project. But one of the reasons we wanted to do this is to make it easy to stream the embeddings straight into the vector database as they are being made so that the embeddings don't live on the system memory. This could be useful for larger workloads where several files are very large.

I am happy to consider other solutions to not have the embeddings live in memory. Let me know if you have some ideas.

2 replies

nuric Jul 22, 2024
Author

Thank you for your reply. I understand, that concern definitely makes sense. I wonder if you could then let people provide a callback in Python which takes the embeddings you produce with ids metadata etc, and calls that function. We would then be free to quickly adapt without you having to worry about each and every vector database client and their parameters in this codebase.

EmbedAnything can decide how often and when to call the callback function to not have to store intermediate embeddings in memory. In the case of SemaDB, it could be as simple as:

# Sample callback
def semadb_store(data):
    """Stores incoming embeddings to SemaDB"""
    points = []
    # Convert embeddings into point dicts for SemaDB
    for i, embedding in enumerate(data.data):
        points.append({'vector': embeddings[i].tolist(), "myfield": i})
    payload = { "points": points }
    # Store
    response = requests.post(base_url+"/collections/mycollection/points", json=payload, headers=headers)
    if not response.ok():
        raise Error()

And you might call it with

embed_anything.embed_directory("directory_path", embeder= "Clip", store_callback=semadb_store)

You can give examples for major vector databases if you want to but without the worry of maintaining wrappers. It would be such a small function that anyone can easily implement theirs and not have to make pull requests. This way you have control over when and which embeddings are stored without the worry.

This is just a suggestion based on my experience of trying to integrate SemaDB into library like LangChain and it might just be simpler.

akshayballal95 Jul 22, 2024
Maintainer

That's neat. This can be done. Thanks for the suggestion.

dhruv-anand-aintech · 2024-07-22T20:11:18Z

dhruv-anand-aintech
Jul 22, 2024

It'd be good to integrate with an existing library that implements the wrappers mentioned for each Vector DBs API.

I develop and maintain Vector-io which does this for 10 different vector DBs. The intent at the moment is to use it for data migrations and backups, but open to expanding that scope to allow it to be a common interface for all vector DBs.

https://github.com/AI-Northstar-Tech/vector-io

0 replies

sonam-pankaj95 · 2024-08-28T10:34:21Z

sonam-pankaj95
Aug 28, 2024
Maintainer

Hi @nuric ,

We have also added vector database adapters, and we would love it if you could contribute Sema db here.

https://github.com/StarlightSearch/EmbedAnything/tree/main/examples/adapters

Thanks

4 replies

nuric Aug 28, 2024
Author

I'll try to have a look, SemaDB just does POST requests so there isn't a Python client. A quick scan through your abstract Adapter class, SemaDB doesn't have an upsert method, it seems like the adapter class you have is made with Pinecone in mind.

akshayballal95 Aug 28, 2024
Maintainer

It was initially with Pinecone in mind. But Upsert is just the name of the method. You can write any code in it that allows you to insert the embeddings into the database.

dhruv-anand-aintech Aug 28, 2024

@nuric please email me at [email protected]

Like I mentioned above I'm trying to add more DBs to tryvector.io and would love to add yours.

The not having a python SDK problem is easily fixed using requests library in Python to wrap your REST API calls.

nuric Aug 28, 2024
Author

The getting started docs already have examples of using the requests library, we added those as the first examples.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maybe don't bother with vector DB integrations #44

{{title}}

Replies: 3 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Maybe don't bother with vector DB integrations #44

nuric Jul 19, 2024

Replies: 3 comments · 6 replies

akshayballal95 Jul 22, 2024 Maintainer

nuric Jul 22, 2024 Author

akshayballal95 Jul 22, 2024 Maintainer

dhruv-anand-aintech Jul 22, 2024

sonam-pankaj95 Aug 28, 2024 Maintainer

nuric Aug 28, 2024 Author

akshayballal95 Aug 28, 2024 Maintainer

dhruv-anand-aintech Aug 28, 2024

nuric Aug 28, 2024 Author

nuric
Jul 19, 2024

Replies: 3 comments 6 replies

akshayballal95
Jul 22, 2024
Maintainer

nuric Jul 22, 2024
Author

akshayballal95 Jul 22, 2024
Maintainer

dhruv-anand-aintech
Jul 22, 2024

sonam-pankaj95
Aug 28, 2024
Maintainer

nuric Aug 28, 2024
Author

akshayballal95 Aug 28, 2024
Maintainer

nuric Aug 28, 2024
Author