Skip to content

Transfer data from Apify Actors to vector databases (Chroma, Milvus, Pinecone, PostgreSQL (PG-Vector), Qdrant, and Weaviate)

License

Notifications You must be signed in to change notification settings

apify/actor-vector-database-integrations

Repository files navigation

Apify Vector Database Integrations

Vector database integrations (Actors)

Actor Actor badge
Chroma Chroma integration
Milvus Milvus integration
OpenSearch OpenSearch integration
PGVector PGVector integration
Pinecone Pinecone integration
Qdrant Qdrant integration
Weaviate Weaviate integration

The Apify Vector Database Integrations facilitate the transfer of data from Apify Actors to a vector database. This process includes data processing, optional splitting into chunks, embedding computation, and data storage

These integrations support incremental updates, ensuring that only changed data is updated. This reduces unnecessary embedding computation and storage operations, making it ideal for search and retrieval augmented generation (RAG) use cases.

This repository contains Actors for different vector databases.

How does it work?

  1. Retrieve a dataset as output from an Actor.
  2. [Optional] Split text data into chunks using langchain.
  3. [Optional] Update only changed data.
  4. Compute embeddings, e.g. using OpenAI or Cohere.
  5. Save data into the database.

Supported Vector Embeddings

How to add a new integration (an example for PG-Vector)?

  1. Add database to docker-compose.yml for local testing (if the database is available in docker).
version: '3.8'

services:
  pgvector-container:
    image: pgvector/pgvector:pg16
    environment:
      - POSTGRES_PASSWORD=password
      - POSTGRES_DB=apify
    ports:
      - "5432:5432"
  1. Add postgres dependency to pyproject.toml:

    poetry add --group=pgvector "langchain_postgres"

    and mark the group pgvector as optional (in pyproject.toml):

    [tool.poetry.group.postgres]
    optional = true
  2. Create a new actor in the actors directory, e.g. actors/pgvector and add the following files:

    • README.md - the actor documentation
    • .actor/actor.json - the actor definition
    • .actor/input_schema.json - the actor input schema
  3. Create a pydantic model for the actor input schema. Edit Makefile to generate the input schema from the model:

     datamodel-codegen --input $(DIRS_WITH_ACTORS)/pgvector/.actor/input_schema.json --output $(DIRS_WITH_CODE)/src/models/pgvector_input_model.py  --input-file-type jsonschema  --field-constraints

    and then run

    make pydantic-model
  4. Import the created model in src/models/__init__.py:

    from .pgvector_input_model import PgvectorIntegration
    ``
  5. Create a new module (pgvector.py) in the vector_stores directory, e.g. vector_stores/pgvector and implement all class PGVectorDatabase and all required methods.

  6. Add PGVector into SupportedVectorStores in the constants.py

       class SupportedVectorStores(str, enum.Enum):
           pgvector = "pgvector"
  7. Add PGVectorDatabase into entrypoint.py

       if actor_type == SupportedVectorStores.pgvector.value:
           await run_actor(PgvectorIntegration(**actor_input), actor_input)
  8. Add PGVectorDatabase and PgvectorIntegration into _types.py

        ActorInputsDb: TypeAlias = ChromaIntegration | PgvectorIntegration | PineconeIntegration | QdrantIntegration
        VectorDb: TypeAlias = ChromaDatabase | PGVectorDatabase | PineconeDatabase | QdrantDatabase
  9. Add PGVectorDatabase into vector_stores/vcs.py

        if isinstance(actor_input, PgvectorIntegration):
            from .vector_stores.pgvector import PGVectorDatabase
    
            return PGVectorDatabase(actor_input, embeddings)
  10. Add PGVectorDatabase fixture into tests/conftets.py

       @pytest.fixture()
       def db_pgvector(crawl_1: list[Document]) -> PGVectorDatabase:
           db = PGVectorDatabase(
               actor_input=PgvectorIntegration(
                   postgresSqlConnectionStr=os.getenv("POSTGRESQL_CONNECTION_STR"),
                   postgresCollectionName=INDEX_NAME,
                   embeddingsProvider=EmbeddingsProvider.OpenAI.value,
                   embeddingsApiKey=os.getenv("OPENAI_API_KEY"),
                   datasetFields=["text"],
               ),
               embeddings=embeddings,
           )
    
           db.unit_test_wait_for_index = 0
    
           db.delete_all()
           # Insert initially crawled objects
           db.add_documents(documents=crawl_1, ids=[d.metadata["id"] for d in crawl_1])
    
           yield db
    
           db.delete_all()
  11. Add the db_pgvector fixture into tests/test_vector_stores.py

       DATABASE_FIXTURES = ["db_pinecone", "db_chroma", "db_qdrant", "db_pgvector"]
  12. Update README.md in the actors/pgvector directory

  13. Add the pgvector to the README.md in the root directory

  14. Run tests

    make pytest
  15. Run the actor locally

    export ACTOR_PATH_IN_DOCKER_CONTEXT=actors/pgvector
    apify run -p
  16. Setup Actor on Apify platform at https://console.apify.com

    Build configuration

    Git URL: https://github.com/apify/store-vector-db
    Branch: master
    Folder: actors/pgvector
    
  17. Test the actor on the Apify platform

About

Transfer data from Apify Actors to vector databases (Chroma, Milvus, Pinecone, PostgreSQL (PG-Vector), Qdrant, and Weaviate)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published