Skip to content

mgvalverde/terraform-serverless-vector-ingestion

Repository files navigation

Terraform Serverless Vector Ingestion

Transform your documents and ingest them into Qdrant using

Arch Diagram

Quickstart

Prerequisites:

  • awscli installed
  • terraform installed.
  • docker installed.
  • Access to AWS and permissions. If you have a profile configured, and wish to make it default, add AWS_PROFILE=... to the .env.
  • A Qdrant cluster deployed. If you don't have any, create one for free on their website. Keep at hand it the URL and API key.
  • A JinaAI API Key. If you don't, grab one for free from their webpage. Keep the API key close.
  1. Create a ECR repo:
make aws/ecr/login && \
make aws/ecr/create TARGET=qdrant-ingestion
  1. Build the Docker images.
make docker/build TARGET=qdrant-ingestion
  1. Push the Docker image to the ECR repo
make docker/push TARGET=qdrant-ingestion
  1. Create the following secure SSM Parameters:
aws ssm put-parameter \
    --name "/vectorized/qdrant/ecr/image" \
    --value <ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com/qdrant-ingestion:latest \
    --type "SecureString" \
    --description "ECR iamge for Qdrant ingestor in lambda"

aws ssm put-parameter \
    --name "/vectorized/qdrant/url" \
    --value "<QDRANT_URL>" \
    --type "SecureString" \
    --description "Qdrant URL"

aws ssm put-parameter \
    --name "/vectorized/qdrant/apikey" \
    --value "QDRANT_API_KEY" \
    --type "SecureString" \
    --description "Qdrant API Key"

aws ssm put-parameter \
    --name "/vectorized/jina/apikey" \
    --value "<JINA_API_KEY>" \
    --type "SecureString" \
    --description "Jina AI API Key"
  1. Initialize the terraform stack. Follow one of the two options below. A. Store TFState locally.
    make tf/init TARGET=app ENV=sandbox

B. Store TFState in S3 and Lock in Dynamo. a. Create a bucket in S3 to store the State

aws s3 mb s3://<your-tf-state-bucket>
b. [Optional] Create a dynamodb table (on demand) to store the lock
aws dynamodb create-table \
    --table-name <your-tf-lock-dbtable> \
    --attribute-definitions \
        AttributeName=LockID,AttributeType=S \
    --key-schema \
        AttributeName=LockID,KeyType=HASH \
    --table-class ON_DEMAND
c. Create a file the `./environments/sandbox/app/backend.conf` and write:
bucket="s3://<your-tf-state-bucket>"
key="<path/to>/terraform.tfstate"
region="<your-bucket-region>"
dynamodb_table="<your-tf-lock-dbtable>"
d. Uncoment lines 8-15 of `deployments/app/versions.tf`
    make tf/init TARGET=app ENV=sandbox
  1. Create one more secret to store a UUID namespace. More about this in the section below.
aws ssm put-parameter \
    --name "/vector-ingestion/qdrant/namespace" \
    --value $(uuidgen) \
    --type "SecureString" \
    --description "Namespace UUID4 to ensure key consistency and avoid duplication in Qdrant"
  1. Create a terraform.tfvars file in environments/sandbox/app/ using template.terraform.tfvars as reference. Fill with your values.

  2. Deploy:

make tf/deploy TARGET=app ENV=sandbox
  1. Test your deployment. Drop a file in your S3 bucket. After a moment, check your Qdrant database, you should find the vectorized document's chunks there.

  2. [Opt] Clean up.

aws s3 rm s3://<documents-bucket>/ --recursive
make tf/destroy TARGET=app ENV=sandbox 
aws ecr --repository-name <value> --force 
aws dynamodb delete-table --table-name <your-tf-lock-dbtable> 
aws s3 rm s3://bucket-name/doc --recursive
aws s3 rb s3://<your-tf-state-bucket>
aws ssm delete-parameter --name "/vectorized/qdrant/ecr/image"
aws ssm delete-parameter --name "/vectorized/qdrant/url"
aws ssm delete-parameter --name "/vectorized/qdrant/apikey"
aws ssm delete-parameter --name "/vectorized/jina/apikey"
aws ssm delete-parameter --name "/vectorized/qdrant/namespace"

Note: Namespace

There is a chance that sometime a document is processed twice. If there is no control over how the ID is created, the same chunk created by two different lambdas would have a different ID.

For that we use a namespace.

Setting a namespace, we can ensure getting the same consistent UUID for a two identical document chunks.

Eg:

from uuid import uuid4, uuid5

text = "hello world"

namespace = uuid4()
id1 = uuid5(namespace, text)
id2 = uuid5(namespace, text)

assert id1 == id2

TODO:

  • Test clean up logic.
  • Improve docs for lambda logic.
  • Remove layer creation code, which is not currently being used.

About

Process and ingest documents using LlamaIndex, Lambda and Qdrant

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published