GitHub - mgvalverde/terraform-serverless-vector-ingestion: Process and ingest documents using LlamaIndex, Lambda and Qdrant

Terraform Serverless Vector Ingestion

Transform your documents and ingest them into Qdrant using

Quickstart

Prerequisites:

awscli installed
terraform installed.
docker installed.
Access to AWS and permissions. If you have a profile configured, and wish to make it default, add AWS_PROFILE=... to the .env.
A Qdrant cluster deployed. If you don't have any, create one for free on their website. Keep at hand it the URL and API key.
A JinaAI API Key. If you don't, grab one for free from their webpage. Keep the API key close.

Create a ECR repo:

make aws/ecr/login && \
make aws/ecr/create TARGET=qdrant-ingestion

Build the Docker images.

make docker/build TARGET=qdrant-ingestion

Push the Docker image to the ECR repo

make docker/push TARGET=qdrant-ingestion

Create the following secure SSM Parameters:

aws ssm put-parameter \
    --name "/vectorized/qdrant/ecr/image" \
    --value <ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com/qdrant-ingestion:latest \
    --type "SecureString" \
    --description "ECR iamge for Qdrant ingestor in lambda"

aws ssm put-parameter \
    --name "/vectorized/qdrant/url" \
    --value "<QDRANT_URL>" \
    --type "SecureString" \
    --description "Qdrant URL"

aws ssm put-parameter \
    --name "/vectorized/qdrant/apikey" \
    --value "QDRANT_API_KEY" \
    --type "SecureString" \
    --description "Qdrant API Key"

aws ssm put-parameter \
    --name "/vectorized/jina/apikey" \
    --value "<JINA_API_KEY>" \
    --type "SecureString" \
    --description "Jina AI API Key"

Initialize the terraform stack. Follow one of the two options below. A. Store TFState locally.

    make tf/init TARGET=app ENV=sandbox

B. Store TFState in S3 and Lock in Dynamo. a. Create a bucket in S3 to store the State

aws s3 mb s3://<your-tf-state-bucket>

b. [Optional] Create a dynamodb table (on demand) to store the lock

aws dynamodb create-table \
    --table-name <your-tf-lock-dbtable> \
    --attribute-definitions \
        AttributeName=LockID,AttributeType=S \
    --key-schema \
        AttributeName=LockID,KeyType=HASH \
    --table-class ON_DEMAND

c. Create a file the `./environments/sandbox/app/backend.conf` and write:

bucket="s3://<your-tf-state-bucket>"
key="<path/to>/terraform.tfstate"
region="<your-bucket-region>"
dynamodb_table="<your-tf-lock-dbtable>"

d. Uncoment lines 8-15 of `deployments/app/versions.tf`

    make tf/init TARGET=app ENV=sandbox

Create one more secret to store a UUID namespace. More about this in the section below.

aws ssm put-parameter \
    --name "/vector-ingestion/qdrant/namespace" \
    --value $(uuidgen) \
    --type "SecureString" \
    --description "Namespace UUID4 to ensure key consistency and avoid duplication in Qdrant"

Create a terraform.tfvars file in environments/sandbox/app/ using template.terraform.tfvars as reference. Fill with your values.
Deploy:

make tf/deploy TARGET=app ENV=sandbox

Test your deployment. Drop a file in your S3 bucket. After a moment, check your Qdrant database, you should find the vectorized document's chunks there.
[Opt] Clean up.

aws s3 rm s3://<documents-bucket>/ --recursive
make tf/destroy TARGET=app ENV=sandbox 
aws ecr --repository-name <value> --force 
aws dynamodb delete-table --table-name <your-tf-lock-dbtable> 
aws s3 rm s3://bucket-name/doc --recursive
aws s3 rb s3://<your-tf-state-bucket>
aws ssm delete-parameter --name "/vectorized/qdrant/ecr/image"
aws ssm delete-parameter --name "/vectorized/qdrant/url"
aws ssm delete-parameter --name "/vectorized/qdrant/apikey"
aws ssm delete-parameter --name "/vectorized/jina/apikey"
aws ssm delete-parameter --name "/vectorized/qdrant/namespace"

Note: Namespace

There is a chance that sometime a document is processed twice. If there is no control over how the ID is created, the same chunk created by two different lambdas would have a different ID.

For that we use a namespace.

Setting a namespace, we can ensure getting the same consistent UUID for a two identical document chunks.

Eg:

from uuid import uuid4, uuid5

text = "hello world"

namespace = uuid4()
id1 = uuid5(namespace, text)
id2 = uuid5(namespace, text)

assert id1 == id2

TODO:

Test clean up logic.
Improve docs for lambda logic.
Remove layer creation code, which is not currently being used.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
deployments/app		deployments/app
docker/qdrant-ingestion		docker/qdrant-ingestion
docs/img		docs/img
environments/sandbox/app		environments/sandbox/app
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Terraform Serverless Vector Ingestion

Quickstart

Note: Namespace

TODO:

About

Releases

Packages

Languages

mgvalverde/terraform-serverless-vector-ingestion

Folders and files

Latest commit

History

Repository files navigation

Terraform Serverless Vector Ingestion

Quickstart

Note: Namespace

TODO:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages