Replies: 1 comment
-
To integrate Amazon Textract with S3 and EC2 for asynchronous data extraction and to update your RAG application, follow these steps:
Step-by-Step ImplementationStep 1: Set up Amazon Textract for Asynchronous ProcessingUse the following code to start an asynchronous Textract job and store the results in an S3 bucket: import boto3
textract = boto3.client('textract')
response = textract.start_document_analysis(
DocumentLocation={
'S3Object': {
'Bucket': 'your-bucket-name',
'Name': 'your-document-name.pdf'
}
},
FeatureTypes=['TABLES'],
NotificationChannel={
'RoleArn': 'your-iam-role-arn',
'SNSTopicArn': 'your-sns-topic-arn'
}
) Step 2: Configure S3 Event NotificationsGo to your S3 bucket, navigate to the "Properties" tab, and add an event notification to trigger a Lambda function on object creation. Step 3: Create a Lambda FunctionCreate a Lambda function that will be triggered by the S3 event. This function will use AWS SSM to run a command on your EC2 instance. import json
import boto3
def lambda_handler(event, context):
ssm_client = boto3.client('ssm')
# Extract bucket name and object key from the event
bucket_name = event['Records'][0]['s3']['bucket']['name']
object_key = event['Records'][0]['s3']['object']['key']
# Command to run on EC2 instance
command = f"aws s3 sync s3://{bucket_name} /path/to/local/directory && python3 /path/to/your/script.py"
# Send command to EC2 instance via SSM
response = ssm_client.send_command(
InstanceIds=['your-ec2-instance-id'],
DocumentName='AWS-RunShellScript',
Parameters={'commands': [command]}
)
return {
'statusCode': 200,
'body': json.dumps('Command sent to EC2 instance')
} Step 4: Run a Script on EC2The script on your EC2 instance will handle the synchronization and updating of the vector store. Here is an example of what the script might look like: import os
from pathlib import Path
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import LocalFileStore
from langchain_community.chat_models import ChatOllama
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from PIL import Image
# Your existing functions here...
# Load CSV files
doc_path = Path("/path/to/local/directory")
print("Read CSV files")
csv_files = [f for f in doc_path.glob("*.csv")]
# Process CSV files and update vector store
for csv_file in csv_files:
# Your code to process CSV and update vector store
pass This setup ensures that whenever a new file is added to your S3 bucket, the Lambda function will trigger a script on your EC2 instance to synchronize the local directory with the S3 bucket and update the vector store accordingly [1]. |
Beta Was this translation helpful? Give feedback.
-
Checked other resources
Commit to Help
Example Code
-
Description
Hello everyone,
I am working on a Retrieval-Augmented Generation (RAG) application for tabular data extraction from PDF documents. Here is my current setup:
Tabular Data Extraction:
Loading and Orchestration:
I now want to enhance my setup by using Amazon Textract's asynchronous capabilities to extract tables and store the extracted CSV files in an S3 bucket. My goal is to run my RAG application on an EC2 instance and sync the local directory with the S3 bucket. This way, whenever new files are added to the S3 bucket, they are also updated in my EC2 local directory. This update should trigger my application to update the vector store, ensuring that I get the latest responses based on new documents.
Specific Requirements:
I am able to refresh the vector store based on the arrival of new documents. However, I need guidance on whether there is any specific implementation available that can synchronize the S3 bucket with the local EC2 directory effectively.
Any suggestions or guidance would be greatly appreciated.
System Info
System Information
Package Information
Packages not installed (Not Necessarily a Problem)
The following packages were not found:
Beta Was this translation helpful? Give feedback.
All reactions