How to Integrate Amazon Textract with S3 and EC2 for Async Data Extraction and RAG Application Update #23698

crman · 2024-07-01T06:04:41Z

crman
Jul 1, 2024

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

Description

Hello everyone,

I am working on a Retrieval-Augmented Generation (RAG) application for tabular data extraction from PDF documents. Here is my current setup:

Tabular Data Extraction:

Using Amazon Textract to extract tabular information from PDF documents.
Converting the extracted data into CSV files and saving them in a directory.

Loading and Orchestration:

Loading these CSV files using an orchestration framework such as Langchain.

I now want to enhance my setup by using Amazon Textract's asynchronous capabilities to extract tables and store the extracted CSV files in an S3 bucket. My goal is to run my RAG application on an EC2 instance and sync the local directory with the S3 bucket. This way, whenever new files are added to the S3 bucket, they are also updated in my EC2 local directory. This update should trigger my application to update the vector store, ensuring that I get the latest responses based on new documents.

Specific Requirements:

Use Amazon Textract's async feature to extract tables and store them in an S3 bucket.
Sync the S3 bucket with a local directory on an EC2 instance.
Automatically trigger the application to update the vector store whenever new files are added to the bucket.

I am able to refresh the vector store based on the arrival of new documents. However, I need guidance on whether there is any specific implementation available that can synchronize the S3 bucket with the local EC2 directory effectively.

Any suggestions or guidance would be greatly appreciated.

System Info

System Information

OS: Windows
OS Version: 10.0.22631
Python Version: 3.10.1 (tags/v3.10.1:2cd268a, Dec 6 2021, 19:10:37) [MSC v.1929 64 bit (AMD64)]

Package Information

langchain_core: 0.1.52
langchain: 0.1.20
langchain_community: 0.0.38
langsmith: 0.1.59
langchain_ai21: 0.1.5
langchain_aws: 0.1.6
langchain_chroma: 0.1.1
langchain_google_genai: 1.0.4
langchain_openai: 0.1.7
langchain_text_splitters: 0.0.2
langchainhub: 0.1.15

Packages not installed (Not Necessarily a Problem)

The following packages were not found:

langgraph
langserve

@dosu · 2024-07-01T06:07:25Z

dosubot[bot]
bot Jul 1, 2024

To integrate Amazon Textract with S3 and EC2 for asynchronous data extraction and to update your RAG application, follow these steps:

Set up Amazon Textract for Asynchronous Processing:
- Use Amazon Textract's asynchronous API to extract tables from PDF documents and store the results in an S3 bucket.
Configure S3 Event Notifications:
- Set up S3 event notifications to trigger a Lambda function whenever new files are added to the S3 bucket.
Create a Lambda Function:
- This Lambda function will be triggered by the S3 event and will use AWS Systems Manager (SSM) to run a command on your EC2 instance.
Run a Script on EC2:
- The script on your EC2 instance will handle the synchronization and updating of the vector store.

Step-by-Step Implementation

Step 1: Set up Amazon Textract for Asynchronous Processing

Use the following code to start an asynchronous Textract job and store the results in an S3 bucket:

import boto3

textract = boto3.client('textract')

response = textract.start_document_analysis(
    DocumentLocation={
        'S3Object': {
            'Bucket': 'your-bucket-name',
            'Name': 'your-document-name.pdf'
        }
    },
    FeatureTypes=['TABLES'],
    NotificationChannel={
        'RoleArn': 'your-iam-role-arn',
        'SNSTopicArn': 'your-sns-topic-arn'
    }
)

Step 2: Configure S3 Event Notifications

Go to your S3 bucket, navigate to the "Properties" tab, and add an event notification to trigger a Lambda function on object creation.

Step 3: Create a Lambda Function

Create a Lambda function that will be triggered by the S3 event. This function will use AWS SSM to run a command on your EC2 instance.

import json
import boto3

def lambda_handler(event, context):
    ssm_client = boto3.client('ssm')
    
    # Extract bucket name and object key from the event
    bucket_name = event['Records'][0]['s3']['bucket']['name']
    object_key = event['Records'][0]['s3']['object']['key']
    
    # Command to run on EC2 instance
    command = f"aws s3 sync s3://{bucket_name} /path/to/local/directory && python3 /path/to/your/script.py"
    
    # Send command to EC2 instance via SSM
    response = ssm_client.send_command(
        InstanceIds=['your-ec2-instance-id'],
        DocumentName='AWS-RunShellScript',
        Parameters={'commands': [command]}
    )
    
    return {
        'statusCode': 200,
        'body': json.dumps('Command sent to EC2 instance')
    }

Step 4: Run a Script on EC2

The script on your EC2 instance will handle the synchronization and updating of the vector store. Here is an example of what the script might look like:

import os
from pathlib import Path
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import LocalFileStore
from langchain_community.chat_models import ChatOllama
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from PIL import Image

# Your existing functions here...

# Load CSV files
doc_path = Path("/path/to/local/directory")
print("Read CSV files")
csv_files = [f for f in doc_path.glob("*.csv")]

# Process CSV files and update vector store
for csv_file in csv_files:
    # Your code to process CSV and update vector store
    pass

This setup ensures that whenever a new file is added to your S3 bucket, the Lambda function will trigger a script on your EC2 instance to synchronize the local directory with the S3 bucket and update the vector store accordingly ^[1].

To continue talking to Dosu, mention @dosu.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to Integrate Amazon Textract with S3 and EC2 for Async Data Extraction and RAG Application Update #23698

{{title}}

Replies: 1 comment

{{title}}

Select a reply

How to Integrate Amazon Textract with S3 and EC2 for Async Data Extraction and RAG Application Update #23698

crman Jul 1, 2024

Checked other resources

Commit to Help

Example Code

Description

System Info

System Information

Package Information

Packages not installed (Not Necessarily a Problem)

Replies: 1 comment

dosubot[bot] bot Jul 1, 2024

Step-by-Step Implementation

Step 1: Set up Amazon Textract for Asynchronous Processing

Step 2: Configure S3 Event Notifications

Step 3: Create a Lambda Function

Step 4: Run a Script on EC2

crman
Jul 1, 2024

dosubot[bot]
bot Jul 1, 2024