RAG with multimoal(image and text paired) #23707

WithFoxSquirrel · 2024-07-01T09:32:17Z

WithFoxSquirrel
Jul 1, 2024

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

# Add images
vectorstore.add_images(uris=image_uris)

# Add documents
vectorstore.add_texts(texts=texts)

Description

I have found the RAG pipeline in langchain docs, multi_modal_RAG_chroma.ipynb, which embeding images and text with openclip,and then add to vectorbase. I confused why not make pair of the image and text which is relevanted, but out of order

System Info

langchain 0.1.9
langchain-chroma 0.1.1
langchain-community 0.0.24
langchain-core 0.1.27
langchain-experimental 0.0.52
langchain-google-genai 0.0.9
langchain-openai 0.0.7
langchain-text-splitters 0.2.1
langchainhub 0.1.2

@WithFoxSquirrel · 2024-07-01T10:03:01Z

dosubot[bot]
bot Jul 1, 2024

Hey there, @WithFoxSquirrel! I'm here to help you with any bugs, questions, or contributions you may have. If you need assistance, feel free to ask!

To ensure that images and texts are paired correctly in the vector store when using the RAG pipeline in LangChain, you need to follow a structured approach. Here are the steps:

Extract and categorize text and images from the PDF:
- Use the partition_pdf function from the unstructured package to extract text and images from the PDF.
- Categorize the extracted elements into tables and texts.
Summarize the extracted elements:
- Use a summarization chain to generate concise summaries of the tables and texts.
Generate image summaries:
- Use a multimodal LLM (such as LLaVA) to produce text summaries from images.
Add images and texts to the vector store:
- Use the vectorstore.add_documents method to add summarized texts, tables, and images to the vector store.
- Use the docstore.mset method to store the original texts, tables, and images with unique identifiers.
Retrieve and format the data for the LLM:
- Use a MultiVectorRetriever to fetch the relevant images and texts based on a query.
- Format the data into a prompt that can be processed by the multimodal LLM.

Here is an example code snippet to illustrate this process:

# Extract images, tables, and chunk text
from unstructured.partition.pdf import partition_pdf

raw_pdf_elements = partition_pdf(
    filename=pdf_path,
    extract_images_in_pdf=True,
    infer_table_structure=True,
    chunking_strategy="by_title",
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
    image_output_dir_path=datapath,
)

# Categorize text elements by type
tables = []
texts = []
for element in raw_pdf_elements:
    if "unstructured.documents.elements.Table" in str(type(element)):
        tables.append(str(element))
    elif "unstructured.documents.elements.CompositeElement" in str(type(element)):
        texts.append(str(element))

# Summarize
prompt_text = """You are an assistant tasked with summarizing tables and text. \ 
Give a concise summary of the table or text. Table or text chunk: {element} """
prompt = ChatPromptTemplate.from_template(prompt_text)
model = ChatOpenAI(temperature=0, model="gpt-4")
summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()

# Apply
table_summaries = summarize_chain.batch(tables, {"max_concurrency": 5})
text_summaries = texts

# Generate image summaries
import os

def generate_img_summaries(path):
    """
    Generate summaries for images using LLaVA
    path: Path to list of .jpg files extracted by Unstructured
    """
    img_summaries = []
    for img_file in sorted(os.listdir(path)):
        if img_file.endswith(".jpg"):
            img_path = os.path.join(path, img_file)
            # Use LLaVA to generate image summaries
            output_file = f"{img_path}.txt"
            os.system(f"/path/to/llava -m /path/to/model --image {img_path} > {output_file}")
            with open(output_file, "r") as file:
                img_summaries.append(file.read().strip())
    return img_summaries

# Image summaries
image_summaries = generate_img_summaries("./cj")

# Use multi vector retriever
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryStore
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document

vectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings())
store = InMemoryStore()
id_key = "doc_id"
retriever = MultiVectorRetriever(vectorstore=vectorstore, docstore=store, id_key=id_key)

# Add texts
doc_ids = [str(uuid.uuid4()) for _ in texts]
summary_texts = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(text_summaries)
]
retriever.vectorstore.add_documents(summary_texts)
retriever.docstore.mset(list(zip(doc_ids, texts)))

# Add tables
table_ids = [str(uuid.uuid4()) for _ in tables]
summary_tables = [
    Document(page_content=s, metadata={id_key: table_ids[i]})
    for i, s in enumerate(table_summaries)
]
retriever.vectorstore.add_documents(summary_tables)
retriever.docstore.mset(list(zip(table_ids, tables)))

# Add images
image_ids = [str(uuid.uuid4()) for _ in image_summaries]
summary_images = [
    Document(page_content=s, metadata={id_key: image_ids[i]})
    for i, s in enumerate(image_summaries)
]
retriever.vectorstore.add_documents(summary_images)
retriever.docstore.mset(list(zip(image_ids, image_summaries)))

# RAG pipeline
template = """Answer the question based only on the following context, which can include text and tables:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)
model = ChatOpenAI(temperature=0, model="gpt-4")
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

class Question(BaseModel):
    __root__: str

chain = chain.with_types(input_type=Question)

By following these steps, you can ensure that the images and texts are paired correctly in the vector store when using the RAG pipeline in LangChain ^[1]^[2]^[3]^[4]^[5].

To continue talking to Dosu, mention @dosu.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RAG with multimoal(image and text paired) #23707

{{title}}

Replies: 1 comment

{{title}}

Select a reply

RAG with multimoal(image and text paired) #23707

WithFoxSquirrel Jul 1, 2024

Checked other resources

Commit to Help

Example Code

Description

System Info

Replies: 1 comment

dosubot[bot] bot Jul 1, 2024

WithFoxSquirrel
Jul 1, 2024

dosubot[bot]
bot Jul 1, 2024