RAG with multimoal(image and text paired) #23707
Replies: 1 comment
-
Hey there, @WithFoxSquirrel! I'm here to help you with any bugs, questions, or contributions you may have. If you need assistance, feel free to ask! To ensure that images and texts are paired correctly in the vector store when using the RAG pipeline in LangChain, you need to follow a structured approach. Here are the steps:
Here is an example code snippet to illustrate this process: # Extract images, tables, and chunk text
from unstructured.partition.pdf import partition_pdf
raw_pdf_elements = partition_pdf(
filename=pdf_path,
extract_images_in_pdf=True,
infer_table_structure=True,
chunking_strategy="by_title",
max_characters=4000,
new_after_n_chars=3800,
combine_text_under_n_chars=2000,
image_output_dir_path=datapath,
)
# Categorize text elements by type
tables = []
texts = []
for element in raw_pdf_elements:
if "unstructured.documents.elements.Table" in str(type(element)):
tables.append(str(element))
elif "unstructured.documents.elements.CompositeElement" in str(type(element)):
texts.append(str(element))
# Summarize
prompt_text = """You are an assistant tasked with summarizing tables and text. \
Give a concise summary of the table or text. Table or text chunk: {element} """
prompt = ChatPromptTemplate.from_template(prompt_text)
model = ChatOpenAI(temperature=0, model="gpt-4")
summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()
# Apply
table_summaries = summarize_chain.batch(tables, {"max_concurrency": 5})
text_summaries = texts
# Generate image summaries
import os
def generate_img_summaries(path):
"""
Generate summaries for images using LLaVA
path: Path to list of .jpg files extracted by Unstructured
"""
img_summaries = []
for img_file in sorted(os.listdir(path)):
if img_file.endswith(".jpg"):
img_path = os.path.join(path, img_file)
# Use LLaVA to generate image summaries
output_file = f"{img_path}.txt"
os.system(f"/path/to/llava -m /path/to/model --image {img_path} > {output_file}")
with open(output_file, "r") as file:
img_summaries.append(file.read().strip())
return img_summaries
# Image summaries
image_summaries = generate_img_summaries("./cj")
# Use multi vector retriever
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryStore
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
vectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings())
store = InMemoryStore()
id_key = "doc_id"
retriever = MultiVectorRetriever(vectorstore=vectorstore, docstore=store, id_key=id_key)
# Add texts
doc_ids = [str(uuid.uuid4()) for _ in texts]
summary_texts = [
Document(page_content=s, metadata={id_key: doc_ids[i]})
for i, s in enumerate(text_summaries)
]
retriever.vectorstore.add_documents(summary_texts)
retriever.docstore.mset(list(zip(doc_ids, texts)))
# Add tables
table_ids = [str(uuid.uuid4()) for _ in tables]
summary_tables = [
Document(page_content=s, metadata={id_key: table_ids[i]})
for i, s in enumerate(table_summaries)
]
retriever.vectorstore.add_documents(summary_tables)
retriever.docstore.mset(list(zip(table_ids, tables)))
# Add images
image_ids = [str(uuid.uuid4()) for _ in image_summaries]
summary_images = [
Document(page_content=s, metadata={id_key: image_ids[i]})
for i, s in enumerate(image_summaries)
]
retriever.vectorstore.add_documents(summary_images)
retriever.docstore.mset(list(zip(image_ids, image_summaries)))
# RAG pipeline
template = """Answer the question based only on the following context, which can include text and tables:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)
model = ChatOpenAI(temperature=0, model="gpt-4")
chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| model
| StrOutputParser()
)
class Question(BaseModel):
__root__: str
chain = chain.with_types(input_type=Question) By following these steps, you can ensure that the images and texts are paired correctly in the vector store when using the RAG pipeline in LangChain [1][2][3][4][5]. |
Beta Was this translation helpful? Give feedback.
-
Checked other resources
Commit to Help
Example Code
Description
I have found the RAG pipeline in langchain docs, multi_modal_RAG_chroma.ipynb, which embeding images and text with openclip,and then add to vectorbase. I confused why not make pair of the image and text which is relevanted, but out of order
System Info
langchain 0.1.9
langchain-chroma 0.1.1
langchain-community 0.0.24
langchain-core 0.1.27
langchain-experimental 0.0.52
langchain-google-genai 0.0.9
langchain-openai 0.0.7
langchain-text-splitters 0.2.1
langchainhub 0.1.2
Beta Was this translation helpful? Give feedback.
All reactions