Building a Multimodal RAG That Responds with Text, Images, and Tables from Sources

** Generation (RAG)** has been one of the earliest and most successful applications of Generative AI. Yet, few chatbots return images, tables, and figures from source documents alongside textual answers.

In this post, I explore why it’s difficult to build a reliable, truly multimodal RAG system, especially for complex documents such as research papers and corporate reports — which often include dense text, formulae, tables, and graphs.

Also, here I present an approach for an improved multimodal RAG pipeline that delivers consistent, high-quality multimodal results across these document types.

Dataset and Setup

To illustrate, I built a small multimodal knowledge base using the following documents:

[Fully Fine-tuned CLIP Models are Efficient Few-Shot Learners](https:…

Also, here I present an approach for an improved multimodal RAG pipeline that delivers consistent, high-quality multimodal results across these document types.

Dataset and Setup

To illustrate, I built a small multimodal knowledge base using the following documents:

The language model used is GPT-4o, and for embeddings I used text-embedding-3-small.

The Standard Multimodal RAG Architecture

In theory, a multimodal RAG bot should:

Accept text and image queries.
Return text and image responses.
Retrieve context from both text and image sources.

A typical pipeline looks like this:

Ingestion

Parsing & chunking: Split documents into text segments and extract images.
Image summarization: Use an LLM to generate captions or summaries for each image.
Multi-vector embeddings: Create embeddings for text chunks, image summaries, and optionally for the raw image features (e.g., using CLIP).

2. Indexing

Store embeddings and metadata in a vector database.

3. Retrieval

For a user query, perform similarity search on:
Text embeddings (for textual matches)
Image summary embeddings (for image relevance)

4. Generation

Use a multimodal LLM to synthesize the final response using both retrieved text and images.

The Inherent Assumption

This approach assumes that the caption or summary of an image generated from its content, always contains enough context about the text or themes that appear in the document, for which this image would be an appropriate response.

In real-world documents, this often isn’t true.

Example: Context Loss in Corporate Reports

Take the “Marketing Strategy for Financial Services (#3 in dataset)” report in the dataset. In its Executive Summary, there are two similar-looking tables showing Working Capital requirements — one for primary producers (farmers) and one for processors. They are the following:

Working Capital Table for Primary Producers Working Capital Table for Processors

GPT-4o generates the following for the first table:

“The table outlines various types of working capital financing options for agricultural businesses, including their purposes and availability across different situations”

And the following for the second table:

“The table provides an overview of working capital financing options, detailing their purposes and potential applicability in different scenarios for businesses, particularly exporters and stock purchasers”

Both seem fine individually — but neither captures the context that distinguishes producers from processors.

This means they will be retrieved incorrectly for queries specifically asking about producers or processors only. There are other tables such as CAPEX, Funding opportunities where the same issue can be seen.

For the VectorPainter paper, where Fig 3 in the paper shows the VectorPainter pipeline, GPT-4o generates the caption as “Overview of the proposed framework for stroke-based style extraction and stylized SVG synthesis with stroke-level constraints,” missing the fact that it represents the core theme of the paper, named “VectorPainter” by the authors.

And for the Vision Language similarity distillation loss formula defined in Sec 3.3 of the CLIP finetuning paper, the caption generated is *“Equation representing the Variational Logit Distribution (VLD) loss, defined as the sum of Kullback–Leibler (KL) divergences between predicted and target logit distributions over a batch of inputs.”, *where the context of vision and language correlation is absent.

It is also to be noted that in the research papers, the figures and tables have a author provided caption, however, during the extraction process, this is extracted not as part of the image, but as part of the text. And also the positioning of the caption is sometimes above and at other times below the figure. As for the Marketing Strategy reports, the embedded tables and other images do not even have an attached caption describing the figure.

What the above has illustrated is that the real-world documents do not follow any standard format of text, images, tables and captions, thereby making the process of associating context to the figures difficult.

The New and Improved Multimodal RAG pipeline

To solve this, I made two key changes.

1. Context-Aware Image Summaries

Instead of asking the LLM to summarize the image, I extract the text immediately before and after the figure — up to 200 characters in each direction. This way, the image caption includes:

The author-provided caption (if any)
The surrounding narrative that gives it meaning

Even if the document lacks a formal caption, this provides a contextually accurate summary.

2. Text Response Guided Image Selection at Generation Time

During retrieval, I don’t match the user query directly with image captions. This is because the user query often is too short to provide adequate context for image retrieval (eg; What is … ?) Instead:

First, generate the textual response using the top text chunks retrieved for context.
Then, select the best two images for the text response matched to the image captions

This ensures the final images are chosen in relation to the actual response, not the query alone.

Here is a diagram for the Extraction to Embedding pipeline:

Extraction to Embedding Pipeline

And the pipeline for **Retrieval and Response Generation **is as follows:

Retrieval and Response Generation

Implementation Details

Step 1: Extract Text and Images

Use Adobe PDF Extract API to parse PDFs into:

figures/ and tables/ folders with .png files
A structuredData.json file containing positions, text, and file paths

I found this API to be far more reliable than libraries like PyMuPDF, especially for extracting formulas and diagrams.

Step 2: Create a Text File

Concatenate all textual elements from the JSON to create the raw text corpus:

# Extract text, sorted by Page and vertical order (Bounds[1])
elements = data.get("elements", [])
# Concatenate text
all_text = []
for el in elements:
if "Text" in el:
all_text.append(el["Text"].strip())
final_text = "\n".join(all_text)

Step 3: Build Image Captions: Walk through each element of `structuredData.json`, check if the element filepath ends in `.png` . Load the file from figures and tables folder of the document, then use the LLM to perform a quality check on the image. This is needed as the extraction process will find some illegible, small images, header and footer, company logos etc which need to be excluded from any user responses.

Note that we are not asking the LLM to interpret the images; just comment if it is clear and relevant enough to be included in the database. The prompt for the LLM would be like:

Analyse the given image for quality, clarity, size etc. Is it a good quality image that can be used for further processing ? The images that we consider good quality are tables of facts and figures, scientific images, formulae, everyday objects and scenes etc. Images of poor quality would be any company logo or any image that is illegible, small, faint and in general would not look good in a response to a user query.
Answer with a simple Good or Poor. Do not be verbose

Next we create the image summary. For this, in the `structuredData.json`, we look at the elements behind and ahead of the `.png` element, and collect up to 200 characters in each direction for a total of 400 characters. This forms the image caption or summary. The code snippet is as follows:

# Collect before
j = i - 1
while j >= 0 and len(text_before) < 200:
if "Text" in elements[j] and not ("Table" in elements[j]["Path"] or "Figure" in elements[j]["Path"]):
text_before = elements[j]["Text"].strip() + " " + text_before
j -= 1
text_before = text_before[-200:]
# Collect after
k = i + 1
while k < len(elements) and len(text_after) < 200:
if "Text" in elements[k]:
text_after += " " + elements[k]["Text"].strip()
k += 1
text_after = text_after[:200]

We perform this for each figure and table for every document in our database, and store the image captions as metadata. In my case, I store as a `image_captions.json` file.

This simple change makes a huge difference — the resulting captions include meaningful context. For instance, the captions I get for the two Working Capital tables from the Marketing Strategy report are as follows. Note how the contexts are now clearly differentiated and include farmers and processors.

"caption": "o farmers for their capital expenditure needs as well as for their working capital needs. The table below shows the different products that would be relevant for the small, medium, and large farmers. Working Capital Input Financing For purchase of farm inputs and labour Yes Yes Yes Contracted Crop Loan* For purchase of inputs for farmers contracted by reputable buyers Yes Yes Yes Structured Loan"

"caption": "producers and their buyers b)\t Potential Loan products at the processing level At the processing level, the products that would be relevant to the small scale and the medium_large processors include Working Capital Invoice discounting_ Factoring Financing working capital requirements by use of accounts receivable as collateral for a loan Maybe Yes Warehouse receipt-financing Financing working ca"

Step 4: Chunk Text and Generate Embeddings

The text file of the document is split into chunks of 1000 characters, using ` RecursiveCharacterTextSplitter` from `langchain` and stored. Embeddings created for the text chunks and image captions, normalized and stored as `faiss` indexes

Step 5: Context Retrieval and Response Generation

The user query is matched and the top 5 text chunks are retrieved as context. Then we use these retrieved chunks and user query to get the text response using the LLM.

In the next step, we take the generated text response and find the top 2 closest image matches (based on caption embeddings) to the response. This is different from the traditional way of matching the user query to the image embeddings and provides much better results.

There is one final step. Our image captions were based on 400 characters around the image in the document, and may not form a logical and concise caption for display. Therefore, for the final selected 2 images, we ask the LLM to take the image captions along with the images and create a brief caption ready for display in the final response.

Here is the code for the above logic:

# Retrieve context
result = retrieve_context_with_images_from_chunks(
user_input,
content_chunks_json_path,
faiss_index_path,
top_k=5,
text_only_flag= True
)
text_results = result.get("top_chunks", [])
# Construct prompts
payload_1 = construct_prompt_text_only (user_input, text_results)
# Collect responses (synchronously for tool)
assistant_text, caption_text = "", ""
for chunk in call_gpt_stream(payload_1):
assistant_text += chunk
lst_final_images = retrieve_top_images (assistant_text, caption_faiss_index_path, captions_json_path, top_n=2)
if len(lst_final_images) > 0:
payload = construct_img_caption (lst_final_images)
for chunk in call_gpt_stream(payload):
caption_text += chunk
response = {
"answer": assistant_text + ("\n\n" + caption_text if caption_text else ""),
"images": [x['image_name'] for x in lst_final_images],
}
return response

Test Results

Let’s run the queries mentioned at the beginning of this blog to see if the images retrieved are relevant to the user query. For simplicity, I am printing only the images and their captions displayed and not the text response.

**Query 1: **What are the loan and working capital requirement of the primary producer ?

Figure 1: Overview of working capital financing options for small, medium, and large farmers.

Figure 2: Capital expenditure financing options for medium and large farmers.

Image Result for Query 1

**Query 2: **What are the loan and working capital requirement of the processors ?

Figure 1: Overview of working capital loan products for small-scale and medium-large processors. Figure 2: CAPEX loan products for machinery purchase and business expansion at the processing level.

Image Result for Query 2

Query 3: What is vision language distillation ?

Figure 1: Vision-language similarity distillation loss formula for transferring modal consistency from pre-trained CLIP to fine-tuned models.

Figure 2: Final objective function combining distillation loss, supervised contrastive loss, and vision-language similarity distillation loss with balancing hyperparameters.

Formula Retrieval for Query 3

Query 4: What is VectorPainter pipeline ?

Figure 1: Overview of the stroke style extraction and SVG synthesis process, highlighting stroke vectorization, style-preserving loss, and text-prompt-based generation.

Figure 2: Comparison of various methods for style transfer across raster and vector formats, showcasing the effectiveness of the proposed approach in maintaining stylistic consistency.

Image Retrieval for Query 4

Conclusion

This enhanced pipeline demonstrates how context-aware image summarization and text response based image selection can dramatically improve multimodal retrieval accuracy.

The approach produces rich, multimodal answers that combine text and visuals in a coherent way — essential for research assistants, document intelligence systems, and AI-powered knowledge bots.

Try it out… leave your comments and connect with me at www.linkedin.com/in/partha-sarkar-lets-talk-AI

Resources

1. Fully Fine-tuned CLIP Models are Efficient Few-Shot Learners: Mushui Liu, Bozheng Li, Yunlong Yu Zhejiang University

2. VectorPainter: Advanced Stylized Vector Graphics Synthesis Using Stroke-Style Priors: Juncheng Hu, Ximing Xing, Jing Zhang, Qian Yu† Beihang University

3. Marketing Strategy for Financial Services: Financing Farming & Processing the Cassava, Maize and Plantain Value Chains in Côte d’Ivoire from https://www.ifc.org

Dataset and Setup

Dataset and Setup

The Standard Multimodal RAG Architecture

The Inherent Assumption

The New and Improved Multimodal RAG pipeline

Implementation Details

Test Results

Conclusion

Resources

Similar Posts