Vision RAG: Enabling Search on Any Documents

Information comes in many shapes and forms. While retrieval-augmented generation (RAG) primarily focuses on plain text, it overlooks vast amounts of data along the way. Most enterprise knowledge resides in complex documents, slides, graphics, and other multimodal sources. Yet, extracting useful information from these formats using optical character recognition (OCR) or other parsing techniques is often low-fidelity, brittle, and expensive.

Vision RAG makes complex documents—including their figures and tables—searchable by using multimodal embeddings, eliminating the need for complex and costly text extraction. This guide explores how Voyage AI’s latest model powers this capability and provides a step-by-step implementation walkthrough.

Vision RAG: Building upon text RAG

Vis…

Vision RAG: Building upon text RAG

Vision RAG is an evolution of traditional RAG built on the same two components: retrieval and generation.

In traditional RAG, unstructured text data is indexed for semantic search. At query time, the system retrieves relevant documents or chunks and appends them to the user’s prompt so the large language model (LLM) can produce more grounded, context-aware answers.

Figure 1. Text RAG with Voyage AI and MongoDB.

Text RAG with Voyage AI and MongoDB

Enterprise data, however, is rarely just clean plain text. Critical information often lives in PDFs, slides, diagrams, dashboards, and other visual formats. Today, this is typically handled by parsing tools and OCR services. Those approaches create several problems:

Significant engineering effort to handle many file types, layouts, and edge cases

Accuracy issues across different OCR or parsing setups

High costs when scaled across large document collections

Next-generation multimodal embedding models provide a simpler and more cost-effective alternative. They can ingest not only text but also images or screenshots of complex document layouts, and generate vector representations that capture the meaning and structure of that content.

Vision RAG uses these multimodal embeddings to index entire documents, slides, and images directly, even when they contain interleaved text and images. This enables them to be searchable via vector search without requiring heavy parsing or OCR. At query time, the system retrieves the most relevant visual assets and feeds them, along with the text prompt, into a vision-capable LLM to inform its answer.

Figure 2. Vision RAG with Voyage AI and MongoDB.

Vision RAG with Voyage AI and MongoDB

As a result, vision RAG enables LLM-based systems with native access to rich, multimodal enterprise data, while reducing engineering complexity and avoiding the performance and cost pitfalls associated with traditional text-focused preprocessing pipelines.

Voyage AI’s latest multimodal embedding model

The multimodal embedding model is where the magic happens. Historically, building such a system was challenging due to the modality gap. Early multimodal embedding models, such as contrastive language-image pretraining (CLIP)-based models, processed text and images using separate encoders. Because the outputs were generated independently, results were often biased toward one modality, making retrieval across mixed content unreliable. These models also struggled to handle interleaved text and images, a critical limitation for vision RAG in real-world environments.

Voyage-multimodal-3 adopts an architecture similar to modern vision-capable LLMs. It uses a single encoder for both text and visual inputs, closing the modality gap and producing unified representations. This ensures that textual and visual features are treated consistently and accurately within the same vector space.

Figure 3. CLIP-based architecture vs. voyage-multimodal-3’s architecture.

CLIP-based architecture vs. voyage-multimodal-3’s architecture

This architectural shift enables true multimodal retrieval, making vision RAG a viable and efficient solution. For more details, refer to the voyage-multimodal-3 blog announcement.

Implementation of vision RAG

Let’s take a simple example and showcase how to implement vision RAG. Traditional text-based RAG often struggles with complex documents, such as slide decks, financial reports, or technical papers, where critical information is often locked inside charts, diagrams, and figures.

By using Voyage AI’s multimodal embedding models alongside Anthropic’s vision-capable LLMs, we can bridge this gap. We will treat images (or screenshots of document pages) as first-class citizens, retrieving them directly based on their visual and semantic content and passing them to a vision-capable LLM for reasoning.

To demonstrate this, we will build a pipeline that extracts insights from the charts and figures of the GitHub Octoverse 2025 survey, which simulates the type of information typically found in enterprise data.

The Jupyter Notebook for this tutorial is available on GitHub in our GenAI Showcase repository. To follow along, run the notebook in Google Colab (or similar), and refer to this tutorial for explanations of key code blocks.

Step 1: Install necessary libraries

First, we need to set up our Python environment. We will install the voyageai client for generating embeddings and the anthropic client for our generative model.

Code Snippet

Step 2: Initialize API clients

To interact with the models, you must initialize the client objects with your API keys. You will need a Voyage AI API key (for the voyage-multimodal-3 model) and an Anthropic API key (for claude-sonnet-4.5).

Note: It is best practice to use environment variables or a secret manager rather than hardcoding keys in production.

Code Snippet

Step 3: Extract visual content

For this example, we will scrape charts and infographics directly from the GitHub Octoverse blog post. In a production setting, this step might involve converting PDF pages to images or processing a directory of PNGs.

We’ll start by importing the standard utilities we need for web requests, image processing, and math operations.

Code Snippet

Next, we define a helper function extract_image_urls to parse the article’s HTML and grab image links, filtering out small icons or logos.

Code Snippet

Now let’s run the extraction on the specific URL.

Code Snippet

The scraping might return general blog assets. To ensure high relevance, we will filter the list to only include images containing “octoverse-2025” in their URL, which targets the report’s charts.

Code Snippet

Step 4: Build the multimodal index

This is the core indexing step. We loop through our filtered URLs, download the images locally, and then pass them to Voyage AI’s voyage-multimodal-3 model. This model converts the visual content into a dense vector embedding.

Code Snippet

Step 5: Define RAG components

We need three specific capabilities to make our RAG pipeline work:

Image Encoding: Converting images to base64 so they can be sent to the Anthropic API 1.

Vector Retrieval: Searching our array of embeddings to find the image most semantically similar to the user’s text query 1.

Generation: Sending the retrieved image and the user’s query to a VLM to get a natural language answer.

Let’s define helper functions for each.

Code Snippet

If you want a full end-to-end example using MongoDB, see this tutorial: Building Multimodal AI Applications with MongoDB, Voyage AI, and Gemini.

Step 6: Combine the components into a complete pipeline

We can now wrap these steps into a single entry point, vision_rag. This function accepts a user query, performs the retrieval to find the correct chart, displays it, and then answers the question.

Code Snippet

Step 7: Run queries

Let’s test our pipeline. We will ask a specific question about developer communities. The system should identify the correct infographic from the report and read the data directly from it.

Code Snippet

Now we can try a quantitative question regarding open-source repositories.

Code Snippet

And finally, a ranking question about programming languages.

Code Snippet

Conclusion

Multimodal embeddings, paired with vision-capable LLMs, enable systems to process and reason across diverse data types like text and images simultaneously. Vision RAG specifically unlocks the vast amount of knowledge trapped in visual formats containing interleaved text and images —such as slide decks, photos, PDF reports, and technical diagrams—that traditional text-only pipelines struggle with.

In this tutorial, we implemented a minimalist vision RAG pipeline from scratch. We extracted rich visual data from the GitHub Octoverse report, generated multimodal embeddings using Voyage AI, and used Anthropic’s model to answer complex questions grounded in visual evidence.

As you move forward, consider how this architecture can be applied to your own proprietary datasets, from financial reports to internal diagrams. When you are ready to take this into production, explore using a robust database like MongoDB to handle scale. For more examples and advanced techniques, check out the Voyage AI documentation to continue building the next generation of multimodal applications.

If you enjoyed reading this tutorial, you can explore more such content on our AI Learning Hub. If you want to go straight to code, we have several more examples of how to build RAG, agentic applications, evals, etc., in our Gen AI Showcase GitHub repository. As always, if you have further questions as you build your AI applications, please reach out to us in our generative AI community forums.

Vision RAG: Building upon text RAG

Vision RAG: Building upon text RAG

Significant engineering effort to handle many file types, layouts, and edge cases

Accuracy issues across different OCR or parsing setups

Voyage AI’s latest multimodal embedding model

Implementation of vision RAG

Step 1: Install necessary libraries

Step 2: Initialize API clients

Step 3: Extract visual content

Step 4: Build the multimodal index

Step 5: Define RAG components

Step 6: Combine the components into a complete pipeline

Step 7: Run queries

Conclusion

Similar Posts