13 min readNov 16, 2025
–
Imagine a system that doesn’t just retrieve information but truly understands it — reading tables, interpreting visuals, extracting meaning from dense corporate reports, and reasoning across them like an analyst with superhuman speed. What if your RAG pipeline could not only process PDFs and financial statements but also connect patterns hidden in structured data, charts, and text, all in one flow? That’s the story we’re about to unfold.
This isn’t just another RAG tutorial. It’s a deep dive into how we can push the boundaries of retrieval-augmented systems with a new, composable stack — one that merges vision, language, and reasoning seamlessly. Welcome to the world of vLACQ Stack, where the question isn’t “Can we do RAG?” but *“How far can we t…
13 min readNov 16, 2025
–
Imagine a system that doesn’t just retrieve information but truly understands it — reading tables, interpreting visuals, extracting meaning from dense corporate reports, and reasoning across them like an analyst with superhuman speed. What if your RAG pipeline could not only process PDFs and financial statements but also connect patterns hidden in structured data, charts, and text, all in one flow? That’s the story we’re about to unfold.
This isn’t just another RAG tutorial. It’s a deep dive into how we can push the boundaries of retrieval-augmented systems with a new, composable stack — one that merges vision, language, and reasoning seamlessly. Welcome to the world of vLACQ Stack, where the question isn’t “Can we do RAG?” but “How far can we take it?”
Press enter or click to view image in full size
created by author M K Pavan Kumar
What is vLACQ?
vLACQ stands for vLLM + LlamaIndex + Agno + Chonkie + Qdrant, a tightly integrated end-to-end RAG (Retrieval-Augmented Generation) stack built to handle multimodal, structured, and complex document intelligence at scale.
It’s designed to streamline the entire RAG lifecycle — from data ingestion to semantic retrieval to contextual inference — with a focus on efficiency, modularity, and real-world deployability.
The Architecture:
The architecture shown represents the vLACQ Stack, a tightly integrated RAG (Retrieval-Augmented Generation) ecosystem designed for multimodal and structured document understanding, including highly complex tables like those found in financial 10-K reports. It is engineered to run efficiently on a single node with multiple GPUs, enabling both vision and text inferencing in a unified workflow.
At the core of the system lies vLLM, deployed on the multi-GPU node using the Qwen3-VL model, which serves as the primary inference engine. This component performs multimodal processing — extracting structured and contextual insights from documents such as PDFs that contain both textual and tabular data. The extracted information is converted into markdown files, which act as the intermediate representation for further processing.
These markdown representations are then passed into the Chonkie ingestion pipeline, which consists of three key stages: Chef, Chunk, and Handshake. The Chef stage performs document pre-processing and cleaning; Chunk breaks the data into semantically coherent pieces optimized for retrieval; and Handshake ensures the processed chunks are indexed into Qdrant, the semantic vector search engine. This setup transforms complex document structures into rich embeddings, allowing for fine-grained and contextually accurate search and retrieval.
On the retrieval side, the AI Layer combines Agno, which orchestrates the RAG Agent, and LlamaIndex, which functions as the retriever framework. When a user submits a query, the RAG Agent retrieves relevant chunks from Qdrant via LlamaIndex, merges them with contextual understanding, and passes them to vLLM for inference. The response — contextualized, accurate, and explainable — is then returned to the user.
This closed-loop pipeline enables RAG to operate effectively on challenging enterprise data, including complex tables, hybrid textual-visual formats, and multimodal knowledge documents. It’s a fully local + runpod-deployable stack that blends vision-language modeling, semantic retrieval, and agentic reasoning — a true step forward in scalable enterprise-grade document intelligence.
The Role of vLLM:
vLLM is an open-source, high-performance inference engine specifically designed for serving large language models in production environments. It’s built to address the computational and memory challenges that arise when deploying LLMs at scale, particularly in enterprise settings where reliability, efficiency, and control are paramount. Unlike basic inference frameworks or closed API services, vLLM implements advanced optimization techniques like PagedAttention, which dramatically improves memory management by treating attention computation similarly to how operating systems manage virtual memory through paging.
The primary reason to use vLLM centers on its ability to maximize throughput and minimize latency while making efficient use of expensive GPU resources. In critical-data projects, you’re typically dealing with high concurrency demands, large context windows from lengthy documents or complex tables, and the need to serve multiple users simultaneously without degradation in response quality. vLLM’s architecture enables you to achieve 2–4x or higher throughput compared to standard inference engines, meaning you can handle significantly more requests with the same hardware investment. This translates directly to cost savings and improved user experience, especially when dealing with real-time or near-real-time applications like interactive agents or analytical dashboards.
Beyond performance, vLLM provides the control and security posture that enterprise and regulated environments demand. When working with sensitive financial reports, regulatory documents, or proprietary operational data, relying on external API services introduces compliance risks and potential data leakage concerns. By deploying vLLM within your own infrastructure, whether on-premises or in a secure cloud enclave, you maintain complete control over data flow, can implement comprehensive logging and auditing, and ensure that sensitive information never leaves your security boundary. This self-hosted approach is often a non-negotiable requirement in industries like finance, healthcare, or government where data sovereignty and regulatory compliance are critical.
The Ingestion pipeline:
Chonkie is an open-source ingestion framework built to streamline the pipeline side of retrieval-augmented generation (RAG). Its purpose is to take raw, messy documents — anything from PDFs with tables, markdown files, images, hybrid formats — and transform them into “chunks” that are optimized for embedding and indexing into a vector-search store. The idea is to give you a clean, modular workflow: raw data → structured pieces → semantically searchable knowledge base. This makes your downstream RAG engine much more efficient, accurate and maintainable.
Core Components of Chonkie
Chonkie’s architecture is defined by a few key modules. Each has a specific role in the ingestion pipeline and, when stitched together, form the entire flow from raw data to indexed knowledge.
Chefs These are the components that handle the pre-processing of raw documents. They might clean up noisy text, normalize formats, extract structure (for example pull out tables from markup, or images from PDFs), and convert everything into a standard internal representation. The goal is to ensure that the content you feed onward is semantically richer and easier to chunk.
Chunkers Once you have structured, cleaned content, the Chunker divides that content into manageable units — “chunks” — that will later be embedded and indexed. The chunking strategy matters: you could chunk by a fixed token size, by sentence boundaries, by semantic coherence, or even by table rows. Choosing the right chunk boundaries is important because it directly impacts how well your retrieval and RAG agent will later perform.
Refineries (optional but important) After chunking, you may have an optional step to refine the chunks: for example generating overlapping contexts so no information is lost at the boundary, embedding generation, deduplication, or trimming extraneous content. This refining step improves chunk quality and makes downstream retrieval more robust.
Handshakes This is the ingestion into the vector store. Once your chunks are ready and ideally embedded, the Handshake module writes them into a vector database (like Qdrant, or any other supported store). It abstracts the act of connecting to the store, managing schema, and indexing those chunks so that your retrieval layer later can fetch them efficiently.
**Pipeline ** Chonkie provides a high-level Pipeline interface where you connect all the above steps in order: fetch the documents → apply Chef → apply Chunker → (optional) apply Refinery → Handshake into your vector store. This abstraction makes your ingestion code clean, reproducible, and modular, and helps you maintain and scale your ingestion workflows.
The Implementation:
Lets look a the project scaffolding first
.├── LICENSE├── README.md├── core│ ├── __init__.py│ ├── image_inferer.py│ ├── image_to_base64_converter.py│ ├── ingestion_pipe.py│ ├── knowledge_to_image_converter.py│ ├── output_images│ │ ├── 0000773840-25-000105_page_5.png│ │ ├── 0000773840-25-000105_page_6.png│ │ ├── 0000773840-25-000105_page_7.png│ │ └── 0000773840-25-000105_page_8.png│ └── rag_agent.py├── data│ └── 0000773840-25-000105.pdf├── requirements.txt├── serve_model.md└── system_requirements.md
The image_inferer.py file contains the VLLMVisionClient class which serves as the primary interface to a remotely hosted vision language model running on vLLM infrastructure. The chat_with_image_url method handles processing images from publicly accessible URLs by constructing a payload with the text prompt and image URL, then sending it to the vLLM API endpoint. The chat_with_local_image method processes local image files by first converting them to base64-encoded data URIs before sending them to the API endpoint, allowing the vision model to analyze visual content from your filesystem. The extract_response_text method parses the API’s JSON response structure to pull out the actual generated text, and all methods include timeout protection and exception handling for robustness.
import osimport requestsfrom core.image_to_base64_converter import image_to_base64from core.ingestion_pipe import ingest_data_to_storefrom dotenv import load_dotenv, find_dotenvload_dotenv(find_dotenv())class VLLMVisionClient: """Client for calling vLLM Vision API endpoint""" def __init__(self, base_url="https://s7z2ms3wud6hm6-8000.proxy.runpod.net"): self.base_url = base_url.rstrip('/') self.endpoint = f"{self.base_url}/v1/chat/completions" def chat_with_image_url(self, text_prompt, image_url, model="Qwen/Qwen3-VL-8B-Instruct"): """ Send a chat request with an image URL Args: text_prompt (str): Text prompt/question about the image image_url (str): URL of the image (http/https) model (str): Model name Returns: dict: API response """ payload = { "model": model, "messages": [ { "role": "user", "content": [ { "type": "text", "text": text_prompt }, { "type": "image_url", "image_url": { "url": image_url } } ] } ] } try: response = requests.post( self.endpoint, json=payload, headers={"Content-Type": "application/json"}, timeout=60 ) response.raise_for_status() return response.json() except requests.exceptions.RequestException as e: print(f"Error calling API: {e}") return None def chat_with_local_image(self, text_prompt, image_path, model="Qwen/Qwen3-VL-8B-Instruct"): """ Send a chat request with a local image file Args: text_prompt (str): Text prompt/question about the image image_path (str): Path to local image file model (str): Model name Returns: dict: API response """ # Convert image to base64 image_data_uri = image_to_base64(image_path) payload = { "model": model, "messages": [ { "role": "user", "content": [ { "type": "text", "text": text_prompt }, { "type": "image_url", "image_url": {"url": image_data_uri} } ] } ] } print(self.endpoint) # print(payload) try: response = requests.post( self.endpoint, json=payload, headers={"Content-Type": "application/json"}, timeout=60 ) response.raise_for_status() return response.json() except requests.exceptions.RequestException as e: print(f"Error calling API: {e}") return None def extract_response_text(self, response): """Extract the text content from API response""" if response and "choices" in response: return response["choices"][0]["message"]["content"] return None# Example usageif __name__ == "__main__": # Initialize client client = VLLMVisionClient(base_url=os.environ.get("VLLM_API_URL")) # Example 1: Query with image URL # print("Example 1: Using image URL") # # response = client.chat_with_image_url( # text_prompt="Extract all the information from the image in paragraph manner. No markdown or No markup or no bullet points.", # image_url='your image URL here!' # ) # # if response: # text_response = client.extract_response_text(response) # print("Response:", text_response) # # # calling ingestion pipeline # ingest_data_to_store(text_response) # # print("\n" + "=" * 50 + "\n") # Example 2: Query with local image print("Example 2: Using local image") response = client.chat_with_local_image( text_prompt="Extract all the information from the image in paragraph manner. No markdown or No markup or no bullet points.", image_path='output_images/0000773840-25-000105_page_8.png' ) if response: text_response = client.extract_response_text(response) print("Response:", text_response) # calling ingestion pipeline ingest_data_to_store(text_response) print("\n" + "=" * 50 + "\n")
The image_to_base64_converter.py file provides the image_to_base64 function which transforms local image files into data URI format suitable for API transmission. It reads the image file in binary mode, encodes it to base64, and automatically detects the appropriate MIME type based on file extension, supporting common formats like PNG, JPEG, GIF, and WebP. This conversion is essential because the vision API expects images either as publicly accessible URLs or as embedded base64 data, and for local files the base64 approach ensures the images remain within your secure infrastructure without needing external hosting.
import base64from pathlib import Pathdef image_to_base64(image_path) -> str: """Convert local image to base64 data URI""" with open(image_path, "rb") as image_file: encoded = base64.b64encode(image_file.read()).decode('utf-8') # Determine MIME type extension = Path(image_path).suffix.lower() mime_types = { '.png': 'image/png', '.jpg': 'image/jpeg', '.jpeg': 'image/jpeg', '.gif': 'image/gif', '.webp': 'image/webp' } mime_type = mime_types.get(extension, 'image/jpeg') return f"data:{mime_type};base64,{encoded}"
The ingestion_pipe.py file uses the Chonkie framework to process extracted text through two functions. The ingest_data_to_store_with_fetch function can process markdown and text files directly from a directory, while the ingest_data_to_store function handles text strings passed to it programmatically. Both functions use a pipeline that chunks text using semantic chunking with a similarity threshold of 0.8, meaning the system intelligently breaks text into meaningful segments rather than arbitrary character counts. The pipeline then applies the refine_with step to add 100 characters of overlapping context between chunks to preserve continuity, and finally the store_in step generates embeddings using the BAAI/bge-base-en-v1.5 model and stores everything in the specified Qdrant collection.
from chonkie import Pipeline, AutoEmbeddingsfrom dotenv import load_dotenv, find_dotenvimport osload_dotenv(find_dotenv())# Get the embeddings handler for SentenceTransformerembeddings = AutoEmbeddings.get_embeddings("BAAI/bge-base-en-v1.5")def ingest_data_to_store_with_fetch(): # Process all markdown files in a directory (Pipeline() .fetch_from("file", dir="./documents", ext=[".md", ".txt"]) .process_with("text") .chunk_with("recursive", chunk_size=512) .run()) print(f"Ingested documents")def ingest_data_to_store(text: str): print(f"Indexing in Qdrant store in collection: {os.environ.get('COLLECTION_NAME')}") (Pipeline() .process_with("text") .chunk_with("semantic", threshold=0.8) .refine_with("overlap", context_size=100) .store_in("qdrant", collection_name=os.environ.get('COLLECTION_NAME'), url=os.environ.get("QDRANT_URL"), api_key=os.environ.get("QDRANT_API_KEY"), embedding_model=embeddings) .run(texts=text)) print(f"Ingested documents")
The knowledge_to_image_converter.py file contains the pdf_to_images function which uses the pdf2image library to transform PDF documents into high-resolution PNG images at 300 DPI by default. You can convert either specific pages by passing the page_number parameter or convert entire documents by leaving it as None, with each page becoming a separate image file organized in an output folder. This conversion is critical because the vision language model cannot directly read PDF files but excels at understanding visual layouts, tables, charts, and formatted text when presented as images, making it ideal for complex financial documents where layout and visual structure carry semantic meaning.
import osfrom pdf2image import convert_from_pathfrom pathlib import Pathdef pdf_to_images(pdf_path, output_folder="pdf_images", page_number=None, dpi=300): """ Convert PDF pages to images. Args: pdf_path (str): Path to the PDF file output_folder (str): Folder to save the images (default: 'pdf_images') page_number (int, optional): Specific page number to convert (1-indexed). If None, converts all pages dpi (int): Resolution of output images (default: 300) Returns: list: Paths of saved image files """ # Create output folder if it doesn't exist Path(output_folder).mkdir(parents=True, exist_ok=True) # Get PDF filename without extension pdf_name = Path(pdf_path).stem saved_files = [] try: if page_number is not None: # Convert specific page (pdf2image uses 1-indexed pages) print(f"Converting page {page_number}...") images = convert_from_path( pdf_path, dpi=dpi, first_page=page_number, last_page=page_number ) # Save the image output_path = os.path.join(output_folder, f"{pdf_name}_page_{page_number}.png") images[0].save(output_path, "PNG") saved_files.append(output_path) print(f"Saved: {output_path}") else: # Convert all pages print("Converting all pages...") images = convert_from_path(pdf_path, dpi=dpi) # Save all images for i, image in enumerate(images, start=1): output_path = os.path.join(output_folder, f"{pdf_name}_page_{i}.png") image.save(output_path, "PNG") saved_files.append(output_path) print(f"Saved: {output_path}") print(f"\nTotal images saved: {len(saved_files)}") return saved_files except Exception as e: print(f"Error converting PDF: {e}") return []# Example usageif __name__ == "__main__": # Example 1: Convert all pages pdf_to_images("../data/0000773840-25-000105.pdf", output_folder="output_images", page_number=8) # Example 2: Convert only page 3 # pdf_to_images("sample.pdf", output_folder="output_images", page_number=3) # Example 3: Convert with custom DPI # pdf_to_images("sample.pdf", output_folder="output_images", dpi=150)
The rag_agent.py file orchestrates the entire question-answering workflow through several components. The retrieve_iso_knowledge_base function connects to the Qdrant vector store and creates a LlamaIndex retriever configured to fetch the top 15 most relevant chunks for any query, wrapping it in an Agno Knowledge object. The create_vllm_agent function instantiates an Agno Agent with **VLLM** as the backend model, configured with a strict system prompt that prevents hallucination by instructing it to answer only from provided context and format responses in bullets. The main execution flow searches the knowledge base using kb.search, concatenates the retrieved document contents into a context string, combines it with the user’s query, and passes everything to the agent’s print_response method which generates the final answer. The architecture supports multiple LLM backends including Gemini, GPT-4, Ollama, and Claude through simple model swapping, giving you flexibility in choosing between cloud APIs and self-hosted models depending on your security and cost requirements.
import osfrom agno.knowledge.knowledge import Knowledgefrom agno.vectordb.llamaindex import LlamaIndexVectorDbfrom llama_index.vector_stores.qdrant import QdrantVectorStorefrom llama_index.core import VectorStoreIndex, StorageContext, Settings# from llama_index.embeddings.google_genai import GoogleGenAIEmbedding# from llama_index.embeddings.ollama import OllamaEmbeddingfrom llama_index.embeddings.fastembed import FastEmbedEmbeddingfrom qdrant_client import QdrantClient# from agno.utils.pprint import pprint_run_responsefrom agno.agent import Agentfrom agno.models.google import Geminifrom agno.models.anthropic import Claudefrom agno.models.openai import OpenAIChatfrom agno.models.ollama import Ollamafrom agno.models.vllm import VLLMfrom agno.tools.googlesearch import GoogleSearchToolsfrom dotenv import load_dotenv, find_dotenvload_dotenv(find_dotenv())# Settings.embed_model = GoogleGenAIEmbedding(model_name="gemini-embedding-001",# embedding_config=EmbedContentConfig(output_dimensionality=768))Settings.embed_model = FastEmbedEmbedding(model_name="BAAI/bge-base-en-v1.5")qdrant_connector = QdrantClient(url="http://localhost:6333", api_key="th3s3cr3tk3y")def retrieve_iso_knowledge_base(): if qdrant_connector.collection_exists(collection_name=os.environ.get("COLLECTION_NAME")): vector_store = QdrantVectorStore(client=qdrant_connector, collection_name=os.environ.get("COLLECTION_NAME")) # collection name should match the collection name while ingesting storage_context = StorageContext.from_defaults(vector_store=vector_store) index = VectorStoreIndex.from_vector_store(vector_store=vector_store, storage_context=storage_context) retriever = index.as_retriever(top_k=15) knowledge = Knowledge( vector_db=LlamaIndexVectorDb(knowledge_retriever=retriever) ) return knowledge else: # handle the fallback if not qdrant passkb = retrieve_iso_knowledge_base()### Agent use OpenAI for ResponseSYSTEM_PROMPT = ( "You are an expert Financial Data Analysis " "specialized in SEC-10K reports. Your primary role is to provide accurate and detailed " "answers to questions *about* the SEC-10K standards and to summarize details " "or the overall content of the *provided SEC-10K context*." "IMPORTANT: Do not use external knowledge, previous knowledge or make up information. Only use the context provided to you. " "If you dont find the answer politely say you dont know the answer" "MOST IMPORTANT: Always provide the information in bullets")#gemini_llm = Gemini(id="gemini-2.5-flash", temperature=0.7, api_key=os.environ.get("GEMINI_API_KEY"))# claude_llm = Claude(id="claude-sonnet-4-20250514", temperature=0.7, api_key=os.environ.get("ANTHROPIC_API_KEY"))# gpt_llm = OpenAIChat(id="gpt-4o", temperature=0.7, api_key=os.environ.get("OPENAI_API_KEY"))# llm = Ollama(id='gemma3:12b')vllm = VLLM(id="Qwen/Qwen3-VL-8B-Instruct", base_url=f"{os.environ.get('VLLM_API_URL')}/v1/")def create_claude_agent(): # Instantiate the Agno Agent with the knowledge base _agent = Agent( model=vllm, # tools=[GoogleSearchTools(fixed_max_results=5)], # search_knowledge=True, debug_mode=True, system_message=SYSTEM_PROMPT, instructions="Always give the response in bullets or tabular format." ) return _agentagent = create_claude_agent()query = "What are Total current assets in September 30, 2025?"# query = "What are the total Asbestos-related liabilities?"context = ''documents = kb.search(query=query, max_results=10)for document in documents: context += document.contentinput_and_context = f'Query:{query}\n\nContext:{context}'agent.print_response(input=input_and_context)
The Result:
Press enter or click to view image in full size
Press enter or click to view image in full size
by author M K Pavan Kumar
Note: For this document i have taken the Honeywell sec-10k filings.
The Conclusion:
The vLACQ Stack represents a decisive leap forward in how modern Retrieval-Augmented Generation systems are designed, deployed, and scaled. By bringing together the strengths of vLLM, LlamaIndex, Agno, Chonkie, and Qdrant, it creates a unified, production-grade ecosystem capable of handling the complexity of multimodal, structured, and enterprise-scale data. From vision-enabled inference to semantically rich retrieval and intelligent agentic reasoning, every layer of this stack is engineered for efficiency, modularity, and performance. Whether the challenge is decoding complex financial tables, ingesting hybrid documents, or powering contextual enterprise assistants, vLACQ demonstrates how the future of RAG lies in composability and precision — not just power. This architecture doesn’t just answer questions; it understands the data behind them.