MMCTAgent
Overview
MMCTAgent is a state-of-the-art multi-modal AI framework that brings human-like critical thinking to visual reasoning tasks. it combines advanced planning, self-critique, and tool-based reasoning to deliver superior performance in complex image and video understanding applications.
Why MMCTAgent?
- π§ Self Reflection Framework: MMCTAgent emulates iteratively analyzing multi-modal information, decomposing complex queries, planning strategies, and dynamically evolving its reasoning. Designed as a research framework, MMCTAgent integrates critical thinking elements such as verification of final answers and self-reflection through a novel approach that defines a vision-based critic and identifies task-specific evaluationβ¦
MMCTAgent
Overview
MMCTAgent is a state-of-the-art multi-modal AI framework that brings human-like critical thinking to visual reasoning tasks. it combines advanced planning, self-critique, and tool-based reasoning to deliver superior performance in complex image and video understanding applications.
Why MMCTAgent?
- π§ Self Reflection Framework: MMCTAgent emulates iteratively analyzing multi-modal information, decomposing complex queries, planning strategies, and dynamically evolving its reasoning. Designed as a research framework, MMCTAgent integrates critical thinking elements such as verification of final answers and self-reflection through a novel approach that defines a vision-based critic and identifies task-specific evaluation criteria, thereby enhancing its decision-making abilities.
- π¬ Enables Querying over Multimodal Collections: It enables modular design to plug-in right audio, visual extraction and processing tools, combined with Multimodal LLMs to ingest and query over large number of videos and image data.
- π Easy Integration: Its modular design allows for easy integration into existing workflows and adding domain-specific tools, facilitating adoption across various domains requiring advanced visual reasoning capabilities.
Key Features
Critical Thinking Architecture
MMCTAgent is inspired by human cognitive processes and integrates a structured reasoning loop:
Planner: Generates an initial response using relevant tools for visual or multi-modal input.
Critic: Evaluates the Plannerβs response and provides feedback to improve accuracy and decision-making.
Modular Agents
MMCTAgent includes two specialized agents:
ImageAgent
A reasoning engine tailored for static image understanding.
It supports a configurable set of tools via the ImageQnaTools enum:
object_detectionβ Detects objects in an image.ocrβ Extracts embedded text content.recogβ Recognizes scenes, faces, or objects.vitβ Applies vision llm for high-level visual reasoning.
The Critic can be toggled via the
use_critic_agentflag.
VideoAgent
Optimized for deep video understanding:
Video Question Answering
Applies a fixed toolchain orchestrated by the Planner:
GET_VIDEO_ANALYSISβ Retrieves the most relevant video for the query, along with its summary and detected objects.GET_CONTEXTβ Extracts transcript, visual summary chunks and object collection info relevant to the query.GET_RELEVANT_FRAMESβ Provides semantically similar keyframes related to the query. This tool is based on the CLIP embedding.QUERY_FRAMEβ Queries specific video keyframes to extract detailed information and provide additional visual context to the Planner.
The Critic agent helps validate and refine answers, improving reasoning depth.
For more details, refer to the full research article:
MMCTAgent: Multi-modal Critical Thinking Agent Framework for Complex Visual Reasoning Published on arXiv β arxiv.org/abs/2405.18358
Table of Contents
- Getting Started
- Provider System
- Configuration
- Project Structure
- Contributing
- Citations
- License
- Support
Getting Started
Installation
Clone the Repository
git clone https://github.com/microsoft/MMCTAgent.git
cd MMCTAgent
System Dependencies
Install FFmpeg
Linux/Ubuntu:
sudo apt-get update
sudo apt-get install ffmpeg libsm6 libxext6 -y
Windows:
- Download FFmpeg from ffmpeg.org
- Add the
binfolder to your system PATH
Python Environment Setup
Option A: Using Conda (Recommended)
conda create -n mmct-agent python=3.11
conda activate mmct-agent
Option B: Using venv
python -m venv mmct-agent
# Linux/Mac
source mmct-agent/bin/activate
# Windows
mmct-agent\Scripts\activate.bat
Install Dependencies
pip install --upgrade pip
pip install -r requirements.txt
Quick Start Examples
Image Analysis with MMCTAgent
from mmct.image_pipeline import ImageAgent, ImageQnaTools
import asyncio
# Initialize the Image Agent with desired tools
image_agent = ImageAgent(
query="What objects are visible in this image and what text can you read?",
image_path="path/to/your/image.jpg",
tools=[ImageQnaTools.object_detection, ImageQnaTools.ocr, ImageQnaTools.vit],
use_critic_agent=True, # Enable critical thinking
stream=False
)
# Run the analysis
response = asyncio.run(image_agent())
print(f"Analysis Result: {response.response}")
Video Analysis with VideoAgent.
Ingest a video through MMCT Video Ingestion Pipeline.
from mmct.video_pipeline import IngestionPipeline, Languages, TranscriptionServices
ingestion = IngestionPipeline(
video_path="path-of-your-video",
index_name="index-name",
transcription_service=TranscriptionServices.WHISPER, #TranscriptionServices.AZURE_STT
language=Languages.ENGLISH_INDIA,
)
# Run the ingestion pipeline
await ingestion.run()
Perform Q&A through MMCTβs Video Agent.
from mmct.video_pipeline import VideoAgent
import asyncio
# Configure the Video Agent
video_agent = VideoAgent(
query="input-query",
index_name="your-index-name",
video_id=None, # Optional: specify video ID
url=None, # Optional: URL to filter out the search results for given url
use_critic_agent=True, # Enable critic agent
stream=False, # Stream response
cache=False # Optional: enable caching
)
# Execute video analysis
response = await video_agent()
print(f"Video Analysis: {response}")
For more comprehensive examples, see the examples/ directory.
Provider System
Multi-Cloud & Vendor-Agnostic Architecture
MMCTAgent now features a modular provider system that allows you to seamlessly switch between different cloud providers and AI services without changing your application code. This makes the framework truly vendor-agnostic and suitable for various deployment scenarios.
Supported Providers
| Service Type | Supported Providers | Use Cases |
|---|---|---|
| LLM | Azure OpenAI, OpenAI | Text generation, chat completion |
| Search | Azure AI Search, FAISS | Document search and retrieval |
| Transcription | Azure Speech Services, OpenAI Whisper | Audio-to-text conversion |
| Storage | Azure Blob Storage, Local Storage | File storage and management |
For detailed configuration instructions, see our Provider Configuration Guide.
Configuration
System Requirements for CLIP embeddings (openai/clip-vit-base-patch32)
Minimum (development / small-scale):
- CPU: 4-core modern i5/i7, ~8 GB RAM
- Disk: ~500 MB caching model + image/text data
- GPU: none (works but slow)
Recommended (for decent speed / batching):
- CPU: 8+ cores, 16 GB RAM
- GPU: NVIDIA with β₯ 4-6 GB VRAM (e.g. RTX 2060/3060)
- PyTorch + CUDA installed, with mixed precision support
High-throughput (fast, large batches):
- 16+ cores CPU, 32+ GB RAM
- GPU: 8-16 GB+ VRAM, fast memory bandwidth (e.g. RTX 3090, A100)
- Use float16 / bfloat16, efficient batching, parallel preprocessing
Environment Setup
MMCTAgent uses a flexible configuration system that supports multiple cloud providers. Choose your configuration method:
Quick Start
Rename the .env.example to .env and fill the specific values.
Provider Configuration Examples
Azure-First Setup:
# LLM Configuration
LLM_PROVIDER=azure
LLM_ENDPOINT=https://your-resource.openai.azure.com/
LLM_DEPLOYMENT_NAME=gpt-4o
LLM_MODEL_NAME=gpt-4o
LLM_USE_MANAGED_IDENTITY=true
# Search Configuration
SEARCH_PROVIDER=azure_ai_search # use `local_faiss` to enable faiss index as vector db
SEARCH_ENDPOINT=https://your-search.search.windows.net
SEARCH_USE_MANAGED_IDENTITY=true
SEARCH_INDEX_NAME=your-index-name
# Storage Configuration
STORAGE_PROVIDER=azure # use `local` to store items to local storage
STORAGE_ACCOUNT_NAME=your-storage-account
STORAGE_USE_MANAGED_IDENTITY=true
π For comprehensive configuration options, see our Provider Configuration Guide
Project Structure
Below is the project structure highlighting the key entry-point scripts for running the three main pipelinesβ Image QNA, Video Ingestion and Video Agent.
MMCTAgent
|
βββ infra
| βββ INFRA_DEPLOYMENT_GUIDE.md # Guide for deployment of Azure Infrastructure
βββ app # contains the FASTAPI application over the mmct pipelines.
βββ mcp_server
β βββ main.py # you need to run main.py to start MCP server
β βββ client.py # MCP server client to connect to MCP server
β βββ notebooks/ # contains the examples to utilize MCP server through different agentic-frameworks
β βββ README.md # Guide for MCP server.
βββ mmct
β βββ .
β βββ image_pipeline
β β βββ agents
β β β βββ image_agent.py # Entry point for the MMCT Image Agentic Workflow
β β βββ README.md # Guide for Image Pipeline
β βββ video_pipeline
β βββ agents
β β βββ video_agent.py # Entry point for the MMCT Video Agentic Workflow
β βββ core
β β βββ ingestion
β β βββ ingestion_pipeline.py # Entry point for the Video Ingestion Workflow
β βββ README.md # Guide for Video Pipeline
βββ requirements.txt
βββ README.md
Contributing
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com. When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repositories using our CLA. This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
Note: This project is currently under active research and continuous development. While contributions are encouraged, please note that the codebase may evolve as the project matures.
Citation
If you find MMCTAgent useful in your research, please cite our paper:
@article{kumar2024mmctagent,
title={MMCTAgent: Multi-modal Critical Thinking Agent Framework for Complex Visual Reasoning},
author={Kumar, Somnath and Gadhia, Yash and Ganu, Tanuja and Nambi, Akshay},
conference={NeurIPS OWA-2024},
year={2024},
url={https://www.microsoft.com/en-us/research/publication/mmctagent-multi-modal-critical-thinking-agent-framework-for-complex-visual-reasoning}
}
License
This project is licensed under the MIT License - see the LICENSE file for details.