MoM – Mixture of Model Service

🎭 MoM (Mixture of Models) Service

Transform multiple AI perspectives into superior answers through intelligent synthesis

MoM Service is an OpenAI-compatible API that revolutionizes LLM usage by orchestrating multiple AI models simultaneously. Instead of relying on a single model’s perspective, it queries several LLMs in parallel and synthesizes their responses into a single, superior answer using a dedicated “concluding” model.

Think of it as assembling an expert panel: you get the creativity of GPT-5, the reasoning of Claude Sonnet 4.5, and the versatility of Gemini 2.5 Pro—all combined into one comprehensive response that’s more reliable and nuanced than any individual model could produce.

🌟 Why a Mixture of Models?

In today’s AI landscape with hundreds of special…

🎭 MoM (Mixture of Models) Service

Transform multiple AI perspectives into superior answers through intelligent synthesis

🌟 Why a Mixture of Models?

In today’s AI landscape with hundreds of specialized LLMs, relying on a single model is limiting. A Mixture of Models (MoM) approach delivers compelling advantages:

Each AI model brings its own unique perspective and reasoning style. MoM synthesizes these diverse viewpoints into a more comprehensive answer.

Benefit	Description
🎯 Superior Quality	Synthesize multiple perspectives to mitigate individual model weaknesses (hallucinations, biases, knowledge gaps)
🛡️ Enhanced Reliability	If one LLM fails or underperforms, others compensate to maintain high-quality output
💰 Cost Optimization	Route queries strategically—use cost-effective models where appropriate, premium ones when needed
🔄 Maximum Flexibility	Hot-swap models via configuration without code changes. Create specialized “meta-models” for different tasks

Real-World Use Cases

📝 Content Creation: Combine creative and factual models for balanced, engaging content
💻 Code Generation: Merge multiple coding assistants for more robust solutions
🔍 Research & Analysis: Get comprehensive answers by consulting multiple AI “experts”
🎓 Educational Applications: Provide students with well-rounded explanations from diverse perspectives

🔄 How It Works

MoM Service uses an elegant fan-out, fan-in architecture for parallel processing and intelligent synthesis:

graph TD
A[Client Request via OpenAI-Compatible API] --> B{MoM Service - FastAPI};
B --> C[Fan-Out to Multiple LLMs];
subgraph "Parallel LLM Inference"
C --> D1[GPT-4o];
C --> D2[Claude 3.5 Sonnet];
C --> D3[Gemini 1.5 Pro];
C --> D4[Llama 3.1 405B];
end
subgraph "Response Synthesis"
D1 --> E{Concluding LLM};
D2 --> E;
D3 --> E;
D4 --> E;
end
E --> F[Final Response Streamed to User];

style B fill:#f9f,stroke:#333,stroke-width:2px
style C fill:#ccf,stroke:#333,stroke-width:2px
style E fill:#cfc,stroke:#333,stroke-width:2px

Processing Flow

📥 Request In: Client makes request to OpenAI-compatible endpoint (/v1/chat/completions)
🎯 Fan-Out: Service identifies the MoM configuration and forwards request to all configured LLMs
⚡ Concurrent Processing: All LLMs process the request simultaneously (non-blocking)
🧠 Synthesize: Responses collected and passed to the “Concluding LLM”
📤 Stream Response: Final synthesized answer streamed back to client in real-time

✨ Features

🔌 OpenAI-Compatible API: Drop-in replacement with /v1/chat/completions and /v1/models endpoints
🎭 Multi-Model Orchestration: Query multiple LLMs in parallel with intelligent synthesis
🖼️ Multimodal Vision Support: Send images alongside text using OpenAI Vision API format
⚡ Real-Time Streaming: Stream synthesized responses back to clients with low latency
⚙️ Configuration-Driven: Define everything in a single config.yaml file—no code changes needed
💰 Advanced Pricing & Cost Tracking:
Custom pricing configurations for reasoning tokens
Automatic model filtering based on multimodal capabilities
Detailed cost breakdowns with normalized token reporting
Per-request cost calculation and logging
📊 Advanced Observability:
Built-in Langfuse integration for distributed tracing
Comprehensive metrics API with cost tracking and usage analytics
Detailed health check endpoints for monitoring system components
🔒 Enterprise Security:
Centralized Bearer token authentication with structured error responses
Clear distinction between service misconfiguration (503) and auth failures (401)
Flexible CORS policies for cross-origin requests
🐳 Production Ready:
Multi-stage Docker builds with non-root users
Docker Compose for local development
Advanced health checks for orchestration
💾 Response Caching: Automatic LLM response caching to reduce costs and latency
🧪 Comprehensive Testing: Full test suite with pytest for reliability

📁 Project Structure

mom-llm/
├── 📄 Dockerfile              # Multi-stage Docker build for production
├── 🐳 docker-compose.yml      # Docker Compose for local development
├── ⚙️  config.yaml            # Main configuration (gitignored - use template)
├── 📋 config.yaml_template    # Configuration template with examples
├── 📦 requirements.txt        # Python dependencies
├── 📝 LICENSE                 # MIT License
├── 🔒 .env                    # Environment variables (gitignored)
├── 📂 mom_service/
│   ├── 🎯 main.py            # FastAPI application & middleware
│   ├── 🔒 auth.py            # Authentication & token validation
│   ├── ⚙️  config.py         # Configuration loader & models
│   ├── 🧠 core_logic.py      # Fan-out & synthesis engine
│   ├── 📞 llm_calls.py       # LLM communication via LiteLLM
│   ├── 🖼️  multimodal_utils.py # Multimodal content & message sanitization
│   ├── 💰 cost_calculation.py # Cost tracking with reasoning tokens
│   ├── 💵 pricing_utils.py   # Pricing conversions & normalization
│   ├── 📊 metrics_db.py      # Metrics persistence & analytics
│   ├── 🏥 health.py          # Health check utilities
│   └── 📂 endpoints/
│       ├── 📋 models.py      # Pydantic request/response models
│       ├── 🔌 openai_v1.py   # OpenAI-compatible endpoints
│       └── 📈 metrics_api.py # Usage metrics API
└── 📂 tests/
├── ⚙️  conftest.py       # Pytest fixtures & configuration
├── 🧪 test_config.py     # Configuration tests
├── 🧪 test_core_logic.py # Core logic tests
├── 🧪 test_llm_calls.py  # LLM integration tests
├── 🧪 test_endpoints.py  # API endpoint tests
└── 🧪 test_health.py     # Health check tests

🚀 Quick Start

Prerequisites

Python 3.9 or higher
Docker (optional, for containerized deployment)
API keys for your chosen LLM providers (OpenAI, Google Gemini, Anthropic, etc.)

Installation

Clone the repository

git clone https://github.com/arashbehmand/mom-llm.git
cd mom-llm

Set up environment variables

Create a .env file in the project root:

# Service Configuration
API_TOKEN="your-secret-bearer-token"
ALLOWED_CORS_ORIGINS=""  # Comma-separated origins, or empty for no CORS
LITELLM_VERBOSE="false"

# LLM API Keys (add the ones you need)
OPENAI_API_KEY="sk-..."
GOOGLE_API_KEY="..."
ANTHROPIC_API_KEY="..."

# Optional: Langfuse for observability
LANGFUSE_PUBLIC_KEY=""
LANGFUSE_SECRET_KEY=""
LANGFUSE_HOST="https://cloud.langfuse.com"

Configure your models

Copy the template and customize:

macOS/Linux:

cp config.yaml_template config.yaml
# Edit config.yaml to define your LLMs and MoM configurations

Windows (PowerShell):

Copy-Item config.yaml_template config.yaml
# Then edit config.yaml to define your LLMs and MoM configurations

Install dependencies

pip install -r requirements.txt

Run the service

uvicorn mom_service.main:app --reload --host 0.0.0.0 --port 8000

🐳 Docker Deployment

Using Docker Compose (Recommended):

# Start the service
docker-compose up -d

# View logs
docker-compose logs -f mom-service

# Stop the service
docker-compose down

Using Docker directly:

# Build the image
docker build -t mom-service .

# Run the container
docker run -d \
--name mom-service \
-p 8000:8000 \
--env-file .env \
-v $(pwd)/config.yaml:/app/config.yaml \
-v $(pwd)/data:/app/data \
mom-service

📝 Basic Usage

Test the service:

curl http://localhost:8000/v1/models \
-H "Authorization: Bearer your-secret-bearer-token"

Make a chat completion request:

curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-secret-bearer-token" \
-d '{
"model": "mom",
"messages": [
{"role": "user", "content": "Explain quantum computing in simple terms"}
],
"stream": true
}'

Note: Set “stream”: false to get a single JSON response instead of an SSE stream.

Send an image (multimodal vision request):

curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-secret-bearer-token" \
-d '{
"model": "mom",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What'\''s in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image.jpg",
"detail": "high"
}
}
]
}
],
"stream": false
}'

Note: Vision requests automatically filter to multimodal-capable models. Non-capable models are skipped, and messages are sanitized for each provider to ensure compatibility.

⚙️ Configuration

The service is configured through config.yaml and environment variables (.env file).

Quick Configuration Overview

1. Environment Variables - API keys and service settings:

# Required
API_TOKEN="your-secret-bearer-token"

# LLM Provider Keys (add the ones you need)
OPENAI_API_KEY="sk-..."
GOOGLE_API_KEY="..."
ANTHROPIC_API_KEY="..."

2. Configuration File - Define your LLMs and MoM models:

# Define individual LLMs
llm_definitions:
- name: "gpt4"
model: "openai/gpt-4"
api_key_env: "OPENAI_API_KEY"

# Define synthesis prompts
prompt_definitions:
- name: "synth_default"
content: "Synthesize responses into a cohesive answer..."

# Create MoM models
models:
- name: "mom"
llms_to_query: ["gpt4", "claude", "gemini"]
concluding_llm: "gpt4"
concluding_prompt: "synth_default"

For detailed configuration options, custom pricing, advanced features, and complete examples, see the Configuration Guide.

🔌 API Reference

The MoM Service provides OpenAI-compatible endpoints plus additional metrics and health check endpoints.

Quick API Overview

Core Endpoints:

GET /v1/models - List available MoM models
POST /v1/chat/completions - Chat completions (streaming and non-streaming)
GET /v1/metrics/usage - Usage metrics and cost tracking
GET /health - Health check

Example Request:

curl http://localhost:8000/v1/chat/completions \
-H "Authorization: Bearer your-token" \
-H "Content-Type: application/json" \
-d '{
"model": "mom",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'

For complete API documentation including all endpoints, parameters, response formats, and code examples in multiple languages, see the API Reference.

Using with OpenAI SDK

The service is fully compatible with the OpenAI Python SDK:

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="your-secret-bearer-token"
)

response = client.chat.completions.create(
model="mom",
messages=[{"role": "user", "content": "What is the meaning of life?"}],
stream=True
)

for chunk in response:
print(chunk.choices[0].delta.content or "", end="")

See the API Reference for more examples including non-streaming and multimodal requests.

🎯 Advanced Features

Thinking Context

Set include_thinking_context: true in your model configuration to see intermediate responses from all LLMs before synthesis:

<think>
Model: gpt-4o
Content: [GPT-4o's response]
---
Model: claude-3-5-sonnet
Content: [Claude's response]
---
</think>

[Final synthesized answer]

Useful for understanding synthesis logic, debugging, and transparency.

Message Sanitization

The service automatically sanitizes messages for provider compatibility, removing empty fields and preserving multimodal content appropriately. This ensures reliable operation across all LLM providers without manual adjustments.

Cost Tracking & Observability

Automatic cost calculation for every request with detailed breakdowns
Langfuse integration for distributed tracing: Add credentials to .env and view detailed traces at Langfuse
Metrics API at /v1/metrics/usage for usage analytics

🛠️ Development

Running in Development Mode

uvicorn mom_service.main:app --reload --reload-include "config.yaml"

The --reload-include flag watches config.yaml for changes and automatically reloads the service.

Health Checks

# Basic health check
curl http://localhost:8000/health

# Detailed health check with component validation
curl http://localhost:8000/health/detailed

# Include LLM connectivity test
curl http://localhost:8000/health/detailed?check_llm=true

Running Tests

# Run all tests
pytest

# Run with coverage report
pytest --cov=mom_service --cov-report=html

# Run specific test file
pytest tests/test_endpoints.py

The test suite includes unit tests, integration tests, API tests, and health check validation.

📚 Documentation

For more detailed information, check out these guides:

Configuration Guide - Comprehensive guide to configuring LLMs, MoM models, and service settings
API Reference - Complete API documentation with examples in multiple languages
Contributing Guide - Guidelines for contributors

🤝 Contributing

Contributions are welcome! Whether you’re fixing bugs, improving documentation, or proposing new features, your help is appreciated.

Please see CONTRIBUTING.md for detailed guidelines on:

Setting up your development environment
Code style and standards
Running tests and quality checks
Submitting pull requests
Reporting issues

Quick start for contributors:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes with tests
Run the test suite (pytest)
Commit your changes
Push to your branch
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

This project was developed with the assistance of multiple AI tools, including Anthropic’s Claude, GitHub Copilot, and Kilo Code.
Built with FastAPI and LiteLLM
Inspired by ensemble learning and multi-agent AI systems
Observability powered by Langfuse

📬 Contact

Arash Behmand

GitHub: @arashbehmand
LinkedIn: linkedin.com/in/arashbehmand

⭐ If you find this project useful, please consider giving it a star on GitHub!

🎭 MoM (Mixture of Models) Service

🌟 Why a Mixture of Models?

🎭 MoM (Mixture of Models) Service

🌟 Why a Mixture of Models?

Real-World Use Cases

🔄 How It Works

Processing Flow

✨ Features

📁 Project Structure

🚀 Quick Start

Prerequisites

Installation

🐳 Docker Deployment

📝 Basic Usage

⚙️ Configuration

Quick Configuration Overview

🔌 API Reference

Quick API Overview

Using with OpenAI SDK

🎯 Advanced Features

Thinking Context

Message Sanitization

Cost Tracking & Observability

🛠️ Development

Running in Development Mode

Health Checks

Running Tests

📚 Documentation

🤝 Contributing

📄 License

🙏 Acknowledgments

📬 Contact

Similar Posts