MCP Hangar
Production-grade MCP infrastructure with auto-discovery, observability, and resilience patterns.
Overview
MCP Hangar is a lifecycle management platform for Model Context Protocol providers, built for platform teams running MCP at scale. It replaces ad-hoc provider management with a unified control plane featuring auto-discovery from Kubernetes, Docker, and filesystem sources; circuit breakers and saga-based recovery for resilience; and first-class observability through Langfuse, OpenTelemetry, and Prometheus. The architecture follows Domain-Driven Design with CQRS and Event Sourcing, providing full audit trails for compliance-heavy environments.
Why MCP Hangar?
| Challenge | Without MCP Hangar | With MCP Hangar | | βββ | ββββββ | βββββ¦
MCP Hangar
Production-grade MCP infrastructure with auto-discovery, observability, and resilience patterns.
Overview
MCP Hangar is a lifecycle management platform for Model Context Protocol providers, built for platform teams running MCP at scale. It replaces ad-hoc provider management with a unified control plane featuring auto-discovery from Kubernetes, Docker, and filesystem sources; circuit breakers and saga-based recovery for resilience; and first-class observability through Langfuse, OpenTelemetry, and Prometheus. The architecture follows Domain-Driven Design with CQRS and Event Sourcing, providing full audit trails for compliance-heavy environments.
Why MCP Hangar?
| Challenge | Without MCP Hangar | With MCP Hangar |
|---|---|---|
| Provider lifecycle | Manual start/stop, no health monitoring | State machine with circuit breaker, health checks, automatic GC |
| Observability | None or DIY | Built-in Langfuse, OpenTelemetry, Prometheus metrics |
| Dynamic environments | Restart required for new providers | Auto-discovery from K8s, Docker, filesystem, entrypoints |
| Failure handling | Cascading failures | Circuit breaker, saga-based recovery and failover |
| Audit & compliance | None | Event sourcing with full audit trail |
| Cold start latency | Wait for provider startup | Predefined tools visible immediately, lazy loading |
| Multi-provider routing | Manual coordination | Load balancing with weighted round-robin, priority, least connections |
Key Features
π Lifecycle Management
Provider lifecycle follows a strict state machine:
COLD β INITIALIZING β READY β DEGRADED β DEAD
- Lazy loading β Providers start on first invocation, not at boot
- Predefined tools β Tool schemas visible before provider starts (no cold start for discovery)
- Automatic GC β Idle providers shutdown after configurable TTL
- Graceful shutdown β Clean termination with timeout enforcement π Auto-Discovery
Automatically detect and register providers from multiple sources:
| Source | Configuration |
|---|---|
| Kubernetes | Pod annotations (mcp-hangar.io/*) with namespace filtering |
| Docker/Podman | Container labels (mcp.hangar.*) |
| Filesystem | YAML configs with file watching |
| Python entrypoints | mcp.providers entry point group |
Discovery modes:
additiveβ Only adds providers, never removes (safe for static environments)authoritativeβ Adds and removes (for dynamic environments like K8s)
Conflict resolution: Static config > Kubernetes > Docker > Filesystem > Entrypoints
π Observability
Full observability stack for production operations:
Langfuse Integration
- End-to-end LLM tracing from prompt to provider response
- Cost attribution per provider, tool, user, or session
- Quality scoring and automated evals
OpenTelemetry
- Distributed tracing with context propagation
- OTLP export to Jaeger, Zipkin, or any collector
Prometheus Metrics
- Tool invocation latency and error rates
- Provider state transitions and cold starts
- Circuit breaker state and trip counts
- Health check results
Health Endpoints
/health/liveβ Liveness check/health/readyβ Readiness check (K8s compatible)/health/startupβ Startup check/metricsβ Prometheus scrape endpoint π‘οΈ Resilience
Production-grade failure handling:
Circuit Breaker
- Opens after configurable failure threshold
- Auto-reset after timeout period
- Prevents cascading failures to healthy providers
Saga-Based Recovery
ProviderRecoverySagaβ Automatic restart with exponential backoffProviderFailoverSagaβ Failover to backup providers with auto-failbackGroupRebalanceSagaβ Rebalance traffic when members change
Health Monitoring
- Configurable check intervals
- Consecutive failure thresholds
- Automatic state transitions (READY β DEGRADED) π Security
Enterprise security controls:
- Rate limiting β Per-provider request limits
- Input validation β Schema validation before provider invocation
- Secrets management β Environment variable injection, never in config files
- Container isolation β Read-only filesystems, resource limits, network policies
- Discovery security β Namespace filtering, max providers per source, quarantine on failure ποΈ Architecture
Domain-Driven Design with clean layer separation:
domain/ Core business logic, state machines, events, value objects
application/ Use cases, commands, queries, sagas
infrastructure/ Adapters for containers, subprocess, persistence, event bus
server/ MCP protocol handlers and validation
bootstrap/ Runtime initialization and dependency injection
- CQRS β Separate command and query paths
- Event Sourcing β All state changes emit domain events
- Port/Adapter β Extensible infrastructure layer
- Thread-safe β Lock hierarchy for concurrent access
Quick Start
Install:
pip install mcp-hangar
Configure (config.yaml):
providers:
math:
mode: subprocess
command: [python, -m, my_math_server]
idle_ttl_s: 300
sqlite:
mode: container
image: ghcr.io/modelcontextprotocol/server-sqlite:latest
volumes:
- "/data/sqlite:/data:rw"
Run:
# Stdio mode (Claude Desktop, Cursor, etc.)
mcp-hangar --config config.yaml
# HTTP mode (LM Studio, web clients)
mcp-hangar --config config.yaml --http
Architecture Overview
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MCP Hangar β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β FastMCP Server β β
β β (Stdio or HTTP transport) β β
β ββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββ β
β β Provider Manager β β
β β βββββββββββ βββββββββββ βββββββββββ β β
β β β State β β Health β β Circuit β β β
β β β Machine β β Tracker β β Breaker β β β
β β βββββββββββ βββββββββββ βββββββββββ β β
β ββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββ β
β β Providers β β
β β βββββββββββββ βββββββββββββ βββββββββββββ β β
β β β Subprocessβ β Docker β β Remote β β β
β β βββββββββββββ βββββββββββββ βββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Background: [GC Worker] [Health Worker] [Discovery Worker] β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Registry Tools
| Tool | Description |
|---|---|
registry_list | List all providers with state, health status, and available tools |
registry_start | Explicitly start a provider |
registry_stop | Stop a running provider |
registry_invoke | Invoke a tool on a provider (auto-starts if needed) |
registry_invoke_ex | Invoke with retry, correlation ID, and metadata |
registry_invoke_stream | Invoke with real-time progress notifications |
registry_tools | Get tool schemas for a provider |
registry_details | Get detailed information about a provider or group |
registry_health | Get health status and metrics |
registry_status | Dashboard view of all providers |
registry_discover | Trigger discovery cycle |
registry_sources | List discovery sources with status |
registry_quarantine | List quarantined providers |
registry_approve | Approve a quarantined provider |
registry_warm | Pre-start providers to avoid cold start latency |
Configuration Reference
| Option | Description | Default |
|---|---|---|
mode | Provider mode: subprocess, container, docker, remote, group | required |
command | Command for subprocess providers | β |
image | Container image for container providers | β |
idle_ttl_s | Seconds before idle provider shutdown | 300 |
health_check_interval_s | Health check frequency in seconds | 60 |
max_consecutive_failures | Failures before transition to DEGRADED | 3 |
tools | Predefined tool schemas (visible before start) | β |
volumes | Container volume mounts | β |
network | Container network mode | none |
read_only | Container read-only filesystem | true |
Observability Setup
observability:
langfuse:
enabled: true
public_key: ${LANGFUSE_PUBLIC_KEY}
secret_key: ${LANGFUSE_SECRET_KEY}
host: https://cloud.langfuse.com
tracing:
enabled: true
otlp_endpoint: http://localhost:4317
metrics:
enabled: true
endpoint: /metrics
Environment Variables:
| Variable | Description |
|---|---|
LANGFUSE_PUBLIC_KEY | Langfuse public key |
LANGFUSE_SECRET_KEY | Langfuse secret key |
OTEL_EXPORTER_OTLP_ENDPOINT | OpenTelemetry collector endpoint |
MCP_TRACING_ENABLED | Enable/disable tracing (true/false) |
Endpoints:
/metricsβ Prometheus metrics/health/liveβ Liveness probe/health/readyβ Readiness probe/health/startupβ Startup probe
Documentation
π Full Documentation
Contributing
See Contributing Guide for development setup, testing requirements, and code style.
git clone https://github.com/mapyr/mcp-hangar.git
cd mcp-hangar
# Setup Python core
cd packages/core
pip install -e ".[dev]"
pytest
# Or use root Makefile
cd ../..
make setup
make test
Project Structure
mcp-hangar/
βββ packages/
β βββ core/ # Python package (PyPI: mcp-hangar)
β β βββ mcp_hangar/
β β βββ tests/
β β βββ pyproject.toml
β βββ operator/ # Kubernetes operator (Go)
β β βββ api/
β β βββ cmd/
β β βββ go.mod
β βββ helm-charts/ # Helm charts
β βββ mcp-hangar/
β βββ mcp-hangar-operator/
βββ docs/ # MkDocs documentation
βββ examples/ # Quick starts & demos
βββ monitoring/ # Grafana, Prometheus configs
βββ Makefile # Root orchestration
License
MIT License β see LICENSE for details.