The Real Problems With LiteLLM (And What Actually Works Better)

I’ve been building LLM applications for the past year. LiteLLM seemed like the obvious choice - everyone uses it, the docs are extensive, and it supports every provider you can think of. After running it in production and talking to other developers, here’s what I’ve learned about where it falls short and what alternatives actually work better.

The Cold Start Issue

This is the first problem most people hit. Import time is slow; like really slow. On a decent machine, from litellm import completion takes 3-4 seconds.

Why? The main __init__.py file has over 1,200 lines of imports. It loads the SDK for every single provider - OpenAI, Anthropic, Google, AWS, Azure, Cohere, all of them - whether you use them or not.

If you’re running serverless functions on AWS Lambda or simil…

The Cold Start Issue

This is the first problem most people hit. Import time is slow; like really slow. On a decent machine, from litellm import completion takes 3-4 seconds.

If you’re running serverless functions on AWS Lambda or similar platforms, this kills you. Your function times out before it even starts processing. Users see 4+ second delays on first request. That’s not acceptable for most applications.

Some people suggested lazy imports or importing specific modules. The codebase isn’t structured for that, so these workarounds don’t help much.

Performance Gets Worse Under Load

The import time is annoying but manageable if you’re running long-lived services. The real problem shows up when you start pushing traffic through it.

At around 500 requests per second, things start breaking. P99 latency shoots up to 90+ seconds. Not milliseconds - full seconds. The maintainers have acknowledged this exists but it’s a fundamental architecture issue, not a simple bug fix.

Memory usage grows over time. Even after they fixed most of the memory leaks, you’re still looking at 300-400MB of RAM for what’s supposed to be a thin proxy layer. For comparison, other gateways handle the same load with under 150MB.

The database becomes a problem once you have about a million logs stored. That sounds like a lot, but at 100,000 requests per day, you hit that in 10 days. Then every request starts slowing down because it’s hitting the database. You have to set up blob storage and multiple callback configurations to work around it.

The Codebase Is Hard to Work With

If you need to debug something or understand how a feature works, you’re in for a rough time.

The main request handling code is in a file called main.py that’s over 5,500 lines long. When something breaks, finding the relevant code means scrolling through thousands of lines. Understanding how requests flow through the system means holding a lot of context in your head.

Configuration works through global variables. This means if you’re building an application where different components need different LiteLLM configurations, you can’t do it. Global state is shared across everything.

The documentation often doesn’t match what the code actually does. You follow an example from the docs, it doesn’t work, you dig into the source code, and you find it works differently than documented. This happens frequently enough to be frustrating.

Release Cadence Creates Problems

LiteLLM ships multiple releases per day sometimes. That’s fine for rapid development, but it creates stability issues in production.

You’ll pin a version, test everything, deploy to production. A week later you want to upgrade and find that a minor version update broke something. There’s usually no migration guide or warning about the breaking change. Things that worked just stop working.

The Discord community has a lot of questions that go unanswered. When you hit a production issue, you’re often on your own to figure it out.

What Actually Works Better

I’ve tested several alternatives. Here’s what I found:

Bifrost

maximhq / bifrost

Fastest LLM gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.

Bifrost

The fastest way to build AI applications that never go down

Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.

Quick Start

Go from zero to production-ready AI gateway in under a minute.

Step 1: Start Bifrost Gateway

# Install and run locally
npx -y @maximhq/bifrost

# Or use Docker
docker run -p 8080:8080 maximhq/bifrost

Step 2: Configure via Web UI

# Open the built-in web interface
open http://localhost:8080

Step 3: Make your first API call

curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-4o-mini",
"messages": [{"role": "user", "content": "Hello, Bifrost!"}]
}'

That’s it! Your AI gateway is running with a web interface for visual configuration, real-time monitoring…

This is a gateway written in Go instead of Python. The architecture difference shows immediately.

Cold starts are instant. It’s a compiled binary, no interpreter startup or dependency loading. Import time goes from 3+ seconds to zero.

Performance is consistent under load. I’ve tested it at 5,000 requests per second without issues. P99 latency stays under 2 seconds even under sustained load. Memory usage stays around 120MB.

The codebase is modular and easy to understand. Each component has a clear purpose. When I needed to debug something, I found the relevant code quickly.

It supports the same providers LiteLLM does - OpenAI, Anthropic, AWS, Google, Azure, etc. The API is OpenAI-compatible so you don’t need to change your application code.

Setup is simple. You can run it as a Docker container or install it with a single command. Configuration is straightforward YAML, no global state issues.

Drawbacks: It’s newer than LiteLLM, so the community is smaller. Some advanced features from LiteLLM aren’t implemented yet. But for most use cases, it has what you need.

Portkey

This is a hosted service, not self-hosted. You route your requests through their API.

The advantage is zero operational overhead. No servers to manage, no performance tuning, no database maintenance.

Performance is good. They handle caching, fallbacks, and load balancing automatically. Latency is consistently low.

Drawbacks: It’s a commercial service, so there are costs. You’re sending your requests through their infrastructure, which some companies won’t allow for compliance reasons. You’re dependent on their uptime.

OpenRouter

Another hosted option. Focuses on making it easy to access multiple models through a single API.

Pricing is straightforward and competitive. The API is simple to use. Good for prototyping and smaller applications.

Drawbacks: Less control over routing and fallback behavior. Limited observability compared to running your own gateway. Not suitable if you need to keep everything on your own infrastructure.

Building Your Own

If you only use 2-3 providers, writing a thin wrapper yourself is viable.

The OpenAI Python SDK is well-maintained. The Anthropic SDK is solid. Writing adapter code to switch between them isn’t complicated. You avoid all the dependency overhead and get exactly the features you need.

Drawbacks: You’re responsible for fallbacks, caching, observability, and other operational concerns. This makes sense for simple use cases but becomes work if you need production features.

What to Choose

For prototyping and development: LiteLLM is fine. The performance issues don’t matter when you’re testing ideas. The wide provider support is helpful.

For production with moderate traffic (< 100 req/sec): LiteLLM can work if you’re willing to tune it. Expect to spend time on configuration and monitoring.

For production at scale (> 500 req/sec): Use Bifrost or another purpose-built gateway. LiteLLM’s architecture can’t handle high throughput reliably.

For serverless applications: Don’t use LiteLLM. The cold start time makes it unusable. Use Bifrost or a hosted service.

For companies with compliance requirements: Self-host Bifrost or build your own thin wrapper. Hosted services might not meet your data residency requirements.

My Setup

I run Bifrost in production for high-traffic applications. It handles fallbacks automatically, caching works well, and performance is predictable. I still use LiteLLM for local development and testing because the setup is easier.

For side projects with low traffic, I use OpenRouter because I don’t want to manage infrastructure.

The Real Issue

LiteLLM’s problems come from its architecture. It’s built in Python using patterns that work for web applications but don’t scale well for high-throughput middleware. The maintainers are responsive and ship fixes quickly, but some of these issues require fundamental redesign to fix.

The early market position helped LiteLLM become the default choice. But "everyone uses it" doesn’t mean it’s the right choice for every situation. Evaluate based on your actual requirements.

Testing Yourself

Don’t take my word for it. Test the options yourself with your workload.

Spin up LiteLLM and Bifrost side by side. Send them realistic traffic for your use case. Measure latency, memory usage, and stability over several hours. See which performs better for your specific needs.

The benchmarks are useful but your production workload is different from synthetic tests. Real traffic patterns reveal problems that don’t show up in benchmarks.

Summary

LiteLLM opened up multi-provider LLM access. It’s a good prototyping tool. But the architecture has limitations that become blockers at production scale.

If you’re hitting performance issues, slow cold starts, or operational complexity, you have options. Bifrost solves most of these problems. Hosted services like Portkey and OpenRouter work for specific use cases. Building your own thin wrapper is viable for simple needs.

Choose based on your requirements, not what everyone else is using.

More details:

Bifrost documentation: https://docs.getbifrost.ai
LiteLLM benchmarks: https://docs.getbifrost.ai/benchmarks/litellm
Portkey: https://portkey.ai
OpenRouter: https://openrouter.ai

Thanks for reading! :)

The Cold Start Issue

The Cold Start Issue

Performance Gets Worse Under Load

The Codebase Is Hard to Work With

Release Cadence Creates Problems

What Actually Works Better

Bifrost

maximhq / bifrost

Fastest LLM gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.

Bifrost

The fastest way to build AI applications that never go down

Quick Start

Portkey

OpenRouter

Building Your Own

What to Choose

My Setup

The Real Issue

Testing Yourself

Summary

Similar Posts