6 min readDec 26, 2025
–
Press enter or click to view image in full size
Hi! In this part we’re moving from experiments and prototyping into the real world — production deployments.
Because the truth is: building a working notebook or a proof-of-concept is only the beginning. The real challenges start when your application must serve hundreds or thousands of users, run reliably 24/7, and still stay within budget.
Let’s start with the first foundation: a model-agnostic approach.
Model-agnostic from day one
Many teams building AI applications quickly lock themselves into a single provider — only OpenAI, or only Anthropic. That’s understandable: it’s faster to pick one API and focus. But long-term it’s a huge risk. If the provider raises prices, has an outage, or ch…
6 min readDec 26, 2025
–
Press enter or click to view image in full size
Hi! In this part we’re moving from experiments and prototyping into the real world — production deployments.
Because the truth is: building a working notebook or a proof-of-concept is only the beginning. The real challenges start when your application must serve hundreds or thousands of users, run reliably 24/7, and still stay within budget.
Let’s start with the first foundation: a model-agnostic approach.
Model-agnostic from day one
Many teams building AI applications quickly lock themselves into a single provider — only OpenAI, or only Anthropic. That’s understandable: it’s faster to pick one API and focus. But long-term it’s a huge risk. If the provider raises prices, has an outage, or changes licensing terms — your entire application can stop.
That’s why it’s worth thinking from the very beginning about a model-agnostic gateway layer.
In practice, this means your code doesn’t talk directly to one specific model. Instead, it calls an abstraction:
- “give me a chat-class LLM”, or
- “give me an embedding generator”
And only the gateway decides whether under the hood it should call GPT-5, Claude 4.5 Sonnet, or a local LLaMA running on your own infrastructure.
API Gateway + routing + fallback
The second foundation is an API Gateway.
Imagine you expose a simple endpoint like POST /v1/chat, where users send requests. In a header like X-Model, the client specifies which model should be used.
The gateway can run multiple models in parallel — and it can also implement fallback logic: if the primary model doesn’t respond within a given time, you automatically switch to a backup model, for example an open-source one running locally.
This pattern doesn’t only improve reliability — it also opens the door to experimentation.
You can route 1% of traffic to a new model and see how it performs compared to the previous one, without changing the entire system.
Monitoring and cost control
The third foundation — often neglected — is monitoring and cost control.
In a prototype it’s enough to say “it works”. In production you’ll get harder questions:
- How much does it cost per day?
- What’s our hallucination rate?
- How often do we reject outputs?
This is where tools like LangSmith help — but even a simple internal logging system can work.
We measure latency (because users don’t want to wait 30 seconds), we measure costs, and we measure quality — for example: how many answers were rejected by guardrails or evaluation.
And we can set very simple but effective alerts:
- if daily cost exceeds $50 → send a notification,
- if average response time goes above 5 seconds → trigger another alert.
With this, you have real visibility into what’s happening inside the system.
These three elements — model-agnostic gateway, API gateway, and monitoring — are not “nice-to-haves”. They’re foundations. If you treat them seriously, your application will not only run in production, but also stay resilient to changes in the market and technology.
Get Michalzarnecki’s stories in your inbox
Join Medium for free to get updates from this writer.
Let’s jump now to the code.
Install libraries and load environment variables
!pip install -U langchain langchain-openai langgraph fastapi uvicorn
from dotenv import load_dotenvload_dotenv()
Human in the Loop
from langchain_openai import ChatOpenAIfrom langchain_core.tools import toolfrom langchain.agents import create_agentfrom langchain.agents.middleware import HumanInTheLoopMiddlewarefrom langgraph.checkpoint.memory import MemorySaverfrom langgraph.types import Command@tooldef risky_operation(secret: str) -> str: """Perform a risky operation that requires manual approval.""" return f"Executed risky operation with: {secret}"tools = [risky_operation]model = ChatOpenAI(model="gpt-4o-mini", temperature=0)hitl = HumanInTheLoopMiddleware( interrupt_on={ "risky_operation": {"allowed_decisions": ["approve", "edit", "reject"]} }, description_prefix="Manual approval required for risky operation:")checkpointer = MemorySaver()agent = create_agent( model=model, tools=tools, middleware=[hitl], checkpointer=checkpointer, debug=True)config = {"configurable": {"thread_id": "hitl-demo-1"}}result = agent.invoke( {"messages": [{"role": "user", "content": "Please run the risky operation with secret code $%45654@."}]}, config=config,)
output:
[values] {'messages': [HumanMessage(content='Please run the risky operation with secret code $%45654@.', additional_kwargs={}, response_metadata={}, id='589244c7-9860-48fa-b68a-eca595510a73')]}[updates] {'model': {'messages': [AIMessage(content='', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 19, 'prompt_tokens': 60, 'total_tokens': 79, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_provider': 'openai', 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_560af6e559', 'id': 'chatcmpl-CaJj7md4CRaAN2mcI1ju8uek8BJti', 'service_tier': 'default', 'finish_reason': 'tool_calls', 'logprobs': None}, id='lc_run--35ad04bd-5d01-4649-a64c-d8c583ffe3aa-0', tool_calls=[{'name': 'risky_operation', 'args': {'secret': '$%45654@'}, 'id': 'call_dK786IhVaO3Z4VssPOI1cM6y', 'type': 'tool_call'}], usage_metadata={'input_tokens': 60, 'output_tokens': 19, 'total_tokens': 79, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}})]}}[values] {'messages': [HumanMessage(content='Please run the risky operation with secret code $%45654@.', additional_kwargs={}, response_metadata={}, id='589244c7-9860-48fa-b68a-eca595510a73'), AIMessage(content='', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 19, 'prompt_tokens': 60, 'total_tokens': 79, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_provider': 'openai', 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_560af6e559', 'id': 'chatcmpl-CaJj7md4CRaAN2mcI1ju8uek8BJti', 'service_tier': 'default', 'finish_reason': 'tool_calls', 'logprobs': None}, id='lc_run--35ad04bd-5d01-4649-a64c-d8c583ffe3aa-0', tool_calls=[{'name': 'risky_operation', 'args': {'secret': '$%45654@'}, 'id': 'call_dK786IhVaO3Z4VssPOI1cM6y', 'type': 'tool_call'}], usage_metadata={'input_tokens': 60, 'output_tokens': 19, 'total_tokens': 79, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}})]}[updates] {'__interrupt__': (Interrupt(value={'action_requests': [{'name': 'risky_operation', 'args': {'secret': '$%45654@'}, 'description': "Manual approval required for risky operation:\n\nTool: risky_operation\nArgs: {'secret': '$%45654@'}"}], 'review_configs': [{'action_name': 'risky_operation', 'allowed_decisions': ['approve', 'edit', 'reject']}]}, id='a3abdfe342bd7c8be8b1b586ee9f8815'),)}
handle interrupt:
if "__interrupt__" in result: print("Interrupt detected!") decisions = [{"type": "approve"}] result = agent.invoke( Command(resume={"decisions": decisions}), config=config, )
output:
[values] {'messages': [HumanMessage(content='Please run the risky operation with secret code $%45654@.', additional_kwargs={}, response_metadata={}, id='589244c7-9860-48fa-b68a-eca595510a73'), AIMessage(content='', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 19, 'prompt_tokens': 60, 'total_tokens': 79, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_provider': 'openai', 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_560af6e559', 'id': 'chatcmpl-CaJj7md4CRaAN2mcI1ju8uek8BJti', 'service_tier': 'default', 'finish_reason': 'tool_calls', 'logprobs': None}, id='lc_run--35ad04bd-5d01-4649-a64c-d8c583ffe3aa-0', tool_calls=[{'name': 'risky_operation', 'args': {'secret': '$%45654@'}, 'id': 'call_dK786IhVaO3Z4VssPOI1cM6y', 'type': 'tool_call'}], usage_metadata={'input_tokens': 60, 'output_tokens': 19, 'total_tokens': 79, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}}), ToolMessage(content='Executed risky operation with: $%45654@', name='risky_operation', id='13109032-38fb-4d94-920c-90026acc41f3', tool_call_id='call_dK786IhVaO3Z4VssPOI1cM6y')]}
Model agnostic API gateway
To run example code below with model agnostic API gateway: 1. Place the above code in a file app.py
# Place the above code in a file app.pyfrom fastapi import FastAPI, Headerfrom pydantic import BaseModelfrom langchain_core.runnables import RunnableLambdafrom langchain_core.messages import AIMessagefrom langchain_core.output_parsers import StrOutputParserfrom langchain_core.prompts import ChatPromptTemplatefrom langchain_openai import ChatOpenAIclass ChatRequest(BaseModel): message: strclass ChatResponse(BaseModel): provider: str model: str answer: strprompt = ChatPromptTemplate.from_messages([ ("system", "You are a helpful assistant."), ("human", "{message}")])def build_model(x_model: str): """ x_model format: - 'openai:gpt-4o-mini' """ if ":" in x_model: provider, model_name = x_model.split(":", 1) else: provider, model_name = "openai", x_model provider = provider.lower().strip() if provider == "openai": return provider, model_name, ChatOpenAI(model=model_name, temperature=0) # if provider == "anthropic": # support for another LLM API provider # return provider, model_name, ChatAnthropic(model=model_name, temperature=0) def _unknown(inputs: dict): return AIMessage(content=f"(unknown provider) Echo: {inputs.get('message','')}") return "unknown", x_model, RunnableLambda(_unknown)app = FastAPI(title="Model-Agnostic LangChain Gateway")@app.post("/chat", response_model=ChatResponse)def chat_endpoint( req: ChatRequest, x_model: str = Header(default="openai:gpt-4o-mini", alias="X-Model"),): provider, model_name, model = build_model(x_model) chain = prompt | model | StrOutputParser() answer: str = chain.invoke({"message": req.message}) return ChatResponse(provider=provider, model=model_name, answer=answer)
2. Start server:
uvicorn app:app - reload
3. Send request:
curl -X POST 'http://127.0.0.1:8000/chat' \ -H 'Content-Type: application/json' \ -H 'X-Model: openai:gpt-5-mini' \ -d '{"message":"Podaj 3 zalety Pythona."}'curl -X POST 'http://127.0.0.1:8000/chat' \ -H 'Content-Type: application/json' \ -H 'X-Model: openai:gpt-4o-mini' \ -d '{"message":"Podaj 3 zalety Pythona."}'
The future of GenAI
That brings us to the second part of this episode: the future of GenAI.
How will this industry look over the next few years? Nobody has a crystal ball — but some trends are already very clear.
Trend #1: Multimodality
Models like GPT-5 or Claude 4.5 can already analyze images, audio, and video. Soon this will be standard.
When you build applications, you have to assume users won’t send only text. They will upload screenshots, photos of documents, audio recordings. Your architecture needs to be ready for that.
Trend #2: Agentic workflows
Classic APIs and linear workflows are not enough when a process is complex and dynamic.
Instead of hardcoding conditions in traditional code, we’ll declare state graphs of agents: Researcher, Critic, Expert — and let the system iterate based on state and quality signals.
Keeping these trends in mind, we can prepare our applications for the next generation of even more capable AI models.
That’s all int this chapter dedicated to model-agnostic pattern, LLM API gateway and future AI trends.
**see **next chapter
**see **previous chapter
**see the full code from this article in the GitHub **repository