How to Cut Your AI API Costs by 50% (The Smart Router Pattern)

Hey developers, if you are running your app on large language models like GPT-4o, Claude 3, or Gemini, you probably know the pain of checking your monthly bill. It always hurts a little. Every word you send and every word the model sends back costs tokens, and tokens are money. As your users grow and prompts get longer, that bill can start looking like your rent.

Prompt engineering helps, yes, but that is not enough anymore. The real game-changer is smart routing. Think of it as your personal air-traffic control for AI requests. You do not need to send every single prompt to the most expensive model. You just need to send the right parts to it and let cheaper models do the heavy lifting.

This article will show you how to build a JavaScript function that can cut your AI bill in h…

This article will show you how to build a JavaScript function that can cut your AI bill in half, keep your responses fast, and save your sanity when rate limits hit.

The Smart Request Router Pattern

Here is the idea. You separate the work into two tiers.

Tier 1 (Premium): The expensive, high-quality model like GPT-4o. Use it only when you truly need the best output.
Tier 2 (Utility): The cheaper or even free model like Gemini 2.5 Flash or Llama. Use it for light work such as text compression or as your backup when things go wrong.

The function you will build is called getSmartAIResponse. It has three jobs: protect your API from overload, reduce prompt size, and make sure your app never breaks even when the expensive model does.

Layer 1: The Defence Mechanism

Before you worry about saving money, you must stop people (and bots) from overusing your API. A simple rate limiter can do this. You can track how many requests each user sends within a time window using a plain JavaScript Map.

Code Example:

// --- Rate Limiting State (Example using a Map for per-user/IP limits) ---
const requestMap = new Map();
const WINDOW_MS = 60000; // 1 minute
const MAX_REQUESTS = 5; // 5 requests per minute

// ... (inside getSmartAIResponse) ...

// 1. Rate Limiter Check (Defense Layer)
const now = Date.now();
const userState = requestMap.get(userId) || { count: 0, time: now };

if (now - userState.time > WINDOW_MS) {
userState.count = 0; // Reset count
userState.time = now;
}

if (userState.count >= MAX_REQUESTS) {
console.warn(`User ${userId} rate limited. Serving Tier 2.`);
// Rate Limit Fallback (Tier 2 Call)
return {
...(await handleTier2AICall(userPrompt)),
message: "Rate limit exceeded. Request routed to Tier 2."
};
}

// Allow the request and increment count
userState.count++;
requestMap.set(userId, userState);

requestMap: This standard JavaScript Map acts as our in-memory database. The key is the userId (which would be the client’s IP address or session ID in a real server environment), and the value stores their request count and the timestamp of their last reset.
Window Logic: The if (now - userState.time > WINDOW_MS) block implements a fixed window counter. If the current time is beyond the one-minute window, the counter is reset, and the user gets a fresh allowance.
Defense & Fallback: If the MAX_REQUESTS is hit, the function immediately bypasses all expensive downstream logic and calls the cheap handleTier2AICall function for the final response. This prevents the user from incurring high costs for the paid API.

Layer 2: The Optimization Layer

This is where the real money-saving starts. The biggest cost usually comes from input tokens: the user prompt itself. If you send a long essay to GPT-4o, you pay through the nose. But if you first ask a cheaper model to shorten it, you still keep the meaning and pay half the price.

Code Example:

// Optimization Layer: Compression using Tier 2 AI
const compressionPrompt = `Shorten the following text by 50% without losing the core meaning: "${userPrompt}"`;
const compressionResult = await handleTier2AICall(compressionPrompt);

if (compressionResult.reply && compressionResult.source !== "error") {
trimmedPrompt = compressionResult.reply;
console.log(`Optimization successful: ${userPrompt.length} -> ${trimmedPrompt.length} tokens.`);
}

// ... continue to Paid API call using the 'trimmedPrompt'

Targeted Prompting: The compressionPrompt is highly directive. It instructs the Tier 2 model to act as a compressor, not a generator. This specialized task is ideal for smaller, faster models that can execute low-latency, low-cost processing.
The Big Win: The variable trimmedPrompt now holds the optimized input. When this is sent to the Tier 1 model in the next step, we are essentially paying half the price for the input tokens, leading to substantial savings.
Decoupling: Notice that if the compression fails (compressionResult.source === "error"), we simply fall back to using the original userPrompt. The failure of the optimization layer does not break the entire service, it just makes that specific request more expensive.

Now you have a smaller prompt ready for the expensive model. If the cheaper model fails, no problem. You still use the original input. Nothing breaks, it just costs a little more that time.

Layer 3: The Reliability Layer

Sometimes the premium API fails. Maybe your quota is over, maybe the server is down, or maybe it just had a bad day. You cannot let that stop your app. This layer catches any error and falls back to Tier 2 automatically.

Code Example:

// ... (Paid API Call using the TRIMMED prompt) ...

if (!paidResponse.ok) {
// Throw to activate the error fallback below
throw new Error(`Tier 1 API failed with status ${paidResponse.status}`);
}

// ... return successful Tier 1 result ...

} catch (err) {
console.error("Tier 1 API failure (Quota/Error). Routing to Tier 2:", err.message);

// Reliability Layer: Quota/Error Fallback
return {
// We pass the ORIGINAL prompt here, ensuring the fallback provides a full response
...(await handleTier2AICall(userPrompt)),
message: "Tier 1 service failed. Request automatically routed to Tier 2 for continuity."
};
}

Error Trapping: Any HTTP status code where paidResponse.ok is false (e.g., 401, 429, 500) triggers the throw Error, sending control to the catch block.
Service Continuity: The catch block immediately executes the handleTier2AICall(userPrompt). Crucially, it sends the original, non-compressed userPrompt to the Tier 2 model. This ensures that even though the response is from the cheaper model, it has the full context to generate a comprehensive answer, maintaining the user experience.
Analytics Value: The message and source fields in the final returned object are essential for logging and analytics. They allow developers to track exactly how often the system is saving money by falling back and how often it’s running into Tier 1 quota limits.

The Full Router Function

Here is the complete version that connects all layers together. It is clean, fast, and fully asynchronous.

// --- Configuration Constants ---
const PAID_API_KEY = "sk-YOUR-PAID-KEY";
const FREE_AI_API_KEY = "AIzaSy-YOUR-FREE-KEY";

const FREE_AI_URL = "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent";
const PAID_AI_URL = "https://api.openai.com/v1/chat/completions";

// --- Rate Limiting State ---
const requestMap = new Map();
const WINDOW_MS = 60000;
const MAX_REQUESTS = 5;

// --- UTILITY: TIER 2 FALLBACK & COMPRESSION ENGINE ---
async function handleTier2AICall(prompt) {
try {
const response = await fetch(`${FREE_AI_URL}?key=${FREE_AI_API_KEY}`, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
contents: [{ parts: [{ text: prompt }] }],
}),
});
if (!response.ok) throw new Error(`Tier 2 service error: ${response.status}`);
const data = await response.json();
const reply = data?.candidates?.[0]?.content?.parts?.[0]?.text;
return { reply, source: "tier-2-fallback" };
} catch (err) {
console.error("Tier 2 service failure:", err.message);
return { reply: "Service temporarily unavailable due to external API issues.", source: "error" };
}
}

/**
* Processes a user prompt using a cost-optimized, tiered routing strategy.
* @param {string} userId - Identifier for rate limiting (e.g., user ID or IP address).
* @param {string} userPrompt - The original, verbose input from the user.
* @returns {Promise<{reply: string, source: string, message?: string}>}
*/
async function getSmartAIResponse(userId, userPrompt) {

// LAYER 1: DEFENSE (Rate Limiting)
const now = Date.now();
const userState = requestMap.get(userId) || { count: 0, time: now };
if (now - userState.time > WINDOW_MS) { userState.count = 0; userState.time = now; }

if (userState.count >= MAX_REQUESTS) {
return {
...(await handleTier2AICall(userPrompt)),
message: "Rate limit exceeded. Request routed to Tier 2."
};
}
userState.count++;
requestMap.set(userId, userState);


let trimmedPrompt = userPrompt;

// LAYER 2: OPTIMIZATION (Prompt Compression)
const compressionPrompt = `Shorten the following text by 50% without losing the core meaning: "${userPrompt}"`;
const compressionResult = await handleTier2AICall(compressionPrompt);

if (compressionResult.reply && compressionResult.source !== "error") {
trimmedPrompt = compressionResult.reply;
}

// PRIMARY EXECUTION: Tier 1 Paid API Call
try {
const paidResponse = await fetch(PAID_AI_URL, {
method: "POST",
headers: { "Authorization": `Bearer ${PAID_API_KEY}`, "Content-Type": "application/json" },
body: JSON.stringify({
model: "gpt-4o-mini",
messages: [{ role: "user", content: trimmedPrompt }],
}),
});

if (!paidResponse.ok) {
throw new Error(`Tier 1 API failed with status ${paidResponse.status}`);
}

const paidData = await paidResponse.json();
const reply = paidData?.choices?.[0]?.message?.content || "No content from Tier 1 API.";

return { reply, source: "tier-1-optimized" };

} catch (err) {
console.error("Tier 1 API failure. Routing to Tier 2:", err.message);

// LAYER 3: RELIABILITY (Error Fallback)
return {
...(await handleTier2AICall(userPrompt)),
message: "Tier 1 service failed. Request automatically routed to Tier 2 for continuity."
};
}
}

Why This Works

This setup wins because it does not over-engineer the problem. You use plain JavaScript features: fetch, Map, and try...catch. No fancy libraries, no magic.

The Cost Logic

Let us say your user sends a 1,000-token prompt.

Scenario	Input Tokens	Cost Savings
Direct Tier 1 Call	1,000	0%
Smart Router Call	500 (after compression)	50%

The small cost of running compression on a cheap model is nothing compared to what you save when sending fewer tokens to the expensive one. It is like paying a kid to fold your laundry before sending it to the dry cleaner.

Why JavaScript Fits the Job

You can run this pattern anywhere modern JavaScript runs: server, edge, or browser extension.

Edge Functions: Great for platforms like Cloudflare Workers or Vercel Edge. No extra libraries, very low latency.
Microservices: Easy to plug into your backend stack and share across teams.
Async Flow: The logic is readable, linear, and simple to maintain.

Conclusion

This Smart Request Router is not just a cost trick. It is a smart architecture for scaling AI apps. You make your app resilient, faster, and way cheaper without touching the quality of your main model.

You treat your expensive model like a celebrity, only bring it on stage when needed. The rest of the time, let your cheaper models handle the warm-up acts.

That is how you build an app that is not only smart but also financially wise. And your monthly API bill will finally stop looking like a horror story.

The Smart Request Router Pattern

Layer 1: The Defence Mechanism

Layer 2: The Optimization Layer

Layer 3: The Reliability Layer

The Full Router Function

Why This Works

The Cost Logic

Why JavaScript Fits the Job

Conclusion

Similar Posts