Model Routing for Cost Savings: Cut LLM Costs 40-70% Without Sacrificing Quality
Model routing sends each LLM request to the cheapest capable model. Implementation patterns, routing strategies, and real AI cost optimization benchmarks.
You’re sending every LLM request to GPT-4o or Claude Opus. Every customer support reply. Every data extraction. Every classification. Every “summarize this email” call. They all cost the same: top-tier pricing for tasks that a model 10-30x cheaper could handle just as well.
This is like shipping every package via overnight express — even the ones that could go ground. Model routing fixes this.
What is model routing?
Model routing is a layer between your application and your LLM providers that dynamically selects which model handles each request. Instead of hardcoding model: "gpt-4o" everywhere, a router classifies each request by complexity and sends it to the cheapest model that can deliver acceptable quality.
The concept is simple:
- Simple tasks (classification, extraction, summarization of short text) → budget models (GPT-4o-mini, Claude Haiku 4.5, Gemini 2.0 Flash)
- Medium tasks (longer summarization, standard Q&A, structured output) → mid-tier models (Claude Sonnet 4.6, GPT-4o)
- Complex tasks (multi-step reasoning, code generation, creative writing, ambiguous queries) → frontier models (Claude Opus, GPT-4.5, Gemini Pro)
The price difference between tiers is dramatic:
| Model | Input cost (per 1M tokens) | Output cost (per 1M tokens) | Relative cost |
|---|---|---|---|
| GPT-4o-mini | $0.15 | $0.60 | 1x (baseline) |
| Claude Haiku 4.5 | $0.80 | $4.00 | ~5x |
| Claude Sonnet 4.6 | $3.00 | $15.00 | ~20x |
| GPT-4o | $2.50 | $10.00 | ~17x |
| Claude Opus 4.6 | $15.00 | $75.00 | ~100x |
| GPT-4.5 | $75.00 | $150.00 | ~500x |
When 60-80% of your production traffic is simple enough for a budget model, the math is compelling.
The cost savings math
Let’s work through a realistic scenario. Assume a production AI system handling 1 million requests per month with an average of 500 input tokens and 200 output tokens per request.
Before routing (everything on GPT-4o):
- Input: 1M × 500 tokens = 500M tokens → 500 × $2.50 = $1,250
- Output: 1M × 200 tokens = 200M tokens → 200 × $10.00 = $2,000
- Total: $3,250/month
After routing (70% to GPT-4o-mini, 20% to GPT-4o, 10% to Claude Opus):
- GPT-4o-mini (700K requests): $52.50 input + $84.00 output = $136.50
- GPT-4o (200K requests): $250.00 input + $400.00 output = $650.00
- Claude Opus (100K requests): $750.00 input + $1,500.00 output = $2,250.00
- Total: $3,036.50/month
Wait — that’s only 7% savings. The Opus calls ate the budget. Let’s re-route: 70% mini, 25% Sonnet, 5% Opus:
- GPT-4o-mini (700K): $136.50
- Claude Sonnet (250K): $375.00 input + $750.00 output = $1,125.00
- Claude Opus (50K): $375.00 input + $750.00 output = $1,125.00
- Total: $2,386.50/month — 27% savings
The real wins come when you can push more traffic to the budget tier. Teams with high-volume, lower-complexity workloads (customer support, data extraction, content tagging) routinely see 50-70% savings.
Implementation patterns
There are three common approaches to model routing, each with different trade-offs.
Pattern 1: Rule-based routing
The simplest approach. Define rules based on request metadata — endpoint, feature flag, prompt length, or user tier.
def route_request(request):
# Short prompts for simple tasks → budget model
if len(request.prompt) < 200 and request.feature == "classify":
return "gpt-4o-mini"
# Customer-facing responses → mid-tier
if request.feature in ["support_reply", "summary"]:
return "claude-sonnet-4-6"
# Complex reasoning → frontier
if request.feature in ["code_gen", "analysis", "planning"]:
return "claude-opus-4-6"
# Default fallback
return "gpt-4o"
Pros: Zero latency overhead, fully deterministic, easy to debug.
Cons: Requires manual classification of every feature. Doesn’t adapt to request content. Misroutes edge cases.
Best for: Teams with well-defined request types and low tolerance for routing errors.
Pattern 2: Classifier-based routing
Use a lightweight ML model or LLM call to classify request complexity before routing.
CLASSIFIER_PROMPT = """Rate this request's complexity 1-3:
1 = Simple (classification, extraction, short summary)
2 = Medium (Q&A, longer summary, structured output)
3 = Complex (reasoning, code generation, creative, ambiguous)
Request: {prompt}
Rating:"""
async def route_request(request):
# Use the cheapest model as the classifier
rating = await llm_call(
model="gpt-4o-mini",
prompt=CLASSIFIER_PROMPT.format(prompt=request.prompt),
max_tokens=1
)
routes = {"1": "gpt-4o-mini", "2": "claude-sonnet-4-6", "3": "claude-opus-4-6"}
return routes.get(rating.strip(), "claude-sonnet-4-6")
Pros: Adapts to request content. Handles edge cases better. Can improve over time.
Cons: Adds latency (one extra LLM call). Classifier itself costs money. Classifier can misclassify.
Best for: Teams with diverse, unpredictable request types where rule-based routing can’t cover the variety.
Cost of the classifier: At GPT-4o-mini pricing, classifying 1M requests costs roughly $15-30 — negligible compared to the routing savings.
Pattern 3: Cascade routing (try cheap first)
Send every request to the cheapest model first. If the response quality is below a threshold, retry with a more expensive model.
async def cascade_route(request):
# Try budget model first
response = await llm_call(model="gpt-4o-mini", prompt=request.prompt)
# Check quality (confidence score, response length, format compliance)
if quality_check(response, request.expected_format):
return response # Budget model was good enough
# Fallback to mid-tier
response = await llm_call(model="claude-sonnet-4-6", prompt=request.prompt)
if quality_check(response, request.expected_format):
return response
# Last resort: frontier model
return await llm_call(model="claude-opus-4-6", prompt=request.prompt)
Pros: Maximizes budget-tier usage. Quality floor is guaranteed. No misrouting risk.
Cons: Higher latency for complex requests (2-3x calls). Total cost can be higher if quality checks fail frequently. Harder to implement reliable quality checks.
Best for: Tasks where you can objectively verify output quality (structured output, JSON schema compliance, factual extraction).
Building your quality check
The cascade pattern and classifier pattern both depend on evaluating whether a cheaper model’s output is “good enough.” Here are practical quality signals:
| Signal | How to check | Works for |
|---|---|---|
| Format compliance | Response matches expected JSON/XML schema | Structured output, data extraction |
| Confidence score | Model’s log-probabilities above threshold | Classification tasks |
| Response length | Output within expected range | Summarization, Q&A |
| Keyword presence | Required terms appear in response | Factual extraction |
| Self-consistency | Run twice, compare outputs | Any task (expensive but reliable) |
| Downstream success | API call succeeds, code compiles | Tool use, code generation |
The simplest starting point: if you’re extracting structured data, validate the JSON schema. If it parses, the budget model was good enough.
What to monitor
Model routing without monitoring is cost optimization without accountability. Track these metrics per route:
-
Cost per route — Are you actually saving money? Track the percentage of traffic per model tier and the total cost per tier.
-
Quality per route — Are budget-model responses as good? Compare user satisfaction, task success rate, or downstream metrics per model.
-
Routing accuracy — For classifier-based routing, how often does the classifier agree with a human assessment of complexity? Sample and review weekly.
-
Latency per route — Budget models are typically faster. If your cascade pattern adds latency for complex requests, quantify the trade-off.
-
Fallback rate — In cascade routing, what percentage of requests fall through to the next tier? A high fallback rate means your quality check is too strict or your budget model can’t handle the traffic mix.
A cost tracking dashboard that breaks down spend by model, feature, and route is essential for tuning. Without it, you’re routing blind.
Common mistakes
Over-routing to frontier models. Teams new to routing often set conservative thresholds — “when in doubt, use the expensive model.” This defeats the purpose. Start aggressive (route 80% to budget), monitor quality, and pull back only where you see degradation.
Ignoring the classifier cost. If your classifier prompt is 500 tokens and you’re classifying 1M requests/month on GPT-4o-mini, that’s about $15. Negligible. But if you accidentally use Claude Opus as the classifier, it’s $7,500. Always use the cheapest model for classification.
Not measuring quality per route. You routed 70% of traffic to GPT-4o-mini and your overall costs dropped 50%. Celebration. But customer satisfaction also dropped 15% and nobody noticed because the metric wasn’t segmented by route. Always measure quality per model tier.
Static routing rules that rot. Your traffic mix changes as features evolve. A rule that routed “support replies” to the mid-tier made sense when replies were simple. Now support handles complex troubleshooting and the mid-tier model hallucinates steps. Review routing rules quarterly.
Getting started: a 3-step plan
Step 1: Measure your current traffic mix. Before routing, you need to know what you’re routing. Instrument your LLM calls with feature tags and prompt length. Run this for 2 weeks. You’ll likely find that 60-80% of calls are simpler than you thought.
Step 2: Start with rule-based routing on your highest-volume, simplest feature. Pick the one feature that generates the most LLM calls and has the most predictable output format. Route it to GPT-4o-mini. Monitor quality for 1 week. If quality holds, expand to the next feature.
Step 3: Add a classifier for the ambiguous middle. Once you’ve routed the obvious simple and obvious complex tasks, the remaining 20-30% needs a classifier. Implement a lightweight classifier prompt and tune the thresholds based on production quality data.
Most teams see meaningful savings within the first week of Step 2.
Further reading
- Token Cost Tracking — Instrument your LLM calls for per-feature, per-model cost attribution
- Budget Alerts — Set spending limits per team and model tier
- Semantic Caching for Cost Reduction — The other major cost optimization lever, complementary to routing
- AI Cost Tracking Tools Compared — Tools that help you measure before you optimize
- Enterprise AI Cost Governance — Multi-team cost attribution and budget controls
Model pricing data accurate as of April 2026. Prices change frequently — verify current rates on provider pricing pages before building financial models.