AI Vyuh FinOps
aivyuh finops
Model RoutingLLM CostsCost OptimizationAI FinOps

Model Routing for Cost Savings: Cut LLM Costs 40-70% Without Sacrificing Quality

Model routing sends each LLM request to the cheapest capable model. Implementation patterns, routing strategies, and real AI cost optimization benchmarks.

AI Vyuh Engineering ·

You’re sending every LLM request to GPT-4o or Claude Opus. Every customer support reply. Every data extraction. Every classification. Every “summarize this email” call. They all cost the same: top-tier pricing for tasks that a model 10-30x cheaper could handle just as well.

This is like shipping every package via overnight express — even the ones that could go ground. Model routing fixes this.


What is model routing?

Model routing is a layer between your application and your LLM providers that dynamically selects which model handles each request. Instead of hardcoding model: "gpt-4o" everywhere, a router classifies each request by complexity and sends it to the cheapest model that can deliver acceptable quality.

The concept is simple:

  • Simple tasks (classification, extraction, summarization of short text) → budget models (GPT-4o-mini, Claude Haiku 4.5, Gemini 2.0 Flash)
  • Medium tasks (longer summarization, standard Q&A, structured output) → mid-tier models (Claude Sonnet 4.6, GPT-4o)
  • Complex tasks (multi-step reasoning, code generation, creative writing, ambiguous queries) → frontier models (Claude Opus, GPT-4.5, Gemini Pro)

The price difference between tiers is dramatic:

ModelInput cost (per 1M tokens)Output cost (per 1M tokens)Relative cost
GPT-4o-mini$0.15$0.601x (baseline)
Claude Haiku 4.5$0.80$4.00~5x
Claude Sonnet 4.6$3.00$15.00~20x
GPT-4o$2.50$10.00~17x
Claude Opus 4.6$15.00$75.00~100x
GPT-4.5$75.00$150.00~500x

When 60-80% of your production traffic is simple enough for a budget model, the math is compelling.


The cost savings math

Let’s work through a realistic scenario. Assume a production AI system handling 1 million requests per month with an average of 500 input tokens and 200 output tokens per request.

Before routing (everything on GPT-4o):

  • Input: 1M × 500 tokens = 500M tokens → 500 × $2.50 = $1,250
  • Output: 1M × 200 tokens = 200M tokens → 200 × $10.00 = $2,000
  • Total: $3,250/month

After routing (70% to GPT-4o-mini, 20% to GPT-4o, 10% to Claude Opus):

  • GPT-4o-mini (700K requests): $52.50 input + $84.00 output = $136.50
  • GPT-4o (200K requests): $250.00 input + $400.00 output = $650.00
  • Claude Opus (100K requests): $750.00 input + $1,500.00 output = $2,250.00
  • Total: $3,036.50/month

Wait — that’s only 7% savings. The Opus calls ate the budget. Let’s re-route: 70% mini, 25% Sonnet, 5% Opus:

  • GPT-4o-mini (700K): $136.50
  • Claude Sonnet (250K): $375.00 input + $750.00 output = $1,125.00
  • Claude Opus (50K): $375.00 input + $750.00 output = $1,125.00
  • Total: $2,386.50/month — 27% savings

The real wins come when you can push more traffic to the budget tier. Teams with high-volume, lower-complexity workloads (customer support, data extraction, content tagging) routinely see 50-70% savings.


Implementation patterns

There are three common approaches to model routing, each with different trade-offs.

Pattern 1: Rule-based routing

The simplest approach. Define rules based on request metadata — endpoint, feature flag, prompt length, or user tier.

def route_request(request):
    # Short prompts for simple tasks → budget model
    if len(request.prompt) < 200 and request.feature == "classify":
        return "gpt-4o-mini"
    
    # Customer-facing responses → mid-tier
    if request.feature in ["support_reply", "summary"]:
        return "claude-sonnet-4-6"
    
    # Complex reasoning → frontier
    if request.feature in ["code_gen", "analysis", "planning"]:
        return "claude-opus-4-6"
    
    # Default fallback
    return "gpt-4o"

Pros: Zero latency overhead, fully deterministic, easy to debug.
Cons: Requires manual classification of every feature. Doesn’t adapt to request content. Misroutes edge cases.

Best for: Teams with well-defined request types and low tolerance for routing errors.

Pattern 2: Classifier-based routing

Use a lightweight ML model or LLM call to classify request complexity before routing.

CLASSIFIER_PROMPT = """Rate this request's complexity 1-3:
1 = Simple (classification, extraction, short summary)
2 = Medium (Q&A, longer summary, structured output)
3 = Complex (reasoning, code generation, creative, ambiguous)

Request: {prompt}
Rating:"""

async def route_request(request):
    # Use the cheapest model as the classifier
    rating = await llm_call(
        model="gpt-4o-mini",
        prompt=CLASSIFIER_PROMPT.format(prompt=request.prompt),
        max_tokens=1
    )
    
    routes = {"1": "gpt-4o-mini", "2": "claude-sonnet-4-6", "3": "claude-opus-4-6"}
    return routes.get(rating.strip(), "claude-sonnet-4-6")

Pros: Adapts to request content. Handles edge cases better. Can improve over time.
Cons: Adds latency (one extra LLM call). Classifier itself costs money. Classifier can misclassify.

Best for: Teams with diverse, unpredictable request types where rule-based routing can’t cover the variety.

Cost of the classifier: At GPT-4o-mini pricing, classifying 1M requests costs roughly $15-30 — negligible compared to the routing savings.

Pattern 3: Cascade routing (try cheap first)

Send every request to the cheapest model first. If the response quality is below a threshold, retry with a more expensive model.

async def cascade_route(request):
    # Try budget model first
    response = await llm_call(model="gpt-4o-mini", prompt=request.prompt)
    
    # Check quality (confidence score, response length, format compliance)
    if quality_check(response, request.expected_format):
        return response  # Budget model was good enough
    
    # Fallback to mid-tier
    response = await llm_call(model="claude-sonnet-4-6", prompt=request.prompt)
    if quality_check(response, request.expected_format):
        return response
    
    # Last resort: frontier model
    return await llm_call(model="claude-opus-4-6", prompt=request.prompt)

Pros: Maximizes budget-tier usage. Quality floor is guaranteed. No misrouting risk.
Cons: Higher latency for complex requests (2-3x calls). Total cost can be higher if quality checks fail frequently. Harder to implement reliable quality checks.

Best for: Tasks where you can objectively verify output quality (structured output, JSON schema compliance, factual extraction).


Building your quality check

The cascade pattern and classifier pattern both depend on evaluating whether a cheaper model’s output is “good enough.” Here are practical quality signals:

SignalHow to checkWorks for
Format complianceResponse matches expected JSON/XML schemaStructured output, data extraction
Confidence scoreModel’s log-probabilities above thresholdClassification tasks
Response lengthOutput within expected rangeSummarization, Q&A
Keyword presenceRequired terms appear in responseFactual extraction
Self-consistencyRun twice, compare outputsAny task (expensive but reliable)
Downstream successAPI call succeeds, code compilesTool use, code generation

The simplest starting point: if you’re extracting structured data, validate the JSON schema. If it parses, the budget model was good enough.


What to monitor

Model routing without monitoring is cost optimization without accountability. Track these metrics per route:

  1. Cost per route — Are you actually saving money? Track the percentage of traffic per model tier and the total cost per tier.

  2. Quality per route — Are budget-model responses as good? Compare user satisfaction, task success rate, or downstream metrics per model.

  3. Routing accuracy — For classifier-based routing, how often does the classifier agree with a human assessment of complexity? Sample and review weekly.

  4. Latency per route — Budget models are typically faster. If your cascade pattern adds latency for complex requests, quantify the trade-off.

  5. Fallback rate — In cascade routing, what percentage of requests fall through to the next tier? A high fallback rate means your quality check is too strict or your budget model can’t handle the traffic mix.

A cost tracking dashboard that breaks down spend by model, feature, and route is essential for tuning. Without it, you’re routing blind.


Common mistakes

Over-routing to frontier models. Teams new to routing often set conservative thresholds — “when in doubt, use the expensive model.” This defeats the purpose. Start aggressive (route 80% to budget), monitor quality, and pull back only where you see degradation.

Ignoring the classifier cost. If your classifier prompt is 500 tokens and you’re classifying 1M requests/month on GPT-4o-mini, that’s about $15. Negligible. But if you accidentally use Claude Opus as the classifier, it’s $7,500. Always use the cheapest model for classification.

Not measuring quality per route. You routed 70% of traffic to GPT-4o-mini and your overall costs dropped 50%. Celebration. But customer satisfaction also dropped 15% and nobody noticed because the metric wasn’t segmented by route. Always measure quality per model tier.

Static routing rules that rot. Your traffic mix changes as features evolve. A rule that routed “support replies” to the mid-tier made sense when replies were simple. Now support handles complex troubleshooting and the mid-tier model hallucinates steps. Review routing rules quarterly.


Getting started: a 3-step plan

Step 1: Measure your current traffic mix. Before routing, you need to know what you’re routing. Instrument your LLM calls with feature tags and prompt length. Run this for 2 weeks. You’ll likely find that 60-80% of calls are simpler than you thought.

Step 2: Start with rule-based routing on your highest-volume, simplest feature. Pick the one feature that generates the most LLM calls and has the most predictable output format. Route it to GPT-4o-mini. Monitor quality for 1 week. If quality holds, expand to the next feature.

Step 3: Add a classifier for the ambiguous middle. Once you’ve routed the obvious simple and obvious complex tasks, the remaining 20-30% needs a classifier. Implement a lightweight classifier prompt and tune the thresholds based on production quality data.

Most teams see meaningful savings within the first week of Step 2.


Further reading


Model pricing data accurate as of April 2026. Prices change frequently — verify current rates on provider pricing pages before building financial models.