Semantic Caching for Cost Reduction: Save 20-40% on LLM API Bills
Semantic caching matches similar prompts to cached LLM responses. How it works, implementation patterns, cache hit benchmarks, and AI cost optimization savings.
Every production LLM system answers the same questions over and over. Your customer support agent gets “how do I reset my password?” fifty times a day — and you pay for fifty separate API calls to generate essentially identical responses.
Traditional caching doesn’t help because users phrase questions differently every time. “Reset my password,” “I forgot my login,” “how to change password” — different strings, same intent. Semantic caching solves this by matching on meaning, not exact text.
How semantic caching works
Semantic caching adds a matching layer between your application and your LLM provider. Here’s the flow:
-
Embed the prompt. Convert the incoming prompt to a vector embedding using a lightweight model (OpenAI text-embedding-3-small, Cohere embed-v3, or a local model like sentence-transformers).
-
Search the cache. Query your vector store for embeddings similar to the new prompt, using cosine similarity. If a match exceeds your similarity threshold (typically 0.90-0.95), it’s a cache hit.
-
Return or generate. On a cache hit, return the stored response immediately — zero LLM cost, sub-10ms latency. On a cache miss, send the request to your LLM, store the response with its embedding, and return it.
The critical component is the similarity threshold. Too high (0.98+) and you’ll rarely get cache hits. Too low (0.85) and you’ll return wrong answers for genuinely different questions.
User prompt → Embed → Vector search (cache)
├── Hit (similarity ≥ threshold) → Return cached response
└── Miss → LLM call → Store response + embedding → Return
What makes it “semantic”
Traditional caches use exact string matching or hash-based lookups. The prompt must be character-for-character identical to trigger a hit. In practice, this means almost zero cache hits for natural language — users never phrase things identically.
Semantic caching uses embeddings to represent prompts as vectors in a high-dimensional space. Prompts with similar meaning end up near each other regardless of phrasing:
| Prompt A | Prompt B | Cosine similarity |
|---|---|---|
| ”What is the capital of France?" | "France’s capital city?“ | 0.96 |
| ”How do I reset my password?" | "I forgot my login credentials” | 0.91 |
| ”Summarize this quarterly report" | "Give me a summary of Q2 earnings” | 0.89 |
| ”Write a poem about dogs" | "Explain quantum computing” | 0.12 |
The embedding model understands that “reset my password” and “forgot my login credentials” have the same intent — even though they share almost no words.
Cost savings benchmarks
The savings depend on your cache hit rate, which depends on how repetitive your traffic is. Here are benchmarks by use case:
| Use case | Typical cache hit rate | Why |
|---|---|---|
| Customer support chatbot | 40-60% | Users ask the same 50-100 questions repeatedly |
| Internal knowledge base / FAQ | 30-50% | Employees search for the same policies and procedures |
| Data extraction / classification | 25-40% | Same document types, same extraction patterns |
| Content generation (personalized) | 10-20% | Each prompt includes unique user context |
| Code generation | 5-15% | Highly variable prompts, context-dependent output |
| Creative writing | 5-10% | Almost every prompt is unique |
Worked example: customer support chatbot
A SaaS company handles 100,000 customer support queries per month through their AI chatbot, using Claude Sonnet at $3/$15 per million tokens. Average query: 300 input tokens, 500 output tokens.
Without caching:
- Input: 100K × 300 = 30M tokens → $90
- Output: 100K × 500 = 50M tokens → $750
- Monthly cost: $840
With semantic caching (45% hit rate):
- 55,000 cache misses → same per-call cost
- Input: 55K × 300 = 16.5M tokens → $49.50
- Output: 55K × 500 = 27.5M tokens → $412.50
- Embedding cost: 100K × 300 tokens via text-embedding-3-small = 30M tokens → $0.60
- Vector search: negligible (Redis or in-memory)
- Monthly cost: $462.60 — 45% savings
The embedding cost is almost invisible: text-embedding-3-small costs $0.02 per million tokens. Even at 100K queries, embedding adds $0.60/month.
For teams using frontier models (Claude Opus at $15/$75 per million tokens), the same 45% hit rate saves proportionally more in absolute dollars.
Implementation guide
Step 1: Choose your embedding model
The embedding model determines cache quality. You want high semantic accuracy, low latency, and low cost.
| Model | Dimensions | Cost per 1M tokens | Latency | Notes |
|---|---|---|---|---|
| OpenAI text-embedding-3-small | 1536 | $0.02 | ~50ms | Best price/performance. Production standard |
| OpenAI text-embedding-3-large | 3072 | $0.13 | ~80ms | Higher accuracy, 6x cost. Rarely needed |
| Cohere embed-v3 | 1024 | $0.10 | ~60ms | Good multilingual support |
| sentence-transformers (local) | 384-768 | $0 (compute only) | ~10ms | Zero API cost but requires hosting |
For most teams, text-embedding-3-small is the right choice. The accuracy difference between small and large is minimal for cache matching, and the cost is negligible.
Step 2: Choose your vector store
The vector store holds your cached embeddings and handles similarity search.
| Store | Type | Best for | Cache lookup latency |
|---|---|---|---|
| Redis + RediSearch | In-memory | Low latency, simple setup | < 1ms |
| Qdrant | Vector DB | Large cache sizes, filtering | 1-5ms |
| Pinecone | Managed | Zero-ops, serverless | 5-20ms |
| SQLite + faiss | Embedded | Local development, small scale | < 1ms |
| In-memory dict | Python dict | Prototyping only | < 0.1ms |
For production, Redis with RediSearch is the most common choice. Sub-millisecond lookups, simple to operate, and you probably already have Redis in your stack.
Step 3: Implement the cache layer
Here’s a production-ready pattern in Python:
import hashlib
import json
import numpy as np
import redis
from openai import OpenAI
client = OpenAI()
cache = redis.Redis(host="localhost", port=6379, db=0)
SIMILARITY_THRESHOLD = 0.92
CACHE_TTL = 86400 * 7 # 7 days
def get_embedding(text: str) -> list[float]:
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
def cosine_similarity(a: list[float], b: list[float]) -> float:
a, b = np.array(a), np.array(b)
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
def semantic_cache_lookup(prompt: str) -> str | None:
prompt_embedding = get_embedding(prompt)
# Scan cached embeddings (use vector search index in production)
for key in cache.scan_iter("cache:embedding:*"):
cached = json.loads(cache.get(key))
similarity = cosine_similarity(prompt_embedding, cached["embedding"])
if similarity >= SIMILARITY_THRESHOLD:
return cached["response"]
return None
def cache_response(prompt: str, response: str):
embedding = get_embedding(prompt)
cache_key = f"cache:embedding:{hashlib.md5(prompt.encode()).hexdigest()}"
cache.setex(
cache_key,
CACHE_TTL,
json.dumps({"embedding": embedding, "response": response, "prompt": prompt})
)
# Usage in your LLM call wrapper
async def llm_call(prompt: str, model: str = "claude-sonnet-4-6") -> str:
# Check cache first
cached = semantic_cache_lookup(prompt)
if cached:
return cached # Free! No LLM call
# Cache miss — call LLM
response = await actual_llm_call(model=model, prompt=prompt)
cache_response(prompt, response)
return response
Important: The example above uses a linear scan for simplicity. In production with 10K+ cached entries, use Redis vector search (RediSearch) or a dedicated vector database for O(log n) lookups instead of O(n) scans.
Step 4: Tune the similarity threshold
Start strict and loosen gradually:
-
Week 1: Set threshold to 0.95. Measure cache hit rate and sample 100 cached responses for quality. Expected hit rate: 10-20%.
-
Week 2: If quality looks good, lower to 0.92. Measure again. Expected hit rate: 20-35%.
-
Week 3: If still good, try 0.90. This is usually the sweet spot for customer-facing applications. Expected hit rate: 30-50%.
-
Below 0.90: Quality degradation becomes noticeable. Only go lower for internal tools where occasional wrong answers are acceptable.
Track these metrics while tuning:
| Metric | What to watch |
|---|---|
| Cache hit rate | Should increase as you lower threshold |
| False positive rate | Sample cached responses — are they correct for the new prompt? |
| User satisfaction | NPS, CSAT, or thumbs-up/down per response |
| Latency (p50, p99) | Cache hits should be < 50ms; misses = normal LLM latency |
| Cost per query | Should decrease proportionally with hit rate |
Cache key design: the details that matter
The prompt alone isn’t always sufficient as a cache key. Consider these scenarios:
Include conversation context
If your LLM sees conversation history, two identical prompts with different histories should NOT match:
# BAD: Cache key is just the user message
cache_key = get_embedding(user_message)
# GOOD: Cache key includes system prompt + recent history
cache_key = get_embedding(f"{system_prompt}\n{last_2_messages}\n{user_message}")
Include user-specific context
“What’s my account balance?” should return different answers for different users. Include the user ID or relevant context in the cache key:
# For personalized queries, scope cache to user
cache_key = get_embedding(f"user:{user_id}\n{prompt}")
Exclude timestamps and ephemeral data
“What’s the weather today?” cached on Monday shouldn’t serve on Tuesday. Set appropriate TTLs for time-sensitive content:
# Short TTL for time-sensitive responses
CACHE_TTL_BY_TYPE = {
"faq": 86400 * 30, # 30 days — FAQ answers rarely change
"policy": 86400 * 7, # 7 days — policies change occasionally
"realtime": 3600, # 1 hour — time-sensitive data
}
Combining semantic caching with model routing
Semantic caching and model routing are complementary optimizations that compound their savings:
- Cache check first — zero cost for cache hits
- Route cache misses — cheapest appropriate model for new queries
- Cache the response — future similar queries skip both routing and LLM call
The combined savings stack multiplicatively:
| Optimization | Standalone savings | Combined savings |
|---|---|---|
| Model routing alone | 40-70% | — |
| Semantic caching alone | 20-40% | — |
| Both together | — | 52-82% |
If routing saves 50% and caching saves 30% of the remaining traffic: 1 - (0.50 × 0.70) = 65% total savings.
What NOT to cache
Semantic caching isn’t appropriate for every LLM use case. Avoid caching:
- Personalized responses where the same question should produce different answers per user (unless you scope the cache key to include user context)
- Responses that depend on external state — database lookups, API calls, real-time data
- Creative/generative tasks where users expect variety — marketing copy, brainstorming, creative writing
- Multi-turn conversations where context shifts the meaning of repeated phrases
- Security-sensitive responses where cached data could leak between users or tenants
A good rule of thumb: if two users asking the same question should get the same answer, it’s cacheable. If they should get different answers, it’s not (unless you scope the cache key appropriately).
Monitoring your cache
Track these metrics to ensure your cache is healthy:
-
Hit rate — The most important metric. Below 15%, caching isn’t worth the complexity. Above 40%, you’re in excellent territory.
-
Cost savings — Actual dollars saved per month. Track with a cost tracking dashboard that attributes savings to the cache layer.
-
Latency improvement — Cache hits should be 10-100x faster than LLM calls. If they’re not, your vector search is too slow.
-
Cache size — Monitor growth. A cache that grows indefinitely will slow down vector search. Set TTLs and implement eviction policies.
-
False positive rate — The percentage of cache hits that returned an incorrect or irrelevant response. Sample and review weekly. If above 5%, tighten your similarity threshold.
Getting started: a 3-step plan
Step 1: Measure your repetition rate. Before building, estimate your potential cache hit rate. Log 1 week of prompts, embed them, and cluster by similarity. If fewer than 20% of prompts have a near-duplicate, caching may not be worth the investment.
Step 2: Start with your highest-volume, most repetitive endpoint. Customer support, FAQ, or internal knowledge base — wherever users ask the same things repeatedly. Implement caching on this single endpoint and measure hit rate and quality for 2 weeks.
Step 3: Expand and combine with routing. Once caching is proven on one endpoint, roll out to others. Layer model routing on top for cache misses. Monitor combined savings with budget alerts to quantify the impact.
Further reading
- Model Routing for Cost Savings — The other major cost optimization lever, complementary to caching
- Token Cost Tracking — Attribute savings to your cache and routing layers
- Budget Alerts — Set alerts to track your cost reduction over time
- AI Cost Tracking Tools Compared — Tools for measuring your baseline before optimizing
- Enterprise AI Cost Governance — Organization-wide cost controls and reporting
Embedding model pricing and benchmarks accurate as of April 2026. Vector database performance varies by deployment configuration — benchmark with your own data and query patterns.