What is semantic caching for LLMs?

Semantic caching stores LLM responses and reuses them when a new prompt is semantically similar (not just identical) to a previously seen prompt. Unlike traditional caching that requires exact string matches, semantic caching uses embedding models to compare meaning. If someone asks 'What is the capital of France?' and later asks 'France's capital city?', a semantic cache recognizes these as the same question and returns the cached response — saving an LLM API call entirely.

How much can semantic caching save on LLM costs?

Semantic caching typically saves 20-40% on LLM API costs, depending on how repetitive your traffic is. Customer support chatbots see the highest cache hit rates (40-60%) because users ask similar questions. Internal knowledge bases see 30-50% hit rates. Creative or one-off generation tasks see lower hit rates (10-20%). The savings are amplified when caching responses from expensive frontier models — a cached Claude Opus response costs zero versus $15-75 per million tokens.

What are the best semantic caching tools?

Popular semantic caching solutions include: GPTCache (open-source, Redis or SQLite backend), Zilliz Cloud (managed vector database with built-in semantic cache), Momento (serverless cache with semantic matching), Redis with vector search (RediSearch module), and custom implementations using any embedding model with a vector store. For production deployments, a managed solution like Zilliz or Momento reduces operational burden. For cost-sensitive teams, GPTCache with Redis is the most popular open-source option.

Does semantic caching reduce response quality?

Semantic caching returns identical responses for semantically similar queries, which means quality depends entirely on the similarity threshold. A threshold that's too loose will return cached responses for prompts that are different enough to need distinct answers — this is the main quality risk. A threshold that's too strict will miss valid cache hits. Most teams start with a cosine similarity threshold of 0.95 (very strict) and lower it to 0.90-0.92 after measuring quality impact. For factual queries, semantic caching has no quality impact. For personalized or context-dependent queries, cache keys must include the relevant context.

← Back to Blog

Semantic CachingLLM CostsCost OptimizationAI FinOps

Semantic Caching for Cost Reduction: Save 20-40% on LLM API Bills

Semantic caching matches similar prompts to cached LLM responses. How it works, implementation patterns, cache hit benchmarks, and AI cost optimization savings.

AI Vyuh Engineering · 7 April 2026

Every production LLM system answers the same questions over and over. Your customer support agent gets “how do I reset my password?” fifty times a day — and you pay for fifty separate API calls to generate essentially identical responses.

Traditional caching doesn’t help because users phrase questions differently every time. “Reset my password,” “I forgot my login,” “how to change password” — different strings, same intent. Semantic caching solves this by matching on meaning, not exact text.

How semantic caching works

Semantic caching adds a matching layer between your application and your LLM provider. Here’s the flow:

Embed the prompt. Convert the incoming prompt to a vector embedding using a lightweight model (OpenAI text-embedding-3-small, Cohere embed-v3, or a local model like sentence-transformers).
Search the cache. Query your vector store for embeddings similar to the new prompt, using cosine similarity. If a match exceeds your similarity threshold (typically 0.90-0.95), it’s a cache hit.
Return or generate. On a cache hit, return the stored response immediately — zero LLM cost, sub-10ms latency. On a cache miss, send the request to your LLM, store the response with its embedding, and return it.

The critical component is the similarity threshold. Too high (0.98+) and you’ll rarely get cache hits. Too low (0.85) and you’ll return wrong answers for genuinely different questions.

User prompt → Embed → Vector search (cache)
                         ├── Hit (similarity ≥ threshold) → Return cached response
                         └── Miss → LLM call → Store response + embedding → Return

What makes it “semantic”

Traditional caches use exact string matching or hash-based lookups. The prompt must be character-for-character identical to trigger a hit. In practice, this means almost zero cache hits for natural language — users never phrase things identically.

Semantic caching uses embeddings to represent prompts as vectors in a high-dimensional space. Prompts with similar meaning end up near each other regardless of phrasing:

Prompt A	Prompt B	Cosine similarity
”What is the capital of France?"	"France’s capital city?“	0.96
”How do I reset my password?"	"I forgot my login credentials”	0.91
”Summarize this quarterly report"	"Give me a summary of Q2 earnings”	0.89
”Write a poem about dogs"	"Explain quantum computing”	0.12

The embedding model understands that “reset my password” and “forgot my login credentials” have the same intent — even though they share almost no words.

Cost savings benchmarks

The savings depend on your cache hit rate, which depends on how repetitive your traffic is. Here are benchmarks by use case:

Use case	Typical cache hit rate	Why
Customer support chatbot	40-60%	Users ask the same 50-100 questions repeatedly
Internal knowledge base / FAQ	30-50%	Employees search for the same policies and procedures
Data extraction / classification	25-40%	Same document types, same extraction patterns
Content generation (personalized)	10-20%	Each prompt includes unique user context
Code generation	5-15%	Highly variable prompts, context-dependent output
Creative writing	5-10%	Almost every prompt is unique

Worked example: customer support chatbot

A SaaS company handles 100,000 customer support queries per month through their AI chatbot, using Claude Sonnet at $3/$15 per million tokens. Average query: 300 input tokens, 500 output tokens.

Without caching:

Input: 100K × 300 = 30M tokens → $90
Output: 100K × 500 = 50M tokens → $750
Monthly cost: $840

With semantic caching (45% hit rate):

55,000 cache misses → same per-call cost
Input: 55K × 300 = 16.5M tokens → $49.50
Output: 55K × 500 = 27.5M tokens → $412.50
Embedding cost: 100K × 300 tokens via text-embedding-3-small = 30M tokens → $0.60
Vector search: negligible (Redis or in-memory)
Monthly cost: $462.60 — 45% savings

The embedding cost is almost invisible: text-embedding-3-small costs $0.02 per million tokens. Even at 100K queries, embedding adds $0.60/month.

For teams using frontier models (Claude Opus at $15/$75 per million tokens), the same 45% hit rate saves proportionally more in absolute dollars.

Implementation guide

Step 1: Choose your embedding model

The embedding model determines cache quality. You want high semantic accuracy, low latency, and low cost.

Model	Dimensions	Cost per 1M tokens	Latency	Notes
OpenAI text-embedding-3-small	1536	$0.02	~50ms	Best price/performance. Production standard
OpenAI text-embedding-3-large	3072	$0.13	~80ms	Higher accuracy, 6x cost. Rarely needed
Cohere embed-v3	1024	$0.10	~60ms	Good multilingual support
sentence-transformers (local)	384-768	$0 (compute only)	~10ms	Zero API cost but requires hosting

For most teams, text-embedding-3-small is the right choice. The accuracy difference between small and large is minimal for cache matching, and the cost is negligible.

Step 2: Choose your vector store

The vector store holds your cached embeddings and handles similarity search.

Store	Type	Best for	Cache lookup latency
Redis + RediSearch	In-memory	Low latency, simple setup	< 1ms
Qdrant	Vector DB	Large cache sizes, filtering	1-5ms
Pinecone	Managed	Zero-ops, serverless	5-20ms
SQLite + faiss	Embedded	Local development, small scale	< 1ms
In-memory dict	Python dict	Prototyping only	< 0.1ms

For production, Redis with RediSearch is the most common choice. Sub-millisecond lookups, simple to operate, and you probably already have Redis in your stack.

Step 3: Implement the cache layer

Here’s a production-ready pattern in Python:

import hashlib
import json
import numpy as np
import redis
from openai import OpenAI

client = OpenAI()
cache = redis.Redis(host="localhost", port=6379, db=0)

SIMILARITY_THRESHOLD = 0.92
CACHE_TTL = 86400 * 7  # 7 days

def get_embedding(text: str) -> list[float]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

def cosine_similarity(a: list[float], b: list[float]) -> float:
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

def semantic_cache_lookup(prompt: str) -> str | None:
    prompt_embedding = get_embedding(prompt)
    
    # Scan cached embeddings (use vector search index in production)
    for key in cache.scan_iter("cache:embedding:*"):
        cached = json.loads(cache.get(key))
        similarity = cosine_similarity(prompt_embedding, cached["embedding"])
        if similarity >= SIMILARITY_THRESHOLD:
            return cached["response"]
    
    return None

def cache_response(prompt: str, response: str):
    embedding = get_embedding(prompt)
    cache_key = f"cache:embedding:{hashlib.md5(prompt.encode()).hexdigest()}"
    cache.setex(
        cache_key,
        CACHE_TTL,
        json.dumps({"embedding": embedding, "response": response, "prompt": prompt})
    )

# Usage in your LLM call wrapper
async def llm_call(prompt: str, model: str = "claude-sonnet-4-6") -> str:
    # Check cache first
    cached = semantic_cache_lookup(prompt)
    if cached:
        return cached  # Free! No LLM call
    
    # Cache miss — call LLM
    response = await actual_llm_call(model=model, prompt=prompt)
    cache_response(prompt, response)
    return response

Important: The example above uses a linear scan for simplicity. In production with 10K+ cached entries, use Redis vector search (RediSearch) or a dedicated vector database for O(log n) lookups instead of O(n) scans.

Step 4: Tune the similarity threshold

Start strict and loosen gradually:

Week 1: Set threshold to 0.95. Measure cache hit rate and sample 100 cached responses for quality. Expected hit rate: 10-20%.
Week 2: If quality looks good, lower to 0.92. Measure again. Expected hit rate: 20-35%.
Week 3: If still good, try 0.90. This is usually the sweet spot for customer-facing applications. Expected hit rate: 30-50%.
Below 0.90: Quality degradation becomes noticeable. Only go lower for internal tools where occasional wrong answers are acceptable.

Track these metrics while tuning:

Metric	What to watch
Cache hit rate	Should increase as you lower threshold
False positive rate	Sample cached responses — are they correct for the new prompt?
User satisfaction	NPS, CSAT, or thumbs-up/down per response
Latency (p50, p99)	Cache hits should be < 50ms; misses = normal LLM latency
Cost per query	Should decrease proportionally with hit rate

Cache key design: the details that matter

The prompt alone isn’t always sufficient as a cache key. Consider these scenarios:

Include conversation context

If your LLM sees conversation history, two identical prompts with different histories should NOT match:

# BAD: Cache key is just the user message
cache_key = get_embedding(user_message)

# GOOD: Cache key includes system prompt + recent history
cache_key = get_embedding(f"{system_prompt}\n{last_2_messages}\n{user_message}")

Include user-specific context

“What’s my account balance?” should return different answers for different users. Include the user ID or relevant context in the cache key:

# For personalized queries, scope cache to user
cache_key = get_embedding(f"user:{user_id}\n{prompt}")

Exclude timestamps and ephemeral data

“What’s the weather today?” cached on Monday shouldn’t serve on Tuesday. Set appropriate TTLs for time-sensitive content:

# Short TTL for time-sensitive responses
CACHE_TTL_BY_TYPE = {
    "faq": 86400 * 30,     # 30 days — FAQ answers rarely change
    "policy": 86400 * 7,    # 7 days — policies change occasionally  
    "realtime": 3600,        # 1 hour — time-sensitive data
}

Combining semantic caching with model routing

Semantic caching and model routing are complementary optimizations that compound their savings:

Cache check first — zero cost for cache hits
Route cache misses — cheapest appropriate model for new queries
Cache the response — future similar queries skip both routing and LLM call

The combined savings stack multiplicatively:

Optimization	Standalone savings	Combined savings
Model routing alone	40-70%	—
Semantic caching alone	20-40%	—
Both together	—	52-82%

If routing saves 50% and caching saves 30% of the remaining traffic: 1 - (0.50 × 0.70) = 65% total savings.

What NOT to cache

Semantic caching isn’t appropriate for every LLM use case. Avoid caching:

Personalized responses where the same question should produce different answers per user (unless you scope the cache key to include user context)
Responses that depend on external state — database lookups, API calls, real-time data
Creative/generative tasks where users expect variety — marketing copy, brainstorming, creative writing
Multi-turn conversations where context shifts the meaning of repeated phrases
Security-sensitive responses where cached data could leak between users or tenants

A good rule of thumb: if two users asking the same question should get the same answer, it’s cacheable. If they should get different answers, it’s not (unless you scope the cache key appropriately).

Monitoring your cache

Track these metrics to ensure your cache is healthy:

Hit rate — The most important metric. Below 15%, caching isn’t worth the complexity. Above 40%, you’re in excellent territory.
Cost savings — Actual dollars saved per month. Track with a cost tracking dashboard that attributes savings to the cache layer.
Latency improvement — Cache hits should be 10-100x faster than LLM calls. If they’re not, your vector search is too slow.
Cache size — Monitor growth. A cache that grows indefinitely will slow down vector search. Set TTLs and implement eviction policies.
False positive rate — The percentage of cache hits that returned an incorrect or irrelevant response. Sample and review weekly. If above 5%, tighten your similarity threshold.

Getting started: a 3-step plan

Step 1: Measure your repetition rate. Before building, estimate your potential cache hit rate. Log 1 week of prompts, embed them, and cluster by similarity. If fewer than 20% of prompts have a near-duplicate, caching may not be worth the investment.

Step 2: Start with your highest-volume, most repetitive endpoint. Customer support, FAQ, or internal knowledge base — wherever users ask the same things repeatedly. Implement caching on this single endpoint and measure hit rate and quality for 2 weeks.

Step 3: Expand and combine with routing. Once caching is proven on one endpoint, roll out to others. Layer model routing on top for cache misses. Monitor combined savings with budget alerts to quantify the impact.