Caching Strategies for LLM Applications
Reduce costs and latency by 70% with smart caching strategies for LLM-powered applications.
LLM API calls are expensive and slow. A single Claude Opus call can take 3-10 seconds and cost several cents. At scale, that adds up fast. Smart caching can cut both cost and latency by 70% or more.
Exact Match Caching
The simplest strategy: hash the prompt, cache the response.
defmodule LLM.Cache do
use GenServer
def complete(prompt, opts \\ []) do
cache_key = :crypto.hash(:sha256, prompt) |> Base.encode16()
ttl = Keyword.get(opts, :ttl, :timer.hours(24))
case Cachex.get(:llm_cache, cache_key) do
{:ok, nil} ->
{:ok, response} = LLM.API.complete(prompt, opts)
Cachex.put(:llm_cache, cache_key, response, ttl: ttl)
{:ok, response}
{:ok, cached} ->
{:ok, cached}
end
end
end
This works well for deterministic prompts — classification, extraction, and formatting tasks where the same input always produces the same output.
Semantic Caching
For user-facing queries, exact matches are rare. Semantic caching uses embeddings to find “similar enough” queries:
defmodule LLM.SemanticCache do
@similarity_threshold 0.95
def complete(prompt, opts \\ []) do
{:ok, embedding} = EmbeddingAPI.create(prompt)
case find_similar(embedding) do
{:ok, cached} when cached.similarity >= @similarity_threshold ->
{:ok, cached.response}
_ ->
{:ok, response} = LLM.API.complete(prompt, opts)
store(prompt, embedding, response)
{:ok, response}
end
end
defp find_similar(embedding) do
Repo.query("""
SELECT response, 1 - (embedding <=> $1) as similarity
FROM llm_cache
ORDER BY embedding <=> $1
LIMIT 1
""", [embedding])
end
end
Set the threshold high (0.95+) to avoid returning stale results for genuinely different queries.
Tiered Caching
In production, combine strategies:
- L1: Exact match (in-memory, microseconds) — ETS or Cachex
- L2: Semantic match (vector DB, milliseconds) — pgvector
- L3: API call (network, seconds) — The actual LLM
defmodule LLM.TieredCache do
def complete(prompt, opts \\ []) do
with {:miss, _} <- check_exact_cache(prompt),
{:miss, _} <- check_semantic_cache(prompt) do
{:ok, response} = LLM.API.complete(prompt, opts)
populate_caches(prompt, response)
{:ok, response}
else
{:hit, response} -> {:ok, response}
end
end
end
Cache Invalidation
The hard part. We use TTL-based expiration with manual invalidation for known data changes:
- Classification prompts — Long TTL (7 days). Categories rarely change.
- RAG responses — Short TTL (1 hour). Source data updates frequently.
- User-facing summaries — Medium TTL (24 hours). Balance freshness vs cost.
- On data change — Invalidate all cache entries that reference the changed document.
Impact
On a support automation system processing 10k tickets/day, tiered caching reduced LLM API costs from $450/day to $120/day and dropped p95 latency from 4.2s to 0.3s. The cache hit rate stabilized at 73% after the first week.