Caching Strategies for LLM Applications

Reduce costs and latency by 70% with smart caching strategies for LLM-powered applications.

LLM API calls are expensive and slow. A single Claude Opus call can take 3-10 seconds and cost several cents. At scale, that adds up fast. Smart caching can cut both cost and latency by 70% or more.

Exact Match Caching

The simplest strategy: hash the prompt, cache the response.

defmodule LLM.Cache do
  use GenServer

  def complete(prompt, opts \\ []) do
    cache_key = :crypto.hash(:sha256, prompt) |> Base.encode16()
    ttl = Keyword.get(opts, :ttl, :timer.hours(24))

    case Cachex.get(:llm_cache, cache_key) do
      {:ok, nil} ->
        {:ok, response} = LLM.API.complete(prompt, opts)
        Cachex.put(:llm_cache, cache_key, response, ttl: ttl)
        {:ok, response}

      {:ok, cached} ->
        {:ok, cached}
    end
  end
end

This works well for deterministic prompts — classification, extraction, and formatting tasks where the same input always produces the same output.

Semantic Caching

For user-facing queries, exact matches are rare. Semantic caching uses embeddings to find “similar enough” queries:

defmodule LLM.SemanticCache do
  @similarity_threshold 0.95

  def complete(prompt, opts \\ []) do
    {:ok, embedding} = EmbeddingAPI.create(prompt)

    case find_similar(embedding) do
      {:ok, cached} when cached.similarity >= @similarity_threshold ->
        {:ok, cached.response}

      _ ->
        {:ok, response} = LLM.API.complete(prompt, opts)
        store(prompt, embedding, response)
        {:ok, response}
    end
  end

  defp find_similar(embedding) do
    Repo.query("""
    SELECT response, 1 - (embedding <=> $1) as similarity
    FROM llm_cache
    ORDER BY embedding <=> $1
    LIMIT 1
    """, [embedding])
  end
end

Set the threshold high (0.95+) to avoid returning stale results for genuinely different queries.

Tiered Caching

In production, combine strategies:

  1. L1: Exact match (in-memory, microseconds) — ETS or Cachex
  2. L2: Semantic match (vector DB, milliseconds) — pgvector
  3. L3: API call (network, seconds) — The actual LLM
defmodule LLM.TieredCache do
  def complete(prompt, opts \\ []) do
    with {:miss, _} <- check_exact_cache(prompt),
         {:miss, _} <- check_semantic_cache(prompt) do
      {:ok, response} = LLM.API.complete(prompt, opts)
      populate_caches(prompt, response)
      {:ok, response}
    else
      {:hit, response} -> {:ok, response}
    end
  end
end

Cache Invalidation

The hard part. We use TTL-based expiration with manual invalidation for known data changes:

  • Classification prompts — Long TTL (7 days). Categories rarely change.
  • RAG responses — Short TTL (1 hour). Source data updates frequently.
  • User-facing summaries — Medium TTL (24 hours). Balance freshness vs cost.
  • On data change — Invalidate all cache entries that reference the changed document.

Impact

On a support automation system processing 10k tickets/day, tiered caching reduced LLM API costs from $450/day to $120/day and dropped p95 latency from 4.2s to 0.3s. The cache hit rate stabilized at 73% after the first week.