How to Evaluate LLM Outputs Systematically

Build evaluation pipelines that measure LLM quality, catch regressions, and guide prompt improvements.

You can’t improve what you can’t measure. LLM evaluation is how you know whether your prompts actually work — and whether changes make things better or worse.

The Eval Dataset

Start by building a curated set of input-output pairs. These are your ground truth:

defmodule Eval.Dataset do
  def load(path) do
    path
    |> File.read!()
    |> Jason.decode!()
    |> Enum.map(fn item ->
      %{
        input: item["input"],
        expected: item["expected"],
        tags: item["tags"] || [],
        difficulty: item["difficulty"] || "normal"
      }
    end)
  end
end

Your dataset should cover: happy paths, edge cases, adversarial inputs, and examples from real production failures. Start with 50-100 examples and grow it over time.

Metric Types

Different tasks need different metrics:

Exact Match — For classification, extraction, structured output:

def exact_match(predicted, expected) do
  predicted == expected
end

Fuzzy Match — For text generation where wording varies:

def fuzzy_match(predicted, expected, threshold \\ 0.85) do
  similarity = String.jaro_distance(predicted, expected)
  similarity >= threshold
end

LLM-as-Judge — For open-ended quality assessment:

def llm_judge(input, output, criteria) do
  prompt = """
  Rate the following output on a scale of 1-5 for each criterion.
  Respond with JSON: {"scores": {"criterion": score}, "reasoning": "..."}

  Input: #{input}
  Output: #{output}
  Criteria: #{Enum.join(criteria, ", ")}
  """

  {:ok, response} = LLM.complete(prompt, model: "claude-sonnet-4-6")
  Jason.decode!(response)
end

The Eval Pipeline

Run evals on every prompt change:

defmodule Eval.Runner do
  def run(dataset, prompt_fn, metrics) do
    results =
      dataset
      |> Task.async_stream(fn example ->
        {:ok, output} = prompt_fn.(example.input)

        scores = Enum.map(metrics, fn metric ->
          {metric.name, metric.score(output, example.expected)}
        end)

        %{input: example.input, output: output, expected: example.expected, scores: scores}
      end, max_concurrency: 10)
      |> Enum.map(fn {:ok, result} -> result end)

    summary = calculate_summary(results, metrics)
    %{results: results, summary: summary}
  end

  defp calculate_summary(results, metrics) do
    Enum.map(metrics, fn metric ->
      scores = Enum.map(results, fn r -> r.scores[metric.name] end)
      %{
        metric: metric.name,
        mean: Enum.sum(scores) / length(scores),
        min: Enum.min(scores),
        max: Enum.max(scores)
      }
    end)
  end
end

Catching Regressions

Store eval results over time and alert when scores drop:

defmodule Eval.Regression do
  @regression_threshold 0.05

  def check(current_scores, baseline_scores) do
    Enum.map(current_scores, fn %{metric: metric, mean: mean} ->
      baseline = Enum.find(baseline_scores, &(&1.metric == metric))
      delta = mean - baseline.mean

      cond do
        delta < -@regression_threshold -> {:regression, metric, delta}
        delta > @regression_threshold -> {:improvement, metric, delta}
        true -> {:stable, metric, delta}
      end
    end)
  end
end

Practical Advice

Run evals before every prompt change goes to production. Treat your eval dataset like test fixtures — keep them in version control, review additions in PRs, and never delete examples that represent real production failures. A growing eval dataset is your best defense against regressions.