How to Evaluate LLM Outputs Systematically

You can’t improve what you can’t measure. LLM evaluation is how you know whether your prompts actually work — and whether changes make things better or worse.

The Eval Dataset

Start by building a curated set of input-output pairs. These are your ground truth:

defmodule Eval.Dataset do
  def load(path) do
    path
    |> File.read!()
    |> Jason.decode!()
    |> Enum.map(fn item ->
      %{
        input: item["input"],
        expected: item["expected"],
        tags: item["tags"] || [],
        difficulty: item["difficulty"] || "normal"
      }
    end)
  end
end

Your dataset should cover: happy paths, edge cases, adversarial inputs, and examples from real production failures. Start with 50-100 examples and grow it over time.

Metric Types

Different tasks need different metrics:

Exact Match — For classification, extraction, structured output:

def exact_match(predicted, expected) do
  predicted == expected
end

Fuzzy Match — For text generation where wording varies:

def fuzzy_match(predicted, expected, threshold \\ 0.85) do
  similarity = String.jaro_distance(predicted, expected)
  similarity >= threshold
end

LLM-as-Judge — For open-ended quality assessment:

def llm_judge(input, output, criteria) do
  prompt = """
  Rate the following output on a scale of 1-5 for each criterion.
  Respond with JSON: {"scores": {"criterion": score}, "reasoning": "..."}

  Input: #{input}
  Output: #{output}
  Criteria: #{Enum.join(criteria, ", ")}
  """

  {:ok, response} = LLM.complete(prompt, model: "claude-sonnet-4-6")
  Jason.decode!(response)
end

The Eval Pipeline

Run evals on every prompt change:

defmodule Eval.Runner do
  def run(dataset, prompt_fn, metrics) do
    results =
      dataset
      |> Task.async_stream(fn example ->
        {:ok, output} = prompt_fn.(example.input)

        scores = Enum.map(metrics, fn metric ->
          {metric.name, metric.score(output, example.expected)}
        end)

        %{input: example.input, output: output, expected: example.expected, scores: scores}
      end, max_concurrency: 10)
      |> Enum.map(fn {:ok, result} -> result end)

    summary = calculate_summary(results, metrics)
    %{results: results, summary: summary}
  end

  defp calculate_summary(results, metrics) do
    Enum.map(metrics, fn metric ->
      scores = Enum.map(results, fn r -> r.scores[metric.name] end)
      %{
        metric: metric.name,
        mean: Enum.sum(scores) / length(scores),
        min: Enum.min(scores),
        max: Enum.max(scores)
      }
    end)
  end
end

Catching Regressions

Store eval results over time and alert when scores drop:

defmodule Eval.Regression do
  @regression_threshold 0.05

  def check(current_scores, baseline_scores) do
    Enum.map(current_scores, fn %{metric: metric, mean: mean} ->
      baseline = Enum.find(baseline_scores, &(&1.metric == metric))
      delta = mean - baseline.mean

      cond do
        delta < -@regression_threshold -> {:regression, metric, delta}
        delta > @regression_threshold -> {:improvement, metric, delta}
        true -> {:stable, metric, delta}
      end
    end)
  end
end

Practical Advice

Run evals before every prompt change goes to production. Treat your eval dataset like test fixtures — keep them in version control, review additions in PRs, and never delete examples that represent real production failures. A growing eval dataset is your best defense against regressions.