How to Evaluate LLM Outputs Systematically
Build evaluation pipelines that measure LLM quality, catch regressions, and guide prompt improvements.
You can’t improve what you can’t measure. LLM evaluation is how you know whether your prompts actually work — and whether changes make things better or worse.
The Eval Dataset
Start by building a curated set of input-output pairs. These are your ground truth:
defmodule Eval.Dataset do
def load(path) do
path
|> File.read!()
|> Jason.decode!()
|> Enum.map(fn item ->
%{
input: item["input"],
expected: item["expected"],
tags: item["tags"] || [],
difficulty: item["difficulty"] || "normal"
}
end)
end
end
Your dataset should cover: happy paths, edge cases, adversarial inputs, and examples from real production failures. Start with 50-100 examples and grow it over time.
Metric Types
Different tasks need different metrics:
Exact Match — For classification, extraction, structured output:
def exact_match(predicted, expected) do
predicted == expected
end
Fuzzy Match — For text generation where wording varies:
def fuzzy_match(predicted, expected, threshold \\ 0.85) do
similarity = String.jaro_distance(predicted, expected)
similarity >= threshold
end
LLM-as-Judge — For open-ended quality assessment:
def llm_judge(input, output, criteria) do
prompt = """
Rate the following output on a scale of 1-5 for each criterion.
Respond with JSON: {"scores": {"criterion": score}, "reasoning": "..."}
Input: #{input}
Output: #{output}
Criteria: #{Enum.join(criteria, ", ")}
"""
{:ok, response} = LLM.complete(prompt, model: "claude-sonnet-4-6")
Jason.decode!(response)
end
The Eval Pipeline
Run evals on every prompt change:
defmodule Eval.Runner do
def run(dataset, prompt_fn, metrics) do
results =
dataset
|> Task.async_stream(fn example ->
{:ok, output} = prompt_fn.(example.input)
scores = Enum.map(metrics, fn metric ->
{metric.name, metric.score(output, example.expected)}
end)
%{input: example.input, output: output, expected: example.expected, scores: scores}
end, max_concurrency: 10)
|> Enum.map(fn {:ok, result} -> result end)
summary = calculate_summary(results, metrics)
%{results: results, summary: summary}
end
defp calculate_summary(results, metrics) do
Enum.map(metrics, fn metric ->
scores = Enum.map(results, fn r -> r.scores[metric.name] end)
%{
metric: metric.name,
mean: Enum.sum(scores) / length(scores),
min: Enum.min(scores),
max: Enum.max(scores)
}
end)
end
end
Catching Regressions
Store eval results over time and alert when scores drop:
defmodule Eval.Regression do
@regression_threshold 0.05
def check(current_scores, baseline_scores) do
Enum.map(current_scores, fn %{metric: metric, mean: mean} ->
baseline = Enum.find(baseline_scores, &(&1.metric == metric))
delta = mean - baseline.mean
cond do
delta < -@regression_threshold -> {:regression, metric, delta}
delta > @regression_threshold -> {:improvement, metric, delta}
true -> {:stable, metric, delta}
end
end)
end
end
Practical Advice
Run evals before every prompt change goes to production. Treat your eval dataset like test fixtures — keep them in version control, review additions in PRs, and never delete examples that represent real production failures. A growing eval dataset is your best defense against regressions.