Fine-Tuning vs Prompting: When to Use Which

A decision framework for choosing between prompt engineering, RAG, and fine-tuning for your LLM application.

“Should I fine-tune?” is the most common question in LLM engineering. The answer is almost always “not yet.” But there are cases where it’s the right call.

The Decision Framework

Think of it as a ladder. Start at the bottom and only climb when you need to:

Level 1: Better Prompting — You’d be surprised how far this goes. Most “the model can’t do X” problems are actually “my prompt doesn’t clearly explain X” problems. Try structured output, few-shot examples, and chain-of-thought before anything else.

Level 2: RAG — If the model lacks knowledge, give it knowledge. RAG is cheaper, faster to implement, and easier to update than fine-tuning. New data? Update the vector store. No retraining needed.

Level 3: Fine-Tuning — When you need the model to behave in a specific way that prompting can’t achieve. This is about style, format, and domain-specific reasoning patterns.

When Prompting Wins

Prompting is the right choice when:

  • Your task is well-defined and can be explained in natural language
  • You need flexibility to change behavior quickly
  • Your volume is low-to-medium (< 100k calls/day)
  • You need to switch between models easily
# This is a prompting problem, not a fine-tuning problem
defmodule Classifier do
  @prompt """
  Classify this email as: spam, marketing, personal, work, automated.
  Respond with only the category name.

  Email subject: <%= subject %>
  Email body (first 500 chars): <%= String.slice(body, 0, 500) %>
  """

  def classify(subject, body) do
    EEx.eval_string(@prompt, subject: subject, body: body)
    |> LLM.complete()
  end
end

When RAG Wins

RAG is the right choice when:

  • The model needs access to private or frequently updated data
  • Answers must be grounded in specific source documents
  • You need citation and traceability
  • The knowledge base changes regularly

When Fine-Tuning Wins

Fine-tuning is worth it when:

  • You need a very specific output style that prompting can’t nail consistently
  • You’re processing extremely high volume and need to use a smaller, cheaper model
  • You have domain-specific reasoning patterns (medical, legal, financial)
  • Latency is critical and you need a smaller model to match a larger model’s quality
# This might justify fine-tuning:
# - Very high volume (1M+ calls/day)
# - Specific format that must be pixel-perfect
# - Domain-specific medical terminology
defmodule MedicalCoder do
  def code(clinical_note) do
    # Fine-tuned model that outputs ICD-10 codes
    # with domain-specific understanding
    LLM.complete(clinical_note,
      model: "ft:claude-haiku:medical-coder:v3",
      max_tokens: 100
    )
  end
end

The Cost Equation

Fine-tuning a model costs $200-2000+ in compute, plus the cost of curating training data (often 500-5000 high-quality examples). The payoff comes from using a smaller, cheaper model at high volume.

Break-even example: If fine-tuning lets you use Haiku instead of Sonnet, and you’re making 500k calls/day, the cost savings pay for fine-tuning within a week.

The Hybrid Approach

In practice, most production systems use a combination:

  1. Fine-tuned small model for high-volume, well-defined tasks (classification, extraction)
  2. RAG + large model for knowledge-intensive tasks (Q&A, research)
  3. Prompted large model for complex reasoning (analysis, planning)

Start with prompting. Add RAG when you need external knowledge. Fine-tune only when you have the volume and the data to justify it. This order saves you months of unnecessary complexity.