Building a RAG System from Scratch

Step-by-step guide to building retrieval-augmented generation systems that actually improve LLM accuracy.

Retrieval-Augmented Generation (RAG) is how you give an LLM access to your private data without fine-tuning. The concept is simple: retrieve relevant context, inject it into the prompt, and let the model generate an informed response.

The implementation? That’s where it gets interesting.

Architecture Overview

A production RAG system has four components:

  1. Ingestion Pipeline — Chunk, embed, and store your documents
  2. Retrieval Layer — Find the most relevant chunks for a query
  3. Prompt Assembly — Combine retrieved context with the user’s question
  4. Generation + Grounding — Generate a response and verify it’s grounded in the retrieved context

Step 1: Chunking Strategy

Chunking is the most underrated part of RAG. Bad chunks = bad retrieval = bad answers.

defmodule RAG.Chunker do
  @chunk_size 512
  @chunk_overlap 64

  def chunk_document(text, metadata \\ %{}) do
    text
    |> split_by_paragraphs()
    |> merge_small_chunks(@chunk_size)
    |> add_overlap(@chunk_overlap)
    |> Enum.with_index()
    |> Enum.map(fn {chunk, idx} ->
      %{
        content: chunk,
        metadata: Map.merge(metadata, %{chunk_index: idx}),
        token_count: estimate_tokens(chunk)
      }
    end)
  end

  defp split_by_paragraphs(text) do
    text
    |> String.split(~r/\n\n+/)
    |> Enum.reject(&(String.trim(&1) == ""))
  end

  defp merge_small_chunks(paragraphs, max_size) do
    Enum.reduce(paragraphs, [], fn para, acc ->
      case acc do
        [last | rest] when byte_size(last <> "\n\n" <> para) < max_size ->
          [(last <> "\n\n" <> para) | rest]
        _ ->
          [para | acc]
      end
    end)
    |> Enum.reverse()
  end
end

Key insight: Chunk by semantic boundaries (paragraphs, sections) not arbitrary character counts. Overlap ensures you don’t lose context at chunk boundaries.

Step 2: Embedding and Storage

Use a vector database (pgvector with Postgres works great) to store embeddings alongside your chunks:

defmodule RAG.Embedder do
  def embed_and_store(chunks) do
    chunks
    |> Enum.map(fn chunk ->
      {:ok, embedding} = EmbeddingAPI.create(chunk.content)
      Map.put(chunk, :embedding, embedding)
    end)
    |> Enum.each(&RAG.VectorStore.insert/1)
  end
end

Step 3: Retrieval

Semantic search finds relevant chunks, but add keyword search as a fallback for exact matches:

defmodule RAG.Retriever do
  def retrieve(query, opts \\ []) do
    k = Keyword.get(opts, :top_k, 5)

    semantic_results = RAG.VectorStore.similarity_search(query, k: k)
    keyword_results = RAG.VectorStore.keyword_search(query, k: k)

    (semantic_results ++ keyword_results)
    |> Enum.uniq_by(& &1.id)
    |> Enum.sort_by(& &1.score, :desc)
    |> Enum.take(k)
  end
end

Step 4: Prompt Assembly

This is where RAG meets prompt engineering:

@rag_prompt """
Answer the user's question using ONLY the provided context.
If the context doesn't contain enough information, say so clearly.
Do not make up information.

Context:
<%= for chunk <- chunks do %>
---
<%= chunk.content %>
<% end %>

Question: <%= question %>

Provide a clear, concise answer with references to the source material.
"""

Common Pitfalls

The three most common RAG failures are: chunks that are too large (diluting relevance), no hybrid search (missing exact keyword matches), and no grounding check (the model hallucinating despite having context). Address all three and your RAG system will be dramatically more reliable.