Building a RAG System from Scratch
Step-by-step guide to building retrieval-augmented generation systems that actually improve LLM accuracy.
Retrieval-Augmented Generation (RAG) is how you give an LLM access to your private data without fine-tuning. The concept is simple: retrieve relevant context, inject it into the prompt, and let the model generate an informed response.
The implementation? That’s where it gets interesting.
Architecture Overview
A production RAG system has four components:
- Ingestion Pipeline — Chunk, embed, and store your documents
- Retrieval Layer — Find the most relevant chunks for a query
- Prompt Assembly — Combine retrieved context with the user’s question
- Generation + Grounding — Generate a response and verify it’s grounded in the retrieved context
Step 1: Chunking Strategy
Chunking is the most underrated part of RAG. Bad chunks = bad retrieval = bad answers.
defmodule RAG.Chunker do
@chunk_size 512
@chunk_overlap 64
def chunk_document(text, metadata \\ %{}) do
text
|> split_by_paragraphs()
|> merge_small_chunks(@chunk_size)
|> add_overlap(@chunk_overlap)
|> Enum.with_index()
|> Enum.map(fn {chunk, idx} ->
%{
content: chunk,
metadata: Map.merge(metadata, %{chunk_index: idx}),
token_count: estimate_tokens(chunk)
}
end)
end
defp split_by_paragraphs(text) do
text
|> String.split(~r/\n\n+/)
|> Enum.reject(&(String.trim(&1) == ""))
end
defp merge_small_chunks(paragraphs, max_size) do
Enum.reduce(paragraphs, [], fn para, acc ->
case acc do
[last | rest] when byte_size(last <> "\n\n" <> para) < max_size ->
[(last <> "\n\n" <> para) | rest]
_ ->
[para | acc]
end
end)
|> Enum.reverse()
end
end
Key insight: Chunk by semantic boundaries (paragraphs, sections) not arbitrary character counts. Overlap ensures you don’t lose context at chunk boundaries.
Step 2: Embedding and Storage
Use a vector database (pgvector with Postgres works great) to store embeddings alongside your chunks:
defmodule RAG.Embedder do
def embed_and_store(chunks) do
chunks
|> Enum.map(fn chunk ->
{:ok, embedding} = EmbeddingAPI.create(chunk.content)
Map.put(chunk, :embedding, embedding)
end)
|> Enum.each(&RAG.VectorStore.insert/1)
end
end
Step 3: Retrieval
Semantic search finds relevant chunks, but add keyword search as a fallback for exact matches:
defmodule RAG.Retriever do
def retrieve(query, opts \\ []) do
k = Keyword.get(opts, :top_k, 5)
semantic_results = RAG.VectorStore.similarity_search(query, k: k)
keyword_results = RAG.VectorStore.keyword_search(query, k: k)
(semantic_results ++ keyword_results)
|> Enum.uniq_by(& &1.id)
|> Enum.sort_by(& &1.score, :desc)
|> Enum.take(k)
end
end
Step 4: Prompt Assembly
This is where RAG meets prompt engineering:
@rag_prompt """
Answer the user's question using ONLY the provided context.
If the context doesn't contain enough information, say so clearly.
Do not make up information.
Context:
<%= for chunk <- chunks do %>
---
<%= chunk.content %>
<% end %>
Question: <%= question %>
Provide a clear, concise answer with references to the source material.
"""
Common Pitfalls
The three most common RAG failures are: chunks that are too large (diluting relevance), no hybrid search (missing exact keyword matches), and no grounding check (the model hallucinating despite having context). Address all three and your RAG system will be dramatically more reliable.