Automating Code Review with LLMs
How to build an LLM-powered code review system that catches real bugs and integrates into your CI pipeline.
LLM-powered code review isn’t about replacing human reviewers — it’s about catching the obvious stuff so humans can focus on architecture, design, and edge cases.
What LLMs Are Good At
LLMs excel at catching: obvious bugs, inconsistent naming, missing error handling, documentation gaps, style violations, and security anti-patterns. They’re less reliable for: architectural decisions, performance implications, and business logic correctness.
Building the Pipeline
Here’s how we wire an LLM reviewer into a GitHub Actions workflow:
defmodule CodeReview.Pipeline do
def review_pr(pr_number) do
with {:ok, diff} <- GitHub.get_pr_diff(pr_number),
{:ok, files} <- parse_diff(diff),
{:ok, reviews} <- review_files(files),
{:ok, _} <- post_comments(pr_number, reviews) do
{:ok, length(reviews)}
end
end
defp review_files(files) do
reviews =
files
|> Enum.filter(&reviewable?/1)
|> Task.async_stream(&review_file/1, max_concurrency: 5)
|> Enum.flat_map(fn {:ok, comments} -> comments end)
{:ok, reviews}
end
defp review_file(%{filename: filename, patch: patch, language: lang}) do
prompt = """
Review this #{lang} code change. Focus on:
1. Bugs or logic errors
2. Missing error handling
3. Security concerns
4. Code clarity improvements
Only comment on genuine issues. Do NOT comment on style preferences.
Respond with a JSON array of comments:
[{"line": <line_number>, "severity": "error|warning|info", "message": "..."}]
If the code looks good, respond with an empty array: []
File: #{filename}
```#{lang}
#{patch}
```
"""
{:ok, response} = LLM.complete(prompt, model: "claude-sonnet-4-6")
Jason.decode!(response)
end
defp reviewable?(%{filename: f}) do
not String.match?(f, ~r/\.(lock|min\.|generated)/)
end
end
The Review Prompt Matters
The critical line is “Only comment on genuine issues.” Without it, LLMs tend to be overly nitpicky — suggesting rename-this and restructure-that on perfectly fine code. You want signal, not noise.
We also explicitly exclude lock files, minified assets, and generated code. LLMs waste tokens and produce garbage reviews on these.
Severity Tiers
Not all comments are equal. We use three severity levels:
- Error — Likely bug, security issue, or broken logic. Blocks merge.
- Warning — Missing error handling, potential edge case. Requires acknowledgment.
- Info — Style suggestion, documentation improvement. Optional.
This maps directly to GitHub’s review comment system and keeps noise manageable.
Integration with CI
# .github/workflows/llm-review.yml
name: LLM Code Review
on: [pull_request]
jobs:
review:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run LLM Review
run: mix run -e "CodeReview.Pipeline.review_pr(${{ github.event.pull_request.number }})"
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
LLM_API_KEY: ${{ secrets.LLM_API_KEY }}
Results
After three months in production, our LLM reviewer catches about 15% of issues before human review. That doesn’t sound like a lot, but it means human reviewers can focus on the hard problems. The review cycle is faster, and fewer bugs slip through to staging.