Automating Code Review with LLMs

LLM-powered code review isn’t about replacing human reviewers — it’s about catching the obvious stuff so humans can focus on architecture, design, and edge cases.

What LLMs Are Good At

LLMs excel at catching: obvious bugs, inconsistent naming, missing error handling, documentation gaps, style violations, and security anti-patterns. They’re less reliable for: architectural decisions, performance implications, and business logic correctness.

Building the Pipeline

Here’s how we wire an LLM reviewer into a GitHub Actions workflow:

defmodule CodeReview.Pipeline do
  def review_pr(pr_number) do
    with {:ok, diff} <- GitHub.get_pr_diff(pr_number),
         {:ok, files} <- parse_diff(diff),
         {:ok, reviews} <- review_files(files),
         {:ok, _} <- post_comments(pr_number, reviews) do
      {:ok, length(reviews)}
    end
  end

  defp review_files(files) do
    reviews =
      files
      |> Enum.filter(&reviewable?/1)
      |> Task.async_stream(&review_file/1, max_concurrency: 5)
      |> Enum.flat_map(fn {:ok, comments} -> comments end)

    {:ok, reviews}
  end

  defp review_file(%{filename: filename, patch: patch, language: lang}) do
    prompt = """
    Review this #{lang} code change. Focus on:
    1. Bugs or logic errors
    2. Missing error handling
    3. Security concerns
    4. Code clarity improvements

    Only comment on genuine issues. Do NOT comment on style preferences.
    Respond with a JSON array of comments:
    [{"line": <line_number>, "severity": "error|warning|info", "message": "..."}]

    If the code looks good, respond with an empty array: []

    File: #{filename}
    ```#{lang}
    #{patch}
    ```
    """

    {:ok, response} = LLM.complete(prompt, model: "claude-sonnet-4-6")
    Jason.decode!(response)
  end

  defp reviewable?(%{filename: f}) do
    not String.match?(f, ~r/\.(lock|min\.|generated)/)
  end
end

The Review Prompt Matters

The critical line is “Only comment on genuine issues.” Without it, LLMs tend to be overly nitpicky — suggesting rename-this and restructure-that on perfectly fine code. You want signal, not noise.

We also explicitly exclude lock files, minified assets, and generated code. LLMs waste tokens and produce garbage reviews on these.

Severity Tiers

Not all comments are equal. We use three severity levels:

Error — Likely bug, security issue, or broken logic. Blocks merge.
Warning — Missing error handling, potential edge case. Requires acknowledgment.
Info — Style suggestion, documentation improvement. Optional.

This maps directly to GitHub’s review comment system and keeps noise manageable.

Integration with CI

# .github/workflows/llm-review.yml
name: LLM Code Review
on: [pull_request]
jobs:
  review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run LLM Review
        run: mix run -e "CodeReview.Pipeline.review_pr(${{ github.event.pull_request.number }})"
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          LLM_API_KEY: ${{ secrets.LLM_API_KEY }}

Results

After three months in production, our LLM reviewer catches about 15% of issues before human review. That doesn’t sound like a lot, but it means human reviewers can focus on the hard problems. The review cycle is faster, and fewer bugs slip through to staging.