← Dashboard

Evaluating AI
Code Review Tools

Pilot Results 67 Bug Cases 8 Tool Configs

v3 evaluation · Opus-judged · Det≥2 metric

General benchmarks don't test
what matters for Provable

What benchmarks test

  • Generic web app bugs
  • Common language patterns
  • Well-documented frameworks
  • Typical CRUD logic errors

What Provable needs

  • ZK circuit constraint correctness
  • Soundness bugs in proof systems
  • Leo language semantics
  • Custom VM and compiler internals

There are 10+ commercial AI review tools now. No established methodology to compare them on domain-specific code. So we built our own evaluation framework.

AI Code Review Tools — March 2026

Tool Price Unlimited? Platforms Key Strength Pilot
CodeRabbit $30/user Yes GH GL BB Azure Most installed GitHub AI app (2M+ repos) Tested
CodeAnt AI $10–24/user Yes GH GL BB Azure Bundles SAST, secrets, IaC, SCA, DORA Planned
GitHub Copilot $10–39/user No GH Zero friction, bundled with existing tooling Tested
Greptile $30 + $1/review No GH GL Full codebase graph indexing, 82% catch rate Tested
Claude Code Review $15–25/review No GH Deepest analysis, <1% false positives Planned
In-house Agent ~$0.65/review Yes Any Full control, domain prompting, customizable Tested

Full evaluation design

For each known bug: we check out the code before the fix, show it to each tool as if it were a new PR, and measure whether the tool identifies the bug. We know the answer because the fix has already been merged — that's our ground truth.

Uses the Claude Agent SDK to spawn a Claude Code session in a cloned repo. The agent gets Bash, Grep, Read, Glob, and WebSearch — the same tools a human reviewer would use. It explores the codebase, reads related files, and writes a structured review. All within a sandboxed workspace.

diff-only — agent sees only the patch. No repo, no shell tools. Tests pure reasoning from the diff alone.

diff+repo — agent gets the full repo checkout at the base commit plus Bash and ripgrep. Tests whether codebase access improves catch rate.

Do custom agents beat commercial tools? · Does repo context improve catch rates? · What's the cost per bug? · How noisy are the tools?

500 PRs in, 67 cases out

PRs fetched
500
↓ −189 no fix commit found
Cases mined
311
↓ −93 dep bumps, −45 duplicates, −27 no buggy lines
After rule filters
146
↓ −79 LLM curation (features, CI fixes, doc fixes, test-only)
Active cases
67
93Dependency bumps 45Merged sibling (duplicate fix) 27No buggy lines found 19Features mislabeled as fix 16Test-expectation only 14CI/release script fixes 12Doc-only fixes 18Other (self-ref, typo, lint, perf)
  • 857 merged PRs over 3 years → 164 real bug fixes (19% bug rate)
  • 67 have provable ground truth — verified introducing commit + precise buggy lines
  • 90 more are real bugs with weak ground truth (recoverable with LLM-based labeling)
  • Not a sample — these are all the bugs we can rigorously verify

End-to-end evaluation pipeline

mine
Scrape fix PRs
blame
Find introducing commits
curate
Auto-exclude bad cases
evaluate
Run tools on cases
normalize
Common format
judge
LLM scores each review
analyze
Reports & charts
67
Bug cases scored
3
Eval modes (PR, API, Agent)
~$509
Total pilot cost
How we score

Detection (0–3): did the tool find the bug?

0 missed → 1 wrong area → 2 correct file+issue → 3 correct ID + actionable fix

Quality (0–4): how useful was the review?

0 useless → 1 shallow → 2 adequate → 3 strong → 4 exceptional

Scored by Claude Opus as judge — tool names hidden to prevent bias.

Leaderboard — 67 cases, Opus-judged, Det≥2

Det≥2 = judge confirmed bug found. SNR = true positives / total comments (higher = less noise). FP = false positive comments.

Tool Context Det≥2 Det=3 Quality SNR FP Novel
Opus 4.6 (prompt v1) diff-only 31% 19% 1.99 1.55 22 166
Opus 4.6 (prompt v1) diff+repo 31% 14% 2.25 1.20 30 202
Opus 4.6 (prompt v2) diff+repo 28% 17% 2.36 2.75 12 298
Sonnet 4.6 diff-only 27% 15% 1.95 0.57 49 126
Copilot PR 23% 7% 1.94 0.75 32 192
Sonnet 4.6 diff+repo 23% 16% 2.03 0.74 46 168
Greptile Rate limited PR 11% 2.02
CodeRabbit Rate limited PR 4% 0.19

Opus diff-only matches diff+repo on catch rate (31%) with fewer false positives (22 vs 30) and better precision on det=3 (19% vs 14%). Repo access adds quality (2.25) and novel findings (202) but also noise. Opus (prompt v2) has the best signal-to-noise (2.75) and most novel findings (298) — the structured approach produces the cleanest reviews.

Det≥2 = LLM judge confirmed tool identified the bug (stricter than mechanical ±10-line proximity used in pilot report).

Repo access adds quality, not catch rate

Diff-only matches or beats diff+repo on detection. Repo access adds novel findings and review depth — but also more false positives.

31%
Catch
19%
Det=3
22
FP

Focused. Fewer false positives.
Best precision on actionable fixes.

31%
Catch
14%
Det=3
30
FP

More novel findings (202 vs 166).
Higher quality (2.25 vs 1.99). More noise.

The tradeoff

Repo access doesn't catch more bugs — but it produces richer reviews with more context. The agent finds issues that require understanding cross-file relationships. The cost is more false positives from exploring irrelevant code paths.

In-house agent beats commercial tools

Opus catches 31% of bugs vs Copilot's 23% — and produces nearly 3x as many actionable fixes (det=3: 19% vs 7%).

23%
Catch rate
7%
Det=3
0.75
SNR

Free, automatic on PR open.
32 false positives across 67 cases.

vs
31%
Catch rate
19%
Det=3
1.55
SNR

~$0.65/review, needs setup.
Only 22 false positives — a third fewer than Copilot.

Signal-to-noise matters

SNR = true positives / total comments. Opus diff-only's SNR of 1.55 means more signal than noise. Copilot's 0.75 means more noise than signal. A tool that catches bugs but buries them in false positives isn't useful in practice.

Structured review catches different bugs

The 3-phase v3 runner (survey → investigate → report) produces higher quality reviews (2.36 vs 1.99) with comparable catch rate (28% vs 31%). Each approach has a blind spot.

Bug is visible directly in the diff — wrong attribute, wrong literal, wrong API call. No context needed.

v3 gets distracted surveying 28 files and misses what's obvious in the diff hunk. Overhead hurts for simple bugs.

Bug requires context beyond the diff — missing callers, spec violations, deleted error paths still referenced.

Structured investigation forces systematic checking. Domain rules file directly helped in 2 cases.

Implication: a hybrid v4 approach

Quick diff-level scan first (no tools, just read the patch). Then structured investigation only for hunks that need deeper context. Combines the speed of single-pass with the thoroughness of 3-phase — without the attention dilution.

The answer might be both

Each tool catches bugs the others miss. Copilot is free and automatic; the in-house agent catches more with higher quality. Running both should cover more ground than either alone.

Opus Agent
31% catch, 1.55 SNR
+
Copilot
23% catch, free, automatic
=
Ensemble
Catch rate of union > either alone

Subtle logic errors, panic vs. proper error handling, refactoring regressions, issues that require reading multiple files to understand

Copy-paste errors, wrong variables, type mismatches — fast line-specific comments on every PR at zero marginal cost

Next step: measure the overlap

If the two tools catch mostly different bugs, the union could reach 40-50% detection. If they overlap heavily, the agent alone is sufficient. This is the key experiment for the next phase.

Building a rigorous evaluation is itself a research problem

The methodology and framework are as valuable as the results — they make every future evaluation faster and more trustworthy.

Dataset quality is the hardest part

  • 13% yield: 500 PRs mined, only 67 survived curation
  • 8 exclusion rules built iteratively from bad results
  • LLM curation gets 80% there; humans make the final call

Scoring is harder than it looks

  • Mechanical scoring inflates — proximity ≠ detection
  • Tools output in wildly different formats — normalization is hard
  • LLM judge needed for "close but not exact" matches

PR tool integration is messy

  • Each tool posts reviews in different places
  • Greptile requires manual dashboard activation per repo
  • Response times: 3 min to 30 min — need async two-phase

Agent architecture matters

  • Agents have a time management problem, not reasoning
  • ~10% of agent runs produce zero comments (turn limit or timeout)
  • Two-pass (explore → review) separates navigation from analysis

Agent evals are expensive — but optimizable

$509
Pilot spent

67 cases × 8 configs

$1,636
Full eval (naive)

250 cases × 8 configs

~$800
With optimizations

51% savings

Opus (prompt v2) (3-phase)$2.25 Opus diff+repo (Agent SDK)$0.99 Opus + shell tools BEST$0.73 Sonnet + shell tools$0.70 PR tools (Copilot, Greptile, CodeRabbit)free

92% of cost is agent eval runs. Judge scoring is only 8%.

$1,636
Naive
$1,252
Prompt caching
$994
+ batch + early stop
$803
+ reuse pilot data

51% savings from caching, batch API, early termination, and reusing scored pilot cases.

From pilot to full evaluation

  • Diff-only with shell tools — isolate whether tooling or context drives the improvement
  • Re-run Greptile & CodeRabbit with proper rate limit handling
  • Measure ensemble overlap — Copilot + Opus agent union catch rate
  • Hybrid v4 runner — quick diff scan + targeted investigation
  • Scale to 250 cases — n=67 is too small for statistical significance
  • Add non-Anthropic agents (Gemini, OpenAI)
  • Add remaining commercial tools (BugBot, Augment, DeepSource)
  • Domain-specific prompting — ZK/Leo hints for Provable repos
  • SWE-bench style fix evaluation — can the agent write a correct patch, not just find the bug?
  • Dataset already has base commits + known fixes to compare against
  • Human judge calibration — Cohen's kappa ≥ 0.85
  • Clean cases for false alarm rate measurement
  • Golden set curation — confirm/dispute via dashboard
  • Statistical rigor: confidence intervals and significance tests

For each tool: how many bugs does it catch, at what cost, with how much noise? Enough data to make a clear build-vs-buy decision for Provable.

Questions?

1. In-house agents match or beat commercial tools on domain-specific code. 2. Shell access (Bash + ripgrep) is the key enabler — not just having the repo, but searching it efficiently. 3. The framework is reusable — new tools and models plug in and get scored automatically.
67
Bug cases
8
Configs tested
250
Cases planned
~$800
Full eval est.

bugbench — framework code, pilot results, and case-by-case analysis