Evaluating AI
Code Review Tools
v3 evaluation · Opus-judged · Det≥2 metric
v3 evaluation · Opus-judged · Det≥2 metric
There are 10+ commercial AI review tools now. No established methodology to compare them on domain-specific code. So we built our own evaluation framework.
| Tool | Price | Unlimited? | Platforms | Key Strength | Pilot |
|---|---|---|---|---|---|
| CodeRabbit | $30/user | Yes | GH GL BB Azure | Most installed GitHub AI app (2M+ repos) | Tested |
| CodeAnt AI | $10–24/user | Yes | GH GL BB Azure | Bundles SAST, secrets, IaC, SCA, DORA | Planned |
| GitHub Copilot | $10–39/user | No | GH | Zero friction, bundled with existing tooling | Tested |
| Greptile | $30 + $1/review | No | GH GL | Full codebase graph indexing, 82% catch rate | Tested |
| Claude Code Review | $15–25/review | No | GH | Deepest analysis, <1% false positives | Planned |
| In-house Agent | ~$0.65/review | Yes | Any | Full control, domain prompting, customizable | Tested |
For each known bug: we check out the code before the fix, show it to each tool as if it were a new PR, and measure whether the tool identifies the bug. We know the answer because the fix has already been merged — that's our ground truth.
Uses the Claude Agent SDK to spawn a Claude Code session in a cloned repo. The agent gets Bash, Grep, Read, Glob, and WebSearch — the same tools a human reviewer would use. It explores the codebase, reads related files, and writes a structured review. All within a sandboxed workspace.
diff-only — agent sees only the patch. No repo, no shell tools. Tests pure reasoning from the diff alone.
diff+repo — agent gets the full repo checkout at the base commit plus Bash and ripgrep. Tests whether codebase access improves catch rate.
Detection (0–3): did the tool find the bug?
0 missed → 1 wrong area → 2 correct file+issue → 3 correct ID + actionable fix
Quality (0–4): how useful was the review?
0 useless → 1 shallow → 2 adequate → 3 strong → 4 exceptional
Scored by Claude Opus as judge — tool names hidden to prevent bias.
Det≥2 = judge confirmed bug found. SNR = true positives / total comments (higher = less noise). FP = false positive comments.
| Tool | Context | Det≥2 | Det=3 | Quality | SNR | FP | Novel |
|---|---|---|---|---|---|---|---|
| Opus 4.6 (prompt v1) | diff-only | 31% | 19% | 1.99 | 1.55 | 22 | 166 |
| Opus 4.6 (prompt v1) | diff+repo | 31% | 14% | 2.25 | 1.20 | 30 | 202 |
| Opus 4.6 (prompt v2) | diff+repo | 28% | 17% | 2.36 | 2.75 | 12 | 298 |
| Sonnet 4.6 | diff-only | 27% | 15% | 1.95 | 0.57 | 49 | 126 |
| Copilot | PR | 23% | 7% | 1.94 | 0.75 | 32 | 192 |
| Sonnet 4.6 | diff+repo | 23% | 16% | 2.03 | 0.74 | 46 | 168 |
| Greptile Rate limited | PR | 11% | — | 2.02 | — | — | — |
| CodeRabbit Rate limited | PR | 4% | — | 0.19 | — | — | — |
Opus diff-only matches diff+repo on catch rate (31%) with fewer false positives (22 vs 30) and better precision on det=3 (19% vs 14%). Repo access adds quality (2.25) and novel findings (202) but also noise. Opus (prompt v2) has the best signal-to-noise (2.75) and most novel findings (298) — the structured approach produces the cleanest reviews.
Det≥2 = LLM judge confirmed tool identified the bug (stricter than mechanical ±10-line proximity used in pilot report).
Diff-only matches or beats diff+repo on detection. Repo access adds novel findings and review depth — but also more false positives.
Focused. Fewer false positives.
Best precision on actionable fixes.
More novel findings (202 vs 166).
Higher quality (2.25 vs 1.99). More noise.
Repo access doesn't catch more bugs — but it produces richer reviews with more context. The agent finds issues that require understanding cross-file relationships. The cost is more false positives from exploring irrelevant code paths.
Opus catches 31% of bugs vs Copilot's 23% — and produces nearly 3x as many actionable fixes (det=3: 19% vs 7%).
Free, automatic on PR open.
32 false positives across 67 cases.
~$0.65/review, needs setup.
Only 22 false positives — a third fewer than Copilot.
SNR = true positives / total comments. Opus diff-only's SNR of 1.55 means more signal than noise. Copilot's 0.75 means more noise than signal. A tool that catches bugs but buries them in false positives isn't useful in practice.
The 3-phase v3 runner (survey → investigate → report) produces higher quality reviews (2.36 vs 1.99) with comparable catch rate (28% vs 31%). Each approach has a blind spot.
Bug is visible directly in the diff — wrong attribute, wrong literal, wrong API call. No context needed.
v3 gets distracted surveying 28 files and misses what's obvious in the diff hunk. Overhead hurts for simple bugs.
Bug requires context beyond the diff — missing callers, spec violations, deleted error paths still referenced.
Structured investigation forces systematic checking. Domain rules file directly helped in 2 cases.
Quick diff-level scan first (no tools, just read the patch). Then structured investigation only for hunks that need deeper context. Combines the speed of single-pass with the thoroughness of 3-phase — without the attention dilution.
Each tool catches bugs the others miss. Copilot is free and automatic; the in-house agent catches more with higher quality. Running both should cover more ground than either alone.
Subtle logic errors, panic vs. proper error handling, refactoring regressions, issues that require reading multiple files to understand
Copy-paste errors, wrong variables, type mismatches — fast line-specific comments on every PR at zero marginal cost
If the two tools catch mostly different bugs, the union could reach 40-50% detection. If they overlap heavily, the agent alone is sufficient. This is the key experiment for the next phase.
The methodology and framework are as valuable as the results — they make every future evaluation faster and more trustworthy.
67 cases × 8 configs
250 cases × 8 configs
51% savings
92% of cost is agent eval runs. Judge scoring is only 8%.
51% savings from caching, batch API, early termination, and reusing scored pilot cases.
For each tool: how many bugs does it catch, at what cost, with how much noise? Enough data to make a clear build-vs-buy decision for Provable.
bugbench — framework code, pilot results, and case-by-case analysis