Provable Engineering

Evaluating AI
Code Review Tools

Pilot Results 67 Bug Cases 8 Tool Configs

v3 evaluation · Opus-judged · Det≥2 metric

The Problem

General benchmarks don't test
what matters for Provable

What benchmarks test

Generic web app bugs
Common language patterns
Well-documented frameworks
Typical CRUD logic errors

What Provable needs

ZK circuit constraint correctness
Soundness bugs in proof systems
Leo language semantics
Custom VM and compiler internals

There are 10+ commercial AI review tools now. No established methodology to compare them on domain-specific code. So we built our own evaluation framework.

Landscape

AI Code Review Tools — March 2026

Tool	Price	Unlimited?	Platforms	Key Strength	Pilot
CodeRabbit	$30/user	Yes	GH GL BB Azure	Most installed GitHub AI app (2M+ repos)	Tested
CodeAnt AI	$10–24/user	Yes	GH GL BB Azure	Bundles SAST, secrets, IaC, SCA, DORA	Planned
GitHub Copilot	$10–39/user	No	GH	Zero friction, bundled with existing tooling	Tested
Greptile	$30 + $1/review	No	GH GL	Full codebase graph indexing, 82% catch rate	Tested
Claude Code Review	$15–25/review	No	GH	Deepest analysis, <1% false positives	Planned
In-house Agent	~$0.65/review	Yes	Any	Full control, domain prompting, customizable	Tested

Approach

Full evaluation design

How It Works

For each known bug: we check out the code before the fix, show it to each tool as if it were a new PR, and measure whether the tool identifies the bug. We know the answer because the fix has already been merged — that's our ground truth.

The In-House Agent

Uses the Claude Agent SDK to spawn a Claude Code session in a cloned repo. The agent gets Bash, Grep, Read, Glob, and WebSearch — the same tools a human reviewer would use. It explores the codebase, reads related files, and writes a structured review. All within a sandboxed workspace.

Context Levels

diff-only — agent sees only the patch. No repo, no shell tools. Tests pure reasoning from the diff alone.

diff+repo — agent gets the full repo checkout at the base commit plus Bash and ripgrep. Tests whether codebase access improves catch rate.

Do custom agents beat commercial tools? · Does repo context improve catch rates? · What's the cost per bug? · How noisy are the tools?

Dataset Mining

500 PRs in, 67 cases out

PRs fetched

500

↓ −189 no fix commit found

Cases mined

311

↓ −93 dep bumps, −45 duplicates, −27 no buggy lines

After rule filters

146

↓ −79 LLM curation (features, CI fixes, doc fixes, test-only)

Active cases

67

Exclusions (244)

93Dependency bumps 45Merged sibling (duplicate fix) 27No buggy lines found 19Features mislabeled as fix 16Test-expectation only 14CI/release script fixes 12Doc-only fixes 18Other (self-ref, typo, lint, perf)

Why 67?

857 merged PRs over 3 years → 164 real bug fixes (19% bug rate)
67 have provable ground truth — verified introducing commit + precise buggy lines
90 more are real bugs with weak ground truth (recoverable with LLM-based labeling)
Not a sample — these are all the bugs we can rigorously verify

Framework

End-to-end evaluation pipeline

mine

Scrape fix PRs

→

blame

Find introducing commits

→

curate

Auto-exclude bad cases

→

evaluate

Run tools on cases

normalize

Common format

→

judge

LLM scores each review

→

analyze

Reports & charts

67

Bug cases scored

3

Eval modes (PR, API, Agent)

~$509

Total pilot cost

How we score

Detection (0–3): did the tool find the bug?

0 missed → 1 wrong area → 2 correct file+issue → 3 correct ID + actionable fix

Quality (0–4): how useful was the review?

0 useless → 1 shallow → 2 adequate → 3 strong → 4 exceptional

Scored by Claude Opus as judge — tool names hidden to prevent bias.

Pilot Results

Leaderboard — 67 cases, Opus-judged, Det≥2

Det≥2 = judge confirmed bug found. SNR = true positives / total comments (higher = less noise). FP = false positive comments.

Tool	Context	Det≥2	Det=3	Quality	SNR	FP	Novel
Opus 4.6 (prompt v1)	diff-only	31%	19%	1.99	1.55	22	166
Opus 4.6 (prompt v1)	diff+repo	31%	14%	2.25	1.20	30	202
Opus 4.6 (prompt v2)	diff+repo	28%	17%	2.36	2.75	12	298
Sonnet 4.6	diff-only	27%	15%	1.95	0.57	49	126
Copilot	PR	23%	7%	1.94	0.75	32	192
Sonnet 4.6	diff+repo	23%	16%	2.03	0.74	46	168
Greptile Rate limited	PR	11%	—	2.02	—	—	—
CodeRabbit Rate limited	PR	4%	—	0.19	—	—	—

Opus diff-only matches diff+repo on catch rate (31%) with fewer false positives (22 vs 30) and better precision on det=3 (19% vs 14%). Repo access adds quality (2.25) and novel findings (202) but also noise. Opus (prompt v2) has the best signal-to-noise (2.75) and most novel findings (298) — the structured approach produces the cleanest reviews.

Det≥2 = LLM judge confirmed tool identified the bug (stricter than mechanical ±10-line proximity used in pilot report).

Key Finding

Repo access adds quality, not catch rate

Diff-only matches or beats diff+repo on detection. Repo access adds novel findings and review depth — but also more false positives.

Opus diff-only

31%

Catch

19%

Det=3

22

FP

Focused. Fewer false positives.
Best precision on actionable fixes.

Opus diff+repo

31%

Catch

14%

Det=3

30

FP

More novel findings (202 vs 166).
Higher quality (2.25 vs 1.99). More noise.

The tradeoff

Repo access doesn't catch more bugs — but it produces richer reviews with more context. The agent finds issues that require understanding cross-file relationships. The cost is more false positives from exploring irrelevant code paths.

Key Finding

In-house agent beats commercial tools

Opus catches 31% of bugs vs Copilot's 23% — and produces nearly 3x as many actionable fixes (det=3: 19% vs 7%).

Copilot

23%

Catch rate

7%

Det=3

0.75

SNR

Free, automatic on PR open.
32 false positives across 67 cases.

vs

Opus Agent (diff-only)

31%

Catch rate

19%

Det=3

1.55

SNR

~$0.65/review, needs setup.
Only 22 false positives — a third fewer than Copilot.

Signal-to-noise matters

SNR = true positives / total comments. Opus diff-only's SNR of 1.55 means more signal than noise. Copilot's 0.75 means more noise than signal. A tool that catches bugs but buries them in false positives isn't useful in practice.

Key Finding

Structured review catches different bugs

The 3-phase v3 runner (survey → investigate → report) produces higher quality reviews (2.36 vs 1.99) with comparable catch rate (28% vs 31%). Each approach has a blind spot.

Single-pass wins (6 cases)

Bug is visible directly in the diff — wrong attribute, wrong literal, wrong API call. No context needed.

v3 gets distracted surveying 28 files and misses what's obvious in the diff hunk. Overhead hurts for simple bugs.

3-phase wins (4 cases)

Bug requires context beyond the diff — missing callers, spec violations, deleted error paths still referenced.

Structured investigation forces systematic checking. Domain rules file directly helped in 2 cases.

Implication: a hybrid v4 approach

Quick diff-level scan first (no tools, just read the patch). Then structured investigation only for hunks that need deeper context. Combines the speed of single-pass with the thoroughness of 3-phase — without the attention dilution.

Hypothesis

The answer might be both

Each tool catches bugs the others miss. Copilot is free and automatic; the in-house agent catches more with higher quality. Running both should cover more ground than either alone.

Opus Agent

31% catch, 1.55 SNR

+

Copilot

23% catch, free, automatic

=

Ensemble

Catch rate of union > either alone

Agent catches what Copilot misses

Subtle logic errors, panic vs. proper error handling, refactoring regressions, issues that require reading multiple files to understand

Copilot catches quickly and cheaply

Copy-paste errors, wrong variables, type mismatches — fast line-specific comments on every PR at zero marginal cost

Next step: measure the overlap

If the two tools catch mostly different bugs, the union could reach 40-50% detection. If they overlap heavily, the agent alone is sufficient. This is the key experiment for the next phase.

Lessons Learned

Building a rigorous evaluation is itself a research problem

The methodology and framework are as valuable as the results — they make every future evaluation faster and more trustworthy.

Dataset quality is the hardest part

13% yield: 500 PRs mined, only 67 survived curation
8 exclusion rules built iteratively from bad results
LLM curation gets 80% there; humans make the final call

Scoring is harder than it looks

Mechanical scoring inflates — proximity ≠ detection
Tools output in wildly different formats — normalization is hard
LLM judge needed for "close but not exact" matches

PR tool integration is messy

Each tool posts reviews in different places
Greptile requires manual dashboard activation per repo
Response times: 3 min to 30 min — need async two-phase

Agent architecture matters

Agents have a time management problem, not reasoning
~10% of agent runs produce zero comments (turn limit or timeout)
Two-pass (explore → review) separates navigation from analysis

Cost Analysis

Agent evals are expensive — but optimizable

$509

Pilot spent

67 cases × 8 configs

$1,636

Full eval (naive)

250 cases × 8 configs

~$800

With optimizations

51% savings

Cost per case

Opus (prompt v2) (3-phase)$2.25 Opus diff+repo (Agent SDK)$0.99 Opus + shell tools BEST$0.73 Sonnet + shell tools$0.70 PR tools (Copilot, Greptile, CodeRabbit)free

92% of cost is agent eval runs. Judge scoring is only 8%.

Optimization path

$1,636

Naive

$1,252

Prompt caching

$994

+ batch + early stop

$803

+ reuse pilot data

51% savings from caching, batch API, early termination, and reusing scored pilot cases.

What's Next

From pilot to full evaluation

Immediate

Diff-only with shell tools — isolate whether tooling or context drives the improvement
Re-run Greptile & CodeRabbit with proper rate limit handling
Measure ensemble overlap — Copilot + Opus agent union catch rate
Hybrid v4 runner — quick diff scan + targeted investigation

Near-term

Scale to 250 cases — n=67 is too small for statistical significance
Add non-Anthropic agents (Gemini, OpenAI)
Add remaining commercial tools (BugBot, Augment, DeepSource)
Domain-specific prompting — ZK/Leo hints for Provable repos

Future Direction

SWE-bench style fix evaluation — can the agent write a correct patch, not just find the bug?
Dataset already has base commits + known fixes to compare against

Methodology

Human judge calibration — Cohen's kappa ≥ 0.85
Clean cases for false alarm rate measurement
Golden set curation — confirm/dispute via dashboard
Statistical rigor: confidence intervals and significance tests

End Goal

For each tool: how many bugs does it catch, at what cost, with how much noise? Enough data to make a clear build-vs-buy decision for Provable.

Questions?

Key Takeaways

1. In-house agents match or beat commercial tools on domain-specific code. 2. Shell access (Bash + ripgrep) is the key enabler — not just having the repo, but searching it efficiently. 3. The framework is reusable — new tools and models plug in and get scored automatically.

67

Bug cases

8

Configs tested

250

Cases planned

~$800

Full eval est.

bugbench — framework code, pilot results, and case-by-case analysis

Evaluating AICode Review Tools

General benchmarks don't testwhat matters for Provable

What benchmarks test

What Provable needs

AI Code Review Tools — March 2026

Full evaluation design

500 PRs in, 67 cases out

End-to-end evaluation pipeline

Leaderboard — 67 cases, Opus-judged, Det≥2

Repo access adds quality, not catch rate

In-house agent beats commercial tools

Structured review catches different bugs

The answer might be both

Building a rigorous evaluation is itself a research problem

Dataset quality is the hardest part

Scoring is harder than it looks

PR tool integration is messy

Agent architecture matters

Agent evals are expensive — but optimizable

From pilot to full evaluation

Questions?

Evaluating AI
Code Review Tools

General benchmarks don't test
what matters for Provable