Reliability Statistics and Benchmarks

The numbers

The previous pages described patterns qualitatively: where agents struggle, what context can't be conveyed, how failures manifest. This page presents the quantitative picture.

Numbers matter because intuition misleads. Developers who've had productive agent sessions tend to overestimate general reliability. Those who've hit early failures tend to underestimate it. The data provides a reality check.

PR merge rates by tool

The January 2026 study of 33,596 agent-authored pull requests established baseline merge rates:

Agent	Merge Rate	Volume
Codex	82.59%	Highest volume
Devin	61-67%	Growing rapidly
Claude Code	59.04%	Mid-range
GitHub Copilot	43.04%	Highest absolute count

The gap between Codex (83%) and Copilot (43%) reflects architectural differences. Codex runs in isolated sandboxes for up to 30 minutes, iterating privately until tests pass before opening a PR. The work that reaches reviewers has already survived internal validation. Copilot operates closer to real-time, surfacing work earlier in its lifecycle.

Neither approach is inherently superior. Higher merge rates from extensive pre-validation come at the cost of latency. Lower merge rates from rapid surfacing enable faster human feedback loops. The right choice depends on workflow requirements.

OpenAI reported internal teams saw a 70% rise in weekly merged pull requests after Codex adoption. That metric combines higher individual PR success rates with increased submission volume.

By early 2026, AI agents authored over 99,000 PRs annually. One in seven PRs now involves an AI agent.

SWE-bench: the benchmark reality gap

SWE-bench has become the standard evaluation for coding agents. It presents agents with real GitHub issues and measures whether they can produce patches that pass the repository's test suite.

The leaderboard tells one story:

Model	SWE-bench Verified
Claude Opus 4.5	80.9%
GPT-5.2	80.0%
Gemini 3 Flash	78.0%
Gemini 3 Pro	76.2%
GPT-5	74.9%

These numbers look impressive. 80% success on real GitHub issues suggests agents are close to handling most routine development tasks autonomously.

The reality is messier.

SWE-bench Verified uses a curated subset of issues. Ambiguous problems, underspecified requirements, and issues requiring deep domain knowledge are filtered out. The remaining issues are cleaner than typical production work.

SWE-bench Pro introduced a more realistic evaluation. The same models that score 70-80% on Verified drop dramatically:

Model	Verified	Pro (Public)	Pro (Private)
Claude Opus 4.5	80.9%	~45%	17.8%
GPT-5.2	80.0%	56.4%	~15%
GPT-5	74.9%	41.8%	14.9%

The gap between Verified (80%) and Pro Private (15-18%) is a 4-5x drop. Models performing at apparent A-level on cleaned benchmarks deliver D-level on realistic evaluations.

Why the collapse?

Data contamination. Research estimates that 32.67% of SWE-bench Verified successes may involve models that saw evaluation code during training. When patches match training data, success doesn't indicate generalization.

Weak test suites. Analysis found 31.08% of patches incorrectly labeled as passing. The original test suites didn't catch the bugs the patches introduced. Actual success rates, when strictly validated, fell from 12.47% to 3.97% for one major system.

Simplified problem selection. Real-world issues include ambiguity, missing information, and implicit requirements. Benchmark curation removes exactly the problems that make development hard.

Private codebase performance

The most realistic assessment comes from private codebases: proprietary code that couldn't have appeared in training data.

The results are sobering:

Context	Top Model Score
Public benchmarks	56-80%
Private codebase public set	22-23%
Private codebase commercial set	14-17%

Enterprise codebases present additional challenges. Public benchmark repositories typically contain 1-30 million lines of code. Enterprise systems can reach 100 billion lines, four orders of magnitude larger. Benchmark evaluation pulls entire repositories into containers. No enterprise can replicate that approach with terabyte-scale codebases.

Language and domain matter. Go and Python show higher resolution rates, with some models exceeding 30% on private sets. JavaScript and TypeScript vary wildly, from 0% to 30% depending on the specific codebase and model combination.

The 14-17% figure for commercial private sets is the current floor for realistic agent performance on unfamiliar, proprietary code. That's not a temporary limitation. It's structural: general models applied to specific contexts.

Compound workflow mathematics

The previous module covered multi-agent orchestration. The mathematics of compound workflows deserves explicit attention.

If a single step succeeds with probability p, a workflow of n independent steps succeeds with probability p^n:

Per-Step Success	5 Steps	10 Steps	20 Steps
99%	95%	90%	82%
95%	77%	60%	36%
90%	59%	35%	12%
85%	44%	20%	4%

At 95% per-step reliability, a 20-step workflow succeeds only 36% of the time. At 90% per-step, it drops to 12%.

This is why complex autonomous workflows fail. No amount of clever prompting fixes compound probability. The only solutions are either dramatically higher per-step success rates or dramatically shorter workflows.

Current research finds agent task completion drops below 10% for tasks that would take humans more than 4 hours. That aligns with compound mathematics: a 4-hour task likely involves enough steps that even 90%+ per-step reliability collapses to low overall success.

For workflows requiring more than 10 steps, implement human checkpoints. Compound probability means autonomous multi-step execution will fail more often than it succeeds, regardless of individual step quality.

Security failure rates

Code quality encompasses more than functionality. Security matters, and the data here is concerning.

Veracode's 2025 analysis of AI-generated code found:

45% of samples contained security flaws (OWASP Top 10 vulnerabilities)
72% failure rate for Java (only 28.5% produced secure code)
43% failure rate for JavaScript
38% failure rate for Python

Specific vulnerability patterns:

Vulnerability Type	AI vs Human Rate
Cross-site scripting (XSS)	2.74x higher
Insecure direct object references	1.91x higher
Improper password handling	1.88x higher
Injection vulnerabilities	1.5x higher

For context-dependent security flaws like XSS, only 12-13% of generated code is secure. The 86% failure rate on XSS defenses means security-critical paths require human implementation.

A concerning finding: security performance remained flat regardless of model size or training sophistication. Larger models weren't meaningfully more secure. The vulnerability patterns appear structural to how code generation works, not limitations of specific model versions.

Human oversight and remediation tools can reduce flaws by over 60%. That reduction requires the oversight to actually happen. When review processes assume AI code is "probably fine," security debt accumulates.

Code reproducibility

A less-discussed metric: can others run the code an agent generates?

Research found 31.7% of AI-generated code fails execution when others attempt to reproduce it. The code compiles. It may even pass tests on the generating system. But environmental assumptions, implicit dependencies, or platform-specific behavior cause failures elsewhere.

Related: code churn (code rewritten or deleted within two weeks) has doubled since 2021. GitClear's analysis of 211 million lines attributes a significant portion of this increase to AI-generated code that didn't survive production contact.

This matters for team contexts. Code written by an agent in one developer's session may not work when another developer pulls it. The "works on my machine" problem, historically a source of integration friction, has new manifestations.

What the numbers mean

Across all these metrics:

Realistic success expectations:

Structured, template-like tasks: 74-84% success
Bug fixes and feature work: 55-64% success
Refactoring and complex changes: 50% success
Novel domain work on private code: 15-20% success

Quality multipliers:

1.7x more issues per PR than human code
45% security vulnerability rate
31.7% reproducibility failure rate

Compound effect:

Multi-step workflows degrade exponentially
4+ hour tasks drop below 10% success
Session quality degrades over time, not gradually but suddenly

The numbers don't argue against using agents. They argue for calibrated expectations. An 83% merge rate from Codex is genuinely useful. A 15% success rate on private codebases is genuinely limiting. Both are true simultaneously.

Practitioners who use this data match tasks to realistic success probabilities. They don't attempt 20-step autonomous workflows when mathematics predicts 36% success. They don't assume security is handled when 45% of generated code contains vulnerabilities. They don't extrapolate from public benchmark scores to private codebase expectations.

The numbers are the foundation for judgment. The next pages build on that foundation.

Reliability Statistics and Benchmarks

On this page