Applied Intelligence
Module 12: Knowing When Not to Use Agents

Reliability Statistics and Benchmarks

The numbers

The previous pages described patterns qualitatively: where agents struggle, what context can't be conveyed, how failures manifest. This page presents the quantitative picture.

Numbers matter because intuition misleads. Developers who've had productive agent sessions tend to overestimate general reliability. Those who've hit early failures tend to underestimate it. The data provides a reality check.

PR merge rates by tool

The January 2026 study of 33,596 agent-authored pull requests established baseline merge rates:

AgentMerge RateVolume
Codex82.59%Highest volume
Devin61-67%Growing rapidly
Claude Code59.04%Mid-range
GitHub Copilot43.04%Highest absolute count

The gap between Codex (83%) and Copilot (43%) reflects architectural differences. Codex runs in isolated sandboxes for up to 30 minutes, iterating privately until tests pass before opening a PR. The work that reaches reviewers has already survived internal validation. Copilot operates closer to real-time, surfacing work earlier in its lifecycle.

Neither approach is inherently superior. Higher merge rates from extensive pre-validation come at the cost of latency. Lower merge rates from rapid surfacing enable faster human feedback loops. The right choice depends on workflow requirements.

OpenAI reported internal teams saw a 70% rise in weekly merged pull requests after Codex adoption. That metric combines higher individual PR success rates with increased submission volume.

By early 2026, AI agents authored over 99,000 PRs annually. One in seven PRs now involves an AI agent.

SWE-bench: the benchmark reality gap

SWE-bench has become the standard evaluation for coding agents. It presents agents with real GitHub issues and measures whether they can produce patches that pass the repository's test suite.

The leaderboard tells one story:

ModelSWE-bench Verified
Claude Opus 4.580.9%
GPT-5.280.0%
Gemini 3 Flash78.0%
Gemini 3 Pro76.2%
GPT-574.9%

These numbers look impressive. 80% success on real GitHub issues suggests agents are close to handling most routine development tasks autonomously.

The reality is messier.

SWE-bench Verified uses a curated subset of issues. Ambiguous problems, underspecified requirements, and issues requiring deep domain knowledge are filtered out. The remaining issues are cleaner than typical production work.

SWE-bench Pro introduced a more realistic evaluation. The same models that score 70-80% on Verified drop dramatically:

ModelVerifiedPro (Public)Pro (Private)
Claude Opus 4.580.9%~45%17.8%
GPT-5.280.0%56.4%~15%
GPT-574.9%41.8%14.9%

The gap between Verified (80%) and Pro Private (15-18%) is a 4-5x drop. Models performing at apparent A-level on cleaned benchmarks deliver D-level on realistic evaluations.

Why the collapse?

Data contamination. Research estimates that 32.67% of SWE-bench Verified successes may involve models that saw evaluation code during training. When patches match training data, success doesn't indicate generalization.

Weak test suites. Analysis found 31.08% of patches incorrectly labeled as passing. The original test suites didn't catch the bugs the patches introduced. Actual success rates, when strictly validated, fell from 12.47% to 3.97% for one major system.

Simplified problem selection. Real-world issues include ambiguity, missing information, and implicit requirements. Benchmark curation removes exactly the problems that make development hard.

Private codebase performance

The most realistic assessment comes from private codebases: proprietary code that couldn't have appeared in training data.

The results are sobering:

ContextTop Model Score
Public benchmarks56-80%
Private codebase public set22-23%
Private codebase commercial set14-17%

Enterprise codebases present additional challenges. Public benchmark repositories typically contain 1-30 million lines of code. Enterprise systems can reach 100 billion lines, four orders of magnitude larger. Benchmark evaluation pulls entire repositories into containers. No enterprise can replicate that approach with terabyte-scale codebases.

Language and domain matter. Go and Python show higher resolution rates, with some models exceeding 30% on private sets. JavaScript and TypeScript vary wildly, from 0% to 30% depending on the specific codebase and model combination.

The 14-17% figure for commercial private sets is the current floor for realistic agent performance on unfamiliar, proprietary code. That's not a temporary limitation. It's structural: general models applied to specific contexts.

Compound workflow mathematics

The previous module covered multi-agent orchestration. The mathematics of compound workflows deserves explicit attention.

If a single step succeeds with probability p, a workflow of n independent steps succeeds with probability p^n:

Per-Step Success5 Steps10 Steps20 Steps
99%95%90%82%
95%77%60%36%
90%59%35%12%
85%44%20%4%

At 95% per-step reliability, a 20-step workflow succeeds only 36% of the time. At 90% per-step, it drops to 12%.

This is why complex autonomous workflows fail. No amount of clever prompting fixes compound probability. The only solutions are either dramatically higher per-step success rates or dramatically shorter workflows.

Current research finds agent task completion drops below 10% for tasks that would take humans more than 4 hours. That aligns with compound mathematics: a 4-hour task likely involves enough steps that even 90%+ per-step reliability collapses to low overall success.

For workflows requiring more than 10 steps, implement human checkpoints. Compound probability means autonomous multi-step execution will fail more often than it succeeds, regardless of individual step quality.

Security failure rates

Code quality encompasses more than functionality. Security matters, and the data here is concerning.

Veracode's 2025 analysis of AI-generated code found:

  • 45% of samples contained security flaws (OWASP Top 10 vulnerabilities)
  • 72% failure rate for Java (only 28.5% produced secure code)
  • 43% failure rate for JavaScript
  • 38% failure rate for Python

Specific vulnerability patterns:

Vulnerability TypeAI vs Human Rate
Cross-site scripting (XSS)2.74x higher
Insecure direct object references1.91x higher
Improper password handling1.88x higher
Injection vulnerabilities1.5x higher

For context-dependent security flaws like XSS, only 12-13% of generated code is secure. The 86% failure rate on XSS defenses means security-critical paths require human implementation.

A concerning finding: security performance remained flat regardless of model size or training sophistication. Larger models weren't meaningfully more secure. The vulnerability patterns appear structural to how code generation works, not limitations of specific model versions.

Human oversight and remediation tools can reduce flaws by over 60%. That reduction requires the oversight to actually happen. When review processes assume AI code is "probably fine," security debt accumulates.

Code reproducibility

A less-discussed metric: can others run the code an agent generates?

Research found 31.7% of AI-generated code fails execution when others attempt to reproduce it. The code compiles. It may even pass tests on the generating system. But environmental assumptions, implicit dependencies, or platform-specific behavior cause failures elsewhere.

Related: code churn (code rewritten or deleted within two weeks) has doubled since 2021. GitClear's analysis of 211 million lines attributes a significant portion of this increase to AI-generated code that didn't survive production contact.

This matters for team contexts. Code written by an agent in one developer's session may not work when another developer pulls it. The "works on my machine" problem, historically a source of integration friction, has new manifestations.

What the numbers mean

Across all these metrics:

Realistic success expectations:

  • Structured, template-like tasks: 74-84% success
  • Bug fixes and feature work: 55-64% success
  • Refactoring and complex changes: 50% success
  • Novel domain work on private code: 15-20% success

Quality multipliers:

  • 1.7x more issues per PR than human code
  • 45% security vulnerability rate
  • 31.7% reproducibility failure rate

Compound effect:

  • Multi-step workflows degrade exponentially
  • 4+ hour tasks drop below 10% success
  • Session quality degrades over time, not gradually but suddenly

The numbers don't argue against using agents. They argue for calibrated expectations. An 83% merge rate from Codex is genuinely useful. A 15% success rate on private codebases is genuinely limiting. Both are true simultaneously.

Practitioners who use this data match tasks to realistic success probabilities. They don't attempt 20-step autonomous workflows when mathematics predicts 36% success. They don't assume security is handled when 45% of generated code contains vulnerabilities. They don't extrapolate from public benchmark scores to private codebase expectations.

The numbers are the foundation for judgment. The next pages build on that foundation.

On this page