Tasks Where Agents Struggle

The limits of agentic assistance

Modules 1 through 11 taught how to make agents effective. Module 12 addresses when to stop using them.

The best practitioners know where the boundaries are. Not theoretical limits from benchmarks, but practical boundaries discovered through production work. The developer who recognizes when to close Claude Code and open an editor directly ships faster than the one who asks for "one more attempt."

Agent capabilities follow a clear pattern. Task success rates vary by category, and the variation is dramatic. Understanding this distribution—which tasks succeed at 84% and which fail at 73%—turns agentic development from guesswork into planning.

Success rates by task type

A January 2026 empirical study analyzed 33,596 agent-authored pull requests across GitHub, providing the first large-scale view of where AI coding agents actually succeed and fail.

Task Category	Merge Rate	Notes
Documentation	84%	Codex achieves 92%
CI/Build configuration	74-79%	Structured, predictable changes
Bug fixes	64%	Moderate success, high variability
Feature additions	56-81%	Depends heavily on scope
Performance optimization	55%	Copilot drops to 27%
Refactoring/tests	50%	Claude Code at 50%, others lower

The gap between documentation (84%) and performance optimization (55%) is a 29-percentage-point spread. That's not noise. It's a signal about what agents can and cannot do.

Agent-specific variation matters too. Codex merges 82.59% of PRs overall, while Copilot merges only 43.04%. For performance tasks specifically, Copilot's success rate drops to 27%—barely better than chance. These aren't equivalent tools with minor differences. They have fundamentally different capability profiles.

Why documentation succeeds

Documentation tasks top the success charts because they're transformations, not judgments.

When an agent documents a function, the inputs are visible (the code) and the output format is predictable (natural language describing behavior). The task is self-contained. The agent doesn't need to understand system-wide implications, predict runtime behavior, or make architectural tradeoffs.

High-success task categories share these characteristics:

Clear inputs: The relevant information fits in context
Predictable outputs: Success criteria can be evaluated without execution
Self-contained scope: No dependencies on implicit knowledge
Low consequence for errors: Mistakes are easily caught and corrected

CI configuration changes succeed for similar reasons. YAML syntax follows rigid rules. Configuration errors surface immediately in CI runs. The feedback loop is tight and automated.

Why performance optimization fails

Performance optimization sits at the opposite end of the spectrum. The reasons are worth examining because they illuminate fundamental agent limitations.

CodeFlash research attempted to optimize over 100,000 open-source functions using leading LLMs. The results:

62% of suggested optimizations introduced bugs (incorrect behavior)
73% of behaviorally correct suggestions provided less than 5% improvement
90% overall were either incorrect or ineffective

Agents cannot profile code, measure execution time, or verify correctness through testing. They operate in a theoretical space, predicting optimization effects without execution capabilities.

Performance optimization requires runtime profiling to identify actual bottlenecks, benchmarking to measure whether changes help or hurt, tradeoff analysis to weigh memory versus speed, and system-level understanding of how changes affect downstream components.

Agents possess none of these capabilities. A separate study of 324 agent-generated and 83 human-authored performance PRs found agents included explicit performance validation only 45.7% of the time, compared to 63.6% for humans. When agents did validate, they used benchmarks only 25% of the time versus 49% for humans.

The failure mode is silent. Agents occasionally report performance improvements without supporting evidence, claiming "7400x speedup" in PR descriptions without corresponding benchmark code. This confabulation pattern appears consistently in tasks where verification is difficult.

Don't delegate performance work to agents unless you're prepared to profile and benchmark every change yourself. The 90% failure rate means accepting suggestions uncritically will likely make code slower or broken. Use agents to generate candidate optimizations, then validate manually.

Multi-file refactoring: the context boundary problem

Refactoring appears to be a natural agent task—reorganize code, rename functions, extract modules. The data tells a different story.

Agent-authored PRs show refactoring merge rates around 50%, but that number hides the details. A study of agentic refactoring found agents perform 35.8% low-level refactorings (simple renames, extracts) versus 24.4% for humans, but only 43.0% high-level refactorings (architectural changes) versus 54.9% for humans.

Agents over-index on local transformations they can verify and under-deliver on architectural work requiring system understanding.

Multi-file refactoring requires understanding how components interact across module boundaries, tracking ripple effects through dependency chains, maintaining semantic consistency across files the agent may never read, and preserving undocumented conventions and implicit contracts.

Qodo's 2025 survey found 65% of developers report agents miss relevant context during refactoring tasks. The context window contains code, but not the relationships between code.

When refactoring scope exceeds context capacity, agents produce plausible-looking changes that break in integration. The fix works locally; it fails systemically.

Legacy code and domain-specific logic

Legacy codebases present compound challenges. One practitioner reported their 8-year-old jQuery codebase "repeatedly baffled" AI coding assistants, which kept suggesting modern alternatives (React, Vue, ES6 modules) when the correct answer was to respect the existing architecture.

This pattern is common. Agents optimize for contemporary patterns in their training data. When those patterns conflict with legacy constraints—browser compatibility, framework limitations, undocumented dependencies—agents produce modernization suggestions rather than working fixes.

Domain-specific business logic poses similar problems. Research on enterprise codebases found agents achieved only 34.2% accuracy when addressing domain-specific requirements. In regulated industries (healthcare, finance), success rates dropped to 22.7%.

Business rules exist in specifications, requirements documents, regulatory guidelines, and institutional knowledge. None of which fit in a context window. When agents encounter domain logic, they pattern-match against general programming practices rather than specific business constraints.

GitHub Copilot was found to misinterpret business requirements in 58.3% of cases involving workflows spanning multiple systems. Multi-system workflows require knowledge about data contracts, API behaviors, and error handling conventions that exist nowhere in the codebase itself.

The task duration correlation

METR research found a pattern: agent success correlates strongly with task duration.

Human task time	Agent success rate
< 4 minutes	~100%
15 minutes	~70%
1 hour	~50%
4+ hours	< 10%

The correlation coefficient (R² = 0.83) suggests task length isn't just a correlate but a predictor of agent failure.

This explains why the same agent can excel at unit test generation but fail at feature implementation. Writing a single test takes minutes. Implementing a feature spanning multiple files takes hours. The underlying capability is identical; the task duration determines outcomes.

Claude 3.7 Sonnet has an estimated "time horizon" of approximately one hour—the task duration at which agents succeed 50% of the time. This horizon is growing (doubling approximately every 7 months), but it remains a real constraint.

If a task would take a competent developer more than an hour, decompose it before delegating. Break feature implementations into sub-tasks. Create checkpoints. Review intermediate results. Agent success compounds when tasks stay short.

Why structured tasks succeed

The pattern across all these findings is consistent: structured tasks succeed; judgment tasks fail.

Structured tasks have defined inputs and outputs where the specification is complete. Results are verifiable without domain expertise. The solution space is constrained with few valid approaches. Feedback is fast—errors surface quickly through tests or compilation.

Boilerplate generation, CRUD operations, test scaffolding, and documentation all fit this pattern. Research found AI assistance reduced implementation time by 43.2% for boilerplate, 51.7% for repetitive CRUD, and 37.9% for common algorithmic challenges.

Judgment tasks are different. They require implicit knowledge not present in code or context, tradeoff evaluation where competing concerns have no clear answer, system understanding of effects that propagate beyond visible code, and business context from requirements, regulations, or conventions.

Performance optimization, architectural refactoring, business logic, and feature design all require judgment. Agents can generate options but cannot evaluate them.

Use agents for the structured portions of judgment tasks. Generate candidate implementations, then evaluate them yourself. Produce test scaffolds, then add meaningful assertions. Draft documentation, then verify accuracy.

Agentic development isn't about replacing judgment. It's about accelerating the structured work surrounding judgment so you have more time for decisions that matter.

On this page