When traditional approaches win
When traditional approaches win
So when should you just write the code yourself?
The previous pages covered where agents struggle and how to spot failures in progress. This page tackles the practical question: when is agent assistance not worth the overhead?
Two factors determine the answer: certain tasks reliably defeat agents regardless of how good your prompts are, and certain developers get no benefit from AI help even on tasks agents handle well.
Tasks where agents reliably fail
Some task categories fail regardless of prompt quality, context engineering effort, or which agent you use. These aren't solvable with better technique.
Performance optimization is the clearest example. CodeFlash research found 90% of AI-suggested optimizations were wrong or provided no benefit. Even worse, 62% changed actual behavior. Agents can't profile your runtime, identify real bottlenecks, or understand your hardware. They suggest optimizations based on what looked good in training data, not what your specific workload actually needs. Performance work requires measuring, forming hypotheses, and validating changes. Agents can't close that loop.
Multi-file refactoring hovers around 50% merge rate, but that number hides the real problem. Agents can't hold an entire refactoring in context simultaneously. A function gets renamed in one file while call sites processed earlier or later keep the old name. Imports get added where unnecessary, removed where needed, or updated inconsistently. Larger refactorings compound these inconsistencies.
Security-critical code has unacceptable failure rates. With 45% of AI-generated code containing vulnerabilities (and context-dependent issues like XSS hitting 86% failure), security decisions shouldn't be delegated. Agents can't reason about threat models or your specific deployment context. They produce code that looks like it addresses security patterns without understanding why those patterns matter.
Domain-specific business logic requires knowledge that doesn't exist in training data. Your company's unique rules, regulatory interpretations, edge cases from production incidents, assumptions buried in existing code. Agents pattern-match against generic examples that may look similar but mean something different. The code compiles, passes a quick review, then fails silently when production data hits an unconsidered edge case.
Context that can't be transmitted
Some tasks fail not because agents lack capability but because you can't convey what they'd need to succeed.
Research suggests 70-90% of organizational knowledge is tacit. Things like:
- Why the database schema has that weird constraint
- Which team owns that service and how they handle change requests
- The historical reason certain code paths exist despite looking dead
- Performance quirks learned from watching production
- Integration points that break under conditions nobody documented
No CLAUDE.md file captures knowledge that was never written down.
If a task depends on tacit knowledge, the developer who has it must do the work.
Or they first extract and document that knowledge, which often takes longer than just doing the task.
Organizational dynamics also affect code in ways agents can't see. A refactoring might improve code quality while violating an unspoken agreement between teams about ownership. A dependency upgrade might conflict with another team's release schedule. These considerations exist outside the codebase entirely.
And judgment under uncertainty requires weighing incomplete information against your specific risk tolerance. Quick patch or address the underlying issue? Depends on deadlines, team capacity, technical debt appetite, and business priorities. Agents can't evaluate those tradeoffs even if you could somehow convey them.
When the overhead isn't worth it
Task duration correlates with success, but not in a straightforward way.
| Task duration | Agent success probability |
|---|---|
| Under 4 minutes | ~100% |
| 15 minutes | ~70% |
| 1 hour | ~50% |
| 4+ hours | Under 10% |
But time alone doesn't tell the whole story. The real calculation includes:
- Interaction overhead: composing prompts, preparing context, reviewing output
- Failure recovery: diagnosing wrong output, debugging agent mistakes, starting over
- Verification burden: security review, testing, validation you wouldn't need for code you wrote yourself
For experienced developers in familiar codebases, traditional development often wins on tasks up to 30-60 minutes of actual work. The METR study made this explicit: developers with an average of five years on their projects were 19% slower with AI assistance. Directing and verifying exceeded the implementation time saved.
Different situations shift this:
- Unfamiliar domains: agents help explore APIs and translate concepts
- Repetitive patterns: agents apply the same transformation across many files well
- Documentation and tests: agents draft faster than humans write from scratch
- Learning contexts: agents can explain code faster than you can read it
Hybrid approaches
The choice isn't "use agents" versus "don't use agents." Effective practitioners apply agents to the parts of tasks where they help and retain direct control over the parts where they don't.
Scaffold with agents, refine manually. Let agents generate structure, boilerplate, and repetitive patterns. Hand-edit the results for domain-specific logic, performance, and security. Agent speed on mechanical work, human judgment where it matters.
Human architecture, agent implementation. Make structural decisions yourself: file organization, function signatures, data models, API contracts. Let agents fill in implementations within those constraints. The agent works within a well-defined scope rather than making decisions that ripple across the system.
Agent draft, human review. For documentation, test generation, and comments, let agents produce a first pass. Review for accuracy and alignment with your actual understanding. Faster than starting from scratch while ensuring the output reflects reality.
Parallel validation. Run agent-generated code through tests, linting, and static analysis before human review. Automated checks catch many errors before they consume your attention. Reserve human judgment for semantic correctness that automation can't assess.
What the experts say
Industry voices converge on a consistent position: AI coding tools are copilots, not autopilots.
Simon Willison captures it well:
"Think of LLMs as an over-confident pair programming assistant who's lightning fast at looking things up, can churn out relevant examples at a moment's notice and can execute on tedious tasks without complaint. Over-confident is important—they'll absolutely make mistakes, sometimes subtle, sometimes huge."
GitHub's official position states that Copilot "is not intended to replace developers, who should continue to apply the same sorts of safeguards and diligence they would apply with regard to any third-party code of unknown origin."
Late 2025 research found that "while experienced developers value agents as a productivity boost, they retain their agency in software design and implementation out of insistence on fundamental software quality attributes."
Even Andrej Karpathy, who coined "vibe coding," acknowledged its limits. His Nanochat project was "basically entirely hand-written" because "I tried to use Claude/Codex agents a few times but they just didn't work well enough at all."
The consensus isn't anti-AI. It's against giving up judgment to systems that can't exercise it.
Staying in control
The goal of ASD isn't maximum agent usage. It's effective software development that uses agents where they help.
You decide what gets built. Agents may suggest approaches or explore possibilities. The direction to pursue remains your call. Agents lack the context to understand business requirements, user needs, and organizational constraints that determine what should exist.
You decide how it gets built. Architecture, data models, API design, system decomposition—these require judgment about tradeoffs agents can't evaluate. Agents implement within structures you define.
You decide when it's done. Agents declare success when code compiles. Whether that code solves the actual problem, handles edge cases, performs well, and meets security needs requires human judgment.
You stay accountable. Your name goes on the commit. Your team owns the system. You answer when something breaks. The developer who can't explain their code because "the AI wrote it" isn't a developer. They're a conduit for technical debt.
The following pages cover how to build workflows that capture agent benefits while maintaining the oversight this page establishes. The balance isn't about limiting AI. It's about judgment—knowing where AI helps and where it doesn't.