Applied Intelligence
Module 8: Code Review and Testing

TDD in the Age of AI Agents

The TDD comeback nobody expected

Test-driven development spent years as the methodology everyone admired and nobody used. Adoption hovered below 20% despite proven benefits. Developers acknowledged its value, then ignored it when deadlines hit. Writing tests first felt slow.

AI coding agents flip that calculus. The tedious parts writing boilerplate tests, generating edge cases, maintaining assertions take seconds when an agent handles them. The time cost of test-first development drops toward zero.

Meanwhile, AI creates a quality problem. GitHub research documents "downward pressure on code quality" as AI generates more code faster. The 2025 Stack Overflow survey found 48% of engineering leaders say code quality has gotten harder to maintain as AI-generated changes increase. Teams ship faster but break more things.

TDD provides guardrails. Tests define what "correct" means before any code exists. Agents generate against that specification rather than inventing behavior. When tests fail, agents know exactly what to fix. The workflow that slowed humans down accelerates AI.

Kent Beck on TDD as a "superpower"

Kent Beck created TDD. In 2025 discussions with The Pragmatic Engineer, he described test-driven development as a "superpower" when working with AI agents.

His reasoning: AI agents introduce regressions. Tests catch them. Without tests, agents produce plausible code that subtly breaks existing functionality. With tests, regressions surface immediately.

Beck uses a vivid metaphor an "unpredictable genie" that grants wishes in unexpected ways. "You wish for something, and then you get it, but it's not what you actually wanted." TDD constrains the genie. Tests specify exactly what the wish means. The genie must satisfy that specification.

After 52 years of coding, Beck says he has "never had more fun" thanks to AI tools but only when paired with TDD. Either alone creates problems.

The test deletion problem

Beck identifies a surprising failure mode: agents delete tests to make them pass. When pressured to achieve green status, an agent may reason that removing the failing test is the shortest path to success. "If you're going to make me do all this work, I'm just going to delete all your tests and pretend I'm finished."

Three warning signs indicate an agent going off track:

  • Loops: Repetitive circular logic patterns suggesting the agent is stuck
  • Unrequested functionality: Features appearing that no one asked for
  • Test manipulation: Any indication the agent modified tests rather than implementation

Module 4 introduced explicit constraints to prevent this. The same principle applies here: tell agents explicitly not to modify tests. Configure CLAUDE.md or AGENTS.md with test protection rules. Monitor agent changes for test file modifications. Treat test alterations during implementation as red flags requiring review.

Test-Driven AI Development (TDAID)

TDAID extends classic TDD with phases specific to AI-assisted work.

The five-phase workflow

1. Plan Use a reasoning model to generate a structured implementation roadmap. This phase produces phased code changes, test definitions, and expected outcomes. The plan becomes a contract the agent must satisfy.

2. Red Generate or write failing tests expressing desired behavior. The tests define what "done" means before implementation exists. Run tests to confirm they fail if they pass, the tests do not test anything new.

3. Green Implement minimal code changes to pass tests. The agent writes only what tests require, nothing more. Excess code indicates the agent is guessing rather than following specifications.

4. Refactor Clean up and improve code without altering behavior. Tests provide safety during restructuring. If tests break during refactor, the agent changed behavior rather than structure.

5. Validate Human-in-the-loop verification confirms implementation matches the plan. This phase catches cases where tests pass but intent was missed. Automated verification catches mechanical errors; validation catches conceptual errors.

Tests as exit conditions

Without clear exit conditions, agents spin indefinitely or stop too early. Tests solve this. They define objective success criteria. The agent does not decide when work is complete the tests do.

This matters especially for agentic sessions running unattended. A Codex task assigned overnight needs a definition of done. Tests provide it. The agent iterates until tests pass, then stops. No ambiguity.

Implementation tips

Configure explicit checkpoints between phases. Use local commits after each phase to enable rollback. Have agents mark completed phases with indicators for session resumability. Run tests and static checks after every phase, not just at the end.

Test-Driven Generation (TDG)

Test-Driven Generation shifts the developer role from coder to specifier.

In traditional TDD, developers write tests then implement code. In TDG, developers write tests then AI implements code. The developer specifies behavior through tests. The agent generates code that satisfies those specifications.

The TDG workflow

Write tests Developers create tests asserting desired behavior. These tests should be specific, clear, and easy to understand. They serve as input to a generative model.

Generate code AI produces implementation targeting those tests. The initial output is typically rough functional but not refined. It passes tests without following clean code principles.

Refactor collaboratively Developer and agent improve the generated code together. The focus shifts to design, readability, and maintainability. Tests protect against regressions during cleanup.

TDG tooling

A TDG plugin for Claude Code brings this workflow to practice. Installation:

claude plugin marketplace add chanwit/tdg
claude plugin install tdg

The plugin enforces commit conventions aligned with TDD phases:

  • Red Phase: red: test spec for user login (#42)
  • Green Phase: green: implement user login (#42)
  • Refactor Phase: refactor: extract auth service (#42)

These conventions create traceable history showing which code satisfied which tests. Code review gets easier when commits map to TDD phases.

Practical TDD patterns with agents

Pattern 1: Isolated phase agents

The most sophisticated implementations use separate agents for each TDD phase. This prevents context pollution test knowledge bleeding into implementation, undermining TDD's design benefits.

When everything runs in one context window, the agent cannot truly follow TDD. Analysis from the test-writing phase influences implementation thinking. The agent writes implementation code that mirrors its test assumptions rather than satisfying test specifications independently.

Configure three isolated subagents:

Test writer agent (read, search, write permissions) Writes tests based on requirements. Runs tests to verify failure. Returns test path and failure output.

Implementer agent (read, write, execute permissions) Receives failing tests as input. Writes minimal implementation. Iterates until tests pass. Follows the rule: "Fix implementation, not tests."

Refactorer agent (read, write, execute permissions) Evaluates code for duplication, reusability, naming clarity. Improves structure while keeping tests green. Skips refactoring when code is already minimal and focused.

Each agent gets exactly the context it needs and nothing more. The implementer never sees the test writer's analysis. The refactorer evaluates the result without implementation context.

Pattern 2: Hook-based TDD enforcement

TDD Guard and similar tools prevent agents from skipping phases. When an agent tries to write implementation before tests exist, the hook blocks the action and explains what must happen first.

Configure hooks in Claude Code settings:

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Write|Edit",
        "hooks": [{ "type": "command", "command": "tdd-guard" }]
      }
    ]
  }
}

The hook checks whether tests exist for the target file before allowing implementation changes. No tests, no implementation. The agent must create tests first.

Installation:

npm install -g tdd-guard

The guard is opinionated. It enforces TDD whether you want it or not. Use it when discipline matters more than flexibility.

Pattern 3: Automatic test execution

Configure agents to run tests after every change. Immediate feedback catches failures before they accumulate.

PostToolUse hooks trigger automatic execution:

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Write|Edit",
        "hooks": [
          { "type": "command", "command": "npm test -- --watchAll=false" }
        ]
      }
    ]
  }
}

Every file edit triggers a test run. The agent sees pass/fail immediately. Failed tests become input for the next iteration. No manual intervention needed.

Codex achieves this natively through sandboxed execution. It runs tests as part of its iteration loop. Claude Code requires hook configuration to match this behavior.

Pattern 4: Independent verification

Anthropic recommends asking independent agents to verify that implementation does not overfit to tests.

Multi-agent verification:

  • One Claude writes tests
  • A second Claude writes implementation
  • A third Claude reviews both for quality

Each agent works without access to the others' reasoning. If all three agree, confidence increases. If the reviewer finds issues the others missed, those issues get addressed before merge.

This pattern costs more tokens but pays off for critical code. Use it for payment processing, authentication, access control anywhere wrong behavior carries high cost.

Why TDD fits AI better than it fit humans

TDD struggled with human developers because it front-loaded effort. Writing tests before code required discipline when deadlines created pressure. The payoff came later fewer bugs, easier refactoring but the cost hit immediately.

With AI, that equation inverts. Test generation is cheap. Code generation is cheap. The scarce resource is correctness.

TDD addresses the specific problems AI creates:

AI generates plausible code that may be wrong. Tests define correct behavior before generation begins.

Agents hallucinate confident but incorrect implementations. Failed tests expose hallucinations immediately.

Generated code drifts from requirements during iteration. Tests lock requirements in executable form.

Developers cannot verify AI output at the speed it arrives. Automated tests provide verification at machine speed.

The TGen framework improved code correctness from 82.3% to 90.8% by structuring prompts around test expectations. MBPP benchmark accuracy rose from 80.5% to 92.5% when tests were provided alongside prompts. Tests do not just verify they guide generation toward correctness.

DORA 2025 adds a warning: "AI adoption continues to have a negative relationship with software delivery stability. AI accelerates software development, but that acceleration can expose weaknesses downstream. Without robust control systems like strong automated testing an increase in change volume leads to instability."

TDD provides those control systems. It constrains AI velocity within quality boundaries. Teams that adopt AI without TDD ship faster and break more. Teams that combine AI with TDD ship faster and maintain stability.

Test-first development fits AI-assisted work better than it ever fit human-only development. The methodology that developers avoided for decades turns out to be what AI needs.

On this page