Applied Intelligence
Module 8: Code Review and Testing

AI-Assisted Test Generation Fundamentals

The test generation opportunity

Pages 1 through 6 covered reviewing code that agents wrote. This page shifts focus to a task where agents actually earn their keep: generating tests.

Test writing is tedious. Developers know they should write more tests. Most do not. The 2025 Qodo survey found that test coverage ranks near documentation for developer enthusiasm which is to say, near the bottom. This creates an opening. Agents handle tedious, repetitive work without complaint. Test generation fits that profile.

Diffblue, which uses reinforcement learning rather than LLMs for test generation, reports 250x faster test creation than manual writing for Java projects. LLM-based approaches show similar acceleration. Teams using Claude Code for test generation report completing coverage tasks in hours rather than days. The question is not whether agents can generate tests quickly they can. The question is whether those tests catch bugs or just inflate coverage numbers.

How agents approach test generation

Understanding the mechanism helps set expectations.

Language models generate tests by pattern-matching against training data. They have seen millions of test files across thousands of frameworks. When asked to test a function, the model produces code that resembles tests it has encountered for similar functions.

For common patterns, this works. A sorting function gets tests with empty arrays, single elements, duplicates, and reverse-sorted inputs. A string parser gets tests for null, empty strings, special characters, and valid inputs. The model recognizes the shape of the problem and generates appropriate test shapes.

For novel patterns, it fails. Proprietary business logic, domain-specific rules, and internal conventions do not exist in training data. The model guesses. Those guesses may look plausible while testing nothing meaningful.

Claude Code test generation patterns

Claude Code generates tests through natural language requests within a session.

The basic pattern:

Write unit tests for src/utils/validation.ts

This produces tests using whatever framework Claude Code infers from the project. Existing test files influence the choice. A project with Jest tests gets Jest tests. A project with Vitest gets Vitest.

Better results come from specific prompts:

Write unit tests for the validateEmail function in src/utils/validation.ts.
Use Vitest with describe/it blocks.
Cover: valid emails, missing @, missing domain, empty string, null input.

Specificity reduces guessing. The model knows what framework to use, what function to test, and what cases matter.

Custom test commands

For repeated use, create a custom slash command. Add a file at .claude/commands/test.md:

Generate unit tests for the file at $ARGUMENTS.

Requirements:
- Use the testing framework in package.json (Jest or Vitest)
- Follow existing test patterns in __tests__ directories
- Include tests for: happy path, edge cases, error conditions
- Use descriptive test names that explain expected behavior
- Do not mock the function under test

Usage: /test src/utils/validation.ts

The command provides consistent instructions across sessions. Every test generation request follows the same standards.

TDD workflow support

Anthropic recommends test-driven development as a Claude Code workflow. The approach:

  1. Ask Claude to write tests based on expected behavior
  2. Verify tests fail (implementation does not exist yet)
  3. Ask Claude to implement the code
  4. Instruct Claude not to modify the tests
  5. Iterate until tests pass

The explicit instruction to avoid modifying tests is critical. Without it, agents optimize for passing tests by changing test assertions rather than fixing implementation. Kent Beck, in 2025 discussions about AI and TDD, noted that "agents actively try to delete tests to make them pass." Explicit constraints prevent this.

Codex test generation

Codex approaches test generation differently. It runs in a sandboxed environment where it can execute tests.

Assign a task:

Add tests for the authentication service.
Target 80% coverage on src/auth/*.ts.

Codex reads the code, generates tests, runs them, and iterates. If tests fail, it examines the output and adjusts. This loop continues until tests pass or the agent determines it cannot proceed.

The difference from Claude Code: Codex does not just generate test code it verifies that tests compile and execute. Failed assertions trigger debugging. Import errors trigger corrections. The feedback loop catches issues that static generation misses.

Configure test expectations in AGENTS.md:

## Testing standards

- Run tests with `pnpm test`
- Aim for 80% line coverage minimum
- Tests must pass before task completion
- Use @testing-library/react for component tests
- Mock external services, not internal modules

Codex checks these guidelines when generating and validating tests.

GitHub Copilot test generation

Copilot's /tests command generates tests for selected code. Highlight a function, type /tests, and Copilot produces test cases.

The context window matters. Copilot examines open files in adjacent tabs. If test files are open, Copilot mimics their patterns. If no test files are open, Copilot makes assumptions that may not match project conventions.

The 2025 Copilot Testing for .NET feature added deeper integration:

@Test #target

Where #target can be a member, class, file, project, or #changes for git diff. When tests fail, Copilot attempts fixes, regenerates, and reruns automatically.

Best practices for Copilot test generation:

  • Open existing test files in adjacent tabs before generating
  • Provide explicit framework and pattern instructions in prompts
  • Start with focused requests (one function) rather than entire files
  • Review generated tests for mock overuse

The 80% coverage target

Research consistently points to 80% test coverage as the practical optimum.

Capgemini studies show projects with over 80% coverage have 30% lower bug density than those below 50%. The gap between 80% and 100% coverage shows much smaller improvements. The last 20% typically covers trivial code getters, setters, simple passthrough functions that rarely contains bugs worth catching.

The economics favor stopping at 80%. Achieving 70-80% coverage requires reasonable effort. The final 10-20% requires disproportionate work. One practitioner summary: "the last 10% takes 90% of the time."

Exceptions exist. Critical systems in healthcare, finance, and aerospace may require higher targets. But for most enterprise software, 80% provides the best return on testing investment.

Configure agents with this target:

## Coverage requirements

- Target 80% line coverage for business logic
- Do not chase 100% coverage on utility code
- Focus testing effort on: payment processing, authentication, data transformation
- Skip coverage on: generated code, simple DTOs, passthrough wrappers

Explicit focus areas direct agent effort toward tests that matter.

Prompting for quality tests

How you prompt determines whether you get useful tests or noise.

Specify what to test, not just what code

Poor prompt:

Write tests for UserService.ts

Better prompt:

Write tests for UserService.createUser().
Test: successful creation, duplicate email rejection, invalid password handling, database failure recovery.

The better prompt names specific scenarios. The agent tests behaviors you care about, not whatever patterns it infers.

Provide examples when possible

Qodo research found that providing test examples increases test adoption by 15%. Show the agent what good tests look like in your project:

Write tests for validatePassword() following this pattern:

describe('validatePassword', () => {
  it('should reject passwords shorter than 8 characters', () => {
    expect(validatePassword('short')).toEqual({
      valid: false,
      reason: 'Password must be at least 8 characters'
    });
  });
});

The example demonstrates structure, assertion style, and expected return shapes. Generated tests follow the pattern.

Request edge cases explicitly

Agents excel at happy paths. Edge cases require prompting:

Include edge case tests for:
- Null and undefined inputs
- Empty strings and arrays
- Maximum length boundaries
- Special characters and Unicode
- Concurrent access scenarios

Without explicit instruction, agents generate tests that pass but miss the failure modes that matter in production.

Common test generation failures

Agent-generated tests fail in predictable ways.

Tautological tests

The agent writes both implementation and tests. Both reflect the same understanding. If the implementation is wrong, the tests pass anyway.

Example: an agent implements email validation that accepts user@ without a domain. The tests check that user@ is valid because that matches the implementation. Both are wrong, but the tests pass.

Write tests before implementation, or have humans write tests that agents implement.

Over-mocking

Page 3 covered this anti-pattern. Agent tests mock dependencies, then assert that mocked values returned. The test validates the mock, not the code.

Explicitly instruct agents not to mock the code under test. Specify what should be mocked (external services) and what should not (internal logic).

Shallow coverage

Agents optimize for coverage metrics. A test that calls a function once achieves line coverage without testing behavior.

// Agent-generated test with 100% coverage, 0% value
it('should process order', () => {
  const result = processOrder(mockOrder);
  expect(result).toBeDefined();
});

Require specific assertions: "Each test must assert on return value properties, not just existence."

Mutation score gaps

Coverage measures execution, not fault detection. Mutation testing introduces deliberate bugs and checks whether tests catch them. Studies show AI-generated tests can achieve 100% coverage with only 4% mutation score the tests run all the code but catch almost no bugs.

For critical code, consider mutation testing as a quality gate beyond coverage:

npx stryker run --files "src/payments/**"

If mutation score falls below threshold, the tests need strengthening regardless of coverage percentage.

When to trust agent tests

Agent-generated tests work best for:

  • Boilerplate tests (CRUD operations, standard validations)
  • Regression tests for existing behavior
  • Expanding coverage on well-understood code
  • Generating test structure that humans refine

Agent-generated tests require scrutiny for:

  • Business logic validation
  • Security-sensitive operations
  • Complex state transitions
  • Anything where wrong behavior is expensive

The practical pattern: agents generate test structure and obvious cases, humans review and add the cases that require domain knowledge. Agents are fast. Humans catch what matters. Use both.

On this page