Applied Intelligence
Module 8: Code Review and Testing

Test Quality and Coverage with AI

Beyond coverage: measuring test quality

Page 7 established that agents generate tests quickly. This page addresses the harder question: are those tests any good?

Coverage measures execution. Quality measures fault detection. These are not the same thing. A test suite can execute every line of code while catching zero bugs. One study found AI-generated tests achieving 100% line coverage with only 4% mutation score the tests ran all the code but detected almost none of the faults researchers deliberately introduced. That's a sobering number.

AI versus human test quality

Qodo's 2025 developer survey found 60% of developers report AI missing context in test generation. The symptom: tautological tests that validate implementation rather than intent. The agent writes code and tests together, and both embody the same understanding or misunderstanding. When the implementation is wrong, the tests pass anyway because they verify what the code does, not what it should do.

The 70% problem from earlier modules applies here too. AI generates 70% of test structure rapidly setup, teardown, basic assertions. The remaining 30% edge cases, security scenarios, business rule validation requires as much human effort as ever. Addy Osmani's observation about AI code applies equally to AI tests: the hard parts remain hard.

Human testers think differently than agents. A human asks "how could this break?" and "what would a malicious user try?" An agent asks "what inputs match this function signature?" One produces tests that find bugs. The other produces tests that exercise code.

Specialized tools: the Diffblue approach

General-purpose agents treat test generation as another coding task. Some tools take a fundamentally different approach.

Diffblue Cover uses reinforcement learning rather than language models. It analyzes Java bytecode directly, understanding execution paths that LLMs must infer from source code. The result is deterministic: every generated test compiles and passes. LLM-based tools achieve 58-88% compilation success. Diffblue achieves 100%.

The productivity numbers are striking. Diffblue reports generating one test every two seconds 250 times faster than manual writing. A 2025 benchmark found Diffblue covers 3,658 lines per prompt compared to 18-297 lines for general-purpose assistants. Because the tests need no debugging, effective speed advantage reaches 20x over a year.

Goldman Sachs documented a concrete example: 3,211 tests generated overnight, doubling coverage from 36% to 72% on a 15,000-line codebase. Work that would have taken 268 developer days finished in less than 24 hours.

Diffblue only supports Java and Kotlin. General-purpose agents handle any language but require more human oversight. For Java-heavy enterprise environments, the specialized approach is worth evaluating.

Edge case identification

Edge cases separate adequate tests from effective tests. Agents find some edges naturally null inputs, empty collections, boundary values. Others require prompting.

Explicit edge case requests improve results:

Generate tests for processPayment() including:
- Zero and negative amounts
- Amounts exceeding account balance
- Concurrent payment attempts
- Network timeout during authorization
- Invalid currency codes
- Amounts at integer overflow boundaries

Without this guidance, agents test happy paths and obvious nulls. Production failures come from scenarios nobody explicitly requested.

AI can assist edge case discovery through code analysis. Claude Code can examine a function and suggest edge cases:

Review src/auth/validateToken.ts and identify edge cases
not covered by existing tests in __tests__/validateToken.test.ts.

The agent compares implementation branches against test scenarios. Missing paths become testing suggestions. This reverses the typical workflow: instead of humans identifying what to test, AI surfaces gaps for human judgment on which matter.

Codex goes further with its execution capability. After generating tests, Codex runs them. Failed assertions reveal edge cases the initial generation missed. Codex then adds tests for discovered scenarios, iterating toward better coverage.

Mutation testing with AI

Mutation testing introduces deliberate faults and verifies tests catch them. A mutation changes > to >= or removes a conditional. If tests still pass, they failed to detect the bug. Mutation score measures what percentage of mutations tests catch.

Traditional mutation testing is computationally expensive. Testing every mutation against every test produces exponential combinations. AI changes the economics.

Meta's Automated Compliance Hardener demonstrates this at scale. Applied to 10,795 Android Kotlin classes across seven platforms, it generated 9,095 mutants and 571 privacy-hardening test cases. Privacy engineers accepted 73% of generated tests. The system detected equivalent mutants mutations that do not change behavior with 95% precision after preprocessing. Equivalent mutant detection is mathematically undecidable by traditional methods. LLMs solve it through pattern recognition.

Google runs mutation testing at even larger scale: 2 billion lines of code, 150 million daily test executions, 24,000 developers across 1,000 projects. Their approach: mutate only changed code during review. This bounds the combinatorial explosion while catching faults in new code.

For teams wanting to try mutation testing:

# PITest for Java
mvn test-compile org.pitest:pitest-maven:mutationCoverage

# Stryker for JavaScript/TypeScript
npx stryker run --files "src/payments/**"

Target mutation scores by criticality:

  • 90%+ for critical systems (payments, authentication, access control)
  • 75-90% for standard production code
  • 50-75% for modules needing improvement

Low mutation scores despite high coverage indicate test quality problems. The tests execute code without validating behavior.

Self-healing test automation

UI tests break constantly. A button moves, a label changes, a class name updates. Tests fail not because behavior changed but because locators stopped matching. Teams spend more time maintaining tests than writing them.

Self-healing tools address this through AI-powered adaptation. When a test cannot find an element, the system searches for alternatives using element ID, text content, relative position, and visual appearance. It updates the locator and continues execution. Successful fixes train the system for future changes.

Vendors report maintenance reductions:

  • Mabl: up to 95% reduction in test maintenance
  • Katalon: 70% maintenance reduction with 90% reliability improvement
  • ACCELQ: 72% lower maintenance
  • Applitools: up to 75% maintenance reduction using visual AI

The 95% figure is a ceiling, not an average. Mabl documents typical progression: 20-30% autonomous handling early in deployment, growing to 70-80% after months of learning. The system improves as it encounters more changes and observes outcomes.

Self-healing works through multiple techniques:

Locator fallback chains: When primary selectors fail, systems try secondary identifiers captured during test creation data attributes, aria labels, relative positions.

Visual AI: Applitools trains on 4 billion app screens. When DOM changes, visual recognition identifies elements by appearance rather than structure.

Semantic understanding: NLP interprets element meaning. If a button labeled "Submit" becomes "Send," semantic analysis recognizes equivalent intent.

Outcome learning: Systems track which auto-healing decisions produced valid tests versus false positives. Failed healings trigger human review; successful healings inform future decisions.

A comparison of tools in this space:

VendorApproachDifferentiator
MablAdaptive AILearns from deployment history
ACCELQCodeless platformUnified web, mobile, API, database
KatalonML confidence scoringElement stability prediction
TestsigmaNLP test definitionPlain English test creation
ApplitoolsVisual AITrained on 4B screens
Sauce LabsAutonomIQ acquisitionSaaS application focus

CI/CD integration for test quality

Quality gates can extend beyond pass/fail. AI-powered pipelines incorporate test quality metrics.

Configure mutation score thresholds:

# GitHub Actions mutation testing gate
- name: Mutation testing
  run: npx stryker run

- name: Check mutation score
  run: |
    score=$(cat reports/mutation/mutation-score.json | jq '.score')
    if (( $(echo "$score < 75" | bc -l) )); then
      echo "Mutation score $score% below 75% threshold"
      exit 1
    fi

Self-healing platforms integrate with CI through APIs:

# Mabl integration
- name: Run Mabl tests
  uses: mablhq/github-run-tests-action@v1
  with:
    application-id: ${{ secrets.MABL_APP_ID }}
    environment-id: ${{ secrets.MABL_ENV_ID }}

Modern pipelines add AI-powered diagnosis. When tests fail, LLM agents analyze logs, identify root causes, and suggest fixes. Sauce Labs reports AI identifies root causes "almost 100 times faster than manual investigation."

Automated quality checks catch mechanical issues, freeing humans for judgment calls about what quality means for their specific context.

Practical test quality workflow

Combining these techniques creates a layered approach:

  1. Generate: Use agents for initial test creation with explicit edge case requests
  2. Execute: Run tests in sandboxed environments (Codex) or locally (Claude Code)
  3. Measure: Check coverage as a baseline, mutation score as quality indicator
  4. Maintain: Deploy self-healing for UI tests, reducing locator maintenance
  5. Gate: Configure CI to enforce quality thresholds before merge
  6. Review: Human validation for business logic and security-sensitive tests

AI accelerates each layer. Humans ensure the layers serve actual quality goals rather than metric optimization. The combination produces tests that execute quickly, maintain themselves, and catch bugs that matter.

On this page