Applied Intelligence
Module 9: Working with Legacy Code

Testing Strategies for Unfamiliar Code

From manual characterization to AI-assisted coverage

The previous section introduced characterization tests as a technique for capturing actual behavior before modification. Writing them manually is tedious. For each function, you write a test, run it to discover the output, update the assertion, and repeat for every input combination worth capturing. On a legacy codebase with thousands of functions, manual characterization testing is impractical.

AI agents make this feasible. They generate characterization tests at speeds no human can match. A mid-sized module that would take a developer a full day to characterize can be covered in hours with agent assistance. The agent examines the code, identifies input combinations, generates test cases, and captures actual outputs.

This is not about replacing human judgment. It removes the mechanical barrier that makes comprehensive characterization testing impractical. With agents handling test generation, developers can focus on reviewing the tests, identifying gaps, and deciding which behaviors to preserve.

AI-generated characterization test workflow

Select the target scope. Identify which functions or modules need characterization tests. Start with the code you intend to modify. Expand outward to code that calls or is called by the modification target.

Direct the agent to explore behavior. Provide a prompt that emphasizes discovery over expectation:

Generate characterization tests for the OrderProcessor class.
Do not assume what the code should do.
Run each method with various inputs and capture what actually happens.
Include:
- Normal cases from calling code patterns
- Edge cases: nulls, empty collections, boundary values
- Error conditions: invalid inputs, missing dependencies
For each test, document the discovered behavior in the test name.

Review generated tests for coverage gaps. The agent will miss cases. It cannot know which edge cases matter to your business. Review the generated tests and add prompts for specific scenarios:

The characterization tests miss the case where customer.loyaltyTier is undefined.
Add tests that capture behavior when loyaltyTier is missing or null.

Establish the baseline. Run all characterization tests and verify they pass. This baseline represents current behavior, the behavior you must preserve or intentionally change. Commit these tests before making any modifications.

Name characterization tests descriptively: shouldReturn95WhenRegularCustomerOrders100 not testCalculateDiscount. The name documents what you discovered, making future changes easier to understand.

Predictive impact analysis

Beyond generating tests, AI handles something humans struggle with: predicting which code changes will affect which behaviors. This predictive impact analysis helps prioritize testing effort where it matters.

Traditional impact analysis examines call graphs and data flow. A change to function A affects functions B, C, and D that call it. This analysis is mechanical and complete, but it treats all impacts equally. A payment calculation and an optional logging call both appear as dependencies.

AI-enhanced impact analysis adds context. By understanding code semantics, agents can assess which impacts are high-risk versus low-risk:

Analyze the proposed change to calculateShipping.
Which calling functions are most at risk of behavioral changes?
Consider:
- Functions that depend on the specific return value format
- Functions that handle edge cases the modification might affect
- Business-critical paths versus utility functions
Rank the impacted functions by risk level.

The agent's ranking reflects patterns learned from training data, millions of examples of which changes broke what. Not perfect prediction, but it focuses human attention where it matters.

Studies on AI-powered testing found that intelligent test prioritization identifies 87% of critical defects in the first 30% of test execution time. Instead of running the entire test suite and waiting, teams run the highest-risk tests first. Critical bugs surface early.

Risk-based test selection

Not all tests matter equally when modifying unfamiliar code. Some tests exercise the code you are changing directly. Others test unrelated functionality that happens to share a test file. Running everything is comprehensive but slow.

Risk-based test selection uses AI to determine which tests are most likely to catch problems from a specific change.

The approach:

  1. Identify changed files and functions. The agent examines the diff to determine modification scope.

  2. Map tests to changed code. The agent traces which tests exercise the modified functions, either directly or through call chains.

  3. Rank tests by coverage relevance. Tests that directly exercise changed code rank highest. Tests that exercise callers of changed code rank second. Tests with no connection to changed code rank lowest.

  4. Execute in risk order. Run high-risk tests first. If they pass, continue to medium-risk tests. Low-risk tests can run in parallel or be deferred.

In practice, a developer modifying a single function might run 50 tests immediately rather than 5,000. The 50 tests cover the changed code and its direct consumers. If those pass, confidence is high that the change is safe. The remaining 4,950 tests can run in CI without blocking the developer.

Risk-based selection complements comprehensive testing; it does not replace it. CI pipelines should still run the full suite. Risk-based selection accelerates the feedback loop during development.

Regression risk management

Regression risk, the likelihood that a change breaks existing functionality, dominates work on unfamiliar codebases. You do not know what will break because you do not know what the code does.

Managing regression risk requires layers of defense.

Existing test coverage. Whatever tests exist, run them. Passing tests mean the change did not break known behaviors. Failing tests require investigation before proceeding.

Generated characterization tests. For code without tests, generate characterization coverage before modification. These tests catch regressions the original developers never tested for.

Mutation testing for critical paths. For high-risk code like payments, authentication, and data integrity, mutation testing verifies that tests actually catch faults. A test suite with 100% line coverage but 4% mutation score provides false confidence. Mutation testing reveals whether tests would catch real bugs.

Studies show mutation score correlates with fault detection far better than coverage metrics. Meta's mutation testing system achieves 95% precision in identifying test gaps. Google runs mutation testing across 2 billion lines of code with 150 million daily tests. These organizations treat mutation testing as table stakes for regression prevention.

Human review of agent output. No automated system catches everything. Human review of agent-generated modifications remains the final quality gate. The 96% of developers who do not fully trust AI output apply appropriate skepticism.

Test prioritization strategies

When time is limited, and it always is, prioritization determines which tests to write first.

Prioritize by change frequency. Files that change often accumulate bugs. Prioritize characterization tests for code with high commit velocity. Stable code that has not changed in years can wait.

Prioritize by bug history. If historical bug reports cluster in certain modules, those modules need coverage first. Agents can analyze git history and issue trackers to identify hotspots:

Analyze the git history for the billing module.
Which files have the most bug-fix commits in the past year?
List them in order of bug frequency.

Prioritize by business criticality. Code that handles money, security, or compliance needs coverage before cosmetic features. Direct agents to focus on critical paths:

Generate characterization tests for the PaymentGateway class first.
This code processes real transactions and must not regress.
Cover all public methods before moving to utility classes.

Prioritize by complexity. Complex code breaks more often. Cyclomatic complexity, method length, and parameter count all correlate with bug rates. Agents can identify complexity hotspots:

Identify the five most complex functions in this module by cyclomatic complexity.
Generate characterization tests for these functions first.
Complex code needs more test coverage to catch edge case behaviors.

Approval testing for complex outputs

Some functions produce outputs too complex for simple assertions. A function that generates an HTML report, serializes a complex object, or formats a multi-line string produces output that is tedious to assert against.

Approval testing (also called golden master testing or snapshot testing) handles this. The workflow:

  1. Run the function and capture output to a file
  2. Human reviews and "approves" the output as the golden master
  3. Future test runs compare output against the approved version
  4. Any difference triggers a test failure
  5. Changes to output require explicit re-approval

This captures complex behavior without writing complex assertions. The approved output file documents what the code produces. Any modification that changes the output requires a human decision: was this change intentional?

Agents can set up approval testing infrastructure:

Convert the characterization tests for generateReport to approval tests.
Capture the current output to approved files.
Use the ApprovalTests library to compare future runs against approved output.

Approval tests are sensitive to trivial changes. Timestamps, random IDs, and ordering differences cause false failures. Configure approval tests to ignore non-deterministic elements.

Continuous test generation

Characterization tests are not a one-time activity. As code evolves, coverage gaps emerge. Continuous test generation treats test coverage as an ongoing concern rather than a milestone.

On pull request: Agent analyzes diff and generates tests for changed code.

Coverage gate: CI blocks merge if new code lacks tests.

Nightly sweep: Agent scans for untested functions and generates coverage.

Manual trigger: Developer requests tests for specific functionality.

This approach prevents the accumulation of untested code. Each change arrives with tests. Coverage improves incrementally rather than requiring dedicated testing sprints.

Teams that adopt AI-assisted test generation report cutting test creation time by 40-60% while maintaining or improving fault detection. The time savings compound: fewer bugs escape to production, reducing debugging effort downstream.

Verifying AI-generated tests

AI-generated tests require verification. An agent can produce tests that pass but prove nothing, tautological tests that test the code against itself.

Mutation testing the tests. Introduce deliberate bugs into the code. Run the generated tests. If tests still pass, they do not catch the bugs they should. Delete and regenerate tests that fail mutation testing.

Assertion review. Examine assertions for specificity. A test that asserts result !== undefined catches less than one asserting result.status === 'approved'. Direct agents to write specific assertions:

Strengthen the assertions in these tests.
Replace existence checks with value checks.
Each test should fail if the function's behavior changes, not just if it throws.

Sabotage testing. Manually break the code in ways the modification might accidentally break it. If the characterization tests catch the sabotage, they are doing their job. If they miss it, coverage is insufficient.

Mark Seemann's 2026 advice applies: "If you don't engage with the tests [AI] generates, you can't tell what guarantees they give." Generated tests are a starting point. Verification transforms them into reliable safety nets.

Building the safety net

Testing unfamiliar code is defensive work. You build walls before making changes, verify the walls hold, then modify with confidence that failures will surface immediately.

The testing strategy for unfamiliar code combines AI-generated characterization tests for speed, predictive impact analysis for focus, risk-based selection for efficiency, mutation testing for confidence, and human verification for quality.

No single technique provides complete protection. Each layer catches what others miss. When the safety net is in place, unfamiliar code becomes merely unfamiliar, not dangerous. The tests tell you immediately when something breaks.

On this page