Applied Intelligence
Module 9: Working with Legacy Code

Documentation Archaeology with AI

The missing why

Every codebase has two layers of understanding: what the code does and why it was written that way. Static analysis tools, linters, and documentation generators handle the first layer. They parse syntax, trace dependencies, and extract function signatures.

Design intent is the second layer, and it traditionally required human knowledge. Why does the authentication system use sessions instead of tokens? Why does the order processor batch writes instead of processing individually? Why does this module exist at all when a library could do the same thing?

These answers lived in the heads of developers who have since moved on. In their commit messages, if the messages said more than "fix bug." In architecture decision records, if anyone wrote them. In meeting notes, if those notes survived.

Nearly 70% of maintenance time goes to program comprehension rather than actual modification. Developers read and re-read, trace and retrace, building mental models that evaporate when they leave the project. Understanding a legacy system used to take months.

AI compresses this. Morgan Stanley's DevGen.AI processed 9 million lines of legacy code in 2025, saving an estimated 280,000 developer hours. The tool translates COBOL and Perl into plain English specifications by scanning line by line and explaining what the code does in terms humans can understand.

This is documentation archaeology: the systematic recovery of design decisions from code.

Discovery, mapping, synthesis

The process follows three phases. They sound obvious when stated, but skipping phases creates problems. Discovery without mapping produces a parts list. Mapping without synthesis produces spaghetti diagrams that explain nothing.

Phase 1: Discovery. Survey the codebase to identify what exists. File structure, naming patterns, configuration files, dependency declarations. This phase answers "what are the pieces?" without yet asking how they fit together.

Examine this repository and identify:
- The main directories and what each appears to contain
- Configuration files and what they configure
- Entry points (main files, index files, API routes)
- Test structure and coverage patterns

Don't explain the code yet. List what you find.

Discovery creates an inventory. The agent reads directory structures, scans file headers, and catalogs components without interpreting relationships.

Phase 2: Mapping. Trace how pieces connect. Call chains, data flows, dependency relationships. This phase answers "how do the pieces interact?"

For the authentication system you identified:
- Trace the flow from login request to session creation
- Map which services depend on authentication state
- Identify where auth tokens are validated vs. where they're trusted
- Find any direct database access vs. service-mediated access

Show the relationships, not just the components.

Mapping produces architecture. The agent follows imports, traces function calls, and builds a picture of how data moves through the system.

Phase 3: Synthesis. Infer intent from patterns. Why these choices instead of alternatives? What constraints shaped the design? This phase answers "why was it built this way?"

Based on your mapping of the authentication system:
- Why might the original developers have chosen session-based auth?
- What constraints does the current design optimize for?
- What tradeoffs did this approach accept?
- Are there patterns that suggest this evolved from something simpler?

Reason about intent from the evidence in the code.

Synthesis produces documentation. The agent generates hypotheses about design rationale, constrained by what the code actually shows.

AI cannot read minds. Phase 3 synthesis produces educated guesses, not recovered memories. Treat synthesized explanations as hypotheses to validate, not facts to trust blindly. The original developers may have had reasons the code doesn't reveal.

Generating plain-language explanations

Legacy code often defies quick comprehension. Nested conditionals, cryptic variable names, business logic distributed across dozens of files. AI translates this into plain English.

Explain what this function does in business terms, not technical terms.
What problem does it solve for users or the business?
Assume I understand the domain but haven't seen this code before.

The request for business terms forces abstraction. Instead of "this function iterates through the orders array and filters by status code 3," the agent produces "this function identifies orders ready for shipment."

For complex modules, request layered explanations:

Explain this module at three levels:
1. One sentence: what capability does it provide?
2. One paragraph: how does it work at a high level?
3. Detailed: walk through the main functions and their roles.

Layered explanations serve different audiences. The one-sentence version goes in architecture diagrams. The paragraph goes in developer onboarding docs. The detailed version becomes inline documentation.

Martin Fowler's team at Thoughtworks developed a "CodeConcise" approach: parsing code into Abstract Syntax Trees, storing them in graph databases, then generating explanations at method, class, and package levels. Engineers query for explanations at whatever granularity they need.

Recovering architectural decisions

Design decisions leave fingerprints in code. Agents can read these fingerprints and reconstruct the decisions that created them.

Dependency choices. Why does the project use Redis for caching instead of Memcached? Why axios instead of fetch? The choice is visible in package.json or requirements.txt. The rationale might emerge from how the dependency is used.

This project uses [library X] for [purpose].
Examine how it's used throughout the codebase.
What features of [library X] does the code actually depend on?
What might have made this a better choice than alternatives?

Structural patterns. Why is the code organized into these particular modules? Why do some services talk directly to the database while others go through a repository layer? Architecture emerges from examining the actual call graph.

Map the database access patterns in this codebase.
Which modules access the database directly?
Which go through an abstraction layer?
Is there a consistent pattern, or does it vary? If it varies, can you identify why?

Historical evolution. Code changes over time. Old patterns coexist with new ones. The git history shows when architectural shifts happened.

Look at the git history for src/auth/.
When was this module introduced?
Were there major rewrites or just incremental changes?
Do commit messages suggest why changes were made?

Claude Code can search git history to answer questions like "Why was this API designed this way?" Explicitly prompting for git history investigation often reveals context that code inspection alone misses.

Commit messages vary wildly in quality. "fix bug" teaches nothing. "Refactored auth to use JWT because session storage hit scaling limits at 10k concurrent users" is useful. When commit messages are informative, git history becomes a design rationale database.

When archaeology fails

Some design decisions leave no trace. This is worth acknowledging because it's easy to over-trust AI-generated explanations.

External constraints. The code uses a particular database because that's what the infrastructure team supported. The API follows a specific format because a partner required it. These constraints existed outside the codebase. No amount of code inspection reveals them.

Abandoned alternatives. The team considered three approaches and chose one. The other two left no evidence. The agent can only see what was built, not what was rejected.

Performance-driven changes. Code might be structured oddly because profiling revealed a bottleneck. The optimization remains; the profile that motivated it is gone.

Time pressure. "We built it this way because we had two weeks" is a common rationale. It leaves no fingerprint.

For these gaps, AI-generated documentation should acknowledge uncertainty:

When documenting design decisions, explicitly flag:
- Decisions where the rationale is clear from the code
- Decisions where you can hypothesize rationale but aren't certain
- Decisions that seem arbitrary and may have had external reasons

Flagging uncertainty makes documentation more honest and more useful. Readers know which explanations to trust and which to verify.

From archaeology to documentation

The goal isn't understanding for its own sake. It's producing artifacts that persist beyond any individual exploration.

After analysis, generate:

Architecture overview.

Based on your analysis, write an architecture overview for this system.
Include:
- Main components and their responsibilities
- How data flows between components
- Key design decisions and their apparent rationale
- Areas where the design seems inconsistent or evolved over time

Component documentation.

For the payment processing module:
- Purpose and business function
- Key classes/functions and their roles
- Dependencies (what it uses and what uses it)
- Known constraints or limitations visible in the code

Decision records.

Create an Architecture Decision Record for the choice to use [X].
Follow ADR format:
- Context: What situation prompted this decision?
- Decision: What was decided?
- Consequences: What does this make easier? Harder?
- Status: Accepted (inferred from implementation)

Retrospective ADRs capture understanding before it evaporates. Teams report generating dozens of ADRs in a single morning using AI, work that would have taken weeks of meetings to produce manually.

Validating archaeological findings

AI-generated explanations require verification. The agent analyzed code; it didn't interview original developers.

Test against behavior. If the agent claims a function handles a particular edge case, write a test that exercises that case. Passing tests validate understanding. Failing tests reveal misinterpretation.

Cross-reference sources. Compare AI-generated documentation against any existing docs, however outdated. Contradictions highlight either changed code or AI error.

Review with developers. Current team members who've worked in the code can validate or correct AI hypotheses. "Actually, we use that pattern because the ORM had a bug in 2019" adds context no amount of code inspection could reveal.

Iterate on corrections. When reviews identify errors, update the documentation and add corrections to project context files. CLAUDE.md entries like "The auth module uses sessions, not tokens, due to mobile app constraints, not because of security preferences" prevent future misinterpretation.

What this buys you

Documentation archaeology with AI doesn't replace human expertise. A senior developer investigating a legacy system still applies judgment about what matters, what to trust, and what to question. But instead of spending two weeks reading code to form initial hypotheses, they spend two hours reviewing AI-generated analysis.

EPAM's AI/Run platform reports 85% accuracy in identifying and documenting business rules from legacy COBOL, with 70% reduction in manual documentation effort. The remaining 30% of effort, validation, correction, contextualization, remains human work. That's still days instead of weeks.

The knowledge that lived only in departed developers' heads becomes explicit, searchable, persistent. Whether that's worth the effort depends on how long the codebase needs to live and how often new people need to understand it. For most enterprise systems, the answer is obviously yes.

The next page applies these techniques specifically: creating agent-friendly documentation for existing projects that have none.

On this page