What to Look for in AI-Generated Code

The output validation problem

Module 4 established techniques for managing context across long sessions and parallel workflows. Those techniques enable agents to produce more code, faster. But output volume creates a new problem: how do you know if the code is any good?

Context engineering determines what agents can see. Output validation determines what you accept. The two form a complete loop: context shapes generation, validation gates acceptance. Skip validation and agentic development becomes a gamble.

The 2025 CodeRabbit analysis of 470 GitHub pull requests quantified what practitioners suspected: AI-assisted PRs contain 10.83 issues on average, compared to 6.45 for human-only PRs. That's a 1.7x multiplier. It compounds. At the 90th percentile, AI PRs reach 26 issues per change, more than double the human baseline. Higher output velocity combined with higher defect density creates technical debt faster than traditional development ever could.

The brilliant but untrusted intern

The right mental model makes validation easier. AI coding assistants have contradictory characteristics: broad knowledge with narrow judgment, speed with inconsistency, syntactic fluency with semantic gaps.

Simon Willison describes it well: "a junior developer that has read every textbook in the world but has 0 practical experience with your specific codebase, and is prone to forgetting anything but the most recent hour of things you've told it."

What the intern metaphor gets right for validation:

The code might be excellent or subtly wrong. You can't tell without checking.
Syntax looks professional even when logic fails. Plausibility isn't correctness.
Local correctness without system awareness. The function works; it breaks three other things.
Agents may confirm what you suggest rather than what's true. They're susceptible to leading questions.

Treating agent output as you would an intern's work produces better outcomes than either blind acceptance or reflexive rejection. Review everything. Trust nothing implicitly. Teach through correction.

Ethan Mollick frames it operationally: "You can delegate tasks, but you'll need to onboard and train it, giving it context and constraints around its job description." The onboarding is CLAUDE.md. The constraints are validation gates. The training is iterative feedback.

Four areas where AI code breaks

AI-generated code fails in predictable categories. The CodeRabbit study broke down the 1.7x issue multiplier by type:

Category	AI vs Human Ratio	What It Means
Logic and correctness	1.75x higher	Algorithm errors, business logic mistakes, error handling gaps
Maintainability	1.64x higher	Readability issues 3x higher, naming inconsistencies 2x higher
Security	1.57x higher	XSS 2.74x more likely, insecure references 1.91x more likely
Performance	1.42x higher	Excessive I/O operations 8x more common

These ratios tell you where to focus. Logic errors cause the most damage. Security errors create the most risk. Maintainability errors accumulate the most debt.

Logic and correctness

Logic errors dominate AI code issues. The 1.75x multiplier reflects a fundamental gap: agents optimize for plausible code, not provably correct code.

Common logic failures:

Off-by-one errors, edge cases ignored, empty inputs mishandled
Happy path works, error handling missing or wrong
Assumptions about data presence that don't hold
Operations in wrong order, race conditions introduced
Rules implemented as understood, not as intended

The danger is subtlety. AI-generated code compiles, passes simple tests, and looks correct. Failures emerge under conditions the agent didn't anticipate: the cases you test only after production incidents.

Review technique: trace execution paths manually for non-trivial logic. Don't assume the code does what the agent said it does. Verify by reading, not by trusting the description.

Security

Security failures in AI code follow recognizable patterns. Veracode's 2025 analysis found 45% of AI-generated code samples contained security flaws. Specific failure rates are worse:

Vulnerability Type	AI Code Failure Rate
XSS prevention	86%
Log injection prevention	88%
SQL injection prevention	20%

Java code shows the highest overall security failure rate at 72%. AI code is 2.74x more likely to introduce XSS vulnerabilities than human code, 1.91x more likely to create insecure object references, and 1.88x more likely to mishandle passwords.

These aren't edge cases. Security failures represent how agents reason about adversarial inputs: they don't. Agents optimize for functionality, not for defense against malicious use.

Review technique: check every input handling path. User input, API responses, file contents, environment variables. Anything external needs validation. Assume the agent didn't add it.

Maintainability

Maintainability issues accumulate invisibly. The 1.64x multiplier understates the problem because maintainability debt compounds over time.

The specifics:

Readability issues appear 3x more often (inconsistent formatting, unclear naming, missing structure)
GitClear found 2-3x higher duplication rates in AI-assisted codebases
Formatting inconsistencies are 2.66x more common
Naming problems are nearly 2x more frequent

AI code often resembles working pseudocode rather than production code. The logic functions, but the code doesn't communicate intent. Six months later, maintenance becomes archaeology.

Review technique: read for comprehension, not just correctness. If you can't immediately understand what code does and why, it needs revision regardless of whether it works.

Performance

Performance issues appear less frequently but with higher variance. The 1.42x overall multiplier hides extreme cases: excessive I/O operations appear 8x more often in AI PRs.

Agents generate code that works at development scale. They don't reason about production load, data volume growth, or resource constraints. A function that works fine with 100 records may fail catastrophically with 100,000.

Common performance issues:

N+1 queries: database calls inside loops
Unbounded operations: no pagination, no limits, no streaming
Synchronous blocking: operations that should be async
Resource leaks: connections, handles, memory not properly released

Review technique: identify scaling assumptions. What happens when inputs grow 10x? 100x? Agents rarely consider these questions. Reviewers must.

Calibrating review depth

Not all code warrants the same review intensity. The goal is appropriate scrutiny, not exhaustive analysis of every line.

High scrutiny code

Apply maximum review rigor to:

Security-sensitive paths: authentication, authorization, input validation
Data transformation logic: ETL, migrations, format conversions
Financial calculations: anything involving money or quantities
Integration boundaries: API calls, database operations, external services
Error handling: recovery paths, fallbacks, failure modes

These categories combine high impact with high AI error rates. Time spent here prevents production incidents.

Standard scrutiny code

Apply normal review depth to:

UI components: layout, styling, presentation logic
Test code: unit tests, integration tests (but verify coverage claims)
Configuration: non-sensitive settings, feature flags
Documentation: comments, README updates

Errors here cause problems but rarely emergencies.

Lower scrutiny code

Reduced review is acceptable for:

Boilerplate: standard patterns, scaffolding, repetitive structure
Type definitions: interfaces, schemas (verify completeness, not correctness)
Dependency updates: version bumps (rely on CI checks)

Agents excel at boilerplate. Detailed review of generated scaffolding wastes time better spent on logic.

The verification feedback loop

Boris Cherny, who leads Claude Code development, emphasizes verification as the critical quality lever: "Give Claude a way to verify its work. If Claude has that feedback loop, it will 2-3x the quality of the final result."

Verification isn't a post-hoc quality gate. It's a generation input. When agents can run tests, check types, or validate behavior, they iterate toward correctness. When verification is absent, agents produce what looks right rather than what is right.

This creates a practical principle: if you can't verify it, you shouldn't accept it.

Can tests verify the logic? Accept after tests pass.
Can type checking verify the contracts? Accept after types check.
Can manual inspection verify the intent? Accept after reading confirms understanding.

Code that can't be verified through any mechanism requires extraordinary justification to merge. The cost of review rises in inverse proportion to verification coverage.

The next page examines specific quality issues in detail: the patterns that produce those 1.7x multiplier findings.

What to Look for in AI-Generated Code

On this page