What to Look for in AI-Generated Code
The output validation problem
Module 4 established techniques for managing context across long sessions and parallel workflows. Those techniques enable agents to produce more code, faster. But output volume creates a new problem: how do you know if the code is any good?
Context engineering determines what agents can see. Output validation determines what you accept. The two form a complete loop: context shapes generation, validation gates acceptance. Skip validation and agentic development becomes a gamble.
The 2025 CodeRabbit analysis of 470 GitHub pull requests quantified what practitioners suspected: AI-assisted PRs contain 10.83 issues on average, compared to 6.45 for human-only PRs. That's a 1.7x multiplier. It compounds. At the 90th percentile, AI PRs reach 26 issues per change, more than double the human baseline. Higher output velocity combined with higher defect density creates technical debt faster than traditional development ever could.
The brilliant but untrusted intern
The right mental model makes validation easier. AI coding assistants have contradictory characteristics: broad knowledge with narrow judgment, speed with inconsistency, syntactic fluency with semantic gaps.
Simon Willison describes it well: "a junior developer that has read every textbook in the world but has 0 practical experience with your specific codebase, and is prone to forgetting anything but the most recent hour of things you've told it."
What the intern metaphor gets right for validation:
- The code might be excellent or subtly wrong. You can't tell without checking.
- Syntax looks professional even when logic fails. Plausibility isn't correctness.
- Local correctness without system awareness. The function works; it breaks three other things.
- Agents may confirm what you suggest rather than what's true. They're susceptible to leading questions.
Treating agent output as you would an intern's work produces better outcomes than either blind acceptance or reflexive rejection. Review everything. Trust nothing implicitly. Teach through correction.
Ethan Mollick frames it operationally: "You can delegate tasks, but you'll need to onboard and train it, giving it context and constraints around its job description." The onboarding is CLAUDE.md. The constraints are validation gates. The training is iterative feedback.
Four areas where AI code breaks
AI-generated code fails in predictable categories. The CodeRabbit study broke down the 1.7x issue multiplier by type:
| Category | AI vs Human Ratio | What It Means |
|---|---|---|
| Logic and correctness | 1.75x higher | Algorithm errors, business logic mistakes, error handling gaps |
| Maintainability | 1.64x higher | Readability issues 3x higher, naming inconsistencies 2x higher |
| Security | 1.57x higher | XSS 2.74x more likely, insecure references 1.91x more likely |
| Performance | 1.42x higher | Excessive I/O operations 8x more common |
These ratios tell you where to focus. Logic errors cause the most damage. Security errors create the most risk. Maintainability errors accumulate the most debt.
Logic and correctness
Logic errors dominate AI code issues. The 1.75x multiplier reflects a fundamental gap: agents optimize for plausible code, not provably correct code.
Common logic failures:
- Off-by-one errors, edge cases ignored, empty inputs mishandled
- Happy path works, error handling missing or wrong
- Assumptions about data presence that don't hold
- Operations in wrong order, race conditions introduced
- Rules implemented as understood, not as intended
The danger is subtlety. AI-generated code compiles, passes simple tests, and looks correct. Failures emerge under conditions the agent didn't anticipate: the cases you test only after production incidents.
Review technique: trace execution paths manually for non-trivial logic. Don't assume the code does what the agent said it does. Verify by reading, not by trusting the description.
Security
Security failures in AI code follow recognizable patterns. Veracode's 2025 analysis found 45% of AI-generated code samples contained security flaws. Specific failure rates are worse:
| Vulnerability Type | AI Code Failure Rate |
|---|---|
| XSS prevention | 86% |
| Log injection prevention | 88% |
| SQL injection prevention | 20% |
Java code shows the highest overall security failure rate at 72%. AI code is 2.74x more likely to introduce XSS vulnerabilities than human code, 1.91x more likely to create insecure object references, and 1.88x more likely to mishandle passwords.
These aren't edge cases. Security failures represent how agents reason about adversarial inputs: they don't. Agents optimize for functionality, not for defense against malicious use.
Review technique: check every input handling path. User input, API responses, file contents, environment variables. Anything external needs validation. Assume the agent didn't add it.
Maintainability
Maintainability issues accumulate invisibly. The 1.64x multiplier understates the problem because maintainability debt compounds over time.
The specifics:
- Readability issues appear 3x more often (inconsistent formatting, unclear naming, missing structure)
- GitClear found 2-3x higher duplication rates in AI-assisted codebases
- Formatting inconsistencies are 2.66x more common
- Naming problems are nearly 2x more frequent
AI code often resembles working pseudocode rather than production code. The logic functions, but the code doesn't communicate intent. Six months later, maintenance becomes archaeology.
Review technique: read for comprehension, not just correctness. If you can't immediately understand what code does and why, it needs revision regardless of whether it works.
Performance
Performance issues appear less frequently but with higher variance. The 1.42x overall multiplier hides extreme cases: excessive I/O operations appear 8x more often in AI PRs.
Agents generate code that works at development scale. They don't reason about production load, data volume growth, or resource constraints. A function that works fine with 100 records may fail catastrophically with 100,000.
Common performance issues:
- N+1 queries: database calls inside loops
- Unbounded operations: no pagination, no limits, no streaming
- Synchronous blocking: operations that should be async
- Resource leaks: connections, handles, memory not properly released
Review technique: identify scaling assumptions. What happens when inputs grow 10x? 100x? Agents rarely consider these questions. Reviewers must.
Calibrating review depth
Not all code warrants the same review intensity. The goal is appropriate scrutiny, not exhaustive analysis of every line.
High scrutiny code
Apply maximum review rigor to:
- Security-sensitive paths: authentication, authorization, input validation
- Data transformation logic: ETL, migrations, format conversions
- Financial calculations: anything involving money or quantities
- Integration boundaries: API calls, database operations, external services
- Error handling: recovery paths, fallbacks, failure modes
These categories combine high impact with high AI error rates. Time spent here prevents production incidents.
Standard scrutiny code
Apply normal review depth to:
- UI components: layout, styling, presentation logic
- Test code: unit tests, integration tests (but verify coverage claims)
- Configuration: non-sensitive settings, feature flags
- Documentation: comments, README updates
Errors here cause problems but rarely emergencies.
Lower scrutiny code
Reduced review is acceptable for:
- Boilerplate: standard patterns, scaffolding, repetitive structure
- Type definitions: interfaces, schemas (verify completeness, not correctness)
- Dependency updates: version bumps (rely on CI checks)
Agents excel at boilerplate. Detailed review of generated scaffolding wastes time better spent on logic.
The verification feedback loop
Boris Cherny, who leads Claude Code development, emphasizes verification as the critical quality lever: "Give Claude a way to verify its work. If Claude has that feedback loop, it will 2-3x the quality of the final result."
Verification isn't a post-hoc quality gate. It's a generation input. When agents can run tests, check types, or validate behavior, they iterate toward correctness. When verification is absent, agents produce what looks right rather than what is right.
This creates a practical principle: if you can't verify it, you shouldn't accept it.
- Can tests verify the logic? Accept after tests pass.
- Can type checking verify the contracts? Accept after types check.
- Can manual inspection verify the intent? Accept after reading confirms understanding.
Code that can't be verified through any mechanism requires extraordinary justification to merge. The cost of review rises in inverse proportion to verification coverage.
The next page examines specific quality issues in detail: the patterns that produce those 1.7x multiplier findings.