Adapted Review Checklists
Why traditional checklists fail
Standard code review checklists assume human authorship. They ask: did the developer follow conventions? Did they handle errors consistently? Did they document their intent?
These questions assume the code author understood the codebase, knew the conventions, and made deliberate choices. AI code breaks every one of those assumptions.
The previous three pages established the damage: 1.7x more issues, 75% more logic errors, 45% security failure rates, hallucinated packages, hardcoded secrets. Here's how to catch those problems systematically.
The verification inversion
Traditional review looks for mistakes in code that was written with intent. AI review looks for missing pieces in code that was generated without understanding.
The OpenSSF Security-Focused Guide for AI Code Assistant Instructions (September 2025) puts it bluntly: AI-generated code "lacks awareness of system behavior, threat surfaces, or compliance constraints." Review must verify these elements are present, not just confirm nothing looks obviously wrong.
Every checklist item becomes an existence check. Did the AI include input validation? Did it use the correct error handling pattern? Did it respect the architectural boundary? These are not mistakes you catch in passing. They are explicit checks that must happen for every AI-generated PR.
The baseline checklist
Before examining category-specific items, every AI-generated PR needs these baseline verifications.
Existence checks:
- All imported packages exist in official registries (npm, PyPI, Maven Central)
- All referenced APIs and methods exist in the stated versions
- All file paths and directory structures are valid
- Dependencies specified in package manifests match import statements
Functional checks:
- Code compiles without warnings
- Existing tests pass without modification
- New tests verify actual behavior, not just happy paths
- Manual execution produces expected results
Artifact removal:
- No console.log, print(), or debug statements in production paths
- No commented-out code blocks
- No TODO comments marking incomplete requirements
- No unused imports, variables, or functions
These checks catch the fingerprints described in Page 3. They take minutes to verify but prevent the embarrassment of shipping code that references packages that do not exist.
Security verification checklist
Module 5, Page 3 established a comprehensive security checklist. The items below adapt that checklist for PR review, adding checks specific to how AI fails.
Input validation (86% XSS failure rate, 88% log injection failure rate):
Every point where external data enters the system requires verification:
- HTTP parameters validated for type, length, and format
- Request bodies parsed with schema validation, not blind acceptance
- File uploads checked for type, size, and content
- URL parameters decoded and validated before use
- Headers treated as untrusted input
For each entry point, trace the data flow. Where does this input go? Is it rendered to users (XSS risk)? Written to logs (log injection risk)? Used in queries (SQL injection risk)? Passed to shell commands (command injection risk)?
AI generates code that processes data. It rarely generates code that validates data. Assume validation is missing until proven present.
Output encoding:
- HTML output uses context-appropriate encoding (body, attribute, script, URL)
- JSON responses properly escaped
- Log entries sanitized before writing
- Error messages stripped of internal details
The 86% XSS failure rate comes from AI treating output as simple string operations. Context-dependent encoding requires understanding what the output will be used for. AI does not have this understanding.
Database operations (20% SQL injection failure rate):
- All queries use parameterized statements or ORM methods
- No string concatenation in query construction
- Dynamic table or column names validated against allowlists
- Stored procedures called with bound parameters
One in five AI-generated database queries is injectable. Every query in AI-generated code needs individual verification.
Authentication and authorization (1.88x improper password handling, 1.91x insecure direct object references):
- Password storage uses bcrypt, argon2, or scrypt with per-user salt
- No plaintext passwords in logs, errors, or responses
- Session tokens generated with cryptographic randomness
- Every data access verifies requesting user has permission
- Resource IDs validated against authorization context
AI generates authentication flows that work. It does not generate authentication flows that resist attack. Verify each authentication operation independently.
Secrets and credentials:
- No hardcoded API keys, passwords, or tokens
- Environment variables or secrets managers used for all credentials
- Connection strings parameterized
- Test credentials verified as non-functional in production
GitGuardian found repositories using Copilot had 40% higher secret leakage. Pre-commit hooks catch some secrets. Review must catch what hooks miss.
Architectural alignment checklist
AI-generated code that passes security review may still violate architectural decisions. These violations work today and create technical debt next quarter.
Pattern conformance:
- Error handling matches project pattern (exceptions, error objects, Result types)
- Logging follows established format and severity conventions
- Configuration access uses approved mechanisms
- External service calls go through defined integration layers
Compare AI-generated code to adjacent code in the same module. If the patterns differ, one of them is wrong. Often the AI-generated code followed a pattern from training data that does not match this project.
Boundary respect:
- Database access goes through repository or data access layer
- Business logic does not appear in controller or API handlers
- Cross-module communication uses defined interfaces
- No direct access to internals of other components
AI optimizes for immediate functionality. Reaching through an abstraction layer to access data directly is faster than going through the proper channel. The code passes tests. The architecture erodes.
Dependency management:
- New dependencies align with approved library list
- No duplicate functionality already available in existing dependencies
- Version constraints match project standards
- License compatibility verified
AI suggests packages based on training data. Those suggestions may conflict with organizational policy on approved dependencies.
Naming and structure:
- File and directory placement follows project conventions
- Class, function, and variable names match project style
- Public API signatures consistent with existing patterns
- Test file organization mirrors source organization
Traditional review covered these for human code too. For AI code, they require explicit verification because the AI had no access to project conventions unless they were in CLAUDE.md or agents.md.
Quality gates for complexity
AI-generated code tends toward specific complexity patterns.
Duplication checks (8x increase in duplicated blocks):
- No functions that closely match existing functions elsewhere
- No constants defined in multiple locations
- No error handling logic repeated rather than centralized
- No transformation patterns that could be extracted
Ask: does similar code exist elsewhere? AI cannot see the whole codebase. Human reviewers can.
Abstraction appropriateness:
- No premature abstraction for single-use cases
- No excessive inheritance hierarchies
- No configuration options for functionality that should be hardcoded
- No overly generic solutions to specific problems
Ox Security found "over-specification of implementation details" in 80-90% of AI-generated codebases. AI creates abstraction where none is needed and skips abstraction where it would help.
Cyclomatic complexity:
- Functions remain under complexity thresholds
- Deeply nested conditionals refactored into guard clauses
- Large functions decomposed into named helper functions
- Switch statements limited to appropriate uses
AI generates working code without considering maintainability. The reviewer must ask: can future developers understand and modify this code?
Business logic validation
AI cannot know business rules. It generates plausible implementations that may be wrong.
Requirement alignment:
- Implementation matches acceptance criteria, not adjacent functionality
- Edge cases specified in requirements are handled
- Default behaviors align with product intent
- Error states handled per business specification
The question is not "does this code work?" but "does this code do what was actually needed?" AI implements what was literally requested. Requirements often omit what was implicitly assumed.
Domain rule verification:
- Business calculations use correct formulas
- State transitions follow valid paths
- Time-based logic handles timezones and edge cases
- Currency and numeric precision appropriate for domain
AI copies patterns. Business rules require understanding. Every calculation, every state machine, every domain-specific operation needs human verification.
Integration points:
- External API calls match documented contracts
- Webhook payloads conform to expected schemas
- Event publishing follows established patterns
- Data formats compatible with downstream consumers
AI generates integration code based on patterns from training data. Those patterns may not match your specific integration requirements.
The tiered approach
Not every checklist item applies to every PR. Allocate review effort based on what the code touches.
Tier 1 - Full verification (15-30 minutes): Apply all checklist sections for code touching:
- Authentication and session management
- Authorization and access control
- Payment processing
- Personal data handling
- External-facing API endpoints
- Cryptographic operations
Tier 2 - Standard verification (10-15 minutes): Apply baseline, security, and architectural checklists for:
- Internal service communication
- Business logic without external input
- Administrative interfaces
- Database operations on non-sensitive data
Tier 3 - Automated plus spot-check (5 minutes): Rely on automated tooling with spot verification for:
- Static content rendering
- Test code and fixtures
- Internal utilities
- Documentation generation
The tiers reflect risk, not effort. AI failure rates are highest in Tier 1 code, and failures there cost the most.
Checklist enforcement
Checklists work when they are used consistently. A checklist used sometimes provides less value than a shorter checklist used always.
PR templates: Require submitters to complete a checklist before review begins. The template prompts self-review. Incomplete checklists signal insufficient preparation.
Review tooling: Integrate checklists into review tools (GitHub PR templates, GitLab merge request descriptions). Make completion visible in the review interface.
Quality gates: Block merge until designated checklist items are marked complete. Automate verification where possible (SAST for security items, linting for style items).
Review rotation: Assign reviewers with domain expertise to PRs touching their domains. Security-sensitive code goes to security-aware reviewers. Architecture changes go to architects.
What checklists cannot catch
Checklists verify specific items. They do not replace judgment.
Novel vulnerabilities: Checklists codify known failure modes. New attack vectors require thinking beyond the checklist.
Subtle logic errors: A checklist confirms error handling exists. It does not confirm the error handling is correct for this specific case.
Performance at scale: A checklist asks about database queries in loops. It does not simulate production load.
Long-term maintainability: A checklist flags obvious complexity. It does not predict which code will become a maintenance burden in six months.
Checklists provide scaffolding. They ensure baseline verification happens consistently. Senior reviewers bring judgment that no checklist can encode.
Making the checklists stick
Using these checklists consistently creates a feedback loop.
Developers learn what reviewers check. They self-review against the checklist before submitting. AI-generated code gets iterated before reaching review. Review time drops because obvious issues are caught earlier.
The checklist also becomes a teaching tool. New team members learn what matters by seeing what gets checked. Over time, the checklist encodes organizational knowledge about AI failure modes.
Page 5 examines how to use AI tools themselves as review assistants augmenting human review without creating circular validation where the same AI that wrote the code approves its own work.