Adapted Review Checklists

Why traditional checklists fail

Standard code review checklists assume human authorship. They ask: did the developer follow conventions? Did they handle errors consistently? Did they document their intent?

These questions assume the code author understood the codebase, knew the conventions, and made deliberate choices. AI code breaks every one of those assumptions.

The previous three pages established the damage: 1.7x more issues, 75% more logic errors, 45% security failure rates, hallucinated packages, hardcoded secrets. Here's how to catch those problems systematically.

The verification inversion

Traditional review looks for mistakes in code that was written with intent. AI review looks for missing pieces in code that was generated without understanding.

The OpenSSF Security-Focused Guide for AI Code Assistant Instructions (September 2025) puts it bluntly: AI-generated code "lacks awareness of system behavior, threat surfaces, or compliance constraints." Review must verify these elements are present, not just confirm nothing looks obviously wrong.

Every checklist item becomes an existence check. Did the AI include input validation? Did it use the correct error handling pattern? Did it respect the architectural boundary? These are not mistakes you catch in passing. They are explicit checks that must happen for every AI-generated PR.

The baseline checklist

Before examining category-specific items, every AI-generated PR needs these baseline verifications.

Existence checks:

All imported packages exist in official registries (npm, PyPI, Maven Central)
All referenced APIs and methods exist in the stated versions
All file paths and directory structures are valid
Dependencies specified in package manifests match import statements

Functional checks:

Code compiles without warnings
Existing tests pass without modification
New tests verify actual behavior, not just happy paths
Manual execution produces expected results

Artifact removal:

No console.log, print(), or debug statements in production paths
No commented-out code blocks
No TODO comments marking incomplete requirements
No unused imports, variables, or functions

These checks catch the fingerprints described in Page 3. They take minutes to verify but prevent the embarrassment of shipping code that references packages that do not exist.

Security verification checklist

Module 5, Page 3 established a comprehensive security checklist. The items below adapt that checklist for PR review, adding checks specific to how AI fails.

Input validation (86% XSS failure rate, 88% log injection failure rate):

Every point where external data enters the system requires verification:

HTTP parameters validated for type, length, and format
Request bodies parsed with schema validation, not blind acceptance
File uploads checked for type, size, and content
URL parameters decoded and validated before use
Headers treated as untrusted input

For each entry point, trace the data flow. Where does this input go? Is it rendered to users (XSS risk)? Written to logs (log injection risk)? Used in queries (SQL injection risk)? Passed to shell commands (command injection risk)?

AI generates code that processes data. It rarely generates code that validates data. Assume validation is missing until proven present.

Output encoding:

HTML output uses context-appropriate encoding (body, attribute, script, URL)
JSON responses properly escaped
Log entries sanitized before writing
Error messages stripped of internal details

The 86% XSS failure rate comes from AI treating output as simple string operations. Context-dependent encoding requires understanding what the output will be used for. AI does not have this understanding.

Database operations (20% SQL injection failure rate):

All queries use parameterized statements or ORM methods
No string concatenation in query construction
Dynamic table or column names validated against allowlists
Stored procedures called with bound parameters

One in five AI-generated database queries is injectable. Every query in AI-generated code needs individual verification.

Authentication and authorization (1.88x improper password handling, 1.91x insecure direct object references):

Password storage uses bcrypt, argon2, or scrypt with per-user salt
No plaintext passwords in logs, errors, or responses
Session tokens generated with cryptographic randomness
Every data access verifies requesting user has permission
Resource IDs validated against authorization context

AI generates authentication flows that work. It does not generate authentication flows that resist attack. Verify each authentication operation independently.

Secrets and credentials:

No hardcoded API keys, passwords, or tokens
Environment variables or secrets managers used for all credentials
Connection strings parameterized
Test credentials verified as non-functional in production

GitGuardian found repositories using Copilot had 40% higher secret leakage. Pre-commit hooks catch some secrets. Review must catch what hooks miss.

Architectural alignment checklist

AI-generated code that passes security review may still violate architectural decisions. These violations work today and create technical debt next quarter.

Pattern conformance:

Error handling matches project pattern (exceptions, error objects, Result types)
Logging follows established format and severity conventions
Configuration access uses approved mechanisms
External service calls go through defined integration layers

Compare AI-generated code to adjacent code in the same module. If the patterns differ, one of them is wrong. Often the AI-generated code followed a pattern from training data that does not match this project.

Boundary respect:

Database access goes through repository or data access layer
Business logic does not appear in controller or API handlers
Cross-module communication uses defined interfaces
No direct access to internals of other components

AI optimizes for immediate functionality. Reaching through an abstraction layer to access data directly is faster than going through the proper channel. The code passes tests. The architecture erodes.

Dependency management:

New dependencies align with approved library list
No duplicate functionality already available in existing dependencies
Version constraints match project standards
License compatibility verified

AI suggests packages based on training data. Those suggestions may conflict with organizational policy on approved dependencies.

Naming and structure:

File and directory placement follows project conventions
Class, function, and variable names match project style
Public API signatures consistent with existing patterns
Test file organization mirrors source organization

Traditional review covered these for human code too. For AI code, they require explicit verification because the AI had no access to project conventions unless they were in CLAUDE.md or agents.md.

Quality gates for complexity

AI-generated code tends toward specific complexity patterns.

Duplication checks (8x increase in duplicated blocks):

No functions that closely match existing functions elsewhere
No constants defined in multiple locations
No error handling logic repeated rather than centralized
No transformation patterns that could be extracted

Ask: does similar code exist elsewhere? AI cannot see the whole codebase. Human reviewers can.

Abstraction appropriateness:

No premature abstraction for single-use cases
No excessive inheritance hierarchies
No configuration options for functionality that should be hardcoded
No overly generic solutions to specific problems

Ox Security found "over-specification of implementation details" in 80-90% of AI-generated codebases. AI creates abstraction where none is needed and skips abstraction where it would help.

Cyclomatic complexity:

Functions remain under complexity thresholds
Deeply nested conditionals refactored into guard clauses
Large functions decomposed into named helper functions
Switch statements limited to appropriate uses

AI generates working code without considering maintainability. The reviewer must ask: can future developers understand and modify this code?

Business logic validation

AI cannot know business rules. It generates plausible implementations that may be wrong.

Requirement alignment:

Implementation matches acceptance criteria, not adjacent functionality
Edge cases specified in requirements are handled
Default behaviors align with product intent
Error states handled per business specification

The question is not "does this code work?" but "does this code do what was actually needed?" AI implements what was literally requested. Requirements often omit what was implicitly assumed.

Domain rule verification:

Business calculations use correct formulas
State transitions follow valid paths
Time-based logic handles timezones and edge cases
Currency and numeric precision appropriate for domain

AI copies patterns. Business rules require understanding. Every calculation, every state machine, every domain-specific operation needs human verification.

Integration points:

External API calls match documented contracts
Webhook payloads conform to expected schemas
Event publishing follows established patterns
Data formats compatible with downstream consumers

AI generates integration code based on patterns from training data. Those patterns may not match your specific integration requirements.

The tiered approach

Not every checklist item applies to every PR. Allocate review effort based on what the code touches.

Tier 1 - Full verification (15-30 minutes): Apply all checklist sections for code touching:

Authentication and session management
Authorization and access control
Payment processing
Personal data handling
External-facing API endpoints
Cryptographic operations

Tier 2 - Standard verification (10-15 minutes): Apply baseline, security, and architectural checklists for:

Internal service communication
Business logic without external input
Administrative interfaces
Database operations on non-sensitive data

Tier 3 - Automated plus spot-check (5 minutes): Rely on automated tooling with spot verification for:

Static content rendering
Test code and fixtures
Internal utilities
Documentation generation

The tiers reflect risk, not effort. AI failure rates are highest in Tier 1 code, and failures there cost the most.

Checklist enforcement

Checklists work when they are used consistently. A checklist used sometimes provides less value than a shorter checklist used always.

PR templates: Require submitters to complete a checklist before review begins. The template prompts self-review. Incomplete checklists signal insufficient preparation.

Review tooling: Integrate checklists into review tools (GitHub PR templates, GitLab merge request descriptions). Make completion visible in the review interface.

Quality gates: Block merge until designated checklist items are marked complete. Automate verification where possible (SAST for security items, linting for style items).

Review rotation: Assign reviewers with domain expertise to PRs touching their domains. Security-sensitive code goes to security-aware reviewers. Architecture changes go to architects.

What checklists cannot catch

Checklists verify specific items. They do not replace judgment.

Novel vulnerabilities: Checklists codify known failure modes. New attack vectors require thinking beyond the checklist.

Subtle logic errors: A checklist confirms error handling exists. It does not confirm the error handling is correct for this specific case.

Performance at scale: A checklist asks about database queries in loops. It does not simulate production load.

Long-term maintainability: A checklist flags obvious complexity. It does not predict which code will become a maintenance burden in six months.

Checklists provide scaffolding. They ensure baseline verification happens consistently. Senior reviewers bring judgment that no checklist can encode.

Making the checklists stick

Using these checklists consistently creates a feedback loop.

Developers learn what reviewers check. They self-review against the checklist before submitting. AI-generated code gets iterated before reaching review. Review time drops because obvious issues are caught earlier.

The checklist also becomes a teaching tool. New team members learn what matters by seeing what gets checked. Over time, the checklist encodes organizational knowledge about AI failure modes.

Page 5 examines how to use AI tools themselves as review assistants augmenting human review without creating circular validation where the same AI that wrote the code approves its own work.

On this page