Red Flags Specific to Agent Output
Patterns that give AI away
Beyond the broad categories of logic errors and security gaps, certain problems appear disproportionately in AI-generated code. These are the fingerprints. Learn to spot them, and you can identify AI-written code at a glance - even before checking git blame.
Package hallucinations
AI models recommend packages that do not exist. A USENIX Security 2025 study analyzing 576,000 code samples across 16 LLMs found that 19.7% of all recommended packages were fabricated. That is one in five package recommendations pointing to nothing. The study identified 205,474 unique hallucinated package names.
Commercial models hallucinate less frequently (5.2%) than open-source models (21.7%). GPT-4 Turbo achieved the lowest rate at 3.59%. CodeLlama exceeded 33%. Gemini Pro reached 64.5%.
The composition breaks down like this:
| Pattern | Percentage |
|---|---|
| Inspired by real package names | 38% |
| Resulting from typos | 13% |
| Completely fabricated | 51% |
Here is what makes this dangerous: 43% of hallucinated package names repeated consistently across multiple queries. The same fake package appears over and over. An attacker who identifies these predictable hallucinations can publish malicious packages under those names and wait. This attack vector, dubbed "slopsquatting," is already being exploited.
Real-world exploitation
The PhantomRaven campaign contaminated 126 npm packages, accumulating 86,000 installations.
A separate attack published 11 malicious packages targeting Solana developers with names like solana-test, solana-token, and solana-charts.
Security researcher Bar Lanyado demonstrated the risk by publishing an empty package called huggingface-cli.
It received over 30,000 downloads in three months.
Alibaba repositories referenced it.
The package did not exist before Lanyado created it - LLMs simply hallucinated the name, and developers installed whatever they found.
What reviewers should check
Every dependency added by AI requires verification.
Before running npm install or pip install on AI-suggested packages:
- Verify the package exists on the official registry
- Check the package publication date and download count
- Confirm the maintainer identity matches expectations
- Read the package description and source if unfamiliar
CI pipelines should include lockfile verification and hash pinning. Phantom packages should never reach production.
Hardcoded secrets
AI models reproduce secrets from training data. Truffle Security found 12,000 live API keys and passwords in Common Crawl archives used to train LLMs. When developers prompt for database connection code, authentication examples, or API integration, models may generate code containing actual credentials from those archives.
GitGuardian's State of Secrets Sprawl 2025 report found that repositories using GitHub Copilot had a 40% higher secret leakage rate compared to baseline. Among 8,127 analyzed Copilot suggestions, researchers extracted 2,702 valid secrets - roughly three valid secrets per prompt.
AI models cannot distinguish security-sensitive constants from benign string values. They reproduce patterns from training data, including insecure patterns. A study of five major LLMs found CWE-798 (hardcoded credentials) in code from every model tested: Claude Sonnet 4 produced 20 instances, GPT-4o produced 20 instances, and Llama 3.2 90B produced 29 instances.
Common hardcoded patterns
| Pattern | Risk |
|---|---|
| Database connection strings with credentials | Direct database access |
| API keys in client-side code | Key theft and abuse |
| Test credentials that work in production | Unauthorized access |
| Placeholder values that look like placeholders but authenticate | Accidental exposure |
The last pattern deserves attention.
AI-generated code may contain strings like test_api_key_12345 that appear to be placeholders but happen to be valid keys from training data.
You cannot tell by looking.
Detection requirements
Secret detection must precede code review:
- Pre-commit hooks using tools like
detect-secretsor Gitleaks - CI pipeline stages that block PRs containing detected secrets
- Regular repository scans for secrets that bypassed earlier checks
Netlify reported that 17% of applications had deployments blocked when smart secret detection was enabled. That number tells you how frequently AI-assisted development introduces credential exposure.
Excessive code duplication
GitClear's 2025 analysis of 211 million lines of code documented an 8-fold increase in code blocks containing 5 or more duplicated lines between 2020 and 2024. Code duplication is now 10 times more prevalent than it was pre-AI. Copy-paste operations exceed refactoring operations for the first time in recorded Git history.
AI assistants make it easy to insert new code blocks with a single keystroke. Limited context windows prevent AI from proposing reuse of similar functions elsewhere in the codebase. The result: identical logic scattered across files, each instance a future maintenance burden.
Duplication metrics
| Metric | 2020-2021 Baseline | 2024 | Change |
|---|---|---|---|
| Copy/paste code | 8.3% | 12.3% | +48% |
| Refactored (moved) code | 24.1% | 9.5% | -60% |
| Code churn (2-week) | 3.1-4% | 7.9% | +155% |
That churn rate - code revised within two weeks of being written - more than doubled. AI-generated code requires more immediate corrections than human-written code. Fast to write, slow to fix.
Review approach for duplication
When reviewing AI-generated code, search for:
- Functions that closely resemble existing functions in the codebase
- Constants or configuration values defined in multiple places
- Error handling logic repeated rather than centralized
- Data transformation patterns that could be extracted
The question to ask: does this belong here, or does it already exist elsewhere? AI optimizes for immediate functionality. Humans must optimize for long-term maintainability.
Missing error handling
CodeRabbit found AI-generated code contains nearly twice as many error handling gaps as human-written code. The pattern is consistent: AI generates code that works for expected inputs and fails silently or catastrophically for unexpected inputs.
Common gaps include:
- Missing null checks before dereferencing
- Early returns that bypass necessary cleanup
- Exception handlers that log but do not propagate
- Fallback values that mask failures instead of reporting them
- Missing timeout handling for network operations
- No retry logic for transient failures
AI models see vastly more working code than failing code in their training data. They optimize for the happy path because the happy path dominates training examples. Error handling is an afterthought - when it appears at all.
What thorough error handling looks like
AI-generated code:
async function fetchUserData(userId) {
const response = await fetch(`/api/users/${userId}`);
const data = await response.json();
return data;
}The same code with proper error handling:
async function fetchUserData(userId) {
if (!userId) {
throw new Error('userId is required');
}
const response = await fetch(`/api/users/${userId}`);
if (!response.ok) {
throw new Error(`Failed to fetch user: ${response.status}`);
}
const data = await response.json();
if (!data || typeof data !== 'object') {
throw new Error('Invalid response format');
}
return data;
}Reviewers must verify that every code path handles failure modes. AI rarely adds this handling unprompted.
Unused constructs and debugging artifacts
AI-generated code frequently contains artifacts that served no purpose from the start or served a purpose during generation that no longer applies.
Common artifacts
Unused imports. AI includes imports for types, functions, or modules that the generated code never references. Static analysis catches these reliably, but their presence indicates AI wrote the code without understanding what it needed.
Dead code paths. Conditional branches that cannot be reached. Functions defined but never called. Variables assigned but never read. AI generated more code than necessary and did not prune the excess.
Console logs and debug statements.
console.log, print(), debugger, and similar statements that belong in development but not in production code.
AI reproduces these from training examples without understanding the distinction between development and production contexts.
Commented-out code. AI sometimes generates alternative implementations as comments, or leaves previous attempts commented out rather than deleting them. Commented code creates maintenance burden and confusion.
Placeholder TODOs that were meant to be filled.
Comments like // TODO: implement validation where validation was the actual requirement.
AI generates the structure it expects to see, including placeholder comments, then moves on without completing them.
Detection strategy
Static analysis tools catch most artifacts:
- ESLint
no-unused-varsandno-consolerules - Dead code detection via coverage analysis
- Import cleanup plugins for various languages
Multiple artifacts in a single PR suggest insufficient AI iteration or incomplete review by the submitting developer.
Over-mocking in tests
AI-generated tests have a consistent problem: they mock dependencies, assert that mocked values were returned, and call it verification. This validates the mock, not the code under test.
Mutation testing reveals how bad it gets. One analysis found AI-generated test suites achieving 100% code coverage with only 4% mutation score. The tests executed all code paths but detected almost no injected faults. Coverage without mutation testing is a vanity metric.
The tautological test problem
When AI writes both code and tests, both reflect the same misunderstanding.
A test asserting divide(10, 0) == 0 passes because the buggy implementation returns 0 for division by zero.
The test "characterizes a buggy implementation" rather than specifying correct behavior.
Qodo's 2025 survey found that 60% of developers report AI missing relevant context during test generation. The result: tests that pass but do not protect against regressions.
Over-mocking patterns
Tests that validate mocks instead of behavior. The test mocks a dependency to return a specific value, then asserts the function returned that exact value. The function could be a pass-through that does nothing, and the test would pass.
Implementation coupling. Mocks that specify exactly which methods will be called in which order with which arguments. Any refactoring breaks the tests even if behavior remains correct.
Missing integration points. Every external dependency mocked, leaving no test that verifies actual integration works. Unit tests pass; the system fails when deployed.
What AI-generated tests should not mock
- The actual input validation logic
- The transformation or business logic being tested
- The return value construction
- Anything that constitutes the core behavior under test
Mocking is appropriate for external systems (databases, APIs, file systems) when those systems would make tests slow or unreliable. Mocking internal logic defeats the purpose of testing.
The compound red flag
One red flag is a problem. Multiple red flags in the same PR indicate something went wrong upstream.
A PR containing:
- Hallucinated packages
- Hardcoded API keys
- Duplicated logic
- Missing error handling
- Over-mocked tests
This is not code that needs fixing. This is code that was generated quickly, never reviewed by the submitting developer, and shipped without iteration. The appropriate review response is not to fix each issue individually but to request regeneration with better prompting and thorough pre-submission review.
The next page examines how to adapt review checklists to systematically catch these red flags before they reach production.