Red Flags Specific to Agent Output

Patterns that give AI away

Beyond the broad categories of logic errors and security gaps, certain problems appear disproportionately in AI-generated code. These are the fingerprints. Learn to spot them, and you can identify AI-written code at a glance - even before checking git blame.

Package hallucinations

AI models recommend packages that do not exist. A USENIX Security 2025 study analyzing 576,000 code samples across 16 LLMs found that 19.7% of all recommended packages were fabricated. That is one in five package recommendations pointing to nothing. The study identified 205,474 unique hallucinated package names.

Commercial models hallucinate less frequently (5.2%) than open-source models (21.7%). GPT-4 Turbo achieved the lowest rate at 3.59%. CodeLlama exceeded 33%. Gemini Pro reached 64.5%.

The composition breaks down like this:

Pattern	Percentage
Inspired by real package names	38%
Resulting from typos	13%
Completely fabricated	51%

Here is what makes this dangerous: 43% of hallucinated package names repeated consistently across multiple queries. The same fake package appears over and over. An attacker who identifies these predictable hallucinations can publish malicious packages under those names and wait. This attack vector, dubbed "slopsquatting," is already being exploited.

Real-world exploitation

The PhantomRaven campaign contaminated 126 npm packages, accumulating 86,000 installations. A separate attack published 11 malicious packages targeting Solana developers with names like solana-test, solana-token, and solana-charts.

Security researcher Bar Lanyado demonstrated the risk by publishing an empty package called huggingface-cli. It received over 30,000 downloads in three months. Alibaba repositories referenced it. The package did not exist before Lanyado created it - LLMs simply hallucinated the name, and developers installed whatever they found.

What reviewers should check

Every dependency added by AI requires verification. Before running npm install or pip install on AI-suggested packages:

Verify the package exists on the official registry
Check the package publication date and download count
Confirm the maintainer identity matches expectations
Read the package description and source if unfamiliar

CI pipelines should include lockfile verification and hash pinning. Phantom packages should never reach production.

Hardcoded secrets

AI models reproduce secrets from training data. Truffle Security found 12,000 live API keys and passwords in Common Crawl archives used to train LLMs. When developers prompt for database connection code, authentication examples, or API integration, models may generate code containing actual credentials from those archives.

GitGuardian's State of Secrets Sprawl 2025 report found that repositories using GitHub Copilot had a 40% higher secret leakage rate compared to baseline. Among 8,127 analyzed Copilot suggestions, researchers extracted 2,702 valid secrets - roughly three valid secrets per prompt.

AI models cannot distinguish security-sensitive constants from benign string values. They reproduce patterns from training data, including insecure patterns. A study of five major LLMs found CWE-798 (hardcoded credentials) in code from every model tested: Claude Sonnet 4 produced 20 instances, GPT-4o produced 20 instances, and Llama 3.2 90B produced 29 instances.

Common hardcoded patterns

Pattern	Risk
Database connection strings with credentials	Direct database access
API keys in client-side code	Key theft and abuse
Test credentials that work in production	Unauthorized access
Placeholder values that look like placeholders but authenticate	Accidental exposure

The last pattern deserves attention. AI-generated code may contain strings like test_api_key_12345 that appear to be placeholders but happen to be valid keys from training data. You cannot tell by looking.

Detection requirements

Secret detection must precede code review:

Pre-commit hooks using tools like detect-secrets or Gitleaks
CI pipeline stages that block PRs containing detected secrets
Regular repository scans for secrets that bypassed earlier checks

Netlify reported that 17% of applications had deployments blocked when smart secret detection was enabled. That number tells you how frequently AI-assisted development introduces credential exposure.

Excessive code duplication

GitClear's 2025 analysis of 211 million lines of code documented an 8-fold increase in code blocks containing 5 or more duplicated lines between 2020 and 2024. Code duplication is now 10 times more prevalent than it was pre-AI. Copy-paste operations exceed refactoring operations for the first time in recorded Git history.

AI assistants make it easy to insert new code blocks with a single keystroke. Limited context windows prevent AI from proposing reuse of similar functions elsewhere in the codebase. The result: identical logic scattered across files, each instance a future maintenance burden.

Duplication metrics

Metric	2020-2021 Baseline	2024	Change
Copy/paste code	8.3%	12.3%	+48%
Refactored (moved) code	24.1%	9.5%	-60%
Code churn (2-week)	3.1-4%	7.9%	+155%

That churn rate - code revised within two weeks of being written - more than doubled. AI-generated code requires more immediate corrections than human-written code. Fast to write, slow to fix.

Review approach for duplication

When reviewing AI-generated code, search for:

Functions that closely resemble existing functions in the codebase
Constants or configuration values defined in multiple places
Error handling logic repeated rather than centralized
Data transformation patterns that could be extracted

The question to ask: does this belong here, or does it already exist elsewhere? AI optimizes for immediate functionality. Humans must optimize for long-term maintainability.

Missing error handling

CodeRabbit found AI-generated code contains nearly twice as many error handling gaps as human-written code. The pattern is consistent: AI generates code that works for expected inputs and fails silently or catastrophically for unexpected inputs.

Common gaps include:

Missing null checks before dereferencing
Early returns that bypass necessary cleanup
Exception handlers that log but do not propagate
Fallback values that mask failures instead of reporting them
Missing timeout handling for network operations
No retry logic for transient failures

AI models see vastly more working code than failing code in their training data. They optimize for the happy path because the happy path dominates training examples. Error handling is an afterthought - when it appears at all.

What thorough error handling looks like

AI-generated code:

async function fetchUserData(userId) {
  const response = await fetch(`/api/users/${userId}`);
  const data = await response.json();
  return data;
}

The same code with proper error handling:

async function fetchUserData(userId) {
  if (!userId) {
    throw new Error('userId is required');
  }

  const response = await fetch(`/api/users/${userId}`);

  if (!response.ok) {
    throw new Error(`Failed to fetch user: ${response.status}`);
  }

  const data = await response.json();

  if (!data || typeof data !== 'object') {
    throw new Error('Invalid response format');
  }

  return data;
}

Reviewers must verify that every code path handles failure modes. AI rarely adds this handling unprompted.

Unused constructs and debugging artifacts

AI-generated code frequently contains artifacts that served no purpose from the start or served a purpose during generation that no longer applies.

Common artifacts

Unused imports. AI includes imports for types, functions, or modules that the generated code never references. Static analysis catches these reliably, but their presence indicates AI wrote the code without understanding what it needed.

Dead code paths. Conditional branches that cannot be reached. Functions defined but never called. Variables assigned but never read. AI generated more code than necessary and did not prune the excess.

Console logs and debug statements. console.log, print(), debugger, and similar statements that belong in development but not in production code. AI reproduces these from training examples without understanding the distinction between development and production contexts.

Commented-out code. AI sometimes generates alternative implementations as comments, or leaves previous attempts commented out rather than deleting them. Commented code creates maintenance burden and confusion.

Placeholder TODOs that were meant to be filled. Comments like // TODO: implement validation where validation was the actual requirement. AI generates the structure it expects to see, including placeholder comments, then moves on without completing them.

Detection strategy

Static analysis tools catch most artifacts:

ESLint no-unused-vars and no-console rules
Dead code detection via coverage analysis
Import cleanup plugins for various languages

Multiple artifacts in a single PR suggest insufficient AI iteration or incomplete review by the submitting developer.

Over-mocking in tests

AI-generated tests have a consistent problem: they mock dependencies, assert that mocked values were returned, and call it verification. This validates the mock, not the code under test.

Mutation testing reveals how bad it gets. One analysis found AI-generated test suites achieving 100% code coverage with only 4% mutation score. The tests executed all code paths but detected almost no injected faults. Coverage without mutation testing is a vanity metric.

The tautological test problem

When AI writes both code and tests, both reflect the same misunderstanding. A test asserting divide(10, 0) == 0 passes because the buggy implementation returns 0 for division by zero. The test "characterizes a buggy implementation" rather than specifying correct behavior.

Qodo's 2025 survey found that 60% of developers report AI missing relevant context during test generation. The result: tests that pass but do not protect against regressions.

Over-mocking patterns

Tests that validate mocks instead of behavior. The test mocks a dependency to return a specific value, then asserts the function returned that exact value. The function could be a pass-through that does nothing, and the test would pass.

Implementation coupling. Mocks that specify exactly which methods will be called in which order with which arguments. Any refactoring breaks the tests even if behavior remains correct.

Missing integration points. Every external dependency mocked, leaving no test that verifies actual integration works. Unit tests pass; the system fails when deployed.

What AI-generated tests should not mock

The actual input validation logic
The transformation or business logic being tested
The return value construction
Anything that constitutes the core behavior under test

Mocking is appropriate for external systems (databases, APIs, file systems) when those systems would make tests slow or unreliable. Mocking internal logic defeats the purpose of testing.

The compound red flag

One red flag is a problem. Multiple red flags in the same PR indicate something went wrong upstream.

A PR containing:

Hallucinated packages
Hardcoded API keys
Duplicated logic
Missing error handling
Over-mocked tests

This is not code that needs fixing. This is code that was generated quickly, never reviewed by the submitting developer, and shipped without iteration. The appropriate review response is not to fix each issue individually but to request regeneration with better prompting and thorough pre-submission review.

The next page examines how to adapt review checklists to systematically catch these red flags before they reach production.

On this page