Categories of Agent Errors

A taxonomy of what goes wrong

Previous pages examined error rates and review strategies. This page catalogs the error types themselves: the specific ways agents fail when generating code.

Research across 558 incorrect code solutions from six major language models identified seven categories of non-syntactic errors. Four categories were already known from earlier studies. Three emerged as distinct patterns only after systematic analysis of agent outputs. Knowing these categories turns error detection from gut feeling into systematic review.

The seven categories, in order of frequency:

Conditional errors
Garbage code
Mathematical formula and logic errors
Minor output formatting errors
Misorder of operations on objects/variables
Misuse of library API
Index off-by-one mistakes

Each category has distinct signatures, root causes, and review strategies.

Conditional errors: the missing branch

Conditional errors occur when agents omit or misinterpret necessary conditions in control flow. The agent handles some branches while others fail silently, return incorrect values, or crash.

A study of HumanEval-X found 24 instances of conditional errors in GPT-4 outputs alone. The pattern recurs across models and benchmarks.

# Prompt: Return false if list has more than one duplicate of same number
# (i.e., 3+ identical values)

# AI generates this interpretation:
def check_duplicates(lst):
    if len(lst) != len(set(lst)):
        return False  # Wrong: triggers on ANY duplicates
    return True

# Correct interpretation requires counting occurrences:
def check_duplicates(lst):
    from collections import Counter
    counts = Counter(lst)
    return all(count <= 2 for count in counts.values())

The error comes from specification misunderstanding, the most common root cause across all error categories. The phrase "more than one duplicate" requires interpretation: does it mean "any duplicate exists" or "three or more identical values"? Agents default to the simpler interpretation.

Conditional errors also appear as:

Missing else branches that should handle unexpected states
Boundary conditions checked incorrectly (< vs <=)
Nested conditionals that miss combination cases
Guard clauses that return early without handling all paths

Review strategy: trace every conditional through all possible input states. Ask what happens when inputs are null, empty, negative, maximum, or of unexpected type. If a branch isn't explicitly handled, assume the agent didn't consider it.

Garbage code: fundamental misalignment

Garbage code is syntactically valid but fundamentally misaligned with the task. The code compiles, runs, and produces output, but solves the wrong problem entirely.

Studies found that when prompts exceed 150 words or solutions require more than 12 lines of code, approximately 60% of outputs qualify as garbage code. Complexity makes this failure mode far more likely.

Three subtypes exist:

Meaningless snippets: Code that bears no relationship to the prompt. The agent latched onto a keyword or pattern and generated something superficially related but functionally useless. Smaller models (InCoder, CodeGen) produce meaningless snippets in 7-25% of failures. Larger models (GPT-4) rarely produce this subtype.

Comments-only output: The agent returns documentation describing what should happen instead of code that makes it happen. This occurs when the task exceeds the agent's capability to solve but not its capability to describe.

Wrong logical direction: The code solves an inverted or perpendicular problem. Asked to find maximum, it finds minimum. Asked to filter in, it filters out. The structure looks correct, but the logic points the wrong way.

# Prompt: Find elements that appear in both lists

# Garbage code (wrong logical direction):
def find_common(list1, list2):
    result = []
    for item in list1:
        if item not in list2:  # Should be "in", not "not in"
            result.append(item)
    return result  # Returns difference, not intersection

Garbage code correlates with prompt complexity. Long prompts, ambiguous specifications, and multi-step requirements increase garbage code rates. When you receive garbage code, the fix is rarely iteration. The agent didn't partially understand the task. Restart with a clearer, shorter prompt.

Mathematical formula and logic errors

Language models process numbers as tokens, not quantities. They cannot perform step-by-step calculation the way calculators do. This limitation causes persistent arithmetic and mathematical reasoning errors.

Mathematical errors appear as:

Incorrect formulas (wrong operators, missing terms)
Calculation cascade errors where one wrong number propagates
Order of operations mistakes (multiplication before addition ignored)
Integer vs. floating-point confusion
Rounding and precision handling failures

# Prompt: Calculate compound interest
# Principal $1000, Rate 5%, Time 3 years

# AI-generated code with formula error:
def compound_interest(principal, rate, time):
    return principal * rate * time  # Simple interest formula
    # Correct: principal * (1 + rate) ** time - principal

# Or arithmetic error:
def calculate_total(items):
    subtotal = sum(item.price for item in items)
    tax = subtotal * 0.08
    discount = subtotal * 0.1
    return subtotal + tax + discount  # Should subtract discount, not add

Research found mathematical/logic errors in 0-6 instances per benchmark per model. Less frequent than conditional or garbage code, but high severity when they occur. A wrong formula produces wrong results for every input, not just edge cases.

Review strategy: verify formulas against authoritative sources. Trace through a calculation by hand with test values. Mathematical correctness cannot be inferred from code that "looks right."

Minor output formatting errors

Output formatting errors produce correct values in wrong formats. The logic works; the presentation doesn't match requirements.

# Prompt: Return user data as JSON string

# AI returns:
def get_user(user_id):
    user = db.fetch(user_id)
    return user  # Returns dict object

# Should return:
def get_user(user_id):
    user = db.fetch(user_id)
    return json.dumps(user)  # Returns JSON string

On HumanEval-X, formatting errors appeared in 2 instances. On CoderEval (production-level code), they appeared in 30 instances. The difference suggests that production code has stricter format requirements that agents miss.

Formatting errors include:

Wrong data type (list vs. tuple, dict vs. JSON string)
Missing decimal places or incorrect number formatting
Date/time format mismatches
String encoding issues (UTF-8 vs. ASCII)
Missing trailing newlines or whitespace handling

These errors typically surface immediately in integration testing when callers expect one format and receive another. Review strategy: verify return types and formats against calling code expectations.

Misorder of operations on objects/variables

Operation order errors occur when agents perform correct operations in wrong sequence. Dependencies between operations require specific ordering that agents infer incorrectly.

// Prompt: Initialize application with config and database

// AI generates:
async function init() {
    startServer();           // Needs database ready
    await connectDatabase(); // Needs config loaded
    loadConfig();            // Should happen first
}

// Correct order:
async function init() {
    await loadConfig();      // 1. Config first
    await connectDatabase(); // 2. Database uses config
    await startServer();     // 3. Server uses database
}

The agent produced all three required operations but sequenced them based on the prompt order rather than dependency requirements. Research found 0-2 instances per benchmark, making this one of the rarer error types. But when operation order matters (initialization, transaction sequences, cleanup routines), these errors cause failures that look like race conditions or resource unavailability.

Review strategy: map dependencies between operations. If operation B reads state that operation A creates, A must precede B. This mapping reveals sequencing errors that correct syntax obscures.

Misuse of library API

API misuse occurs when agents call library functions incorrectly: wrong arguments, wrong context, wrong assumptions about behavior.

A study of AI-generated code found four API misuse patterns:

Type	Description	Python Rate	Java Rate
Intent misuse	Functionally inappropriate API	30.4%	27.0%
Confabulation misuse	Non-existent methods/parameters	40.9%	34.6%
Missing item misuse	Omitting required parameters	3.0%	11.3%
Redundancy misuse	Unnecessary API calls	5.3%	10.2%

Intent misuse selects an API that exists but doesn't fit the task. Using np.abs() for vector magnitude when np.linalg.norm() is correct. The code runs without errors but produces wrong results.

Confabulation misuse references API methods or parameters that don't exist. The agent generates plausible-sounding method names from patterns in training data. These fail immediately at runtime, which makes them one of the "obvious" error types that execution catches.

# AI generates confabulated API:
result = pandas.dataframe.merge_columns(df1, df2)  # Method doesn't exist
# Correct API:
result = pandas.merge(df1, df2)

Cross-language confusion contributes to API misuse. Agents trained on multiple languages sometimes apply conventions from one language to another. Python's split() behaves differently than Java's split(). Java accepts regex; Python doesn't without the re library. The agent generates code that would work in a different language.

Review strategy: verify API calls against official documentation. Hover over methods in IDE to confirm signatures. Run the code. Confabulated APIs fail immediately; intent misuse requires reasoning about correctness.

Index off-by-one mistakes

Index errors occur when array or list access uses wrong indices. The classic off-by-one error: iterating to < n when <= n was required, or vice versa.

# Prompt: Get last three elements of list

# AI generates:
def last_three(items):
    return items[-3:-1]  # Off by one: misses actual last element
    # Correct: items[-3:] or items[-3:None]

Research found index errors increased from 15 instances (HumanEval+ regular inputs) to 56 instances (LiveCodeBench regular inputs) as task complexity increased. When tested with edge inputs, all models failed roughly twice as often. Index errors stress array boundaries: empty arrays, single-element arrays, indices at limits.

Index errors appear as:

Loop bounds excluding final element
Slice notation off by one at start or end
Zero-based vs. one-based indexing confusion
Negative indices calculated incorrectly
Range end values (inclusive vs. exclusive)

Review strategy: trace array operations with boundary inputs. Empty list, single element, exactly the boundary size. If any boundary behaves wrong, there's an index error.

Misunderstanding is the root cause

Across all seven error categories, one root cause keeps appearing: specification misunderstanding.

Research categorized 557 errors and found:

Missing corner case checks: 54.4% in HumanEval+, 53.2% in RWPB
Residual logic misunderstanding: 69.1% in APPS+ (complex tasks)
Over 70% of failures in specialized domains stem from misinterpreting specifications

The agent understood the words but not the requirements. The prompt said "duplicate" and the agent chose one interpretation among several. The specification implied a constraint the agent didn't infer. The business rule existed in domain knowledge the agent lacks.

Six root causes drive specification misunderstanding:

Misleading coding question specification: Ambiguous phrasing permits multiple interpretations
Input-output demonstration impact: Low-quality examples enable wrong understanding
Edge case oversight: Agents overfit to demonstrated scenarios
Misleading function signature: Function names suggest different behavior
Positional sensitivity: Agents ignore parts of long prompts due to attention limitations
Incorrect trained knowledge: Cross-language confusion from training on multiple languages

Most errors trace back to what you told the agent (or didn't tell it). Clearer specifications, explicit constraints, and example edge cases prevent errors that no amount of iteration will fix after the fact. When an agent produces wrong code, ask first whether the specification was ambiguous. The answer is usually yes.

Using the taxonomy in review

This seven-category framework turns error detection from intuition to checklist. For each AI-generated code change, verify:

Conditional coverage: Every branch handled. Edge states considered. No silent fallthrough.

Task alignment: Code solves the stated problem, not a related one. Logic direction correct.

Mathematical correctness: Formulas verified against sources. Calculations traced by hand.

Format compliance: Return types match expectations. Data formats correct.

Operation sequence: Dependencies mapped. Required ordering enforced.

API correctness: Methods exist. Parameters correct. Intent matches function purpose.

Index boundaries: Loop bounds correct. Slices include intended elements. Edge sizes handled.

The taxonomy also provides vocabulary for feedback. "Conditional error in the authentication check, missing the expired token case" communicates more precisely than "there's a bug." Precise feedback helps agents correct errors faster and helps humans understand what to look for.

On this page