Performance and Style Evaluation

The performance gap hides in production

Performance issues appear at 1.42x the human rate in AI-generated code. That's the smallest multiplier in the quality breakdown smaller than logic errors (1.75x), security flaws (1.57x), or maintainability problems (1.64x). The number suggests AI performs relatively well here.

It doesn't.

Performance bugs compound under load in ways other bugs don't. A logic error fails every time. A performance problem works fine on your laptop, works fine in staging with test data, then melts production when real traffic hits. The 1.42x multiplier underweights severity.

Excessive I/O operations appear 8x more often in AI-generated code. N+1 queries. Files opened but not closed. Synchronous blocking in async contexts. Network calls inside loops.

# AI-generated code with hidden N+1 query
def get_dashboard_data(user_ids):
    results = []
    for user_id in user_ids:
        user = db.query(f"SELECT * FROM users WHERE id = {user_id}")
        orders = db.query(f"SELECT * FROM orders WHERE user_id = {user_id}")
        results.append({"user": user, "orders": orders})
    return results
# 1000 users = 2001 database queries
# Should be 2 queries with IN clauses and joining

The agent solved the prompt: return dashboard data for users. It didn't reason about what happens when the user list grows from 10 in testing to 10,000 in production. It pattern-matched against examples that worked at small scale.

Scaling assumptions to check

Agent-generated code embeds scaling assumptions that match the examples it trained on. Most example code handles trivial data volumes. Production code handles everything else.

Review for these patterns:

Unbounded operations:

list.sort() on collections that could grow arbitrarily
[x for x in huge_iterator] loading everything into memory
Recursive functions without depth limits

Missing pagination:

API endpoints returning all results without limits
Database queries without LIMIT clauses
File reads loading entire contents regardless of size

Synchronous blocking:

Blocking I/O in async functions
Sequential operations that could be parallelized
Network calls in request handlers without timeouts

Resource leaks:

File handles not closed in all code paths
Database connections not returned to pools
Event listeners never unregistered

The question to ask: what happens when inputs grow 10x? Then 100x. If the code breaks at 10x, there's a scaling bug hiding in functional code.

Code churn doubled since 2021

GitClear analyzed 211 million lines of code across the AI adoption curve. Code churn the percentage of lines revised or deleted within two weeks of being written doubled from 2021 to 2024.

In 2020, 3.1% of new code was revised within two weeks. By 2024, that number hit 5.7-7.9% and was still climbing.

Code churn measures rework. It's code that got written, then needed to be fixed, changed, or thrown out almost immediately. High churn means paying twice for the same feature once to write it wrong, once to write it right.

AI code specifically shows 41% higher churn than human code. That's not a small effect. If an AI-assisted team writes more code per week, but 41% more of it needs rework, the productivity gains may not be what they appear.

The pattern matches what you'd expect from agents optimizing for prompt completion rather than long-term correctness. The code works on first pass tests pass, feature functions but details are wrong. Variable names that make sense only in isolation. Logic that handles the happy path but not edge cases. Implementations that do what was asked but not what was meant.

Two weeks later, someone's rewriting it.

The cloning problem

Copy-pasted code exceeded moved code for the first time in 2024. For the first time since GitClear started tracking, developers copied more code than they refactored.

Code cloning grew 4x during AI adoption. Duplicate code blocks of 5+ lines increased 8x during 2024 alone. Copy-pasted lines rose from 8.3% to 12.3% of all changed lines.

Meanwhile, refactoring dropped from 24.1% of changes in 2020 to 9.5% in 2024.

The dynamic is obvious once you see it. AI excels at generating new code. It's fast, easy to prompt, and produces something that works. Asking for refactoring requires understanding what already exists, why it exists, and what should change. That's harder to prompt. It takes more context. It requires the agent to understand the codebase holistically rather than complete a local task.

So developers consciously or not prompt for new code instead of cleanup. The codebase grows in volume without growing in capability. Duplication accumulates. Technical debt compounds.

"AI code assistants excel at adding code quickly, but they can cause AI-induced tech debt."

Bill Harding, GitClear

The long-term cost isn't theoretical. A 2024 enterprise case study found remediating AI-generated code took 3x as long as remediating human-written code. The extra time went to figuring out what the AI intended before fixing it.

Style drift undermines team velocity

Readability issues appear 3x more often in AI-generated code. Formatting problems hit 2.66x more often. Naming inconsistencies run at nearly 2x the human rate.

This isn't about aesthetics. Style consistency is infrastructure. When code follows predictable patterns, developers navigate it without cognitive overhead. When every file looks different, every file requires active mental parsing.

The root cause: AI models train on public GitHub, but your codebase has its own conventions. Every team develops idioms naming patterns, architectural norms, formatting choices that make their specific codebase navigable. The agent doesn't know these idioms. It generates code that looks like average public code, not code that looks like your project.

# Team convention: use get_ prefix for data access
def get_user_by_id(user_id):
    return db.users.find_one({"_id": user_id})

# AI generates different style
def fetchUserData(userId):
    return db.users.find_one({"_id": userId})

# And different again
def retrieve_user(id):
    return db.users.find_one({"_id": id})

All three functions do the same thing. None of them match. Now multiply by every function the agent generates.

Teams with linters and formatters catch surface-level inconsistencies. But structural style drift how error handling is organized, how modules are composed, how APIs are shaped escapes automation. That drift accumulates in AI-heavy codebases until the repository feels like it was written by 20 different developers who never talked to each other.

Because it was.

"Permanent First-Day Syndrome"

Pete Hodgson describes the core problem: every AI session resets to baseline knowledge. The agent is a "brand new hire" without institutional memory.

It doesn't know your team's coding conventions unless they're explicitly laid out. It doesn't know your architectural decisions unless they're documented. It doesn't know your testing philosophy, your error handling patterns, or your logging conventions.

Each prompt starts from zero.

The fix is context CLAUDE.md files, explicit instructions, examples of preferred patterns. But context is work. Maintaining context is ongoing work. Most teams don't invest enough, and the agent keeps generating code that looks like it belongs to a different project.

The style guide in CLAUDE.md can specify naming conventions. The agent can be instructed to follow existing patterns. But someone has to write those instructions, and someone has to verify the agent follows them. Without that investment, style drift is the default.

The velocity-quality tradeoff isn't what you'd expect

Teams adopt AI coding tools expecting to ship faster. The pitch: more code, less time. The reality is more complicated.

The METR study measured experienced open-source developers on real tasks. They predicted AI would save them 24% of their time. Afterward, they estimated it had saved them 20%. Actual measurement: they took 19% longer.

A 39-percentage-point gap between perception and reality.

The study involved 16 developers with deep familiarity with their codebases, working on 246 real issues. These weren't beginners struggling with syntax. These were experts who knew their code. AI made them slower.

The Cortex 2026 benchmark tells the other half. Pull requests per developer increased 20%. But incidents per PR rose 23.5%. Change failure rates jumped 30%.

More code shipped. More production broke.

The pattern makes sense when you trace the workflow. AI speeds up code creation. But code creation is maybe 16% of a developer's time. The rest design, debugging, review, communication, understanding requirements doesn't get faster because drafting code got faster.

And if AI-assisted code needs more review attention, more debugging, more rework, the draft speed advantage evaporates in downstream costs.

The 88% acceptance problem

88% of AI-generated code that gets accepted remains in the final codebase.

That sounds like a good number until you realize what it measures. It measures developer willingness to accept code, not code quality. If a developer accepts code and keeps it, that code counted toward the 88%. Whether that code was good is a separate question.

The acceptance number says something about trust. It says something about time pressure. It says something about the friction of rejection. It doesn't say anything about correctness.

Only 3.8% of developers report both low confabulations and high confidence shipping AI code. 46% don't fully trust AI-generated code. But 88% of accepted code stays in.

The gap suggests developers keep code they don't fully trust. Under deadline pressure, something that works now beats nothing. The debt gets paid later in bugs, in rework, in production incidents.

Evaluation strategy for performance and style

Given these patterns, review of AI-generated code requires specific focus areas.

Performance review:

Identify all I/O operations (database, network, file)
Trace data flow for unbounded collections
Check for pagination on any list-returning operation
Verify resource cleanup in all code paths
Ask: what happens at 10x scale? 100x?

Style review:

Compare naming against existing conventions in the same module
Check import organization matches project patterns
Verify error handling follows established norms
Look for structural consistency with similar code elsewhere
Consider: would a team member recognize this as "our code"?

Churn prevention:

Require tests that exercise edge cases, not just happy paths
Document intent in PR descriptions, not just changes
Check for duplication against existing utilities
Verify new code didn't reimplement something that exists
Ask: will this need revision in two weeks?

The 1.42x performance multiplier and 3x readability gap aren't destiny. Review catches the problems AI introduces. But review requires knowing what to look for, and allocating time to look.

The velocity gains AI promises arrive only when the code AI generates doesn't create downstream rework. That requires validation before merge, not merge and hope.

On this page