Metrics and Monitoring for AI Code Quality

The measurement problem

AI-assisted development breaks traditional metrics. Lines of code per day? Meaningless when an agent generates thousands in minutes. Commits per week? Equally useless.

New metrics matter because AI code fails in new ways. Without measuring those failure modes specifically, organizations mistake the illusion of progress for the real thing.

Technical Debt Ratio

Technical Debt Ratio (TDR) expresses accumulated code quality issues as a percentage of development effort:

TDR = (Remediation Cost / Development Cost) × 100

A TDR of 10% means fixing accumulated issues would take one hour for every ten hours of new development. The metric comes from SonarQube but the concept applies anywhere.

Thresholds

TDR Range	What it means
Below 5%	Sustainable
5-10%	Acceptable
10-20%	Starting to hurt
Above 20%	Something is wrong

McKinsey's 2024 Digital Report found that organizations with high technical debt spend 40% more on maintenance and ship new features 25-50% slower.

Here's the trap: AI can reduce apparent remediation cost in the short term by generating code that passes automated checks. That same code accumulates maintenance burden through duplication, complexity, and missed refactoring.

It looks cheap now. The interest compounds later.

Some teams track an adjusted formula:

TDR(AI) = (Remediation Cost + AI-Introduced Debt) / (Development Cost - AI-Acceleration Benefit) × 100

The point isn't precision. The point is acknowledging that AI shifts when debt becomes visible, not whether debt exists.

Code churn

Code churn measures how quickly code gets revised or deleted after being written. GitClear defines churned code as "changes that were either incomplete or erroneous when they were authored."

An 84% increase

GitClear analyzed 211 million changed lines of code from 2020 to 2024:

Year	New code revised within 2 weeks
2020	3.1%
2024	5.7%

That 84% jump tracks the adoption curve of AI coding tools almost exactly.

Their 2024 report projected code churn would double from pre-AI baselines. The data confirms it.

What this means

High churn means code didn't work the first time. With AI code, that usually means:

The agent misunderstood what you wanted
It handled the happy path but missed edge cases
Tests passed but production exposed the gaps
Integration problems appeared only after merge

If you track churn by origin AI-assisted versus traditional you can see whether AI accelerates initial delivery at the cost of fixing things later.

The hidden math

Fast delivery followed by high churn can produce net negative productivity. A feature shipped in two days but requiring three days of fixes took five days. Traditional development might have taken four.

Churn metrics expose this. Velocity metrics alone hide it.

Duplication

Code duplication increased dramatically with AI. GitClear tracked a 4x growth in code clones attributed to AI assistance.

The numbers tell a story

Metric	2020	2024	Change
Copy/pasted lines	8.3%	12.3%	+48%
Moved (refactored) lines	24.1%	9.5%	-60%
Duplicated blocks (5+ lines)	Baseline	8x	+700%

That 8x increase in duplicated blocks is genuinely striking. For the first time in GitClear's measurement history, developers paste code more often than they refactor or reuse it.

Why this matters more than it seems

Duplicated code multiplies maintenance burden. Fix a bug in one place, you need to fix it in every duplicated location. A 2023 study found 57% of co-changed cloned code was involved in bugs.

AI generates duplication because each prompt starts fresh. The agent doesn't know a nearly identical function exists three files away. It can't.

Detection tools

Standard tools catch this:

SonarQube's duplicate detection
PMD's Copy/Paste Detector (CPD)
Simian for cross-file similarity
ESLint's no-duplicate-imports and similar rules

Configure them to track trends over time, not just snapshots. Rising duplication percentages indicate accumulating debt regardless of whether individual PRs pass gates.

Defect rates

CodeRabbit analyzed 470 GitHub pull requests (320 AI-coauthored, 150 human-only) in December 2025:

Metric	AI-Generated	Human-Written	Ratio
Issues per PR	10.83	6.45	1.7x
Critical issues	Higher	Baseline	1.4x
Major issues	Higher	Baseline	1.7x

Where AI fails and succeeds

Issue type	AI versus human
Performance inefficiencies	~8x more in AI code
Readability problems	3x more in AI code
Logic and correctness errors	75% more in AI code
Security vulnerabilities	1.5-2x more in AI code
Spelling errors	1.76x more in human code
Testability issues	1.32x more in human code

AI avoids typos and produces testable structures. It struggles with performance, business logic, and security.

Tracking what matters

Effective defect tracking for AI code means:

Tag origin: Mark PRs or commits as AI-assisted versus traditional
Categorize defects: Separate logic errors, security issues, performance problems, style violations
Time discovery: Track when defects surface in review, testing, or production
Measure remediation cost: Time to fix matters more than defect count

A single security vulnerability in production costs orders of magnitude more than ten style issues caught in review. Raw counts without severity weighting mislead.

The METR productivity paradox

This is the finding I keep coming back to.

The Model Evaluation and Threat Research (METR) organization ran a randomized controlled trial from February to June 2025. Sixteen experienced open-source developers worked on 246 real tasks.

The core finding

Metric	Value
Actual impact	19% slower with AI tools
Developer prediction before study	24% faster
Developer perception after study	20% faster
Perception gap	39 percentage points

Developers believed they were 20% faster while actually being 19% slower. The gap approaches 40 percentage points.

Study conditions

These weren't novices on unfamiliar codebases:

Participants averaged five years of experience with their repositories
Repositories averaged 22,000+ stars and 1 million+ lines of code
Tools included Cursor Pro with Claude Sonnet models
Over 140 hours of screen recordings analyzed

What slowed them down

The slowdown came from time spent crafting prompts, reviewing AI suggestions, and integrating outputs with established codebases. There was also cognitive overhead from context-switching between directing and coding.

The counterintuitive part: developers highly familiar with codebases were slowed down more, not less. Experts who could write code quickly found AI assistance disrupted their established workflow. Novices, with less to disrupt, sometimes benefited more.

What this means for measurement

If developers believe they're faster while being slower, self-reported productivity metrics are worthless. Organizations need objective measurement: actual time to complete defined tasks, defects per feature, time in rework, end-to-end cycle time.

The METR study doesn't prove AI always slows development. It proves perception and reality can diverge dramatically. That makes measurement non-optional.

The DORA amplifier effect

Google's DORA 2025 State of AI-Assisted Software Development report surveyed nearly 5,000 technology professionals with 78 in-depth interviews.

The central finding deserves quoting:

"AI's primary role in software development is that of an amplifier. It magnifies the strengths of high-performing organisations and the dysfunctions of struggling ones."

Strong teams with solid practices see AI accelerate their advantages. Struggling teams see AI intensify their problems. Teams without clear direction "build the wrong things faster."

Individual versus organizational metrics

Level	AI impact
Tasks completed (individual)	+21%
Pull requests merged (individual)	+98%
Code review time	+91%
PR size	+154%
Bug rates	+9%
Organizational delivery	Flat

Individual productivity metrics improve dramatically. Organizational delivery stays flat. The gains disappear into downstream bottlenecks.

The new stability metric

DORA 2025 added "rework rate" as a fifth core metric, measuring unplanned deployments caused by production issues. Only 7.3% of teams report rework rates below 2%. Over 26% experience rework rates between 8-16%.

This metric captures what happens when you ship faster without shipping better.

The stability trade-off

For every 25% increase in AI adoption, DORA observed:

7.2% decrease in delivery stability
1.5% decrease in delivery throughput

AI boosts individual throughput while degrading system stability. Without metrics that capture both, you only see the gains.

Building a dashboard

Combine multiple metrics for a coherent picture:

Metric	Source	Why it matters for AI
Technical Debt Ratio	SonarQube	May mask long-term debt
Code churn (2-week)	Git analytics	84% higher with AI
Duplication rate	SonarQube, CPD	4-8x increase with AI
Defect density	Issue tracking	1.7x more issues
Rework rate	Deployment logs	DORA stability indicator
PR size	Git analytics	154% larger with AI
Review cycle time	PR metrics	91% longer with AI volume

Segment by origin

Track metrics separately for:

AI-assisted (agent generated most of the code)
AI-augmented (human wrote with AI suggestions)
Traditional (human wrote without AI)

This tells you whether AI-specific patterns appear in your codebase, not just industry studies.

Watch trends, not snapshots

Point-in-time metrics mislead. Track trends over rolling windows: weekly churn, monthly duplication, quarterly defect density, release-over-release rework.

Rising trends signal accumulating problems. Flat trends with high values signal persistent problems. Declining trends signal improvement regardless of current levels.

Set alerts

alerts:
  churn_rate:
    warning: 6%
    critical: 8%
  duplication:
    warning: 10%
    critical: 15%
  defect_density:
    warning: 1.5x baseline
    critical: 2x baseline
  rework_rate:
    warning: 8%
    critical: 12%

Establish pre-AI baselines before expanding AI adoption. Without them, you can't make meaningful comparisons.

Why measurement matters here

The METR study's 39-point perception gap should concern every engineering organization. Without objective metrics, you cannot tell genuine productivity gains from comfortable illusion.

The patterns are consistent: higher churn, more duplication, more defects per PR, stability traded for throughput. Organizations that measure these patterns can manage them. Organizations that don't measure them won't know they exist until maintenance costs overwhelm the apparent savings.

The goal isn't proving AI harmful or helpful. The goal is knowing which applies to your situation, with your team, on your codebase. That requires measurement.