Metrics and Monitoring for AI Code Quality
The measurement problem
AI-assisted development breaks traditional metrics. Lines of code per day? Meaningless when an agent generates thousands in minutes. Commits per week? Equally useless.
New metrics matter because AI code fails in new ways. Without measuring those failure modes specifically, organizations mistake the illusion of progress for the real thing.
Technical Debt Ratio
Technical Debt Ratio (TDR) expresses accumulated code quality issues as a percentage of development effort:
TDR = (Remediation Cost / Development Cost) × 100A TDR of 10% means fixing accumulated issues would take one hour for every ten hours of new development. The metric comes from SonarQube but the concept applies anywhere.
Thresholds
| TDR Range | What it means |
|---|---|
| Below 5% | Sustainable |
| 5-10% | Acceptable |
| 10-20% | Starting to hurt |
| Above 20% | Something is wrong |
McKinsey's 2024 Digital Report found that organizations with high technical debt spend 40% more on maintenance and ship new features 25-50% slower.
The AI debt problem
Here's the trap: AI can reduce apparent remediation cost in the short term by generating code that passes automated checks. That same code accumulates maintenance burden through duplication, complexity, and missed refactoring.
It looks cheap now. The interest compounds later.
Some teams track an adjusted formula:
TDR(AI) = (Remediation Cost + AI-Introduced Debt) / (Development Cost - AI-Acceleration Benefit) × 100The point isn't precision. The point is acknowledging that AI shifts when debt becomes visible, not whether debt exists.
Code churn
Code churn measures how quickly code gets revised or deleted after being written. GitClear defines churned code as "changes that were either incomplete or erroneous when they were authored."
An 84% increase
GitClear analyzed 211 million changed lines of code from 2020 to 2024:
| Year | New code revised within 2 weeks |
|---|---|
| 2020 | 3.1% |
| 2024 | 5.7% |
That 84% jump tracks the adoption curve of AI coding tools almost exactly.
Their 2024 report projected code churn would double from pre-AI baselines. The data confirms it.
What this means
High churn means code didn't work the first time. With AI code, that usually means:
- The agent misunderstood what you wanted
- It handled the happy path but missed edge cases
- Tests passed but production exposed the gaps
- Integration problems appeared only after merge
If you track churn by origin AI-assisted versus traditional you can see whether AI accelerates initial delivery at the cost of fixing things later.
The hidden math
Fast delivery followed by high churn can produce net negative productivity. A feature shipped in two days but requiring three days of fixes took five days. Traditional development might have taken four.
Churn metrics expose this. Velocity metrics alone hide it.
Duplication
Code duplication increased dramatically with AI. GitClear tracked a 4x growth in code clones attributed to AI assistance.
The numbers tell a story
| Metric | 2020 | 2024 | Change |
|---|---|---|---|
| Copy/pasted lines | 8.3% | 12.3% | +48% |
| Moved (refactored) lines | 24.1% | 9.5% | -60% |
| Duplicated blocks (5+ lines) | Baseline | 8x | +700% |
That 8x increase in duplicated blocks is genuinely striking. For the first time in GitClear's measurement history, developers paste code more often than they refactor or reuse it.
Why this matters more than it seems
Duplicated code multiplies maintenance burden. Fix a bug in one place, you need to fix it in every duplicated location. A 2023 study found 57% of co-changed cloned code was involved in bugs.
AI generates duplication because each prompt starts fresh. The agent doesn't know a nearly identical function exists three files away. It can't.
Detection tools
Standard tools catch this:
- SonarQube's duplicate detection
- PMD's Copy/Paste Detector (CPD)
- Simian for cross-file similarity
- ESLint's no-duplicate-imports and similar rules
Configure them to track trends over time, not just snapshots. Rising duplication percentages indicate accumulating debt regardless of whether individual PRs pass gates.
Defect rates
CodeRabbit analyzed 470 GitHub pull requests (320 AI-coauthored, 150 human-only) in December 2025:
| Metric | AI-Generated | Human-Written | Ratio |
|---|---|---|---|
| Issues per PR | 10.83 | 6.45 | 1.7x |
| Critical issues | Higher | Baseline | 1.4x |
| Major issues | Higher | Baseline | 1.7x |
Where AI fails and succeeds
| Issue type | AI versus human |
|---|---|
| Performance inefficiencies | ~8x more in AI code |
| Readability problems | 3x more in AI code |
| Logic and correctness errors | 75% more in AI code |
| Security vulnerabilities | 1.5-2x more in AI code |
| Spelling errors | 1.76x more in human code |
| Testability issues | 1.32x more in human code |
AI avoids typos and produces testable structures. It struggles with performance, business logic, and security.
Tracking what matters
Effective defect tracking for AI code means:
- Tag origin: Mark PRs or commits as AI-assisted versus traditional
- Categorize defects: Separate logic errors, security issues, performance problems, style violations
- Time discovery: Track when defects surface in review, testing, or production
- Measure remediation cost: Time to fix matters more than defect count
A single security vulnerability in production costs orders of magnitude more than ten style issues caught in review. Raw counts without severity weighting mislead.
The METR productivity paradox
This is the finding I keep coming back to.
The Model Evaluation and Threat Research (METR) organization ran a randomized controlled trial from February to June 2025. Sixteen experienced open-source developers worked on 246 real tasks.
The core finding
| Metric | Value |
|---|---|
| Actual impact | 19% slower with AI tools |
| Developer prediction before study | 24% faster |
| Developer perception after study | 20% faster |
| Perception gap | 39 percentage points |
Developers believed they were 20% faster while actually being 19% slower. The gap approaches 40 percentage points.
Study conditions
These weren't novices on unfamiliar codebases:
- Participants averaged five years of experience with their repositories
- Repositories averaged 22,000+ stars and 1 million+ lines of code
- Tools included Cursor Pro with Claude Sonnet models
- Over 140 hours of screen recordings analyzed
What slowed them down
The slowdown came from time spent crafting prompts, reviewing AI suggestions, and integrating outputs with established codebases. There was also cognitive overhead from context-switching between directing and coding.
The counterintuitive part: developers highly familiar with codebases were slowed down more, not less. Experts who could write code quickly found AI assistance disrupted their established workflow. Novices, with less to disrupt, sometimes benefited more.
What this means for measurement
If developers believe they're faster while being slower, self-reported productivity metrics are worthless. Organizations need objective measurement: actual time to complete defined tasks, defects per feature, time in rework, end-to-end cycle time.
The METR study doesn't prove AI always slows development. It proves perception and reality can diverge dramatically. That makes measurement non-optional.
The DORA amplifier effect
Google's DORA 2025 State of AI-Assisted Software Development report surveyed nearly 5,000 technology professionals with 78 in-depth interviews.
The central finding deserves quoting:
"AI's primary role in software development is that of an amplifier. It magnifies the strengths of high-performing organisations and the dysfunctions of struggling ones."
Strong teams with solid practices see AI accelerate their advantages. Struggling teams see AI intensify their problems. Teams without clear direction "build the wrong things faster."
Individual versus organizational metrics
| Level | AI impact |
|---|---|
| Tasks completed (individual) | +21% |
| Pull requests merged (individual) | +98% |
| Code review time | +91% |
| PR size | +154% |
| Bug rates | +9% |
| Organizational delivery | Flat |
Individual productivity metrics improve dramatically. Organizational delivery stays flat. The gains disappear into downstream bottlenecks.
The new stability metric
DORA 2025 added "rework rate" as a fifth core metric, measuring unplanned deployments caused by production issues. Only 7.3% of teams report rework rates below 2%. Over 26% experience rework rates between 8-16%.
This metric captures what happens when you ship faster without shipping better.
The stability trade-off
For every 25% increase in AI adoption, DORA observed:
- 7.2% decrease in delivery stability
- 1.5% decrease in delivery throughput
AI boosts individual throughput while degrading system stability. Without metrics that capture both, you only see the gains.
Building a dashboard
Combine multiple metrics for a coherent picture:
| Metric | Source | Why it matters for AI |
|---|---|---|
| Technical Debt Ratio | SonarQube | May mask long-term debt |
| Code churn (2-week) | Git analytics | 84% higher with AI |
| Duplication rate | SonarQube, CPD | 4-8x increase with AI |
| Defect density | Issue tracking | 1.7x more issues |
| Rework rate | Deployment logs | DORA stability indicator |
| PR size | Git analytics | 154% larger with AI |
| Review cycle time | PR metrics | 91% longer with AI volume |
Segment by origin
Track metrics separately for:
- AI-assisted (agent generated most of the code)
- AI-augmented (human wrote with AI suggestions)
- Traditional (human wrote without AI)
This tells you whether AI-specific patterns appear in your codebase, not just industry studies.
Watch trends, not snapshots
Point-in-time metrics mislead. Track trends over rolling windows: weekly churn, monthly duplication, quarterly defect density, release-over-release rework.
Rising trends signal accumulating problems. Flat trends with high values signal persistent problems. Declining trends signal improvement regardless of current levels.
Set alerts
alerts:
churn_rate:
warning: 6%
critical: 8%
duplication:
warning: 10%
critical: 15%
defect_density:
warning: 1.5x baseline
critical: 2x baseline
rework_rate:
warning: 8%
critical: 12%Establish pre-AI baselines before expanding AI adoption. Without them, you can't make meaningful comparisons.
Why measurement matters here
The METR study's 39-point perception gap should concern every engineering organization. Without objective metrics, you cannot tell genuine productivity gains from comfortable illusion.
The patterns are consistent: higher churn, more duplication, more defects per PR, stability traded for throughput. Organizations that measure these patterns can manage them. Organizations that don't measure them won't know they exist until maintenance costs overwhelm the apparent savings.
The goal isn't proving AI harmful or helpful. The goal is knowing which applies to your situation, with your team, on your codebase. That requires measurement.