AI & EngineeringMarch 20, 2026 · 11 min read

AI Code Is Breaking Production: What DORA 2025 Found

The 2025 DORA State of DevOps Report contained a finding that surprised a lot of engineering leaders: teams with high AI coding assistant adoption showed 2.3× higher delivery instability than teams with low adoption, after controlling for team size, industry, and deployment frequency. This is not a reason to stop using AI tools. But it is a clear signal that the way most teams are using them creates a risk they are not measuring.

The key finding

High AI adoption + unchanged review process = 2.3× delivery instability. High AI adoption + differentiated review process = instability roughly equivalent to low-AI-adoption teams. The tool is not the variable. The review process is.

Why the 2.3× Number Matters

Delivery instability in the DORA framework is a composite measure — it combines change failure rate and rework rate into a single index. A 2.3× increase means that for every incident caused by a human-authored deployment, AI-heavy teams are experiencing 2.3 incidents. That is not a marginal regression. It is a systematic failure in how organizations have scaled AI adoption.

The finding is nuanced in an important way: the correlation is not between AI adoption and instability. It is between AI adoption without review process adaptation and instability. Teams in the study that had deliberately modified their code review process to account for AI-generated code — applying different scrutiny, different coverage requirements, different CODEOWNERS routing — did not show the 2.3× effect. Their instability was comparable to low-AI-adoption teams.

This is actionable. The problem is not that AI generates bad code (though sometimes it does). The problem is that humans review AI code the same way they review code from a trusted colleague who owns the service — with implicit trust shortcuts that AI-generated code has not earned.

Why AI-Generated Code Bypasses Expert Pattern Recognition

When an experienced engineer reviews a PR from a colleague who owns a service, they apply a pattern-matching process that combines the written code with implicit knowledge about the author. "Sarah owns the auth service — if she made this change, she probably already considered the edge case with expired tokens." This trust heuristic is usually appropriate for human authors. It is systematically inappropriate for AI agents.

AI coding assistants — Copilot, Cursor, Devin, Gemini Code Assist — have no service ownership history, no accumulated context about which files are high-stakes, no institutional memory about past incidents, and no understanding of which changes have caused problems before. They generate code that looks authoritative and is often syntactically correct, but lacks the judgment that comes from domain expertise.

When reviewers see a clean, complete, well-formatted PR, they apply less scrutiny — not more. The cognitive bias is toward approving things that look finished. AI-generated PRs are very good at looking finished. The review gate that is supposed to catch architectural mistakes gets compressed because the code looks like it was written by someone who knows what they are doing.

The Specific Failure Modes

DORA 2025 categorized the incidents attributed to AI-generated code into three clusters:

Failure Mode% of AI IncidentsRoot Cause
Integration failures41%AI code does not understand cross-service contracts
Performance regressions28%Missing caching, N+1 queries, unoptimized algorithms
Security vulnerabilities19%Missing input validation, unsafe deserialization, injection vectors
Data corruption12%Missing transaction handling, race conditions, schema assumptions

Integration failures (41%)

The largest category. AI agents are excellent at implementing a feature in isolation and poor at understanding how that feature interacts with existing system behavior. An AI-generated API endpoint might correctly handle its own input validation but not account for the rate limiting strategy of the service it calls downstream. An AI-generated background job might not honor the existing retry logic and create duplicate processing under failure conditions.

Human engineers who work on a service daily accumulate implicit knowledge about these cross-cutting concerns. AI models have access only to what is written in the codebase they can see — and many of these integration contracts are not written down anywhere.

Performance regressions (28%)

AI models optimize for correctness, not performance. A query that works correctly on 10,000 rows in development and is generated as correct-but-unoptimized code will pass all tests and all review — and then time out on 10 million rows in production. N+1 query patterns are particularly common in AI-generated ORM code because the model sees the individual queries work correctly in isolation.

The Signals That Matter for AI-Generated PRs

Given these failure modes, the signals you should weight differently for AI-generated code in a risk scoring system are:

  • Author file expertise: For AI-generated PRs, this is always zero — the AI agent has never owned this file. Weight this signal heavily for AI-authored changes in ways you would not for a human expert.
  • Test coverage delta: AI code often ships with plausible-looking unit tests that cover the happy path but miss edge cases specifically related to the integration points where failures occur. A positive coverage delta for AI code is less reassuring than for human code.
  • Change scope (files touched): AI agents have a higher tendency to produce changes that touch more files than necessary to solve the problem. High file count in an AI PR correlates with higher integration failure risk.
  • CODEOWNERS overlap: If an AI-generated PR touches files owned by teams that were not explicitly involved in the PR, that is a strong signal for elevated risk.
  • Review thoroughness: AI PRs that are approved quickly (under 10 minutes) with minimal comment thread should be flagged. Quick approvals on AI code are a review quality signal, not a quality signal.

What to Actually Do

The teams in the 2025 DORA study that maintained low instability with high AI adoption shared three practices:

Explicit AI PR labeling. They labeled or tagged PRs that contained significant AI-generated code (typically defined as >30% of lines generated by a tool). This label triggered different routing rules — more required reviewers, different CODEOWNERS requirements.

Coverage-first review for AI PRs. Rather than reading the code top-down, reviewers started with the test coverage delta. If the AI had not written tests for the integration points, that was a blocking comment before any code review began.

Integration checklist requirements. AI PRs required a mandatory checklist item confirming that the reviewer had verified the change's behavior against each downstream service it touches. Not a rubber stamp — actual verification, documented in the PR description.

Koalr scores AI-generated PRs differently

Koalr's deploy risk model weights author file expertise (zero for AI agents), review speed (fast approvals on AI PRs raise risk, not lower it), and CODEOWNERS overlap to produce risk scores that reflect the actual risk profile of AI-authored changes. High-risk AI PRs get routed to mandatory additional review automatically.

Score AI-generated PRs before they reach production

Koalr automatically identifies AI-assisted PRs, applies elevated risk weighting to the signals that matter most, and routes high-risk changes to additional reviewers before they merge. Connect GitHub in 5 minutes.