Engineering MetricsMarch 16, 2026 · 9 min read

Trunk-Based Development and DORA Metrics: The Research-Backed Connection

The DORA research program has studied over 33,000 engineering professionals across more than a decade of annual surveys. Trunk-based development appears in every single year's analysis as a statistically significant predictor of elite delivery performance. This post explains why the correlation exists, how to transition from long-lived feature branches, and what the DORA metrics actually look like for teams that have made the shift.

The core insight

Trunk-based development does not improve DORA metrics because of anything magical about a single branch. It improves DORA metrics because it enforces small batch sizes — and every DORA metric is a downstream effect of batch size. Understanding this mechanism is what makes TBD adoption stick, rather than being abandoned after the first painful merge conflict.

What trunk-based development actually means

Trunk-based development (TBD) is a source control branching model where all engineers commit to a single shared branch — commonly called main or trunk — or use very short-lived branches with a maximum lifespan of one to two days before merging back to trunk.

Two misconceptions make teams reluctant to try TBD, and both need to be addressed directly. The first misconception: TBD does not mean every commit must be a complete, deployable feature. Partial work hides behind feature flags — code ships to production but remains invisible to users until the flag is enabled. The second misconception: TBD does not mean "everyone pushes broken code to main." CI is the quality gate. TBD without a reliable, fast CI pipeline is not TBD — it is chaos. The discipline of TBD depends on CI being trusted so completely that engineers are confident pushing to trunk multiple times per day.

In practice, TBD exists on a spectrum based on team size and release model. The three most common forms are:

  • Pure TBD — Engineers commit directly to main. Works well for small, co-located teams with extremely fast CI (under 5 minutes) and strong mutual code awareness. Every commit triggers a deployment pipeline.
  • Short-lived branch TBD — Engineers create branches for individual changes, but branches must merge within one day. PRs are small by construction, CI completes in under 10 minutes, and review turnaround is measured in hours, not days. This is the most common form at companies with 10–200 engineers.
  • Release-branch TBD — Main is always deployable. Release branches are cut periodically (weekly, per sprint) only for generating versioned artifacts — not for feature development. All feature work goes to main; release branches are cherry-pick destinations for hotfixes only. Common in regulated industries or products shipping versioned software.

All three variants share the defining property: no long-lived feature branches. The maximum age for any branch that contains new feature work is measured in days, never weeks.

What the DORA research actually says about TBD

The DORA State of DevOps research program — now part of Google Cloud and running continuously since 2014 — has consistently identified trunk-based development as one of a small cluster of practices that differentiates elite software delivery performance from the rest. The 2023 State of DevOps report, which surveyed over 36,000 professionals across industries and company sizes, is the most recent publication to quantify the relationship directly.

The findings on TBD vs. long-lived feature-branch workflows are specific and substantial:

  • Teams using TBD have 3× higher deployment frequency than teams using long-lived feature branches. The mechanism is batch size: TBD teams ship many small changes; feature-branch teams accumulate work and ship large releases less often.
  • TBD teams have 2× lower MTTR (mean time to restore service). When a small, focused change causes an incident, the blast radius is narrow and the causal chain is short. Engineers identify the source of an outage from a 50-line PR in minutes; they spend hours tracing a 1,200-line feature merge.
  • TBD teams have a 1.8× lower change failure rate — the percentage of deployments that require an emergency fix, rollback, or patch. This is the number that most directly connects to customer trust and SLA adherence. Smaller batches are more thoroughly tested, have less interaction surface with existing code, and are easier to roll back cleanly when they do fail.

It is worth being precise about what the DORA research shows. TBD is not the cause of these improvements in isolation — it is a necessary precondition for the practice changes that drive them. Teams that adopt TBD but do not invest in CI reliability, fast feedback loops, and feature flag infrastructure will not see the 3× deployment frequency lift. TBD is the structural constraint that forces those investments, and the DORA metrics improve as a downstream effect.

The batch size constraint: the mechanism behind every DORA improvement

The reason TBD correlates so strongly with elite DORA performance is that it enforces a hard constraint on batch size — and batch size is the single variable with the most leverage over every DORA metric simultaneously.

Long-lived branches accumulate work over days and weeks. The PR that eventually opens against main is large: hundreds or thousands of lines, touching many files, often crossing service or module boundaries. That PR is harder to review thoroughly, harder to test completely, and dramatically harder to revert cleanly when it causes an incident. A revert of a 1,500-line PR in a production incident at 2am is one of the most stress-inducing operations in software engineering — because it may break other deployments that merged after it, because the revert itself introduces risk, and because the on-call engineer may not be the author.

TBD makes large batch sizes structurally impossible. You cannot have a three-week feature branch if your team norm is merging to main within one day. The constraint is real and automatic — it is not a process requirement that teams comply with out of discipline; it is a property of the branching model.

Trace each DORA metric back to batch size and the mechanism is clear:

  • Deployment frequency: Small batches have nothing to "hold back." There is no accumulation phase before each release. Each small change can deploy as soon as it merges and passes CI. Deployment frequency rises to match the pace of change completion rather than the pace of feature branch cycles.
  • Lead time for changes: Lead time measures the time from code commit to production. With TBD, the coding time for any given change is bounded by the branch lifespan — days, not weeks. A change that cannot live in a long-lived branch must be scoped down to something shippable within days. Lead time falls as a direct consequence.
  • Change failure rate: Smaller changes fail less often. They have less surface area for defects, are easier to test completely, and are less likely to conflict with other changes that merged concurrently. The CFR drop from TBD is not primarily about CI quality — it is about the reduced number of ways a focused 150-line change can go wrong compared to a sprawling 900-line feature.
  • MTTR: When a small, well-scoped change causes an incident, root-cause identification is fast. The causal surface is narrow. An on-call engineer looking at a recent 80-line PR can often identify the problem in the time it takes to read the diff. Compare that to debugging an incident caused by a merge of a two-week feature branch touching 47 files across three services.

What long-lived feature branches actually cost you

The costs of feature-branch workflows are often invisible in the day-to-day — they show up in aggregate, as elevated change failure rates, stretched lead times, and a persistent sense that "releases are stressful." Putting specific numbers to the costs makes the case for change concrete.

Merge conflicts. A branch open for 14 days in an active codebase — where other engineers are merging changes to main continuously — typically requires three to five conflict resolution sessions before it can merge. Each session requires the engineer to context-switch back into the branch, understand changes from other commits they did not author, and make judgment calls about which version of conflicting code is correct. Microsoft Research has estimated merge conflict resolution at two to six engineer-hours per occurrence. At five conflicts per long-lived branch, that is up to 30 engineer-hours per feature — before the PR even opens for review.

Review quality degradation. The relationship between PR size and review quality is nonlinear. A 1,200-line PR does not receive a review that is six times as thorough as a 200-line PR — it receives a review that is substantially less thorough per line, because reviewers have finite cognitive bandwidth and large diffs overwhelm it. Research from SmartBear's study of code review practices found that reviewer effectiveness drops sharply above 200-400 lines of changed code, and that review sessions longer than 60 minutes produce diminishing returns on defect detection. Breaking a 1,200-line PR into four 300-line PRs — with equivalent total content — produces better reviews than the single large PR, even though the total reviewer time is similar.

Integration debt. Features built in isolation on long-lived branches can conflict at the product level — not just the code level. Two teams building related features on separate branches for three weeks may create architecturally incompatible implementations that both pass their individual tests but break when integrated. This integration debt is invisible until the branches merge, often in the same sprint, compounding the conflict resolution burden.

Stale assumptions. Code written three weeks ago may be based on assumptions that are no longer true by the time it is reviewed. The underlying data model may have changed. A related service's contract may have been updated. A business requirement may have been revised. The longer a branch lives in isolation, the higher the probability that something about the world it was written against has changed — and that the author will not notice, because they are no longer thinking about the original context in the same way.

Feature flags: the enabler of trunk-based development

The most common and legitimate objection to TBD is: "I cannot merge a half-finished feature to main. Users will see incomplete UI, broken flows, or experimental behavior I am not ready to ship."

Feature flags dissolve this objection entirely. The pattern is simple: wrap any in-progress code in a runtime flag check that is off by default, then merge freely to main. The code ships to every production instance on every deploy — but the code path executes only when the flag is enabled. From the user's perspective, nothing has changed. From the engineering team's perspective, the work is on trunk, integrated with everything else, being tested by CI against the full codebase on every push.

The combination of TBD and feature flags supports an unlimited number of concurrent in-progress features on trunk simultaneously, all invisible to users, none of them blocking each other's merges. A team of 20 engineers can each be working on separate features, all committing to trunk daily, with no feature visible to users until the individual flag is enabled. This is how high-performing teams sustain both high deployment frequency and disciplined feature releases.

Common feature flag tooling includes LaunchDarkly (full-featured, targeting rules, experimentation), Unleash (open-source, self-hostable), and Flagsmith (open-source with cloud option). For teams starting out, a simple database table with flag name and enabled boolean is sufficient to prove the pattern. The infrastructure complexity can grow with the use case — the essential concept is just a conditional check at the entry point of the in-progress code path.

How to measure whether your team is practicing TBD

Before starting a TBD transition, it is worth baselining exactly where your team stands. Four metrics give a complete picture of your current branching behavior, all computable from your existing GitHub data.

MetricTBD benchmarkFeature-branch baseline
Average branch age at mergeUnder 1 day5–14 days
Median PR size (lines changed)Under 200 lines500+ lines
Merge frequency per engineer≥1 PR per day2–3 PRs per week
CI completion time (p50)Under 10 minutes15–45 minutes

These metrics are computable directly from the GitHub API. The branch age calculation uses pull_request.created_at vs pull_request.merged_at — the delta is the branch age at merge. PR size uses additions + deletions on the merged PR object. Merge frequency is simply the count of merged PRs per engineer per week, averaged over the last 90 days. Koalr surfaces all four of these in the PR analytics dashboard with per-team breakdowns and trend lines, so you can track progress through a TBD transition without building the queries yourself.

The transition plan: from long-lived branches to TBD

TBD transitions fail most often because teams try to change branching behavior before the underlying infrastructure — primarily CI reliability and feature flag discipline — is ready to support it. The four phases below are sequenced to build that infrastructure first, with branching behavior changes arriving after the foundation is stable.

Phase 1: CI reliability (weeks 1–4)

TBD requires CI you trust completely. An engineer who pushes a partial feature behind a flag and discovers that their change broke an unrelated test suite — because CI is flaky — will abandon TBD practices within a week. The reliability bar is high: CI must pass reliably on unrelated PRs more than 97% of the time, and CI completion time must be under 10 minutes for the majority of changes.

Flaky tests are the primary obstacle. A test that fails 3% of the time on unrelated changes is tolerable in a feature-branch workflow where CI runs infrequently. In a TBD environment with CI running ten or more times per day per engineer, a 3% flakiness rate becomes a constant source of noise that erodes confidence in trunk's green state. The investment in fixing flaky tests before attempting TBD pays immediate dividends — and the work is valuable regardless of branching strategy.

Phase 2: PR size norms (weeks 3–6)

Start tracking median PR size per team and surfacing it in retrospectives. The goal is not to shame engineers with large PRs — it is to make the size distribution visible and to create a shared vocabulary for talking about it. Teams that have never looked at their PR size distribution are often surprised: median PR size above 600 lines is common in feature-branch workflows, and most engineers had no idea their PRs were that large.

Set a soft target: 80% of PRs under 400 lines within 60 days. This is achievable without changing branching strategy yet — it simply requires more intentional scope management. Simultaneously, introduce feature flags on one high-visibility in-progress feature. Demonstrate to the team that shipping partial work behind a flag to main is safe, testable, and reviewable. This proof-of-concept breaks the psychological association between "merging to main" and "releasing to users."

Phase 3: Branch lifetime limits (weeks 5–8)

Introduce a team norm: any branch older than two days needs either a split plan or a brief check-in conversation. This is not a hard automated cutoff — it is a forcing function for conversations about whether the work can be decomposed. In most cases, a branch that has been open for three days contains work that can be split: there is a "safe to merge now" chunk and a "needs more work" chunk, and the safe chunk can go out behind a feature flag while work continues on the rest.

Track branch age in Koalr's PR dashboard. The stale PR age heatmap shows, per engineer and per team, how many open PRs have been alive for more than one day, two days, and five days. The heatmap is not a performance dashboard — it is a conversation starter in retrospectives and 1:1s. The pattern you are looking for is a gradual leftward shift in the distribution as the team learns to scope work more narrowly.

Phase 4: Full TBD (weeks 8–12)

At this point CI is reliable, PR sizes are trending down, feature flag usage is normalized, and branch lifetimes are shortening. The final phase formalizes the new norms: all new features start with a feature flag; branch protection rules enforce CI passing before any merge to main; and the team establishes a shared expectation that engineers merge to main at least once per day.

The first week where every engineer on the team merges to main at least once per day is a cultural milestone worth marking explicitly. It is evidence that the structural conditions are right — CI is fast, feature flags are available, and PRs are small enough to complete within a day. The DORA metrics change in the weeks that follow are often the first concrete evidence engineering leadership sees that the investment was worthwhile.

DORA metric trajectory: what to expect over 90 days

Teams that complete a TBD transition see DORA metric improvements arrive in a predictable sequence. Understanding the expected trajectory helps set realistic expectations with engineering leadership — and prevents premature conclusions if one metric does not immediately improve.

TimeframeDORA metricExpected change
Weeks 1–4Deployment frequencyRises immediately — smaller PRs merge and deploy more quickly, with less accumulation before release
Weeks 4–8Lead time for changesFalls as branch age shrinks — coding time is now bounded by the branch lifespan rather than sprint cadence
Weeks 8–12Change failure rateTypically falls 20–40% as batch size shrinks and CI covers complete, focused feature units rather than sprawling merges
Weeks 1–12MTTRUnchanged initially — TBD does not improve incident response process. Improves gradually as incident scope narrows and causal chains become shorter

MTTR deserves a specific note. It is the one DORA metric that does not respond immediately to TBD adoption. MTTR is primarily a function of incident response capability — alerting quality, on-call process, runbook completeness, rollback tooling — and those are independent of branching strategy. What TBD does for MTTR is reduce the maximum scope of any given incident: a small, focused change that causes an incident produces a smaller blast radius and a shorter causal chain to trace. But the actual reduction shows up on a longer timeline as the team accumulates experience with smaller, more targeted deployments. Teams that see no MTTR improvement in the first 90 days of TBD adoption should look at their incident response process, not their branching strategy.

TBD at scale: three challenges and how to address them

TBD works straightforwardly for small teams with shared codebase ownership. As teams grow and codebases become more complex, three specific challenges emerge that are worth addressing proactively.

Monorepos with multiple teams. When 15 different teams are committing to the same trunk simultaneously, the challenge is scoping CI and review to the affected packages rather than running the full test suite on every commit. Build tools like Turborepo and Nx address this with dependency graph-based task execution — only the packages affected by a given change rebuild and test. Koalr's CODEOWNERS enforcement ensures that PRs touching multiple packages get routed to the correct owning teams for review, preventing the "everyone reviews everything" anti-pattern that does not scale past 30 engineers.

Regulated environments. Some industries — finance, healthcare, defense — have change management requirements that specify named releases, audit trails for specific versions, or separation between development and deployment authorization. TBD on main is still correct in these environments; the adaptation is in the release process. Cut release branches from trunk on a defined cadence; deploy to production only from release branches; apply hotfixes to release branches via cherry-pick from main. All feature development continues on trunk. The release branch model satisfies change management requirements without sacrificing the batch size benefits of TBD on the development side.

Large refactors. Some changes — a database schema migration, a framework upgrade, an API contract renegotiation — seem to require a long-lived branch by nature. In most cases, the strangler fig pattern resolves this: the new implementation is built incrementally alongside the old one, behind an abstraction boundary, with each incremental addition merging to trunk independently. The old implementation is removed in a final cleanup PR. This approach requires more design discipline than a single long-lived branch, but it results in a safer migration because each increment is independently tested and deployable. For the genuinely indivisible cases — and they exist — a long-lived branch with daily rebases from main minimizes integration drift while acknowledging the exception.

Measuring TBD adoption with Koalr

The four TBD readiness metrics — branch age at merge, PR size distribution, merge frequency per engineer, and CI completion time — are all available in Koalr's PR analytics dashboard with historical trending. Tracking them through a TBD transition gives engineering leadership a clear view of adoption progress that is more precise than "we told teams to merge faster."

The branch age heatmap in Koalr's pull request view shows the age distribution of all open PRs, by team, with configurable thresholds for what counts as "at risk of becoming long-lived." The PR size trend shows median and p90 PR size over time, making it immediately visible when a team starts splitting work more aggressively. And the DORA metrics dashboard tracks the downstream effects — deployment frequency, lead time, CFR, MTTR — so the connection between branching behavior changes and delivery performance improvements is visible in a single view.

The DORA research program has demonstrated the connection between TBD and elite delivery performance consistently across a decade and tens of thousands of data points. The mechanism is not mysterious: TBD enforces small batches, and small batches produce better DORA outcomes across every dimension. The transition requires investment in CI reliability and feature flag discipline before the branching change itself — but teams that sequence it correctly consistently see the deployment frequency, lead time, and change failure rate improvements that the research predicts.

Measure your branch age and PR size distribution in Koalr

Connect GitHub in under 5 minutes. Koalr calculates your current TBD readiness metrics immediately — average branch age at merge, median PR size per team, merge frequency per engineer, and CI time trends — alongside the DORA metrics that show where you are today and how they change as TBD adoption improves.