Engineering MetricsMarch 16, 2026 · 10 min read

CI/CD Metrics: How to Measure and Improve Your Pipeline Performance

A slow or unreliable CI/CD pipeline does not just frustrate developers — it directly inflates your DORA metrics and erodes delivery confidence. This guide covers the 8 CI/CD metrics that matter most, what each one tells you, and the quick wins that consistently move the numbers in the right direction.

The 30-minute rule

Once a full CI pipeline exceeds 30 minutes at P95, developers stop waiting for it. They open a new task, lose context, and treat CI as an asynchronous formality rather than a blocking quality gate. Everything downstream of that threshold — review speed, merge confidence, deployment frequency — gets worse.

Why CI/CD metrics matter beyond pipeline uptime

Most teams know when their CI is broken. The build is red, the deploy is blocked, and the Slack channel lights up. What is harder to see is the slow degradation — pipelines that used to take 8 minutes now take 22, flaky tests that developers have learned to just re-run, runners that queue for 6 minutes before a single line of code executes. None of these states are crises. They are all individually tolerable. And together, they compound into a delivery process that consistently underperforms its potential.

CI/CD metrics make that degradation visible before it reaches a critical threshold. They also connect the pipeline directly to the four DORA metrics that engineering leaders and boards use to benchmark delivery performance. Build duration is a direct component of Lead Time for Changes. Build success rate limits Deployment Frequency. Flaky tests inflate Change Failure Rate by letting regressions through. Deployment success rate feeds CFR directly.

Measuring CI/CD is not about holding engineers accountable to pipeline numbers. It is about surfacing the maintenance work that nobody is explicitly prioritising — the slow build that nobody owns, the flaky test suite that everyone tolerates, the runner capacity that was never reviewed after the team doubled.

The 8 CI/CD metrics to track

1. Build success rate

Build success rate is the simplest and most important pipeline health signal. The formula:

Build Success Rate = (successful_builds / total_builds) × 100

Calculate this per branch or per repository, not just across the entire organisation. A repo with a 60% build success rate hiding inside an org-wide 92% average is a problem that the aggregate number obscures.

Target: above 95%. Below 90% means your pipeline is a regular source of developer friction. Below 80% is a pipeline crisis — developers are routinely committing to a broken baseline and normalising it.

The two most common causes of a low build success rate are (1) a flaky test suite that fails intermittently on otherwise-clean code, and (2) a missing or outdated dependency lock that causes nondeterministic installs. Both are fixable; neither tends to get fixed until the metric makes the frequency undeniable.

2. Build duration (P50 and P95)

Mean build duration hides distribution. A pipeline that takes 5 minutes 80% of the time and 45 minutes 20% of the time has a mean around 13 minutes — but developers experience it as either fast or unusably slow, with no way to predict which they will get. Track both the median (P50) and the 95th percentile (P95) separately.

Targets:

  • Unit test / lint pipeline — P50 under 5 minutes, P95 under 10 minutes
  • Full CI suite (unit + integration + e2e) — P50 under 15 minutes, P95 under 30 minutes

Above 30 minutes at P95, the pipeline stops functioning as a feedback loop. Developers who must wait 35 minutes before merging will, over time, stop waiting. They merge with a partial green signal, skip re-running a single flaky job, or batch multiple PRs into a single deploy to amortise the wait time. All of these patterns increase change failure rate.

Build duration trends over time are often more actionable than the current absolute value. A P50 that has grown from 7 minutes to 14 minutes over six months without a corresponding increase in test coverage is a signal of test debt accumulation — tests were added without any parallelisation or pruning work to offset the cost.

3. Test pass rate

Test pass rate measures the percentage of tests in the suite that pass on the main branch at any given point in time. This is distinct from build success rate — a build can be green because failing tests are marked as expected failures, skipped, or simply not run.

Target: 100% on main. Any failing test on the main branch that is not actively being investigated is a broken pipeline culture signal. It means engineers have accepted that the test suite is unreliable, which means they cannot trust it to catch regressions.

Low test pass rates on feature branches are expected and normal. Low pass rates on main are not. The distinction is important — do not conflate the two when calculating this metric.

4. Flaky test rate

A flaky test is one that produces different results — pass or fail — for the same code, without any code change. Flakiness is usually caused by timing dependencies, shared state between tests, or tests that depend on external services without proper mocking.

Flaky Test Rate = (tests that intermittently fail / total tests) × 100

# Detect flakiness: a test that passes and fails on the same commit
# across retries or across pipeline re-runs without code changes

Target: below 1%. Above 5%, developers stop trusting the test suite entirely. The cognitive load of evaluating "is this a real failure or just a flaky test?" on every red build is significant — and the rational response, over time, is to default to "probably flaky, re-run it." That default is how real regressions get missed.

Flakiness is insidious because it rarely causes a complete outage. It just degrades developer trust in CI incrementally. Teams with high flaky test rates tend to have elevated Change Failure Rates — not because the flaky tests themselves cause failures, but because the culture of ignoring red CI builds means real failures also get ignored or dismissed.

5. Mean time to fix build (MTTFB)

When a build breaks, how long does it stay broken? Mean time to fix build measures the average duration between a build going red on main and returning to green.

Target: under 2 hours. A same-day fix culture means that a broken main branch never lingers overnight. Teams that achieve this have internalised that a broken main is a P1 incident for the team — not a background problem that someone will get to eventually.

High MTTFB values often correlate with ownership ambiguity. When no individual or team owns the CI pipeline, broken builds tend to wait for whoever broke it to come back online — which could be hours later in different timezones, or after a full day of other work. Assigning explicit CI ownership and making MTTFB visible is one of the highest ROI changes a team can make.

6. Pipeline queue time

Queue time is the delay between a pipeline being triggered and the first job actually starting execution. It is entirely invisible to the developer who pushed the code — they see "pending" and assume the pipeline is running. It is not running; it is waiting for a runner.

Target: P95 under 2 minutes. Above 5 minutes at P95, you have a runner capacity problem. The fix is almost always more runners or better runner allocation — but teams routinely leave this unaddressed because nobody is measuring the queue time separately from build duration.

Queue time spikes are often correlated with specific times of day (all-hands standups, post-sprint merges) or with monorepo matrix builds that simultaneously request a large number of runners. Tracking queue time as a separate metric from build execution time makes the capacity problem clearly visible.

7. Deployment success rate

Deployment success rate is the percentage of production deploy attempts that complete successfully. This is distinct from build success rate — a build can be green but the deployment step itself can fail due to infrastructure issues, migration failures, health check timeouts, or environment-specific configuration problems.

Deployment Success Rate = (successful_deploys / total_deploy_attempts) × 100

# Note: this is the inverse of Change Failure Rate for deployment-level events
# CFR (deployment failures) = 100 - Deployment Success Rate

Target: above 95%. A deployment success rate below 90% means more than one in ten deployment attempts is failing to reach production. That rate of failure is high enough to erode deployment confidence — engineers begin treating every deploy as a risky event rather than a routine operation, which paradoxically reduces deployment frequency and increases batch size.

8. Artifact size trend

Artifact size is not a performance metric in the traditional sense, but it is an early warning signal for build bloat and dependency accumulation. Track the size of your primary build outputs over time: Docker image sizes, JavaScript bundle sizes, compiled binary sizes.

A Docker image that grows from 400MB to 1.4GB over eight months is not just a storage cost problem — it means pull times are increasing on every deploy, cold start latency is growing, and at some point the image size will start affecting deployment duration directly. Catching this trend early, when corrective action is cheap, is far better than addressing it after the image has become genuinely unwieldy.

GitHub Actions specific metrics

If your team runs CI on GitHub Actions, several additional metrics are available that are specific to the platform:

  • Workflow run duration by workflow name — break down build duration per workflow file rather than aggregating. A slowdeploy.yml and a slow test.yml have different remediation paths.
  • Job failure rate by job name — within a workflow, which individual job fails most often? Theintegration-tests job failing 40% of the time whileunit-tests fails 3% of the time is a very specific signal about where to invest stability work.
  • Self-hosted vs. GitHub-hosted runner utilisation — if you run a mix of self-hosted and GitHub-hosted runners, track utilisation separately. Overloaded self-hosted runners with queued GitHub-hosted capacity available is a common misconfiguration.
  • Actions Cache hit rate — a cache miss means re-downloading or rebuilding something that was already done. Cache hit rates below 80% for dependency installs suggest the cache key strategy is too granular or the cache is being invalidated too aggressively.
  • Concurrent workflow limit hits — GitHub Actions has concurrency limits per organisation. If jobs are queuing because the concurrency ceiling is being hit, that shows up as queue time spikes correlated with high-activity periods. This is distinct from runner capacity — it is a platform-level throttle.

.github/workflows/ci.yml — dependency caching pattern

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: '20'
          cache: 'npm'          # cache npm dependencies automatically

      - name: Cache build output
        uses: actions/cache@v4
        with:
          path: .next/cache
          # key includes package-lock hash so cache invalidates on dep change
          key: nextjs-${{ hashFiles('package-lock.json') }}
          restore-keys: nextjs-

      - run: npm ci
      - run: npm test

How CI/CD metrics connect to DORA

The four DORA metrics are often treated as a separate concern from CI/CD pipeline health. They are not. Each DORA metric has a direct causal relationship with one or more of the CI/CD metrics above:

DORA MetricCI/CD metric that drives itMechanism
Lead Time for ChangesBuild duration + queue timePipeline time is a direct segment of the commit-to-production journey
Deployment FrequencyBuild success rateFailed builds block deployments; low success rate caps how often you can ship
Change Failure RateFlaky test rate + deployment success rateUnreliable tests miss regressions; failed deploys count directly as change failures
MTTRMean time to fix buildTeams that fix broken builds fast also fix production incidents fast — same discipline

A team targeting DORA elite performance cannot get there with a 25-minute P95 CI pipeline and a 4% flaky test rate. The pipeline is not a separate problem to fix later — it is one of the primary variables determining where your DORA metrics land.

CI/CD anti-patterns to identify from metrics

"It's probably fine" merges

When build success rate drops below 80% on a repository, you will start to see a pattern: PRs merged with a yellow or partially-green CI status because "the failing job is probably just the flaky integration test." This rationalisation is the moment at which CI stops functioning as a quality gate. The metric that catches this pattern early is build success rate by branch, not just by main.

Pipeline debt

Pipeline debt is the accumulation of build steps that nobody removed when they stopped being useful. The classic form: a full CI suite that takes 62 minutes because it includes end-to-end tests written three years ago for a feature that no longer exists, a security scan that always passes and was never configured properly, and an artifact publish step that runs even on feature branches where it serves no purpose. Build duration trend data surfaces pipeline debt — a steady upward slope with no accompanying increase in coverage or reliability is the signature.

Test deserts

A repository where CI only checks lint and type correctness, with no meaningful test coverage, will show near-perfect build success rates and very short build durations. Both numbers look healthy. The underlying state is not. Tracking test pass rate and coverage delta alongside build success rate catches this pattern — a 99% success rate on a pipeline that runs zero tests is not a success metric.

Runner overprovisioning and underprovisioning

Pipeline queue time reveals both capacity problems. Consistent queue time above 5 minutes means underprovisioned runners. Consistent queue time near zero with runner utilisation below 20% means overprovisioned runners — you are paying for capacity that is rarely used. Both states are correctable once queue time is measured and surfaced separately from build execution time.

How to instrument CI/CD metrics

The instrumentation approach depends on which CI/CD platform you use. Most platforms expose the data required for all eight metrics above through either webhooks or a REST API.

GitHub Actions

The primary data sources are the Workflow Runs API and the Deployments API. For real-time ingestion, subscribe to the workflow_run webhook event — this fires on every workflow run completion with full metadata including conclusion, timing, and repository. The check_run webhook provides per-job granularity for job-level failure rate tracking.

For historical backfill, use GET /repos/{owner}/{repo}/actions/runs with pagination. Workflow run records include created_at, updated_at, conclusion, and run_attempt (retry count). The run_attempt field is particularly useful for detecting flakiness — a run that failed on attempt 1 and passed on attempt 2 with no code change in between is a flaky test signal.

See the GitHub Actions DORA metrics guide for full detail on the Deployments API and how to compute deployment-specific metrics accurately.

GitLab CI

GitLab exposes pipeline metrics via the Pipelines API (GET /projects/{id}/pipelines) and the Jobs API (GET /projects/{id}/jobs). Pipeline webhooks fire onpipeline events with full status and duration data. GitLab also provides built-in pipeline analytics in the CI/CD section of the project settings for basic trend data.

Jenkins

Jenkins exposes build data through its REST API at /job/{jobname}/api/json. The Blue Ocean plugin provides a more structured pipeline API with per-stage timing data — useful for identifying which stage within a Jenkins pipeline is the bottleneck. The Prometheus plugin exports Jenkins build metrics in a scrape-friendly format for teams already running a Prometheus stack.

CircleCI, Buildkite, and TeamCity

All three platforms provide pipeline webhooks that fire on build completion with outcome, duration, and attribution data. CircleCI's Insights API provides pre-computed workflow and job-level analytics. Buildkite's webhooks include agent wait time, making queue time measurement straightforward. TeamCity exposes build chain data through its REST API, which is useful for multi-stage pipeline duration analysis.

Quick wins to improve CI/CD performance

These are the changes that consistently produce the largest improvements relative to implementation effort. Prioritise in this order if your metrics are underperforming.

1. Parallelize test jobs with a matrix strategy

If your test suite runs as a single sequential job, splitting it across parallel jobs is the highest-leverage change you can make. A 20-minute sequential test run split across 4 parallel jobs takes 5 minutes — same code, same tests, 75% faster feedback.

.github/workflows/ci.yml — matrix parallelization

jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        shard: [1, 2, 3, 4]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '20'
          cache: 'npm'
      - run: npm ci
      # vitest shard support: run 1/4, 2/4, 3/4, 4/4
      - run: npx vitest run --shard=${{ matrix.shard }}/4

2. Cache dependencies aggressively

Dependency installation is the single most avoidable source of build time. A cleannpm ci on a project with 800 dependencies takes 90 seconds. A cached install takes 3 seconds. Cache hit rates below 80% usually mean the cache key includes something that changes too frequently — move the key to package-lock.json hash only, not commit SHA.

3. Run unit tests first, integration tests after

Unit tests should complete in under 2 minutes. If they fail, there is no reason to start an integration test suite that will take 12 minutes to complete. Structure your CI pipeline so that the fastest, most specific feedback comes first. A failing unit test job should fail fast and block the integration job from starting.

4. Use path filters in monorepos

In a monorepo, a commit touching only the documentation in the docs/folder should not trigger a full build of the API service. Path filters limit CI execution to the parts of the codebase that actually changed, which reduces both build time and runner cost substantially.

.github/workflows/api.yml — path filter for monorepo

on:
  push:
    branches: [main]
    paths:
      - 'apps/api/**'
      - 'packages/db/**'
      - 'packages/types/**'
  pull_request:
    paths:
      - 'apps/api/**'
      - 'packages/db/**'
      - 'packages/types/**'

5. Delete coverage theatre

Coverage theatre is the set of tests that always pass trivially and provide no meaningful signal — tests that verify a constructor exists, tests that cover generated code that never changes, snapshot tests that snapshot entire page HTML. These tests increase build duration without improving failure detection. Identify them by looking at which test files have never had a failing run in the past 90 days despite code changes in the relevant modules. That is a signal of either a test that never fails or a test that never actually asserts anything useful.

6. Quarantine flaky tests explicitly

The worst thing you can do with a flaky test is leave it in the main suite where it occasionally blocks merges. The second-worst thing is delete it. The right approach is quarantine — move the test to a separate job that runs in allowed-failure mode, file a ticket to fix the underlying flakiness, and remove the test from the quarantine suite when it is stable. This preserves the test's signal for the team while preventing it from blocking delivery.

Setting a baseline and tracking improvement

The most common mistake teams make when starting CI/CD metric tracking is trying to fix everything simultaneously. The metrics are more useful as a triage and prioritisation tool than as a scorecard.

Start by establishing a 30-day baseline for all eight metrics above. Then rank the metrics by their current deviation from target and by their estimated impact on DORA performance. A 40-minute P95 build duration is almost certainly a higher priority fix than a 3% flaky test rate — even though both are below target. The build duration problem is a daily friction multiplied across every engineer on the team; the flakiness problem is significant but more contained.

Review the metrics monthly. Most improvements show up within a 2–4 week window of the change being made. If a change to the pipeline did not move the relevant metric, that is also useful information — it means the change addressed a symptom rather than a cause.

Metric gaming is a real risk

Build success rate can be gamed by removing tests that fail. Build duration can be reduced by skipping test stages. When introducing CI/CD metrics, pair them with coverage metrics and deploy failure rates so that a team cannot optimise one number by degrading another. The goal is delivery confidence — not green dashboards.

Putting it together: CI/CD metrics as a DORA accelerator

CI/CD metrics are not separate from engineering metrics — they are the mechanical layer beneath the DORA metrics that most engineering leaders track. A team that wants to improve Deployment Frequency needs to improve Build Success Rate first. A team that wants to reduce Lead Time needs to address Build Duration and Queue Time. A team that wants to reduce Change Failure Rate needs to fix its Flaky Test Rate.

The eight metrics in this guide form a complete picture of pipeline health. Track all eight, establish baselines, and prioritise improvements by their DORA impact. The fastest path to elite DORA performance almost always runs directly through CI/CD pipeline work — it is not glamorous, but it is the highest-leverage engineering investment most teams can make.

Track CI/CD outcomes alongside DORA metrics in Koalr

Koalr reads GitHub Actions workflow runs, check runs, and deployment events as part of the DORA pipeline. Build success rate, deployment frequency, and lead time are all computed from the same GitHub integration — no additional instrumentation required. See how your CI/CD performance is affecting your DORA metrics in a single dashboard.