LaunchDarkly and DORA Metrics: Connecting Flag Deployments to Delivery Health
DORA metrics — deploy frequency, lead time for changes, change failure rate, and MTTR — are the standard framework for measuring software delivery health. Most DORA implementations measure deployments. Feature flags mean that value reaches users in two distinct steps, and most DORA dashboards only capture one of them.
Why feature flags create a measurement gap in DORA
The DORA research program, which began at Google and is now maintained through the annual State of DevOps Report, defines deployment as the unit of delivery measurement. A deployment is when code changes go live to production. This definition made sense in 2014, before feature flags were standard practice at most engineering teams.
In a flag-heavy engineering organization, the deployment event and the release event are decoupled. Code is deployed continuously, but value reaches users when a flag is enabled. A team that deploys ten times a week but rolls out features once a month has high deploy frequency and low release frequency. Which number should their DORA dashboard show?
The answer depends on what question you are trying to answer. If you want to measure the health of your CI/CD pipeline, deployment frequency is the right metric. If you want to measure how quickly value reaches users — which is what DORA is ultimately trying to capture — you need to include flag rollout events.
LaunchDarkly does not expose DORA metrics natively. It is a feature flag platform, not an engineering analytics tool. Bridging the two requires pulling LaunchDarkly's flag event data into your DORA measurement pipeline alongside your deployment events.
How each DORA metric maps to LaunchDarkly events
| DORA Metric | LaunchDarkly Event | Without Flags | With Flags |
|---|---|---|---|
| Deploy Frequency | Flag rollout event (percentage increase or full enable) | Only counts code deployments; underestimates actual delivery frequency | Each flag rollout increment is a release event — ships value to users even without a new deployment |
| Lead Time for Changes | Time from PR merge → flag reaching 100% rollout | Measured PR merge → deployment only; ignores time-to-user | Full lead time includes flag staging period; reveals hidden latency between "shipped" and "available" |
| Change Failure Rate | Flag kill-switch triggered = implicit failure signal | Only counts explicit rollbacks and hotfixes; misses flag-masked failures | Kill-switch events surface failures that were resolved silently without appearing in deployment history |
| MTTR | Time from kill-switch (incident signal) → stable rollout resume | Only measures incidents that went through formal PagerDuty/OpsGenie workflow | Flag-resolved incidents have their own MTTR signal; often faster but also often invisible to reporting |
Deploy frequency: flags as release events
Deployment frequency measures how often an organization successfully releases to production. Elite DORA performers deploy multiple times per day. But at a team using LaunchDarkly, many of those deployments carry no user-visible change — the new code ships dark, behind flags set to 0%.
There are two ways to handle this in DORA measurement. The first is to count only code deployments, accepting that your deploy frequency metric measures CI/CD pipeline health but does not represent release cadence. The second is to count each flag rollout increment as a deployment event, treating the enabling of a flag for users as the point at which value is delivered.
Neither approach is universally correct. The right choice depends on what your team is optimizing for. If engineering leadership uses deploy frequency to assess pipeline maturity, count deployments. If product leadership uses it to assess release cadence, count flag rollouts. The important thing is to be explicit about which definition you are using — most teams are not.
There is also a third pattern worth distinguishing: dark launches, sometimes called flag-behind-flag. In this pattern, a feature is deployed and technically enabled at 0% rollout, but the infrastructure is exercised in production to warm caches and validate configuration. This is neither a deployment (the feature is not live) nor a release (no users see it). Dark launch events are worth tracking separately as a leading indicator of upcoming releases.
Lead time: the hidden gap between merge and rollout
Lead time for changes measures the time from a commit being made to it running in production. DORA defines "running in production" loosely — technically it means deployed, not necessarily user-visible. But the spirit of the metric is measuring how quickly value moves from idea to user.
For teams using feature flags, there is a meaningful and often large gap between "deployed to production" and "available to users." A feature might be deployed in January but held behind a flag until a product launch in March. The lead time measured from commit to deployment looks excellent. The lead time measured from commit to users seeing the feature is two months.
Koalr tracks both. The deployment lead time (commit → deployment) tells you about your CI/CD pipeline efficiency. The release lead time (commit → flag reaching 100% rollout) tells you about your full delivery cycle. The difference between the two — the time a feature spends staged behind a flag — is a metric we call flag staging time.
High flag staging time is not always bad. Planned launches, legal review periods, and coordinated go-to-market timing all justify holding features behind flags for weeks. But high flag staging time that accumulates accidentally — because the product decision stalled, because the rollout got deprioritized, or because no one scheduled the final enable — represents waste. Tracking it makes the waste visible.
Change failure rate: kill switches as implicit failure signals
Change failure rate measures the percentage of deployments that result in a failure requiring a hotfix, rollback, or patch. It is calculated as failed deployments divided by total deployments over a measurement period.
Feature flags create a new category of failure signal that does not appear in the standard change failure rate calculation: the flag kill switch. When an engineer disables a flag in response to user-reported errors, elevated error rates, or performance degradation, that is a rollback. It is functionally identical to a deployment rollback — it reverts a change to reduce user impact — but it happens through LaunchDarkly rather than through your deployment pipeline.
If you measure change failure rate only from deployment rollbacks and hotfixes, you are undercounting failures. Every kill-switch event that was triggered in response to an incident is a failure that should increment your change failure rate counter. Teams that use flags extensively often have lower apparent change failure rates precisely because flags make rollbacks so fast and painless — the failures are still happening, but they resolve quickly and do not show up in deployment history.
Koalr includes kill-switch events in change failure rate calculation. This typically increases a team's reported change failure rate, but it also gives a more accurate picture. A team with a 2% deployment-only change failure rate and a 5% flag-inclusive change failure rate has a real 5% failure rate that is partly masked by effective use of flags.
MTTR: flag-resolved incidents and recovery time
Mean time to recovery measures how long it takes to restore service after a failure. Traditional MTTR calculation uses incident management systems: the gap between when an alert fires and when the incident is marked resolved.
Feature flags create a parallel recovery path that bypasses the formal incident management workflow. When an engineer disables a flag to stop user impact, service restores within seconds — but the incident may never be formally opened in PagerDuty or OpsGenie if the flag kill switch resolved the problem before anyone filed a ticket.
This is actually a genuine improvement in MTTR, but it is invisible to your MTTR dashboard. The recovery happened; it just happened outside the measurement system.
Koalr measures flag-resolved MTTR as a separate metric: the time from when error rates spiked (or when a user report was filed) to when a kill switch was toggled and error rates returned to baseline. This gives you a complete picture of recovery performance — both incidents that went through formal channels and incidents that were resolved silently through flag management.
For elite DORA performers, flag-resolved MTTR is often under five minutes. That is a meaningful capability worth surfacing. A team that achieves sub-five-minute recovery through flag management should see that in their metrics — not have it invisible because it bypassed the incident management tool.
Implementing LaunchDarkly-aware DORA metrics
To implement DORA metrics that account for LaunchDarkly flag events, you need to pull data from three sources and correlate them on a common timeline:
- GitHub or your source control — commit timestamps, deployment events, PR metadata
- LaunchDarkly audit log — flag enable/disable events, percentage rollout changes, kill-switch events
- Incident management — PagerDuty or OpsGenie alerts, incident open and close timestamps
Koalr connects all three. Flag events from LaunchDarkly appear in the same timeline as deployments and incidents, so your DORA metrics automatically incorporate flag rollout events without manual reconciliation.
The implementation also handles the disambiguation problem: not every flag enable is a release event, and not every kill switch is a failure. Koalr distinguishes between flag changes that affect end users (rollout changes, targeting rule changes) and flag changes that affect only internal configuration (maintenance flags, ops toggles). Only user-facing flag events contribute to DORA metric calculations.
What good looks like
A team with a mature LaunchDarkly + DORA integration typically sees three things in their metrics:
- Deploy frequency increases when flag rollout events are included, reflecting the actual cadence at which value reaches users.
- Change failure rate increases slightly as kill-switch events are included, giving a more accurate picture of the failure rate the team is actually experiencing.
- MTTR decreases as flag-resolved incidents are included, revealing fast recovery capabilities that were previously invisible to reporting.
These changes do not reflect a deterioration in engineering health — they reflect a more accurate measurement of it. The team that previously appeared to have a 1% change failure rate and a 45-minute MTTR may actually have a 3% failure rate and a 6-minute MTTR. Both numbers are more useful for making decisions about where to invest in reliability engineering.
DORA metrics that include your LaunchDarkly data
Koalr connects LaunchDarkly flag events to your DORA dashboard — deploy frequency, lead time, change failure rate, and MTTR. Get a complete picture of delivery health, not just the deployment half of it.