The premise that turned out to be incomplete

The conventional wisdom in continuous delivery is that feature flags decouple deployment from release. You merge code to main, it ships in the next deploy, and nothing changes for users until you flip the flag. This is accurate. But it led to an assumption that turned out to be wrong: that the deployment event is the only moment of risk.

Flags are not passive configuration. They are executed on every request, and their evaluation logic touches your live application state — user attributes, percentage rollout buckets, targeting rules, dependency chains. Every change to a flag or to the code that a flag gates is a live production change, even if no deployment happened.

This distinction matters because most deploy risk tooling is triggered by pull requests and deployment events. If you are not also watching for flag changes, you have a blind spot in your incident prevention pipeline.

How flag changes cause production incidents

Four failure modes appear consistently in engineering post-mortems. None of them are edge cases — they are predictable consequences of how feature flags interact with production systems.

1. Percentage rollout paired with a breaking change in the same PR

A developer ships a refactor that changes the response shape of an internal API. The change is gated behind a flag set to 0% rollout, so it passes code review without incident. Two weeks later, a product manager enables the flag for 10% of users as part of an experiment. The response shape change breaks a downstream consumer that was written against the old format.

The deployment that introduced the breaking code happened two weeks ago and passed every check. The incident was triggered by a targeting rule change in LaunchDarkly — which created no Git commit, no PR, and no deployment event. If your monitoring is deploy-correlated, you will look at the deployment timeline and find nothing.

2. Stale flag cleanup that removes code still in use

Flag hygiene is a real problem. When engineers finally clean up long-lived flags, they typically search for the flag key, delete the associated code branches, and open a PR. The PR looks correct. The flag is gone, the server-side references are removed, the tests pass.

The problem: other code paths sometimes evaluate the same flag independently, or the flag key appears in mobile apps deployed on a slower release cadence. A flag removal that looks clean in a web monorepo creates a KeyNotFound error in a mobile app still on last month's release. There is no quick rollback — you need a hotfix through the app store review process.

This class of incident is hard to catch because the risk lives in a dependency that code review does not surface. The PR is correct given what code review can see; the failure mode requires cross-system knowledge that no individual reviewer has.

3. Targeting rule changes that hit more users than expected

LaunchDarkly targeting rules are evaluated on every request against user attributes. A common failure mode is a rule written against a user attribute that has since changed semantics. A rule targeting plan === "enterprise"was written when enterprise customers were 5% of users. After a pricing change that moved many customers to an enterprise tier, the same rule now targets 40% of users — including users on infrastructure that cannot handle the flagged feature's load profile.

The rule did not change. The underlying data did. Without observability that correlates flag evaluation rates with error rates, this failure mode is nearly invisible until the incident is already underway.

4. Permanent experiments — flags without a reliable kill switch

Experiments have a defined lifecycle: hypothesis, rollout, measurement, decision. In practice, many experiments outlive their intended scope. A flag that was supposed to run for two weeks runs for eight months. The control code path stops being exercised in development, the tests against it stop being maintained, and the team treats the flagged behavior as the production default.

When an incident occurs and the team reaches for the kill switch to roll back to the control path, they discover that the control path has bit-rotted. Rolling back the flag does not restore the previous behavior — it exposes untested code that has drifted from production for months. The kill switch is not reliable. Mean time to recovery increases substantially when the rollback path cannot be trusted.

What deploy risk scoring adds to LaunchDarkly

LaunchDarkly gives you the controls — targeting rules, percentage rollouts, kill switches, audit logs. What it does not give you is risk scoring on the code changes that activate when you flip those controls.

Koalr's LaunchDarkly deploy risk integration watches for flag-related file changes in pull requests and correlates them with production incident history. When a PR modifies files that contain flag evaluations — LaunchDarkly SDK calls, flag key references, targeting rule configuration — the risk scorer applies additional signal weight to those changes.

Flag-adjacent code is not the same as ordinary application code. It tends to have more conditional branches, more user-attribute dependencies, and more paths that are exercised inconsistently in test suites. A change to a flag evaluation block that looks small by lines-changed is often higher risk than its size suggests.

The signals that matter for flag risk

Not every flag change carries the same risk profile. Koalr's scoring model weights three categories of signal when evaluating PRs that touch flag-adjacent code.

Flag file change density

How many files containing flag evaluations were modified? A PR that modifies one file with one flag call is lower risk than a PR that modifies fifteen files across three services. Change spread is a reliable predictor of incident probability regardless of the flag context — it reflects coordination complexity and the likelihood that some changed path will not be adequately tested.

Blast radius estimation

For percentage rollouts, what share of your user base does this flag affect? Koalr estimates blast radius by correlating LaunchDarkly targeting rule configuration with your GitHub PR data. A flag set to 100% rollout targeting all users carries a fundamentally different risk profile than a flag at 1% targeting internal users, even if the code change is identical.

Blast radius matters most during incident response. A kill switch that affects 100% of users requires more caution to flip than one affecting 5% of users — but in the heat of an incident, teams often treat both the same way. Surfacing blast radius before an incident is cheaper than learning it during one.

Author file-expertise score

Has the engineer who wrote this change modified these files before? Koalr's file-expertise signal uses Git history to measure how familiar an author is with the specific files they changed in a PR. For flag-adjacent code, expertise is particularly important because the failure modes are non-obvious. An engineer who has worked extensively with a flag implementation knows the edge cases. An engineer making their first change to that system is more likely to miss them.

This signal is not about penalizing engineers with lower expertise scores. It is about ensuring PRs modifying high-stakes flag code receive additional review attention — and that the reviewers assigned have the context to catch subtle problems.

Connecting LaunchDarkly audit logs to incident timelines

LaunchDarkly maintains a full audit log of every flag change: who changed it, when, what the previous and new configuration were. This log is the key to correlating flag events with production incidents — but the correlation is manual without tooling.

The standard incident debugging sequence is: check recent deployments, check infrastructure changes, check application logs. Flag changes often fall outside this sequence because they are not deployments and do not appear in infrastructure change logs. Engineers have to remember to check the LaunchDarkly audit log separately — and in incident response, memory is not a reliable mechanism.

Koalr pulls LaunchDarkly audit events into the same incident timeline view as deployments, infrastructure changes, and alert triggers. Flag changes appear in the timeline without requiring a separate lookup, surfacing the targeting rule change and error rate spike correlation that would otherwise take 20 minutes to find manually.

Practical risk reduction without adding friction

The goal is not to add friction to LaunchDarkly usage. Feature flags are genuinely valuable for deployment safety, and a risk scoring system that makes teams afraid to use flags would be counterproductive. The goal is to make the higher-risk patterns visible before they cause incidents.

In practice, this means three things:

Risk scores on flag-adjacent PRs appear as GitHub PR comments before merge, so reviewers know to pay extra attention to flag evaluation changes.
Blast radius estimates surface in the pre-merge context, so engineers know how many users are affected when a flag rolls out.
Flag events in incident timelines reduce mean time to diagnosis by surfacing LaunchDarkly changes alongside deployment and infrastructure events.

None of this requires changing how engineers write code or how product managers use LaunchDarkly. It adds a layer of observability to the flag lifecycle that does not exist in either tool today.

Feature Flag Risk: How LaunchDarkly Changes Connect to Production Incidents