Datadog + DORA Metrics: Using Observability Data to Improve Deployment Safety
Datadog has DORA metrics built in. Its Deployment Tracking, SLO dashboards, and Incident Management give you a precise picture of your delivery performance — after the fact. Koalr operates in the window that comes first: before the PR merges. Together, they form a closed loop that Datadog alone cannot close and Koalr alone cannot close either.
The core distinction
Datadog is great at telling you your MTTR was 47 minutes last quarter. Koalr tells you the PR that caused the incident scored 78/100 deploy risk before it merged — and which signals drove that score. The two tools answer different questions at different points in the delivery timeline.
How Datadog implements DORA metrics
Datadog introduced native DORA support via its Deployment Tracking product, which instruments deploys and computes the four key metrics — deployment frequency, lead time for changes, change failure rate, and MTTR — directly in the Datadog platform. The implementation is solid and increasingly mature.
Deployment events are the foundation. You can emit them programmatically using the POST /api/v2/dora/deployments endpoint, which accepts a service name, version, environment, and timestamp. From these events, Datadog computes deployment frequency (how often are you deploying a given service?) and lead time (how long from code commit to production deploy?). Alternatively, for teams using Datadog CI Visibility, pipeline completion events flow through automatically — you get deployment frequency without a separate instrumentation step.
Retrieving the computed metrics is equally direct. The GET /api/v2/dora/metrics endpoint returns DORA metric values for a specified service and timeframe — you pass a service name, environment, and date range, and Datadog returns deployment frequency, lead time, change failure rate, and MTTR as computed aggregates. This is the data powering the built-in DORA dashboard widgets available in any Datadog account with Deployment Tracking enabled.
For MTTR specifically, Datadog's Incident Management product provides the incident timeline that feeds the calculation. When an incident is declared in Datadog and subsequently resolved, the open-to-close window is attributed to deployments that preceded it, giving you MTTR that is correlated with actual deployment events rather than estimated from aggregate ticket data.
SLO monitoring rounds out the picture. Datadog allows you to define SLOs against any monitored metric — for example, error_rate < 0.1% over a 30-day rolling window — and then configure burn rate alerts that fire when you are consuming your error budget faster than sustainable. This is the Google SRE burn rate model: consuming budget at 2× the sustainable rate triggers a warning; consuming at 14.4× triggers an immediate page because at that rate you will exhaust your entire quarterly error budget in less than two hours.
The gap in Datadog's DORA story
Datadog measures outcomes. It captures what happened in production after your deployment landed. That is genuinely valuable — you cannot improve what you do not measure, and DORA aggregates give engineering leaders the baseline they need to have meaningful conversations about delivery performance.
But the outcome measurement arrives after the risk event has already occurred. By the time Datadog's change failure rate widget ticks up, a user-facing incident has already happened. By the time MTTR accumulates, on-call engineers have already been paged. The measurement is accurate, but the moment of intervention it points to — the pre-merge window — has already passed.
Several specific gaps follow from this:
- No pre-merge risk scoring. Datadog has no mechanism to evaluate whether a specific PR — this diff, right now, before it merges — is likely to cause an incident. The data that would enable this (author commit history, file churn rates, review coverage, code coverage delta) lives in GitHub and your CI pipeline, not in Datadog.
- No code coverage correlation. Datadog does not ingest per-PR coverage reports from Codecov or SonarCloud. A PR that drops test coverage by 12% on the files it touches is not flagged differently from one that improves coverage — Datadog has no visibility into that signal.
- No CODEOWNERS enforcement. Datadog cannot answer "did the right people review this change?" It has no integration with your repository's CODEOWNERS file or with GitHub review data. A critical security service merged without a review from its owning team is invisible in Datadog until an incident results.
- No AI chat against live engineering data. Datadog Notebooks and dashboards let you explore metrics visually, but you cannot ask a natural language question like "which services have the highest deploy risk this week and why?" and get a synthesized answer.
None of this is a criticism of Datadog — it is an observability platform, not a pre-merge risk predictor. These are different tools solving different problems at different points in the deployment lifecycle.
SLO burn rate as a deploy risk signal
The most technically precise integration point between Datadog and Koalr is SLO burn rate. This is where Datadog's post-deploy observability directly feeds Koalr's pre-merge risk scoring.
The mechanism: Datadog exposes SLO history via GET /api/v1/slo/{slo_id}/history, which accepts from_ts and to_ts query parameters and returns the SLO status over that window, including overall.slo_met (boolean) and the calculated error budget consumption rate. Koalr reads this endpoint at deploy time for each service with a connected SLO.
The risk scoring logic works as follows. When a PR targeting a service is evaluated, Koalr checks whether that service's Datadog SLO burn rate exceeded the warning threshold (2× sustainable) in the preceding 48 hours. If it did, the PR receives a +15 point penalty on its deploy risk score — because deploying to a service that is already burning its error budget at an elevated rate increases the probability of a compounding incident. The service is already degraded; the new deployment adds change surface to a system under stress.
Post-deploy correlation closes the feedback loop in the other direction. After a deployment completes, Koalr reads the SLO history for the 2-hour window following the deploy event. If the SLO burn rate spiked within that window — consistent with the deploy causing a degradation — Koalr attributes this as a negative outcome to the deploying PR. This outcome feeds the deploy risk ML model: the PR's pre-merge signals are retrospectively labeled as "led to SLO burn", strengthening the model's weighting of those signals for future predictions.
The Google SRE burn rate model
A 30-day SLO with a 99.9% target has an error budget of 43.2 minutes per month. A burn rate of 1× means you are consuming that budget at exactly the sustainable rate. A burn rate of 14.4× means you will exhaust the entire monthly budget in two hours — Datadog fires a P1 alert at this threshold. Koalr uses the 2× burn rate threshold (warning level) as its deploy risk trigger, since by the time you hit 14.4× the incident has already started.
Deployment markers and DORA frequency
Deployment frequency — how often you ship to production — is the DORA metric most directly correlated with engineering team performance. Elite performers deploy multiple times per day; high performers deploy between once per day and once per week.
Datadog provides two mechanisms for marking deployments. The first is the POST /api/v2/dora/deployments endpoint, designed specifically for DORA instrumentation. The second is the older events API: POST /api/v1/events with alert_type: info and tags including deployment, service:my-service, and version:1.2.3. Many teams have been using the events approach for years and have rich deployment history in their Datadog event stream without having explicitly adopted the DORA API.
Koalr reads both sources. Using GET /api/v1/events with tags=deployment,service:my-service, Koalr retrieves the paginated deployment event history for each connected service and computes deployment frequency independently from GitHub releases or tags. This is particularly valuable for teams using Datadog CI Visibility as their primary deployment tracking mechanism — their deployment frequency data lives in Datadog, not in GitHub release artifacts.
Cross-validation between data sources catches discrepancies. If Koalr sees a GitHub release for v1.2.3 but no corresponding Datadog deployment event within a reasonable window, it surfaces this on the DORA dashboard as a data quality flag — useful for identifying gaps in deployment instrumentation before they affect metric accuracy.
Setting up the Datadog integration in Koalr
The integration requires two Datadog credentials: an API key for authenticated requests and an Application key for read access to your account's data. Both are generated from the Datadog Organization Settings.
To generate an API key: navigate to Organization Settings → API Keys → New Key. Give it a descriptive name (e.g., Koalr Integration) and copy the key value — it is shown only once. To generate an Application key: navigate to Organization Settings → Application Keys → New Key. The Application key scopes the API key to your specific Datadog account.
These two keys are passed as headers in every Datadog API request: DD-API-KEY and DD-APPLICATION-KEY.
In Koalr: navigate to Settings → Integrations → Datadog, paste both keys, and select your Datadog site (US1 at datadoghq.com, EU1 at datadoghq.eu, US3 at us3.datadoghq.com, or US5 at us5.datadoghq.com). Once credentials are validated, Koalr queries the Datadog service catalog to discover your services automatically — you select which services to monitor from the discovered list.
What Koalr surfaces from Datadog data
Once connected, Datadog data flows into several places in the Koalr platform.
On the DORA dashboard, each tracked service displays SLO health as a color-coded indicator: green (SLO currently met, burn rate below 1×), amber (burn rate between 1× and 2×, budget being consumed faster than sustainable), or red (burn rate above 2×, or SLO currently breached). Hovering the indicator shows the specific SLO target and current error budget remaining as a percentage.
A 30-day SLO burn rate sparkline sits alongside the health indicator, showing the burn rate trend over the period. This is more informative than a point-in-time status — a service at amber with a declining burn rate trend is in a different position than a service at amber with a rising one.
Deployment frequency on the DORA dashboard is computed from Datadog deployment events (cross-validated with GitHub), giving you an accurate frequency metric even for services that do not use GitHub Releases as their canonical deployment artifact.
MTTR pulls from Datadog Incidents. When an incident is opened and subsequently resolved in Datadog, Koalr reads the incident timeline via the Incidents API and attributes it to the most recent deployment preceding the incident open time. This gives you per-service MTTR that is correlated with actual deployment events — not estimated from ticket creation timestamps.
In AI Chat, Datadog data becomes queryable in natural language. You can ask: "Which services are currently burning their error budget the fastest?" and get a ranked list with burn rates. Or: "What was the MTTR for the payments service last quarter, and which PRs were deployed in the hour before each incident?" — a query that crosses Datadog incident data with GitHub PR history in a single synthesized answer.
AI Chat example
You: "Which services are currently burning their error budget the fastest?"
Koalr: "The payments-api service has the highest current burn rate at 4.2× sustainable, consuming error budget at four times the safe rate. At this pace the monthly budget will be exhausted in 7 days. The auth-service and notification-worker are at 1.8× and 1.4× respectively. The payments-api spike began 6 hours ago, which correlates with the deployment of PR #2341 — that PR scored 71 deploy risk before merge."
The closed-loop system: from pre-merge to post-incident
The real value of combining Datadog and Koalr is not in the individual features — it is in the feedback loop they form together.
The loop has four stages. Before merge, Koalr scores the PR using code-level signals: change size, author expertise, review coverage, test coverage delta, file churn history, and current Datadog SLO health for the target service. This score is visible to the engineer and their reviewers before they click merge.
After deploy, Datadog's observability layer monitors the service. SLO burn rate, error rates, and latency are tracked in real time. If a deployment causes a degradation, Datadog fires the appropriate alert based on the burn rate threshold.
When an incident is opened in Datadog, Koalr correlates it with the preceding deployment. The PR that was deployed is identified, and its pre-merge risk score is surfaced in the incident timeline — so the on-call engineer has immediate context about whether this was a high-risk change that was flagged before merge or a low-risk change that failed unexpectedly.
The outcome is then fed back into the deploy risk model. If the PR scored 68 pre-merge and was followed by an incident, the signals that drove that score get positive reinforcement in the model's training data — they correctly predicted risk. If a low-scoring PR is followed by an incident (a false negative), the model learns that those signal weights underestimated the risk in that case. Over time, the model calibrates specifically to your codebase, your team, and your deployment patterns.
This is the feedback loop that neither tool creates on its own. Datadog observes outcomes but has no mechanism to trace them back to pre-merge signals. Koalr scores pre-merge risk but without post-deploy observability data, the model cannot learn whether its predictions were accurate. Connected together, each tool makes the other more useful.
Koalr scores the PR 0–100 using code signals + Datadog SLO health for the target service.
Datadog monitors SLO burn rate, error rate, and latency. Fires alerts if thresholds are crossed.
Koalr surfaces the triggering PR and its pre-merge risk score in the incident timeline.
The outcome feeds the deploy risk model, improving prediction accuracy for future deployments.
CI Visibility: flaky test rate as a deploy risk signal
Datadog CI Visibility captures detailed test execution data from your CI pipelines — pass, fail, and flaky status per test suite, per commit. This data source adds a signal to Koalr's deploy risk scoring that code-level metrics alone cannot provide: the flaky test rate for the repository over recent builds.
The reasoning is straightforward. A repository where 15% of test runs include at least one flaky test has degraded CI signal quality. Engineers learn to re-run CI rather than investigate failures, which means genuine test failures pass through as "just flaky". In this environment, a PR's passing CI build is a weaker quality signal than the same result in a repo with near-zero flaky test rates.
Koalr reads CI Visibility data using the GET /api/v2/ci/tests/events endpoint with a filter query such as filter[query]: @ci.pipeline.name:my-pipeline. The response includes test status per run, which Koalr aggregates into a per-repository flaky test rate over the preceding 14 days. This rate is factored into the deploy risk score: repositories with flaky test rates above 10% receive a modest score increase on all PRs, because the coverage and test signals for those PRs carry less confidence.
CI Visibility also surfaces the specific test suites that are flaky most frequently. Koalr exposes this breakdown in the AI Chat interface — you can ask "which test suites are flakiest in the payments service this week?" and get a ranked list with flake rates per suite. This is an actionable lead for engineering investment: fixing the flakiest test suites improves the quality of the CI signal across every PR that touches those suites going forward.
Practical implementation checklist
For teams already on Datadog who want to add Koalr to the stack, the integration path is straightforward. Start with credentials and basic connectivity, then layer in richer data sources as you validate the baseline.
Organization Settings → API Keys → New Key (name it "Koalr"). Then Organization Settings → Application Keys → New Key. Save both values immediately — the API key is shown only once.
Settings → Integrations → Datadog. Paste the DD-API-KEY and DD-APPLICATION-KEY values, select your Datadog site (US1, EU1, US3, or US5). Koalr validates credentials and discovers your service catalog.
From the discovered service list, select the services you want Koalr to track. DORA frequency, SLO health, and MTTR will begin populating for those services immediately.
SLO health and burn rate signals require at least one SLO defined per service in Datadog. Service-level objectives based on error rate or latency work best for the burn rate signal.
If you use Datadog CI Visibility, enable the flaky test signal in Koalr under Integration Settings → Datadog → CI Visibility. Select the pipelines to monitor for flake rate.
Datadog vs. Koalr: what each tool owns
It is worth being explicit about the division of responsibility, because the tools overlap in surface area even when they do not overlap in function.
Datadog owns post-deploy observability. SLOs, error rates, latency percentiles, infrastructure metrics, distributed traces — these are Datadog's domain. Koalr does not replicate these. Koalr reads them as inputs for risk scoring and correlation, but if you want to investigate what is happening in production right now, that investigation happens in Datadog.
Koalr owns pre-merge risk prediction and engineering process metrics. Deploy risk scores, PR cycle time, review coverage, author expertise, CODEOWNERS enforcement, coverage trends, and the AI chat interface against live engineering data — these are Koalr's domain. Datadog does not replicate these. Datadog does not know what is in a PR diff or who reviewed it.
The DORA metrics themselves sit in the overlap. Both tools can compute deployment frequency, lead time, change failure rate, and MTTR. In practice, Koalr's DORA metrics are richer in the dimensions that involve code and people (review coverage, author expertise, PR-level attribution); Datadog's DORA metrics are richer in the dimensions that involve production systems (SLO-based CFR calculation, incident-sourced MTTR). A combined deployment gives you both.
A note on the feedback loop and ML model improvement
The deploy risk model improves over time specifically because of the Datadog integration. Without post-deploy outcome data, the model can score PRs based on historical signal distributions — but it cannot learn whether those scores were calibrated correctly for your specific services and team.
With Datadog incident and SLO data flowing in, Koalr can answer the retrospective question: "of the PRs we scored above 70 last quarter, what fraction was followed by an SLO breach or incident within 4 hours of deploy?" If the answer is 40%, the model's precision at that threshold is 40% — meaning 60% of high-score deployments were false positives. The model can then recalibrate, either by adjusting the weights of the signals that contributed to those false positives or by adjusting the score thresholds for your team's specific risk profile.
This is the property that makes the Datadog + Koalr combination more than the sum of its parts. Each tool generates data the other cannot generate, and the combination creates a feedback loop that neither can create alone. The result is a deploy risk model that gets more accurate with every deployment — learning continuously from the outcomes that Datadog observes.
Connect Datadog to Koalr
Add Datadog observability to your deploy risk scoring. SLO burn rate becomes a live input to pre-merge risk prediction. Incidents correlate back to triggering PRs. The feedback loop that improves prediction accuracy over time starts with the first connected service.