Why Kubernetes creates DORA measurement challenges

Traditional DORA tooling was built for a simpler deployment model: a merge to the main branch triggers a pipeline that pushes code directly to a server. The commit is the deployment. In this model, counting deployments means counting merges, and lead time is commit-to-merge.

Kubernetes breaks both assumptions. Between a developer pushing a commit and that code reaching a production pod, the following steps occur: a CI pipeline builds a container image and pushes it to a registry; a CD tool (ArgoCD, Flux, Spinnaker, or a plain CI step) applies a Kubernetes manifest to the cluster; the Kubernetes scheduler places pods on nodes; the kubelet pulls the image and starts containers; readiness probes pass; the Deployment controller marks the rollout complete. Any of these steps can fail independently, and none of them are visible from GitHub Events alone.

The most common result is that teams using Kubernetes measure DORA metrics from the wrong signal — PR merges to the application repository — and end up with deployment frequency counts that are inflated (counting merges that never reached production) and lead time values that are systematically too short (cutting off the measurement before the image build, registry push, and rollout phases that commonly add 15 to 60 minutes).

Change failure rate is even more problematic. A Kubernetes deployment can succeed at the manifest application level — the API server accepted the new Deployment spec — while the pods immediately enter CrashLoopBackOff. Standard CFR tools that only watch for explicit rollback events miss this entirely. And MTTR measurements that rely on incident ticket open/close times miss the many incidents that are resolved through automated Kubernetes self-healing before a human opens a ticket.

The four K8s deployment event sources for DORA

There is no single universal Kubernetes deployment event. Which event source gives you accurate DORA data depends on how your team deploys. Four sources cover the majority of Kubernetes deployment patterns.

1. GitHub Deployments API

The GitHub Deployments API is a first-party mechanism for recording that a specific commit was deployed to a specific environment. Your CD pipeline (or a post-deploy hook) posts a deployment event on each successful rollout, attaching the commit SHA, the environment name, and a status of success or failure.

This approach works with any Kubernetes tooling because the GitHub Deployment event is created by your pipeline, not by Kubernetes itself. It integrates with GitHub status checks, shows deployment history per environment in the GitHub UI, and gives DORA tools a clean, standardized event stream to consume.

The tradeoff: it requires your CD pipeline to post the event explicitly. It does not happen automatically. Teams often implement this as the last step in a GitHub Actions workflow or Tekton pipeline, after confirming the Kubernetes rollout completed successfully.

Example GitHub Actions step to post a deployment event after a successful rollout:

- name: Create GitHub Deployment
  uses: chrnorm/deployment-action@v2
  with:
    token: ${{ secrets.GITHUB_TOKEN }}
    environment: production
    ref: ${{ github.sha }}

- name: Wait for rollout
  run: kubectl rollout status deployment/api -n production --timeout=300s

- name: Update deployment status
  uses: chrnorm/deployment-status@v2
  with:
    token: ${{ secrets.GITHUB_TOKEN }}
    deployment-id: ${{ steps.deploy.outputs.deployment_id }}
    state: success
    environment-url: https://api.example.com

DORA tooling that reads the GitHub Deployments API sees a clean commit SHA paired with an environment and a success/failure status — exactly what is needed for deployment frequency and lead time calculation.

2. ArgoCD sync events

For teams using ArgoCD, the sync event is the deployment event. Each time ArgoCD successfully syncs an Application to the cluster, it emits a Kubernetes event with reason: Sync and type: Normal. A failed sync emits type: Warning.

ArgoCD sync events are the most precise Kubernetes deployment signal available for GitOps teams because they reflect the actual state of the cluster — not what was pushed to Git, and not what the CI pipeline attempted. A sync event only fires when Kubernetes has accepted the new manifests and the ArgoCD operation completed.

The Application CRD carries everything needed for DORA measurement: app.status.sync.revision contains the Git commit SHA that was synced; app.status.operationState.phase indicates Succeeded or Failed; app.status.health.status indicates whether the resulting pods are Healthy or Degraded.

For a deeper walkthrough of ArgoCD-specific DORA instrumentation, see the ArgoCD DORA metrics guide.

3. Flux GitOps reconcile events

Flux uses two primary custom resources for deployment: Kustomization (for raw manifests or Kustomize overlays) and HelmRelease (for Helm charts). Both resources emit Kubernetes events on each reconcile cycle.

A successful Kustomization reconcile sets the resource condition Ready=True and emits a ReconcileSucceeded event. A failed reconcile sets Ready=False and emits ReconcileFailed. The HelmRelease resource follows the same pattern with InstallSucceeded and UpgradeSucceeded events for new installs and upgrades respectively.

Flux's Notification Controller can route these events to external webhook endpoints, making it straightforward to push reconcile events to a DORA metrics platform in real time.

4. Custom deployment webhook

For teams not using ArgoCD or Flux — running plain Helm, Kustomize via CI, or a bespoke deployment tool — the most reliable approach is emitting a structured webhook event on each rollout completion. This can be triggered from a Kubernetes event watcher that monitors Deployment objects for rollout completion, or as an explicit step at the end of any CD pipeline.

The webhook approach is the most portable: it works with any Kubernetes tooling and does not depend on a specific GitOps controller being present. The tradeoff is that you own the instrumentation code.

Kubernetes deployment event schema

Regardless of which event source you use, the deployment event payload should capture a consistent set of fields to support all four DORA metrics. A complete deployment event looks like this:

{
  "service": "api",
  "environment": "production",
  "sha": "abc123def456",
  "image_tag": "v1.2.3",
  "deployed_at": "2026-03-16T10:00:00Z",
  "deployed_by": "argocd-sync",
  "rollout_duration_seconds": 145,
  "replicas_updated": 5,
  "previous_sha": "def456abc789"
}

Each field serves a specific purpose. sha and previous_sha enable lead time calculation — the commit timestamp of sha is the end of the lead time window, and previous_sha lets you determine which commits are new in this deployment. rollout_duration_seconds is the time the Kubernetes Deployment rollout took to complete, a useful operational metric separate from DORA lead time. replicas_updated distinguishes a full rollout (all replicas on new image) from a partial rollout.

Deployment frequency from Kubernetes

Deployment frequency counts successful rollout completions per service per time period. The key word is successful — an attempted deployment that ends in rollback or a failed health check does not increment deployment frequency; it contributes to change failure rate.

From the command line, a successful rollout is indicated by a zero exit code from:

kubectl rollout status deployment/api -n production

For ArgoCD, the equivalent check is both conditions being true simultaneously:

Application.status.sync.status == "Synced"
Application.status.health.status == "Healthy"

A sync that reaches Synced but leaves the application in Degraded health is not a successful deployment — it is a change failure. Both conditions must be true.

HPA scaling events are not deployments — filter them out

Horizontal Pod Autoscaler scaling events change the number of running pods without changing the pod spec or the container image. An HPA scale-up creates new pods from the same image that is already running — it is not a deployment. Deployment frequency tracking must filter for pod spec changes only: events where .spec.template changed (i.e., a new image tag, updated environment variables, or modified resource limits). Replica count changes alone should be excluded. Most GitOps tools handle this correctly because they only create sync events when the Git source changes; raw Kubernetes event watchers need explicit filtering.

Lead time from Kubernetes — connecting commits to deploys

Lead time requires connecting a deployment event to the Git commit that triggered it. Kubernetes does not store this connection natively — a pod's spec contains an image reference, not a Git SHA. The connection must be recovered by parsing the image tag.

The most reliable approach is a consistent image tagging convention that embeds the Git SHA. The recommended format:

ghcr.io/org/api:sha-abc123def456

With this convention, extracting the SHA from a running pod is straightforward:

kubectl get pod api-7d8b9f-xkz2p \
  -o jsonpath='{.spec.containers[0].image}'
# → ghcr.io/org/api:sha-abc123def456

Parse the sha- prefix off the tag to recover the Git SHA. Then query the GitHub API for the commit timestamp at that SHA. Lead time is:

lead_time = deploy_timestamp - commit_timestamp

For multi-commit changes, use the timestamp of the earliest commit in the changeset (from the previous deployment SHA to the current one) as the start of the lead time window — this gives you the full commit-to-production duration for the longest-waiting change, which is the correct DORA definition.

Teams using semantic version tags (v1.2.3) without an embedded SHA need an additional lookup step: query the container registry for the image manifest, then look up the build that produced that image in the CI system to recover the source SHA. This works but adds pipeline complexity. SHA-tagged images are strongly recommended for any team that wants accurate lead time measurement.

Change failure rate from Kubernetes

Three Kubernetes signals contribute to change failure rate: deployment-correlated incident alerts, rollback events, and post-deploy pod health degradation.

Deployment-correlated incidents

Correlate PagerDuty or Alertmanager alerts with deployment events using a time window. An alert that fires within 30 minutes after a deployment to the same service is a strong CFR signal. Koalr uses a ±30 minute window for this correlation by default, which captures the vast majority of deployment-triggered incidents while excluding unrelated alerts.

Rollback events

A kubectl rollout undo is the strongest possible CFR signal — it means the team explicitly decided the deployment was bad enough to revert. Kubernetes records this as a new Deployment revision with the previous pod spec. Watch for Deployment ReplicaSet changes that restore an older revision (the kubernetes.io/change-cause annotation often contains "rollback" for these events).

For ArgoCD rollbacks:

argocd app rollback <app-name> <revision>

ArgoCD records this as a sync to a previous revision, distinguishable from a forward sync by the operation annotation.

Post-deploy pod health degradation

Two Kubernetes health signals are leading indicators of a bad deployment even before a human-visible incident occurs:

CrashLoopBackOff spikes post-deploy. When a new deployment rolls out and pods immediately enter CrashLoopBackOff, the application is failing to start. Watch for pods in this state on the Deployment selector within 5 minutes of a rollout completing.

OOMKilled events post-deploy. A surge of OOMKilled pod events after a deployment indicates the new version has a memory regression. The container exceeded its memory limit and was terminated by the kernel. This frequently does not trigger a PagerDuty alert immediately (the pod restarts, traffic is served by other replicas) but it will eventually cascade into service degradation.

Both patterns can be detected from the Kubernetes Events API:

kubectl get events -n production \
  --field-selector reason=OOMKilling \
  --sort-by='.lastTimestamp'

MTTR from Kubernetes

Mean time to recovery in a Kubernetes environment has a clear start and end signal that is independent of incident ticketing systems.

Incident start: the moment an alert fires in PagerDuty, Datadog, or Alertmanager — or the moment pods enter a degraded state post-deploy (CrashLoopBackOff, OOMKilled, failed readiness probes).

Recovery: the moment a rollback completes (kubectl rollout undo exits zero) or a new fix deployment reaches healthy pods. The Kubernetes recovery signal is unambiguous:

kubectl get deployment api -n production \
  -o jsonpath='{.status.availableReplicas}'

When availableReplicas equals spec.replicas, the service is fully recovered. MTTR is the duration from incident start to this timestamp.

Kubernetes MTTR is often shorter than incident-ticket-based MTTR because Kubernetes self-healing (automatic pod restarts, replica set rollback) resolves many incidents before a human manually closes a ticket. This is a real improvement in recovery speed — measure it.

ArgoCD-specific DORA instrumentation — the recommended GitOps approach

For teams running ArgoCD, the recommended instrumentation uses ArgoCD Notifications to emit deployment events in real time. ArgoCD Notifications is a built-in controller that watches Application status and triggers webhooks on configurable transitions.

The three notification triggers needed for complete DORA coverage:

on-sync-succeeded — fires when app.status.operationState.phase == Succeeded. This is the deployment frequency event.
on-sync-failed — fires when app.status.operationState.phase == Failed. This is a change failure event.
on-health-degraded — fires when app.status.health.status == Degraded. When this follows an on-sync-succeeded within 5 minutes, it upgrades that sync to a change failure and starts the MTTR clock.

Add the notification triggers and a Koalr webhook template to the ArgoCD Notifications ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-notifications-cm
  namespace: argocd
data:
  trigger.on-sync-succeeded: |
    - when: app.status.operationState.phase in ['Succeeded']
      send: [koalr-deployment-webhook]
  trigger.on-sync-failed: |
    - when: app.status.operationState.phase in ['Failed', 'Error']
      send: [koalr-deployment-webhook]
  trigger.on-health-degraded: |
    - when: app.status.health.status == 'Degraded'
      send: [koalr-deployment-webhook]
  template.koalr-deployment-webhook: |
    webhook:
      koalr:
        method: POST
        path: /api/integrations/argocd/webhook
        body: |
          {
            "app": "{{.app.metadata.name}}",
            "revision": "{{.app.status.sync.revision}}",
            "phase": "{{.app.status.operationState.phase}}",
            "health": "{{.app.status.health.status}}",
            "sync_started_at": "{{.app.status.operationState.startedAt}}",
            "sync_finished_at": "{{.app.status.operationState.finishedAt}}"
          }

Add the Koalr webhook destination to the ArgoCD Notifications Secret:

apiVersion: v1
kind: Secret
metadata:
  name: argocd-notifications-secret
  namespace: argocd
stringData:
  notifiers.yaml: |
    webhook:
      koalr:
        url: https://api.koalr.com/integrations/argocd/webhook
        headers:
          - name: Authorization
            value: Bearer <your-koalr-api-key>

This configuration emits a structured event to Koalr on every sync outcome and every health degradation — providing real-time deployment frequency, change failure rate, and MTTR data without any polling.

Flux-specific DORA instrumentation

Flux uses its Notification Controller to route reconcile events to external webhooks. The two resources to watch for deployment events are Kustomization and HelmRelease.

Check Kustomization status conditions to determine deployment success:

kubectl get kustomization api -n flux-system -o jsonpath='{.status.conditions}'

A condition with type: Ready and status: True means the reconcile succeeded. The condition message includes the revision (Git SHA or semver tag) that was applied.

Configure the Flux Notification Controller to route events to Koalr. Create an Alert resource:

apiVersion: notification.toolkit.fluxcd.io/v1beta3
kind: Alert
metadata:
  name: koalr-dora
  namespace: flux-system
spec:
  providerRef:
    name: koalr-webhook
  eventSeverity: info
  eventSources:
    - kind: Kustomization
      namespace: flux-system
      matchLabels:
        environment: production
    - kind: HelmRelease
      namespace: flux-system
      matchLabels:
        environment: production

Pair this with a Provider resource pointing to the Koalr webhook endpoint:

apiVersion: notification.toolkit.fluxcd.io/v1beta3
kind: Provider
metadata:
  name: koalr-webhook
  namespace: flux-system
spec:
  type: generic
  address: https://api.koalr.com/integrations/flux/webhook
  secretRef:
    name: koalr-webhook-secret

Flux emits ReconcileSucceeded and ReconcileFailed events that map directly to successful deployment and change failure signals for DORA.

For CLI-based event monitoring during incident response:

flux get kustomizations --watch

Common Kubernetes DORA pitfalls

Pitfall	What goes wrong	Fix
Measuring kubectl apply	Apply can succeed while pods crash — counts bad deploys as successes	Wait for `rollout status` exit code 0 before counting
Counting HPA scale events	Inflates deployment frequency with autoscaling noise	Filter for pod spec changes only; ignore replica count changes
Missing canary deployments	Partial rollouts are counted before they reach 100%, skewing frequency	Count as deployment only when canary weight reaches 100%
Mixing staging deploys into DORA	Staging deploy frequency inflates the number; staging failures inflate CFR	Scope all DORA metrics to production environment label only
Using semver tags only	Cannot trace image tag → Git SHA → commit timestamp for lead time	Use SHA-tagged images: `sha-abc123` suffix convention
Relying on incident tickets	Self-healing deployments that crash and recover are invisible to ticketing systems	Add pod health event watchers for CrashLoopBackOff and OOMKilled post-deploy

DORA metric sources by K8s tooling

DORA Metric	ArgoCD	Flux	GitHub Deploys
Deployment Frequency	`Sync + Healthy` events	`ReconcileSucceeded`	Deployment status `success`
Lead Time	`sync.revision` SHA → commit timestamp	Condition message SHA → commit timestamp	Deployment `ref` SHA → commit timestamp
Change Failure Rate	Sync failed + sync-then-degrade (5 min window)	`ReconcileFailed` + health events	Deployment status `failure` + alerts
MTTR	`Degraded` → `Healthy` timestamp delta	Failed condition → Ready=True delta	Alert start → `availableReplicas` restored

How Koalr handles Kubernetes DORA measurement

Koalr's ArgoCD integration reads sync events directly from the ArgoCD Application CRD and Notifications controller, correlating each sync to its commit SHA for lead time calculation. The GitHub integration reads the GitHub Deployments API for teams using that approach, and correlates deployment events with pull requests and commit history to compute full commit-to-production lead time.

Koalr applies the correct filters automatically: HPA scaling events are excluded from deployment frequency counts, staging and development environment events are separated from production DORA data, and the sync-then-degrade CFR pattern is detected using the 5-minute post-sync health window.

The result is that DORA metrics in Koalr reflect what is actually happening in your Kubernetes production environment — not what Git Events suggest happened, and not what a CI pipeline reported it attempted.

Kubernetes Deployment Metrics: How to Track DORA from K8s Events