What is GitOps?

GitOps is an operational model in which git is the single source of truth for both application code and infrastructure state. Every change to a running system — whether that is a new container image tag, a Kubernetes manifest update, or a Helm values file change — is expressed as a git commit before it is applied anywhere. The git repository is not just a history of changes; it is the authoritative specification of what should be running right now.

The enforcement mechanism is a GitOps operator running inside the cluster. ArgoCD and Flux CD are the two dominant implementations. The operator continuously watches the git repository (or a specific branch or directory within it) and compares the desired state expressed in git against the actual state of the cluster. When it detects a divergence — either because a new commit has been pushed to git, or because the cluster state has drifted from what git specifies — the operator reconciles by applying the changes needed to bring actual state into alignment with desired state.

This model has several consequences that matter for engineering teams and for DORA measurement. First, there is a complete audit trail of every production change in git history — immutable, signed (if you use commit signing), and attributable to a specific author. Second, rollback is just a git revert: creating a new commit that restores the previous state, which the operator then applies automatically. Third, no one runs manual kubectl commands against production (or if they do, the operator immediately reverts them back to the git-specified state). Fourth and most consequentially for metrics: the deployment event is the operator sync, not the CI pipeline completion.

How GitOps changes the DORA definition of a deployment

The DORA research framework defines deployment frequency as the number of times code is deployed to production per day, per week, or per month. In a traditional CI/CD pipeline, a merge to the main branch triggers a build, which triggers a deploy pipeline, which pushes directly to production. Merge and deploy are tightly coupled — typically separated by minutes. Counting PR merges as deployments is a reasonable approximation.

GitOps decouples them. A developer merges code to the application repository. A CI pipeline builds a container image and pushes it to a registry. The image tag is updated in the GitOps repository — either by the CI pipeline writing to the GitOps repo, or by an image update automation tool like ArgoCD Image Updater or Flux Image Automation Controller. Then the GitOps operator detects the change in the GitOps repo and syncs the new state to the cluster.

Each of those steps introduces latency and a potential failure point. The CI build might take 10 minutes. The GitOps repo update might be manual, adding hours of delay. The operator sync might fail because of a policy violation, a misconfigured resource, or a health check timeout. If you count PR merges to the application repository as deployments, you are counting events that may not have reached production — and missing the timing information that makes lead time meaningful.

The correct definition in GitOps: a deployment is a commit to the GitOps repository (the branch and path that the operator watches for a given environment) that is successfully reconciled by the operator. Not the application code commit. Not the CI pipeline push. Not the image registry push. The operator sync completion is the deployment event.

Deployment frequency in GitOps

Deployment frequency in a GitOps environment is the count of successful operator reconciliations per service per day (or week, or month) on the production environment. There are two ways to count this, depending on your observability setup.

The most direct approach is to count commits to the deploy branch. If your GitOps workflow uses a dedicated branch per environment — for example, main for production and staging for staging — then count commits to the production branch per service directory per day. Each commit to the production branch represents a change that the operator will apply. This approach works without any operator API access, using only git history.

The more accurate approach is to count successful ArgoCD sync events or successful Flux reconciliation events. This is preferable because it captures the actual operator behavior rather than intent: a commit pushed to the deploy branch that never successfully reconciles (due to a sync failure, a suspended application, or a health check that keeps failing) is not a deployment. Counting operator sync completions gives you the count of changes that actually reached production.

For ArgoCD: count Sync events with type: Normal on production-labeled applications. For Flux: count ReconciliationSucceeded events from the Kustomization or HelmRelease controller, filtered to production namespace objects. In both cases, scope to production environments only — staging reconciliations are not DORA events.

Lead time for changes in GitOps

Lead time for changes is defined by DORA as the time from code commit to code running in production. In GitOps, there is a new step at the end of the pipeline that does not exist in traditional CI/CD: the infrastructure reconciliation phase.

The full lead time path in a GitOps workflow looks like this: a developer authors a feature on a feature branch (commit time is the start of the lead time clock); the feature branch is merged to the application repository main branch after code review; the CI pipeline builds the container image; the image tag is updated in the GitOps repository (either automatically or manually); the operator detects the GitOps repo change; the operator syncs the new state to the production cluster; health checks pass and the deployment is confirmed. The end of the lead time clock is when the operator confirms the sync is complete and the application is healthy.

The reconciliation time — from when the commit lands in the GitOps repository to when the operator finishes syncing it to the cluster — is a new pipeline stage that does not exist in traditional CD and is frequently overlooked in DORA measurement. For teams with auto-sync enabled and no manual approval gates, reconciliation typically takes 2 to 10 minutes. For teams with manual sync policies or approval gates configured in ArgoCD or Flux, reconciliation time can be hours or days.

Teams that measure lead time as commit-to-PR-merge are excluding the entire reconciliation phase. If your reconciliation is fast and automatic, the error is small. If your team has manual sync gates, approval requirements, or periodic sync schedules (Flux defaults to a 10-minute reconciliation interval), the error is large. The only accurate measurement is commit to operator sync completion, using the commit SHA from the sync event payload to trace back through git history to the first commit in the change set.

Change failure rate in GitOps

Change failure rate is the percentage of deployments that cause a degradation requiring remediation. In GitOps, there are three distinct failure patterns that all contribute to CFR, and standard tooling catches only one of them.

Sync failures. The operator attempts to reconcile a change and the sync operation fails — a manifest has a schema validation error, a resource exceeds admission webhook policy limits, a dependency is missing, or the target namespace does not exist. The operator logs a failed sync event. No change reaches the cluster. This is a change failure: an attempted deployment that did not succeed. Count these in CFR even though they leave production unchanged — the change was intended and it failed.

Sync-then-degrade. The sync operation completes successfully at the operator level — all manifests are applied, the sync phase reports Succeeded — but the application health status degrades within minutes. The pod fails its readiness probe, the deployment rollout stalls because the new image cannot be pulled, or the service crashes on startup. ArgoCD reports this as a ResourceHealthDegraded event shortly after a successful Sync event. Flux surfaces it as a ReconciliationFailed event when health checks fail post-reconciliation. Koalr uses a 5-minute correlation window: a health degradation within 5 minutes of a successful sync on the same application is classified as a change failure.

Incident correlation. A PagerDuty or Opsgenie incident fires within 30 minutes of a successful sync. This is the most visible failure pattern — it generates an alert — but it is also the most likely to be missed by GitOps-specific tooling that only watches operator events. Correlating incident timestamps with sync timestamps gives you the subset of incidents that were deployment-triggered, which is the correct CFR numerator.

Most teams measure CFR only from incident tickets, missing the first two patterns entirely. Sync failures that self-resolve (the developer fixes the manifest and pushes again) and sync-then-degrade events that recover through Kubernetes self-healing (a rollout undo triggered by the deployment controller) never create incidents. They are invisible to incident-based CFR measurement but represent real deployment failures that should count toward CFR.

MTTR in GitOps

Mean time to recovery measures how long it takes to restore service after a deployment-caused failure. GitOps changes MTTR in one important way: the canonical remediation path is a git operation, not a console command or emergency kubectl.

When a GitOps deployment causes an incident, the remediation path is: create a revert commit in the GitOps repository (reverting the problematic commit), push it to the production branch, and wait for the operator to reconcile. The operator applies the revert, which restores the previous state. Health checks pass. The incident is resolved. The revert commit is the fix — it is version-controlled, reviewable, and attributed to whoever executed the rollback.

This is structurally faster than traditional rollback approaches because there is no deploy pipeline to wait for. A git revert and push takes seconds. The operator reconciliation typically completes in under 5 minutes. Total rollback time from decision to production restoration is often under 10 minutes, compared to re-triggering a deploy pipeline and waiting for image builds and pipeline stages.

For MTTR measurement: the clock starts at incident detection (first PagerDuty or Opsgenie alert, or first health degradation event from the operator) and ends when the operator confirms the application has returned to a healthy state after the remediation sync. Both events are available from the operator event stream. Koalr measures MTTR as the duration from the ResourceHealthDegraded event timestamp to the subsequent Healthy status transition timestamp, anchored to the production environment sync history.

The GitOps audit trail advantage

One of the most underappreciated benefits of GitOps for engineering teams is the audit trail it creates automatically. In a traditional deployment model, a production change might be attributable to a CI/CD pipeline run, a manual kubectl command, a Helm release, or a direct API call — and correlating which human made which change when requires checking multiple systems. Access logs, CI run history, kubectl audit logs, and Helm release history all need to be queried and correlated.

In GitOps, every production change is a git commit. Git commits are immutable, ordered by timestamp, and attributed to an author. If you use signed commits (GPG or SSH signing), they are also cryptographically verifiable — no one can claim a commit was made by a different author after the fact. If you require pull request review for all changes to the production branch, the review history is also preserved alongside the commit.

This has direct compliance implications. SOC 2, ISO 27001, and most enterprise security frameworks require evidence of change management controls: who approved a production change, when it was applied, and what exactly changed. In a GitOps workflow, this evidence exists in git history by construction. A compliance audit query — show me every production change in the last 90 days with author and approval — is a git log query. Compare that to assembling the same evidence from CI logs, kubectl audit logs, and ticketing systems in a traditional deployment model.

For DORA metrics, the audit trail means every deployment has a deterministic author, timestamp, and change set. This is what makes accurate lead time calculation possible: the commit SHA in the operator sync event traces directly back to the author and the first commit in the change set, giving you the full journey from code authored to code running in production.

Drift detection and change failure rate

GitOps operators do more than apply changes from git — they continuously watch the cluster for changes that were not made through git. When the actual cluster state diverges from the desired state in git, that divergence is called drift.

Drift happens when someone runs a manual kubectl command against a production cluster — applying a quick fix during an incident, scaling a deployment by hand to handle a traffic spike, or updating a ConfigMap directly to test a configuration change. In each case, the actual cluster state no longer matches what git specifies. ArgoCD will flag the application as OutOfSync. Flux will report a drift condition.

Drift is a change failure rate risk for two reasons. First, the manual change bypasses the review and approval controls that protect the production branch. A change that would have been blocked by a required reviewer or a policy check in a PR can be applied directly to the cluster without any of those controls. Second, the operator will eventually reconcile the drift back to the git-specified state — which means the manual change will be overwritten. If the manual change was a deliberate fix, it gets reverted the next time the operator syncs, potentially causing another incident.

For DORA instrumentation, treat drift events as untracked changes and include them in CFR analysis. An application that shows frequent drift is one where production state is regularly diverging from the intended state — a signal of process breakdown that standard deployment frequency counts will not surface. Koalr alerts on drift events from the ArgoCD OutOfSync status and the Flux drift condition, and correlates them with incident timelines to determine whether a drift event contributed to a subsequent incident.

Drift = untracked change = CFR risk

Every drift event represents a change to production that bypassed the GitOps workflow. Even if the change was made with good intentions — a manual fix during an incident — it is an unreviewed, unattributed modification that the operator will eventually overwrite. Track drift events separately from sync-based CFR and alert engineering leads when drift frequency is rising.

Multi-environment GitOps DORA

Most GitOps workflows involve multiple environments: development, staging, and production at minimum. How you structure environments in your GitOps repository directly affects how you should instrument DORA metrics.

The two common patterns are separate branches per environment and separate directories per environment (within a single branch, often called the environment overlay pattern). In the branch-per-environment pattern, the production branch holds the production desired state; the staging branch holds the staging desired state. In the directory-per-environment pattern, a single main branch has subdirectories like environments/production/ and environments/staging/.

For DORA metrics, the rule is simple: count deployments per environment separately, and only production deployments count toward DORA metrics. Staging reconciliations are not DORA events. They are useful for measuring change failure rate in non-production environments (as a leading indicator of production CFR), but they should not be mixed with production deployment counts.

A consequence of this rule: a change that is deployed to staging but never reaches production is not a deployment. This matters for deployment frequency: teams that promote changes slowly through multiple environment gates often have dramatically lower production deployment frequency than their staging frequency suggests. The gap between staging deployment frequency and production deployment frequency is itself a useful metric — it surfaces how much change is stuck in environment queues rather than reaching users.

For multi-environment instrumentation in Koalr: tag each ArgoCD application or Flux Kustomization with its environment label. Production-labeled objects feed DORA metrics. Staging-labeled objects are available in environment-specific views but do not affect the primary DORA dashboard.

ArgoCD vs Flux for DORA instrumentation

ArgoCD and Flux CD are both mature, production-ready GitOps operators, but they expose different APIs and emit different event structures. The DORA instrumentation approach differs between them.

Dimension	ArgoCD	Flux CD
Deployment event source	Application events API `/api/v1/applications/events`	Kubernetes events on Kustomization and HelmRelease objects
Success event	`reason: Sync, type: Normal`	`ReconciliationSucceeded`
Failure event	`reason: Sync, type: Warning` or `ResourceHealthDegraded`	`ReconciliationFailed` or `HealthCheckFailed`
Commit SHA in event	`app.status.sync.revision`	`metadata.annotations["reconciler.fluxcd.io/revision"]`
Drift detection	OutOfSync status on Application object; configurable auto-sync	Drift detection built-in; always auto-reconciles by default
Webhook support	Outbound webhooks via ArgoCD notifications; configurable per event type	Flux notification controller; alert objects per event source
Historical query	REST API with event pagination; 90-day default retention	Kubernetes events API; default retention 1 hour unless extended
UI for manual sync control	Full ArgoCD UI with sync history	CLI-primary; Weave GitOps for UI

The most significant practical difference for DORA instrumentation is historical event retention. ArgoCD stores application events for 90 days by default and exposes them through a REST API with pagination. Flux stores Kubernetes events, which default to a 1-hour retention window in most cluster configurations (controlled by the--event-ttl flag on the kube-apiserver). If you want historical Flux event data for DORA trend analysis, you need to ship events to an external store — either by configuring Flux notification controller to emit events to a webhook, or by running a log aggregation pipeline that captures Kubernetes events before they expire.

ArgoCD is therefore easier to instrument for DORA from a cold start: the REST API provides up to 90 days of history on first connection, enabling immediate trend analysis without waiting for data to accumulate. Flux instrumentation requires either an external event store or an acceptance that trend data will only be available from the date of Koalr connection forward.

How Koalr instruments GitOps DORA metrics

Koalr supports both ArgoCD and Flux CD as GitOps integration sources. The ArgoCD integration reads sync events via the ArgoCD REST API and supports outbound webhook delivery for real-time updates. The instrumentation process has four stages common to both operators.

On initial connection, Koalr discovers all applications (ArgoCD) or Kustomizations and HelmReleases (Flux) and builds an inventory mapping each workload to its git repository, target environment, and production label. This inventory is used to apply environment scoping — only production-labeled workloads feed DORA metrics.

Koalr then performs a historical backfill: for ArgoCD, paginating through the full event history via the REST API; for Flux, importing any events shipped to an external webhook endpoint before connection. Historical data populates the DORA trend charts immediately rather than requiring a waiting period.

After backfill, Koalr switches to real-time event ingestion via webhook. For each deployment event (successful sync or reconciliation), Koalr extracts the commit SHA from the event payload and uses it to trace back through the connected GitHub or GitLab repository to the first commit in the change set. This commit-SHA correlation is what makes accurate lead time measurement possible in GitOps — without it, a production sync event cannot be connected to the developer who authored the code or when that code was first written.

Drift events from ArgoCD OutOfSync status transitions are tracked separately and surfaced in the CFR analysis as untracked changes. Koalr correlates drift event timestamps with incident timelines to identify drift events that preceded incidents — a pattern that indicates manual production changes are contributing to instability.

DORA Metric	GitOps Signal	Common Wrong Signal
Deployment Frequency	Successful operator sync completions on production workloads	PR merges to application main branch
Lead Time	First feature commit → operator sync completion (via commit SHA)	Commit → PR merge timestamp
Change Failure Rate	Sync failures + sync-then-degrade (5 min window) + incident correlations	PagerDuty incidents only
MTTR	Health degradation event → Healthy status after revert sync	Incident open → incident resolved

Getting started

If you are running ArgoCD, connect it to Koalr from Settings → Integrations → ArgoCD. You will need a read-only service account token with applications, list and applications, get permissions. Koalr will discover your applications, prompt you to select which projects to monitor, and start the historical backfill. Setup takes under 10 minutes and the DORA metrics dashboard updates immediately after the backfill completes.

For teams running Flux CD, Koalr ingests events via the Flux notification controller. You configure a Flux Alert object pointing at the Koalr webhook endpoint, with an AlertProvider of type generic-hmac. Koalr processes ReconciliationSucceeded, ReconciliationFailed, and HealthCheckFailed events from Kustomization and HelmRelease sources.

In both cases, the first step after connecting is to verify that production workloads are correctly tagged and that staging workloads are excluded from the DORA metrics scope. A misconfigured environment label is the most common source of inflated deployment frequency counts in GitOps DORA setups.

GitOps and DORA Metrics: How Git-Based Deployments Change What You Measure