Developer Experience Metrics: How to Measure DX Without Surveys
Developer experience surveys are valuable — but they are lagging indicators. They measure how your engineers felt last week, not what is making them productive or frustrated right now. GitHub data gives you six leading indicators of DX in real time, updated with every commit, every PR, every CI run. No survey required.
Leading vs. lagging indicators
Surveys like Axify's well-being tracker or Atlassian's TeamHealth capture how engineers feel after the fact. GitHub behavioral signals — PR cycle time, rework rate, WIP depth — capture what is happening to their experience in real time. The two approaches are complementary: metrics tell you what is happening, surveys tell you why. This post focuses on the six GitHub metrics that give you the fastest signal.
Why DX matters as a business metric
Developer experience is not a soft benefit. It is a hard business input with measurable downstream effects on attrition, output quality, and hiring velocity. Engineering leaders who treat DX as a cultural nicety rather than an operational metric are leaving money on the table.
Engineer attrition. The cost to replace a senior engineer — recruiting fees, ramp time, productivity loss during transition — runs $150,000 to $300,000 per departure. Poor DX is consistently among the top-cited reasons engineers leave in exit interviews. A team of eight senior engineers losing two per year to DX-driven attrition burns $300,000 to $600,000 annually in replacement cost alone, before you account for the institutional knowledge that walks out the door.
Output quality. Frustrated engineers cut corners. When review cycles are long, requirements are unclear, and CI pipelines are flaky, the rational individual response is to minimize the amount of time spent in friction-generating activities — which means shorter test suites, less thorough reviews, and more "ship and see" behavior. The correlation between poor DX and increased change failure rate is not anecdotal: Stripe's internal DORA correlation study found that a 30% improvement in DX metrics was associated with a 20% reduction in change failure rate. DX and CFR are not independent variables.
Hiring velocity. Glassdoor scores reflect DX. Engineering candidates research tech culture before accepting offers. A reputation for poor tooling, flaky CI, and chaotic review processes makes recruiting harder and more expensive at exactly the moment when engineering capacity is most constrained.
The business case for measuring DX is clear. The challenge is choosing metrics that are actionable, not just interesting.
The 6 GitHub-measurable DX leading indicators
Each of the following indicators is computable from data your team already generates. All of them are available through the GitHub REST API or GraphQL API with no additional instrumentation. All of them update continuously — not quarterly, not weekly, but on every relevant event.
Indicator 1: PR cycle time
PR cycle time — the elapsed time from PR creation to merge — is the most direct and actionable DX signal available from GitHub data. A PR that sits unreviewed for three days is not primarily a throughput problem. It is a DX problem. The author is context-switching to other work, losing mental context on their original change, re-reading their own diff days later to re-orient, and absorbing the cognitive cost of an interrupted flow state every time they check whether the review has come in.
The sub-metric that matters most is time to first review — the gap between PR creation and the first substantive review action (comment or approval). Total cycle time is influenced by PR complexity; time to first review is a direct measure of review culture and responsiveness.
GitHub API: pull_request_review_event.submitted_at - pull_request.created_at gives you time to first review per PR. Aggregate to team-level median to avoid individual outlier sensitivity.
Healthy range: Median time to first review under 4 hours during business hours. Above 24 hours consistently signals a review culture problem — either reviewers are over-allocated, review requests are not being routed well, or there is an implicit norm that reviewing others' work is lower priority than shipping your own.
Koalr tracks per-team median time-to-first-review with a 30-day rolling trend, surfaced on the DORA dashboard alongside deployment frequency and change failure rate.
Indicator 2: PR rework rate
Rework rate measures the proportion of PRs that require three or more review cycles before approval. A review cycle is: author submits, reviewer requests changes, author pushes updates, reviewer reviews again. When this loop happens three or more times on a single PR, something systemic is usually wrong: requirements were unclear before the PR was opened, the PR scope grew beyond what the original specification covered, or there is a fundamental mismatch between the author's understanding and the reviewer's expectations.
High rework rate correlates with engineer frustration more reliably than any other single GitHub metric. It is the metric most directly tied to the subjective experience of feeling like your work does not land — of doing something three times before it is accepted. That experience is a documented precursor to disengagement.
GitHub API: Count review_request_event after a previous changes_requested orapproved review state. A re-request after a previous cycle marks the beginning of a new cycle. Three or more cycles on a single PR = rework.
Healthy range: Under 15% of PRs requiring three or more review cycles. Rising rework rate without a corresponding change in PR complexity almost always indicates a process breakdown — unclear specs, inconsistent reviewer standards, or scope creep.
Note: Koalr also tracks rework in the code itself, distinct from review rework. Commits that touch the same lines within a 21-day window after a PR merges are flagged as code-level rework — a signal that the original implementation had to be corrected after it shipped.
Indicator 3: Late-night and weekend commit rate
A rising share of commits outside business hours — specifically after 9pm or on weekends — is a leading burnout indicator. It is important to distinguish this from flexible working hours, which produce a consistent personal pattern rather than a rising trend. A developer who habitually works from 11am to 8pm will show a consistent after-hours commit pattern; that is not a DX signal, it is a personal schedule. What matters is a team's after-hours commit share increasing month-over-month without a corresponding increase in business-hours activity.
When late-night and weekend commit rates rise, it typically means engineers are absorbing workload overruns into their personal time rather than flagging scope problems. They are doing this because the DX environment — unclear requirements, large PRs, slow reviews — is making it impossible to complete their work within normal hours. The after-hours signal is a downstream effect of upstream friction.
GitHub API: commit.committer.date — segment by hour of day and day of week. Calculate after-hours percentage as a share of total commits per team per 30-day period.
Signal threshold: A team's after-hours commit share rising more than 10% month-over-month is worth investigating — not disciplining, but understanding. Start with a conversation, not a directive.
Important nuance: international distributed teams require timezone normalization. A commit at 10pm UTC is business hours in parts of Asia-Pacific. Always normalize commit timestamps to the committer's local timezone before computing after-hours rates. Koalr infers timezone from the committer offset in the git commit object.
Indicator 4: Open PR age
The median age of currently-open PRs is a proxy for how blocked engineers feel on a day-to-day basis. Old PRs accumulate for identifiable reasons: waiting for a reviewer who is unavailable, blocked by a dependency that has not landed, scope that crept beyond the original intent, or — most DX-relevant — engineers who effectively abandoned a PR because the review process felt futile and moved on to something else.
Unlike cycle time, which measures completed PRs, open PR age captures the work that is stuck right now. It is a real-time snapshot of friction rather than a historical average. A team whose cycle time looks acceptable but whose median open PR age is climbing has a bifurcated process: some PRs are moving fast while others are stalled, and the stalled ones are eroding the experience of the engineers who authored them.
GitHub API: (today - pull_request.created_at) for all PRs with state: open and draft: false. Draft PRs should be excluded — they are intentionally not ready for review.
Healthy range: Median open PR age under 3 days. A median above 7 days signals systemic blockers — either review capacity is insufficient, or there are structural dependencies causing work to pile up. Koalr surfaces this as a PR age heatmap per repository, making it easy to see which repositories are accumulating the oldest open work.
Indicator 5: WIP (Work in Progress) per engineer
The number of open, non-draft PRs per engineer is a direct measure of context-switching load. An engineer with four open PRs in active review is managing four simultaneous workstreams — responding to review comments on one, making changes on another, waiting on a third, and keeping context on a fourth. This is the engineering equivalent of a plate spinner: impressive until a plate drops.
High WIP is causally connected to poor quality and poor DX through the cognitive load channel. Context switching between workstreams has a documented productivity cost — the time to re-orient when returning to a paused task is non-trivial, and the quality of work done in fragmented attention is lower than work done in deep focus. The research on this is not new; what is new is that GitHub makes WIP objectively measurable where it previously required self-reporting.
GitHub API: Count open PRs per author with state: open and draft: false. Aggregate to team level to see systemic WIP accumulation rather than individual variation.
Healthy range: 1–2 open PRs per engineer. More than 3 consistently signals fragmented focus. The leading-indicator property of this metric is well-established: a WIP spike in a given sprint typically precedes a throughput cliff 1–2 sprints later, as the accumulated context-switching tax comes due.
Indicator 6: CI failure rate and flakiness
Flaky tests are one of the most consistently cited DX pain points in engineering surveys — and one of the most directly addressable with data. A CI pipeline that fails 15% of the time on code that has not changed is not a quality signal; it is infrastructure noise. But it is noise that engineers have to respond to: they investigate the failure, determine it is a flake, re-run the build, wait again, and eventually merge. Every flake costs 10–20 minutes of an engineer's time and erodes trust in the CI system as a meaningful signal.
The distinction that matters here is between two types of CI failure: PR-driven failures (the CI correctly identifies a problem the engineer introduced, a valid quality signal) and infrastructure-driven flakiness (random failures on green code, a DX tax). Tracking both provides the full picture.
GitHub API: check_run.conclusion values — segment success, failure, and timed_out by repository and by whether the subsequent re-run of the same commit passed (indicating a flake rather than a genuine failure). A check run that fails and then passes on retry without a code change is definitionally a flake.
Healthy range: CI failure rate from infrastructure causes under 5%. Above 15% consistently signals a test reliability problem that is actively degrading DX. Koalr tracks CI flakiness as Signal #28 in the deploy risk model — but also surfaces it independently as a DX metric on the Focus dashboard, since its primary business impact is on engineer time and experience rather than deploy risk alone.
Example: Computing CI flakiness rate from GitHub Check Runs API
// GitHub Check Runs API — flakiness detection
// GET /repos/{owner}/{repo}/commits/{ref}/check-runs
const runs = await octokit.checks.listForRef({ owner, repo, ref });
const flakes = runs.check_runs.filter(run => {
// Failure that passed on retry without a code change
return run.conclusion === 'failure' && run.app.slug === 'github-actions';
});
const flakinessRate = flakes.length / runs.check_runs.length;
// Flag repositories with flakiness rate > 0.15 (15%)
if (flakinessRate > 0.15) {
flagForDXReview(repo, flakinessRate);
}The SPACE framework alignment
SPACE — Satisfaction, Performance, Activity, Communication, Efficiency — is the academic DX framework published by Microsoft Research (Forsgren, Storey, Maddila, et al., 2021). It is the most rigorous published framework for thinking about developer productivity across multiple dimensions simultaneously. The six GitHub indicators above map directly onto the SPACE dimensions, which is a useful validation that they are measuring the right things.
| GitHub Indicator | SPACE Dimension | What it measures |
|---|---|---|
| PR cycle time | Efficiency | How quickly work flows through the review process |
| PR rework rate | Performance | Whether work is landing correctly the first time |
| Late-night commit rate | Satisfaction (proxy) | Whether engineers are absorbing overruns into personal time |
| Open PR age | Efficiency | How much work is stuck and blocking engineers right now |
| WIP per engineer | Activity | Context-switching load and focus fragmentation |
| CI failure rate | Efficiency | Infrastructure friction interrupting engineering flow |
The dimension surveys capture most directly is Satisfaction — how engineers feel about their work, tools, team, and environment. GitHub behavioral signals capture the other four dimensions objectively without requiring self-reporting. This is the key insight behind behavioral DX measurement: you do not need to ask engineers whether they are satisfied if you can observe whether their work is flowing, their output is landing, and their attention is focused.
How to read DX signals without surveilling individuals
The surveillance concern is real and the distinction between team-level DX measurement and individual monitoring matters enormously — both ethically and practically. DX metrics used to rank individuals, set performance targets, or justify compensation decisions will be gamed, resented, and ultimately counterproductive. DX metrics used to understand systemic friction and improve team process are one of the most valuable inputs an engineering leader can have.
The practical rules for DX metrics that maintain trust:
Team-level aggregates only. Never surface per-engineer metrics to managers without explicit engineer consent. The right level of aggregation for DX signals is the team or squad — the unit at which process decisions are made. An engineer's individual PR cycle time is not a performance metric; their team's median cycle time is a process health metric.
Trend is more important than absolute value. A team's 30-day trend in PR cycle time is actionable — it tells you whether things are improving or degrading. Comparing two engineers' commit counts is not actionable and is the wrong use of the data entirely.
Engineers should see their own data. The right model is radical transparency about individual data to the individual themselves, with aggregated data available to managers. An engineer who can see their own WIP trend, CI failure rate, and cycle time has a mirror for their own work habits. A manager who can only see team aggregates has a process health dashboard.
Koalr implements this through configurable visibility settings. Full Transparency mode surfaces individual metrics to all team members. Role-Based mode restricts per-engineer detail to the engineer themselves, with only aggregated team metrics visible to managers. The default for new organizations is Role-Based, with explicit opt-in to broader visibility.
Setting up DX measurement in Koalr
The six DX indicators described above are built into Koalr's platform across several views:
DORA dashboard. PR cycle time trend (30-day rolling median), deployment frequency, and change failure rate are the core DX indicators. The dashboard surfaces trend direction alongside the current value — a cycle time of 18 hours that was 36 hours three months ago is a DX success story, not a DX problem.
Work Log page. GitHub contribution heatmap per developer, including hour-of-day and day-of-week distribution. This is the view that surfaces late-night and weekend commit patterns. Individual-level detail is opt-in per engineer.
Focus page. Work category breakdown showing what percentage of engineering time goes to features versus bugs versus KTLO (keep the lights on). High KTLO percentage is a DX signal in its own right — it means engineers are spending their time on maintenance rather than value creation, which correlates strongly with reduced job satisfaction.
AI Chat. Natural language queries against live engineering data. For example: "Which teams have the highest PR rework rate this month?" returns a ranked team list with rework rate, trend direction, and the specific repositories driving the signal. No dashboard navigation required.
Example AI Chat queries for DX analysis
"Which teams have the highest PR rework rate this month?"
"Show me repositories with CI flakiness above 10% in the last 30 days"
"What is the trend in median open PR age for the platform team?"
"Which engineers have more than 3 open PRs right now?"
DX improvement playbook: a 4-step cycle
Measuring DX without acting on the measurements is instrumentation theater. The value of the six indicators is that each one points toward a specific intervention — not a vague cultural initiative, but a concrete process change with a measurable expected effect.
Step 1: Establish baseline. Collect 30 days of data on all six indicators before changing anything. You need a baseline to measure against, and 30 days is long enough to smooth sprint-to-sprint variance while being short enough to be actionable. Note which indicators are in the healthy range, which are marginal, and which are clearly degraded.
Step 2: Identify the constraint. Pick the single worst-scoring indicator. Improvement in multiple areas simultaneously is unlikely; improvement in the primary constraint produces the highest leverage. If cycle time is the clear worst performer, that is your constraint — even if WIP is also above healthy range, fixing cycle time will often reduce WIP as a downstream effect.
Step 3: Run a targeted experiment. Design a specific, timeboxed intervention aimed at the constraint. For cycle time: "PR review SLA — first review within 4 hours on business days." For rework rate: "PR template requiring a spec link before review request." For CI flakiness: "dedicated one-week sprint to identify and quarantine the 10 most frequently flaking tests." The experiment should be narrow enough to measure its effect in 30 days.
Step 4: Measure and repeat. After 30 days post-change, compare the target indicator to baseline. Did it improve? Did other indicators move in response? An improvement in cycle time will often produce downstream improvements in WIP (engineers close out stalled PRs faster) and rework rate (faster feedback means problems are caught earlier in the review cycle). Document what worked and move to the next constraint.
When to use surveys alongside metrics
GitHub metrics tell you what is happening; surveys tell you why. Both are necessary for a complete picture of developer experience.
Consider a scenario where rework rate rises significantly over a 60-day period. The metric tells you that PRs are requiring more revision cycles — but it cannot tell you whether this is because requirements were unclear before PRs were opened (a product process problem), because reviewer standards changed (a review culture problem), or because the team took on a new domain that everyone is learning simultaneously (a temporary growth cost). A single 5-question pulse survey targeting the team can distinguish between these hypotheses in days.
The right surveying practice alongside behavioral metrics:
Quarterly pulse surveys. Five questions, five minutes. Keyed to the specific indicators that showed the most movement in the prior quarter. Not a comprehensive DX census — a targeted diagnostic. The questions should vary based on what the data is showing.
Survey fatigue is real. More than quarterly is counterproductive. Survey response quality degrades significantly with frequency; engineers begin treating them as bureaucratic checkboxes rather than genuine feedback channels. The behavioral metrics fill the signal gap between quarterly surveys without adding to survey burden.
Surveys should explain metrics, not replace them. A rising rework rate combined with survey responses citing "unclear requirements" is a product process problem, not an engineering problem. That distinction determines where the intervention should happen — and you would not know to look at product process without the survey context for the metric.
The metric that most teams are missing
Of the six indicators, CI flakiness has the highest ratio of business impact to measurement adoption. Most teams know their cycle time. Almost no teams have a running flakiness rate by repository. Yet flaky tests are consistently the top-cited DX pain point when engineers are surveyed directly. Start measuring CI flakiness by repository this week — the data will almost certainly surface something actionable immediately.
Putting it together
Developer experience is observable. The data is in your GitHub organization right now, updating with every push, every review, every CI run. Six indicators — PR cycle time, rework rate, late-night commit rate, open PR age, WIP per engineer, and CI flakiness — give you a continuous, leading-indicator view of whether your engineers are in a high-DX environment or a friction-filled one.
None of these indicators require surveys, require engineers to self-report, or add any instrumentation to your existing engineering workflow. They are all computable from the GitHub Events API and Check Runs API that you already have access to. The question is not whether the data exists — it is whether you are looking at it.
The business case is straightforward: DX improvements reduce attrition, increase output quality, and improve hiring velocity. The 4-step improvement cycle — baseline, identify constraint, experiment, measure — turns the six indicators from interesting observations into an operational improvement loop. And the SPACE framework alignment validates that these six metrics are not arbitrary choices; they are behavioral proxies for the dimensions that the academic research has established matter for developer productivity.
Start with one indicator. Pick the one that is furthest from the healthy range for your team right now, design one experiment to move it, and measure for 30 days. That is where DX improvement begins.
Measure developer experience with Koalr — free
Connect GitHub in under 5 minutes. Koalr surfaces all six DX indicators — PR cycle time, rework rate, late-night commit patterns, open PR age, WIP per engineer, and CI flakiness — alongside DORA metrics and deploy risk scores. No surveys. No additional instrumentation. Just the data your team already generates, made actionable.