DORA Metrics

Whenever a tech leader is asked about their team’s performance, two questions lurk underneath: “are we shipping fast?” and “are we shipping well?”. The problem is, without data, the answer becomes a feeling. And feelings vary with the mood of the person answering, the latest Friday-night incident, and how much coffee was at the meeting.

📊 DORA Metrics
“A set of four metrics that indicate the performance of a software development team: deployment frequency, lead time for changes, change failure rate, and time to restore service. Together, they provide a balance between speed and stability.”
— Google Cloud / DORA

DORA (DevOps Research and Assessment) is a Google Cloud research program that, for more than a decade, has been studying what separates software teams that perform well from those that don’t. The most famous synthesis of that work is in the book Accelerate and in the annual State of DevOps Report. In practice, DORA Metrics have become the gold standard for measuring software delivery (DORA — Software delivery metrics).

Like any tool that becomes a “gold standard”, they’ve also become an easy target for misuse. It’s common to see teams adopt the metrics for the wrong reason, set up the wrong dashboard, and draw the wrong conclusion. This article is an attempt to explain the concept, show how to adopt it in a healthy way, and — maybe more importantly — map the traps you can easily fall into.

The 4 (actually 5) metrics

The original DORA framework has 4 metrics, split into two groups: velocity (throughput) and stability. In 2024, DORA updated the set to 5 metrics, renaming one and adding another (A history of DORA’s metrics). I’ll focus on the 4 classics, which are still the most used and referenced, and mention the evolution at the end of the section.

Velocity

1. Deployment Frequency — how often the team ships code to production. The most visible metric and arguably the most misinterpreted (more on that). Elite teams ship multiple times a day; low teams ship every few months.

2. Lead Time for Changes — how long it takes between a commit being made and that change reaching production. Measures the “friction” of the delivery pipeline: how long until a PR becomes user value.

Stability

3. Change Failure Rate — percentage of deploys that required urgent intervention (rollback, hotfix, emergency patch). A proxy for the quality of what’s being delivered.

4. Mean Time to Restore (MTTR) — how long it takes the team to restore service after a production incident. Measures response capability to failure, not capability to avoid failure.

Combining the benchmarks published in the most recent reports (Octopus Deploy — DORA 2024/25, DORA report), the performance tiers look roughly like this:

Metric	Elite	High	Medium	Low
Deployment Frequency	Several per day	1/day – 1/week	1/week – 1/month	< 1/month
Lead Time for Changes	< 1 day	1 – 7 days	1 week – 1 month	> 1 month
Change Failure Rate	~5%	~10%	~15%	up to 64%
MTTR	< 1 hour	< 1 day	1 day – 1 week	> 1 week

About the evolution to 5 metrics: DORA renamed MTTR to Failed Deployment Recovery Time (more specific — it only measures recovery from a bad deploy, not any incident), and added Deployment Rework Rate (percentage of unplanned deploys caused by a production incident). The official doc is at DORA’s software delivery performance metrics. For those getting started, the 4 classics already cover 90% of the value — you don’t need to wait to adopt all 5 to begin.

A point the DORA report has been hammering for years and many people ignore: velocity and stability are not a trade-off. Elite teams do well on all four metrics simultaneously. Anyone who attacks deployment frequency only and ignores change failure rate doesn’t become an elite team — they become a team that produces incidents faster.

Metrics as goals (well, not quite 🫠)

This is the point that needs to be stated before any dashboard gets built: DORA Metrics are not goals. They are lagging indicators — thermometers showing how the delivery system is performing over time. The moment they turn into team OKRs or criteria for individual reviews, things sour fast.

There’s a principle called Goodhart’s Law that captures it well:

“When a measure becomes a target, it ceases to be a good measure.”

In practice, when you tell the team “we need to reach 5 deploys per day by end of quarter”, what happens? The team finds a way. And that way almost always involves some kind of unintentional “cheating”. Let’s look at some real scenarios that the InfoQ folks documented:

Inflating Deployment Frequency: the team breaks a single PR into 5 artificial PRs, or does “no-op deploys” just to bump the number. Result: metric goes up, delivered value doesn’t change, review and deploy overhead grows.
Reducing Lead Time by skipping review: to get commits to prod faster, someone loosens review, or skips tests on “small” PRs. Lead Time improves, Change Failure Rate gets worse — and since no one looks at both together, it looks like an improvement.
Masking MTTR with aggressive rollback: rollback is fast, so every time something goes wrong in prod, the team rolls back. MTTR looks great. Except rollback means the release value got taken away from the user — you’re not delivering anymore, you’re just not breaking.
Diluting Change Failure Rate with trivial deploys: if you ship 100 copy-change deploys and 5 deploys that broke, your failure rate plummets. The metric lies, and the team believes it.

So what’s the point of measuring, if gaming the metric makes it lie? The point is to use the metrics as a conversation trigger, not a test score. When Lead Time rises three sprints in a row, that’s information. It’s not a punishment or a badge: it’s a signal to pop the hood and investigate. Could be slow CI, review queue, oversized PRs, unstable staging. The metrics don’t tell you what to do, they tell you where to look.

Team metric vs. individual metric

This is such an important corollary it deserves its own subsection: DORA Metrics are system metrics, not people metrics. The official documentation is explicit about it (DORA — DORA metrics): they should be applied at the application or service level, not compared across different teams, and never used to evaluate engineers individually.

The reason is direct: if I, as a dev, know my promotion depends on my Lead Time, I’ll optimize for my Lead Time. I’ll avoid large PRs even when they make sense, I’ll avoid picking up hard bugs because they take longer, I’ll create political friction in others’ reviews to accelerate mine. Team metrics turned into individual incentives destroy the team and lie about performance.

For individual engineer evaluation, there’s another framework called SPACE (Satisfaction, Performance, Activity, Communication, Efficiency), explicitly designed to capture individual and well-being dimensions. The two are complementary. DORA measures the system; SPACE measures people and collaboration. Don’t conflate them.

How to apply it in the team

OK, so how do we adopt this in a healthy way? The recipe isn’t complicated, but there are some prerequisites.

Prerequisites

There are several platforms that propose to calculate DORA Metrics on top of your development flow — Swarmia, LinearB, Sleuth, among others. Regardless of the tooling choice, three things need to be in place first:

Connected integrations: the central point usually is the platform’s app installed in GitHub/GitLab — it pulls PRs, commits, reviews and deploy events. For project/initiative context, it’s worth plugging the issue tracker too (Linear, Jira, etc). And to close the team-notification loop, connect Slack.
Explicit, agreed-upon definitions: what counts as a “deploy”? Does staging count or only prod? Does a feature-flag toggle count? What’s a “failure”? Bug reported by user or only incidents that paged someone? These questions need to be answered before measuring starts, and in writing. Bryan Finster has a good whitepaper on these ambiguities. In most tools, part of this decision turns into concrete configuration: you choose between GitHub Deployments, GitHub Checks or some Deployment API to tell the tool what it should count as a production deploy.
Leadership buy-in that this is a system metric: if the head of eng is going to use it for team ranking or to justify a PIP, better not start. The damage of adopting it badly is greater than the benefit of not adopting it at all.

Start small

You don’t need to measure all 4 metrics on day 1. Connecting the platform to GitHub and enabling deployments already gets you Deployment Frequency and Lead Time for Changes working almost for free — they come straight from the history of PRs merged to main and from deploy events. Start there: together they already paint a reasonable picture of the pipeline’s health.

Change Failure Rate comes next, but depends on an explicit signal: most tools automatically detect when a deploy is followed by a revert, rollback or hotfix and mark the original deploy as a failure. For that detection to be reliable, the team needs to adopt a convention for those deploys — standardized commit message, PR label, or specific tag. Without a convention, the metric will silently underestimate reality.

MTTR is where some tools have a limitation worth knowing: they tend to compute recovery from the time between a broken deploy and the fix deploy. That works well for failures visible in the pipeline, but doesn’t capture incidents detected externally (production alert, user-reported bug) — that incident-management data ends up out of scope. For most teams the deploy-based approximation is enough; just be clear that’s the definition.

About review cadence: weekly or per-sprint is enough. Daily becomes noise (the natural day-to-day variance swallows the signal). Monthly works for a more mature team with a low deploy volume.

A feature that pairs well with that cadence is Working Agreements: you define collective targets (e.g., “PRs open for less than 24h”, “Lead Time below X days”) and the tool notifies on Slack when the team drifts. Useful to automate part of the feedback without turning into micromanagement — the alert goes to the team, not the individual.

A practical case

Let me grab a real example documented by the BossaBox folks, which illustrates the virtuous cycle nicely:

	Before	After
Deployment Frequency	1 every 15 days	> 1 per day
Change Failure Rate	~100%	significantly reduced
MTTR	~90 hours	~90 minutes

The team had deploys every 15 days and almost every deploy was followed by a hotfix (hence the failure rate near 100%). Recovery took an average of 90 hours. When this data was gathered and presented quantitatively, it opened the door for a structural discussion — the problem wasn’t team “effort”, it was architecture, deploy process, and test coverage. After the changes, daily deploys with 90-minute MTTR.

Notice the pattern: the metric fixed nothing. It just made visible a problem everyone felt but no one could size. The fix was in the practices (CI/CD, tests, architecture) and in the culture (deploy is no longer a risk event). The tool is what makes “making it visible” fast — but it’s not magic.

Improvement conversations

When the team looks at the metrics in a recurring ritual (a good fit is the sprint retro), the right questions to ask are open-ended:

“Our Lead Time went up 30% in the last month. What changed in our flow?”
“That Change Failure spike — which area did it come from? Is it repeating or isolated?”
“Deploy Frequency is dropping. Conscious decision or is a bottleneck appearing?”

The tool helps here because it lets you filter by repo, team, author or time window — you can quickly see if the symptom is concentrated in one service or is systemic, and cross-reference with issue tracker data to understand if it was a specific type of change. But the tooling doesn’t replace the discussion. Notice none of the questions above is “who’s to blame?”. A metric that blames becomes a metric that gets gamed — and that holds whether the dashboard is pretty or not.

Expected outcome

What to expect from adopting DORA Metrics properly? On different horizons:

Short-term (weeks): visibility. You start seeing where time is lost, which pipeline stage is slowest, and how much the team spends firefighting vs. delivering features.
Medium-term (months): the team starts using the metrics as an improvement tool. Lead Time drops because someone noticed CI was slow. Change Failure Rate drops because someone noticed coverage was missing in a specific area. Deploy Frequency rises because the fear of shipping went down.
Long-term (years): the DORA report documents consistent correlation between high performance on these metrics and business metrics (profitability, market share) and people metrics (retention, lower burnout). It’s not magic — it’s that a team shipping fast and stable does so on top of healthy fundamentals (CI, tests, trunk-based development, observability), and those fundamentals are good in themselves.

But it’s honest to close with the disclaimer: the metrics don’t tell you how to improve. They only point where to look. The actual work — which involves trunk-based development, continuous delivery, feature flags, automated tests, blameless postmortem culture, observability — is what moves the numbers. The dashboard doesn’t, it just tells the story.

The final message is the same as the opening: a dashboard doesn’t fix a team. Data-driven conversation does. DORA Metrics are a great excuse to start that conversation in a structured way — as long as no one turns the metric into a target along the way.

📊 DORA Metrics#

The 4 (actually 5) metrics#

Velocity#

Stability#

Metrics as goals (well, not quite 🫠)#

Team metric vs. individual metric#

How to apply it in the team#

Prerequisites#

Start small#

A practical case#

Improvement conversations#

Expected outcome#

📊 DORA Metrics