Service metrics and its meanings

Whenever you have a system, it doesn’t really matters what it does, it is of good measure to… well… measure it.

The purposes are multiple, make sure that it is performing well, measure its overall cost, measure its latency to costumers, etc. It is a fact that it is quite important to do so, but then the following question arises: how?

Well, the answer to that question is not easy, as there are aspects of each service/case/app that are to be considered.

Service metrics and its meanings

Whenever you have a system, it doesn’t really matters what it does, it is of good measure to… well… measure it.

Well, the answer to that question is not easy, as there are aspects of each service/case/app that are to be considered.

A good example of that is Google itself, and I’m not saying that because they have THE book for Site Reliability Engineering, I am talking about relations between the metrics and their inherent meaning on a myriad of services so big that it covers the entire world. On those books they cover a great deal of trial and error on how to measure production systems and how to focus on what is important. See also the companion SRE Workbook and background on postmortem culture.

Albeit to that, you might find that those examples are irrelevant for your use case and you’ll have to figure everything out for yourself. And that is OK. But one thing is undeniable, those concepts help a bit when they fall on a use case that fits.

From my point of view, there are two basic types of metrics that you can apply those concepts to. They are inherently interconnected, in the sense that one does not exist without the other. Related framing: product “North Star” metrics (see Amplitude guide) versus engineering reliability KPIs (e.g. Four Golden Signals).

Business meaning

When we talk about metrics from a business standpoint we are trying to quantify whether the service is succeeding in creating (and keeping) value. A metric is a proxy for an outcome; no proxy is perfect, but good ones are:

Connected to customer impact (they move when users are happier or churn when they are not)
Hard to game (improving the number tends to improve the underlying reality)
Stable enough to trend yet responsive enough to detect meaningful change

Leading vs. lagging indicators

Lagging indicators describe results you usually discover too late (monthly revenue, quarterly churn). Leading indicators move earlier and let you correct course (signup conversion, time-to-value, task success rate). A healthy metric set pairs each critical lagging metric to at least one leading metric that you can act upon. For a primer, see Leading vs. Lagging Indicators.

Outcome (Lagging)	Example Leading Metrics	Intervention Window
Customer retention	Onboarding completion %, First action latency	Days/weeks
Revenue growth	Trial -> paid conversion, Expansion feature adoption	Weeks
NPS / Satisfaction (What is NPS?)	Error-free session %, Page performance (p95) (Web Vitals)	Minutes/hours

North Star and guardrails

The North Star metric is the clearest expression of sustained value creation (e.g., “weekly active collaborative documents”). Guardrail metrics prevent optimizing the North Star at the expense of health (support tickets backlog, cost per transaction, reliability SLO compliance). If a push increases the North Star but violates a guardrail, you slow down.

Metric layers

Think in concentric layers:

North Star (one, maybe two)
Strategic pillars (fairly stable: acquisition, activation, retention, efficiency)
Operational KPIs (change more often, tie to teams OKRs)
Diagnostic metrics (rich, detailed; used to explain movement, rarely reported upward)

Business questions to validate

Before adopting a business metric ask: What decision will change if this metric moves? Who owns reacting to it? How fast must we respond? What thresholds define success vs. acceptable vs. alert?

Common business metric mistakes

Measuring everything and prioritizing nothing
Declaring a vanity metric (raw signups) as success without a quality filter
Lacking a clear owner; metrics without owners decay
Setting targets without historical baselines or variance analysis
Not revisiting metrics when the product stage changes (growth vs. efficiency phase)

Translating to technical metrics

Each business metric should map (not 1:1, but traceably) to technical signals. If “time-to-value” matters, you must instrument latency of first key workflow and session error rates. This translation is the handshake between product and engineering.

Technical meaning

On the technical side, metrics become the nervous system of operating the service. Their meaning comes from how precisely they reflect user-visible behavior and how actionable they are for engineers.

Observability pillars vs. service metrics

Observability often cites three pillars: metrics (numeric aggregations), logs (discrete events with context), traces (distributed request flows). Service metrics sit at the top as interpreted numbers distilled from raw telemetry. You rarely alert on raw logs; you derive counters, rates, percentiles. See OpenTelemetry for standardized telemetry and this discussion on Monitoring vs. Observability.

Golden signals

Borrowing from SRE practice, the four golden signals of a user-facing system:

Latency – How long it takes to serve a request (track both success and error paths, p50/p95/p99).
Traffic – Demand size: requests/second, concurrent sessions.
Errors – Failure rate: explicit errors, timeouts, correctness failures.
Saturation – Resource exhaustion proximity: CPU, memory, queue length. Add a fifth in many modern systems: Cost – Unit economics per request/job.

SLIs, SLOs, SLAs

SLI (Service Level Indicator): Precisely defined measurement of user experience (e.g., “fraction of read API requests completed under 300 ms and returning 2xx”).
SLO (Service Level Objective): Target for SLI over a window (“99.9% weekly”).
SLA (Service Level Agreement): Contractual externally visible commitment; breaching may have penalties. Always set SLO tighter than SLA.

Error Budget = 1 - SLO. It is the allowed unreliability used for change velocity (deploys, experiments). If you burn budget too fast: slow releases, add reliability work. If you never spend budget: you may be over-investing. For alerting strategy, consider multi-window, multi-burn rate SLO alerts.

Metric types (Prometheus style)

Counter: Monotonic increase (e.g., total requests). Alert on rate not raw value. (Prometheus counter)
Gauge: Arbitrary up/down (e.g., memory usage, queue depth). (Prometheus gauge)
Histogram: Buckets of observations (latency). Enables percentiles & tail analysis. (Prometheus histogram)
Summary: Client-side calculated quantiles; use sparingly due to aggregation limits. (Prometheus summary) Prefer histograms for latency & size; counters for events; gauges for states. Ensure unit consistency (seconds, bytes). Document each metric: name, type, unit, cardinality dimensions. Good primers: Prometheus histograms and summaries and Gil Tene on latency percentiles.

Cardinality discipline

High-cardinality labels (user_id, session_id) explode storage and slow queries. Guidelines:

Reserve high cardinality for traces/logs, not primary metrics. (Prometheus best practices)
Dimension by stable groupings (region, API endpoint, plan tier). (Grafana labels guide)
Keep total series per metric under sane thresholds (e.g., < 10k) unless justified. (Cardinality explained)

Instrumentation checklist

Define SLIs first (user perspective) then choose raw signals.
Standard naming: service_namespace_subsystem_metric_unit (e.g., checkout_api_request_latency_seconds). See Prometheus naming best practices.
Include outcome labels: status=“success|error|timeout”.
Separate pathologically slow from normal via buckets (e.g., 50,100,200,300,500,800,1200,2000 ms).
Emit from a well-tested middleware layer to ensure coverage.

Aggregation & rollups

Store both raw series and periodic rollups (1m, 5m, 1h) to enable long-range trends affordably. Tail metrics (p99) need raw-ish resolution; cost/traffic can tolerate coarser granularity. See Recording rules and continuous aggregates.

Dashboards vs. alerts

Dashboards are for exploration & storytelling; alerts for actionable interruption. An alert should meet: clear owner, severity classification, runbook link, deduplication logic, and auto-silence conditions (maintenance windows, downstream known incidents). Too many unactionable alerts create alert fatigue; measure mean time to acknowledge (MTTA) and percent of alerts yielding tickets. See PagerDuty on alert fatigue.

Dependency metrics

Track upstream SLIs you depend on (e.g., database p99 latency). Decompose incidents faster by correlating service SLI dip with dependency saturation metrics.

Continuous improvement loop

Observe SLI trends and error budget consumption.
Perform weekly reliability review: anomalies, top regression sources.
Propose experiments (cache change, index) with expected movement.
Deploy and compare before/after via change-focused dashboards.
Feed learnings into next quarter’s reliability & efficiency OKRs.

Bridging business and technical metrics

The most powerful metrics tell a dual story: “p95 checkout latency improved 20%, and conversion rose 3%”. Maintain a mapping document linking each business KPI to:

Primary SLI(s)
Key technical driver metrics (cache hit rate, DB lock wait)
Leading indicator hypotheses (client render time) Revisit mapping quarterly; prune metrics that no longer explain variance.

Metric taxonomy cheat sheet

Layer	Example	Owner	Cadence
North Star	Weekly active collaborative docs	Product leadership	Weekly
SLI	Successful doc save latency p95 < 400ms	SRE/Engineering	Continuous
SLO	99.95% saves < 400ms weekly	SRE	Weekly review
Guardrail	Infra cost per save < $0.001	Finance/Eng	Monthly
Diagnostic	DB write lock wait time	Eng team	Ad hoc

Common anti-patterns

Alerting on averages (hide tail pain) instead of percentiles. See ACM Queue on percentiles.
Chasing “100%” reliability (diminishing returns) vs. defined SLO + error budget. See Error Budgets.
Overloading a single metric with too many labels causing cardinality blow-up. See Prometheus instrumentation pitfalls.
Building dashboards no one uses: track dashboard views, retire stale boards. See Grafana dashboard tips.
Confusing throughput with performance (more requests could mean retries from errors). Contrast RED method vs. USE method.

Practical example (API service)

Scenario: Public REST API for document editing. Defined SLIs:

Read latency: fraction of GET /doc/{id} served < 250ms.
Write success: fraction of PUT /doc/{id} returning 2xx.
Editing session stability: sessions without disconnect > 5 minutes. SLOs: 99.9%, 99.95%, 99% respectively (weekly). Error budget alarms at 50%, 75%, 100% consumption. Business KPI mapped: active editing minutes per user. Hypothesis: Improving read latency p95 will raise active minutes by reducing initial load friction. Run experiment: introduce edge caching; monitor cache hit rate (target > 80%), origin latency drop (expect -30%). Outcome metrics decide rollout.

Getting started checklist

List top 3 user journeys; define one SLI each.
Set initial SLOs using historical 4-week median performance minus a modest stretch.
Instrument counters for requests, errors; histograms for latency.
Create one “golden dashboard” with SLIs + dependency saturation metrics.
Define 3 alerts only: SLI burn rate high (short & long window), dependency slowdown, error spike.
Write runbooks before enabling alerts.
Review after 2 weeks: adjust buckets, remove noisy labels, refine SLO.

Conclusion

Metrics are a language; alignment happens when business and engineering speak a shared dialect rooted in user experience. Start small, be explicit, iterate continuously. The goal is not more graphs; it’s faster, better decisions.