Metrics are the cheapest, most queryable form of observability data — and the most likely to silently bankrupt your team via cardinality explosions. This guide covers how to actually use metrics in production: what Prometheus is, how PromQL works, why cardinality is the bill killer, and how to design SLOs that match user expectations without breaking your reliability budget.
What are Metrics?
A metric is a numerical measurement that changes over time. CPU utilization, request count, latency, queue depth, error rate — all metrics. They're stored as time series: a sequence of (timestamp, value) pairs, tagged with labels that identify what the value represents.
Metrics are cheap to store, cheap to query, and ideal for dashboards, alerts, and capacity planning. They are not ideal for debugging individual requests — that's what logs and traces are for.
Metrics vs Logs vs Traces
| Signal | Best for | Cost driver | When to reach for it |
|---|---|---|---|
| Metrics | Trends, dashboards, alerts | Cardinality | "Is the service healthy right now?" |
| Logs | Forensics, debugging | Volume (GB ingested) | "What exactly happened with this request?" |
| Traces | Latency analysis, dependencies | Sampling rate | "Where is time being spent across services?" |
Production observability needs all three. The mistake is using them interchangeably — putting user IDs in metric labels (cardinality explosion), logging every metric tick (volume explosion), or capturing 100% of traces at high traffic (cost explosion).
Prometheus & PromQL
Prometheus is the open-source standard for metrics in Kubernetes and cloud-native infrastructure. It pulls metrics from your services on a regular interval (default: 15s), stores them locally as time series, and lets you query them with PromQL.
A typical PromQL query looks like:
sum by (status) (rate(http_requests_total{job="api"}[5m]))
Read that as: "the rate of HTTP requests over the last 5 minutes, summed by status code, for jobs tagged api". The pieces:
http_requests_total— the metric name{job="api"}— a label filter[5m]— a range vector (last 5 minutes of samples)rate(...)— per-second average increase across the rangesum by (status)— collapse all dimensions exceptstatus
PromQL is dense but consistent. Once you internalize rate, sum by, and histogram_quantile, you can write 80% of useful queries.
Cardinality — The Bill Killer
Every unique combination of label values on a metric is a separate time series. If http_requests_total has labels {method, path, status} with 5 methods, 100 paths, and 10 statuses, that's 5,000 series. Multiply by 10 replicas = 50,000. Add a user_id label with 100,000 users = 5 billion series. Prometheus dies.
Rules for low cardinality:
- Never label by user ID, request ID, IP address, or other unbounded values
- Bucket paths into route templates (
/users/:id) instead of literal values (/users/12345) - Watch out for service mesh sidecars and Kubernetes exporters — they can balloon cardinality without warning
- Use
topk()queries to find your worst-offending metrics
Managed observability vendors (Datadog, New Relic, Grafana Cloud) charge by series count × time × retention. A single high-cardinality metric can add thousands of dollars per month to your bill.
SLI, SLO, and SLA
The vocabulary is precise. Use it right:
- SLI (Service Level Indicator) — the actual metric you measure. Example: "fraction of HTTP requests returning 2xx within 500ms".
- SLO (Service Level Objective) — your internal target for the SLI. Example: "99.9% of requests meet the SLI over any 30-day window". Internal, changeable.
- SLA (Service Level Agreement) — the contractual commitment to customers. Example: "99.5% monthly availability, with 10% service credit if breached". Should be looser than your SLO so you have a buffer.
The error budget is what your SLO doesn't promise: if the SLO is 99.9%, your error budget is 0.1% of the period — 43.8 minutes/month. Use it deliberately for releases, experiments, or chaos engineering. Burn it accidentally and you're forced to slow down.
The Four Golden Signals
From Google's SRE handbook. If you instrument only four things on every service, instrument these:
- Latency — request duration. Distinguish successful vs failed (failed requests are often faster — don't let them pollute your P99).
- Traffic — requests per second. The denominator for everything else.
- Errors — failure rate. Explicit (5xx) and implicit (200 with error in body) both count.
- Saturation — how full your service is. CPU, memory, queue depth, connection pool utilization.
Most production fires can be diagnosed from these four metrics plus the request logs from the same window. Everything else is nice-to-have until it isn't.
Free Tools
Project Helena's free tools for metrics work:
- Error Budget Calculator — SLO burn rate and time-to-exhaustion
- Uptime SLA Calculator — Any SLA % to allowed downtime
- Latency Percentile Calculator — P50, P90, P95, P99 from raw data
More metrics tools are coming: a Prometheus cardinality estimator, a PromQL formatter, and an SLI/SLO YAML builder.
Also see
- AWS Pricing — observability infrastructure isn't free; understand the cost side
- Uptime Monitoring — the simplest metric of all: is the service up?
- Distributed Tracing — when metrics aren't enough, follow the request
- Log Management — the third leg of observability
Metrics & Observability FAQ
What is the difference between metrics, logs, and traces?
Metrics are numerical measurements over time (CPU %, request count, latency). Cheap to store, easy to aggregate, bad for debugging individual requests. Logs are timestamped events with rich context. Expensive at scale. Traces follow a single request across services. Best for "where is time spent in this call?" but generate a lot of data without sampling.
What is Prometheus cardinality?
Cardinality = the number of unique time series produced by a metric. Every unique combination of label values is a separate series. http_requests_total{method, path, status} with 5 methods × 100 paths × 10 status codes = 5,000 series — multiply by replicas and you can hit millions fast. High cardinality is the #1 reason Prometheus bills explode.
What is a good SLO target?
Match the SLO to user expectations and engineering cost. 99% (two nines) = 7.3 hours/month allowed downtime — fine for internal tools. 99.9% = 43.8 min/month — the SaaS baseline. 99.99% = 4.38 min/month — multi-AZ deployments + automated failover. Going from 99.9% to 99.99% typically costs 5-10x. Use the Error Budget Calculator to see what each level means in practice.
What are the Four Golden Signals?
From the Google SRE book: Latency (how long a request takes), Traffic (how many requests/sec), Errors (rate of failed requests), and Saturation (how full the service is — CPU, memory, queue depth). If you only have time to instrument four things, instrument these.
How do I reduce my Prometheus / Datadog metrics bill?
Three levers, in order of impact: (1) reduce label cardinality by removing high-cardinality labels like user_id, request_id, IP addresses — these belong in logs/traces, not metrics. (2) Drop unused metrics — many exporters emit hundreds of metrics teams never query. Use a metric_relabel_config to drop them. (3) Adjust scrape intervals — most infra metrics don't need 15s resolution.
Should I use Prometheus or a managed service?
Self-hosted Prometheus is free at infrastructure scale but you pay in operator time (managing storage, HA, federation). Managed (Grafana Cloud, Datadog, New Relic) starts cheap and gets expensive fast based on data volume + cardinality. Most teams under 50 engineers do better with managed; larger teams running their own observability platform should consider self-hosted with Thanos or Mimir for long-term storage.