Prometheus Cardinality Is What Kills Your Bill — How to Measure It

Cardinality blows up Prometheus and managed observability bills. What it is, what to watch for, and the queries to find your worst offenders.

· Project Helena · 5 min read ·
prometheus metrics observability cost

Short answer: Cardinality is the number of unique time series produced by a metric. Each unique combination of label values creates a new series, and your cost (memory in Prometheus, dollars in managed observability vendors) scales with total active series. A single bad label like user_id on a high-traffic metric can generate millions of series and add thousands per month to your bill. This post explains how to measure cardinality, the rules for keeping it sane, and the PromQL queries to find your worst offenders.

The Math

A metric http_requests_total with labels {method, path, status} produces one time series per unique combination of label values:

5 methods × 100 paths × 10 statuses = 5,000 series

Multiply by 10 replicas emitting that metric = 50,000 series. Add a user_id label with 100,000 unique users = 5 billion series. Prometheus dies. Datadog charges you $5,000+/month.

Project your cardinality →

Why Cardinality Costs Money

Two distinct cost models, both broken by high cardinality:

Self-hosted Prometheus stores every series in memory. Each series consumes roughly 1-3 KB of resident memory plus 1-2 bytes per sample on disk. At 1 million active series, you need ~3 GB just for the in-memory index. At 10M series you’re looking at multi-tens-of-GB Prometheus VMs and slow queries.

Managed observability vendors charge per active series per month. Approximate 2026 rates:

  • Grafana Cloud: $8 per 1,000 active series/month above the free tier
  • Datadog: $1.27 per 100 custom metrics per month (and a metric is effectively a series)
  • New Relic / Honeycomb / Chronosphere: similar per-series economics

One ungoverned high-cardinality metric on a large team can add $1,000-$10,000+/month.

Labels That Should NEVER Exist on Metrics

The hard rule: a metric label’s value set must be bounded and small. These labels are unbounded by nature and belong in logs or traces, never metrics:

  • user_id, account_id, customer_id
  • request_id, trace_id, session_id
  • email, username
  • ip / client_ip
  • Raw URL path with IDs (use route templates: /users/:id not /users/12345)
  • Timestamps as labels
  • Hash values, UUIDs
  • Free-form user-input strings (search queries, error messages with variables)

If you need that detail for debugging, log it as structured log fields or attach it as a trace attribute. Metrics are for aggregates, not individual events.

Bounded Labels: The Good List

These labels are typically safe because their value set is small and known:

  • method — GET, POST, PUT, DELETE, PATCH = 5-7 values
  • status_code (bucketed: 2xx, 3xx, 4xx, 5xx) = 4 values
  • service / app — depends on your service count
  • environment — prod, staging, dev = 3 values
  • region — AWS regions = ~30 values max
  • path (as ROUTE TEMPLATE) — depends on your API surface

When in doubt, count: if the label can take more than ~1,000 values in production, it’s a problem.

Finding Your Cardinality Worst Offenders

Run these PromQL queries in your environment.

Top 20 metrics by total series count:

topk(20, count by(__name__) ({__name__=~".+"}))

Top label cardinality across all metrics:

topk(20, count by(__name__, label_name) ({__name__=~".+"}))

Series produced by a specific metric:

count(http_requests_total)

Run these weekly. New high-cardinality metrics show up over time as developers add labels without thinking about the multiplication factor.

What to Do When You Find a Problem

Three remediations, in order of impact:

1. Drop the high-cardinality label via metric_relabel_configs

In your Prometheus scrape config:

metric_relabel_configs:
- source_labels: [__name__]
regex: http_requests_total
action: keep
- regex: user_id
action: labeldrop

This removes user_id from http_requests_total before storage. Cardinality drops 100,000x.

2. Bucket continuous values

If you need response time visibility per route, don’t tag the millisecond — use a histogram with buckets:

# Instead of: http_request_duration_ms with raw values as labels
# Use:
http_request_duration_bucket{le="0.005"}
http_request_duration_bucket{le="0.01"}
http_request_duration_bucket{le="0.025"}
# etc.

Then query with histogram_quantile(0.99, ...) for percentiles.

3. Drop the metric entirely

If a metric is unused (no dashboards, no alerts), drop it. Many exporters emit hundreds of metrics teams never query.

metric_relabel_configs:
- source_labels: [__name__]
regex: 'go_(gc_duration_seconds|memstats_gc_cpu_fraction|memstats_other_sys_bytes)'
action: drop

Safe Cardinality Targets

Per metricStatusAction
< 1,000HealthyNone
1K-10KWatchNote in dashboards
10K-100KRiskAudit labels
100K-1MHighDrop or bucket high-card labels
> 1MCriticalDrop the metric or relabel immediately

Per Prometheus instance, with sufficient memory: typically up to 10M active series, with 5-10 second query latency at that scale. Beyond that you need Thanos or Mimir.

The Cardinality Audit Checklist

Quarterly:

  1. Run the topk(20) queries above.
  2. For each high-cardinality metric, ask: “what dashboard or alert uses this label?” If nothing uses it, drop the label.
  3. Review every exporter and library being scraped. Default configs often emit hundreds of low-value metrics.
  4. Audit any service that added new metrics in the last quarter. New code = new cardinality risk.
  5. For managed vendors, look at your monthly billing breakdown by metric — if any single metric is 5%+ of the bill, it’s worth governance attention.

For more on metrics fundamentals (SLI/SLO/SLA, the Four Golden Signals, the difference between metrics, logs, and traces), see the Metrics & Observability Guide. For an interactive cardinality forecast tool, use the Cardinality Estimator.

Related tools:

Stay in the loop

Get notified about new posts, product updates, and engineering insights.

Join the waitlist →