Prometheus Cardinality Is What Kills Your Bill

Short answer: Cardinality is the number of unique time series produced by a metric. Each unique combination of label values creates a new series, and your cost (memory in Prometheus, dollars in managed observability vendors) scales with total active series. A single bad label like user_id on a high-traffic metric can generate millions of series and add thousands per month to your bill. This post explains how to measure cardinality, the rules for keeping it sane, and the PromQL queries to find your worst offenders.

The Math

A metric http_requests_total with labels {method, path, status} produces one time series per unique combination of label values:

5 methods × 100 paths × 10 statuses = 5,000 series

Multiply by 10 replicas emitting that metric = 50,000 series. Add a user_id label with 100,000 unique users = 5 billion series. Prometheus dies. Datadog charges you $5,000+/month.

Project your cardinality →

Why Cardinality Costs Money

Two distinct cost models, both broken by high cardinality:

Self-hosted Prometheus stores every series in memory. Each series consumes roughly 1-3 KB of resident memory plus 1-2 bytes per sample on disk. At 1 million active series, you need ~3 GB just for the in-memory index. At 10M series you’re looking at multi-tens-of-GB Prometheus VMs and slow queries.

Managed observability vendors charge per active series per month. Approximate 2026 rates:

Grafana Cloud: $8 per 1,000 active series/month above the free tier
Datadog: $1.27 per 100 custom metrics per month (and a metric is effectively a series)
New Relic / Honeycomb / Chronosphere: similar per-series economics

One ungoverned high-cardinality metric on a large team can add $1,000-$10,000+/month.

Labels That Should NEVER Exist on Metrics

The hard rule: a metric label’s value set must be bounded and small. These labels are unbounded by nature and belong in logs or traces, never metrics:

user_id, account_id, customer_id
request_id, trace_id, session_id
email, username
ip / client_ip
Raw URL path with IDs (use route templates: /users/:id not /users/12345)
Timestamps as labels
Hash values, UUIDs
Free-form user-input strings (search queries, error messages with variables)

If you need that detail for debugging, log it as structured log fields or attach it as a trace attribute. Metrics are for aggregates, not individual events.

Bounded Labels: The Good List

These labels are typically safe because their value set is small and known:

method — GET, POST, PUT, DELETE, PATCH = 5-7 values
status_code (bucketed: 2xx, 3xx, 4xx, 5xx) = 4 values
service / app — depends on your service count
environment — prod, staging, dev = 3 values
region — AWS regions = ~30 values max
path (as ROUTE TEMPLATE) — depends on your API surface

When in doubt, count: if the label can take more than ~1,000 values in production, it’s a problem.

Finding Your Cardinality Worst Offenders

Run these PromQL queries in your environment.

Top 20 metrics by total series count:

topk(20, count by(__name__) ({__name__=~".+"}))

Top label cardinality across all metrics:

topk(20, count by(__name__, label_name) ({__name__=~".+"}))

Series produced by a specific metric:

count(http_requests_total)

Run these weekly. New high-cardinality metrics show up over time as developers add labels without thinking about the multiplication factor.

What to Do When You Find a Problem

Three remediations, in order of impact:

1. Drop the high-cardinality label via `metric_relabel_configs`

In your Prometheus scrape config:

metric_relabel_configs:
  - source_labels: [__name__]
    regex: http_requests_total
    action: keep
  - regex: user_id
    action: labeldrop

This removes user_id from http_requests_total before storage. Cardinality drops 100,000x.

2. Bucket continuous values

If you need response time visibility per route, don’t tag the millisecond — use a histogram with buckets:

# Instead of: http_request_duration_ms with raw values as labels
# Use:
http_request_duration_bucket{le="0.005"}
http_request_duration_bucket{le="0.01"}
http_request_duration_bucket{le="0.025"}
# etc.

Then query with histogram_quantile(0.99, ...) for percentiles.

3. Drop the metric entirely

If a metric is unused (no dashboards, no alerts), drop it. Many exporters emit hundreds of metrics teams never query.

metric_relabel_configs:
  - source_labels: [__name__]
    regex: 'go_(gc_duration_seconds|memstats_gc_cpu_fraction|memstats_other_sys_bytes)'
    action: drop

Safe Cardinality Targets

Per metric	Status	Action
< 1,000	Healthy	None
1K-10K	Watch	Note in dashboards
10K-100K	Risk	Audit labels
100K-1M	High	Drop or bucket high-card labels
> 1M	Critical	Drop the metric or relabel immediately

Per Prometheus instance, with sufficient memory: typically up to 10M active series, with 5-10 second query latency at that scale. Beyond that you need Thanos or Mimir.

The Cardinality Audit Checklist

Quarterly:

Run the topk(20) queries above.
For each high-cardinality metric, ask: “what dashboard or alert uses this label?” If nothing uses it, drop the label.
Review every exporter and library being scraped. Default configs often emit hundreds of low-value metrics.
Audit any service that added new metrics in the last quarter. New code = new cardinality risk.
For managed vendors, look at your monthly billing breakdown by metric — if any single metric is 5%+ of the bill, it’s worth governance attention.

For more on metrics fundamentals (SLI/SLO/SLA, the Four Golden Signals, the difference between metrics, logs, and traces), see the Metrics & Observability Guide. For an interactive cardinality forecast tool, use the Cardinality Estimator.

Related tools:

Prometheus Cardinality Is What Kills Your Bill — How to Measure It

The Math

Why Cardinality Costs Money

Labels That Should NEVER Exist on Metrics

Bounded Labels: The Good List

Finding Your Cardinality Worst Offenders

What to Do When You Find a Problem

1. Drop the high-cardinality label via `metric_relabel_configs`

2. Bucket continuous values

3. Drop the metric entirely

Safe Cardinality Targets

The Cardinality Audit Checklist

Related Posts

Error Budgets Explained: How Google Does SRE

Stay in the loop

Prometheus Cardinality Is What Kills Your Bill — How to Measure It

The Math

Why Cardinality Costs Money

Labels That Should NEVER Exist on Metrics

Bounded Labels: The Good List

Finding Your Cardinality Worst Offenders

What to Do When You Find a Problem

1. Drop the high-cardinality label via metric_relabel_configs

2. Bucket continuous values

3. Drop the metric entirely

Safe Cardinality Targets

The Cardinality Audit Checklist

Related Posts

Error Budgets Explained: How Google Does SRE

Stay in the loop

1. Drop the high-cardinality label via `metric_relabel_configs`