Short answer: Cardinality is the number of unique time series produced by a metric. Each unique combination of label values creates a new series, and your cost (memory in Prometheus, dollars in managed observability vendors) scales with total active series. A single bad label like user_id on a high-traffic metric can generate millions of series and add thousands per month to your bill. This post explains how to measure cardinality, the rules for keeping it sane, and the PromQL queries to find your worst offenders.
The Math
A metric http_requests_total with labels {method, path, status} produces one time series per unique combination of label values:
5 methods × 100 paths × 10 statuses = 5,000 seriesMultiply by 10 replicas emitting that metric = 50,000 series. Add a user_id label with 100,000 unique users = 5 billion series. Prometheus dies. Datadog charges you $5,000+/month.
Why Cardinality Costs Money
Two distinct cost models, both broken by high cardinality:
Self-hosted Prometheus stores every series in memory. Each series consumes roughly 1-3 KB of resident memory plus 1-2 bytes per sample on disk. At 1 million active series, you need ~3 GB just for the in-memory index. At 10M series you’re looking at multi-tens-of-GB Prometheus VMs and slow queries.
Managed observability vendors charge per active series per month. Approximate 2026 rates:
- Grafana Cloud: $8 per 1,000 active series/month above the free tier
- Datadog: $1.27 per 100 custom metrics per month (and a metric is effectively a series)
- New Relic / Honeycomb / Chronosphere: similar per-series economics
One ungoverned high-cardinality metric on a large team can add $1,000-$10,000+/month.
Labels That Should NEVER Exist on Metrics
The hard rule: a metric label’s value set must be bounded and small. These labels are unbounded by nature and belong in logs or traces, never metrics:
user_id,account_id,customer_idrequest_id,trace_id,session_idemail,usernameip/client_ip- Raw URL path with IDs (use route templates:
/users/:idnot/users/12345) - Timestamps as labels
- Hash values, UUIDs
- Free-form user-input strings (search queries, error messages with variables)
If you need that detail for debugging, log it as structured log fields or attach it as a trace attribute. Metrics are for aggregates, not individual events.
Bounded Labels: The Good List
These labels are typically safe because their value set is small and known:
method— GET, POST, PUT, DELETE, PATCH = 5-7 valuesstatus_code(bucketed: 2xx, 3xx, 4xx, 5xx) = 4 valuesservice/app— depends on your service countenvironment— prod, staging, dev = 3 valuesregion— AWS regions = ~30 values maxpath(as ROUTE TEMPLATE) — depends on your API surface
When in doubt, count: if the label can take more than ~1,000 values in production, it’s a problem.
Finding Your Cardinality Worst Offenders
Run these PromQL queries in your environment.
Top 20 metrics by total series count:
topk(20, count by(__name__) ({__name__=~".+"}))Top label cardinality across all metrics:
topk(20, count by(__name__, label_name) ({__name__=~".+"}))Series produced by a specific metric:
count(http_requests_total)Run these weekly. New high-cardinality metrics show up over time as developers add labels without thinking about the multiplication factor.
What to Do When You Find a Problem
Three remediations, in order of impact:
1. Drop the high-cardinality label via metric_relabel_configs
In your Prometheus scrape config:
metric_relabel_configs: - source_labels: [__name__] regex: http_requests_total action: keep - regex: user_id action: labeldropThis removes user_id from http_requests_total before storage. Cardinality drops 100,000x.
2. Bucket continuous values
If you need response time visibility per route, don’t tag the millisecond — use a histogram with buckets:
# Instead of: http_request_duration_ms with raw values as labels# Use:http_request_duration_bucket{le="0.005"}http_request_duration_bucket{le="0.01"}http_request_duration_bucket{le="0.025"}# etc.Then query with histogram_quantile(0.99, ...) for percentiles.
3. Drop the metric entirely
If a metric is unused (no dashboards, no alerts), drop it. Many exporters emit hundreds of metrics teams never query.
metric_relabel_configs: - source_labels: [__name__] regex: 'go_(gc_duration_seconds|memstats_gc_cpu_fraction|memstats_other_sys_bytes)' action: dropSafe Cardinality Targets
| Per metric | Status | Action |
|---|---|---|
| < 1,000 | Healthy | None |
| 1K-10K | Watch | Note in dashboards |
| 10K-100K | Risk | Audit labels |
| 100K-1M | High | Drop or bucket high-card labels |
| > 1M | Critical | Drop the metric or relabel immediately |
Per Prometheus instance, with sufficient memory: typically up to 10M active series, with 5-10 second query latency at that scale. Beyond that you need Thanos or Mimir.
The Cardinality Audit Checklist
Quarterly:
- Run the
topk(20)queries above. - For each high-cardinality metric, ask: “what dashboard or alert uses this label?” If nothing uses it, drop the label.
- Review every exporter and library being scraped. Default configs often emit hundreds of low-value metrics.
- Audit any service that added new metrics in the last quarter. New code = new cardinality risk.
- For managed vendors, look at your monthly billing breakdown by metric — if any single metric is 5%+ of the bill, it’s worth governance attention.
For more on metrics fundamentals (SLI/SLO/SLA, the Four Golden Signals, the difference between metrics, logs, and traces), see the Metrics & Observability Guide. For an interactive cardinality forecast tool, use the Cardinality Estimator.
Related tools: