Distributed Tracing Guide: OpenTelemetry, Spans, Sampling (2026)

Distributed tracing tells you where time goes when a single request crosses many services. In a monolith you can read a stack trace; in a 30-service microservice graph, tracing is what makes "this endpoint is slow" debuggable.

This guide covers the model (spans, traces, context propagation), the standard (OpenTelemetry), and the single biggest cost lever (sampling).

What is Distributed Tracing?

A trace is the recorded path of one request through your system. Each unit of work along that path is a span. Spans nest: a parent span has child spans for the work it dispatched. The output looks like a flame graph — wide for the slow operations, narrow for the fast ones.

The value: in a microservice architecture, "the API is slow" could mean the API itself, or any of 5-50 downstream services it calls. A trace tells you which one — and how much time was spent waiting, computing, or in the network.

Spans, Traces, and Context Propagation

Trace ID — a unique identifier (typically 128-bit hex) shared by every span in a single request flow.
Span ID — unique per span. Each span has a trace ID + span ID + optional parent span ID.
Context propagation — the mechanism that carries the trace ID from one service to the next. HTTP requests carry it in the traceparent header (W3C Trace Context standard).
Attributes — arbitrary key-value pairs on a span (service name, HTTP method, status code, query string, etc.).
Events — timestamped annotations within a span (think: log.info() attached to the span timeline).

OpenTelemetry — The Standard

OpenTelemetry (OTel) is the open-source, vendor-neutral standard for emitting telemetry. It replaces older standards like OpenTracing and OpenCensus. Backed by the CNCF; supported by every major observability vendor.

Three components:

SDKs — language-specific libraries that you import in your code. Auto-instrumentation packages exist for most frameworks (HTTP clients, DB drivers, message queues) so you get traces without writing code.
OTel Collector — a standalone agent that receives telemetry from your apps, processes it (sampling, batching, attribute filtering), and exports it to one or more backends.
Protocol (OTLP) — the wire format for sending telemetry. gRPC or HTTP.

The benefit of OTel over vendor-specific SDKs: you can switch from Datadog to Honeycomb to self-hosted Tempo by changing one Collector config — your application code doesn't change.

Sampling — The Cost Lever

Sampling decides which traces to keep. The two main strategies:

Head-based sampling — decide at the start of the trace (in the first service) whether to record it. Simple but you might drop a slow or errored trace you would have wanted to see.
Tail-based sampling — buffer all spans, then decide at the end whether to keep the trace. Lets you keep 100% of errors and slow requests while dropping 99% of fast successful ones. Requires more memory in the Collector.

Rough sampling targets by traffic:

Traffic	Recommended sampling
< 100 RPS	100% (don't sample)
100-1,000 RPS	10-25% head-based, OR 100% with tail-based
1,000-10,000 RPS	1-5% head-based, OR tail-based keeping errors + slow
> 10,000 RPS	0.1-1% head-based, OR aggressive tail sampling

A trace sampling calculator is coming soon to help size this against your tracing backend's cost.

When to Reach for Tracing (vs Metrics or Logs)

"Is the service healthy?" → metrics.
"Why did this specific request fail?" → logs (or a single trace).
"Where is time being spent across our microservices?" → traces.
"Which downstream service is slow today?" → traces (or service-level latency metrics if you have them).
"Did the new release add latency?" → traces + metrics (compare p99 before/after).

Metrics & Observability — the first observability signal you should add
Log Management — the third leg, with cost implications
AWS Pricing — running an OTel Collector and trace backend isn't free
Uptime Monitoring — outside-in availability

Distributed Tracing FAQ

What is distributed tracing?

Distributed tracing records the path of a single request as it traverses multiple services. Each step (a span) records timing and metadata; the full set of spans for one request is a trace. The output answers "where did time go on this request?" — the question metrics and logs can't answer cleanly in a microservice architecture.

What is OpenTelemetry?

OpenTelemetry (OTel) is the open-source standard for emitting traces, metrics, and logs — backed by CNCF. It's a set of SDKs (one per language) plus the OTel Collector, a vendor-neutral agent that receives, processes, and exports telemetry to your backend (Jaeger, Tempo, Datadog, Honeycomb, etc.). Use OTel so you can swap backends without rewriting instrumentation.

What sampling rate should I use?

For development: 100%. For low-traffic production (< 100 RPS): 100%. For mid traffic (100-1,000 RPS): 10-25%. For high traffic (> 1,000 RPS): 1-5% head-based sampling, or use tail-based sampling to keep 100% of error traces and slow traces while dropping fast successful ones. The exact rate depends on your tracing backend's cost per span.

Tracing vs APM — what is the difference?

APM (Application Performance Monitoring) is a product category that bundles tracing + metrics + error tracking + sometimes profiling. Distributed tracing is the underlying technique. New Relic, Datadog, AppDynamics are APM products that include tracing. Honeycomb, Tempo, Jaeger are tracing-first tools.

How much does distributed tracing cost?

Vendor tracing typically costs $0.50-$2.50 per million spans ingested. At 100 RPS with 10 spans per request, 100% sampling, that's 86M spans/day = $43-215/day. Sampling at 5% cuts that 20x. Self-hosted Tempo or Jaeger on S3 storage can be 10x cheaper at scale.