A dashboard tells you the p99 latency on the checkout endpoint just doubled. A trace tells you which of the eleven downstream calls caused it, on which pod, with which customer ID in the baggage. Distributed tracing is the only pillar that reconstructs a single request’s path across service boundaries — and it is worthless if context drops on the first hop or if your spans are named so badly the backend can’t aggregate them. This is the end-to-end wiring: propagation, span design, backend choice, and the links that turn three disconnected pillars into one investigation.
1. What a trace actually is
A span is one unit of work with a start time, a duration, a name, a set of key-value attributes, a status, and zero or more events. A trace is a tree of spans that share a single 16-byte trace_id. Each span carries an 8-byte span_id and a parent_span_id pointing at the span that caused it. The root span has no parent.
trace_id = 4bf92f3577b34da6a3ce929d0e0e4736
[HTTP GET /checkout] span A (root, parent=none)
[validate-cart] span B (parent=A)
[POST inventory.Reserve] span C (parent=A) <-- client span
[grpc inventory.Reserve] span D (parent=C) <-- server span, other process
[SELECT FROM stock] span E (parent=D)
[POST payments.Charge] span F (parent=A)
The tuple that travels between processes — trace_id, span_id, trace flags (the sampled bit), and trace state — is the span context. Propagation is nothing more than serializing the active span’s context into the outbound request and deserializing it on the other side so the remote span sets its parent_span_id correctly. Drop that step on any hop and the trace splits into two disconnected trees.
Spans C and D above are the crux: the client (caller) and server (callee) each create a span, in different processes, and D’s parent is C only because C’s context rode across the wire.
2. Context propagation across the wire
The standard is W3C Trace Context, two HTTP headers. The exact format matters because every language and proxy parses it byte-for-byte:
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
^^ ^------------------------------^ ^--------------^ ^^
ver trace-id (16B) parent-id (8B) flags
tracestate: vendorA=opaqueValue,vendorB=opaqueValue
The trailing 01 flag is the sampled bit: 01 means recorded, 00 means not. tracestate carries vendor-specific data and must be propagated verbatim. There is also baggage — the W3C baggage header — which carries arbitrary key-value pairs (e.g. tenant.id=acme, enduser.role=admin) to every downstream service. Baggage is powerful and dangerous: it travels on every hop, so keep it tiny and never put secrets or PII in it, because it leaks to every service and often into logs.
In OpenTelemetry, set a global composite propagator once at startup. This is the single most common silent failure — SDKs do not all enable W3C by default.
# Python: enable W3C trace context + baggage globally
from opentelemetry.propagate import set_global_textmap
from opentelemetry.propagators.composite import CompositePropagator
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator
from opentelemetry.baggage.propagation import W3CBaggagePropagator
set_global_textmap(CompositePropagator([
TraceContextTextMapPropagator(),
W3CBaggagePropagator(),
]))
// Go: same idea, set once in main()
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/propagation"
)
otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
propagation.TraceContext{},
propagation.Baggage{},
))
For synchronous HTTP/gRPC, the instrumentation libraries inject and extract these headers automatically once the propagator is set. Async is where traces break. When you publish to Kafka, SQS, RabbitMQ, or a cloud bus, the HTTP context is gone — you must serialize traceparent into the message and re-extract it in the consumer.
# Producer: inject context into message headers/attributes
from opentelemetry.propagate import inject
carrier = {}
inject(carrier) # writes traceparent/baggage into the dict
message.headers = [(k, v.encode()) for k, v in carrier.items()]
# Consumer: extract context, then start the span as a child
from opentelemetry.propagate import extract
from opentelemetry import trace
carrier = {k: v.decode() for k, v in message.headers}
ctx = extract(carrier)
with trace.get_tracer(__name__).start_as_current_span(
"process order", context=ctx,
kind=trace.SpanKind.CONSUMER):
handle(message)
Fan-out gotcha: when one message is consumed by many workers, or batched, the parent-child relationship gets ambiguous. Use span links (not parent-child) to connect a batch-processing span to the N producer spans it drains. Links express “caused by, but not in a strict tree” — exactly the messaging case.
3. Span design that pays off
Spans are only as useful as their names and attributes. The two failure modes are high cardinality in names (which destroys backend aggregation) and span explosion (which destroys cost).
Name spans by the operation, not the data. The name should be a low-cardinality template; the variable parts go in attributes.
| Bad span name | Why | Better |
|---|---|---|
GET /users/8123 |
ID in name = unbounded cardinality | GET /users/:id with http.route attribute |
query-orders-2026-06 |
date in name | db.query with db.statement attribute |
process |
tells you nothing | process order |
Lean on semantic conventions so the backend and any vendor understand your data: http.request.method, http.route, http.response.status_code, url.path, db.system, server.address, rpc.grpc.status_code. Set span status correctly: leave it Unset for normal flow, set Error only for genuine failures, and record the exception as an event. A 404 is usually not a span error — it is a valid outcome.
from opentelemetry.trace import Status, StatusCode
span.set_attribute("http.route", "/users/:id")
span.set_attribute("http.response.status_code", 503)
try:
do_work()
except Exception as e:
span.record_exception(e)
span.set_status(Status(StatusCode.ERROR, "inventory backend unreachable"))
raise
Avoid span explosion. Do not create a span per loop iteration or per row. If you process 10,000 records, emit one span with a batch.size attribute, not 10,000 spans. Each extra span costs ingest, storage, and query time. A good rule: a span should represent work worth seeing on a timeline — a network call, a meaningful compute phase, a transaction — not every function.
4. Choosing a backend: Tempo vs Jaeger
Both ingest OTLP and both work. They differ sharply on storage model and operational cost.
| Concern | Grafana Tempo | Jaeger |
|---|---|---|
| Storage backend | Object storage only (S3, GCS, Azure Blob) | Cassandra, Elasticsearch, OpenSearch, or Badger |
| Index | Minimal; TraceQL scans object storage blocks | Full index in the chosen store |
| Cost model | Cheap object storage; scales with bytes | Pay for a database cluster |
| Query language | TraceQL (filter on span attributes) | Tag-based search via UI/API |
| Best when | High volume, cost-sensitive, Grafana shop | You need rich indexed search or already run ES |
| Span metrics | Built-in metrics-generator | Via OTel Collector spanmetrics connector |
Tempo’s bet is that you rarely search blindly for traces — you arrive from a metric, a log, or an exemplar carrying a trace_id, so a heavy index is wasted money. Object storage is the cheapest durable tier, which makes Tempo the default for high-volume, cost-conscious stacks. Jaeger earns its keep when you genuinely need ad-hoc indexed search across arbitrary tags and are willing to run Elasticsearch to get it.
A minimal Tempo install via Helm, configured for S3:
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm upgrade --install tempo grafana/tempo-distributed \
--namespace observability --create-namespace \
-f tempo-values.yaml
# tempo-values.yaml (excerpt) — object storage + span metrics
storage:
trace:
backend: s3
s3:
bucket: my-org-tempo-traces
endpoint: s3.us-east-1.amazonaws.com
region: us-east-1
# generate RED metrics from spans and remote_write to Prometheus
metricsGenerator:
enabled: true
remoteWriteUrl: http://prometheus.observability.svc:9090/api/v1/write
Point your OTel Collector (or app SDK) at the distributor over OTLP:
# OTel Collector exporter -> Tempo
exporters:
otlp/tempo:
endpoint: tempo-distributor.observability.svc.cluster.local:4317
tls:
insecure: true # in-cluster only
5. Generating metrics from spans: RED for free
Every span already encodes Rate, Errors, and Duration. The spanmetrics connector in the OTel Collector aggregates spans into Prometheus metrics so you get RED signals without instrumenting metrics separately. (Tempo’s metrics-generator does the same server-side; pick one, not both, to avoid double counting.)
# OTel Collector: spans in -> RED metrics out
connectors:
spanmetrics:
histogram:
explicit:
buckets: [5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2s, 5s]
dimensions:
- name: http.route
- name: http.response.status_code
exemplars:
enabled: true # critical for step 6
service:
pipelines:
traces:
receivers: [otlp]
exporters: [otlp/tempo, spanmetrics] # fan out: store AND aggregate
metrics:
receivers: [spanmetrics]
exporters: [prometheusremotewrite]
This emits traces_span_metrics_calls_total and traces_span_metrics_duration_* (a histogram) labeled by your chosen dimensions. Choose dimensions carefully — every dimension is a Prometheus label, and high-cardinality labels (user IDs, raw paths) will blow up your TSDB exactly the way they do for any metric. Note exemplars.enabled: true: that is what makes the next step possible.
6. Exemplars: from a percentile straight into the trace
An exemplar is a sample attached to a histogram bucket that carries the trace_id of one request that landed in that bucket. So when you look at a p99 latency panel, each plotted point can link to an actual slow trace — you stop guessing and jump straight to the offender.
The plumbing has three requirements, and all three must be true:
- The metrics source emits exemplars (set above in the connector). OTLP histograms carry exemplars natively.
- Prometheus stores them. Exemplar storage must be explicitly enabled:
# Prometheus must be started with the feature flag (pre-3.0)
prometheus --enable-feature=exemplar-storage \
--storage.tsdb.path=/prometheus
- Grafana links the exemplar’s
trace_idto the Tempo data source. Configure this on the Prometheus data source:
# Grafana datasource provisioning: Prometheus -> Tempo via exemplar
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
url: http://prometheus.observability.svc:9090
jsonData:
exemplarTraceIdDestinations:
- name: trace_id
datasourceUid: tempo-uid
Now query a percentile and Grafana renders exemplar diamonds under the line:
histogram_quantile(0.99,
sum by (le, http_route) (
rate(traces_span_metrics_duration_milliseconds_bucket[5m])
)
)
Click a diamond on the spike, and Grafana opens that exact trace in Tempo. That click is the entire payoff of wiring exemplars: percentile to root cause in one hop.
7. Trace-to-logs and logs-to-trace correlation
The third link is logs. The rule is simple: every structured log line emitted inside a span must include the trace_id (and ideally span_id). Then you can pivot from a span to its logs and from a log line back to the full trace.
Inject the IDs from the active context into your logger:
# Python: enrich logs with the active trace context
import logging, json
from opentelemetry import trace
class TraceContextFilter(logging.Filter):
def filter(self, record):
ctx = trace.get_current_span().get_span_context()
record.trace_id = format(ctx.trace_id, "032x") if ctx.is_valid else ""
record.span_id = format(ctx.span_id, "016x") if ctx.is_valid else ""
return True
A resulting structured line looks like:
{"level":"error","msg":"inventory reserve failed","service":"checkout","trace_id":"4bf92f3577b34da6a3ce929d0e0e4736","span_id":"00f067aa0ba902b7"}
With Loki as the log store, wire both directions in Grafana. Logs-to-trace: a derived field parses trace_id out of the log line and links to Tempo. Trace-to-logs: Tempo’s data source jumps from a span into a Loki query scoped by trace_id and time.
# Loki -> Tempo (logs to trace)
- name: Loki
type: loki
jsonData:
derivedFields:
- name: trace_id
matcherRegex: '"trace_id":"(\w+)"'
url: "$${__value.raw}"
datasourceUid: tempo-uid
# Tempo -> Loki (trace to logs)
- name: Tempo
type: tempo
uid: tempo-uid
jsonData:
tracesToLogsV2:
datasourceUid: loki-uid
filterByTraceID: true
spanStartTimeShift: "-5m"
spanEndTimeShift: "5m"
The time shift matters: a log line’s timestamp rarely matches the span’s exactly, so widen the window or the correlated query returns nothing.
8. Sampling strategy that survives production
Tracing every request at full volume is expensive and usually unnecessary. The decision is what to keep, and there are two places to decide.
Head sampling decides at the root, before the trace exists, using a ratio. It is cheap and stateless but blind — it cannot keep “all the errors” because it decides before knowing the outcome.
# Head sampling: keep 10%, respect upstream decision
from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased
sampler = ParentBased(root=TraceIdRatioBased(0.1))
ParentBased is non-negotiable in a distributed system: if an upstream service sampled a trace, every downstream service must honor that bit (it rides in the traceparent flags), or you get half-traces. Head sampling is consistent because the decision derives from the trace_id itself, so every service computes the same answer.
Tail sampling decides after the whole trace completes, in the Collector gateway, so it can keep every error and every slow trace while dropping the boring fast ones. It needs the complete trace buffered in one place, which is why it lives in the gateway tier — covered in depth in the OpenTelemetry Collector pipelines article. The policy shape:
# Tail sampling: keep all errors + slow traces, sample the rest
processors:
tail_sampling:
decision_wait: 10s
policies:
- name: errors
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow
type: latency
latency: {threshold_ms: 1000}
- name: baseline
type: probabilistic
probabilistic: {sampling_percentage: 5}
A sane production default: head-sample lightly to cap SDK and network overhead, then tail-sample at the gateway to guarantee you keep the traces that matter (errors, latency outliers) plus a small representative baseline of healthy traffic.
Cost reality: keep RED metrics at 100% (they are cheap aggregates) and sample the traces aggressively. You almost never need every healthy trace — you need every broken one plus enough baseline to compare against. Metrics give you the totals; sampled traces give you the examples.
Enterprise scenario
A payments platform migrated its checkout path to async: the API published an order.placed event to Kafka and a fleet of consumers settled it. Traces looked perfect in the synchronous tier and then died at the broker — every consumer span showed up as a brand-new root, so the p99 settlement latency had no traces behind its exemplars. The team had set the W3C propagator globally and assumed that was enough. It wasn’t: the auto-instrumentation only injects into HTTP/gRPC carriers, and nothing was writing traceparent into Kafka record headers.
The fix was explicit context injection on produce and extraction on consume, plus one subtlety they got wrong first. The batch consumer drained up to 500 records per poll, so making each one a child of a single poll span produced a meaningless 500-wide fan. They switched to span links: the poll span links back to each producer’s context instead of parenting it.
from opentelemetry.propagate import inject, extract
from opentelemetry import trace
# producer: write context into the Kafka record headers
carrier = {}
inject(carrier)
headers = [(k, v.encode()) for k, v in carrier.items()]
# consumer: link each record's context to the batch span
links = [trace.Link(extract(dict(r.headers)).get_span_context())
for r in records]
with tracer.start_as_current_span("settle batch", links=links,
kind=trace.SpanKind.CONSUMER):
settle(records)
After that, exemplars on the settlement histogram clicked straight through to traces spanning API to broker to consumer to ledger. The lesson: a global propagator covers the wire, not the queue.
Verify
Confirm each link in the chain actually works:
# 1. Context propagates: hit an endpoint, confirm the SAME trace_id
# spans appear across multiple services in Tempo/Jaeger.
curl -s -D - http://localhost:8080/checkout -o /dev/null | grep -i traceparent
# 2. Tempo is receiving traces (query by trace_id via TraceQL API)
curl -s "http://tempo:3200/api/traces/4bf92f3577b34da6a3ce929d0e0e4736" | jq '.batches | length'
# 3. Span metrics exist in Prometheus
curl -s "http://prometheus:9090/api/v1/query?query=traces_span_metrics_calls_total" | jq '.data.result | length'
# 4. Exemplars are stored (must return a non-empty array)
curl -s "http://prometheus:9090/api/v1/query_exemplars?query=traces_span_metrics_duration_milliseconds_bucket&start=$(date -u -d '-1 hour' +%s)&end=$(date -u +%s)" | jq '.data | length'
- In Grafana, open a latency panel and confirm exemplar diamonds render, and clicking one opens the trace.
- From a span in Tempo, click “Logs for this span” and confirm Loki returns the correlated lines.
- From a log line in Loki, confirm the
trace_idderived field links back to the full trace.
Checklist
Pitfalls
- Default propagators are not always W3C. Some SDKs and older agents default to B3 or no propagator. Mismatched propagators across services silently break traces at the boundary. Standardize on W3C everywhere and verify the
traceparentheader is actually present on the wire. - PII in baggage and span attributes. Baggage propagates to every service and frequently into logs; span attributes land in your trace store and any connected vendor. Treat both as untrusted output and scrub emails, tokens, and full request bodies. The OTel Collector’s
attributesandredactionprocessors are the right place to enforce this centrally. - Double-counted span metrics. Running both the spanmetrics connector and Tempo’s metrics-generator doubles your RED numbers. Choose one path and disable the other.
- Clock skew across services. Spans timestamped by different hosts can render with a child appearing to start before its parent. Run NTP/chrony everywhere; it is the cheapest fix for confusing waterfalls.
- Sampling that drops the errors. Head-only sampling at a low ratio will throw away most of your failures because it decides before the failure happens. If you sample, put tail sampling at the gateway so error and latency traces are kept deterministically.
Next steps
Add TraceQL-based alerting and span-metrics SLOs so a rising error rate links straight to exemplar traces. Wire OpenTelemetry continuous profiling (Pyroscope) as a fourth correlated signal — jump from a slow span to a flame graph of that exact code path. And formalize the cardinality budget for span attributes and metric dimensions the same way you do for any time series, before a well-meaning user.id attribute quietly triples your storage bill.