Observability Multi-Cloud

Distributed Tracing End-to-End: Context Propagation, Tempo, and Correlating Traces with Metrics and Logs

A dashboard tells you the p99 latency on the checkout endpoint just doubled. A trace tells you which of the eleven downstream calls caused it, on which pod, with which customer ID in the baggage. Distributed tracing is the only pillar that reconstructs a single request’s path across service boundaries — and it is worthless if context drops on the first hop or if your spans are named so badly the backend can’t aggregate them. This is the end-to-end wiring: propagation, span design, backend choice, and the links that turn three disconnected pillars into one investigation.

1. What a trace actually is

A span is one unit of work with a start time, a duration, a name, a set of key-value attributes, a status, and zero or more events. A trace is a tree of spans that share a single 16-byte trace_id. Each span carries an 8-byte span_id and a parent_span_id pointing at the span that caused it. The root span has no parent.

trace_id = 4bf92f3577b34da6a3ce929d0e0e4736

  [HTTP GET /checkout]              span A (root, parent=none)
    [validate-cart]                 span B (parent=A)
    [POST inventory.Reserve]        span C (parent=A)   <-- client span
       [grpc inventory.Reserve]     span D (parent=C)   <-- server span, other process
          [SELECT FROM stock]       span E (parent=D)
    [POST payments.Charge]          span F (parent=A)

The tuple that travels between processes — trace_id, span_id, trace flags (the sampled bit), and trace state — is the span context. Propagation is nothing more than serializing the active span’s context into the outbound request and deserializing it on the other side so the remote span sets its parent_span_id correctly. Drop that step on any hop and the trace splits into two disconnected trees.

Spans C and D above are the crux: the client (caller) and server (callee) each create a span, in different processes, and D’s parent is C only because C’s context rode across the wire.

2. Context propagation across the wire

The standard is W3C Trace Context, two HTTP headers. The exact format matters because every language and proxy parses it byte-for-byte:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
             ^^ ^------------------------------^ ^--------------^ ^^
             ver           trace-id (16B)         parent-id (8B)  flags

tracestate: vendorA=opaqueValue,vendorB=opaqueValue

The trailing 01 flag is the sampled bit: 01 means recorded, 00 means not. tracestate carries vendor-specific data and must be propagated verbatim. There is also baggage — the W3C baggage header — which carries arbitrary key-value pairs (e.g. tenant.id=acme, enduser.role=admin) to every downstream service. Baggage is powerful and dangerous: it travels on every hop, so keep it tiny and never put secrets or PII in it, because it leaks to every service and often into logs.

In OpenTelemetry, set a global composite propagator once at startup. This is the single most common silent failure — SDKs do not all enable W3C by default.

# Python: enable W3C trace context + baggage globally
from opentelemetry.propagate import set_global_textmap
from opentelemetry.propagators.composite import CompositePropagator
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator
from opentelemetry.baggage.propagation import W3CBaggagePropagator

set_global_textmap(CompositePropagator([
    TraceContextTextMapPropagator(),
    W3CBaggagePropagator(),
]))
// Go: same idea, set once in main()
import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/propagation"
)

otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
    propagation.TraceContext{},
    propagation.Baggage{},
))

For synchronous HTTP/gRPC, the instrumentation libraries inject and extract these headers automatically once the propagator is set. Async is where traces break. When you publish to Kafka, SQS, RabbitMQ, or a cloud bus, the HTTP context is gone — you must serialize traceparent into the message and re-extract it in the consumer.

# Producer: inject context into message headers/attributes
from opentelemetry.propagate import inject

carrier = {}
inject(carrier)  # writes traceparent/baggage into the dict
message.headers = [(k, v.encode()) for k, v in carrier.items()]
# Consumer: extract context, then start the span as a child
from opentelemetry.propagate import extract
from opentelemetry import trace

carrier = {k: v.decode() for k, v in message.headers}
ctx = extract(carrier)
with trace.get_tracer(__name__).start_as_current_span(
        "process order", context=ctx,
        kind=trace.SpanKind.CONSUMER):
    handle(message)

Fan-out gotcha: when one message is consumed by many workers, or batched, the parent-child relationship gets ambiguous. Use span links (not parent-child) to connect a batch-processing span to the N producer spans it drains. Links express “caused by, but not in a strict tree” — exactly the messaging case.

3. Span design that pays off

Spans are only as useful as their names and attributes. The two failure modes are high cardinality in names (which destroys backend aggregation) and span explosion (which destroys cost).

Name spans by the operation, not the data. The name should be a low-cardinality template; the variable parts go in attributes.

Bad span name Why Better
GET /users/8123 ID in name = unbounded cardinality GET /users/:id with http.route attribute
query-orders-2026-06 date in name db.query with db.statement attribute
process tells you nothing process order

Lean on semantic conventions so the backend and any vendor understand your data: http.request.method, http.route, http.response.status_code, url.path, db.system, server.address, rpc.grpc.status_code. Set span status correctly: leave it Unset for normal flow, set Error only for genuine failures, and record the exception as an event. A 404 is usually not a span error — it is a valid outcome.

from opentelemetry.trace import Status, StatusCode

span.set_attribute("http.route", "/users/:id")
span.set_attribute("http.response.status_code", 503)
try:
    do_work()
except Exception as e:
    span.record_exception(e)
    span.set_status(Status(StatusCode.ERROR, "inventory backend unreachable"))
    raise

Avoid span explosion. Do not create a span per loop iteration or per row. If you process 10,000 records, emit one span with a batch.size attribute, not 10,000 spans. Each extra span costs ingest, storage, and query time. A good rule: a span should represent work worth seeing on a timeline — a network call, a meaningful compute phase, a transaction — not every function.

4. Choosing a backend: Tempo vs Jaeger

Both ingest OTLP and both work. They differ sharply on storage model and operational cost.

Concern Grafana Tempo Jaeger
Storage backend Object storage only (S3, GCS, Azure Blob) Cassandra, Elasticsearch, OpenSearch, or Badger
Index Minimal; TraceQL scans object storage blocks Full index in the chosen store
Cost model Cheap object storage; scales with bytes Pay for a database cluster
Query language TraceQL (filter on span attributes) Tag-based search via UI/API
Best when High volume, cost-sensitive, Grafana shop You need rich indexed search or already run ES
Span metrics Built-in metrics-generator Via OTel Collector spanmetrics connector

Tempo’s bet is that you rarely search blindly for traces — you arrive from a metric, a log, or an exemplar carrying a trace_id, so a heavy index is wasted money. Object storage is the cheapest durable tier, which makes Tempo the default for high-volume, cost-conscious stacks. Jaeger earns its keep when you genuinely need ad-hoc indexed search across arbitrary tags and are willing to run Elasticsearch to get it.

A minimal Tempo install via Helm, configured for S3:

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm upgrade --install tempo grafana/tempo-distributed \
  --namespace observability --create-namespace \
  -f tempo-values.yaml
# tempo-values.yaml (excerpt) — object storage + span metrics
storage:
  trace:
    backend: s3
    s3:
      bucket: my-org-tempo-traces
      endpoint: s3.us-east-1.amazonaws.com
      region: us-east-1
# generate RED metrics from spans and remote_write to Prometheus
metricsGenerator:
  enabled: true
  remoteWriteUrl: http://prometheus.observability.svc:9090/api/v1/write

Point your OTel Collector (or app SDK) at the distributor over OTLP:

# OTel Collector exporter -> Tempo
exporters:
  otlp/tempo:
    endpoint: tempo-distributor.observability.svc.cluster.local:4317
    tls:
      insecure: true   # in-cluster only

5. Generating metrics from spans: RED for free

Every span already encodes Rate, Errors, and Duration. The spanmetrics connector in the OTel Collector aggregates spans into Prometheus metrics so you get RED signals without instrumenting metrics separately. (Tempo’s metrics-generator does the same server-side; pick one, not both, to avoid double counting.)

# OTel Collector: spans in -> RED metrics out
connectors:
  spanmetrics:
    histogram:
      explicit:
        buckets: [5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2s, 5s]
    dimensions:
      - name: http.route
      - name: http.response.status_code
    exemplars:
      enabled: true   # critical for step 6

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [otlp/tempo, spanmetrics]   # fan out: store AND aggregate
    metrics:
      receivers: [spanmetrics]
      exporters: [prometheusremotewrite]

This emits traces_span_metrics_calls_total and traces_span_metrics_duration_* (a histogram) labeled by your chosen dimensions. Choose dimensions carefully — every dimension is a Prometheus label, and high-cardinality labels (user IDs, raw paths) will blow up your TSDB exactly the way they do for any metric. Note exemplars.enabled: true: that is what makes the next step possible.

6. Exemplars: from a percentile straight into the trace

An exemplar is a sample attached to a histogram bucket that carries the trace_id of one request that landed in that bucket. So when you look at a p99 latency panel, each plotted point can link to an actual slow trace — you stop guessing and jump straight to the offender.

The plumbing has three requirements, and all three must be true:

  1. The metrics source emits exemplars (set above in the connector). OTLP histograms carry exemplars natively.
  2. Prometheus stores them. Exemplar storage must be explicitly enabled:
# Prometheus must be started with the feature flag (pre-3.0)
prometheus --enable-feature=exemplar-storage \
  --storage.tsdb.path=/prometheus
  1. Grafana links the exemplar’s trace_id to the Tempo data source. Configure this on the Prometheus data source:
# Grafana datasource provisioning: Prometheus -> Tempo via exemplar
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    url: http://prometheus.observability.svc:9090
    jsonData:
      exemplarTraceIdDestinations:
        - name: trace_id
          datasourceUid: tempo-uid

Now query a percentile and Grafana renders exemplar diamonds under the line:

histogram_quantile(0.99,
  sum by (le, http_route) (
    rate(traces_span_metrics_duration_milliseconds_bucket[5m])
  )
)

Click a diamond on the spike, and Grafana opens that exact trace in Tempo. That click is the entire payoff of wiring exemplars: percentile to root cause in one hop.

7. Trace-to-logs and logs-to-trace correlation

The third link is logs. The rule is simple: every structured log line emitted inside a span must include the trace_id (and ideally span_id). Then you can pivot from a span to its logs and from a log line back to the full trace.

Inject the IDs from the active context into your logger:

# Python: enrich logs with the active trace context
import logging, json
from opentelemetry import trace

class TraceContextFilter(logging.Filter):
    def filter(self, record):
        ctx = trace.get_current_span().get_span_context()
        record.trace_id = format(ctx.trace_id, "032x") if ctx.is_valid else ""
        record.span_id = format(ctx.span_id, "016x") if ctx.is_valid else ""
        return True

A resulting structured line looks like:

{"level":"error","msg":"inventory reserve failed","service":"checkout","trace_id":"4bf92f3577b34da6a3ce929d0e0e4736","span_id":"00f067aa0ba902b7"}

With Loki as the log store, wire both directions in Grafana. Logs-to-trace: a derived field parses trace_id out of the log line and links to Tempo. Trace-to-logs: Tempo’s data source jumps from a span into a Loki query scoped by trace_id and time.

# Loki -> Tempo (logs to trace)
- name: Loki
  type: loki
  jsonData:
    derivedFields:
      - name: trace_id
        matcherRegex: '"trace_id":"(\w+)"'
        url: "$${__value.raw}"
        datasourceUid: tempo-uid
# Tempo -> Loki (trace to logs)
- name: Tempo
  type: tempo
  uid: tempo-uid
  jsonData:
    tracesToLogsV2:
      datasourceUid: loki-uid
      filterByTraceID: true
      spanStartTimeShift: "-5m"
      spanEndTimeShift: "5m"

The time shift matters: a log line’s timestamp rarely matches the span’s exactly, so widen the window or the correlated query returns nothing.

8. Sampling strategy that survives production

Tracing every request at full volume is expensive and usually unnecessary. The decision is what to keep, and there are two places to decide.

Head sampling decides at the root, before the trace exists, using a ratio. It is cheap and stateless but blind — it cannot keep “all the errors” because it decides before knowing the outcome.

# Head sampling: keep 10%, respect upstream decision
from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased
sampler = ParentBased(root=TraceIdRatioBased(0.1))

ParentBased is non-negotiable in a distributed system: if an upstream service sampled a trace, every downstream service must honor that bit (it rides in the traceparent flags), or you get half-traces. Head sampling is consistent because the decision derives from the trace_id itself, so every service computes the same answer.

Tail sampling decides after the whole trace completes, in the Collector gateway, so it can keep every error and every slow trace while dropping the boring fast ones. It needs the complete trace buffered in one place, which is why it lives in the gateway tier — covered in depth in the OpenTelemetry Collector pipelines article. The policy shape:

# Tail sampling: keep all errors + slow traces, sample the rest
processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow
        type: latency
        latency: {threshold_ms: 1000}
      - name: baseline
        type: probabilistic
        probabilistic: {sampling_percentage: 5}

A sane production default: head-sample lightly to cap SDK and network overhead, then tail-sample at the gateway to guarantee you keep the traces that matter (errors, latency outliers) plus a small representative baseline of healthy traffic.

Cost reality: keep RED metrics at 100% (they are cheap aggregates) and sample the traces aggressively. You almost never need every healthy trace — you need every broken one plus enough baseline to compare against. Metrics give you the totals; sampled traces give you the examples.

Enterprise scenario

A payments platform migrated its checkout path to async: the API published an order.placed event to Kafka and a fleet of consumers settled it. Traces looked perfect in the synchronous tier and then died at the broker — every consumer span showed up as a brand-new root, so the p99 settlement latency had no traces behind its exemplars. The team had set the W3C propagator globally and assumed that was enough. It wasn’t: the auto-instrumentation only injects into HTTP/gRPC carriers, and nothing was writing traceparent into Kafka record headers.

The fix was explicit context injection on produce and extraction on consume, plus one subtlety they got wrong first. The batch consumer drained up to 500 records per poll, so making each one a child of a single poll span produced a meaningless 500-wide fan. They switched to span links: the poll span links back to each producer’s context instead of parenting it.

from opentelemetry.propagate import inject, extract
from opentelemetry import trace

# producer: write context into the Kafka record headers
carrier = {}
inject(carrier)
headers = [(k, v.encode()) for k, v in carrier.items()]

# consumer: link each record's context to the batch span
links = [trace.Link(extract(dict(r.headers)).get_span_context())
         for r in records]
with tracer.start_as_current_span("settle batch", links=links,
                                  kind=trace.SpanKind.CONSUMER):
    settle(records)

After that, exemplars on the settlement histogram clicked straight through to traces spanning API to broker to consumer to ledger. The lesson: a global propagator covers the wire, not the queue.

Verify

Confirm each link in the chain actually works:

# 1. Context propagates: hit an endpoint, confirm the SAME trace_id
#    spans appear across multiple services in Tempo/Jaeger.
curl -s -D - http://localhost:8080/checkout -o /dev/null | grep -i traceparent

# 2. Tempo is receiving traces (query by trace_id via TraceQL API)
curl -s "http://tempo:3200/api/traces/4bf92f3577b34da6a3ce929d0e0e4736" | jq '.batches | length'

# 3. Span metrics exist in Prometheus
curl -s "http://prometheus:9090/api/v1/query?query=traces_span_metrics_calls_total" | jq '.data.result | length'

# 4. Exemplars are stored (must return a non-empty array)
curl -s "http://prometheus:9090/api/v1/query_exemplars?query=traces_span_metrics_duration_milliseconds_bucket&start=$(date -u -d '-1 hour' +%s)&end=$(date -u +%s)" | jq '.data | length'

Checklist

Pitfalls

Next steps

Add TraceQL-based alerting and span-metrics SLOs so a rising error rate links straight to exemplar traces. Wire OpenTelemetry continuous profiling (Pyroscope) as a fourth correlated signal — jump from a slow span to a flame graph of that exact code path. And formalize the cardinality budget for span attributes and metric dimensions the same way you do for any time series, before a well-meaning user.id attribute quietly triples your storage bill.

TracingTempoJaegerOpenTelemetryExemplars

Comments

Keep Reading