Distributed Tracing End-to-End: Context Propagation, Tempo, and Correlating Traces with Metrics and Logs

A dashboard tells you the p99 latency on the checkout endpoint just doubled. A trace tells you which of the eleven downstream calls caused it, on which pod, with which tenant ID in the baggage. Distributed tracing is the only observability signal that reconstructs a single request’s path across service boundaries — and it is worthless if context drops on the first hop, if your spans are named so badly the backend can’t aggregate them, or if the trace sits in Tempo while you stare at a Grafana panel with no way to jump to it. Most teams that “have tracing” are in exactly that state: traces exist, nobody looks at them, and every incident still starts with grep.

This article is the end-to-end wiring, treated as a system rather than a product install: the byte layout of the traceparent header, deliberate span design with OpenTelemetry, a hard-nosed Grafana Tempo vs Jaeger comparison (storage model, components, TraceQL versus tag search), and the three correlation links that make tracing pay rent — exemplars (metrics → trace), trace_id in logs (trace ↔ log), and service graphs + span metrics (RED derived from traces). Plus head versus tail sampling, multi-tenancy and retention on both backends, and the five-pivot debugging workflow — alert → metric → exemplar → trace → log — that turns a page at 02:14 into a root cause by 02:25.

This is the concept-and-decision piece. When you are ready to stand up the Tempo backend itself — S3 block storage, compaction, metrics-generator tuning, TraceQL metrics — the implementation companion is Configure Grafana Tempo with TraceQL, Metrics-Generator, and S3 Block Storage. Here we build the mental model that makes every config line in that guide obvious.

What problem this solves

Metrics tell you that something is wrong, aggregated across everything. Logs tell you what one process said, in isolation. Neither answers the question that costs you money: which hop, in which service, on which code path, for which subset of requests? In a monolith you attached a debugger. In a system of 40 services, 3 message buses and 2 clouds, a checkout request touches 15 processes — and the slow one is three hops from the one that alerted.

Without tracing done properly, three failure patterns repeat everywhere. The war-room fan-out: an SLO alert fires at the gateway, nobody can see the request path, so every downstream team gets paged and declares “our dashboards look fine” — mean time to innocence beats mean time to resolution. The unaggregatable trace swamp: Jaeger was installed two years ago, spans are named GET /users/8231 and process, and sampling keeps 1% of everything — so 0 of the 30 errors that mattered. The disconnected pillars: traces, metrics and logs in three tools, with the on-call human as the join engine, copy-pasting timestamps between tabs.

The fix is not “buy an APM” — it is six contracts applied ruthlessly: W3C propagation on every hop including queues; semantic-convention span names and attributes; a sampling policy that guarantees errors and outliers are kept; RED metrics generated from spans; exemplars so every latency panel is a door into a trace; and trace_id on every log line. Wire those six and the three pillars collapse into one investigation surface.

Who hits this: platform teams standardising observability across polyglot services, SREs whose reviews keep saying “detection fast, localisation slow”, and architects choosing between Jaeger-on-Elasticsearch and Tempo-on-S3 with a real budget attached.

Learning objectives

By the end of this article you can:

Explain span anatomy (trace_id, span_id, parent, kind, status, attributes, events, links) and read a traceparent header byte by byte.
Propagate context across HTTP, gRPC, and message queues (Kafka/SQS/RabbitMQ), and choose parent-child versus span links for fan-out and batch consumers.
Instrument services with the OTel SDK and zero-code agents, tune the batch processor, and apply semantic conventions so backends can aggregate your spans.
Compare Tempo and Jaeger component by component — ingest, storage, index, query language, scale model — and defend a backend choice in an architecture review.
Design a two-stage sampling policy: consistent head sampling in the SDK plus tail sampling in a Collector gateway that keeps 100% of errors and latency outliers.
Generate RED metrics and service graphs from spans (Collector spanmetrics/servicegraph connectors or Tempo’s metrics-generator) without double counting.
Wire exemplars end to end (SDK → Collector → Prometheus exemplar-storage → Grafana exemplarTraceIdDestinations) and both directions of trace↔log correlation (Loki derived fields, tracesToLogsV2).
Operate the system: multi-tenant isolation with X-Scope-OrgID, per-tenant retention and rate limits, and a troubleshooting playbook for broken traces, missing exemplars and empty correlations.

Prerequisites & where this fits

You should be comfortable with Kubernetes (Deployments, Services, Helm), the Prometheus data model (labels, histograms, rate()), and one language well enough to read its OTel snippets — and know the OpenTelemetry Collector at block-diagram level: receivers, processors, exporters. The deep dive lives in Building Production OpenTelemetry Collector Pipelines: Receivers, Processors, and Tail Sampling; this article uses the Collector but does not re-teach it.

This piece sits at the centre of the observability track. Upstream of it: PromQL fluency (PromQL in Anger: Rate, Histograms, and Aggregation Patterns That Actually Work) and log pipeline design (Grafana Loki Deep Dive: LogQL, Label Cardinality, and Chunk Storage Tuning). Downstream: the Tempo implementation guide already mentioned, tail sampling at fleet scale (Tail-Based Sampling at Scale with the OpenTelemetry Collector and Load-Balancing Exporter), and language-specific instrumentation (OpenTelemetry for Java Services: Auto-Instrumentation, Context Propagation, and Custom Spans).

Core concepts: spans, traces, and span context

A span is one timed unit of work. A trace is a tree of spans sharing one 16-byte trace_id. Each span carries an 8-byte span_id and (except the root) a parent_span_id. That is the whole data model; everything else — waterfalls, service graphs, RED metrics, exemplars — is derived from it.

trace_id = 4bf92f3577b34da6a3ce929d0e0e4736

  [HTTP GET /checkout]              span A (root, parent=none, kind=SERVER)
    [validate-cart]                 span B (parent=A, kind=INTERNAL)
    [POST inventory.Reserve]        span C (parent=A, kind=CLIENT)
       [grpc inventory.Reserve]     span D (parent=C, kind=SERVER, other process)
          [SELECT stock]            span E (parent=D, kind=CLIENT)
    [POST payments.Charge]          span F (parent=A, kind=CLIENT)

Spans C and D are the crux of distributed tracing: the caller and callee each create a span, in different processes, and D’s parent is C only because C’s context rode across the wire. Every field of a span has an operational consequence:

Span field	Type / size	Set by	Why it matters operationally
`trace_id`	16 bytes (32 hex)	Root span’s SDK	The join key for every correlation in this article
`span_id`	8 bytes (16 hex)	Each span’s SDK	Log correlation at sub-request granularity
`parent_span_id`	8 bytes	From extracted context	Wrong/missing → orphan spans, broken waterfalls
Name	Low-cardinality string	You / instrumentation	Aggregation key for span metrics; high cardinality here poisons everything
`kind`	Enum (5 values)	Instrumentation	Drives service graphs and client/server latency split
Start / end time	Nanosecond timestamps	SDK	Duration histograms; clock skew shows here
Attributes	Key-value map (default limit 128/span)	You / semconv	Filtering (TraceQL/tags), metric dimensions
Events	Timestamped sub-records (limit 128)	You (`record_exception`)	Exceptions, retries, cache misses on the timeline
Links	References to other span contexts (limit 128)	You	Batch/fan-out causality without a fake parent
Status	`Unset` / `Ok` / `Error`	You / instrumentation	Error-rate metrics and tail-sampling policies key off it
Resource	Process-level attributes (`service.name`…)	SDK at startup	Identifies the emitting service/pod/version everywhere

SpanKind deserves its own table because service graphs and the client/server latency split are computed from it, and getting it wrong silently breaks both:

Kind	Meaning	Created by	Typical pairing
`SERVER`	Handling an inbound synchronous request	HTTP/gRPC server instrumentation	Child of a remote `CLIENT` span
`CLIENT`	Making an outbound synchronous request	HTTP/gRPC/DB client instrumentation	Parent of a remote `SERVER` span
`PRODUCER`	Enqueueing an async message	Messaging producer instrumentation	Linked/parented by a remote `CONSUMER`
`CONSUMER`	Processing an async message	Messaging consumer instrumentation	Child of or linked to `PRODUCER`
`INTERNAL`	In-process work, no remote party	Manual spans, default	Anywhere inside a service

Two rules complete the model. Status discipline: leave status Unset for normal flow, set Error only for genuine failures and record the exception as a span event — a 404 on a lookup endpoint is a valid outcome, and marking it Error corrupts every error-rate metric derived from spans. Span context: only trace_id + span_id + trace flags + trace state travel between processes. Attributes, events and names do not propagate — anything downstream must see goes in baggage, explicitly.

Context propagation: W3C traceparent and where traces break

Propagation is serializing the active span context into an outbound carrier (HTTP headers, gRPC metadata, Kafka record headers) and deserializing it on the other side so the remote span parents correctly. Drop it on any hop and the trace splits — the most common tracing failure in production fleets.

The standard is W3C Trace Context: two headers, parsed byte-for-byte by every SDK and proxy:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
             ^^ ^------------------------------^ ^--------------^ ^^
             ver           trace-id (16B)          parent-id (8B)  flags

tracestate: vendorA=opaqueValue,vendorB=opaqueValue
baggage:    tenant.id=acme,checkout.experiment=b

`traceparent` field	Size	Legal values	Gotcha
`version`	1 byte (2 hex)	`00` today	`ff` is invalid; parsers must accept future versions and read the known prefix
`trace-id`	16 bytes (32 hex)	Any non-zero	All-zero = invalid → receiver must start a new trace
`parent-id`	8 bytes (16 hex)	Any non-zero	This is the caller’s span id, not the trace root
`trace-flags`	1 byte (2 hex)	Bit 0 = sampled (`01`)	Level 2 of the spec adds bit 1 (`02`) = random trace-id, used by consistent-probability samplers
`tracestate`	≤ 32 list members	Vendor key=value pairs	Must be propagated verbatim even if you don’t understand it; SDKs may truncate past 512 chars
`baggage` (separate header)	Spec: ≥ 64 entries / 8192 bytes supported	Arbitrary key=value	Propagates to every downstream service and often into logs — never put secrets or PII here

W3C is the standard, but it is not the only format alive in real fleets — and mismatched propagators are the number-one cause of silently broken traces at organisational boundaries:

Propagator	Header(s)	Origin / where you still meet it	OTel `OTEL_PROPAGATORS` value
W3C Trace Context	`traceparent`, `tracestate`	OTel default; modern meshes and gateways	`tracecontext`
W3C Baggage	`baggage`	OTel default companion	`baggage`
B3 single	`b3: <trace>-<span>-<sampled>`	Zipkin heritage; Envoy/Istio (historically), Linkerd	`b3`
B3 multi	`X-B3-TraceId`, `X-B3-SpanId`, `X-B3-Sampled`, …	Spring Cloud Sleuth defaults, older meshes	`b3multi`
Jaeger	`uber-trace-id: <trace>:<span>:<parent>:<flags>`	Legacy Jaeger clients (retired), old sidecars	`jaeger`
AWS X-Ray	`X-Amzn-Trace-Id: Root=1-…;Parent=…;Sampled=1`	ALB injects it; X-Ray SDKs	`xray`
None	—	Misconfigured distros	`none` (useful only for tests)

Spec-compliant OTel SDKs default to tracecontext,baggage. Breakage comes from mixed fleets: a Sleuth service emitting only B3 calls an OTel service expecting traceparent, and the trace forks. The fix is either standardising on W3C everywhere or running a composite propagator that reads all formats during the migration:

# Transitional setting on every legacy-adjacent service:
export OTEL_PROPAGATORS="tracecontext,baggage,b3multi"

Carriers: how context physically travels per transport

For synchronous HTTP/gRPC, instrumentation libraries inject and extract automatically once the propagator is set. Everything else needs deliberate work:

Transport	Carrier for `traceparent`	Auto-instrumented?	What you must do
HTTP 1.1/2	Request headers	Yes (all major libs)	Nothing — verify header on the wire
gRPC	Metadata	Yes	Nothing
Kafka	Record headers	Partially (client libs vary)	`inject()` on produce, `extract()` on consume
RabbitMQ / AMQP	Message properties/headers	Partially	Same inject/extract pattern
AWS SQS/SNS	Message attributes	Via AWS SDK instrumentation	Watch the 10-attribute SQS limit — tracing takes one
Azure Service Bus	Application properties	Via Azure SDK (`Diagnostic-Id`)	Azure SDKs emit W3C-compatible context
Cron / batch job	Nothing inbound	No	Start a new root; link to triggering trace if an ID was persisted
Database	No standard carrier	n/a	The `CLIENT` span is the leaf; DB-side tracing is separate

Async messaging is where traces die most often: the producer’s HTTP context is long gone when a consumer picks up the message, so context must ride inside the message:

# Producer: inject context into Kafka record headers
from opentelemetry.propagate import inject

carrier = {}
inject(carrier)                       # writes traceparent/baggage into the dict
producer.send("orders",
              value=payload,
              headers=[(k, v.encode()) for k, v in carrier.items()])

# Consumer: extract, then decide — child (1:1) or links (batch/fan-out)
from opentelemetry.propagate import extract
from opentelemetry import trace

tracer = trace.get_tracer("orders.consumer")

# 1:1 processing → parent-child is honest
ctx = extract({k: v.decode() for k, v in msg.headers})
with tracer.start_as_current_span("orders process", context=ctx,
                                  kind=trace.SpanKind.CONSUMER):
    handle(msg)

# Batch of 500 → parenting one poll span under 500 producers is meaningless.
# Use span LINKS: causality without a fake tree.
links = [trace.Link(extract(dict(r.headers)).get_span_context())
         for r in records]
with tracer.start_as_current_span("orders settle-batch", links=links,
                                  kind=trace.SpanKind.CONSUMER) as span:
    span.set_attribute("messaging.batch.message_count", len(records))
    settle(records)

Where traces break: the boundary audit

Walk every boundary against this table once and you will find the split traces before an incident does:

Boundary	How the trace breaks	How to confirm	Fix
Mixed propagator fleets (B3 vs W3C)	Downstream starts a new root	Compare `trace_id` in both services’ logs for one request	Composite propagator during migration; W3C end-state
Message queues	Consumer span has no parent	Consumer traces are all single-service trees	Inject/extract into message headers (above)
Proxies/gateways that strip unknown headers	`traceparent` never arrives	`curl -v` through the proxy; tcpdump the header	Allowlist `traceparent`, `tracestate`, `baggage`
Thread pools / goroutines / async runtimes	New work has no active context	Orphan `INTERNAL` spans with fresh trace_ids	Use context-aware executors (`Context.taskWrapping`, `contextvars`)
Serverless (Lambda/Functions) triggers	Event path drops HTTP context	Function traces never join caller traces	Use the platform’s OTel layer/extension; extract from event payload
Batch/ETL jobs	No inbound context at all	Every run is a root trace (fine!)	Persist upstream `trace_id`; add a span link on the job root
Browser → backend	CORS blocks `traceparent`	Preflight fails or header missing server-side	Add `traceparent` to `Access-Control-Allow-Headers`

Two boundary rules. The sampled bit is part of the contract: ParentBased samplers downstream honour the 01 flag — that is what keeps traces complete instead of half-sampled. Baggage is a loaded weapon: the only way to move business context (tenant, experiment, priority) to every hop, and the easiest way to leak PII into every hop’s logs — budget it to a few short keys and strip it at the trust boundary with a Collector attributes/redaction processor.

Instrumenting with OpenTelemetry: SDK, auto-instrumentation, Collector

OpenTelemetry won the instrumentation war: vendor-neutral API + SDK + wire protocol (OTLP), with Tempo and Jaeger both OTLP-native. Instrumenting well means understanding the SDK pipeline you are configuring, by code or by environment variables:

SDK component	Role	Default	Production setting
`TracerProvider`	Factory holding resource, sampler, processors	One global	One per process, set at startup before any span
`Resource`	Process-level identity attributes	Auto-detected basics	Set `service.name`, `service.version`, `deployment.environment.name` explicitly
`Sampler`	Head-sampling decision at span creation	`parentbased_always_on`	`parentbased_traceidratio` (see sampling section)
`SpanProcessor`	Hook between end-of-span and export	—	`BatchSpanProcessor` always; `SimpleSpanProcessor` only in tests
`SpanExporter`	Serialisation + transport	OTLP	OTLP gRPC (4317) or HTTP/protobuf (4318) to a local Collector
`Propagator`	Context inject/extract format	`tracecontext,baggage`	Keep; add `b3multi` transitionally
Span limits	Cap attributes/events/links per span	128 / 128 / 128	Raise deliberately, never unbounded

The full SDK bootstrap in Python — every line maps to a row above:

from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

resource = Resource.create({
    "service.name": "checkout",
    "service.version": "1.42.0",
    "deployment.environment.name": "prod",
})

provider = TracerProvider(
    resource=resource,
    sampler=ParentBased(root=TraceIdRatioBased(0.25)),
)
provider.add_span_processor(BatchSpanProcessor(
    OTLPSpanExporter(endpoint="http://otel-agent:4317", insecure=True),
))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("checkout.orders", "1.0.0")

In containerised fleets you rarely hardcode this — spec-defined environment variables that every SDK honours make instrumentation a deployment concern:

Variable	Purpose	Default	Production note
`OTEL_SERVICE_NAME`	Sets `service.name`	`unknown_service:<proc>`	Mandatory — unnamed services wreck every view
`OTEL_RESOURCE_ATTRIBUTES`	Extra resource attrs	—	`service.version=1.42.0,deployment.environment.name=prod`
`OTEL_EXPORTER_OTLP_ENDPOINT`	OTLP target	`http://localhost:4317`	Point at node-local Collector agent, not the backend
`OTEL_EXPORTER_OTLP_PROTOCOL`	`grpc` / `http/protobuf`	varies by SDK	`http/protobuf` traverses L7 proxies more predictably
`OTEL_TRACES_SAMPLER`	Head sampler	`parentbased_always_on`	`parentbased_traceidratio`
`OTEL_TRACES_SAMPLER_ARG`	Sampler parameter	—	e.g. `0.25`
`OTEL_PROPAGATORS`	Propagator list	`tracecontext,baggage`	Add `b3multi` only during migrations
`OTEL_BSP_SCHEDULE_DELAY`	Batch flush interval	5000 ms	Lower (1–2 s) for spiky short-lived pods
`OTEL_BSP_MAX_QUEUE_SIZE`	Buffered spans before drop	2048	Raise for chatty services; watch dropped-span counters
`OTEL_BSP_MAX_EXPORT_BATCH_SIZE`	Spans per export call	512	Keep ≤ queue size; bigger batches = fewer RPCs
`OTEL_SPAN_ATTRIBUTE_COUNT_LIMIT`	Attrs per span	128	Guardrail against attribute explosions
`OTEL_ATTRIBUTE_VALUE_LENGTH_LIMIT`	Attr value truncation	unlimited	Set (e.g. 1024) to stop request-body dumping

Zero-code versus manual instrumentation

You almost never write spans for HTTP servers, clients, and database drivers — auto-instrumentation covers those. You do write spans for business logic worth seeing on a timeline:

Language	Zero-code mechanism	Coverage quality	Custom spans
Java	`-javaagent:opentelemetry-javaagent.jar` (bytecode weaving)	Excellent — hundreds of libraries	`@WithSpan` or Tracer API
Python	`opentelemetry-instrument <cmd>` bootstrap	Very good (Django/Flask/FastAPI/requests/…)	Tracer API
Node.js	`--require @opentelemetry/auto-instrumentations-node/register`	Very good	Tracer API
.NET	Auto-instrumentation (CLR profiler) or built-in `ActivitySource`	Very good; tracing is semi-native	`ActivitySource`
Go	No agent — compile-time libraries (`otelhttp`, `otelgrpc`); eBPF option	Manual-ish by design	Explicit `tracer.Start` everywhere
Anything (incl. Go)	eBPF zero-code (Grafana Beyla / OBI) from the kernel	Protocol-level spans, no app context	Complements, doesn’t replace SDK

# Java: full tracing with zero code changes (-Dotel.* mirrors the env vars above)
java -javaagent:/otel/opentelemetry-javaagent.jar \
     -Dotel.service.name=checkout \
     -Dotel.traces.sampler=parentbased_traceidratio \
     -Dotel.traces.sampler.arg=0.25 \
     -jar checkout.jar

The Kubernetes-native version of this — the OTel Operator injecting agents via pod annotations — is covered in Deploy the OpenTelemetry Operator with Target Allocator and Auto-Instrumentation, and the eBPF path in Zero-Code Instrumentation with eBPF: Grafana Beyla and OpenTelemetry.

Semantic conventions: the attribute contract

Backends aggregate spans by attributes. If every team invents its own keys (http_status, statusCode, code), TraceQL queries, span-metrics dimensions and vendor integrations all break. Semantic conventions are the shared vocabulary — use them verbatim:

Domain	Key attributes (stable semconv)	Example values
HTTP	`http.request.method`, `http.route`, `http.response.status_code`, `url.path`, `server.address`	`POST`, `/users/:id`, `503`
Database	`db.system.name`, `db.query.text` (sanitised), `db.operation.name`, `db.collection.name`, `db.namespace`	`postgresql`, `SELECT`, `orders`
RPC	`rpc.system`, `rpc.service`, `rpc.method`, `rpc.grpc.status_code`	`grpc`, `Inventory`, `Reserve`, `14`
Messaging	`messaging.system`, `messaging.destination.name`, `messaging.operation.type`, `messaging.batch.message_count`	`kafka`, `orders`, `process`, `500`
Errors	`exception.type`, `exception.message`, `exception.stacktrace` (as span event)	`TimeoutError`
Resource	`service.name`, `service.version`, `service.namespace`, `deployment.environment.name`, `k8s.pod.name`	`checkout`, `1.42.0`, `prod`
End user / business	`enduser.id` (careful — PII), `tenant.id` (custom)	hashed IDs only

The naming rule that protects aggregation — the span name is a low-cardinality template; variables are attributes:

Bad span name	Failure it causes	Correct form
`GET /users/8231`	One metric series and one search bucket per user	`GET /users/:id` + `http.route`
`query-orders-2026-06-04`	Per-day names — span metrics explode	`SELECT orders` + `db.query.text`
`process`	Aggregates everything together — useless	`orders process`
`SELECT * FROM orders WHERE id=8231`	Query text (and data!) as name	`SELECT orders` + sanitised `db.query.text`
`charge-tenant-acme`	Tenant in name = cardinality + leakage	`payments charge` + `tenant.id` attr

Finally, avoid span explosion: no span per loop iteration or per row — processing 10,000 records is one span with batch.size=10000. A span should represent work worth seeing on a timeline (a network call, a transaction, a meaningful compute phase); everything finer is pure ingest, storage and query cost.

Tempo vs Jaeger: architecture, storage, and query

Both are open-source, OTLP-native trace backends, and both render a waterfall. The architectural bets underneath are near-opposite, and they decide your cost curve and your query ergonomics. The headline matrix first, then the mechanics:

Concern	Grafana Tempo	Jaeger
Core bet	Traces are looked up (by ID, from exemplars/logs) → minimal index, object storage	Traces are searched (by tags) → real index in a database
Storage	Object storage only: S3, GCS, Azure Blob, local disk	Cassandra, Elasticsearch/OpenSearch, Badger (single-node), memory; ClickHouse via v2 ecosystem
Block/index format	Columnar Apache Parquet (vParquet), scanned by TraceQL	Backend-native rows/documents + full index
Ingest	OTLP native (also Jaeger/Zipkin receivers)	OTLP native since 1.35; legacy Thrift/proto ports
Query language	TraceQL — structural, attribute, duration, aggregate queries	Tag/service/operation/duration search (UI + API); no structural language
Scale-out shape	Microservices: distributor, ingester, querier, query-frontend, compactor, metrics-generator	v1: collector, query, ingester(+Kafka); v2: OTel-Collector-based binary per role
RED-from-spans	Built-in metrics-generator (span metrics + service graphs)	Via Collector `spanmetrics` connector + SPM UI tab reading Prometheus
Multi-tenancy	First-class: `X-Scope-OrgID`, per-tenant limits/retention	None native — per-team instances or index-prefix separation
Retention	Compactor `block_retention` (default 336h), per-tenant override	Backend-dependent: ES ILM/index-cleaner, Cassandra TTL, Badger TTL
Operational cost centre	Cheap storage; CPU at query time (scan)	Database cluster 24×7 (ES/Cassandra); cheap reads
Project status	Actively developed (Tempo 2.x)	v1 in maintenance; Jaeger v2 (GA Nov 2024) rebuilt on the OTel Collector framework

Storage and index: why the cost curves diverge

Jaeger writes each span into a database and maintains an index over service names, operation names, tags and durations. Elasticsearch/OpenSearch is the common production choice: one index per day (jaeger-span-2026-06-04), shards and replicas to size, sub-second searches over any indexed tag. The price is the database itself: at tens of thousands of spans per second the cluster needs multiple data nodes with fast disks running around the clock — you pay for peak, always.

Tempo writes spans into columnar Parquet blocks in object storage with only a tiny bloom-filter/index footprint per block. Trace-by-ID lookup is cheap; search means queriers scanning block columns in parallel — brute force made affordable by Parquet’s column pruning and object-storage bandwidth. The price moves from storage to query-time CPU — elastic and mostly idle — and retention costs S3 prices, not ES-node prices. The block/compaction/caching machinery is what Configure Grafana Tempo with TraceQL, Metrics-Generator, and S3 Block Storage walks through.

Storage question	Tempo answer	Jaeger answer
Durable store	S3/GCS/Azure Blob bucket	ES/OpenSearch or Cassandra cluster (Badger = single node only)
Index size	Bloom filters + block metadata (tiny)	Full inverted index (ES) / partition keys (Cassandra)
Find by trace ID	Bloom-filter block lookup — fast	Primary-key read — fast
Search by attribute	Parallel Parquet scan (TraceQL), seconds over large windows	Index hit, sub-second (ES)
30-day retention of 5 TB	~₹9,600/mo S3 Standard (≈ $115)	3+ ES data nodes ≈ $400–800/mo before storage
Scaling ingest	Add distributors/ingesters (stateless-ish, WAL)	Scale collectors + database write capacity
Failure domain	Bucket + WAL disks	The database cluster (its own on-call surface)

Query: TraceQL versus Jaeger search

Jaeger’s search is a form: service, operation, tags (error=true http.status_code=503), min/max duration, lookback, limit (default 20, UI caps at 1500 results). It is fast and fine for “show me recent errors in checkout”. It cannot express structure — “traces where checkout calls payments AND the payments span was slow” — because the index is per-span, not per-tree.

TraceQL is a pipeline language over trace structure, and it is Tempo’s decisive UX advantage:

TraceQL query	What it finds
`{ resource.service.name = "checkout" && span.http.response.status_code >= 500 }`	Server errors in one service
`{ span.http.route = "/checkout" && duration > 800ms }`	Slow requests on one route
`{ status = error } && { span.db.system.name = "postgresql" }`	Traces containing both an error span and a Postgres span
`{ resource.service.name = "checkout" } >> { resource.service.name = "payments" && duration > 500ms }`	Checkout traces whose descendant payments span was slow
`{ resource.service.name = "checkout" } \| avg(duration) > 1s`	Aggregate condition over matching spans
`{ span.tenant.id = "acme" && status = error }`	One tenant’s failing requests (your attribute discipline paying off)
`{ } \| rate() by (span.http.route)`	TraceQL metrics: per-route request rate computed from stored spans (needs metrics-generator `local-blocks`)

Capability side by side:

Capability	Jaeger UI/API	Tempo TraceQL
Service/operation/tag search	Yes (indexed, fast)	Yes (scan, parallel)
Duration filters	Min/max per span	Arbitrary predicates incl. per-span and trace-level
Structural conditions (ancestor/descendant/sibling)	No	Yes (`>`, `>>`, `~`)
Aggregates in query (`count`, `avg` …)	No	Yes (pipeline stages)
Metrics from stored traces	No (SPM reads Prometheus)	TraceQL metrics (`rate`, `quantile_over_time` …)
Trace diff/compare view	Yes — underrated for regressions	Partial (compare in Explore workflows)
Saved/linkable queries	URL params	URL params + Grafana Explore/dashboards

Deployment shapes and the v2 note

Tempo ships one binary in two modes: monolithic (everything in one process — labs, small shops) and microservices (distributor → ingester → querier/query-frontend → compactor + metrics-generator) for scale, deployed via the tempo-distributed Helm chart. Jaeger v1’s shape is collector → storage ← query, optionally with Kafka and an ingester between collector and storage as a buffer. Jaeger v2 rebuilds the whole project as an OpenTelemetry Collector distribution: the same binary hosts receivers, processors, a jaeger_storage extension and the jaeger_query UI — which means Collector-style YAML config, first-class OTLP, and any Collector processor available natively in Jaeger:

# jaeger-v2.yaml — Jaeger as an OTel Collector distribution (Elasticsearch backend)
extensions:
  jaeger_query:
    storage:
      traces: es_main
  jaeger_storage:
    backends:
      es_main:
        elasticsearch:
          server_urls: ["http://elasticsearch:9200"]
          indices:
            index_prefix: "jaeger-prod"

exporters:
  jaeger_storage_exporter:
    trace_storage: es_main

service:
  extensions: [jaeger_storage, jaeger_query]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [jaeger_storage_exporter]

And the Tempo equivalent skeleton, showing the object-storage bet directly:

# tempo.yaml (excerpt) — object storage + metrics-generator
multitenancy_enabled: false          # flip to true for X-Scope-OrgID isolation

distributor:
  receivers:
    otlp:
      protocols:
        grpc: { endpoint: "0.0.0.0:4317" }
        http: { endpoint: "0.0.0.0:4318" }

storage:
  trace:
    backend: s3
    s3:
      bucket: kloudvin-tempo-traces
      endpoint: s3.ap-south-1.amazonaws.com
      region: ap-south-1
    wal:
      path: /var/tempo/wal

compactor:
  compaction:
    block_retention: 336h            # 14 days, the global default

metrics_generator:
  storage:
    path: /var/tempo/generator/wal
    remote_write:
      - url: http://prometheus.observability.svc:9090/api/v1/write
        send_exemplars: true

overrides:
  defaults:
    metrics_generator:
      processors: [span-metrics, service-graphs]

The decision table

If this describes you…	Choose	Because
Grafana is your pane of glass; Loki/Mimir already run; cost per GB matters	Tempo	Native correlations, object-storage economics, one operational idiom (same microservices shape as Loki/Mimir)
High span volume (≥ 50k spans/s) and finance asks about the ES bill	Tempo	Storage cost collapses to object storage; queriers scale elastically
You need structural queries (“A calling B slower than X”)	Tempo	TraceQL is the only OSS language that expresses it
You already operate Elasticsearch well and need rich ad-hoc tag search	Jaeger (ES/OpenSearch)	The index you’re paying for is genuinely good at this
Air-gapped single node / edge appliance	Jaeger (Badger) or Tempo monolithic w/ local disk	Both fit; Badger is battle-tested for embedded
You want backend logic (sampling, redaction) config-native in the trace backend	Jaeger v2	It is an OTel Collector — processors run in-line
Hard tenant isolation (platform serving many teams)	Tempo	`X-Scope-OrgID` + per-tenant limits/retention are built in
Trace comparison UI for release regressions is a daily tool	Jaeger	Trace diff view is mature

From here the examples default to Tempo, whose correlation features are the tightest fit with Grafana — but every concept maps to Jaeger, and the tables call out where mechanics differ.

Sampling strategies: head, tail, and remote

Storing every span of every request is financially silly at scale: healthy traffic is overwhelmingly self-similar. The question is what to keep, and there are two places to decide — at trace start (head) or after completion (tail) — plus a hybrid where the backend pushes rates to SDKs (remote).

Head sampling happens in the SDK at root-span creation. The standard samplers:

Sampler (`OTEL_TRACES_SAMPLER`)	Decision	Use when	Trap
`always_on`	Record everything	Dev, low traffic, labs	Cost at scale
`always_off`	Record nothing	Load-test noise suppression	You keep nothing, including errors
`traceidratio`	Keep fraction, derived deterministically from trace_id	Never alone in a distributed system	Ignores parent’s decision → torn traces
`parentbased_always_on`	Honour parent; sample all local roots	Default; internal services behind a sampled edge	Edge decides everything
`parentbased_traceidratio`	Honour parent; ratio for new roots	The production default	Ratio applies only at the edge/root
`jaeger_remote` / `parentbased_jaeger_remote`	Poll per-service/op rates from backend	Jaeger shops needing central control	One more moving part
`xray`	X-Ray centralised rules	AWS X-Ray estates	AWS-specific

Two properties make head sampling safe in a distributed system. Parent-based composition: every non-root service honours the inbound sampled flag, or you get half-traces. Consistency: traceidratio derives the decision from the trace_id itself, so every service computing the same ratio reaches the same answer (the W3C Level-2 random flag formalises the entropy). What head sampling cannot do is see the future — a 5% head sample throws away 95% of your errors, because the decision precedes the failure.

Tail sampling decides after the whole trace is assembled, in a Collector gateway tier, and can therefore keep all the interesting traces:

`tail_sampling` policy type	Keeps	Typical use
`status_code`	Traces containing `ERROR` spans	Keep 100% of failures — the core policy
`latency`	Traces over `threshold_ms`	Keep every outlier
`probabilistic`	N% of everything else	The healthy-traffic baseline
`string_attribute` / `numeric_attribute`	Attribute matches (e.g. `tenant.id = "vip-…"`)	VIP tenants, canary versions
`boolean_attribute`	e.g. `debug.force_trace = true`	Support-driven “trace this user now”
`rate_limiting`	At most N spans/sec	Hard cost ceiling
`span_count`	Traces with ≥/≤ N spans	Catch pathological fan-outs
`trace_state` / `ottl_condition`	tracestate values / arbitrary OTTL	Vendor flags, complex predicates
`and` / `composite`	Boolean combos / ordered budget allocation	“Errors AND tenant=X”, budget split across policies

# Collector gateway: guarantee errors + outliers, sample the boring 95%
processors:
  tail_sampling:
    decision_wait: 10s          # buffer window before deciding (default 30s)
    num_traces: 200000          # in-flight traces held in memory
    policies:
      - name: keep-errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: keep-slow
        type: latency
        latency: { threshold_ms: 1000 }
      - name: healthy-baseline
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }

Tail sampling has real costs: the gateway buffers every in-flight trace in RAM for decision_wait, and all spans of a trace must reach the same Collector instance — at fleet scale that forces a two-tier layout with the loadbalancing exporter routing by trace ID, the subject of Tail-Based Sampling at Scale with the OpenTelemetry Collector and Load-Balancing Exporter.

Dimension	Head sampling	Tail sampling
Decision point	SDK, at root span start	Collector gateway, after trace completes
Can guarantee errors kept	No — decision precedes the error	Yes — that’s the point
Cost profile	Cuts SDK/network/backend cost	Full ingest to gateway; cuts backend cost only
State required	None (deterministic hash)	RAM for all in-flight traces + trace-ID-affine routing
Complexity	One env var	Gateway tier + LB exporter + capacity planning
Where each backend fits	Both backends benefit equally	Collector-side for Tempo; Jaeger v2 can run the same processor in-process; Jaeger v1 adds adaptive sampling (collector computes per-operation rates served to SDKs via `jaeger_remote`)

How the pieces combine in practice:

Scenario	Recommended policy
< 1k spans/s, small team	Head `parentbased_always_on` (keep everything); revisit at 10×
10–100k spans/s, cost-aware	Head 100% (or high ratio) → gateway tail: errors + latency + 3–10% baseline
Massive edge traffic, SDK overhead itself hurts	Head `parentbased_traceidratio(0.25)` → tail on what remains (accept losing errors inside the unsampled 75%)
Jaeger v1 estate, no gateway tier	`parentbased_jaeger_remote` + adaptive sampling (`SAMPLING_CONFIG_TYPE=adaptive`, target samples/s per operation)
Compliance: “every request auditable”	Sample traces anyway; keep span metrics at 100% and logs as the audit record — traces are diagnostics, not the ledger
Load tests	`always_off` via env on the load generator’s traffic (or a `string_attribute` tail policy dropping `synthetic=true`)

One principle above all of it: metrics at 100%, traces sampled. RED metrics generated from spans (next section) are cheap aggregates — never sample those. Sample the examples, keep the totals. And generate the metrics before the tail-sampling processor in the pipeline graph, or your “request rate” will silently become “request rate of kept traces”.

Span metrics and service graphs: RED from traces

Every span already encodes RED: it happened (Rate), it has a status (Errors), it has a duration (Duration). Aggregating spans into Prometheus series yields request metrics with the same dimensions as your traces — precisely what makes exemplars and tracesToMetrics line up later. Two places can run this aggregation; run exactly one:

Dimension	Collector `spanmetrics` connector	Tempo metrics-generator
Runs where	Your Collector gateway (before tail sampling!)	Inside Tempo, on ingested spans
Metric names (Prometheus)	`traces_span_metrics_calls_total`, `traces_span_metrics_duration_milliseconds_*`	`traces_spanmetrics_calls_total`, `traces_spanmetrics_latency_*`, `traces_spanmetrics_size_total`
Service graph companion	`servicegraph` connector	`service-graphs` processor
Sees pre-sampling traffic	Yes, if placed before `tail_sampling`	Only what survives sampling and reaches Tempo
Exemplar support	`histogram.exemplars.enabled: true`	`send_exemplars: true` on remote_write
Extra dimensions	`dimensions:` list (span attrs)	Per-tenant `metrics_generator` dimension overrides
Choose it when	You tail-sample (metrics must see 100%) or run multiple backends	Tempo receives ~everything; fewer moving parts
The cardinal sin	Running both → every RED number doubles	Same

The Collector-side wiring, including the fan-out that lets metrics see 100% of spans while Tempo receives only the tail-sampled subset:

connectors:
  spanmetrics:
    histogram:
      exemplars:
        enabled: true                       # carries trace_id samples → step 6
      explicit:
        buckets: [5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2s, 5s]
    dimensions:
      - name: http.route
      - name: http.response.status_code
  servicegraph:
    latency_histogram_buckets: [10ms, 50ms, 100ms, 500ms, 1s, 5s]

service:
  pipelines:
    traces/full:                            # 100% of spans → RED metrics
      receivers: [otlp]
      processors: [memory_limiter]
      exporters: [spanmetrics, servicegraph]
    traces/store:                           # sampled subset → Tempo
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch]
      exporters: [otlp/tempo]
    metrics/red:
      receivers: [spanmetrics, servicegraph]
      processors: [batch]
      exporters: [prometheusremotewrite]

What comes out the other side — the series you will alert and dashboard on:

Metric	Type	Key labels	Feeds
`traces_span_metrics_calls_total`	Counter	`service_name`, `span_name`, `span_kind`, `status_code` + your dimensions	Request rate, error rate panels
`traces_span_metrics_duration_milliseconds_bucket/_sum/_count`	Histogram	Same	Latency percentiles + exemplars
`traces_service_graph_request_total`	Counter	`client`, `server`, `connection_type`	Service map edge volume
`traces_service_graph_request_failed_total`	Counter	`client`, `server`	Service map edge error %
`traces_service_graph_request_server_seconds_` / `_client_seconds_`	Histograms	`client`, `server`	Edge latency (both viewpoints)
`traces_service_graph_unpaired_spans_total`	Counter	—	Diagnostic: rising = broken propagation or missing CLIENT/SERVER kinds

Service graphs are computed by pairing each CLIENT span with the matching remote SERVER span (via trace_id + parent relationship) and emitting an edge client→server. Both sides must be instrumented and correctly kinded or the edge never forms, and chronic one-sided instrumentation surfaces as unpaired_spans_total — one of the most useful “is propagation healthy” meta-signals you own. Grafana renders the map from these series via serviceMap.datasourceUid pointing at your Prometheus/Mimir.

Dimension discipline is the same as any Prometheus work: every dimensions: entry multiplies series count. http.route (bounded) yes; url.path, user.id, tenant.id for thousands of tenants — no. The governance maths lives in Prometheus Cardinality Control: Relabeling and Cost Governance, and dashboarding these RED series properly is Engineering Grafana Dashboards That Get Used: RED, USE, Template Variables, and Provisioning-as-Code.

Exemplars: the metrics-to-trace jump

An exemplar is a sample attached to a histogram bucket carrying the trace_id (plus value and timestamp) of one real request that landed in that bucket. On a Grafana latency panel exemplars render as diamonds under the line; clicking one opens that request’s trace. It is the highest-leverage click in observability — a p99 spike stops being a statistic and becomes a specific slow trace — and it fails unless all four stages of a chain hold:

#	Stage	Requirement	Config	Symptom if missing
1	Metric source	Histogram emits exemplars with `trace_id`	`spanmetrics: histogram.exemplars.enabled: true` (or Tempo `send_exemplars: true`; or app SDKs attaching exemplars to OTLP/OpenMetrics histograms)	Diamonds never exist anywhere
2	Transport	Exemplars survive the write path	OTLP carries them natively; Prometheus `remote_write` needs `send_exemplars: true` per endpoint	Present at source, absent in TSDB
3	Storage	TSDB stores exemplars	Prometheus: `--enable-feature=exemplar-storage` (still a feature flag in 3.x) + optional `storage.exemplars.max_exemplars` (default 100000, circular buffer); Mimir: per-tenant exemplar limits	`query_exemplars` API returns `[]`
4	UI link	Grafana maps exemplar label → trace datasource	Prometheus datasource `exemplarTraceIdDestinations: [{name: trace_id, datasourceUid: tempo}]`	Diamonds render but don’t link

The storage flag and the datasource link, concretely:

# Prometheus with exemplar storage (feature flag) and remote-write receiver
prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --enable-feature=exemplar-storage \
  --web.enable-remote-write-receiver

# Grafana provisioning: Prometheus → Tempo via exemplars
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    uid: prom
    url: http://prometheus:9090
    jsonData:
      exemplarTraceIdDestinations:
        - name: trace_id          # the exemplar's label key
          datasourceUid: tempo

Any histogram panel then does the work — toggle “Exemplars” on in the query options:

histogram_quantile(0.99,
  sum by (le, http_route) (
    rate(traces_span_metrics_duration_milliseconds_bucket[5m])
  )
)

Two caveats. Exemplars are not a statistical sample — Prometheus keeps a bounded circular buffer of recent representatives, which is what debugging needs but not what reporting should be built on. And an exemplar pointing at a trace tail sampling dropped is a dead link — one more reason “keep all errors + slow traces” is the right tail policy, since those are exactly the traces anomalous buckets point to. The application-side view of this plumbing (SDK histograms attaching exemplars directly) is covered in Wiring OpenTelemetry Metrics and Exemplars for Click-Through Trace Correlation.

Trace-log correlation: trace_id everywhere

The third correlation is logs, and the rule is brutally simple: every structured log line emitted inside an active span carries trace_id and span_id. Do that and both pivots open — span to exactly its log lines, log line to the full distributed trace. Miss it and you are back to timestamp archaeology.

Injection is a logging-layer concern, one small shim per language:

# Python: stamp trace context onto every log record
import logging
from opentelemetry import trace

class TraceContextFilter(logging.Filter):
    def filter(self, record):
        ctx = trace.get_current_span().get_span_context()
        record.trace_id = format(ctx.trace_id, "032x") if ctx.is_valid else ""
        record.span_id  = format(ctx.span_id,  "016x") if ctx.is_valid else ""
        return True

{"ts":"2026-06-04T10:41:03Z","level":"error","service":"checkout",
 "msg":"inventory reserve failed","trace_id":"4bf92f3577b34da6a3ce929d0e0e4736",
 "span_id":"00f067aa0ba902b7"}

(Java: the OTel agent auto-populates MDC keys trace_id/span_id; Go: wrap slog/zap with a handler reading trace.SpanContextFromContext; .NET: Activity.Current.TraceId flows into ILogger scopes.)

With Loki storing logs, wire both directions in Grafana — pure data-source configuration, no code:

# Loki → Tempo: derived field lifts trace_id out of the line and links it
  - name: Loki
    type: loki
    uid: loki
    url: http://loki:3100
    jsonData:
      derivedFields:
        - name: TraceID
          matcherRegex: '"trace_id":"([0-9a-f]+)"'
          url: '$${__value.raw}'        # $$ escapes $ in provisioning files
          datasourceUid: tempo

# Tempo → Loki: from a span, query logs scoped by labels + trace_id + time
  - name: Tempo
    type: tempo
    uid: tempo
    url: http://tempo:3200
    jsonData:
      tracesToLogsV2:
        datasourceUid: loki
        spanStartTimeShift: '-5m'
        spanEndTimeShift: '5m'
        filterByTraceID: true
        tags: [{ key: 'service.name', value: 'service_name' }]
      tracesToMetrics:
        datasourceUid: prom
        spanStartTimeShift: '-2m'
        spanEndTimeShift: '2m'
        queries:
          - name: 'Route p99'
            query: 'histogram_quantile(0.99, sum by (le) (rate(traces_span_metrics_duration_milliseconds_bucket{$$__tags}[5m])))'
      serviceMap:
        datasourceUid: prom
      nodeGraph:
        enabled: true

The details that decide whether these links return results or an empty pane:

Setting	What it does	Why it’s load-bearing
`derivedFields.matcherRegex`	Extracts trace_id from the log line (or use `matcherType: label` for a structured-metadata/label field in Loki 3.x)	Regex must match your exact JSON key format; test in Explore
`tracesToLogsV2.tags`	Maps span resource attrs → Loki label names (`service.name` → `service_name`)	Wrong mapping = LogQL selector matches zero streams
`filterByTraceID`	Appends `\| trace_id ="<id>"` line filter	Without it you get all the service’s logs, not the trace’s
`spanStart/EndTimeShift`	Widens the log query window around the span	Log timestamps rarely match span times; skew + buffering make `±5m` a sane default
`tracesToMetrics.queries` + `$__tags`	Span → metrics panels with matching labels	The reverse pivot: from a span to its route’s RED history
`serviceMap.datasourceUid`	Where service-graph metrics live	Empty map usually means this points at the wrong Prometheus

Where should trace_id live on the Loki side? Not as a stream label — trace IDs are unique per request, and unique-valued labels are the cardinality catastrophe explained in Grafana Loki Deep Dive: LogQL, Label Cardinality, and Chunk Storage Tuning. Keep it in the log line (filtered with |=/| trace_id="") or, on Loki 3.x, in structured metadata — the indexed-but-not-a-label middle ground this problem needed; Loki’s OTLP endpoint puts it there automatically.

The complete correlation matrix — six pivots, each with its own mechanism:

From → To	Mechanism	Configured where
Metric → Trace	Exemplar diamond	Prometheus DS `exemplarTraceIdDestinations`
Trace → Logs	`tracesToLogsV2` (labels + trace_id + time window)	Tempo DS
Log → Trace	Derived field on `trace_id`	Loki DS
Trace → Metrics	`tracesToMetrics` with `$__tags`	Tempo DS
Trace → Service graph	metrics-generator/servicegraph series	Tempo DS `serviceMap`
Trace → Profile	Pyroscope span-scoped flame graph (`span_id` labelled profiles)	Tempo DS `tracesToProfiles` — the fourth signal, see Continuous Profiling in Production with eBPF: Parca, Pyroscope, and Flame Graphs

The debugging workflow: alert to root cause in five pivots

Everything above exists to make one on-call motion fast. The drill, with the artefact each pivot hands the next:

Step	Surface	Action	Artefact handed forward
1. Alert	Alertmanager/Grafana	Multi-window burn-rate alert on the checkout SLO fires	Service + SLI + time window
2. Metric	RED dashboard	p99 panel for `/checkout` shows the step change at 10:41	The anomalous histogram series
3. Exemplar	Same panel	Click a diamond inside the spike	A real `trace_id` from the bad population
4. Trace	Tempo waterfall	Read the critical path: 1.9s of 2.1s inside `payments` span `psp charge`, status Error, `retry` events ×3	Failing service, operation, span_id
5. Logs	“Logs for this span” → Loki	Scoped query returns the lines for this trace only: `psp timeout after 500ms, attempt 3/3`	Root cause string, stack trace
6. Breadth check	Service map / TraceQL	`{ span.psp.provider = "gateway-b" && status = error } \| count() > 0` — is it all traffic or one provider?	Blast radius; who to page; rollback vs failover

Ninety seconds of clicking, zero grep, and one person does it without paging four teams. The reverse entry points work too: support ticket with request ID → log search → derived field → trace → service map; suspicious deploy → tracesToMetrics from a canary span → route history. The alert layer feeding step 1 is covered in SLOs and Error Budgets in Practice: Defining SLIs and Building Multi-Window Burn-Rate Alerts. Three checkable invariants keep the workflow trustworthy: exemplars point at kept traces (tail policy retains errors/outliers), span metrics are generated before sampling (or panels lie), and clocks are disciplined (skew breaks both the log window and waterfall ordering).

Multi-tenancy and retention

The moment tracing serves more than one team, two governance questions arrive: whose spans are whose (isolation, limits) and how long do they live (retention, cost). The backends answer very differently:

Concern	Tempo	Jaeger
Tenant model	First-class: `multitenancy_enabled: true`, tenant = `X-Scope-OrgID` header on ingest and query	None native — run per-team instances, or share one and separate by ES index prefix / tag conventions
Enforcement point	Distributor rejects unlabelled writes; queriers scope reads to header’s tenant	Whatever your deployment topology enforces
Per-tenant limits	Ingestion rate/burst bytes, `max_traces_per_user` (default 10000 live), `max_bytes_per_trace` (default ~5 MB)	Collector-level only (queue sizes); no per-tenant accounting
Per-tenant retention	`compaction.block_retention` override per tenant	Per-index policies if you split indices per team (ILM)
Grafana access control	One Tempo datasource per tenant with the header set; RBAC on datasources	Datasource-per-instance
Cross-tenant query	Deliberately impossible (or explicit multi-tenant federation at query-frontend)	Trivially possible — which is the problem

The per-tenant overrides file is where platform teams encode the contract — limits, retention, metric generation, per X-Scope-OrgID:

# tempo.yaml
multitenancy_enabled: true
overrides:
  per_tenant_override_config: /conf/overrides.yaml

# overrides.yaml — the platform contract, per tenant
overrides:
  "team-payments":                  # X-Scope-OrgID: team-payments
    ingestion:
      rate_limit_bytes: 30000000    # 30 MB/s (global default 15 MB/s)
      burst_size_bytes: 40000000
    global:
      max_bytes_per_trace: 10000000 # pathological traces rejected at 10 MB
    compaction:
      block_retention: 720h         # 30 days — they pay for it
    metrics_generator:
      processors: [span-metrics, service-graphs]
  "team-internal-tools":
    compaction:
      block_retention: 168h         # 7 days is plenty
    metrics_generator:
      processors: []                # no RED generation for this tenant

The Collector side stamps the tenant on the way in — one gateway per team namespace, or the headers_setter extension mapping a resource attribute to the header:

exporters:
  otlp/tempo:
    endpoint: tempo-distributor.observability.svc:4317
    headers:
      X-Scope-OrgID: team-payments
    tls:
      insecure: true    # in-cluster only; mTLS at the mesh layer

Retention mechanics: “we keep traces 30 days” is a config line in Tempo and an operational subsystem in Jaeger:

Backend	Retention mechanism	Default	Knob
Tempo	Compactor deletes blocks older than retention	336h (14 d)	`compactor.compaction.block_retention`, per-tenant override
Jaeger + Elasticsearch/OpenSearch	Daily indices + `es-index-cleaner` CronJob, or rollover + ILM (`es.use-ilm`)	None — you must set it	Cleaner schedule / ILM policy days
Jaeger + Cassandra	Row TTL baked in at schema creation	172800 s (2 d)	`TRACE_TTL` when creating the schema
Jaeger + Badger	Store-level TTL	72h	`--badger.span-store-ttl`

Retention strategy itself: traces age badly — nobody opens a 60-day-old waterfall. Keep the aggregates long-term instead (span metrics in Mimir, retained 13 months cheaply — see Running Grafana Mimir: Multi-Tenant, Horizontally Scalable Prometheus Storage); 14–30 days of traces with 100%-of-errors sampling covers real debugging at a fraction of the cost of “keep everything for a year”.

Architecture at a glance

Picture the pipeline left to right in four tiers. Tier 1 — services: every pod runs an OTel SDK (or agent) with W3C propagation; HTTP/gRPC hops carry traceparent automatically, Kafka producers inject it into record headers, and every log line is stamped with trace_id/span_id. Spans batch out via OTLP to a node-local Collector agent. Tier 2 — Collector gateway: agents forward to a small gateway fleet (routed by trace ID when tail sampling is on), where the traffic forks deliberately: one pipeline feeds the spanmetrics and servicegraph connectors so RED metrics and edges are computed from 100% of spans and remote-written to Prometheus/Mimir with exemplars; the other applies tail sampling — every error, every trace over the latency threshold, a 5% healthy baseline — and exports the survivors to the trace backend.

Tier 3 — backends: Tempo’s distributor receives OTLP (per-tenant via X-Scope-OrgID), ingesters cut Parquet blocks to S3, the compactor enforces per-tenant retention, queriers serve TraceQL; in a Jaeger deployment the same position is held by collectors writing to Elasticsearch/Cassandra with the query service in front. Loki holds the logs alongside; Prometheus/Mimir holds the RED series and service-graph edges. Tier 4 — Grafana: one pane wired with the six correlation links — exemplar diamonds from any latency panel into the exact trace, tracesToLogsV2 from any span into its Loki lines, derived fields from any log line back to the full trace, tracesToMetrics and the service map closing the loop. The topology is a lambda shape: aggregates flow unsampled to the metrics store, examples flow sampled to the trace store, and trace_id is the foreign key joining every surface.

Real-world scenario

A payments platform (“FinRail”, ~90 services, 60k spans/s peak, AWS ap-south-1) ran Jaeger 1.x on a 6-node OpenSearch cluster: roughly $1,900/month to hold 7 days of head-sampled traces at 10%. Incident reviews kept repeating one line: “detection 3 minutes, localisation 45”. The 10% head sample meant the trace for any given failed settlement existed one time in ten — the search index was fast, but the trace people needed usually wasn’t there. Checkout p99 alerts pointed at dashboards with no path into traces; engineers searched Jaeger by service + time window and scrolled.

The platform team rebuilt over six weeks. Weeks 1–2: standardised on W3C propagation (three Spring services still emitted only B3 — the composite propagator bridged the migration) and pushed trace_id into all JSON logs via the Java agent’s MDC. Weeks 3–4: a two-tier Collector layout — node agents plus a 6-pod gateway with the loadbalancing exporter — with RED generation moved into spanmetrics/servicegraph connectors before tail sampling, and sampling switched to head 100% → tail “all errors + traces > 800ms + 5% baseline”. Weeks 5–6: Tempo (microservices mode) on S3, 30-day retention for the payments tenant and 14 for everyone else via X-Scope-OrgID, and Grafana provisioned with the full correlation mesh.

The numbers after: stored trace volume dropped 84% versus “keep 10% of everything” (tail policies kept ~5.8% of traces, but the right 5.8% — error-trace coverage went from ~10% to 100%). Backend cost fell from ~$1,900/mo to ~$410/mo (S3 ~2.1 TB steady-state + Tempo and gateway pods). The cultural change came from exemplars: median “alert → trace in hand” went from 14 minutes to under 60 seconds, and localisation MTTR dropped 58% over the next quarter. The regression they accepted knowingly: ad-hoc tag search over long windows got slower (TraceQL scan vs OpenSearch index) — a fair trade, since 95% of trace opens now arrive via an exemplar or log link carrying an exact trace_id, where Tempo is instant.

Advantages and disadvantages

Advantages of tracing done as designed here	Disadvantages / costs
Localisation in one click chain: alert → exemplar → trace → log	Instrumentation is a fleet-wide programme, not an install — propagation discipline across every team
RED metrics and service maps derived from spans — one instrumentation, three signals	Sampling design is genuinely subtle; naive configs silently drop the errors
Tail sampling keeps 100% of errors/outliers at a fraction of full-volume cost	Tail sampling needs a stateful gateway tier with trace-ID-affine routing (RAM + ops)
Tempo-on-S3 economics: retention priced in object storage, not database nodes	TraceQL search over long windows is scan-bound — slower than an ES index for ad-hoc hunts
Correlation config is declarative (Grafana provisioning) and versionable	Six links = six configs that silently no-op when a label mapping drifts
Vendor-neutral: OTLP + semconv means backends are swappable	Semconv churn (HTTP/DB renames) leaves mixed old/new attribute keys during upgrades
Multi-tenant limits + retention make tracing a governable platform service	Baggage/attributes are a PII leak surface that needs active redaction policy

The honest summary: the system is strictly better than disconnected pillars, and the cost is that it is a system — a dozen configs across SDKs, Collectors, backends and Grafana that must agree. The troubleshooting table exists because they occasionally won’t.

Hands-on lab

Build the correlation mesh on your laptop: Tempo (OTLP ingest + metrics-generator) remote-writing to Prometheus (exemplar storage), Loki and Grafana alongside, traces generated with telemetrygen, then click every link. Docker Compose, no cloud spend.

1. Layout.

mkdir -p tracing-lab/{tempo,prometheus,grafana/provisioning/datasources} && cd tracing-lab

2. Tempo config — monolithic, local block storage, metrics-generator remote-writing RED + service graphs with exemplars:

# tempo/tempo.yaml
server:
  http_listen_port: 3200
distributor:
  receivers:
    otlp:
      protocols:
        grpc: { endpoint: "0.0.0.0:4317" }
storage:
  trace:
    backend: local
    local: { path: /var/tempo/blocks }
    wal:   { path: /var/tempo/wal }
compactor:
  compaction:
    block_retention: 24h
metrics_generator:
  storage:
    path: /var/tempo/generator/wal
    remote_write:
      - url: http://prometheus:9090/api/v1/write
        send_exemplars: true
overrides:
  defaults:
    metrics_generator:
      processors: [span-metrics, service-graphs]

3. Prometheus config (scrapes nothing; receives remote_write):

# prometheus/prometheus.yml
global:
  scrape_interval: 15s

4. Grafana datasources — the whole point of the lab, every correlation in one file:

# grafana/provisioning/datasources/ds.yaml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    uid: prom
    url: http://prometheus:9090
    jsonData:
      exemplarTraceIdDestinations:
        - name: trace_id
          datasourceUid: tempo
  - name: Loki
    type: loki
    uid: loki
    url: http://loki:3100
    jsonData:
      derivedFields:
        - name: TraceID
          matcherRegex: '"trace_id":"([0-9a-f]+)"'
          url: '$${__value.raw}'
          datasourceUid: tempo
  - name: Tempo
    type: tempo
    uid: tempo
    url: http://tempo:3200
    jsonData:
      tracesToLogsV2:
        datasourceUid: loki
        spanStartTimeShift: '-5m'
        spanEndTimeShift: '5m'
        filterByTraceID: true
        tags: [{ key: 'service.name', value: 'service_name' }]
      serviceMap: { datasourceUid: prom }
      nodeGraph: { enabled: true }

5. Compose file. (Tempo’s image runs as a non-root user; for a lab, running as root sidesteps volume-permission setup — do not do this in production.)

# docker-compose.yaml
services:
  tempo:
    image: grafana/tempo:latest
    user: "0"
    command: ["-config.file=/etc/tempo.yaml"]
    volumes:
      - ./tempo/tempo.yaml:/etc/tempo.yaml
      - tempo-data:/var/tempo
    ports: ["3200:3200", "4317:4317"]
  prometheus:
    image: prom/prometheus:latest
    command:
      - --config.file=/etc/prometheus/prometheus.yml
      - --enable-feature=exemplar-storage
      - --web.enable-remote-write-receiver
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
    ports: ["9090:9090"]
  loki:
    image: grafana/loki:latest
    ports: ["3100:3100"]
  grafana:
    image: grafana/grafana:latest
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
    volumes:
      - ./grafana/provisioning:/etc/grafana/provisioning
    ports: ["3000:3000"]
volumes:
  tempo-data:

docker compose up -d
docker compose ps          # expect 4 services Up

6. Generate traces with telemetrygen (the Collector project’s load tool):

docker run --rm --network tracing-lab_default \
  ghcr.io/open-telemetry/opentelemetry-collector-contrib/telemetrygen:latest \
  traces --otlp-endpoint tempo:4317 --otlp-insecure \
  --service checkout --rate 20 --duration 120s --child-spans 4

The run ends by reporting traces generated (≈ 2,400).

7. Validate each link in the chain.

# a) Tempo has traces (TraceQL search over everything)
curl -s "http://localhost:3200/api/search?q=%7B%7D&limit=3" | jq '.traces[].traceID'
# → three 32-hex trace IDs

# b) metrics-generator produced RED series in Prometheus (allow ~60s)
curl -s "http://localhost:9090/api/v1/query?query=traces_spanmetrics_calls_total" \
  | jq '.data.result | length'
# → non-zero

# c) exemplars are stored alongside the latency histogram
curl -s "http://localhost:9090/api/v1/query_exemplars" \
  --data-urlencode 'query=traces_spanmetrics_latency_bucket' \
  --data-urlencode "start=$(date -u -v-1H +%s 2>/dev/null || date -u -d '-1 hour' +%s)" \
  --data-urlencode "end=$(date -u +%s)" | jq '.data | length'
# → non-zero; each exemplar carries labels.trace_id

8. Prove log↔trace correlation — push a log line into Loki carrying a real trace_id, then click through:

TRACE_ID=$(curl -s "http://localhost:3200/api/search?q=%7B%7D&limit=1" | jq -r '.traces[0].traceID')
curl -s -X POST http://localhost:3100/loki/api/v1/push \
  -H "Content-Type: application/json" \
  -d "{\"streams\":[{\"stream\":{\"service_name\":\"checkout\",\"level\":\"error\"},
       \"values\":[[\"$(date +%s)000000000\",
       \"{\\\"msg\\\":\\\"payment failed\\\",\\\"trace_id\\\":\\\"${TRACE_ID}\\\"}\"]]}]}"

9. Click the mesh in Grafana (http://localhost:3000): Explore → Prometheus → query histogram_quantile(0.99, sum by (le) (rate(traces_spanmetrics_latency_bucket[5m]))), enable Exemplars → diamonds render → click one → the Tempo trace opens. In the trace view, open a span → Logs for this span → your pushed Loki line appears. In Explore → Loki → {service_name="checkout"}, expand the line — the TraceID derived field links back. Explore → Tempo → Service Graph renders checkout and its synthetic children.

10. Teardown.

docker compose down -v && cd .. && rm -rf tracing-lab

If step 7c returns 0, you have reproduced the most common exemplar failure — work the troubleshooting table from row 6.

Common mistakes & troubleshooting

The playbook, symptom-first. Confirm before you fix; every row has a concrete check.

#	Symptom	Root cause	Confirm	Fix
1	Traces split into two trees at one service boundary	Propagator mismatch (B3 vs W3C) or proxy stripping headers	Log inbound headers; is `traceparent` present? Compare trace_ids across the two services for one request	Standardise W3C; composite propagator during migration; allowlist headers on the proxy
2	Consumer-side traces are all new roots	Context not injected into message headers	Inspect Kafka record headers for `traceparent` (`kafka-console-consumer --property print.headers=true`)	`inject()` on produce, `extract()` on consume; links for batches
3	Orphan spans / children before parents in waterfall	Async work started without context; or clock skew	Orphans: check executor/goroutine context passing. Skew: compare node clocks	Context-aware executors; chrony/NTP on every node
4	One service missing from every trace	SDK exports to wrong endpoint or is sampled out	`OTEL_EXPORTER_OTLP_ENDPOINT` value; SDK logs for export errors; sampler env	Point at the local agent; `parentbased_*` sampler
5	Error traces missing despite “we sample 20%”	Head-only sampling decided before the error happened	Count error spans in backend vs error rate in metrics	Move keep-decisions to tail sampling: `status_code` + `latency` policies
6	No exemplar diamonds in Grafana	Any of the 4-stage chain broken	Walk the chain: connector `exemplars.enabled`? remote_write `send_exemplars`? Prometheus flag? `query_exemplars` API non-empty? panel toggle on?	Fix the first broken stage; they fail silently in order
7	Diamonds render but click goes nowhere	`exemplarTraceIdDestinations` missing or label name ≠ exemplar’s label	Inspect exemplar in Explore — label key is `trace_id`? datasourceUid valid?	Align `name:` with the actual exemplar label; set correct Tempo uid
8	Exemplar click → “trace not found”	Trace was dropped by tail sampling (metrics saw it, Tempo never stored it)	Compare exemplar trace_id against `/api/traces/<id>` 404	Keep-all-errors/slow tail policies; generate metrics and exemplars from the kept stream if you must guarantee links
9	RED numbers exactly double	Both `spanmetrics` connector and Tempo metrics-generator enabled	Two metric families present: `traces_span_metrics_` AND `traces_spanmetrics_`	Disable one (keep the connector if you tail-sample)
10	“Logs for this span” returns nothing	Label mapping wrong, or time window too tight, or trace_id absent from logs	Run the generated LogQL manually; check `tags:` mapping and `spanStart/EndTimeShift`	Map `service.name`→`service_name`; widen shifts to ±5m; stamp trace_id in the logger
11	Derived field never appears on log lines	Regex doesn’t match the JSON key format	Test regex in Explore against a raw line	Fix `matcherRegex` (or `matcherType: label` for structured metadata)
12	Service graph empty, or `unpaired_spans_total` climbing	Metrics-generator/servicegraph off, wrong `serviceMap.datasourceUid`, one edge side uninstrumented, or wrong span kinds	`curl` Prometheus for `traces_service_graph_request_total`; check span kinds in a trace	Enable processors; fix datasource uid; instrument the missing side with correct kinds
13	Tempo rejects spans (`RATE_LIMITED` / 429s in Collector logs)	Per-tenant ingestion limits hit	Tempo distributor logs; `tempo_discarded_spans_total` metric by reason	Raise tenant `ingestion.rate_limit_bytes`/burst, or fix a span-explosion bug upstream
14	Huge traces rejected (`TRACE_TOO_LARGE`)	Span explosion (span-per-row bug) blowing `max_bytes_per_trace` (~5 MB default)	`tempo_discarded_spans_total{reason="trace_too_large"}`; find the producing service	Fix instrumentation (batch spans); raise the override only if genuinely needed
15	Collector gateway OOMs with tail sampling on	`num_traces` × avg trace size exceeds pod memory; or LB routing broken so traces never complete	Collector `otelcol_processor_tail_sampling_*` metrics; pod RSS	Size RAM ≈ spans/s × decision_wait × bytes/span; verify `routing_key: traceID`; `memory_limiter` first in pipeline
16	TraceQL search slow over long windows	Scan-bound query across many blocks	Query-frontend metrics; speed scales with window size	Add attribute filters (service first), shrink window, scale queriers; arrive via trace_id links instead of hunting

Meta-lesson from rows 6–8: correlation features fail silently — nothing errors, the click just returns empty. Treat the correlation mesh as a deployable artefact with its own smoke test (lab step 7, scriptable against staging).

Best practices

Standardise W3C tracecontext,baggage fleet-wide; run composite propagators only as a bounded migration, with an end date.
Name spans as low-cardinality operation templates; push all variables into semconv attributes. Review span names in PRs like metric label sets.
Set status Error only for real failures and record exceptions as events — your error-rate metrics are only as honest as this discipline.
Instrument once, derive thrice: spans → span metrics → service graphs. Never hand-maintain a parallel RED instrumentation that can drift from the traces.
Generate metrics before sampling, sample traces aggressively: 100% metrics, tail-kept errors + outliers + small baseline; audit monthly that every SEV kept its trace.
Make exemplars a tested contract: a CI/staging smoke test that queries query_exemplars and resolves one trace end-to-end.
Stamp trace_id/span_id into logs via shared logging config (platform-owned library), not per-team copy-paste.
Budget baggage (≤ 3 keys, short values, no PII) and enforce centrally with Collector redaction/attributes processors at the egress gateway.
Watch the meta-signals: unpaired_spans_total, tempo_discarded_spans_total, Collector queue/drop counters — they catch broken tracing before humans notice.
Pin per-tenant limits and retention in the overrides file under version control; changes go through PR review like any capacity contract.
Keep clocks disciplined (chrony everywhere, including VMs and on-prem) — skewed waterfalls destroy trust in the tool faster than any outage.
Run game days on the pivot chain: hand an engineer an alert and require root cause via exemplar→trace→log only; any dead pivot is the action item.

Security notes

Traces are an under-audited data store that routinely ends up containing what your DLP policy forbids. Attributes and baggage are the leak surface: URLs with tokens in query strings (url.full), raw SQL with literals (db.query.text unsanitised), enduser.id as plain email, request bodies dumped into events. Enforce redaction centrally in the Collector gateway — attributes (delete/hash specific keys) and redaction (allowlist mode) processors — so a single misbehaving team can’t leak into the shared backend. Baggage deserves special paranoia: it propagates to every downstream service, crosses trust boundaries you forgot existed, and lands in logs; strip it at the network edge in both directions.

Transport and access: OTLP from agents to gateway to backend rides mTLS (mesh-issued certs in-cluster; never insecure: true across nodes in production). Tempo’s X-Scope-OrgID is an identifier, not authentication — anyone who reaches the distributor can claim any tenant, so front it with an authenticating proxy and give Grafana per-tenant datasources locked with RBAC so team A cannot query team B’s traces. Jaeger’s UI has no native authz — front it with an OIDC proxy. Retention is a security control too: traces older than your incident-review horizon are pure liability — let the compactor/ILM delete them, keep the bucket private with public access blocked, and use SSE-KMS if compliance asks. Finally, traceparent on outbound requests to third parties leaks request structure — usually harmless, but strip tracestate/baggage at egress and know the header crosses the boundary.

Cost & sizing

The bill drivers, in order: span volume (spans/s × bytes/span), retention days, span-metrics cardinality, and the always-on compute of your backend. Working numbers: a typical instrumented request produces 5–20 spans; a span serialises to ~300–1,000 bytes on the wire and compacts to roughly a third in columnar storage.

Driver	Working figure	Lever
Raw span volume	10k spans/s ≈ 430 GB/day at ~500 B/span	Tail sampling (keep ~5–10%), span-explosion hygiene
Tempo storage (S3 Standard, ap-south-1)	≈ $0.025/GB-mo ≈ ₹2.1/GB-mo → 2 TB steady ≈ ₹4,300/mo	Retention (14 d vs 30 d is linear), compression via compaction
Jaeger ES cluster (same load)	3 × data nodes (≥ m5.xlarge + gp3) ≈ $500–800/mo before growth	Fewer indexed tags, shorter retention, ILM to warm/cold tiers
Collector gateway	Tail sampling RAM ≈ spans/s × decision_wait × ~1 KB → 60k spans/s × 10 s ≈ ~600 MB in-flight + overhead; plan pods of 4–8 GB	Shorter `decision_wait`, more/smaller pods behind LB exporter
Span-metrics cardinality	series ≈ services × span_names × kinds × status × extra dims — 50 svc × 40 names × few = easily 10–50k series	Dimension discipline; drop `status_code` granularity you don’t alert on
Exemplar storage	Prometheus circular buffer (default 100k exemplars) — negligible	None needed
Egress	Cross-AZ/region OTLP adds per-GB charges	Keep agents→gateway→backend within AZ/region

Right-sizing heuristics: start tail sampling at “errors + >p99-ish threshold + 5% baseline” and tune until kept-traffic is 5–10% of raw; set retention at 14 days and extend only per tenant, with a use case (the override makes them pay their own delta); review span-metrics dimensions quarterly exactly like scrape cardinality. For a small platform (≤ 5k spans/s) the entire Tempo stack fits in three small pods + a bucket — comfortably inside a lab budget. Grafana Cloud’s free tier (which includes a traces allowance) is a legitimate zero-ops way to run this architecture while you learn — it is Tempo underneath, so everything transfers 1:1.

Interview & exam questions

Q1. Walk me through what happens to trace context when service A calls service B over HTTP. A’s SDK has an active span; the HTTP client instrumentation asks the global propagator to inject its span context into headers — traceparent: 00-<trace_id>-<A's span_id>-<flags> plus tracestate/baggage. B’s server instrumentation extracts it, creates a SERVER span whose parent is A’s span_id, same trace_id, honouring the sampled flag via its ParentBased sampler. Attributes and events do not travel; only the context tuple does.

Q2. Why can’t head sampling guarantee you keep error traces, and what’s the fix? The head decision is made at root-span creation, before any downstream failure exists, and it’s derived from the trace_id ratio — so errors are kept at exactly the baseline rate. The fix is tail sampling in a Collector gateway: buffer the complete trace (decision_wait), then apply status_code and latency policies that deterministically keep failures and outliers, plus a probabilistic baseline.

Q3. Tempo versus Jaeger: argue the storage trade-off in one minute. Jaeger indexes spans in a database (ES/Cassandra), so tag search is fast but you pay for an always-on cluster sized for peak. Tempo writes columnar Parquet blocks to object storage with minimal index; trace-by-ID is cheap, search is a parallel scan (TraceQL), and cost scales with bytes at S3 prices. Tempo’s bet: you usually arrive carrying a trace_id (exemplar, log link), so a heavy index is wasted spend. Choose Jaeger for indexed ad-hoc search you truly need; Tempo for volume and Grafana-native correlation.

Q4. What is an exemplar, end to end? A sample attached to a histogram bucket carrying the trace_id (plus value/timestamp) of one real request in that bucket. Chain: metrics source attaches it (spanmetrics connector or SDK) → remote_write with send_exemplars → Prometheus stores it behind --enable-feature=exemplar-storage in a bounded buffer → Grafana’s exemplarTraceIdDestinations renders diamonds that deep-link into the trace datasource.

Q5. Your consumer spans all start new traces. Diagnose. Context isn’t riding the messages: HTTP auto-instrumentation doesn’t cover queue carriers. Producer must inject() the context into message headers; consumer must extract() and start its CONSUMER span against that context — or, for batch consumption, attach span links to each message’s context instead of faking a single parent.

Q6. How do service-graph metrics get computed, and what breaks them? The processor pairs a CLIENT span with the remote SERVER span it spawned (same trace, parent relationship) and emits client→server edge counters and latency histograms from both viewpoints. Broken by: missing instrumentation on one side, wrong span kinds (INTERNAL everywhere), or propagation drops — all visible as traces_service_graph_unpaired_spans_total rising.

Q7. Why must span metrics be generated before tail sampling? Tail sampling drops most healthy traces by design. Metrics generated after it measure only survivors — request rates ~20× low, latency distributions biased toward the slow traces you kept. Fan the raw stream into spanmetrics in one pipeline and sample in another; metrics stay 100%-accurate while traces stay cheap.

Q8. How does multi-tenancy work in Tempo, and what’s the equivalent story in Jaeger? Tempo: multitenancy_enabled: true; every ingest and query carries X-Scope-OrgID; per-tenant overrides govern rate limits, max trace size, metrics-generator processors and block_retention. The header is unauthenticated, so an authenticating gateway fronts it. Jaeger has no native tenancy — you isolate by running instances per team or splitting ES index prefixes, with access control at the proxy layer.

Q9. A latency panel’s exemplar click says “trace not found”. Explain the failure and two fixes. Metrics (and their exemplars) were generated from 100% of spans, but tail sampling dropped that particular trace before Tempo stored it — the exemplar is a dangling pointer. Fixes: ensure tail policies keep the populations exemplars point at on anomalous buckets (all errors + all slow), or generate the exemplar-bearing metrics from the post-sampling stream (accepting slightly biased exemplar coverage of healthy traffic).

Q10. What belongs in baggage, and what must never be there? Belongs: tiny, non-sensitive routing/business context needed by many hops — tenant tier, experiment arm, synthetic-traffic flag. Never: PII, tokens, anything large — baggage propagates to every downstream service (including third parties unless stripped at egress) and leaks into logs.

Q11. Certification mapping: where does this show up? CNCF OTCA (OpenTelemetry Certified Associate) covers propagation, the SDK pipeline, sampling and Collector config directly; PCA (Prometheus Certified Associate) covers the metrics-side mechanics (histograms, remote_write, exemplars). Cloud certs (AWS DevOps Pro, Azure AZ-400) test the managed equivalents — X-Ray, Application Insights — where the same head/tail sampling and correlation concepts transfer.

Quick check

In traceparent: 00-4bf9…4736-00f0…02b7-01, what does the trailing 01 mean and which sampler type honours it downstream?
You see traces_span_metrics_calls_total and traces_spanmetrics_calls_total both in Prometheus. What happened and why is it bad?
Which TraceQL operator finds traces where a checkout span has a slow payments descendant, and can Jaeger’s search express the same?
Name the four stages of the exemplar chain that must all be enabled.
Why is trace_id stored in the log line (or structured metadata) rather than as a Loki stream label?

Answers

It’s the sampled flag (bit 0 of trace-flags). ParentBased samplers downstream honour it, which is what keeps distributed traces complete instead of partially sampled.
Both the Collector spanmetrics connector and Tempo’s metrics-generator are enabled — every RED number is double-counted. Disable one; keep the connector if you tail-sample so metrics see 100% of spans.
The descendant operator >>: { resource.service.name="checkout" } >> { resource.service.name="payments" && duration > 500ms }. Jaeger’s tag search cannot express structural (ancestor/descendant) conditions — its index is per-span, not per-tree.
Source emits exemplars (connector/SDK) → transport preserves them (send_exemplars on remote_write / OTLP native) → Prometheus stores them (--enable-feature=exemplar-storage) → Grafana links them (exemplarTraceIdDestinations on the Prometheus datasource).
Trace IDs are unique per request; as a label each would mint a one-entry stream — unbounded cardinality that destroys Loki’s index and chunk efficiency. In the line (or structured metadata) it stays filterable without minting streams.

Glossary

Span: one timed unit of work with name, attributes, events, status; the atom of tracing.
Trace: the tree of spans sharing one trace_id across processes.
Span context: the propagated tuple — trace_id, span_id, trace flags, trace state.
traceparent / tracestate / baggage: the W3C headers carrying context (and key-value baggage) between processes.
Propagator: SDK component that injects/extracts context into carriers (W3C, B3, Jaeger, X-Ray formats).
SpanKind: SERVER/CLIENT/PRODUCER/CONSUMER/INTERNAL — drives service graphs and latency attribution.
Semantic conventions (semconv): OTel’s standard attribute names (http.route, db.system.name…) that make spans aggregatable.
OTLP: the OpenTelemetry wire protocol (gRPC 4317 / HTTP-protobuf 4318) spoken by SDKs, Collectors, Tempo and Jaeger.
Head / tail sampling: keep-decision at trace start in the SDK vs after trace completion in a gateway.
TraceQL: Tempo’s structural query language over stored traces (filters, >>, aggregates, metrics functions).
Metrics-generator: Tempo component deriving span metrics and service graphs from ingested spans, remote-written to Prometheus.
Exemplar: a histogram-bucket sample carrying a trace_id — the metrics→trace hyperlink.
Derived field: Loki datasource feature extracting a value (trace_id) from log lines and linking it to another datasource.
X-Scope-OrgID: the tenant header used by Tempo (and Loki/Mimir) for multi-tenant ingest and query scoping.
Service graph: the directed service-to-service edge map computed from paired CLIENT/SERVER spans.

Next steps

Stand the backend up properly: Configure Grafana Tempo with TraceQL, Metrics-Generator, and S3 Block Storage — blocks, compaction, caching, TraceQL metrics.
Build the pipeline tier this article assumes: Building Production OpenTelemetry Collector Pipelines: Receivers, Processors, and Tail Sampling, then scale the sampling gateway with Tail-Based Sampling at Scale with the OpenTelemetry Collector and Load-Balancing Exporter.
Finish the correlation mesh from the metrics side: Wiring OpenTelemetry Metrics and Exemplars for Click-Through Trace Correlation and the log side: Deploy Loki in Distributed Microservices Mode with S3 Chunk Storage and Index Gateway.
Add the fourth signal — from a slow span straight into a flame graph: Continuous Profiling in Production with eBPF: Parca, Pyroscope, and Flame Graphs.