Wiring OpenTelemetry Metrics and Exemplars for Click-Through Trace Correlation

A p99 latency panel that turns red tells you that something is slow. It does not tell you which request, on which pod, hitting which downstream, at which moment. You stare at a curve climbing past 800 ms, you know users are hurting, and the next thirty minutes are spent eyeballing timestamps — flipping between the Grafana tab and the Tempo tab, guessing which of the ten thousand traces in that minute is the one the spike is made of. That manual correlation is the single biggest tax on latency-regression MTTR, and it exists purely because a metric is an aggregate: by the time a value lands in a histogram bucket, the identity of the request that produced it has been averaged away.

Exemplars close that gap. An exemplar is a single representative measurement — a value, a timestamp, and a set of attributes — that the metrics SDK pins to the exact trace_id (and span_id) that produced it. One bucket increment in http_server_request_duration_seconds_bucket carries alongside it a pointer that says “and one of the requests in this bucket was trace 4bf92f35….” Store that pointer through the pipeline, render it as a diamond on the Grafana panel, map the diamond to a tracing data source, and the click on the spike drops you into the offending trace. The aggregate and the example travel together; a manual archaeology dig becomes a single click.

This article is the full wiring of that path, end to end, at the depth a senior platform engineer needs to run it for hundreds of services. We cover the OpenTelemetry metrics data model (and the delta-vs-cumulative landmine that breaks Prometheus), how exemplar collection is gated by the exemplar filter and the active span context, the OTLP and Prometheus exemplar wire formats byte for byte, both export paths into Prometheus (the native OTLP receiver vs the Collector prometheusremotewrite exporter), Prometheus --enable-feature=exemplar-storage and the storage buffer, native (exponential) histograms and why they change the exemplar story, the Grafana overlay and the exemplarTraceIdDestinations metric→trace jump, the RED metrics + exemplar pivot, and — the part that bites everyone — the sampling interplay: why head sampling, tail sampling, and the exemplar filter must agree or your overlay stays empty no matter how perfect the rest is. Everything is real config — SDK env vars, Collector YAML, prometheus.yml, Grafana provisioning, PromQL — plus a free-tier lab. This is the metrics-and-correlation half; the tracing/propagation half (how the trace_id got there, and how to store and query the trace it points at) lives in Distributed Tracing End-to-End: Context Propagation, Tempo, and Correlating Traces with Metrics and Logs.

What problem this solves

Three failure modes describe almost every team that “has metrics and has traces but cannot connect them.”

The eyeball-correlation tax. On-call sees a p99 spike at 14:32. To find the slow request they open Tempo, search a narrow window around 14:32 filtered by duration > 800ms, scroll the results, and try to confirm one is representative of the spike rather than an unrelated slow request nearby. With moderate traffic that window holds thousands of traces; the search is slow, the filter is fuzzy, and the engineer is never sure they found the trace. Exemplars make the spike itself carry a known-good trace_id — no search, no guessing.

The dashboards-and-traces-are-two-worlds tax. Metrics live in Prometheus and Grafana; traces live in Tempo/Jaeger; nobody set up the jump between them. Engineers context-switch between tools, copy-pasting timestamps by hand, and the cognitive load means they often don’t bother — they ack the alert, restart the pod, and the real cause goes uninvestigated. A working exemplar pivot turns “two worlds” into one panel you click through.

The “we wired it and the diamonds never appear” tax. The cruel one, because the team did the work — enabled exemplar storage, toggled the overlay — and the panel is still empty. Almost always it is a silent drop somewhere in the chain: no sampled span was active at record time; the Collector prometheusremotewrite exporter had export_exemplars at its default false and threw every exemplar away; tail sampling decided keep/drop after the metrics pipeline already recorded; a recording rule re-aggregated the bucket series and stripped the exemplars; or the PromQL name doesn’t match the _bucket series the exemplars attached to. Every one is a config-level fault with a specific confirm command, and the bulk of this article teaches you to localize which.

Who hits this: anyone running OpenTelemetry into a Prometheus-compatible store (Prometheus, Mimir, Thanos, Grafana Cloud) alongside a tracing backend (Tempo, Jaeger), wanting click-through from latency dashboards to traces. It bites hardest on high-traffic services that must keep trace sampling low for cost — exactly where manual correlation is most painful and the sampling interplay is most subtle.

Learning objectives

By the end of this article you can:

Explain what an exemplar is at the data-model level — value, timestamp, filtered attributes, and the trace_id/span_id that link it to a trace — and where it lives inside an OTLP histogram data point and a Prometheus _bucket series.
Choose cumulative temporality for any Prometheus-bound metrics pipeline and explain exactly why delta temporality produces nonsense from rate() and histogram_quantile().
Enable exemplar collection in the OpenTelemetry SDK with OTEL_METRICS_EXEMPLAR_FILTER=trace_based, and explain why a sampled span must be active at record time for an exemplar to exist at all.
Read and reason about the OTLP exemplar wire format and the Prometheus exemplar exposition / OpenMetrics format, including the # {trace_id="…"} value timestamp syntax.
Wire both supported export paths into Prometheus — the native OTLP receiver (--web.enable-otlp-receiver) and the Collector prometheusremotewrite exporter with export_exemplars: true — and pick the right one.
Turn on --enable-feature=exemplar-storage, size storage.exemplars.max_exemplars, and reason about the fixed-size circular exemplar buffer and its retention math.
Decide between explicit-bucket and exponential/native histograms for latency SLIs, enable native histograms in Prometheus, and explain how each interacts with exemplars and quantile accuracy.
Configure the Grafana exemplar overlay and exemplarTraceIdDestinations so a click on a diamond opens the trace in Tempo/Jaeger, and build a RED dashboard whose panels pivot to traces.
Reconcile head sampling, tail sampling, and the exemplar filter so exemplars actually appear without raising your trace storage bill.

Prerequisites & where this fits

You should already be comfortable with the OpenTelemetry signal model — that metrics, traces, and logs are three signals carried by OTLP, that a span represents one unit of work with a trace_id/span_id, and that context propagation is how the trace_id flows across service boundaries. You should know Prometheus basics: that it is a pull-based, cumulative time-series database, that a histogram is a set of _bucket/_sum/_count series, and that PromQL rate() and histogram_quantile() are how you turn those into a p99. You should be able to run a container, edit YAML, and read protobuf-shaped JSON. Familiarity with Grafana data sources and Tempo or Jaeger as a trace store rounds it out.

This sits in the Observability track, specifically the correlation layer that ties the three signals together. It assumes the metrics fundamentals from PromQL in Anger: Rate, Histograms, and Aggregation Patterns That Actually Work and the collection plumbing from Building Production OpenTelemetry Collector Pipelines: Receivers, Processors, and Tail Sampling. It is the metrics-side complement to Distributed Tracing End-to-End: Context Propagation, Tempo, and Correlating Traces with Metrics and Logs, which owns the trace-side of the same loop. The sampling discussion connects directly to Tail-Based Sampling at Scale with the OpenTelemetry Collector and Load-Balancing Exporter, and the cardinality cautions to Taming Metric Cardinality: Relabeling, Limits, and Cost Governance in Prometheus. The dashboard pattern at the end builds on Engineering Grafana Dashboards That Get Used: RED, USE, Template Variables, and Provisioning-as-Code.

A quick map of who owns each hop in this pipeline, so you know which config file to open when the overlay is empty:

Hop	What lives here	Config surface	Failure it can cause
Application + SDK	Instruments, spans, exemplar filter, temporality	`OTEL_*` env vars, SDK code	No exemplar at all (no sampled span / filter off)
Sampler	Head sampling decision (sampled bit)	`OTEL_TRACES_SAMPLER*`	Sampled bit unset → no exemplar candidate
OTLP exporter	Wire format, protocol, endpoint	`OTEL_EXPORTER_OTLP_*`	Wrong endpoint/protocol → metrics or exemplars lost
Collector	Batching, tail sampling, fan-out, exemplar pass-through	Collector YAML	`export_exemplars: false` drops them; tail sampling timing
Prometheus	Ingest, exemplar storage, native histograms	`prometheus.yml` + flags	Storage off → exemplars rejected; name translation
Grafana	Overlay query, trace-id destination	Data source provisioning + panel	Overlay off / no destination → no click-through
Tempo/Jaeger	The trace the exemplar points at	Trace store config	Trace not stored (sampled out) → dead link

Core concepts

Six mental models make every later configuration decision obvious.

An exemplar is an example pinned to an aggregate. A histogram throws away identity by design — it tells you the shape of the distribution (how many requests fell in each latency bucket) but not which requests. An exemplar re-attaches one concrete example to a bucket: a single recorded value, the wall-clock timestamp it was recorded at, a small set of filtered attributes, and — the load-bearing part — the trace_id and span_id of the span that was active when it was recorded. It is deliberately one example per bucket per export interval, not all of them; you do not need every slow request, you need one you can click into.

Exemplars are read from context, never stamped by hand. You do not write code that says “attach trace_id X to this measurement.” When you call histogram.record(value, attrs) and a span is active in context, the SDK reads the active trace_id/span_id off that context and attaches them automatically. Your only job is to ensure a span is active when you record — which, with HTTP/gRPC auto-instrumentation wrapping your handler, is already true. The linkage is a property of where the record call happens, not of any explicit ID-passing.

The exemplar filter gates which measurements become candidates. The SDK has an exemplar filter that decides whether a given measurement is even eligible to become an exemplar. The default and correct production value is trace_based: a measurement is a candidate only if it is recorded inside a span whose context is sampled (the sampled flag on the span context is set). This is exactly what you want — an exemplar should point at a trace that actually exists in your tracing backend. The alternatives are always_on (every measurement is a candidate, even with no trace — wasteful, and most exemplars carry no usable trace_id) and always_off (disabled). The filter is the first valve; the active-span requirement is the second.

Temporality decides whether Prometheus can read your metrics at all. OpenTelemetry can export metrics with cumulative temporality (a counter is a running total since process start; histogram buckets are total counts since start) or delta temporality (each export carries only the increment since the last export). Prometheus is a cumulative system to its core: rate() assumes monotonically increasing counters with reset detection, and histogram_quantile() assumes cumulative bucket counts. Export delta to Prometheus and every rate and quantile is garbage. For a Prometheus backend you must force cumulative.

The histogram type changes the exemplar and accuracy story. An explicit-bucket histogram uses boundaries you define and is stored in Prometheus as classic _bucket series — battle-tested, exemplars attach per bucket, but quantile accuracy is only as good as your bucket layout. An exponential histogram auto-scales its buckets to the data and is stored in Prometheus as a native histogram — far better relative accuracy across a wide range, fewer series on the wire, but it is a newer feature with its own exemplar handling and not yet wired into every panel and recording rule. Choosing between them is a real decision, covered in its own section.

The whole loop is only as strong as the sampling agreement. Because trace_based requires a sampled span at record time, the sampling decision must be made before the handler records. Head sampling (decide at span start) satisfies this; tail sampling (decide later, in the Collector) does not — by record time the bit was not yet set, so no exemplar is produced even for kept traces. This is the most common reason a perfectly-wired overlay stays empty, and it gets its own section.

The vocabulary in one table

Pin down every moving part before the deep sections. The glossary repeats these for lookup; this is the mental model side by side:

Term	One-line definition	Where it lives	Why it matters here
Exemplar	An example measurement pinned to a bucket, carrying a `trace_id`	Histogram data point / `_bucket` series	The thing you click through to a trace
Exemplar filter	SDK policy deciding which measurements are exemplar candidates	OTel SDK	`trace_based` (right), `always_on`, `always_off`
Temporality	Cumulative vs delta accumulation of metric values	SDK exporter	Prometheus needs cumulative
Explicit-bucket histogram	Histogram with boundaries you define	SDK view → `_bucket` series	Predictable, exemplars per bucket
Exponential histogram	Auto-scaling base-2 histogram	SDK → Prometheus native histogram	High accuracy, fewer series
Native histogram	Prometheus storage form of an exponential histogram	Prometheus TSDB	Behind a feature flag; own exemplar handling
OTLP	The OpenTelemetry wire protocol (gRPC or HTTP/protobuf)	Between SDK ↔ Collector ↔ Prometheus	Carries exemplars in the histogram data point
`prometheusremotewrite`	Collector exporter to Prometheus remote-write	Collector	Needs `export_exemplars: true`
Exemplar storage	Prometheus’ fixed-size in-memory exemplar buffer	Prometheus	`--enable-feature=exemplar-storage` + sizing
`exemplarTraceIdDestinations`	Grafana mapping from exemplar `trace_id` to a trace data source	Grafana data source	Makes the diamond clickable into Tempo
RED metrics	Rate, Errors, Duration — the request-service golden signals	Dashboards	Duration panel is where exemplars pivot
Head sampling	Sampling decision at span start	SDK sampler	Sets the sampled bit before record → exemplar works
Tail sampling	Sampling decision after span completes	Collector	Bit unset at record time → no exemplar

The OTel metrics data model you actually need

Three instrument families carry almost all production signal, and only one of them meaningfully carries exemplars.

Instrument	Monotonic?	What it records	Carries exemplars?	Typical SLI use
Counter	Yes (up only)	Requests served, bytes written	Technically yes, rarely useful	Rate (the R in RED), error count (the E)
UpDownCounter	No	Items in a queue, active connections	Rarely	Saturation-style gauges
Histogram	n/a	A distribution into buckets (duration, size)	Yes — this is where exemplars live	Latency (the D in RED), payload size
Gauge (async)	No	Point-in-time value (memory in use, temp)	No	Resource levels

You record into a histogram, and a single bucket increment can carry an exemplar pinning that observation to a trace. Counters can technically carry an exemplar on their sum, but you almost never pivot from a counter — you pivot from a latency histogram, because “this latency was slow, show me why” is the question exemplars answer. In practice: exemplars are a histogram feature, and the histogram you care about is request duration.

Delta vs cumulative — the landmine

The decision that silently breaks Prometheus pipelines is temporality. Here is the difference and why it is not negotiable for Prometheus:

	Cumulative	Delta
Counter value	Running total since process start	Increment since last export
Histogram buckets	Total counts since start	Counts in this interval only
Reset handling	Reset to zero on restart; `rate()` detects it	Each export is self-contained
Memory in SDK	Holds running totals	Can forget after export
Prometheus wants	This one	Not this one
Delta-native backends want	—	This one (e.g. some cloud metric APIs)

Prometheus’ rate() computes the per-second increase of a counter by looking at the difference between samples and dividing by the time gap, correcting for resets (when the value drops, it assumes a process restart, not a negative rate). histogram_quantile() assumes the bucket counts are cumulative-since-start so it can compute the proportion of observations below a boundary. Feed it delta — where each scrape is just “what happened in this 15 s” — and the reset detection misfires on every export, the math goes nonsensical, and your p99 panel shows garbage that looks plausible enough to mislead. So for any Prometheus target, force cumulative on the SDK exporter:

# Prometheus is cumulative to its core. Never send it delta.
export OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE=cumulative

The non-obvious part: even when you route through an OpenTelemetry Collector to Prometheus, the temporality is decided at the SDK, not the Collector — the Collector forwards what it receives. Both the Prometheus native OTLP receiver and the Collector prometheusremotewrite exporter want cumulative. Only set delta when your backend is a delta-native system that explicitly asks for it (some managed cloud metrics endpoints do), and then you are not talking to Prometheus.

Backend	Required temporality	Set on
Prometheus (OTLP receiver)	cumulative	SDK `OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE`
Prometheus (remote-write via Collector)	cumulative	SDK (Collector forwards as-is)
Grafana Mimir / Thanos	cumulative	SDK
Delta-native cloud metrics API	delta	SDK (per that vendor’s guidance)

Turn on exemplar collection in the SDK

Two valves must both be open for an exemplar to exist: the exemplar filter must admit the measurement, and a sampled span must be active when you record. Set the filter explicitly so it survives SDK default changes:

# Only attach exemplars when a sampled span is in context.
export OTEL_METRICS_EXEMPLAR_FILTER=trace_based

The three filter values, and when each is right:

Filter value	Behaviour	When to use	Cost
`trace_based`	Candidate only if recorded in a sampled span	Production default — exemplars always point at a real trace	Minimal; bounded by sampling rate
`always_on`	Every measurement is a candidate	Debugging without tracing; you want some example even without a trace	Higher; most exemplars have no `trace_id`
`always_off`	No exemplars ever	When you have no trace store and never pivot	None

Stick with trace_based. The SDK then does the rest automatically: when record() runs and a sampled span is current, it captures the value, the timestamp, the attributes filtered to the instrument’s view, and the active trace_id/span_id. You never stamp trace IDs by hand. Your only obligation is to make a span active at record time. With auto-instrumentation (an HTTP/gRPC server span wrapping the handler), that is already the case. Here is an explicit Python example so the linkage is unambiguous — the record() sits inside the span:

from opentelemetry import metrics, trace
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter

reader = PeriodicExportingMetricReader(
    OTLPMetricExporter(endpoint="http://otel-collector:4317", insecure=True),
    export_interval_millis=15000,
)
metrics.set_meter_provider(MeterProvider(metric_readers=[reader]))

meter = metrics.get_meter("checkout")
latency = meter.create_histogram(
    name="http.server.request.duration",
    unit="s",
    description="HTTP server request duration",
)
tracer = trace.get_tracer("checkout")

def handle(route, status, elapsed_seconds):
    with tracer.start_as_current_span("checkout"):
        # A sampled span is active here, so this record() emits an exemplar
        # carrying the active trace_id/span_id automatically.
        latency.record(elapsed_seconds, {"http.route": route, "http.response.status_code": status})

The same OTEL_* environment variables drive the Java agent, the Go SDK, the Node SDK, and the .NET SDK — exemplar behaviour is part of the specification, not the implementation, so it is consistent across languages. Language-specific notes worth knowing:

Language	Exemplar support	How spans get active	Note
Java (agent)	Auto via `OTEL_METRICS_EXEMPLAR_FILTER`	Auto-instrumentation wraps handlers	Most turnkey; agent sets context automatically
Go (SDK)	Manual SDK setup; honours the env var	`otel.GetTracerProvider()` + middleware	Wire the histogram view and middleware yourself
Python (SDK)	Honours the env var	`start_as_current_span` or auto-instrumentation	As shown above
Node (SDK)	Honours the env var	`@opentelemetry/auto-instrumentations-node`	Ensure metrics + traces share the same context manager
.NET	`System.Diagnostics.Metrics` + OTel	`ActivitySource` spans	Exemplar support follows the OTel .NET version

The export interval (OTEL_METRIC_EXPORT_INTERVAL, default 60000 ms; 15000 shown above) controls how often the reader flushes — and therefore how fresh your exemplars are. A shorter interval gets exemplars onto the panel faster at the cost of more export traffic. The full set of SDK env vars that govern this pipeline:

Env var	Controls	Recommended for Prometheus	Default
`OTEL_METRICS_EXEMPLAR_FILTER`	Which measurements become exemplar candidates	`trace_based`	`trace_based`
`OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE`	Cumulative vs delta	`cumulative`	varies by SDK
`OTEL_METRIC_EXPORT_INTERVAL`	Reader flush interval (ms)	15000–30000	60000
`OTEL_EXPORTER_OTLP_METRICS_ENDPOINT`	Where metrics go	Collector or Prometheus OTLP receiver	—
`OTEL_EXPORTER_OTLP_METRICS_PROTOCOL`	`grpc` or `http/protobuf`	match the endpoint	`grpc`
`OTEL_TRACES_SAMPLER`	Head-sampling strategy	`parentbased_traceidratio`	`parentbased_always_on`
`OTEL_TRACES_SAMPLER_ARG`	Sampling ratio	e.g. `0.05`	`1.0`

The exemplar wire format, byte for byte

To debug a missing exemplar you have to know what it looks like on the wire at each hop. There are two representations: the OTLP form (protobuf, between SDK ↔ Collector ↔ Prometheus OTLP receiver) and the Prometheus / OpenMetrics form (text exposition and the remote-write/query-API representation).

OTLP exemplar (inside a histogram data point)

In OTLP, every HistogramDataPoint (and ExponentialHistogramDataPoint) carries a repeated exemplars field. Each Exemplar message has these fields:

OTLP `Exemplar` field	Type	Meaning
`time_unix_nano`	fixed64	Wall-clock time the measurement was recorded
`as_double` / `as_int`	double / sfixed64	The exemplar’s value (one of the two)
`filtered_attributes`	repeated KeyValue	Attributes present at record time but not part of the metric’s identifying attribute set
`span_id`	bytes (8)	The active span’s ID at record time
`trace_id`	bytes (16)	The active trace’s ID at record time

Two subtleties. First, trace_id and span_id are raw bytes in OTLP (16 and 8 bytes), not hex strings — the hex you see in Grafana is a rendering. Second, filtered_attributes holds the attributes that were on the measurement but are not part of the metric’s label set, so the exemplar can carry extra context (a user.id, a pod name) without inflating metric cardinality. This is why exemplar attributes can be richer than metric labels.

Prometheus / OpenMetrics exemplar (text exposition)

When exposed in OpenMetrics text format (the /metrics exposition that a Prometheus exporter or the Collector prometheus exporter emits), an exemplar is appended to a sample line after a #:

# HELP http_server_request_duration_seconds Duration of HTTP server requests
# TYPE http_server_request_duration_seconds histogram
http_server_request_duration_seconds_bucket{http_route="/orders/{id}",le="0.5"} 24863 # {trace_id="4bf92f3577b34da6a3ce929d0e0e4736",span_id="00f067aa0ba902b7"} 0.483 1.7160000e9
http_server_request_duration_seconds_bucket{http_route="/orders/{id}",le="1.0"} 25102 # {trace_id="9c2e...",span_id="b7ad..."} 0.842 1.7160001e9

Reading that line: the series and its labels, then the bucket count, then #, then the exemplar’s labels (which must include the trace_id), then the exemplar’s value, then the exemplar’s timestamp. The format pieces:

Part	Example	Notes
Series + labels	`..._bucket{http_route="/orders/{id}",le="0.5"}`	The `_bucket` series the exemplar attaches to
Sample value	`24863`	Cumulative bucket count
Separator	`#`	Marks the start of the exemplar
Exemplar labels	`{trace_id="4bf9…",span_id="00f0…"}`	Must include `trace_id`; OpenMetrics caps total label length
Exemplar value	`0.483`	The observed value that fell in this bucket
Exemplar timestamp	`1.7160000e9`	Unix seconds (float); when it was observed

Constraints that bite: OpenMetrics limits the combined length of an exemplar’s labels (so you cannot stuff arbitrary context into the trace_id line — keep it to the IDs plus a tiny amount), exemplars are only valid on _bucket, _count, and _sum-style lines (and counters), and the exemplar timestamp is the observation time, which is what lets Grafana place the diamond at the right point on the X axis rather than at scrape time.

The metric-name translation that breaks overlays

The OTLP-to-Prometheus name translation is where most “the overlay has nothing to attach to” bugs originate. OTLP names like http.server.request.duration (unit s) become Prometheus series with a deterministic transformation:

Step	Rule	`http.server.request.duration` (unit `s`) becomes
Lowercase + sanitize	Dots → underscores, invalid chars → `_`	`http_server_request_duration`
Append unit	Add the unit suffix	`http_server_request_duration_seconds`
Histogram suffix	Bucket series get `_bucket`; also `_sum`, `_count`	`http_server_request_duration_seconds_bucket`
`le` label	Bucket boundary carried as the `le` label	`{le="0.5"}` etc.

So the series your PromQL and your Grafana overlay must reference is http_server_request_duration_seconds_bucket — not the OTLP name. Reference the wrong name (forget the _seconds, forget the _bucket) and the query returns data but the exemplar overlay has nothing to hang on, because exemplars live on the _bucket series specifically. Newer Prometheus versions offer a translation strategy toggle (to optionally not append units), so confirm which convention your Prometheus is on; the names in your dashboards must match it exactly.

Export OTLP and land exemplars in Prometheus

There are two supported paths, and the choice matters for exemplars and for operational shape.

Path A — push OTLP straight into Prometheus

Prometheus 2.47+ ships a native OTLP receiver at /api/v1/otlp/v1/metrics. It is the simplest path, it preserves exemplars, and it is ideal when you do not already need a Collector. Enable the receiver and exemplar storage in prometheus.yml:

# prometheus.yml
otlp:
  # Promote a few resource attributes to labels; keep this list SHORT (cardinality).
  promote_resource_attributes:
    - service.name
    - service.namespace
    - deployment.environment

storage:
  exemplars:
    # REQUIRED. Exemplar storage is a fixed-size in-memory circular buffer
    # and is OFF until you size it. 100k is a reasonable starting point.
    max_exemplars: 100000

Start Prometheus with the flags that enable the OTLP write path and the exemplar API:

prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --web.enable-otlp-receiver \
  --enable-feature=exemplar-storage

Point the SDK (or a Collector) at the receiver. Note OTLP/HTTP, not gRPC, for this endpoint, and the full path:

export OTEL_EXPORTER_OTLP_METRICS_ENDPOINT=http://prometheus:9090/api/v1/otlp/v1/metrics
export OTEL_EXPORTER_OTLP_METRICS_PROTOCOL=http/protobuf

Path B — Collector with the `prometheusremotewrite` exporter

Use this when you already run a Collector for fan-out, batching, tail sampling, or multi-backend routing. Exemplars are dropped by default on this exporter — the single most common cause of empty overlays. Turn them on explicitly:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 10s
    send_batch_size: 8192

exporters:
  prometheusremotewrite:
    endpoint: http://prometheus:9090/api/v1/write
    # OFF by default. Without this, every exemplar silently vanishes.
    export_exemplars: true
    resource_to_telemetry_conversion:
      enabled: false   # don't blow up cardinality by promoting all resource attrs

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheusremotewrite]

The Prometheus side then needs the remote-write receiver plus exemplar storage:

prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --web.enable-remote-write-receiver \
  --enable-feature=exemplar-storage

There is a third option worth naming: the Collector’s prometheus exporter (an exposition endpoint Prometheus scrapes) and the prometheusexporter’s enable_open_metrics setting, which exposes exemplars in OpenMetrics text for a scrape-based pull. Most teams choose push (remote-write) for OTLP-native pipelines. The three export shapes compared:

Path	Mechanism	Exemplar flag	Use when	Watch-out
A. Native OTLP receiver	SDK/Collector pushes OTLP to Prometheus	None extra (preserved)	Simplest; no Collector needed for metrics	Newer Prometheus; HTTP not gRPC; name translation
B. `prometheusremotewrite`	Collector pushes remote-write	`export_exemplars: true`	You already run a Collector	Default drops exemplars; remote-write receiver flag
C. `prometheus` exporter (scrape)	Collector exposes `/metrics`, Prometheus scrapes	`enable_open_metrics: true`	Pull-based shop	OpenMetrics scrape config; staleness on `_bucket`

Whichever path, the critical shared detail is the _bucket series: exemplars attach to histogram bucket time series, and Prometheus exposes them as http_server_request_duration_seconds_bucket. Get that name right everywhere downstream.

Prometheus exemplar storage: the buffer and its math

--enable-feature=exemplar-storage turns on a fixed-size, in-memory, circular buffer of exemplars, sized by storage.exemplars.max_exemplars. It is not on the TSDB disk path the way samples are; it is a bounded ring that the oldest exemplars fall out of. The consequences:

Property	Behaviour	Implication
Storage	In-memory circular buffer	Lost on restart; not durable like samples
Size control	`storage.exemplars.max_exemplars`	Total exemplars retained across all series
Eviction	Oldest-first (circular)	High-churn series can evict before you query
Retention	Implicit (until buffer wraps)	Effective window = buffer size ÷ exemplar ingest rate
Query API	`/api/v1/query_exemplars`	Time-bounded; returns what’s still in the buffer
Default	Off (0) until you set it	Forgetting this is a top cause of empty overlays

The retention math matters at scale. If you set max_exemplars: 100000 and your fleet emits ~2,000 exemplars/second, the buffer holds roughly 50 seconds of exemplars — meaning a spike older than ~50 s may have nothing to click. Size the buffer to the retention window you want to investigate (you usually look at the last few minutes), not to “as small as possible.” Rough sizing:

Exemplar ingest rate	`max_exemplars` for ~5 min window	Approx memory
200 /s	~60,000	Tens of MB
1,000 /s	~300,000	~100+ MB
5,000 /s	~1,500,000	Several hundred MB
20,000 /s	Consider Mimir/Thanos (per-tenant limits)	—

Two more facts: the exemplar ingest rate is bounded by your sampling rate (with trace_based, only sampled requests produce exemplars), so low sampling keeps the buffer cheap; and because it is in-memory, a Prometheus restart drops all exemplars — they refill as new sampled requests arrive, but a post-restart panel is briefly diamond-less. For durable, multi-tenant exemplar storage at fleet scale, push to Running Grafana Mimir: Multi-Tenant, Horizontally Scalable Prometheus Storage or Thanos in Production: Global Query View, Deduplication, and Object-Storage Downsampling, which have their own exemplar handling and limits.

Pick the right histogram: explicit-bucket vs exponential/native

For latency SLIs this is a real decision, not a detail, and it changes both quantile accuracy and the exemplar story.

Explicit-bucket histograms

Boundaries you define. You control resolution exactly where you need it — dense around your SLO threshold, sparse in the tail — and waste nothing elsewhere. The catch: if traffic shifts and your p99 lands between two coarse buckets, histogram_quantile() interpolates linearly inside that bucket and the number gets soft. Define boundaries as an advisory hint via a view:

from opentelemetry.sdk.metrics.view import View, ExplicitBucketHistogramAggregation

view = View(
    instrument_name="http.server.request.duration",
    aggregation=ExplicitBucketHistogramAggregation(
        # Dense around a ~300 ms SLO, sparse in the tail.
        boundaries=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
    ),
)

Exponential (native) histograms

Exponential histograms auto-scale their buckets to the data with a configurable max_scale, giving high relative accuracy across the whole range without you guessing boundaries. They cost fewer time series on the wire (one native-histogram series instead of dozens of _bucket series) and they are the better default for unknown or wide-ranging distributions. On the storage side, classic Prometheus stores them as native histograms, which is still behind a feature flag:

# Enable native histograms (and exemplar storage) in Prometheus.
prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --enable-feature=native-histograms \
  --enable-feature=exemplar-storage \
  --web.enable-otlp-receiver

Switch the SDK to exponential aggregation via the temporality/aggregation preference or a view:

# Prefer exponential (base-2) histograms from the OTLP exporter.
export OTEL_EXPORTER_OTLP_METRICS_DEFAULT_HISTOGRAM_AGGREGATION=base2_exponential_bucket_histogram

The decision table, end to end:

Factor	Explicit-bucket	Exponential / native
Bucket boundaries	You define them	Auto-scaled to the data
Accuracy around SLO	Excellent if you placed buckets there	High relative accuracy everywhere
Accuracy for unknown range	Poor if traffic shifts off your buckets	Robust
Series count on the wire	Many (`_bucket` per boundary)	Few (one native series)
Prometheus storage	Classic `_bucket` series	Native histogram (feature-flagged)
Grafana panel support	Universal	Improving; check your version
Recording rules	Mature	Newer functions (`histogram_*`)
Exemplars	Attach per `_bucket`	Attach to the native histogram
PromQL	`histogram_quantile(0.99, sum by (le)(rate(..._bucket[5m])))`	`histogram_quantile(0.99, rate(...[5m]))` (no `le`)
Pick it when	You know your SLO threshold and want rock-solid quantiles there, on battle-tested dashboards	The distribution is wide/unknown and your stack supports native histograms end to end

Rule of thumb: explicit buckets when you know your SLO threshold and want rock-solid quantiles around it on mature dashboards; exponential when the distribution is wide or unknown and your whole pipeline — Prometheus, Grafana, recording rules — supports native histograms. A pragmatic migration is to run explicit buckets in production today and pilot native histograms on one service until the panel and recording-rule support is proven across your version.

Align metric attributes with span attributes

Correlation is only clean if the dimensions match. If your metric labels say route but your spans use http.route, the human reading the dashboard cannot pivot mentally, and worse, the exemplar overlay shows a point whose attributes do not map to anything in the trace view. Standardize on OpenTelemetry semantic conventions on both sides:

Concept	Convention key	Use on	Cardinality note
HTTP route template	`http.route`	metric label + span attribute	Templated → bounded
HTTP method	`http.request.method`	metric label + span attribute	Small fixed set
HTTP status	`http.response.status_code`	metric label + span attribute	Small fixed set
RPC method	`rpc.method`	metric label + span attribute	Bounded by API surface
Service identity	`service.name`	resource (promoted to label)	One per service
Service namespace	`service.namespace`	resource (promoted to label)	Few
Environment	`deployment.environment`	resource (promoted to label)	Few (prod/staging/…)

Two hard rules govern this:

Rule	Why	What breaks if ignored
Use the templated route, never the raw path	`/orders/{id}` is one value; `/orders/8a1f…` is unbounded	Metric and exemplar-buffer cardinality explode; the buffer evicts instantly
Keep metric attributes minimal; push identity into the span	Exemplar `filtered_attributes` can carry rich context cheaply; metric labels cannot	Every extra metric label multiplies series count and cost

The exemplar’s filtered_attributes is the escape hatch here: it carries the attributes that were present at record time but are not metric labels — so a user.id or pod can ride on the exemplar (visible when you click it) without ever becoming a metric dimension. Lean on that to keep the metric attribute set tiny while still giving the click rich context. For the deeper cardinality discipline, see Taming Metric Cardinality: Relabeling, Limits, and Cost Governance in Prometheus.

Configure the Grafana overlay and trace link

Two pieces of Grafana config make the click work: the data source must query exemplars and know where the trace lives, and the panel must request exemplars.

First, enable the exemplar overlay on the Prometheus data source and map the exemplar’s trace_id to a tracing data source. In a provisioned data source, add an exemplarTraceIdDestinations entry:

# grafana provisioning: datasources/prometheus.yaml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    uid: prometheus
    url: http://prometheus:9090
    jsonData:
      exemplarTraceIdDestinations:
        - name: trace_id            # the exemplar label carrying the trace id
          datasourceUid: tempo      # uid of your Tempo/Jaeger data source
          # Optional: open an internal link in Grafana Explore (vs an external URL)
          urlDisplayLabel: "View trace"

Second, make sure the panel query actually requests exemplars. In the panel’s Prometheus query options, toggle Exemplars on (or set "exemplar": true on the target in the dashboard JSON). A typical p99 query on an explicit-bucket histogram:

histogram_quantile(
  0.99,
  sum by (le, http_route) (
    rate(http_server_request_duration_seconds_bucket[5m])
  )
)

With exemplars enabled and the destination mapped, Grafana queries /api/v1/query_exemplars alongside the range query and overlays diamonds at the recorded measurements. Click one, and the tooltip shows the value plus a trace_id link that opens the trace in Tempo or Jaeger — the full click-through. It lands on the right span because of everything upstream: the SDK stamped the active trace_id, the attributes line up, and the trace_id label name in the destination matches the exemplar label.

The Grafana-side knobs that govern whether the diamond appears and whether it clicks through:

Setting	Where	Effect	Common mistake
Exemplars toggle	Panel → query options	Grafana queries `/api/v1/query_exemplars`	Left off → no diamonds even with data
`exemplarTraceIdDestinations.name`	Data source `jsonData`	Which exemplar label is the trace id	Must equal the exemplar label (`trace_id`)
`datasourceUid`	Data source `jsonData`	Which trace store the click opens	Wrong/empty uid → diamond not clickable
`urlDisplayLabel`	Data source `jsonData`	Link text in the tooltip	Cosmetic; nice to set
Tempo data source uid	Tempo data source	Target of the jump	Must exist and be provisioned

The RED metrics + exemplar pivot

The reason this whole apparatus pays off is a specific dashboard pattern. RED — Rate, Errors, Duration — is the request-service golden-signals trio, and the Duration panel is exactly where exemplars belong. The pattern: present R, E, and D for a service; let the operator read the aggregate health off R/E/D; and let them pivot from a Duration spike straight to a trace via the exemplar overlay.

RED signal	PromQL (explicit-bucket)	Carries exemplars?	Role in the pivot
Rate	`sum(rate(http_server_request_duration_seconds_count[5m]))`	No (rarely pivoted)	Context: is traffic up?
Errors	`sum(rate(http_server_request_duration_seconds_count{http_response_status_code=~"5.."}[5m])) / sum(rate(http_server_request_duration_seconds_count[5m]))`	No	Context: is this a latency or an error event?
Duration (p99)	`histogram_quantile(0.99, sum by (le)(rate(http_server_request_duration_seconds_bucket[5m])))`	Yes — pivot here	The spike you click into a trace

The operator workflow this enables, hop by hop:

Step	What the operator does	What they learn
1	Alert fires; open the service’s RED row	Latency or errors? Traffic spike or not?
2	Read the Duration p99 panel; see a spike at 14:32	That it is slow, and when
3	Hover the diamond on the spike	The exemplar value + `trace_id`
4	Click the diamond	Tempo opens the exact slow trace
5	Read the trace’s span timeline	Which downstream / DB call ate the time
6	Fix the actual cause	Root cause in one click, not a 30-min hunt

This is the difference between a dashboard that reports and one that gets used — the principle behind Engineering Grafana Dashboards That Get Used: RED, USE, Template Variables, and Provisioning-as-Code. Two rules make the pivot reliable: keep the Duration panel’s by labels aligned with the exemplar’s attributes, and never re-aggregate the bucket series through a recording rule on the overlay path — re-aggregation strips exemplars (see troubleshooting).

Sampling interplay — why your overlay is empty

This is the section that turns a perfectly-wired pipeline from empty to populated, and it is the most common silent failure. The crux: with trace_based, an exemplar exists only if the span active at record time was sampled. So the sampling decision must be made before the handler records its histogram.

Sampling style	Decision made…	Sampled bit at record time	Produces exemplars?
Head (parent-based ratio)	At span start, in the SDK	Set	Yes
Always-on (100%)	Trivially, at start	Set	Yes (but trace cost is high)
Tail-based (Collector)	After span completes, in the Collector	Unset at record time	No — even for kept traces
Always-off (0%)	n/a	Never set	No

The trap is tail sampling. Teams adopt tail sampling to keep only interesting traces (errors, slow ones) and cut storage — a great pattern, covered in Tail-Based Sampling at Scale with the OpenTelemetry Collector and Load-Balancing Exporter. But the tail decision happens in the Collector after the request finished and the metric was already recorded. At record time the span’s sampled bit was not yet set (or was set optimistically and then overridden), so trace_based saw “not sampled” and produced no exemplar — and your overlay is empty even though the traces are in Tempo.

The fix is to set a small head sample so the sampled bit is set at record time, while letting tail sampling refine what is ultimately stored:

# SDK: a small probabilistic head sample so the sampled bit is set BEFORE the handler records.
export OTEL_TRACES_SAMPLER=parentbased_traceidratio
export OTEL_TRACES_SAMPLER_ARG=0.05

# Collector: tail sampling still decides final RETENTION downstream.
processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow
        type: latency
        latency: { threshold_ms: 500 }
      - name: baseline
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }

Now ~5% of requests carry a sampled span at record time, which is more than enough density to land on a spike (a spike that lasts seconds spans many requests), while tail sampling continues to decide what gets kept in Tempo. You get exemplars without raising trace storage. The interplay, summarized:

Goal	Lever	Effect on exemplars	Effect on trace storage
Some exemplars on every panel	Small head sample (e.g. 5%)	Provides the sampled bit at record time	Adds a floor of kept traces unless tail trims
Keep only error/slow traces	Tail sampling in Collector	None directly (decision too late for the bit)	Cuts storage to interesting traces
Both at once (recommended)	Head 5% and tail sampling	Exemplars present	Storage governed by tail policies
Max exemplar density	Higher head sample	More diamonds	More traces kept (cost ↑)

One more nuance: even at 1% head sampling you can still get usable exemplars, because a latency spike is made of many requests over several seconds — even 1% lands a diamond on a sustained spike. The density you need is “at least one exemplar per spike I care about,” not per request. Tune the head ratio to the shortest spike you must be able to click.

Architecture at a glance

Walk the path the way a request and its telemetry actually flow. A client request arrives at your service; the OpenTelemetry SDK (or auto-instrumentation agent) starts a server span, and the head sampler — configured as parentbased_traceidratio at, say, 5% — sets the span’s sampled bit right there at span start. Inside the handler, your code calls histogram.record(elapsed, {http.route, http.response.status_code}) on the http.server.request.duration instrument. Because a sampled span is active and the exemplar filter is trace_based, the SDK attaches an exemplar to the bucket the value lands in, reading the active trace_id and span_id straight off the context — no manual stamping. The metric reader, set to cumulative temporality, batches and exports over OTLP.

From the SDK, two routes reach Prometheus. In Path A the SDK (or a Collector) pushes OTLP/HTTP to Prometheus’ native OTLP receiver at /api/v1/otlp/v1/metrics, and exemplars ride along untouched. In Path B the telemetry first hits an OpenTelemetry Collector — which batches, optionally tail-samples to decide what traces to keep in Tempo, and exports metrics via prometheusremotewrite with export_exemplars: true so the exemplars survive the hop to Prometheus’ remote-write receiver. Either way, the OTLP name http.server.request.duration is translated to the Prometheus series http_server_request_duration_seconds_bucket, and the exemplar attaches to that _bucket series. Prometheus, started with --enable-feature=exemplar-storage and storage.exemplars.max_exemplars sized, holds the exemplars in a fixed-size in-memory circular buffer and answers /api/v1/query_exemplars.

On the read side, Grafana renders a RED row. The Duration panel runs histogram_quantile(0.99, …_bucket …) with the Exemplars toggle on, so Grafana also queries /api/v1/query_exemplars and overlays diamonds at the recorded measurements, placed by the exemplar’s observation timestamp. The data source’s exemplarTraceIdDestinations maps the exemplar’s trace_id label to the Tempo (or Jaeger) data source. The operator clicks a diamond on the spike; Grafana opens the trace whose trace_id the exemplar carried; the span timeline reveals the slow downstream. The same trace_id that the sampler set at span start, the exemplar captured at record time, Prometheus stored in its buffer, and Grafana surfaced on the panel — one identifier threading the entire loop, turning a red line into a one-click root cause.

Real-world scenario

A payments platform team — call them Lumio Pay — ran roughly 400 services through a fleet of OpenTelemetry Collectors into Prometheus and Tempo. Their on-call dashboards showed clean p99 spikes, but the exemplar overlay was empty on every panel. Engineers were spike-hunting by hand: when latency climbed at 02:40, they opened Tempo, searched duration > 800ms in a two-minute window, scrolled hundreds of traces, and tried to confirm one was representative of the spike. MTTR on latency regressions was an embarrassing thirty-to-forty minutes, most of it manual correlation.

The constraint that shaped everything: they could not raise trace sampling. At their request volume — tens of thousands of requests per second across the fleet — head sampling above ~5% would have flooded Tempo and blown the storage budget by an order of magnitude. So most requests had no trace, which meant most measurements had no exemplar candidate — fine, expected, and not the bug. The real faults were elsewhere, and there were two.

Fault one — the Collector dropped every exemplar. The prometheusremotewrite exporter was on its default config, so export_exemplars was false. Every exemplar that did exist — the ~5% from sampled requests — was being thrown away at the exporter before it ever reached Prometheus. They confirmed it by querying /api/v1/query_exemplars directly and getting an empty array even under load, then diffing the exporter config against the docs. One line fixed it:

exporters:
  prometheusremotewrite:
    endpoint: http://prometheus:9090/api/v1/write
    export_exemplars: true   # was defaulting to false — dropping everything

Fault two — tail sampling made the bit too late. They ran tail-based sampling in the Collector to keep only error and slow traces. The sampling decision was made after the metrics pipeline had already recorded its histograms, so even traces that were ultimately kept had been seen as “not sampled” at record time — and produced no exemplar. The fix was to keep a small probabilistic head sampler at the SDK so the sampled bit is set before the handler records, while tail sampling continued to refine what got stored in Tempo:

export OTEL_TRACES_SAMPLER=parentbased_traceidratio
export OTEL_TRACES_SAMPLER_ARG=0.05

A third, smaller issue surfaced while validating: a recording rule was pre-aggregating http_server_request_duration_seconds_bucket into a :rollup series the Duration panel queried, and the rollup stripped exemplars. They pointed the overlay panel back at the raw _bucket series (keeping the rollup for the long-range view) and the diamonds appeared.

With those changes, ~5% of measurements carried a valid trace_id — ample density to land on any spike lasting more than a second or two. The diamonds appeared, the click-through to Tempo worked, and latency-regression MTTR dropped from thirty-to-forty minutes of manual correlation to a single click — without one extra byte of trace storage and without raising sampling. They sized max_exemplars: 250000 to hold roughly a five-minute window, and alerted on the exemplar API returning empty under load as a regression guard so the overlay could never silently break again.

Advantages and disadvantages

Exemplar-based correlation is powerful but it is not free, and it has sharp edges. Weigh it honestly:

Advantages	Disadvantages
One-click pivot from an aggregate spike to the exact trace — kills manual timestamp correlation	Multi-hop pipeline (SDK → Collector → Prometheus → Grafana → Tempo); any hop can silently drop exemplars
Exemplar density is governed by sampling, so the cost is bounded and predictable	Requires a sampled span at record time — incompatible with tail-only sampling without a head floor
`filtered_attributes` carry rich context (user, pod) without inflating metric cardinality	Several config flags default to off/drop (`export_exemplars`, exemplar storage, the panel toggle) — easy to miss
Native histograms add high-accuracy quantiles and exemplars in fewer series	Native histograms are feature-flagged and not yet universal across panels/recording rules
Works across all OTel languages because exemplars are spec, not implementation	Exemplar storage is in-memory and lost on restart; retention is buffer-size-bounded, not durable
Turns “two tools, two tabs” into one panel — dashboards actually get used	Re-aggregation (recording rules) on the overlay path strips exemplars — a subtle footgun
Confirmable end to end with `/api/v1/query_exemplars` — each hop has a check	The dead-link case (exemplar points at a trace that was sampled out) needs head/tail alignment to avoid

The model is right when you run OTel metrics into a Prometheus-compatible store and a trace store, and you want operators to root-cause latency fast. It is most valuable on high-traffic, low-sampling services — exactly where manual correlation hurts most. It is least worth the wiring when you have no trace store, when you never pivot from latency to traces, or when your traffic is so low that you sample 100% and can just search Tempo directly. The disadvantages are all manageable, but only if you know the silent-drop points exist — which is the entire point of the troubleshooting section.

Hands-on lab

Stand up the full loop on a laptop with Docker Compose: an instrumented service, an OTel Collector, Prometheus with exemplar storage, Tempo, and Grafana — then click a diamond into a trace. Free, local, teardown at the end.

Step 1 — Project layout and a compose file. Create a directory and a docker-compose.yml:

# docker-compose.yml
services:
  prometheus:
    image: prom/prometheus:latest
    command:
      - --config.file=/etc/prometheus/prometheus.yml
      - --web.enable-otlp-receiver
      - --enable-feature=exemplar-storage
    volumes: [ "./prometheus.yml:/etc/prometheus/prometheus.yml" ]
    ports: [ "9090:9090" ]
  tempo:
    image: grafana/tempo:latest
    command: [ "-config.file=/etc/tempo.yaml" ]
    volumes: [ "./tempo.yaml:/etc/tempo.yaml" ]
    ports: [ "3200:3200", "4319:4317" ]
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    command: [ "--config=/etc/otel.yaml" ]
    volumes: [ "./otel.yaml:/etc/otel.yaml" ]
    ports: [ "4317:4317", "4318:4318" ]
  grafana:
    image: grafana/grafana:latest
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
    volumes: [ "./grafana-datasources.yaml:/etc/grafana/provisioning/datasources/ds.yaml" ]
    ports: [ "3000:3000" ]

Step 2 — Prometheus config with exemplar storage.

# prometheus.yml
otlp:
  promote_resource_attributes: [ service.name, deployment.environment ]
storage:
  exemplars:
    max_exemplars: 100000

Step 3 — Collector config that preserves exemplars.

# otel.yaml
receivers:
  otlp:
    protocols:
      grpc: { endpoint: 0.0.0.0:4317 }
      http: { endpoint: 0.0.0.0:4318 }
exporters:
  prometheusremotewrite:
    endpoint: http://prometheus:9090/api/v1/write
    export_exemplars: true
  otlp/tempo:
    endpoint: tempo:4317
    tls: { insecure: true }
service:
  pipelines:
    metrics: { receivers: [otlp], processors: [], exporters: [prometheusremotewrite] }
    traces:  { receivers: [otlp], processors: [], exporters: [otlp/tempo] }

Note: this lab uses the Collector remote-write path so you exercise export_exemplars: true. If remote-write 404s, add --web.enable-remote-write-receiver to the Prometheus command.

Step 4 — Grafana data source provisioning with the trace-id destination.

# grafana-datasources.yaml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    uid: prometheus
    url: http://prometheus:9090
    jsonData:
      exemplarTraceIdDestinations:
        - name: trace_id
          datasourceUid: tempo
  - name: Tempo
    type: tempo
    uid: tempo
    url: http://tempo:3200

Step 5 — Instrument a tiny service. A Python app that records a duration histogram inside a sampled span and exports OTLP to the Collector:

# app.py — pip install opentelemetry-distro opentelemetry-exporter-otlp flask, then opentelemetry-bootstrap -a install
import os, random, time
os.environ.setdefault("OTEL_METRICS_EXEMPLAR_FILTER", "trace_based")
os.environ.setdefault("OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE", "cumulative")
os.environ.setdefault("OTEL_TRACES_SAMPLER", "parentbased_traceidratio")
os.environ.setdefault("OTEL_TRACES_SAMPLER_ARG", "1.0")  # 100% for the lab so every request has a trace
os.environ.setdefault("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4317")

from flask import Flask
from opentelemetry import metrics, trace
app = Flask(__name__)
latency = metrics.get_meter("lab").create_histogram("http.server.request.duration", unit="s")
tracer = trace.get_tracer("lab")

@app.route("/orders/<oid>")
def orders(oid):
    with tracer.start_as_current_span("orders"):
        d = random.choice([0.05, 0.1, 0.9])  # occasional slow request → a spike to click
        time.sleep(d)
        latency.record(d, {"http.route": "/orders/{id}", "http.response.status_code": 200})
        return f"ok {oid}\n"

# Run with auto-instrumentation:
#   opentelemetry-instrument flask --app app run -p 8000

Step 6 — Bring it up and generate load.

docker compose up -d
opentelemetry-instrument flask --app app run -p 8000 &   # in the project venv
# Generate traffic so sampled spans record exemplars
for i in $(seq 1 2000); do curl -s "http://localhost:8000/orders/$i" >/dev/null; done

Step 7 — Confirm exemplars landed in Prometheus.

curl -s 'http://localhost:9090/api/v1/query_exemplars' \
  --data-urlencode 'query=http_server_request_duration_seconds_bucket' \
  --data-urlencode "start=$(date -u -v-5M +%s 2>/dev/null || date -u -d '-5 min' +%s)" \
  --data-urlencode "end=$(date -u +%s)" \
  | python3 -c 'import sys,json; d=json.load(sys.stdin); print(d["data"][0]["exemplars"][0]["labels"])'

Expected — an exemplar carrying a trace_id:

{"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736", "span_id": "00f067aa0ba902b7"}

Step 8 — Click through in Grafana. Open http://localhost:3000, Explore → Prometheus, run the p99 query with the Exemplars toggle on:

histogram_quantile(0.99, sum by (le, http_route) (rate(http_server_request_duration_seconds_bucket[1m])))

Diamonds appear on the spikes. Hover one → see the trace_id link → click → Tempo opens that exact trace.

Validation checklist. What each step proves:

Step	What you did	What it proves
3	`export_exemplars: true` on the Collector	The exporter no longer drops exemplars (Path B)
2	`max_exemplars` + `--enable-feature=exemplar-storage`	Prometheus accepts and retains exemplars
5	`record()` inside a sampled span, `trace_based` filter	The SDK attaches a `trace_id` automatically
7	`/api/v1/query_exemplars` returns a `trace_id`	The exemplar survived the whole ingest path
8	Click a diamond → Tempo	The metric→trace pivot works end to end

Teardown.

docker compose down -v
kill %1 2>/dev/null   # the flask process

Cost note. Entirely local containers — zero cloud spend. The only resource is laptop RAM (a few hundred MB for the five containers).

Common mistakes & troubleshooting

This is the differentiator — the silent-drop playbook you bookmark. First as a scannable table, then the full reasoning for the entries that bite hardest. Every row is “the overlay is empty / wrong” localized to one hop.

#	Symptom	Root cause	Confirm (exact command / path)	Fix
1	Overlay empty everywhere; `/api/v1/query_exemplars` errors or returns nothing	Exemplar storage not enabled	`curl .../api/v1/query_exemplars?...` → error; check Prometheus flags	Add `--enable-feature=exemplar-storage` + `storage.exemplars.max_exemplars`
2	Exemplars exist in the SDK but never reach Prometheus (Path B)	`export_exemplars` defaulting to `false` on `prometheusremotewrite`	Diff Collector exporter config; `/api/v1/query_exemplars` empty under load	Set `export_exemplars: true`
3	Tail sampling on; traces in Tempo; no exemplars at all	Sampled bit unset at record time (tail decides too late)	Confirm only tail sampling, no head sampler; exemplars empty	Add a small head sample (`parentbased_traceidratio`, e.g. 0.05)
4	Query returns data but overlay has no diamonds	Panel Exemplars toggle off	Panel JSON `"exemplar": false`	Toggle Exemplars on / set `"exemplar": true`
5	Diamonds appear but are not clickable	No `exemplarTraceIdDestinations` / wrong label name	Data source `jsonData` missing the mapping	Add `name: trace_id`, `datasourceUid: <tempo>`
6	Click opens nothing / “trace not found”	Exemplar points at a trace that was sampled out	Open the `trace_id` in Tempo directly → 404	Align head sample so the sampled trace is actually kept
7	p99 query works but exemplars never attach	PromQL references a re-aggregated/rollup series	Panel query uses `:rollup`/recording-rule series	Point the overlay panel at the raw `_bucket` series
8	Wrong/no series; overlay has nothing to attach to	Metric name mismatch (missing `_seconds`/`_bucket`)	`curl .../api/v1/label/__name__/values \| grep duration`	Use the translated name `..._seconds_bucket`
9	p99 and rates are garbage	Delta temporality sent to Prometheus	Check `OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE`	Force `cumulative`
10	Exemplars vanish minutes after a spike	Buffer too small for the ingest rate	`max_exemplars` low vs sampled-exemplar rate	Size buffer for your investigation window (see math)
11	Native histogram p99 works but `_bucket` PromQL returns nothing	Series stored as native histogram, not classic buckets	`curl .../label/__name__/values` shows no `_bucket`	Use native-histogram PromQL (`histogram_quantile(0.99, rate(...[5m]))`, no `le`)
12	Overlay empty right after Prometheus restart	In-memory exemplar buffer cleared on restart	Restart time vs first diamond reappearing	Expected; refills as sampled requests arrive
13	Exemplar buffer evicts almost instantly	Cardinality explosion (raw path as label)	High series count; `http_route` shows raw IDs	Use templated routes; cut metric label cardinality
14	`OTEL_*` settings ignored	Wrong endpoint protocol (gRPC vs HTTP) for OTLP receiver	Prometheus OTLP receiver expects HTTP/protobuf	Set `OTEL_EXPORTER_OTLP_METRICS_PROTOCOL=http/protobuf` for Path A

The expanded reasoning for the ones that cost the most time:

2. Exemplars exist but never reach Prometheus (Path B). The prometheusremotewrite exporter ships with export_exemplars defaulting to false. Every exemplar the SDK produced is dropped at the exporter, so the SDK looks correct, Tempo has traces, and Prometheus has none. Confirm: generate load (so sampled requests exist), then query /api/v1/query_exemplars — an empty array under sustained load with traces present points straight at the exporter. Fix: export_exemplars: true. This is the single most common cause of a wired-but-empty overlay.

3. Tail sampling, traces present, zero exemplars. With trace_based, the exemplar is created at record time only if the active span is sampled then. Tail sampling decides after the request finishes, in the Collector, so at record time the bit was unset and no exemplar was created — even for traces tail sampling ultimately keeps. Confirm: check that you have only tail sampling and no SDK head sampler; exemplar queries return empty. Fix: add a small head sample (parentbased_traceidratio, e.g. 0.05) so the bit is set before the handler records; keep tail sampling for retention.

6. The diamond clicks into nothing. The exemplar carried a real trace_id, but the trace it points at was not stored (sampled out, or expired in Tempo). The pivot opens Tempo, Tempo 404s. Confirm: copy the trace_id from the exemplar and search Tempo directly — if it 404s, the trace was never kept. Fix: ensure the head-sampled traces that produce exemplars are the same traces you keep — i.e. don’t head-sample a trace, emit its exemplar, then tail-drop that exact trace. A baseline tail policy that keeps the head-sampled probabilistic slice closes this gap.

7. p99 works but exemplars won’t attach. Re-aggregation strips exemplars. If your Duration panel queries a recording-rule rollup of the bucket series (built for long-range performance), the rollup has no exemplars to carry. Confirm: look at the panel’s PromQL — does it reference a :rollup/recording-rule series instead of the raw _bucket? Fix: point the exemplar-overlay panel at the raw http_server_request_duration_seconds_bucket (keep the rollup for non-exemplar long-range views).

8. Metric-name mismatch. Exemplars live on the _bucket series under the translated name. If your PromQL says http_server_request_duration_bucket (no _seconds) it queries a series that may not exist, and even a close-but-wrong name has no exemplars. Confirm: curl -s http://prometheus:9090/api/v1/label/__name__/values | tr ',' '\n' | grep duration to see the real names. Fix: use the exact translated name including the unit suffix and _bucket.

10 / 13. Buffer eviction. Two flavours: too-small a buffer for a high sampled-exemplar rate (size with the retention math), or cardinality explosion from raw paths as labels filling the buffer with junk series that evict the good ones. Confirm: compare max_exemplars to your sampled-exemplar rate; inspect http_route for raw IDs. Fix: size the buffer to your investigation window, and template the route so the label set stays bounded.

Best practices

Force cumulative temporality for every Prometheus-bound pipeline. Set it at the SDK; the Collector forwards as-is. Delta silently breaks rate() and histogram_quantile().
Set OTEL_METRICS_EXEMPLAR_FILTER=trace_based explicitly. It is the right default but pin it so an SDK default change doesn’t silently alter behaviour, and so exemplars always point at traces that exist.
Always set export_exemplars: true on the Collector prometheusremotewrite exporter. It defaults to false and is the number-one cause of empty overlays.
Enable exemplar storage and size the buffer to your investigation window. --enable-feature=exemplar-storage plus a max_exemplars computed from your sampled-exemplar rate, not “as small as possible.”
Keep a small head sample even when you tail-sample. parentbased_traceidratio at a few percent sets the sampled bit at record time so exemplars exist; let tail sampling govern retention.
Choose the histogram type deliberately. Explicit buckets dense around your SLO on mature dashboards; exponential/native for wide or unknown distributions once your whole stack supports them.
Use OTel semantic conventions on both metrics and spans. http.route, http.response.status_code, service.name on both sides so the pivot’s dimensions line up.
Template routes; never put raw paths in labels. Raw IDs are unbounded cardinality that explode metrics and evict good exemplars from the buffer.
Keep metric attributes minimal; push identity into the span. Use the exemplar’s filtered_attributes for rich context (user, pod) so it rides the exemplar, not the metric label set.
Point the exemplar-overlay panel at the raw _bucket series. Re-aggregation through recording rules strips exemplars; keep rollups for non-overlay long-range views.
Map exemplarTraceIdDestinations to your trace data source. The diamond is useless without the name: trace_id → datasourceUid mapping.
Build RED rows with the pivot on Duration. Rate and Errors give context; Duration carries the exemplars you click into a trace.
Alert on the exemplar API returning empty under load. A query for query_exemplars that comes back empty during real traffic is a regression guard against the overlay silently breaking.

Security notes

A trace_id is a capability — protect the trace store. The exemplar exposes a trace_id on a metrics panel; anyone who can read the metric can read the ID and pivot. Lock down Tempo/Jaeger with auth so the ID alone does not grant access to potentially sensitive trace payloads.
Keep PII out of exemplar filtered_attributes. It is tempting to attach user.email or a raw request body to the exemplar for context — but that data then lives in the metrics store’s exemplar buffer and on dashboards. Attach low-sensitivity identifiers (a hashed user id, a pod name), not PII.
Scrub sensitive attributes in the Collector. Use the attributes/redaction processors to drop or hash sensitive keys before metrics (and their exemplars) leave the Collector, so secrets never reach Prometheus or Grafana.
Secure the Prometheus remote-write and OTLP receivers. --web.enable-remote-write-receiver and --web.enable-otlp-receiver open write paths; put them behind network policy, mTLS, or an authenticating proxy so arbitrary clients cannot inject metrics or exemplars.
Tenant-isolate exemplars in multi-tenant stores. In Mimir/Grafana Cloud, exemplars are per-tenant; ensure a tenant cannot query another tenant’s exemplars (and thus another tenant’s trace IDs).
Least-privilege the Grafana data source. The Prometheus and Tempo data sources should use read-scoped credentials; the metric→trace jump needs read on traces, not admin.

Risk	Where it lives	Control
`trace_id` leaks a path to sensitive traces	Exemplar label on a panel	AuthN/Z on Tempo/Jaeger
PII in exemplar attributes	`filtered_attributes` in the buffer	Don’t attach PII; Collector redaction processor
Unauthenticated metric injection	Remote-write / OTLP receiver	mTLS / proxy / network policy
Cross-tenant exemplar read	Multi-tenant metrics store	Per-tenant isolation (Mimir)
Over-scoped dashboard access	Grafana data sources	Read-scoped credentials

Cost & sizing

Exemplars are cheap if you let sampling bound them; they get expensive when cardinality or buffer sizing goes wrong. The drivers:

Exemplar volume is bounded by sampling. With trace_based, only sampled requests produce exemplars, so a 5% head sample means roughly 5% of recorded measurements carry an exemplar. This is what keeps the buffer affordable — low sampling, low exemplar volume, small buffer.
The exemplar buffer is RAM. max_exemplars is an in-memory ring; sizing it for a five-minute investigation window at your sampled-exemplar rate costs tens to a few hundred MB of Prometheus memory. Oversizing it wastes RAM; undersizing evicts spikes before you click them.
Metric cardinality dominates the metrics bill, not exemplars. The expensive mistake is raw-path labels exploding series count — which also fills and churns the exemplar buffer. Bounded labels keep both the metrics store and the exemplar buffer cheap. See Taming Metric Cardinality: Relabeling, Limits, and Cost Governance in Prometheus.
Native histograms can cut cost. Replacing dozens of _bucket series with one native-histogram series reduces series count and ingestion while keeping exemplars and improving quantile accuracy — a rare win on both axes once your stack supports them.
Trace storage is the big bill, and exemplars don’t raise it. The whole point of the head-sample-plus-tail-sample pattern is that you get exemplars without keeping more traces. Trace storage is governed by your tail policies, not by exemplar density.

Cost driver	What you pay for	Rough scale	How to control
Exemplar buffer	Prometheus RAM for the ring	Tens–hundreds of MB	Size `max_exemplars` to your window
Metric series	TSDB storage + ingest	Linear in cardinality	Bounded labels; templated routes
Native histograms	Fewer series than `_bucket`	Saves vs explicit buckets	Adopt where supported end to end
Trace storage (Tempo)	Object storage for kept traces	Dominated by sampling	Tail policies; head sample stays small
Collector compute	CPU for batching/sampling	Scales with throughput	Right-size the fleet

Free-tier reality: the lab runs locally for nothing. In production, exemplars add a modest, bounded RAM cost on Prometheus and effectively nothing to trace storage — the MTTR savings (a 30-minute correlation becoming a one-click pivot) dwarf the spend.

Interview & exam questions

1. What is an exemplar, precisely? A single example measurement pinned to a histogram bucket, carrying a value, a timestamp, a set of filtered attributes, and — crucially — the trace_id/span_id of the span active when it was recorded. It re-attaches request identity to an aggregate that otherwise averaged it away, enabling a click-through from a metric spike to the exact trace.

2. Why must Prometheus-bound OTel metrics use cumulative temporality? Prometheus is cumulative to its core: rate() assumes monotonically increasing counters with reset detection, and histogram_quantile() assumes cumulative bucket counts. Delta temporality (per-interval increments) makes reset detection misfire on every export and produces garbage rates and quantiles. Set OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE=cumulative at the SDK.

3. How does an exemplar get a trace_id — do you stamp it? No. When histogram.record() runs and a sampled span is active in context, the SDK reads the trace_id/span_id off the active context and attaches them automatically. Your only job is to ensure a span is active at record time; auto-instrumentation handles that. The linkage is a property of where the record call sits, not of explicit ID-passing.

4. What does OTEL_METRICS_EXEMPLAR_FILTER=trace_based do, and what are the alternatives? It makes a measurement an exemplar candidate only if recorded inside a sampled span — so every exemplar points at a trace that actually exists. Alternatives: always_on (every measurement is a candidate, even without a trace — wasteful), and always_off (no exemplars). trace_based is the production default.

5. Your exemplar overlay is empty even though traces exist in Tempo and exemplar storage is on. Two likely causes? (a) The Collector prometheusremotewrite exporter has export_exemplars at its default false, dropping every exemplar; (b) you run tail-only sampling, so the sampled bit was unset at record time and no exemplar was created. Fix with export_exemplars: true and a small head sample so the bit is set before the handler records.

6. Why does tail sampling alone produce no exemplars, and how do you fix it without raising trace storage? Tail sampling decides keep/drop after the request finishes, in the Collector — but trace_based needs the span sampled at record time, which is before that decision. Add a small head sample (parentbased_traceidratio, e.g. 5%) so the bit is set early enough to produce exemplars; keep tail sampling to govern what is ultimately stored. Exemplar density needs only “one per spike,” not per-request coverage.

7. Explicit-bucket vs exponential histograms — when each? Explicit buckets when you know your SLO threshold and want rock-solid quantiles around it on mature dashboards (boundaries you define, classic _bucket series). Exponential/native when the distribution is wide or unknown — auto-scaling buckets, high relative accuracy everywhere, fewer series — once your Prometheus, Grafana, and recording rules all support native histograms (feature-flagged).

8. Where do exemplars attach in Prometheus, and what’s the name-translation gotcha? They attach to the histogram _bucket series. The OTLP name http.server.request.duration (unit s) translates to http_server_request_duration_seconds_bucket — lowercased, dots to underscores, unit appended, _bucket suffix. Reference the wrong name in PromQL/Grafana and the overlay has nothing to attach to.

9. How is Prometheus exemplar storage shaped, and what’s the retention implication? It is a fixed-size, in-memory, circular buffer sized by storage.exemplars.max_exemplars, enabled by --enable-feature=exemplar-storage. It is lost on restart and evicts oldest-first, so the effective retention window is buffer size ÷ exemplar ingest rate. Size it for the window you investigate (usually a few minutes), not minimally.

10. What makes the Grafana diamond actually clickable into a trace? Two things: the panel must request exemplars (the Exemplars toggle / "exemplar": true), and the Prometheus data source must have an exemplarTraceIdDestinations entry mapping the exemplar’s trace_id label to the Tempo/Jaeger data source uid. Miss the destination and the diamond renders but does not click through.

11. What is the RED + exemplar pivot pattern? Present Rate, Errors, and Duration for a service. Rate and Errors give context (traffic up? latency or error event?); the Duration (p99) panel carries exemplars. The operator reads aggregate health, then clicks a Duration-spike diamond straight into the trace, turning a 30-minute manual correlation into a one-click root cause.

12. The diamond clicks but Tempo says “trace not found.” Why? The exemplar carried a real trace_id, but that trace was never stored (sampled out downstream or expired). It is a head/tail alignment bug: you emitted an exemplar for a head-sampled trace, then tail-dropped that exact trace. Keep a baseline tail policy that retains the head-sampled probabilistic slice so exemplar-linked traces survive.

These map to vendor-neutral OpenTelemetry Certified Associate (OTCA) objectives (signals, SDK config, the Collector) and to the observability sections of SRE/DevOps interviews. A compact mapping:

Question theme	Maps to
Exemplar definition, `trace_id` linkage	OTCA — metrics signal & SDK
Temporality / cumulative vs delta	OTCA — metrics data model; Prometheus fundamentals
Exemplar filter, sampling interplay	OTCA — sampling; Collector behaviour
Histogram types, name translation	Prometheus/PromQL proficiency
Exemplar storage, buffer math	Prometheus operations
Grafana overlay, RED pivot	Grafana / observability practice

Quick check

You send OTel metrics to Prometheus and your p99 panel shows nonsensical values that jump around. What temporality setting is the first thing to check, and what value does Prometheus need?
Your exemplar overlay is empty, traces are in Tempo, and exemplar storage is enabled. You route metrics through a Collector. Name the single most likely one-line config cause.
True or false: with OTEL_METRICS_EXEMPLAR_FILTER=trace_based, tail-based sampling alone (no head sampler) will still produce exemplars for the traces it keeps.
Your PromQL p99 query returns data but no diamonds attach. Two distinct causes (one Grafana-side, one PromQL-side)?
You set max_exemplars: 50000 and at peak your fleet emits ~5,000 sampled exemplars/second. Roughly how long a window of exemplars does your buffer hold, and is that enough to click a spike from 4 minutes ago?

Answers

Check OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE. Prometheus needs cumulative; delta makes rate() and histogram_quantile() misfire on every export and produce garbage that looks plausible.
The Collector prometheusremotewrite exporter has export_exemplars at its default false, dropping every exemplar at the exporter. Set export_exemplars: true.
False. Tail sampling decides keep/drop after the request finishes, but trace_based needs the span sampled at record time — which is earlier. With no head sampler the bit is unset at record time, so no exemplar is produced even for kept traces. Add a small head sample.
(a) Grafana-side: the panel’s Exemplars toggle is off ("exemplar": false), so Grafana never queries /api/v1/query_exemplars. (b) PromQL-side: the query references a re-aggregated/rollup series (or the wrong translated name), which has no exemplars to attach — point it at the raw ..._seconds_bucket.
Buffer size ÷ ingest rate = 50,000 ÷ 5,000 ≈ 10 seconds. That is not enough to click a spike from 4 minutes ago — the exemplars from then have long since been evicted. Size the buffer (e.g. ~1.5M) for a multi-minute window.

Glossary

Exemplar — a single example measurement pinned to a histogram bucket, carrying a value, timestamp, filtered attributes, and the trace_id/span_id of the span active at record time; the thing you click through to a trace.
Exemplar filter — the SDK policy (trace_based, always_on, always_off) deciding which measurements are eligible to become exemplars; trace_based requires a sampled span.
Temporality — whether metric values are cumulative (running total since start) or delta (per-interval increment); Prometheus requires cumulative.
Explicit-bucket histogram — a histogram with boundaries you define, stored in Prometheus as classic _bucket series.
Exponential histogram — an auto-scaling base-2 histogram with high relative accuracy across a wide range, stored in Prometheus as a native histogram.
Native histogram — Prometheus’ storage form of an exponential histogram; feature-flagged (--enable-feature=native-histograms), with its own exemplar handling.
OTLP — the OpenTelemetry wire protocol (gRPC or HTTP/protobuf) that carries metrics, traces, and logs; exemplars live inside the histogram data point.
prometheusremotewrite exporter — the Collector exporter that pushes metrics to Prometheus remote-write; drops exemplars unless export_exemplars: true.
Exemplar storage — Prometheus’ fixed-size, in-memory, circular buffer of exemplars; enabled by --enable-feature=exemplar-storage and sized by storage.exemplars.max_exemplars.
/api/v1/query_exemplars — the Prometheus API that returns exemplars for a query over a time window; the confirm endpoint for the whole pipeline.
exemplarTraceIdDestinations — the Grafana data source mapping from an exemplar’s trace_id label to a tracing data source uid; what makes the diamond click through.
RED metrics — Rate, Errors, Duration — the request-service golden signals; the Duration panel is where exemplars pivot.
Head sampling — the trace sampling decision made at span start (in the SDK), which sets the sampled bit before the handler records — so exemplars work.
Tail sampling — the trace sampling decision made after the span completes (in the Collector); too late to set the bit for trace_based, so it needs a head-sample companion for exemplars.
Semantic conventions — OpenTelemetry’s standardized attribute keys (http.route, http.response.status_code, service.name) used on both metrics and spans so correlation dimensions match.
filtered_attributes — the exemplar’s attributes that are present at record time but not part of the metric’s label set; carry rich context (user, pod) without inflating metric cardinality.

Next steps

You can now wire metrics → exemplars → Prometheus → Grafana → trace, and localize any silent drop in the chain. Build outward:

Next (the trace half): Distributed Tracing End-to-End: Context Propagation, Tempo, and Correlating Traces with Metrics and Logs — how the trace_id got there, and how to store and query the trace the exemplar points at.
Related: Tail-Based Sampling at Scale with the OpenTelemetry Collector and Load-Balancing Exporter — the sampling pattern this article’s “head-plus-tail” advice depends on.
Related: Building Production OpenTelemetry Collector Pipelines: Receivers, Processors, and Tail Sampling — the Collector plumbing behind prometheusremotewrite and export_exemplars.
Related: PromQL in Anger: Rate, Histograms, and Aggregation Patterns That Actually Work — the histogram_quantile/rate patterns the Duration panel runs.
Related: Engineering Grafana Dashboards That Get Used: RED, USE, Template Variables, and Provisioning-as-Code — the RED dashboard the exemplar pivot lives on.
Related: Taming Metric Cardinality: Relabeling, Limits, and Cost Governance in Prometheus — keep labels bounded so the exemplar buffer and metrics store stay cheap.