A p99 latency panel that turns red tells you that something is slow. It does not tell you which request, on which pod, hitting which downstream. Exemplars close that gap: each one pins a single representative measurement to the exact trace_id that produced it, so a click on the spike drops you into the offending trace. This is the full wiring — the OpenTelemetry metrics data model, exemplar collection in the SDK, OTLP export into Prometheus with native exemplar storage, and the Grafana overlay and trace link that make the click work. Everything here is the metrics-and-correlation half; if you need the tracing/propagation half, that lives in a separate guide.
1. The OTel metrics data model you actually need
Three instrument families carry almost all production signal:
- Counter — monotonic, only goes up (requests served, bytes written). You query the rate, never the raw value.
- Histogram — records a distribution of values into buckets (request duration, payload size). This is where latency SLIs live, and the only instrument that meaningfully carries exemplars.
- Gauge — a point-in-time value that can go up or down (queue depth, memory in use).
The decision that trips people up is temporality: delta vs cumulative.
| Cumulative | Delta | |
|---|---|---|
| Counter value | running total since process start | increment since last export |
| Histogram buckets | total counts since start | counts in this interval |
| Prometheus wants | this one | not this one |
| Restart behavior | reset to zero, PromQL rate() handles it |
each export self-contained |
Prometheus is a cumulative system: rate() and histogram_quantile() assume monotonically increasing counters and reset detection. If you export delta to a Prometheus-style backend you will get nonsense. So when the target is Prometheus, force cumulative temporality on the SDK. The OTLP exporter respects an environment variable:
# Prometheus is cumulative. Do not send it delta.
export OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE=cumulative
Counter-intuitive but important: the Prometheus remote-write exporter and the
prometheusremotewriteCollector exporter both want cumulative. Only set delta when your backend is something like a delta-native system that explicitly asks for it.
2. Turn on exemplar collection in the SDK
Exemplars are off-by-policy unless a sampled span is active when a measurement is recorded, and unless the SDK’s exemplar filter lets it through. The OTel default filter is trace_based: a measurement becomes an exemplar candidate only if it is recorded inside a span whose context is sampled. That is exactly what you want — exemplars should point at traces that actually exist in your tracing backend.
Set it explicitly so it survives SDK default changes:
# Only attach exemplars when a sampled span is in context.
export OTEL_METRICS_EXEMPLAR_FILTER=trace_based
The other valid values are always_on (every measurement is a candidate — expensive, and most exemplars will have no trace) and always_off (disable entirely). Stick with trace_based.
The SDK does the rest automatically: when you record into a histogram and a sampled span is current, it attaches an exemplar carrying the value, the timestamp, the filtered attributes, and the active trace_id and span_id. You do not manually stamp trace IDs onto exemplars — the metrics SDK reads them from the active context. Your only job is to make sure a span is active when you record. With auto-instrumentation (an HTTP/gRPC server span wrapping the handler) that is already true.
Here is an explicit Python example so the linkage is unambiguous — the record() call sits inside the span:
from opentelemetry import metrics, trace
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
reader = PeriodicExportingMetricReader(
OTLPMetricExporter(endpoint="http://otel-collector:4317", insecure=True),
export_interval_millis=15000,
)
metrics.set_meter_provider(MeterProvider(metric_readers=[reader]))
meter = metrics.get_meter("checkout")
latency = meter.create_histogram(
name="http.server.request.duration",
unit="s",
description="HTTP server request duration",
)
tracer = trace.get_tracer("checkout")
def handle(route, status, elapsed_seconds):
with tracer.start_as_current_span("checkout"):
# A sampled span is active here, so this record() emits an exemplar
# carrying the active trace_id/span_id automatically.
latency.record(elapsed_seconds, {"http.route": route, "http.response.status_code": status})
The same OTEL_* environment variables drive the Java agent, the Go SDK, and the Node SDK — exemplar behavior is consistent across languages because it is part of the spec, not the implementation.
3. Export OTLP and land exemplars in Prometheus
There are two supported paths, and the choice matters for exemplars.
Path A — push OTLP straight into Prometheus. Prometheus 2.47+ ships a native OTLP receiver at /api/v1/otlp/v1/metrics. It is the simplest path and it preserves exemplars. Enable the receiver and exemplar storage in the Prometheus config:
# prometheus.yml
otlp:
# Promote a few resource attributes to labels; keep this list short.
promote_resource_attributes:
- service.name
- service.namespace
- deployment.environment
storage:
exemplars:
# Required. Exemplar storage is a fixed-size in-memory circular buffer
# and is OFF until you size it.
max_exemplars: 100000
Then start Prometheus with the feature flags that enable the OTLP write path and the exemplar API:
prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--web.enable-otlp-receiver \
--enable-feature=exemplar-storage
Point the SDK (or a Collector) at the receiver. Note OTLP/HTTP, not gRPC, for this endpoint:
export OTEL_EXPORTER_OTLP_METRICS_ENDPOINT=http://prometheus:9090/api/v1/otlp/v1/metrics
export OTEL_EXPORTER_OTLP_METRICS_PROTOCOL=http/protobuf
Path B — Collector with the prometheusremotewrite exporter. Use this when you already run a Collector for fan-out, batching, or tail sampling. Exemplars must be explicitly enabled on the exporter; they are dropped by default:
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
processors:
batch:
timeout: 10s
exporters:
prometheusremotewrite:
endpoint: http://prometheus:9090/api/v1/write
# Off by default. Without this, your exemplars silently vanish.
export_exemplars: true
service:
pipelines:
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheusremotewrite]
The Prometheus side still needs the remote-write receiver and exemplar storage:
prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--web.enable-remote-write-receiver \
--enable-feature=exemplar-storage
Either way, the critical detail is the _bucket series. Exemplars attach to histogram bucket time series, which Prometheus exposes as http_server_request_duration_seconds_bucket. The OTLP-to-Prometheus translation lowercases the name, replaces dots with underscores, and appends the unit (_seconds) plus _bucket. Get the metric name right in your PromQL or the exemplar overlay will have nothing to attach to.
4. Pick the right histogram: explicit-bucket vs exponential
For latency SLIs this is a real decision, not a detail.
Explicit-bucket histograms use boundaries you define. You control resolution exactly where you need it (around your SLO threshold) and waste nothing elsewhere — but if traffic shifts and your p99 lands between two coarse buckets, histogram_quantile() interpolates linearly inside that bucket and the number gets soft. Define boundaries as an advisory hint on the instrument:
from opentelemetry.sdk.metrics.view import View, ExplicitBucketHistogramAggregation
view = View(
instrument_name="http.server.request.duration",
aggregation=ExplicitBucketHistogramAggregation(
# Dense around a ~300ms SLO, sparse in the tail.
boundaries=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
),
)
Exponential (base-2) histograms auto-scale their buckets to the data with a configurable max_scale, giving high relative accuracy across the whole range without you guessing boundaries. They are the better default for unknown or wide-ranging latency distributions and they cost fewer time series on the wire. The catch on the storage side: classic Prometheus stores them as native histograms, which is still behind a feature flag and not yet wired into every Grafana panel and recording rule. If your stack is fully on native histograms, prefer exponential; if you are on classic _bucket series and battle-tested dashboards, explicit buckets remain the pragmatic choice.
Rule of thumb: explicit buckets when you know your SLO threshold and want rock-solid quantiles around it; exponential when the distribution is wide or unknown and your backend supports native histograms.
5. Align metric attributes with span attributes
Correlation is only clean if the dimensions match. If your metric labels say route but your spans use http.route, the human reading the dashboard cannot pivot, and worse, the exemplar overlay shows a point with attributes that do not map to anything in the trace view. Standardize on OpenTelemetry semantic conventions on both sides:
| Concept | Convention key | Use on |
|---|---|---|
| HTTP route template | http.route |
metric label + span attribute |
| HTTP status | http.response.status_code |
metric label + span attribute |
| RPC method | rpc.method |
metric label + span attribute |
| Service identity | service.name |
resource (promoted to label) |
| Environment | deployment.environment |
resource (promoted to label) |
Two hard rules:
- Use the templated route, never the raw path.
/orders/{id}is one label value;/orders/8a1f...is unbounded cardinality and will blow up both your metrics and the exemplar buffer. - Keep exemplar attributes minimal. The SDK filters exemplar attributes to the instrument’s view attributes, but you should keep the metric attribute set itself small. Stuff identity into the span, not into extra metric labels.
6. Configure the Grafana overlay and trace link
Two pieces of Grafana config make the click work.
First, enable the exemplar overlay on the Prometheus data source so Grafana queries the exemplar API and renders the little diamonds on time-series panels. In a provisioned data source, add an exemplarTraceIdDestinations entry that maps the exemplar’s trace_id label to a tracing data source:
# grafana provisioning: datasources/prometheus.yaml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
uid: prometheus
url: http://prometheus:9090
jsonData:
exemplarTraceIdDestinations:
- name: trace_id # the exemplar label carrying the trace id
datasourceUid: tempo # uid of your Tempo/Jaeger data source
Second, make sure the panel query actually requests exemplars. In the panel’s Prometheus query options, toggle Exemplars on (or set "exemplar": true on the target in the dashboard JSON). A typical p99 query:
histogram_quantile(
0.99,
sum by (le, http_route) (
rate(http_server_request_duration_seconds_bucket[5m])
)
)
With exemplars enabled and the destination mapped, Grafana overlays diamonds at the recorded measurements. Click one, and the tooltip shows the value plus a trace_id link that opens the trace in Tempo or Jaeger — the full click-through. The reason it lands on the right span is everything from sections 2 and 5: the SDK stamped the active trace_id, and the attributes line up.
Verify
Confirm each hop instead of trusting the wiring end to end.
# 1. Exemplar storage is on and the API answers (empty result is fine; an error is not).
curl -s 'http://prometheus:9090/api/v1/query_exemplars' \
--data-urlencode 'query=http_server_request_duration_seconds_bucket' \
--data-urlencode "start=$(date -u -d '-5 min' +%s)" \
--data-urlencode "end=$(date -u +%s)" | jq '.status'
# 2. Generate load so a sampled span records a measurement, then re-query.
hey -z 30s -q 20 http://your-service/orders/123 >/dev/null
# 3. You should now see exemplars carrying a trace_id label.
curl -s 'http://prometheus:9090/api/v1/query_exemplars' \
--data-urlencode 'query=http_server_request_duration_seconds_bucket' \
--data-urlencode "start=$(date -u -d '-5 min' +%s)" \
--data-urlencode "end=$(date -u +%s)" \
| jq '.data[0].exemplars[0].labels'
A healthy result shows a trace_id (and usually span_id) label on the exemplar:
{
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7"
}
If the trace_id is missing, the measurement was recorded outside a sampled span — check that your handler is wrapped in a span and that your trace sampler is not at 0%. If the whole array is empty, exemplars were dropped in transit: confirm export_exemplars: true on the Collector exporter (Path B) and that --enable-feature=exemplar-storage is set. Finally, paste the trace_id into Grafana Explore against Tempo — if the trace resolves, the loop is closed.
Enterprise scenario
A payments platform team ran ~400 services through a fleet of OpenTelemetry Collectors into Prometheus and Tempo. Their on-call dashboard showed clean p99 spikes, but the exemplar overlay was empty on every panel — engineers were spike-hunting by hand, correlating timestamps between Grafana and Tempo by eye, and the MTTR on latency regressions was embarrassing.
The constraint: they could not raise the trace sampling rate. At their request volume, head sampling above 5% would have flooded Tempo and blown their storage budget. So most requests had no trace, which meant most measurements had no exemplar candidate — fine, expected. The actual bug was elsewhere.
Investigation found two faults. First, the Collector’s prometheusremotewrite exporter was on its default config, so export_exemplars was false — every exemplar that did exist was being dropped at the exporter. Second, they ran tail-based sampling in the Collector: the sampling decision was made after the metrics pipeline had already recorded measurements, so even kept traces had been seen as “not yet sampled” at record time and produced no exemplar. The fix was to enable exemplar export and to keep a small probabilistic head sampler at the SDK so the sampled bit is set before the handler records its histogram, while tail sampling continued to refine what got stored in Tempo.
# Collector exporter: stop dropping exemplars.
exporters:
prometheusremotewrite:
endpoint: http://prometheus:9090/api/v1/write
export_exemplars: true
# SDK: a small head sample so the sampled bit is set at record time.
# Tail sampling still decides final retention downstream.
export OTEL_TRACES_SAMPLER=parentbased_traceidratio
export OTEL_TRACES_SAMPLER_ARG=0.05
With those two changes, ~5% of measurements carried a valid trace_id, which is more than enough density to land on a spike. The dashboard diamonds appeared, the click-through to Tempo worked, and latency-regression MTTR dropped from tens of minutes of manual correlation to a single click — without spending one extra byte of trace storage.