A payments platform team is drowning in trace cost. They run a managed APM whose per-host, per-span pricing has crossed six figures a month, and 95% of the traces it ingests are never looked at — yet the 5% that matter, the ones behind a checkout latency spike at 02:00, are exactly the ones sampled away before an engineer can pull them. The mandate from the platform lead is precise: own the tracing backend, store the raw spans cheaply enough to keep everything for a useful window, and still get the dashboards and SLO alerts the old vendor gave for free. This guide builds that: Grafana Tempo in distributed mode, with S3 as the block-storage backend so a terabyte of traces costs the price of a terabyte of object storage, the metrics-generator turning span streams into RED (Rate, Errors, Duration) metrics so you keep your dashboards, and TraceQL as the query language that makes “show me every span over 2s on the checkout service that errored” a one-liner. By the end you will have a horizontally-scaled Tempo cluster on Kubernetes, traces landing in S3, RED metrics flowing to Prometheus, and exemplars linking a latency graph straight to the trace that caused the spike.
This is an advanced, hands-on guide. It assumes you operate Kubernetes and Prometheus already and want production wiring, not a single-binary demo.
Prerequisites
- A Kubernetes cluster (EKS, GKE, AKS, or self-managed) with at least 16 vCPU / 32 GiB schedulable headroom, and
kubectl+helm3.14+ configured against it. - An S3 bucket (or S3-compatible store — MinIO, GCS via the S3 API, Cloudflare R2) in the same region as the cluster, plus an IAM role you can attach to pods via IRSA (EKS), Workload Identity (GKE), or a Vault-issued credential.
- A running Prometheus (or Mimir / Grafana Cloud) that can be written to via
remote_write, to receive the metrics-generator output. - A Grafana 10.4+ instance with permission to add data sources.
- An OpenTelemetry Collector (or apps already emitting OTLP) as the trace source. We will point it at Tempo’s distributor.
- CLI tooling:
awsv2,helm,kubectl, andjq.
Target topology
Tempo in distributed mode is a set of independently-scalable microservices, not one process. Spans enter through the distributor, which validates and shards them by trace ID to the ingesters; ingesters batch spans into blocks and flush them to S3. The metrics-generator taps the same span stream the distributor sees and, instead of storing traces, derives RED metrics (traces_spanmetrics_*) and service-graph metrics, which it remote_writes to Prometheus. On the read side, a query-frontend shards each TraceQL query, queriers fan out across the ingesters (recent data) and S3 (older blocks), and the compactor runs in the background merging small blocks in S3 into larger ones so search stays fast and the object count stays sane. Grafana sits in front: it queries Tempo for traces over TraceQL and Prometheus for the derived metrics, and stitches them together with exemplars.
Everything below deploys this as a single Helm release and then configures each piece. Where a real platform team plugs in their existing stack, I name the tool and say exactly what it does here.
1. Create the S3 bucket and an IAM identity for Tempo
Tempo needs a bucket and an identity that can read/write/list/delete objects in it (delete is required — the compactor removes superseded blocks). Create the bucket with versioning off (Tempo manages its own block lifecycle; versioning just doubles your storage) and a lifecycle rule as a safety net.
export AWS_REGION=ap-south-1
export TEMPO_BUCKET=kloudvin-tempo-traces-prod
aws s3api create-bucket \
--bucket "$TEMPO_BUCKET" \
--region "$AWS_REGION" \
--create-bucket-configuration LocationConstraint="$AWS_REGION"
# Block all public access — traces contain request internals.
aws s3api put-public-access-block \
--bucket "$TEMPO_BUCKET" \
--public-access-block-configuration \
BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true
# Default SSE-KMS encryption at rest.
aws s3api put-bucket-encryption \
--bucket "$TEMPO_BUCKET" \
--server-side-encryption-configuration '{
"Rules":[{"ApplyServerSideEncryptionByDefault":{"SSEAlgorithm":"aws:kms"},"BucketKeyEnabled":true}]
}'
The minimal IAM policy Tempo needs on that bucket:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "TempoBucketRW",
"Effect": "Allow",
"Action": ["s3:ListBucket", "s3:GetBucketLocation"],
"Resource": "arn:aws:s3:::kloudvin-tempo-traces-prod"
},
{
"Sid": "TempoObjectRW",
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:PutObject", "s3:DeleteObject"],
"Resource": "arn:aws:s3:::kloudvin-tempo-traces-prod/*"
}
]
}
Create an IRSA role so Tempo pods assume this without any static keys. Static AWS keys in a values file are exactly the leak we refuse to repeat:
eksctl create iamserviceaccount \
--cluster kloudvin-prod \
--namespace tracing \
--name tempo \
--attach-policy-arn arn:aws:iam::123456789012:policy/TempoS3Access \
--role-name kloudvin-tempo-irsa \
--approve
For non-AWS clusters or where you centralize secrets, HashiCorp Vault is the alternative: enable Vault’s AWS secrets engine and have the Vault Agent sidecar inject a short-lived, dynamically-leased S3 credential into the pod, so Tempo never holds a long-lived key. We use IRSA below because it removes the credential entirely.
2. Lay down the Tempo Helm values
Use the grafana/tempo-distributed chart (distributed mode), not tempo (single binary). The values file is the heart of this guide — it configures the S3 backend, the metrics-generator, and per-component scaling in one place.
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
kubectl create namespace tracing
tempo-values.yaml:
# --- Storage: traces live in S3, nowhere else durable ---
storage:
trace:
backend: s3
s3:
bucket: kloudvin-tempo-traces-prod
endpoint: s3.ap-south-1.amazonaws.com
region: ap-south-1
# No access_key/secret_key here — IRSA supplies credentials.
# Attach the IRSA role to every Tempo component's pods.
serviceAccount:
create: false
name: tempo
# --- Metrics-generator: derive RED + service-graph from spans ---
metricsGenerator:
enabled: true
config:
storage:
remote_write:
- url: http://prometheus.monitoring.svc:9090/api/v1/write
send_exemplars: true
# Turn the processors on globally and route all tenants to the generator.
global_overrides:
defaults:
metrics_generator:
processors:
- service-graphs # node/edge latency between services
- span-metrics # the RED metrics (rate, errors, duration)
- local-blocks # required for TraceQL metrics queries over spans
# --- Component replicas (scale these independently in step 7) ---
distributor:
replicas: 3
ingester:
replicas: 3
persistence:
enabled: true
size: 20Gi
querier:
replicas: 3
queryFrontend:
replicas: 2
compactor:
replicas: 1
metricsGenerator:
replicas: 2
# Keep traces 30 days in S3, then the compactor expires the blocks.
compactor:
config:
compaction:
block_retention: 720h
A few choices that teams get wrong: local-blocks must be enabled in the generator processors or TraceQL metrics queries (rate/quantile over spans) silently return nothing; send_exemplars: true is what lets a Prometheus latency graph link back to a real trace; and block_retention on the compactor — not a bucket lifecycle rule — is what governs how long traces are queryable. Set the S3 lifecycle rule to a few days longer than block_retention so the compactor, not S3, owns deletion.
3. Deploy and confirm the cluster is healthy
helm upgrade --install tempo grafana/tempo-distributed \
--namespace tracing \
--values tempo-values.yaml \
--version 1.18.0
kubectl -n tracing rollout status deploy/tempo-distributor
kubectl -n tracing get pods -l app.kubernetes.io/name=tempo
You should see distributor, ingester, querier, query-frontend, compactor, and metrics-generator pods Running. Confirm each component registered into its hash ring — an ingester that is UNHEALTHY in the ring will black-hole spans:
kubectl -n tracing port-forward svc/tempo-distributor 3200:3200 &
curl -s http://localhost:3200/ingester/ring | jq '.shards[] | {id, state}'
curl -s http://localhost:3200/metrics-generator/ring | jq '.shards[] | {id, state}'
Every shard should read ACTIVE.
4. Point the OpenTelemetry Collector at Tempo
Your apps should already export OTLP. Configure the OpenTelemetry Collector — the vendor-neutral agent that receives, batches, and routes telemetry — to forward traces to Tempo’s distributor over OTLP/gRPC. This is also where you set tail-based sampling: keep 100% of errors and slow traces, sample the boring fast-200s, which is how you beat the old vendor’s cost without losing the traces that matter.
# otel-collector-config.yaml (Collector running in the cluster)
receivers:
otlp:
protocols:
grpc: { endpoint: 0.0.0.0:4317 }
processors:
batch: {}
tail_sampling:
decision_wait: 10s
policies:
- name: keep-errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: keep-slow
type: latency
latency: { threshold_ms: 2000 }
- name: sample-rest
type: probabilistic
probabilistic: { sampling_percentage: 10 }
exporters:
otlp/tempo:
endpoint: tempo-distributor.tracing.svc:4317
tls: { insecure: true } # in-cluster; mTLS via service mesh in prod
service:
pipelines:
traces:
receivers: [otlp]
processors: [tail_sampling, batch]
exporters: [otlp/tempo]
Apply it, then generate a trace and confirm it lands. The quickest smoke test is tracegen, which ships with Tempo:
kubectl -n tracing run tracegen --rm -it --restart=Never \
--image=ghcr.io/grafana/tempo/tracegen:latest -- \
-otlp-endpoint tempo-distributor.tracing.svc:4317 \
-otlp-insecure -duration 30s -workers 4
5. Wire Tempo as a Grafana data source (with TraceQL)
Add Tempo as a data source. The non-obvious settings are the ones that connect traces to logs and metrics — tracesToMetrics makes the exemplar links work, and serviceMap points Grafana at the Prometheus that holds the generator’s service-graph metrics.
# grafana-datasource-tempo.yaml (provisioning)
apiVersion: 1
datasources:
- name: Tempo
type: tempo
access: proxy
uid: tempo
url: http://tempo-query-frontend.tracing.svc:3200
jsonData:
tracesToMetrics:
datasourceUid: prometheus # exemplar → metric jumps
spanStartTimeShift: '-2m'
spanEndTimeShift: '2m'
serviceMap:
datasourceUid: prometheus # renders the service graph
nodeGraph:
enabled: true
search:
hide: false # enables the TraceQL search tab
In Grafana’s Explore view, select the Tempo data source and the TraceQL tab. The query that motivated this whole project — every checkout span slower than 2s that errored — is now one line:
{ resource.service.name = "checkout" && duration > 2s && status = error }
TraceQL also does aggregation directly over spans, which is what makes it more than a trace lookup. The p95 latency of the checkout service’s database calls, computed from raw spans:
{ resource.service.name = "checkout" && name = "db.query" } | quantile_over_time(duration, .95)
That quantile_over_time is exactly what requires the local-blocks processor from step 2.
6. Confirm RED metrics are flowing from the generator
The metrics-generator should now be emitting span metrics into Prometheus. Verify the series exist:
kubectl -n monitoring port-forward svc/prometheus 9090:9090 &
curl -s 'http://localhost:9090/api/v1/query?query=traces_spanmetrics_calls_total' \
| jq '.data.result | length'
A non-zero count means RED metrics are landing. Now you can rebuild the dashboards the old APM gave you, in PromQL. Request rate per service:
sum by (service) (rate(traces_spanmetrics_calls_total[5m]))
Error rate (fraction of calls with a non-OK span status):
sum by (service) (rate(traces_spanmetrics_calls_total{status_code="STATUS_CODE_ERROR"}[5m]))
/
sum by (service) (rate(traces_spanmetrics_calls_total[5m]))
p95 duration from the generator’s latency histogram:
histogram_quantile(0.95,
sum by (service, le) (rate(traces_spanmetrics_latency_bucket[5m])))
Because the generator wrote exemplars, clicking a spike on the p95 panel in Grafana drops you straight into the slow trace behind it — the loop the old sampled vendor could never close.
7. Scale the components and tune ingestion limits
Distributed mode exists so each tier scales on its own signal. The signals to watch and the lever for each:
- Distributors scale on incoming span throughput (CPU-bound). If
tempo_distributor_spans_received_totaloutpaces CPU, add replicas. - Ingesters scale on active trace volume and flush pressure; they are stateful (the PVC from step 2), so scale deliberately and let the ring rebalance.
- Queriers scale on TraceQL query latency and S3 read concurrency.
- Metrics-generator scales on span rate; it is the component most likely to OOM under a traffic surge, so give it headroom.
Raise the per-tenant ingestion limits before they bite — the defaults will reject spans under real load with a 429:
global_overrides:
defaults:
ingestion:
rate_limit_bytes: 30000000 # 30 MB/s per tenant
burst_size_bytes: 45000000
read:
max_bytes_per_trace: 50000000 # cap a runaway trace at 50 MB
Drive scaling automatically with a HorizontalPodAutoscaler on the distributor and metrics-generator (CPU + custom span-rate metric), and have it all delivered by Argo CD — the GitOps controller that reconciles this Helm release from Git, so every change to tempo-values.yaml is a reviewed, auditable pull request rather than a live helm command. Terraform owns the layer below the cluster (the S3 bucket, the IAM/IRSA role, the KMS key from step 1) so storage and identity are codified too; a GitHub Actions workflow runs terraform plan on each PR and Argo CD takes over once the chart values merge.
Validation
Run this end-to-end checklist after deployment; each line proves one hop of the topology.
# 1. Rings healthy — no UNHEALTHY ingester/generator.
curl -s http://localhost:3200/ingester/ring | jq -r '.shards[].state' | sort | uniq -c
# 2. Spans are being accepted by the distributor.
curl -s http://localhost:3200/metrics | grep tempo_distributor_spans_received_total
# 3. Blocks are actually landing in S3.
aws s3 ls "s3://$TEMPO_BUCKET/" --recursive | tail -n 5
# 4. The compactor is doing work (count should rise over time).
curl -s http://localhost:3200/metrics | grep tempo_compactor_blocks_total
# 5. RED metrics present in Prometheus.
curl -s 'http://localhost:9090/api/v1/query?query=traces_spanmetrics_calls_total' | jq '.status'
In Grafana, the final acceptance test is human: run the status = error TraceQL query and click through to a trace; open the p95 panel and click an exemplar dot to land on the matching slow trace. If both jumps work, traces and metrics are correctly linked.
Rollback / teardown
Because everything is one Helm release plus Terraform-managed storage, rollback is clean. To revert a bad values change:
helm history tempo -n tracing
helm rollback tempo <PREVIOUS_REVISION> -n tracing
To tear the whole stack down (note: this does not delete the S3 data — that is deliberate, so you can re-attach a new cluster to existing traces):
helm uninstall tempo -n tracing
kubectl delete namespace tracing
Only when you genuinely want the traces gone — and have confirmed retention requirements — remove the bucket contents and the bucket:
aws s3 rm "s3://$TEMPO_BUCKET/" --recursive
aws s3api delete-bucket --bucket "$TEMPO_BUCKET" --region "$AWS_REGION"
# Then `terraform destroy` the IAM/IRSA role and KMS key.
If you front this rollout with Argo CD, the real rollback is reverting the Git commit; Argo reconciles the cluster back to the prior state and you never touch helm by hand.
Common pitfalls
- TraceQL metrics queries return empty. The
local-blocksprocessor is not enabled. Add it tometrics_generator.processors(step 2) —span-metricsalone gives you the Prometheus series but not in-Temporate/quantile_over_time. - Spans accepted but nothing in S3. Almost always an IAM problem — the role is missing
s3:DeleteObject(the compactor fails) or the IRSA annotation did not attach. Checkkubectl -n tracing logs deploy/tempo-ingesterforAccessDenied. 429ingestion rate-limit errors at peak. The per-tenant defaults are conservative. Raiserate_limit_bytes/burst_size_bytes(step 7) and confirm the Collector’sbatchprocessor is enabled so you are not sending tiny noisy requests.- Service graph empty in Grafana. The
service-graphsprocessor is off, or Grafana’s Tempo data sourceserviceMap.datasourceUidpoints at the wrong Prometheus. - Compactor never runs / S3 object count explodes. Only run one compactor replica per tenant shard unless you have configured sharded compaction; multiple un-sharded compactors fight over the same blocks.
- Exemplars don’t link.
send_exemplars: truewas missing on the generatorremote_write, or your Prometheus was not started with--enable-feature=exemplar-storage.
Security notes
Traces carry request internals — URLs, headers, sometimes IDs — so treat the pipeline as sensitive data. Keep the S3 bucket fully private with SSE-KMS (step 1), and never put AWS keys in the values file: use IRSA so pods assume a role, or HashiCorp Vault’s AWS secrets engine to inject short-lived credentials via a sidecar. Front Grafana with your workforce IdP — Okta (or Microsoft Entra ID) over OIDC/SAML — so only authenticated engineers reach the TraceQL UI, and map IdP groups to Grafana org roles so on-call sees traces while the wider org does not. Scrub PII at the OpenTelemetry Collector with the attributes/redaction processors before spans ever reach Tempo, since once a span is in S3 it is expensive to selectively delete. Run Wiz (or Wiz Code) against the Terraform that provisions the bucket and IAM so a public-exposure or over-broad-policy regression is caught in the PR, and put CrowdStrike Falcon sensors on the cluster nodes for runtime protection of the Tempo workloads. Where TLS terminates at the edge (a public Grafana), Akamai provides WAF and bot mitigation in front of it.
Cost notes
The entire reason for this build is unit economics, so instrument the savings. Storing traces in S3 turns the dominant cost from per-span vendor pricing into roughly the price of object storage plus compaction compute — a terabyte of retained traces costs single-digit dollars a month at rest. The biggest lever is tail-based sampling at the Collector (step 4): keep 100% of errors and slow traces, sample the rest, and you cut ingest and storage by an order of magnitude while keeping every trace that matters. Tune block_retention (step 2) to your real query window — 30 days is generous; many teams keep 7–14 — and let the compactor merge small blocks so S3 request and LIST costs stay low. Watch the metrics-generator: it is the priciest component to run (it processes every span), so right-size its replicas rather than over-provisioning. Pipe Tempo’s own usage metrics and the S3 cost data into Datadog (or Dynatrace) so the platform team has a single dashboard showing trace volume, storage growth, and cost per service, and route any cost or error-budget breach into ServiceNow as a ticket so the savings are governed, not just hoped for. For teams running internal training material on the platform, even a Moodle instance behind the same Okta SSO and Akamai edge benefits from this tracing backend at near-zero marginal cost — one Tempo cluster serves every service, including the long tail of low-traffic internal apps and virtual appliances, which is precisely where the old per-host pricing hurt most.