The OpenTelemetry Collector is the one piece of your observability stack that should not be coupled to a vendor. It receives telemetry in OTLP, transforms it through a processor chain, and fans it out wherever you want — Prometheus, Tempo, a SaaS backend, all three at once. Get topology and processor ordering right and you get clean, enriched, cost-controlled telemetry. Get them wrong and you get OOM-killed pods, dropped error traces, and a bill that scales with log volume instead of value.
1. Agent vs gateway: the two-tier topology
The Collector runs in two distinct roles, and a serious deployment uses both.
An agent runs close to the workload — as a DaemonSet (one per node) or a sidecar. It collects local telemetry, attaches host- and pod-level context that only exists at the source, and forwards quickly over OTLP. Agents must stay lightweight; they share node resources with your application pods.
A gateway runs as a standalone, horizontally-scaled Deployment behind a Service. It receives from all the agents, does the expensive work — heavy transformation, tail sampling, fan-out to multiple backends — and is the only tier that talks to external systems.
| Concern | Agent (DaemonSet / sidecar) | Gateway (Deployment) |
|---|---|---|
| Placement | One per node, or per pod | Central pool, N replicas |
| Job | Collect + enrich + forward | Aggregate + sample + export |
| Resource profile | Minimal, capped | Sized for buffering/sampling |
| Talks to backends | No, only to gateway | Yes |
| Scales with | Node count | Telemetry volume |
Why two tiers? Tail sampling needs a complete trace — every span in one process. If agents sampled independently, each would have only the spans from its own node and could never make a whole-trace decision. The full trace lands at the gateway, so that is where tail sampling lives. The agent’s job is to enrich and forward, never to sample traces.
A minimal agent forwards straight to the gateway Service:
# agent (DaemonSet) — collect, enrich, forward
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
memory_limiter:
check_interval: 1s
limit_percentage: 80
spike_limit_percentage: 25
k8sattributes: {}
resourcedetection:
detectors: [env, system]
batch: {}
exporters:
otlp/gateway:
endpoint: otel-gateway.observability.svc.cluster.local:4317
tls:
insecure: true # in-cluster; use real TLS across trust boundaries
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, k8sattributes, resourcedetection, batch]
exporters: [otlp/gateway]
The gateway receives OTLP from agents and does the heavy lifting (full config builds up over the next sections).
2. Receivers and the OTLP contract
OTLP (OpenTelemetry Protocol) is the native wire format and the reason this pipeline is polyglot. A Go service, a Python worker, and a Node API all emit OTLP the same way, so the Collector needs exactly one receiver for all of them. Standardize on OTLP at the edge and translate everything else into it.
The otlp receiver exposes two protocols, and you almost always want both:
- gRPC on
4317— the default for SDKs and agent-to-gateway hops; efficient, streaming. - HTTP on
4318— for environments where gRPC is awkward (browsers, some serverless, restrictive proxies).
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
max_recv_msg_size_mib: 16 # raise for large batched payloads
http:
endpoint: 0.0.0.0:4318
The Collector is not OTLP-only on the way in. Contrib receivers ingest legacy formats and pull-based sources, all normalized into the same internal model:
prometheus— scrape Prometheus targets, turning pull metrics into the pipeline.hostmetrics— CPU, memory, disk, network from the node (agent role).filelog— tail and parse log files (replaces a separate log shipper).zipkin/jaeger— accept traces from apps not yet on OTLP, so you migrate incrementally.
A receiver does nothing until a pipeline references it under
service::pipelines. Defining one in thereceivers:block but forgetting to wire it in is the most common “why is no data flowing” mistake. Same for processors and exporters: the top-level blocks are definitions; theservicesection activates them.
3. The processor pipeline that matters (ordering is load-bearing)
Processors run in the exact order you list them in the pipeline. This is not cosmetic — it changes correctness and memory safety. Here is the order I run in production and why each position is deliberate:
processors:
memory_limiter:
check_interval: 1s
limit_percentage: 80
spike_limit_percentage: 25
resourcedetection:
detectors: [env, system]
timeout: 5s
k8sattributes:
extract:
metadata:
- k8s.namespace.name
- k8s.deployment.name
- k8s.pod.name
- k8s.node.name
attributes/scrub:
actions:
- key: http.request.header.authorization
action: delete
- key: db.statement
action: delete
batch:
send_batch_size: 8192
send_batch_max_size: 10000
timeout: 5s
The reasoning behind the order:
memory_limiterfirst, always. It is the circuit breaker. As the process nears its memory ceiling it refuses new data (forcing backpressure to producers) instead of letting the Collector OOM and lose everything in flight. Put it first so it gates work before downstream processors allocate on every batch.resourcedetectionandk8sattributesnext, while context is fresh — they enrich resource attributes (section 4). Enrich before scrubbing and batching so later stages and the backend see complete metadata.attributesfor scrubbing before batching — strip secrets (auth headers, raw SQL, PII) early so sensitive data never sits in a batch buffer or reaches an exporter.batchlast. Batching groups telemetry into efficient bundles to cut network calls and backend load. It belongs at the end so everything upstream operates on individual records and only the final, clean, enriched data gets bundled.
The two non-negotiables:
memory_limiteris first,batchis last. Reverse them and you batch data you might have to drop, then drop data already buffered — the worst of both. Every other processor sits between these bookends. Thebatchprocessor is also mandatory for throughput: without it the Collector makes one export call per record.
4. Enriching telemetry with resource detection and k8sattributes
Raw telemetry from an SDK knows the service name and not much else. The value of a central pipeline is enrichment: attaching the infrastructure context that turns “service X is slow” into “service X on node Y in namespace Z is slow.” Two processors own this.
resourcedetection reads the environment the Collector runs in and stamps resource attributes — cloud provider, region, availability zone, host ID, instance type. It ships detectors for the major clouds:
processors:
resourcedetection:
detectors: [env, azure, system] # also: ec2, gcp, gce, eks, aks, lambda
timeout: 5s
override: false # don't clobber attributes the SDK already set
k8sattributes is the one that earns its keep in Kubernetes. It correlates the source IP of incoming telemetry with the Kubernetes API to attach pod, namespace, deployment, and node metadata automatically — no SDK changes required:
processors:
k8sattributes:
auth_type: serviceAccount
passthrough: false
extract:
metadata:
- k8s.namespace.name
- k8s.pod.name
- k8s.pod.uid
- k8s.deployment.name
- k8s.node.name
labels:
- tag_name: app
key: app.kubernetes.io/name
from: pod
pod_association:
- sources:
- from: resource_attribute
name: k8s.pod.ip
- sources:
- from: connection
Because it queries the Kubernetes API, this processor needs RBAC to list and watch pods. Grant it a ClusterRole:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: otel-collector-k8sattributes
rules:
- apiGroups: [""]
resources: ["pods", "namespaces"]
verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
resources: ["replicasets"]
verbs: ["get", "list", "watch"]
Run
k8sattributeson the agent (DaemonSet), not only the gateway. The agent sees the original source connection from the local pod, so it can correlate the IP reliably. By the time telemetry reaches the gateway the source IP is the agent’s, and the pod association is lost. Enrich at the edge, sample in the center.
5. Head vs tail sampling: why probabilistic sampling drops your error traces
Sampling exists because storing 100% of traces at scale is ruinous. The question is which traces you keep, and the answer separates a useful trace store from an expensive random sample.
Head sampling decides at the start of a trace, before any spans complete. It is typically probabilistic — “keep 10% of traces” — implemented in the SDK or with the probabilistic_sampler processor. It is cheap and stateless because it buffers nothing. And it is blind: it makes the keep/drop call before it knows whether the trace errored, was slow, or hit an interesting path.
The failure mode is brutal. With 10% head sampling you keep 10% of your error traces too, so the 3am incident trace you need has a 90% chance of having been discarded at the source. Head sampling optimizes for the traces you do not care about.
Tail sampling decides at the end, after the whole trace is assembled and every span is visible. Now you can decide with information: keep all errors, keep everything slow, keep a thin baseline of healthy traces for comparison, drop the boring fast successes.
| Head sampling | Tail sampling | |
|---|---|---|
| Decision point | Trace start | Trace complete |
| Needs full trace buffered | No | Yes |
| Keeps errors reliably | No (probabilistic) | Yes (by policy) |
| State / memory cost | Negligible | Significant |
| Where it runs | SDK or agent | Gateway |
The cost is real: the Collector buffers every span of every in-flight trace until the trace is judged complete. That needs memory and forces the one-process-per-trace constraint — exactly why it belongs on the gateway.
6. Configuring the tail_sampling processor
The tail_sampling processor evaluates a list of policies against each completed trace. A trace is sampled (kept) if any policy matches — policies are OR’d. You build a layered set: catch everything important, then keep a thin probabilistic baseline of the rest.
processors:
tail_sampling:
decision_wait: 10s # buffer spans this long before deciding
num_traces: 100000 # max in-flight traces held in memory
expected_new_traces_per_sec: 2000
policies:
# 1. Always keep traces that errored.
- name: errors
type: status_code
status_code:
status_codes: [ERROR]
# 2. Always keep slow traces (> 800ms end to end).
- name: slow
type: latency
latency:
threshold_ms: 800
# 3. Keep anything explicitly flagged important.
- name: critical-tenant
type: string_attribute
string_attribute:
key: tenant.tier
values: [enterprise, platinum]
# 4. Keep a 5% baseline of everything else for comparison.
- name: baseline
type: probabilistic
probabilistic:
sampling_percentage: 5
# 5. Cap total throughput so a flood can't blow the budget.
- name: rate-limit
type: rate_limiting
rate_limiting:
spans_per_second: 5000
How the knobs interact:
decision_waitis how long the Collector holds a trace’s spans before deciding, giving late spans time to arrive. Too short and you decide on incomplete traces (and re-evaluate awkwardly); too long and memory balloons. Set it above your slowest realistic trace — 10s is a sane start, raise it if you see traces being judged before they finish.num_tracescaps how many traces live in memory at once. Size it fromexpected_new_traces_per_sec * decision_waitwith generous headroom. This is your primary memory dial.- Policy order does not change the keep/drop result (any match keeps the trace), but put the high-value, cheap-to-evaluate policies (
status_code,latency) first for clarity. rate_limitingis a backstop, not a primary filter. It protects the backend from a sudden flood; the semantic policies above it should be doing the real selection.
Tail sampling has a sharp edge with multi-replica gateways: every span of a trace must reach the same instance, or no replica ever sees the whole trace. Plain round-robin breaks this. Put a
loadbalancingexporter in front of the tail-sampling tier, keyed ontraceID, so all spans of a trace route to the same downstream Collector. This is the most common way tail sampling silently misbehaves at scale.
7. Exporters and fan-out: Prometheus, Tempo, and a vendor backend at once
A single pipeline can hand the same telemetry to many exporters at once — list them all under a pipeline’s exporters and the Collector fans out a copy to each. This is how you stay vendor-neutral: send metrics to Prometheus and a SaaS, with no app changes and no lock-in.
exporters:
# Traces -> Tempo (OTLP).
otlp/tempo:
endpoint: tempo.observability.svc.cluster.local:4317
tls:
insecure: true
sending_queue:
enabled: true
queue_size: 5000
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 300s
# Metrics -> Prometheus (remote write).
prometheusremotewrite:
endpoint: http://prometheus.observability.svc.cluster.local:9090/api/v1/write
resource_to_telemetry_conversion:
enabled: true # promote resource attrs to labels
# Everything -> a vendor backend over OTLP.
otlp/vendor:
endpoint: ingest.vendor.example.com:443
headers:
api-key: ${env:VENDOR_API_KEY}
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, tail_sampling, batch]
exporters: [otlp/tempo, otlp/vendor]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheusremotewrite, otlp/vendor]
Two distinctions that trip people up:
prometheusvsprometheusremotewrite. Theprometheusexporter opens an endpoint that Prometheus scrapes (pull). Theprometheusremotewriteexporter pushes to a Prometheus remote-write endpoint. From a gateway, push (remote write) is usually what you want; expose a scrape endpoint only if your Prometheus is configured to pull from the Collector.sending_queueandretry_on_failureare per-exporter resilience. The queue buffers data when a backend is briefly unavailable; retry re-sends transient failures with backoff. Without them, a 30-second backend blip drops data on the floor. Enable both on every production exporter;max_elapsed_timebounds how long retries persist before data is finally dropped, so set it consciously.
Fan-out copies are independent: if Tempo is healthy and the vendor endpoint is down, the vendor exporter’s queue fills and retries while Tempo keeps receiving. One slow backend does not block the others — but a slow exporter with a full queue applies backpressure upstream, which is why per-exporter queue sizing matters.
8. Operating the Collector
A pipeline you cannot observe is a pipeline you cannot trust. The Collector emits its own telemetry — turn it on and watch it.
service:
telemetry:
logs:
level: info
metrics:
level: detailed
readers:
- pull:
exporter:
prometheus:
host: 0.0.0.0
port: 8888
Scrape :8888 and watch these in particular:
otelcol_processor_refused_spans/..._refused_metric_points— thememory_limiteris shedding load. The Collector is under-resourced or downstream is too slow.otelcol_exporter_send_failed_spans— an exporter cannot deliver. Backend down, auth wrong, or the queue is full.otelcol_exporter_queue_sizevsotelcol_exporter_queue_capacity— a queue near capacity is about to start dropping. Scale out or raisequeue_size.otelcol_processor_tail_sampling_sampling_trace_dropped_too_early— traces evicted beforedecision_waitelapsed; raisenum_tracesor lower load.
Enable the health and zPages extensions for liveness and live introspection:
extensions:
health_check:
endpoint: 0.0.0.0:13133
zpages:
endpoint: 0.0.0.0:55679
service:
extensions: [health_check, zpages]
Point Kubernetes liveness/readiness probes at :13133. Use zpages at /debug/tracez and /debug/pipelinez to inspect live pipeline state when debugging.
For zero-downtime config changes, do not rely on in-place reload — the Collector does not hot-reload config the way some agents do. Treat config as immutable and roll it out: update the ConfigMap, then trigger a rolling restart so new pods come up on the new config while old pods drain. With a multi-replica gateway behind a Service, this is seamless.
# Update config, then roll the deployment with zero downtime.
kubectl -n observability create configmap otel-gateway-config \
--from-file=config.yaml --dry-run=client -o yaml | kubectl apply -f -
kubectl -n observability rollout restart deployment/otel-gateway
kubectl -n observability rollout status deployment/otel-gateway
Validate config before it reaches the cluster. The Collector binary can check a config without running it; wire
otelcol validate --config config.yamlinto CI so a typo in a processor name fails the pipeline instead of crash-looping a production pod. (Use the matching subcommand for your distribution —otelcol-contrib validate ...for the contrib build.)
Enterprise scenario
A payments platform moved tail sampling onto its gateway and immediately lost error traces — the exact ones the migration was supposed to save. Self-telemetry told the story: otelcol_processor_tail_sampling_sampling_trace_dropped_too_early was climbing into the thousands per minute. The gateway ran six replicas behind a plain Kubernetes Service, so spans for a single trace round-robined across pods. No replica ever assembled a whole trace, every trace looked incomplete at decision_wait, and the errors policy could never fire because the error span and its parent landed on different instances.
The fix was a two-tier gateway. A thin front tier did nothing but route by trace ID using the loadbalancing exporter, so all spans of a trace pinned to one backend Collector where tail_sampling actually lived.
exporters:
loadbalancing:
routing_key: traceID
protocol:
otlp:
tls:
insecure: true
resolver:
k8s:
service: otel-sampling.observability.svc.cluster.local
ports: [4317]
The k8s resolver watches the backing Service’s endpoints, so the routing ring rebalances automatically as sampling pods scale or restart — no static host list to rot. Within minutes dropped_too_early fell to near zero and error-trace retention jumped to the ~100% the policy promised. The lesson: with multi-replica tail sampling, consistent trace-ID routing is not an optimization, it is a correctness requirement, and the Collector’s own metrics will tell you the moment you have it wrong.
Verify
Confirm the config parses before you ship it:
otelcol-contrib validate --config config.yaml
Send a test trace to the receiver and confirm it flows. The simplest end-to-end check is to use the OTLP HTTP endpoint directly:
curl -i http://localhost:4318/v1/traces \
-H "Content-Type: application/json" \
-d '{
"resourceSpans": [{
"scopeSpans": [{
"spans": [{
"traceId": "5b8efff798038103d269b633813fc60c",
"spanId": "eee19b7ec3c1b174",
"name": "verify-span",
"kind": 1,
"startTimeUnixNano": "1700000000000000000",
"endTimeUnixNano": "1700000000500000000"
}]
}]
}]
}'
A 200 OK means the receiver accepted it. Confirm it traversed the pipeline by checking the self-telemetry counters:
curl -s http://localhost:8888/metrics | grep -E \
'otelcol_receiver_accepted_spans|otelcol_exporter_sent_spans|otelcol_processor_refused_spans'
receiver_accepted_spans should rise and exporter_sent_spans should rise to match (minus anything tail sampling intentionally dropped). If accepted climbs but sent does not, your data is dying somewhere in the middle — check send_failed and the exporter queue. Confirm liveness:
curl -s http://localhost:13133/ # health_check extension
Checklist
Pitfalls and next steps
The recurring failures are predictable. Processor order that batches before limiting memory, or scrubs after buffering. Tail sampling on a round-robin gateway pool so no replica ever sees a whole trace. Exporters with no queue or retry, quietly dropping data during a routine backend deploy. A defined-but-unwired component that does nothing. Every one returns a “working” Collector that lies about what it kept.
The highest-leverage next move is to put the config under test: otelcol validate in CI, plus a smoke test that sends a known trace and asserts the self-telemetry counters move. From there, tune tail_sampling against your real trace distribution — measure what fraction of errors and slow traces you actually retain, and set decision_wait and num_traces from observed memory rather than guesses. The Collector is infrastructure; treat its config with the same rigor as application code and it becomes the stable, vendor-neutral foundation the rest of your stack relies on.