Observability Multi-Cloud

Building Production OpenTelemetry Collector Pipelines: Receivers, Processors, and Tail Sampling

The OpenTelemetry Collector is the one piece of your observability stack that should not be coupled to a vendor. It receives telemetry in OTLP, transforms it through a processor chain, and fans it out wherever you want — Prometheus, Tempo, a SaaS backend, all three at once. Get topology and processor ordering right and you get clean, enriched, cost-controlled telemetry. Get them wrong and you get OOM-killed pods, dropped error traces, and a bill that scales with log volume instead of value.

1. Agent vs gateway: the two-tier topology

The Collector runs in two distinct roles, and a serious deployment uses both.

An agent runs close to the workload — as a DaemonSet (one per node) or a sidecar. It collects local telemetry, attaches host- and pod-level context that only exists at the source, and forwards quickly over OTLP. Agents must stay lightweight; they share node resources with your application pods.

A gateway runs as a standalone, horizontally-scaled Deployment behind a Service. It receives from all the agents, does the expensive work — heavy transformation, tail sampling, fan-out to multiple backends — and is the only tier that talks to external systems.

Concern Agent (DaemonSet / sidecar) Gateway (Deployment)
Placement One per node, or per pod Central pool, N replicas
Job Collect + enrich + forward Aggregate + sample + export
Resource profile Minimal, capped Sized for buffering/sampling
Talks to backends No, only to gateway Yes
Scales with Node count Telemetry volume

Why two tiers? Tail sampling needs a complete trace — every span in one process. If agents sampled independently, each would have only the spans from its own node and could never make a whole-trace decision. The full trace lands at the gateway, so that is where tail sampling lives. The agent’s job is to enrich and forward, never to sample traces.

A minimal agent forwards straight to the gateway Service:

# agent (DaemonSet) — collect, enrich, forward
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 80
    spike_limit_percentage: 25
  k8sattributes: {}
  resourcedetection:
    detectors: [env, system]
  batch: {}

exporters:
  otlp/gateway:
    endpoint: otel-gateway.observability.svc.cluster.local:4317
    tls:
      insecure: true   # in-cluster; use real TLS across trust boundaries

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, k8sattributes, resourcedetection, batch]
      exporters: [otlp/gateway]

The gateway receives OTLP from agents and does the heavy lifting (full config builds up over the next sections).

2. Receivers and the OTLP contract

OTLP (OpenTelemetry Protocol) is the native wire format and the reason this pipeline is polyglot. A Go service, a Python worker, and a Node API all emit OTLP the same way, so the Collector needs exactly one receiver for all of them. Standardize on OTLP at the edge and translate everything else into it.

The otlp receiver exposes two protocols, and you almost always want both:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        max_recv_msg_size_mib: 16   # raise for large batched payloads
      http:
        endpoint: 0.0.0.0:4318

The Collector is not OTLP-only on the way in. Contrib receivers ingest legacy formats and pull-based sources, all normalized into the same internal model:

A receiver does nothing until a pipeline references it under service::pipelines. Defining one in the receivers: block but forgetting to wire it in is the most common “why is no data flowing” mistake. Same for processors and exporters: the top-level blocks are definitions; the service section activates them.

3. The processor pipeline that matters (ordering is load-bearing)

Processors run in the exact order you list them in the pipeline. This is not cosmetic — it changes correctness and memory safety. Here is the order I run in production and why each position is deliberate:

processors:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 80
    spike_limit_percentage: 25
  resourcedetection:
    detectors: [env, system]
    timeout: 5s
  k8sattributes:
    extract:
      metadata:
        - k8s.namespace.name
        - k8s.deployment.name
        - k8s.pod.name
        - k8s.node.name
  attributes/scrub:
    actions:
      - key: http.request.header.authorization
        action: delete
      - key: db.statement
        action: delete
  batch:
    send_batch_size: 8192
    send_batch_max_size: 10000
    timeout: 5s

The reasoning behind the order:

  1. memory_limiter first, always. It is the circuit breaker. As the process nears its memory ceiling it refuses new data (forcing backpressure to producers) instead of letting the Collector OOM and lose everything in flight. Put it first so it gates work before downstream processors allocate on every batch.
  2. resourcedetection and k8sattributes next, while context is fresh — they enrich resource attributes (section 4). Enrich before scrubbing and batching so later stages and the backend see complete metadata.
  3. attributes for scrubbing before batching — strip secrets (auth headers, raw SQL, PII) early so sensitive data never sits in a batch buffer or reaches an exporter.
  4. batch last. Batching groups telemetry into efficient bundles to cut network calls and backend load. It belongs at the end so everything upstream operates on individual records and only the final, clean, enriched data gets bundled.

The two non-negotiables: memory_limiter is first, batch is last. Reverse them and you batch data you might have to drop, then drop data already buffered — the worst of both. Every other processor sits between these bookends. The batch processor is also mandatory for throughput: without it the Collector makes one export call per record.

4. Enriching telemetry with resource detection and k8sattributes

Raw telemetry from an SDK knows the service name and not much else. The value of a central pipeline is enrichment: attaching the infrastructure context that turns “service X is slow” into “service X on node Y in namespace Z is slow.” Two processors own this.

resourcedetection reads the environment the Collector runs in and stamps resource attributes — cloud provider, region, availability zone, host ID, instance type. It ships detectors for the major clouds:

processors:
  resourcedetection:
    detectors: [env, azure, system]   # also: ec2, gcp, gce, eks, aks, lambda
    timeout: 5s
    override: false   # don't clobber attributes the SDK already set

k8sattributes is the one that earns its keep in Kubernetes. It correlates the source IP of incoming telemetry with the Kubernetes API to attach pod, namespace, deployment, and node metadata automatically — no SDK changes required:

processors:
  k8sattributes:
    auth_type: serviceAccount
    passthrough: false
    extract:
      metadata:
        - k8s.namespace.name
        - k8s.pod.name
        - k8s.pod.uid
        - k8s.deployment.name
        - k8s.node.name
      labels:
        - tag_name: app
          key: app.kubernetes.io/name
          from: pod
    pod_association:
      - sources:
          - from: resource_attribute
            name: k8s.pod.ip
      - sources:
          - from: connection

Because it queries the Kubernetes API, this processor needs RBAC to list and watch pods. Grant it a ClusterRole:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: otel-collector-k8sattributes
rules:
  - apiGroups: [""]
    resources: ["pods", "namespaces"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["apps"]
    resources: ["replicasets"]
    verbs: ["get", "list", "watch"]

Run k8sattributes on the agent (DaemonSet), not only the gateway. The agent sees the original source connection from the local pod, so it can correlate the IP reliably. By the time telemetry reaches the gateway the source IP is the agent’s, and the pod association is lost. Enrich at the edge, sample in the center.

5. Head vs tail sampling: why probabilistic sampling drops your error traces

Sampling exists because storing 100% of traces at scale is ruinous. The question is which traces you keep, and the answer separates a useful trace store from an expensive random sample.

Head sampling decides at the start of a trace, before any spans complete. It is typically probabilistic — “keep 10% of traces” — implemented in the SDK or with the probabilistic_sampler processor. It is cheap and stateless because it buffers nothing. And it is blind: it makes the keep/drop call before it knows whether the trace errored, was slow, or hit an interesting path.

The failure mode is brutal. With 10% head sampling you keep 10% of your error traces too, so the 3am incident trace you need has a 90% chance of having been discarded at the source. Head sampling optimizes for the traces you do not care about.

Tail sampling decides at the end, after the whole trace is assembled and every span is visible. Now you can decide with information: keep all errors, keep everything slow, keep a thin baseline of healthy traces for comparison, drop the boring fast successes.

Head sampling Tail sampling
Decision point Trace start Trace complete
Needs full trace buffered No Yes
Keeps errors reliably No (probabilistic) Yes (by policy)
State / memory cost Negligible Significant
Where it runs SDK or agent Gateway

The cost is real: the Collector buffers every span of every in-flight trace until the trace is judged complete. That needs memory and forces the one-process-per-trace constraint — exactly why it belongs on the gateway.

6. Configuring the tail_sampling processor

The tail_sampling processor evaluates a list of policies against each completed trace. A trace is sampled (kept) if any policy matches — policies are OR’d. You build a layered set: catch everything important, then keep a thin probabilistic baseline of the rest.

processors:
  tail_sampling:
    decision_wait: 10s          # buffer spans this long before deciding
    num_traces: 100000          # max in-flight traces held in memory
    expected_new_traces_per_sec: 2000
    policies:
      # 1. Always keep traces that errored.
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]

      # 2. Always keep slow traces (> 800ms end to end).
      - name: slow
        type: latency
        latency:
          threshold_ms: 800

      # 3. Keep anything explicitly flagged important.
      - name: critical-tenant
        type: string_attribute
        string_attribute:
          key: tenant.tier
          values: [enterprise, platinum]

      # 4. Keep a 5% baseline of everything else for comparison.
      - name: baseline
        type: probabilistic
        probabilistic:
          sampling_percentage: 5

      # 5. Cap total throughput so a flood can't blow the budget.
      - name: rate-limit
        type: rate_limiting
        rate_limiting:
          spans_per_second: 5000

How the knobs interact:

Tail sampling has a sharp edge with multi-replica gateways: every span of a trace must reach the same instance, or no replica ever sees the whole trace. Plain round-robin breaks this. Put a loadbalancing exporter in front of the tail-sampling tier, keyed on traceID, so all spans of a trace route to the same downstream Collector. This is the most common way tail sampling silently misbehaves at scale.

7. Exporters and fan-out: Prometheus, Tempo, and a vendor backend at once

A single pipeline can hand the same telemetry to many exporters at once — list them all under a pipeline’s exporters and the Collector fans out a copy to each. This is how you stay vendor-neutral: send metrics to Prometheus and a SaaS, with no app changes and no lock-in.

exporters:
  # Traces -> Tempo (OTLP).
  otlp/tempo:
    endpoint: tempo.observability.svc.cluster.local:4317
    tls:
      insecure: true
    sending_queue:
      enabled: true
      queue_size: 5000
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 300s

  # Metrics -> Prometheus (remote write).
  prometheusremotewrite:
    endpoint: http://prometheus.observability.svc.cluster.local:9090/api/v1/write
    resource_to_telemetry_conversion:
      enabled: true   # promote resource attrs to labels

  # Everything -> a vendor backend over OTLP.
  otlp/vendor:
    endpoint: ingest.vendor.example.com:443
    headers:
      api-key: ${env:VENDOR_API_KEY}

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch]
      exporters: [otlp/tempo, otlp/vendor]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite, otlp/vendor]

Two distinctions that trip people up:

Fan-out copies are independent: if Tempo is healthy and the vendor endpoint is down, the vendor exporter’s queue fills and retries while Tempo keeps receiving. One slow backend does not block the others — but a slow exporter with a full queue applies backpressure upstream, which is why per-exporter queue sizing matters.

8. Operating the Collector

A pipeline you cannot observe is a pipeline you cannot trust. The Collector emits its own telemetry — turn it on and watch it.

service:
  telemetry:
    logs:
      level: info
    metrics:
      level: detailed
      readers:
        - pull:
            exporter:
              prometheus:
                host: 0.0.0.0
                port: 8888

Scrape :8888 and watch these in particular:

Enable the health and zPages extensions for liveness and live introspection:

extensions:
  health_check:
    endpoint: 0.0.0.0:13133
  zpages:
    endpoint: 0.0.0.0:55679

service:
  extensions: [health_check, zpages]

Point Kubernetes liveness/readiness probes at :13133. Use zpages at /debug/tracez and /debug/pipelinez to inspect live pipeline state when debugging.

For zero-downtime config changes, do not rely on in-place reload — the Collector does not hot-reload config the way some agents do. Treat config as immutable and roll it out: update the ConfigMap, then trigger a rolling restart so new pods come up on the new config while old pods drain. With a multi-replica gateway behind a Service, this is seamless.

# Update config, then roll the deployment with zero downtime.
kubectl -n observability create configmap otel-gateway-config \
  --from-file=config.yaml --dry-run=client -o yaml | kubectl apply -f -

kubectl -n observability rollout restart deployment/otel-gateway
kubectl -n observability rollout status  deployment/otel-gateway

Validate config before it reaches the cluster. The Collector binary can check a config without running it; wire otelcol validate --config config.yaml into CI so a typo in a processor name fails the pipeline instead of crash-looping a production pod. (Use the matching subcommand for your distribution — otelcol-contrib validate ... for the contrib build.)

Enterprise scenario

A payments platform moved tail sampling onto its gateway and immediately lost error traces — the exact ones the migration was supposed to save. Self-telemetry told the story: otelcol_processor_tail_sampling_sampling_trace_dropped_too_early was climbing into the thousands per minute. The gateway ran six replicas behind a plain Kubernetes Service, so spans for a single trace round-robined across pods. No replica ever assembled a whole trace, every trace looked incomplete at decision_wait, and the errors policy could never fire because the error span and its parent landed on different instances.

The fix was a two-tier gateway. A thin front tier did nothing but route by trace ID using the loadbalancing exporter, so all spans of a trace pinned to one backend Collector where tail_sampling actually lived.

exporters:
  loadbalancing:
    routing_key: traceID
    protocol:
      otlp:
        tls:
          insecure: true
    resolver:
      k8s:
        service: otel-sampling.observability.svc.cluster.local
        ports: [4317]

The k8s resolver watches the backing Service’s endpoints, so the routing ring rebalances automatically as sampling pods scale or restart — no static host list to rot. Within minutes dropped_too_early fell to near zero and error-trace retention jumped to the ~100% the policy promised. The lesson: with multi-replica tail sampling, consistent trace-ID routing is not an optimization, it is a correctness requirement, and the Collector’s own metrics will tell you the moment you have it wrong.

Verify

Confirm the config parses before you ship it:

otelcol-contrib validate --config config.yaml

Send a test trace to the receiver and confirm it flows. The simplest end-to-end check is to use the OTLP HTTP endpoint directly:

curl -i http://localhost:4318/v1/traces \
  -H "Content-Type: application/json" \
  -d '{
    "resourceSpans": [{
      "scopeSpans": [{
        "spans": [{
          "traceId": "5b8efff798038103d269b633813fc60c",
          "spanId": "eee19b7ec3c1b174",
          "name": "verify-span",
          "kind": 1,
          "startTimeUnixNano": "1700000000000000000",
          "endTimeUnixNano":   "1700000000500000000"
        }]
      }]
    }]
  }'

A 200 OK means the receiver accepted it. Confirm it traversed the pipeline by checking the self-telemetry counters:

curl -s http://localhost:8888/metrics | grep -E \
  'otelcol_receiver_accepted_spans|otelcol_exporter_sent_spans|otelcol_processor_refused_spans'

receiver_accepted_spans should rise and exporter_sent_spans should rise to match (minus anything tail sampling intentionally dropped). If accepted climbs but sent does not, your data is dying somewhere in the middle — check send_failed and the exporter queue. Confirm liveness:

curl -s http://localhost:13133/   # health_check extension

Checklist

Pitfalls and next steps

The recurring failures are predictable. Processor order that batches before limiting memory, or scrubs after buffering. Tail sampling on a round-robin gateway pool so no replica ever sees a whole trace. Exporters with no queue or retry, quietly dropping data during a routine backend deploy. A defined-but-unwired component that does nothing. Every one returns a “working” Collector that lies about what it kept.

The highest-leverage next move is to put the config under test: otelcol validate in CI, plus a smoke test that sends a known trace and asserts the self-telemetry counters move. From there, tune tail_sampling against your real trace distribution — measure what fraction of errors and slow traces you actually retain, and set decision_wait and num_traces from observed memory rather than guesses. The Collector is infrastructure; treat its config with the same rigor as application code and it becomes the stable, vendor-neutral foundation the rest of your stack relies on.

OpenTelemetryCollectorTracingPipelinesOTLP

Comments

Keep Reading