A payments platform team runs about 140 microservices across three EKS clusters, and their observability story is a museum of half-finished migrations: a hand-rolled Prometheus pair scraping a 9,000-target kubernetes_sd config that falls over every time the bigger cluster scales, four different vendor agents fighting for the same 8888 port, and a tracing setup that only the two services whose owners cared enough to add an SDK actually emit spans. When an incident hits, the on-call spends the first ten minutes deciding which tool to trust. The mandate from the platform lead is blunt: “one collection layer, one pipeline, traces from every service whether the team instrumented it or not, and stop paging me because Prometheus ran out of memory.” This guide builds exactly that on Kubernetes using the OpenTelemetry Operator — Collector custom resources for a managed pipeline, the Target Allocator to shard Prometheus scraping so no single Collector drowns, and auto-instrumentation injection so a pod gets a tracing agent by adding one annotation, no code change required.
The Operator is the piece that turns “we run some Collectors” into “observability is a platform capability.” It owns three CRDs — OpenTelemetryCollector (the pipeline), TargetAllocator (Prometheus target sharding), and Instrumentation (the language auto-instrument config) — and reconciles them the way Kubernetes reconciles everything else: declaratively, version-controlled, and self-healing. By the end you will have a sharded scrape layer that scales with the cluster, a gateway tier that fans telemetry out to Dynatrace and Datadog in parallel, and Java/Python/Node services emitting distributed traces without their teams touching a line of code.
Prerequisites
- A Kubernetes cluster on 1.27+ (examples assume EKS, but AKS/GKE work identically) with
kubectlandhelm3.14+ configured against it. - cert-manager installed and healthy — the Operator’s admission webhooks need a serving certificate, and cert-manager is the supported way to issue it.
- Cluster-admin for the initial install (you create CRDs and a webhook), plus a namespace where your workloads run.
- An existing telemetry backend reachable from the cluster. This guide ships to Dynatrace (OTLP/HTTP with an API token) and Datadog (Datadog exporter with an API key); any OTLP backend substitutes cleanly.
- HashiCorp Vault with the Kubernetes auth method enabled, used here to inject backend API tokens into the Collector rather than committing them to a Secret.
- Argo CD for GitOps delivery (optional but assumed in the rollout section), and a ServiceNow instance if you want the change-gate described at the end.
Target topology
The design is a two-tier collection pipeline, which is the shape almost every production OpenTelemetry deployment converges on. A DaemonSet (agent) Collector runs one pod per node: it receives OTLP from local workloads over the loopback-friendly node IP, scrapes node-local targets, and enriches everything with k8sattributes (pod, namespace, deployment, node) before forwarding. Separately, a StatefulSet Collector with the Target Allocator owns Prometheus scraping — the Target Allocator discovers all scrape targets and distributes them across the StatefulSet replicas so each Collector scrapes a stable, bounded slice instead of one process trying to scrape everything. Both tiers forward to a Deployment (gateway) Collector, the single egress point that batches, applies tail-sampling, and fans out to Dynatrace and Datadog in parallel. The Operator’s mutating webhook sits to the side, watching pod creation and injecting auto-instrumentation init-containers when it sees the right annotation.
Why three tiers rather than one big Collector: the agent has to be node-local for low-latency OTLP receive and host metadata; Prometheus sharding needs sticky pod identity (hence a StatefulSet) so the Target Allocator’s assignment stays stable across restarts; and the gateway centralizes the expensive, stateful work — tail sampling needs to see all spans of a trace, and you do not want every node renegotiating TLS to two SaaS backends. Separating them lets each scale on its own axis.
1. Install cert-manager and the OpenTelemetry Operator
The Operator’s webhooks default-and-validate every CR you submit, so they must be up before any CRD is applied. Install cert-manager first, wait for it, then install the Operator via its Helm chart.
# cert-manager — required for the Operator's admission webhook TLS
helm repo add jetstack https://charts.jetstack.io
helm repo update
helm upgrade --install cert-manager jetstack/cert-manager \
--namespace cert-manager --create-namespace \
--version v1.15.3 \
--set crds.enabled=true
kubectl rollout status deployment/cert-manager-webhook -n cert-manager --timeout=120s
Now the Operator. The chart installs the three CRDs, the controller Deployment, and the mutating/validating webhooks. Crucially, set the Target Allocator’s RBAC flag at install time — the chart provisions the ClusterRole the allocator needs to do Kubernetes service discovery.
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update
helm upgrade --install opentelemetry-operator \
open-telemetry/opentelemetry-operator \
--namespace opentelemetry-operator-system --create-namespace \
--version 0.74.2 \
--set "manager.collectorImage.repository=otel/opentelemetry-collector-contrib" \
--set admissionWebhooks.certManager.enabled=true \
--set manager.featureGates=operator.observability.prometheus \
--set targetAllocator.enabled=true
# The Operator must be Available before it can reconcile any CR you create next.
kubectl rollout status deployment/opentelemetry-operator \
-n opentelemetry-operator-system --timeout=180s
kubectl get crd | grep opentelemetry
You should see opentelemetrycollectors, targetallocators, and instrumentations listed. Pin the contrib collector image explicitly — the k8sattributes, prometheus receiver, and the Datadog exporter all live in the contrib distribution, not the core one. A frequent first-day failure is deploying a Collector that references a receiver the image does not contain; pinning contrib avoids it.
2. Provision backend credentials with Vault (no plaintext secrets)
The Collector needs a Dynatrace API token and a Datadog API key. Per the project’s hard rule about never committing credentials, these come from HashiCorp Vault via the Vault Agent Injector, which writes them to a tmpfs file the Collector reads with an environment-variable reference. Set up the Vault role and policy.
# In Vault: a policy granting read on the two telemetry secrets
vault policy write otel-collector - <<'EOF'
path "secret/data/observability/dynatrace" { capabilities = ["read"] }
path "secret/data/observability/datadog" { capabilities = ["read"] }
EOF
# Bind the policy to the Collector's ServiceAccount via Kubernetes auth
vault write auth/kubernetes/role/otel-collector \
bound_service_account_names=otel-gateway-collector \
bound_service_account_namespaces=observability \
policies=otel-collector \
ttl=1h
The gateway Collector pod (defined in step 5) carries Vault Agent annotations so the secrets land at /vault/secrets/. The Collector config then reads them with ${env:DT_API_TOKEN}-style references after the Vault Agent templates them into the environment. This keeps the API token out of the OpenTelemetryCollector CR, out of git, and out of any kubectl get secret output — short-lived, leased, and injected at runtime.
3. Deploy the agent (DaemonSet) Collector
This is the per-node receiver. It accepts OTLP from local pods, tags telemetry with Kubernetes metadata, and forwards to the gateway. Note mode: daemonset — the Operator translates the CR into a DaemonSet for you.
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
name: otel-agent
namespace: observability
spec:
mode: daemonset
image: otel/opentelemetry-collector-contrib:0.105.0
serviceAccount: otel-agent-collector
env:
- name: K8S_NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
config:
receivers:
otlp:
protocols:
grpc: { endpoint: 0.0.0.0:4317 }
http: { endpoint: 0.0.0.0:4318 }
processors:
k8sattributes:
auth_type: serviceAccount
passthrough: false
extract:
metadata:
- k8s.namespace.name
- k8s.deployment.name
- k8s.pod.name
- k8s.node.name
resourcedetection:
detectors: [env, eks, ec2]
batch:
send_batch_size: 8192
timeout: 5s
exporters:
otlp/gateway:
endpoint: otel-gateway-collector.observability.svc:4317
tls: { insecure: true } # in-cluster mTLS handled at the mesh layer if present
service:
pipelines:
traces:
receivers: [otlp]
processors: [k8sattributes, resourcedetection, batch]
exporters: [otlp/gateway]
metrics:
receivers: [otlp]
processors: [k8sattributes, resourcedetection, batch]
exporters: [otlp/gateway]
The k8sattributes processor needs RBAC to read pods and replicasets cluster-wide; the Operator does not grant this for you on an arbitrary ServiceAccount. Bind it explicitly:
kubectl create serviceaccount otel-agent-collector -n observability
kubectl create clusterrole otel-k8sattributes \
--verb=get,list,watch \
--resource=pods,namespaces,nodes,replicasets
kubectl create clusterrolebinding otel-agent-k8sattributes \
--clusterrole=otel-k8sattributes \
--serviceaccount=observability:otel-agent-collector
Apply the CR with kubectl apply -f otel-agent.yaml -n observability. Forgetting the ClusterRole is the single most common reason k8sattributes silently emits un-enriched telemetry — the processor fails open, so you get spans with no k8s.deployment.name and no error until you go looking.
4. Deploy the Target Allocator for sharded Prometheus scraping
This is the heart of the guide. A statefulset-mode Collector with targetAllocator.enabled: true hands all Prometheus service discovery to a sidecar allocator, which assigns each discovered target to exactly one Collector replica. Scaling the StatefulSet redistributes targets automatically — the thing the team’s old single Prometheus could never do.
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
name: otel-prometheus
namespace: observability
spec:
mode: statefulset
replicas: 3
image: otel/opentelemetry-collector-contrib:0.105.0
serviceAccount: otel-prometheus-collector
targetAllocator:
enabled: true
serviceAccount: otel-targetallocator
allocationStrategy: consistent-hashing # stable assignment across rescale
prometheusCR:
enabled: true # consume ServiceMonitor/PodMonitor CRs
scrapeInterval: 30s
config:
receivers:
prometheus:
config:
scrape_configs:
- job_name: otel-collector-self
scrape_interval: 30s
static_configs:
- targets: [0.0.0.0:8888]
processors:
batch:
send_batch_size: 8192
timeout: 5s
exporters:
otlp/gateway:
endpoint: otel-gateway-collector.observability.svc:4317
tls: { insecure: true }
service:
pipelines:
metrics:
receivers: [prometheus]
processors: [batch]
exporters: [otlp/gateway]
Two settings carry the whole feature. allocationStrategy: consistent-hashing means adding a fourth replica reshuffles only ~1/4 of targets, not all of them — minimizing scrape gaps during scale events. prometheusCR.enabled: true lets the allocator read existing Prometheus Operator ServiceMonitor and PodMonitor resources, so you migrate off a Prometheus Operator stack without rewriting a single scrape config — the allocator simply consumes the CRs you already have.
The Target Allocator does cluster-wide discovery, so its ServiceAccount needs broad read RBAC. The Helm chart created a suitable ClusterRole at install (step 1); bind it:
kubectl create serviceaccount otel-targetallocator -n observability
kubectl create serviceaccount otel-prometheus-collector -n observability
# Bind the ClusterRole the operator chart provisioned for target allocation
kubectl create clusterrolebinding otel-ta-discovery \
--clusterrole=opentelemetry-operator-targetallocator \
--serviceaccount=observability:otel-targetallocator
kubectl apply -f otel-prometheus.yaml -n observability
Confirm the allocator is sharding. Its /jobs and /jobs/<job>/targets HTTP API on port 8080 shows the live assignment:
kubectl port-forward -n observability svc/otel-prometheus-targetallocator 8080:80 &
curl -s localhost:8080/jobs | jq 'keys'
# Pick a job, then see which collector replica owns which targets:
curl -s "localhost:8080/jobs/serviceMonitor%2Fobservability%2Fapi-gateway%2F0/targets" \
| jq '.[].targets | length'
If replicas is 3, the targets for a busy job should be split roughly into thirds across the three Collector pods. That split is the proof the OOM problem is solved: each Collector now scrapes a bounded slice instead of all 9,000 targets.
5. Deploy the gateway (Deployment) Collector with tail sampling and fan-out
The gateway is the single egress point. It receives forwarded telemetry from agents and the Prometheus tier, applies tail-based sampling (which requires a Collector that sees a whole trace — hence centralizing it), and exports to Dynatrace and Datadog in parallel so both backends get full fidelity. This is also the pod that gets the Vault annotations from step 2.
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
name: otel-gateway
namespace: observability
spec:
mode: deployment
replicas: 2
image: otel/opentelemetry-collector-contrib:0.105.0
serviceAccount: otel-gateway-collector
podAnnotations:
vault.hashicorp.com/agent-inject: "true"
vault.hashicorp.com/role: "otel-collector"
vault.hashicorp.com/agent-inject-secret-dt: "secret/data/observability/dynatrace"
vault.hashicorp.com/agent-inject-template-dt: |
{{- with secret "secret/data/observability/dynatrace" -}}
export DT_API_TOKEN="{{ .Data.data.token }}"
{{- end -}}
vault.hashicorp.com/agent-inject-secret-dd: "secret/data/observability/datadog"
vault.hashicorp.com/agent-inject-template-dd: |
{{- with secret "secret/data/observability/datadog" -}}
export DD_API_KEY="{{ .Data.data.api_key }}"
{{- end -}}
config:
receivers:
otlp:
protocols:
grpc: { endpoint: 0.0.0.0:4317 }
processors:
tail_sampling:
decision_wait: 10s
policies:
- name: errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow
type: latency
latency: { threshold_ms: 500 }
- name: baseline
type: probabilistic
probabilistic: { sampling_percentage: 10 }
batch:
send_batch_size: 8192
timeout: 5s
exporters:
otlphttp/dynatrace:
endpoint: ${env:DT_ENV_URL}/api/v2/otlp
headers:
Authorization: "Api-Token ${env:DT_API_TOKEN}"
datadog:
api:
key: ${env:DD_API_KEY}
site: datadoghq.com
service:
pipelines:
traces:
receivers: [otlp]
processors: [tail_sampling, batch]
exporters: [otlphttp/dynatrace, datadog]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [otlphttp/dynatrace, datadog]
Listing both exporters in one pipeline’s exporters array is what fans every trace out to Dynatrace (root-cause and Davis anomaly detection) and Datadog (the SRE team’s dashboards and SLOs) simultaneously — neither is a downsampled copy. The tail_sampling processor keeps 100% of error and slow traces and 10% of the rest, which is the policy that keeps the Datadog ingest bill sane without losing the traces that matter during an incident. Set DT_ENV_URL as a plain (non-secret) env var on the CR; only the token comes from Vault.
6. Inject auto-instrumentation with zero code changes
Now the payoff the platform lead asked for: traces from every service, instrumented or not. Create an Instrumentation CR describing the agents and the export endpoint, then annotate workloads to opt in. The Operator’s mutating webhook injects a language-specific init-container that copies the agent into the pod and sets the right env vars (JAVA_TOOL_OPTIONS, PYTHONPATH, etc.).
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
name: kloudvin-instrumentation
namespace: payments
spec:
exporter:
# Send to the node-local agent Collector via the host IP
endpoint: http://$(K8S_NODE_NAME):4318
propagators:
- tracecontext
- baggage
sampler:
type: parentbased_traceidratio
argument: "1.0" # let the gateway's tail sampler make the real decision
java:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:2.6.0
python:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:0.46b0
nodejs:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:0.52.0
Apply it, then opt a Deployment in by adding one annotation to the pod template — not the Deployment metadata, the pod template, a detail teams routinely get wrong:
kubectl apply -f instrumentation.yaml -n payments
# Opt the 'checkout' Java service in. Value = "<namespace>/<Instrumentation name>".
kubectl patch deployment checkout -n payments --type=merge -p '{
"spec": { "template": { "metadata": { "annotations": {
"instrumentation.opentelemetry.io/inject-java": "payments/kloudvin-instrumentation"
}}}}
}'
kubectl rollout status deployment/checkout -n payments
# Verify the injected init-container and env are present:
kubectl get pod -n payments -l app=checkout -o jsonpath='{.items[0].spec.initContainers[*].name}'
kubectl get pod -n payments -l app=checkout \
-o jsonpath='{.items[0].spec.containers[0].env[?(@.name=="JAVA_TOOL_OPTIONS")].value}'
The annotation keys are language-specific: inject-java, inject-python, inject-nodejs, inject-dotnet, inject-go (Go uses an eBPF sidecar and needs elevated privileges, so treat it separately). The checkout service now emits distributed traces — its team wrote no code, added no SDK, and shipped no new image. This is the mechanism that takes tracing coverage from “the two teams who cared” to “everything,” which is the whole reason the mandate existed.
7. Roll out via GitOps with Argo CD
The CRs above belong in git, not in kubectl apply history. Wrap them in an Argo CD Application so the Operator’s desired state is version-controlled, reviewable, and self-healing. Terraform (or Ansible) provisions the cluster, cert-manager, and the Operator Helm release at the infrastructure layer; Argo CD owns the day-2 CRs on top.
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: otel-pipeline
namespace: argocd
spec:
project: platform
source:
repoURL: https://git.internal/kloudvin/observability.git
targetRevision: main
path: clusters/eks-prod/otel
destination:
server: https://kubernetes.default.svc
namespace: observability
syncPolicy:
automated: { prune: true, selfHeal: true }
syncOptions: [CreateNamespace=true]
With selfHeal: true, a manual edit to a Collector CR is reverted to match git — so the pipeline config can only change through a reviewed pull request. This is also where a change-management gate slots in: a CI job opens a ServiceNow change request for any merge into the otel path and blocks the Argo sync until it is approved, giving the SRE org a documented record every time the production collection pipeline changes.
Validation
Walk the pipeline end to end before you trust it.
# 1. All three Collector tiers are reconciled and Running
kubectl get opentelemetrycollectors -n observability
kubectl get pods -n observability -l app.kubernetes.io/managed-by=opentelemetry-operator
# 2. Target Allocator is actually distributing targets across replicas
for p in 0 1 2; do
echo "replica $p:"; kubectl exec -n observability otel-prometheus-collector-$p -- \
wget -qO- localhost:8888/metrics | grep -c '^otelcol_receiver_accepted_metric_points'
done
# 3. Auto-instrumentation produced spans — check the gateway's internal metrics
kubectl exec -n observability deploy/otel-gateway-collector -- \
wget -qO- localhost:8888/metrics | grep otelcol_exporter_sent_spans
# 4. The backends are actually receiving — no export failures
kubectl exec -n observability deploy/otel-gateway-collector -- \
wget -qO- localhost:8888/metrics | grep -E 'otelcol_exporter_send_failed_(spans|metric_points)'
A healthy system shows non-zero accepted_metric_points on every Prometheus replica (proof of sharding), rising exporter_sent_spans on the gateway (proof auto-instrumentation works), and send_failed counters flat at zero (proof Dynatrace and Datadog are both accepting data). Finally, confirm in the backends themselves: a service map in Dynatrace and a trace search in Datadog should both show the checkout service you never manually instrumented.
Rollback and teardown
Because the Operator is declarative, rollback is reverting CRs — but do it in dependency order so you do not strand resources.
# Step back one service at a time: remove the inject annotation (next rollout drops the agent)
kubectl patch deployment checkout -n payments --type=json -p='[
{"op":"remove","path":"/spec/template/metadata/annotations/instrumentation.opentelemetry.io~1inject-java"}
]'
kubectl rollout status deployment/checkout -n payments
# Tear down the pipeline CRs (Operator reconciles the workloads away)
kubectl delete -f instrumentation.yaml -n payments
kubectl delete opentelemetrycollector otel-gateway otel-prometheus otel-agent -n observability
# Remove RBAC you created out-of-band
kubectl delete clusterrolebinding otel-agent-k8sattributes otel-ta-discovery
kubectl delete clusterrole otel-k8sattributes
# Finally the Operator and cert-manager (only if nothing else uses them)
helm uninstall opentelemetry-operator -n opentelemetry-operator-system
helm uninstall cert-manager -n cert-manager
If you delivered via Argo CD, the clean path is to disable auto-sync (kubectl patch app otel-pipeline -n argocd --type=merge -p '{"spec":{"syncPolicy":null}}'), delete the Application with --cascade, and let prune remove the CRs — never kubectl delete underneath Argo while self-heal is on, or it will fight you and immediately recreate what you deleted.
Common pitfalls
- Deleting under self-heal. With Argo CD
selfHeal: true, any direct edit or delete is reverted within seconds. Always disable sync first. This catches everyone once. - Missing
k8sattributesRBAC. The processor fails open, so telemetry flows but arrives un-enriched with no error. Ifk8s.deployment.nameis absent, check the ClusterRoleBinding from step 3 before anything else. - Annotation on the wrong object.
instrumentation.opentelemetry.io/inject-*must be on the pod template (spec.template.metadata.annotations), not the Deployment’s own metadata. Put it on the Deployment and nothing injects — and there is no warning. - Core image instead of contrib. The
prometheusreceiver,k8sattributes, and thedatadogexporter ship only inopentelemetry-collector-contrib. A core image reconciles fine, then crash-loops on an unknown component. Pin contrib everywhere. - Target Allocator RBAC too narrow. The allocator needs cluster-wide read on services, endpoints, pods, and the Prometheus CRs. A namespaced Role leaves it discovering nothing and the StatefulSet scraping only its self-metrics.
statefulsetvsdeploymentmode for the allocator. The Target Allocator’s assignment relies on stable pod identity. Run the scrape tier asdeploymentand rescaling reshuffles everything; always usestatefulset.- Webhook race on first install. Apply a Collector CR before the Operator Deployment is
Availableand the webhook rejects it with a TLS error. Always wait on the rollout (step 1) first.
Security notes
Backend credentials never touch a Kubernetes Secret or git: HashiCorp Vault leases the Dynatrace token and Datadog key and the Vault Agent injects them into the gateway pod at runtime, so they are short-lived and auditable. Workload identity to the cluster is brokered through your IdP — Okta federated to Entra ID for the platform engineers who manage the Operator via Argo CD, so access to change the collection pipeline is SSO-gated and conditional-access enforced rather than tied to a static kubeconfig. The Collector images and the autoinstrumentation images are scanned in CI by Wiz Code before they are allowed into the registry, and Wiz runs continuous CSPM over the cluster to flag if a Collector ever gets over-privileged RBAC or its OTLP receiver is exposed beyond the cluster. CrowdStrike Falcon sensors on the node pool give runtime threat detection on the DaemonSet Collectors, which run on every node and are therefore a sensitive blast radius. Restrict the OTLP receivers to in-cluster traffic only — the agent’s 4317/4318 should never be reachable from outside the VPC; if you front any Collector externally (rare), terminate TLS and apply WAF at Akamai at the edge rather than on the Collector itself.
Cost notes
The two levers that dominate cost are what you scrape and what you keep. The Target Allocator lets you scale the Prometheus StatefulSet to match real target count instead of over-provisioning a single giant Prometheus for peak — right-size replicas to the per-Collector target ceiling (a few thousand each) and add replicas only as the cluster grows. Tail sampling at the gateway is the biggest line-item control: keeping 100% of error/slow traces and 10% of the baseline typically cuts trace volume to Datadog by 70–85% with no loss of incident-relevant data, and Datadog/Dynatrace both bill on ingested volume. Use the batch processor’s send_batch_size to reduce export request count (and thus egress and per-request overhead), and drop high-cardinality, low-value metrics at the agent with a filter processor before they ever reach the gateway — cardinality, not raw volume, is what makes a metrics bill explode. Finally, because everything is declarative, you can run a smaller-replica pipeline in non-prod clusters by simply changing replicas in the per-cluster Argo path, rather than maintaining a separate config — the same source, sized down.
The shape of the win
The payments team’s mandate is satisfied by three CRs and an annotation convention. The 9,000-target scrape problem is gone because the Target Allocator shards it across a StatefulSet that scales with the cluster — no Collector ever scrapes more than its bounded slice, and the OOM pages stop. Tracing coverage went from two services to all 140 because auto-instrumentation injection means a team opts in with one pod-template annotation and ships no code. And the on-call no longer wonders which tool to trust, because one gateway pipeline fans the same fully-fidelity telemetry to both Dynatrace and Datadog through a Vault-secured, Argo-governed, ServiceNow-gated path. The Operator is what makes observability a platform capability instead of 140 teams’ worth of individual heroics — declarative, version-controlled, and self-healing like everything else on the cluster.